The Data Stack Show - 49: MLops - The Finalization of the Data Stack with Ben Rogojan of Facebook

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rutterstack, the CDP for developers. You can learn more at rutterstack.com. Welcome back to the show. Today's guest is Ben Rogozhan, and he is known online as the Seattle data guy. Many of you may follow him on Twitter. He has lots of followers, and he has done a lot

Starting point is 00:00:41 of work with data in his career, and he has a really interesting set of things that he's doing now. So he works as a consultant. So he helps companies figure out their data stacks. And then he's also a data engineer at Facebook. And so that leads me right into my big question. You know, I love asking consultants about what they're seeing on the ground just because they have a wide field of view.

Starting point is 00:01:03 But I want to hear about the difference between what he sees on the ground as a consultant with smaller companies and then what he's dealing with at Facebook. Facebook is one of the fang companies. It's so large that I think a lot of us have a hard time comprehending just even the types of problems that they deal with. So I just want to hear him talk about the differences there. And I think it'll be really interesting to hear just about the difference between those

Starting point is 00:01:25 two experiences. Well, I don't think that I'm going to differentiate much from your questions, Eric. I'll just maybe go a little bit deeper on the technical side of things. But yeah, I find it like very, very interesting to see what the difference are between like an organization like Facebook and the rest of the companies out there. And I think that Ben is the perfect person to talk about that stuff. Great. Well, let's jump in and chat with Ben. Let's do it. Ben, thanks so much for joining us on the Dataset show. We've followed you on social for a while,

Starting point is 00:01:58 love your content, have learned a lot from your content, and are really excited to have you as a guest on the show. Thank you. Thanks so much, Eric. I really appreciate you guys having me on today. Cool. Well, let's start where we always start and give us, tell us a little bit about your background and kind of what led you to what you're doing today and then talk about what you are doing today. Yeah, no. Yeah. So just to give kind of a background of like where things started in my journey, I guess, on data really started back, I think, in college before I even knew, I think data engineering or even data science in some regards was a thing.

Starting point is 00:02:31 And I took like an epidemiology course. And then I think I was also taking some computer science courses at the time. And I was kind of enthralled by like just how you could use data to like drive just results. Like you learn about Jon Snow in epidemiology. I think that's the first thing most people learn about is how he kind of used data to figure out cholera and things of that nature. And I was like, man, if only there was a way you could like combined statistics and programming. And then I think it took like three more months for me to figure out that

Starting point is 00:02:57 that was a whole kind of rising, I think field at the time, right? Like 2012 was the year of that whole Harvard article coming out about the sexiest job being data science. And so that's where I originally started was like, I was like really into data science. And then eventually I think as I started working and kind of maturing and figuring out what I liked, I tended to like more of the engineering side of things. So I started flowing more towards the like data engineering, data architecture side of building things. And so then from there, it was like working at some healthcare companies and startups, working on their kind of data flows and data stacks

Starting point is 00:03:29 and data pipelines. I at the same time had like started a consulting company in data that was, again, initially started in data science, but kind of shifted over to more data engineering and data architecture, and then eventually shifted over into working in big tech at Facebook as a data engineer. So that's kind of been the whole kind of quick flow for me from school to kind of career. And then again, now I'm kind of doing the Facebook thing while also having a consulting company that I've been operating for a little bit at this point. And we've got a few clients where, again, we kind of just help them develop their whole data stack and whatever that, again, we can kind of discuss what that means here in a

Starting point is 00:04:04 second, but helping them figure out what tools fit best for them, whether it be data viz, data storage, kind of going through different options, especially right now with so many tools that are really coming out. And I think changing a lot of the game, Snowflake's been out for a while, but I think the more and more I play with it, the more and more I see people use it. So things like that, just other data warehousing tools, tools like Rudderstack and things of that nature. So yeah, I think that's kind of where I'm at right now, where I'm really enjoying both of those

Starting point is 00:04:27 fields or both of those kind of roles and getting a chance just to see a lot of different perspectives and how people are using data and trying to use data. Awesome. And it's interesting starting out in the healthcare space, there are lots of data concerns there that are the sharp end of dealing with some of the issues around governance and compliance. And so I'm sure that was just a really good entree into dealing with some of the more difficult challenges around data. I'd love to dig into actually both sides of your work that you're doing as a consultant. And then at Facebook, we've talked with several consultants on the show before,

Starting point is 00:05:04 and we love hearing about what with several consultants on the show before, and we love hearing about what they're seeing on the ground. As a consultant, you get a breadth of view across many companies who are trying to solve problems that are similar or at least contiguous in the data space. So why don't we start with what are you seeing on the ground? What are trends? I mean, you mentioned that more and more companies are using Snowflake, but even outside of tools, are there architectures, team structures, problems or solutions that are really interesting that have popped up over the last couple months?

Starting point is 00:05:32 Yeah, I mean, like, I think one of the big things that I'm seeing is people trying to readjust, or trying to figure out how to do more with less or it's becoming, I think, for most companies, just apparent that data is growing at a rate that if you were to continually hire data engineers at the rate required based off the current solutions that you might be using, it will become just vastly expensive to pay enough data engineers to manage all of the various data sources that you have, right? Like, I mean, I'm small companies these days that are in the, you know, eight figure, seven figure range, we'll have like 20 to 50 data sources possible. And hiring one data engineer for that,

Starting point is 00:06:09 that's going to be 150 to 200k, depending on where you are, is barely feasible. And having to hire more is not. So I think a lot of companies are looking to different solutions. And in that regard, whether whatever place that might be, again, five trends, a popular one or different, it depends where in that data flow you're kind of looking to fill in that role. But I think that's one thing is like people are trying to figure out how to do more with less. So it's not that, in a weird way,

Starting point is 00:06:34 I think some people feel like data engineering is going to go away in that regard. But in my aspect, it's just like, you're just going to have to, data engineers are just going to be more capable, kind of like how software engineers, I think have been amplified over the last couple of years with like cloud becoming more efficient in terms of how you can actually like deploy code and things of that nature

Starting point is 00:06:51 much easier rather than having to like get a server and have like five people just to stand up a little bit of code. Now you can have one or two really solid engineers kind of manage a whole flow. So I think that's kind of where things are shifting in data is we're trying to figure out how do we have one really solid data engineer manage a lot more with tools and the right solutions rather than spending tons of time putting together patchwork systems built up of like Cron six combination of clients and proposals that I've written all around that. So that's been like interesting in terms of more like tools that people are selecting. I think it's just one of those interesting things because Redshift I think has had this like kind of started us off. And I think then BigQuery and Snowflake have kind of had this like quiet popularity where even if they're doing better marketing or I don't

Starting point is 00:07:42 know what it is about those two that tend to just get decent traction, at least in people's minds. So I'm still trying to figure out that whole thing more from like a perspective of like someone constantly coming into these projects and seeing those same two things rather than seeing something like Redshift or like, well, Microsoft products

Starting point is 00:07:59 kind of been a little newer. So that could also be a reason why Microsoft's a little behind. But yeah, why those two seem to be doing kind of well overall is something else that I'm kind of thinking about and trying to figure out how to kind of work with as a consultant. Yeah, it's super interesting. You know, and I'd love to, before the show, we chatted a little bit about ML. And so I want to dig into that subject a little bit later in the show.

Starting point is 00:08:20 But we're seeing a lot of companies leverage BigQuery for some of the native ML functionality, which is super interesting. It kind of, in some ways, allows teams to punch above their weight class when doing certain things. But one final question on what you're seeing, and I agree, it's super interesting. The idea of data engineers going away is a fascinating conversation in and of itself. There's no way that's going to happen, of course, in my opinion. But I'd love your opinion on what I will call the gap between companies who are actively figuring out the frontier of data using these new tools and processes. And then a lot of companies who are there are a lot of companies who are just, they don't even know that some of this stuff exists. And it seems like that gap is widening and the companies that aren't adapting, it seems like they will really, I mean,

Starting point is 00:09:10 depending on the business model, they'll really struggle just because operationalizing data is really the way that, you know, we shape the experiences that modern consumers have come to expect. Do you see that happening where there's a lot of companies where it's like, man, I've never even, I didn't even know you could connect all this stuff. Yeah, no, I do kind of get that feeling where it's like, and I think part of this is due to the fact that there's kind of information overload, right? Like it's like, what is the right product? So I think sometimes people might be focused on either older products that have existed forever. I mean, they're still kind of looking at that as their solution, or maybe they just, you know, whatever they might have be viewing the products that they've known forever, and seeing the limitations that they had maybe 10 years ago and thinking that they still have those limitations. And I think that's kind of like maybe what's holding people back. Like I think even with like Tableau, I remember there was a point where I didn't realize, I don't remember what feature that they had recently added that for a while there, I didn't even realize they had added, But there was one that was like pretty pivotal in terms of like, if you were to pick tableau

Starting point is 00:10:07 as a good data viz solution or not. And yeah, like I, if you're not staying on top of all of the stuff that's changing in every space, it, I think that's what can really keep people behind. I did a video on like a data engineering roadmap. And I kind of had a joke within the first five seconds, or maybe it was like 15 seconds where I like picked a picture or an infographic of like the data or data space tool, tool wise, right? Like it was like one of those like VC kind of picked or infographics of like all the tools based broken down. The ones that like all the VC firms have been like making their architectures and visualizing that. Yeah. But like you literally couldn't tell what you were looking

Starting point is 00:10:44 at. There was just so many. And I think that's, that's currently the, the, one of the major problems where like data ingestion alone has like 50 tools you could pick from. Right. And they range all from like tools that have been around since like 1995 and tools that came out just last year. And so like, I think that alone can make it very difficult in terms of like knowing what exists and

Starting point is 00:11:05 knowing what each of these things do and why you might want to use one or the other. So I think I think that's one of the big things. It's just it's just so hard to keep up. And then the other thing is like, I think some companies are, I think all the big most a lot of the bigger companies I think are catching up. I think some of the more mid and small companies that are finally like there's just that I think that's where the big gap is, is there's this like,

Starting point is 00:11:26 some of them like either are reading all the Medium articles or whatever they're doing and then they'll contact me and be like, we want machine learning and they're too far ahead or something like that. And then there's some people who are just so busy

Starting point is 00:11:36 and they're probably day to day and they're in their ops that they just don't feel like they have the time to keep up or maybe they don't even feel like it could help them. So I think that's another hard place for some people. They don't realize that maybe there's something that could help them. So I think that's another hard place for some people. They don't realize that maybe there's something that could

Starting point is 00:11:46 help them. So I think that's probably one of the gaps. Yeah. Super interesting. Okay. Let's change gears a little bit. So that's what you're seeing on the ground on the consulting side of things, but you also work as a data engineer at Facebook. And what really intrigues me about that is you get to work with companies as a consultant who are just orders of magnitude smaller than Facebook and then you also get to see, okay, what does this look like at scale? And I know we could probably do a whole episode on that, but I'd love for, especially the people in our audience who are maybe working at a company in the mid-market or even a small enterprise company, to hear what are the

Starting point is 00:12:26 unique things that you've experienced as a data engineer at a company that's the size of Facebook, a true international enterprise with a massive engineering team? Yeah, I mean, I think I think that that alone, you just kind of stated one of the one of the major differences right there is that you're talking, even like you said, like a mid-size company or even some large Fortune 500 companies, just they have a engineering staff of, you know, 2,000 people, like maybe a larger Fortune 500. And if you're mid, small, mid or something like that, maybe it's a few hundred people that are part of your engineering staff. And then you're trying to compete with companies like Google, Facebook, Amazon that have engineering staffs of 15, 20, 30,000 engineers, right? And you just 5,000 of them might be focused on more of the enterprise side and developing enterprise systems and, and the architectures and we don't call

Starting point is 00:13:18 it like just services in general that make your life so much easier if you work at those companies versus again, small mid-cap companies, you're likely either relying on pre-bought products or maybe you're trying to put out your own internal solutions. But obviously it's just hard to commit the same time, the same amount of time towards this. And then you add in the fact that many of those mid,

Starting point is 00:13:37 small and even larger Fortune 500 companies have been around so long. So they're still relying on like source systems and operational systems that are maybe super functional and super developed amazingly, but maybe their analytic systems are just antiquated or developed in such a way that it was developed for 20 years ago, data sets or something of that nature. So then you're also having to remigrate and do all this other stuff just to get yourself to a point where you can actually build systems that act like a larger company like Facebook, Amazon,

Starting point is 00:14:05 and so on. And so I think that's been the big difference I've noticed. And I've talked to people like, I don't know if you know Veronica Zai from Fivetran. She kind of comes from finance and banking at JP Morgan. And she's kind of brought that up as well, where their op side is amazing, but their analytics side was pretty terrible when she first started. Yeah. And she spearheaded that whole kind of development on developing kind of their whole ETL and whatever. And that's what eventually brought her over to Fivetran was like that whole like dealing with all this terribleness and then seeing Fivetran as a possible solution for herself.

Starting point is 00:14:38 So that's like one kind of area that I think companies are going to continue to deal with, right? Like you've dealt with these systems and they work well, but they're old or they're not, they're not as easy to work with because they were developed for a different thing. Yeah. I think that's one of the major differences. Yeah. That's interesting. And I want to just touch base on one thing you said that is, is pretty mind blowing. You said maybe you have, I mean, I know you're just spitballing here, but 5,000 engineers who are focused on building tools that make the engineering team's life better. And it's just crazy.

Starting point is 00:15:11 You know, there's companies that IPO with far less total employees than the number that are working on a very specific set of things inside of Facebook. And that scale is just mind-blowing. I'd love to know, I mean, just out of personal curiosity, but I think for our audience as well, at that scale, you're probably building a lot of things internally because you outstrip the ability of even what we would say like enterprise-grade SaaS products can manage at scale. And I know that different teams for different use cases will probably use various SaaS products can manage at scale. And I know that different teams for different use cases will probably use like various SaaS products, but I would be surprised if you didn't have a lot of homegrown solutions just because there isn't SaaS that's built to manage what you're facing because not a lot of companies have faced that before.

Starting point is 00:16:00 Yeah. And I think that's something that when you join Facebook, they kind of bring up just like you think about when Facebook came about and you think about Amazon and AWS and when it was kind of developing its things and there was no AWS when or not to the same degree, at least when Facebook was dealing with their problems, right? They had to develop all of their own solutions, essentially. So, yeah, I mean, I think that's the thing. Like there wasn't even options to some degree and so a lot of companies that especially deal with that size of data have to develop their own whether again whether it's google facebook amazon and i think that's right like that's why amazon developed their products originally was for themselves and then originally or then eventually realizing they could sell them develop their own cloud service

Starting point is 00:16:40 but yeah it's it's a combination of both yeah, Facebook probably can build their own or at least better integrate. I think it's another thing, right? Like regardless of the SAS product, there's only like, there's usually limitations to integrations, regardless of how well they're often developed. It's just hard. And you're always going to be limited by the SAS provider, right? Like if you're working with Salesforce, you're only going to be able to do so much. Like it's pretty customizable, but there might be a point where you're like, oh,

Starting point is 00:17:05 I just wish I could do this one thing. And there's no, there's no engineer you can go out to and be like, Hey, can you do this thing for me? But if you build it yourself, you've got a whole team. You'd be like, Hey, can we get this feature going? And at least that's a little more feasible, obviously then you've got other problems with internal people making choices on what features to work on, but at least there's a little more control where you can go to that team and be like,

Starting point is 00:17:25 hey, could we get this feature? We think it really would change the workflow. So I think that's another reason. It's not just about scale. It's also about having that ability to integrate at a very different level than most other companies. This is great, Ben. Can you give us a little bit more information

Starting point is 00:17:41 about the teams of the engineers and what the work looks like in a company the size of Facebook, especially for data engineers? I guess I'm curious on what specifically you're looking for. Yeah, it's more of an organizational question, to be honest. I mean, we are used to think of data engineering teams to be like a small team, relatively like to the product engineering teams that you usually have, right? And of course, they don't reach the scale of what Facebook has.

Starting point is 00:18:09 How the scale affects how the team structure or the teams are structured. And that's the essence of the question. Yeah, I mean, I'm gonna guess it's very similar to a few other or plenty of other companies where it's like oftentimes, I think Facebook tries to support a product with a team of data of data engineers right like that way you've got a good integration with both sides both the software side and the analysts and data scientists side so depending on how big

Starting point is 00:18:33 the product is could change how big the team is but overall you're trying to support some some some product with some data engineering team that that way it can kind of be one pipeline where it's like you got they have a good relationship with the software engineers they have a good relationship with their exfns on the other side and and everything runs smoothly i mean i think there's always going to be a problem with and this is something i see regardless of whether you work at facebook facebook or other companies is software engineers i think always tend to be focused on functionality in terms of we want to make sure it works and care less about like data and they i mean obviously they need data in terms of like making sure

Starting point is 00:19:06 their product is up to date, right? Like if someone clicks or posts something, they want to make sure that information gets stored, but logging and things of that nature is kind of secondary. Right? Like if the, if the product works, do you need to log things? So I think that's generally the one interesting thing that I'll often see. Yeah. That's super interesting.

Starting point is 00:19:24 Uh, do you also, I mean, we have in our minds that like data engineers, they are mainly building and maintaining data pipelines, right? What else do you see getting done by the data engineers? Like, do you see them like building like internal tooling, for example? Is this something that you see? And if yes, how is this managed?

Starting point is 00:19:44 And like how much of the work of the data engineer in the future you think it's going to be something like that? But yeah, no, I think there's definitely always going to be kind of a need to build internal tooling to kind of abstract as much as you can away in terms of building data pipelines, right? Like trying to, again, it's being a balance between abstraction and building maintainable systems. But I think that's always kind of a goal in general of data engineers is not just to build pipelines, because

Starting point is 00:20:10 they could build that with some Python scripts, but also figuring out how to build more pipelines more effectively in terms of, right, like if you have to manage 1000 pipelines, how can you manage 1000 pipelines easily, because that's, it doesn't take much for a pipeline to fail. Like I think, I think that's the one thing I found interesting about, regardless of the company that I've worked at is it doesn't take much for most pipelines to fail, it could be one column changes, one data type changes and, and regardless of how much you maybe make some component inside that whole pipeline, maybe a little more robust, so it doesn't get impacted. There's always somewhere downstream that maybe does get impacted.

Starting point is 00:20:44 Maybe a table in MySQL or something of that nature or in your data warehouse. So I think trying to figure out how to develop systems that are more robust in that sense or at least can make it easier to manage when things do go wrong or provide

Starting point is 00:20:59 better notifications, whatever it might be. I think it's kind of a role of a data engineer because we tend to know what we'll need. Again, i think it does depend on the company you work at as well i think you work at a larger company like facebook you've got again tons software engineers that are probably building a lot of those products you work at smaller companies elsewhere even things like lyft i've talked to people you're tending to play a little more of a software engineering role and not purely focus on just like data pipeline so yeah i think it also just depends on the company and how much support you have from maybe software teams that maybe purely develop that kind of like data instrumentation or something of that nature

Starting point is 00:21:33 yeah yeah i think that was by the way like an excellent point what you said about pipelines failing and that's like regardless of the size of the company. And I think that's also a space where there are many opportunities for products also to probably be created, exactly because the concept of observability, let's say, that we used to have in typical software products, it's not so well-defined when it comes to data pipelines and data in general and probably needs some kind of like different approach. So it'll be super interesting to learn more about how do you manage this problem and like what kind of lessons you have learned from like trying to do that.

Starting point is 00:22:18 But before we go deeper into this, a couple of months ago, we had another episode with someone who came actually from Facebook. He was working there and his name is Ivan. And he left to start a company called Slabdas. And actually he took like what he learned inside Facebook and the problems that he had to solve there and in a way productize it, right? Like he came up with a product. I don't want you to tell me like exactly, but do you have the feeling that we might see something similar coming out from Facebook

Starting point is 00:22:50 also for data-related products? Yeah, I mean, obviously there's multiple reasons I probably can't speak to that. Yeah, I mean, I think overall the answer is I have no idea, right? Like I already said, like I kind of said earlier, right? Like Amazon built all this

Starting point is 00:23:05 stuff internally and then started selling it facebook in a sense has built a lot of similar things but has never sold it why i'm i'm unaware of it's not okay it's not part of my purview also even if i was i imagine that would be something i would have to double check with someone before i would say anything yeah of course of course of course. Yeah, makes sense. Makes sense. Okay, cool. So let's get a little bit more technical. And let's discuss a little bit about data stacks. And based on your experience, because you have experienced like both extremes, probably through your consulting career, but also like working in a huge organization like Facebook, you have seen many different, let's say, versions of what we keep calling data stack.

Starting point is 00:23:49 So based on your experience, first of all, what is the data stack? What part of the software that the company is using to operate should be called the data stack of the company? Sure. I mean, I think it's, again, like you said, it's pretty broad. If you're talking purely about maybe the analytics data stack, I combine? Sure. I mean, like, I think it's, again, like you said, it's pretty broad. If you're talking purely about like maybe like the analytics data stack, I think you start with raw data and you go all the way to like data storage, data viz, and maybe

Starting point is 00:24:12 some light data analytics. I mean, if you want to include some ML stuff in there, you could, if you really want to go that far. I think it's definitely like that tip of the iceberg kind of data stack stuff. So that's why I'm not as focused on that when I refer to data stack. Yeah, I really focus on that, like raw data, like data ingestion, data storage, data transformation, and then some sort of data is or whatever your final data product could be, because I think there's some data viz. I also think like, data products don't always have to be a dashboard.

Starting point is 00:24:39 I think there's plenty of examples of like, other forms of reporting that you could consider kind of your final product. I think one of the things I've been recently trying to work on is building something like NerdWallet has this calculator that is a cost of living calculator. And they basically scrape a bunch of information from different sources, put it all together. And then you can now put in, I want to move to LA. I'm currently making 150K and it'll calculate how much you should make. And it kind of gives you some other information about like how pleasant

Starting point is 00:25:06 it is to live there based on like some walking scores and other information you can pull from APIs and then like cost of rentals and things, other things that they've kind of pulled and aggregated together. So I think that's also kind of like less of a database and more of like a data product side where I kind of put that in the data stack as well, because like it's, it's part of part of part of the whole flow. So, so yeah's, it's part of, part of, part of the whole flow. So, so yeah, so that's kind of, I think that the, the steps that most people will reference and what I kind of consider it, right. In terms of like, what is the data stack?

Starting point is 00:25:33 And, and it's, and it's so broad in terms of like, even starting all the way to raw data, like, what does it mean? It's like, well, that raw data can come from everywhere, right? Event logging, SFTPs from other companies and like external files from other companies, scraping things from online, pulling government data from online, pulling from your various APIs and marketing tools and Salesforce and things of that nature,

Starting point is 00:25:55 streaming data that you maybe you're getting in. So it's just so broad that like even that alone, it's like, that's where it all starts in terms of like complexity and like probably the hardest part and why so many companies I think right now are focused on data ingestion layer because it's like if you can do well and develop a good product for data ingestion, you'll do okay in terms of a product.

Starting point is 00:26:15 Yeah, it makes a lot of sense. Did you see something changing on the definition, the broad definition of data stack based on the size of the company? I think in a weird way, it's like, it's going like a lot of companies are getting access to a lot of tools that they never had to before. Like, I don't think the term data stack was almost used the same for smaller companies up until now, at least what I feel, right? Like a lot of companies up until now, you had your 30 python scripts that you ran on or or shell scripts or whatever you prefer to script in that you managed in cron and it worked

Starting point is 00:26:53 fine because you only had five or five data sources now that companies like all of their products are sass all of them have apis or at least a good portion of them have apis being able to switch over to some more well defined components, I think is what's personally, I'm seeing more of a switch, we can actually switch over to something, just so I don't say the same product over like, like, like reverie.io or airbyte. I think those are two other kind of tools that are looking to fit into the data connector data data ingestion layer, You can use those instead of having to, again, develop a bunch of custom scripts. Because again, I've created four Salesforce connectors

Starting point is 00:27:30 in my life already, right? So yeah, it's like we've all created the same things over and over again. And it makes a lot of sense that someone tries to sell that and productionize it. So yeah. Yeah. So if you were like to advise, let's say, a young startup that has their first customers,

Starting point is 00:27:48 they create some revenue, they have to do some reporting, they have to understand a little bit better how their customers interact with their product. What would be, let's say, not an ideal, but a data stack that would make sense for a young company. And the reason I'm asking that is because many times we tend to over-engineer solutions, right? It's not like you need to operate a Kafka cluster just to move your data around, right? And it's like a very common mistake that people are doing.

Starting point is 00:28:22 And it costs a lot both in technical data and time. And at the end, they end up with results that they are pretty much noise, right? So can you give some advice how you would structure, let's say, the data stack again for a young company? Sure. I mean, I think raw data, it's, again, hard to say how you're going to pull that all just because it depends how you've developed or like what tools you use but yeah beyond that right like i think tools like rudder stack or

Starting point is 00:28:49 segment can work well in terms of like trying to log and just getting a lot of that information out there initially and getting it to the right place i also do like things of like i usually switch in between five tran and airflow depending on maybe what companies like technical knowledge as well as like maybe price sensitivity. I think for example, Fivetran can end up being very expensive, but if you can afford it or if it really does help you because you've got enough just data sources that you're trying to manage, that can be very helpful. But I also think like Airflow is kind of great because it's overall decently simple and you can automate pretty well. I think the one hard thing for there is some people have a hard time managing airflow because it does, does tend to be a little bit finicky for some people,

Starting point is 00:29:30 but that tends to be more of the coding side of what I'll use rather than trying to develop my own thing. Um, in terms of like maybe data storage, I think I, at this point, I'm, I will probably switch between like end BigQuery and snowflake, I think those are kind of two favorites or if they're not like using tons and tons of data, like if it's really small, I'll even use Postgres just because it's like, if it's small, it's like, okay, look, this is fine. You're not going to, we don't need to do anything crazy or go, go pay huge costs for crazy optimizations, but you've got a lot of big data. Snowflake and big query is great. Also snowflake is just like,

Starting point is 00:30:04 I like to say like, I feel like snowflake is great. Also, Snowflake is just like, I like to say, like, I feel like Snowflake is the Apple of data warehouses. It just has, it just has this feel to it. Like, I don't know why I like using my Mac or my Apple in terms of my laptop. I just do. I don't know why I like using Snowflake compared to some of its counterparts. I just do. It's easier.

Starting point is 00:30:20 I don't know why. I'm like still sitting here like, I don't know what it is. I just like it better. Okay. I don't know why. Maybe it's just the branding. I don't know why I'm like, still sitting here. Like, I don't know what it is. I just like it better. Okay. I don't know why maybe, maybe it's the, maybe it's just the branding. I don't know. They've, they've, they've got something there. And then data is, I think, I think I still generally like as much as I think lookers kind of what is often named as like the modern data stack tool. I think I still just prefer Tableau's usability. It's just so much easier to build anything very quickly. And if you know what you're doing, I think it's fine. I think Looker, the one thing is like it has ML, not machine learning, but it has its models and things of that nature that you can kind of define things a little more, which some people prefer.

Starting point is 00:30:57 And I get that. But I think if you're safe with Tableau, I think it's, I think Tableau's just got easier usability and you can build up something so quickly. think it's, I think Tableau has just got easier usability and you can build up something so quickly. And it's, yeah, that's, that's, that's generally what I still prefer in terms of data viz. Yeah. Makes sense. Makes sense. I actually found it very interesting that you mentioned Postgres. It's especially when I talk like with young companies, like the first thing that I ask them when they're trying to figure out their data stack, let's say is okay, are you like a B2B or a B2C company? And that makes a huge difference in terms of at what stage of the company data,

Starting point is 00:31:30 especially the volume of the data, becomes an issue. Like a B2B company, I mean, you can pretty much grow a lot and still just use Excel documents in some cases, especially if you're like focusing on large enterprises and stuff like that, which is, of course, completely different compared to building a marketplace or building an app door-to-door, right? Even at the early stages of door-to-door, like the amounts of data generated might be like huge.

Starting point is 00:31:55 And it's a very common advice that I also give, like just use Postgres. I mean, Postgres can scale to quite a lot of data without having to go and get yourself into using like something like snowflake which of course it has you put it very well i think that this parallelism with like apple is like amazing i love it it feels nice but at the end it's like i mean come on dude you don't need that to answer like a few small queries right yeah like just just use postgres as you would do also like for your for your product so that's super interesting you mentioned snowflake and big query right there's also like redshift and it's a very interesting story because redshift is the first product in the cloud data warehouse space, right?

Starting point is 00:32:45 But we tend to not talk that much about them today. In your opinion, what do you think went wrong with Redshift? And you touched a little bit about Snowflake, but what do you think Snowflake did really, really well? Yeah, I mean, this is my personal opinion. I think Redshift is just, it's not that different from most data warehouses, but it's almost too different. I feel like it's not, like in my own mind, like there's so many nuances on how it works

Starting point is 00:33:14 and how, like you have to be just that much more technical and using it and making sure you're like using it properly than I think maybe some of the previous tools, or at least like people were technical using like Oracle and my SQL server in terms of like building their data warehouses back whenever, but like, I don't know, it just felt like such a shift in how you thought and how you design, right? Like you couldn't run updates, right? Like that was like a weird thing. I think they might have recently added that, but there was like little things that like classic, like data warehouse modeling wouldn't necessarily work well. And I think that that kind of took, took it's like people are like, okay, so if I want to run like an insert merge and do slowly changing dimensions, oh, I can't.

Starting point is 00:33:54 Or I gotta like do this weird thing and add, like, do it, do two tables kind of thing, like have a staging table, have the current table and then create a new table based off of that. And so I think that might've been a little bit clunky. I think, I think that's the biggest thing. I think it's clunky. And again, going back to Snowflake, Snowflake is just, it, it operates. It, it's how you think it should work. And I think that, that, that's what makes it different. It's the same way. Like why, why do people prefer Macs over windows? Like windows can sometimes be clunky because they're not like, I don't know. I don't know what it is about it.

Starting point is 00:34:25 It just feels a little clunkier than it would with Apple. I'm not a designer, so even for me, sometimes the reasoning eludes me. It's just like, that's usually my description. It's like, well, this one feels clunky, that one feels smooth. That's what I can tell you. I like using one, I don't like using the other.

Starting point is 00:34:39 Yeah, yeah. Yeah, I totally agree. I think that the feeling that you get from Snowflake is that it just works, right? Yeah, I totally agree. I think that the feeling that you get from Snowflake is that it just works, right? And if you have to scale up or scale down, again, it just works. I remember having to deal with Redshift. I don't know how it is today because, okay,

Starting point is 00:34:57 I think they have changed quite a few things and the product has matured a lot and has some kind of parity with the rest of the data warehouses out there. But having to rescale your cluster, it was a nightmare. You had downtimes, for example, right? Or having to vacuum your data, which is, okay, it's something, some kind of like relic, because they built the distributed system on top of Postgres.

Starting point is 00:35:24 Postgres has this concept of vacuuming. And then they had also to introduce stuff like deep copying, and then you had vacuum. Anyway, it was a lot of, let's say, not unnecessary work, but it could be very inconvenient when it shouldn't be inconvenient. And that's something that, from a product perspective, I think Snowflake did really, really well. And that's something that from a product perspective, I think Snowflake did really, really well. And that's amazing. But on the other hand, BigQuery is not that different, right? When I first tried BigQuery, it has pretty much the same feeling. It just works, right? Why do you think that Snowflake is that much more successful in a way or we hear more about it

Starting point is 00:36:04 than BigQuery? I don't know if their marketing is better i think maybe that's possibly one side of it like i think their branding in general has been like i remember i recall back back in like oh it must have been like 2015 or something i like went to a meetup assuming it was some sort of like tech talk and about like 30 minutes real 30 minutes in i realized it was some sort of like tech talk and about like 30 minutes, 30 minutes in, I realized it was like a sell, like basically just a sales guy trying to sell Salesforce or Salesforce Snowflake. But like, even that was like all the interesting, so we're talking about like the design and like how they made it different and things of that nature. So I think they've just been building on it for so long that I think that's kind of,

Starting point is 00:36:40 kind of helped. I just think it's been a lot more of a branding thing. And I think it's easier to brand than it is to brand BigQuery, which is connected to Google Cloud. And so it's hard to maybe separate from that. Where Snowflake, it's like, it's just Snowflake. There's nothing else connected to it. I agree with that. I was going to say, if you think about Redshift had the first mover advantage, right? And so the product itself, in terms of being a new solution in the market, was sort of groundbreaking just because of the nature of the product and having the first mover advantage. Google is such a huge beast. And so I wonder, and this is complete conjecture, but you kind of have a strange feeling about

Starting point is 00:37:26 using your free Gmail account and then dumping all of your customer data into BigQuery and it's the same company. That's just a little bit of a weird perception. Like you said, where it's like, I mean, BigQuery, it's an awesome tool. But from a branding standpoint, I think it's really hard to overcome offering lots of free consumer products and then building a brand around corporate enterprise trust when it comes to your most valuable asset. Whereas Snowflake, that was all they did. And so it was a much more straightforward branding exercise. Yeah. I'm not a marketer, but I assume that's it. I think that plays a role. I mean, like, again, I obviously, you've noticed that I do play some marketing, but I think even then it's

Starting point is 00:38:10 like, just for me, it's sheer force of marketing. Like I'll just keep putting out content, but I don't exactly have like a marketing strategy. Yeah. Well, if there's anyone out there in the audience who has an informed opinion on this, we would love to have you on the show to discuss it because we love talking about the battle of the warehouses. Then one other thing that we had chatted about before the show was what we call, and I loved how you said it, the other side of the stack.

Starting point is 00:38:39 And I love that visual, right? Like there's a lot of the core components of the stack and however you want to architect your data stack. There's this other side of the stack that doesn't get discussed a lot, I think for a number of reasons, but it's machine learning and the ecosystem and process and tooling around, around machine learning. And that's, that's been of interest to you lately. Can you tell us, just start out by saying like, what is, when you think about machine learning and ML ops, what are the types of things you're talking about and why don't you think that they're at the forefront of the conversation with the data stack?

Starting point is 00:39:12 Sure. So, so yeah, when I, when I refer to MLOps, it's almost like everything, but the model, I mean, in some ways it is the model, obviously it plays a role, but there's so much other stuff that right. Encompasses getting some form of model out into production. And I think this is such a constant problem for anyone who's ever built a model that you realize you understand how to build a model. And maybe that's what you learned in school. Maybe that's what you learned at a bootcamp, but you never learned the other side, which was okay.

Starting point is 00:39:36 But now how does it go into production? How does it like maintain? How do we, how do we make sure it's still operating correctly over time? How do we deal with the various problems that you can deal with in terms of like keeping models up to date? What if problems start occurring? What if data drift starts occurring? Thing, things of that nature become kind of challenging.

Starting point is 00:39:53 So I think, I think that's kind of what I refer to MLOps, which is like all of the stuff around the ML model, which is like kind of, to me, very similar to like data engineering, which is like you have the data pipeline, but then around the data pipeline, you have so much infrastructure. That's just there to make sure that that data pipeline operates smoothly and get notifications when things go wrong. And I think that's a space that I think is going to continue to grow over the next few years, just because we've had now a decade or so of big tech companies and other, you know, tech companies developing their pipelines, developing good or best practices.

Starting point is 00:40:25 Like you said earlier, you have enough people with entrepreneurial drive that have worked at those companies that will now probably take those learnings and develop products to send out to other companies in the next couple of years. So that's one reason or one area that I'm kind of focusing in terms of like MLOps.

Starting point is 00:40:40 In terms of why it doesn't necessarily get discussed is I think the term itself is still kind of coming into its own. I think it's like only really like there was like a paper, I think in 2015, that kind of kicked off the idea. But I think if I recall, like the term started getting used like 2018 2019, in terms of like ml ops. And I think I think that's one reason I think people are trying to just now get attention to this idea of like, okay, in order to actually get that model out into production, we need to have a system. We can't just push it out there and not think about how it lives out there in the wild. It has to have something more. And so I think it's just more as companies mature in their data understanding and data like infrastructure, they'll eventually get to that point. But I think a lot of companies aren't

Starting point is 00:41:22 there yet, right? Like they're still trying to gather their data and manage it in such a way that makes sense. And so the next step after that would be like, okay, now that we have it all, we've done more, we've done dashboards, we've gotten all the value out of this quote, unquote, low hanging fruit, how do we really drive that next level? And that will be where they'll learn about, okay, I want to develop this machine learning model. Okay, wait, I've developed it. Now what? And so I think that's generally what most new people come up to. It's like, okay, now what? Ben, quick question about ML Ops.

Starting point is 00:41:50 It has like, the word comes from ML and Ops, okay? And the reason that I'm saying that is because usually with ML, we're associating data scientists and ML engineers, right? How do you see the overlap between data engineering and data engineers and like these other engineering roles participating in MLOps and whose problem is MLOps at the end? That's definitely kind of an interesting question because like, right, like you have MLO engineers

Starting point is 00:42:15 that like develop models and put them out into systems. And I think they've definitely been doing that for a while, like larger tech companies. But I think I've also seen a lot of people kind of implement machine learning models in things like airflow, right? Like if you're doing like a batch job for your machine learning model, it's like, okay, well, what's one way we could kick this off? Well, let's just use airflow, right? Like that's one thing we could do to get out whatever the output is. And obviously, that's not for live models. But if you're doing something that's batch focused, so I think that's kind of the similarity where you kind of have the same thing where you're dealing with either batch jobs in kind of ML, if you're doing something that's batch focused. So I think that's kind of the similarity where you kind of have the same thing where you're dealing with either batch jobs in kind of ML or you're dealing with live more streaming like jobs. And you probably come up

Starting point is 00:42:52 with similar optimization problems and like performance problems as well that you would run into data engineering when you're doing transformations and things of that nature. So that's kind of why I think they're somewhat related. Whose problem it is, I think will depend on tooling in the future. I think for now, you'll still So that's kind of why I think they're somewhat related. Whose problem it is, I think will depend on tooling in the future. I think for now, like you'll still have ML engineers kind of taking care of a lot of the implementation or software engineers in general, just because again, you're trying to optimize something

Starting point is 00:43:14 that maybe you might not have someone that's both strong in ML and in implementation. So you might have to kind of find this happy medium. But once I think you start getting hopefully better tooling, you can get to a point where maybe ML engineer or machine learning, like researchers, or maybe data scientists can figure out a way that they can deploy it,

Starting point is 00:43:31 maybe without having to have a whole extra person required for that. Well, I mean, again, that'll depend on where tooling goes though. That's great. I mean, okay, it's probably also like a little bit early. It's still other like definition and like trying to figure out exactly

Starting point is 00:43:45 how it fits in the organization in general. But can you give us like a list of tools that you think are like core or like important for MLOps today that are very commonly used? I think there's a few that I've been like looking into myself. I think like I've personally been looking into like

Starting point is 00:44:02 things like just Azure ML and it's different features that that it has, because like obviously it's got some things that are more focused on helping you kind of find the right model. But it also is, I think, going towards that drag and drop very similar to SSIS feel tool. I don't know what it's called, where you can kind of run models using things of that nature. I think also things like feature stores tend to be something that can play a role in ML ops. Also, I'm looking at like data robot right now in terms of like how it's going to kind of play its role. So I don't know if there's like specific components. I think there's like specific tools I'm looking at and try to try to figure out what their role is, what works best where, right? Like I think that's you're going to deal with a similar problem that we have in the data engineering space

Starting point is 00:44:45 which is there's just a lot. So I'm still trying to figure out like which tools will work best. Yeah. Yeah. Actually, that's a very interesting topic which is feature stores. The reason like I find feature stores very interesting from a product perspective more and

Starting point is 00:45:01 not like the machine learning or the engineering perspective is that you hear a lot about them, but actually there aren't that many of them out there. I mean, you have the big companies that they have built their own, and even companies that traditionally open source many things, like Netflix, for example, they haven't done it yet. And you pretty much have something like Tekton, which is something that, at least until recently, it wasn't publicly open.

Starting point is 00:45:32 It was a very enterprising kind of product. And Fist, which is the open source. And that's all. I mean, is there anything else out there? I think I recall doing something a while back where maybe I saw one or two more, but those are definitely the two that I recall doing something a while back where maybe I saw one or two more, but like those are definitely the two that I recall. In fact, it's one of, TechTown is one of the Slack channels I'm on.

Starting point is 00:45:51 So yeah, yeah. Those are the two that I'm well aware of, but I'm sure there might be some more, maybe more open source-y style projects out there. But yeah, it really has been kind of, you know, people haven't really tried to productionize it and make it into a product. Are you thinking about that as your next product costas i don't know i mean it's a very interesting data problem that like feature stores are trying to solve we had we had we have an episode with someone from techton actually about about feature stores and it was very interesting it was like the first time that i talked with someone about feature stores, and it was very interesting. It was like the first time that I talked with someone

Starting point is 00:46:26 about feature stores, and it was very fascinating. But I find it very interesting that we don't see that many products yet. That's one. And the other is that we don't see open source projects, which is another thing. Like, for example, let's take Data Lake, right? We have Delta Lake, which has been open source.

Starting point is 00:46:48 We have Iceberg. We have Hoodie. You can do your own things probably with more of vanilla stuff by just using something like RK files and run Athena or something like that. But I would put that closer to the products that are related to data warehouses and data lakes, right? But you have quite a few open source projects there. But that's not the case with feature stores, which I find it very interesting. Maybe it has to do with the nature of the problem or the products or the scale. That's another thing. What kind of company do you believe needs a feature store?

Starting point is 00:47:24 And when it becomes something important i don't think i have a good answer for that at this point yeah i just don't think i have a good answer because i had this conversation with a guy from tecton and he was saying that like feature stores is something that you needed when that is going to affect the productivity of a team right like you need to have a sizable team to need that it's not something that just because you have someone who's creating a model or two or trying to do some prediction internally, you're going to need a feature store. Maybe that's also another reason, like maybe it's a product that you need to have a certain scale and above like to actually need it,

Starting point is 00:47:59 or it's just too early. And it's this whole product category is still under definition. I don't know, but it's very fascinating.'s it's very interesting i'm very interested to observe how feature stores are going to to progress as products i guess like in facebook you have similar technologies that you are using but these are all builds internally right yeah yeah i mean i think a lot of a lot of the a lot of the stuff that's like ml ops it's funny's funny. Like now I take for granted a lot of things, I think, at this point, when you talk about my work internally at Facebook, just because it is the positive and negative, I think, of working at a big tech company, right?

Starting point is 00:48:37 Like you don't understand all of the problems. So before, in order to get my way through college, I worked in the culinary field. And the first restaurant I worked at was like one of the problems. It's so before in my, in order to get my way through college, I worked in the culinary field and I worked, the first restaurant I worked at was like one of the top restaurants in the city. And like I eventually went to a slightly lesser one and they kind of point out they're like, yeah, like you used to work somewhere where you just basically had to slice a tomato and serve it. Cause you've got such good ingredients and now you actually have to work hard to make those ingredients something. So something same here, it's like some places you'd like, you just start with, it's such a good place.

Starting point is 00:49:09 That's like, that's just so easy. I mean, in comparison, right? Like when you have a lot of the harder problems solved, it's not to say there aren't problems, it's just different. Yeah. Yeah, yeah, makes total sense. All right, one last question from me.

Starting point is 00:49:22 What do you think is the importance of open source in this whole category of data related technologies that we see around us? Yeah. I mean, I think definitely open source always will play a role, right? There's so many things we already kind of rely on in one way or the other that are open source. Like I think even things like Hive, although like started Facebook, went open source, like it just benefits a lot from people being able to improve the overall solution and not being limited to the, again, the 10 engineers that could possibly be working on it. And I think, I think it just gives a lot more perspective to the problems that you run into

Starting point is 00:49:58 in that code base, right? Like you're not having, you're not being forced to wait for someone to fix the problem. You can fix the problem. And I think especially as engineers, that tends to be our mindset anyways right like we we just just give us the code right like we'll fix it like just give me the code i'll fix it and then we can we can go forward with this and make this better together so i think i think that's that's kind of the important thing in terms of like why the benefits of of open source right like we can we can in theory move faster if people if you have a good community around a product. Yeah. You mentioned Airflow a few times, which is an open source project.

Starting point is 00:50:31 Is there any other projects that you are really, I don't know, like you love what is open source? I don't know if I'd say that I have huge ones I love. Like I'm, I keep tabs of things, right? Like Airbytes, something I've been keeping tabs of. It's like an interesting idea in terms of like open sourcing data connectors. Yeah, I think that's the other one that I've kind of currently been paying attention to. Yeah, I think that's currently my focus.

Starting point is 00:50:56 I don't know, again, I don't think I have a love for anything. I think if something's open source, if something costs money, I think it just depends on the tool. Like if I like it, I like it. If I, like Snowflake, like Snowflake is not a cheap thing, but I like it. So I would, I would enjoy using it, but yeah, I don't, I don't think there's like a preference.

Starting point is 00:51:13 Really an interesting conversation. Loved hearing about your experience as a consultant, loved hearing about Facebook. And then of course the other side of the stack, which is a whole fascinating conversation in and of itself and something I think we should do an episode on probably here soon, Costas, because I agree it's the next wave of what's going to happen to data once everyone gets the analytics and the data unification sorted out. So Ben, really appreciate the conversation. We'd love to have you back on the show sometime soon. Yeah, no, thank you guys. I really appreciate your time. I enjoyed this conversation. Yeah. Let me know. That was a really fun show and I'm going to be a broken record here and restate something that Ben stated

Starting point is 00:51:58 and then that I also restated, but it's amazing to me, just thinking about the fact that they have more engineers working on internal tooling for engineers than many companies have total employees when they IPO. And that's just incredible to me, just thinking about having those level of resources and the types of things that you can build, the speed with which you can build them. Of course, working in a large organization, there's process and bureaucracy, but that kind of leverage is pretty mind-blowing. Absolutely. I don't think that we can understand

Starting point is 00:52:31 the scale of a tech company like Facebook. And I'm not talking about the technology. Forget about the technology. I think the most fascinating thing is the organizational scale. How you can get all these people and all these thousands of engineers and create such a consistent product experience

Starting point is 00:52:48 at the end internally and externally. It's amazing. And I don't think that it's something that you can easily experience. It was a very interesting and very fun episodes. I also want to, outside of what you said, there are two things that I want to keep from our conversation.

Starting point is 00:53:03 One is that the problems at the end are the same, regardless of how big or small of a company you are. What changes is the scale, actually. And that might change the tools that you might be using. That's an interesting part of the conversation where we were saying that, okay, just use a Postgres at the end. You don't really need necessarily a huge data warehouse that is super ultra scalable like Snowflake, right?

Starting point is 00:53:27 That's one thing. And the other thing that I really liked was what Ben said about Snowflake, that it's the apple of data warehouses. I found... I loved that. Yeah, I loved it too. I think he's like very to the point about the product experience that people get from Snowflake. So that was also amazing. And yeah, hopefully we will have him again for another episode soon.

Starting point is 00:53:50 I really enjoyed hearing that because I think especially in the world of data, we have a requirement to be very precise in our work and be very descriptive, require very specific features. And there's this intangible component of really great products that make people say, I just like to use it. And that's kind of hard to describe. And I love that he brought that up and said, it's expensive, but I just really like it. And I think that's, I think that's a big testament to Snowflake and what they've built. That's true. That's true. All right. Well, thanks for joining us and we will catch you on the next one.

Starting point is 00:54:30 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com that's e-r-i-c at datastackshow.com the show is brought to you by rudder stack the cdp for developers learn how to build a cdp on your data warehouse

The Data Stack Show - 49: MLops - The Finalization of the Data Stack with Ben Rogojan of Facebook

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.