The Data Stack Show - 171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We are here with Sandy Rizza from Dagster Labs. Sandy, so excited to chat with you about data ops, workflows, data pipelines, all of the above. Thanks for coming on the show.

Starting point is 00:00:39 Thanks for having me. Excited to chat with you. All right, well, give us your background briefly. Yeah. So I'm presently the lead engineer on the DAGSTAR project. And I think we can talk a little bit more about what the DAGSTAR project is for those who aren't familiar later. Earlier in my career, I had a mix of roles that involved building data infrastructures, building tools that would help data practitioners and working as a data practitioner, machine learning engineer myself. I started my career at Cloudera. While I was there, I wrote this book, Advanced Analytics with Spark, that taught how to use that particular framework to do machine learning. And then spent a number of years practicing data scientist at Clover Health,

Starting point is 00:01:24 Motive, which used to be called Keep Truckin', and also worked in public transit software before finding myself back in the data tooling space at Dagster Labs. That's awesome, an orchestrator in the lifecycle of data. Like defining it, why we need it, why it has to be like an external tool, right? And it's not part of query engine, for example. Also, why currently we have such a diverse, let's say, number of solutions out there, especially when we are considering the more traditional data-related operations and DML operations. And we even see new orchestrators coming out that are focusing just on the ML side. Why we need that when we have, let's say like something that already works for data and I'd love to hear

Starting point is 00:02:28 like and learn from you like why is that and what it means like for the practitioners out there right what's in your mind though like what you would like to chat and get like deeper into like during our conversation. Yeah, the topic that you brought up is one that I've thought about quite a bit, both from this perspective of being a machine learning engineer and from this perspective of working on tools for machine learning engineers. And I think we can get into this later, but the fact that I ended up working on a general purpose orchestrator kind of says a lot about how I view the role of orchestration and data pipelines in the machine learning engineering domain. So really excited

Starting point is 00:03:12 to talk about that. Excited to also talk about orchestration in general and what it means to build a data pipeline and the relevance of that to different roles like data engineers, machine learning engineers, data scientists. Yeah, that's awesome. I think we have a lot to talk about. And what do you think, Eric? Let's go. Yeah, let's get to it.

Starting point is 00:03:35 Good. So great to have Dagster back on the podcast after such a short time. All right. Well, we have a ton to talk about. And specifically, we want to talk about sort of the intersection, the changes around data ops, ML ops, and that whole space. I mean, there's so many tools, there's so many opinions out there. So I want to get there, but I want to start by hearing your story because it's pretty fascinating. So can you just give us an overview of sort of the arc of your career, where you started and how you sort of ended back

Starting point is 00:04:10 in place where you started? Yeah, my career is a bit of a loop and I'll quickly walk you through that. So I started out in data in like 2012, which felt like a qualitatively different era of data so this was the era when data scientist was kind of a burgeoning new term a buzzword the sexiest job the the entire stack and like a lot of the focus of where the technology was going was the big data was the other buzzword and everyone was focused on how can we process these enormous amounts of data. And I worked at Cloudera, which was kind of at the heart of that. So I was a contributor to these open-source software projects that were kind of at the heart of this big data software stack.

Starting point is 00:04:57 One of those was Hadoop, MapReduce, those were originally based on these kind of foundational internal Google papers on how to process Google-sized data. And the other one was Spark, which was sort of an improvement upon the original Hadoop framework that made it accessible to a much broader set of people and for a much broader set of use cases. For example, machine learning. So I started my career working on these open source software projects that were fundamentally built for data practitioners, like data engineers and data scientists. And became kind of interested in what was on the other side of the API boundary.

Starting point is 00:05:43 We were building these systems that could process enormous amounts of data. It's like, that's cool, but it's very abstract. Like, what value do you actually bring the world by processing these enormous amounts of data? And so I wanted to go to sort of up the value chain a little bit and learn a little bit about what the world of using these tools looks like. So I first did that with inCloud Dara. We had this internal consulting function, which was sort of like an embedded data science team. And we would go on site to a, let's say a large telco and help them understand their

Starting point is 00:06:19 users and use these big data tools to understand their users. But eventually I ended up working in full-time roles as a machine learning engineer data person at companies that actually had embedded versions of those functions. So one of them was Clover Health, where we were working on health insurance. Another one was Keep Truckin', which is now called Motive,

Starting point is 00:06:43 working on technology that helps truck drivers do their jobs. And so, you know, I started talking about how 2012 felt like a very different era in data. And I think in a way that's largely because the problems that people focused on were very different at the time. And I think there was this kind of acknowledgement that maybe the world of data had gotten ahead of itself

Starting point is 00:07:07 a little bit or had or the tools had maybe solved some layer of problems, but there was this other layer of problems that was like bigger and scarier on top of that layer of problems. And it was less about the size of the data, but about the complexity

Starting point is 00:07:23 of the data. So, like, from the perspective of Cloudera, it's like, okay, once we make you a tool that can do, you know, process two terabytes of data in, you know, 25 seconds, then you'll just take that and make your machine learning model. You're done. Yeah, you're done. It's awesome. Like, right. Then they just run like, you know, fit regression model or, you know, fit pre-use and, you know, who even needs a data team. But moving to the other side of this

Starting point is 00:07:55 and becoming, being in these roles where I was actually, you know, developing machine learning models, doing analyses, trying to answer questions with data, it became clear that that the hardest part of actually doing this job was wrangling and structuring this enormous amount of complexity.

Starting point is 00:08:16 Starting with data that was, I don't think you'd say garbage, but you'd say very disorganized and trying to bring some order, some order, you know, not just to the data itself, but to the process that generates and keeps that data up to date. Yep. And so the consequence of this was that because sort of doing these basic data tasks was so sort of disorganized and difficult, these jobs, I ended up spending, you know, especially when I was in more lead roles and responsible for making other people on my team be productive.

Starting point is 00:08:54 I ended up spending an enormous amount of my time just building internal frameworks at these companies to do this job. And, you know, maybe we'll get to this later, but a huge, you know, the biggest, the biggest way you can improve a machine learning model is give better data to that model. Yeah. Yep. So, so in, in the roles where I was responsible for building better machine learning models, I was primarily concerned with how I could give better data to these models and do so in a reliable, repeatable way. And I basically ended up spending a huge chunk of my time building frameworks that would allow me and other people on my team to do that successfully. When it came time to find a new role around 2020, I sort of thought, like, why go into a company and build another, like, internal version of this framework that's, you know, might be really useful for this

Starting point is 00:09:53 company when I could try to build a version of it that is accessible to many different organizations? Like, it ultimately felt like a very, a much more high leverage thing to be able to do and i happen to know nick who was the founder of dagster and the company which used to be called elemental but is now known as dagster labs and was basically like this is a problem you know i built this system a couple times before i I want to do it again, but do it general. And right this time, I talked to Nick and joined the team at Dagster Labs

Starting point is 00:10:30 as one of the sixth or seventh, or something, first employees. And I've basically been working full-time on Dagster, this open-source software project, since then. Wow, what a story. You know, it really struck me. I loved your, you know, I love the analogy about, you know, you can process two terabytes of data in 25 seconds or whatever. It's like, you have this race car, but in order to drive it, you actually have to go like build an oil refinery. you know, it's like,

Starting point is 00:11:06 so I think that's an amazing analogy. Yeah. Love that. Yeah, that's, that's super ironic. Okay, so a couple questions here. In terms of, well, first of all, actually, what I'd love to know is, when did you step back as a practitioner, you know, going through multiple roles as a practitioner? When did you step back? Do you remember maybe the moment or sort of the project where you said, wow, I'm seeing a pattern here because I seem to keep going back and working on this similar thing? yeah so i i think that i have like a there's a fundamental some dimension of the way that my brain works is very lazy and what i mean by that is i really

Starting point is 00:11:55 don't like to hold they can try to hold a bunch of information in my head at one time i like really want like to be able to like think clearly I really want some external system to be able to offload that too. So pretty early on in these roles where I was doing data pipelining tasks, I sort of got frustrated very early with the tooling and found myself trying to at least contribute to it, improve it in minor ways. I think another piece there was talking to a lot of other practicing data scientists at the time. And, you know, there was this refrain of so much of what we do, like, you know, we're hired to do machine learning, but all we do is clean the data. I think it took some number of those conversations. I don't know how many, but for me to realize and reframe it in my mind

Starting point is 00:12:48 that like cleaning the data isn't this like task of drudgery that you have to do before doing the exciting part. Like it is kind of fundamentally the heart of the machine learning engineering job. And, you know, you can think of it as cleaning data or you can think of it as like producing reliable data sets that are generally useful within your organization. is sort of this work of structuring, like taking these reusable pieces of data and then building even more useful

Starting point is 00:13:29 and reusable pieces of data on top of that. I found that a very motivating way to think about that work. And yeah, I think that probably clicked in my first data role, but then really got reinforced in my later data roles. Okay. Super interesting. My next question is actually more related to Dagster. So what I'd love for you to do is tell us, you know, give us an overview of what is Dagster? What does it do? And then I'd love to know

Starting point is 00:14:08 how much of what you were building, how close was it to the stuff that Dagster does when you were in those practitioner roles like the tools? Got it. Okay, so trying to think about what the best angle is to approach this. I think both in my life and generally in these roles, a pretty common pattern is that you have some sort of you'll have a set of analysts that aren't software engineers,

Starting point is 00:14:39 like the most technical people, although they'll have some proficiency with Python have some proficiency with Python or some proficiency with SQL, and you'll end up with some sort of domain-specific language or internal framework inside of a company that allows those analysts to do their job. And it's not always like this, but if you have a sort of more tech-s tech savvy analyst or some data engineer who's responsible for supporting these analysts they'll end up building something internally

Starting point is 00:15:08 that makes it so the analyst doesn't have to like you know spin up a cron process and like run docker every time that they want to let's say keep some table up to date and if you look at these frameworks and sort of like thinking about the frameworks of the organizations that I was at, they always tend to revolve around tables. And so like the fundamental abstraction when you're thinking about, you know, sort of reproducible work in a like data analyst or even machine learning role is like a table or some sort of data set. Like, I want to start with this data that we have that's maybe sort of not clean or not formatted in the way that's most useful to me. And then in the course of my analysis, ideally kind of like factor out some sort of cleaner,

Starting point is 00:16:03 more useful version of this data set that, you know, the next time I have to do this analysis, I'll be able to rely on. At Clover Health, as well as at Keep Trucking, the kind of like natural way that we built our internal tools to make our data scientists productive. And then there was this interesting mismatch because that was the natural way for us to think about it as the people in these data roles.

Starting point is 00:16:25 But then you look at these tools that are sort of the orchestrators of the time, which still popular orchestrators now, like Airflow, for example, are focused on a totally different set of abstractions. With Airflow, you go in and you define a dag and a dag is a set of tasks and you're fundamentally thinking about tasks when you operate your data pipeline so like the primary challenge as someone who is trying to both do this data science data pipelining work was translating from this like table way of looking at the world, which was very natural to me and the other people I worked with, to this like task and workflow

Starting point is 00:17:12 based world, which was the language that tools like Airflow spoke. And so the internal frameworks that I would end up working on at these companies were basically kind of these translation layers. So allow me to express what I'm trying to do, like my data, basically write my data pipeline in terms of tables or data sets or machine learning models and the relationships between those entities. And then, you know, have some software kind of figure out how to airflow that for me and turn it into this world of like DAGs and tasks it was a messy fit you could do you know you can get your pipeline running on a schedule but there were all these weird translation issues at the borders so like when it times comes time for someone to debug an error or

Starting point is 00:17:59 look at logs they're forced to think in terms of these like very different abstractions than the ones that are natural to that as a data practitioner. So this is a very kind of long-winded way of saying that what got me excited about working on an orchestrator like Dagster was the opportunity to build something that thought about assets. I want to say an asset, I mean, a table, a data set, a machinery model, any sort of persistent object that captures some sort of understanding of the world. It was the opportunity to think about that as the center, the central abstraction for building the data pipeline and allowing everything to revolve around that. Super interesting. And can you talk about maybe just at a high level

Starting point is 00:18:49 to start with, how do you think about a system that relies on the concept of assets as opposed to tasks? Like what are the fundamental differences there in terms of how the system itself operates, right? Because I mean, you can create, you know, you can do orchestration with Airflow, you can do orchestration with Dagster, right? But we're talking about sort of two fundamentally different approaches. That's right. It permeates in a bunch of different ways. I'm trying to think about the best way to approach it. When you build a workflow using tasks, there's kind of this fundamentally top-down approach where you have these sort of like individual tasks and then you assemble them into a DAG.

Starting point is 00:19:33 And the DAG, you know, it's the workflow. It defines the dependencies between those tasks. Whereas when you're working with assets, it ends up being a little bit more of a fundamentally more distributed approach. So when you define a data asset, so for example, which is synonymous with saying, I have a table that I want to create. Let's say there's like this raw data here, that's all the raw events that come into my system. And then I want to create a table called fleen events or gold events. When I define that table, I define its dependency on the upstream events table. And the way that

Starting point is 00:20:15 the entire dependency graph is sort of defined is at the level of individual assets instead of having to do this top-down approach that involves a set of DAGs. The consequence of thinking that way is you're not forced to make these tough and often arbitrary decisions about where the nodes in your graph go. A common failure mode in people who build DA dag based data pipelines if they'll have one sort of enormous unwieldy dag and anytime they make a change they have to contend with that entire dag like execute it or sort of like deal deal with the enormity of it whereas or you know or they'll go the opposite way. And they'll

Starting point is 00:21:05 chop up their DAG into these tiny little pieces, but then lose the ability to actually sort of reliably track the relationships between those pieces. And so when you think about data assets, and you think about defying dependencies in terms of what data do I need to be able to generate this data set, you kind of sidestep that problem entirely. A second piece of that, which is this, the fundamentally declarative approach that comes when you're sort of thinking about assets first. When data engineers are sort of questioned by other people in the organizations like management or business stakeholders. Maybe a question has too much of an interrogative connotation. But when data practitioners want to communicate with stakeholders about their work,

Starting point is 00:21:56 the language they normally communicate in is data assets. I found this very true in my own work. Like when I'm explaining to someone the data pipeline that I'm working on or the thing that I'm going to produce for them, the thing that I draw on the whiteboard is the tables that are going to be produced. Yeah. Yep. And machine learning is almost even more clear. You say, you know, I'm going to make this machine learning is almost even more clear.

Starting point is 00:22:25 You say, you know, I'm going to make this machine learning model. These are the features that are going to go into it. These are the evaluations that are going to come out of it. So you naturally, when you're communicating about what you're building in your data pipeline, think in terms of this network of data assets. And so the advantage of an orchestrator that sort of thinks primarily about the data assets is like that language is the language that you use

Starting point is 00:22:55 to actually define your pipeline. So the consequence of that is that you have this degree of confidence that the pipeline is actually going to generate this network of data assets because it's the language that the pipeline is defined in terms of. Makes total sense. Sorry, Eric. Something kind of give us like an example, like a concrete example, right? Of like a pipeline and how this could be done using like the concept of software defined

Starting point is 00:23:28 assets, right? Like in DAX there. Yeah. So I really wish I had the ability to use a visual aid, but I'll do my best to describe it. So super basic, let's say you have a table of raw event data. Let's say you're running a website. People come onto your website and click on things in your website, log in, maybe your website sells something.

Starting point is 00:24:00 Yep. So you have these kind of these core basic entities that will often come in in some sort of raw form at the beginning of your pipeline. So those might be, let's say, all the events that happen on your website. So like clicks, page views, logins. And your role as a data engineering team might be to deliver cleaned up versions or aggregated versions of this kind of data as core data sets. So other people inside of your organization can use those to build data products or do analyses. Can I interrupt you?

Starting point is 00:24:33 Let's say we have, just to make it like a little bit more concrete, like to someone, let's say we collect these events, right? All the different events. And at the end, we want to get to the point where we have somewhere the number of signups per month. And I'm giving this as an example because I think it's very straightforward. And it's exactly what you are saying, but you're making the case like the more general case, right? That's right. like the more general case, right? So let's say we go from an event in JSON captured with something like a RouterStack, right? Posting the old, like our data warehouse, and we want to end up like on

Starting point is 00:25:16 calculating like how many signups we had like this month, right? How, I mean, I think like many people, especially coming from like using Airflow or something like that, they get like, let's say the tasks that are needed there, right? How would we do that with the data assets? Great question. So with the data asset focused way

Starting point is 00:25:44 of building that data pipeline, the first thing you sort of do is think about the nouns. It's natural to start with where you're trying to get to. And then, you know, I'll even do this sometimes on a whiteboard. I'll write out where I'm trying to get to. I'll write out the data that I have. And then I'll write out a set of intermediate data sets that will help me get from the data that I have to the data that I'm trying to get to. So thinking in terms of your specific example, the data that you want to get to is probably a table that has information about these signups.

Starting point is 00:26:17 And you might even write out the schema of that table. And the data that you're starting with is, let's say, this raw, untransformed sequence of blobs that represent events. Maybe they're in S3. And let's say that where you want to get to is going to be to have this table in Snowflake so that it's easy to query from sort of dashboarding tools. Yep. Those are two nodes in your graph. And then you think about, okay, to actually build a reliable signup data set, what are the subcomponents that I need to have

Starting point is 00:26:55 to be able to accurately calculate signups? So let's see. Maybe one of these sub components is the, you know, set of times people hit the enter button on my signup form. But I also know that there's a bunch of internal testing that we do, where people would hit that enter button, we don't actually want to count that in our sort of like, our business facing signup metric. So it's important to exclude those internal testing. And then we have this separate table somewhere that is a list of all of our internal test users.

Starting point is 00:27:34 So to compute this ultimate MIT signups table, we're going to need to depend on a couple different things. One is going to be this table of external test users. One is going to be this list of external test users. One is going to be this list of form submissions for the signup form. Then you think, okay, how do I get the form submissions? Ultimately, that will be derived from my underlying asset at the beginning of my graph, which is the list of JSON blobs that are clicks on the website. And you draw these out and you basically put arrows connecting any dataset to the datasets that it needs to read in order to generate itself.

Starting point is 00:28:13 And so the asset is the dataset itself, like the materialized dataset, or the event, the concept, let's say. How is like, what's the difference there between the two? Got it. Yeah, so first of all, just to clarify, when we say data asset in the world of Dagster, the reason we don't use a word just like table

Starting point is 00:28:36 is that we want these to be more flexible. So they could be, it could be relational data, but it could also be a set of images in S3 or a machine learning model. When we talk about an asset, we're talking about some sort of object and persistent storage. It doesn't necessarily need to be a data warehouse, but it could be a table in a data warehouse. It could be a file in a file system, a model that's in some sort of model store. And that's what we're referring to when we refer to a data asset.

Starting point is 00:29:06 Okay. Okay. That's great. Cool. So from what I understand, like from what we are saying, like the way that Daxter works is by actually asking the user to define the lineage. let's say the materialized steps that the data has to go through until it delivers the end result.

Starting point is 00:29:33 So instead of thinking in terms of processing, we are thinking in terms of outcomes. So it's not, let's say, the query per se that generates the data. It's the data and how it connects to the previous data set that was the input to actually generate.

Starting point is 00:29:55 Do I get it right? Yeah, that's exactly right. And I want to add, you have to think about processing at some point, because, you know, the DAXer isn't going to read your mind and just figure out, you know, what needs to get run in order to derive, you know, in order to build the signups data set from the events data set. But when you, when you write out your processing logic, you're sort of hanging it off of this scaffolding of the data asset graph.

Starting point is 00:30:29 Okay. And how is the user using DAX there? Is it like an SDK? Is it something like some annotations that you are using to annotate some object, like in a notebook, like how actually like the user goes there and defines like these lineants between like the data assets? Great question. Yeah.

Starting point is 00:30:55 So Dagster exposes a Python API that allows you to define your data pipeline. And so ultimately the most straightforward way to define a data pipeline in Dagster is to write a Python file and include a set of asset definitions in that Python file. An asset definition is basically a decorated function. So for example, if you want to have an asset called signups, you would write a Python function called signups. You would decorate it with Dagster's asset decorator just to indicate it's an asset and then optionally include metadata about the asset, including dependencies on other assets. And then inside the body of that function, you would include the logic that's actually required to build this signups table.

Starting point is 00:31:42 Okay. So for example, read data from some other table and then do some transformations and then write it out to your storage system. And I would assume this is something that it's a pack to like

Starting point is 00:31:58 the system itself, like it can be SQL or it can be a data frame SDK that is used for Spark or PySpark or Polar or whatever. The processing logic itself is not something that Daxter is opinionated about. It's exactly right, yeah. So the idea is that it's just a Python function.

Starting point is 00:32:24 You can invoke any computation, any framework from that Python function. A really common thing to do is to invoke dbt. So for those who aren't familiar with dbt, dbt is a framework that allows defining tables, basically as SQL statements. So let's say you want to define this signups table. You would create a file called signups.sql. And then inside that file, you would include a select statement that says, select blah, blah, blah from the events table. And Dagster has a dbt integration that basically will digest that dbt table definition, have Dagster understand it, and then when it comes time to actually execute that node in the graph, will invoke dbt to execute the SQL inside your database. Okay, that's interesting. Why would someone do it like that, though, and not just use directly like DAGster or DBT, right?

Starting point is 00:33:36 Why would someone use both systems together? Got it, yeah. So I think there's two directions to think about that question. One is, why wouldn't you just use Dagster? And the other one is, why wouldn't you just use dbt? So starting with the why wouldn't you just use Dagster, dbt has become a standard for expressing data transformations in SQL. And it has a lot of features that make it really useful to that make it really powerful at doing that. So for example, you can write macros, the standard way that you specify data dependencies in dbt has just become widely accepted as part of the analytics engineering skillset.

Starting point is 00:34:16 And for Dagster to try to rebuild that would sort of unnecessarily fragment the ecosystem and make it less accessible to the set of set of users who are already familiar with one way of thinking about it is even as like a set of extensions to the SQL programming language that sort of make it useful for defining data pipelines. So that's why DBT is a really useful tool to use even with Dagster. For the question of why not just use dbt, dbt is very narrowly focused on a particular kind of data transformation, a particular kind of data pipeline. And in most organizations, even when a large body of the work that they're doing sort of fits into the dbt framework, often a large body of the work that they're doing sort of fits into the dbt framework, often a large body of the work that they're doing will not fit easily into the dbt framework.

Starting point is 00:35:09 So for example, they'll have steps in their pipeline that do things that are, you know, just fundamentally not SQL transformations. Like maybe they'll be moving data between different storage systems, or they will be building machine learning models. And those don't really make sense to represent inside of dbt. And so if you were to use dbt

Starting point is 00:35:32 for all your SQL and then Dagster for all of your not SQL stuff, you'd end up in this sort of fragmented world. You wouldn't have a single consistent view or ability to execute your entire data pipeline. And so embedding Dagster or in DBT allows you to kind of get the best of both worlds.

Starting point is 00:35:52 Okay, that makes sense. And let's talk a little bit about, like you said something interesting. You said that, actually, no, before we get to that, like you mentioned something about like DBT. And that's, it's very very interesting, about the fragmentation. So there are plenty of orchestrators out there, and one of the ways that orchestrators are created is because somehow there is a use case where, for whatever reason, the existing orchestrators do not cover the need or whatever.

Starting point is 00:36:35 And suddenly we come up with another orchestrator out there. And I think that's very common, especially if we take the ML world and let's say the data processing world, right? Although both are the same thing. But anyway, we need some way to differentiate the two. But I think our audience gets what I'm trying to say here. So example, we have flight, right? It is an orchestrator. You go to the website, build and deploy data in the ML pipelines. What's the difference between something like Flight that focuses more on the ML side of things, from my understanding at least, and something like Dagster?

Starting point is 00:37:16 And why at the end we end up having all these different orchestration tools, right? And in this case, like DBT is also like an example of that, right? Because DBT in a way is also, like as part of the product at least, kind of an orchestrator, right? If someone lives only inside SQL, technically they can use only DBT, right? They don't need Daxter or Airflow or

Starting point is 00:37:39 some other system. And I think that's very confusing at the end. I think like the practitioners at the end, they're like, okay, like what is going on here? Right. So tell us a little bit more about that and how you think about it, both, okay, like personalism as a practitioner, right. But also like as Daxter, the company.

Starting point is 00:38:03 Yeah. A lot of thoughts there. So I think there's this truism, which I think is true in many cases, that software boundaries end up sort of modeling organizational boundaries. So teams will build software that sort of serves the needs of their team.

Starting point is 00:38:24 And if an organization isn't structured in a certain way, that could lead to two different teams building software that solves very similar problems, but in slightly different and incompatible ways. in the world of data, often, you know, within a data organization or within a company at large, the functions of analytics and machine learning and data engineering will be sort of organizationally separate. Historically, I think what that has led to is that people within those functions have ended up building, you know, maybe building internally and then going on to open source or going on to commercialize tools that are sort of rooted in their understanding of that particular function. Something that I have encountered at working at companies

Starting point is 00:39:20 with fairly early data functions is that you end up having to fill a lot of roles. And you know that the software that's needed to orchestrate in the world of machine learning is actually very similar to the software that's needed to orchestrate in the world of analytics. So I'd have come to a belief that you actually don't really need

Starting point is 00:39:42 super specific tooling for a lot of these domains. A lot of the boundaries and silos that are set up are sort of artificial or unnecessary. And not only unnecessary, but actually have a fairly high cost. So from the perspective of a machine learning team, as I mentioned earlier, the biggest sort of the highest leverage way you can improve your machine learning model is by feeding it better data and ensuring the data that's coming to it is clean and correct and accurate. It becomes a lot harder to do that if

Starting point is 00:40:17 the underlying process that is generating the data uses a totally different software stack from the software stack that you're using. So you actually can reap a lot of benefits by having the kind of converse de-siloed view of the world that allows a machine learning person to understand the impact of a change that's like far up in the data pipeline because their machine learning model is trained using the same orchestration framework that upstream data asset is built using. Yeah, that makes sense. What are the differences though, like between, let's say, building workflows or like trying

Starting point is 00:41:01 to orchestrate like ML work compared to trying to orchestrate data engineering or analytical work, right? What are the differences between them? One thing that comes up in ML work, more than data engineering or analytical work, is that the experimentation phase is, and the development phase is often a lot more rich and intense. So in the simplest case, if you're just building a basic table, you write a SQL query, you

Starting point is 00:41:38 run it a couple times, you commit it to your repo, and now you have that table running. Ideally, your orchestrator is good enough that it can basically just start updating that table when you're working with a machine learning pipeline often there's a whole sort of workflow of experimentation that happens even if you've written kind of like the perfect code the first time you end up needing to tweak parameters to try out your model on different features. And so the iterative process is a lot more heavy. The compute is often much more heterogeneous as well in the world of machine learning.

Starting point is 00:42:20 If you're able to express your computation in SQL, you can basically just ship it off to Snowflake or DuckDB or whatever your database is and have it execute inside of there. But if you're dealing with machine learning models, there's a wide array of Python libraries that you could be using. There's hardware that you might have access to, like GPUs. You end up needing sort of a much more flexible execution substrate to orchestrate across. So to sum up, I think two sort of larger points to what you need to think about when you're orchestrating machine learning versus orchestrating, let's say, analytic data pipelines. One of them is this iterative experimentation-based workflow and the other one is this more

Starting point is 00:43:07 complex computational environment. Yeah, so this iterative experimentation part happens in production in ML? Or is it like a completely separated task? Well, I'm trying to

Starting point is 00:43:23 get to here. I'm trying to get to here. I'm trying to understand physically, normally in my mind at least, the orchestrator is something that gets into the process when you actually go into production. You have concluded how things should be done done and now you have to deploy something and repeatedly and with a lot of like, obviously in a reliable way that these things will keep happening, right? Because again, like, yeah, you experiment with ML, but I would argue that like whatever

Starting point is 00:44:04 has to do with software has like a part of experimentation, right? Even if you write a website, right? So it's part of the nature of the job at the end. When you build software one way or another, there is this iterative process during development. But that's how engineering works. You reach a point and you say, hey, okay, now that this is what I want to do, let's push it into production. Is this different with ML? ML doesn't have this distinction and you have to incorporate the orchestrator much earlier? Or is the orchestrator something that has to be incorporated as part of

Starting point is 00:44:45 the development phase and not only the production phase? Got it. Yeah, so I think in broad strokes, I would be inclined to agree with you. In particular, on that point that experimentation is a big part of software engineering and general data engineering

Starting point is 00:45:01 as well. So a lot of the sort of pieces of the machine learning development pipeline can sometimes be presented as unique to the machine learning development pipeline or actually general like software development, software engineering practices, which is part of why I don't think that these require specialized tools. The one area that I would, one area that I would speak more about is this notion that orchestration should only be part of production. So I don't think people should be replacing their jupiter notebooks with orchestration

Starting point is 00:45:46 but i do think it's very powerful to be able to work with an orchestrator and much earlier phases of the data development life cycle if if you think about an orchestrator abstractly it's this system that understands the dependencies between data and upstream data and is able to execute computation sort of along the lines of those dependencies. And that is a really important function, even when you're early in some degree of the experimentation process. So for example, if you're prototyping a change to the logic that generates one of your data assets, it's often really important to understand the implications of that change. So how it affects that data asset and how it affects downstream data assets

Starting point is 00:46:33 far before you decide to commit that change to production. Yep. 100%. Yep. Okay. Got it. And from your experience with Darkstar, do you see... Who is the primary user that you see? Is it more of the data engineer or the more, let's say, traditional data practitioner? Or do you see more people coming from ML? Or any change there in terms of the trends of like who is actually like coming to learn more about Dagster these days? Yeah. So we see a lot of different users. Let me maybe try to categorize

Starting point is 00:47:17 them in some sort of way. One pattern of Dagster use is that a data platform engineer will adopt Dagster to help them organize the computation of a bunch of different sort of functions inside their data organization. So maybe the data platform engineer is supporting a team of analytics engineers, or maybe supporting a team of analytics engineers as well as machine learning engineers. And they want to set up kind of like a shared orchestration environment where all of the data assets that are being produced by these people who are maybe a little bit less technical can be orchestrated in one place. So that's one pattern of DAG server usage. Another pattern of DAG server usage

Starting point is 00:48:03 is sort of the bread and butter data engineering DAGster usage. Another pattern of DAGster usage is sort of the bread and butter data engineering DAGster usage. So in this case, the person who adopts DAGster is also the one who's sort of writing the content of the data pipelines. They're not just not just facilitating other people's data pipelines. They're actually defining data assets in DAGster writing the logic to move data around or transform that data. And then last of all, we see a lot of people who are doing machine learning use Dagster.

Starting point is 00:48:33 And so in these cases, it's normally sort of a mixed machine learning and data pipelining function. They'll be using Dagster to train their machine learning model, but then also to generate all the features that get fed into their machine learning model, but then also to generate all the features that get fed into their machine learning model, and then perhaps take that machine learning model and then do batch inference with it. Yeah, makes sense. And one last question from me, and then I'll give the microphone back to Eric. But with the emergence of like LLMs and like, letLMs and AI engineering,

Starting point is 00:49:05 and not just ML engineering, is there a difference in terms of what is needed to build around LLMs? Or the existing orchestrators like Dagster, what do you need to go and work with LLMs and AI? Yeah, it's interesting. At the broad strokes, you still fundamentally have data that you're feeding in, and data pipelines still exist. There are some differences.

Starting point is 00:49:37 So, for example, feature engineering becomes less important in the world of LLMs because these models are powerful enough to be able to kind of do some of the thinking that a machine learning engineer would have needed to do. But at the same time, you end up needing to do prompting and moving data through vector databases. So the pipelines you end up creating end up looking very similar. Some of the nodes have slightly different labels. We've seen users use Dagster for traditional machine learning as well as LLMs. Fundamentally, the shape

Starting point is 00:50:12 of the work is not so different. That's all from me for now. Eric, sorry for hijacking the conversation here. No, that was amazing. That was amazing. I learned so much.

Starting point is 00:50:28 We have time for one more question here, Sandy. And I want to ask you more about roles and team structure in a world where, you know, the lines between data engineering and, you know, ML engineering, ML ops, and data science really blur. I mean, many of the things that we've talked about today, you know, you could label the conversation, you know, a conversation about ML ops or a conversation about data engineering either way. And you kind of saw this, you know, DBT, I think, helped coin the phrase analytics engineer, right?

Starting point is 00:51:07 Where you have, you know, you mentioned analysts who like, are maybe somewhat literate in SQL or have literacy in SQL or Python, but not, you know, actually running pipelines. But that kind of started to change and a lot of analysts started to learn to run pipelines, right?

Starting point is 00:51:23 And the same with data engineers who, you know, ran pipelines, but they didn't pipelines, but they didn't necessarily work on the modeling layer. And so you had this role emerge. It was kind of an analytics engineer that's a little bit of a hybrid. What do you think is going to happen in the relationship between traditionally like ml engineer or data science ml engineer data scientist um data engineer you know sort of that realm yeah to your point it definitely feels like the boundaries between these roles if they weren't always blurry they become very blurry you know i feel like in 2015, most data scientists would spend half their time like explaining to other people what exactly a data scientist was,

Starting point is 00:52:12 or sparring other people about, you know, the definition of a data scientist. And thankfully, those conversations aren't such a huge part of the job of data science anymore. So, you know, maybe that's because people have just come to accept that it means so many different things and trying to pin it down is a bit of a fruitless exercise. The way that I tend to be inclined to think about it is there are these spectrums of proficiency that different people have and that, you know, maybe eventually end up getting clustered into these different roles. So one axis of proficiency is data modeling, you know, which is sort of tightly related to sort of engaging with the facts of the particular business. And then there are these other axes of proficiency, which are more about infrastructure,

Starting point is 00:53:11 dealing with Kubernetes and different substrates. I think that from what we've seen is that these boundaries are super fluid, and it really varies from organization to organization. How sort of separate the person who thinks about the data pipeline is from the person who thinks about the infrastructure data pipeline runs on who writes in you know who writes in python who writes purely in sql and i think it's difficult to build a data platform with the assumption that these functions are going to end up totally siloed.

Starting point is 00:53:49 Yeah, I think it's really interesting. And I think, you know, the tooling has really helped enable a lot of this change. You know, for example, who writes in Python, who writes in SQL? Well, a lot of modern tooling, it doesn't matter, right? You can have someone writing SQL and someone writing Python and you can use the same workflow and work on the same data set, which is incredible.

Starting point is 00:54:10 I mean, that really is, you know, that sounds, you know, for anyone who's, you know, sort of only familiar with modern tooling where that's like pretty recent, that's possible. I mean, it's...

Starting point is 00:54:23 It definitely was not. Yeah, it's insane. So it is pretty cool. And I think, you know, personally, what I see that I'm very excited about is, you know, I think when you give people much easier access to explore different areas that are interesting to them,

Starting point is 00:54:42 they can follow their curiosity without these massive technical walls that would require a career change to overcome. The tools are making it a lot more fluid, which I think will spark a lot of creativity, which is exciting. Well, Sandy, we're at time here, but it's been so great. We learned so much. You're doing incredible work at Dagster. So thanks for giving us some of your time. Thanks so much for having me on the show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite

Starting point is 00:55:14 podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.