The Data Stack Show - 171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster
Episode Date: January 3, 2024Highlights from this week’s conversation include:The role of an orchestrator in the lifecycle of data (1:34)Relevance of orchestration in data pipelines (00:02:45)Changes around data ops and MLOps (...3:37)Data Cleaning (11:42)Overview of Dagster (13:50)Assets vs Tasks in Data Pipeline (19:15)Building a Data Pipeline with Dexter (25:40)Difference between Data Asset and Materialized Dataset (28:28)Defining Lineage and Data Assets in Dagster (29:32)The boundaries of software and organizational structures (37:25)The benefits of a unified orchestration framework (39:56)Orchestration in the development phase (45:29)The emergence of analytics engineer role (51:53)Fluidity in data pipeline and infrastructure roles (52:40)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We are here with Sandy Rizza from Dagster Labs.
Sandy, so excited to chat with you about data ops, workflows, data pipelines, all of the above.
Thanks for coming on the show.
Thanks for having me. Excited to chat with you.
All right, well, give us your background briefly.
Yeah. So I'm presently the lead engineer on the DAGSTAR project.
And I think we can talk a little bit more about what the DAGSTAR project is for those who aren't familiar later.
Earlier in my career, I had a mix of roles that involved building data infrastructures,
building tools that would help data practitioners and working as a data practitioner, machine learning engineer myself. I started my career at Cloudera. While I was there,
I wrote this book, Advanced Analytics with Spark, that taught how to use that particular framework
to do machine learning. And then spent a number of years practicing data scientist at Clover Health,
Motive, which used to be called Keep Truckin', and also worked in public transit software before finding myself back in the data tooling space at Dagster Labs.
That's awesome, an orchestrator in the lifecycle of data.
Like defining it, why we need it, why it has to be like an external tool, right?
And it's not part of query engine, for example. Also, why currently we have such a diverse, let's say, number of solutions out there,
especially when we are considering the more traditional data-related operations and DML
operations.
And we even see new orchestrators coming out that are focusing just on the ML side.
Why we need that when we have, let's say like something that already works for data and I'd love to hear
like and learn from you like why is that and what it means like for the practitioners out there
right what's in your mind though like what you would like to chat and get like deeper into like
during our conversation.
Yeah, the topic that you brought up is one that I've thought about quite a bit, both from this perspective of being a machine learning engineer and from this perspective of working
on tools for machine learning engineers.
And I think we can get into this later, but the fact that I ended up working on a general
purpose orchestrator kind of says a lot about how I view the role of
orchestration and data pipelines in the machine learning engineering domain. So really excited
to talk about that. Excited to also talk about orchestration in general and what it means to
build a data pipeline and the relevance of that to different roles like data engineers, machine learning
engineers, data scientists.
Yeah, that's awesome.
I think we have a lot to talk about.
And what do you think, Eric?
Let's go.
Yeah, let's get to it.
Good.
So great to have Dagster back on the podcast after such a short time.
All right.
Well, we have a ton to talk about.
And specifically, we want to talk about sort of the intersection, the changes around data ops, ML ops,
and that whole space. I mean, there's so many tools, there's so many opinions out there.
So I want to get there, but I want to start by hearing your story because it's pretty fascinating.
So can you just give us an overview of sort of the arc of your career, where you started and how you sort of ended back
in place where you started? Yeah, my career is a bit of a loop and I'll quickly walk you through
that. So I started out in data in like 2012, which felt like a qualitatively different era of data so this
was the era when data scientist was kind of a burgeoning new term a buzzword the sexiest job
the the entire stack and like a lot of the focus of where the technology was going
was the big data was the other buzzword and everyone was focused on how can we process these enormous amounts of data.
And I worked at Cloudera, which was kind of at the heart of that.
So I was a contributor to these open-source software projects
that were kind of at the heart of this big data software stack.
One of those was Hadoop, MapReduce,
those were originally based on these kind of foundational
internal Google papers on how to process Google-sized data.
And the other one was Spark, which was sort of an improvement upon the original Hadoop framework
that made it accessible to a much broader set of people and for a much broader set of use cases.
For example, machine learning. So I started my career working on these open source software projects that were fundamentally
built for data practitioners, like data engineers and data scientists.
And became kind of interested in what was on the other side of the API boundary.
We were building these systems that
could process enormous amounts of data. It's like, that's cool, but it's very abstract. Like,
what value do you actually bring the world by processing these enormous amounts of data?
And so I wanted to go to sort of up the value chain a little bit and learn a little bit about
what the world of using these tools looks like. So I first did that with inCloud Dara.
We had this internal consulting function, which was sort of like an embedded data science
team.
And we would go on site to a, let's say a large telco and help them understand their
users and use these big data tools to understand their users.
But eventually I ended up working in full-time roles
as a machine learning engineer data person
at companies that actually had embedded versions of those functions.
So one of them was Clover Health,
where we were working on health insurance.
Another one was Keep Truckin',
which is now called Motive,
working on technology that
helps truck drivers do their jobs.
And so, you know, I started talking about how 2012 felt like a very different era in
data.
And I think in a way that's largely because the problems that people focused on were very
different at the time.
And I think there was this kind of acknowledgement that maybe the world
of data had gotten ahead of itself
a little bit or had
or the tools had maybe
solved some layer
of problems, but there was this other layer of problems
that was like bigger and scarier
on top of that layer of problems.
And it was less about the
size of the data, but about the complexity
of the data.
So, like, from the perspective of Cloudera, it's like, okay, once we make you a tool that can do, you know, process two terabytes of data in, you know, 25 seconds, then you'll just take that and make your machine learning model.
You're done.
Yeah, you're done.
It's awesome. Like, right. Then they just run like, you know, fit regression model or, you know,
fit pre-use and, you know,
who even needs a data team.
But moving to the other side of this
and becoming, being in these roles
where I was actually, you know,
developing machine learning models,
doing analyses,
trying to answer questions with data,
it became clear that that the hardest part
of actually doing this job was wrangling
and structuring this enormous amount of complexity.
Starting with data that was,
I don't think you'd say garbage,
but you'd say very disorganized
and trying to bring some order,
some order, you know, not just to the data itself, but to the process that generates
and keeps that data up to date.
Yep.
And so the consequence of this was that because sort of doing these basic data tasks was so sort of disorganized and difficult, these jobs, I ended up spending, you know, especially when I was in more lead roles and responsible for making other people on my team be productive.
I ended up spending an enormous amount of my time just building internal frameworks at these companies to do this job. And, you know, maybe we'll get to this later, but a huge, you know, the biggest,
the biggest way you can improve a machine learning model is give better data to that model.
Yeah. Yep. So, so in, in the roles where I was responsible for building better machine learning
models, I was primarily concerned with how I could give better data to these models and do so in a
reliable, repeatable way. And I basically ended up spending a huge chunk of my time building frameworks that would
allow me and other people on my team to do that successfully.
When it came time to find a new role around 2020, I sort of thought, like, why go into a company and build another,
like, internal version of this framework that's, you know, might be really useful for this
company when I could try to build a version of it that is accessible to many different
organizations?
Like, it ultimately felt like a very, a much more high leverage thing to be able to do
and i happen to know nick who was the founder of dagster and the company which used to be
called elemental but is now known as dagster labs and was basically like this is a problem
you know i built this system a couple times before i I want to do it again, but do it general.
And right this time, I talked to Nick
and joined the team at Dagster Labs
as one of the sixth or seventh,
or something, first employees.
And I've basically been working full-time on Dagster,
this open-source software project, since then.
Wow, what a story. You know, it really struck me. I loved your,
you know, I love the analogy about, you know, you can process two terabytes of data in 25 seconds
or whatever. It's like, you have this race car, but in order to drive it, you actually have to go
like build an oil refinery. you know, it's like,
so I think that's an amazing analogy. Yeah. Love that. Yeah, that's, that's super ironic.
Okay, so a couple questions here. In terms of, well, first of all, actually, what I'd love to
know is, when did you step back as a practitioner, you know,
going through multiple roles as a practitioner?
When did you step back?
Do you remember maybe the moment or sort of the project where you said, wow, I'm seeing
a pattern here because I seem to keep going back and working on this similar thing? yeah so i i think that i have like a there's a fundamental
some dimension of the way that my brain works is very lazy and what i mean by that is i really
don't like to hold they can try to hold a bunch of information in my head at one time i like really
want like to be able to like think clearly I really want some external system to be able to
offload that too. So pretty early on in these roles where I was doing data pipelining tasks,
I sort of got frustrated very early with the tooling and found myself trying to at least
contribute to it, improve it in minor ways.
I think another piece there was talking to a lot of other practicing data scientists at the time.
And, you know, there was this refrain of so much of what we do, like, you know, we're hired to do machine learning, but all we do is clean the data.
I think it took some number of those conversations. I don't know how many, but for me to realize and reframe it in my mind
that like cleaning the data
isn't this like task of drudgery
that you have to do before doing the exciting part.
Like it is kind of fundamentally
the heart of the machine learning engineering job.
And, you know, you can think of it as cleaning data
or you can think of it as like producing reliable data sets that are generally useful within your organization. is sort of this work of structuring, like taking these reusable pieces of data
and then building even more useful
and reusable pieces of data on top of that.
I found that a very motivating way
to think about that work.
And yeah, I think that probably clicked
in my first data role,
but then really got reinforced in my later data roles.
Okay. Super interesting. My next question is actually more related to Dagster. So
what I'd love for you to do is tell us, you know, give us an overview of what is Dagster? What does it do? And then I'd love to know
how much of what you were building, how close was it to the stuff that Dagster
does when you were in those practitioner roles like the tools?
Got it. Okay, so trying to think about what the
best angle is to approach this.
I think
both in my life and generally in these roles,
a pretty common pattern is that you have some sort of
you'll have a set of analysts that aren't software engineers,
like the most technical people, although they'll have
some proficiency with Python have some proficiency with Python
or some proficiency with SQL, and you'll end up
with some sort of domain-specific language or internal
framework inside of a company that allows those analysts
to do their job. And it's not always like this, but if you have
a sort of more tech-s tech savvy analyst or some data engineer
who's responsible for supporting these analysts they'll end up building something internally
that makes it so the analyst doesn't have to like you know spin up a cron process and like run docker
every time that they want to let's say keep some table up to date and if you look at these
frameworks and sort of like thinking about the frameworks of the organizations that I was at, they always tend to revolve around tables. And so like
the fundamental abstraction when you're thinking about, you know, sort of reproducible work
in a like data analyst or even machine learning role is like a table or some sort of data set.
Like, I want to start with this data that we have that's maybe sort of not clean or
not formatted in the way that's most useful to me.
And then in the course of my analysis, ideally kind of like factor out some sort of cleaner,
more useful version of this data set that, you know,
the next time I have to do this analysis, I'll be able to rely on.
At Clover Health, as well as at Keep Trucking,
the kind of like natural way that we built our internal tools
to make our data scientists productive.
And then there was this interesting mismatch
because that was the natural way for us to think about it
as the people in these data roles.
But then you look at these tools that are sort of the orchestrators of the time, which
still popular orchestrators now, like Airflow, for example, are focused on a totally different
set of abstractions.
With Airflow, you go in and you define a dag and a dag is a set of tasks and
you're fundamentally thinking about tasks when you operate your data pipeline so like the primary
challenge as someone who is trying to both do this data science data pipelining work was translating
from this like table way of looking at the world,
which was very natural to me and the other people I worked with, to this like task and workflow
based world, which was the language that tools like Airflow spoke. And so the internal frameworks
that I would end up working on at these companies were basically kind of these translation layers.
So allow me to express what I'm trying to do, like my data, basically write my data pipeline in terms of tables or data sets or machine learning models and the relationships
between those entities.
And then, you know, have some software kind of figure out how to airflow that for me and
turn it into this world of like DAGs and tasks it was a messy fit you could
do you know you can get your pipeline running on a schedule but there were all these weird
translation issues at the borders so like when it times comes time for someone to debug an error or
look at logs they're forced to think in terms of these like very different abstractions than the ones
that are natural to that as a data practitioner. So this is a very kind of long-winded way of
saying that what got me excited about working on an orchestrator like Dagster was the opportunity
to build something that thought about assets. I want to say an asset, I mean, a table, a data
set, a machinery model,
any sort of persistent object that captures some sort of understanding of the world.
It was the opportunity to think about that as the center, the central abstraction for building the
data pipeline and allowing everything to revolve around that. Super interesting. And can you talk about maybe just at a high level
to start with, how do you think about a system that relies on the concept of assets as opposed
to tasks? Like what are the fundamental differences there in terms of how the system itself operates,
right? Because I mean, you can
create, you know, you can do orchestration with Airflow, you can do orchestration with Dagster,
right? But we're talking about sort of two fundamentally different approaches.
That's right. It permeates in a bunch of different ways. I'm trying to think about the best way to
approach it. When you build a workflow using tasks, there's kind of this fundamentally top-down approach
where you have these sort of like individual tasks and then you assemble them into a DAG.
And the DAG, you know, it's the workflow. It defines the dependencies between those tasks.
Whereas when you're working with assets, it ends up being a little bit more of a fundamentally more distributed approach.
So when you define a data asset, so for example, which is synonymous with saying, I have a
table that I want to create.
Let's say there's like this raw data here, that's all the raw events that come into my
system.
And then I want to create a table called fleen events or gold events.
When I define that table, I define its dependency on the upstream events table. And the way that
the entire dependency graph is sort of defined is at the level of individual assets instead of
having to do this top-down approach that involves a set of DAGs.
The consequence of thinking that way is you're not forced to make these tough and often arbitrary
decisions about where the nodes in your graph go.
A common failure mode in people who build DA dag based data pipelines if they'll have one
sort of enormous unwieldy dag and anytime they make a change they have to contend with that
entire dag like execute it or sort of like deal deal with the enormity of it whereas or you know
or they'll go the opposite way. And they'll
chop up their DAG into these tiny little pieces, but then lose the ability to actually sort of
reliably track the relationships between those pieces. And so when you think about data assets,
and you think about defying dependencies in terms of what data do I need to be able to generate this
data set, you kind of sidestep that problem
entirely. A second piece of that, which is this, the fundamentally declarative approach that comes
when you're sort of thinking about assets first. When data engineers are sort of questioned by
other people in the organizations like management or business stakeholders.
Maybe a question has too much of an interrogative connotation. But when data practitioners want to communicate with stakeholders about their work,
the language they normally communicate in is data assets.
I found this very true in my own work.
Like when I'm explaining to someone the data pipeline that I'm working on or the thing
that I'm going to produce for them, the thing that I draw on the whiteboard is the tables
that are going to be produced.
Yeah.
Yep.
And machine learning is almost even more clear. You say, you know, I'm going to make this machine learning is almost even more clear.
You say, you know, I'm going to make this machine learning model.
These are the features that are going to go into it.
These are the evaluations that are going to come out of it.
So you naturally, when you're communicating about what you're building in your data pipeline,
think in terms of this network of data assets.
And so the advantage of an orchestrator
that sort of thinks primarily about the data assets
is like that language is the language that you use
to actually define your pipeline.
So the consequence of that
is that you have this degree of confidence
that the pipeline is actually going to generate this network of data assets because it's the language that the pipeline is defined in terms of.
Makes total sense.
Sorry, Eric.
Something kind of give us like an example, like a concrete example, right?
Of like a pipeline and how this could be done using like the concept of software defined
assets, right?
Like in DAX there.
Yeah.
So I really wish I had the ability to use a visual aid, but I'll do my best to describe
it.
So super basic, let's say you have a table of raw event data. Let's
say you're running a website. People come onto your website and click on things in your
website, log in, maybe your website sells something.
Yep.
So you have these kind of these core basic entities that will often come in in some sort of raw form at the beginning of your pipeline.
So those might be, let's say, all the events that happen on your website.
So like clicks, page views, logins.
And your role as a data engineering team might be to deliver cleaned up versions or aggregated versions of this kind of data as core data sets.
So other people inside of your organization can use those to build data products or do
analyses.
Can I interrupt you?
Let's say we have, just to make it like a little bit more concrete, like to someone,
let's say we collect these events, right?
All the different events.
And at the end, we want to get to the point where we have somewhere the number of signups per month.
And I'm giving this as an example because I think it's very straightforward.
And it's exactly what you are saying, but you're making the case like the more general case, right?
That's right. like the more general case, right? So let's say we go from an event in JSON captured with something
like a RouterStack, right? Posting the old, like our data warehouse, and we want to end up like on
calculating like how many signups we had like this month, right? How, I mean, I think like many people,
especially coming from like using Airflow
or something like that,
they get like, let's say the tasks
that are needed there, right?
How would we do that with the data assets?
Great question.
So with the data asset focused way
of building that data pipeline,
the first thing you sort of do is think about the nouns. It's natural to start with where you're
trying to get to. And then, you know, I'll even do this sometimes on a whiteboard. I'll write out
where I'm trying to get to. I'll write out the data that I have. And then I'll write out a set
of intermediate data sets that will help me get from the data that
I have to the data that I'm trying to get to.
So thinking in terms of your specific example, the data that you want to get to is probably
a table that has information about these signups.
And you might even write out the schema of that table.
And the data that you're starting with is, let's say, this raw, untransformed
sequence of blobs that represent events. Maybe they're in S3. And let's say that where you want
to get to is going to be to have this table in Snowflake so that it's easy to query from sort of
dashboarding tools. Yep. Those are two nodes in your graph.
And then you think about, okay,
to actually build a reliable signup data set,
what are the subcomponents that I need to have
to be able to accurately calculate signups?
So let's see.
Maybe one of these sub components is the, you know, set of
times people hit the enter button on my signup form. But I also know that there's a bunch of
internal testing that we do, where people would hit that enter button, we don't actually want to
count that in our sort of like, our business facing signup metric. So it's important to exclude those internal testing.
And then we have this separate table somewhere
that is a list of all of our internal test users.
So to compute this ultimate MIT signups table,
we're going to need to depend on a couple different things.
One is going to be this table of external test users.
One is going to be this list of external test users. One is going to be this list of form
submissions for the signup form. Then you think, okay, how do I get the form submissions? Ultimately,
that will be derived from my underlying asset at the beginning of my graph, which is the list of
JSON blobs that are clicks on the website. And you draw these out and you basically put arrows connecting any dataset to the datasets
that it needs to read in order to generate itself.
And so the asset is the dataset itself,
like the materialized dataset,
or the event, the concept, let's say.
How is like, what's the difference there between the two?
Got it.
Yeah, so first of all, just to clarify,
when we say data asset in the world of Dagster,
the reason we don't use a word just like table
is that we want these to be more flexible.
So they could be, it could be relational data,
but it could also be a set of images in S3
or a machine learning model.
When we talk about an asset, we're talking about some sort of object and persistent storage.
It doesn't necessarily need to be a data warehouse, but it could be a table in a data warehouse.
It could be a file in a file system, a model that's in some sort of model store.
And that's what we're referring to when we refer to a data asset.
Okay.
Okay.
That's great.
Cool.
So from what I understand, like from what we are saying, like the way that
Daxter works is by actually asking the user to define the lineage. let's say the materialized steps
that the data has to go through
until it delivers the end result.
So instead of thinking in terms of
processing,
we are thinking in terms of outcomes.
So it's not, let's say,
the query per se
that generates the data.
It's the data and how it connects to the previous data set
that was the input to actually generate.
Do I get it right?
Yeah, that's exactly right.
And I want to add, you have to think about processing
at some point, because,
you know, the DAXer isn't going to read your mind and just figure out, you know, what needs to get
run in order to derive, you know, in order to build the signups data set from the events data
set. But when you, when you write out your processing logic, you're sort of hanging it off of this scaffolding
of the data asset graph.
Okay.
And how is the user using DAX there?
Is it like an SDK?
Is it something like some annotations that you are using to annotate some object, like
in a notebook, like how actually like the user goes there and defines like these lineants between like
the data assets?
Great question.
Yeah.
So Dagster exposes a Python API that allows you to define your data pipeline.
And so ultimately the most straightforward way to define a data pipeline
in Dagster is to write a Python file and include a set of asset definitions in that Python file.
An asset definition is basically a decorated function. So for example, if you want to have
an asset called signups, you would write a Python function called signups. You would decorate it with
Dagster's asset decorator just to indicate it's an asset and then optionally include metadata
about the asset, including dependencies on other assets. And then inside the body of that function,
you would include the logic that's actually required to build this signups table.
Okay. So for example, read data from some other
table and then do some
transformations and then write it out to
your
storage system.
And I would assume this is
something that it's a pack to
like
the system itself, like it can be SQL
or it can be
a data frame SDK
that is used for Spark or PySpark or Polar or whatever.
The processing logic itself is not something
that Daxter is opinionated about.
It's exactly right, yeah.
So the idea is that it's just a Python function.
You can invoke any computation, any framework from that Python function. A really common thing to do is to invoke dbt. So for those who aren't familiar with dbt, dbt is a framework that allows defining tables, basically as SQL statements. So let's say you want to define this signups table.
You would create a file called signups.sql.
And then inside that file, you would include a select statement that says, select blah, blah, blah from the events table.
And Dagster has a dbt integration that basically will digest that dbt table definition,
have Dagster understand it, and then when it comes time to actually
execute that node in the graph, will invoke dbt to execute the SQL inside your database. Okay, that's interesting.
Why would someone do it like that, though,
and not just use directly like DAGster or DBT, right?
Why would someone use both systems together?
Got it, yeah.
So I think there's two directions to think about that question.
One is,
why wouldn't you just use Dagster? And the other one is, why wouldn't you just use dbt?
So starting with the why wouldn't you just use Dagster, dbt has become a standard for expressing data transformations in SQL. And it has a lot of features that make it really useful
to that make it really powerful at doing that. So for example, you can write macros, the standard way that you specify data dependencies
in dbt has just become widely accepted as part of the analytics engineering skillset.
And for Dagster to try to rebuild that would sort of unnecessarily fragment the ecosystem
and make it less accessible to the set of set of users who are already familiar with one way of thinking about it is
even as like a set of extensions to the SQL programming language that sort of
make it useful for defining data pipelines.
So that's why DBT is a really useful tool to use even with Dagster. For the question of why not just use dbt,
dbt is very narrowly focused on a particular kind of data transformation, a particular kind
of data pipeline. And in most organizations, even when a large body of the work that they're doing
sort of fits into the dbt framework, often a large body of the work that they're doing sort of fits into the dbt framework, often a large body of the work that they're doing will not fit easily into the dbt framework.
So for example, they'll have steps in their pipeline that do things that are, you know,
just fundamentally not SQL transformations.
Like maybe they'll be moving data between different storage systems, or they will be
building machine learning models.
And those don't really make sense to
represent
inside of dbt.
And so if you were to use dbt
for all your SQL and then
Dagster for all of your not
SQL stuff, you'd end up in this sort of
fragmented world.
You wouldn't have a single consistent view
or ability to execute your
entire data pipeline.
And so embedding Dagster or in DBT allows you to kind of get the best of both worlds.
Okay, that makes sense.
And let's talk a little bit about, like you said something interesting.
You said that, actually, no, before we get to that, like you mentioned something about like DBT.
And that's, it's very very interesting, about the fragmentation.
So there are plenty of orchestrators out there,
and one of the ways that orchestrators are created
is because somehow there is a use case where, for whatever reason,
the existing orchestrators do not cover the need or whatever.
And suddenly we come up with another orchestrator out there.
And I think that's very common, especially if we take the ML world and let's say the data processing world, right?
Although both are the same thing.
But anyway, we need some way to differentiate the two.
But I think our audience gets what I'm trying to say here.
So example, we have flight, right?
It is an orchestrator.
You go to the website, build and deploy data in the ML pipelines. What's the difference between something like Flight that focuses more on the ML side of things, from my understanding at least, and something like Dagster?
And why at the end we end up having all these different orchestration tools, right? And in this case, like DBT is also like an example of that, right?
Because DBT in a way is
also, like as part of the product at least,
kind of an orchestrator, right?
If someone lives only inside SQL,
technically they can use only
DBT, right? They don't need
Daxter or Airflow or
some other system. And I
think that's very confusing at the end. I think like
the practitioners at the end, they're like, okay, like what is going on
here?
Right.
So tell us a little bit more about that and how you think about it, both, okay, like
personalism as a practitioner, right.
But also like as Daxter, the company.
Yeah.
A lot of thoughts there.
So I think there's this truism,
which I think is true in many cases,
that software boundaries end up
sort of modeling organizational boundaries.
So teams will build software
that sort of serves the needs of their team.
And if an organization isn't structured in a certain way, that could lead to two different teams building software that solves very similar problems, but in slightly different and incompatible ways. in the world of data, often, you know, within a data organization or within a company at large,
the functions of analytics and machine learning and data engineering will be sort of
organizationally separate. Historically, I think what that has led to is that people within those
functions have ended up building, you know, maybe building internally and then going on to open source
or going on to commercialize tools
that are sort of rooted in their understanding
of that particular function.
Something that I have encountered at working at companies
with fairly early data functions
is that you end up having to fill a lot of roles.
And you know that the software that's needed
to orchestrate in the world of machine learning
is actually very similar to the software
that's needed to orchestrate in the world of analytics.
So I'd have come to a belief
that you actually don't really need
super specific tooling for a lot of these
domains.
A lot of the boundaries and silos that are set up are sort of artificial or unnecessary.
And not only unnecessary, but actually have a fairly high cost.
So from the perspective of a machine learning team, as I mentioned earlier, the biggest
sort of the highest leverage way you
can improve your machine learning model is by feeding it better data and ensuring the data
that's coming to it is clean and correct and accurate. It becomes a lot harder to do that if
the underlying process that is generating the data uses a totally different software stack from the
software stack that you're using.
So you actually can reap a lot of benefits by having the kind of converse de-siloed view of
the world that allows a machine learning person to understand the impact of a change that's like
far up in the data pipeline because their machine learning model is trained using the same orchestration
framework that upstream data asset is built using.
Yeah, that makes sense.
What are the differences though, like between, let's say, building workflows or like trying
to orchestrate like ML work compared to trying to orchestrate data engineering
or analytical work, right?
What are the differences between them?
One thing that comes up in ML work,
more than data engineering or analytical work,
is that the experimentation phase is,
and the development phase is often a lot more rich and intense.
So in the simplest case, if you're just building a basic table, you write a SQL query, you
run it a couple times, you commit it to your repo, and now you have that table running.
Ideally, your orchestrator is good enough that it can basically just start updating that table
when you're working with a machine learning pipeline often there's a whole sort of workflow
of experimentation that happens even if you've written kind of like the perfect code the first
time you end up needing to tweak parameters to try out your model on different
features.
And so the iterative process is a lot more heavy.
The compute is often much more heterogeneous as well in the world of machine learning.
If you're able to express your computation in SQL, you can basically just ship it off to Snowflake or DuckDB or whatever your database is and have it execute inside of there.
But if you're dealing with machine learning models, there's a wide array of Python libraries that you could be using.
There's hardware that you might have access to, like GPUs.
You end up needing sort of a much more flexible execution substrate to orchestrate across.
So to sum up, I think two sort of larger points to what you need to think about when you're orchestrating machine learning
versus orchestrating, let's say, analytic data pipelines.
One of them is this iterative experimentation-based workflow
and the other one is this more
complex computational
environment.
Yeah, so this iterative
experimentation
part happens
in production in ML?
Or is it like a completely
separated task? Well, I'm trying to
get to here. I'm trying to get to here.
I'm trying to understand physically, normally in my mind at least,
the orchestrator is something that gets into the process
when you actually go into production.
You have concluded how things should be done done and now you have to deploy something
and repeatedly and with a lot of like, obviously in a reliable way that these things will keep
happening, right?
Because again, like, yeah, you experiment with ML, but I would argue that like whatever
has to do with software has like a part of experimentation, right?
Even if you write a website, right?
So it's part of the nature of the job at the end.
When you build software one way or another, there is this iterative process during development.
But that's how engineering works. You reach a point and you say, hey,
okay, now that this is what I want to do, let's push it into production. Is this different with
ML? ML doesn't have this distinction and you have to incorporate the orchestrator much earlier? Or
is the orchestrator something that has to be incorporated as part of
the development phase and not only
the production phase?
Got it. Yeah, so I think
in broad strokes, I would be inclined
to agree with you.
In particular, on that point that
experimentation is a big part of software engineering
and general data engineering
as well.
So
a lot of the sort of pieces of the machine learning development pipeline can
sometimes be presented as unique to the machine learning development pipeline or actually
general like software development, software engineering practices, which is part of why
I don't think that these require specialized tools.
The one area that I would, one area that I would speak more about is this notion that
orchestration should only be part of production. So I don't think people should be replacing their jupiter notebooks with orchestration
but i do think it's very powerful to be able to work with an orchestrator
and much earlier phases of the data development life cycle if if you think about an orchestrator
abstractly it's this system that understands the dependencies between data and upstream data and is able to execute
computation sort of along the lines of those dependencies. And that is a really important
function, even when you're early in some degree of the experimentation process.
So for example, if you're prototyping a change to the logic that generates one of your data assets,
it's often really important to understand the implications of that change.
So how it affects that data asset and how it affects downstream data assets
far before you decide to commit that change to production.
Yep. 100%. Yep.
Okay. Got it.
And from your experience with Darkstar,
do you see... Who is the primary user that you see?
Is it more of the data engineer or the more, let's say, traditional data practitioner?
Or do you see more people coming from ML? Or any change there in terms of the trends of like who is actually like coming to learn more about
Dagster these days? Yeah. So we see a lot of different users. Let me maybe try to categorize
them in some sort of way. One pattern of Dagster use is that a data platform engineer will adopt Dagster to help them organize the computation of a bunch of different sort of functions inside their data organization.
So maybe the data platform engineer is supporting a team of analytics engineers, or maybe supporting a team of analytics engineers as well as machine learning engineers.
And they want to set up kind of like a shared orchestration environment
where all of the data assets that are being produced
by these people who are maybe a little bit less technical
can be orchestrated in one place.
So that's one pattern of DAG server usage.
Another pattern of DAG server usage
is sort of the bread and butter data engineering DAGster usage. Another pattern of DAGster usage is sort of the bread and butter
data engineering DAGster usage.
So in this case, the person who adopts DAGster is also the one who's sort of
writing the content of the data pipelines.
They're not just not just facilitating other people's data pipelines.
They're actually defining data assets in DAGster writing the logic to move
data around or transform that data.
And then last of all, we see a lot of people who are doing machine learning use Dagster.
And so in these cases, it's normally sort of a mixed machine learning and data pipelining
function. They'll be using Dagster to train their machine learning model, but then also to
generate all the features that get fed into their machine learning model, but then also to generate all the features
that get fed into their machine learning model, and then perhaps take that machine learning
model and then do batch inference with it.
Yeah, makes sense.
And one last question from me, and then I'll give the microphone back to Eric.
But with the emergence of like LLMs and like, letLMs and AI engineering,
and not just ML engineering,
is there a difference in terms of what is needed
to build around LLMs?
Or the existing orchestrators like Dagster,
what do you need to go and work with LLMs and AI?
Yeah, it's interesting.
At the broad strokes, you still fundamentally have data that you're feeding in, and data pipelines still exist.
There are some differences.
So, for example, feature engineering becomes less important in the world of LLMs because these models are powerful enough to be able to
kind of do some of the thinking that a machine learning engineer would have needed to do.
But at the same time, you end up needing to do prompting and moving data through vector
databases. So the pipelines you end up creating end up looking very similar. Some of the nodes
have slightly different labels. We've seen users
use Dagster for traditional
machine learning as well as LLMs.
Fundamentally, the shape
of the work is not so
different.
That's all from me for now. Eric,
sorry for hijacking
the conversation here.
No, that was amazing.
That was amazing.
I learned so much.
We have time for one more question here, Sandy.
And I want to ask you more about roles
and team structure in a world where, you know,
the lines between data engineering and, you know, ML engineering,
ML ops, and data science really blur. I mean, many of the things that we've talked about today,
you know, you could label the conversation, you know, a conversation about ML ops or a conversation
about data engineering either way. And you kind of saw this, you know, DBT, I think, helped coin the phrase
analytics engineer, right?
Where you have, you know,
you mentioned analysts who like,
are maybe somewhat literate in SQL
or have literacy in SQL or Python,
but not, you know, actually running pipelines.
But that kind of started to change
and a lot of analysts started to learn
to run pipelines, right?
And the same with data engineers
who, you know, ran pipelines, but they didn't pipelines, but they didn't necessarily work on the modeling layer.
And so you had this role emerge. It was kind of an analytics engineer that's a little bit of a
hybrid. What do you think is going to happen in the relationship between traditionally like ml engineer or data science ml engineer
data scientist um data engineer you know sort of that realm yeah to your point it definitely feels
like the boundaries between these roles if they weren't always blurry they become very blurry
you know i feel like in 2015, most data scientists
would spend half their time like explaining to other people what exactly a data scientist was,
or sparring other people about, you know, the definition of a data scientist. And thankfully,
those conversations aren't such a huge part of the job of data science anymore.
So, you know, maybe that's because people have just come to accept that it means so
many different things and trying to pin it down is a bit of a fruitless exercise.
The way that I tend to be inclined to think about it is there are these spectrums of proficiency that different people have and that, you know, maybe eventually end up getting clustered into these different roles.
So one axis of proficiency is data modeling, you know, which is sort of tightly related to sort of engaging with the facts of the particular business.
And then there are these other axes of proficiency,
which are more about infrastructure,
dealing with Kubernetes and different substrates.
I think that from what we've seen is that these boundaries are super fluid,
and it really varies from organization to organization.
How sort of separate the person who
thinks about the data pipeline is from the person who thinks about the infrastructure
data pipeline runs on who writes in you know who writes in python who writes purely in sql
and i think it's difficult to build a data platform with the assumption that these functions
are going to end up totally siloed.
Yeah, I think it's really interesting.
And I think, you know, the tooling has really helped enable a lot of this change.
You know, for example, who writes in Python, who writes in SQL?
Well, a lot of modern tooling, it doesn't matter, right?
You can have someone writing SQL and someone writing Python
and you can use the same workflow
and work on the same data set,
which is incredible.
I mean, that really is,
you know, that sounds,
you know, for anyone who's,
you know, sort of only familiar
with modern tooling
where that's like pretty recent,
that's possible.
I mean, it's...
It definitely was not.
Yeah, it's insane.
So it is pretty cool.
And I think, you know, personally,
what I see that I'm very excited about is,
you know, I think when you give people
much easier access to explore different areas
that are interesting to them,
they can follow their curiosity
without these
massive technical walls that would require a career change to overcome. The tools are making
it a lot more fluid, which I think will spark a lot of creativity, which is exciting.
Well, Sandy, we're at time here, but it's been so great. We learned so much.
You're doing incredible work at Dagster. So thanks
for giving us some of your time. Thanks so much for having me on the show.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You
can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.