The Data Stack Show - 158: The Orchestration Layer as the Data Platform Control Plane With Nick Schrock of Dagster Labs
Episode Date: October 4, 2023Highlights from this week’s conversation include:Nick’s background and journey in data (2:28)Founding Dagster Labs (7:50)The evolution of data engineering (12:32)Fragmentation in data infrastructu...re (15:04)The role of orchestration in data platforms (19:53)The importance of operational tools for data pipelines (25:01)Lessons learned from working with GraphQL (26:19)The role of the orchestrator in data engineering (34:51)The boundaries between data infrastructure and product engineering (37:33)Different orchestrators in the data infrastructure landscape(42:03)The role of MLOps in data engineering (46:04)Data Quality and Orchestration (51:04)Future of Data Teams and Orchestration (54:27)Final thoughts and takeaways from (58:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Kostas,
very exciting guest. We've actually had lots of people on the show who have built incredible
technologies at the Fangs or other similar huge companies, and then have gone on to do
really interesting things and found really interesting companies. Similar story today, we're going to talk with Nick, who worked at Facebook and was actually
one of the people who was behind GraphQL, which is really fascinating.
But he started a company called Dagster Labs, originally called Elemental.
And they build an orchestration tool with a goal to sort of become a control plane for data infrastructure,
which is really fascinating.
And I mean, what a journey.
I can't wait to ask him about it.
I think one of the questions that I want to dig into with him is actually basic.
We've had some orchestration players on the show. Airflow
is obviously a huge incumbent in the space, but I just want to talk about what problem orchestration
solves. It'd be fun to define a DAG. I don't think we've done that on the show, which is surprising.
And then just for myself, build a better understanding of the nature of the problem
that they're solving so that's probably where i'm going to start how about you
yeah yeah this is going to be a very interesting conversation we have plenty to learn from
from nick i definitely want to ask about graphql first of all. Yeah. I think it's interesting.
I mean, it's more common to see people getting,
you know, starting like a company in one space because they have prior experience to the same space, right?
Yeah.
Like to the part of the industry.
But GraphQL is not part, let's say,
of like the data infrastructure per se, right?
Like it's not like a tool that typically you find there.
And I'd love to hear the story behind it.
How, like Nick, from building something so successful like GraphQL,
ended up building something in a different, let's say,
a little bit different part of the industry.
So that's one thing that's definitely I would like to chat about.
And then, I mean, we need to talk about orchestrators.
Like they are like a very important part of the infrastructure in data.
We have Airflow, right?
Which has been, let's say, the de facto solution out there.
So I'd love to talk about
this whole space
like product category of like
orchestrators from
Airflow to
the whole landscape, like see what
else exists out there outside of like
Daxter and Airflow, right, and why
so yeah, let's go and have
like this conversation, I'm sure like
we are going to have like an amazing time with him I agree, let's go and have this conversation. I'm sure we are going to have an amazing time with him. I agree. Let's dig in.
Nick, welcome to the Data Stack Show. Great to be with you. Thanks for having me.
All right. Well, you have built some really cool things in your time and are building
really cool things at Daxter Labs, which is super exciting. But get started at the beginning. Where did your
career start and then how did you get into data? Yeah. So if everyone listening, my name is Nick
Schrock. I'm the CTO and founder of Daxter Labs. And just kind of my career up until now, I'll
start in 2009. So from 2009, 2017, I worked at Facebook and that was kind of the bulk of my career,
increasingly less true over time. But while I was there, my time was dominated there by
building internal infrastructure for our application developers. So I formed this
team called Product Infrastructure, which was infrastructure for the product teams.
And our mission was to make our Apple Asian developers more efficient and productive.
So we started building internal tools and internal frameworks, but we ended up externalizing
that into open source projects.
So React came out of that team and I didn't even do React, but it was kind of adjacent
to it.
And actually the CEO of Dagster Labs is Pete Hunt.
He's one of the co-creators of React.
And then what I'm personally more associated with
is I was one of the co-creators of GraphQL.
And both those, especially React,
ended up being successful open-source technologies.
And so that was really exciting to be a part of.
I left Facebook 2017, was figuring out what to do next. And I started asking companies outside the Valley what their biggest technical liabilities were. regardless of maturity of the company, this notion of data infrastructure kept on coming up as like the technical issue that was preventing them
from making progress on their business.
I remember really distinctly,
I was talking to a healthcare company
and I kind of had data on the mind and I was like,
okay, tell me about your data problems.
And I expected them to be talking about HIPAA
and like all these complicated issues.
But then what they were talking about was much lower level.
And then I remember at one point in the conversation,
I was like, wait, so you're telling me what in your mind
was preventing you from making progress in American healthcare
is the inability to do a regularized computation on a CSV file.
And they're like, yeah, pretty much.
And that was kind of the moment where I was like, okay,
this is something I should really look at. And I was like data infrastructure adjacent at Facebook.
So I knew about the issues, but I didn't live and breathe it as much as I did the application
space. So I really dug in. And what I like to say is I found the biggest mismatch between the
complexity and criticality of a problem domain and the tools and processes to support that domain that I've ever seen in my entire career. And the only thing that came close
was kind of full stack web development, say in like 2010, 2008, where you had like IE6,
super immature JavaScript frameworks, just a completely hostile development environment.
And as a result, also this kind of self-defeating
engineering culture around it.
And then you fast forward 10 years
in full stack web development,
it's like the entire universe has changed
and the tools are amazing
and the quality of software being produced
is so much better as well.
And it was clear to me that
that same sort of transformation was necessary in data.
And I really was attracted and it was clear to me that that same sort of transformation was necessary in data and i really
was attracted to the orchestration component of that story because i thought it was a linchpin
technology and i think we can go into that a little more but that was that's what started the
process of forming dagster labs at the time it was called elemental we just recently changed the
company name and yeah and i've you know incorporated in 2018, started to play with some ideas. And
then really the company started to take off in 2019 with hiring full-time people and pushing
out the project publicly. And fast forward, we raised a Series B a few months ago in a very
hostile fundraising environment. So thank God we did that. And now we're scaling the company
and feeling a ton of momentum and it feels great to really kind of, you know, really hit an
inflection point with the company. Awesome. And just for, you know, I think most of our listeners
are familiar with the concept of orchestration, but tell us what Dagster Labs, what do you make and what problem do you solve?
Yeah.
So fundamentally, we sponsor, Dagster Labs sponsors an open source project called Dagster,
which is an open source orchestration framework.
And then we deliver a commercial product that leverages that framework called Dagaster Cloud. So orchestration performs a very critical function
and its definition is slightly changing over time. But kind of the base case of orchestration
is that when you are a data platform, to use a fancy term, is effectively just you have data,
you do computation on it,
you produce more data, and subsequently you do more computation on that. And it's almost like
an assembly line, right? But instead of an assembly line, you have an assembly DAG,
directed acyclic graph. And what the orchestrator at its most primitive form does is decide when
that factory runs and ensures that things run in the right order.
And so if there's an error somewhere in that factory, you can retry the step.
So at a minimum, you're ordering and scheduling computations in production.
But that orchestration layer is a very interesting leverage point to build a more full featured product.
So for example, the orchestrator has to know how to order computations. And therefore,
it has enough information to understand the lineage of two pieces of data in the system.
So for example, we have integrated data lineage. And likewise, if you're data aware,
the orchestrator ends up being a natural place to catalog all the data assets that are produced by your system.
So we really conceive of ourselves, you know, we have to meet users where they are.
And people categorize this as an orchestrator because they're comparing like, should I use this system or Dagster?
And that's the way it works. But we really conceive of ourselves as kind of a control plane for a data platform that
does orchestration as well as other components of the system. That was great. Super helpful.
One quick thing. Can you define DAG? Because I think that's a term that's thrown around a lot,
especially like as an acronym, but could you actually define directed acyclical graph just so we level set?
Because we never want to assume that all of our listeners know what all of these acronyms mean.
Totally.
So it's a highly, it's kind of an obtuse sounding term that actually describes something
that's extremely intuitive.
So it stands for directed acyclical graph.
So what does that mean? I think the best real world analogy is a recipe for cooking. So imagine that you're cooking
a recipe and you make a fundamental ingredient, that's a step. And then you use that ingredient
with two other steps. And then at some point you recombine those two sub-ingredients to do,
say, put it in the oven or something.
That is a directed acyclic graph,
because it doesn't make sense for that to have cycles, right?
Because if in the end you took something out of the oven
and then you restarted the first step,
you would never complete the recipe.
So that is kind of, you know,
a recipe is kind of the real world manifestation of a DAG.
I love it.
Okay.
You brought up a term when talking about the software engineering industry and you drew
a parallel to data infrastructure and you described it as a hostile environment, right?
I mean, you saw the disparate JavaScript frameworks,
and in some ways, when you think back to that world,
you have a set of tooling that's fairly primitive.
Why is the world of data infrastructure hostile?
I know that there, you know, are there,
what are the similarities, right?
And I guess maybe to direct the question a little bit, to some extent in software engineering,
like there were some primitive tools or some, you know, like very early frameworks, et cetera.
Yeah.
That's a little bit of a problem than like complex infrastructure where you're dealing
with like a fragmented, with fragmentation, right?
So what's the nature of the hostility in data infrastructure?
So why I think it's hostile.
So I think there is a pretty good analogy in terms of, if you think about the history
of data engineering, it kind of has a lineage that is not software engineering. I mean, historically, it was like drag and drop tools and Informatica and all this stuff.
And it was kind of thought of like a lower status job as well.
So there's also that lineage.
And I think it's because data engineering in the end ends up dealing with very physical things in the world.
Like you're moving around files and creating tables and data warehouses.
And it's actually, it is difficult to synthesize test data, to set up virtual environments where things are more flexible and whatnot, and therefore have like,
you know, because software engineering processes really work when it's like purely abstraction
based and you can kind of like shim out the right layers and have super fast dev workflows
and whatnot.
And that is fundamentally a very difficult problem in data.
The analogy to web development is that, you know, web development wasn't executing on
a physical thing, not physical data, but it was scripting the web browser, which was not designed to be a programmable surface area.
So it was just this completely hacked system, and there was no good abstractions over the browser.
So you were left just manually testing the browser.
There's no way to run JavaScript code on your laptop without booting up a browser
and this super heavyweight thing and whatnot.
And things like React constructed
the right software abstraction
between the application code and the underlying browser,
which was this incredibly hostile substrate
for software development.
I think the analogy applies to data engineering
where there aren't good enough abstraction layers between the application code and business logic code that data engineers have to write and these underlying concrete storage systems and computational runtimes, which are actually extremely inflexible and hard to deal with. Yeah. And, you know, an example of,
an example piece of infrastructure
that has gotten popular
these days,
and putting Daxter aside,
but one that really sticks out
that kind of solves
some of this programmability problem
is something like DuckDB,
right?
Which makes it very easy
to kind of program
against the same runtime
on your laptop
as well as in the cloud.
It's like,
and you can actually, I guess you execute it in the web browser to tie this all together, which is nuts, right?
So these technologies that kind of are very developer centric where the computation is
portable between different environments are extremely powerful.
Yeah.
Yeah.
Super helpful.
Let's, okay. Let's talk about fragmentation really quickly because I want to dig in on that a little bit. So you have sort of a fragmented,
you have these fragmented systems. It's difficult to sort of have a development environment,
you know, especially when you're dealing with these physical things like you talked about, the infrastructure is completely different. going to move towards increased fragmentation, or if there's sort of a rebundling happening,
because to your point, you know, solving these problems with these disparate layers is really
hard. And so, is the market going to sort of congeal around, or is it going to produce,
you know, more vertically integrated, or I guess, horizontally integrated, depending on how you look
at it, or both systems that sort of solve some of these
problems in an integrated way. What's your take on that? And then how does Dagster fit into
your take on that? Yeah. So that's a great question. I think you said the right words,
which is it's either going to be vertical or horizontal. Because I think anyone who's talking about this subject and who has even different... Lots of people have different visualizations of the end state,
but I think everyone agrees that currently the world is too fragmented and life is too hard for
people spinning up data platforms because they have to cobble together way too many tools.
And it's extremely complicated,
both technically, as well as like maintaining that many business and support relationships.
It's just a, it's not a sustainable situation. So there's kind of two school of thoughts, I think.
One is you're going to pivot back to a world of more vertical integration and go back to that
world that's like Informatica and Oracle or
Microsoft kind of in the 90s. You pick one stack and choose everything. And that was inflexible
and terrible. For technological reasons, you also got vendor lock-in. So the modern day analogy of
this is you're either going to be like a Databricks company, a Snowflake company, or you're going to
be one of the hyperscalers, Amazon.
Microsoft just has their new fabric product, which is another kind of data platform in a
box type solution. So that's the vertically integrated story. And if you're not going to
do vertical integration, you need to solve this bundling issue, then there has to be some other layer
of integration.
That's the horizontal layer.
Now, I think of this naturally, given our position in the stack, I think the orchestrator
is a natural place to assemble all these capabilities that you need over the data platforms like
Databricks and Snowflake. And I think the other thing,
I do think that there is a natural resistance to vertical integration.
And I think companies instinctively know this
because if you go to any large company,
almost all of them are running Databricks and Snowflake.
Yeah.
Like no one just picks one.
Like everyone runs both
because they're suitable for different workloads.
And you also don't want to like bet everything on one vendor and get knocked into that degree.
So I think in some ways, like the natural market resistance is doing this.
Now, Snowflake and Databricks or whomever, they might build enough capabilities or do
it in a tasteful way where customers still feel like
it's composable so that they can eventually get dominance but i just don't think that's fundamentally
the way it works and so also like i don't want to live in that future i think like vertical
integration is like boring and sad in some ways and so then it's you know what's the horizontal
layer gonna be you know some people think it's gonna be the catalog you know, what's the horizontal layer going to be?
You know, some people think it's going to be the cataloging, right? And that's the basis of the control plane.
Some people think, okay, like Apache Arrow is like the way this works because you can have portable memory formats that you move between data platforms or whatnot.
Yeah. I personally, because I mean, it's my job to say this as the founder of the company, but I think this orchestration control plane layer is like the natural place to put it
because the orchestrator by its nature, every practitioner needs to interact with it because
anyone who's putting a data product in production has to put it in either, has to put it in
some sort of asset graph because all data comes from somewhere and
goes somewhere.
So they have to place in the context of some sort of orchestration at some point.
And then the orchestrator is also the system that shells out and invokes every single computational
runtime and touches every storage layer.
So I view it as this very natural choke leverage point that has to exist at any platform of real scale, any data platform of real scale.
And, you know, the kind of user experience you want, I know it's a tortured analogy, but you really want something that feels like the iPhone, where you have this common set of rules, you have a grid of apps, and they
all kind of abide by certain rules in the ecosystem and provides order to the chaos.
But within that order of chaos, you get a ton of heterogeneity, right?
And that's, even though, yeah, it's a torture analogy a bit because like the iPhone is vertically
integrated, but it's a vertically integrated OS with like an app store, right?
So the analogy is on purpose,
but I think that's like the,
for lack of a better term,
the vibe you want
is some sort of superstructure
that brings order to chaos,
but within that,
you can mix and match technologies.
Yeah, that makes total sense.
Okay, let's,
one more question for me
before I hand the mic over to Costas.
So in that world where we think about horizontal integration, do you sort of operate with a
set of maybe sort of foundational philosophies or design principles around like when and
where the orchestration layer
enters the picture. And so let's talk about trickle down, which we started out the chat with,
talking about building these things at Facebook, fangs way ahead of most of the market because
they're inventing new technologies that eventually then trickle down and sort of, you know, companies can adopt them. When we think about orchestration and DAXR in particular, is your view of the world
that this is really where you should sort of start building your infrastructure, right? So
Ceteris Paribus, like you start with sort of an orchestration layer and then augment your stack over time around the orchestration layer? Or is it a situation where you really only need this when you hit
a certain level of complexity or when you have multiple storage layers or some sort of
breakpoint or threshold where orchestration is the right tool for the job? So typically, I think that orchestration should be one of the first tools that you go to in the
data platform. For example, if you only have one data warehouse and you know you will only ever
use DBT, you only use templated SQL, you'll bring in no other technologies and you don't
need anything beyond a cron scheduler and you know that for certainty for certain yeah might
not need a full orchestrator snowflake dbt tableau and you're done you know yeah something like that
right and if you don't need any automation right it would be another example where right like
literally you can just like manually do stuff and only create
things on demand and you're comfortable with manual intervention.
That's another example.
But for nearly every single data project, I think that orchestration is fundamental
and essential.
You need to schedule things.
You need to order computations.
You need to do it across multiple stakeholders,
multiple technologies, typically. And by multiple stakeholders, sometimes I should probably say
multiple roles, because even as a solo practitioner, you often kind of wear different
hats depending on what you're thinking about. Sometimes you're thinking like, oh, I'm building
infrastructure for myself. And sometimes I'm building the actual data pipelines.
And I think we have work to do to educate the marketplace to convince people that the orchestration should be one of the first tools you adopt rather than one of the last.
But I think in reality, the things you do within the orchestrator are so fundamental and essential to building data pipelines that it should be in the picture from day one, right?
You're going to have errors in your production.
You need to be able to resolve those as easily as possible.
You're going to need alerting.
You're going to need to schedule things.
You're going to need to order your computations.
And, you know, those are like the basic tools in mind.
If you don't build an orchestrator in place, you very quickly left with like a Rube Goldberg
machine of like, maybe you have like four different hosted SaaS apps and they have like
overlapping cron schedules and you're like debugging issues across multiple tools. I just
don't think it's tenable. Yeah. Yeah. Yeah. It's super
interesting. It's, you know, I kind of think about it as, you know, you said that like the
tortured analogy of the iPhone, but I was just thinking about like a dashboard in a car where
it's interesting because it feels like a single thing, but it actually represents a massive amount of complexity
with relate to very different parts of the system, right?
From breaking to pressures that are running
in different pieces of the engine or transmission
or all these separate things.
But it feels, you wouldn't think about designing
a dashboard for a car in a disparate way, right? It represents a related system.
Yeah. Especially for coherent operational aspects. It's like being able to go to one
place and have the source of truth, the so-called single pane of glass where you can be like,
what's going on in my system right now? Yeah. I just, I cannot imagine running a data pipeline, a single data pipeline, let alone
an entire platform of data pipelines without that operational tool.
Yep.
All right, Costas, I could keep going, but please, I know you have a ton of questions.
Yeah.
Thank you, Eric.
All right, Nick, let's go back to your GraphQL days first, okay, before we
get deeper into what Duxter is doing and talking more about the data infrastructure.
So I have a question, since you described your journey, what have you learned
from working with GraphQL, a tool that is
primarily used by product engineers
to go and create applications,
that you think is also very applicable to what Daxter is
doing? So that's a great question. that you think is also very applicable to what Daxter is doing, right?
So that's a great question.
I think there's a few lessons.
One is if you can express the problem that your users are trying to solve
in concepts that make sense to them and align with their day-to-day experience
in the ground, that is enormously powerful. So the analogy is that in GraphQL, I think the
novel insight, because people are like, oh, why don't you just use something like SQL?
Well, the reason why is that SQL is fundamentally tabular and GraphQL is hierarchical.
And the reason why that's powerful is that if you're a front-end developer, the view libraries that you're dealing with, 99% of the time, it's a hierarchical structure.
You're thinking about nesting elements within each other. Query language that directly maps with that is extremely powerful, both in terms of just
understanding the query language, maybe most importantly, building client-side tools that
line up those views with that data fetching.
It's just an extremely powerful paradigm.
And similarly, with Dagster, for example, we really thought about from first principles,
what are the things that you're...
What are you actually doing when you're building a data pipeline?
What's the outcome you're trying to affect in the world?
And how do you interface with the stakeholders who care about you?
And, you know, we kind of have this phrase we say around the office, virtual office,
which is like, no one gives a shit about your pipelines, right?
All they care about is the data assets that they depend on, right?
Pipelines are implementation detail. And if you can express from the developer's perspective, if you can start out with like, hey, declare the assets you want to exist in the world, and if everything downstream of that makes sense, then everything lines up better. Both your own internal tools, the way you communicate with stakeholders, et cetera,
et cetera. So I think that's one lesson is like really, and at Daxer, we really, this has been a struggle. This has been a challenge, I'd say over years to really dial in this language and
we're still working on it, but getting that right is super important. And the other thing that I
learned with GraphQL is that a lot of these developers, there's kind of this common
trope. You talk to VCs or you talk to engineers, there's a lot of almost contempt for the broader
software engineering communities. Like, oh, all developers are dumb and you're used to only dealing
with the top 1% of developers and that's the way it's going to work. And what I found in the GraphQL space is that people aren't dumb. Developers know their
domain and their business. They are generally quite smart, quite bright and competent, but
they are extremely busy. So I think people confuse smart, busy people with uninterested people and that causes a lot of
people to build underpowered tools like GraphQL like it used something like GraphQL
you're still relying on the users to do a lot like they're building a very complex piece of
software underneath the hood GraphQL provides provides this overarching structure that makes sense in their mind and tools on top of that. But
beyond that, developers have to do all sorts of clever things. And I've always been pleasantly
shocked at how sophisticated the GraphQL community is in terms of building custom tools and whatnot. And I think the same thing applies to the data engineering world, where you don't just
want to give out-of-the-box solutions, but you also want to provide developers a toolkit
to make them more productive.
And you have to find the right balance in doing that.
But I think having that mentality is critical. And that has really served
us well in the DAXer journey. I still can't believe some of the use cases people apply this
stuff to. So from first principles, getting your mental model right, super important. And two,
understanding that your users are smart, busy people, typically building
complicated things and understanding where to give them the tools where they can do the complex
things while simplifying everything else as much as possible is also critically important.
And then the last thing is being consistent with messaging is utterly critical. I remember, you know,
the GraphQL, once we started to see meetups where people that we didn't organize, where people were
effectively saying the things that we propagandized and advocated for. And I'm like, I remember the
meeting where we decided on using that phrase, not another phrase. Now it's being repeated
in Johannesburg and we didn't need to talk to anyone to make that happen.
That's like a very powerful thing.
So consistent messaging is another thing that comes to mind.
Okay, that's awesome, actually.
That was an unexpected outcome, to be honest.
I didn't expect to hear that from you,
but that actually makes a lot of sense, I think,
especially when we are talking about new paradigms
and new technology, right?
Like a period where the marketplace out there
needs to go through education, right?
So consistency, I think it is critical.
So that makes sense.
I think the other thing I've learned is that
it's really important for a technology to be viewed as a career enhancing move to adopt.
So if you can build a technology where people feel like they will advance further in their career and achieve better financial success and notoriety because they adopt you. Like that is like an incredibly important place to be in a,
in terms of a technology provider.
Yeah.
A hundred percent.
A hundred percent.
Okay.
One question just to try and help,
let's say the people out there who are coming from one or the other.
And when I say one,
the other,
I mean product development in one product engineering in one side and data engineering on the other. And when I say one or the other, I mean product development in one,
product engineering in one side,
and data engineering on the other side, right?
And product infrastructure and data infrastructure.
So obviously there are like two different domains, but there has to be like some overlap, right?
Like there's still engineering, right?
It still has to do both like managed data,
both managed states,
both have to present something to someone.
You do something and all these things.
So the question that I want to ask you,
because I don't want to ask you
what they have in common and whatnot, right?
I'll try to do like in a little bit more creative way.
From your experience with GraphQL in product engineering,
if you had to choose, let's say, in data engineering,
a technology that is closer to what GraphQL is for product engineering,
what you would say that this is?
Which part of the stack out there, it can be the orchestrator,
it can be, I don't know, like Arrow, as you mentioned at some point, the query engine, I don't know.
Like the good thing, the good and the bad thing with data infrastructure is that there's so much to choose from out there, like all these things.
But what has like a similar utility at the end for the engineer out there right like
positioned in the stack if there is there might be might not be something wrong well i mean i think
one of the reasons why i was attracted to the layer in the stack that dexter is in is that i
felt that the orchestrator served as a basis of a layer which could serve a similar function as GraphQL,
but in the data domain.
Insofar as, you know, GraphQL is this very compelling choke point in a front-end stack
where all the different clients, Android, web apps, iPhone, all go through the same
schema.
And then that is then backed by a piece of software that
talks to every single service and backing store at the company and provides this organizational
layer to kind of model your entire application, right? In a similar way, I think that Orchestra
serves the same function in terms of the place where you can
model your entire data platform, where all different stakeholders can kind of view it
through the same lens, just this graph of assets, right?
And then each one of those assets can be backed by arbitrary computation, arbitrary storage.
So I really do think it's, and I think that's why I was attracted to it, whether implicitly or explicitly, is that I felt this kind of had that same property of being both a technological and organizational choke point through which you could deliver enormous value and have enormous leverage. And why do we need such a different implementation of the technology then?
Why we can't get GraphQL and somehow adapt it and be like the interface for the data infrastructure too?
What is the reason?
And it's more of a technical question when I'm asking, to be honest.
Yeah, I mean...
Why is this happening?
Well, I just think it's a completely different domain and problem space.
I remember people were like, oh, Nick, why aren't you thinking about using GraphQL for analytics?
And I'm like, absolutely not.
Absolutely not.
And the reason why is what I was talking about before is that GraphQL works for front-end
applications because the net thing that you view on the screen is hierarchical, right?
And that makes sense.
When you're dealing with analytics, you're looking at tables.
You're looking at tabular data and direct renderings of that in dashboards and whatnot.
So SQL is the right tool for analytics.
And GraphQL is a better tool,
in my opinion,
for building kind of front-end products.
And so they're completely distinct domains.
Okay.
That makes total sense.
One last question that has to do with, let's say the boundaries
between these two different disciplines,
but that also have some similarities.
So in the data infrastructure, we are talking about orchestrators, right?
We have this concept like Daxter, right?
We have, as you said, tasks to be executed.
We have some scheduler.
We have managing failures, all these things.
There is another, let's say, in the product engineering space,
there's also like the concept of the workflow engine, right?
And there's like a lot of conversation like lately about workflow engines
and how close they should be to the state or like should be part of the database,
like the transactional database or not, or outside.
But they have some similarities, right?
Like at the end, even like the workflow engine is like, it is a DAG pretty much.
Like you have some tasks that they need to have some ordering and how they get executed.
Maybe necessarily, let's say doing like data processing directly, but there might be an
end point that you have to go and like hit somewhere, right?
Yeah.
Why do we, like, again, what's the difference?
Like why we can't have one, right?
That can drive, let's say, the data infrastructure
and like the processing there.
And like the same also like with product engineering
where we have to orchestrate again like tasks
what's an example workflow engine and product engineering that you're thinking of just so i
can because workflow engine is like can be different things different people yeah 100%
like temporal for example is like a product sometimes like in my mind right yeah temporal
is really interesting actually i think fundamentally something like temporal is really interesting, actually. I think fundamentally, something like temporal is a more imperative and general purpose tool. But you have to make explicit trade-offs there that make it less well-suited for doing data processing in the context of a data platform. I think the simplest visualization of doing it
is that using temporal,
if you wanted to understand the lineage of your data assets
without executing it,
that is literally impossible in temporal
because temporal is a more dynamic state machine
that makes very different trade-offs.
So there's nothing preventing you from using
temporal to perform a subset
of the functions in a data platform
orchestrator. But it just doesn't
fit all the needs of data platform teams.
So there's a world where there needs of data platform teams. And so, you know,
there's a world where there's a data platform stack where temporal is a
component of it, but fundamentally,
fundamentally it's very different. Something like temporal is interesting.
I'll be very curious to see how it develops over time because it's actually an
extremely invasive programming model and i would
if i was hired as a vp of engineering at some larger company that i bet heavily on temporal
for its infrastructure i would be lying awake sweating at night thinking about like how do i
debug this if it goes wrong?
Cause like you're putting so much faith in the system to like
re-entrant,
you know,
be re-entrant and like,
you know,
manage all the state properly.
And like,
if something goes wrong,
I just have a hard time debugging it.
But yeah,
I'm like,
I'm both like extremely intrigued and amazed byal and also kind of terrified of it.
At scale, especially.
Yeah, makes sense.
Makes sense.
All right, cool.
So let's focus more on the data and stuff now.
Let's talk about orchestrators in data infra, right?
Like DAX is obviously not the first one.
There are like many different
solutions out there. Some are like more niche, some are like more generic. I would say probably
the most well-known one is Airflow, right? For sure.
So let's give us, give us like, like how you see the landscape out there.
Totally. Yeah. Airflow is funny.
So the lineage of airflow is actually from Facebook.
So it's based on a system that was kicked off in Facebook in 2007 called
data swarm,
which still exists.
And then Max who invented airflow,
who created airflow,
who I know very well.
And he actually left Facebook, went to Airbnb,
realized they needed a similar system
and kind of basically built V2 of Data Swarm.
And I think that Airflow did a couple
of really important things.
One is that you could build DAGs.
You could write your code in Python rather than having to use a UI or use some
really inflexible config system. And then it had a nice UI. And so between being able to use Python,
which gave a level of dynamicism in a language that data people understood, and a high quality UI, it really took off.
But there's a few things that are a problem with Airflow. One is clearly not written
for the local development experience. And these systems are complicated enough where you need to
be able to test them, do automated testing, have fast feedback loops, because those are the foundations of developer productivity and developer productivity is
absolutely huge. And the other thing, and this is funny, we like to say that even though it is
kind of the incumbent orchestrator that people build data pipelines in, it actually is not a
great tool for building data pipelines because, it actually is not a great tool
for building data pipelines because it's not aware of the data that it produces. It's kind
of this like tautological thing. It's like the wrong layer of abstraction for data pipelines
has got its momentum and became a norm. But we fundamentally think a more kind of data-oriented
approach is important. So if you think about the landscape, and I'll include
DBT, Dagster, Prefect,
and Aerophone. Prefect is actually much more similar to something like Temporal
at this point, because Prefect's new 2.0 product,
that was a company started like a year before Dagster Labs,
and they have this sort of dagless vision that's similar to temporal or just arbitrary workflows.
And so it's more imperative and generic.
Then you have the task-based dag system, which is Airflow.
Then you have dbt, which is very popular, which is exclusively for Jinja templated SQL with a hint of Python these days, but like 99.9%
of the usage is templated SQL. They build a graph of data assets as well. They call them models
and they exclusively execute over the warehouse. And they're targeted for these kind of software
engineer analyst hybrids they call analytics engineer. And if you think of those as a spectrum, Dagster is kind of in between Airflow and DBT,
meaning that has a much more declarative
data-oriented approach, similar to DBT,
but is targeted towards data and ML engineers
and more trained software engineers
and is more flexible and can be backed
by any arbitrary computation,
not just Jinja Template SQL.
So that's kind of the landscape.
One way declarative hyper-focus on the data warehouse,
SQL, that's DBT, all the way to the other hand,
something like Prefactor Temporal,
which is completely DAGless,
more of a straight workflow engine,
and then kind of Airflow and DAGster in between.
Okay, that was awesome.
And are there any, let's say, more,
how to say that, like, niche type of, like, orchestrators?
Like, there's this whole thing around, like, ML Ops, for example.
Like, is ML, is there, like, a group of orchestrators
that are just, like, for ML?
So ML Ops is super interesting.
I think it's something that we're going to be focusing on
in 2024
because
we actually really believe
that the MLOps
ecosystem is unnecessarily siloed.
There was an article
and they don't need their own orchestrators.
MLOps should be
a layer, not a silo.
And there was this great article that hit Hacker News
like six months ago, which was like,
MLOps is 98% data engineering.
And I think that's totally true.
I wrote this news, by the way.
You wrote that?
Oh my, I never connected that.
Okay, so this is perfect.
That's amazing.
We love that.
That's like the basis. That's going to be
the base of our product marketing next year. Because if you interview, we don't emphasize
our ML use cases at all. But our cloud customer base, over 50, so 90% of them use it for ETL and
analytics, which makes sense. But 50% also use it for ML and 40% also for what they call production
use cases. So multi-use case is the norm. And what happens is that a data platform team brings
in Dagster and then they start using it. And then their ML team wants support and doing stuff.
They talk to the ML team. They're like, well, we want to write Python. We want to write DAGs of
stuff. We produce a bunch of intermediate tables.
And at the end of the line, we produce models.
They're like, well, that sounds like data pipelining.
And DAGster provides a great foundational tool for the data engineering components of
the MLOps job, which is 98% of it.
That's so funny that you're the one that wrote that article.
So we totally buy into that.
We view it, DAGaxter is about data engineering.
So we kind of think of data engineering as this layer.
And then different parts of the data pipeline overlay different technologies on top of that layer.
So in the middle of it, in the data warehouse, you might have dbt core.
In the ML component of it, you might have MLOps as a layer of tooling on top
of that, but it all shares a common
control plane
driven by data engineering principles.
Yeah, that makes a lot of sense.
I mean, obviously, I agree. I also like
what was the reason I wrote
that post.
Great post.
Yeah, it was
much more impact than I expected, to be honest.
And it was very interesting to see the reactions, both from the ML, let's say, group of people, and also the data engineers.
Anyway, maybe we should have an episode just talking about that.
Oh, yeah, yeah.
I do believe there needs to be a convergence between the two disciplines.
It is important
if you would like
to keep
adding more
and more value
and faster innovation
like in the industry.
Otherwise,
it's just like
things are like
way too fragmented
and it doesn't make sense.
Yeah, it doesn't make any sense.
No, we got to get Sandy
on this.
He's the lead engineer
of the project
and we could
get going
for two hours
getting ourselves
whipped up
about this subject.
Yeah, we should do that.
Absolutely.
We'll arrange that.
So, okay, let's go back to like specifically to Airflow.
And I have like one last question here.
If there's something that you are, let's say, envious of that Airflow has, right?
And VSF?
Yeah, what that would be, one thing.
Oh, just the install base.
You know, like that's pretty much it.
I feel like we compare favorably almost every...
Actually, the install base and the existing corpus
of searchable content related to the technology
are the
advantages of incumbency.
But those are the two things
I envy.
But we're making
good progress.
We do. That's true.
And I think you are generating also some
pretty good content out there.
So, okay.
But Envelope has been around also for how many years now?
Like it's like 10 years, maybe like a little bit like less than that.
So that's a lot of stuff.
Max wrote in 2014, I think it was open source pretty quickly, 2015.
So we're getting there.
Yeah.
Yeah.
All right.
Let's talk about Daxter now.
You've been working on this for quite
a while. What's next? What are the next couple of things that are coming out about Daxter? And
what should we be excited about for the future releases? Yeah.
So I think that our near-term future is very much about demonstrating,
we kind of have this position in the stack.
We claim to be this operational single pane of glass.
We have companies where a bunch of different stakeholder teams are adopting it.
And now the next part of the journey is like using that leverage point
to deliver more value to teams.
So one point of that is that, you know,
I think this show is going to come out like a week before our launch week,
but we're going to be announcing embedded data quality in the orchestrator.
So, and that doesn't mean we're going to try to replace
DBT tests or replace SODA or replace gratifications. Those are all systems where we can
leverage, but it's more about almost making the orchestrator data quality cognizant, I would say.
And so it's very, we actually get, this isn't just us like reaching outside of our domain our users like
explicitly want this because they're used to looking at our asset graph and being like what
is the state of my system and having an extra checkbox there that says i passed all my data
quality tests is the most natural thing of world to want to integrate and then being able to alert
like okay if this thing fails you, ping me in Slack or whatever.
So it's a very natural extension of the orchestration system.
And we think that in five years, any orchestration platform that doesn't include data quality in this manner will be viewed as woefully incomplete.
Similarly, so we're adding data quality capabilities.
Similarly, we also are adding consumption management
capabilities to the system.
So first of all, we're going to kind of be
augmenting our integrations to make it very straightforward
to collect metrics about consumption.
So like how many snowflake credits is each asset consuming?
And then what's very unique is that one,
we can index that by asset name in our system.
So we can give reports to say like, hey,
you're recomputing this thing all the time.
It's consuming this many credits.
Are you sure you're getting enough value out of that?
And then second of all,
the orchestrator is also naturally a very interesting place
to embed cost and consumption information
because you can do quoting,
you can provide quotas,
you can project how much computation
is going to cost going forward.
It's just a very natural place to embed
that sort of cost information. And we think that's going to be incredibly powerful. Jeff Bezos
famously said, your margin is my opportunity. I think the equivalent to data is your NDR,
your net dollar retention is our opportunity because you can't increase your snowflake, spend 80% year over year. Eventually you'll run out of money and you need tools to be able to control that. And we think that WorkSphere is a natural place to do that. especially that centralized data engineers and data platform teams can bring in all the
computations of all their stakeholder teams in a way that's much, much easier, that doesn't require
modification of that external code or minimal modification of that external code.
I think right now, I think Dagster historically has gotten dinged because it kind of, it feels like it has to take over too much
of your system. And I think that feedback was pretty accurate actually. And so we've really
taken that to heart. So we really want to make it so that instead of the entire organization
having to become Dagster experts, only a centralized team has to become Dagster experts.
And everyone else is kind of becomes Dagster curious where
they just know hint of Dagster and then they can use our operational tools and it all kind of works
smoothly. So, you know, in the end, the goal of that launch week is really to make these centralized
data teams, data platform teams, especially way more leveraged. So make it way faster for them
to kind of bring everyone
into the orchestration platform.
And then once they're there,
be able to use these value-added features
to deliver enormous value super quickly.
So beyond orchestration is kind of one of our internal teams.
And we want to really kind of develop this future
of this more advanced control plane
that I think data teams desperately need.
Yeah, that makes total sense.
All right.
We are really close to the end here,
and I want to give some time to Eric
to ask any other questions that he has.
But we definitely have to do at least one more episode.
I think we have a lot to chat about.
So I'm really looking forward to do this again in the future.
Yeah, it's been great.
It's so funny you wrote that article and I was just name dropping it.
I swear to God that wasn't on purpose.
That would have been real 4D chess to pretend to not know
that you had written it, then drop it.
Yeah.
Costas, you're a real data influencer.
I mean, you're the foundation of product marketing strategy.
Okay, just one last quick question.
We've talked so much about data, and Nick, you've been so articulate and so helpful on so many subjects.
So my last question actually has nothing to do with data.
So if you weren't working with data or building technology tooling, what would you do?
Oh, I actually have a good answer for this.
We are on the cusp in the world of an energy transition, and no one understands the implications
of it.
And this isn't for environmental reasons.
We reached this tipping point where solar energy is now cheaper than all other fossil
fuel, all other energy generation.
So solar, wind, and battery together, not only is it cleaner, which is interesting,
but also it's just way cheaper.
And by definition, there's no fuel
inputs to that. And that is incredibly exciting because not only... Because there's kind of like
this mentality that, oh, by decarbonizing, we're going to have to degrowth and go back to some
pastoral life where no one drives and no one travels. completely false. If we do this right, we're actually going to live in a world
with effectively infinitely abundant energy.
And working on that transition, I think, would be incredibly exciting.
Because effectively, the way that the math works out
for these clean energy systems is that the cheapest configuration of building them
means you dramatically over-provision solar and wind generation capacity so that you don't
have to build as many batteries.
And that means that most of the time during the year, you have a wild excess of eventually
limitless free energy.
So I think there's going to be entire new waves of industry
that are built to effectively
take advantage of this intermittent,
infinite, virtually free energy.
I think it could be an incredible future.
So that is what I would be working on.
Fascinating.
Man, that's an episode in and of itself.
Well, as we like to say,
we're at the buzzer. Brooks is telling us it's time to land the plane. But Nick, this has been
such a wonderful conversation. Thank you for giving us your time and we'll have you back on
very soon. Awesome. Yeah, this was great. Thanks for having me. What a guest. I feel like every
time we asked Nick a hard question, he was able to come up with an answer
that was concise and articulate
for every single question.
It was really amazing.
I think you probably asked one of the winning questions,
which was around the data orchestration landscape.
And what a fascinating answer. He really, I mean, it was
really helpful to hear him talk about the entire spectrum where you have, you know, temporal,
which you brought up on the show, which is sort of, you know, embedded in application code and
sort of, you know, deeply integrated workflow, you know, sort of generic workflow execution
all the way over to, you know, the DBT side of things, which is, you know, sort of ginger
templating and, you know, managing jobs for running SQL queries.
And he really painted that entire picture.
It was so helpful to me.
And that is my big takeaway.
So I think this is a show for anyone who wants to
understand deeply the history the current state and then think well about the future of orchestration
this is a great show yeah oh 100 i think it's not just like for the future of orchestrators i think
it's a glimpse to the future of data infrastructure in general. Yeah. And data engineering, I would say also,
because that's like another part of this episode
that I think is super unique and super fascinating
is that we talked a lot about what are the differences
and also the overlaps between product engineering
and data engineering, the tooling,
the infrastructure out there, why we
need to have these different disciplines
or domains,
what we can learn from one
and transfer to the other.
And I think
Nick had such an extensive
experience at
a kind of unique scale
at Facebook.
So his perspective I I think, is very interesting
and very insightful and not that easy to find out there.
So I would encourage everyone to tune in
and actually listen to the conversation I would have.
And hopefully we are going to have more conversations
with him in the future too.
Totally agree.
One last bonus from this episode is that
I think this is the episode
where you became a data influencer
because, and I won't give away too much,
Nick referenced that they're building
their product marketing strategy
off of a particular article
that went on the first page of Hacker News
that may or may not have been authored
by one of the co-hosts of the show.
So if that tantalizing piece of juicy information
is interesting to you,
listen to the entire show to hear more.
Subscribe if you haven't, tell a friend,
and we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week. We'd also love your feedback. You can email me,
ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.