The Data Stack Show - 158: The Orchestration Layer as the Data Platform Control Plane With Nick Schrock of Dagster Labs

Episode Date: October 4, 2023

Highlights from this week’s conversation include:Nick’s background and journey in data (2:28)Founding Dagster Labs (7:50)The evolution of data engineering (12:32)Fragmentation in data infrastructu...re (15:04)The role of orchestration in data platforms (19:53)The importance of operational tools for data pipelines (25:01)Lessons learned from working with GraphQL (26:19)The role of the orchestrator in data engineering (34:51)The boundaries between data infrastructure and product engineering (37:33)Different orchestrators in the data infrastructure landscape(42:03)The role of MLOps in data engineering (46:04)Data Quality and Orchestration (51:04)Future of Data Teams and Orchestration (54:27)Final thoughts and takeaways from (58:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Kostas, very exciting guest. We've actually had lots of people on the show who have built incredible technologies at the Fangs or other similar huge companies, and then have gone on to do
Starting point is 00:00:40 really interesting things and found really interesting companies. Similar story today, we're going to talk with Nick, who worked at Facebook and was actually one of the people who was behind GraphQL, which is really fascinating. But he started a company called Dagster Labs, originally called Elemental. And they build an orchestration tool with a goal to sort of become a control plane for data infrastructure, which is really fascinating. And I mean, what a journey. I can't wait to ask him about it. I think one of the questions that I want to dig into with him is actually basic.
Starting point is 00:01:23 We've had some orchestration players on the show. Airflow is obviously a huge incumbent in the space, but I just want to talk about what problem orchestration solves. It'd be fun to define a DAG. I don't think we've done that on the show, which is surprising. And then just for myself, build a better understanding of the nature of the problem that they're solving so that's probably where i'm going to start how about you yeah yeah this is going to be a very interesting conversation we have plenty to learn from from nick i definitely want to ask about graphql first of all. Yeah. I think it's interesting. I mean, it's more common to see people getting,
Starting point is 00:02:13 you know, starting like a company in one space because they have prior experience to the same space, right? Yeah. Like to the part of the industry. But GraphQL is not part, let's say, of like the data infrastructure per se, right? Like it's not like a tool that typically you find there. And I'd love to hear the story behind it. How, like Nick, from building something so successful like GraphQL,
Starting point is 00:02:38 ended up building something in a different, let's say, a little bit different part of the industry. So that's one thing that's definitely I would like to chat about. And then, I mean, we need to talk about orchestrators. Like they are like a very important part of the infrastructure in data. We have Airflow, right? Which has been, let's say, the de facto solution out there. So I'd love to talk about
Starting point is 00:03:06 this whole space like product category of like orchestrators from Airflow to the whole landscape, like see what else exists out there outside of like Daxter and Airflow, right, and why so yeah, let's go and have
Starting point is 00:03:22 like this conversation, I'm sure like we are going to have like an amazing time with him I agree, let's go and have this conversation. I'm sure we are going to have an amazing time with him. I agree. Let's dig in. Nick, welcome to the Data Stack Show. Great to be with you. Thanks for having me. All right. Well, you have built some really cool things in your time and are building really cool things at Daxter Labs, which is super exciting. But get started at the beginning. Where did your career start and then how did you get into data? Yeah. So if everyone listening, my name is Nick Schrock. I'm the CTO and founder of Daxter Labs. And just kind of my career up until now, I'll start in 2009. So from 2009, 2017, I worked at Facebook and that was kind of the bulk of my career,
Starting point is 00:04:08 increasingly less true over time. But while I was there, my time was dominated there by building internal infrastructure for our application developers. So I formed this team called Product Infrastructure, which was infrastructure for the product teams. And our mission was to make our Apple Asian developers more efficient and productive. So we started building internal tools and internal frameworks, but we ended up externalizing that into open source projects. So React came out of that team and I didn't even do React, but it was kind of adjacent to it.
Starting point is 00:04:44 And actually the CEO of Dagster Labs is Pete Hunt. He's one of the co-creators of React. And then what I'm personally more associated with is I was one of the co-creators of GraphQL. And both those, especially React, ended up being successful open-source technologies. And so that was really exciting to be a part of. I left Facebook 2017, was figuring out what to do next. And I started asking companies outside the Valley what their biggest technical liabilities were. regardless of maturity of the company, this notion of data infrastructure kept on coming up as like the technical issue that was preventing them
Starting point is 00:05:28 from making progress on their business. I remember really distinctly, I was talking to a healthcare company and I kind of had data on the mind and I was like, okay, tell me about your data problems. And I expected them to be talking about HIPAA and like all these complicated issues. But then what they were talking about was much lower level.
Starting point is 00:05:46 And then I remember at one point in the conversation, I was like, wait, so you're telling me what in your mind was preventing you from making progress in American healthcare is the inability to do a regularized computation on a CSV file. And they're like, yeah, pretty much. And that was kind of the moment where I was like, okay, this is something I should really look at. And I was like data infrastructure adjacent at Facebook. So I knew about the issues, but I didn't live and breathe it as much as I did the application
Starting point is 00:06:15 space. So I really dug in. And what I like to say is I found the biggest mismatch between the complexity and criticality of a problem domain and the tools and processes to support that domain that I've ever seen in my entire career. And the only thing that came close was kind of full stack web development, say in like 2010, 2008, where you had like IE6, super immature JavaScript frameworks, just a completely hostile development environment. And as a result, also this kind of self-defeating engineering culture around it. And then you fast forward 10 years in full stack web development,
Starting point is 00:06:53 it's like the entire universe has changed and the tools are amazing and the quality of software being produced is so much better as well. And it was clear to me that that same sort of transformation was necessary in data. And I really was attracted and it was clear to me that that same sort of transformation was necessary in data and i really was attracted to the orchestration component of that story because i thought it was a linchpin
Starting point is 00:07:10 technology and i think we can go into that a little more but that was that's what started the process of forming dagster labs at the time it was called elemental we just recently changed the company name and yeah and i've you know incorporated in 2018, started to play with some ideas. And then really the company started to take off in 2019 with hiring full-time people and pushing out the project publicly. And fast forward, we raised a Series B a few months ago in a very hostile fundraising environment. So thank God we did that. And now we're scaling the company and feeling a ton of momentum and it feels great to really kind of, you know, really hit an inflection point with the company. Awesome. And just for, you know, I think most of our listeners
Starting point is 00:07:57 are familiar with the concept of orchestration, but tell us what Dagster Labs, what do you make and what problem do you solve? Yeah. So fundamentally, we sponsor, Dagster Labs sponsors an open source project called Dagster, which is an open source orchestration framework. And then we deliver a commercial product that leverages that framework called Dagaster Cloud. So orchestration performs a very critical function and its definition is slightly changing over time. But kind of the base case of orchestration is that when you are a data platform, to use a fancy term, is effectively just you have data, you do computation on it,
Starting point is 00:08:45 you produce more data, and subsequently you do more computation on that. And it's almost like an assembly line, right? But instead of an assembly line, you have an assembly DAG, directed acyclic graph. And what the orchestrator at its most primitive form does is decide when that factory runs and ensures that things run in the right order. And so if there's an error somewhere in that factory, you can retry the step. So at a minimum, you're ordering and scheduling computations in production. But that orchestration layer is a very interesting leverage point to build a more full featured product. So for example, the orchestrator has to know how to order computations. And therefore,
Starting point is 00:09:31 it has enough information to understand the lineage of two pieces of data in the system. So for example, we have integrated data lineage. And likewise, if you're data aware, the orchestrator ends up being a natural place to catalog all the data assets that are produced by your system. So we really conceive of ourselves, you know, we have to meet users where they are. And people categorize this as an orchestrator because they're comparing like, should I use this system or Dagster? And that's the way it works. But we really conceive of ourselves as kind of a control plane for a data platform that does orchestration as well as other components of the system. That was great. Super helpful. One quick thing. Can you define DAG? Because I think that's a term that's thrown around a lot,
Starting point is 00:10:21 especially like as an acronym, but could you actually define directed acyclical graph just so we level set? Because we never want to assume that all of our listeners know what all of these acronyms mean. Totally. So it's a highly, it's kind of an obtuse sounding term that actually describes something that's extremely intuitive. So it stands for directed acyclical graph. So what does that mean? I think the best real world analogy is a recipe for cooking. So imagine that you're cooking a recipe and you make a fundamental ingredient, that's a step. And then you use that ingredient
Starting point is 00:10:58 with two other steps. And then at some point you recombine those two sub-ingredients to do, say, put it in the oven or something. That is a directed acyclic graph, because it doesn't make sense for that to have cycles, right? Because if in the end you took something out of the oven and then you restarted the first step, you would never complete the recipe. So that is kind of, you know,
Starting point is 00:11:23 a recipe is kind of the real world manifestation of a DAG. I love it. Okay. You brought up a term when talking about the software engineering industry and you drew a parallel to data infrastructure and you described it as a hostile environment, right? I mean, you saw the disparate JavaScript frameworks, and in some ways, when you think back to that world, you have a set of tooling that's fairly primitive.
Starting point is 00:11:55 Why is the world of data infrastructure hostile? I know that there, you know, are there, what are the similarities, right? And I guess maybe to direct the question a little bit, to some extent in software engineering, like there were some primitive tools or some, you know, like very early frameworks, et cetera. Yeah. That's a little bit of a problem than like complex infrastructure where you're dealing with like a fragmented, with fragmentation, right?
Starting point is 00:12:26 So what's the nature of the hostility in data infrastructure? So why I think it's hostile. So I think there is a pretty good analogy in terms of, if you think about the history of data engineering, it kind of has a lineage that is not software engineering. I mean, historically, it was like drag and drop tools and Informatica and all this stuff. And it was kind of thought of like a lower status job as well. So there's also that lineage. And I think it's because data engineering in the end ends up dealing with very physical things in the world. Like you're moving around files and creating tables and data warehouses.
Starting point is 00:13:13 And it's actually, it is difficult to synthesize test data, to set up virtual environments where things are more flexible and whatnot, and therefore have like, you know, because software engineering processes really work when it's like purely abstraction based and you can kind of like shim out the right layers and have super fast dev workflows and whatnot. And that is fundamentally a very difficult problem in data. The analogy to web development is that, you know, web development wasn't executing on a physical thing, not physical data, but it was scripting the web browser, which was not designed to be a programmable surface area. So it was just this completely hacked system, and there was no good abstractions over the browser.
Starting point is 00:13:58 So you were left just manually testing the browser. There's no way to run JavaScript code on your laptop without booting up a browser and this super heavyweight thing and whatnot. And things like React constructed the right software abstraction between the application code and the underlying browser, which was this incredibly hostile substrate for software development.
Starting point is 00:14:20 I think the analogy applies to data engineering where there aren't good enough abstraction layers between the application code and business logic code that data engineers have to write and these underlying concrete storage systems and computational runtimes, which are actually extremely inflexible and hard to deal with. Yeah. And, you know, an example of, an example piece of infrastructure that has gotten popular these days, and putting Daxter aside, but one that really sticks out that kind of solves
Starting point is 00:14:54 some of this programmability problem is something like DuckDB, right? Which makes it very easy to kind of program against the same runtime on your laptop as well as in the cloud.
Starting point is 00:15:04 It's like, and you can actually, I guess you execute it in the web browser to tie this all together, which is nuts, right? So these technologies that kind of are very developer centric where the computation is portable between different environments are extremely powerful. Yeah. Yeah. Super helpful. Let's, okay. Let's talk about fragmentation really quickly because I want to dig in on that a little bit. So you have sort of a fragmented,
Starting point is 00:15:33 you have these fragmented systems. It's difficult to sort of have a development environment, you know, especially when you're dealing with these physical things like you talked about, the infrastructure is completely different. going to move towards increased fragmentation, or if there's sort of a rebundling happening, because to your point, you know, solving these problems with these disparate layers is really hard. And so, is the market going to sort of congeal around, or is it going to produce, you know, more vertically integrated, or I guess, horizontally integrated, depending on how you look at it, or both systems that sort of solve some of these problems in an integrated way. What's your take on that? And then how does Dagster fit into your take on that? Yeah. So that's a great question. I think you said the right words,
Starting point is 00:16:38 which is it's either going to be vertical or horizontal. Because I think anyone who's talking about this subject and who has even different... Lots of people have different visualizations of the end state, but I think everyone agrees that currently the world is too fragmented and life is too hard for people spinning up data platforms because they have to cobble together way too many tools. And it's extremely complicated, both technically, as well as like maintaining that many business and support relationships. It's just a, it's not a sustainable situation. So there's kind of two school of thoughts, I think. One is you're going to pivot back to a world of more vertical integration and go back to that world that's like Informatica and Oracle or
Starting point is 00:17:25 Microsoft kind of in the 90s. You pick one stack and choose everything. And that was inflexible and terrible. For technological reasons, you also got vendor lock-in. So the modern day analogy of this is you're either going to be like a Databricks company, a Snowflake company, or you're going to be one of the hyperscalers, Amazon. Microsoft just has their new fabric product, which is another kind of data platform in a box type solution. So that's the vertically integrated story. And if you're not going to do vertical integration, you need to solve this bundling issue, then there has to be some other layer of integration.
Starting point is 00:18:06 That's the horizontal layer. Now, I think of this naturally, given our position in the stack, I think the orchestrator is a natural place to assemble all these capabilities that you need over the data platforms like Databricks and Snowflake. And I think the other thing, I do think that there is a natural resistance to vertical integration. And I think companies instinctively know this because if you go to any large company, almost all of them are running Databricks and Snowflake.
Starting point is 00:18:40 Yeah. Like no one just picks one. Like everyone runs both because they're suitable for different workloads. And you also don't want to like bet everything on one vendor and get knocked into that degree. So I think in some ways, like the natural market resistance is doing this. Now, Snowflake and Databricks or whomever, they might build enough capabilities or do it in a tasteful way where customers still feel like
Starting point is 00:19:06 it's composable so that they can eventually get dominance but i just don't think that's fundamentally the way it works and so also like i don't want to live in that future i think like vertical integration is like boring and sad in some ways and so then it's you know what's the horizontal layer gonna be you know some people think it's gonna be the catalog you know, what's the horizontal layer going to be? You know, some people think it's going to be the cataloging, right? And that's the basis of the control plane. Some people think, okay, like Apache Arrow is like the way this works because you can have portable memory formats that you move between data platforms or whatnot. Yeah. I personally, because I mean, it's my job to say this as the founder of the company, but I think this orchestration control plane layer is like the natural place to put it because the orchestrator by its nature, every practitioner needs to interact with it because
Starting point is 00:19:57 anyone who's putting a data product in production has to put it in either, has to put it in some sort of asset graph because all data comes from somewhere and goes somewhere. So they have to place in the context of some sort of orchestration at some point. And then the orchestrator is also the system that shells out and invokes every single computational runtime and touches every storage layer. So I view it as this very natural choke leverage point that has to exist at any platform of real scale, any data platform of real scale. And, you know, the kind of user experience you want, I know it's a tortured analogy, but you really want something that feels like the iPhone, where you have this common set of rules, you have a grid of apps, and they
Starting point is 00:20:46 all kind of abide by certain rules in the ecosystem and provides order to the chaos. But within that order of chaos, you get a ton of heterogeneity, right? And that's, even though, yeah, it's a torture analogy a bit because like the iPhone is vertically integrated, but it's a vertically integrated OS with like an app store, right? So the analogy is on purpose, but I think that's like the, for lack of a better term, the vibe you want
Starting point is 00:21:09 is some sort of superstructure that brings order to chaos, but within that, you can mix and match technologies. Yeah, that makes total sense. Okay, let's, one more question for me before I hand the mic over to Costas.
Starting point is 00:21:26 So in that world where we think about horizontal integration, do you sort of operate with a set of maybe sort of foundational philosophies or design principles around like when and where the orchestration layer enters the picture. And so let's talk about trickle down, which we started out the chat with, talking about building these things at Facebook, fangs way ahead of most of the market because they're inventing new technologies that eventually then trickle down and sort of, you know, companies can adopt them. When we think about orchestration and DAXR in particular, is your view of the world that this is really where you should sort of start building your infrastructure, right? So Ceteris Paribus, like you start with sort of an orchestration layer and then augment your stack over time around the orchestration layer? Or is it a situation where you really only need this when you hit
Starting point is 00:22:31 a certain level of complexity or when you have multiple storage layers or some sort of breakpoint or threshold where orchestration is the right tool for the job? So typically, I think that orchestration should be one of the first tools that you go to in the data platform. For example, if you only have one data warehouse and you know you will only ever use DBT, you only use templated SQL, you'll bring in no other technologies and you don't need anything beyond a cron scheduler and you know that for certainty for certain yeah might not need a full orchestrator snowflake dbt tableau and you're done you know yeah something like that right and if you don't need any automation right it would be another example where right like literally you can just like manually do stuff and only create
Starting point is 00:23:27 things on demand and you're comfortable with manual intervention. That's another example. But for nearly every single data project, I think that orchestration is fundamental and essential. You need to schedule things. You need to order computations. You need to do it across multiple stakeholders, multiple technologies, typically. And by multiple stakeholders, sometimes I should probably say
Starting point is 00:23:51 multiple roles, because even as a solo practitioner, you often kind of wear different hats depending on what you're thinking about. Sometimes you're thinking like, oh, I'm building infrastructure for myself. And sometimes I'm building the actual data pipelines. And I think we have work to do to educate the marketplace to convince people that the orchestration should be one of the first tools you adopt rather than one of the last. But I think in reality, the things you do within the orchestrator are so fundamental and essential to building data pipelines that it should be in the picture from day one, right? You're going to have errors in your production. You need to be able to resolve those as easily as possible. You're going to need alerting.
Starting point is 00:24:37 You're going to need to schedule things. You're going to need to order your computations. And, you know, those are like the basic tools in mind. If you don't build an orchestrator in place, you very quickly left with like a Rube Goldberg machine of like, maybe you have like four different hosted SaaS apps and they have like overlapping cron schedules and you're like debugging issues across multiple tools. I just don't think it's tenable. Yeah. Yeah. Yeah. It's super interesting. It's, you know, I kind of think about it as, you know, you said that like the
Starting point is 00:25:13 tortured analogy of the iPhone, but I was just thinking about like a dashboard in a car where it's interesting because it feels like a single thing, but it actually represents a massive amount of complexity with relate to very different parts of the system, right? From breaking to pressures that are running in different pieces of the engine or transmission or all these separate things. But it feels, you wouldn't think about designing a dashboard for a car in a disparate way, right? It represents a related system.
Starting point is 00:25:52 Yeah. Especially for coherent operational aspects. It's like being able to go to one place and have the source of truth, the so-called single pane of glass where you can be like, what's going on in my system right now? Yeah. I just, I cannot imagine running a data pipeline, a single data pipeline, let alone an entire platform of data pipelines without that operational tool. Yep. All right, Costas, I could keep going, but please, I know you have a ton of questions. Yeah. Thank you, Eric.
Starting point is 00:26:22 All right, Nick, let's go back to your GraphQL days first, okay, before we get deeper into what Duxter is doing and talking more about the data infrastructure. So I have a question, since you described your journey, what have you learned from working with GraphQL, a tool that is primarily used by product engineers to go and create applications, that you think is also very applicable to what Daxter is doing? So that's a great question. that you think is also very applicable to what Daxter is doing, right?
Starting point is 00:27:06 So that's a great question. I think there's a few lessons. One is if you can express the problem that your users are trying to solve in concepts that make sense to them and align with their day-to-day experience in the ground, that is enormously powerful. So the analogy is that in GraphQL, I think the novel insight, because people are like, oh, why don't you just use something like SQL? Well, the reason why is that SQL is fundamentally tabular and GraphQL is hierarchical. And the reason why that's powerful is that if you're a front-end developer, the view libraries that you're dealing with, 99% of the time, it's a hierarchical structure.
Starting point is 00:27:55 You're thinking about nesting elements within each other. Query language that directly maps with that is extremely powerful, both in terms of just understanding the query language, maybe most importantly, building client-side tools that line up those views with that data fetching. It's just an extremely powerful paradigm. And similarly, with Dagster, for example, we really thought about from first principles, what are the things that you're... What are you actually doing when you're building a data pipeline? What's the outcome you're trying to affect in the world?
Starting point is 00:28:29 And how do you interface with the stakeholders who care about you? And, you know, we kind of have this phrase we say around the office, virtual office, which is like, no one gives a shit about your pipelines, right? All they care about is the data assets that they depend on, right? Pipelines are implementation detail. And if you can express from the developer's perspective, if you can start out with like, hey, declare the assets you want to exist in the world, and if everything downstream of that makes sense, then everything lines up better. Both your own internal tools, the way you communicate with stakeholders, et cetera, et cetera. So I think that's one lesson is like really, and at Daxer, we really, this has been a struggle. This has been a challenge, I'd say over years to really dial in this language and we're still working on it, but getting that right is super important. And the other thing that I learned with GraphQL is that a lot of these developers, there's kind of this common
Starting point is 00:29:28 trope. You talk to VCs or you talk to engineers, there's a lot of almost contempt for the broader software engineering communities. Like, oh, all developers are dumb and you're used to only dealing with the top 1% of developers and that's the way it's going to work. And what I found in the GraphQL space is that people aren't dumb. Developers know their domain and their business. They are generally quite smart, quite bright and competent, but they are extremely busy. So I think people confuse smart, busy people with uninterested people and that causes a lot of people to build underpowered tools like GraphQL like it used something like GraphQL you're still relying on the users to do a lot like they're building a very complex piece of software underneath the hood GraphQL provides provides this overarching structure that makes sense in their mind and tools on top of that. But
Starting point is 00:30:30 beyond that, developers have to do all sorts of clever things. And I've always been pleasantly shocked at how sophisticated the GraphQL community is in terms of building custom tools and whatnot. And I think the same thing applies to the data engineering world, where you don't just want to give out-of-the-box solutions, but you also want to provide developers a toolkit to make them more productive. And you have to find the right balance in doing that. But I think having that mentality is critical. And that has really served us well in the DAXer journey. I still can't believe some of the use cases people apply this stuff to. So from first principles, getting your mental model right, super important. And two,
Starting point is 00:31:22 understanding that your users are smart, busy people, typically building complicated things and understanding where to give them the tools where they can do the complex things while simplifying everything else as much as possible is also critically important. And then the last thing is being consistent with messaging is utterly critical. I remember, you know, the GraphQL, once we started to see meetups where people that we didn't organize, where people were effectively saying the things that we propagandized and advocated for. And I'm like, I remember the meeting where we decided on using that phrase, not another phrase. Now it's being repeated in Johannesburg and we didn't need to talk to anyone to make that happen.
Starting point is 00:32:07 That's like a very powerful thing. So consistent messaging is another thing that comes to mind. Okay, that's awesome, actually. That was an unexpected outcome, to be honest. I didn't expect to hear that from you, but that actually makes a lot of sense, I think, especially when we are talking about new paradigms and new technology, right?
Starting point is 00:32:27 Like a period where the marketplace out there needs to go through education, right? So consistency, I think it is critical. So that makes sense. I think the other thing I've learned is that it's really important for a technology to be viewed as a career enhancing move to adopt. So if you can build a technology where people feel like they will advance further in their career and achieve better financial success and notoriety because they adopt you. Like that is like an incredibly important place to be in a, in terms of a technology provider.
Starting point is 00:33:10 Yeah. A hundred percent. A hundred percent. Okay. One question just to try and help, let's say the people out there who are coming from one or the other. And when I say one, the other,
Starting point is 00:33:22 I mean product development in one product engineering in one side and data engineering on the other. And when I say one or the other, I mean product development in one, product engineering in one side, and data engineering on the other side, right? And product infrastructure and data infrastructure. So obviously there are like two different domains, but there has to be like some overlap, right? Like there's still engineering, right? It still has to do both like managed data, both managed states,
Starting point is 00:33:46 both have to present something to someone. You do something and all these things. So the question that I want to ask you, because I don't want to ask you what they have in common and whatnot, right? I'll try to do like in a little bit more creative way. From your experience with GraphQL in product engineering, if you had to choose, let's say, in data engineering,
Starting point is 00:34:11 a technology that is closer to what GraphQL is for product engineering, what you would say that this is? Which part of the stack out there, it can be the orchestrator, it can be, I don't know, like Arrow, as you mentioned at some point, the query engine, I don't know. Like the good thing, the good and the bad thing with data infrastructure is that there's so much to choose from out there, like all these things. But what has like a similar utility at the end for the engineer out there right like positioned in the stack if there is there might be might not be something wrong well i mean i think one of the reasons why i was attracted to the layer in the stack that dexter is in is that i
Starting point is 00:34:57 felt that the orchestrator served as a basis of a layer which could serve a similar function as GraphQL, but in the data domain. Insofar as, you know, GraphQL is this very compelling choke point in a front-end stack where all the different clients, Android, web apps, iPhone, all go through the same schema. And then that is then backed by a piece of software that talks to every single service and backing store at the company and provides this organizational layer to kind of model your entire application, right? In a similar way, I think that Orchestra
Starting point is 00:35:41 serves the same function in terms of the place where you can model your entire data platform, where all different stakeholders can kind of view it through the same lens, just this graph of assets, right? And then each one of those assets can be backed by arbitrary computation, arbitrary storage. So I really do think it's, and I think that's why I was attracted to it, whether implicitly or explicitly, is that I felt this kind of had that same property of being both a technological and organizational choke point through which you could deliver enormous value and have enormous leverage. And why do we need such a different implementation of the technology then? Why we can't get GraphQL and somehow adapt it and be like the interface for the data infrastructure too? What is the reason? And it's more of a technical question when I'm asking, to be honest.
Starting point is 00:36:39 Yeah, I mean... Why is this happening? Well, I just think it's a completely different domain and problem space. I remember people were like, oh, Nick, why aren't you thinking about using GraphQL for analytics? And I'm like, absolutely not. Absolutely not. And the reason why is what I was talking about before is that GraphQL works for front-end applications because the net thing that you view on the screen is hierarchical, right?
Starting point is 00:37:11 And that makes sense. When you're dealing with analytics, you're looking at tables. You're looking at tabular data and direct renderings of that in dashboards and whatnot. So SQL is the right tool for analytics. And GraphQL is a better tool, in my opinion, for building kind of front-end products. And so they're completely distinct domains.
Starting point is 00:37:33 Okay. That makes total sense. One last question that has to do with, let's say the boundaries between these two different disciplines, but that also have some similarities. So in the data infrastructure, we are talking about orchestrators, right? We have this concept like Daxter, right? We have, as you said, tasks to be executed.
Starting point is 00:37:55 We have some scheduler. We have managing failures, all these things. There is another, let's say, in the product engineering space, there's also like the concept of the workflow engine, right? And there's like a lot of conversation like lately about workflow engines and how close they should be to the state or like should be part of the database, like the transactional database or not, or outside. But they have some similarities, right?
Starting point is 00:38:25 Like at the end, even like the workflow engine is like, it is a DAG pretty much. Like you have some tasks that they need to have some ordering and how they get executed. Maybe necessarily, let's say doing like data processing directly, but there might be an end point that you have to go and like hit somewhere, right? Yeah. Why do we, like, again, what's the difference? Like why we can't have one, right? That can drive, let's say, the data infrastructure
Starting point is 00:38:56 and like the processing there. And like the same also like with product engineering where we have to orchestrate again like tasks what's an example workflow engine and product engineering that you're thinking of just so i can because workflow engine is like can be different things different people yeah 100% like temporal for example is like a product sometimes like in my mind right yeah temporal is really interesting actually i think fundamentally something like temporal is really interesting, actually. I think fundamentally, something like temporal is a more imperative and general purpose tool. But you have to make explicit trade-offs there that make it less well-suited for doing data processing in the context of a data platform. I think the simplest visualization of doing it is that using temporal,
Starting point is 00:39:50 if you wanted to understand the lineage of your data assets without executing it, that is literally impossible in temporal because temporal is a more dynamic state machine that makes very different trade-offs. So there's nothing preventing you from using temporal to perform a subset of the functions in a data platform
Starting point is 00:40:16 orchestrator. But it just doesn't fit all the needs of data platform teams. So there's a world where there needs of data platform teams. And so, you know, there's a world where there's a data platform stack where temporal is a component of it, but fundamentally, fundamentally it's very different. Something like temporal is interesting. I'll be very curious to see how it develops over time because it's actually an extremely invasive programming model and i would
Starting point is 00:40:48 if i was hired as a vp of engineering at some larger company that i bet heavily on temporal for its infrastructure i would be lying awake sweating at night thinking about like how do i debug this if it goes wrong? Cause like you're putting so much faith in the system to like re-entrant, you know, be re-entrant and like, you know,
Starting point is 00:41:13 manage all the state properly. And like, if something goes wrong, I just have a hard time debugging it. But yeah, I'm like, I'm both like extremely intrigued and amazed byal and also kind of terrified of it. At scale, especially.
Starting point is 00:41:29 Yeah, makes sense. Makes sense. All right, cool. So let's focus more on the data and stuff now. Let's talk about orchestrators in data infra, right? Like DAX is obviously not the first one. There are like many different solutions out there. Some are like more niche, some are like more generic. I would say probably
Starting point is 00:41:52 the most well-known one is Airflow, right? For sure. So let's give us, give us like, like how you see the landscape out there. Totally. Yeah. Airflow is funny. So the lineage of airflow is actually from Facebook. So it's based on a system that was kicked off in Facebook in 2007 called data swarm, which still exists. And then Max who invented airflow,
Starting point is 00:42:20 who created airflow, who I know very well. And he actually left Facebook, went to Airbnb, realized they needed a similar system and kind of basically built V2 of Data Swarm. And I think that Airflow did a couple of really important things. One is that you could build DAGs.
Starting point is 00:42:42 You could write your code in Python rather than having to use a UI or use some really inflexible config system. And then it had a nice UI. And so between being able to use Python, which gave a level of dynamicism in a language that data people understood, and a high quality UI, it really took off. But there's a few things that are a problem with Airflow. One is clearly not written for the local development experience. And these systems are complicated enough where you need to be able to test them, do automated testing, have fast feedback loops, because those are the foundations of developer productivity and developer productivity is absolutely huge. And the other thing, and this is funny, we like to say that even though it is kind of the incumbent orchestrator that people build data pipelines in, it actually is not a
Starting point is 00:43:44 great tool for building data pipelines because, it actually is not a great tool for building data pipelines because it's not aware of the data that it produces. It's kind of this like tautological thing. It's like the wrong layer of abstraction for data pipelines has got its momentum and became a norm. But we fundamentally think a more kind of data-oriented approach is important. So if you think about the landscape, and I'll include DBT, Dagster, Prefect, and Aerophone. Prefect is actually much more similar to something like Temporal at this point, because Prefect's new 2.0 product,
Starting point is 00:44:19 that was a company started like a year before Dagster Labs, and they have this sort of dagless vision that's similar to temporal or just arbitrary workflows. And so it's more imperative and generic. Then you have the task-based dag system, which is Airflow. Then you have dbt, which is very popular, which is exclusively for Jinja templated SQL with a hint of Python these days, but like 99.9% of the usage is templated SQL. They build a graph of data assets as well. They call them models and they exclusively execute over the warehouse. And they're targeted for these kind of software engineer analyst hybrids they call analytics engineer. And if you think of those as a spectrum, Dagster is kind of in between Airflow and DBT,
Starting point is 00:45:07 meaning that has a much more declarative data-oriented approach, similar to DBT, but is targeted towards data and ML engineers and more trained software engineers and is more flexible and can be backed by any arbitrary computation, not just Jinja Template SQL. So that's kind of the landscape.
Starting point is 00:45:28 One way declarative hyper-focus on the data warehouse, SQL, that's DBT, all the way to the other hand, something like Prefactor Temporal, which is completely DAGless, more of a straight workflow engine, and then kind of Airflow and DAGster in between. Okay, that was awesome. And are there any, let's say, more,
Starting point is 00:45:51 how to say that, like, niche type of, like, orchestrators? Like, there's this whole thing around, like, ML Ops, for example. Like, is ML, is there, like, a group of orchestrators that are just, like, for ML? So ML Ops is super interesting. I think it's something that we're going to be focusing on in 2024 because
Starting point is 00:46:10 we actually really believe that the MLOps ecosystem is unnecessarily siloed. There was an article and they don't need their own orchestrators. MLOps should be a layer, not a silo. And there was this great article that hit Hacker News
Starting point is 00:46:28 like six months ago, which was like, MLOps is 98% data engineering. And I think that's totally true. I wrote this news, by the way. You wrote that? Oh my, I never connected that. Okay, so this is perfect. That's amazing.
Starting point is 00:46:42 We love that. That's like the basis. That's going to be the base of our product marketing next year. Because if you interview, we don't emphasize our ML use cases at all. But our cloud customer base, over 50, so 90% of them use it for ETL and analytics, which makes sense. But 50% also use it for ML and 40% also for what they call production use cases. So multi-use case is the norm. And what happens is that a data platform team brings in Dagster and then they start using it. And then their ML team wants support and doing stuff. They talk to the ML team. They're like, well, we want to write Python. We want to write DAGs of
Starting point is 00:47:21 stuff. We produce a bunch of intermediate tables. And at the end of the line, we produce models. They're like, well, that sounds like data pipelining. And DAGster provides a great foundational tool for the data engineering components of the MLOps job, which is 98% of it. That's so funny that you're the one that wrote that article. So we totally buy into that. We view it, DAGaxter is about data engineering.
Starting point is 00:47:48 So we kind of think of data engineering as this layer. And then different parts of the data pipeline overlay different technologies on top of that layer. So in the middle of it, in the data warehouse, you might have dbt core. In the ML component of it, you might have MLOps as a layer of tooling on top of that, but it all shares a common control plane driven by data engineering principles. Yeah, that makes a lot of sense.
Starting point is 00:48:14 I mean, obviously, I agree. I also like what was the reason I wrote that post. Great post. Yeah, it was much more impact than I expected, to be honest. And it was very interesting to see the reactions, both from the ML, let's say, group of people, and also the data engineers. Anyway, maybe we should have an episode just talking about that.
Starting point is 00:48:37 Oh, yeah, yeah. I do believe there needs to be a convergence between the two disciplines. It is important if you would like to keep adding more and more value and faster innovation
Starting point is 00:48:49 like in the industry. Otherwise, it's just like things are like way too fragmented and it doesn't make sense. Yeah, it doesn't make any sense. No, we got to get Sandy
Starting point is 00:48:56 on this. He's the lead engineer of the project and we could get going for two hours getting ourselves whipped up
Starting point is 00:49:04 about this subject. Yeah, we should do that. Absolutely. We'll arrange that. So, okay, let's go back to like specifically to Airflow. And I have like one last question here. If there's something that you are, let's say, envious of that Airflow has, right? And VSF?
Starting point is 00:49:25 Yeah, what that would be, one thing. Oh, just the install base. You know, like that's pretty much it. I feel like we compare favorably almost every... Actually, the install base and the existing corpus of searchable content related to the technology are the advantages of incumbency.
Starting point is 00:49:50 But those are the two things I envy. But we're making good progress. We do. That's true. And I think you are generating also some pretty good content out there. So, okay.
Starting point is 00:50:05 But Envelope has been around also for how many years now? Like it's like 10 years, maybe like a little bit like less than that. So that's a lot of stuff. Max wrote in 2014, I think it was open source pretty quickly, 2015. So we're getting there. Yeah. Yeah. All right.
Starting point is 00:50:22 Let's talk about Daxter now. You've been working on this for quite a while. What's next? What are the next couple of things that are coming out about Daxter? And what should we be excited about for the future releases? Yeah. So I think that our near-term future is very much about demonstrating, we kind of have this position in the stack. We claim to be this operational single pane of glass. We have companies where a bunch of different stakeholder teams are adopting it.
Starting point is 00:51:04 And now the next part of the journey is like using that leverage point to deliver more value to teams. So one point of that is that, you know, I think this show is going to come out like a week before our launch week, but we're going to be announcing embedded data quality in the orchestrator. So, and that doesn't mean we're going to try to replace DBT tests or replace SODA or replace gratifications. Those are all systems where we can leverage, but it's more about almost making the orchestrator data quality cognizant, I would say.
Starting point is 00:51:39 And so it's very, we actually get, this isn't just us like reaching outside of our domain our users like explicitly want this because they're used to looking at our asset graph and being like what is the state of my system and having an extra checkbox there that says i passed all my data quality tests is the most natural thing of world to want to integrate and then being able to alert like okay if this thing fails you, ping me in Slack or whatever. So it's a very natural extension of the orchestration system. And we think that in five years, any orchestration platform that doesn't include data quality in this manner will be viewed as woefully incomplete. Similarly, so we're adding data quality capabilities.
Starting point is 00:52:26 Similarly, we also are adding consumption management capabilities to the system. So first of all, we're going to kind of be augmenting our integrations to make it very straightforward to collect metrics about consumption. So like how many snowflake credits is each asset consuming? And then what's very unique is that one, we can index that by asset name in our system.
Starting point is 00:52:55 So we can give reports to say like, hey, you're recomputing this thing all the time. It's consuming this many credits. Are you sure you're getting enough value out of that? And then second of all, the orchestrator is also naturally a very interesting place to embed cost and consumption information because you can do quoting,
Starting point is 00:53:16 you can provide quotas, you can project how much computation is going to cost going forward. It's just a very natural place to embed that sort of cost information. And we think that's going to be incredibly powerful. Jeff Bezos famously said, your margin is my opportunity. I think the equivalent to data is your NDR, your net dollar retention is our opportunity because you can't increase your snowflake, spend 80% year over year. Eventually you'll run out of money and you need tools to be able to control that. And we think that WorkSphere is a natural place to do that. especially that centralized data engineers and data platform teams can bring in all the computations of all their stakeholder teams in a way that's much, much easier, that doesn't require
Starting point is 00:54:11 modification of that external code or minimal modification of that external code. I think right now, I think Dagster historically has gotten dinged because it kind of, it feels like it has to take over too much of your system. And I think that feedback was pretty accurate actually. And so we've really taken that to heart. So we really want to make it so that instead of the entire organization having to become Dagster experts, only a centralized team has to become Dagster experts. And everyone else is kind of becomes Dagster curious where they just know hint of Dagster and then they can use our operational tools and it all kind of works smoothly. So, you know, in the end, the goal of that launch week is really to make these centralized
Starting point is 00:54:57 data teams, data platform teams, especially way more leveraged. So make it way faster for them to kind of bring everyone into the orchestration platform. And then once they're there, be able to use these value-added features to deliver enormous value super quickly. So beyond orchestration is kind of one of our internal teams. And we want to really kind of develop this future
Starting point is 00:55:24 of this more advanced control plane that I think data teams desperately need. Yeah, that makes total sense. All right. We are really close to the end here, and I want to give some time to Eric to ask any other questions that he has. But we definitely have to do at least one more episode.
Starting point is 00:55:45 I think we have a lot to chat about. So I'm really looking forward to do this again in the future. Yeah, it's been great. It's so funny you wrote that article and I was just name dropping it. I swear to God that wasn't on purpose. That would have been real 4D chess to pretend to not know that you had written it, then drop it. Yeah.
Starting point is 00:56:08 Costas, you're a real data influencer. I mean, you're the foundation of product marketing strategy. Okay, just one last quick question. We've talked so much about data, and Nick, you've been so articulate and so helpful on so many subjects. So my last question actually has nothing to do with data. So if you weren't working with data or building technology tooling, what would you do? Oh, I actually have a good answer for this. We are on the cusp in the world of an energy transition, and no one understands the implications
Starting point is 00:56:43 of it. And this isn't for environmental reasons. We reached this tipping point where solar energy is now cheaper than all other fossil fuel, all other energy generation. So solar, wind, and battery together, not only is it cleaner, which is interesting, but also it's just way cheaper. And by definition, there's no fuel inputs to that. And that is incredibly exciting because not only... Because there's kind of like
Starting point is 00:57:12 this mentality that, oh, by decarbonizing, we're going to have to degrowth and go back to some pastoral life where no one drives and no one travels. completely false. If we do this right, we're actually going to live in a world with effectively infinitely abundant energy. And working on that transition, I think, would be incredibly exciting. Because effectively, the way that the math works out for these clean energy systems is that the cheapest configuration of building them means you dramatically over-provision solar and wind generation capacity so that you don't have to build as many batteries.
Starting point is 00:57:55 And that means that most of the time during the year, you have a wild excess of eventually limitless free energy. So I think there's going to be entire new waves of industry that are built to effectively take advantage of this intermittent, infinite, virtually free energy. I think it could be an incredible future. So that is what I would be working on.
Starting point is 00:58:21 Fascinating. Man, that's an episode in and of itself. Well, as we like to say, we're at the buzzer. Brooks is telling us it's time to land the plane. But Nick, this has been such a wonderful conversation. Thank you for giving us your time and we'll have you back on very soon. Awesome. Yeah, this was great. Thanks for having me. What a guest. I feel like every time we asked Nick a hard question, he was able to come up with an answer that was concise and articulate
Starting point is 00:58:49 for every single question. It was really amazing. I think you probably asked one of the winning questions, which was around the data orchestration landscape. And what a fascinating answer. He really, I mean, it was really helpful to hear him talk about the entire spectrum where you have, you know, temporal, which you brought up on the show, which is sort of, you know, embedded in application code and sort of, you know, deeply integrated workflow, you know, sort of generic workflow execution
Starting point is 00:59:26 all the way over to, you know, the DBT side of things, which is, you know, sort of ginger templating and, you know, managing jobs for running SQL queries. And he really painted that entire picture. It was so helpful to me. And that is my big takeaway. So I think this is a show for anyone who wants to understand deeply the history the current state and then think well about the future of orchestration this is a great show yeah oh 100 i think it's not just like for the future of orchestrators i think
Starting point is 00:59:59 it's a glimpse to the future of data infrastructure in general. Yeah. And data engineering, I would say also, because that's like another part of this episode that I think is super unique and super fascinating is that we talked a lot about what are the differences and also the overlaps between product engineering and data engineering, the tooling, the infrastructure out there, why we need to have these different disciplines
Starting point is 01:00:28 or domains, what we can learn from one and transfer to the other. And I think Nick had such an extensive experience at a kind of unique scale at Facebook.
Starting point is 01:00:43 So his perspective I I think, is very interesting and very insightful and not that easy to find out there. So I would encourage everyone to tune in and actually listen to the conversation I would have. And hopefully we are going to have more conversations with him in the future too. Totally agree. One last bonus from this episode is that
Starting point is 01:01:07 I think this is the episode where you became a data influencer because, and I won't give away too much, Nick referenced that they're building their product marketing strategy off of a particular article that went on the first page of Hacker News that may or may not have been authored
Starting point is 01:01:23 by one of the co-hosts of the show. So if that tantalizing piece of juicy information is interesting to you, listen to the entire show to hear more. Subscribe if you haven't, tell a friend, and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app
Starting point is 01:01:43 to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.