Drill to Detail - Drill to Detail Ep.109 'Dagster, Orchestration and Software-Defined Assets' with Special Guest Nick Schrock

Starting point is 00:00:00 Yeah, so software-defined asset is a way to structure your data platform and the data pipelines that constitute your data platform. And what it is, is a definition in code of an asset that is supposed to exist. It's really about a way of thinking in terms of moving the canonical definition of a data asset from physical storage to a software definition. Welcome to another episode of Drill to Detail, and I'm your host, Mark Whitman. So I'm very pleased in this episode to be joined by Nick Schrock, founder of Elemental, and returning guest to the show. So Nick, it's great to have you back with us. And for anyone who doesn't know you, tell us a bit about yourself and what you do. Yeah, so thanks, Mark. Thanks for having me. It's a pleasure being back. So yeah, as you mentioned, I'm Nick Schrock. I'm the founder and CTO of Elemental, the company behind Dagster. Prior to that, I spent a bunch of time at Facebook and was best known there for being

Starting point is 00:01:19 one of the co-creators of GraphQL. So Nick, you came on the show before, a couple of years ago, talking about Dagster. And in this episode, I want to talk about a particular thing that you're focusing on now with Dagster, which is software-defined assets. But before we do that, can you give us a bit of a, I suppose, an elevator pitch, really, for what Dagster is? And that will set the context, really, for the rest of this conversation. Yeah. SoGster is a Python framework for building data pipelines. And the way that we conceptualize it is that we orient the oriented abstraction around building data pipelines. We orient against the final output of those things, which are data assets. The purpose of a data pipeline is to keep data assets up to date. That's the way that we define the problem.

Starting point is 00:02:06 And we're a Python framework that enables the building of those things. It really conceives of building data pipelines across the entire software development process. So making it extremely fast to develop, having a full test lifecycle, really thinking about how it goes from dev to test to staging to production and so on and so forth. And then by orienting around assets, not only can we, you know, traditional schedulers kind of just schedule things in production, but we also give you then a base level of data lineage, data observability, and other features that are very much oriented around the data assets you produce. Okay, fantastic. So we'll get into a lot more of the detail of

Starting point is 00:02:50 things you've been talking about there as we go on. But just before we start, so you're the founder of Elemental. So what's the relationship between Daxter and Elemental? Well, Elemental is a company behind Daxter. And so this is kind of the corporate host for Dagster. But really, the complete focus of the company is Dagster, and the commercial product is Dagster Cloud. Okay, fantastic. That's good. Okay, so when I first started using Dagster, it was almost as like an alternative to tools like, say,

Starting point is 00:03:18 dbtCloud or other DAG-type sort of orchestrators, where we wanted to have an alternative to those tools and potentially be able to orchestrate things other than say sort of dbt jobs but something that I've certainly encountered in projects over the last few years or certainly the last year really is the increasing complexity of those projects so I suppose the number of models in a sort of like in a dbt project or a dag has kind of increased the complexity in working on those especially as a team and extending those and governing those you know it's got to the point now where you know we start to see vendors like

Starting point is 00:03:55 say dbt labs talking about things like data contracts going to those in a moment but i suppose projects and requirements and the amount of things you need to orchestrate have probably increased in numbers and complexity a lot over the last year or so. Talk to us about that, really, and what you're seeing in the market. So I think part of what we're seeing is slightly a bit of everything that is old is just everything that is new is just things that have already happened before. insofar as that people who are adopting the modern data stack, in terms of they start out with just an ingestion tool, maybe DBT, maybe a reverse ETL tool, that's the start of their journey. And now they're rediscovering all the problems that large tech companies had to solve internally

Starting point is 00:04:39 or kind of more vertically integrated solutions had solutions to many years ago. Like Mark, you've always used to be an Oracle practitioner. And I think that you're probably encountering gaps in the tooling that would have been filled by a more all-encompassing solution like Oracle. I've never been a deep user of it, but just from the way that you've described it, I feel like I can imagine the conversation of someone who's like very modern data stack native coming up to you and having some brilliant idea.

Starting point is 00:05:11 And then you being like, wow, that's interesting because I had that feature in Oracle, you know, 15 years ago. So I think there's this kind of, you know, the modern data stack opened up the notion of building a data platform to a much broader set of people. And so that happened. And then also a broader set of companies in the world felt the need to build a data platform because they were a cloud from day one. They could adopt infrastructure incrementally. They didn't have to write a million-dollar contract to tear data just to get a data warehouse. They could pay as you go. And then also the demands of the external world have made it so that people are demanding more data-driven applications.

Starting point is 00:05:54 It's just becoming more of a norm. So data is now critical for companies all the way from inception to IPO, right? So all this lead up is saying that I think what's happening here is that people who are building platforms, starting with the modern data stack, are re-encountering the same problems that many, many data teams throughout history have encountered before. And the world is getting more demanding, right? Increasingly, the ability to manage data is not just like, oh, we can accurately report finances. It's critical to the operations of the business and considered it a competitive advantage.

Starting point is 00:06:35 So if you look at some Instacart or something, right? Their data platform is critical to the way the product functions because it impacts the recommendation engine. They have to integrate data sources from a bunch of different grocery stores. You know, it's like critical to their functioning. So I think that more people are doing things. They're doing things with new tools that aren't as mature, actually, in their life cycle.

Starting point is 00:07:02 And then the external world is also more demanding and demanding more complicated data platforms. Every time you digitize a process, it ends up producing data that needs to be incorporated into your entire tool chain. So we have more SaaS services than employees. So do most smaller companies. It's been interesting reading some of the kind of data Twitter and data LinkedIn.

Starting point is 00:07:29 People are actually very conscious of the cost of refreshing and keeping these things running as well. Have you noticed that as well as, I suppose, people's awareness of the cost of refreshing these and keeping these things running? Yeah, no, we're seeing huge demand from the marketplace on this front. This is on our roadmap, actually. Kind of jumping ahead, probably someone I'll talk about later. We really think the orchestrator is a natural place to want to control and comprehend costs because it's the thing that kicks off compute. It's the thing that kicks off the thing that causes the consumption of these services. And right now, the tools that people have to control those costs are extremely coarse-grained and blunt. It's like, oh, I'm going to refresh this every week instead of every day.

Starting point is 00:08:16 Well, that's not really possible. And you have to think about what are the different SLIs for all the different data assets in your platform. So another aspect of, I suppose, the increased complexity of doing this is how you can work together as a team. Or when you've got, say, a bigger organization that has got maybe many distributed owners of data or distributed stakeholders, actually the sheer complexity of trying to develop on a, say, monolithic platform, monolithic repo, gets complicated as well. So have you found, again, that actually the costs and the overheads and the complexity of governing and developing these sort of systems can get a bit sort of out of hand as well? I mean, for sure. And I think that what we're seeing with increasingly frequency is even at, you know, what you consider smaller, small side of organizations, 100 people or more, say, is the development of the data platform engineer and data platform engineering as a discipline. So we just see that all over the place.

Starting point is 00:09:22 And that's our, like, actually DAXer's natural constituency, I say, is the data platform engineer. Because we want to combine having practitioners who are co-located with the business for efficiency purposes and the reality that those different business units have data assets that depend on each other, right? Like the machine learning team is consuming data practice as a platform where there's shared infrastructure, but then you can independently sort of develop applications on top of that. And applications in this is like a set of data assets, kind of the way been a few initiatives around adding sort of structure and governance and so on to platforms like this to deal with the complexity and so on. And one of them is the concept of data contracts. Okay. And we'll get onto how that leads into what we can talk about with Dagstra in a bit, but maybe just explain what is a data contract really in your understanding and kind of what problem does it try and solve, really, in this kind of space? So I think there are two operative definitions of data contracts in the world right now.

Starting point is 00:10:52 And I think they're actually quite different and worth talking about. One is a data contract between operational systems. For instance, your main application and, say like the Postgres tables that back it and the contract between those things and the data warehousing system. Right. And that involves a very specific set of technologies and practices where effectively the problem you want to prevent is a application engineer changing the schema of a database table and then having that break the entire data platform. And that's one level of data contracting. And that is kind of a different set of tools and techniques. And another form of it is the way that data teams within the data platform kind of interoperate.

Starting point is 00:11:45 And they sound like the same problem, but they end up having quite different technical solutions to the point where they almost have completely different solutions to them. And so a data contract on that side of things, the data team is like, is, hey, and there's a couple of ways to structure it, but it's like, I'm producing this table in a data warehouse.

Starting point is 00:12:09 It has these such and such columns. And then a downstream team somehow encodes that I depend on this table with such and such of these columns. And if either side breaks the assumption there, you prevent the breakage before it gets committed. So I think that, you know, that is effectively like the two operative definitions of data contracts. It's an agreement between two different stakeholders in a data system about the about qualitative or quantitative dimensions of the data in which they operate. OK, so almost as quickly as data contracts became sort of the latest thing, it seemed to be the backlash against data contracts came along. Do you have any opinion on that?

Starting point is 00:12:53 Do you have any opinion why the backlash to data contracts seems to almost appear at the same time as data contracts? Oh, I guess, can you give me an example of the backlash that you're talking about? I think I saw a few flavors of this. I think it's a practicality thing. So I think it's, I think, well, obviously there's an element there of the people that made the opportunity for things to become so complex, then become the ones that solve it through this.

Starting point is 00:13:20 But I'm thinking more about how practical they are in reality. So I suppose there is, there's the element of maybe sort of like a cynical thing, but I'm thinking more about how practical they are in reality. So I suppose there's the element of maybe sort of a cynical thing, but there's the element there of, I suppose, these things already exist in terms of database schemas and things anyway. But also, if you're going to start to enforce these things, it becomes quite impractical to then develop in that environment and work with it. Is that a fair kind of argument, or is that not really an issue? I mean, whether or not you call it

Starting point is 00:13:48 a data contract or not, I think the current state of the world in 99.9% of organizations is that if you commit code and push it, you can't be confident that it's not going to break anything. And so whatever we're doing now is not working. And effectively, when a data contract, regardless of how it's structured, I think the problem it's trying to solve is, can I actually commit code to my own project and be confident that I'm not breaking anything. And I don't care if you call it a data contract or not. But since it's not happening now, we need a new thing in order

Starting point is 00:14:31 to make that happen. So I think the backlash, I think, is kind of silly in that way. But I mean, these things have been solved in other systems. Like, you know, my days at Facebook, when some engineer came up with a schematized logging format that made it so that, you know, our application engineers stopped submitting log messages that broke the data platform. And that was effectively a form of a data contract, right? So I think it's schema, but also combined with some social engineering to keep in mind that there's cross-team boundaries and sometimes cross-repo boundaries, so you can't keep schema definitions up to date all the time.

Starting point is 00:15:18 So I think that's... So I'm a data contract supporter. Yes, you're right so i suppose just before data contracts became trendy um before that it was data products and so so you know a year ago everybody was talking about data products really um so so do you again i know it's not necessarily a solution to a problem as such but what do you understand what data products to be? And how does that kind of evolve into this world as well? We're going to talk about how that goes into software-defined assets in time.

Starting point is 00:15:52 But what's a data product in your mind? Well, this is more of a – now, this is a subject where I'm more on the backlash side of things, I guess. Insofar as I think that data products is effective, it's not a technology per se. It is a way of approaching the practice of data and analytics engineering to be more product thinking. So you kind of... Instead of just like, oh, I'm just spitting out a database, a data warehouse table with the right stuff. And they'll figure out the dashboard people will figure out how to render it the right way instead of being much more thoughtful about the, hey, I am providing this table or this machine learning model to my downstream

Starting point is 00:16:35 stakeholders. There's a lifecycle around it here. And pretty much putting a product management process around data assets. And I think that's appropriate, but I don't think there's like a real technology there. Yeah. Interesting. So it's a thing you can throw into a bid with a customer for a contracting piece, consulting piece of work.

Starting point is 00:16:53 And it always makes you sound smart. Right. We'll treat, we'll treat this as a, as a product and so on. But okay. So let's get onto the topic now. I mean,

Starting point is 00:16:59 I'm sympathetic. I'm sympathetic to the terminology matters argument. Right. I think like one, I always thought it was kind of, I'm sympathetic to the terminology matters argument, right? I think like when I always thought it was kind of, let me just give you an example of that. I always thought it was kind of denigrating. Everyone was like, oh, it's just all data cleaning, right? Everyone referred to like data cleaning and data janitorial work. And I always pushed back.

Starting point is 00:17:22 It's like, no, you have to think about it. You are designing a data set and there are trade-offs and it should be documented. It should be a full product process. It's not just cleaning. Actually, producing the data product is the work. It's 90% of the work. So I actually think producing... I guess I'm going back myself. I've come around to data product elevating the discipline, but I think it is, I think the words do matter. So Nick, let's get on to this topic then of software defined assets. So give us a background to this really. And where does software defined assets come into your thinking and how does it evolve on for some of the things you've been talking about? And

Starting point is 00:18:01 maybe just give us the kind of fundamental definition of what you're talking about, really. And we'll get into then how DAX still works in this area. Yeah, so software-defined asset is a way to structure your data platform and the data pipelines that constitute your data platform. And what it is, is a definition in code of an asset that is supposed to exist. And I guess like, you know, originally I was calling these things solids way back to the day of the project. And that came from, I was calling them software structured data sets and SSD,

Starting point is 00:18:40 and then did a clever backwards acronym into solid state drive. But I think software defined assets is a much better name for them. But I guess it's really about a way of thinking in terms of moving the canonical definition of a data asset from physical storage to a software definition. And if you think about it, that's right. Because think about the exercise of like, what if you dropped a database table from a data warehouse? Is the data asset gone? Well, no, because you can recompute it at any point. And if that is true, then the canonical definition is actually in software. And the table is just the latest materialization of that asset. And that's where all that language in the system comes from.

Starting point is 00:19:33 So the definition of an asset is really the software that produces that asset, and then its upstream dependencies. And then that kind of recurses all the way down in the asset graph. You talk about DBT a lot, and I assume your audience might be more familiar with DBT. We really think of DBT as sort of a specialized form of software-defined asset, but for the analytics engineer and for Jinja Templated SQL in the data warehouse. That's why we're able to understand dbt very natively, but we're a more generalized

Starting point is 00:20:13 version of that, or it doesn't have to be a dbt model. It can be any computation. The same way a dbt model can be materialized in the form of a view or a sort of a table or materialized view, but the actual definition of what the item is, is in the model definition within DBT. A solid is a more generalized version of that. Don't say solid. We don't call it. Solid is an old, we never speak of it again. We don't we don't use that word anymore a software defined asset okay that's interesting because i never understood what solids meant when i read the dbt sorry the dagster documentation at the time and i thought this is this is kind of obviously such an i never i never understood what you meant by that so well that's why we changed the name so yeah so now now i've worked out what it is then now i have to forget that so um okay so so what

Starting point is 00:21:02 so what what problems did what problems did this though, in the context of what we've been talking about? Or why are software-defined assets something that you think is particularly topical at the moment, really? So I think, first of all, it's just a much natural way to program in these data platforms. And if you think about it, it makes the data asset the subject of a software development lifecycle. So you can test it, you can deploy it, you can apply all sorts of change management techniques to it, and so on and so forth. So I think that's kind of like the generalized thing. In terms of, I think there's also very kind of practical implications of that in terms of a natural way to program.

Starting point is 00:21:54 So for example, if you're writing a data pipeline with software-defined assets, there is no centralized DAG object that you have to construct manually. So if you're using a more traditional platform like Airflow, typically you write your task, you write the code that backs your assets somewhere. And then you have to find the centralized file where the DAG is being built and like, no to say dot set upstream, you know, and then the thing above it. And it ends up being this centralized dumping ground that no one owns and no one understands and is difficult to orchestrate and deal with. Where in a software defined asset context, the dependencies are co-located with the software defined asset itself, which means that the system can effectively construct the DAG on your behalf

Starting point is 00:22:47 using a centralized coordinator. I think that allows for a more distributed ownership model, which I think is very compelling. I think these centralized huge coordination, like units of coordination that are manually curated or like software engineering disasters, typically. Okay. So say I was a developer and I was developing, because I think one of the things that I've picked up on about software defined assets and how it works in DAX too is the declarative note, the declarative note sort of nature of it really.

Starting point is 00:23:26 So let's say that I was a data warehouse developer and I was developing maybe a subject area or something, or maybe sort of like a part of the warehouse that was new, that was focused around one sort of that subject. How might the development process look different with this declarative approach? And what do you think that maybe means in terms of how development would work and then maybe the different different benefits out of that you know talk us through how that process

Starting point is 00:23:49 would work so if you're developing a new area say a new a data product somewhere um and you know that you depend on a couple upstream data products from two other teams, right? And you're doing some enrichment or something. So prior to software-defined assets, you would have to know what DAG those live in and then figure out how to hook it all together. And do you want to be scheduled along with that DAG? Or do you have to manually figure out when it's going to be updated and guess as to when then I should kick off, for example? And this mode of thinking, all you need to know is the asset key. That's what we

Starting point is 00:24:39 call it. Effectively, the address of the asset in the system, you can declare your dependencies on it right there, right co-located with your code. And then you don't have to think about the unit of scheduling right up front. And it gets automatically inserted into this global asset lineage graph. And that's a really powerful thing. Once you orient your system around the asset, a lot of things fall out. So let's say you know that you need to depend on the order's asset. If you didn't have a system like Daxter, you would have to just magically know what DAG that thing lives in. Whereas in a system like DAG, you just go to the tool, literally like go to the search box, type orders and it shows up. And so you have what we call this operational asset catalog kind of baked into the core orchestration system. So you no longer have to constantly do this remapping of task to asset, best to task, back to DAG.

Starting point is 00:25:45 Everything's just like, you just look up the asset that you're looking for. And that bleeds down all the way down to the programming model level. So I think very concretely, you can just start writing code, you can depend on the upstream data products that you depend on, and you're off to the races. And you don't have to think about how you're slotting into everything operationally from the first get-go. Okay. So does this really, I suppose,

Starting point is 00:26:09 does this really mean that conceptually a developer in this area would, currently they're all about pipeline engineering. So they're about, like you say, it's knowing where in the dag things go and so on. But maybe this is more about now kind of, I suppose, data asset engineering and understanding, like you say,

Starting point is 00:26:23 the catalog of assets. And then kind of, you know, you're working with those. You're working at a high level of abstraction, really. Is that really a valid thing to say? Yeah. I mean, we really think of ourselves as a data engineering platform. And in the previous world, the way you did data engineering is you built DAGs. But if you are using this system, you actually are building, you are using abstraction

Starting point is 00:26:48 which is much more closely to the actual job which you're doing. And your job is you're keeping data assets up to date. So I don't think we're going to be calling,

Starting point is 00:26:57 just like, I don't think anyone calls themselves a data pipeline engineer now. I don't think anyone calls themselves a data asset engineer. I just think that

Starting point is 00:27:04 this is a far natural way to do the activity that is data engineering. And I think the other, we've been really focusing on this data engineering as an activity component. I think it's worth digging into because we see analytics engineers, data engineers, and ML engineers all using the system very successfully. And I think the reason why is that data engineering is the bulk of all of those people's jobs. Meaning the building production data pipelines that keep data assets up to date. Maybe that's 90% of a data engineer's job. Maybe it's 80% of an ML engineer's job. Yeah, but there's this core activity that unites all those disciplines.

Starting point is 00:27:52 Okay, okay. So is it, so, I mean, again, listeners might be familiar with other products, like, say, Airflow or Prefect, you know. How does Dagster and this approach differ from those products and those projects? So Airflow is the dominant incumbent in the space, and they have a very traditional task-based approach. And so you build a DAG, a directed acyclic graph of tasks, and you put those on a schedule, right? And Airflow typically and historically has not really thought about the full development lifecycle. Their domain is to schedule and order computations in production. And I think it's been a useful tool for doing that. Prefect is interesting. The project came

Starting point is 00:28:40 out of Airflow. So the founder of Prefect, for instance, built XCOMs in Airflow. And then I think effectively started Prefect because he wanted the project to go one way and it didn't. And what's interesting is that I think that all three projects started out somewhat like if you kind of squinted and looked at the code, they kind of all look somewhat similar, but they've actually diverged quite a bit where we have really bet heavily on the software-defined assets direction and really focusing on the data platform use case. So integrated lineage, observability, whatnot. Whereas Prefect has gone, I would say, more generic and imperative. And generic in a good way, meaning generalized.

Starting point is 00:29:41 So what Prefect is cool insofar as you can just write Python code and you don't have to construct the DAG ahead of time. And it's more of like a distributed state machine, almost like temporal or cadence, if you're familiar with that domain of products. And I think that gets you flexibility, but it's also more difficult to handle operationally. And there's a very basic consequence of it, where in Prefect, you cannot visualize the shape of the computation before it executes, whereas in Airflow and Dagster, you can. So in Dagster, you can load up your project and boom, the entire lineage graph of the assets that will be created is there. Whereas in Prefect, by its definition, it has to be a blank page. And only once you start executing can it actually... Because it can do loops, it has a more flexible, dynamic execution engine, and so on and so forth. So I think they've gone less declarative

Starting point is 00:30:39 and more imperative, less data platform specific, and more generalized. Whereas we have gone much more declarative, much more specialized to the data platform use case. Okay. So I remember when I spoke to you before, you talked about, or you said that the most important product in the modern data stack was the orchestrator, because everything revolves around that. Maybe just kind of reiterate that a little bit and maybe talk about how the software sorry the asset sorry the software asset sort of approach you're taking and the way that dagster works and particularly why is orchestration so important to the product and why does dagster do orchestration really well so the orchestrators of the orchestration layer is a very unique part of the stack because it interacts with, it's like a choke point that every practitioner has to interact with and that every storage system compute system ends up being invoked by the orchestrator.

Starting point is 00:31:39 All data has to come from somewhere and go somewhere. And it's the orchestrator which orchestrates that process. So to me, it has always been fairly clear that there needs to be a much more advanced control plane than exists previously for data platforms. And that orchestration is the central pillar of that. And it's core to your programming model, meaning that if you're building data assets, you are writing code which takes data and produces data. It's like a very fundamental activity, and as a result, any data asset that's being put into production has to interact with the orchestrator. And then, you know, I'm kind of a point of focus as important as the data warehouse itself

Starting point is 00:32:46 in terms of its centrality. If a data warehouse is the data plane, you know, an orchestrator or more expansive vision of the orchestrator is the control plane for the data platform. And then if you visualize, you know, I really visualize that control plane, so to speak, as the global asset lineage graph. It's like the living, breathing global asset lineage graph that you can kick off computation, you can program against. And so naturally, the control plane should be oriented around having a canonical definition of your assets oriented in a graph, and that should be oriented around, you know, having a canonical definition of your assets oriented in a graph, and that should be software-defined. So that's really the way I think about it.

Starting point is 00:33:31 So we talked earlier on about the complexity of projects and platforms, and there's concepts like data mesh out there that are about sort of, I suppose, distributing the ownership and the kind of the transformation of projects amongst the people that are using it and the parts that are using it. How can software-defined assets help with this really?

Starting point is 00:33:49 Is it part of the same problem it's solving or is it complementary or what really? So I love talking about data mesh. When we announced DAGster 1.0 and really talked about software-defined assets and were demoing it, someone in the YouTube comments, it was the most liked one. And they said, thank you for expressing the concepts of the data mesh in a way that my coworkers could possibly understand. And I think there's something there. So I think the, you know, let me talk about data mesh for a second. Data mesh is, you know,

Starting point is 00:34:29 it's kind of the microservices approach of thinking applied to the data domain. Ironically, I actually don't like microservice, but I think it's much more appropriate in the data domain. I think that the data mesh, the practitioners of it or the advocates of it kind of have a vocabulary problem in that they use very obtuse and weird vocabulary in my view. You start talking about architectural quantas and polysemes, and there's all this terminology

Starting point is 00:35:02 around it where I think what it really is, is empowering stakeholder ownerships in the data platform. And then making it so that to me, the most fundamental and essential and good idea in the data mess is that the assets should be the interface between teams. That a team's job in a data platform should be to expose a set of data products that then other teams can latch onto and say that when this thing changes, I want to do a computation, and then maximizing the amount of autonomy within that system. And I think that even though we don't come from the data mesh community, I think that software-defined assets is the most practical way to actually execute a data mesh strategy at a company. Because it's just the ideas slot in.

Starting point is 00:36:00 There is this global asset lineage graph. We aren't imposing that on anyone. It is a fundamental underlying reality that exists. Like these data assets in reality depend on other teams' data assets, and that is encoded in the tool. But we empower those teams to deploy independently with their own Python, you know, in their own Python environments. They can deploy to a centralized control plane on their own schedule, right?

Starting point is 00:36:31 Independently, they can operate their own data assets independently, schedule them independently, monitor them independently. But the true interconnections between all those different teams is expressed within a single tool. And it's a single platform that a centralized data platform team can build unified tooling around, which is really exciting. So that's kind of the relationship there. Literally, you can open up Dagster and in the product, see the mesh. The mesh is the global asset lineage graph. You can literally see it.

Starting point is 00:37:15 Right. Brilliant. So we touched a little bit while earlier on about, I suppose, helping keep the cost of these sort of things down. And I suppose being more mindful about the way that assets are refreshed and so on. So what's going on with Dagster on that? And what kind of problem is it trying to solve, really?

Starting point is 00:37:33 Yeah, so I think on the cost front, there's a couple things happening. One is an experimental feature we've released already, which I know Ritman Analytics is using, which we call declarative scheduling or automaterialization is kind of the other. I think we're moving towards declarative scheduling some more. It rolls off the tongue a little more. But what it says is that instead of having just like a single hourly cron job where all your assets are refreshed on a unified schedule, you can instead annotate assets with their scheduling requirements and then allow the system to keep those requirements, satisfy those

Starting point is 00:38:12 requirements while kicking off as little computation as possible. That's the way it's approached. And it's literally, it's very difficult to, in one centralized scheduling policy, kind of accomplish that objective because there's no way that one centralized artifact can understand all the different requirements, all the different stakeholders that could possibly be involved in that asset graph. here is allow asset owners to annotate their assets with SLAs and then let the scheduler do all the work for them and do sophisticated fine-grained things that satisfies SLAs with the minimum amount of compute and consumption. So what you just said there sounds more like a more fine-grained and thoughtful way of refreshing data, not necessarily about cost, really. Is that correct? That's right. So one of the advantages is cost, but not the only one. And you're going to say the second thing.

Starting point is 00:39:10 Yeah. Then we are actively looking at incorporating cost management more directly into the orchestrator. So by having the asset be an abstraction in the orchestration system, we can provide tools that are like, Hey, sort this asset graph by the amount of time it takes to execute each, you know, each asset within it. And do we truly understand like what are the most expensive assets in the system? And then you as as a human or engineer, can be like, actually, why are we spending so much money or consuming so many resources keeping this so up to date? It's not even that useful, right?

Starting point is 00:39:54 And we really saw this from early days. We had an early user who, for example, was built a monitoring job, which actually detected which BI tools had stopped querying certain tables and then automatically submitted PRs to delete unused DBT models, right? And that type of,

Starting point is 00:40:20 once you have the entire asset lineage graph and all this information around it, that type of tooling, it's pretty straightforward to build. And the result that we have all these integrations that invoke Databricks or Snowflake or DBT or anything else. And we want to provide a centralized reporting mechanism where we can surface consumption metrics on a per-asset basis so people can make prioritization decisions. And we think that's potentially super compelling.

Starting point is 00:40:52 And then you can also get in the type of world where you can prevent, you can warn people if they're about to kick off a backfill, it's going to be extremely expensive and things of that nature where there's a lot of interesting things we can do by integrating cost management into the orchestration system. Okay. Okay. So let's kind of step back a little bit from the detail of these features and think about, I suppose, DAGster and the market and where you sort of see this and who your kind of

Starting point is 00:41:19 ideal user and so on is. So first of all, is DAGster just a better version of products like say, dbt cloud, for example, you know, would you quite with it? Would you be quite happy to hoover up all their customers and serve all their use cases? Or is Dagster a sort of more niche product serving a more niche kind of set of requirements? What do you see as the adjustable market? And who's your ideal user? So we think our addressable market is a pretty universal product. We think that almost every company needs this.

Starting point is 00:41:49 I guess you want to talk about dbt cloud and we can get into that. dbt cloud. Well, no, just, yeah. Well, dbt cloud is interesting to talk about. It's an interesting product. From one standpoint, it is a niche orchestrator because it is an orchestrator that only orchestrates you know does coarse-grained orchestration of dbt projects

Starting point is 00:42:10 right or is like the the domain of orchestrator is very large um beyond that um and dbt cloud is really three parallel products i would say one is um you know the IDE. Two is what I'll call all the stakeholder-facing components of it. So they have a semantic layer, they have dbt docs, they have features where you can embed within BI tools of the state of assets. That's like another pillar. And then there's their scheduling system. They have a pretty bare-bones orchestration system. So we have lots of users who have moved off the orchestration piece of dbt cloud because their needs have gone beyond what dbt cloud can provide.

Starting point is 00:43:02 dbt cloud doesn't have sensors. You can't orchestrate non-dbt things on it and um so on and so forth we have no interest in building the web ide component of dbt cloud and you know with the as we as we get farther away from the engineer kind of broadly we become less interested in the products. I think that dbt kind of owns the analyst persona. And, you know, that that is not in our near future. So, you know, but, you know, we are obviously and that is happening. We are very happy to people who have grown beyond dbt clouds orchestration capabilities often like are

Starting point is 00:43:45 talking talking to us and they and then you know we we can programmatically invoke dbt cloud jobs as well we have customers that do that um if they want to leverage the other features that they have so so you back at the start you mentioned and actually this is in the last episode we recorded the idea of a data platform engineer okay and i think that that certainly just skipped over a little bit at the start, but how central is that to your thinking really in terms of your kind of ideal, you know, customer persona or user persona? And what does that really mean in your mind? So a data platform engineer is, I'll call it a role.

Starting point is 00:44:22 Okay. In some organizations, that role is a human. In some organizations, that role is a human. In some organizations, that role is an entire team. And in some, there are teams of one where part of their brain is the data platform engineer. Meaning, and so data platform engineering is scaling data engineering with an org. So setting up the infrastructure, building the CICD. So setting up the infrastructure, building the CICD pipeline, setting up the workflow,

Starting point is 00:44:50 building shared infrastructure that spans different data pipelines, right? And just given how multi-stakeholder the systems are, and the data platforms are just, they're very particular to the needs of the business. And you just always end up building a, even like a little platform, whenever you build a set of data pipelines,

Starting point is 00:45:17 it just always happens. You know, like if I build a data system and I'm the only engineer, like I probably, you know, I'm my natural inclination. I probably spend too much time being the data platform engineer and not enough time being the only engineer, I probably, you know, I'm my natural inclination. I probably spend too much time being the data platform engineer and not enough time being the data engineer.

Starting point is 00:45:30 But I think that anyone, it kind of uses both parts of your brain. And one part of your brain, you're thinking about, hey, how do I make it more efficient to build any data pipeline in this context? And then as a data engineer, you're building that pipeline. And that's who's on your mind, really, is the kind of customer, is the user persona for Dagster, really. Is that correct?

Starting point is 00:45:49 So our user persona are data and ML engineers who embrace software engineering best practices. So that's kind of how we define our ideal customer persona. So it's not like we're persuading them that infrastructure as code is good. They are an already persuaded human being on that front, and they're looking for orchestration or a data platform control plane that conspicuously embraces those values. The fact that we

Starting point is 00:46:25 really focus on automated testing and CICD processes and all that is just instrumental to our ideal customer profile. I think that's actually a broad set of

Starting point is 00:46:41 a large set of people that is increasing in the world. Okay. And I suppose the last question for me really, before I ask sort of what's, what's coming next is, is the financial model for, for,

Starting point is 00:46:51 for elemental as well. So, so, you know, you, as you said, you do, you,

Starting point is 00:46:55 you sell DAGsta cloud and there's an open source project sort of DAGsta. What's the, I suppose, what's the, how do you, how do you make money and how do you, how do you, I suppose,

Starting point is 00:47:04 plan to grow the business and keep the business kind of alive and not be sort of like – I don't know. What's the plan for the financial side of what you're doing really? Our business model is Daxter Cloud. So you go to our website. You want to use Daxter. You do not want to self-host and you want proprietary features that are more enterprise oriented. And you can sign up for Daxter Cloud and effortlessly go from your first PR. You have a data platform at your disposal. And that is the business model. So we have an enterprise tier for enterprise customers that we can serve tons of customers all the way to the

Starting point is 00:47:47 Fortune 100. And then we also have a self-serve product as well, which is actually growing like gangbusters, which is very exciting. It's one of those graphs you love to see as a startup founder. And so yeah, you're a developer, you want to use Dagster, and you pay us to host it for you and provide awesome features. Okay, and how does the open source community play into this? How does that contribute into what you're doing with Dagster? Yeah, we really think about it, Dagster is an ambitious project, and we want to change the way that people think about building data pipelines. And so part, you know, I guess one of the reasons to do open source is that open source is a vibe,

Starting point is 00:48:33 for lack of a better term. I, you know, it's like, I like the Slack communities. It's fun to get people to participate in the process and meet lots of like-minded, interesting people that way. But the other aspect is from a more kind of brass tack business perspective is that the objective here is for DAX to become a ubiquitous standard. It's just that the way that you build data pipelines, and then if you want to run those in production, then DAX to cloud is the easiest and most powerful way of doing that. And it is in our interest for this to be a ubiquitous standard, which is why we want to kind of spread it as far as wide as possible and maximize adoption. In some ways, you can actually think of our commercial product as through the lens of adoption maximization,

Starting point is 00:49:25 because there are tons of people who say would want to use Dagster, but who didn't want to self host. And therefore they were excluded from using technology. Or there's people at big companies who wanted to use Dagster, but they didn't have these complex enterprise features that require a team to maintain 24-7. And so in a lot of ways, you can just view it as, you know, we have business objectives, obviously, but you can also just view everything through the lens of adoption maximization. What's it like selling into the enterprise out of interest? I mean, anecdotally, I hear that it's quite lucrative, but it's also, you know, you really have to play dance to their tune and the product has to be... What's it like selling into the enterprise out of interest?

Starting point is 00:50:08 I mean, so we have a... We rely on passionate champions in the org to get us in the door and then facilitate the process with economic buyers and stakeholders. So typically the way it works is that a highly technical tech lead or say head of data has brought us in, meaning the technology, evaluated it, maybe kick-tested it with some of their own internal systems just to see how it would work and get familiar with the programming model. And then you start out a commercial conversation. And then that's about managing the internal stakeholders, improving the economic value.

Starting point is 00:50:58 So we're also blessed with a very talented sales team who is a sales team who also... All the engineers that they work with like them, which is kind of a miracle. We kind of have a bunch of unicorns on that front. So, you know, enterprise sales can be a pain in the butt, you know, and especially the bigger companies that are more established have their own peculiarities and procurement processes. But, you know, overall, it's been... We haven't had to, nor would we,

Starting point is 00:51:32 build custom features that are specific to a customer. We've been able to keep the platform generalized, which I actually think is good for every stakeholder. Fantastic. Fantastic. I'll cut that bit out, actually, because that's more of my interest, rather than anything else, because it's always interesting into enterprise sales.

Starting point is 00:51:50 But I mean, it's... So last thing then, Nick, just to finish off then, how do people find out more about Dagster and the concept of software-defined assets? Well, yeah, the easiest thing is just to go to Dagster.io and click on Docs and, you know, start looking at the plane with the open source software. And then anyone in the world can install on their laptop,

Starting point is 00:52:10 pip install Dagster, and you're off to the races. So if you want to engage with the community, the center of gravity is our Slack community. So that's also one click away on our website. So that is the way to engage. Fantastic. Fantastic. Great. Well, I know certainly within our consulting team that there's a huge uptake of Dagster

Starting point is 00:52:31 and there's blogs on our website about it as well. And everyone enthuses about it. So it's a product that we really love using. So Nick, thanks very much for coming on the show. Appreciate you answering all my questions there. And yeah, good luck with the product going forward and take care. All right. Thanks so much. Thanks for having me. Thanks for being such great users. Thank you.

Drill to Detail - Drill to Detail Ep.109 'Dagster, Orchestration and Software-Defined Assets' with Special Guest Nick Schrock

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.