Drill to Detail - Drill to Detail Ep.109 'Dagster, Orchestration and Software-Defined Assets' with Special Guest Nick Schrock
Episode Date: August 3, 2023Mark Rittman is joined in this episode by returning guest and Elementl Founder Nick Shrock to talk about Dagster's role in the modern data stack ecosystem and software defined assets, a new, declarati...ve approach to managing data and orchestrating its maintenance.Introducing Software-Defined AssetsRethinking Orchestration as Reconciliation: Software-Defined Assets in DagsterOptimizing Data Materialization Using Dagster’s PoliciesHow I use Dagster to orchestrate the production of social science data assets
Transcript
Discussion (0)
Yeah, so software-defined asset is a way to structure your data platform and the data pipelines that constitute your data platform.
And what it is, is a definition in code of an asset that is supposed to exist.
It's really about a way of thinking in terms of moving the canonical definition of a data asset from physical storage to a software definition.
Welcome to another episode of Drill to Detail, and I'm your host, Mark Whitman.
So I'm very pleased in this episode to be joined by Nick Schrock, founder of Elemental, and returning guest to the show.
So Nick, it's great to have you back with us. And for anyone who doesn't know you, tell us a bit about yourself and what you do.
Yeah, so thanks, Mark. Thanks for having me. It's a pleasure being back. So yeah, as you mentioned, I'm Nick Schrock. I'm the founder and CTO of Elemental, the company behind Dagster.
Prior to that, I spent a bunch of time at Facebook and was best known there for being
one of the co-creators of GraphQL. So Nick, you came on the show before,
a couple of years ago, talking about Dagster. And in this episode, I want to talk about a particular thing that you're focusing on now with
Dagster, which is software-defined assets. But before we do that, can you give us a bit of a,
I suppose, an elevator pitch, really, for what Dagster is? And that will set the context,
really, for the rest of this conversation. Yeah. SoGster is a Python framework for building data pipelines. And the
way that we conceptualize it is that we orient the oriented abstraction around building data
pipelines. We orient against the final output of those things, which are data assets. The purpose
of a data pipeline is to keep data assets up to date. That's the way that we define the problem.
And we're a Python framework that enables the building of those things. It really conceives
of building data pipelines across the entire software development process. So making it
extremely fast to develop, having a full test lifecycle, really thinking about how it
goes from dev to test to staging to production and so
on and so forth. And then by orienting around assets, not only can we, you know, traditional
schedulers kind of just schedule things in production, but we also give you then a base
level of data lineage, data observability, and other features that are very much oriented around
the data assets you produce. Okay, fantastic. So we'll get into a lot more of the detail of
things you've been talking about there as we go on. But just before we start, so you're the founder
of Elemental. So what's the relationship between Daxter and Elemental? Well, Elemental is a company
behind Daxter. And so this is kind of the corporate host for Dagster.
But really, the complete focus of the company is Dagster,
and the commercial product is Dagster Cloud.
Okay, fantastic. That's good.
Okay, so when I first started using Dagster,
it was almost as like an alternative to tools like, say,
dbtCloud or other DAG-type sort of orchestrators,
where we wanted to have an alternative to those tools and
potentially be able to orchestrate things other than say sort of dbt jobs but something that I've
certainly encountered in projects over the last few years or certainly the last year really
is the increasing complexity of those projects so I suppose the number of models in a sort of
like in a dbt project or a dag has kind
of increased the complexity in working on those especially as a team and extending those and
governing those you know it's got to the point now where you know we start to see vendors like
say dbt labs talking about things like data contracts going to those in a moment but i
suppose projects and requirements and the amount of things you need to orchestrate have probably increased in numbers and complexity a lot over the last year or so.
Talk to us about that, really, and what you're seeing in the market.
So I think part of what we're seeing is slightly a bit of everything that is old is just everything that is new is just things that have already happened before. insofar as that people who are adopting the modern data stack,
in terms of they start out with just an ingestion tool,
maybe DBT, maybe a reverse ETL tool,
that's the start of their journey. And now they're rediscovering all the problems
that large tech companies had to solve internally
or kind of more vertically integrated solutions
had solutions to many years ago.
Like Mark, you've always used to be an Oracle practitioner.
And I think that you're probably encountering gaps in the tooling
that would have been filled by a more all-encompassing solution like Oracle.
I've never been a deep user of it,
but just from the way that you've described it,
I feel like I can imagine the conversation of someone who's like very modern data stack native coming up to you and having some brilliant idea.
And then you being like, wow, that's interesting because I had that feature in Oracle, you know, 15 years ago.
So I think there's this kind of, you know, the modern data stack opened up the notion of building a data platform to a much broader set of people.
And so that happened.
And then also a broader set of companies in the world felt the need to build a data platform because they were a cloud from day one.
They could adopt infrastructure incrementally.
They didn't have to write a million-dollar contract to tear data just to get a data warehouse.
They could pay as you go.
And then also the demands of the external world have made it so that people are demanding more data-driven applications.
It's just becoming more of a norm.
So data is now critical for companies all the way from inception to IPO, right? So all this lead up is saying that I think what's happening
here is that people who are building platforms, starting with the modern data stack, are
re-encountering the same problems that many, many data teams throughout history have encountered
before. And the world is getting more demanding, right? Increasingly, the ability to manage data is not just like,
oh, we can accurately report finances.
It's critical to the operations of the business
and considered it a competitive advantage.
So if you look at some Instacart or something, right?
Their data platform is critical to the way the product functions
because it impacts the recommendation engine.
They have to integrate data sources from a bunch of different grocery stores.
You know, it's like critical to their functioning.
So I think that more people are doing things.
They're doing things with new tools that aren't as mature, actually,
in their life cycle.
And then the external world is also more demanding
and demanding more complicated data platforms.
Every time you digitize a process,
it ends up producing data
that needs to be incorporated into your entire tool chain.
So we have more SaaS services than employees.
So do most smaller companies.
It's been interesting reading some of the kind of data Twitter and data LinkedIn.
People are actually very conscious of the cost of refreshing and keeping these things running as well.
Have you noticed that as well as, I suppose, people's awareness of the cost of refreshing these and keeping these things running?
Yeah, no, we're seeing huge demand from the marketplace on
this front. This is on our roadmap, actually. Kind of jumping ahead, probably someone I'll
talk about later. We really think the orchestrator is a natural place to want to control and
comprehend costs because it's the thing that kicks off compute. It's the thing that kicks
off the thing that causes the consumption of these services. And right now, the tools that people have to control those costs are extremely coarse-grained and blunt.
It's like, oh, I'm going to refresh this every week instead of every day.
Well, that's not really possible.
And you have to think about what are the different SLIs for all the different data assets in your platform. So another aspect of, I suppose, the increased complexity of doing this is how you can work together as a team.
Or when you've got, say, a bigger organization that has got maybe many distributed owners of data or distributed stakeholders,
actually the sheer complexity of trying to develop on a, say, monolithic platform,
monolithic repo, gets complicated as well. So have you found, again, that actually the costs
and the overheads and the complexity of governing and developing these sort of systems can get a
bit sort of out of hand as well? I mean, for sure. And I think that what we're seeing with increasingly frequency is even at, you know, what you consider smaller, small side of organizations, 100 people or more, say, is the development of the data platform engineer and data platform engineering as a discipline.
So we just see that all over the place.
And that's our, like, actually DAXer's natural constituency, I say, is the data platform engineer.
Because we want to combine having practitioners who are co-located with the business for efficiency purposes and the reality that those different business units have data assets that depend on each other, right? Like the machine learning team is consuming data practice as a platform where there's shared infrastructure, but then you can independently sort of develop applications on top of that.
And applications in this is like a set of data assets, kind of the way been a few initiatives around adding sort of structure
and governance and so on to platforms like this to deal with the complexity and so on.
And one of them is the concept of data contracts. Okay. And we'll get onto how that leads into
what we can talk about with Dagstra in a bit, but maybe just explain what is a data contract
really in your understanding and kind of what problem does it try and solve, really, in this kind of space?
So I think there are two operative definitions of data contracts in the world right now.
And I think they're actually quite different and worth talking about.
One is a data contract between operational systems.
For instance, your main application and, say like the Postgres tables that back it and the contract between those things and the data warehousing system.
Right. And that involves a very specific set of technologies and practices where effectively the problem you want to prevent is a application engineer changing the schema of a database table and then having that break the entire data platform.
And that's one level of data contracting.
And that is kind of a different set of tools and techniques.
And another form of it is the way that data teams
within the data platform kind of interoperate.
And they sound like the same problem,
but they end up having quite different technical solutions
to the point where they almost have
completely different solutions to them.
And so a data contract on that side of things,
the data team is like,
is, hey, and there's a couple of ways to structure it,
but it's like, I'm producing this table in a data warehouse.
It has these such and such columns.
And then a downstream team somehow encodes that I depend on this table with such and such of these columns.
And if either side breaks the assumption there, you prevent the breakage before it gets committed.
So I think that, you know, that is effectively like the two operative definitions of data contracts.
It's an agreement between two different stakeholders in a data system about the about qualitative or quantitative dimensions of the data in which they operate.
OK, so almost as quickly as data contracts became sort of the latest thing,
it seemed to be the backlash against data contracts came along.
Do you have any opinion on that?
Do you have any opinion why the backlash to data contracts
seems to almost appear at the same time as data contracts?
Oh, I guess, can you give me an example of the backlash that you're talking about?
I think I saw a few flavors of this.
I think it's a practicality thing.
So I think it's, I think, well, obviously there's an element there of the people that
made the opportunity for things to become so complex, then become the ones that solve
it through this.
But I'm thinking more about how practical they are in reality.
So I suppose there is, there's the element of maybe sort of like a cynical thing, but I'm thinking more about how practical they are in reality. So I suppose there's the element of maybe sort of a cynical thing,
but there's the element there of, I suppose,
these things already exist in terms of database schemas and things anyway.
But also, if you're going to start to enforce these things,
it becomes quite impractical to then develop in that environment
and work with it.
Is that a fair kind of argument, or is that not really an issue? I mean, whether or not you call it
a data contract or not, I think
the current state of the world in 99.9%
of organizations is that if you
commit code and push it, you can't be confident that it's not
going to break anything.
And so whatever we're doing now is not working. And effectively, when a data contract,
regardless of how it's structured, I think the problem it's trying to solve is, can I actually commit code to my own project and be confident that I'm not breaking anything. And I don't care if you
call it a data contract or not. But since it's not happening now, we need a new thing in order
to make that happen. So I think the backlash, I think, is kind of silly in that way.
But I mean, these things have been solved in other systems. Like, you know, my days at Facebook, when some engineer came up with a schematized logging format that made it so
that, you know, our application engineers stopped submitting log messages that broke the data
platform. And that was effectively a form of a data contract, right? So I think it's schema,
but also combined with some social engineering
to keep in mind that there's cross-team boundaries
and sometimes cross-repo boundaries,
so you can't keep schema definitions up to date all the time.
So I think that's...
So I'm a data contract supporter.
Yes, you're right so i suppose just before data contracts became trendy um before that it was
data products and so so you know a year ago everybody was talking about data products really
um so so do you again i know it's not necessarily a solution to a problem as such but what do you
understand what data products to be?
And how does that kind of evolve into this world as well?
We're going to talk about how that goes into software-defined assets in time.
But what's a data product in your mind?
Well, this is more of a – now, this is a subject where I'm more on the backlash side of things, I guess.
Insofar as I think that data products is effective, it's not a technology per se.
It is a way of approaching the practice of data and analytics engineering to be more product
thinking. So you kind of... Instead of just like, oh, I'm just spitting out a database,
a data warehouse table with the right stuff. And they'll figure out the dashboard
people will figure out how to render it the right way instead of being much more thoughtful
about the, hey, I am providing this table or this machine learning model to my downstream
stakeholders. There's a lifecycle around it here. And pretty much putting a product management
process around data assets.
And I think that's appropriate, but I don't think there's like a real technology there.
Yeah.
Interesting.
So it's a thing you can throw into a bid with a customer for a contracting
piece,
consulting piece of work.
And it always makes you sound smart.
Right.
We'll treat,
we'll treat this as a,
as a product and so on.
But okay.
So let's get onto the topic now.
I mean,
I'm sympathetic.
I'm sympathetic to the terminology matters argument.
Right. I think like one, I always thought it was kind of, I'm sympathetic to the terminology matters argument, right?
I think like when I always thought it was kind of, let me just give you an example of that.
I always thought it was kind of denigrating.
Everyone was like, oh, it's just all data cleaning, right?
Everyone referred to like data cleaning and data janitorial work.
And I always pushed back.
It's like, no, you have to think about it.
You are designing a data set and there are trade-offs and it should be documented. It should be a full product
process. It's not just cleaning. Actually, producing the data product is the work.
It's 90% of the work. So I actually think producing... I guess I'm going back myself.
I've come around to data product elevating the discipline, but I think it
is, I think the words do matter. So Nick, let's get on to this topic then of software defined
assets. So give us a background to this really. And where does software defined assets come into
your thinking and how does it evolve on for some of the things you've been talking about? And
maybe just give us the kind of fundamental definition of what you're talking about, really.
And we'll get into then how DAX still works in this area.
Yeah, so software-defined asset is a way to structure your data platform and the data
pipelines that constitute your data platform.
And what it is, is a definition in code of an asset that is supposed to exist.
And I guess like, you know,
originally I was calling these things solids way back to the day of the project. And that came from,
I was calling them software structured data sets and SSD,
and then did a clever backwards acronym into solid state drive.
But I think software defined assets is a much better name for them.
But I guess it's really about a way of thinking in terms of moving the canonical definition
of a data asset from physical storage to a software definition.
And if you think about it, that's right. Because think about the exercise of like, what if you dropped a database table
from a data warehouse? Is the data asset gone? Well, no, because you can recompute it at any
point. And if that is true, then the canonical definition is actually in software. And the table is just the latest materialization of that asset.
And that's where all that language in the system comes from.
So the definition of an asset is really the software that produces that asset,
and then its upstream dependencies.
And then that kind of recurses all the way down in the asset graph.
You talk about DBT a lot, and I assume your audience might be more familiar with DBT.
We really think of DBT as sort of a specialized form of software-defined asset, but for the
analytics engineer and for Jinja Templated SQL in the data warehouse.
That's why we're able to understand
dbt very natively, but we're a more generalized
version of that, or it doesn't have to be a dbt
model. It can be any computation.
The same way a dbt model can be materialized in the form of a view or a sort of a table or materialized view, but the actual definition of what the item is, is in the model definition within DBT. A solid is a more generalized version of that.
Don't say solid. We don't call it. Solid is an old, we never speak of it again. We don't we don't use that word anymore a software defined asset
okay that's interesting because i never understood what solids meant when i read the dbt
sorry the dagster documentation at the time and i thought this is this is kind of obviously such an
i never i never understood what you meant by that so well that's why we changed the name so
yeah so now now i've worked out what it is then now i have to forget that so um okay so so what
so what what problems did what problems did this though, in the context of what we've been talking about?
Or why are software-defined assets something that you think is particularly topical at the moment, really?
So I think, first of all, it's just a much natural way to program in these data platforms. And if you think about it, it makes the data
asset the subject of a software development lifecycle. So you can test it, you can deploy it,
you can apply all sorts of change management techniques to it, and so on and so forth.
So I think that's kind of like the generalized thing.
In terms of, I think there's also very kind of practical implications of that in terms
of a natural way to program.
So for example, if you're writing a data pipeline with software-defined assets, there
is no centralized DAG object that you have to construct manually. So if you're
using a more traditional platform like Airflow, typically you write your task, you write the code
that backs your assets somewhere. And then you have to find the centralized file where the DAG is being built and like, no to say dot set upstream, you know, and then the
thing above it. And it ends up being this centralized dumping ground that no one owns
and no one understands and is difficult to orchestrate and deal with. Where in a software
defined asset context, the dependencies are co-located with the software defined asset itself,
which means that the system can effectively construct the DAG on your behalf
using a centralized coordinator.
I think that allows for a more distributed ownership model,
which I think is very compelling.
I think these centralized huge coordination,
like units of coordination that are manually curated or like software engineering disasters,
typically. Okay. So say I was a developer and I was developing, because I think one of the things
that I've picked up on about software defined assets and how it works in DAX too is the
declarative note, the declarative note sort of nature of it really.
So let's say that I was a data warehouse developer
and I was developing maybe a subject area or something,
or maybe sort of like a part of the warehouse that was new,
that was focused around one sort of that subject.
How might the development process look different
with this declarative approach?
And what do you think that maybe means
in terms of how development would work and then maybe the different different benefits out of that you know talk us through how that process
would work so if you're developing a new area say a new a data product somewhere um and you know
that you depend on a couple upstream data products from two other teams, right?
And you're doing some enrichment or something.
So prior to software-defined assets, you would have to know what DAG those live in
and then figure out how to hook it all together.
And do you want to be scheduled along with that DAG? Or do you have to
manually figure out when it's going to be updated and guess as to when then I should kick off,
for example? And this mode of thinking, all you need to know is the asset key. That's what we
call it. Effectively, the address of the asset in the system, you can declare your dependencies on it right there, right co-located with your code. And then you don't have to think about
the unit of scheduling right up front. And it gets automatically inserted into this global
asset lineage graph. And that's a really powerful thing. Once you orient your system around the asset,
a lot of things fall out. So let's say you know that you need to depend on the order's asset.
If you didn't have a system like Daxter, you would have to just magically know
what DAG that thing lives in. Whereas in a system like DAG, you just go to the tool, literally like go to the search box, type orders and it shows up.
And so you have what we call this operational asset catalog kind of baked into the core orchestration system.
So you no longer have to constantly do this remapping of task to asset, best to task, back to DAG.
Everything's just like, you just look up the asset that you're looking for.
And that bleeds down all the way down to the programming model level.
So I think very concretely, you can just start writing code, you can depend on the upstream
data products that you depend on, and you're off to the races.
And you don't have to think about how you're slotting into everything operationally
from the first get-go.
Okay.
So does this really, I suppose,
does this really mean that conceptually
a developer in this area would,
currently they're all about pipeline engineering.
So they're about, like you say,
it's knowing where in the dag things go and so on.
But maybe this is more about now kind of,
I suppose, data asset engineering
and understanding, like you say,
the catalog of assets.
And then kind of, you know, you're working with those.
You're working at a high level of abstraction, really.
Is that really a valid thing to say?
Yeah.
I mean, we really think of ourselves as a data engineering platform.
And in the previous world, the way you did data engineering is you built DAGs. But if you are using this system, you actually are building,
you are using abstraction
which is much more closely
to the actual job
which you're doing.
And your job is
you're keeping data assets
up to date.
So I don't think
we're going to be calling,
just like,
I don't think anyone
calls themselves
a data pipeline engineer now.
I don't think anyone
calls themselves
a data asset engineer.
I just think that
this is a far
natural way to do the activity that is data engineering. And I think the other, we've been
really focusing on this data engineering as an activity component. I think it's worth digging
into because we see analytics engineers, data engineers, and ML engineers all using the system very successfully.
And I think the reason why is that data engineering is the bulk of all of those people's
jobs. Meaning the building production data pipelines that keep data assets up to date.
Maybe that's 90% of a data engineer's job. Maybe it's 80% of an ML engineer's job.
Yeah, but there's this core activity that unites all those disciplines.
Okay, okay.
So is it, so, I mean, again, listeners might be familiar with other products,
like, say, Airflow or Prefect, you know.
How does Dagster and this approach differ from those products and those projects?
So Airflow is the dominant incumbent in the space, and they have a very traditional task-based approach.
And so you build a DAG, a directed acyclic graph of tasks, and you put those on a schedule, right?
And Airflow typically and historically has not really thought about the full development lifecycle. Their domain is to schedule and order computations in production.
And I think it's been a useful tool for doing that. Prefect is interesting. The project came
out of Airflow. So the founder of Prefect, for instance, built XCOMs in Airflow.
And then I think effectively started Prefect because he wanted the project to go one way
and it didn't. And what's interesting is that I think that all three projects started out
somewhat like if you kind of squinted and
looked at the code, they kind of all look somewhat similar, but they've actually diverged
quite a bit where we have really bet heavily on the software-defined assets direction
and really focusing on the data platform use case. So integrated lineage, observability, whatnot. Whereas Prefect has gone,
I would say, more generic and imperative. And generic in a good way, meaning generalized.
So what Prefect is cool insofar as you can just write Python code and you don't have to construct the DAG ahead of time. And it's more of like a distributed state machine, almost like temporal or cadence, if you're familiar with that domain of products.
And I think that gets you flexibility, but it's also more difficult to handle operationally.
And there's a very basic consequence of it, where in Prefect, you cannot visualize the shape of the computation
before it executes, whereas in Airflow and Dagster, you can. So in Dagster, you can load up
your project and boom, the entire lineage graph of the assets that will be created is there.
Whereas in Prefect, by its definition, it has to be a blank page.
And only once you start executing can it actually... Because it can do loops, it has a more
flexible, dynamic execution engine, and so on and so forth. So I think they've gone less declarative
and more imperative, less data platform specific, and more generalized. Whereas we
have gone much more declarative, much more specialized to the data platform use case.
Okay. So I remember when I spoke to you before, you talked about, or you said that the most
important product in the modern data stack was the orchestrator, because everything revolves
around that. Maybe just kind of reiterate that a little bit and maybe talk about how the software sorry the asset sorry the software asset sort of approach
you're taking and the way that dagster works and particularly why is orchestration so important
to the product and why does dagster do orchestration really well so the orchestrators of
the orchestration layer is a very unique part of the stack because it interacts with, it's like a choke point that every practitioner has to interact with and that every storage system compute system ends up being invoked by the orchestrator.
All data has to come from somewhere and go somewhere.
And it's the orchestrator which orchestrates that process.
So to me, it has always been fairly clear that there needs to be a much more advanced control plane than exists previously for data platforms.
And that orchestration is the central pillar of that.
And it's core to your programming model, meaning that if you're building data assets,
you are writing code which takes data and produces data.
It's like a very fundamental activity, and as a result, any data asset that's being put into production has to interact with the orchestrator.
And then, you know, I'm kind of a point of focus as important as the data warehouse itself
in terms of its centrality. If a data warehouse is the data plane, you know, an orchestrator or
more expansive vision of the orchestrator is the control plane for the data platform.
And then if you visualize, you know, I really visualize that control plane, so to speak, as the global asset lineage graph.
It's like the living, breathing global asset lineage graph that you can kick off computation, you can program against.
And so naturally, the control plane should be oriented around having a canonical definition of your assets oriented in a graph, and that should be oriented around, you know, having a canonical definition of your assets
oriented in a graph,
and that should be software-defined.
So that's really the way I think about it.
So we talked earlier on about the complexity
of projects and platforms,
and there's concepts like data mesh out there
that are about sort of, I suppose,
distributing the ownership
and the kind of the transformation of projects
amongst the people that are using it and the parts that are using it.
How can software-defined assets help with this really?
Is it part of the same problem it's solving or is it complementary or what really?
So I love talking about data mesh.
When we announced DAGster 1.0 and really talked about software-defined assets and were demoing it, someone in the YouTube comments, it was the most liked one.
And they said, thank you for expressing the concepts of the data mesh in a way that my coworkers could possibly understand.
And I think there's something there.
So I think the, you know,
let me talk about data mesh for a second.
Data mesh is, you know,
it's kind of the microservices approach of thinking
applied to the data domain.
Ironically, I actually don't like microservice,
but I think it's much more appropriate in the data domain.
I think that the data mesh,
the practitioners of it or the advocates of it kind
of have a vocabulary problem in that they use very obtuse and weird vocabulary in my view.
You start talking about architectural quantas and polysemes, and there's all this terminology
around it where I think what
it really is, is empowering stakeholder ownerships in the data platform. And then making it so that
to me, the most fundamental and essential and good idea in the data mess is that the
assets should be the interface between teams. That a team's job in a data platform should be to expose a set of data products that then other
teams can latch onto and say that when this thing changes, I want to do a computation,
and then maximizing the amount of autonomy within that system. And I think that even though we don't
come from the data mesh community, I think that software-defined assets is the most practical way to actually execute a data mesh strategy at a company.
Because it's just the ideas slot in.
There is this global asset lineage graph.
We aren't imposing that on anyone.
It is a fundamental underlying reality that exists.
Like these data assets in reality depend on other teams' data assets, and that is encoded
in the tool.
But we empower those teams to deploy independently with their own Python, you know, in their
own Python environments.
They can deploy to a centralized control plane on their own schedule, right?
Independently, they can operate their own data assets independently, schedule them independently,
monitor them independently.
But the true interconnections between all those different teams
is expressed within a single tool. And it's a single platform that a centralized data platform
team can build unified tooling around, which is really exciting. So that's kind of the
relationship there. Literally, you can open up Dagster and in the product, see the mesh.
The mesh is the global asset lineage graph.
You can literally see it.
Right.
Brilliant.
So we touched a little bit while earlier on about, I suppose, helping keep the cost of
these sort of things down.
And I suppose being more mindful about the way that assets are refreshed
and so on.
So what's going on with Dagster on that?
And what kind of problem is it trying to solve, really?
Yeah, so I think on the cost front, there's a couple things happening.
One is an experimental feature we've released already,
which I know Ritman Analytics is using, which we call declarative scheduling
or automaterialization is kind of the other. I think we're moving towards declarative
scheduling some more. It rolls off the tongue a little more. But what it says is that instead of
having just like a single hourly cron job where all your assets are refreshed on a unified schedule,
you can instead annotate assets with their
scheduling requirements and then allow the system to keep those requirements, satisfy those
requirements while kicking off as little computation as possible. That's the way it's
approached. And it's literally, it's very difficult to, in one centralized scheduling policy, kind of accomplish that objective because there's no way that one centralized artifact can understand all the different requirements, all the different stakeholders that could possibly be involved in that asset graph. here is allow asset owners to annotate their assets with SLAs and then let the scheduler do
all the work for them and do sophisticated fine-grained things that satisfies SLAs with
the minimum amount of compute and consumption. So what you just said there sounds more like
a more fine-grained and thoughtful way of refreshing data, not necessarily about cost,
really. Is that correct? That's right.
So one of the advantages is cost, but not the only one.
And you're going to say the second thing.
Yeah.
Then we are actively looking at incorporating cost management more directly into the orchestrator.
So by having the asset be an abstraction in the orchestration system,
we can provide tools that are like, Hey, sort this asset graph by the amount of time it takes
to execute each, you know, each asset within it. And do we truly understand like what are the most
expensive assets in the system? And then you as as a human or engineer, can be like, actually, why are we spending so much money
or consuming so many resources keeping this so up to date?
It's not even that useful, right?
And we really saw this from early days.
We had an early user who, for example,
was built a monitoring job,
which actually detected which BI tools
had stopped querying certain tables
and then automatically submitted PRs
to delete unused DBT models, right?
And that type of,
once you have the entire asset lineage graph
and all this information around it,
that type of tooling,
it's pretty straightforward to build.
And the result that we have all these integrations that invoke Databricks or Snowflake or DBT or anything else.
And we want to provide a centralized reporting mechanism where we can surface consumption metrics on a per-asset basis
so people can make prioritization decisions.
And we think that's potentially super compelling.
And then you can also get in the type of world where you can prevent,
you can warn people if they're about to kick off a backfill,
it's going to be extremely expensive and things of that nature
where there's a lot of interesting things we can do by integrating cost management into the orchestration system.
Okay. Okay.
So let's kind of step back a little bit from the detail of these features and
think about, I suppose,
DAGster and the market and where you sort of see this and who your kind of
ideal user and so on is.
So first of all,
is DAGster just a better version of products like say,
dbt cloud, for example, you know, would you quite with it? Would you be quite happy to
hoover up all their customers and serve all their use cases? Or is Dagster a sort of more niche
product serving a more niche kind of set of requirements? What do you see as the adjustable
market? And who's your ideal user? So we think our addressable market is a pretty universal product.
We think that almost every company needs this.
I guess you want to talk about dbt cloud and we can get into that.
dbt cloud.
Well, no, just, yeah.
Well, dbt cloud is interesting to talk about.
It's an interesting product.
From one standpoint, it is a niche orchestrator
because it is an
orchestrator that only orchestrates you know does coarse-grained orchestration of dbt projects
right or is like the the domain of orchestrator is very large um beyond that um and dbt cloud
is really three parallel products i would say one is um you know the IDE. Two is what I'll call all the stakeholder-facing
components of it. So they have a semantic layer, they have dbt docs, they have features where you
can embed within BI tools of the state of assets.
That's like another pillar.
And then there's their scheduling system.
They have a pretty bare-bones orchestration system.
So we have lots of users who have moved off the orchestration piece of dbt cloud because their needs have gone beyond what dbt cloud can provide.
dbt cloud doesn't have sensors.
You can't orchestrate non-dbt things on it and um so on and so forth we have no interest in building the web ide component of dbt cloud and you know with the as we as we get farther away
from the engineer kind of broadly we become less interested in the products.
I think that dbt kind of owns the analyst persona.
And, you know, that that is not in our near future.
So, you know, but, you know, we are obviously and that is happening.
We are very happy to people who have grown beyond dbt clouds orchestration capabilities
often like are
talking talking to us and they and then you know we we can programmatically invoke dbt cloud jobs
as well we have customers that do that um if they want to leverage the other features that they have
so so you back at the start you mentioned and actually this is in the last episode we recorded
the idea of a data platform engineer okay and i think that that certainly just skipped over a little bit at the start,
but how central is that to your thinking really in terms of your kind of
ideal, you know, customer persona or user persona?
And what does that really mean in your mind?
So a data platform engineer is, I'll call it a role.
Okay.
In some organizations, that role is a human. In some organizations, that role is a human.
In some organizations, that role is an entire team.
And in some, there are teams of one where part of their brain is the data platform engineer.
Meaning, and so data platform engineering is scaling data engineering with an org.
So setting up the infrastructure, building the CICD. So setting up the infrastructure,
building the CICD pipeline,
setting up the workflow,
building shared infrastructure that spans different data pipelines, right?
And just given how multi-stakeholder
the systems are,
and the data platforms are just,
they're very particular to the needs of the business.
And you just always end up building a,
even like a little platform,
whenever you build a set of data pipelines,
it just always happens.
You know,
like if I build a data system and I'm the only engineer,
like I probably,
you know,
I'm my natural inclination.
I probably spend too much time being the data platform engineer and not enough time being the only engineer, I probably, you know, I'm my natural inclination.
I probably spend too much time being the data platform engineer and not enough time being the data engineer.
But I think that anyone, it kind of uses both parts of your brain.
And one part of your brain, you're thinking about, hey, how do I make it more efficient
to build any data pipeline in this context?
And then as a data engineer, you're building that pipeline.
And that's who's on your mind, really,
is the kind of customer,
is the user persona for Dagster, really.
Is that correct?
So our user persona are data and ML engineers
who embrace software engineering best practices.
So that's kind of how we define our ideal customer persona.
So it's not like we're persuading them
that infrastructure as code is good.
They are an already persuaded human being on that front,
and they're looking for orchestration or a data platform control plane
that conspicuously embraces those values. The fact that we
really focus
on automated testing and
CICD processes and all
that is just
instrumental to our ideal customer
profile.
I think that's actually
a broad set of
a large set of people that
is increasing in the world.
Okay.
And I suppose the last question for me really,
before I ask sort of what's,
what's coming next is,
is the financial model for,
for,
for elemental as well.
So,
so,
you know,
you,
as you said,
you do,
you,
you sell DAGsta cloud and there's an open source project sort of DAGsta.
What's the,
I suppose,
what's the,
how do you,
how do you make money and how do you,
how do you,
I suppose,
plan to grow the business and keep the business kind of alive and not be sort of like – I don't know.
What's the plan for the financial side of what you're doing really?
Our business model is Daxter Cloud.
So you go to our website.
You want to use Daxter.
You do not want to self-host and you want proprietary features that are more enterprise oriented. And you can sign up for Daxter Cloud and effortlessly go from
your first PR. You have a data platform at your disposal. And that is the business model.
So we have an enterprise tier for enterprise customers that we can serve tons of customers all the way to the
Fortune 100. And then we also have a self-serve product as well, which is actually growing like
gangbusters, which is very exciting. It's one of those graphs you love to see as a startup founder.
And so yeah, you're a developer, you want to use Dagster, and you pay us to host it for you and provide awesome features.
Okay, and how does the open source community play into this?
How does that contribute into what you're doing with Dagster?
Yeah, we really think about it, Dagster is an ambitious project,
and we want to change the way that people think about building data pipelines.
And so part, you know, I guess one of the reasons to do open source is that open source is a vibe,
for lack of a better term. I, you know, it's like, I like the Slack communities. It's fun to get
people to participate in the process and meet lots of like-minded, interesting people that way.
But the other aspect is from a more kind of brass tack business perspective is that the objective here is for DAX to become a ubiquitous standard.
It's just that the way that you build data pipelines, and then if you want to run those
in production, then DAX to cloud is the easiest and most powerful way of doing that.
And it is in our interest for this to be a ubiquitous standard, which is why we want to
kind of spread it as far as wide as possible and maximize adoption. In some ways, you can
actually think of our commercial product as through the lens of adoption maximization,
because there are tons of people who say would want to use Dagster, but who didn't want to self
host. And therefore they were excluded from using technology. Or there's people at big companies who
wanted to use Dagster, but they didn't have these complex enterprise features that require a team to maintain 24-7.
And so in a lot of ways, you can just view it as, you know, we have business objectives,
obviously, but you can also just view everything through the lens of adoption maximization.
What's it like selling into the enterprise out of interest?
I mean, anecdotally, I hear that it's quite lucrative, but it's also, you know, you really have to play dance to their tune and the product has to be...
What's it like selling into the enterprise out of interest?
I mean, so we have a...
We rely on passionate champions in the org to get us in the door and then facilitate the process with economic buyers and stakeholders.
So typically the way it works is that a highly technical tech lead or say head of data has brought us in, meaning the technology,
evaluated it, maybe kick-tested it with some of their own internal systems
just to see how it would work and get familiar with the programming model.
And then you start out a commercial conversation.
And then that's about managing the internal stakeholders,
improving the economic value.
So we're also blessed with a very talented sales team who is a sales team who also...
All the engineers that they work with like them,
which is kind of a miracle.
We kind of have a bunch of unicorns on that front.
So, you know, enterprise sales can be a pain in the butt, you know,
and especially the bigger companies that are more established
have their own peculiarities and procurement processes.
But, you know, overall, it's been... We haven't had to, nor would we,
build custom features that are specific to a customer.
We've been able to keep the platform generalized,
which I actually think is good for every stakeholder.
Fantastic. Fantastic.
I'll cut that bit out, actually,
because that's more of my interest,
rather than anything else,
because it's always interesting into enterprise sales.
But I mean, it's...
So last thing then, Nick, just to finish off then,
how do people find out more about Dagster
and the concept of software-defined assets?
Well, yeah, the easiest thing is just to go to Dagster.io
and click on Docs and, you know,
start looking at the plane with the open source software.
And then anyone in the world can install on their laptop,
pip install Dagster, and you're off to the races.
So if you want to engage with the community,
the center of gravity is our Slack community.
So that's also one click away on our website.
So that is the way to engage.
Fantastic. Fantastic.
Great.
Well, I know certainly within our consulting team that there's a huge uptake of Dagster
and there's blogs on our website about it as well.
And everyone enthuses about it.
So it's a product that we really love using.
So Nick, thanks very much for coming on the show.
Appreciate you answering all my questions there.
And yeah, good luck with the product going forward and take care.
All right. Thanks so much. Thanks for having me.
Thanks for being such great users. Thank you.