The Data Stack Show - 246: AI, Abstractions, and the Future of Data Engineering with Pete Hunt of Dagster
Episode Date: May 28, 2025Highlights from this week’s conversation include:Pete's Background and Journey in Data (1:36)Evolution of Data Practices (3:02)Integration Challenges with Acquired Companies (5:13)Trust and Safety a...s a Service (8:12)Transition to Dagster (11:26)Value Creation in Networking (14:42)Observability in Data Pipelines (18:44)The Era of Big Complexity (21:38)Abstraction as a Tool for Complexity (24:41)Composability and Workflow Engines (28:08)The Need for Guardrails (33:13)AI in Development Tools (36:24)Internal Components Marketplace (40:14)Reimagining Data Integration (43:03)Importance of Abstraction in Data Tools (46:17)Parting Advice for Listeners and Closing Thoughts (48:01)The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it’s needed to power smarter decisions and better customer experiences. Each week, we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Hi, I'm Eric Dotz.
And I'm John Wessel.
Welcome to the Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human
challenges involved in data work.
Join our casual conversations with innovators and data professionals to learn about new
data technologies and how data teams are run at top companies. How to Create a Data Team with RutterSack
Before we dig into today's episode,
we want to give a huge thanks
to our presenting sponsor, RutterSack.
They give us the equipment and time
to do this show week in, week out,
and provide you the valuable content.
RutterSack provides customer data infrastructure
and is used by the world's most innovative companies
to collect, transform, and deliver their event data
wherever it's needed, all in real time.
You can learn more at ruddersack.com.
Okay, so special episode here today.
We're here with Pete Hunt from Dagster.
Pete is actually the fourth person from Dagster
we have ever talked to on the show,
which is I think a show record.
I think so.
And also if you're like, hey, this is an unfamiliar voice,
what's the deal?
Eric is on a plane right now,
so couldn't make the recording.
I'm Brooks, producer of the show.
You probably heard me here and there before,
but here to kick things off today
and excited to connect with
Pete.
So Pete, what we always do first in our intros, will you give us just like the quick high
level version of your background?
We'll get more in depth later.
But yeah, tell us kind of where you started and what you're doing today.
Yeah, it's great to be here.
Thanks for having me.
I'm Pete.
I'm the CEO here at Dagster.
Come from an engineering background.
So kind of the first big thing I worked on was, was react.js at Facebook, which
was a large successful open source project.
Then I really wanted to get an entrepreneurship.
So I left and started a company called Smite, which did that's where, really
where I got into data in large scale stream processing to try to find fake
and compromised accounts on the internet.
Ended up selling that to the company that was known as Twitter back then.
Stayed there for a couple of years and then my old buddy from Facebook, Nick Schrock,
recruited me over to Dagster and before I even knew what was happening, I was CEO.
So that was very exciting.
It was very cool.
So Pete, just so many things to talk about.
One of the things I want to talk about
in regards to data teams,
which we talked about before the show,
is this idea of data people starting,
let's say kind of more from an analyst background,
they're not from a development background,
and we're seeing people kind of drifting that way.
We're seeing data practices drift that way.
So I want to dig into that with you.
And then what are you excited to talk about? I mean, I love talking about that way. So I want to dig into that with you. And then what are you excited to talk about?
I mean, I love talking about that stuff. I'm, as you can probably guess, I'm very into like dev tools and frameworks and infrastructure.
And in many ways, that's about enabling different personas to participate in like an engineering process.
And I think it's just a really exciting time to talk about that kind of thing, both because those practices are evolving, obviously,
but also there's a lot of use of large language models
to generate code.
I think that changes the math a little bit on who
can do what on the stack.
So we could talk about maybe how DevTools best practices impact
that.
Awesome.
So good.
Yeah, Jen and I have been talking a good bit about actually the kind of shifting ground
underneath us all and excited to get your take on how GNI is kind of changing the landscape
here.
So let's dig in.
Let's do it.
All right.
All right, Pete.
Again, we are so excited to have another person from Dagster here on the show. You started at Facebook, didn't start in data,
but I imagine even back then,
just tell us a little more about kind of working then.
Like, did you think about data a lot
or was it really like it smite?
You're just like, hey, now I'm getting into this.
Or was it kind of something that you had kind of always
maybe had an affinity for or kind of drifted towards?
Well, certainly back then, you know,
I was originally working on a product team
and it's very, you know, engineering-empowered
type of organization.
So back then, the latest and greatest technology was Hive.
And so I was pulling, you know,
I was pulling my own metrics to decide, you know,
number one, like what products should we focus on?
Because again, like back then it was very much like
individual small teams in many ways
making their own decisions as to what priority.
We wanted to make data-driven decisions.
So we're pulling from,
I think there were like weekly snapshots back then.
I think that was the best we could do.
And it was the era when you would like
tee up your hive query and then go get lunch
and 20 minutes later would have the wrong answer and then you'd have to runive query and then go get lunch, and 20 minutes later would have the wrong answer,
and then you'd have to run it again and then go get coffee.
So I was always a user there for both guiding product development and debugging stuff.
And then over time, as we would do things like acquire Instagram, for example,
they would have to get integrated with the data systems at Facebook.
And so there was a data integration problem that I was a part of. So I was always, I kind of started
on the periphery, but it was always around me. Yeah. Well, and I'm sure you got very familiar
with the problems and kind of friction points for actually working with the data. Yeah. Yeah. I was
around when they rolled out this thing, I think it was called Peregrine originally,
but that eventually became Presto and Autrino.
And it took those 20-minute queries down
to a minute or something.
And it was like, you could sit at your desk
and still be in flow.
And it was incredible.
I was shocked that you could see that.
Was there a sense of euphoria when everybody realized,
well, we have this now?
Yeah.
I mean, they're never going to attribute stock growth
to the data platform team.
But think about it, right?
It's a big, giant social network.
It's not like you can talk to your users at any sort of scale
to figure out what they want.
You have to make the decisions based on data, right?
And if you're able to make your data-driven decisions, it's like 20 times
faster. That's a big deal, right? So I think a lot of that, the growth in that company comes down
to technologies like that, that enable these like business and technical people to be able to make
quick decisions. Totally. That's really cool. So this is really funny. It just, I just thought of this. So you remember 2012, 2013, when the like,
I think they were called MOOCs,
massive online course, like a Udacity, right?
Yeah, yeah.
That was like around the time I was first getting
more into some data science stuff.
So there was a course, I still remember the course.
I mean, it's been years now by,
I wish I could remember her name,
one of the data scientists at Facebook at the time.
And I'm still now I'm like thinking through in context of this conversation, like some of the really interesting things, I wish I could remember her name,
as part of the data science class. think of, you know, in the social space. But it's really interesting when you start to look at it in an adversarial context as well.
That was the thing that I did after Facebook was like trying to find fake accounts.
And, you know, there were common birthdays.
To give to give listeners context, the timing of this was very crucial to write.
This was during kind of Covid and lots of unrest.
Yeah, we started the company actually, like end of 2014,
early 2015, but we ended up selling it to Twitter in 2018.
And that was like, you know, I was there from 2018 to 2022.
And so it was very much like, you know, elections, COVID.
I think there was like some global disaster too, I can't emphasize that that I can't remember.
That you've blocked from your mind at this point.
Just really quickly, Smart is the name of the company.
What did y'all do?
We called it kind of Trust and Safety as a service, but really what it was, we
would ingest event data from marketplaces and social networks.
And then in like near real time, we would try to basically find fake accounts and social networks. And then in like near real time, we try to basically find fake accounts
and compromised accounts.
What's interesting, this was really where I got into data.
I actually got in through like sphere processing mostly.
And what was kind of interesting about this problem is,
first of all, it sounds like a machine learning problem,
right?
It's like, oh, you like do some feature engineering,
you label the data and you like throw it into
like logistic regression or something and you get a classifier out.
That doesn't work.
And the reason why is because it's adversarial.
And so you actually don't have up to date labels because the patterns, the attacker
patterns change all the time.
So oftentimes, at least back then, I think now they're using like transformer models
and they work really well.
But back then there was like this combination
of like anomaly detection, manually curated heuristics
and targeted machine learning at specific problems.
And the thing is you had to like respond
and label the data like, you know, at very low latency.
Because, you know, you're talking about like,
if you wait five minutes, you know, that could be,
there was, you know, you can get a lot of spam into a system or compromise a lot if you wait five minutes,
you can get a lot of spam into a system
or compromise a lot of accounts in five minutes.
So it was just a very interesting problem.
And that's really where I fell in love with the data space.
It was really cool.
Nice.
I imagine you think a lot, probably read a lot,
about similar problem.
But I think today, it's like trust and safety
with all these foundational models. Is that, I mean, is that something you're still super interested in,
and kind of in today's, kind of today's day and age with AI?
So to tell you the truth here, I found trust and safety to be a very interesting technical problem.
I thought it was like, I mean, all the problems that I just laid out to me as like somebody
that like grew up writing code and was really interested
in like distributed systems and like data analysis and stuff.
It's like a mystery and you can apply many different types
of techniques to get to the air.
And there's all sorts of like interesting tricks
you can use.
So it's just a very, I thought it was very fascinating
from an intellectual perspective. I thought it was very fascinating from an intellectual perspective.
I thought it was, you know, really like obviously fulfilling
to that like it's that technology is now primarily used
for like child safety and cyber crime over there.
I haven't worked there for a long time now,
but I believe it's still used over there
for those sorts of applications,
which is obviously like a very fulfilling thing.
Yeah.
But there are lots of parts about that.
It's also very fraught, basically category.
Right.
Like, you know, I think that everybody wants to stop like child predators.
Right.
But there's, you know, once you get beyond that, there gets to be a very big gray
area around policy, you know, what's legal, what's not, what's proper, what's not.
And to me, especially during that time,
it was just like, it's pretty messy
and it wasn't really what was fulfilling for me.
And really for me, what I'm excited about was like
all these interesting data problems
and making developers happy
through like dev tools and infrastructure.
So I did end up leaving after like three and a half years
or so, but it was a good run there.
And so reconnected with a friend from Facebook
and went to Daxter.
Yeah, tell us a little more about that.
Yeah, so let's see.
I had known Nick back in the day,
I was working on React.js and he was working on GraphQL.
He's both like open source projects
that came out of Facebook at that time.
We like metaphorically and actually sat down the hall from each other.
And, you know, we always stayed in touch and, and we're friends.
And I knew he started Dagster and I put a little, a little money
in early at the seed round.
So I was always close to the company.
And, you know, when he was looking for a new head of engineering, it was
right around the time that I was kind of ready to wrap up at Twitter.
And he was like, hey, you know, I need a new head of engineering.
Can you help me search?
And I helped him with the search for a little while.
And then I just decided, hey, you know what?
I can use a little bit of a change.
I'll come over and be head of engineering.
All right.
Like, if I have to.
Yeah, yeah.
And, you know, we had a really good first year and it was one of those things
where I had done the CEO thing before. And you know, I knew it was like a job where you're
really busy and you don't have time to do everything. And so I would kind of try to
find places where his attention was elsewhere and just try to like help out there. Right?
So like there are certain things, you know, that you kind of learn the first time around
and mistakes that you make that you don't want to make the second time around.
So I helped him not make those mistakes
the second time around.
And by the end of the year, he's like, listen, man,
like I've been a solo founder for a long time.
It's a ton of work.
And frankly, I think he got into it
because he wanted to like write code
and work with customers and like be a, you know,
be a technical visionary or whatever.
And that's the CTO's job, not the CEO's job.
The CEO's job is to clean the toilets
and make sure that there's money in the company bank account
and stuff like that.
So we talked and we decided that it made sense for me to be CEO
and he could step into the CTO spot.
And I think it's been great, you know?
For me, it's like stepping into an old pair of shoes,
picking up right where I left off.
And for Nick, I think he gets to work on
the stuff that he's really excited to work on
at GoDeepOn.
Yeah.
I just want to call out something really quick there
that I think is cool and not a given.
Like having that background of like somebody
that you know and trust,
because like if he had brought somebody else in
as head of, you know, and that's unlikely to have happened.
It was possible it could happen, but I think it's cool.
Like having that like kind of long-term connection
where you can have that flexibility, right?
Where you both like kind of understand,
you know how to work together
and you can do some neat stuff like that
that maybe you couldn't normally do in other contexts.
Yeah, you know, it's like people you work with, you know,
like they always come back around in the future
and you never know who you're gonna work with in the future.
So I guess like always in my career,
I was always kind of trying to have this like aura
of like value around me.
Not, it's not about me, it's about actually other people.
It's like somebody like if they have some sort of interaction
or they're working with me,
like they come away like more successful.
It was like how I was,
how I tried to like think about my early career.
And I think that kind of like worked in a lot of ways
and helped create like a good network for me.
And like very concretely, like what that means is like,
often I like wouldn't work on the cool thing.
So like back at Facebook, for example,
the transition to native mobile was like a big deal.
That was where all the best people were going.
They were retraining to go to mobile.
I just stayed on the website
because that's kind of where people needed it.
They needed somebody who was good,
who was willing to be focused on maybe the thing
that wasn't super hot right now.
And that's where React came from,
and that was a really successful project.
And so like, I think for me, that strategy really worked
and created a great network for me
that has served me well in the future.
Yeah, really cool and just great,
I think, advice for anyone and everyone.
I do also want to call out, I think he said, you know,
Nick wanted to be the technical visionary.
We have had him on the show before.
It's probably been about a year ago,
but if you want to hear a technical visionary,
go back and listen to that episode.
I mean, the way he articulates his vision
for orchestration in Dagster is,
I mean, it is pretty incredible.
So yeah, go back and check that one out.
Yeah, it's been bigger for sure.
Yeah, for sure.
It wanna get into the kind of nitty gritty
of orchestration, talk about Dagster.
But before we do that,
can we just get like your definition of orchestration,
just zooming all the way out, kind of basic level,
what is orchestration?
And then from there, I think we can,
I may let John take over and go deep.
And I have no idea what orchestration is.
I was gonna say, everybody,
I'm excited to hear your definition.
Well, I don't, you know, it's interesting
because everybody does have a different definition, right?
And, you know, I mean, just to get really concrete
really quickly, we're like the thing that schedules,
runs and monitors your data pipelines.
But I think that when you frame it like that,
there's a wide variety of technologies you can use.
And I think that we're on kind of this evolutionary path
from a, like, you can imagine a spectrum or a timeline.
There's like schedulers over here,
and there's a control plane over here, right?
And it's similar to kind of how you saw
like container orchestration and infrastructure,
like back in infrastructure of all over time.
You started with like a single server
and like daemons running on the Unix box
all the way to something like Kubernetes,
which does, which is really a control plane
over all of your services.
And so we're kind of think of orchestration
as going a similar route.
So you start with like Krone or something that looks like cron built in
scheduler into a product, like a control M or something.
And, you know, that thing is very simple.
It just runs your jobs at a certain time.
You quickly find that you wait, you know, you're over computing.
You're running every step at every time slice.
Failures become a big problem. Observability is like non-existent.
And so then you move to something that's like a workflow orchestrator, right?
This would be like an Apache airflow or something like that.
And then now you've got like a smart acron and you've got a smart
acron that can retry and retry individual steps, right?
So you've seen a major improvementon that can retry and retry individual steps, right? So you've seen a major improvement
over something like cron.
The problem is though, that like,
if you're on a data team,
it's kind of an impedance mismatch
between what those workflow orchestrators are doing
and what you're trying to do as a data team.
It's like specifically like,
the data team is thinking in terms of tables
or machine learning models or files in a data warehouse.
And the workflow engine is thinking in terms
of like these opaque steps, right?
So we kind of see like this step, this, you know
there was this move from like kind of more
of like workflow orchestration to like a data control plane.
And like fundamentally what you need
there's a deep understanding of the data assets
the lineage between them, the current state of them
all the metadata.
And then you get this rich system of record
of every single data asset in your organization.
And once you've got that information,
you can really build a bunch of interesting
observability stuff on top of that and really help.
To me, that's the last piece of orchestration
is being able to observe what's happening
and let a human operator fix issues with your pipelines.
So I just want to positive see that
made any degree of sense.
Yeah, I mean, definitely to me,
but Brooks is probably a better one to respond to that.
No, it was great.
No, yeah, love kind of breaking down the fundamentals.
It was great.
We're gonna take a quick break from the episode
to talk about our sponsor, RutterSack.
Now I could say a bunch of nice things as if I found a fancy new tool, We're going to take a quick break from the episode to talk about our sponsor, RutterStack.
Now, I could say a bunch of nice things as if I found a fancy new tool, but John has been implementing RutterStack for over half a decade.
John, you work with customer event data every day and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go.
Yeah, Eric, as you know, customer data can get messy. and stream it everywhere it needs to go. Or has it that you have implemented the longest running production instance of Rutter Stack
at six years and going?
Yes, I can confirm that.
And one of the reasons we picked Rutter Stack was that it does not store the data and we
can live stream data to our downstream tools.
One of the things about the implementation that has been so common over all the years
and with so many
your entire stack, including your data infrastructure tools,
head over to rudderstack.com to learn more.
So now it's my fun time. So on the technical side, we talked about this a little bit in the intro.
I'd love to, well, we'll start here.
Let's talk data stack.
Let's talk a little bit of evolution, modern data stack, and talk about
tooling, like how you've seen that evolve and then maybe where you see it headed.
Sure.
Yeah.
I mean, I think, you know, we were talking about running 20-minute queries on Hive back in the day.
That was definitely the pre-modern data stack. In many ways, big data was still a challenge, right?
I would say that even in the Hive era, like, we had these big data tools, but big data was not a solved problem yet.
We couldn't easily compute over unbounded sets of data
in a reasonable way, or effectively unbounded anyway.
And so I would say with the arrival of tools
like that Peregrine thing that I told you about,
but also Snowflake, Databricks, BigQuery,
you really got to almost interactive query speeds.
And to me, right around the time that BigQuery came out,
and I think it's like the Google Dremel paper,
that to me was when big data became a solved problem.
OK, we know how to do this.
We can run more computer to problem
and solve most data challenges.
Then once you get this new capability,
people start using it, right?
And they start to basically build a bunch of stuff on top.
And to me, that was kind of like where we entered
the modern data stack era.
And the way we think about it at Dagster
is that created this era of big complexity
where suddenly you have all these different stakeholders
building all this mission critical stuff
on top of this new capability that they have.
But oftentimes they aren't,
the tooling and infrastructure doesn't support
the level of service they need to provide
for really like production system, right?
So like specifically what I'm talking about here
is like the clicking buttons in a UI and pressing save and critical state is now in some
system that only one team knows about and is not version controlled.
And, you know, everybody knows where it's like a big market correction, 2022,
2023, maybe those people got laid off. And now suddenly there's this like,
you know, you've got this whole giant, you know, data estate or mansion that's built on top of like one little wooden pillar that is not maintained
by anybody and termites are munching at it and eventually that thing's going to give
out.
Right.
So what we, you know, at Daxter, we really believe that software engineering best practices
are the way to, to tame big complexity.
This has been a trend like in every other part
of engineering, right?
Like again, you know, citing two examples,
you think about infrastructure management,
it started out as like a sysadmin,
SSHing into an individual box
and like running the magic commands
that only that person knew to make sure
that like INITD was running or whatever.
And now it's all done through tools like Terraform
and infrastructure as code, right?
And now it's, you know, you can roll back,
you can like onboard a new person
and they can actually understand what's going on.
And you look at the front end world,
it's a similar thing, right?
It used to be there's a big giant hairy CSS file
that nobody knew how to understand
or nobody could understand.
And today there are tools like React and CSS modules and stuff like that, like really enable
this like kind of standardized way to build and operate, you know, your applications and it's all
version controlled. And so we have been trying to do that for data. We're not the only people trying
to do it for data. Like we've seen dbt for example bring this style of development to that particular persona.
But we think it's, you know, we think that like a data platform control plane really
like brings this this way to manage complexity to like the whole data platform.
Yeah, for sure.
And then I think this leads perfectly into kind of our next topic here is the complexity
topic of like you have a lot of complexity,
you're managing a lot of complexity.
So what strategies are you guys at DAX you're thinking about
to make it more simple, right?
Because it is a complex problem, right?
There's a certain amount of complexity.
It's just a complex problem.
But I know you guys are working on a lot of strategies
to simplify.
Yep.
So there's like there's exactly one tool that we all have in our arsenal
to address complexity of any kind, which is abstraction.
Which is taking a, it's almost like taking this weird amorphous
problem, finding the pattern, the common pattern,
and path to success, and then wrapping it it up, like wrapping it up in a box
and making it like kind of a repeatable process.
So you kind of present a clean, understandable interface
to a really complex problem underneath.
So you can kind of solve the lower layer once
and then the problems above it are a bit simpler.
So really it's about abstraction, right?
When we talk about complexity management,
the, you know, we talk about complexity management,
we've seen this in all areas of engineering.
We started out writing assembler code with go-tos.
Then we abstracted that away into structured programming
languages with reusable functions,
object-oriented programming, et cetera.
And we think getting the abstraction right in the data
platform is key to managing the complexity of the data platform.
And what we saw was that-
Can you just articulate getting the abstraction right?
Like, how do you think about that?
What does that mean to you?
And even some examples of maybe how kind of you and the team
think about the problems you're solving.
Yeah, I mean, this is the art and science
of building a framework, right? How do you get the abstraction right?
Yeah.
And so I would say there are, you know, people write books on this stuff.
So you can read like Martin Fowler and you can, you know, people talk about like coupling
and cohesion as principles here.
You want to have the different parts of your system have low coupling so they can be examined independently. High cohesion when you
read a single module they make sense. There's a wide body of computer
science literature about this, but really what I think it comes down to is how
much power do you want to sacrifice for the user in order to give them
some new value is like kind of fundamentally
the first thing that you think about.
And then the second thing is like,
how do they pull the escape hatch when they need to?
Or do you even want them to be able to pull an escape hatch?
And it usually in most systems,
you do want to give them some sort of escape hatch
that is reasonable.
So that's how I really think about it.
And so oftentimes you're trading some amount of flexibility
in order to get some property of the system that you want.
And oftentimes that is increased developer velocity
or increased observability or ability to debug your system.
Does that make sense at all?
Totally, yeah, that was awesome.
Thank you.
Yeah, and I think one of the things too, that, cause we talked about this,
so there's the roles question that we were talking about before the show of like,
all right, I'm more of an analyst or now DBT's introduced this kind of like
analytics engineer, like, all right, we're going to blur some lines here piece. And then like the other
side of like, I'm a, you know, more traditionally
trained engineer or maybe I'm a more traditionally trained engineer
or maybe I'm a DevOps person or something.
So I think it'd be interesting to talk about,
even maybe specifically, maybe generally with orchestrators,
but specifically with Dagster,
how do you see those coming together?
And then on top of that,
we've got this new component of the AI piece
that also makes that a little bit more complicated as far as what the roles might even look like.
Yeah. So I would start by saying, we've talked a bit about abstraction, right?
Very much related to that is this notion of composability, which is like, okay, I've abstracted away one component of this system.
And when I connect it to a different component of the system, like it works in a predictable way. So the idea here is that like we've given you a set
of abstractions, you can put them together like Lego blocks is like the analogy everybody
uses, but a number of way analogies you could use. You can you combine them in ways that
the abstraction author didn't think of before prescribed for you. And the system still has
the properties that you want or that we all agreed to, right?
And so I promise I'm gonna get to answering your question.
Like the challenge with kind of like these workflow engines
is that the task abstraction is a very weak abstraction.
You don't trade very much power.
It can do like kind of anything,
but in exchange, you don't get very much benefit
from it either.
Like you can't really arbitrarily compose them together
and it's actually quite difficult to observe
what's going on inside of an opaque task
unless manually instrument it
with some sort of observability system.
And so you don't get very many composability benefits.
And so then when you wanna onboard
a bunch of different stakeholders,
you either like, you know, regardless of their persona,
whether they're software engineers or data analysts
or infrastructure engineers, you're still like,
if you have a really weak abstraction, it's very risky
because people can step on each other's toes
and cause interactions between the components
as you don't expect.
So what often either happens
with a workflow orchestration tool
is you either just get a big mess
or you build, like the user builds their own abstractions,
like a platform team that will build
their own abstractions on top.
And then their stakeholders will onboard onto that.
And I think what usually happens is both.
It's usually like a big mess.
And then the team is like,
oh man, we got to go clean this up.
And the process of cleaning that up is like
refactoring into these abstractions
that then these stakeholder teams can use.
So, you know, our abstraction, by the way,
is this thing called a data asset,
which can represent like a table in a data warehouse
or a file in an object store or something.
And that to us is like a really great way to enable different stakeholder teams to interact.
And so what we see today is we will get like machine learning
engineering teams that will be building stuff in like,
either notebooks or just like kind of stuff
using the Python scientific stack,
being able to integrate, like write and deploy data
assets into the Dagster right alongside the dbt engineer that imports their
dbt project into Dagster and every dbt model gets represented as an asset
within Dagster.
So this is what I mean by like, like, you know, having those two teams
like work together because like, like I saw this at Twitter, right?
Like we're trying to find spam.
There's a team that's using machine learning,
actually it was in Scala,
but like there's a team using like machine learning.
And then there's a team that's like hacking together
like SQL queries that like work,
that have like the magic regex
that finds the spam campaign, right?
You need both of these things to really deliver, right?
And they both depend on the same upstream datasets, but they're using completely different stacks and they can't build
on each other's work. And we often found them building parallel datasets that did exactly the
same thing, except they were slightly different because they couldn't, there wasn't a good
abstraction for them to be able to collaborate on one platform. So that's why we keep hammering on this notion of an abstraction.
And so we think that over time, more stakeholders will be able to participate directly
using this abstraction. And with the rise of tools like Cloud Code and Cursor,
what I think is happening and is going to happen is that
individuals are going to feel more empowered to work in areas of the stack
that they were not previously familiar with.
We're seeing this today at Dagster where we would have engineers that previously would only work in the Python code base.
And then when it came time to deploy something, they'd like call up somebody from our platform
team and say, hey, can you help me write the Terraform to do the RDS configure, whatever.
And today they're like using LLMs to generate the Terraform config, getting it reviewed
by the platform team.
And it's just much more efficient.
You know, they're just able to do more.
So I do think that it is the boundaries between teams and stakeholders are changing, for sure.
So we talked about guardrails.
In this new world where more people
are able to do more things, I mean,
guardrails kind of jumped in my mind.
And you talked about, OK, platform team
is reviewing the stuff that they did.
But are you thinking about that kind of more critically and even maybe
at a higher level, right? It's like, hey, as more people do more things, like we need guardrails,
we want to have more freedom, but we need guardrails and here's how we're doing that at
Tagster. Yeah, so in many ways, like abstractions are guardrails, right? Yeah. And we often hear,
like abstractions or guardrails, right? And we often hear, you know, you talk to a data platform team
and they often say, listen, like, you know, it's our job
to build the central platform, the shared set of tools
and best practices to enable other teams to be successful.
Right.
And the, like their goal really is to give those teams
as much autonomy as possible, but like no more than that. You
know what I mean? So they want to put like guardrails that make sure the organization
stays compliant with the obligations they made to their customers and regulators. Also,
make sure that everybody stays within budget and leverages the best practices and tools to make
those teams more successful. And so very much like the platform team's job is to build these
abstractions,
build these guardrails, you know, for these stakeholder teams.
Now, the way that this has worked in the past is, you know, with
Daxter in particular is we have this notion of an asset as like our kind
of fundamental unit of composition.
And it works really well for teams that are, you know, Python forward, right?
Python, they could take Daxter out of the box
and generally be pretty successful.
It also works really well for organizations
where we have a really good out of the box integration.
So like, or technologies rather.
So dbt, really good example.
Point Daxter at your dbt project,
your dbt developers can be Daxter users, no problem. But there is a big world out there
of like diverse stakeholders, you know,
different tools and a lot of organizations
have like their own tools that they built internally.
And, you know, they wanted a way
to basically build those guardrails
or build those abstraction layers on top of Dagster
for their stakeholder teams.
And we saw customers doing that just using Python.
They would maybe build a YAML, DSL, domain specific language
on top of Dagster.
Maybe they would build a special Python library
that would translate their domain concepts
into Dagster concepts.
But what we had found was everybody was doing it
in slightly different ways.
The tooling was often like MVP status.
So like they didn't have a beautiful VS code
auto-complete extension for their thing.
And it was, you know,
whatever they were able to get done in the limited time
that they had to work on this.
And so we said, listen, let's take all the stuff
that users are already doing and build them great tooling
to build their own abstractions.
And we'll ship some out-of-the-box abstractions
too for common use cases, like a DBT or data movement tool.
And so that's a thing that we built called Daxter Components.
And I think it's very interesting developing
new dev tools and new abstractions in the age of AI.
Because it's part of our, we consider like Claude
as like a user just alongside like our design partners,
right?
So we'll test an API and we'll say, hey,
did Claude actually understand this?
Was it able to like one shotshot what we wanted to do?
And you know, it's very, you see this with tools like,
you know, if you talk to the folks at Versailles
working on v0, maybe similar things, right?
Was that easier or harder than you expected it to be?
What's interesting is the stuff that makes it good for LLMs
makes it good for humans too.
I was going to ask are there parts that are diffurgent paths you've seen there or like so far
it's been like we can just kind of optimize dually for both at the same time.
It's 80% both at the same time and 20% different.
So there's like the way that you provide the documentation to the LLM is quite different than how you're providing
to humans.
Yeah.
Right.
You want to like, actually, with humans,
you want to like give them a bunch of examples and contexts
and stuff like that.
And with an LLM, you have a finite number of tokens
that you really want to be able to burn on this sort of thing.
So the way you deliver the documentation is different.
And there's this thing called model context protocol,
which we built kind of an integration.
Let's you integrate with like all these LLMs
and give them kind of programmatic access
to tools and documentation.
So certainly like that's a thing specifically for LLMs,
but 80% of it is like, if we were to talking to an LLM,
we would say we need to reduce the number of tokens in the context window.
And if we're talking to a human, it's like,
we want you to only have to look at one file
to be able to solve this problem or know what's going on.
And the code should be concise.
These are things that humans like,
and LLMs I think also benefit from. Similarly, like, you know,
humans, like LLMs need feedback from tooling that says,
hey, did I write my code correctly?
Does it pass the schema check?
Does it, you know, initialize correctly?
And that needs to be as fast as possible
so the LLM can work and like,
it's the same thing for human, right?
So what is interesting is like,
as we started to put this through the like LLM wringer,
the framework just got a lot better for humans.
So it's, I don't know, we live in this weird age
of like cybernetic program.
Yeah.
Yeah.
Well, so I think the components thing,
that conversation is really interesting.
And I just wanna make sure that like,
that I and our listeners kind of understand,
is the future state here where like,
components are going to be at like,
the, like, there's going to be a specific component
optimized for like a specific destination,
like Postgres or something,
and a specific source like Salesforce.
I don't know.
Is that how I should think about components?
I know there's, it's kind of an abstraction above that,
but is that, do you think that's one of the kind of practical use cases?
So let me give you an example. I can give you a couple of examples because it is an abstraction
above that, right? So we're going to ship with importing your dbt project as a component.
So there's a dbt project component. We're shipped with various bi tools and data
movement tools as components. So you want to integrate with whatever ELT tool or whatever
BI tool you want. There's a component for that. And we think that's going to actually reduce the
time from not knowing Daxter at all to having something in production like by 10X, right?
But really the value is like, you're gonna have,
we're shipping an internal components marketplace
for enterprises where like,
they're not gonna want their teams to take
any SaaS data movement tool off of the shelf, right?
They're gonna have their approved vendor that they use
or their approved technologies they use. And so they're gonna build their approved vendor that they use, or their approved technology that they use.
And so they're gonna build their own internal component
that makes it very easy for teams
to spin up their own data movement pipeline
and adhere to their best practices.
Okay.
And then bringing in Model Context Protocol, MCP,
things like that.
So I've got this marketplace,
and it's like that abstraction layer up which is higher leverage, right?
Like if you're not trying to connect directly to, you know, specific
SAS tools and you're like up at the layer above where you're working with like all the common, you know,
extraction tools for example, or extraction transformation tools.
So then, like if I'm kind of an ordinary or kind of a less technical user,
I theoretically could use whatever my company uses
as far as like an LLM. There's potentially like from the Dagster side this model context protocol
which gives context to the LLM for what I'm doing here. And then I could describe in English,
hey I want to move data from here, transform it in this way and I want it to land here, for example.
That's right. Yeah. And it's, and like the way to think about it too, is like the person
doing that, they probably spike really deep on some other technology. Maybe they're really
good financial analysts and maybe they're really good dbt developer, really good machine
learning person. And they're not a Dagster expert, right? So they're just like, integrate this thing with this thing.
And then like do it for them.
And then there, but there is gonna probably be a small team
of Daxter experts at the company, right?
And their job is gonna be to basically build
those custom components.
And through the model context protocol integration
between like Daxter and Cloud Code or Cursor or whatever,
those custom, like those Dagster data platform engineers can like teach the
model how to be really effective in their stack.
Right.
So it's actually like, once you see the full development workflow, it's like, I
think it's going to really change how teams develop.
I think it's just going to empower like a lot more folks to participate in like
a self-service way without creating like a bunch of technical debt or having to block on other
teams to build part of their session.
So here's a follow-up question then, and it's an unfair question.
We've talked a lot about how data is drifting toward a lot of these workflows that are really
kind of more mature,
what a front-end developer might do, even the DevOps world.
Because it's a little bit more greenfield,
what is something you like, I think we can get better because we know about how all these other workflows work?
Does that make sense? in terms of what's the value?
there's like a couple of things we're excited that, that actually may be better
because we get to kind of start over.
Oh yeah.
I mean, I think that everybody knows that there's like
too many tools and the integration between tools
is like a big pain in the butt.
And so like, you know, when you kind of like,
if you have it like an orchestrator
that understands the asset lineage in a very deep way
and understands where the data is coming from,
where it's going to, where it's stored,
the current status of it,
whether it's passing its quality checks or not.
Like a really great observability tool
and data discovery tool, just automatically, right?
And it's not like we did a process
to document all our data,
we just wrote the code in this way
and we got all these capabilities.
And this is like, again,
the power of like a really good abstraction, right?
It's like, you know, we put some guardrails in place,
which probably sacrifices a little bit of power.
In exchange though, you get like a data catalog,
like out of the box.
And you get like, you know,
an understanding of the freshness of your data assets.
And like, if they fail their freshness checks,
we can like automatically remediate it
and stuff like that.
So it's actually, it is a bit of a rethinking of the stack.
When you start to go from like,
hey, orchestrators is just a fancy scheduler
to know this is like a control plane
across the whole data platform.
It actually does rethink what you can do
in the shape of the stack.
I kind of think of it as like,
you've got your big data compute layer below here,
like Snowflake Databricks and stuff like that.
You've got your BI tools and data activation up here.
There's a bunch of messy stuff in the middle.
We can really help.
A control plane really tames the complexity of that messy middle.
Right. Yeah, I think that makes a lot of sense.
This is kind of a very specific question.
So there's a lot.
There's kind of the general flow, which we've talked through several times,
where we've got what's called sources, we've got transformations, steps that are happening in the middle,
we've got data landing.
What about some of the more, what do you call it, edge cases?
Because I still think they're very common, but some of these like ML, AI,
I've got unstructured data that I want to bring in or even like maybe even more
edge case of like I have like fairly sophisticated like
security and like governance that I need to maintain and like I have all these SQL scripts that like run to do things or I have
like auditors here today. Like there's just all these interesting
like, you know, long like essentially like a long tail of people that,
that, you know, I think are going to be also users. So, you know, I've talked to all of those,
but maybe pick one of those and be curious to learn more. I mean, I'll tell you that we target
the, you know, when you zoom out and we're really in the business of taming the complexity and we
think that software engineering best practices is the way to do that. It kind of implies like a technical or semi-technical user, right?
Oh, there are teams like I, you know, working on trust and safety at Twitter, right?
Like there's a ton of like data compliance and privacy things that you have to work through.
And there's, you know, giant legal orgs that, that you have to interface with.
I think generally like Dicester is not the tool for them.
Sure.
But for a lot of the kind of technical stakeholders
that are writing SQL or doing data analysis
or anything like that,
that's kind of really where we see like Dagster,
you know, being kind of the tool for them.
I'm not sure if I answered your question, but.
Yeah, no, I think that's helpful
because essentially like to do the abstractions well
and make them useful, you can't sell for every use case
or like the abstraction is kind of like bad
or not really an abstraction, right?
Right, it's a kitchen sink, right?
Yeah, exactly.
Yeah, we don't wanna do everything.
Like we designed the asset abstraction
and a couple of other abstractions around it
that make sense.
It's kind of like you define the asset abstraction,
you think about the life cycle of a data asset,
and then we can hook in with best of breed tools
or custom code from the user,
and then bring it all together into one place.
Right.
We are almost at the buzzer here,
but that's your components out now.
If folks want to learn more,
see it in action, what's the best?
Did I see it at the website?
Yeah, this should go to dagster.io.
And it's an open source framework,
so you can read the documentation
and install it yourself.
Or if you request a demo from our team,
we'll get you on with an engineer,
and they can demo our commercial offering.
Super exciting.
Last question before we wrap here. on with an engineer and they can demo our commercial offering. Yeah. Super exciting.
Last question before we wrap here, Pete, you have had an extremely interesting and I think
fair to say prolific career though.
I don't think you, Pete, you strike me as pretty humble, might not say that about yourself,
but you have learned a lot of interesting lessons along the way.
Parting piece of advice to our listeners, somebody working in day
to day, maybe especially facing AI is changing a lot of things, with juries still out on
exactly what things will look like.
But what would be kind of just a parting piece of advice that you'd give our listeners?
Yeah, I mean, it's a very interesting and broad question, but I would say just be like
an empathetic person and try to help people out.
And like in the tech industry, it's the type of thing where like helping people out indirectly
equals success, you know?
And so even if you're just totally selfish and you're totally looking out for yourself,
adopting a default strategy of being an empathetic person will be good for everybody. So that's what I would leave people with.
Yeah, that's great. Well, Pete, been an awesome show. Thank you so much for coming on. And
yeah, I'm sure the way it's going, we'll have someone else from Daxter here pretty soon.
So we'll look forward to it. But thank you, Pete.
Cool. Thanks, guys.
Yeah, you Pete.