The Data Stack Show - 246: AI, Abstractions, and the Future of Data Engineering with Pete Hunt of Dagster

Starting point is 00:00:00 Hi, I'm Eric Dotz. And I'm John Wessel. Welcome to the Data Stack Show. The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. How to Create a Data Team with RutterSack Before we dig into today's episode,

Starting point is 00:00:30 we want to give a huge thanks to our presenting sponsor, RutterSack. They give us the equipment and time to do this show week in, week out, and provide you the valuable content. RutterSack provides customer data infrastructure and is used by the world's most innovative companies to collect, transform, and deliver their event data

Starting point is 00:00:48 wherever it's needed, all in real time. You can learn more at ruddersack.com. Okay, so special episode here today. We're here with Pete Hunt from Dagster. Pete is actually the fourth person from Dagster we have ever talked to on the show, which is I think a show record. I think so.

Starting point is 00:01:09 And also if you're like, hey, this is an unfamiliar voice, what's the deal? Eric is on a plane right now, so couldn't make the recording. I'm Brooks, producer of the show. You probably heard me here and there before, but here to kick things off today and excited to connect with

Starting point is 00:01:25 Pete. So Pete, what we always do first in our intros, will you give us just like the quick high level version of your background? We'll get more in depth later. But yeah, tell us kind of where you started and what you're doing today. Yeah, it's great to be here. Thanks for having me. I'm Pete.

Starting point is 00:01:41 I'm the CEO here at Dagster. Come from an engineering background. So kind of the first big thing I worked on was, was react.js at Facebook, which was a large successful open source project. Then I really wanted to get an entrepreneurship. So I left and started a company called Smite, which did that's where, really where I got into data in large scale stream processing to try to find fake and compromised accounts on the internet.

Starting point is 00:02:03 Ended up selling that to the company that was known as Twitter back then. Stayed there for a couple of years and then my old buddy from Facebook, Nick Schrock, recruited me over to Dagster and before I even knew what was happening, I was CEO. So that was very exciting. It was very cool. So Pete, just so many things to talk about. One of the things I want to talk about in regards to data teams,

Starting point is 00:02:26 which we talked about before the show, is this idea of data people starting, let's say kind of more from an analyst background, they're not from a development background, and we're seeing people kind of drifting that way. We're seeing data practices drift that way. So I want to dig into that with you. And then what are you excited to talk about? I mean, I love talking about that way. So I want to dig into that with you. And then what are you excited to talk about?

Starting point is 00:02:45 I mean, I love talking about that stuff. I'm, as you can probably guess, I'm very into like dev tools and frameworks and infrastructure. And in many ways, that's about enabling different personas to participate in like an engineering process. And I think it's just a really exciting time to talk about that kind of thing, both because those practices are evolving, obviously, but also there's a lot of use of large language models to generate code. I think that changes the math a little bit on who can do what on the stack. So we could talk about maybe how DevTools best practices impact

Starting point is 00:03:19 that. Awesome. So good. Yeah, Jen and I have been talking a good bit about actually the kind of shifting ground underneath us all and excited to get your take on how GNI is kind of changing the landscape here. So let's dig in. Let's do it.

Starting point is 00:03:36 All right. All right, Pete. Again, we are so excited to have another person from Dagster here on the show. You started at Facebook, didn't start in data, but I imagine even back then, just tell us a little more about kind of working then. Like, did you think about data a lot or was it really like it smite? You're just like, hey, now I'm getting into this.

Starting point is 00:03:57 Or was it kind of something that you had kind of always maybe had an affinity for or kind of drifted towards? Well, certainly back then, you know, I was originally working on a product team and it's very, you know, engineering-empowered type of organization. So back then, the latest and greatest technology was Hive. And so I was pulling, you know,

Starting point is 00:04:18 I was pulling my own metrics to decide, you know, number one, like what products should we focus on? Because again, like back then it was very much like individual small teams in many ways making their own decisions as to what priority. We wanted to make data-driven decisions. So we're pulling from, I think there were like weekly snapshots back then.

Starting point is 00:04:36 I think that was the best we could do. And it was the era when you would like tee up your hive query and then go get lunch and 20 minutes later would have the wrong answer and then you'd have to runive query and then go get lunch, and 20 minutes later would have the wrong answer, and then you'd have to run it again and then go get coffee. So I was always a user there for both guiding product development and debugging stuff. And then over time, as we would do things like acquire Instagram, for example, they would have to get integrated with the data systems at Facebook.

Starting point is 00:05:04 And so there was a data integration problem that I was a part of. So I was always, I kind of started on the periphery, but it was always around me. Yeah. Well, and I'm sure you got very familiar with the problems and kind of friction points for actually working with the data. Yeah. Yeah. I was around when they rolled out this thing, I think it was called Peregrine originally, but that eventually became Presto and Autrino. And it took those 20-minute queries down to a minute or something. And it was like, you could sit at your desk

Starting point is 00:05:36 and still be in flow. And it was incredible. I was shocked that you could see that. Was there a sense of euphoria when everybody realized, well, we have this now? Yeah. I mean, they're never going to attribute stock growth to the data platform team.

Starting point is 00:05:52 But think about it, right? It's a big, giant social network. It's not like you can talk to your users at any sort of scale to figure out what they want. You have to make the decisions based on data, right? And if you're able to make your data-driven decisions, it's like 20 times faster. That's a big deal, right? So I think a lot of that, the growth in that company comes down to technologies like that, that enable these like business and technical people to be able to make

Starting point is 00:06:18 quick decisions. Totally. That's really cool. So this is really funny. It just, I just thought of this. So you remember 2012, 2013, when the like, I think they were called MOOCs, massive online course, like a Udacity, right? Yeah, yeah. That was like around the time I was first getting more into some data science stuff. So there was a course, I still remember the course. I mean, it's been years now by,

Starting point is 00:06:40 I wish I could remember her name, one of the data scientists at Facebook at the time. And I'm still now I'm like thinking through in context of this conversation, like some of the really interesting things, I wish I could remember her name, as part of the data science class. think of, you know, in the social space. But it's really interesting when you start to look at it in an adversarial context as well. That was the thing that I did after Facebook was like trying to find fake accounts. And, you know, there were common birthdays. To give to give listeners context, the timing of this was very crucial to write. This was during kind of Covid and lots of unrest.

Starting point is 00:07:44 Yeah, we started the company actually, like end of 2014, early 2015, but we ended up selling it to Twitter in 2018. And that was like, you know, I was there from 2018 to 2022. And so it was very much like, you know, elections, COVID. I think there was like some global disaster too, I can't emphasize that that I can't remember. That you've blocked from your mind at this point. Just really quickly, Smart is the name of the company. What did y'all do?

Starting point is 00:08:13 We called it kind of Trust and Safety as a service, but really what it was, we would ingest event data from marketplaces and social networks. And then in like near real time, we would try to basically find fake accounts and social networks. And then in like near real time, we try to basically find fake accounts and compromised accounts. What's interesting, this was really where I got into data. I actually got in through like sphere processing mostly. And what was kind of interesting about this problem is, first of all, it sounds like a machine learning problem,

Starting point is 00:08:40 right? It's like, oh, you like do some feature engineering, you label the data and you like throw it into like logistic regression or something and you get a classifier out. That doesn't work. And the reason why is because it's adversarial. And so you actually don't have up to date labels because the patterns, the attacker patterns change all the time.

Starting point is 00:09:00 So oftentimes, at least back then, I think now they're using like transformer models and they work really well. But back then there was like this combination of like anomaly detection, manually curated heuristics and targeted machine learning at specific problems. And the thing is you had to like respond and label the data like, you know, at very low latency. Because, you know, you're talking about like,

Starting point is 00:09:21 if you wait five minutes, you know, that could be, there was, you know, you can get a lot of spam into a system or compromise a lot if you wait five minutes, you can get a lot of spam into a system or compromise a lot of accounts in five minutes. So it was just a very interesting problem. And that's really where I fell in love with the data space. It was really cool. Nice.

Starting point is 00:09:36 I imagine you think a lot, probably read a lot, about similar problem. But I think today, it's like trust and safety with all these foundational models. Is that, I mean, is that something you're still super interested in, and kind of in today's, kind of today's day and age with AI? So to tell you the truth here, I found trust and safety to be a very interesting technical problem. I thought it was like, I mean, all the problems that I just laid out to me as like somebody that like grew up writing code and was really interested

Starting point is 00:10:09 in like distributed systems and like data analysis and stuff. It's like a mystery and you can apply many different types of techniques to get to the air. And there's all sorts of like interesting tricks you can use. So it's just a very, I thought it was very fascinating from an intellectual perspective. I thought it was very fascinating from an intellectual perspective. I thought it was, you know, really like obviously fulfilling

Starting point is 00:10:29 to that like it's that technology is now primarily used for like child safety and cyber crime over there. I haven't worked there for a long time now, but I believe it's still used over there for those sorts of applications, which is obviously like a very fulfilling thing. Yeah. But there are lots of parts about that.

Starting point is 00:10:47 It's also very fraught, basically category. Right. Like, you know, I think that everybody wants to stop like child predators. Right. But there's, you know, once you get beyond that, there gets to be a very big gray area around policy, you know, what's legal, what's not, what's proper, what's not. And to me, especially during that time, it was just like, it's pretty messy

Starting point is 00:11:11 and it wasn't really what was fulfilling for me. And really for me, what I'm excited about was like all these interesting data problems and making developers happy through like dev tools and infrastructure. So I did end up leaving after like three and a half years or so, but it was a good run there. And so reconnected with a friend from Facebook

Starting point is 00:11:31 and went to Daxter. Yeah, tell us a little more about that. Yeah, so let's see. I had known Nick back in the day, I was working on React.js and he was working on GraphQL. He's both like open source projects that came out of Facebook at that time. We like metaphorically and actually sat down the hall from each other.

Starting point is 00:11:49 And, you know, we always stayed in touch and, and we're friends. And I knew he started Dagster and I put a little, a little money in early at the seed round. So I was always close to the company. And, you know, when he was looking for a new head of engineering, it was right around the time that I was kind of ready to wrap up at Twitter. And he was like, hey, you know, I need a new head of engineering. Can you help me search?

Starting point is 00:12:11 And I helped him with the search for a little while. And then I just decided, hey, you know what? I can use a little bit of a change. I'll come over and be head of engineering. All right. Like, if I have to. Yeah, yeah. And, you know, we had a really good first year and it was one of those things

Starting point is 00:12:25 where I had done the CEO thing before. And you know, I knew it was like a job where you're really busy and you don't have time to do everything. And so I would kind of try to find places where his attention was elsewhere and just try to like help out there. Right? So like there are certain things, you know, that you kind of learn the first time around and mistakes that you make that you don't want to make the second time around. So I helped him not make those mistakes the second time around. And by the end of the year, he's like, listen, man,

Starting point is 00:12:51 like I've been a solo founder for a long time. It's a ton of work. And frankly, I think he got into it because he wanted to like write code and work with customers and like be a, you know, be a technical visionary or whatever. And that's the CTO's job, not the CEO's job. The CEO's job is to clean the toilets

Starting point is 00:13:08 and make sure that there's money in the company bank account and stuff like that. So we talked and we decided that it made sense for me to be CEO and he could step into the CTO spot. And I think it's been great, you know? For me, it's like stepping into an old pair of shoes, picking up right where I left off. And for Nick, I think he gets to work on

Starting point is 00:13:26 the stuff that he's really excited to work on at GoDeepOn. Yeah. I just want to call out something really quick there that I think is cool and not a given. Like having that background of like somebody that you know and trust, because like if he had brought somebody else in

Starting point is 00:13:40 as head of, you know, and that's unlikely to have happened. It was possible it could happen, but I think it's cool. Like having that like kind of long-term connection where you can have that flexibility, right? Where you both like kind of understand, you know how to work together and you can do some neat stuff like that that maybe you couldn't normally do in other contexts.

Starting point is 00:13:59 Yeah, you know, it's like people you work with, you know, like they always come back around in the future and you never know who you're gonna work with in the future. So I guess like always in my career, I was always kind of trying to have this like aura of like value around me. Not, it's not about me, it's about actually other people. It's like somebody like if they have some sort of interaction

Starting point is 00:14:25 or they're working with me, like they come away like more successful. It was like how I was, how I tried to like think about my early career. And I think that kind of like worked in a lot of ways and helped create like a good network for me. And like very concretely, like what that means is like, often I like wouldn't work on the cool thing.

Starting point is 00:14:43 So like back at Facebook, for example, the transition to native mobile was like a big deal. That was where all the best people were going. They were retraining to go to mobile. I just stayed on the website because that's kind of where people needed it. They needed somebody who was good, who was willing to be focused on maybe the thing

Starting point is 00:15:00 that wasn't super hot right now. And that's where React came from, and that was a really successful project. And so like, I think for me, that strategy really worked and created a great network for me that has served me well in the future. Yeah, really cool and just great, I think, advice for anyone and everyone.

Starting point is 00:15:19 I do also want to call out, I think he said, you know, Nick wanted to be the technical visionary. We have had him on the show before. It's probably been about a year ago, but if you want to hear a technical visionary, go back and listen to that episode. I mean, the way he articulates his vision for orchestration in Dagster is,

Starting point is 00:15:38 I mean, it is pretty incredible. So yeah, go back and check that one out. Yeah, it's been bigger for sure. Yeah, for sure. It wanna get into the kind of nitty gritty of orchestration, talk about Dagster. But before we do that, can we just get like your definition of orchestration,

Starting point is 00:15:59 just zooming all the way out, kind of basic level, what is orchestration? And then from there, I think we can, I may let John take over and go deep. And I have no idea what orchestration is. I was gonna say, everybody, I'm excited to hear your definition. Well, I don't, you know, it's interesting

Starting point is 00:16:14 because everybody does have a different definition, right? And, you know, I mean, just to get really concrete really quickly, we're like the thing that schedules, runs and monitors your data pipelines. But I think that when you frame it like that, there's a wide variety of technologies you can use. And I think that we're on kind of this evolutionary path from a, like, you can imagine a spectrum or a timeline.

Starting point is 00:16:39 There's like schedulers over here, and there's a control plane over here, right? And it's similar to kind of how you saw like container orchestration and infrastructure, like back in infrastructure of all over time. You started with like a single server and like daemons running on the Unix box all the way to something like Kubernetes,

Starting point is 00:16:57 which does, which is really a control plane over all of your services. And so we're kind of think of orchestration as going a similar route. So you start with like Krone or something that looks like cron built in scheduler into a product, like a control M or something. And, you know, that thing is very simple. It just runs your jobs at a certain time.

Starting point is 00:17:16 You quickly find that you wait, you know, you're over computing. You're running every step at every time slice. Failures become a big problem. Observability is like non-existent. And so then you move to something that's like a workflow orchestrator, right? This would be like an Apache airflow or something like that. And then now you've got like a smart acron and you've got a smart acron that can retry and retry individual steps, right? So you've seen a major improvementon that can retry and retry individual steps, right? So you've seen a major improvement

Starting point is 00:17:45 over something like cron. The problem is though, that like, if you're on a data team, it's kind of an impedance mismatch between what those workflow orchestrators are doing and what you're trying to do as a data team. It's like specifically like, the data team is thinking in terms of tables

Starting point is 00:18:00 or machine learning models or files in a data warehouse. And the workflow engine is thinking in terms of like these opaque steps, right? So we kind of see like this step, this, you know there was this move from like kind of more of like workflow orchestration to like a data control plane. And like fundamentally what you need there's a deep understanding of the data assets

Starting point is 00:18:19 the lineage between them, the current state of them all the metadata. And then you get this rich system of record of every single data asset in your organization. And once you've got that information, you can really build a bunch of interesting observability stuff on top of that and really help. To me, that's the last piece of orchestration

Starting point is 00:18:39 is being able to observe what's happening and let a human operator fix issues with your pipelines. So I just want to positive see that made any degree of sense. Yeah, I mean, definitely to me, but Brooks is probably a better one to respond to that. No, it was great. No, yeah, love kind of breaking down the fundamentals.

Starting point is 00:18:59 It was great. We're gonna take a quick break from the episode to talk about our sponsor, RutterSack. Now I could say a bunch of nice things as if I found a fancy new tool, We're going to take a quick break from the episode to talk about our sponsor, RutterStack. Now, I could say a bunch of nice things as if I found a fancy new tool, but John has been implementing RutterStack for over half a decade. John, you work with customer event data every day and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric, as you know, customer data can get messy. and stream it everywhere it needs to go. Or has it that you have implemented the longest running production instance of Rutter Stack at six years and going?

Starting point is 00:19:47 Yes, I can confirm that. And one of the reasons we picked Rutter Stack was that it does not store the data and we can live stream data to our downstream tools. One of the things about the implementation that has been so common over all the years and with so many your entire stack, including your data infrastructure tools, head over to rudderstack.com to learn more. So now it's my fun time. So on the technical side, we talked about this a little bit in the intro.

Starting point is 00:20:32 I'd love to, well, we'll start here. Let's talk data stack. Let's talk a little bit of evolution, modern data stack, and talk about tooling, like how you've seen that evolve and then maybe where you see it headed. Sure. Yeah. I mean, I think, you know, we were talking about running 20-minute queries on Hive back in the day. That was definitely the pre-modern data stack. In many ways, big data was still a challenge, right?

Starting point is 00:20:59 I would say that even in the Hive era, like, we had these big data tools, but big data was not a solved problem yet. We couldn't easily compute over unbounded sets of data in a reasonable way, or effectively unbounded anyway. And so I would say with the arrival of tools like that Peregrine thing that I told you about, but also Snowflake, Databricks, BigQuery, you really got to almost interactive query speeds. And to me, right around the time that BigQuery came out,

Starting point is 00:21:30 and I think it's like the Google Dremel paper, that to me was when big data became a solved problem. OK, we know how to do this. We can run more computer to problem and solve most data challenges. Then once you get this new capability, people start using it, right? And they start to basically build a bunch of stuff on top.

Starting point is 00:21:51 And to me, that was kind of like where we entered the modern data stack era. And the way we think about it at Dagster is that created this era of big complexity where suddenly you have all these different stakeholders building all this mission critical stuff on top of this new capability that they have. But oftentimes they aren't,

Starting point is 00:22:14 the tooling and infrastructure doesn't support the level of service they need to provide for really like production system, right? So like specifically what I'm talking about here is like the clicking buttons in a UI and pressing save and critical state is now in some system that only one team knows about and is not version controlled. And, you know, everybody knows where it's like a big market correction, 2022, 2023, maybe those people got laid off. And now suddenly there's this like,

Starting point is 00:22:41 you know, you've got this whole giant, you know, data estate or mansion that's built on top of like one little wooden pillar that is not maintained by anybody and termites are munching at it and eventually that thing's going to give out. Right. So what we, you know, at Daxter, we really believe that software engineering best practices are the way to, to tame big complexity. This has been a trend like in every other part of engineering, right?

Starting point is 00:23:09 Like again, you know, citing two examples, you think about infrastructure management, it started out as like a sysadmin, SSHing into an individual box and like running the magic commands that only that person knew to make sure that like INITD was running or whatever. And now it's all done through tools like Terraform

Starting point is 00:23:29 and infrastructure as code, right? And now it's, you know, you can roll back, you can like onboard a new person and they can actually understand what's going on. And you look at the front end world, it's a similar thing, right? It used to be there's a big giant hairy CSS file that nobody knew how to understand

Starting point is 00:23:44 or nobody could understand. And today there are tools like React and CSS modules and stuff like that, like really enable this like kind of standardized way to build and operate, you know, your applications and it's all version controlled. And so we have been trying to do that for data. We're not the only people trying to do it for data. Like we've seen dbt for example bring this style of development to that particular persona. But we think it's, you know, we think that like a data platform control plane really like brings this this way to manage complexity to like the whole data platform. Yeah, for sure.

Starting point is 00:24:17 And then I think this leads perfectly into kind of our next topic here is the complexity topic of like you have a lot of complexity, you're managing a lot of complexity. So what strategies are you guys at DAX you're thinking about to make it more simple, right? Because it is a complex problem, right? There's a certain amount of complexity. It's just a complex problem.

Starting point is 00:24:38 But I know you guys are working on a lot of strategies to simplify. Yep. So there's like there's exactly one tool that we all have in our arsenal to address complexity of any kind, which is abstraction. Which is taking a, it's almost like taking this weird amorphous problem, finding the pattern, the common pattern, and path to success, and then wrapping it it up, like wrapping it up in a box

Starting point is 00:25:06 and making it like kind of a repeatable process. So you kind of present a clean, understandable interface to a really complex problem underneath. So you can kind of solve the lower layer once and then the problems above it are a bit simpler. So really it's about abstraction, right? When we talk about complexity management, the, you know, we talk about complexity management,

Starting point is 00:25:25 we've seen this in all areas of engineering. We started out writing assembler code with go-tos. Then we abstracted that away into structured programming languages with reusable functions, object-oriented programming, et cetera. And we think getting the abstraction right in the data platform is key to managing the complexity of the data platform. And what we saw was that-

Starting point is 00:25:49 Can you just articulate getting the abstraction right? Like, how do you think about that? What does that mean to you? And even some examples of maybe how kind of you and the team think about the problems you're solving. Yeah, I mean, this is the art and science of building a framework, right? How do you get the abstraction right? Yeah.

Starting point is 00:26:08 And so I would say there are, you know, people write books on this stuff. So you can read like Martin Fowler and you can, you know, people talk about like coupling and cohesion as principles here. You want to have the different parts of your system have low coupling so they can be examined independently. High cohesion when you read a single module they make sense. There's a wide body of computer science literature about this, but really what I think it comes down to is how much power do you want to sacrifice for the user in order to give them some new value is like kind of fundamentally

Starting point is 00:26:46 the first thing that you think about. And then the second thing is like, how do they pull the escape hatch when they need to? Or do you even want them to be able to pull an escape hatch? And it usually in most systems, you do want to give them some sort of escape hatch that is reasonable. So that's how I really think about it.

Starting point is 00:27:03 And so oftentimes you're trading some amount of flexibility in order to get some property of the system that you want. And oftentimes that is increased developer velocity or increased observability or ability to debug your system. Does that make sense at all? Totally, yeah, that was awesome. Thank you. Yeah, and I think one of the things too, that, cause we talked about this,

Starting point is 00:27:28 so there's the roles question that we were talking about before the show of like, all right, I'm more of an analyst or now DBT's introduced this kind of like analytics engineer, like, all right, we're going to blur some lines here piece. And then like the other side of like, I'm a, you know, more traditionally trained engineer or maybe I'm a more traditionally trained engineer or maybe I'm a DevOps person or something. So I think it'd be interesting to talk about, even maybe specifically, maybe generally with orchestrators,

Starting point is 00:27:55 but specifically with Dagster, how do you see those coming together? And then on top of that, we've got this new component of the AI piece that also makes that a little bit more complicated as far as what the roles might even look like. Yeah. So I would start by saying, we've talked a bit about abstraction, right? Very much related to that is this notion of composability, which is like, okay, I've abstracted away one component of this system. And when I connect it to a different component of the system, like it works in a predictable way. So the idea here is that like we've given you a set

Starting point is 00:28:28 of abstractions, you can put them together like Lego blocks is like the analogy everybody uses, but a number of way analogies you could use. You can you combine them in ways that the abstraction author didn't think of before prescribed for you. And the system still has the properties that you want or that we all agreed to, right? And so I promise I'm gonna get to answering your question. Like the challenge with kind of like these workflow engines is that the task abstraction is a very weak abstraction. You don't trade very much power.

Starting point is 00:29:00 It can do like kind of anything, but in exchange, you don't get very much benefit from it either. Like you can't really arbitrarily compose them together and it's actually quite difficult to observe what's going on inside of an opaque task unless manually instrument it with some sort of observability system.

Starting point is 00:29:16 And so you don't get very many composability benefits. And so then when you wanna onboard a bunch of different stakeholders, you either like, you know, regardless of their persona, whether they're software engineers or data analysts or infrastructure engineers, you're still like, if you have a really weak abstraction, it's very risky because people can step on each other's toes

Starting point is 00:29:38 and cause interactions between the components as you don't expect. So what often either happens with a workflow orchestration tool is you either just get a big mess or you build, like the user builds their own abstractions, like a platform team that will build their own abstractions on top.

Starting point is 00:29:55 And then their stakeholders will onboard onto that. And I think what usually happens is both. It's usually like a big mess. And then the team is like, oh man, we got to go clean this up. And the process of cleaning that up is like refactoring into these abstractions that then these stakeholder teams can use.

Starting point is 00:30:11 So, you know, our abstraction, by the way, is this thing called a data asset, which can represent like a table in a data warehouse or a file in an object store or something. And that to us is like a really great way to enable different stakeholder teams to interact. And so what we see today is we will get like machine learning engineering teams that will be building stuff in like, either notebooks or just like kind of stuff

Starting point is 00:30:42 using the Python scientific stack, being able to integrate, like write and deploy data assets into the Dagster right alongside the dbt engineer that imports their dbt project into Dagster and every dbt model gets represented as an asset within Dagster. So this is what I mean by like, like, you know, having those two teams like work together because like, like I saw this at Twitter, right? Like we're trying to find spam.

Starting point is 00:31:06 There's a team that's using machine learning, actually it was in Scala, but like there's a team using like machine learning. And then there's a team that's like hacking together like SQL queries that like work, that have like the magic regex that finds the spam campaign, right? You need both of these things to really deliver, right?

Starting point is 00:31:23 And they both depend on the same upstream datasets, but they're using completely different stacks and they can't build on each other's work. And we often found them building parallel datasets that did exactly the same thing, except they were slightly different because they couldn't, there wasn't a good abstraction for them to be able to collaborate on one platform. So that's why we keep hammering on this notion of an abstraction. And so we think that over time, more stakeholders will be able to participate directly using this abstraction. And with the rise of tools like Cloud Code and Cursor, what I think is happening and is going to happen is that individuals are going to feel more empowered to work in areas of the stack

Starting point is 00:32:10 that they were not previously familiar with. We're seeing this today at Dagster where we would have engineers that previously would only work in the Python code base. And then when it came time to deploy something, they'd like call up somebody from our platform team and say, hey, can you help me write the Terraform to do the RDS configure, whatever. And today they're like using LLMs to generate the Terraform config, getting it reviewed by the platform team. And it's just much more efficient. You know, they're just able to do more.

Starting point is 00:32:39 So I do think that it is the boundaries between teams and stakeholders are changing, for sure. So we talked about guardrails. In this new world where more people are able to do more things, I mean, guardrails kind of jumped in my mind. And you talked about, OK, platform team is reviewing the stuff that they did. But are you thinking about that kind of more critically and even maybe

Starting point is 00:33:07 at a higher level, right? It's like, hey, as more people do more things, like we need guardrails, we want to have more freedom, but we need guardrails and here's how we're doing that at Tagster. Yeah, so in many ways, like abstractions are guardrails, right? Yeah. And we often hear, like abstractions or guardrails, right? And we often hear, you know, you talk to a data platform team and they often say, listen, like, you know, it's our job to build the central platform, the shared set of tools and best practices to enable other teams to be successful. Right.

Starting point is 00:33:37 And the, like their goal really is to give those teams as much autonomy as possible, but like no more than that. You know what I mean? So they want to put like guardrails that make sure the organization stays compliant with the obligations they made to their customers and regulators. Also, make sure that everybody stays within budget and leverages the best practices and tools to make those teams more successful. And so very much like the platform team's job is to build these abstractions, build these guardrails, you know, for these stakeholder teams.

Starting point is 00:34:08 Now, the way that this has worked in the past is, you know, with Daxter in particular is we have this notion of an asset as like our kind of fundamental unit of composition. And it works really well for teams that are, you know, Python forward, right? Python, they could take Daxter out of the box and generally be pretty successful. It also works really well for organizations where we have a really good out of the box integration.

Starting point is 00:34:34 So like, or technologies rather. So dbt, really good example. Point Daxter at your dbt project, your dbt developers can be Daxter users, no problem. But there is a big world out there of like diverse stakeholders, you know, different tools and a lot of organizations have like their own tools that they built internally. And, you know, they wanted a way

Starting point is 00:34:56 to basically build those guardrails or build those abstraction layers on top of Dagster for their stakeholder teams. And we saw customers doing that just using Python. They would maybe build a YAML, DSL, domain specific language on top of Dagster. Maybe they would build a special Python library that would translate their domain concepts

Starting point is 00:35:21 into Dagster concepts. But what we had found was everybody was doing it in slightly different ways. The tooling was often like MVP status. So like they didn't have a beautiful VS code auto-complete extension for their thing. And it was, you know, whatever they were able to get done in the limited time

Starting point is 00:35:41 that they had to work on this. And so we said, listen, let's take all the stuff that users are already doing and build them great tooling to build their own abstractions. And we'll ship some out-of-the-box abstractions too for common use cases, like a DBT or data movement tool. And so that's a thing that we built called Daxter Components. And I think it's very interesting developing

Starting point is 00:36:04 new dev tools and new abstractions in the age of AI. Because it's part of our, we consider like Claude as like a user just alongside like our design partners, right? So we'll test an API and we'll say, hey, did Claude actually understand this? Was it able to like one shotshot what we wanted to do? And you know, it's very, you see this with tools like,

Starting point is 00:36:28 you know, if you talk to the folks at Versailles working on v0, maybe similar things, right? Was that easier or harder than you expected it to be? What's interesting is the stuff that makes it good for LLMs makes it good for humans too. I was going to ask are there parts that are diffurgent paths you've seen there or like so far it's been like we can just kind of optimize dually for both at the same time. It's 80% both at the same time and 20% different.

Starting point is 00:37:00 So there's like the way that you provide the documentation to the LLM is quite different than how you're providing to humans. Yeah. Right. You want to like, actually, with humans, you want to like give them a bunch of examples and contexts and stuff like that. And with an LLM, you have a finite number of tokens

Starting point is 00:37:16 that you really want to be able to burn on this sort of thing. So the way you deliver the documentation is different. And there's this thing called model context protocol, which we built kind of an integration. Let's you integrate with like all these LLMs and give them kind of programmatic access to tools and documentation. So certainly like that's a thing specifically for LLMs,

Starting point is 00:37:37 but 80% of it is like, if we were to talking to an LLM, we would say we need to reduce the number of tokens in the context window. And if we're talking to a human, it's like, we want you to only have to look at one file to be able to solve this problem or know what's going on. And the code should be concise. These are things that humans like, and LLMs I think also benefit from. Similarly, like, you know,

Starting point is 00:38:07 humans, like LLMs need feedback from tooling that says, hey, did I write my code correctly? Does it pass the schema check? Does it, you know, initialize correctly? And that needs to be as fast as possible so the LLM can work and like, it's the same thing for human, right? So what is interesting is like,

Starting point is 00:38:23 as we started to put this through the like LLM wringer, the framework just got a lot better for humans. So it's, I don't know, we live in this weird age of like cybernetic program. Yeah. Yeah. Well, so I think the components thing, that conversation is really interesting.

Starting point is 00:38:42 And I just wanna make sure that like, that I and our listeners kind of understand, is the future state here where like, components are going to be at like, the, like, there's going to be a specific component optimized for like a specific destination, like Postgres or something, and a specific source like Salesforce.

Starting point is 00:39:00 I don't know. Is that how I should think about components? I know there's, it's kind of an abstraction above that, but is that, do you think that's one of the kind of practical use cases? So let me give you an example. I can give you a couple of examples because it is an abstraction above that, right? So we're going to ship with importing your dbt project as a component. So there's a dbt project component. We're shipped with various bi tools and data movement tools as components. So you want to integrate with whatever ELT tool or whatever

Starting point is 00:39:33 BI tool you want. There's a component for that. And we think that's going to actually reduce the time from not knowing Daxter at all to having something in production like by 10X, right? But really the value is like, you're gonna have, we're shipping an internal components marketplace for enterprises where like, they're not gonna want their teams to take any SaaS data movement tool off of the shelf, right? They're gonna have their approved vendor that they use

Starting point is 00:40:03 or their approved technologies they use. And so they're gonna build their approved vendor that they use, or their approved technology that they use. And so they're gonna build their own internal component that makes it very easy for teams to spin up their own data movement pipeline and adhere to their best practices. Okay. And then bringing in Model Context Protocol, MCP, things like that.

Starting point is 00:40:21 So I've got this marketplace, and it's like that abstraction layer up which is higher leverage, right? Like if you're not trying to connect directly to, you know, specific SAS tools and you're like up at the layer above where you're working with like all the common, you know, extraction tools for example, or extraction transformation tools. So then, like if I'm kind of an ordinary or kind of a less technical user, I theoretically could use whatever my company uses as far as like an LLM. There's potentially like from the Dagster side this model context protocol

Starting point is 00:40:53 which gives context to the LLM for what I'm doing here. And then I could describe in English, hey I want to move data from here, transform it in this way and I want it to land here, for example. That's right. Yeah. And it's, and like the way to think about it too, is like the person doing that, they probably spike really deep on some other technology. Maybe they're really good financial analysts and maybe they're really good dbt developer, really good machine learning person. And they're not a Dagster expert, right? So they're just like, integrate this thing with this thing. And then like do it for them. And then there, but there is gonna probably be a small team

Starting point is 00:41:31 of Daxter experts at the company, right? And their job is gonna be to basically build those custom components. And through the model context protocol integration between like Daxter and Cloud Code or Cursor or whatever, those custom, like those Dagster data platform engineers can like teach the model how to be really effective in their stack. Right.

Starting point is 00:41:54 So it's actually like, once you see the full development workflow, it's like, I think it's going to really change how teams develop. I think it's just going to empower like a lot more folks to participate in like a self-service way without creating like a bunch of technical debt or having to block on other teams to build part of their session. So here's a follow-up question then, and it's an unfair question. We've talked a lot about how data is drifting toward a lot of these workflows that are really kind of more mature,

Starting point is 00:42:25 what a front-end developer might do, even the DevOps world. Because it's a little bit more greenfield, what is something you like, I think we can get better because we know about how all these other workflows work? Does that make sense? in terms of what's the value? there's like a couple of things we're excited that, that actually may be better because we get to kind of start over. Oh yeah. I mean, I think that everybody knows that there's like

Starting point is 00:43:08 too many tools and the integration between tools is like a big pain in the butt. And so like, you know, when you kind of like, if you have it like an orchestrator that understands the asset lineage in a very deep way and understands where the data is coming from, where it's going to, where it's stored, the current status of it,

Starting point is 00:43:27 whether it's passing its quality checks or not. Like a really great observability tool and data discovery tool, just automatically, right? And it's not like we did a process to document all our data, we just wrote the code in this way and we got all these capabilities. And this is like, again,

Starting point is 00:43:43 the power of like a really good abstraction, right? It's like, you know, we put some guardrails in place, which probably sacrifices a little bit of power. In exchange though, you get like a data catalog, like out of the box. And you get like, you know, an understanding of the freshness of your data assets. And like, if they fail their freshness checks,

Starting point is 00:44:03 we can like automatically remediate it and stuff like that. So it's actually, it is a bit of a rethinking of the stack. When you start to go from like, hey, orchestrators is just a fancy scheduler to know this is like a control plane across the whole data platform. It actually does rethink what you can do

Starting point is 00:44:20 in the shape of the stack. I kind of think of it as like, you've got your big data compute layer below here, like Snowflake Databricks and stuff like that. You've got your BI tools and data activation up here. There's a bunch of messy stuff in the middle. We can really help. A control plane really tames the complexity of that messy middle.

Starting point is 00:44:38 Right. Yeah, I think that makes a lot of sense. This is kind of a very specific question. So there's a lot. There's kind of the general flow, which we've talked through several times, where we've got what's called sources, we've got transformations, steps that are happening in the middle, we've got data landing. What about some of the more, what do you call it, edge cases? Because I still think they're very common, but some of these like ML, AI,

Starting point is 00:45:03 I've got unstructured data that I want to bring in or even like maybe even more edge case of like I have like fairly sophisticated like security and like governance that I need to maintain and like I have all these SQL scripts that like run to do things or I have like auditors here today. Like there's just all these interesting like, you know, long like essentially like a long tail of people that, that, you know, I think are going to be also users. So, you know, I've talked to all of those, but maybe pick one of those and be curious to learn more. I mean, I'll tell you that we target the, you know, when you zoom out and we're really in the business of taming the complexity and we

Starting point is 00:45:38 think that software engineering best practices is the way to do that. It kind of implies like a technical or semi-technical user, right? Oh, there are teams like I, you know, working on trust and safety at Twitter, right? Like there's a ton of like data compliance and privacy things that you have to work through. And there's, you know, giant legal orgs that, that you have to interface with. I think generally like Dicester is not the tool for them. Sure. But for a lot of the kind of technical stakeholders that are writing SQL or doing data analysis

Starting point is 00:46:09 or anything like that, that's kind of really where we see like Dagster, you know, being kind of the tool for them. I'm not sure if I answered your question, but. Yeah, no, I think that's helpful because essentially like to do the abstractions well and make them useful, you can't sell for every use case or like the abstraction is kind of like bad

Starting point is 00:46:32 or not really an abstraction, right? Right, it's a kitchen sink, right? Yeah, exactly. Yeah, we don't wanna do everything. Like we designed the asset abstraction and a couple of other abstractions around it that make sense. It's kind of like you define the asset abstraction,

Starting point is 00:46:45 you think about the life cycle of a data asset, and then we can hook in with best of breed tools or custom code from the user, and then bring it all together into one place. Right. We are almost at the buzzer here, but that's your components out now. If folks want to learn more,

Starting point is 00:47:06 see it in action, what's the best? Did I see it at the website? Yeah, this should go to dagster.io. And it's an open source framework, so you can read the documentation and install it yourself. Or if you request a demo from our team, we'll get you on with an engineer,

Starting point is 00:47:20 and they can demo our commercial offering. Super exciting. Last question before we wrap here. on with an engineer and they can demo our commercial offering. Yeah. Super exciting. Last question before we wrap here, Pete, you have had an extremely interesting and I think fair to say prolific career though. I don't think you, Pete, you strike me as pretty humble, might not say that about yourself, but you have learned a lot of interesting lessons along the way. Parting piece of advice to our listeners, somebody working in day

Starting point is 00:47:46 to day, maybe especially facing AI is changing a lot of things, with juries still out on exactly what things will look like. But what would be kind of just a parting piece of advice that you'd give our listeners? Yeah, I mean, it's a very interesting and broad question, but I would say just be like an empathetic person and try to help people out. And like in the tech industry, it's the type of thing where like helping people out indirectly equals success, you know? And so even if you're just totally selfish and you're totally looking out for yourself,

Starting point is 00:48:22 adopting a default strategy of being an empathetic person will be good for everybody. So that's what I would leave people with. Yeah, that's great. Well, Pete, been an awesome show. Thank you so much for coming on. And yeah, I'm sure the way it's going, we'll have someone else from Daxter here pretty soon. So we'll look forward to it. But thank you, Pete. Cool. Thanks, guys. Yeah, you Pete.

The Data Stack Show - 246: AI, Abstractions, and the Future of Data Engineering with Pete Hunt of Dagster

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.