The Data Stack Show - 148: Exploring the Intersection of DAGs, ML Code, and Complex Code Bases: An Elegant Solution Unveiled with Stefan Krawczyk of DAGWorks

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack. They've been helping us put on the show for years and they just launched an awesome new product called Profiles. It makes it easy to build an identity graph and complete customer profiles right in your warehouse for Data Lake. You should go check it out at ruddersack.com today. Welcome back to the Data Sack Show. Costas, super excited for the topic today. So we're going to talk with Stefan from Dagworks. He developed a really interesting

Starting point is 00:00:45 technology at Stitch Fix called Hamilton. And, you know, we actually haven't talked about DAGs a ton on the show. Airflow's kind of come up here and there. And Hamilton's a fascinating take on this where you sort of declare functions and it sort of produces a DAG that makes it much easier to test code, understand code and actually produce code, which is pretty fascinating. And this is all in the Python ecosystem, ML stuff. It's very cool. I wanna know what led Stefan from

Starting point is 00:01:23 originally kind of working on some of these end use cases. So building, you know, an experimentation platform, you know, for testing, for example, or an experimentation framework for testing and all the data and the trimmings that go into that to going far deeper in the stack and actually building sort of platform level tooling that enables the building of those tools, if that makes sense. So to me, that's a fascinating journey. Very difficult problem to solve from a developer experience standpoint. But yeah, excited to hear about his journey.

Starting point is 00:01:53 How about you? Yeah. I mean, I definitely want to learn more about Hamilton, the project itself, and the whole journey from coming up with the problem inside Stitch Fix and ending up with like an open source project that's currently like the foundation for a company. So that's like definitely like something that I would like to chat about with Stefan and get deeper into what Hamilton is because these kind of systems like similarly to like what dbd also does right like they have like a lot of value but they are also rely a lot on like the people like using it and adopting these solutions. So I want to hear from Stefan about that,

Starting point is 00:02:49 like how we can actually do this and how we can, you know, like onboard people until they figure out like the value of actually using something like this in their like everyday work. All right. Well, let's dig in and talk about DAGs, Python, and Hamilton. Let's do it.

Starting point is 00:03:12 Stefan, welcome to the Data Stack Show. So excited to chat with you. So many questions. But first, of course, give us kind of your background and what led you to starting DAG Works. Thanks for having me. Yeah. So DAG Works, I'm the CEO of DAG Works, DAGworks. Thanks for having me. Yeah, so DAGworks, I'm the CEO of DAGworks, DAG for direct direct cycle graph. We're a recent YC batch graduate and at a high level, we're simplifying ETL pipeline management, targeting ML and AI use cases.

Starting point is 00:03:38 In terms of, you know, my background, how I got here to be CEO of of a small startup i you know came over here to silicon value back in 2007 i did an internship at ibm and then i went to grad school at stanford since then i you know finished a master's computer science you know right at the time when it's still classically trained so all the deep learning stuff was just all the phds were doing so i'm still kind of catching up on coursework there, but otherwise I've been, I worked at companies like LinkedIn, Nextdoor, where I was engineer number 13 and did a lot of initial things, a small, then I went to a small startup at Ebon that crashed and burned, which was a good time. But otherwise before starting the company, I was at Stitch Fix for six years, helping data scientists streamline their model productionization efforts.

Starting point is 00:04:31 All right. Love it. And give us, just go one click deeper with DAGworks, right? So I think a lot of our listeners are familiar with DAG sort of in a general sense, but you're starting a company around it. So can you go one click deeper and tell us, what does the product do? As a startup, we're still evolving, but effectively we're trying to, you know, if you're, for the practitioners listening, if you've ever inherited an ETL or a pipeline that you're horrified by, or had, you know, had to come in and, you know, debug something you've even written yourself and it's failing slightly because of, you know, upstream data changes or code changes you weren't aware of because your teammate kind of made them or something, right? We're essentially trying to solve that problem

Starting point is 00:05:09 because we feel that you can get things to production pretty easily these days, but really the problem then becomes how do you maintain and manage these over time such that you don't slow down and you can, you know, rather than when someone leaves spending six months to rewrite it,

Starting point is 00:05:24 you know, there should be a standard, more standard way to kind of maintain, manage, and therefore kind of operate these data and ML pipelines. Yep. I love it. Well, tons of specific questions there, but let's rewind just a little bit. So at Nextdoor, you said, you know, you were very early and you built a lot of first things, right? So you sort of the data warehousing, data lake, you know, infrastructure, testing infrastructure for experimentation, etc. So you were sort of really on the front lines, like shipping stuff that was, you know, hitting production and sort of, you know, producing test results and all that sort of stuff. And now you're building to, you know, really platform tooling for the people who are

Starting point is 00:06:10 going to enable those things. And so I would just love to hear about what, you know, tell us about your experience at Nextdoor, you know, launching a bunch of those things. Did that influence the way that you thought about platforms? Because I would guess, I mean, you know, I could be way wrong, but you were building a lot of point solutions that weren't a platform and then probably eventually needed to be a, you know, sort of platform tooling at scale. Yeah, a lot to unpack there. So if I get off track, feel free to bring me in. But yeah, I want to say so before going to next door i was actually at linkedin where you know i had the opportunity to see you know a larger company with a bit of established infrastructure so for example they had a hadoop cluster and saw all the problems of

Starting point is 00:06:55 you know you know writing jobs trying to maintain understand debug you know trusting a data set can i you know use this to build a better model, right? And so the allure of Nextdoor was like, hey, it's a small, it's also a social network, like, they're going to be building or needing these things, can I build them out there? So that was part of the motivation and also, like, you know, I liked building product as much as I liked

Starting point is 00:07:18 kind of building kind of the infrastructure of things, right? And so I think from that perspective, you know, going from zero to one and having a blank canvas that perspective, you know, going from zero to one and having a blank canvas is like, you know, terrifying and exciting at the same time.

Starting point is 00:07:29 Back then, like it was a very different environment as it is now because now there's a lot of vendors, a lot of off-the-shelf solutions. But back then, you really had to kind of, you know,

Starting point is 00:07:39 build most of the things yourself. I mean, AWS was just in its infancy. I remember getting a demo of Snowflake when they were just building things out. And so, yeah, so Nextdoor got the opportunity to, got the keys to AWS effectively and to try to solve business problems. The first one, for example, being we need a data warehouse because up until that point, they were actually

Starting point is 00:08:02 running queries off of the production databases. So you were working on if you were using the site on a sunday it could you know things could have been impacted because of you know the queries and things or at least they were getting to sort of the scale where the queries were at least off the read replicas where you know we're locking them up right and so so having seen you know this is where partly like if you have to think of things from first principles and like and see how the sausage is made as as the expression goes like i think you kind of get a better appreciation for the things that you can build on top and then also potentially the you know the decisions you make lower level down how they eventually kind of

Starting point is 00:08:41 impact you at a high level so and next to one of the things i kind of built out was a you know an ab testing experimentation system for example and then you know trying to connect that all the way back with you know things that happen on the website to so you can do inference so you can it was pretty easy or we made it much easier to you know if you wanted to create a change you know you could feature flag it turn it on and then you know get some metrics and telemetry yeah and so yeah i guess in terms of you know like going to say a place like uh stitch fix which you know i had a startup hop in between right i think i realized that one i you know i'm not excited by giving a data set and figuring out what to do with that makes you more excited about building the tool so i had a great time building

Starting point is 00:09:30 experimentation stuff i had uh like a linkedin prototyping content-based kind of recommendation infrastructure as well right and so which case i realized that i guess my passion was more you know helping other people be more successful and And so in which case, Stitch Fix, with its allure of a lot of different modeling problems, it wasn't a shop that just wanted to optimize, say, targeted ads, right? It actually had a lot of different problems and they were hiring a lot of people to solve them. That's great.

Starting point is 00:09:58 So when you went to Stitch Fix, were you hired specifically to work on platform and tooling stuff? Yeah. Yeah. So one of the reasons why I left Nextdoor was because I realized that machine learning wasn't quite a key to the company. I could build things myself, but I wanted to be part of a team to bounce ideas off and work at a place that valued it. It was a little more core uh to what was doing so i actually went to an nlp for enterprise startup for that purpose where i got to kind of delve into like how do you build a machining models on top of spark and then get

Starting point is 00:10:32 them to production right unfortunately that you know that was it was a good roller coaster ride but company ran out of money had to fold and so then i realized yeah i was like i wanted to build more of these you know platforms and in which case, at Stitch Fix, they were avant-garde at the time where they're actually hiring a platform team pretty early to help enable and build out kind of self-service infrastructure for data scientists. So rather than the model, for those who don't know, Stitch Fix, it's a personal styling service.

Starting point is 00:11:00 So if you don't like shopping and picking out your own clothes, it's a great kind of service for you. But very early on, they had a team an environment where they hired data scientists who were in their own organization they weren't attached to marketing or engineering they were their own organization oh interesting and they were tasked with building taking prototyping and then taking and engineering the things that were required to get to production and so but they hired it was starting you know hiring our platform team to slowly you know rather than data scientists having to build a lot of engineering work themselves like slowly bring in part of the abstractions and layers to help uh make that self-service easier so for example you know

Starting point is 00:11:39 the platform team owned jenkins and you know the the Spark cluster and then setting up Kafka and then the Redshift instance and then helping. And so I was part of a team that was more focused on the, okay, how do you get machine learning and then plug it back into the business? So part of my journey was building actually one team that was focused on backend kind of model deployment and other on setting up the

Starting point is 00:12:05 centralized experimentation infrastructure, and the third one being what we call model lifecycle, which is end-to-end, like how do we actually speed up getting a model from development to production. Makes total sense. Now, can we dig into the self-service piece of that a little bit? So when you came to Search Fix, it sounds like culturally they had sort of committed to we want to enable more self-service. Can you talk about who specifically in the org needed self-service and what were the problems they were facing? Like what were the bottlenecks that not having tooling for self-service was creating. Yeah. I want to mention, so there's, I think, a pretty reasonable kind of summary of kind of what things were at the time.

Starting point is 00:12:54 My former VP, Jeff Magnuson, wrote a post, a pretty famous one called Engineers Shouldn't Write ETL. So if you haven't seen that post or haven't heard of it, I can take a look at it. But effectively, I mean, part of the thesis was that, you know, being handed over work, thrown over a wall at someone isn't very happy work for that person. And they're also kind of disconnected from business value. And so the idea at Stitch Fix was, well, you know, can we, the data scientist, the person who has the idea, but is also talking, say, with the business partner.

Starting point is 00:13:27 So at Citrix, each data science team was effectively partnered with some sort of team marketing, operations, styling, merchandise, right? And so they were trying to help those teams make better decisions. And so the thought was, you know, iteration loops are key in terms of machine learning differentiating. So how can we speed up this loop? Easiest way to speed up this loop,

Starting point is 00:13:47 but the person who's building it can also take it to production and then, you know, close the loop and then iterate and make better decisions that way. So that was really, you know, the philosophical kind of thesis as to what it was. And so I want to say, it wasn't necessarily like a problem. It was more like, hey, this is the framing. This is how we want to operate. In which case, then the framing of the platform team is like how can we build in capabilities and provide an easier time for that data scientist to you know do get more get more done without engineering it themselves but we weren't on anyone's critical path so there was like there was a bit of like not obviously if you want to use Spark cluster, you have to use the cluster, but in terms of, you know, APIs to read and write, there

Starting point is 00:14:29 was, you know, there was a lot of, before the platform team came in, people were writing their own tools and solutions, right? So Search Fix hired very capable, you know, PhDs from various walks of life that weren't computer scientists of background, but some of them knew that they could abstract things. And so, which case, you know, part of it was, you know, competing with data scientists and house abstractions and trying to gain ownership of them as a platform to better manage them.

Starting point is 00:14:54 Yeah, well, I was going to ask about that because, you know, you're okay, self-service, let's make the cycle time faster. You know, that sounds really great on the surface. But, you know, you're talking about like, you know, multiple data scientists, you know, sort of working for different internal stakeholders who have already built some of their own tooling. Was it challenging? You know, was there pushback or was generally people were excited about it?

Starting point is 00:15:19 I mean, I know the tool eventually had to prove itself out and get adoption internally, but culturally, what was it like to enter that, you know, sort of mandate, I guess, if you will? I mean, it was, I mean, a mixed bag. I mean, like, it depends. So a very academic type environment, so very much open to suggestion and discussion, very high communication bar. So there was like a weekly, what was called beverage minute, where you could kind could kind of present talk about things and that's where kind of people did and that was your kind of forum to disseminate stuff and so people always eager to learn best practices

Starting point is 00:15:52 right i think you know people being practically minded like if they build it and they're like well i don't have that problem you know why should i use your tool all right why should i bother using spending time like i mean coming from very practical concerns of like you know what's in it for me right so that that's if anything was a bit of a challenge like if some team had a little bit of a solution but no not the other teams did you know you could get the other teams on but that one team would be like well i don't think the opportunity cost is there yet right yep. That makes total sense. Okay. So one of the big pieces of work that came from your efforts at Stitch Fix was Hamilton, you know, which is intimately tied to DAG work. So can you set the stage for where Hamilton came from inside of Stitch Fix and sort of the,

Starting point is 00:16:40 maybe the particular flavor of problem that it was solving? Yeah. So one of, it was built for a data science team. So one data science team and one of the oldest teams there basically had a code base that was basically, you know, five, six, six years old at that point, gone through a lot of team members and things. And so it wasn't written, you know, structured by, you know, people with a software reading background, but effectively they had to forecast things about the business that the business could make operational decisions on. And so they're basically doing time series forecasting. And what is pretty common in time series forecasting is that you are continually adding and updating the code because things change in the business.

Starting point is 00:17:19 Time moves, you know, account for it. Right. And so one way you do that is you write kind of inputs or features. Right. moves you know account for it right and so one way do you do that is you write kind of inputs or features right and so at a high level uh the you know getting a forecast up or like the pipeline or the etl at a high level for a forecast was pretty you know you could say simple or pretty standard you know only a couple of steps but the software the challenges of adding maintaining updating and changing the code that was within you know at a high level with a map in the in that macro pipeline was really what was the challenge

Starting point is 00:17:49 and was really slowing them down they were also operationally always under the gun because they you know had to provide things that business needed to make decisions on you know they had to model different scenarios and certain things and so which case you, they weren't in a position to really, you know, do things themselves. In which case, you know, manager came to the platform team was like, hey, help. And so, yeah, what I found really was like,

Starting point is 00:18:14 it was, you know, the macro pipeline wasn't the challenge. It was the code within the steps that needed to be changed and updated. Right. And so this is where like, yeah, getting to production was easy, but now the maintenance aspect of like maintaining, changing was really the struggle and so with hamilton the idea

Starting point is 00:18:29 was you know how can we you know this is a plus for a work from home wednesday so if there was no work from home wednesday i might not have come up with this but i had a full day to kind of think about this problem and kind of analyzing and looking at their code it was a lot of effectively what they were trying to do well one of the biggest problems at their code it was a lot of effectively what they were trying to do well one of the biggest problems was they needed to create a data set or a data frame uh with thousands of columns um and because with time series forecasting it's very easy for you to create your inputs their derivatives of other columns and so the ability to express the transforms was really,

Starting point is 00:19:05 and be confident that like, if you change one, like you don't know what's downstream of it. All the dependencies. Yeah. Because the code base was so big. It was, you know,

Starting point is 00:19:13 it wasn't, you know, that, that well structured. Right. And so I came up with Hamilton where effectively I was like, I was trying to make it as simple as possible from a process perspective of given an output,

Starting point is 00:19:26 how can you quickly and easily map that back to code? And the definition for it, right? And so if Hamilton at a high level is a microframework for describing data flows, right? And so a data flow is essentially compute and data movement. This is exactly what they're doing with their process to create this large data frame, given some source data,

Starting point is 00:19:49 put it through a bunch of transforms, create a table. And so Hamilton was kind of created from that problem of like, yeah, the software engineering need. And I mean, I could dive into more details of how Hamilton works, but I'm going to first ask whether I've given enough high-level context. No, that's super helpful.

Starting point is 00:20:05 And one thing I actually want to drill into, because I want to hand the mic off to Kostas in a second and dig into the guts of how Hamilton works, but we're talking about time series data and especially around features specifically. One of the things that's interesting about Hamilton sort of being, you know, let's say, and maybe I'm jumping the gun a little bit here, but more sort of declarative rather than imperative is that it creates a much more flexible environment, at least, you know, from me to Greener, in terms of definitions, right? Because one of the problems with time series data and definitions is that if a definition changes,

Starting point is 00:20:47 which it will, and you have a large code base, it's not that you can't get a current view of how that definition looks with your snapshot data, but it's actually going back and sort of recomputing and updating everything historically in order to rerun models and all that sort of stuff,ing and updating everything historically in order to like, you know, rerun models and all that sort of stuff, which is really interesting. Were you thinking a lot about the definition piece

Starting point is 00:21:12 with Hamilton and sort of making it easier to create definitions that didn't require, you know, like updating a hundred, you know, different points in the code? Yeah. I mean, that effectively yeah if you can make it really simple to make output to to code then logic that means there's only really like one place to do it and so what one part of the problems you know with the code base that it was before was you know there wasn't a good testing story there wasn't a good documentation story hard to see dependencies between things and And then when you updated something,

Starting point is 00:21:46 you didn't know, to your point, how confident you were in what you actually changed or impacted, right? Yeah. Because everything was effectively in a large script where you had to run everything to test something. So there was this kind of real inertia to really, or a lot of energy required to understand changes and impacts. And so effectively, by rewriting things as functions which are kind of we'll dig into it helps really abstract and encapsulate what

Starting point is 00:22:12 the dependencies are and so therefore if you are going to make a change it's very much much easier to logically reason then and find say in the code base like who you know the upstream and downstream dependencies of this yeah and so it becomes you have a far more procedural methodical way that you can then kind of add update and change workflows whereas before if you kind of it's a script or like wherever software engineering practices you're kind of using you have to you know take a lot more care and concern when you do that but with hamilton it's kind of the paradigm kind of forces you to do things in a particular way that makes, you know, this particularly beneficial

Starting point is 00:22:47 for, you know, changing, updating and maintaining. Yeah, absolutely. You know, it's amazing. Even if, you know, even on teams that really are diligent about best practices with software engineering, it's amazing as code bases grow, the amount of tribal knowledge that's needed to make

Starting point is 00:23:07 significant changes you know you always end up with a handful of people who know all of the nooks and the crannies you know and sort of that one dependency that's you know the you know the killer when you push to production without tinkering with it one thing for the readers i think since your audience is probably familiar with dbt, I want to say Hamilton's very similar, I guess, in what dbt did for SQL, right? Before dbt was a bit of the wild west of how you maintain and manage your SQL files, how they link together, right? How do you test document them, right?

Starting point is 00:23:38 Hamilton kind of does pretty much the same thing, but for Python function, Python transforms, right? And so it gives you this very opinionated, structured way that you end up actually being more productive and being able to write and manage more code than you would otherwise, which I think dbt kind of did for the SQL. Yeah, absolutely. All right, Costas, I've been monopolizing,

Starting point is 00:23:58 and I know you have a ton of questions about how this works. I do too. Please. You can get back in the conversation whenever you want, so don't be shy so stefan first question what makes hamilton nml oriented framework why it is like for for ml right like writing it lml what's and not for something else right i want to say it it comes its roots are definitely you know machine learning oriented camping you know like effectively what i was describing was a

Starting point is 00:24:31 feature engineering a problem for time series forecasting right i mean hamilton since then we kind of added and adjusted it to to operate over you know any python object type because it was initially focused on pandas now it isn't i effectively kind of call it it's a bit of a swiss army knife and that you could anything you can model in a dag or at least if you would have draw a workflow diagram hamilton's maybe the one of the easiest ways to kind of directly write code to map the maps to it but specifically you know i think you know python and machine learning are very coupled together self-engineering practices are hard in in machine learning are very coupled together. Software engineering practices are hard in machine learning, in which case I feel Hamilton specifically is trying to target

Starting point is 00:25:12 the software engineering aspects of things, in which case I think machine learning and data work is least mature there. And so very waffly answer is that its roots are from that, and so therefore I think it's targeting more of those problem spaces. But people have been applying Hamilton to much wider use cases than just machine learning. Yeah, yeah, 100%. I'm always finding it very fascinating to hear from practitioners like you

Starting point is 00:25:36 about the unique challenges that the ML workloads have compared to any other data workload, right? I mean, Hamilton is actually a little less around workloads and more about team process and code that helps define those things, right? Since, you know, individuals build models or data or, you know, artifacts, right? But teams own them, right? And you need kind of different practices to make it work, right? I mean, there is the infrastructure side, like do you do feature engineering over gigabytes of data.

Starting point is 00:26:09 But then there's also, well, how do you actually maintain the definition of the code to ensure that it's correct, that it can live a long, prosperous life. When you leave, someone else can inherit it. And so Hamilton is kind of starting from that angle first. But definitely, I could see a feature where you can, you know, you can use it on Spark. You can use it in a notebook. You can use it in a web service anywhere that Python runs. Right.

Starting point is 00:26:30 So definitely it has integrations and extensions that definitely also extend out to more of the data processing side. Yep. Yep. And okay. So let's change like the question a little bit. And instead of like talking about like the workloads, let's talk about the teams. Like how ML teams and the people in these ML teams might be different than a data engineering team or a data infra team, right?

Starting point is 00:26:56 So tell us a little bit more about that. Like how things are different for ML teams compared to, I don't know, like a BI team, right? I mean, so there's a bit of nuance here because depending on like if you're applying machine learning to then go into an online setting or if it's only all in an offline world, right? There's slightly different kind of SLAs and tolerances. Most data scientists, machine learning engineers I know don't have computer science backgrounds. And I want to say this is probably almost even true for data engineers I know as well, right? and true for data engineers i know as well right but effectively you're trying to couple data and compute together in a way that you know yields a statistical model representation that then you can

Starting point is 00:27:34 kind of you know which is some bytes in memory that then you want to kind of ship out how you get there and how you produce it really i think impacts how the company operates how the team operates the ease and effectiveness that you can kind of you know quickly get results so i want to say yeah there's a lot more you know focus on you could say you could say this way where amalops is you know trying to become like a devops practice right where it's kind of giving you the guiding principles on how to kind to operate and manage things. And then I guess in terms of how it relates to other things, I actually think

Starting point is 00:28:11 machine learning is a bit of a superset of analytics workflows. So I think the same problems exist on the analytics side, maybe obviously slightly different focuses and endpoints, but effectively you're effectively generally using the same infrastructure or reusing it as a better term and then you're generally connecting i have to connect and intersect with that world as well

Starting point is 00:28:35 and so i want to say it's more of a superset of that and you know it has therefore slightly more different challenges because the things that you produce are more likely to then end up in other places like you know online in a web service versus you know analytics results which just only serve from a dashboard and look okay that's great so okay you mentioned at some point when we were discussing with eric that like hamilton's like an opinion opinionated way of like doing things around like a man. You gave a very good example for people to understand with DBT.

Starting point is 00:29:10 Where DBT came and put some kind of guardrails there on how things should be getting done. Can you take us a little bit through that? What does this mean? How the world is perceived from the lenses, from the point of view of Hamilton?

Starting point is 00:29:27 What are the terminology used, right? Like, that's data frames. Tell us a little bit about the vocabulary and all these things that we should know to understand the fundamentals of Hamilton. Sure. So as I said, Hamilton's a micro framework for describing data flow so i say micro framework in that it's embeddable anywhere that python runs it doesn't contain state and all it's doing is really kind of helping you you can say orchestrate code it is not a macro orchestration system as opposed to something like airflow prefect dagster which contains state and you think of tasks as computational units. Hamilton, instead, you think of things,

Starting point is 00:30:07 the units are functions. And so rather than writing procedural code where you're assigning, say, a column to a data frame object, in Hamilton, instead, you would rewrite that as a function where the column name is the name of the function, and the function input arguments

Starting point is 00:30:23 declare dependencies or other things that are required as input to compute that column. So inherently, so I guess there's macro versus micro. So I call Hamilton a micro frustration framework or a micro framework, micro frustration, a kind of view of the world versus macro, which is something that it isn't, right? It is, we're writing functions that are declarative.

Starting point is 00:30:48 And so where the function name means something and function input arguments also declare dependencies. You're not writing scripts. With Hamilton, there is a bit of, well, you don't call the functions directly. You need to write some driver code. And so with Hamilton, the other concept is like, you have this driver, right?

Starting point is 00:31:04 And so given the functions like you have this driver right and so given the uh functions that you have written so you have to write all your functions curated into python modules python modules you could say representations of parts of your dag so if you think visually and you think of nodes and edges where functions and nodes and edges being kind of dependencies of what's required to be passed in. That's, I guess, the nuts and bolts of Hamilton. You write functions that go into modules, but then you need a driver, some driver script to then read those modules to build this kind of DAG representation of the world. That's that code. That's, you could say, the script code that you would then kind of plug into any way that you run Python.

Starting point is 00:31:42 I'll pause there. Any clarifications or going along so far? Yeah. So just to make sure that like I understand like correctly, right. And consider me as like a very naive, let's say

Starting point is 00:31:55 practitioner around that stuff. Right. So if I'm going to start developing using Hamilton, I'll start thinking in terms of columns. Right. So I don't Hamilton, I'll start thinking in terms of columns, right? So I don't really,

Starting point is 00:32:08 I don't start like from the concept of like having like a table or like something like a data frame, right? So technically I can create, let's say independent columns and then I can mix and match to create outputs data sets in a way, right? Yeah, so Hamilton's roots

Starting point is 00:32:28 wrangling, say, Pandas data frames. So it's very easy to go back to time-seriousness and time-serious data. It's very easy to think in columns when you're processing this type of data. And so with Hamilton, functionally the function you can

Starting point is 00:32:44 therefore think of as equivalent to representing a column, the framework forces you to only have one definition. So if you have a column name X, there's only one place that you can have X in your DAG, or there's only one node that can be called X to compute and create that, right? So Hamilton forces you to have one declaration of this. And so where the function name is kind of equivalent to the column name or an output you can get but when you write that function you haven't actually said what data comes into it you've only just you've only declared through the function arguments the names of other columns or inputs that are required so with hamilton you're kind of you're not coupling context when you're writing these functions and so therefore you kind of effectively are coming

Starting point is 00:33:23 up with you know you know you can say a column definition or a feature definition that is kind of invariant to context. The way that Hamilton then stitches things together is through names, right? And so if you have a column named foo that takes in an argument bar, Hamilton, when you go to compute foo, will either look for a function called bar

Starting point is 00:33:42 or it will expect some input called bar to come in. 100%. Okay, so we chain together functions, right? And create, let's say, a new column, right? From creating columns. And you said the context is not that important. When I define a function, I just link, let's say, the inputs. Mm-hmm.

Starting point is 00:34:07 But, okay, coming again from a little bit more of traditional programming, but how do you deal with types, for example? How do I avoid having issues with types and conflicts and stuff like that yeah so hamilton's pretty lightweight here we so the function declares an output an output when you write a function it has to a declare a function output type but also the input arguments also have to be type annotated so when hamilton constructs the dag of how things are chained together it does a quick check of like hey do these function types match you have the flexibility have to be type annotated. So when Hamilton constructs the DAG of how things are chained together, it does a quick check of like,

Starting point is 00:34:46 hey, do these function types match? You have the flexibility to kind of, you know, fuzzy them as much as you like. But effectively, so that's a DAG construction. At runtime, there's also a brief check like on input to the DAG to make sure that, you know, the types match at least

Starting point is 00:35:03 the expected kind of input arguments. But otherwise there's a bit of an assumption that if you set a function outputs a pandas data frame, it's a pandas data frame. And the reason why we don't do anything too strict there is that, well, if you want to reuse your pandas code and run it with pandas on Spark, assuming you meet that subset of the API, to everyone who's reading the code, it looks like a Pandas data frame, but underneath it could be a Pandas, sorry, a PySpark data frame

Starting point is 00:35:27 wrapped in the Pandas on the API. So effectively, you know, with Hamilton, the DAG kind of enforces types to ensure that functions match, but you have flexibility as to, you know, how you, if you really want to perturb that, you can write some code to kind of fuzzy that up. Otherwise at runtime,

Starting point is 00:35:42 there isn't much of an a an enforcement check but then if you do really want that there is the facility then to also what's called a check output annotation that you can add to a function that can do a runtime data quality check for you which you could then you know check the type check the you know the cardinality or you know the values of a particular output okay that's cool and okay so let's say I want to start playing around with Hamilton, right? And I already have some existing environment where I create pipelines and I work with my data, right? How do I migrate to Hamilton?

Starting point is 00:36:20 What do I have to do? Yeah, it's a good question. So Hamilton, as I said, it runs anywhere that Python runs. So all you need is to really, you know, say you're using pandas just for the sake of argument. You can replace however much code you want, you know, with Hamilton.

Starting point is 00:36:38 So you can slowly, you could say, change parts of your code base and replace it with Hamilton code. I mean, in terms of actually migrating, the easiest is to save the input data, save the target output data, and then write transforms and functions that then as you're migrating things

Starting point is 00:37:01 to see whether the old way and the new way kind of line up and match up but from an actual kind of practicality and you know poc perspective like it's really up to you to scope you know how big of a chunk do you want to really move to hamilton in which case because all you need to do is just pip install the hamilton library that's really the only impediment for you to kind of try or something is really the time to like chunk what code you want to translate to hamilton and but otherwise you know there shouldn't be any system dependencies really stopping okay that's super cool and you mentioned at the beginning of the conversation

Starting point is 00:37:35 that's okay well it's one thing like to build something it's a completely different thing like to operate and maintain something right and like that's where a lot of pain exists today. Having, let's say, a pipeline, handing this pipeline to a new engineer, trying to figure out things that are going in there, updating that, improving that, it's hard. And from my understanding, one of the goals and the vision of Hamilton is to help with that and to actually bring best practices that we have in software engineering also when we work with data pipelines. So how is this done? Let's say I've built it, right? I've used Hamilton. I have now a pipeline that builds whatever

Starting point is 00:38:24 the input of a service that takes like a model is. What's next? Like how, what kind of tooling I have around Hamilton that helps me, let's say to go there and debug a pipeline or improve on pipeline and in general, maintain the pipeline. Yeah. Yeah.

Starting point is 00:38:42 So good question. So one is I'm going to claim that you know a junior data scientist can write hamilton code and no one's going to be terrified of inheriting it because i mean so part of i guess one of the things that kind of the framework forces you do is basically you need to you know chunk things up into functions one nice thing of chunking things up into functions that everything is unit testable not to say that you have to add unit tests but if you really want to, you can. And then you also have the function doc string,

Starting point is 00:39:07 always, that you can add more specific kind of documentation. Now, because everything is kind of stitched together by naming, you're also forced to name things slightly more verbosely that you can kind of pretty much read the function definition and kind of understand things, right? And so I just want to set the context of like, you know, the base level of like what Hamilton gives you you effectively you can think of it's you know a senior software engineer in your back pocket without you having a high one because you know you're decoupling logic

Starting point is 00:39:32 it's making it reusable from day one because you're forced to curate modules and then you have these great testing story and then one of the facilities that's built into the hamilton framework natively is that you can output a graph as visualization of how actually everything connects or like how a particular execution path looks right so with that on the base of that right I want to say if someone's coming into making a change right there isn't much extra tooling you need at a low level right to to be confident so if someone's making a change to a particular piece of logic, it's only a single function, right?

Starting point is 00:40:09 The function, you know, who's downstream of that because you just need to find people, you know, grab the code base for whoever has that function input arguments, right? If you're adding something, you know, you're not going to clobber anything

Starting point is 00:40:20 or impact anything because it's a very separate thing that you're creating, right? Similarly, if you're deleting or removing things, you can also make the, you know, easily go through the code base to find things. So pull requests, therefore, are a little more easier and simpler

Starting point is 00:40:31 because things are chunked in a way that like people, a lot of the changes already have all the context around them and they're not really, you know, they're not in disparate parts of the code base when a change is made. So therefore, in terms of debugging, because you have this DAG structure, if there's an issue, it's pretty

Starting point is 00:40:49 methodical to debug something. So if you see an output, it looks funky, well, it's very easy for you to map to where the code should be. So if the logic in the function looks off, you can test it, unit test it. But if it's not, then you know it's you know

Starting point is 00:41:05 it's function of argument so then you know you effectively then know okay what was what's what was run before this so you can then logically step through the code base as to like okay well if it's not this then it's this if it's not this then it's this and you can set you know pdb set trace or you know debugging output within it right and so this is where I was saying this paradigm forces this kind of simplicity or very structured or standardized way of approaching it and debugging stuff, in which case, therefore,

Starting point is 00:41:33 anyone new who comes to the code base, they don't need to read a wall of text and be consuming from a fire hose. Instead, if they want to see a particular output, they can use the tool to visualize that particular execution path and then just walk through the code there or with someone or someone's handing off.

Starting point is 00:41:52 So I think it really simplifies a lot of the decisions and effectively encodes a lot of the best practices that you would naturally have in a good code base to make it easy for someone to come and update, maintain, and then also debug. You mentioned the documentation of... to make it easy for someone to come and update, maintain, and then also debug. Mm-hmm. You mentioned the documentation of... I was browsing the GitHub repo for Hamilton, and there's a very interesting matrix there

Starting point is 00:42:16 that compares the features of Hamilton with other systems. I think it really helps someone to understand exactly what Hamilton is. But I want to ask about the code. You mentioned at some point that the code is always unit-testable, right? And it's always true for Hamilton, but it's not for other systems like DBT, for example, or Feast or Airflow. Can you elaborate a little bit more on that? Like why with Hamilton we can do that, right? And why we cannot with Airflow, for example?

Starting point is 00:42:56 Yeah, yeah. It's very easy to... In systems, so given a blank slate of Python, you can write a script, right? And so one of the things that's very easy and most people do is they want to get as fastest from a to b as possible in the data world that means loading some data doing some transforms and then loading back out right and so if you think of the context that you have just coupled together to kind of do that you you know have made an assumption of you know where the data is coming from uh maybe it's of a particular format or type

Starting point is 00:43:25 the logic then is very much now coupled to that particular context so if you you know most data scientists cut and paste code rather than refactoring it to for use right and that's partly because of that kind of you know coupling of context and then you've also assumed what the outputs are and so you could make that code always testable but you need to think about it when you're writing it right yeah you need to structure things in a way that you know because if you couple things or you write functions that take in that certain things that means the unit test is a pain because you have to mock different you know data loaders kind of apis to kind of make it work because you have to mock different you know data loaders kind of apis to kind of make it work whereas you know with hamilton you're really forced to really

Starting point is 00:44:09 chunk things separately or at least if there's anything complex it's actually you know contained in a single function in a single place right and and so it is therefore much easier therefore you know if you need to write a unit test to write it in hamilton and have it maintainable whereas in the other context you have to think about that you know as you're writing it but most people don't and so in which case then it's a problem of inertia and then people generally you know add to the code base to make it look like how it is and so which case the problem then just propagates. Unless you find that someone who, that one person,

Starting point is 00:44:46 there's generally one person in every company who really likes cleaning up code. You find them and they want to do it. But those people are a rarity, in which case, for me, I'm more of a reframe the problem to make problems go away type of guy. And so in which case with Hamilton,

Starting point is 00:45:00 it's like, yeah, reframe the problem a little bit by getting you to write code and set it to start. But then all these other problems, you just don't have to deal with because we've designed to write code in a certain way that always makes testing and documentation friendliness true.

Starting point is 00:45:17 Yeah, and one more question on unit tests. I want you to ask this question. I want to ask this question to you because you mentioned at the beginning, and it's very true, that many of the practitioners in the ML and the data science domain,

Starting point is 00:45:35 and that's also true for many of the data engineers out there, don't necessarily come from a software engineering background. So probably they're also like not exposed to unit testing and why unit testing is important, right? So why is unit testing important for a data scientist? It's important if you have a particular logic that you want to ensure that A, you have written correctly and B, if someone changes that they don't, you know,

Starting point is 00:46:01 break it inadvertently, right? And so I want to say it's not true that you always need unit tests for simple functions right it's mainly for the things you really want to kind of enshrine the logic for and also potentially help other people can understand like these are the bounds of the logic so classic examples of this are like i say stitch fix was you had a survey response to a particular question and you wanted to transform that survey response into a particular input or output, right? Unit test was a great way to encapsulate and kind of enshrine a bit of that logic, right? To ensure that like, hey, if something changes or if there are assumptions that change, you could easily kind of understand and see whether that kind of test broke or not. Cool.

Starting point is 00:46:43 So let's pause a little bit here like about hamilton and i want to ask you because we talked a lot about like hamilton but hamilton is also like the seeds of a combat that you built right today and i would like to hear from you a little bit about like this journey how you know things started like from within Stitch Fix. As you said, there was a problem there. We described how you started building Hamilton to the point of today being the CEO of a company that is building a product and a business on top of this solution.

Starting point is 00:47:19 So tell us a little bit about this experience, how you decided to do it, the good things around it, like whatever made you happy so far. And if you can share also some of the bitter parts of doing this, because I'm sure it's not easy. That would be awesome. I went to Stanford, got bitten by the bug for For the last decade, I've been thinking about starting a company. In terms of how Dagworks got started and the idea for it, we did a lot of build versus buy on the platform team at Stitch Fix. So we saw a lot of vendors who come in.

Starting point is 00:47:57 And quite frankly, I was like, I think we have actually better ideas or assumptions or even we could build a better product here. And so we built most things at stitch fix which i see for that reason we only brought in a few things right and so uh hamilton actually started out more of a more of a branding exercise and so part of it was actually it was of the things my team built it was also the easiest to kind of open source but also from that perspective was also i guess the most interesting so i do think it's actually pretty different a pretty different approach to than other people are taking and so part of that

Starting point is 00:48:29 was like you know i think it's unique and then just happened to be easier to open source than other things and so we open sourced it and the reaction from your people was like yeah like i i honestly initially thought hamilton was a bit of you know, it was a cute meta programming hack in Python to kind of get to work. But like, I was like, wasn't quite sure where the other people would think, get the same value out of it. Suffice to say, you know, people did, which was exciting. And then realizing, you know, like at Stitch Fix,

Starting point is 00:48:59 we had, you know, a hundred plus data scientists to deal with, but, you know, with open source, it's kind of like, wow, you can actually have thousands of people you could potentially, you know, help plus data scientists to deal with. But, you know, with open source, it's kind of like, wow, you actually have thousands of people you could potentially help and reach. Right. And so that was invigorating from a personal perspective of like, you know, just being able to reach more people and, you know, and help more people. So I think, you know, with open source, there's the challenge of actually how you start a

Starting point is 00:49:20 business around it. I mean, if you look at other companies you know dbt for example you know they didn't really take off until they were get three or four years outside of open source right hamilton was actually built in 2019 we only open sourced it 18 months ago i mean i did know that was sticky because the teams that used it internally at stitchflix loved it but you know exciting to see its kind of adoption go and then and then so from that perspective you know seeing open source get adopted me being you know excited by helping other people and then you know being thinking about companies for the last decade i thought it was you know now was a good time because i'm like

Starting point is 00:49:55 i still think i know something people don't in which case you know that machine learning tech that is going to come home to roost in the next few years of all the people who brought machine into production and now you, feeling the pains of, you know, vendor ops, as it's sometimes called of, you know, stitching together all these MLOP solutions.

Starting point is 00:50:11 And then, so timing, knowing something the market doesn't, and then, you know, having the passion for it was kind of roughly the three things that let it larger myself, the other co-creator Hamilton kind of to start Dagworks.

Starting point is 00:50:25 That's awesome. And one last quick question for me before I hand the microphone back to Eric. Where can someone learn more, both about Hamilton and the company? Yeah, so if you want to try out Hamilton, we have a website called tryhamilton.dev. It runs Pyodide because Hamilton is a small dependency footprint where you can actually load python up in the browser and you can play around with that you're having to install anything otherwise for the dagworks platform that we're building

Starting point is 00:50:53 around hamilton you can kind of think of it as just at a high level you know hamilton's technology dagworks platform is kind of a product around it you can go to dagworks.io. And by the time that this releases, I think we should be, you know, we're taking off the beta wait list. And so if that's still there, do sign up. We'll get you on it quickly. Else, hopefully we'll have more of a self-service means to kind of play around with what we built on top of Hamilton. That's great.

Starting point is 00:51:24 Eric, all yours. All right. Well, we have to ask the question, where did the name Hamilton come from? Good question. So at Stitch Fix, the team that we were building, you know, this fall was,

Starting point is 00:51:41 you know, I was going to say this was pretty fantastic. It was basically a rewrite of kind of how they wrote code and how they pushed things. The team was called the Forecasting, Estimation, and Demand team, or the Fed team for short. I had also

Starting point is 00:51:53 recently learned more about American history because the Hamilton musical had gone. I was like, what's foundational and associated from the Fed? Well, Alexander Hamilton created the actual Federal Reserve. Yeah, yeah. And so then there were other names, right?

Starting point is 00:52:11 But then as I started thinking about it more, I'm like, well, Hamilton also, you know, the Fed team is also trying to model the business in a way. So there are Hamiltonian physics concepts, right? And then the actual implementation of what we're doing is graph Theory 101, effectively, right? And so for computer science, there's also Hamiltonian concepts there. So I was like, oh, great. Hamilton's probably the best name for it since it helps tie together all these things. I love it. Well, Stefan, this has been such a wonderful time. We've learned so much. And thank you again for giving us a little bit of your day

Starting point is 00:52:48 to chat about DAGs, Hamilton, Python, open source, and more. Thanks for having me. It was a good time in terms of being more succinct on responses. I think this is my lesson I've learned from this podcast. I need to kind of work on that a little bit more. But otherwise, yeah, much appreciated for having me on and thanks for the conversation. Anytime. You were great. But Costas, I loved the show because we covered a variety of topics with Stefan from Dagworks and Hamilton. I think one of the most fascinating things about the show

Starting point is 00:53:24 to me was we started out thinking we were going to talk a lot about DAGs, because DAGworks, the name of the company is focused on DAGs. But really what's interesting is that it's not necessarily a tool for DAGs like you would think about Airflow necessarily. It's actually a tool for writing clean testable ML code that produces a DAG. And so the DAG is almost sort of a consequence of an entire methodology, which is Hamilton, which is absolutely fascinating. And so I really appreciated the way that Stefan sort of got at the heart of the problem. It's not like we need another DAG tool, right?

Starting point is 00:54:09 We actually need a tool that solves sort of problems with complex growing code bases at the core. And a DAG is sort of a natural consequence of that and a way to view the solution, but not the only one. So I think that was my big takeaway. I think it's a very interesting, elegant solution or way to approach the problem. Yeah. DAGs appear everywhere with these kind of problems, right? Like anything that's like close to a workload or there is some kind of like dependency there, there's always a DAG somewhere,

Starting point is 00:54:40 right? And like, similarly, like again, like Hamilton, the same way that if you think about DBT, DBT also is a DAG. Every DBT project is a graph that connects models with each other. The difference, of course, is that we have DBT, which lives in the in the sql world and then we have hamilton which looks like in the python world and it's also like targeting different a different audience right so and that's like at the end what hamilton is trying to do is like to bring the value of um let's say the guardrails that a framework like dbt is offering like to the bi and the analytical and the analytics professional out there to the ml community right because they also have that and probably they have it also like in deeper complexity compared to let's say the bi words just because by nature like ml models models and features have deeper, deeper

Starting point is 00:55:45 dependencies to each other. So it's very interesting to see how the patterns emerge in different sides of the field, like the industry, but at each core they remain the same.

Starting point is 00:56:03 So I think everyone should go and take a look at hamilton they also have like a like a sandbox like playground where you can try it online if you want and started like building a company on top of that and like any feedback is going to be like super useful for the Hamilton folks. So I would encourage everyone like to go and like do it. Definitely. And while you're checking out Hamilton, I think it's tryhamilton.dev. Head over to Data Stack Show, click on your favorite podcast app and subscribe to the Data Stack Show. Tell a friend if you haven't, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on

Starting point is 00:56:50 your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

The Data Stack Show - 148: Exploring the Intersection of DAGs, ML Code, and Complex Code Bases: An Elegant Solution Unveiled with Stefan Krawczyk of DAGWorks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.