The Data Stack Show - 148: Exploring the Intersection of DAGs, ML Code, and Complex Code Bases: An Elegant Solution Unveiled with Stefan Krawczyk of DAGWorks
Episode Date: July 26, 2023Highlights from this week’s conversation include:Stefan’s background in data (2:39)What is DAGWorks? (3:55)How building point solutions influenced Stefan’s journey (5:03)Solving the tooling prob...lems of self-service at an organization (11:44)Creating Hamilton (15:53)How Hamilton works with definitions and time-series data (19:34)What makes Hamilton an ML-oriented framework? (23:39)Navigating the differences between ML teams and other data teams (26:27)Understanding the fundamentals of Hamilton (28:25)Dealing with types and conflicts in programming (33:18)How Hamilton helps improve pipelines and maintaining data (37:11)Why unit testing is important for a data scientist (44:54)The ups and downs of founding building a data solution (46:32)Connecting with DAGWorks and trying out Hamilton (50:01)Final thoughts and takeaways (52:46)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack.
They've been helping us put on the show for years and they just launched an awesome new product called Profiles. It makes it easy to build an identity graph and complete customer profiles right in
your warehouse for Data Lake. You should go check it out at ruddersack.com today.
Welcome back to the Data Sack Show. Costas, super excited for the topic today. So we're
going to talk with Stefan from Dagworks. He developed a really interesting
technology at Stitch Fix called Hamilton. And, you know, we actually haven't talked about DAGs
a ton on the show. Airflow's kind of come up here and there. And Hamilton's a fascinating take on
this where you sort of declare functions and it sort of produces a DAG that makes it much easier to test code,
understand code and actually produce code,
which is pretty fascinating.
And this is all in the Python ecosystem, ML stuff.
It's very cool.
I wanna know what led Stefan from
originally kind of working on some of these end use cases. So building, you know,
an experimentation platform, you know, for testing, for example, or an experimentation
framework for testing and all the data and the trimmings that go into that to going far deeper
in the stack and actually building sort of platform level tooling that enables the building
of those tools, if that makes sense.
So to me, that's a fascinating journey.
Very difficult problem to solve from a developer experience standpoint.
But yeah, excited to hear about his journey.
How about you?
Yeah. I mean, I definitely want to learn more about Hamilton, the project itself, and the whole
journey from coming up with the problem inside Stitch Fix and
ending up with like an open source project that's currently like the foundation for a
company.
So that's like definitely like something that I would like to chat about with Stefan and get deeper into what Hamilton is because these kind of systems like similarly to
like what dbd also does right like they have like a lot of value but they are also
rely a lot on like the people like using it and adopting these solutions. So I want to hear from Stefan about that,
like how we can actually do this
and how we can, you know, like onboard people
until they figure out like the value
of actually using something like this
in their like everyday work.
All right.
Well, let's dig in and talk about DAGs, Python, and Hamilton.
Let's do it.
Stefan, welcome to the Data Stack Show.
So excited to chat with you.
So many questions.
But first, of course, give us kind of your background and what led you to starting DAG
Works.
Thanks for having me.
Yeah. So DAG Works, I'm the CEO of DAG Works, DAGworks. Thanks for having me. Yeah, so DAGworks,
I'm the CEO of DAGworks, DAG for direct direct cycle graph. We're a recent YC batch graduate and at a high level, we're simplifying ETL pipeline management, targeting ML and AI use cases.
In terms of, you know, my background, how I got here to be CEO of of a small startup i you know came over here to silicon
value back in 2007 i did an internship at ibm and then i went to grad school at stanford
since then i you know finished a master's computer science you know right at the time when
it's still classically trained so all the deep learning stuff was just all the phds were doing
so i'm still kind of catching up on coursework there, but otherwise I've been, I worked at companies like LinkedIn, Nextdoor, where I was
engineer number 13 and did a lot of initial things, a small, then I went to a small startup
at Ebon that crashed and burned, which was a good time. But otherwise before starting the company,
I was at Stitch Fix for six years, helping data scientists streamline their model productionization efforts.
All right. Love it. And give us, just go one click deeper with DAGworks, right?
So I think a lot of our listeners are familiar with DAG sort of in a general sense,
but you're starting a company around it. So can you go one click deeper and tell us,
what does the product do? As a startup, we're still evolving, but effectively we're trying to, you know,
if you're, for the practitioners listening, if you've ever inherited an ETL or a pipeline that you're horrified by, or had, you know, had to come in and, you know, debug something you've
even written yourself and it's failing slightly because of, you know, upstream data changes
or code changes you weren't aware of because your teammate kind of made them or something, right?
We're essentially trying to solve that problem
because we feel that you can get things to production
pretty easily these days,
but really the problem then becomes
how do you maintain and manage these over time
such that you don't slow down
and you can, you know,
rather than when someone leaves
spending six months to rewrite it,
you know, there should be a standard, more standard way to kind of maintain, manage,
and therefore kind of operate these data and ML pipelines.
Yep.
I love it.
Well, tons of specific questions there, but let's rewind just a little bit.
So at Nextdoor, you said, you know, you were very early and you built a lot of first things, right?
So you sort of the data warehousing, data lake, you know, infrastructure, testing infrastructure for experimentation, etc.
So you were sort of really on the front lines, like shipping stuff that was, you know, hitting production and sort of, you know, producing test results and all that sort of stuff. And now you're building to, you know, really platform tooling for the people who are
going to enable those things. And so I would just love to hear about what, you know, tell us about
your experience at Nextdoor, you know, launching a bunch of those things. Did that influence the
way that you thought about platforms? Because I would guess, I mean, you know, I could be way wrong, but you were building a lot of point
solutions that weren't a platform and then probably eventually needed to be a, you know,
sort of platform tooling at scale. Yeah, a lot to unpack there. So if I get off track,
feel free to bring me in. But yeah, I want to say so before going to next door i was actually
at linkedin where you know i had the opportunity to see you know a larger company with a bit of
established infrastructure so for example they had a hadoop cluster and saw all the problems of
you know you know writing jobs trying to maintain understand debug you know trusting a data set
can i you know use this to build a better model, right? And so
the allure of Nextdoor was like, hey, it's a small,
it's also a social network,
like, they're going to be building or needing
these things, can I build them out there? So that was part of the motivation
and also, like, you know,
I liked building product as much as I liked
kind of building kind of the infrastructure of things, right?
And so I think
from that perspective, you know, going from
zero to one and having a blank canvas that perspective, you know, going from zero to one
and having a blank canvas
is like, you know,
terrifying and exciting
at the same time.
Back then,
like it was a very different
environment as it is now
because now there's a lot of vendors,
a lot of off-the-shelf solutions.
But back then,
you really had to kind of,
you know,
build most of the things yourself.
I mean, AWS was just
in its infancy.
I remember getting a demo
of Snowflake
when they were just building things out. And so, yeah, so Nextdoor got the opportunity to,
got the keys to AWS effectively and to try to solve business problems. The first one,
for example, being we need a data warehouse because up until that point, they were actually
running queries off of the production databases. So you were working on if you were using the site
on a sunday it could you know things could have been impacted because of you know the queries
and things or at least they were getting to sort of the scale where the queries were at least
off the read replicas where you know we're locking them up right and so so having seen you know this
is where partly like if you have to think of things
from first principles and like and see how the sausage is made as as the expression goes like
i think you kind of get a better appreciation for the things that you can build on top and then also
potentially the you know the decisions you make lower level down how they eventually kind of
impact you at a high level so and next to one of the things i kind of built out
was a you know an ab testing experimentation system for example and then you know trying
to connect that all the way back with you know things that happen on the website to so you can
do inference so you can it was pretty easy or we made it much easier to you know if you wanted to
create a change you know you could feature flag it turn it on and then you know get some metrics and telemetry yeah and so yeah i guess in terms of
you know like going to say a place like uh stitch fix which you know i had a startup hop in between
right i think i realized that one i you know i'm not excited by giving a data set and figuring out
what to do with that makes you more excited about building the tool so i had a great time building
experimentation stuff i had uh like a linkedin prototyping content-based kind of recommendation
infrastructure as well right and so which case i realized that i guess my passion was more
you know helping other people be more successful and And so in which case, Stitch Fix,
with its allure of a lot of different modeling problems,
it wasn't a shop that just wanted to optimize,
say, targeted ads, right?
It actually had a lot of different problems and they were hiring a lot of people to solve them.
That's great.
So when you went to Stitch Fix,
were you hired specifically to work on platform and tooling stuff?
Yeah.
Yeah.
So one of the reasons why I left Nextdoor was because I realized that machine learning wasn't quite a key to the company.
I could build things myself, but I wanted to be part of a team to bounce ideas off and work at a place that valued it.
It was a little more core uh to what was doing so i actually went to an nlp for enterprise startup for that purpose where i got
to kind of delve into like how do you build a machining models on top of spark and then get
them to production right unfortunately that you know that was it was a good roller coaster ride
but company ran out of money had to fold and so then i realized yeah i was like i wanted to build
more of these you know platforms and in which case, at Stitch Fix,
they were avant-garde at the time where they're actually hiring a platform team pretty early
to help enable and build out
kind of self-service infrastructure for data scientists.
So rather than the model, for those who don't know,
Stitch Fix, it's a personal styling service.
So if you don't like shopping and picking out your own clothes,
it's a great kind of service for you.
But very early on, they had a team an environment where they hired data scientists who were in their own organization they weren't attached to marketing or engineering
they were their own organization oh interesting and they were tasked with building taking
prototyping and then taking and engineering the things that were required to get to production
and so but they hired it was starting you know hiring our platform team to slowly you know rather than data scientists
having to build a lot of engineering work themselves like slowly bring in part of the
abstractions and layers to help uh make that self-service easier so for example you know
the platform team owned jenkins and you know the the Spark cluster and then setting up Kafka and then the Redshift instance
and then helping.
And so I was part of a team that was more focused on the,
okay, how do you get machine learning
and then plug it back into the business?
So part of my journey was building actually one team
that was focused on backend kind of model deployment
and other on setting up the
centralized experimentation infrastructure, and the third one being what we call model lifecycle,
which is end-to-end, like how do we actually speed up getting a model from development to production.
Makes total sense. Now, can we dig into the self-service piece of that a little bit? So
when you came to Search Fix, it sounds like culturally they had sort of committed to we want to enable more self-service.
Can you talk about who specifically in the org needed self-service and what were the problems they were facing?
Like what were the bottlenecks that not having tooling for self-service was creating. Yeah.
I want to mention, so there's, I think,
a pretty reasonable kind of summary of kind of what things were at the time.
My former VP, Jeff Magnuson, wrote a post,
a pretty famous one called
Engineers Shouldn't Write ETL.
So if you haven't seen that post or haven't heard of it,
I can take a look at it.
But effectively, I mean, part of the thesis was that, you know, being handed over work, thrown over a wall at someone isn't very happy work for that person.
And they're also kind of disconnected from business value.
And so the idea at Stitch Fix was, well, you know, can we, the data scientist, the person who has the idea, but is also talking, say, with the business partner.
So at Citrix, each data science team
was effectively partnered with some sort of team marketing,
operations, styling, merchandise, right?
And so they were trying to help those teams
make better decisions.
And so the thought was, you know,
iteration loops are key
in terms of machine learning differentiating. So how can we speed up this loop? Easiest way to speed up this loop,
but the person who's building it can also take it to production and then, you know,
close the loop and then iterate and make better decisions that way. So that was really,
you know, the philosophical kind of thesis as to what it was. And so I want to say,
it wasn't necessarily like a problem. It was more like, hey, this is the framing. This is how we
want to operate. In which case, then the framing of the platform team is like how can we build in capabilities
and provide an easier time for that data scientist to you know do get more get more done without
engineering it themselves but we weren't on anyone's critical path so there was like there
was a bit of like not obviously if you want to use Spark cluster, you have to use the cluster, but in terms of, you know, APIs to read and write, there
was, you know, there was a lot of, before the platform team came in, people were writing
their own tools and solutions, right?
So Search Fix hired very capable, you know, PhDs from various walks of life that weren't
computer scientists of background, but some of them knew that they could abstract things.
And so, which case, you know, part of it was, you know,
competing with data scientists and house abstractions
and trying to gain ownership of them as a platform
to better manage them.
Yeah, well, I was going to ask about that
because, you know, you're okay, self-service,
let's make the cycle time faster.
You know, that sounds really great on the surface.
But, you know, you're talking about like, you know, multiple data scientists, you know,
sort of working for different internal stakeholders who have already built some of their own tooling.
Was it challenging?
You know, was there pushback or was generally people were excited about it?
I mean, I know the tool eventually had to prove itself out and get adoption internally, but culturally, what was it like to enter that, you know, sort of mandate, I guess,
if you will?
I mean, it was, I mean, a mixed bag.
I mean, like, it depends.
So a very academic type environment, so very much open to suggestion and discussion, very
high communication bar.
So there was like a weekly, what was called beverage minute, where you could kind could kind of present talk about things and that's where kind of people did and that was
your kind of forum to disseminate stuff and so people always eager to learn best practices
right i think you know people being practically minded like if they build it and they're like
well i don't have that problem you know why should i use your tool all right why should i bother
using spending time like i mean coming from very practical concerns of like you know what's in it for me right so that that's if
anything was a bit of a challenge like if some team had a little bit of a solution but no not
the other teams did you know you could get the other teams on but that one team would be like
well i don't think the opportunity cost is there yet right yep. That makes total sense. Okay. So one of the big pieces of work that
came from your efforts at Stitch Fix was Hamilton, you know, which is intimately tied to DAG work. So
can you set the stage for where Hamilton came from inside of Stitch Fix and sort of the,
maybe the particular flavor of problem that it was solving? Yeah. So one of, it was built for a data science team.
So one data science team and one of the oldest teams there basically had a code base that
was basically, you know, five, six, six years old at that point, gone through a lot of team
members and things.
And so it wasn't written, you know, structured by, you know, people with a software reading
background, but effectively they had to forecast things about the business that the business could make operational decisions on.
And so they're basically doing time series forecasting.
And what is pretty common in time series forecasting is that you are continually adding and updating the code because things change in the business.
Time moves, you know, account for it.
Right.
And so one way you do that is you write kind of inputs or features.
Right. moves you know account for it right and so one way do you do that is you write kind of inputs or features right and so at a high level uh the you know getting a forecast up or like the pipeline
or the etl at a high level for a forecast was pretty you know you could say simple or pretty
standard you know only a couple of steps but the software the challenges of adding maintaining
updating and changing the code that was within
you know at a high level with a map in the in that macro pipeline was really what was the challenge
and was really slowing them down they were also operationally always under the gun because they
you know had to provide things that business needed to make decisions on you know they had
to model different scenarios and certain things and so which case you, they weren't in a position to really, you know,
do things themselves.
In which case, you know,
manager came to the platform team was like,
hey, help.
And so, yeah, what I found really was like,
it was, you know,
the macro pipeline wasn't the challenge.
It was the code within the steps
that needed to be changed and updated.
Right.
And so this is where like, yeah,
getting to production was easy,
but now the maintenance aspect of like maintaining, changing was really the struggle and so with hamilton the idea
was you know how can we you know this is a plus for a work from home wednesday so if there was
no work from home wednesday i might not have come up with this but i had a full day to kind of think
about this problem and kind of analyzing and looking at their code it was a lot of effectively
what they were trying to do well one of the biggest problems at their code it was a lot of effectively what they were trying
to do well one of the biggest problems was they needed to create a data set or a data frame uh
with thousands of columns um and because with time series forecasting it's very easy for you to
create your inputs their derivatives of other columns and so the ability to express the transforms
was really,
and be confident that like,
if you change one,
like you don't know what's downstream of it.
All the dependencies.
Yeah.
Because the code base was so big.
It was,
you know,
it wasn't,
you know,
that,
that well structured.
Right.
And so I came up with Hamilton where effectively I was like,
I was trying to make it as simple as possible from a process perspective of
given an output,
how can you quickly and easily map that back to code?
And the definition for it, right?
And so if Hamilton at a high level is a microframework
for describing data flows, right?
And so a data flow is essentially compute and data movement.
This is exactly what they're doing with their process
to create this large data frame,
given some source data,
put it through a bunch of transforms, create a table.
And so Hamilton was kind of created from that problem
of like, yeah, the software engineering need.
And I mean, I could dive into more details
of how Hamilton works,
but I'm going to first ask whether
I've given enough high-level context.
No, that's super helpful.
And one thing I actually want to drill into, because I want to hand the mic off to Kostas
in a second and dig into the guts of how Hamilton works, but we're talking about time series
data and especially around features specifically.
One of the things that's interesting about Hamilton sort of being, you know, let's say,
and maybe I'm jumping the gun a little bit here, but more sort of declarative rather than
imperative is that it creates a much more flexible environment, at least, you know,
from me to Greener, in terms of definitions, right? Because one of the problems with time
series data and definitions is that if a definition changes,
which it will, and you have a large code base,
it's not that you can't get a current view
of how that definition looks with your snapshot data,
but it's actually going back and sort of recomputing
and updating everything historically
in order to rerun models and all that sort of stuff,ing and updating everything historically in order to like, you know, rerun models
and all that sort of stuff, which is really interesting.
Were you thinking a lot about the definition piece
with Hamilton and sort of making it easier
to create definitions that didn't require, you know,
like updating a hundred, you know, different points in the code?
Yeah. I mean, that effectively yeah if you can
make it really simple to make output to to code then logic that means there's only really like
one place to do it and so what one part of the problems you know with the code base that it was
before was you know there wasn't a good testing story there wasn't a good documentation story
hard to see dependencies between things and And then when you updated something,
you didn't know, to your point,
how confident you were in what you actually changed or impacted, right?
Yeah.
Because everything was effectively in a large script
where you had to run everything to test something.
So there was this kind of real inertia to really,
or a lot of energy required to understand changes and impacts.
And so effectively, by rewriting things as functions which are kind of we'll dig into it helps really abstract and encapsulate what
the dependencies are and so therefore if you are going to make a change it's very much much easier
to logically reason then and find say in the code base like who you know the upstream and downstream
dependencies of this yeah and so it becomes you
have a far more procedural methodical way that you can then kind of add update and change workflows
whereas before if you kind of it's a script or like wherever software engineering practices you're
kind of using you have to you know take a lot more care and concern when you do that but with
hamilton it's kind of the paradigm kind of forces you to do things in a particular way that makes,
you know, this particularly beneficial
for, you know, changing, updating and maintaining.
Yeah, absolutely.
You know, it's amazing.
Even if, you know, even on teams
that really are diligent about best practices
with software engineering,
it's amazing as code bases grow,
the amount of tribal knowledge that's needed to make
significant changes you know you always end up with a handful of people who know all of the
nooks and the crannies you know and sort of that one dependency that's you know the you know the
killer when you push to production without tinkering with it one thing for the readers i
think since your audience is probably familiar with dbt,
I want to say Hamilton's very similar, I guess, in what dbt did for SQL, right?
Before dbt was a bit of the wild west of how you maintain and manage your SQL files, how
they link together, right?
How do you test document them, right?
Hamilton kind of does pretty much the same thing, but for Python function, Python transforms,
right?
And so it gives you this very opinionated,
structured way that you end up actually being more productive and being able to write and manage more code
than you would otherwise,
which I think dbt kind of did for the SQL.
Yeah, absolutely.
All right, Costas, I've been monopolizing,
and I know you have a ton of questions about how this works.
I do too.
Please.
You can get back in the conversation whenever you want,
so don't be shy
so stefan first question what makes hamilton nml oriented framework why it is like for
for ml right like writing it lml what's and not for something else right i want to say it it comes its roots are definitely you
know machine learning oriented camping you know like effectively what i was describing was a
feature engineering a problem for time series forecasting right i mean hamilton since then
we kind of added and adjusted it to to operate over you know any python object type because it
was initially focused on pandas now it isn't i effectively kind of call it it's a bit of a swiss
army knife and that you could anything you can model in a dag or at least if you would have
draw a workflow diagram hamilton's maybe the one of the easiest ways to kind of directly write code
to map the maps to it but specifically you know i think you know python and machine learning are
very coupled together self-engineering practices are hard in in machine learning are very coupled together. Software engineering practices are hard in machine learning,
in which case I feel Hamilton specifically is trying to target
the software engineering aspects of things,
in which case I think machine learning and data work is least mature there.
And so very waffly answer is that its roots are from that,
and so therefore I think it's targeting more of those problem spaces.
But people have been applying Hamilton to much wider use cases
than just machine learning.
Yeah, yeah, 100%.
I'm always finding it very fascinating to hear from practitioners like you
about the unique challenges that the ML workloads have
compared to any other data workload, right?
I mean, Hamilton is actually a little less around workloads and more about team
process and code that helps define those things, right?
Since, you know, individuals build models or data or, you know, artifacts, right?
But teams own them, right?
And you need kind of different practices to make it work, right?
I mean, there is the infrastructure side, like do you do feature engineering over gigabytes of data.
But then there's also, well, how do you actually maintain the definition of the code to ensure that it's correct, that it can live a long, prosperous life.
When you leave, someone else can inherit it.
And so Hamilton is kind of starting from that angle first.
But definitely, I could see a feature where you can, you know, you can use it on
Spark.
You can use it in a notebook.
You can use it in a web service anywhere that Python runs.
Right.
So definitely it has integrations and extensions that definitely also extend out to more of
the data processing side.
Yep.
Yep.
And okay.
So let's change like the question a little bit.
And instead of like talking about like the workloads, let's talk about the teams.
Like how ML teams and the people in these ML teams might be different than a data engineering team or a data infra team, right?
So tell us a little bit more about that.
Like how things are different for ML teams compared to, I don't know, like a BI team, right? I mean, so there's a bit of nuance here because depending on like if you're applying machine
learning to then go into an online setting or if it's only all in an offline world, right?
There's slightly different kind of SLAs and tolerances.
Most data scientists, machine learning engineers I know don't have computer science backgrounds.
And I want to say this is probably almost even true for data engineers I know as well,
right? and true for data engineers i know as well right but effectively you're trying to couple data and
compute together in a way that you know yields a statistical model representation that then you can
kind of you know which is some bytes in memory that then you want to kind of ship out how you
get there and how you produce it really i think impacts how the company operates how the team operates
the ease and effectiveness that you can kind of you know quickly get results so i want to say
yeah there's a lot more you know focus on you could say you could say this way where amalops is
you know trying to become like a devops practice right where it's kind of giving you the guiding
principles on how to kind to operate and manage things.
And then I guess in terms of how it relates
to other things, I actually think
machine learning is a bit of a
superset of analytics workflows.
So I think the
same problems exist on the
analytics side, maybe obviously slightly different focuses
and endpoints, but effectively
you're effectively generally using the same infrastructure or reusing it as a better term
and then you're generally connecting i have to connect and intersect with that world as well
and so i want to say it's more of a superset of that and you know it has therefore slightly
more different challenges because the things that you produce are more likely to then end up in other
places like you know online in a web service versus you know analytics results which just
only serve from a dashboard and look okay that's great so okay you mentioned at some point when we
were discussing with eric that like hamilton's like an opinion opinionated way of like doing
things around like a man.
You gave a very good example
for people to understand with DBT.
Where DBT came and
put some kind of guardrails there
on how things should be getting done.
Can you
take us a little bit through that?
What does
this mean? How the world is
perceived from the lenses, from the point of view of Hamilton?
What are the terminology used, right? Like, that's data frames. Tell us a little bit about
the vocabulary and all these things that we should know to understand the fundamentals of Hamilton.
Sure. So as I said, Hamilton's a micro framework for describing data flow so i
say micro framework in that it's embeddable anywhere that python runs it doesn't contain
state and all it's doing is really kind of helping you you can say orchestrate code it is
not a macro orchestration system as opposed to something like airflow prefect dagster which
contains state and you think of tasks as computational units.
Hamilton, instead, you think of things,
the units are functions.
And so rather than writing procedural code
where you're assigning, say,
a column to a data frame object,
in Hamilton, instead,
you would rewrite that as a function
where the column name is the name of the function,
and the function input arguments
declare dependencies or other things that are required as input
to compute that column.
So inherently, so I guess there's macro versus micro.
So I call Hamilton a micro frustration framework
or a micro framework, micro frustration,
a kind of view of the world versus macro,
which is something that it isn't, right?
It is, we're writing functions that are declarative.
And so where the function name means something
and function input arguments also declare dependencies.
You're not writing scripts.
With Hamilton, there is a bit of,
well, you don't call the functions directly.
You need to write some driver code.
And so with Hamilton, the other concept is like,
you have this driver, right?
And so given the functions like you have this driver right and so given
the uh functions that you have written so you have to write all your functions curated into
python modules python modules you could say representations of parts of your dag so if you
think visually and you think of nodes and edges where functions and nodes and edges being kind
of dependencies of what's required to be passed in.
That's, I guess, the nuts and bolts of Hamilton. You write functions that go into modules,
but then you need a driver, some driver script to then read those modules to build this kind of DAG representation of the world. That's that code. That's, you could say, the script
code that you would then kind of plug into any way that you run Python.
I'll pause there. Any clarifications or
going along so far?
Yeah.
So just to make sure that
like I understand like
correctly, right.
And consider me as like a
very naive, let's say
practitioner around that
stuff.
Right.
So if I'm going to start
developing using Hamilton,
I'll start thinking in
terms of columns. Right. So I don't Hamilton, I'll start thinking in terms of columns, right?
So I don't really,
I don't start like from the concept
of like having like a table
or like something like a data frame, right?
So technically I can create,
let's say independent columns
and then I can mix and match
to create outputs data sets in a way, right?
Yeah, so Hamilton's roots
wrangling, say, Pandas
data frames. So it's very easy
to go back to time-seriousness and time-serious
data. It's very easy to
think in columns when you're processing
this type of data.
And so with Hamilton, functionally
the function you can
therefore think of as equivalent to representing a column, the framework forces you to only have one definition.
So if you have a column name X, there's only one place that you can have X in your DAG, or there's only one node that can be called X to compute and create that, right?
So Hamilton forces you to have one declaration of this.
And so where the function name is kind of equivalent to the column name or
an output you can get but when you write that function you haven't actually said what data
comes into it you've only just you've only declared through the function arguments the names of other
columns or inputs that are required so with hamilton you're kind of you're not coupling
context when you're writing these functions and so therefore you kind of effectively are coming
up with you know you know you can say a column definition
or a feature definition that is kind of invariant to context.
The way that Hamilton then stitches things together
is through names, right?
And so if you have a column named foo
that takes in an argument bar,
Hamilton, when you go to compute foo,
will either look for a function called bar
or it will expect some input called bar to come in.
100%.
Okay, so we chain together functions, right?
And create, let's say, a new column, right?
From creating columns.
And you said the context is not that important.
When I define a function, I just link, let's say, the inputs.
Mm-hmm.
But, okay, coming again from a little bit more of traditional programming, but
how do you deal with types, for example?
How do I avoid having issues with types and conflicts and stuff like that yeah
so hamilton's pretty lightweight here we so the function declares an output an output
when you write a function it has to a declare a function output type but also the input arguments
also have to be type annotated so when hamilton constructs the dag of how things are chained
together it does a quick check of like hey do these function types match you have the flexibility have to be type annotated. So when Hamilton constructs the DAG of how things are chained together,
it does a quick check of like,
hey, do these function types match?
You have the flexibility to kind of, you know,
fuzzy them as much as you like.
But effectively, so that's a DAG construction.
At runtime, there's also a brief check
like on input to the DAG
to make sure that, you know,
the types match at least
the expected kind of input arguments.
But otherwise there's a bit of an assumption that if you set a function outputs a pandas data frame,
it's a pandas data frame. And the reason why we don't do anything too strict there is that,
well, if you want to reuse your pandas code and run it with pandas on Spark,
assuming you meet that subset of the API, to everyone who's reading the code,
it looks like a Pandas data frame,
but underneath it could be a Pandas,
sorry, a PySpark data frame
wrapped in the Pandas on the API.
So effectively, you know, with Hamilton,
the DAG kind of enforces types
to ensure that functions match,
but you have flexibility as to, you know,
how you, if you really want to perturb that,
you can write some code to kind of fuzzy that up.
Otherwise at runtime,
there isn't much of an a an enforcement check but then
if you do really want that there is the facility then to also what's called a check output annotation
that you can add to a function that can do a runtime data quality check for you which you
could then you know check the type check the you know the cardinality or you know the values of a
particular output okay that's cool and okay so let's say I want to start playing around with Hamilton, right?
And I already have some existing environment where I create pipelines and I work with my
data, right?
How do I migrate to Hamilton?
What do I have to do?
Yeah, it's a good question.
So Hamilton, as I said, it runs anywhere that Python runs.
So all you need is to really,
you know, say you're using pandas
just for the sake of argument.
You can replace however much code you want,
you know, with Hamilton.
So you can slowly, you could say,
change parts of your code base
and replace it with Hamilton code.
I mean, in terms of actually migrating,
the easiest is to save the input data,
save the target output data,
and then write transforms and functions
that then as you're migrating things
to see whether the old way and the new way
kind of line up
and match up but from an actual kind of practicality and you know poc perspective like it's really up to
you to scope you know how big of a chunk do you want to really move to hamilton in which case
because all you need to do is just pip install the hamilton library that's really the only
impediment for you to kind of try or something is really the time to like chunk what code you
want to translate to hamilton and but otherwise you know there shouldn't be any system dependencies
really stopping okay that's super cool and you mentioned at the beginning of the conversation
that's okay well it's one thing like to build something it's a completely different thing like
to operate and maintain something right and like that's where a lot of pain exists today. Having, let's say, a pipeline, handing this pipeline to a new engineer,
trying to figure out things that are going in there, updating that, improving that, it's hard.
And from my understanding, one of the goals and the vision of Hamilton is to help with that and to
actually bring best practices that we have in software engineering
also when we work with data pipelines.
So how is this done? Let's say I've built it, right?
I've used Hamilton. I have now a pipeline that builds whatever
the input of a
service that takes like a model is.
What's next?
Like how, what kind of tooling I have around Hamilton that helps me, let's say
to go there and debug a pipeline or improve on pipeline and in general, maintain the
pipeline.
Yeah.
Yeah.
So good question.
So one is I'm going to claim that you know a junior
data scientist can write hamilton code and no one's going to be terrified of inheriting it
because i mean so part of i guess one of the things that kind of the framework forces you do
is basically you need to you know chunk things up into functions one nice thing of chunking
things up into functions that everything is unit testable not to say that you have to add unit
tests but if you really want to, you can.
And then you also have the function doc string,
always, that you can add more specific kind of documentation.
Now, because everything is kind of stitched together by naming,
you're also forced to name things slightly more verbosely
that you can kind of pretty much read the function definition
and kind of understand things, right?
And so I just want to set the context of like,
you know, the base level of like what Hamilton gives you you effectively you can think of it's you know a senior software
engineer in your back pocket without you having a high one because you know you're decoupling logic
it's making it reusable from day one because you're forced to curate modules and then you
have these great testing story and then one of the facilities that's built into the hamilton
framework natively is that you can output a graph as visualization of how
actually everything connects or like how a particular execution path looks right so with
that on the base of that right I want to say if someone's coming into making a change right
there isn't much extra tooling you need at a low level right to to be confident so if someone's
making a change to a particular piece of logic,
it's only a single function, right?
The function, you know,
who's downstream of that
because you just need to find people,
you know, grab the code base
for whoever has that
function input arguments, right?
If you're adding something,
you know, you're not going to clobber anything
or impact anything
because it's a very separate thing
that you're creating, right?
Similarly, if you're deleting or removing things,
you can also make the, you know,
easily go through the code base to find things.
So pull requests, therefore,
are a little more easier and simpler
because things are chunked in a way that like people,
a lot of the changes already have all the context
around them and they're not really, you know,
they're not in disparate parts of the code base
when a change is made.
So therefore, in terms of debugging,
because you have this DAG structure,
if there's an issue, it's pretty
methodical to debug something. So if you
see an output,
it looks funky, well,
it's very easy for you to map to where the code should be.
So if the logic in the function
looks off, you can test it,
unit test it. But if it's not,
then you know it's you know
it's function of argument so then you know you effectively then know okay what was what's what
was run before this so you can then logically step through the code base as to like okay well
if it's not this then it's this if it's not this then it's this and you can set you know pdb set
trace or you know debugging output within it right and so this is where I was saying
this paradigm forces this kind of simplicity
or very structured or standardized way
of approaching it and debugging stuff,
in which case, therefore,
anyone new who comes to the code base,
they don't need to read a wall of text
and be consuming from a fire hose.
Instead, if they want to see a particular output,
they can use the tool to visualize
that particular execution path
and then just walk through the code there
or with someone or someone's handing off.
So I think it really simplifies a lot of the decisions
and effectively encodes a lot of the best practices
that you would naturally have in a good code base
to make it easy for someone to come and update,
maintain, and then also debug.
You mentioned the documentation of... to make it easy for someone to come and update, maintain, and then also debug. Mm-hmm.
You mentioned the documentation of... I was browsing the GitHub repo for Hamilton,
and there's a very interesting matrix there
that compares the features of Hamilton with other systems.
I think it really helps someone to understand
exactly what Hamilton is.
But I want to ask about the code. You mentioned at some point that the code is always unit-testable,
right? And it's always true for Hamilton, but it's not for other systems like DBT, for example, or
Feast or Airflow. Can you elaborate a little bit more on that?
Like why with Hamilton we can do that, right?
And why we cannot with Airflow, for example?
Yeah, yeah.
It's very easy to...
In systems, so given a blank slate of Python,
you can write a script, right?
And so one of the things that's very easy and most people do is they want to get as fastest from a to b as possible
in the data world that means loading some data doing some transforms and then loading back out
right and so if you think of the context that you have just coupled together to kind of do that
you you know have made an assumption of you know where the data is coming from uh maybe it's of a particular format or type
the logic then is very much now coupled to that particular context so if you
you know most data scientists cut and paste code rather than refactoring it to for use right and
that's partly because of that kind of you know coupling of context and then you've also assumed what the outputs are and so you could make that code always
testable but you need to think about it when you're writing it right yeah you need to structure
things in a way that you know because if you couple things or you write functions that take in
that certain things that means the unit test is a pain because you have to mock different you know
data loaders kind of apis to kind of make it work because you have to mock different you know data loaders
kind of apis to kind of make it work whereas you know with hamilton you're really forced to really
chunk things separately or at least if there's anything complex it's actually you know contained
in a single function in a single place right and and so it is therefore much easier therefore you
know if you need to write a unit test to write it in hamilton
and have it maintainable whereas in the other context you have to think about that you know
as you're writing it but most people don't and so in which case then it's a problem of inertia and
then people generally you know add to the code base to make it look like how it is and so which
case the problem then just propagates. Unless you find that someone who,
that one person,
there's generally one person
in every company who really likes cleaning up code.
You find them and they want to do it.
But those people are a rarity,
in which case, for me,
I'm more of a reframe the problem
to make problems go away type of guy.
And so in which case with Hamilton,
it's like, yeah, reframe the problem a little bit
by getting you to write code and set it to start.
But then all these other problems, you just
don't have to deal with because
we've designed
to write code in a certain way that always makes
testing and documentation friendliness
true.
Yeah, and
one more question on unit tests.
I want you to ask
this question. I want to ask this question to you
because you mentioned at the beginning,
and it's very true,
that many of the practitioners in the ML
and the data science domain,
and that's also true for many of the data engineers out there,
don't necessarily come from a software engineering background.
So probably they're also like not exposed to unit testing
and why unit testing is important, right?
So why is unit testing important for a data scientist?
It's important if you have a particular logic
that you want to ensure that A, you have written correctly
and B, if someone changes that they don't, you know,
break it inadvertently, right?
And so I want to say it's not true that you always need unit tests for simple functions right it's mainly for the things
you really want to kind of enshrine the logic for and also potentially help other people can
understand like these are the bounds of the logic so classic examples of this are like i say stitch
fix was you had a survey response to a particular question and you wanted to transform that survey response into a particular input or output, right?
Unit test was a great way to encapsulate and kind of enshrine a bit of that logic, right?
To ensure that like, hey, if something changes or if there are assumptions that change, you could easily kind of understand and see whether that kind of test broke or not.
Cool.
So let's pause a little bit here like
about hamilton and i want to ask you because we talked a lot about like hamilton but hamilton is
also like the seeds of a combat that you built right today and i would like to hear from you
a little bit about like this journey how you know things started like from within Stitch Fix.
As you said, there was a problem there.
We described how you started building Hamilton
to the point of today being the CEO of a company
that is building a product and a business on top of this solution.
So tell us a little bit about this experience,
how you decided to do it,
the good things around it, like whatever made you happy so far.
And if you can share also some of the bitter parts of doing this, because I'm sure it's not easy.
That would be awesome.
I went to Stanford, got bitten by the bug for For the last decade, I've been thinking about starting a company. In terms of how Dagworks got started and the idea for it,
we did a lot of build versus buy on the platform team at Stitch Fix.
So we saw a lot of vendors who come in.
And quite frankly, I was like,
I think we have actually better ideas or assumptions or even we could build a better product here.
And so we built most things at stitch fix
which i see for that reason we only brought in a few things right
and so uh hamilton actually started out more of a more of a branding exercise
and so part of it was actually it was of the things my team built it was also the easiest
to kind of open source but also from that perspective was also i guess the most interesting
so i do think it's actually pretty different a pretty different approach to than other people are taking and so part of that
was like you know i think it's unique and then just happened to be easier to open source than
other things and so we open sourced it and the reaction from your people was like yeah like i
i honestly initially thought hamilton was a bit of you know, it was a cute meta programming hack in Python to kind of get to work.
But like, I was like,
wasn't quite sure where the other people would think,
get the same value out of it.
Suffice to say, you know, people did, which was exciting.
And then realizing, you know, like at Stitch Fix,
we had, you know, a hundred plus data scientists to deal with,
but, you know, with open source, it's kind of like, wow,
you can actually have thousands of people you could potentially, you know, help plus data scientists to deal with. But, you know, with open source, it's kind of like, wow, you actually have thousands
of people you could potentially help and reach.
Right.
And so that was invigorating from a personal perspective of like, you know, just being
able to reach more people and, you know, and help more people.
So I think, you know, with open source, there's the challenge of actually how you start a
business around it.
I mean, if you look at other companies you know dbt for example you
know they didn't really take off until they were get three or four years outside of open source
right hamilton was actually built in 2019 we only open sourced it 18 months ago i mean i did know
that was sticky because the teams that used it internally at stitchflix loved it but
you know exciting to see its kind of adoption go and then and then so from that perspective
you know seeing open source get adopted me being you know excited by helping other people and then you know being thinking
about companies for the last decade i thought it was you know now was a good time because i'm like
i still think i know something people don't in which case you know that machine learning tech
that is going to come home to roost in the next few years of all the people who brought machine
into production and now you, feeling the pains of,
you know,
vendor ops,
as it's sometimes called of,
you know,
stitching together all these MLOP solutions.
And then,
so timing,
knowing something the market doesn't,
and then,
you know,
having the passion for it was kind of roughly the three things that let it
larger myself,
the other co-creator Hamilton kind of to start Dagworks.
That's awesome.
And one last quick question for me before I hand the microphone back to Eric.
Where can someone learn more, both about Hamilton and the company?
Yeah, so if you want to try out Hamilton,
we have a website called tryhamilton.dev.
It runs Pyodide because Hamilton is a small dependency footprint
where you can actually load python up in the browser and you can play around with
that you're having to install anything otherwise for the dagworks platform that we're building
around hamilton you can kind of think of it as just at a high level you know hamilton's technology
dagworks platform is kind of a product around it you can go to dagworks.io. And by the time that this releases, I think we should be, you know, we're taking off the
beta wait list.
And so if that's still there, do sign up.
We'll get you on it quickly.
Else, hopefully we'll have more of a self-service means to kind of play around with what we
built on top of Hamilton.
That's great.
Eric, all yours.
All right.
Well, we have to ask the question,
where did the name Hamilton come from?
Good question.
So at Stitch Fix,
the team that we were building,
you know, this fall was,
you know, I was going to say
this was pretty fantastic.
It was basically a rewrite
of kind of how they wrote code and how
they pushed things.
The team was called the Forecasting, Estimation,
and Demand team, or the Fed team for short.
I had also
recently learned more about
American history because the Hamilton musical
had gone. I was like,
what's foundational
and associated from the Fed?
Well, Alexander Hamilton created the actual Federal Reserve.
Yeah, yeah.
And so then there were other names, right?
But then as I started thinking about it more, I'm like, well, Hamilton also, you know, the Fed team is also trying to model the business in a way.
So there are Hamiltonian physics concepts, right?
And then the actual implementation of what we're doing is graph Theory 101, effectively, right? And so for computer science,
there's also Hamiltonian concepts there. So I was like, oh, great. Hamilton's
probably the best name for it since it helps
tie together all these things. I love it. Well,
Stefan, this has been such a wonderful time. We've learned
so much. And thank you again for giving us a little bit of your day
to chat about DAGs, Hamilton, Python, open source, and more.
Thanks for having me.
It was a good time in terms of being more succinct on responses.
I think this is my lesson I've learned from this podcast.
I need to kind of work on that a little bit more.
But otherwise, yeah, much appreciated for having me on and thanks for the conversation.
Anytime. You were great. But Costas, I loved the show because we covered a variety of topics
with Stefan from Dagworks and Hamilton. I think one of the most fascinating things about the show
to me was we started out
thinking we were going to talk a lot about DAGs, because DAGworks, the name of the company is
focused on DAGs. But really what's interesting is that it's not necessarily a tool for DAGs like
you would think about Airflow necessarily. It's actually a tool for
writing clean testable ML code that produces a DAG. And so the DAG is almost sort of a consequence
of an entire methodology, which is Hamilton, which is absolutely fascinating. And so I really
appreciated the way that Stefan sort of got at the heart of the problem.
It's not like we need another DAG tool, right?
We actually need a tool that solves sort of problems
with complex growing code bases at the core.
And a DAG is sort of a natural consequence of that
and a way to view the solution, but not the only one.
So I think that was my big takeaway.
I think it's a very interesting, elegant solution or way to approach the problem.
Yeah. DAGs appear everywhere with these kind of problems, right? Like anything that's like
close to a workload or there is some kind of like dependency there, there's always a DAG somewhere,
right? And like, similarly, like again, like Hamilton, the same way that if you think about DBT, DBT also is a
DAG. Every DBT project is a graph that connects models with each other. The difference, of course,
is that we have DBT, which lives in the in the sql world and then we have hamilton
which looks like in the python world and it's also like targeting different a different audience
right so and that's like at the end what hamilton is trying to do is like to bring the value of um
let's say the guardrails that a framework like dbt is offering like to the bi and the analytical and the analytics professional
out there to the ml community right because they also have that and probably they have it also like
in deeper complexity compared to let's say the bi words just because by nature like ml models models and features have deeper, deeper
dependencies to each
other. So
it's very interesting to see how
the patterns
emerge in different
sides of the field,
like the industry, but
at each core they remain the same.
So
I think everyone should go and take a
look at hamilton they also have like a like a sandbox like playground where you can try it
online if you want and started like building a company on top of that and like any feedback is
going to be like super useful for the Hamilton folks. So I would encourage everyone like
to go and like do it. Definitely. And while you're checking out Hamilton, I think it's
tryhamilton.dev. Head over to Data Stack Show, click on your favorite podcast app and subscribe
to the Data Stack Show. Tell a friend if you haven't, and we will catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on
your favorite podcast app to get notified about new episodes every week. We'd also love your
feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.