The Data Stack Show - The PRQL: Does Machine Learning Need Its Own Orchestrator? Featuring Sandy Ryza of Dagster
Episode Date: January 2, 2024In this bonus episode, Eric and Kostas preview their upcoming conversation with Sandy Ryza of Dagster. ...
Transcript
Discussion (0)
Welcome to the Data Stack Show prequel.
This is a short bonus episode where we preview the upcoming show.
You'll get to meet our guest and hear about the topics we're going to cover.
If they're interesting to you, you can catch the full-length show when it drops on Wednesday.
We are here with Sandy Rizza from Dagster Labs.
Sandy, so excited to chat with you about data ops, workflows, data pipelines, all of the above. Thanks for coming on the show.
Thanks for having me. Excited to chat with you.
All right. Well, give us your background briefly. Yeah, so I'm presently the lead engineer on the DAGSTAR project. And I think we can talk a little bit more about what the DAGSTAR project is for those
who aren't familiar later.
Earlier in my career, I had a mix of roles that involved building data infrastructures,
building tools that would help data practitioners and working as a data practitioner, machine
learning engineer myself.
I started my career at Cloudera.
While I was there, I wrote this book, Advanced Analytics with Spark,
that taught how to use that particular framework to do machine learning.
And then spent a number of years practicing data scientist at Clover Health,
Motive, which used to be called Keep Truckin',
and also worked in public transit software
before finding myself back in the data tooling space
at Dagster Labs.
That's awesome, Sandy.
And I think we're going to have a lot to talk about,
but something that I'm particularly interested
into going deeper is the role of,
let's say, an orchestrator in the lifecycle of data, like defining it, why we need it, why it has to be like an external tool, right?
And it's not part of query engine, for example.
And also why currently we have such a diverse, let's say, number of solutions out there, especially when we are considering
the more traditional data-related operations
and DML operations.
And we even see new orchestrators coming out
that are focusing just on the ML side.
Why we need that when we have, let's say,
something that already works for data.
And I'd love to hear and learn from you
why is that and what it means
for the practitioners out there, right?
What's in your mind, though?
What you would like to chat
and get deeper into during our conversation?
Yeah, the topic that you brought up
is one that I've thought about quite a bit,
both from this perspective of being a machine learning engineer and from this perspective of working on tools for machine learning engineers.
And, you know, I think we can get into this later, but the fact that I ended up working on a general purpose orchestrator kind of says a lot about how I view the role of orchestration and data pipelines in the machine learning engineering domain.
So really excited to talk about that.
Excited to also talk about orchestration in general
and what it means to build a data pipeline
and the relevance of that to different roles,
like data engineers, machine learning engineers, data scientists.
Yeah, that's awesome.
I think we have a lot to talk about.
What do you think, Eric?
Yeah, let's get to it.
Alright, that's a wrap for the prequel.
The full-length episode will drop Wednesday morning.
Subscribe now so you don't miss it.