The Data Stack Show - The PRQL: Does Machine Learning Need Its Own Orchestrator? Featuring Sandy Ryza of Dagster

Episode Date: January 2, 2024

In this bonus episode, Eric and Kostas preview their upcoming conversation with Sandy Ryza of Dagster. ...

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show prequel. This is a short bonus episode where we preview the upcoming show. You'll get to meet our guest and hear about the topics we're going to cover. If they're interesting to you, you can catch the full-length show when it drops on Wednesday. We are here with Sandy Rizza from Dagster Labs. Sandy, so excited to chat with you about data ops, workflows, data pipelines, all of the above. Thanks for coming on the show. Thanks for having me. Excited to chat with you. All right. Well, give us your background briefly. Yeah, so I'm presently the lead engineer on the DAGSTAR project. And I think we can talk a little bit more about what the DAGSTAR project is for those
Starting point is 00:00:50 who aren't familiar later. Earlier in my career, I had a mix of roles that involved building data infrastructures, building tools that would help data practitioners and working as a data practitioner, machine learning engineer myself. I started my career at Cloudera. While I was there, I wrote this book, Advanced Analytics with Spark, that taught how to use that particular framework to do machine learning. And then spent a number of years practicing data scientist at Clover Health,
Starting point is 00:01:20 Motive, which used to be called Keep Truckin', and also worked in public transit software before finding myself back in the data tooling space at Dagster Labs. That's awesome, Sandy. And I think we're going to have a lot to talk about, but something that I'm particularly interested into going deeper is the role of,
Starting point is 00:01:43 let's say, an orchestrator in the lifecycle of data, like defining it, why we need it, why it has to be like an external tool, right? And it's not part of query engine, for example. And also why currently we have such a diverse, let's say, number of solutions out there, especially when we are considering the more traditional data-related operations and DML operations. And we even see new orchestrators coming out that are focusing just on the ML side. Why we need that when we have, let's say,
Starting point is 00:02:18 something that already works for data. And I'd love to hear and learn from you why is that and what it means for the practitioners out there, right? What's in your mind, though? What you would like to chat and get deeper into during our conversation? Yeah, the topic that you brought up
Starting point is 00:02:42 is one that I've thought about quite a bit, both from this perspective of being a machine learning engineer and from this perspective of working on tools for machine learning engineers. And, you know, I think we can get into this later, but the fact that I ended up working on a general purpose orchestrator kind of says a lot about how I view the role of orchestration and data pipelines in the machine learning engineering domain. So really excited to talk about that. Excited to also talk about orchestration in general and what it means to build a data pipeline and the relevance of that to different roles, like data engineers, machine learning engineers, data scientists.
Starting point is 00:03:25 Yeah, that's awesome. I think we have a lot to talk about. What do you think, Eric? Yeah, let's get to it. Alright, that's a wrap for the prequel. The full-length episode will drop Wednesday morning. Subscribe now so you don't miss it.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.