The Data Stack Show - The PRQL: The Data Supply Chain with Chad Sanderson of Gable.ai
Episode Date: March 25, 2024The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building a...nd maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show prequel.
This is a short bonus episode where we preview the upcoming show.
You'll get to meet our guest and hear about the topics we're going to cover.
If they're interesting to you, you can catch the full-length show when it drops on Wednesday.
We are here with Chad Sanderson. Chad, you have a really long history working in data quality
and have actually even founded a company, Gable.ai. So we have so much to talk about,
but of course, we want to start at the beginning. Tell us how you got into data in the beginning. Yeah, well, great to be here with you
folks. Thanks for having me on again. It's been a while, but I really enjoyed the last conversation.
And in terms of where I got started in data, I've been doing this for a pretty long time.
Started as an analyst and working at a very small company in Northern Georgia that produced grow parts,
and then ended up working as a data scientist within Oracle. And then from there, I kind of
fell in love with the infrastructure side of the house. I felt like building things for other
people to use was more validating and rewarding than trying to be a smart scientist myself and ended up doing that at a few big companies.
I worked on the data platform team at Sephora and Subway, the AI platform team over at Microsoft.
And then most recently, I led data infrastructure for a great tech company called Convoy.
That's awesome.
By the way, it's not the first time that we have you here, Chad.
So I'm very excited to continue the conversation
from where we left and many things happened since then.
But one of the things that I really want to talk with you about
is the supply chain around data and data infrastructure.
There's always a lot of focus, either on the people who are managing the infrastructure
or the people who are the downstream consumers, right?
Like the people who are the analysts or the data scientists.
But one of the parts in the supply chain that we don't talk that much. It's like going more and more upstream where the data is actually captured,
generated,
and transferred
into the data infrastructure.
And apparently many of the
issues that we deal with
stem from that.
There are organizational issues.
We're talking about very different
engineering teams involved there
with different goals and needs.
But at the end, all these people and these systems, they need to work together
if we want to have data that we can rely on.
So I'd love to get a little bit deeper into that and spend some time together
to talk about the importance of this, the issues there, and what we can do to
make things better, right? So that's one of the things that I'd love to hear your thoughts on.
What's in your mind, what you would like to tell about?
Well, I think that's a great topic, first of all, and it's very timely and topical as teams are, you know, the modern data stack is still, I think, on the tip of everybody's tongue.
But it's become a bit of a sour word these days, I think. maybe five to eight years ago, that by adopting the modern data stack, you would be able to get
all of this utility and value from data. And I think to some degree that was true. The modern
data stack did allow teams to get started with their data implementations very quickly, to move
off of their old legacy infrastructure very quickly, to get a dashboard spun up fast to answer some
questions about their product. But maintaining the system over time became challenging. And that's
where the phrase that you used, which is data supply chain, comes into play. This idea that
data is not just a pipeline, it's also people. And it's people focusing on different aspects of the data. An application developer who is emitting events to a transactional database is using data for one thing. A data engineering team that is extracting that data and potentially transforming it into some core table in the warehouse is using it for something different.
A front end engineer who is using, you know, rudder stack to emit events is doing something
totally different.
An analyst is doing something totally different.
And yet all of these people are fundamentally interconnected with each other.
And that is a supply chain. And this is very different, I think, to the way that
software engineers on the application side think about their work. In fact, they try to become as
modular and as decoupled from the rest of the organization as possible so that they can move
faster. Whereas in the data world, if you take this supply chain view, decoupling is actually
impossible. It's just not actually feasible to do because we're so reliant on transformations by other
people within the company.
And if you start looking at the pipeline as more of a supply chain, then you can begin
to make comparisons to other supply chains in the real world and see where they put their
focus.
So as a very quick example, McDonald's is obviously a massive supply chain, and they've spent billions of dollars in the producers, not the consumers. Meaning if you're a manufacturer of the beef patties that are used in their sandwiches,
you are the one that's doing quality at the sort of patty creation layer.
It's not the responsibility of the individual retailers and the stores that are putting
the patties on the buns to individually inspect every patty for quality.
You can imagine the type of cost and inefficiency issues that would lead to when the focus is speed.
And so the patty suppliers and the stores and McDonald's corporate have to be in a really
tight feedback loop with each other, communicating about compliance and regulations and governance and quality so that the end retailer doesn't
have to sort of worry about a lot of these capacity, about a lot of these issues.
And the last thing I'll say about McDonald's, because I think it's such a fascinating use
case, is that the suppliers actually track on their own how the patty needs,
like the volume requirements for each individual store.
So when those numbers get low,
they can automatically push more patties to each store when it's needed.
So it's a very different way of doing things,
having these tight feedback loops,
versus the way that I think most data teams operate today.
Yeah, yeah, makes sense. Okay, I think most data teams operate today. Yeah. Yeah. Makes sense.
Okay.
I think we have like a lot to talk about.
Eric, what do you think?
Let's do it.
Let's do it.
All right.
That's a wrap for the prequel.
The full length episode will drop Wednesday morning.
Subscribe now so you don't miss it.