The Data Stack Show - 183: Why Modern Data Quality Must Move Beyond Traditional Data Management Practices with Chad Sanderson of Gable.ai
Episode Date: March 27, 2024Highlights from this week’s conversation include:Chad’s background and journey in data (0:46)Importance of Data Supply Chain (2:19)Challenges with Modern Data Stack (3:28)Comparing Data Supply Cha...in to Real-world Supply Chains (4:49)Overview of Gable.ai (8:05)Rethinking Data Catalogs (11:42)New Ideas for Managing Data (15:16)Data Discovery and Governance Challenges (18:51)Static Code Analysis and AI Impact on Data (24:55)Creating Contracts and Defining Data Lineage (27:31)Data Quality Issues and Upstream Problems (32:32)Challenges with Third-Party Vendors and External Data (34:29)Incentivizing Engineers for Data Quality (40:28)Feedback Loops and Actionability in Data Catalogs (45:30)Missing metadata (48:57)Role of AI in data semantics (50:27)Data as a product (54:26)Slowing down to go faster (57:38)Quantifying the cost of data changes (1:01:24)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We are here with Chad Sanderson. Chad, you have a
really long history working in data quality and have actually even founded a company,
Gable.ai. So we have so much to talk about, but of course we want to start at the beginning. Tell us how you got into data in the beginning. Yeah, well, great to be here with you folks. Thanks for having me on again.
It's been a while, but I really enjoyed the last conversation. And in terms of where I got started
in data, I've been doing this for a pretty long time. Started as an analyst and working at a very small company in northern
Georgia that produced grow parts, and then ended up working as a data scientist within Oracle.
And then from there, I kind of fell in love with the infrastructure side of the house. I felt like
building things for other people to use was more validating and rewarding than trying to be a smart
scientist myself and ended up doing that at a few big companies. I worked on the data platform team
at Sephora and Subway, the AI platform team over at Microsoft. And then most recently, I led
data infrastructure for a great tech company called Convoy.
That's awesome.
By the way, I mean, it's not the first time that we have you here, Chad.
So I'm very excited to continue the conversation from where we left.
Many things happened since then.
But one of the things that I really want to talk with you about is the supply chain around data and data infrastructure.
There's always a lot of focus either on the people who are managing the infrastructure or the people who are the downstream consumers, right?
Like the people who are the analysts or the data scientists.
But one of the parts in the supply chain that we don't talk that much is going more and more upstream,
where the data is actually captured, generated, and transferred into the data infrastructure.
And apparently, many of the issues that we deal with stem from that.
There are organizational issues.
We're talking about very different engineering teams involved
there with different goals
and needs.
But at the end, all
these people and these systems, they need to work
together if we want to have data that
we can rely on.
So I'd love to
get a little bit deeper into that
and spend some time together to talk
about the importance of this, the issues there, and what we can do to make things better.
So that's one of the things that I'd love to hear your thoughts on.
What's in your mind, what you would like to talk about?
Well, I think that's a great topic, first of all. And it's very timely and topical as teams are, you know, the modern data stack is still,
I think, on the tip of everybody's tongue.
But it's become a bit of a sour word these days, I think.
There was a belief maybe five to eight years ago that by adopting the modern data stack,
you would be able to get all of this utility and value from data. And I think to some degree,
that was true. The modern data stack did allow teams to get started with their data implementations
very quickly, to move off of their old legacy infrastructure very quickly, to get a dashboard spun up fast, to answer some questions about their product.
But maintaining the system over time became challenging. And that's where the phrase that
you used, which is data supply chain, comes into play. This idea that data is not just a pipeline. It's also people.
And it's people focusing on different aspects of the data.
An application developer who is emitting events to a transactional database is using data for one thing.
A data engineering team that is extracting that data and potentially transforming
it into some core table in the warehouse is using it for something different.
A front end engineer who's using, you know, rudder stack to emit events is doing something
totally different.
An analyst is doing something totally different.
And yet all of these people are fundamentally interconnected with each other.
And that is a supply chain.
And this is very different, I think,
to the way that software engineers
on the application side think about their work.
In fact, they try to become as modular
and as decoupled from the rest of the organization
as possible so that they can move faster.
Whereas in the data world,
if you take this supply chain view,
decoupling is actually impossible.
It's just not actually feasible to do
because we're so reliant on transformations
by other people within the company.
And if you start looking at the pipeline
as more of a supply chain,
then you can begin to make comparisons
to other supply chains in the real world
and see where they put their focus.
So as a very quick example,
McDonald's is obviously a massive supply chain, and they've spent billions of dollars in optimizing that supply chain over
years. One of the most interesting things that I found is that when we talk about quality,
McDonald's tries to put the primary burden of quality onto the producers, not the consumers.
Meaning if you're a manufacturer of the beef patties that are used in their sandwiches,
you are the one that's doing quality at the sort of patty creation layer.
It's not the responsibility of the individual retailers and the stores that are putting
the patties on the buns to individually inspect every patty for quality. You can imagine the type of cost and inefficiency issues that
would lead to when the focus is speed. And so the patty suppliers and the stores and McDonald's
corporate have to be in a really tight feedback loop with each other, communicating about compliance and regulations and governance
and quality so that the end retailer doesn't have to worry about a lot of these capacity,
about a lot of these issues. And the last thing I'll say about McDonald's, because I think it's
such a fascinating use case, is that the suppliers actually track on their own how the patty needs, like the volume requirements for each individual
store. So when those numbers get low, they can automatically push more patties to each store
when it's needed. So it's a very different way of doing things, having these tight feedback loops
versus the way that I think most data teams operate today. Yeah, yeah, makes sense. Okay, I think we have like a lot to talk about.
Eric, what do you think? Let's do it. Let's do it. We love having guests back on, especially when
they've tackled really exciting things in between their first time on the show and their second time
on the show. And you actually founded a company called Gable.ai. And so we have tons to talk
about in terms of data quality generally, but I do not want to keep our listeners on the edge of
the seat, you know, for the whole hour. So give us the, give us the overview of Gable. Yeah. So Gable is really
trying to tackle a problem that I've personally experienced for basically every role in my career.
Every time I started at a new organization, my focus as a data leader was to understand the
use cases for data in the company and start to
apply data management best practices, beginning with my immediate team, which is analysts
and data scientists and data engineers.
We would always go through that process.
And at some point, we would still be facing massive quality compliance and governance
issues.
And that's because I found that a significant number of
these quality issues were coming from the upstream data producers that just weren't aware of my
existence. And as time went on, I found that these producers were not averse to supporting us, but
they did not have the tool set to effectively do that. Oftentimes, it required me explaining to them how data worked
or trying to get them to use a different tool outside of their stack
or saying, hey, here's a data catalog, and I want you to look at it
any time that you make a change to ensure you're not breaking anybody.
And this is just very hard and complex.
And so we developed Gable to act as the data management surface for data producer.
It's something that any engineer or data platform manager can use to number one,
understand the quality of their data coming from the source systems. Number two, can create
contracts, whether one-sided or two-sided,
around the expectations of that data. And then number three, protect themselves from changes
to the data. And that might mean data that is already in flight. So maybe I'm consuming an API
from a third-party provider and they decide to suddenly change the schema out from under me, we want to
be able to detect that change before it causes an impact on the pipelines. Or it could mean
someone making a change to the actual code. Like maybe there's some Python function in code that
is producing data and the software engineer making that change just doesn't know that it's going to cause an impact downstream, we want to be able to catch that using the tools that engineers already leverage,
like GitHub and GitLab, and stop that, or at least give them information to both sides that
a change is coming. So yeah, that's basically how the tool works. That's Gable, and that's the
high-level problem we're trying to solve.
Awesome. Well, I have some specific questions about Gable that I want to get to, dig into the product a little bit more, especially you've chosen the.ai URL.
I want to dig into the reason behind that because I know it was intentional.
Let's zoom out a little bit first. One of the things we were chatting about before we hit record was the traditional way,
and you actually mentioned this term data catalog, right? It's a huge buzzword. There are entire
companies formed around this concept of a data catalog today. We were chatting a little bit about
how there are certain concepts that have been around
for a long time, like a data catalog, but maybe they aren't necessarily the right way to solve
problems modern day. So why don't we just talk about the data catalog, for instance? Do you
think that it's one of those concepts that we should retain, right? Because there are certain things historically
that are good for us to retain, but things change, right? So maybe we don't need to retain everything.
Yeah, I think a catalog is one of those ideas that conceptually makes an enormous amount of sense
on the surface. If I have a large number of objects and I want to go searching for a specific
object in that pile, having a catalog that allows me to quickly and easily find the thing that I
need makes a lot of sense. But like you said, I think this is a older idea that's based around
a very particular organizational model. So the original concept of
the data catalog back from the on-prem days was actually taken from like a library where you have
an enormous amount of books. You've got someone who comes into the library and is trying to find
something specific and they go to a computer or they open one of the very old school
documents, like a literal catalog. And from there, they can search and try to find what they need.
But this requires a certain management model of the catalog itself, right? You've got librarians,
people who know all of the books in the library. They maintain the catalog. They're very careful about what they
bring in and out. They're curating the catalog itself and they can add all of the relevant
sort of metadata, quote unquote, about catalog that gives people the information that they need.
This was also true in the on-prem data world as well. When you have data architects and data stewards,
you had to be very explicit about the data
that you were bringing into your ecosystem.
You had to know exactly what that data was,
where it came from, what it was going to be used for.
And the catalog that you then provided to your consumers
was this very sort of narrow curated list
of all of the data that could possibly exist.
But in the modern sort of data stack, it's not like that.
It's more of a, you know, you've got your data lake and that is a dumping ground for
thousands or hundreds of thousands of data points.
There really is no curation anymore. And so what happens in
that worldview, I think that the model, the underlying model needs to change.
It makes total sense. One, digging in on that a little bit more,
we think about the data lake and, you know, of course there's tons of memes around it being a
data swamp, you know, and, you know, we're collecting more data
than we ever have before. What are the new ideas that we need to think about in order to manage
that, right? Because what's attractive about a data catalog, I guess you could say, is that
you have, call it like a single source of truth or, you know, sort of a shared set of definitions,
whatever you want to call it,
that people can use as a reference point. And like you said, when the producers were, you know,
they had to engineer all of that stuff, right? And so they basically designed from a spec and that is
your data catalog, right? Essentially. But when you can just point SaaS pipelines from any source
to your data lake or your data warehouse.
It's this crazy world where like, could a data catalog even keep up?
And so what are some new ideas for us to sort of operate in this new world?
I think it's a question of socio-technical engineering. So funnily enough, there is sort of a modern day library,
which I would say is Amazon. I mean, that's sort of Jeff Bezos' whole original idea.
It was a bookstore on the internet. But it was different from a typical library because it was totally decentralized. There wasn't one person
curating all the books in the library. The curation actually fell onto the sellers of those books.
And what Amazon did is they built an algorithm that was based around search. It was a ranking
algorithm. And that ranking algorithm would elevate certain books
higher in search based on their relevancy and the metadata that these curators or the book owners
would actually add. And there's a really strong, powerful incentive for the owner of each book
to play the game, right? To participate. Because if they do a good job adding their context,
it ranks higher, which means more people pay them money. And the same is true for any sort
of ranking algorithm-based system like Google or anything else, right? You're incentivizing the
people who own the websites to add the metadata so that they get searched for more often.
I think this paradigm is what a lot of the modern cataloging solutions
have tried to emulate. Like, let's move more to search. Let's move more to machine learning-based
ranking. But the problem to me is that it hasn't captured that socio-technological incentive.
The Amazon book owner, their incentive is money. The Google website owner owner their incentive is money the google website owner their incentive is you
know clicks or whatever value they get from someone going to their website what is the
incentive of a data analyst or a data scientist to provide all of that metadata to get their
particular asset ranked is that even something they want at all? Because if they're working around a data
asset, do they want to expose that to the broader organization? Does that mean if they have
thousands of people now taking a dependency on it, that it becomes part of their workload to
support it, which they may not want to do nor have the time to do? So I think the incentives
are not aligned. And in order to exist in this federated world, there has to be a
way to better align those incentives. I think that's what needs to change.
Well, okay. You brought up two concepts in there, and I'm going to label them, but let me know if
I'm labeling my data incorrectly. But there's this concept of data discovery. I think the point
about search is really interesting, right? Okay, so you have this massive data lake and you have a
search-focused data catalog type product that allows, you know, and you can apply ranking,
et cetera. But in many ways, that's sort of data discovery, right? The bookseller on Amazon
is trying to help people who like murder mystery fiction to discover their work, right?
Which is great. I mean, that is certainly a problem, right? But when you think about
the other use of the data catalog beyond just discovery, there's a governance aspect, right? Because
there's these questions of, we found something that is not in the catalog. Should it be in there,
right? Or there's something in the catalog that has changed, or we need to update the catalog
itself, right? And so how do you marry those two worlds? And I mean, I agree, the catalog is a really,
is it even the right way to think about that?
Because discovery and governance or quality or whatever labels you want to put on that side of it
are extremely different challenges.
Yeah, I think that's exactly right.
I think that they have very different implications as well.
I do think that a great discovery system requires a couple problems.
I think the first is really great discovery actually requires more context than a system
built on top of downstream data alone is able to provide. If I'm a data
scientist or an analyst, and I was at one point in my career, what I really wanted when I was
doing a search for data was to understand, you know, what does this data mean? Who is using it?
Which is an indication of trust. Where is it coming from?
What was its intended purpose?
Can I trust it at all?
And how should I use it, right?
These are sort of the big categories of questions that I wanted to answer.
If a data catalog is simply scraping data from, you know, a Snowflake instance and then putting a UI on it, putting it into a list and letting people, you know, look at the metadata, it's only answering sort of a small subset of those
questions that I have. It's like, yep, what is the thing? Do I can I find something that matches
the string input that I typed into a search box? But all the other questions I now have to go and
figure out basically on my own, by talking to people, potentially talking to engineers, trying to trace this to some code-based resource or some other external resource.
And that lowers the utility of the catalog by quite a bit.
And then there's the governance side that you mentioned. And governance and quality is really interesting,
kind of like I implied before,
because in sort of a supply chain universe,
the quality and the governance is going to be on the producer.
I mean, it's really the only way.
And if the governance is going to be on the producer,
that means that the producer needs to have an incentive
to add that governance in the first place.
And I think today it's very hard as a producer to even know who is taking a dependency on the data that you are generating.
You don't know how they're using it, and therefore you don't even know what metadata would be relevant for them.
And you may not even want to expose all of that metadata, like I mentioned before.
So to your earlier point, I think catalog is probably, at least to me anyway,
it's not the right way of framing the problem.
If I could frame it differently, it may be more around like inventory management.
And that's more of the supply chain take
than sort of the old school take.
Yeah, absolutely fascinating.
When we think about, and actually I'd love to dig
into the practical nature of Gable really quickly,
just because I think it's interesting
to talk about the supply chain. And maybe a fun way to do quickly, just because I think it's interesting to talk
about the supply chain and maybe a fun way to do it. You know, you and I recently talked about
some of the data quality features that Rudder SAC recently implemented, right? And I think it's a
good example because they're a very small slice of the pie, right? They're designed to, you know,
help catch errors and event data at the source, right?
At the very beginning, right?
So you have events being emitted from some website or app.
You can have sort of defined schemas that allow you to say,
look, if this property is missing, drop the event, do whatever, right?
Propagate an error, send it downstream.
First of all, would you consider that as sort of a like a producer a source how does that
like orient us to in gable where would the rudder stack sort of data source
like sit is that a producer absolutely i i think that a rudder stack would be a producer i think
pretty much you know the way i've thought about it is that there's really two types of producer assets, I guess.
There's code assets, or maybe three.
There's sort of code assets.
There's structures of data, so like schemas and things like this.
And then there's the actual contents of data itself. And like you said, there's lots and lots of different slices
of this problem where the events that you're emitting
from your application like RutterStack
are one area where you need this type of coverage.
Like I said, APIs that you ingest,
you've got kind of backend events,
you've got custom frontend events, you've got, you know, C sharp and.net and like all of these other sort of
this very wide variety of things. And so I think everything that you talked about sort of in the
Rudderstack webinar, which was, you know, being able to check the data live as it's flowing from
one system to another system, doing schema management, all of that we consider.
I think that's totally relevant to what Gable is working on as well.
We also are trying to look at things like,
can we actually analyze the code pre-deployment
and figure out if a change that's coming through a pull request
is going to cause a violation of
the contract, wherein the contract is just an expectation of the data from a consumer.
And there is some level of sophistication to that. We do have, for example, like static code analysis
that crawls an abstract syntax tree. We can basically figure out when a change is made,
what are all of the sort of
dependencies in code that like power that change, what all the function calls. And then if any
function call is modified anywhere in that syntax tree, we can then recognize that it's going to
impact the data in some way. And then in addition to that, and this is where I think things get really cool, is we can layer on artificial intelligence.
So not only would we know how different changes within that syntax tree can affect the schema, we can also know how it affects the actual constants of the data before the change is deployed.
So an example of that would be, and this is like a typically very difficult thing to catch pre-deployment is,
you know, let's say I have a date time field and we can say it's like datetime.now and a product
engineer decides to change that to like datetime.utc now. If you've been in data for any amount of
time, like a very common date format, engineers love it. but like that change represents an enormous amount
of difficulty to detect and modify in all the places in all the areas. In CICD, not only could
we identify that change is going to happen, but we could actually understand that it is changing
to UTC and then communicate that to everyone that depends on that data.
That allows the consumer to either say, okay, I'm going to prepare all of my queries for UTC from
now on. Or if it's a really important thing and you might say, hey, software engineer,
I want to give you some feedback that you're going to cause an outage to 10 different teams.
So please don't make this change right now.
That's like one big part of the platform
is that like you're shifting left,
trying to catch them.
They're closer to the source as a part of DevOps.
And then the other side of it is,
like you said with Rutherstack,
we try to catch stuff in flight as well.
So if someone has made a bunch of changes, if there's a lot of changes
coming through in files that land in a Postgres database in S3, we run at S3, we look at those
files individually, map them back to the contracts, and then we can send some signals to the data
platform team to say, hey, there's some
bad data that's coming through. Now is your opportunity to get in front of it so that it
doesn't actually make its way into the pipeline. Yep. I want to drill down that just a little bit
more. And I'm going to give you an example of a contract, but please feel free to trash it and
pick something else. But let's take this contract around like a metric like active users, right?
You know, of course, like one of those definitions
that you ask around a company
and you get five different definitions,
we need to turn that into a contract
so that all the reports downstream
are using sort of the same metric or whatever, right?
And maybe Rutter Stack Event Data
is a contributor to that definition,
you know, based on a timestamp of some user activity, right? But
there are tons of other ingredients into that metric, right? And so maybe you need to roll that
up at an account level. And so you either need, you know, a copy of the Postgres production database,
you know, so you can make that connection or, you know, Salesforce or whatever it is, right? You need maybe subscription data from a
payment system so that you know what plan they're on so you can look at active users by all those
different tiers. So we have that contract in Gable. And so can you just kind of describe the
way that you might wire in a couple of those other
pieces beyond just the event data because i think the other interesting thing is you know we think
about data quality at ruddersack we're just trying to align to a schema definition but what's
interesting is that the downstream definition in a contract actually may interpret that
differently in the business context, as opposed to
there's a diff on the schema and something's different, right?
Yes. So I think there's sort of two different ways to think about this. One way, and the way that I
usually recommend people to think about this problem is to start from the top down.
There's a couple of reasons for that. One, it can be organizationally very difficult to ask someone
downstream to take a contract around something like a transformation around a metric in Snowflake
or something like that, or BigQuery, if the inputs to that contract are not under contract, right?
That feels a bit scary.
It's like I am now taking accountability for something
that I may not necessarily control.
And so oftentimes there is pushback to that,
which is why I usually say that the best place to start with contracts
is from the
sources first, and then waterfall your way down. Interesting.
The second piece of that is, the second piece of that is that, like, I think that there's a longer
term horizon on this stuff where everything I just said doesn't apply, which is starting to integrate more concepts around of data lineage into contract definition.
So let's say that I have this sort of metrics table and I want to put a contract on it,
but nothing exists. In the ideal world, you would be able to say, I want this contract,
and now I want some underlying system to figure out what all of the sources are sort of end
to end.
I want to create almost like a data lineage in reverse.
And then I simply want to either ask for a contract or to start collecting data on how
changes to those upstream systems are ultimately going to affect this transformation
of mine downstream.
This is something that we hear a lot where teams basically say, I want contracts, but
I don't really have the social, like political capital to put in my engineering team and
tell them what to do without evidence.
And they would like to just collect that data first.
So I think that's sort of the other is being able to construct that lineage, understanding how things changing, collecting the data and creating the evidence
for the contracts and then implementing them from there. Yeah. Love the phrase around,
I don't want responsibility for something that's not under contract. Okay. I actually have a
question. I know Costas has a ton of questions, but I actually have a question for you and for Costas. When we think about contracts, right, so I think about, you know, I brought up the example of active users, it could be any number of things. You've been a practitioner, Costas, you've been a practitioner, you've built a bunch of data tooling. How fragmented are the problems of data quality? And I guess maybe we could think about the 80-20 rule. And part of the reason I
want to ask is because, you know, even, you know, in the work that I do with, you know, analytics
and stuff like that, you always wonder, it's like, man, I mean, this is kind of messy. Like,
I wonder what it's like at other companies. Is it the same set of problems? Is it really
fragmented? Does the 80-20 rule apply where there's like, you need these set of, you know,
contracts and they'll take care of 80% of the problems? But what have you seen? Chad, maybe start with you and then Costas would love your thoughts as well. The numbers that I have seen is that 50 to 70% of the data quality issues are coming from the
upstream source systems or the data producers. That's sort of the most typical range that I've
heard. Now, within that, I think that there is a pretty wide variety of problems. For example,
like databases, changes to databases, not really
that complex of a problem. And the reason why it's generally not a problem for data teams is because
engineers don't do a lot of backwards incompatible stuff because they're scared
of deleting columns that other teams are using. Sure. Yeah, yeah.
And so, but there is still a quality problem there, which is like, well,
as a software engineer, maybe I'm just going to add a new column that contains data from the old column, and I don't
communicate that to the team downstream.
So that's an issue.
And then on the actual kind of business logic code side of the house, this is where we hear
issues on the data content.
And that's like that sort of daytime UTC change that I mentioned before.
We also hear a ton of problems around third party vendors, especially schema changes.
And that's because they're really under no obligation to not make those changes.
And a lot of the actual like financial, the legal contracts between teams, doesn't account
for changes to the actual data structures themselves, right?
The SLA is more about uptime of the actual service, but not, will this data suddenly
change from tomorrow to today?
So depending on where companies have built the majority of their data infrastructure,
you'll see a very different sort of split
in what upstream problems are causing the most issues.
Yeah, I think it's all like described it very well.
And I think it gets, it's probably getting,
it gets like even more complicated when we start like considering
all the different roles out there that they can make changes to a database schema,
right?
Like, for example, let's say you're using Salesforce.
I mean, Salesforce at the end, it is like a user interface, like on a database.
You have people there who can go and in a table that they don't see it as a table, they
see it as a leads or whatever.
They can go and make make changes there. Right. And these changes can propagate down like to the like the data infrastructure
that we have and like all that stuff.
So I think especially like after Chas and that's like what I find like very
interesting with like Chad was saying about like the catalog because yeah,
sure, like back then we had very narrow narrow set of producers at the end, right?
That were under a lot of control by someone.
But pretty much every system that we are using in the company to do something is potentially a data producer.
And the people behind them are not necessarily data people or even engineers, right?
They can be salespeople or marketing people or HR people or whatever.
I don't think anyone can, let's say, require from them to understand what UTC even is when they are
going to make changes. And that's obviously on top of what, let's say, Salesforce on their own
might change there, which I would say is probably more rare than what is caused by, let's say, the actual users. So yeah, I mean, I think it makes total sense that
most of the, let's say, the problems come from the production of the data out there.
But it's also, I think, the question I have for you, like actually Chad is, even
if we focus only on the production side, right?
Let's go upstream.
Is there among, let's say the upstream producers of like a typical company out there, another
Pareto kind of distribution in terms of like where most of the problems come from compared
like to others?
Yeah.
I mean, I think you actually touched on a few of them.
A lot of these sort of third-party tools like Salesforce, HubSpot, SAP that are maintained
by teams outside of the data organization.
I mean, you said it exactly.
It doesn't seem like a problem as a salesperson or a Salesforce administrator to delete a
couple columns in your schema that you're working with.
But if you're relying on that data for your finance team or your machine learning team,
this becomes hugely problematic.
So this is almost always a source of pain.
I think the other thing that's very problematic are the events. And we hear front-end
events are especially notorious. And this is something I think that Eric and the Rudderstack
team are sort of working on, but we hear it all the time where you have this relatively legacy
code base and there's a ton of different objects in code that are generating data. And for every single feature that's deployed, those may or may not change.
The events may suddenly stop firing or new events might be suddenly added and no one
is told about that.
And the ETL processes don't get built.
There's just such a large communication gap between the teams sort of working on the features
that are producing the data and the teams that are using the data that, you know, really
anything that can go wrong oftentimes does.
And then the other really big area, I think, is the external data.
This is where it's just, it is unbelievably problematic.
And a lot of companies, they're not sort of ingesting real-time data feeds.
It's sort of much longer batch processes that take a lot longer to load.
So it might be every quarter I pull in a big data set or every couple months I pull in
a big data set.
And there's so much change that happens on the producer side between, you know, the times that they
vend these large data sets out that it could look like a completely different thing when
you get from month to month or quarter to quarter.
And there's so much work that then has to go into sort of putting the data into a format
that can actually be ingested into the existing pipeline that it just causes it.
You know, there's a company I was talking to where they basically said the data team
lost our entire December to one of those changes.
And I think that these types of things are very common.
Eric, anything you want to add there?
No, I know you have a ton more questions.
Of course, I could ask a bunch of questions, but I'm just soaking this up like a sponge.
I love it.
Okay. Okay.
Okay.
So let's talk about events.
Let's get a little bit deeper into that.
And before we get into the data and the technology part, let's talk a little bit about humans
and organizations there.
So I have a feeling that not that many front-end developers have ever been promoted
because of their data hygiene when it comes to events, right? So how do we align that? Because
you made a very good point about, let's say, the incentives out there in the marketplace,
for example, where people are actually incentivized to go and put good metadata or even get to the
point where they try to game the algorithms with the metadata that they put there. But
in organizations that are not necessarily even aligned between them inside engineering,
the data teams and the product teams might not be aligned. right? Like, how can we do that? And like, where are the limits at the end of like technology with that stuff, right?
Exactly.
I mean, I think that your last sentence there hit it exactly.
I think that technology can only do so much.
In my opinion, and what I've seen, like you said, it comes down to incentives.
And so the question is, in fact, when I was at
Convoy, I asked engineers this exact question as I went to them and I said, hey, how can I get you
to care about the data that you're producing because you're changing things and it's causing
a big problem for us? And the answer that I heard pretty consistently was, well, I need to know who has a dependency on me. Who is using that data?
Why are they using it? And when am I going to do something that affects them? I don't have any of
that context right now when I'm going through my day-to-day work. And so it feels a bit bad.
I think if you're an engineer and you're going through your typical processes, you're making some sort of code change. It gets reviewed by everybody on your team. They give you the
thumbs up, you ship it, you deploy it. And then two and a half weeks later, some data guy comes
to bang on your door and say, hey, you made a change and it broke everything downstream.
It's like at that point, you've already moved on to the next project. You're working on something
new. You've left the old stuff behind. It just doesn't feel good to have to retract all of that. And
this is why something we've heard a lot is like product engineers generally tend to see
data things as being the realm of data people, right? If you are anything sort of in the data warehouse
is kind of treated as a black box.
And if there's a problem caused there,
then the data teams will just,
they'll deal with it downstream.
And I think that this mentality needs,
it needs to change.
And I think that product can help it change.
So one example of this is DevSecOps,
right? The whole discipline of DevSecOps has evolved over the past five to seven years
from security engineers that have basically said, look, we cannot permanently be in a reactive state
when it comes to security issues. We can respond to hacking, we can respond to fraud, but the best case scenario
for us is to start to incorporate security best practices into the software development
lifecycle as, for example, just another step within CICD. And I think this is what needs to
happen within data. Checks for data quality should be another step within
CICD. And that step, just like any other integration test or any other form of code review,
should communicate context to both the producer and the consumer of what's about to go wrong.
So if I can tell an engineer, for example, hey, the change that you are about to
make is going to cause this amount of damage downstream to these data products and these
people, you've now created a sense of accountability. If they continue to deploy the change,
even in that environment, well, you can't say you don't know anymore. It's no longer a black box.
It's been open. And it provides an
opportunity for the data scientists to plead their case and say, hey, you're about to break us. Can
you at least give us a few days or give us a week to account for this? I think that is a type of
communication that changes culture over time. Yeah, makes total sense. And okay, we talked
about the people and how they are involved and what is needed there,
but let's talk also a little bit about technology. What are the tools that are missing?
And what are the tools also that... Where are the opportunities, let's say, in the toolbox that we
have today to go and build new things? You mentioned, for example, the catalog, that it's a concept that probably has to be evolved.
And it's something that we had a panel a couple of weeks ago with folks like Ryan from Iceberg
and Wes McKinney.
And it was one of the things that came up, that the is like one of these things that we might have
like to rethink about.
It's catalogs, by the way,
I think what like most people
have in their mind,
they are thinking of like a place
where I can go and,
as you said,
like it's an inventory
of like things, right?
Where I can find my assets
and like reason about
what I can do
and what I'm looking for.
But catalogs are also what,
let's say, fuel the query engines out there. There's also metadata that the systems need to go and execute the queries. So there are multiple different layers from the machine up to the human
that they have to interact with. So what are the tooling that you see missing and what are the opportunities?
So what I think is missing for the catalog to be effective is feedback loops and action
ability.
Basically, or to maybe phrase it another way, give something, get something. If I can provide as a consumer or even a producer for that matter,
if I can provide information to a catalog that helps me in some way,
then I am more inclined to provide that information more frequently. And as a data product owner,
what I would like to get back in return,
or one of the most valuable things that I could get back,
is either some information about where my data is coming from,
the source of truth, who it's actually owned by,
this sort of class of problems that I mentioned before that I'm interested in,
or I get data quality in response, right? And so this kind of ties back to the point I was making
earlier around lineage. And I'll just give you a very simple example to illustrate, you know,
let's say within the warehouse, there's sort of a table, maybe a raw table that's owned by a data
engineer. And then a few transformation steps away, there was,
I don't know what Eric was saying. There was like some metric that's been produced by a product team
and they don't want that to break. Now, what they could do is that if they, through whatever the
system is, they could effectively describe what their quality needs are. And then we could traverse
the lineage graph and say, okay, I can
now communicate these quality needs to all of the producers who manage data that ultimately inputs
to this metric. And I can be sure that if there is ever going to be a change that violated those
expectations, I would know about it in advance. Now I, as the metric owner, am a lot more inclined to add good
information, right? So I've created a feedback loop where I'm providing metadata and details
about my data object that I maintain. I'm getting something which is quality in return.
And now I've built something that is robust that someone else can take a dependency on.
And I think this is the type of system that basically has to exist where
the team, the producer team of some data object is getting a lot of value in return for contributing the metadata in the context, which I don't think is the case today.
And you say you mentioned the word like metadata and in said device, people like to go and add the metadata
there.
What is the metadata that's missing right now?
To construct these...
Because Lineage, okay, the Lineage graph is not a new concept, right?
It's been around for a while, but why it's not enough what we have already?
What is missing from there?
Well, I think it's a couple of things.
I think one thing is that, number one, the lineage graph doesn't actually go far enough.
And you hear this a lot, like right now, especially in the modern data stack, the limits, the
edges of the lineage graph basically end at structured data.
And if that's where you stop, then you're missing another 50% of the lineage, which
means that if something does change in that sort of unstructured code-based
world, it is ultimately still going to impact you. Any monitoring or quality checks at that point
are just reactive to the changes that have happened. So number one, you need to actually
have the full lineage in place in order for the system to actually work the way that I'm describing it. And then in terms of what metadata is missing, I think there's a massive amount, right? Number one,
for just like the biggest question that I probably had as a data scientist and got as a data platform
leader is what does a single row in this table actually represent? That data is found almost nowhere in the catalogs
because again, there's no real incentive
for someone to go through all of the various objects
that they own and add that.
Same is true for all the columns.
Like if we have a column called,
I don't know, in convoy, it was a freight company.
And so this idea of distance was very important.
We had probably 12 this idea of distance was very important. We had probably
12 different definitions of distance, and none of them were laid out explicitly in the catalog.
Distance might be in terms of miles. It might be in terms of time. It might be in terms of
geography. It might be some combination of all of those. But if I, as the owner of that data product, can communicate exactly what I mean
by distance, then that's going to help the upstream teams better communicate when something
changes that impacts my understanding. So yeah, I think that's sort of the idea is I think all of
the semantic information about the data, that's the missing metadata, in my opinion. Yeah, yeah, makes sense. Do you see an opportunity there for AI to play a role
with the semantics of all this data that we have? And if yes, how?
Yes, number one, I think so. I think the challenge, though, is that, well, again, I think there's a couple ways that
this can play out.
Ultimately, I think that this is what all businesses will need to do in order to really
scale up their AI operations.
They are going to need to add some sort of language-based semantic information to their core datasets. Otherwise,
all this idea of like, oh, I'm just going to be able to automatically query any data in my
dataset and ask it any question, all of that's going to be impossible because the semantic
information is not there to do it. It's just tables and columns and nobody knows what this
stuff actually refers to. I think one option is that the
leadership could just say, okay, everybody that owns something in data, we're going to spend a
year or maybe two years going to all of the big data sets in our organization and trying to fill
out as much of the semantic detail as we possibly can. I think that could help as a start, but I tried this when I was
onboarding a data catalog and it's like temporary, right? Like you get the initial boost, like maybe
for a month, you get a ton of metadata added all at once. And then it just kind of gradually
slopes off and ultimately isn't maintained, which is pretty problematic.
I think a better way to do it is to start from the sources and trickle down in the same way I was describing Eric before. And I think all of this comes back to the contract.
If you can have a contract that is rich with this semantic information,
starting from the source systems, it is the responsibility of the producers to maintain. They understand what all of their dependencies are. Anytime something changes
with the contract, they're actually not allowed to deploy that change unless they have evolved
the contract and contributed the required semantic update. Then you get this sort of nice
model of inheritance where every single data set that is leveraging
that semantic metadata can then use it to build their own contract.
And I think a lot of that could actually be automated.
This is more of a far off future, but I think it would be a more sustainable way of ensuring
that the catalog is actually up to date and the data is trustworthy.
Yeah, makes total sense.
Eric, we're close to the end here.
So I'd like to give you some time to ask any other questions you have.
Yeah, so two more questions for me.
One, just following on to the AI topic.
What are the, you know, when you think about the risks, and this is somewhat of a tired topic,
but I think it's really interesting in the context of data quality as we're discussing it, I agree with you that AI can have a massive impact on the ability to scale certain aspects of this, right?
But when we're talking about a data contract, the impact of something going wrong is significant, right?
It's not like you need to double check your facts because you're researching some information, right? It's not like you need to double check your facts because you're,
you know, researching some information, right? You're talking about something, you know,
someone potentially making an errant decision, you know, for a business. So how do you think
about that aspect? And, you know, I guess maybe as we think about the next several years,
when do you see that problem being worked out?
I think that it's going to require treating the data as a product in terms of the environments
that data teams are using. And what I mean by that is, today, when we are building software applications, what delineates a software application in a QA
sort of test environment from something that is production and deployed to users is the process
that it follows to get there. Ultimately, code is not that dissimilar. It's just that there's a
series of quality checks and CICD checks and unit testing
and integration testing and code review and monitoring. It's the process you follow that
actually makes like some bit of code a production system or not. And I think that in the data world,
exactly as you've said, what makes something production, is it trustworthy? Is there a very
clear owner? Do we know exactly what this data
means? Is there a mechanism for evolving the data over time? Do we have the ability to iteratively
manage that context? And I think the process that has to be followed from kind of like experimental
data sets to a production data set is a lot of the same stuff. It's CICD and unit
tests and integration. I think contracts play a really big part of that. There needs to be a
contract in place before we consider data production grade. And I think this is where
the environments comes in. There needs to be actually literally different environments for a data asset that is
production versus one that is not. And I think that should have impacts on where you can use
that data. If we don't have a data set that has a contract and has gone through the productionization
process, I can't use it in my machine learning model, or I can't use it, I can't share it with our
executive team in a dashboard or report.
And in the same way that like, I can't deploy something to a customer if I don't follow
my, you know, quality, my code quality process.
I think this is the thing that probably needs to change the most.
Like right now in data, we don't delineate at all between what is production
and what is not production in the sense of like customer utility. It's all sort of
bunched into a big spaghetti glob. Yeah. Super helpful. All right. Last question.
You know, a lot of what we've talked about, one way to summarize it could be, you know, you almost need to slow down to go faster, right? Right?
You know, actually defining contracts, actually putting data producers under contract.
You know, you use the term socio technological, right? It involves people. That takes time. Can you speak to the listener who
has followed along this conversation and said, man, I would love to start fixing this problem
at my company, but it's really hard to get things to slow down so you can go faster in the future.
What would be the top couple pieces of
advice for that person? So yeah, so first of all, I agree with you, there is some element of slowing
down. But at the same time, I would say that, like, I think that's the same for code quality too, right? GitHub does slow us down, right? And CICD checks do slow us down. And having something like LaunchDarkly that controls feature deployments is going slower than just deploying everything to 100% of our audience. But what software teams have realized is that in the long run, if you do not have these
types of quality gates in place, you'll be dealing with bugs so frequently that you'll be spending a
lot more time on that than you will on shipping products. So that's sort of the first framing
that I would take, because I think this falls under that exact sort of class of problems. The second thing I would say is, I think the problems
that a lot of engineering organizations and even more business units have with slowing down on the
data side is that they are still not treating their data like it is a product. They're treating it more like, hey,
it's just some airy thing. I want an answer to a question. I get an answer to a question.
It's not something that needs a maintainer and it has to be robust and trustworthy and scalable and
all these other things, which is kind of the implication. It's like if I ask a question
about my business, it is implied that it is trustworthy and that it's high quality, but oftentimes that connection is not made. And so
what I oftentimes recommend people to do is you have to illustrate that to the company and then
illustrate the gap. So a concept I use a lot at Convoy was this idea of tier one data services. And that basically means there are some set of
data objects at your business where a data quality issue can be tracked back to revenue
or business value. So in Convoy's case, we were using a lot of machine learning models.
A single null value for a record would mean that particular row of training data would
need to get thrown out.
And if you're doing that a lot, then you can actually map that to a deep inaccuracy.
And if you know how much value your model is producing, then every percentage point
in inaccuracy can be traced to a literal dollar sign, right?
And so that's sort of one application.
I think there's lots of applications within finance.
There's some really important reporting that goes on.
Once you sort of identify all of these use cases for data,
what I then like to do is map out the lineage
and go all the way back to the source systems
to the very beginning and say,
okay, now we see that there is this tree.
There's all these producers and consumers
that are feeding into this ultimate data product.
And then the question is,
how many of these producers and consumers have contracts?
How many of them know that this downstream system even exists?
And how many times has that data been changed
in a way that's ultimately backwards and compatible and causes quality issues with that system even exists? And how many times has that data been changed in a way that's
ultimately backwards and compatible and causes quality issues with that system? Now, with all
of that, you can actually quantify the cost of any potential change to any input to your tier one
data service. And you can put that in front of a CTO or a head of engineering or head of data or
even the CEO, and it becomes
immediately important the level of risk that the company faces not having something like this in
place. So that's a really excellent way to get started. A lot of companies are beginning just
with paper contracts and saying, here are the agreements and the expectations that we need
as a set of data consumers and then working to implement those more programmatically over time.
Such helpful advice
that I really need to take to heart
in the stuff I do at Data Every Day.
Chad, thank you so much for joining us.
If anyone is interested in connecting with Chad,
you can find him on LinkedIn.
Gable.ai is the website.
So you can head there, check out the product. And Chad,
yeah, thank you again for such a great conversation. Thank you for having me.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You
can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com. Thank you.