The Data Stack Show - 132: Data Quality and Data Contracts with Chad Sanderson of Data Quality Camp
Episode Date: March 29, 2023Highlights from this week’s conversation include:Chad’s background in data (2:10)Breaking down data quality (4:02)Semantic and logical layers of data (10:04)What are data contracts and how do they... work? (17:41)Implicit contracts at companies (24:01)Where do data contracts fit in data infrastructure? (28:14)The value of data contracts to the producer and consumer (31:18)Tools needed in effective data contracts (46:13)The importance of community in data quality (50:53)Getting connected to Data Quality Camp (1:00:55)Final thoughts and takeaways (1:01:53)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show,
Kostas. Today we're going to talk with Chad Sanderson. He has had a long career as a data practitioner, but he runs a community and creates a lot of content around data quality. And he talks
a lot about data contracts in particular. And that's what I
want to ask him about. I don't think we've talked about data contracts in the show. We've
had discussions about data quality and a lot of the tooling that's trying to accomplish that. But
data contracts, I think, is a new subject. And so I want the breakdown, the 101
from Chad, because he is the expert. Yeah, a hundred percent.
I'm very excited to have him on the show, to be honest.
It's data contracts is one of these concepts.
I mean, we keep like hearing more and more about it lately, but it's not about
like the data contracts, to be honest, that I'm like so excited about it.
It's more about having Chad on the show because it's, you know, like we tend
to like to talk about different things in the data infrastructure, usually
like from the point of view of like bringing like a solution, but in this
case we'll have a person who's just like too passionate about data policy.
Right.
And tries like to build, not just like, not like a product, but more of like
a change in the way that like people work around data and data quality in particular.
So there are like much more things or broader things that we can discuss with him.
And I'm really looking forward to do that and talk about data quality in general, why it is important, why it's so hard to define it, where data products fit in this narrative, and how we can make things better.
All right, well, let's dig in and talk data contracts with Chad.
Let's do it.
Chad, welcome to the Data Stack Show. A privilege to have you on.
A privilege to be here. Thanks for having me. Absolutely. Well, we have a ton to talk about.
I've read a lot of your work. It's been a huge help to me and the way that I think about a lot
of things related to data quality. But give us your background and kind of what led you to what you're doing today
with the community and the content.
Yeah, so these days I spend most of my time doing writing,
going to various conferences.
I may have a book or two in the works at the moment, a CD on that.
And I'm running a community called Data Quality Camp.
And that community is focused around managing data quality at scale,
which is an unsolved need.
There's a lot of ways to think about data quality and not too many standards.
So I thought it'd be good to, you know, stand up some community that
helps people manage their transition to data quality a little bit better.
Before that, I spent three years at a company called Convoy, which is a freight tech startup based out in Seattle. more like small and media data problems, where the issue is less around cost and compute and more
around complexity of the data and ownership. And from my time there trying to solve these problems
around the accumulation of tech debts in the data warehouse, ownership problems between producers
and consumers, we derived this programmatic initiative around an idea called data contracts.
And that's what I spend most of my time writing and talking about these days.
Prior to that, I was at Microsoft.
I worked on their artificial intelligence platform team.
And then I've been in big data for just around 10 years altogether.
I worked at a variety of other companies in the e-commerce space.
Okay, Chad, you mentioned that you do a lot of thinking and writing about data quality at scale.
You faced this problem in previous jobs. But as you mentioned, data quality, there's a lot that goes into it. It's a very wide field. There are lots of companies that are trying to solve this
problem. And they're doing it in some very different ways, right? Very different approaches.
Can you break down data quality for us? How do you frame such a big topic?
Yeah, so I would basically break down data quality into two main categories that each have their own subdivisions of delegations of concern, I guess.
The first layer of quality is what I think any engineering team would think of when they hear quality, right?
Does the application work the way that it was intended?
If we have a set of requirements for a product, are those requirements being met? Are the SLAs that we have being met?
And that could include the freshness of the data.
It could include whether or not there are serious breaking changes being made to dependencies. dependencies? Is the API of any API being
consumed evolving in a way
that is conducive
to the health of that application?
And so that's
one part of it, I think, where we're treating
the data itself like a
product, and so there is some expected
level of quality
mapped to the requirements.
The other element of quality that I think is unique to data is the idea of truth or
trustworthiness.
The data needs to map to some real world reality.
And if I have a shipment and I have data about the shipment and I know where the shipment was
dropped off and where it was initiated from and whether or not it arrived on time all of that
should map to whatever happened whatever really happened in the real world and that is a really
complex subject because you have different levels of of big T truth and little t truth if you spend
any time in philosophy 101. So there's big T truth where there's some objective meaning to what
happened. And then there's our little t truth, which is the subjective interpretation of what
happened. And all of us, many people at a company have maybe different interpretations of what a
specific metric might mean or what a specific dimension might mean. So part of data quality is ensuring that everyone is speaking the same language
and that the objective truth about the world is reflected in the data itself. That's what I think
are the main components of data quality. That's super helpful. And I love the philosophy angle. Do you see more struggle on the big T
side or the little T side? I mean, obviously, if you don't get the big T right, then you're
going to have a lot of problems with the little T. But do companies really struggle with the big
T side of things? I think basically every company I've ever talked to struggles with the big T side
of things. And this is not to jump the gun, but this is one of the major issues that data
contracts is attempting to solve is ensuring that the data is defined in a way that maps
to the real world and that it doesn't change unexpectedly for reasons that may have to
do with something other than the data itself.
Like, oh, we decided to launch a new feature or we decided to drop a column or rename a column
because it didn't really fit what we were attempting to do with our application.
The goal is for the data we're collecting from our source systems to be as tightly mapped to
that big T truth as possible. And part of that mapping has to come from the consumers who understand what the data
needs and what it maps to having a great relationship with producers who are responsible
for maintaining the systems that are collecting that data.
So yeah, I think that big T truth is a huge problem.
Little t truth is also a huge problem.
And it really depends on where in the organization you're looking and what type of business it
is.
But there are massive disagreements that, again, almost every company I've talked to
about what a particular metric even means.
We have this dimension at Convoy that was called a shipment distance.
And you would think that's a pretty straightforward thing.
It's just the distance between where a shipment's origin point and its destination point. But there were so many
people that couldn't exactly agree on what specifically we were talking about. We could
be talking about distance in kilometers or distance in miles. Some people wanted to define
the starting point as the moment from where the shipment was dropped off at. Some people wanted to define the starting point as the moment from where the
shipment was dropped off at. Some people wanted to define it as from where the trucker was driving
from. And these types of differences in thinking sort of apply to the use cases that the consumer
is attempting to solve. So wrangling everybody's brain around the same semantic concepts is very challenging.
Yeah.
Now, I want to get practical for a second.
Would you describe shipment distance
as a big T or a little t?
There's definitely elements of both.
And so this is where we kind of get
to the philosophy of all of this, right?
And it actually becomes
a really challenging conversation to have. There is obviously some objective distance, right? The
shipment is traveling from one place to another place. So that part is real. The question is,
what explicitly do we mean by shipment? And distance part is real. It's the shipment part, applying distance
to the shipment where people might disagree, right? Yeah. Yeah. That makes total sense. I was
thinking about even something like delivery, right? It seems like it's binary. This thing
was delivered or it wasn't, right? But one team could say, okay, if it gets to the physical destination, it's delivered.
But another team may say, well, no, it's when the customer opens it and verifies that what they got is correct.
That's a successful delivery, or whatever it is, which are all useful. But the question that comes up then is, you start to face,
and I'm speaking from experience here, you know, even in stuff that we do every day with our own
data, is, okay, so you have some disagreements, right? Not because anyone's necessarily, like,
right or wrong in a lot of cases. it's that in order to interpret their job
or understand the effectiveness of their work you need to measure something in a slightly different
way right but the problem that often comes up is then you start to have this proliferation
of you know it's like okay well now we have 19 shipping distance you know variations so you know, it's like, okay, well now we have 19 shipping distance, you know, variations.
So, you know, whatever, you know, of course, like it's sad, but it's like shipping distance
underscore, you know, X, you know, or, you know, and it's like 19 different variations.
How do you, and I want to get practical when we talk about, you know, data contracts and stuff,
but philosophically, where do you fall on the spectrum of like, you know, we need to
provide consumers with the information that they need to do their job well, without allowing,
you know, things to run rampant and to create all of this metrics debt,
which just spirals out of control.
I feel like the first time you say,
okay, we'll just cut a new version of this,
it's like weeks later,
the warehouse is already getting messy.
Yeah, so I sort of see
the data platform environment split into two halves.
There is the, the semantic layer and the logical layer.
And I'm using those terms a bit differently, I think, from how a lot of other companies
use them.
And there's a reason why I think companies use them in a different way.
But like when people talk about semantics, that means at least in every definition I've seen, that's like the nature of the thing itself.
Right? Like, if I say the semantics of a car, I'm talking about the nature of a car,
I'm not talking about abstract interpretations of cars, saying like, well, there's a car has an
engine, and it has four tires, and it has a function, which is to move from one place to another place.
And so that's one layer I think needs to exist in the data platform.
And then the other layer that needs to exist is the logical layer.
So these are derivations of real world objects and events.
And those are kind of subject to our interpretation.
Something like margin is an
example of a logical construction, right? There's a real thing called margin that exists in the
world that we can grasp, right? It depends on how we, the humans who work at a company, choose to
define margin. And it can be cut many different ways. I think that semantic layer needs to have one type of governance and
implementation and coordination. And I think that the logical layer needs to have a very different
type of governance and organization that is based around the promotion, sort of like crowdsourced, almost like Reddit upvoted artifacts, right?
So if we have, you know, if as a company, we agree that this definition of margin is
the one that is most commonly used by the business, that doesn't mean that everyone
else can't have their own interpretation.
That's fine.
But if anybody in the company has a question, what is margin as according to some common
definition, there should be a very easy way for them to get access to that data without
having to try to understand the 30 unique versions of margin that exist all across the
business.
So I think there needs to exist some plane where there can be iteration.
Teams can derive, you know, logical aggregations
based on real-world semantic objects.
And as those logical derivations
become more and more valuable at the company,
they are elevated to a higher level of importance
and treated like an API.
And then there can be discussion, right?
So if you want to change the definition of margin
that is powering key data
objects, and actually, let me take a step back because I think when we're talking about these
elevated meanings, it's not just in a sort of abstract like, oh yeah, there's one set of metrics
that are good and one are bad. Having that elevated version of a metric should allow you to
use the metric in ways that are more actionable
and production grade.
For example, if I want to use this concept of margin in a dashboard that I serve out
to my customers, then I have to use the official version, the first elevated version.
If I want to serve it to a sales team, or if I want to do something that maybe goes
across team, then it has to be, I have to contribute back to this like central, almost open source definition of
the metric.
If you want to create your own version of a metric and it lives in your little local
environment and you tinker with it and you apply it to a dashboard that only you see,
that's fine.
But once we start sort of going cross company, that's what we need to have some
agreement of what these terms actually mean.
Yeah, that makes total sense.
So a centralized agreement on the most important things,
but you're not removing decentralization from the equation, right?
Exactly.
That makes total sense.
And this is my approach to both the semantic layering, the logical layering.
I think that there is this sometimes a misconception in data where we'll sort of look at all the SQL in our data warehouse and we'll look at the pipelines that are actively failing and the business logic is, you know, there's no clear agreement on what these entities are.
And we take this approach of we need to go and remodel everything.
We need to have a very clear and well-agreed and established data model.
And we have an entity called Shipments and it's owned by this team.
It has to be broke here and we have data mesh or we have data warehouses or we have data
marks, whatever.
But it's always pitches this big, massive overhaul.
And where there is going to be a big T truth that applies everywhere.
And you're not allowed to change that.
And that's just not feasible or realistic.
In my experience, you have to give people the ability to sort of iterate and tinker
and prototype and sort of try out new things, but give them a path to
move from prototype sort of design environment to a production high trust environment. And that
production high trust environment needs to be supported by all the best practices in software
engineering to a smaller, but far more valuable and condensed slice of our data pipelines.
Yep. No, that makes total sense.
Okay, so where do data contracts fit in here, right?
I know we've probably been walking all around the subject of data contracts
in the philosophical discussion.
I guess practical discussion is well around quality, but...
Okay, break down data contracts for us and where they fit into everything that you just outlined.
Yeah.
So data contracts are at their core agreements between producers and consumers
enforced in a programmatic mechanism.
And that means put it simply, it's like a data API. And an API for data is more robust and comprehensive than I would say a traditional API, because
you're not just thinking about the schema and the evolution of the schema, but you're
taking into consideration the data itself, right?
So then this goes back to that real world truth that I was mentioning before. If I have an expectation that a particular ID field always is a 10 character string, then I need to ensure that the
data itself reflects that. And if I get a nine character string or 15 character string, that
means that somewhere a bug or otherwise some otherwise a regression has been introduced.
And that means my version of the, my assumption that this data represents the big T truth
has been violated because there is no, it doesn't make sense for an ID to be 15 characters.
It doesn't work in our system.
Right.
So I actually think when we're talking, when we were, I mentioned data quality before and
split into these two halves, you've got this issue that's talking about truth and semantics, and you have this other
issue that's talking, does the product map to the requirements that I have?
I think that data contracts actually start primarily on the right side of that.
It starts as a quality mechanism to say, is my data product working the way that I expect?
And do I have a very clear owner that's willing to fix bugs and regressions in that data product?
But I think that over time, it can be used to transition more to solve some of the semantic
problems that I mentioned before. Yeah, that makes total sense.
One question, this is kind of a practical or maybe a specific question. One of the challenges that I've seen come up over and over again, as it relates to data contracts is on the logic side, on the consumer side, right? So one of the challenges is that you have like a sales
team or a marketing team or a product team, and they have some sort of tooling that allows them
to do whatever they do, right? So they're sending messages or they're moving deals through some sort
of life cycle or whatever. And tons of logic lives in there, right? But those systems tend to be
very inflexible, understandably, right? Because they're built for that purpose.
And so when you think about a contract, I think one of the challenges is that you have logic,
business logic, that I would say many times is a contributor or informer of even some of the
semantic like big T truth. This is what a closed deal is or whatever, right? So that lives in a
downstream tool. But when we think about an API, as you described it for data, you know, that a lot
of that has to be centralized in infrastructure. How do you think about that in the world of data contracts and even the technical side of data contracts?
Yeah, it's definitely a challenging problem, but it's actually one that I think is going to be solved at some point.
Salesforce, for example, has their own sort of DevOps-oriented infrastructure now where changes are like logs through
job actions. And so if you're a developer, you can tie into that. And I think that there's a
lot of different, there's a lot of interesting potential. There's a lot of interesting
potential in those types of systems, like essentially being able to say, hey,
we detected by running a check that you were about to drop a column
in your Salesforce schema.
There's someone downstream that has a dependency on you,
so we're not going to let you do that.
And obviously you need an engineer
to implement a system like that,
but you can abstract the messaging
up to the level of the non-technical user.
There are obviously some systems that are very old, like ERP systems and
things like that, that, you know, maybe will never fully integrate, like
they'll never have their own like DevOps solution, but even then I don't think
it's an impossible problem to solve.
The challenge is really getting in between the change and the data making
its way to whatever it's like business
critical pipeline is.
So for example, you could do something where you say, look, I just want to have some staging
table where I drop all the data from Salesforce or my ERP system.
I run a set of checks.
Ideally, if it was a real time, all the better.
But most of the stuff is pushed out
through like back systems so you can run a check maybe once per day or once every few hours and if
you see any violations of this contract downstream then you can revert to a previous version you
could try to parse through that data and only allow whatever you know at the row level meets the contract through into the pipeline.
And then you can try to have some alerts or notification
for the salesperson or the business person
that made the change that said,
hey, something that you pushed out earlier in the day
or yesterday was a violation of a contract
and you're potentially causing
machine learning model to break.
We're gonna need you to go in and update that, right?
So some of this probably is going to require
significant culture change.
Like it's just people learning
that changes that you make to data
have impacts elsewhere.
But some of it is like having the right tooling
to just like get in between bad data
arriving in a pipeline
and having some messaging
that goes out to these
producers. Yep. What happens when, you know, I think a lot of companies have, I would say,
maybe implicit contracts, but not explicit contracts around data, right? Especially when
there's not, you know, centralized infrastructure or, centralized infrastructure or other sort of tooling to
mitigate that. How do you see that play out at a lot of companies?
Yeah. A ton of companies have implicit contracts. I call them non-consensual APIs.
That's great.
Yeah. And it's not good. It never really plays out well, honestly.
I don't think I've seen a single situation of those implicit contracts actually being positive for anyone downstream.
And oftentimes what happens is, but it also makes sense why they exist, right?
You have some software engineer who owns a Postgres database or MySQL database or something like that,
they are thinking in terms of their production applications and ensuring that their applications
have the right data to function. And they're not thinking about the downstream data or the
analytics or the machine learning at all. And that's because a lot of the tooling that teams use, like some ELT tools or CDC,
allows these teams to not be concerned with those problems, right?
And say, hey, I'm just going to plug into your database.
I'm going to pull your data out.
I'm going to do something fancy with it because I need to move quickly.
And the engineering team says, okay, that's cool.
But just so you know, you have a dependency on me.
And that's that. I just don't need to worry about it.
Like you're going to, you're going to fix this issue.
And that's usually fine for the first few years that a company exists, right?
Because A, it's very easy to be in the loop whenever an engineer makes a potential breaking
change to your pipeline.
And B, you know, people are just like thoughtful and nobody's a jerk and
you, the data, I would say, isn't useful enough to really have any sort of strict data quality guidelines around it. It's mainly for, you know, analytics, maybe it's for VI, you know, okay.
If my customer churn table is down for a few hours, or it's maybe down for a couple of days
while some analyst comes in and fixes it, that's fine. It's not that big of a deal. But once you start getting to scale, and now you have
data engineers that are being bottlenecked, or they are the bottleneck in a lot of cases,
because you've got this large team of data consumers and data scientists, and they have
machine learning models, and those models are breaking all the time, and you have all these changes that are happening. And all of those
tickets get routed to this central bottlenecks, the data engineering team, and they're spending
all their time just solving tickets constantly. And it's not fun for anybody. It's not fun for
the consumers because they're not having their problems addressed in a timely way.
It's not fun for the data engineers because they're just constantly underwater and they don't get to do what they actually want to do, which
is do engineering and like build things.
And and it's not really fun for the data producers either because they get
yelled at, you know, like every other week about something that they
broke that they had no idea about.
And so, yeah, that's sort of how I've seen it typically play out.
Like most companies I've seen on the modern data stack that adopt that, you know, just
move fast and break things, early architecture, get to a point where like that doesn't actually
work anymore.
Let's go through like, let's say, a quick example of some data infrastructure and where
like the data contracts exist in it, right? Like, let's assume we have like a typical example of a production database.
Postgres generates data, of course.
You want to export the data from there.
So there's some kind of ETL, like CDC, whatever.
It doesn't matter, right?
Like, take the data out of there, put it in a data warehouse.
There are some steps of transformation that will happen to the data there, and you will end up
with some tables
that can be consumed for analytical
purposes. Let's keep it in the
simplest, most
common scenario of analytics.
Let's not talk about a
more exotic use case.
Where
data contracts fit in this
environment?
And the reason I'm asking is because you use the words API and data contracts.
And in my mind, an API is always like a contract between two systems, right?
And in the world of the data engineer, we actually have way too many systems that we need to orchestrate or like make them operate.
Right.
So help me understand a little bit, like, where do we start putting these data contracts in such like an environment?
So in general, and so we'll start at a high level and sort of drill down to the tactical. At a high level, I think that data contracts need to exist anytime there is a handoff of data from one team to another team.
So that could be from the Postgres database to the data lake.
It could be from the data lake to the data warehouse.
It could be from one team that owns a particular data model in the data warehouse to another team that consumes that model.
But anytime data is handed off and there's some transformation that's happening, there
needs to be a data contract and that sort of API input output needs to exist.
As you rightly pointed out, depending on where you're at in the pipeline, the vehicle that's
the mechanism of enforcement that the data contract takes is going to look different.
So if you're trying to enforce at the production Postgres level, then you're probably going to need something in CICB.
You want to prevent the changes from being made before they happen as often as you can. If you have a CDC and you've
got an event bus, then you might want to do a set of enforcements there, right? We want to look at
each row. And if we detect that at the row level, there's a violation on the contracts, we can
sideline that data, stick it into a DLQ, get to back filling later and send out an alert to the
team that's on call for that contract. The overall goal is to try to shift the ownership as left as we can for each contract and try to
make that the enforcement as tactile and is embedded into the developer workflow as we
possibly can. So if we're just talking about like Postgres, for example, or we're talking about the use case, we might want to start off by defining a contract in some schema serialization framework.
So it could be Protobuf, it could be Afro, it could be JSON schema, though I don't recommend
that.
You'll want to store that contract in some type of registry.
And then there should be a mechanism of doing backwards incompatibility checks on that stored
contract and ideally
on the data itself during the actual build process.
And then you can, you know, break the build and you can send an alert and says, Hey, you,
there's been a contract violation.
That's like one example.
But like I said, as for each transformation stage, there are things that you can do that
you can try to tie back to a producer.
Yeah, that makes a lot of sense.
And, okay, there are many different people involved, right?
Yes.
Probably more than the technologies involved in this whole process.
So let's take, don't overcomplicate it,
but let's at least assume two basic categories.
We have the data producers and the data consumers, right?
What's the value that each one of them gets from implementing data contracts?
Alex Williams- Yes.
So a lot of this comes down to the implementation, but I would say the primary value that the producer gets is awareness.
If it's implemented the right way.
And there's a caveat,
which I would say that the data contract is a really meaningful piece of technology.
But it serves a function.
And this very specific function that it serves
is to define contracts and to enforce contracts.
All around that core function,
I think there needs to exist other capabilities
at an organization which add the value
that you're talking about.
And I think of this as not super dissimilar from GitHub,
where at its source or at the core,
GitHub is a platform that facilitates source control.
But all around source control,
you have this other functionality that brings engineers
from all across the company together.
Pull requests, you know, code diffs, things like that.
Just make deploying and managing sort of, you know, code deployments in an agile way,
like very easy for everybody.
And that creates a great incentive to actually use the system.
Data contract requires something similar.
What we found at Convoy was that awareness was the big value for the producer.
And that meant understanding if you own some upstream database, how is that data actually
being used?
Where is it being used?
And if you're going to make some change, how is that going to impact someone else?
The reason that this is a valuable thing is obviously because as an engineer, you want
to build scalable, maintainable systems and you don't want to break anybody.
Also, you deserve credit.
If your data is being used in, let's say, a pricing model for the company and you ensure
high data quality for your piece of the pie, and that makes the model better, then
that's something that you as an engineer really deserve credit for.
And then on the final part, it's not good if software engineers are brought into like,
they have to attend a COE, participate in a COE because there was some breaking change
that was made to a very valuable data product.
So as often as we can avoid that, right, that would be ideal for them.
So the next thing for the consumer, the value for the consumer is really having more higher
quality data specifically for the things that are most important to them.
And by that, I mean, I don't think that data contracts need to apply everywhere.
Not everywhere you have data or every
use case of data requires a contract. I think because contracts do add time and they do add
additional effort, they should only be applied where the ROI justification makes sense. So if
you've got, you know, like, I mean, we mentioned analytics, but ideally it would be some report
that adds a lot of value back to the company, like a dashboard, the CEO looks at
every single morning. Maybe in that case, a contract would make a lot of sense. And if you've
got some data consumer that's on the hook for ensuring that the data is correct, they probably
never want to be in a situation where they go into that meeting like, Oh, sorry guys, the dashboard
is broken. And I have no idea why just from like a career perspective and also from like a business
perspective, that's not really great. There's actually a couple more things I wanted to mention on the producer side really quickly that's very valuable.
One of them is I think that contracts are bidirectional systems.
So lineage to me is a huge part of the contract, being able to understand, you know, where the data is actually being used, what feeds into the contract, and also who is using the source data.
And if it's bidirectional, it means that not only should the producer be accountable to the consumer, but the consumer has to be accountable to the producer.
So GDPR is a really great example of where this adds an enormous amount of value, right?
Like if you're an engineer and you're generating some data that might be audited or you are
accountable for how it's used at the company, you need to have that insight.
Otherwise, it doesn't make sense to make the data available to anybody at all.
So yeah, there's a couple examples.
Okay, that's awesome.
And, okay, you mentioned earlier that Eric, that there are always some implicit
contracts, right? Yes.
Let's say the company reaches the point
where things being implicit
is not a good thing.
And I think pretty much everyone who has
been working for a while has
experienced that, right? It's part of
the evolution of building systems.
Where do we start
in making things explicit?
I'm asking you because you have the experience
of talking with so many different teams and people.
Who starts this conversation about the data products
and who usually pushes enough for this to happen?
Who is the driving force behind?
Yeah, great question.
Generally, the driving force is the data engineering team or the data platform team.
The reason for that is they are the bottleneck.
They're feeling a tremendous amount of pain in most cases.
This was my team, right?
Every day, we'd have 10, 15, 20 service desk tickets.
And they all essentially follow the same pattern, which is something happened in a production system, and the downstream team did not have the ability to solve it themselves, and they relied on us.
And we had a lot of churn for that reason. So the data engineers generally want to get out of that communication cycle between the producers and consumers. And this is a method of doing, of like managing that decentralization.
In terms of where you start, it's a big cultural transition.
A lot of it depends on the company, honestly.
If you've got a, and the use case, if you've got a use case that is unbelievably to the
business, then you can probably skip a couple of steps, right?
Like, so if you're Amazon and, you know, you have your recommendations model or whatever,
and that's making you $2 billion a year, I would guess with about 99% certainty that
they have a lot of mechanisms in place to prevent that model from just breaking randomly.
So that's a great starting point.
Is there something that's really valuable to the business?
I think you can actually start directly with the producer in that case and saying, hey,
there's some constraints that we need.
There's some policies that we need to implement about how data is changed.
And we're actually not going to allow you to make schema changes or make significant
changes to the data because whatever feature you're building is not as important as our recommendation model.
Like there's nothing that you could create
that could generate more value than that.
And so therefore we're going to block you
and that's probably a business decision.
In most other cases,
what I'd say is the best thing to do
is invest in this sort of awareness infrastructure.
The goal is not to initiate change
from the producer side on day one.
It's to allow everybody in the pipeline to just figure out what would happen based on
the changes that they make.
As an engineer, you don't have the context on like, if I do this, what's going to happen?
Then you can't possibly make an informed decision, nor can you take ownership of the data in
the future.
This is what we did at Convoy.
So we basically said, hey, we have a valuable use case.
We want to inform, but not break.
We had a GitHub box that if there was a change, a potentially breaking change that was being
made, we would use that GitHub bot to alert and say, hey, here's how the data is being
used downstream.
Here's the data product.
Here's the SLA.
Here's what's going to happen if the pipeline fails,
like it's going to be an incident or not.
And here's the person that you should go talk to
to actually, you know, work through this change.
And then the producer has a choice.
They can either say, you know what?
I think it's fine.
Doesn't really seem like a big use case.
I really need to get this thing out the door.
And that's okay.
They just put this to change.
They're willing to deal with the results. And at worst, we can still alert the downstream consumers that a
change is coming. We know exactly like why, where it's coming from, how to sort of negotiate and
deal with the problem. And in the best case scenario, they say, oh yeah, maybe I should go
have a conversation with this person because I don't want to break them. And we come to some
amicable conclusion of the contract. To said, I'm sort of answering your question in reverse there,
but the first part was like, where do you start?
I think you have to,
but it's much easier to start on the producer's side.
If you can get contracts on the producer's side first,
then every transformation step below it,
the owner is going to feel much more confident saying,
oh, well, my data is now under contract
and therefore I feel comfortable vending that data to someone else.
You try to go from the bottoms up, you don't have that, right?
Like you could still potentially be broken.
And now you as a data owner, we're sort of right back to square one where instead of
the data engineer, that onus has just been shifted to whoever the data consumer is or
analytics engineer that owns that data set.
And that's not a good feeling.
Yeah, that makes a lot of sense.
My question is like, you know,
whenever we're talking about like APIs and contracts
and all that stuff, like usually we have,
like it's a two-sided thing, right?
Like there are like two parties
that they have to agree on something, right?
Right, right.
And I can think of like engineers,
like having a conversation and figuring out the schema, how it should
look like, or adding, removing, all these things, the more technical things.
But here we are talking about, at least how I understand it so far, with a process that
has implications from the engineering side to the highest level, let's say, consumer of data, right?
Because as you said, like there might be, I don't know,
like a dashboard that the CEO like uses to report to the board, right?
When you have to communicate between the consumer and the producer
to create contracts, right?
And here we have people that, like, okay, they think in very different terms, right?
Like, even the language that they are using is, like, different.
How can this happen?
Or maybe it doesn't.
I don't know.
Maybe it's not that important, right?
But how do you deal with that?
Yeah.
So, two things.
The first thing is that I think there is a maturity curve of
implementing contracts at a company. And I think the curve starts to start with the producers and
the technical consumers having that conversation because at least their language is the most
similar to each other. It's still different, but it's the most similar to each other. And I believe
the vehicle of communication is the PR. And in that PR, if you can communicate, hey, this is how
the data is being used. Here's information about the lineage so you can see how it's transformed,
what the final data set looks like. And here are all the constraints and why we need those
constraints. That is probably enough information. That's sort of like the right level of communication
for producers and consumers to have a fruitful, a productive discussion. I think that the non-technical
consumer, it's a lot more challenging for them to have that conversation directly with the producer.
So I think, and again, you know, I'm not even this far yet, but it's where I want to get to,
is I think that there need, in the same way, there's sort of this conversation,
this sort of surface for conversation between the producer and the consumer, there needs to be a similar
surface and conversation between the consumer and the technical consumer.
When a non-technical consumer can essentially say, hey, here's what I know about the business.
Here's what I know needs to be true.
And that technical consumer is able to translate that, those set of requirements into contracts
that can then be
fulfilled by the producer. So I think it's probably a sort of a double hop of communication.
And how does it work with like a somatic layer in place? I know that you talked about that,
like at the beginning between like the difference between like a somatic layer and like a logical
layer. But I think it like, that's like, at least my experience with the enterprise where you have the Colibras
of the world out there where it's a very top-down kind of situation where the board will come
and define what the revenue is and then we are going to create the terminology of this
is what revenue is and then this has to spread across the rest of the organization. So how these two things can align, right?
It's very tricky.
It's absolutely a very tricky thing.
Basically, this is going to be an unsatisfactory answer, but I think that there really needs
to exist levels of abstraction that are based around sort of you know fundamental engineering artifacts
um i think that it's it would be very hard to go sort of directly from the business wanting to
define some metric to then taking that and translating it if the like foundations of
the the foundations of trustworthy data are not there in sort of the engineering
and like programmatic sense. That's why I always recommend starting off as like,
ensure that you have a, this sort of foundational, like highly trustworthy data pipeline that is
defined between the technical producer and the technical consumer. And then I think there's
lots of interesting ways that you can focus on abstraction
the higher that you go,
which is, like I said,
it's sort of a non-satisfying answer
because people want to do all these interesting things
with the semantic layer today.
And my personal opinion is that we're kind of,
we're sort of trying to reverse decades of bad practice,
just to be frank with you.
We've kind of been doing data the wrong way,
where in a lot of cases, we've started at the end.
We started with the analytics sort of BI tool
and said, let's just sort of very quickly get data
into these really complex analytical instruments.
And we can build out a lot of cool stuff
and build out all our metrics and everything else.
And the fundamental architecture and upstream ownership
is just not there. And then we reach a point where we want
to do so many more interesting things with our data. We want to have OLAP cubes and do slice
and dice and have semantic layers and have these like APIs and all this other great stuff,
but you don't even have ownership from the source. And so I think we need to reverse that trend,
start from the top, work our way down, and then build the layers of abstraction onto that.
So then ideally the non-technical consumer can say, Hey, I have this
version of margin that I would like to define, and here's how I like to define
it, and that just like back sort of propagates through the system.
But I don't think the foundations are in place to do things like that yet.
I totally agree with that.
All right.
So let's talk about tooling.
I think you like mentioning a lot of like GitOps stuff, right?
PRs, like working like all together, like on Git and like a lot of stuff. So if we would like to start implementing data contracts today, right?
Like outside of the Git repo, what do we need?
Like what's the tools, let's say the fundamental tools that like an engineering team needs?
Yeah.
So just starting from like a requirements first perspective, and then, you know, we could talk about very specific tooling.
From requirements perspective, you need some mechanism of defining a contract.
It could work with a schema registry
could work for that.
You need some form of a registry.
You need some schema serialization framework to work in.
So you need to be using protobuf or Avro or JSON schema.
And then you need some mechanism of performing those
backwards incompatible changes.
So, you know, I sort of wrote a whole article about exactly how you do this,
but you can, you know, you can use Docker, you can spin up a clone of the database.
You can run a backwards incompatibility check against that.
During the build phase, you can do a check against the Kafka skater registry and do
backwards incompatibility checks against that.
I would say that sort of the having the sort of the schema evolution pieces in place are
the most foundational aspects of the contract.
And they're the most foundational aspects of ownership in general.
So if you get that in place, you're like 50% of the way there. The next big piece is the how you enforce on non-schema related data issues, semantics,
cardinality. And so there's a few different places where enforcement makes sense. It really
just depends on your use case and how the data moves through the pipeline. But like in Convoy's
case, we had a data lake, we were doing streaming. We were using CDC with Debezium.
We were already using Flink as a stream processing layer.
And we were also using Snowflake.
And so when you just think about like that spectrum of technologies, what we could do
is we could have checks in the CICD layer.
We could do checks in the application code on values.
So if we detected that there's some value that falls
outside of the constraint, we could block it there. In the stream, we could use Flink to
run some Flink SQL and have checks like, at the row level, does this entity have a many-to-one
relationship with another entity? And is that what we actually observed? If yes, great. Allow it
through the pipeline. If no, sidelight it. And then when what we actually observed? If yes, great. Allow it through the pipeline.
If no, sidelight it.
And then when the data actually lands in like a lake or a warehouse,
you can take data profiles.
So like YLAMS has a really cool open source tool
for doing data profiles.
You've got a bunch of great tools
for monitoring out there,
like Monte Carlo and LightUp and elementary.io,
which is the open source version.
And so you can do all those checks there.
And then you've got the warehouse.
And in the warehouse, you've got, you know,
you've got airflow, you've got a DBT tests,
you've got great expectations,
and you can implement your CICD checks
still using the schema registry
if you're using a tool like DBT.
And then you would have to do checks sort of on, on batch, right?
You'd have just a batch process.
You run all your checks there.
You see if it passes or not.
And then you have some, you know, like, like system in place for either like
rolling back to a previous version or, you know, shunting the data to another
table or something like that.
So, so, so technically like all the tools to do this stuff,
like already exist, right?
All the open source tools are out there.
It's just a matter of stringing all the pieces together
so that you have the right level of enforcement
in the right place.
At least like that's how you would do
sort of the core data contracting technology.
Yeah, makes a lot of sense.
All right, cool.
One last question from me,
then I'll give the mic back to you, to Eric.
So we've been talking like all this time about,
and I think we were equally talking
about technology and people, right?
It's like people are always like involved at the end.
Like you have to take the people and like educate them
or like make them, I don't know, like change the way that they are doing things and like all that stuff and like agree at the end. You have to take the people and educate them or make them, I don't know,
change the way that they do things
and all that stuff and agree at the end.
We have a contract
at the end. We have to agree and sign it.
And I know that we are very active
in building a community
around that stuff. So I'd like to
ask you about
that. How important
education is and how, and how like a community, right?
Like acts as a vessel for these like change like to happen.
So let us like share your wisdom with us about like the community because it's a
very interesting like topic.
Yeah, absolutely.
So I think community is critical here and the reason the re so there's a very interesting topic. Absolutely. So I think community is critical here.
And the reason,
so there's a couple of reasons I think it's critical.
The first is that this,
you know, one of the things I've heard a lot
from people that's read my content is,
you know, it's like, wow, you know,
you're saying things that seem so obvious in retrospect, like, of course you
can't solve data quality unless the producer gets involved.
Like, how could you possibly do it?
Like garbage in garbage out doesn't make any sense if the garbage is already in, right?
Like you, you have to prevent it from getting in the first place.
And there's only one way to do that.
And that's to start from source.
And, but, well, and I think part of the community is
giving people the weapon, maybe it's not the right word, but giving them the tools to have
the appropriate conversations with their producers or with their consumers. And oftentimes data
engineers and data platform engineers are so in the weeds, focusing on the day-to-day work,
that it's hard for them to take a step back and figure out, how do I have these conversations in the bigger sense? And this is something I think community
is really useful for. It's like saying, oh, wait a second, I can actually contextualize
all the problems I'm having in this larger narrative about the company. Why is data set
up the way that it is? How is data quality affected by these various sort of pieces in the business working together?
And how can I speak to that and propose changes that actually make more sense?
The other reason I think that community is valuable for at least talking about data contracts
is because, as you said, historically, these types of problems have been purely organizational,
right?
We need to make some organizational shift, right?
You hear a lot about data mesh and data mesh is, it's an organizational thing. It's like, we need to
restructure our organization. So we have better ownership of like data objects and domains,
which I don't think is entirely necessary to get to a point where you actually have enough
problem solved that people don't like really hate doing their work every day. But it has been this
really heavy organizational process and getting in front of these people saying, actually, you still will always need some
element of cultural transition, but technology can really help because technology makes that
cultural transition easy, right?
It's easy for the producers to take the ownership.
If it's easy for them to understand how the data is being used, and if it's easy for the
consumer to define what they need, then people will do it, right? The bottlenecks is like, people don't do the right thing if doing the
right thing is very hard, and it takes away from their primary work. Right? So I think that's
another great thing as another great message to spread through culture is like helping people
overcome the traumas of the past, where they've tried to do this stuff before, and they've just
gotten smacked down by the fist of reality.
It's like, well, you got to understand the reason you fail, right?
The reason that you got smacked down is because you were asking the business to do this massive
cultural change and it's not really tied to business value.
And it would have taken a year and a half or two years.
And you had to, it had to involve the entire organization instead of doing it iteratively
and programmatically and like very efficiently, you know?
So, so, so I think the community is really great for sharing stories like that
and for just helping people think through these types of issues.
This is great. Eric, any questions?
Yeah, we're close to the buzzer here, but that was really helpful, practical advice.
Yeah, it is so funny. I mean, you said this at the beginning of the show, but we just tend to say like, okay, let's just, it's almost cathartic to think about a full reset.
You know, like, let's just do a full reset and like build all this stuff from the ground up or whatever. But that's not actually reality. But another question, I think, on the practical side to close us out here.
So on the cultural side, I think that was really helpful. On the tooling side, it seems like
there's a bit of a gap. And you described it really well, Chad, when you talked about,
okay, well, you're a small company, and it's okay, a pipeline breaks, and you know, someone's dashboard goes
down. And so they send a Slack message, hey, something wrong here. Oh, yeah, let me look at
that. Okay, like, you get it up and running in the next day. But you know, it's not like the
company's losing money, because, you know, this data flow or pipeline broke, but you inevitably
in that environment, like accrue a bunch of debt that
you're going to have to pay back at some point. And it's interesting because those smaller teams
don't often have the resources to implement dedicated tooling around APIs or data contracts
or whatever. How do you approach that as thinking about our listeners who
are maybe at smaller companies, they maybe are working on the cultural side,
but from a tooling standpoint, it's like, well, I'm definitely not going to get the budget to go
buy a really nice dedicated tool for some of this stuff. But I also have the
bandwidth to like start building some of this stuff internally. What is it? Where should they
start? How should they think about it? So I think if you're at a small company,
the best thing that you can do is to try to be in the loop, whatever producers are making changes
to things and just establish a good relationship with those teams. Right. So, you know, if there's a meeting explain like, Hey, we have some important data.
Can you just invite me to, you know, like whenever you're talking about like
making a major database change, like just sort of loop me in, let's do a, let's
put a, you know, put together a dedicated Slack channel if you have changes that
you're impacting database at all,
we can push it, push all the alerts to a Slack channel.
So at least I'm notified.
I can ask you questions.
But I think it really is sort of a getting in the loop
and having the conversation.
If you don't have the resources for like tools
or open source technology or whatever,
or building something.
And I think that the point of transition starts to come when
there is some data asset, which if you have incremental data quality, you start to experience
incremental value back to the business, measurable value back to the business, right?
So I've got, you know, maybe a machine learning model and it's
a relevance model and it's running every day and I know it's making us money and we're having to
drop 10, 20% of the data due to null values. And those no values are sort of being caused by issues
with upstream systems. And you say, okay, if I'm just able to solve this one problem, this very small slice of a pipeline
by getting a contract on a few, maybe one schema, or maybe even one or two columns upstream.
And I can say, hey, I was able to reduce the amount of nulls flowing into this table
by 25, 30%. And I can connect that and say like, Hey, there's some real world ground truth. And
we're making better predictions. And now our model is making more money. You have just justified why
data quality is a meaningful investment to me. What too many teams do, I found is like,
they try to take this very holistic approach and say, well, we need data quality everywhere.
We need monitoring everywhere. We need checks on everything.
And number one, that leads to alert fatigue almost 100% of the time, like I said,
because you've got, you know,
the metaphor that I used before,
I've used before is like,
it's if your house is on fire,
you don't need a fire detector.
You need the fireman, right?
You already know the house is on fire.
You don't need a bunch of alerts to tell you
that you're burning up.
You need someone to come and solve the problem.
And so if you have a million different alarms going off,
it actually numbs you and desensitizes the teams
to data quality issues, which is a bad thing.
And so you need to focus on a smaller piece of the problem that's manageable, that's iterative,
that's not going to be a massive cultural shift for the producers and where you have
clear business value.
This is exactly what we did at Convoy.
And I will be honest and say, I didn't start doing that.
I had to learn that was the right approach.
I took the big wide approach at first and that totally bombed
out and completely failed.
And and then when I switched to the smaller, narrower approach, we
just got so much more traction.
And the great thing was because it wasn't as large of a lift on the
producer side, they got to the engineers got to familiarize themselves
with these processes and it turned out they're like, wait a second.
This is just integration testing. This is just CICD. This is just APIs for data. Of course,
we should be doing this. Like, why aren't we doing this? And in fact, at some point,
I know maybe a lot of listeners will find this hard to believe, but the conversation actually
flipped. And so instead of the consumers going to the producers and saying, hey, I need you to take
ownership over this stuff. It was the producers going to the producers and saying, hey, I need you to take ownership over this stuff. It was the producers going to the consumers and saying, hey, I have some data here.
Is it useful to you?
And if it is, how do I put a contract around it?
So I think that like you just have to give people time and space and allow them to sort of see the successes one by one and not try to not try to rush it and sort of solve all the problems in the world
in one single project. I love it. I think that is so well said. Chad, this has been
such a helpful episode. And even for the work that I do every day, but day to my job, it's
just so much here to implement right away. So thank you. Thank you for joining
the show. If people want to check out the community, where should they go?
So you can go into your browser right now and you can type in dataquality.camp slash slack,
and you'll get redirected to the Slack channel. It's Slack, so it's totally free. And right now, we're mainly a community for
networking and finding peers who are in the data space. So there's lots of people who are like
heads of data science at big companies, heads of data engineering, heads of data platform.
And they're all talking about how they're implementing data contracts and monitoring
and data quality of all sorts. But later in the year, maybe middle of the
year, we're going to start working on some other things like in-person events and meetups, training
courses, stuff like that. So there's a lot planned. Very cool. Well, keep us posted on the books as
well. And we'll have you back on to talk about whichever one you publish first.
That would be fantastic. Great talking to you folks. Thanks.
All right, Kostas, fascinating conversation with Chad Sanderson, who runs Data Quality Camp,
which is a community.
He produces a ton of content.
And we covered so many topics.
I think one of the things that he kept returning to
over and over again,
that I think was incredibly helpful
and just a really good,
there's so much practical stuff
for people to get from the show. I felt like I could walk away from the show and have practical
things that I could start doing tomorrow to make data quality better. And I think that was
really refreshing because the conversation around data quality can feel really big, right? It's a huge problem.
How do you fix it? We have so much tech debt. What tools do I use? Where do you solve the
problem in the pipeline? Do you try to do things proactively with schema management? Do you try to
do sort of passive detection? I mean,
there's so many things. And I walked away, especially at the end there,
by even having a couple of practical things in my mind of like, I should probably go do this
tomorrow to make our data quality better. You know, there are small things that I can do.
And so I think both for, you know, the like the listeners who are data leaders, but also the ones who are
doing the work on the ground every day, just a hugely practical, helpful episode in terms of
what can you do tomorrow to start improving data quality. We also talked about philosophy,
which is always fun. Absolutely. What I'll keep from the conversation that we had with Chaden,
I found very refreshing and interesting,
is that he's giving a definition of what data quantity is
from the perspective of, let's say,
the agents that are involved in the process of working with data and not
trying to give an objective definition of, oh, you have this metric and that metric and
something that automatically a machine can reason about.
And I think that's the most important thing here. But at the end, no matter what, data is information
and we have to agree on how we use it.
And that's, I think, the big change that Chad brings with his ideas.
And the most interesting thing is that he's not keeping it abstract.
It's not like an abstract paradigm, but like an
organization is like, you know, to go and
hire hundreds
of consultants to coach you
on how to do.
He tells you that you can do it
today. The tooling is out there.
And he positions technology in a
very interesting way on what's the role
of the technology in making this happen.
So,
I don't know.
I think I should encourage everyone to listen and re-listen to these episodes because I think there's a lot of wisdom in the things that we discussed,
both in terms of technology and how it can be used,
but also the importance of like people in the organization
implementing processes around data. I agree. Well, thank you again for joining the Data Stack Show.
We will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback. You can email me,
ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.