The Data Stack Show - 132: Data Quality and Data Contracts with Chad Sanderson of Data Quality Camp

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show, Kostas. Today we're going to talk with Chad Sanderson. He has had a long career as a data practitioner, but he runs a community and creates a lot of content around data quality. And he talks a lot about data contracts in particular. And that's what I

Starting point is 00:00:47 want to ask him about. I don't think we've talked about data contracts in the show. We've had discussions about data quality and a lot of the tooling that's trying to accomplish that. But data contracts, I think, is a new subject. And so I want the breakdown, the 101 from Chad, because he is the expert. Yeah, a hundred percent. I'm very excited to have him on the show, to be honest. It's data contracts is one of these concepts. I mean, we keep like hearing more and more about it lately, but it's not about like the data contracts, to be honest, that I'm like so excited about it.

Starting point is 00:01:20 It's more about having Chad on the show because it's, you know, like we tend to like to talk about different things in the data infrastructure, usually like from the point of view of like bringing like a solution, but in this case we'll have a person who's just like too passionate about data policy. Right. And tries like to build, not just like, not like a product, but more of like a change in the way that like people work around data and data quality in particular. So there are like much more things or broader things that we can discuss with him.

Starting point is 00:02:00 And I'm really looking forward to do that and talk about data quality in general, why it is important, why it's so hard to define it, where data products fit in this narrative, and how we can make things better. All right, well, let's dig in and talk data contracts with Chad. Let's do it. Chad, welcome to the Data Stack Show. A privilege to have you on. A privilege to be here. Thanks for having me. Absolutely. Well, we have a ton to talk about. I've read a lot of your work. It's been a huge help to me and the way that I think about a lot of things related to data quality. But give us your background and kind of what led you to what you're doing today with the community and the content.

Starting point is 00:02:51 Yeah, so these days I spend most of my time doing writing, going to various conferences. I may have a book or two in the works at the moment, a CD on that. And I'm running a community called Data Quality Camp. And that community is focused around managing data quality at scale, which is an unsolved need. There's a lot of ways to think about data quality and not too many standards. So I thought it'd be good to, you know, stand up some community that

Starting point is 00:03:23 helps people manage their transition to data quality a little bit better. Before that, I spent three years at a company called Convoy, which is a freight tech startup based out in Seattle. more like small and media data problems, where the issue is less around cost and compute and more around complexity of the data and ownership. And from my time there trying to solve these problems around the accumulation of tech debts in the data warehouse, ownership problems between producers and consumers, we derived this programmatic initiative around an idea called data contracts. And that's what I spend most of my time writing and talking about these days. Prior to that, I was at Microsoft. I worked on their artificial intelligence platform team.

Starting point is 00:04:16 And then I've been in big data for just around 10 years altogether. I worked at a variety of other companies in the e-commerce space. Okay, Chad, you mentioned that you do a lot of thinking and writing about data quality at scale. You faced this problem in previous jobs. But as you mentioned, data quality, there's a lot that goes into it. It's a very wide field. There are lots of companies that are trying to solve this problem. And they're doing it in some very different ways, right? Very different approaches. Can you break down data quality for us? How do you frame such a big topic? Yeah, so I would basically break down data quality into two main categories that each have their own subdivisions of delegations of concern, I guess. The first layer of quality is what I think any engineering team would think of when they hear quality, right?

Starting point is 00:05:16 Does the application work the way that it was intended? If we have a set of requirements for a product, are those requirements being met? Are the SLAs that we have being met? And that could include the freshness of the data. It could include whether or not there are serious breaking changes being made to dependencies. dependencies? Is the API of any API being consumed evolving in a way that is conducive to the health of that application? And so that's

Starting point is 00:05:53 one part of it, I think, where we're treating the data itself like a product, and so there is some expected level of quality mapped to the requirements. The other element of quality that I think is unique to data is the idea of truth or trustworthiness. The data needs to map to some real world reality.

Starting point is 00:06:21 And if I have a shipment and I have data about the shipment and I know where the shipment was dropped off and where it was initiated from and whether or not it arrived on time all of that should map to whatever happened whatever really happened in the real world and that is a really complex subject because you have different levels of of big T truth and little t truth if you spend any time in philosophy 101. So there's big T truth where there's some objective meaning to what happened. And then there's our little t truth, which is the subjective interpretation of what happened. And all of us, many people at a company have maybe different interpretations of what a specific metric might mean or what a specific dimension might mean. So part of data quality is ensuring that everyone is speaking the same language

Starting point is 00:07:09 and that the objective truth about the world is reflected in the data itself. That's what I think are the main components of data quality. That's super helpful. And I love the philosophy angle. Do you see more struggle on the big T side or the little T side? I mean, obviously, if you don't get the big T right, then you're going to have a lot of problems with the little T. But do companies really struggle with the big T side of things? I think basically every company I've ever talked to struggles with the big T side of things. And this is not to jump the gun, but this is one of the major issues that data contracts is attempting to solve is ensuring that the data is defined in a way that maps to the real world and that it doesn't change unexpectedly for reasons that may have to

Starting point is 00:08:04 do with something other than the data itself. Like, oh, we decided to launch a new feature or we decided to drop a column or rename a column because it didn't really fit what we were attempting to do with our application. The goal is for the data we're collecting from our source systems to be as tightly mapped to that big T truth as possible. And part of that mapping has to come from the consumers who understand what the data needs and what it maps to having a great relationship with producers who are responsible for maintaining the systems that are collecting that data. So yeah, I think that big T truth is a huge problem.

Starting point is 00:08:43 Little t truth is also a huge problem. And it really depends on where in the organization you're looking and what type of business it is. But there are massive disagreements that, again, almost every company I've talked to about what a particular metric even means. We have this dimension at Convoy that was called a shipment distance. And you would think that's a pretty straightforward thing. It's just the distance between where a shipment's origin point and its destination point. But there were so many

Starting point is 00:09:11 people that couldn't exactly agree on what specifically we were talking about. We could be talking about distance in kilometers or distance in miles. Some people wanted to define the starting point as the moment from where the shipment was dropped off at. Some people wanted to define the starting point as the moment from where the shipment was dropped off at. Some people wanted to define it as from where the trucker was driving from. And these types of differences in thinking sort of apply to the use cases that the consumer is attempting to solve. So wrangling everybody's brain around the same semantic concepts is very challenging. Yeah. Now, I want to get practical for a second.

Starting point is 00:09:50 Would you describe shipment distance as a big T or a little t? There's definitely elements of both. And so this is where we kind of get to the philosophy of all of this, right? And it actually becomes a really challenging conversation to have. There is obviously some objective distance, right? The shipment is traveling from one place to another place. So that part is real. The question is,

Starting point is 00:10:17 what explicitly do we mean by shipment? And distance part is real. It's the shipment part, applying distance to the shipment where people might disagree, right? Yeah. Yeah. That makes total sense. I was thinking about even something like delivery, right? It seems like it's binary. This thing was delivered or it wasn't, right? But one team could say, okay, if it gets to the physical destination, it's delivered. But another team may say, well, no, it's when the customer opens it and verifies that what they got is correct. That's a successful delivery, or whatever it is, which are all useful. But the question that comes up then is, you start to face, and I'm speaking from experience here, you know, even in stuff that we do every day with our own data, is, okay, so you have some disagreements, right? Not because anyone's necessarily, like,

Starting point is 00:11:22 right or wrong in a lot of cases. it's that in order to interpret their job or understand the effectiveness of their work you need to measure something in a slightly different way right but the problem that often comes up is then you start to have this proliferation of you know it's like okay well now we have 19 shipping distance you know variations so you know, it's like, okay, well now we have 19 shipping distance, you know, variations. So, you know, whatever, you know, of course, like it's sad, but it's like shipping distance underscore, you know, X, you know, or, you know, and it's like 19 different variations. How do you, and I want to get practical when we talk about, you know, data contracts and stuff, but philosophically, where do you fall on the spectrum of like, you know, we need to

Starting point is 00:12:13 provide consumers with the information that they need to do their job well, without allowing, you know, things to run rampant and to create all of this metrics debt, which just spirals out of control. I feel like the first time you say, okay, we'll just cut a new version of this, it's like weeks later, the warehouse is already getting messy. Yeah, so I sort of see

Starting point is 00:12:42 the data platform environment split into two halves. There is the, the semantic layer and the logical layer. And I'm using those terms a bit differently, I think, from how a lot of other companies use them. And there's a reason why I think companies use them in a different way. But like when people talk about semantics, that means at least in every definition I've seen, that's like the nature of the thing itself. Right? Like, if I say the semantics of a car, I'm talking about the nature of a car, I'm not talking about abstract interpretations of cars, saying like, well, there's a car has an

Starting point is 00:13:21 engine, and it has four tires, and it has a function, which is to move from one place to another place. And so that's one layer I think needs to exist in the data platform. And then the other layer that needs to exist is the logical layer. So these are derivations of real world objects and events. And those are kind of subject to our interpretation. Something like margin is an example of a logical construction, right? There's a real thing called margin that exists in the world that we can grasp, right? It depends on how we, the humans who work at a company, choose to

Starting point is 00:13:57 define margin. And it can be cut many different ways. I think that semantic layer needs to have one type of governance and implementation and coordination. And I think that the logical layer needs to have a very different type of governance and organization that is based around the promotion, sort of like crowdsourced, almost like Reddit upvoted artifacts, right? So if we have, you know, if as a company, we agree that this definition of margin is the one that is most commonly used by the business, that doesn't mean that everyone else can't have their own interpretation. That's fine. But if anybody in the company has a question, what is margin as according to some common

Starting point is 00:14:45 definition, there should be a very easy way for them to get access to that data without having to try to understand the 30 unique versions of margin that exist all across the business. So I think there needs to exist some plane where there can be iteration. Teams can derive, you know, logical aggregations based on real-world semantic objects. And as those logical derivations become more and more valuable at the company,

Starting point is 00:15:13 they are elevated to a higher level of importance and treated like an API. And then there can be discussion, right? So if you want to change the definition of margin that is powering key data objects, and actually, let me take a step back because I think when we're talking about these elevated meanings, it's not just in a sort of abstract like, oh yeah, there's one set of metrics that are good and one are bad. Having that elevated version of a metric should allow you to

Starting point is 00:15:44 use the metric in ways that are more actionable and production grade. For example, if I want to use this concept of margin in a dashboard that I serve out to my customers, then I have to use the official version, the first elevated version. If I want to serve it to a sales team, or if I want to do something that maybe goes across team, then it has to be, I have to contribute back to this like central, almost open source definition of the metric. If you want to create your own version of a metric and it lives in your little local

Starting point is 00:16:14 environment and you tinker with it and you apply it to a dashboard that only you see, that's fine. But once we start sort of going cross company, that's what we need to have some agreement of what these terms actually mean. Yeah, that makes total sense. So a centralized agreement on the most important things, but you're not removing decentralization from the equation, right? Exactly.

Starting point is 00:16:45 That makes total sense. And this is my approach to both the semantic layering, the logical layering. I think that there is this sometimes a misconception in data where we'll sort of look at all the SQL in our data warehouse and we'll look at the pipelines that are actively failing and the business logic is, you know, there's no clear agreement on what these entities are. And we take this approach of we need to go and remodel everything. We need to have a very clear and well-agreed and established data model. And we have an entity called Shipments and it's owned by this team. It has to be broke here and we have data mesh or we have data warehouses or we have data marks, whatever.

Starting point is 00:17:24 But it's always pitches this big, massive overhaul. And where there is going to be a big T truth that applies everywhere. And you're not allowed to change that. And that's just not feasible or realistic. In my experience, you have to give people the ability to sort of iterate and tinker and prototype and sort of try out new things, but give them a path to move from prototype sort of design environment to a production high trust environment. And that production high trust environment needs to be supported by all the best practices in software

Starting point is 00:17:57 engineering to a smaller, but far more valuable and condensed slice of our data pipelines. Yep. No, that makes total sense. Okay, so where do data contracts fit in here, right? I know we've probably been walking all around the subject of data contracts in the philosophical discussion. I guess practical discussion is well around quality, but... Okay, break down data contracts for us and where they fit into everything that you just outlined. Yeah.

Starting point is 00:18:29 So data contracts are at their core agreements between producers and consumers enforced in a programmatic mechanism. And that means put it simply, it's like a data API. And an API for data is more robust and comprehensive than I would say a traditional API, because you're not just thinking about the schema and the evolution of the schema, but you're taking into consideration the data itself, right? So then this goes back to that real world truth that I was mentioning before. If I have an expectation that a particular ID field always is a 10 character string, then I need to ensure that the data itself reflects that. And if I get a nine character string or 15 character string, that means that somewhere a bug or otherwise some otherwise a regression has been introduced.

Starting point is 00:19:25 And that means my version of the, my assumption that this data represents the big T truth has been violated because there is no, it doesn't make sense for an ID to be 15 characters. It doesn't work in our system. Right. So I actually think when we're talking, when we were, I mentioned data quality before and split into these two halves, you've got this issue that's talking about truth and semantics, and you have this other issue that's talking, does the product map to the requirements that I have? I think that data contracts actually start primarily on the right side of that.

Starting point is 00:19:58 It starts as a quality mechanism to say, is my data product working the way that I expect? And do I have a very clear owner that's willing to fix bugs and regressions in that data product? But I think that over time, it can be used to transition more to solve some of the semantic problems that I mentioned before. Yeah, that makes total sense. One question, this is kind of a practical or maybe a specific question. One of the challenges that I've seen come up over and over again, as it relates to data contracts is on the logic side, on the consumer side, right? So one of the challenges is that you have like a sales team or a marketing team or a product team, and they have some sort of tooling that allows them to do whatever they do, right? So they're sending messages or they're moving deals through some sort of life cycle or whatever. And tons of logic lives in there, right? But those systems tend to be

Starting point is 00:21:08 very inflexible, understandably, right? Because they're built for that purpose. And so when you think about a contract, I think one of the challenges is that you have logic, business logic, that I would say many times is a contributor or informer of even some of the semantic like big T truth. This is what a closed deal is or whatever, right? So that lives in a downstream tool. But when we think about an API, as you described it for data, you know, that a lot of that has to be centralized in infrastructure. How do you think about that in the world of data contracts and even the technical side of data contracts? Yeah, it's definitely a challenging problem, but it's actually one that I think is going to be solved at some point. Salesforce, for example, has their own sort of DevOps-oriented infrastructure now where changes are like logs through

Starting point is 00:22:06 job actions. And so if you're a developer, you can tie into that. And I think that there's a lot of different, there's a lot of interesting potential. There's a lot of interesting potential in those types of systems, like essentially being able to say, hey, we detected by running a check that you were about to drop a column in your Salesforce schema. There's someone downstream that has a dependency on you, so we're not going to let you do that. And obviously you need an engineer

Starting point is 00:22:35 to implement a system like that, but you can abstract the messaging up to the level of the non-technical user. There are obviously some systems that are very old, like ERP systems and things like that, that, you know, maybe will never fully integrate, like they'll never have their own like DevOps solution, but even then I don't think it's an impossible problem to solve. The challenge is really getting in between the change and the data making

Starting point is 00:23:04 its way to whatever it's like business critical pipeline is. So for example, you could do something where you say, look, I just want to have some staging table where I drop all the data from Salesforce or my ERP system. I run a set of checks. Ideally, if it was a real time, all the better. But most of the stuff is pushed out through like back systems so you can run a check maybe once per day or once every few hours and if

Starting point is 00:23:32 you see any violations of this contract downstream then you can revert to a previous version you could try to parse through that data and only allow whatever you know at the row level meets the contract through into the pipeline. And then you can try to have some alerts or notification for the salesperson or the business person that made the change that said, hey, something that you pushed out earlier in the day or yesterday was a violation of a contract and you're potentially causing

Starting point is 00:24:03 machine learning model to break. We're gonna need you to go in and update that, right? So some of this probably is going to require significant culture change. Like it's just people learning that changes that you make to data have impacts elsewhere. But some of it is like having the right tooling

Starting point is 00:24:19 to just like get in between bad data arriving in a pipeline and having some messaging that goes out to these producers. Yep. What happens when, you know, I think a lot of companies have, I would say, maybe implicit contracts, but not explicit contracts around data, right? Especially when there's not, you know, centralized infrastructure or, centralized infrastructure or other sort of tooling to mitigate that. How do you see that play out at a lot of companies?

Starting point is 00:24:52 Yeah. A ton of companies have implicit contracts. I call them non-consensual APIs. That's great. Yeah. And it's not good. It never really plays out well, honestly. I don't think I've seen a single situation of those implicit contracts actually being positive for anyone downstream. And oftentimes what happens is, but it also makes sense why they exist, right? You have some software engineer who owns a Postgres database or MySQL database or something like that, they are thinking in terms of their production applications and ensuring that their applications have the right data to function. And they're not thinking about the downstream data or the

Starting point is 00:25:37 analytics or the machine learning at all. And that's because a lot of the tooling that teams use, like some ELT tools or CDC, allows these teams to not be concerned with those problems, right? And say, hey, I'm just going to plug into your database. I'm going to pull your data out. I'm going to do something fancy with it because I need to move quickly. And the engineering team says, okay, that's cool. But just so you know, you have a dependency on me. And that's that. I just don't need to worry about it.

Starting point is 00:26:07 Like you're going to, you're going to fix this issue. And that's usually fine for the first few years that a company exists, right? Because A, it's very easy to be in the loop whenever an engineer makes a potential breaking change to your pipeline. And B, you know, people are just like thoughtful and nobody's a jerk and you, the data, I would say, isn't useful enough to really have any sort of strict data quality guidelines around it. It's mainly for, you know, analytics, maybe it's for VI, you know, okay. If my customer churn table is down for a few hours, or it's maybe down for a couple of days while some analyst comes in and fixes it, that's fine. It's not that big of a deal. But once you start getting to scale, and now you have

Starting point is 00:26:50 data engineers that are being bottlenecked, or they are the bottleneck in a lot of cases, because you've got this large team of data consumers and data scientists, and they have machine learning models, and those models are breaking all the time, and you have all these changes that are happening. And all of those tickets get routed to this central bottlenecks, the data engineering team, and they're spending all their time just solving tickets constantly. And it's not fun for anybody. It's not fun for the consumers because they're not having their problems addressed in a timely way. It's not fun for the data engineers because they're just constantly underwater and they don't get to do what they actually want to do, which is do engineering and like build things.

Starting point is 00:27:30 And and it's not really fun for the data producers either because they get yelled at, you know, like every other week about something that they broke that they had no idea about. And so, yeah, that's sort of how I've seen it typically play out. Like most companies I've seen on the modern data stack that adopt that, you know, just move fast and break things, early architecture, get to a point where like that doesn't actually work anymore. Let's go through like, let's say, a quick example of some data infrastructure and where

Starting point is 00:27:59 like the data contracts exist in it, right? Like, let's assume we have like a typical example of a production database. Postgres generates data, of course. You want to export the data from there. So there's some kind of ETL, like CDC, whatever. It doesn't matter, right? Like, take the data out of there, put it in a data warehouse. There are some steps of transformation that will happen to the data there, and you will end up with some tables

Starting point is 00:28:27 that can be consumed for analytical purposes. Let's keep it in the simplest, most common scenario of analytics. Let's not talk about a more exotic use case. Where data contracts fit in this

Starting point is 00:28:44 environment? And the reason I'm asking is because you use the words API and data contracts. And in my mind, an API is always like a contract between two systems, right? And in the world of the data engineer, we actually have way too many systems that we need to orchestrate or like make them operate. Right. So help me understand a little bit, like, where do we start putting these data contracts in such like an environment? So in general, and so we'll start at a high level and sort of drill down to the tactical. At a high level, I think that data contracts need to exist anytime there is a handoff of data from one team to another team. So that could be from the Postgres database to the data lake.

Starting point is 00:29:37 It could be from the data lake to the data warehouse. It could be from one team that owns a particular data model in the data warehouse to another team that consumes that model. But anytime data is handed off and there's some transformation that's happening, there needs to be a data contract and that sort of API input output needs to exist. As you rightly pointed out, depending on where you're at in the pipeline, the vehicle that's the mechanism of enforcement that the data contract takes is going to look different. So if you're trying to enforce at the production Postgres level, then you're probably going to need something in CICB. You want to prevent the changes from being made before they happen as often as you can. If you have a CDC and you've

Starting point is 00:30:26 got an event bus, then you might want to do a set of enforcements there, right? We want to look at each row. And if we detect that at the row level, there's a violation on the contracts, we can sideline that data, stick it into a DLQ, get to back filling later and send out an alert to the team that's on call for that contract. The overall goal is to try to shift the ownership as left as we can for each contract and try to make that the enforcement as tactile and is embedded into the developer workflow as we possibly can. So if we're just talking about like Postgres, for example, or we're talking about the use case, we might want to start off by defining a contract in some schema serialization framework. So it could be Protobuf, it could be Afro, it could be JSON schema, though I don't recommend that.

Starting point is 00:31:15 You'll want to store that contract in some type of registry. And then there should be a mechanism of doing backwards incompatibility checks on that stored contract and ideally on the data itself during the actual build process. And then you can, you know, break the build and you can send an alert and says, Hey, you, there's been a contract violation. That's like one example. But like I said, as for each transformation stage, there are things that you can do that

Starting point is 00:31:40 you can try to tie back to a producer. Yeah, that makes a lot of sense. And, okay, there are many different people involved, right? Yes. Probably more than the technologies involved in this whole process. So let's take, don't overcomplicate it, but let's at least assume two basic categories. We have the data producers and the data consumers, right?

Starting point is 00:32:14 What's the value that each one of them gets from implementing data contracts? Alex Williams- Yes. So a lot of this comes down to the implementation, but I would say the primary value that the producer gets is awareness. If it's implemented the right way. And there's a caveat, which I would say that the data contract is a really meaningful piece of technology. But it serves a function. And this very specific function that it serves

Starting point is 00:32:40 is to define contracts and to enforce contracts. All around that core function, I think there needs to exist other capabilities at an organization which add the value that you're talking about. And I think of this as not super dissimilar from GitHub, where at its source or at the core, GitHub is a platform that facilitates source control.

Starting point is 00:33:02 But all around source control, you have this other functionality that brings engineers from all across the company together. Pull requests, you know, code diffs, things like that. Just make deploying and managing sort of, you know, code deployments in an agile way, like very easy for everybody. And that creates a great incentive to actually use the system. Data contract requires something similar.

Starting point is 00:33:26 What we found at Convoy was that awareness was the big value for the producer. And that meant understanding if you own some upstream database, how is that data actually being used? Where is it being used? And if you're going to make some change, how is that going to impact someone else? The reason that this is a valuable thing is obviously because as an engineer, you want to build scalable, maintainable systems and you don't want to break anybody. Also, you deserve credit.

Starting point is 00:33:53 If your data is being used in, let's say, a pricing model for the company and you ensure high data quality for your piece of the pie, and that makes the model better, then that's something that you as an engineer really deserve credit for. And then on the final part, it's not good if software engineers are brought into like, they have to attend a COE, participate in a COE because there was some breaking change that was made to a very valuable data product. So as often as we can avoid that, right, that would be ideal for them. So the next thing for the consumer, the value for the consumer is really having more higher

Starting point is 00:34:34 quality data specifically for the things that are most important to them. And by that, I mean, I don't think that data contracts need to apply everywhere. Not everywhere you have data or every use case of data requires a contract. I think because contracts do add time and they do add additional effort, they should only be applied where the ROI justification makes sense. So if you've got, you know, like, I mean, we mentioned analytics, but ideally it would be some report that adds a lot of value back to the company, like a dashboard, the CEO looks at every single morning. Maybe in that case, a contract would make a lot of sense. And if you've

Starting point is 00:35:09 got some data consumer that's on the hook for ensuring that the data is correct, they probably never want to be in a situation where they go into that meeting like, Oh, sorry guys, the dashboard is broken. And I have no idea why just from like a career perspective and also from like a business perspective, that's not really great. There's actually a couple more things I wanted to mention on the producer side really quickly that's very valuable. One of them is I think that contracts are bidirectional systems. So lineage to me is a huge part of the contract, being able to understand, you know, where the data is actually being used, what feeds into the contract, and also who is using the source data. And if it's bidirectional, it means that not only should the producer be accountable to the consumer, but the consumer has to be accountable to the producer. So GDPR is a really great example of where this adds an enormous amount of value, right?

Starting point is 00:35:59 Like if you're an engineer and you're generating some data that might be audited or you are accountable for how it's used at the company, you need to have that insight. Otherwise, it doesn't make sense to make the data available to anybody at all. So yeah, there's a couple examples. Okay, that's awesome. And, okay, you mentioned earlier that Eric, that there are always some implicit contracts, right? Yes. Let's say the company reaches the point

Starting point is 00:36:29 where things being implicit is not a good thing. And I think pretty much everyone who has been working for a while has experienced that, right? It's part of the evolution of building systems. Where do we start in making things explicit?

Starting point is 00:36:48 I'm asking you because you have the experience of talking with so many different teams and people. Who starts this conversation about the data products and who usually pushes enough for this to happen? Who is the driving force behind? Yeah, great question. Generally, the driving force is the data engineering team or the data platform team. The reason for that is they are the bottleneck.

Starting point is 00:37:15 They're feeling a tremendous amount of pain in most cases. This was my team, right? Every day, we'd have 10, 15, 20 service desk tickets. And they all essentially follow the same pattern, which is something happened in a production system, and the downstream team did not have the ability to solve it themselves, and they relied on us. And we had a lot of churn for that reason. So the data engineers generally want to get out of that communication cycle between the producers and consumers. And this is a method of doing, of like managing that decentralization. In terms of where you start, it's a big cultural transition. A lot of it depends on the company, honestly. If you've got a, and the use case, if you've got a use case that is unbelievably to the

Starting point is 00:38:01 business, then you can probably skip a couple of steps, right? Like, so if you're Amazon and, you know, you have your recommendations model or whatever, and that's making you $2 billion a year, I would guess with about 99% certainty that they have a lot of mechanisms in place to prevent that model from just breaking randomly. So that's a great starting point. Is there something that's really valuable to the business? I think you can actually start directly with the producer in that case and saying, hey, there's some constraints that we need.

Starting point is 00:38:33 There's some policies that we need to implement about how data is changed. And we're actually not going to allow you to make schema changes or make significant changes to the data because whatever feature you're building is not as important as our recommendation model. Like there's nothing that you could create that could generate more value than that. And so therefore we're going to block you and that's probably a business decision. In most other cases,

Starting point is 00:38:54 what I'd say is the best thing to do is invest in this sort of awareness infrastructure. The goal is not to initiate change from the producer side on day one. It's to allow everybody in the pipeline to just figure out what would happen based on the changes that they make. As an engineer, you don't have the context on like, if I do this, what's going to happen? Then you can't possibly make an informed decision, nor can you take ownership of the data in

Starting point is 00:39:23 the future. This is what we did at Convoy. So we basically said, hey, we have a valuable use case. We want to inform, but not break. We had a GitHub box that if there was a change, a potentially breaking change that was being made, we would use that GitHub bot to alert and say, hey, here's how the data is being used downstream. Here's the data product.

Starting point is 00:39:43 Here's the SLA. Here's what's going to happen if the pipeline fails, like it's going to be an incident or not. And here's the person that you should go talk to to actually, you know, work through this change. And then the producer has a choice. They can either say, you know what? I think it's fine.

Starting point is 00:39:57 Doesn't really seem like a big use case. I really need to get this thing out the door. And that's okay. They just put this to change. They're willing to deal with the results. And at worst, we can still alert the downstream consumers that a change is coming. We know exactly like why, where it's coming from, how to sort of negotiate and deal with the problem. And in the best case scenario, they say, oh yeah, maybe I should go have a conversation with this person because I don't want to break them. And we come to some

Starting point is 00:40:20 amicable conclusion of the contract. To said, I'm sort of answering your question in reverse there, but the first part was like, where do you start? I think you have to, but it's much easier to start on the producer's side. If you can get contracts on the producer's side first, then every transformation step below it, the owner is going to feel much more confident saying, oh, well, my data is now under contract

Starting point is 00:40:47 and therefore I feel comfortable vending that data to someone else. You try to go from the bottoms up, you don't have that, right? Like you could still potentially be broken. And now you as a data owner, we're sort of right back to square one where instead of the data engineer, that onus has just been shifted to whoever the data consumer is or analytics engineer that owns that data set. And that's not a good feeling. Yeah, that makes a lot of sense.

Starting point is 00:41:09 My question is like, you know, whenever we're talking about like APIs and contracts and all that stuff, like usually we have, like it's a two-sided thing, right? Like there are like two parties that they have to agree on something, right? Right, right. And I can think of like engineers,

Starting point is 00:41:24 like having a conversation and figuring out the schema, how it should look like, or adding, removing, all these things, the more technical things. But here we are talking about, at least how I understand it so far, with a process that has implications from the engineering side to the highest level, let's say, consumer of data, right? Because as you said, like there might be, I don't know, like a dashboard that the CEO like uses to report to the board, right? When you have to communicate between the consumer and the producer to create contracts, right?

Starting point is 00:42:05 And here we have people that, like, okay, they think in very different terms, right? Like, even the language that they are using is, like, different. How can this happen? Or maybe it doesn't. I don't know. Maybe it's not that important, right? But how do you deal with that? Yeah.

Starting point is 00:42:21 So, two things. The first thing is that I think there is a maturity curve of implementing contracts at a company. And I think the curve starts to start with the producers and the technical consumers having that conversation because at least their language is the most similar to each other. It's still different, but it's the most similar to each other. And I believe the vehicle of communication is the PR. And in that PR, if you can communicate, hey, this is how the data is being used. Here's information about the lineage so you can see how it's transformed, what the final data set looks like. And here are all the constraints and why we need those

Starting point is 00:42:59 constraints. That is probably enough information. That's sort of like the right level of communication for producers and consumers to have a fruitful, a productive discussion. I think that the non-technical consumer, it's a lot more challenging for them to have that conversation directly with the producer. So I think, and again, you know, I'm not even this far yet, but it's where I want to get to, is I think that there need, in the same way, there's sort of this conversation, this sort of surface for conversation between the producer and the consumer, there needs to be a similar surface and conversation between the consumer and the technical consumer. When a non-technical consumer can essentially say, hey, here's what I know about the business.

Starting point is 00:43:37 Here's what I know needs to be true. And that technical consumer is able to translate that, those set of requirements into contracts that can then be fulfilled by the producer. So I think it's probably a sort of a double hop of communication. And how does it work with like a somatic layer in place? I know that you talked about that, like at the beginning between like the difference between like a somatic layer and like a logical layer. But I think it like, that's like, at least my experience with the enterprise where you have the Colibras of the world out there where it's a very top-down kind of situation where the board will come

Starting point is 00:44:15 and define what the revenue is and then we are going to create the terminology of this is what revenue is and then this has to spread across the rest of the organization. So how these two things can align, right? It's very tricky. It's absolutely a very tricky thing. Basically, this is going to be an unsatisfactory answer, but I think that there really needs to exist levels of abstraction that are based around sort of you know fundamental engineering artifacts um i think that it's it would be very hard to go sort of directly from the business wanting to define some metric to then taking that and translating it if the like foundations of

Starting point is 00:45:01 the the foundations of trustworthy data are not there in sort of the engineering and like programmatic sense. That's why I always recommend starting off as like, ensure that you have a, this sort of foundational, like highly trustworthy data pipeline that is defined between the technical producer and the technical consumer. And then I think there's lots of interesting ways that you can focus on abstraction the higher that you go, which is, like I said, it's sort of a non-satisfying answer

Starting point is 00:45:29 because people want to do all these interesting things with the semantic layer today. And my personal opinion is that we're kind of, we're sort of trying to reverse decades of bad practice, just to be frank with you. We've kind of been doing data the wrong way, where in a lot of cases, we've started at the end. We started with the analytics sort of BI tool

Starting point is 00:45:52 and said, let's just sort of very quickly get data into these really complex analytical instruments. And we can build out a lot of cool stuff and build out all our metrics and everything else. And the fundamental architecture and upstream ownership is just not there. And then we reach a point where we want to do so many more interesting things with our data. We want to have OLAP cubes and do slice and dice and have semantic layers and have these like APIs and all this other great stuff,

Starting point is 00:46:16 but you don't even have ownership from the source. And so I think we need to reverse that trend, start from the top, work our way down, and then build the layers of abstraction onto that. So then ideally the non-technical consumer can say, Hey, I have this version of margin that I would like to define, and here's how I like to define it, and that just like back sort of propagates through the system. But I don't think the foundations are in place to do things like that yet. I totally agree with that. All right.

Starting point is 00:46:43 So let's talk about tooling. I think you like mentioning a lot of like GitOps stuff, right? PRs, like working like all together, like on Git and like a lot of stuff. So if we would like to start implementing data contracts today, right? Like outside of the Git repo, what do we need? Like what's the tools, let's say the fundamental tools that like an engineering team needs? Yeah. So just starting from like a requirements first perspective, and then, you know, we could talk about very specific tooling. From requirements perspective, you need some mechanism of defining a contract.

Starting point is 00:47:24 It could work with a schema registry could work for that. You need some form of a registry. You need some schema serialization framework to work in. So you need to be using protobuf or Avro or JSON schema. And then you need some mechanism of performing those backwards incompatible changes. So, you know, I sort of wrote a whole article about exactly how you do this,

Starting point is 00:47:52 but you can, you know, you can use Docker, you can spin up a clone of the database. You can run a backwards incompatibility check against that. During the build phase, you can do a check against the Kafka skater registry and do backwards incompatibility checks against that. I would say that sort of the having the sort of the schema evolution pieces in place are the most foundational aspects of the contract. And they're the most foundational aspects of ownership in general. So if you get that in place, you're like 50% of the way there. The next big piece is the how you enforce on non-schema related data issues, semantics,

Starting point is 00:48:30 cardinality. And so there's a few different places where enforcement makes sense. It really just depends on your use case and how the data moves through the pipeline. But like in Convoy's case, we had a data lake, we were doing streaming. We were using CDC with Debezium. We were already using Flink as a stream processing layer. And we were also using Snowflake. And so when you just think about like that spectrum of technologies, what we could do is we could have checks in the CICD layer. We could do checks in the application code on values.

Starting point is 00:49:03 So if we detected that there's some value that falls outside of the constraint, we could block it there. In the stream, we could use Flink to run some Flink SQL and have checks like, at the row level, does this entity have a many-to-one relationship with another entity? And is that what we actually observed? If yes, great. Allow it through the pipeline. If no, sidelight it. And then when what we actually observed? If yes, great. Allow it through the pipeline. If no, sidelight it. And then when the data actually lands in like a lake or a warehouse, you can take data profiles.

Starting point is 00:49:31 So like YLAMS has a really cool open source tool for doing data profiles. You've got a bunch of great tools for monitoring out there, like Monte Carlo and LightUp and elementary.io, which is the open source version. And so you can do all those checks there. And then you've got the warehouse.

Starting point is 00:49:47 And in the warehouse, you've got, you know, you've got airflow, you've got a DBT tests, you've got great expectations, and you can implement your CICD checks still using the schema registry if you're using a tool like DBT. And then you would have to do checks sort of on, on batch, right? You'd have just a batch process.

Starting point is 00:50:08 You run all your checks there. You see if it passes or not. And then you have some, you know, like, like system in place for either like rolling back to a previous version or, you know, shunting the data to another table or something like that. So, so, so technically like all the tools to do this stuff, like already exist, right? All the open source tools are out there.

Starting point is 00:50:29 It's just a matter of stringing all the pieces together so that you have the right level of enforcement in the right place. At least like that's how you would do sort of the core data contracting technology. Yeah, makes a lot of sense. All right, cool. One last question from me,

Starting point is 00:50:45 then I'll give the mic back to you, to Eric. So we've been talking like all this time about, and I think we were equally talking about technology and people, right? It's like people are always like involved at the end. Like you have to take the people and like educate them or like make them, I don't know, like change the way that they are doing things and like all that stuff and like agree at the end. You have to take the people and educate them or make them, I don't know, change the way that they do things

Starting point is 00:51:07 and all that stuff and agree at the end. We have a contract at the end. We have to agree and sign it. And I know that we are very active in building a community around that stuff. So I'd like to ask you about that. How important

Starting point is 00:51:23 education is and how, and how like a community, right? Like acts as a vessel for these like change like to happen. So let us like share your wisdom with us about like the community because it's a very interesting like topic. Yeah, absolutely. So I think community is critical here and the reason the re so there's a very interesting topic. Absolutely. So I think community is critical here. And the reason, so there's a couple of reasons I think it's critical.

Starting point is 00:51:50 The first is that this, you know, one of the things I've heard a lot from people that's read my content is, you know, it's like, wow, you know, you're saying things that seem so obvious in retrospect, like, of course you can't solve data quality unless the producer gets involved. Like, how could you possibly do it? Like garbage in garbage out doesn't make any sense if the garbage is already in, right?

Starting point is 00:52:16 Like you, you have to prevent it from getting in the first place. And there's only one way to do that. And that's to start from source. And, but, well, and I think part of the community is giving people the weapon, maybe it's not the right word, but giving them the tools to have the appropriate conversations with their producers or with their consumers. And oftentimes data engineers and data platform engineers are so in the weeds, focusing on the day-to-day work, that it's hard for them to take a step back and figure out, how do I have these conversations in the bigger sense? And this is something I think community

Starting point is 00:52:49 is really useful for. It's like saying, oh, wait a second, I can actually contextualize all the problems I'm having in this larger narrative about the company. Why is data set up the way that it is? How is data quality affected by these various sort of pieces in the business working together? And how can I speak to that and propose changes that actually make more sense? The other reason I think that community is valuable for at least talking about data contracts is because, as you said, historically, these types of problems have been purely organizational, right? We need to make some organizational shift, right?

Starting point is 00:53:22 You hear a lot about data mesh and data mesh is, it's an organizational thing. It's like, we need to restructure our organization. So we have better ownership of like data objects and domains, which I don't think is entirely necessary to get to a point where you actually have enough problem solved that people don't like really hate doing their work every day. But it has been this really heavy organizational process and getting in front of these people saying, actually, you still will always need some element of cultural transition, but technology can really help because technology makes that cultural transition easy, right? It's easy for the producers to take the ownership.

Starting point is 00:53:58 If it's easy for them to understand how the data is being used, and if it's easy for the consumer to define what they need, then people will do it, right? The bottlenecks is like, people don't do the right thing if doing the right thing is very hard, and it takes away from their primary work. Right? So I think that's another great thing as another great message to spread through culture is like helping people overcome the traumas of the past, where they've tried to do this stuff before, and they've just gotten smacked down by the fist of reality. It's like, well, you got to understand the reason you fail, right? The reason that you got smacked down is because you were asking the business to do this massive

Starting point is 00:54:33 cultural change and it's not really tied to business value. And it would have taken a year and a half or two years. And you had to, it had to involve the entire organization instead of doing it iteratively and programmatically and like very efficiently, you know? So, so, so I think the community is really great for sharing stories like that and for just helping people think through these types of issues. This is great. Eric, any questions? Yeah, we're close to the buzzer here, but that was really helpful, practical advice.

Starting point is 00:55:04 Yeah, it is so funny. I mean, you said this at the beginning of the show, but we just tend to say like, okay, let's just, it's almost cathartic to think about a full reset. You know, like, let's just do a full reset and like build all this stuff from the ground up or whatever. But that's not actually reality. But another question, I think, on the practical side to close us out here. So on the cultural side, I think that was really helpful. On the tooling side, it seems like there's a bit of a gap. And you described it really well, Chad, when you talked about, okay, well, you're a small company, and it's okay, a pipeline breaks, and you know, someone's dashboard goes down. And so they send a Slack message, hey, something wrong here. Oh, yeah, let me look at that. Okay, like, you get it up and running in the next day. But you know, it's not like the company's losing money, because, you know, this data flow or pipeline broke, but you inevitably

Starting point is 00:56:02 in that environment, like accrue a bunch of debt that you're going to have to pay back at some point. And it's interesting because those smaller teams don't often have the resources to implement dedicated tooling around APIs or data contracts or whatever. How do you approach that as thinking about our listeners who are maybe at smaller companies, they maybe are working on the cultural side, but from a tooling standpoint, it's like, well, I'm definitely not going to get the budget to go buy a really nice dedicated tool for some of this stuff. But I also have the bandwidth to like start building some of this stuff internally. What is it? Where should they

Starting point is 00:56:50 start? How should they think about it? So I think if you're at a small company, the best thing that you can do is to try to be in the loop, whatever producers are making changes to things and just establish a good relationship with those teams. Right. So, you know, if there's a meeting explain like, Hey, we have some important data. Can you just invite me to, you know, like whenever you're talking about like making a major database change, like just sort of loop me in, let's do a, let's put a, you know, put together a dedicated Slack channel if you have changes that you're impacting database at all, we can push it, push all the alerts to a Slack channel.

Starting point is 00:57:28 So at least I'm notified. I can ask you questions. But I think it really is sort of a getting in the loop and having the conversation. If you don't have the resources for like tools or open source technology or whatever, or building something. And I think that the point of transition starts to come when

Starting point is 00:57:48 there is some data asset, which if you have incremental data quality, you start to experience incremental value back to the business, measurable value back to the business, right? So I've got, you know, maybe a machine learning model and it's a relevance model and it's running every day and I know it's making us money and we're having to drop 10, 20% of the data due to null values. And those no values are sort of being caused by issues with upstream systems. And you say, okay, if I'm just able to solve this one problem, this very small slice of a pipeline by getting a contract on a few, maybe one schema, or maybe even one or two columns upstream. And I can say, hey, I was able to reduce the amount of nulls flowing into this table

Starting point is 00:58:38 by 25, 30%. And I can connect that and say like, Hey, there's some real world ground truth. And we're making better predictions. And now our model is making more money. You have just justified why data quality is a meaningful investment to me. What too many teams do, I found is like, they try to take this very holistic approach and say, well, we need data quality everywhere. We need monitoring everywhere. We need checks on everything. And number one, that leads to alert fatigue almost 100% of the time, like I said, because you've got, you know, the metaphor that I used before,

Starting point is 00:59:15 I've used before is like, it's if your house is on fire, you don't need a fire detector. You need the fireman, right? You already know the house is on fire. You don't need a bunch of alerts to tell you that you're burning up. You need someone to come and solve the problem.

Starting point is 00:59:29 And so if you have a million different alarms going off, it actually numbs you and desensitizes the teams to data quality issues, which is a bad thing. And so you need to focus on a smaller piece of the problem that's manageable, that's iterative, that's not going to be a massive cultural shift for the producers and where you have clear business value. This is exactly what we did at Convoy. And I will be honest and say, I didn't start doing that.

Starting point is 01:00:01 I had to learn that was the right approach. I took the big wide approach at first and that totally bombed out and completely failed. And and then when I switched to the smaller, narrower approach, we just got so much more traction. And the great thing was because it wasn't as large of a lift on the producer side, they got to the engineers got to familiarize themselves with these processes and it turned out they're like, wait a second.

Starting point is 01:00:23 This is just integration testing. This is just CICD. This is just APIs for data. Of course, we should be doing this. Like, why aren't we doing this? And in fact, at some point, I know maybe a lot of listeners will find this hard to believe, but the conversation actually flipped. And so instead of the consumers going to the producers and saying, hey, I need you to take ownership over this stuff. It was the producers going to the producers and saying, hey, I need you to take ownership over this stuff. It was the producers going to the consumers and saying, hey, I have some data here. Is it useful to you? And if it is, how do I put a contract around it? So I think that like you just have to give people time and space and allow them to sort of see the successes one by one and not try to not try to rush it and sort of solve all the problems in the world

Starting point is 01:01:07 in one single project. I love it. I think that is so well said. Chad, this has been such a helpful episode. And even for the work that I do every day, but day to my job, it's just so much here to implement right away. So thank you. Thank you for joining the show. If people want to check out the community, where should they go? So you can go into your browser right now and you can type in dataquality.camp slash slack, and you'll get redirected to the Slack channel. It's Slack, so it's totally free. And right now, we're mainly a community for networking and finding peers who are in the data space. So there's lots of people who are like heads of data science at big companies, heads of data engineering, heads of data platform.

Starting point is 01:01:56 And they're all talking about how they're implementing data contracts and monitoring and data quality of all sorts. But later in the year, maybe middle of the year, we're going to start working on some other things like in-person events and meetups, training courses, stuff like that. So there's a lot planned. Very cool. Well, keep us posted on the books as well. And we'll have you back on to talk about whichever one you publish first. That would be fantastic. Great talking to you folks. Thanks. All right, Kostas, fascinating conversation with Chad Sanderson, who runs Data Quality Camp, which is a community.

Starting point is 01:02:32 He produces a ton of content. And we covered so many topics. I think one of the things that he kept returning to over and over again, that I think was incredibly helpful and just a really good, there's so much practical stuff for people to get from the show. I felt like I could walk away from the show and have practical

Starting point is 01:02:53 things that I could start doing tomorrow to make data quality better. And I think that was really refreshing because the conversation around data quality can feel really big, right? It's a huge problem. How do you fix it? We have so much tech debt. What tools do I use? Where do you solve the problem in the pipeline? Do you try to do things proactively with schema management? Do you try to do sort of passive detection? I mean, there's so many things. And I walked away, especially at the end there, by even having a couple of practical things in my mind of like, I should probably go do this tomorrow to make our data quality better. You know, there are small things that I can do.

Starting point is 01:03:41 And so I think both for, you know, the like the listeners who are data leaders, but also the ones who are doing the work on the ground every day, just a hugely practical, helpful episode in terms of what can you do tomorrow to start improving data quality. We also talked about philosophy, which is always fun. Absolutely. What I'll keep from the conversation that we had with Chaden, I found very refreshing and interesting, is that he's giving a definition of what data quantity is from the perspective of, let's say, the agents that are involved in the process of working with data and not

Starting point is 01:04:26 trying to give an objective definition of, oh, you have this metric and that metric and something that automatically a machine can reason about. And I think that's the most important thing here. But at the end, no matter what, data is information and we have to agree on how we use it. And that's, I think, the big change that Chad brings with his ideas. And the most interesting thing is that he's not keeping it abstract. It's not like an abstract paradigm, but like an organization is like, you know, to go and

Starting point is 01:05:07 hire hundreds of consultants to coach you on how to do. He tells you that you can do it today. The tooling is out there. And he positions technology in a very interesting way on what's the role of the technology in making this happen.

Starting point is 01:05:24 So, I don't know. I think I should encourage everyone to listen and re-listen to these episodes because I think there's a lot of wisdom in the things that we discussed, both in terms of technology and how it can be used, but also the importance of like people in the organization implementing processes around data. I agree. Well, thank you again for joining the Data Stack Show. We will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.

Starting point is 01:06:02 We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 132: Data Quality and Data Contracts with Chad Sanderson of Data Quality Camp

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.