The Data Stack Show - 183: Why Modern Data Quality Must Move Beyond Traditional Data Management Practices with Chad Sanderson of Gable.ai

Episode Date: March 27, 2024

Highlights from this week’s conversation include:Chad’s background and journey in data (0:46)Importance of Data Supply Chain (2:19)Challenges with Modern Data Stack (3:28)Comparing Data Supply Cha...in to Real-world Supply Chains (4:49)Overview of Gable.ai (8:05)Rethinking Data Catalogs (11:42)New Ideas for Managing Data (15:16)Data Discovery and Governance Challenges (18:51)Static Code Analysis and AI Impact on Data (24:55)Creating Contracts and Defining Data Lineage (27:31)Data Quality Issues and Upstream Problems (32:32)Challenges with Third-Party Vendors and External Data (34:29)Incentivizing Engineers for Data Quality (40:28)Feedback Loops and Actionability in Data Catalogs (45:30)Missing metadata (48:57)Role of AI in data semantics (50:27)Data as a product (54:26)Slowing down to go faster (57:38)Quantifying the cost of data changes (1:01:24)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We are here with Chad Sanderson. Chad, you have a really long history working in data quality and have actually even founded a company, Gable.ai. So we have so much to talk about, but of course we want to start at the beginning. Tell us how you got into data in the beginning. Yeah, well, great to be here with you folks. Thanks for having me on again.
Starting point is 00:00:51 It's been a while, but I really enjoyed the last conversation. And in terms of where I got started in data, I've been doing this for a pretty long time. Started as an analyst and working at a very small company in northern Georgia that produced grow parts, and then ended up working as a data scientist within Oracle. And then from there, I kind of fell in love with the infrastructure side of the house. I felt like building things for other people to use was more validating and rewarding than trying to be a smart scientist myself and ended up doing that at a few big companies. I worked on the data platform team at Sephora and Subway, the AI platform team over at Microsoft. And then most recently, I led data infrastructure for a great tech company called Convoy.
Starting point is 00:01:47 That's awesome. By the way, I mean, it's not the first time that we have you here, Chad. So I'm very excited to continue the conversation from where we left. Many things happened since then. But one of the things that I really want to talk with you about is the supply chain around data and data infrastructure. There's always a lot of focus either on the people who are managing the infrastructure or the people who are the downstream consumers, right? Like the people who are the analysts or the data scientists. But one of the parts in the supply chain that we don't talk that much is going more and more upstream,
Starting point is 00:02:27 where the data is actually captured, generated, and transferred into the data infrastructure. And apparently, many of the issues that we deal with stem from that. There are organizational issues. We're talking about very different engineering teams involved there with different goals and needs. But at the end, all these people and these systems, they need to work
Starting point is 00:02:53 together if we want to have data that we can rely on. So I'd love to get a little bit deeper into that and spend some time together to talk about the importance of this, the issues there, and what we can do to make things better. So that's one of the things that I'd love to hear your thoughts on. What's in your mind, what you would like to talk about?
Starting point is 00:03:19 Well, I think that's a great topic, first of all. And it's very timely and topical as teams are, you know, the modern data stack is still, I think, on the tip of everybody's tongue. But it's become a bit of a sour word these days, I think. There was a belief maybe five to eight years ago that by adopting the modern data stack, you would be able to get all of this utility and value from data. And I think to some degree, that was true. The modern data stack did allow teams to get started with their data implementations very quickly, to move off of their old legacy infrastructure very quickly, to get a dashboard spun up fast, to answer some questions about their product. But maintaining the system over time became challenging. And that's where the phrase that
Starting point is 00:04:16 you used, which is data supply chain, comes into play. This idea that data is not just a pipeline. It's also people. And it's people focusing on different aspects of the data. An application developer who is emitting events to a transactional database is using data for one thing. A data engineering team that is extracting that data and potentially transforming it into some core table in the warehouse is using it for something different. A front end engineer who's using, you know, rudder stack to emit events is doing something totally different. An analyst is doing something totally different.
Starting point is 00:04:58 And yet all of these people are fundamentally interconnected with each other. And that is a supply chain. And this is very different, I think, to the way that software engineers on the application side think about their work. In fact, they try to become as modular and as decoupled from the rest of the organization as possible so that they can move faster.
Starting point is 00:05:22 Whereas in the data world, if you take this supply chain view, decoupling is actually impossible. It's just not actually feasible to do because we're so reliant on transformations by other people within the company. And if you start looking at the pipeline as more of a supply chain,
Starting point is 00:05:36 then you can begin to make comparisons to other supply chains in the real world and see where they put their focus. So as a very quick example, McDonald's is obviously a massive supply chain, and they've spent billions of dollars in optimizing that supply chain over years. One of the most interesting things that I found is that when we talk about quality, McDonald's tries to put the primary burden of quality onto the producers, not the consumers. Meaning if you're a manufacturer of the beef patties that are used in their sandwiches,
Starting point is 00:06:12 you are the one that's doing quality at the sort of patty creation layer. It's not the responsibility of the individual retailers and the stores that are putting the patties on the buns to individually inspect every patty for quality. You can imagine the type of cost and inefficiency issues that would lead to when the focus is speed. And so the patty suppliers and the stores and McDonald's corporate have to be in a really tight feedback loop with each other, communicating about compliance and regulations and governance and quality so that the end retailer doesn't have to worry about a lot of these capacity, about a lot of these issues. And the last thing I'll say about McDonald's, because I think it's such a fascinating use case, is that the suppliers actually track on their own how the patty needs, like the volume requirements for each individual
Starting point is 00:07:11 store. So when those numbers get low, they can automatically push more patties to each store when it's needed. So it's a very different way of doing things, having these tight feedback loops versus the way that I think most data teams operate today. Yeah, yeah, makes sense. Okay, I think we have like a lot to talk about. Eric, what do you think? Let's do it. Let's do it. We love having guests back on, especially when they've tackled really exciting things in between their first time on the show and their second time on the show. And you actually founded a company called Gable.ai. And so we have tons to talk about in terms of data quality generally, but I do not want to keep our listeners on the edge of the seat, you know, for the whole hour. So give us the, give us the overview of Gable. Yeah. So Gable is really
Starting point is 00:08:08 trying to tackle a problem that I've personally experienced for basically every role in my career. Every time I started at a new organization, my focus as a data leader was to understand the use cases for data in the company and start to apply data management best practices, beginning with my immediate team, which is analysts and data scientists and data engineers. We would always go through that process. And at some point, we would still be facing massive quality compliance and governance issues.
Starting point is 00:08:41 And that's because I found that a significant number of these quality issues were coming from the upstream data producers that just weren't aware of my existence. And as time went on, I found that these producers were not averse to supporting us, but they did not have the tool set to effectively do that. Oftentimes, it required me explaining to them how data worked or trying to get them to use a different tool outside of their stack or saying, hey, here's a data catalog, and I want you to look at it any time that you make a change to ensure you're not breaking anybody. And this is just very hard and complex.
Starting point is 00:09:21 And so we developed Gable to act as the data management surface for data producer. It's something that any engineer or data platform manager can use to number one, understand the quality of their data coming from the source systems. Number two, can create contracts, whether one-sided or two-sided, around the expectations of that data. And then number three, protect themselves from changes to the data. And that might mean data that is already in flight. So maybe I'm consuming an API from a third-party provider and they decide to suddenly change the schema out from under me, we want to be able to detect that change before it causes an impact on the pipelines. Or it could mean
Starting point is 00:10:11 someone making a change to the actual code. Like maybe there's some Python function in code that is producing data and the software engineer making that change just doesn't know that it's going to cause an impact downstream, we want to be able to catch that using the tools that engineers already leverage, like GitHub and GitLab, and stop that, or at least give them information to both sides that a change is coming. So yeah, that's basically how the tool works. That's Gable, and that's the high-level problem we're trying to solve. Awesome. Well, I have some specific questions about Gable that I want to get to, dig into the product a little bit more, especially you've chosen the.ai URL. I want to dig into the reason behind that because I know it was intentional. Let's zoom out a little bit first. One of the things we were chatting about before we hit record was the traditional way,
Starting point is 00:11:12 and you actually mentioned this term data catalog, right? It's a huge buzzword. There are entire companies formed around this concept of a data catalog today. We were chatting a little bit about how there are certain concepts that have been around for a long time, like a data catalog, but maybe they aren't necessarily the right way to solve problems modern day. So why don't we just talk about the data catalog, for instance? Do you think that it's one of those concepts that we should retain, right? Because there are certain things historically that are good for us to retain, but things change, right? So maybe we don't need to retain everything. Yeah, I think a catalog is one of those ideas that conceptually makes an enormous amount of sense
Starting point is 00:12:00 on the surface. If I have a large number of objects and I want to go searching for a specific object in that pile, having a catalog that allows me to quickly and easily find the thing that I need makes a lot of sense. But like you said, I think this is a older idea that's based around a very particular organizational model. So the original concept of the data catalog back from the on-prem days was actually taken from like a library where you have an enormous amount of books. You've got someone who comes into the library and is trying to find something specific and they go to a computer or they open one of the very old school documents, like a literal catalog. And from there, they can search and try to find what they need.
Starting point is 00:12:51 But this requires a certain management model of the catalog itself, right? You've got librarians, people who know all of the books in the library. They maintain the catalog. They're very careful about what they bring in and out. They're curating the catalog itself and they can add all of the relevant sort of metadata, quote unquote, about catalog that gives people the information that they need. This was also true in the on-prem data world as well. When you have data architects and data stewards, you had to be very explicit about the data that you were bringing into your ecosystem. You had to know exactly what that data was,
Starting point is 00:13:35 where it came from, what it was going to be used for. And the catalog that you then provided to your consumers was this very sort of narrow curated list of all of the data that could possibly exist. But in the modern sort of data stack, it's not like that. It's more of a, you know, you've got your data lake and that is a dumping ground for thousands or hundreds of thousands of data points. There really is no curation anymore. And so what happens in
Starting point is 00:14:07 that worldview, I think that the model, the underlying model needs to change. It makes total sense. One, digging in on that a little bit more, we think about the data lake and, you know, of course there's tons of memes around it being a data swamp, you know, and, you know, we're collecting more data than we ever have before. What are the new ideas that we need to think about in order to manage that, right? Because what's attractive about a data catalog, I guess you could say, is that you have, call it like a single source of truth or, you know, sort of a shared set of definitions, whatever you want to call it,
Starting point is 00:14:45 that people can use as a reference point. And like you said, when the producers were, you know, they had to engineer all of that stuff, right? And so they basically designed from a spec and that is your data catalog, right? Essentially. But when you can just point SaaS pipelines from any source to your data lake or your data warehouse. It's this crazy world where like, could a data catalog even keep up? And so what are some new ideas for us to sort of operate in this new world? I think it's a question of socio-technical engineering. So funnily enough, there is sort of a modern day library, which I would say is Amazon. I mean, that's sort of Jeff Bezos' whole original idea.
Starting point is 00:15:41 It was a bookstore on the internet. But it was different from a typical library because it was totally decentralized. There wasn't one person curating all the books in the library. The curation actually fell onto the sellers of those books. And what Amazon did is they built an algorithm that was based around search. It was a ranking algorithm. And that ranking algorithm would elevate certain books higher in search based on their relevancy and the metadata that these curators or the book owners would actually add. And there's a really strong, powerful incentive for the owner of each book to play the game, right? To participate. Because if they do a good job adding their context, it ranks higher, which means more people pay them money. And the same is true for any sort
Starting point is 00:16:31 of ranking algorithm-based system like Google or anything else, right? You're incentivizing the people who own the websites to add the metadata so that they get searched for more often. I think this paradigm is what a lot of the modern cataloging solutions have tried to emulate. Like, let's move more to search. Let's move more to machine learning-based ranking. But the problem to me is that it hasn't captured that socio-technological incentive. The Amazon book owner, their incentive is money. The Google website owner owner their incentive is money the google website owner their incentive is you know clicks or whatever value they get from someone going to their website what is the incentive of a data analyst or a data scientist to provide all of that metadata to get their
Starting point is 00:17:18 particular asset ranked is that even something they want at all? Because if they're working around a data asset, do they want to expose that to the broader organization? Does that mean if they have thousands of people now taking a dependency on it, that it becomes part of their workload to support it, which they may not want to do nor have the time to do? So I think the incentives are not aligned. And in order to exist in this federated world, there has to be a way to better align those incentives. I think that's what needs to change. Well, okay. You brought up two concepts in there, and I'm going to label them, but let me know if I'm labeling my data incorrectly. But there's this concept of data discovery. I think the point
Starting point is 00:18:06 about search is really interesting, right? Okay, so you have this massive data lake and you have a search-focused data catalog type product that allows, you know, and you can apply ranking, et cetera. But in many ways, that's sort of data discovery, right? The bookseller on Amazon is trying to help people who like murder mystery fiction to discover their work, right? Which is great. I mean, that is certainly a problem, right? But when you think about the other use of the data catalog beyond just discovery, there's a governance aspect, right? Because there's these questions of, we found something that is not in the catalog. Should it be in there, right? Or there's something in the catalog that has changed, or we need to update the catalog
Starting point is 00:18:58 itself, right? And so how do you marry those two worlds? And I mean, I agree, the catalog is a really, is it even the right way to think about that? Because discovery and governance or quality or whatever labels you want to put on that side of it are extremely different challenges. Yeah, I think that's exactly right. I think that they have very different implications as well. I do think that a great discovery system requires a couple problems. I think the first is really great discovery actually requires more context than a system
Starting point is 00:19:42 built on top of downstream data alone is able to provide. If I'm a data scientist or an analyst, and I was at one point in my career, what I really wanted when I was doing a search for data was to understand, you know, what does this data mean? Who is using it? Which is an indication of trust. Where is it coming from? What was its intended purpose? Can I trust it at all? And how should I use it, right? These are sort of the big categories of questions that I wanted to answer.
Starting point is 00:20:21 If a data catalog is simply scraping data from, you know, a Snowflake instance and then putting a UI on it, putting it into a list and letting people, you know, look at the metadata, it's only answering sort of a small subset of those questions that I have. It's like, yep, what is the thing? Do I can I find something that matches the string input that I typed into a search box? But all the other questions I now have to go and figure out basically on my own, by talking to people, potentially talking to engineers, trying to trace this to some code-based resource or some other external resource. And that lowers the utility of the catalog by quite a bit. And then there's the governance side that you mentioned. And governance and quality is really interesting, kind of like I implied before, because in sort of a supply chain universe,
Starting point is 00:21:13 the quality and the governance is going to be on the producer. I mean, it's really the only way. And if the governance is going to be on the producer, that means that the producer needs to have an incentive to add that governance in the first place. And I think today it's very hard as a producer to even know who is taking a dependency on the data that you are generating. You don't know how they're using it, and therefore you don't even know what metadata would be relevant for them. And you may not even want to expose all of that metadata, like I mentioned before.
Starting point is 00:21:52 So to your earlier point, I think catalog is probably, at least to me anyway, it's not the right way of framing the problem. If I could frame it differently, it may be more around like inventory management. And that's more of the supply chain take than sort of the old school take. Yeah, absolutely fascinating. When we think about, and actually I'd love to dig into the practical nature of Gable really quickly,
Starting point is 00:22:23 just because I think it's interesting to talk about the supply chain. And maybe a fun way to do quickly, just because I think it's interesting to talk about the supply chain and maybe a fun way to do it. You know, you and I recently talked about some of the data quality features that Rudder SAC recently implemented, right? And I think it's a good example because they're a very small slice of the pie, right? They're designed to, you know, help catch errors and event data at the source, right? At the very beginning, right? So you have events being emitted from some website or app.
Starting point is 00:22:50 You can have sort of defined schemas that allow you to say, look, if this property is missing, drop the event, do whatever, right? Propagate an error, send it downstream. First of all, would you consider that as sort of a like a producer a source how does that like orient us to in gable where would the rudder stack sort of data source like sit is that a producer absolutely i i think that a rudder stack would be a producer i think pretty much you know the way i've thought about it is that there's really two types of producer assets, I guess. There's code assets, or maybe three.
Starting point is 00:23:31 There's sort of code assets. There's structures of data, so like schemas and things like this. And then there's the actual contents of data itself. And like you said, there's lots and lots of different slices of this problem where the events that you're emitting from your application like RutterStack are one area where you need this type of coverage. Like I said, APIs that you ingest, you've got kind of backend events,
Starting point is 00:24:04 you've got custom frontend events, you've got, you know, C sharp and.net and like all of these other sort of this very wide variety of things. And so I think everything that you talked about sort of in the Rudderstack webinar, which was, you know, being able to check the data live as it's flowing from one system to another system, doing schema management, all of that we consider. I think that's totally relevant to what Gable is working on as well. We also are trying to look at things like, can we actually analyze the code pre-deployment and figure out if a change that's coming through a pull request
Starting point is 00:24:43 is going to cause a violation of the contract, wherein the contract is just an expectation of the data from a consumer. And there is some level of sophistication to that. We do have, for example, like static code analysis that crawls an abstract syntax tree. We can basically figure out when a change is made, what are all of the sort of dependencies in code that like power that change, what all the function calls. And then if any function call is modified anywhere in that syntax tree, we can then recognize that it's going to impact the data in some way. And then in addition to that, and this is where I think things get really cool, is we can layer on artificial intelligence.
Starting point is 00:25:27 So not only would we know how different changes within that syntax tree can affect the schema, we can also know how it affects the actual constants of the data before the change is deployed. So an example of that would be, and this is like a typically very difficult thing to catch pre-deployment is, you know, let's say I have a date time field and we can say it's like datetime.now and a product engineer decides to change that to like datetime.utc now. If you've been in data for any amount of time, like a very common date format, engineers love it. but like that change represents an enormous amount of difficulty to detect and modify in all the places in all the areas. In CICD, not only could we identify that change is going to happen, but we could actually understand that it is changing to UTC and then communicate that to everyone that depends on that data.
Starting point is 00:26:25 That allows the consumer to either say, okay, I'm going to prepare all of my queries for UTC from now on. Or if it's a really important thing and you might say, hey, software engineer, I want to give you some feedback that you're going to cause an outage to 10 different teams. So please don't make this change right now. That's like one big part of the platform is that like you're shifting left, trying to catch them. They're closer to the source as a part of DevOps.
Starting point is 00:26:55 And then the other side of it is, like you said with Rutherstack, we try to catch stuff in flight as well. So if someone has made a bunch of changes, if there's a lot of changes coming through in files that land in a Postgres database in S3, we run at S3, we look at those files individually, map them back to the contracts, and then we can send some signals to the data platform team to say, hey, there's some bad data that's coming through. Now is your opportunity to get in front of it so that it
Starting point is 00:27:29 doesn't actually make its way into the pipeline. Yep. I want to drill down that just a little bit more. And I'm going to give you an example of a contract, but please feel free to trash it and pick something else. But let's take this contract around like a metric like active users, right? You know, of course, like one of those definitions that you ask around a company and you get five different definitions, we need to turn that into a contract so that all the reports downstream
Starting point is 00:27:54 are using sort of the same metric or whatever, right? And maybe Rutter Stack Event Data is a contributor to that definition, you know, based on a timestamp of some user activity, right? But there are tons of other ingredients into that metric, right? And so maybe you need to roll that up at an account level. And so you either need, you know, a copy of the Postgres production database, you know, so you can make that connection or, you know, Salesforce or whatever it is, right? You need maybe subscription data from a payment system so that you know what plan they're on so you can look at active users by all those
Starting point is 00:28:34 different tiers. So we have that contract in Gable. And so can you just kind of describe the way that you might wire in a couple of those other pieces beyond just the event data because i think the other interesting thing is you know we think about data quality at ruddersack we're just trying to align to a schema definition but what's interesting is that the downstream definition in a contract actually may interpret that differently in the business context, as opposed to there's a diff on the schema and something's different, right? Yes. So I think there's sort of two different ways to think about this. One way, and the way that I
Starting point is 00:29:19 usually recommend people to think about this problem is to start from the top down. There's a couple of reasons for that. One, it can be organizationally very difficult to ask someone downstream to take a contract around something like a transformation around a metric in Snowflake or something like that, or BigQuery, if the inputs to that contract are not under contract, right? That feels a bit scary. It's like I am now taking accountability for something that I may not necessarily control. And so oftentimes there is pushback to that,
Starting point is 00:30:00 which is why I usually say that the best place to start with contracts is from the sources first, and then waterfall your way down. Interesting. The second piece of that is, the second piece of that is that, like, I think that there's a longer term horizon on this stuff where everything I just said doesn't apply, which is starting to integrate more concepts around of data lineage into contract definition. So let's say that I have this sort of metrics table and I want to put a contract on it, but nothing exists. In the ideal world, you would be able to say, I want this contract, and now I want some underlying system to figure out what all of the sources are sort of end
Starting point is 00:30:45 to end. I want to create almost like a data lineage in reverse. And then I simply want to either ask for a contract or to start collecting data on how changes to those upstream systems are ultimately going to affect this transformation of mine downstream. This is something that we hear a lot where teams basically say, I want contracts, but I don't really have the social, like political capital to put in my engineering team and tell them what to do without evidence.
Starting point is 00:31:18 And they would like to just collect that data first. So I think that's sort of the other is being able to construct that lineage, understanding how things changing, collecting the data and creating the evidence for the contracts and then implementing them from there. Yeah. Love the phrase around, I don't want responsibility for something that's not under contract. Okay. I actually have a question. I know Costas has a ton of questions, but I actually have a question for you and for Costas. When we think about contracts, right, so I think about, you know, I brought up the example of active users, it could be any number of things. You've been a practitioner, Costas, you've been a practitioner, you've built a bunch of data tooling. How fragmented are the problems of data quality? And I guess maybe we could think about the 80-20 rule. And part of the reason I want to ask is because, you know, even, you know, in the work that I do with, you know, analytics and stuff like that, you always wonder, it's like, man, I mean, this is kind of messy. Like, I wonder what it's like at other companies. Is it the same set of problems? Is it really
Starting point is 00:32:19 fragmented? Does the 80-20 rule apply where there's like, you need these set of, you know, contracts and they'll take care of 80% of the problems? But what have you seen? Chad, maybe start with you and then Costas would love your thoughts as well. The numbers that I have seen is that 50 to 70% of the data quality issues are coming from the upstream source systems or the data producers. That's sort of the most typical range that I've heard. Now, within that, I think that there is a pretty wide variety of problems. For example, like databases, changes to databases, not really that complex of a problem. And the reason why it's generally not a problem for data teams is because engineers don't do a lot of backwards incompatible stuff because they're scared of deleting columns that other teams are using. Sure. Yeah, yeah.
Starting point is 00:33:19 And so, but there is still a quality problem there, which is like, well, as a software engineer, maybe I'm just going to add a new column that contains data from the old column, and I don't communicate that to the team downstream. So that's an issue. And then on the actual kind of business logic code side of the house, this is where we hear issues on the data content. And that's like that sort of daytime UTC change that I mentioned before. We also hear a ton of problems around third party vendors, especially schema changes.
Starting point is 00:33:53 And that's because they're really under no obligation to not make those changes. And a lot of the actual like financial, the legal contracts between teams, doesn't account for changes to the actual data structures themselves, right? The SLA is more about uptime of the actual service, but not, will this data suddenly change from tomorrow to today? So depending on where companies have built the majority of their data infrastructure, you'll see a very different sort of split in what upstream problems are causing the most issues.
Starting point is 00:34:29 Yeah, I think it's all like described it very well. And I think it gets, it's probably getting, it gets like even more complicated when we start like considering all the different roles out there that they can make changes to a database schema, right? Like, for example, let's say you're using Salesforce. I mean, Salesforce at the end, it is like a user interface, like on a database. You have people there who can go and in a table that they don't see it as a table, they
Starting point is 00:34:59 see it as a leads or whatever. They can go and make make changes there. Right. And these changes can propagate down like to the like the data infrastructure that we have and like all that stuff. So I think especially like after Chas and that's like what I find like very interesting with like Chad was saying about like the catalog because yeah, sure, like back then we had very narrow narrow set of producers at the end, right? That were under a lot of control by someone. But pretty much every system that we are using in the company to do something is potentially a data producer.
Starting point is 00:35:36 And the people behind them are not necessarily data people or even engineers, right? They can be salespeople or marketing people or HR people or whatever. I don't think anyone can, let's say, require from them to understand what UTC even is when they are going to make changes. And that's obviously on top of what, let's say, Salesforce on their own might change there, which I would say is probably more rare than what is caused by, let's say, the actual users. So yeah, I mean, I think it makes total sense that most of the, let's say, the problems come from the production of the data out there. But it's also, I think, the question I have for you, like actually Chad is, even if we focus only on the production side, right?
Starting point is 00:36:29 Let's go upstream. Is there among, let's say the upstream producers of like a typical company out there, another Pareto kind of distribution in terms of like where most of the problems come from compared like to others? Yeah. I mean, I think you actually touched on a few of them. A lot of these sort of third-party tools like Salesforce, HubSpot, SAP that are maintained by teams outside of the data organization.
Starting point is 00:36:57 I mean, you said it exactly. It doesn't seem like a problem as a salesperson or a Salesforce administrator to delete a couple columns in your schema that you're working with. But if you're relying on that data for your finance team or your machine learning team, this becomes hugely problematic. So this is almost always a source of pain. I think the other thing that's very problematic are the events. And we hear front-end events are especially notorious. And this is something I think that Eric and the Rudderstack
Starting point is 00:37:33 team are sort of working on, but we hear it all the time where you have this relatively legacy code base and there's a ton of different objects in code that are generating data. And for every single feature that's deployed, those may or may not change. The events may suddenly stop firing or new events might be suddenly added and no one is told about that. And the ETL processes don't get built. There's just such a large communication gap between the teams sort of working on the features that are producing the data and the teams that are using the data that, you know, really anything that can go wrong oftentimes does.
Starting point is 00:38:14 And then the other really big area, I think, is the external data. This is where it's just, it is unbelievably problematic. And a lot of companies, they're not sort of ingesting real-time data feeds. It's sort of much longer batch processes that take a lot longer to load. So it might be every quarter I pull in a big data set or every couple months I pull in a big data set. And there's so much change that happens on the producer side between, you know, the times that they vend these large data sets out that it could look like a completely different thing when
Starting point is 00:38:50 you get from month to month or quarter to quarter. And there's so much work that then has to go into sort of putting the data into a format that can actually be ingested into the existing pipeline that it just causes it. You know, there's a company I was talking to where they basically said the data team lost our entire December to one of those changes. And I think that these types of things are very common. Eric, anything you want to add there? No, I know you have a ton more questions.
Starting point is 00:39:19 Of course, I could ask a bunch of questions, but I'm just soaking this up like a sponge. I love it. Okay. Okay. Okay. So let's talk about events. Let's get a little bit deeper into that. And before we get into the data and the technology part, let's talk a little bit about humans and organizations there.
Starting point is 00:39:41 So I have a feeling that not that many front-end developers have ever been promoted because of their data hygiene when it comes to events, right? So how do we align that? Because you made a very good point about, let's say, the incentives out there in the marketplace, for example, where people are actually incentivized to go and put good metadata or even get to the point where they try to game the algorithms with the metadata that they put there. But in organizations that are not necessarily even aligned between them inside engineering, the data teams and the product teams might not be aligned. right? Like, how can we do that? And like, where are the limits at the end of like technology with that stuff, right? Exactly.
Starting point is 00:40:29 I mean, I think that your last sentence there hit it exactly. I think that technology can only do so much. In my opinion, and what I've seen, like you said, it comes down to incentives. And so the question is, in fact, when I was at Convoy, I asked engineers this exact question as I went to them and I said, hey, how can I get you to care about the data that you're producing because you're changing things and it's causing a big problem for us? And the answer that I heard pretty consistently was, well, I need to know who has a dependency on me. Who is using that data? Why are they using it? And when am I going to do something that affects them? I don't have any of
Starting point is 00:41:14 that context right now when I'm going through my day-to-day work. And so it feels a bit bad. I think if you're an engineer and you're going through your typical processes, you're making some sort of code change. It gets reviewed by everybody on your team. They give you the thumbs up, you ship it, you deploy it. And then two and a half weeks later, some data guy comes to bang on your door and say, hey, you made a change and it broke everything downstream. It's like at that point, you've already moved on to the next project. You're working on something new. You've left the old stuff behind. It just doesn't feel good to have to retract all of that. And this is why something we've heard a lot is like product engineers generally tend to see data things as being the realm of data people, right? If you are anything sort of in the data warehouse
Starting point is 00:42:06 is kind of treated as a black box. And if there's a problem caused there, then the data teams will just, they'll deal with it downstream. And I think that this mentality needs, it needs to change. And I think that product can help it change. So one example of this is DevSecOps,
Starting point is 00:42:25 right? The whole discipline of DevSecOps has evolved over the past five to seven years from security engineers that have basically said, look, we cannot permanently be in a reactive state when it comes to security issues. We can respond to hacking, we can respond to fraud, but the best case scenario for us is to start to incorporate security best practices into the software development lifecycle as, for example, just another step within CICD. And I think this is what needs to happen within data. Checks for data quality should be another step within CICD. And that step, just like any other integration test or any other form of code review, should communicate context to both the producer and the consumer of what's about to go wrong.
Starting point is 00:43:21 So if I can tell an engineer, for example, hey, the change that you are about to make is going to cause this amount of damage downstream to these data products and these people, you've now created a sense of accountability. If they continue to deploy the change, even in that environment, well, you can't say you don't know anymore. It's no longer a black box. It's been open. And it provides an opportunity for the data scientists to plead their case and say, hey, you're about to break us. Can you at least give us a few days or give us a week to account for this? I think that is a type of communication that changes culture over time. Yeah, makes total sense. And okay, we talked
Starting point is 00:44:03 about the people and how they are involved and what is needed there, but let's talk also a little bit about technology. What are the tools that are missing? And what are the tools also that... Where are the opportunities, let's say, in the toolbox that we have today to go and build new things? You mentioned, for example, the catalog, that it's a concept that probably has to be evolved. And it's something that we had a panel a couple of weeks ago with folks like Ryan from Iceberg and Wes McKinney. And it was one of the things that came up, that the is like one of these things that we might have like to rethink about.
Starting point is 00:44:48 It's catalogs, by the way, I think what like most people have in their mind, they are thinking of like a place where I can go and, as you said, like it's an inventory of like things, right?
Starting point is 00:44:58 Where I can find my assets and like reason about what I can do and what I'm looking for. But catalogs are also what, let's say, fuel the query engines out there. There's also metadata that the systems need to go and execute the queries. So there are multiple different layers from the machine up to the human that they have to interact with. So what are the tooling that you see missing and what are the opportunities? So what I think is missing for the catalog to be effective is feedback loops and action
Starting point is 00:45:38 ability. Basically, or to maybe phrase it another way, give something, get something. If I can provide as a consumer or even a producer for that matter, if I can provide information to a catalog that helps me in some way, then I am more inclined to provide that information more frequently. And as a data product owner, what I would like to get back in return, or one of the most valuable things that I could get back, is either some information about where my data is coming from, the source of truth, who it's actually owned by,
Starting point is 00:46:20 this sort of class of problems that I mentioned before that I'm interested in, or I get data quality in response, right? And so this kind of ties back to the point I was making earlier around lineage. And I'll just give you a very simple example to illustrate, you know, let's say within the warehouse, there's sort of a table, maybe a raw table that's owned by a data engineer. And then a few transformation steps away, there was, I don't know what Eric was saying. There was like some metric that's been produced by a product team and they don't want that to break. Now, what they could do is that if they, through whatever the system is, they could effectively describe what their quality needs are. And then we could traverse
Starting point is 00:47:02 the lineage graph and say, okay, I can now communicate these quality needs to all of the producers who manage data that ultimately inputs to this metric. And I can be sure that if there is ever going to be a change that violated those expectations, I would know about it in advance. Now I, as the metric owner, am a lot more inclined to add good information, right? So I've created a feedback loop where I'm providing metadata and details about my data object that I maintain. I'm getting something which is quality in return. And now I've built something that is robust that someone else can take a dependency on. And I think this is the type of system that basically has to exist where
Starting point is 00:47:53 the team, the producer team of some data object is getting a lot of value in return for contributing the metadata in the context, which I don't think is the case today. And you say you mentioned the word like metadata and in said device, people like to go and add the metadata there. What is the metadata that's missing right now? To construct these... Because Lineage, okay, the Lineage graph is not a new concept, right? It's been around for a while, but why it's not enough what we have already? What is missing from there?
Starting point is 00:48:19 Well, I think it's a couple of things. I think one thing is that, number one, the lineage graph doesn't actually go far enough. And you hear this a lot, like right now, especially in the modern data stack, the limits, the edges of the lineage graph basically end at structured data. And if that's where you stop, then you're missing another 50% of the lineage, which means that if something does change in that sort of unstructured code-based world, it is ultimately still going to impact you. Any monitoring or quality checks at that point are just reactive to the changes that have happened. So number one, you need to actually
Starting point is 00:48:59 have the full lineage in place in order for the system to actually work the way that I'm describing it. And then in terms of what metadata is missing, I think there's a massive amount, right? Number one, for just like the biggest question that I probably had as a data scientist and got as a data platform leader is what does a single row in this table actually represent? That data is found almost nowhere in the catalogs because again, there's no real incentive for someone to go through all of the various objects that they own and add that. Same is true for all the columns. Like if we have a column called,
Starting point is 00:49:37 I don't know, in convoy, it was a freight company. And so this idea of distance was very important. We had probably 12 this idea of distance was very important. We had probably 12 different definitions of distance, and none of them were laid out explicitly in the catalog. Distance might be in terms of miles. It might be in terms of time. It might be in terms of geography. It might be some combination of all of those. But if I, as the owner of that data product, can communicate exactly what I mean by distance, then that's going to help the upstream teams better communicate when something changes that impacts my understanding. So yeah, I think that's sort of the idea is I think all of
Starting point is 00:50:19 the semantic information about the data, that's the missing metadata, in my opinion. Yeah, yeah, makes sense. Do you see an opportunity there for AI to play a role with the semantics of all this data that we have? And if yes, how? Yes, number one, I think so. I think the challenge, though, is that, well, again, I think there's a couple ways that this can play out. Ultimately, I think that this is what all businesses will need to do in order to really scale up their AI operations. They are going to need to add some sort of language-based semantic information to their core datasets. Otherwise, all this idea of like, oh, I'm just going to be able to automatically query any data in my
Starting point is 00:51:13 dataset and ask it any question, all of that's going to be impossible because the semantic information is not there to do it. It's just tables and columns and nobody knows what this stuff actually refers to. I think one option is that the leadership could just say, okay, everybody that owns something in data, we're going to spend a year or maybe two years going to all of the big data sets in our organization and trying to fill out as much of the semantic detail as we possibly can. I think that could help as a start, but I tried this when I was onboarding a data catalog and it's like temporary, right? Like you get the initial boost, like maybe for a month, you get a ton of metadata added all at once. And then it just kind of gradually
Starting point is 00:52:00 slopes off and ultimately isn't maintained, which is pretty problematic. I think a better way to do it is to start from the sources and trickle down in the same way I was describing Eric before. And I think all of this comes back to the contract. If you can have a contract that is rich with this semantic information, starting from the source systems, it is the responsibility of the producers to maintain. They understand what all of their dependencies are. Anytime something changes with the contract, they're actually not allowed to deploy that change unless they have evolved the contract and contributed the required semantic update. Then you get this sort of nice model of inheritance where every single data set that is leveraging that semantic metadata can then use it to build their own contract.
Starting point is 00:52:50 And I think a lot of that could actually be automated. This is more of a far off future, but I think it would be a more sustainable way of ensuring that the catalog is actually up to date and the data is trustworthy. Yeah, makes total sense. Eric, we're close to the end here. So I'd like to give you some time to ask any other questions you have. Yeah, so two more questions for me. One, just following on to the AI topic.
Starting point is 00:53:22 What are the, you know, when you think about the risks, and this is somewhat of a tired topic, but I think it's really interesting in the context of data quality as we're discussing it, I agree with you that AI can have a massive impact on the ability to scale certain aspects of this, right? But when we're talking about a data contract, the impact of something going wrong is significant, right? It's not like you need to double check your facts because you're researching some information, right? It's not like you need to double check your facts because you're, you know, researching some information, right? You're talking about something, you know, someone potentially making an errant decision, you know, for a business. So how do you think about that aspect? And, you know, I guess maybe as we think about the next several years, when do you see that problem being worked out?
Starting point is 00:54:26 I think that it's going to require treating the data as a product in terms of the environments that data teams are using. And what I mean by that is, today, when we are building software applications, what delineates a software application in a QA sort of test environment from something that is production and deployed to users is the process that it follows to get there. Ultimately, code is not that dissimilar. It's just that there's a series of quality checks and CICD checks and unit testing and integration testing and code review and monitoring. It's the process you follow that actually makes like some bit of code a production system or not. And I think that in the data world, exactly as you've said, what makes something production, is it trustworthy? Is there a very
Starting point is 00:55:23 clear owner? Do we know exactly what this data means? Is there a mechanism for evolving the data over time? Do we have the ability to iteratively manage that context? And I think the process that has to be followed from kind of like experimental data sets to a production data set is a lot of the same stuff. It's CICD and unit tests and integration. I think contracts play a really big part of that. There needs to be a contract in place before we consider data production grade. And I think this is where the environments comes in. There needs to be actually literally different environments for a data asset that is production versus one that is not. And I think that should have impacts on where you can use
Starting point is 00:56:13 that data. If we don't have a data set that has a contract and has gone through the productionization process, I can't use it in my machine learning model, or I can't use it, I can't share it with our executive team in a dashboard or report. And in the same way that like, I can't deploy something to a customer if I don't follow my, you know, quality, my code quality process. I think this is the thing that probably needs to change the most. Like right now in data, we don't delineate at all between what is production and what is not production in the sense of like customer utility. It's all sort of
Starting point is 00:56:49 bunched into a big spaghetti glob. Yeah. Super helpful. All right. Last question. You know, a lot of what we've talked about, one way to summarize it could be, you know, you almost need to slow down to go faster, right? Right? You know, actually defining contracts, actually putting data producers under contract. You know, you use the term socio technological, right? It involves people. That takes time. Can you speak to the listener who has followed along this conversation and said, man, I would love to start fixing this problem at my company, but it's really hard to get things to slow down so you can go faster in the future. What would be the top couple pieces of advice for that person? So yeah, so first of all, I agree with you, there is some element of slowing
Starting point is 00:57:55 down. But at the same time, I would say that, like, I think that's the same for code quality too, right? GitHub does slow us down, right? And CICD checks do slow us down. And having something like LaunchDarkly that controls feature deployments is going slower than just deploying everything to 100% of our audience. But what software teams have realized is that in the long run, if you do not have these types of quality gates in place, you'll be dealing with bugs so frequently that you'll be spending a lot more time on that than you will on shipping products. So that's sort of the first framing that I would take, because I think this falls under that exact sort of class of problems. The second thing I would say is, I think the problems that a lot of engineering organizations and even more business units have with slowing down on the data side is that they are still not treating their data like it is a product. They're treating it more like, hey, it's just some airy thing. I want an answer to a question. I get an answer to a question. It's not something that needs a maintainer and it has to be robust and trustworthy and scalable and
Starting point is 00:59:18 all these other things, which is kind of the implication. It's like if I ask a question about my business, it is implied that it is trustworthy and that it's high quality, but oftentimes that connection is not made. And so what I oftentimes recommend people to do is you have to illustrate that to the company and then illustrate the gap. So a concept I use a lot at Convoy was this idea of tier one data services. And that basically means there are some set of data objects at your business where a data quality issue can be tracked back to revenue or business value. So in Convoy's case, we were using a lot of machine learning models. A single null value for a record would mean that particular row of training data would need to get thrown out.
Starting point is 01:00:07 And if you're doing that a lot, then you can actually map that to a deep inaccuracy. And if you know how much value your model is producing, then every percentage point in inaccuracy can be traced to a literal dollar sign, right? And so that's sort of one application. I think there's lots of applications within finance. There's some really important reporting that goes on. Once you sort of identify all of these use cases for data, what I then like to do is map out the lineage
Starting point is 01:00:39 and go all the way back to the source systems to the very beginning and say, okay, now we see that there is this tree. There's all these producers and consumers that are feeding into this ultimate data product. And then the question is, how many of these producers and consumers have contracts? How many of them know that this downstream system even exists?
Starting point is 01:01:02 And how many times has that data been changed in a way that's ultimately backwards and compatible and causes quality issues with that system even exists? And how many times has that data been changed in a way that's ultimately backwards and compatible and causes quality issues with that system? Now, with all of that, you can actually quantify the cost of any potential change to any input to your tier one data service. And you can put that in front of a CTO or a head of engineering or head of data or even the CEO, and it becomes immediately important the level of risk that the company faces not having something like this in place. So that's a really excellent way to get started. A lot of companies are beginning just
Starting point is 01:01:34 with paper contracts and saying, here are the agreements and the expectations that we need as a set of data consumers and then working to implement those more programmatically over time. Such helpful advice that I really need to take to heart in the stuff I do at Data Every Day. Chad, thank you so much for joining us. If anyone is interested in connecting with Chad, you can find him on LinkedIn.
Starting point is 01:02:00 Gable.ai is the website. So you can head there, check out the product. And Chad, yeah, thank you again for such a great conversation. Thank you for having me. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Starting point is 00:00:00 Learn how to build a CDP on your data warehouse at rudderstack.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.