The Data Stack Show - 86: Solving the Data Quality Problem with Bigeye, Great Expectations, Metaplane, and Lightup.ai

Episode Date: May 11, 2022

Highlights from this week’s conversation include:Guest introductions (1:02)Defining data quality (4:08)Forgetting to apply software best practices (8:33)Differentiating observability and data qualit...y (17:53)Who should care about quality in the organization (26:55)Why this is still a valid conversation (35:44)The jurisdiction of various components (45:39)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome everyone to the Data Stack Show live. We have done a couple of these now and it's quickly becoming one of our favorite things that we do. We have a great show lined up today. We are going to talk about data quality, which is a very wide-ranging subject. We've collected some of the top thinkers in the industry who
Starting point is 00:00:44 are building products around this, and we're super excited to chat. Also, for those of you who are joining, please drop questions in the Q&A as we go through the show. We'd love to answer those live. We love going down rabbit trails that are helpful for the listeners. So please drop your questions in the Q&A. And why don't we go ahead and get started with some introductions. I'll just go in order of the people I see on Zoom. So Ben, do you want to give us a brief background and tell us about yourself? Sure. Thanks, Eric. And thanks for having us on. I'm Ben Castleton, co-founder at Superconductive, and we're the team behind Great Expectations. And basically, it's a product that allows us to test and validate data and provide documentation, and it's an open-source platform.
Starting point is 00:01:32 Excited to be here. Excited to have you. All right. Manu, you're up next. Hey, everyone. This is Manu. I'm founder and CEO of LightUp Data. We're a three-year-old startup backed by Andreessen solving the data quality problem, or what we might sometimes call data observability. We are not open source and proudly so at this point, but super excited to be building the product we have and the progress we have made since we started. This is not an easy problem, as I'm sure every panelist here will agree to. And we'd love to talk more about what the problem means, save you from how we do it.
Starting point is 00:02:05 That can always come later, but I think there's a lot of conversations still needing to happen around why we should be looking at this problem and why it's central to the modern data stack. And I'm very excited to be here on this panel today. So thanks for having me on the show. Of course, so excited to have you.
Starting point is 00:02:20 All right, Kevin. Hi, everyone. I'm Kevin, the co-founder and CEO of Metaplane, a plug-and-play data observability tool that helps you spot data quality issues, assess the impact, and figure out the root cause. And also, thank you both for creating this DMZ between co-op members. I think we can help grow the market together. And we chatted before the show and was really looking forward to this conversation. Absolutely. Well, so excited to have you in the DMZ description.
Starting point is 00:02:49 All right, Igor. I guess we left all of our knives at home, right? Hey, everyone. I'm Igor. I'm the co-founder and CTO here at BigEye. We are a data observability platform. We help you build the sort of workflows that you need in order to detect and deal with any sort of data quality problems that come up in your systems. And as everyone else mentioned, Eric and Costas,
Starting point is 00:03:11 thanks a lot for organizing this. I'm very excited to have a conversation and see what all these other brilliant minds think about the data quality space. Absolutely. Well, so excited to have you. Well, let's start. So one thing I love about this group is that I think everyone really shares the same mindset that, you know, getting into the technical details is certainly important, but it really is about the core problem that we're trying to solve. And so I'd love to try to put definition to data quality in just a few sentences. And actually, we'll just go in reverse order here. So I'd love for you to try to define data quality in just a few sentences. And actually, we'll just go in reverse order here. So I'd love for you to try to define data quality in just a couple sentences, because I think,
Starting point is 00:03:50 you know, for a lot of our listeners, you hear data quality, data observability, data lineage, you know, anomaly detection, like there are tons of terms. Let's zoom out from the details a little bit. Igor, can you define data quality for us? Sure. So if we want to zoom out all the way, I think that data quality is knowing whether or not your data is fit for the use that is intended for it by the business. I think this is about as broad as I can make that definition. Everything else that we talk about, data observability, lineage, issue management, notification management, all of those are tools in order to detect and deal with problems that come up. Data quality itself is really just, can the business do something useful with the data? They want to do something. And is the data in the right shape, in the right format, in the right state for somebody in the organization to work with it?
Starting point is 00:04:55 Love it. All right, Kevin. Very well said by Igor. I mean, data exists to be used, right? It exists to make a decision on it, to take an action on it, to automate something. So when there is a departure from what you'd expect, like you expect an hourly dashboard
Starting point is 00:05:10 and it's been more than an hour, then that is a data quality issue. And plus one to the idea that data lineage, data observability, all of these, these are just technologies in service of the problem, which is data quality.
Starting point is 00:05:25 Love it. So, I mean, I think Igor and Kevin did a great job of kind of talking about the broad definitions. Maybe I can take a stab at making it a little sharper from my point of view. I think of data quality as issues where data is broken, but your infrastructure is healthy. So kind of think about what is class of issues that we would want to call data quality issues. You know, there's issues where infrastructure goes down, machine goes down, and now data starts showing. And that's a data quality issue to some extent.
Starting point is 00:05:55 But you're trying to differentiate between a machine going down and any effects that can have versus when everything is looking good on your Datadog dashboard or your APM tool. And yet, when you have a data-driven product, it's not doing what it's supposed to do, right? And I'd love to dig deeper into this and resolve this conversation, but there's kind of issues that tend to be very close to the business and usually end up requiring manual judgment. And that's a class of data quality issues that's very important, but not necessarily
Starting point is 00:06:21 solvable by a tool. So we are starting to kind of converge on the definition of what are more operational data quality issues where you can put a system at it and have something detected autonomously, right? As opposed to requiring non-recess. But where I would leave this is
Starting point is 00:06:37 issues that are not coming from the infrastructure and your data is still broken. Love it. All right, Ben, you're going last after everyone else gave their definitions. I don't know if that's easier or harder, but you're up. I just took notes on what they said. I'm just going to repeat it here.
Starting point is 00:06:53 But seriously, love what Manu's saying. And we think of the same thing. It's kind of like if you are running data through an application and nothing's wrong with that, but it can still have a really impactful problem set if you've got data that isn't what you expect it to be. And so we think of that as separately testing the data as just as important as testing your software at great expectations. And that's kind of the crux of why we came up with that name. I also wanted to just comment and say, the way we say it is data quality can be kind of reframed to say, you're basically looking to see if the data is fit for the purpose that you intended for. And so not necessarily,
Starting point is 00:07:41 like I could have a data set that has all these issues that, you know, somebody else might say are issues, but it's intended for a purpose that I have in mind and it works for that purpose, then I don't really have data quality issues that I'm concerned about. And so we think of that as just testing to see if it's fit for the purpose that it's intended for. Fascinating. Yeah, that is. I'd love to dig into that a little bit more later in the show,
Starting point is 00:08:05 because quality is both highly objective, you know, on certain vectors and highly subjective on other vectors. So I'd love to talk about that. First, though, I'd love to hear and we can just, we'll just do a round robin again, and we'll go in reverse order again. And then I'll hand the mic to Kostas. But one thing I'm interested in is, you know, all of you are building, you know, really great tools in this space. But as, you know, as we all know, the data space is adopting a lot of paradigms
Starting point is 00:08:35 from software engineering, right? And quality, observability, you know, sort of lineage, issue management, all the sort of constellation of these specific components that sort of, you know, ladder up to providing, you know, whatever data quality is. Why is this something that it almost seems like, okay, we learned all of this in software, great. We have a set of best practices. And then now it seems like, and maybe this is just perception, but now it's kind of like, oh, dang it.
Starting point is 00:09:06 We forgot that when we started working with data, even though, you know, we knew it was really, really important. So Ben, why do you think that it's, you know, why do you think, you know, there's sort of a long period where it wasn't necessarily a first class citizen, even in some of the core pieces of data infrastructure that came out? Why do you think that is? Yeah, I've actually done a fair amount of wondering about that myself, what actually caused it, meaning the entire world, the ecosystem, the economics behind it. Why was that? I think there's a combination of factors. One obvious one being the technology innovation and just like storage, compute, all the, you know, the Moore's law happening in technology so that you've got a lot of ability to test data that you did not have, you know, a decade ago or even, you know, it just moves so fast. So that's one factor. You have like i don't know if this is true but sometimes
Starting point is 00:10:06 i wonder if it was just kind of missed because while we started tackling what we knew about which is code right let's look at testing code let's look at infrastructure to figure out we make our code i think like i did this too i remember when i first started coding, I assumed every time I wrote a bit of code that it would work perfectly, you know, after I wrote it, I'm just going to hit enter and then run it and it's going to work. So we assumed, well, if we fix that code, we're going to be fine. And then you start to realize with the, you know, all the data we have now that it just doesn't work without testing data. So I'll let others chat. Those are some of my ideas.
Starting point is 00:10:47 Super helpful. All right, Manu. I mean, it's an interesting question. Maybe I'll challenge the premise of the question first a little bit, right? I mean, how old software is, right? We've been talking about it since, what, 1950s, 60s, maybe, right? But when did we see Splunk happen?
Starting point is 00:11:03 And when did we see Datadog become a universal? There was a huge gap between those two movements, right? Great point. Software used to be wild, wild west too for a very long time. And then you started to see some open source libraries half done, kind of trying to get productized and then people sending them across and people still kind of moving logs on FTP and whatnot, right? Back in the day.
Starting point is 00:11:24 So it took some time for best practices to emerge. I guess the question then becomes, now that that has happened with software, can't we just borrow that into data? I would love to. I think we all would. But the fact is that data is just so different than software, right? And we have seen some false starts
Starting point is 00:11:40 when trying to carry over ideas, not taking favorites here, but data versioning, for example, is not nearly as powerful as software version. It's still trying to figure out what that actually means, what kind of value it creates. Or it took some time for data build tool to show up
Starting point is 00:11:58 and took a very different form than software build tools, right? And so it doesn't really carry over. And I've thought a lot about why that's the case. Why is there kind of a drift between the two? It turns out software is a design-time testable entity, right? The person producing software is a human being. But data, at least the way we talk about it now, is a runtime entity
Starting point is 00:12:23 which is coming in autonomously. You don't have human intervention in most cases right so it's just a fundamentally different object to be building or testing or monitoring or putting quality controls on right like there's no such no such thing as let me test my data today because tomorrow it's going to look different and you will be it's a dynamic thing. It's a runtime entity, right? So that's where all those ideas need to be redone. And that's why we end up going back to a clean slate. Super helpful points. Kevin.
Starting point is 00:12:57 I agree with Ben that we're kind of riding on, you know, big daddy Moore's law a little bit with, you know, compute and storage costs going down so that there's not really a trade-off between analytical workloads and quality workloads anymore. If you go, like people have been testing their data for, I mean, for a very long time, right? OG data and ETL developers have always had SQL scripts running against their databases. Now what's possible is that you can have the SQL scripts for not take down fraud as well. I'd say that the demand side is also shifting a little bit, right? Where some of the first BI tools came before Edgar Codd wrote his paper on relational databases. Like BI is the very old, maybe one of the first applications of software,
Starting point is 00:13:41 but now it goes beyond decision support, right? We're seeing reverse ETL. I don't know what the kids call it these days. We have first ETL into operational workflows. You're seeing more automations powered by data, like machine learning, right? Now the stakes are higher. So even though everything old is new again, it's new, but with a vengeance and like the degree is such that the stakes of
Starting point is 00:14:07 data going down is much higher. And the tools that we use to address that both technologically and conceptually and with Manu is that we have to be critical about what we borrow, right? We can't borrow wholesale because there are many differences. Data being a dynamic entity but also the first class objects like lineage what is the lineage in software right you have traces and that's important and it feels a somewhat similar role but it's not quite the same thing and as a result even the you know heuristics like treat your software like cattle. We cannot treat our data like cattle.
Starting point is 00:14:46 They're very precious thoroughbred racehorses that we have to coddle a little bit. But that's a whole other thing. Yeah, love it. I want to dig in a little bit to what Manu said as well about testing and data being dynamic. I think this is why data observability as a term is a its own until, I mean, I'd say 15-ish years ago, maybe, where the notion of monitoring your systems and your applications through externally measurable properties in order to understand their internal behaviors that data dog and new relic and the data traces of the world have made that possible for software systems where if i have an application running on a server i can guess at its health by monitoring things like latency
Starting point is 00:15:59 and response times and cpu utilization and there's a lot of externalities that are going to affect these measurements. In the same way that software has externalities like, oh, well, our switch went down. So now we're routing through a different switch and your latency is going to go up just because of that. Or something else is running on the system, the CPU, there's memory contention, and that's going to affect.
Starting point is 00:16:21 Going back to Manu's point, there's externalities in the data. The data changes all the time. And so the best thing we can do is just monitor across those properties and say, this is what we can observe about the data, and it is behaving differently than we expect it to behave. So that may or may not be a problem with the data itself, but you need to be aware of that. And I think that's why data observability has been such a correct term for the act of looking at data and making sure that it is right and working as expected. And data quality then is saying like, okay, well, when you are doing that data observability, are we actually, is it
Starting point is 00:17:08 satisfying all of the rules that we expect? But yes, we need to understand that data will change. And going back to the software comparison, software didn't really pick this up until very recently. Well, very recently, air quotes here for everyone listening on audio is because 15 years ago is obviously not recent at all, but also in this, in the sense of like software has been around for 50, 60 years. Like that is a fairly recent development in the software. Thank you so much, because actually you almost gave an answer to my question, to be honest, which is about the difference between the terms of like observability and data quality. And actually, when you guys did like your introductions,
Starting point is 00:17:51 and hopefully I'm not wrong, two of you mentioned that you are building data quality products, and two of you said data observability products. Okay. So I'd love to like get a little bit deeper into that and understand like what are the reasons for each one of you to choose one or the other and try like to add a little bit more context that will be helpful primarily for me, because as you all know, I'm a very selfish person and I want to learn primarily myself, but also like for our audience. So let's start with you, Igor, because, okay, you pretty much like gave a response to that, applied to that already, but if you can add like a
Starting point is 00:18:33 little bit more context, I think it would be great. Igor Sfiligoii, Yeah. And I'm going to throw another term at you just to make this even more complicated than it already is. Igor Sfiligoii, Please do. term at you just to make this even more complicated than it already is. We at BigEye have been thinking about data reliability and data reliability engineering as, going back to that software comparison, the data equivalent of SRE and site reliability, where SRE
Starting point is 00:18:59 introduces best practices for maintaining software systems and making sure that they are reliable and up and usable. Data reliability is the same set of tools and practices applied to the data space. Now, zooming into that, you need both operational best practices, and this is a very human problem and a process problem that is trickier to solve than tooling. But you also need the tools in order to support those processes. And data observability is a tool in the toolbox that helps you generate the signals to understand what state your data is in.
Starting point is 00:19:35 So then you can enact the processes in order to go and repair it or change it or modify it or update your assumptions about it in order to make it more reliable. And so the reason I say BigEye is a data observability platform is because we are solving that piece of the toolbox right now. We are solving for how can you most efficiently monitor the state of your data so you can start making better decisions about what to do and start creating the sorts of processes that you need to have around those signals in order to make your data reliable. And the interesting part to me is we see data reliability engineer. And that is very, very exciting to me because that encompasses that thinking of, I am using tools and setting up the processes
Starting point is 00:20:30 for my organization to have the trust and understand that the data is as reliable as possible. So hopefully that answers the question a little bit more, but also there's a whole nother wrench in it. Oh, yeah. And I'm pretty sure that like everyone who's going to add like even more to the context show.
Starting point is 00:20:48 Kevin, you're next. Kevin Kwan- That's gotta be good to see that job posting or to see that job title. Well, I'm a little bit conflicted when I agree so much with every other panelists on the show. When I assure the audience that we have not talked beforehand. So if we all agree with each other, it either means that we're all right or which I hope it's the former, but I, you know, how Igor described data
Starting point is 00:21:10 observability of trying to collect as many externally observable properties of your data system as possible. I mean, that's a hundred percent correct. I would just, you know, bucket those properties into like metrics, metadata, lineage, and logs as four ways to organize it. And I always return back to our customers, which are data teams, they spend all day providing data to their customers, right? Providing data to the sales team to make better decisions, understand the impact of their work, and prioritize their work. But what is the data for the data team, right? How do data teams know which tables are being used, which ones should be deprecated, what models I should create? And data observability
Starting point is 00:21:57 is kind of creating the data for the data team. And metadata is that data. Now we're getting very meta for a second. But so we have this technology, which is data observability. And data quality is one really important problem that it solves. But it does solve other problems, right? Spend management is a major issue for data teams. Prioritizing work or refactoring a model and knowing what the downstream impact might be. Data engineering as a job, and I'm sure we'll talk about the roles later, is a bucket of a whole bunch of disparate jobs to be done.
Starting point is 00:22:30 And data observability is like one technology that kind of like layers on top of all of that. Not that any one is more important than the other, but it's not a direct one-to-one mapping, data quality and data observability. Mm-hmm. Manu, you're next. Manu Karuna- Yeah, so I see it a bit differently.
Starting point is 00:22:48 You know, I think, I think a lot of it, by the way, is just basically irrational and non-technical. It is because, you know, when you, when you say data equality, the first thing you probably think of is a 20-year-old tool that no one wants to be called today. And you, you invent this new term called data observability just because you want to stand out, right? But let's leave aside all that. Let's go into the actual semantics here.
Starting point is 00:23:08 I think all of these terms are actually great terms. And they're describing different aspects of the same overall goal you're trying to accomplish. If you just kind of go back to the language a little bit. What are we trying to accomplish? We want good data quality. We want reliable data. We want to operate it well. So that's your overall objective.
Starting point is 00:23:32 And what do you need to do that? You need to make sure that if data breaks, you catch it. So you need monitoring on some observable property of your data. And before you can monitor something, what do you need? You need observability, right? You need those externally visible signals that are telling you how your data looks like. So these all work together.
Starting point is 00:23:54 They're different layers in the journey, right? So the objective is good data or data quality or data reliability. The starting point is to have observables and add observability. And then in between somewhere there, you have monitoring, right? Just having observables is not enough if you're not monitoring them, right? You need to make sure if something changes, you will catch it. And then you have management of it. So all of these terms are actually accurate. It depends on whether you want
Starting point is 00:24:21 to talk more about building those observable signals, or you want to talk more about building those observable signals or you want to talk more about the end result, which is data quality. I think it's just different interpretations to the same objective. That makes the whole sense. And Ben? Yeah, thank you so much. Good stuff here. And I have a hard time disagreeing significantly with anyone. I would say for great expectations, we kind of approached it
Starting point is 00:24:45 from a different angle, which was, well, you know, you've got the ocean to boil and let's start with putting one test on, you know, one data asset and figure out how to prevent certain problems with understanding whether or not this data is what you expected in a specific way. And building those tests, you can't really get to scale without going out and doing what, you know, Kevin and Igor are talking about with data observability, having a system and software that allows you to observe the results of those, the metrics, the logs and all that. But we approached it from testing as the entry point to collaboration with people across teams. And so it's just kind of, we're coming at it from a different angle, but I think we all know sort of the same problem set is what we're trying to solve. For us, it's more about,
Starting point is 00:25:39 well, how can this team collaborate with another team around data and remove some of the friction involved in data workflows that almost always is a very diverse set of workflows and diverse set of people trying to collaborate around data to make the entire thing work. And so you've got to put in place almost like contracts where you can see, you know, I expect it to be like this when it comes to this point where our team takes over and make sure that you can test that and validate it and then have good documentation around it. And that's an entry point into the whole, you know, the whole world of observability and just data collaboration in general.
Starting point is 00:26:18 Mm-hmm. So you mentioned something very interesting, Ben. You said that you are focusing on the collaboration with other teams. And that's a very good trigger for my next question, which is, who should care about quality in the organization? Who is, let's say, the people that are the next data reliability engineers, right? When this becomes a thing, which is becoming from what it seems. So let's start with you, Ben. Like who are the people who should be like
Starting point is 00:26:49 really, really interested in this and actually like use, what are the users of like products like? Yeah, should we just all agree to start with the board of directors, the CEO, the C-suite, all the managers, everybody in the board of directors? Maybe I won't get too much disagreement there, but specifically our products, we've aimed at first the users
Starting point is 00:27:14 who are oftentimes data engineers building data pipelines. And then sometimes this second level people who run into data quality issues are not necessarily coders or data engineers and are more like subject matter experts and people who are analyzing the data and sort of run into the problems and are oftentimes of conversations are really helpful to enlighten, you know, the entire ecosystem of business people who should be concerned about data quality, because when you have problems, it of the reasons that I really love working with products that have to do with data is that you cannot focus on only one person or nothing. Like, yeah, you have the main user of your product that might be the data engineer, let's say, for example. But then you also have data analysts who are going to be working with the data. And even like the output of the data analyst is going to be probably like marketing managers, someone else. So you have like this chain of stakeholders who one consumes, let's say, the output of the other. But when you're building like products like this, and especially when it comes to data quality, you really have to consider pretty much all of them. And I have like a question about that, but that's for later, because now I want to ask Manu about his opinion on that. I see. I mean, this is a very interesting and a very important topic.
Starting point is 00:28:53 I think what Ben is saying makes a lot of sense, which is that it should come all the way from the top. And we are actually seeing signs of that. So that's actually to the credit of the executives who do realize the importance of data. Maybe it's because data is coming to prime time now and it's driving really high value use cases now. And if data breaks, everyone knows, you know, inventory is not getting stocked the right way. Or the CFO is looking at, hey, why is this sales number suddenly low? And then everyone suddenly cares about it, right? I think Igor hit upon this initially, right?
Starting point is 00:29:25 We care about the data reliability engineer persona to emerge. We're not actually seeing that yet. I think there's still a little bit of a debate, even when we talk to customers, where they will actually sometimes ask us, who should be looking at this? Who should be tasked with maintaining data quality. And we see this tension right now where the stakeholders who have most to gain or lose are so-called SMEs or the business stakeholders who are consumers of data. And they're the ones, unfortunately, who are also the ones who detect those issues for the first time.
Starting point is 00:30:00 Right. But the people who I strongly believe should own data quality are the producers of data. So let's say the data engineering team who are moving in data, storing it, and then producing finished data that others can consume. And we are starting to see that happen, but I think it's also
Starting point is 00:30:17 going to the same kind of tension that you saw with DevOps and SRE role finally getting split out where it was an orphan mate. And you know, when you talk about it in software, software engineers were the best equipped to deal with software quality issues, but the ones that were not really interested in operating software and dealing with those issues. Right. So then we had to create this and Google came out with the SRE handbook in 2006,
Starting point is 00:30:44 2007, and said, this is its own discipline and this is a with their study here and we'll be more 2006, 2007 and said, this is its own discipline and this is a first-class role. And it requires a really skilled person to be running software and keeping Google's page load time under a second, right? And I think the same transition is now happening in data. So who should own it? I think it should not be the consumers of data. Between producers and a different persona, I think that's still an open question. Kevin? I agree that it's an open question. And also that data teams, as much as we love them, they do not produce or consume data, right? Like it's the go-to-market teams, the product teams, engineering teams that are like, it's a human putting in a
Starting point is 00:31:22 number or it's a machine popping out a number. And at the end of the pipeline, it's someone reading that number or taking action on it. And as much as we talk about Snowflake as the source of truth, like it is, Snowflake does not ship with data. It is a box into which you put your truth. And as a result, we've seen both from, you know, the data team being solely responsible for data quality and the organization being responsible for data quality. I've seen everywhere along the spectrum be successful. But I think it requires being realistic about behavior change. Because if they go to someone like a sales rep who's putting in the wrong number, right? Why do doctors have bad handwriting?
Starting point is 00:32:05 It's because they don't have to read their own handwriting. Pharmacists do. And if you don't suffer the consequences of your own actions, it's very hard to change that unless someone rules with an iron fist, which could work. So I'm just noting that the data teams that I've seen have most success with up-leveling the state of data quality through an organization, gets everyone looped in.
Starting point is 00:32:28 No, they say, this is the goal, this is how we reach it, but I'm going to need you to all be on my side. And to help you be on my side, it's going to hurt a little bit. Igor? Yeah, I'm piecing a couple of things together that have been said earlier. I want to go back. Eric made this comment, I think, at the very beginning, which is like data quality is pretty objective,
Starting point is 00:32:51 but sometimes it's subjective. I'm actually going to argue with that. I'm going to say it's always subjective. And the reason that data quality is always subjective is because the only people who can define what is expected about the data are the end consumers of the data and so this is where i actually i agree with kevin i want to push back a little bit on and get a little more out of this manu from your statement manu around the data team
Starting point is 00:33:19 the data producers need to be the ultimate owners of the data quality. And the reason for this is data producers often don't know what the data actually means and what it's being used for. Like I was a data engineer back at Uber. I had a bunch of pipelines and the pipelines pushed things around and they transformed it. And I talked to a PM and I talked to a data scientist and they told me what it should look like at the end. And I made that happen. But what was that data being used for? What were their expectations about that data? Unless that was communicated to me by the business itself, by the data consumers themselves, I would never know or be able to encode that. And so to Kevin's point, I think it's important to get everyone on the same page, but what's even more important is allowing the business stakeholders to define what data quality means for them.
Starting point is 00:34:10 What are their expectations of the data and providing the sorts of systems and tools that let them encode those expectations into the rest of the processes that are being run by the data producers. So going back to my statement of data reliability engineers, data reliability engineers are really the people who support the tools and enforce the processes and create these processes in the organization to have this sort of cohesion. Going back to Kevin's point, it's going to be a little painful. Somebody is going to show up with a giant document and say, if you want to build a dashboard, you have to say what the expectations of that dashboard are going to be. And you now have an extra four hours of work. Nobody likes that. But that is the only way that you will get real, valuable expectations from your data that can then be enforced through the tools that the
Starting point is 00:35:02 data reliability engineer team or the producers team can start enforcing. Igor, I don't know what's happening today, but like you almost always answer the next question that I have. My daughter. Yeah, yeah, yeah. So, okay. My God, I think it's an interesting segue here to dig deeper into what we have been discussing.
Starting point is 00:35:26 It is very interesting because we're starting to now get into why we are talking about data quality and data observability and how these two things relate to each other, right? Why is this still a conversation, right? Absolutely. And I think, let me just, if you allow me, I'll just challenge Igor a little bit on this
Starting point is 00:35:43 and say, yes, everything is subjective, but at that zoom level, but if you're going to do one more zoom in, right, I think this starts to factorize into two different components. One is subjective, one need not be. It appears so, but it need not be, right? So when you think of how to measure data and what it needs to be, these are two different things, right? So how you measure it is an indicator. We like to call it data quality indicator, just like you think of a KPI. So how you measure your business, that's a KPI. How you measure your data quality, that's a DQI.
Starting point is 00:36:17 In the language we use, data quality indicators, right? How you measure the performance of a website, you think of page load time as a metric, right? So the metric definition here can actually be objective. And that's very technical. That's actually coming from engineers for the most part. Now, what this needs to be for the business to be successful, that's subjective. And that usually comes from the business stakeholder, right? So the criteria, the rule on a metric. Yes. I think that's very subjective. People will have different interpretations.
Starting point is 00:36:47 What that metric needs to be. I think we have an opportunity here to standardize it. I know, you know, I would love to hear what Ben has to say to that, because GeoDesignations has been going around talking about creating a standard for it. And I think that can be done and we'll see more and more of that happen over time. We have seen that definitely happen with IT observability where we talk about CPU and memory and disk and most basic metrics that everyone will always want to measure.
Starting point is 00:37:13 Do you want to run your CPU super hot at 80% or do you want to run it super safe at 40%? Well, that's business choice, right? I mean, that depends on how you want to operate, but there's no two ways about you should be looking at CPU utilization as a metric. Can't argue with wanting
Starting point is 00:37:29 to create a standard for some of those. Yeah. I was agreeing with Igor and now I'm agreeing with Manu. So yeah, really hard to disagree with you guys.
Starting point is 00:37:37 You're too smart. I have something here before you go on, Igor. Like what I really love as a person who is like, today I work mainly in products and I have tried to build businesses in the past, is that you are touching one of the most, let's say, interesting problems,
Starting point is 00:37:58 which is how the world is perceived from the lenses of an engineer and how it is perceived by the lenses of an actual user. So you have this subjectivity versus objectivity, exactly like what engineering managers and product managers have to fight every day. When they try to define, okay, what are we going to build next? So I love that.
Starting point is 00:38:27 And I think it's one of the most interesting challenges that you guys have to solve with your products. Because at the end, you have all these different, let's say, people that are involved. And yeah, you have the data engineer. The data engineer needs something very concrete that is going to be measured, right? But then how do you communicate the output of what is measured and observed there to
Starting point is 00:38:48 the marketing manager who the only thing that that person cares about is like how much I can trust or I cannot trust the data, right? Like even the language is different. And that's my actually like also like the question that I would follow up, but you started like answering there is like, how, how do you think guys that this can actually happen? Because it sounds like from a product design and management perspective, like a huge, huge challenge. And Igor, please go on. Like I interrupted you. Yeah, no, I will. I can, I can weave in both answers into one. So I agree that the signals can be objective in software
Starting point is 00:39:33 because infrastructure and software all behaves the same way. They all consume CPU and memory and they have endpoints and those endpoints have latencies and they're hit a certain number of times. And these are non-negotiable facts about software. The problem is there are very, very, very few non-negotiable facts about data. And in my mind, the things that I've usually been able to enumerate at this point are the table needs to be loading on some cadence. The table needs to be loading some number of records. Those two signals are non-negotiable. And probably, actually, that's about it. I don't even think nulls are non-negotiable
Starting point is 00:40:15 because some fields can be null, some can't. There's still no negotiation there. So there's only really two signals in data that are actually objective. Everything else is subjective because do I care if this column is ever null? Maybe I don't. Maybe there's like some extras field that somebody's dumping in here and may or may not choose to populate in the log. Do I even want to start measuring that? Does that matter to me? Is that field being, to a costless point, is it ever being consumed downstream? And that is where you get into that subjectivity of, does observing it even matter? And to follow up on that, the business stakeholders
Starting point is 00:41:00 care about the data when it's actually being used in a data product. When I say data product, I mean something like a dashboard, ML model that's generating output something that they are then consuming i think the best way to do this is to surface that information as close to the consumer as possible so we're talking into the data into the bi tools into their query editors into their data catalogs, where they're actually starting to interact with the data. Now, this is where subjectivity comes back. What matters? How do you determine that a dashboard is no longer fit for use? And the only way to do that is to take the person who has built the dashboard and say, here is what matters about this dashboard.
Starting point is 00:41:43 Here are properties that need to be held true. And that is always subjective on a business by business basis and even a dashboard by dashboard basis. And so that's why I still stand firm on data quality is subjective, but hopefully Kostas, I've also answered your question with that. Yep. Yep. Yep. You did. Ben, what do you think about that? Yeah, I definitely think when we talk about data quality, again, it goes back to our definition of it's being fit for the purpose
Starting point is 00:42:14 that you intend it for. And also, you don't want to waste a lot. There's a cost to testing data. There's a cost to the software tools in both effort to manage and also the technology as well. testing data, there's a cost to the software tools in both effort to manage and, you know, also the technology as well. But you don't want to test everything. You don't want to observe everything because that doesn't help you. You want to observe the important things and you want to
Starting point is 00:42:37 test the important things. And so I agree, you kind of have to start with the end in mind. However, when I go back to our customers, I think very objectively, I can always say, well, a good place to test is from the application to its first landing place, first staging area. Are you dropping it in an S3 bucket and you want to know, did it get from the application to that S3 bucket in the same form or did something get missed there? So testing on ingestion into the data warehouse. And then you've always got this, like, is the data quality or is the data as you expect through the transforms? And is it as you expect right before it gets pumped out into either an AI
Starting point is 00:43:26 model or some sort of BI tool? And so those things are usually objectively true with the customers that we see. Usually there's problems that happen in there. And so we see there's some objective rules that we want to test. And you mentioned a number of rows, Igor, and very few things that are objective. That's an interesting idea. And I think you're right that you have to be subjectively pulling in what business users want to be able to decide what other metrics you're testing. But having standards about how to test that just seems like it'll create a lot of efficiency. And so that's kind of the angle we're going after. And I would just say that one thing, when I think about observability and how this conversation wraps together,
Starting point is 00:44:16 I do not, I want to push back on the idea that you can have the source of truth be in a data warehouse. Like it really has to encompass a much broader set of infrastructure. And so you want to be able to execute data quality tests and just observability outside the data warehouse in order to get a complete picture. And that's really important in our framing for our products. Ben, I'd love to dig into that a little bit. So let's talk about what is the jurisdiction of these various components. We don't need to get into the components, right? But the storage layer seems to be a really logical starting point, right? Because you're trying to make Snowflake your source of truth. Great. I mean, that's good. And that actually in many ways can sort of expedite some of these data quality issues because you sort of have comprehensive way to detect certain things, et cetera. But what is the jurisdiction? And I'd love, I mean, feel free to take the question wherever
Starting point is 00:45:20 you want. I'm interested in kind of the philosophical aspect of that, right? Because you can reach into, I mean, so Igor said, you can actually reach into the BI tools that people are using, right? So what is the jurisdiction? I'd love to know the way that you think about that. Yeah, well, I think it comes back to the business drivers. And we've mentioned, can't remember if it was you, Kevin, but mentioning, well, sure, a go-to-market team uses data. If you've got your data in Salesforce or some other CRM, by the way, totally separate topic, but there's new ways of selling and sales teams want to use data. And there's all this innovation happening around that
Starting point is 00:46:03 with product-led growth. And you think about the data that's used there, that's going to drive where you want to test it. So if you're really working in Salesforce, for example, you're going to want to have some tests around the Salesforce integration with whatever product analytics you're doing. And you're going to want to be seeing that the data is as you expect so that your sales teams are operating on infrastructure that is producing the stuff that they want to use every day. And you don't want to have that just show up when two weeks later, your sales team has been very not efficient and not able to manage their processes because the data is bad.
Starting point is 00:46:46 So that's what drives it. And so we talk about wanting to test it at the source. And then there's no jurisdiction here besides a pipeline, I think. And that usually crosses a wide variety of infrastructure. Yeah. Fascinating. Okay. So Kevin, you said you described Metaplane as plug and play. So what's your take on jurisdiction? It's a tough question, right? Like I think even in software observability, the jury is still out on, do you want to test the symptoms or do you want to test the causes? There are arguments for and against both. You could say if we are monitoring data within Snowflake that, okay, if something goes wrong, that it's too late. The problem has already occurred. Or you could say, this is, returning to the previous topic, what matters to the person using the data and therefore we should test it. I think there's been a trend towards monitoring the symptoms first. One, because that's much more aligned with what the users perceive. And two, because it helps you, you know, prioritize what kinds of causes to debug upstream.
Starting point is 00:47:57 But the jury is still out. I think there's both like two ways to do it. And this is another case, Eric, of everything old is new again, right? When we're talking about focusing on the outcomes of data, wow, people have been talking about this in the academic literature for 30 years, right? Of extrinsic versus intrinsic data quality dimensions of, do you want to enforce referential integrity or go talk to your user? And testing the symptoms versus causes is yet another thing where I think they came to a conclusion 30 years ago and we're trying to
Starting point is 00:48:29 rederive it for the modern data ecosystem. Got it. All right, Igor, what say you? I think there's a difference in jurisdiction of what is monitored versus where the information is presented. So going back to what Ben said about you have a process that takes data, puts it into S3 that needs to actually land there. It needs to be a non-zero byte file, whatever other properties it needs to have. I think when I talk about data observability, I'm thinking about looking at the data contents itself. I feel like there is a whole other dimension to data quality, which is, are the processes that are generating my data running correctly? And I feel like there needs
Starting point is 00:49:21 to be a little bit more disambiguation in the term. I don't know, maybe it's data pipeline monitoring, process monitoring, whatever we want to call that. But there's a whole nother sphere of monitoring, which is actually much closer to application monitoring, which can be much more objective, such as did my pipeline run? Yes or no? How long did it take to run? How much memory to consume? All of these properties about the process itself that you can start monitoring and creating signal on. So I think that is still under the jurisdiction of data quality, and it impacts the state
Starting point is 00:50:01 of the data. But I actually think it's a slightly different problem. In terms of the BI tool thing that I mentioned, I think that is surfacing the information that is coming out of these systems. And I do not want to have jurisdiction over the BI tool. That is not in my best interest, not in BigEye's best interest at all. But we need to surface the signals that we're collecting and the information that we know about the state of the data closer to where the consumers are looking at it. And if that means pushing that into BI tools and pushing that into data querying tools, that's what it has to be. Great. All right, Manu, we'd love for you to have the last word briefly on jurisdiction, and then we can wrap up with a really good question from one of the listeners. I need to bring into context
Starting point is 00:50:54 what Damien just said on the chat here, which is very relatable, actually. That used to be my experience when I was building data pipelines. And we see this time and again where it eventually lands on the data engineer building the pipeline. So as much as I think the producers of data should own data quality, it should not mean that that becomes a bottleneck and all data quality is now becoming that one person or one team's responsibility. But we see that happen just way too many times, right?
Starting point is 00:51:21 Where you go to data engineers and they're frustrated that instead of building pipelines, they're just chasing data quality issues. And that used to be me. And then I'm going around trying to understand the business context and saying, hey, you know, it's not really my job to do or I'm not the expert on what data quality should even look like or how to interpret this data. So I think it goes back to like, who should own data quality? I think that role needs to interpret this data. So I think it goes back to like, who should own data quality?
Starting point is 00:51:46 I think that role needs to be carved out. And the more we are talking to customers, the more we are seeing a separation happen where they're creating an offshoot team out of data engineering and starting to call that a data quality or a data governance team. So these are people who are engineers by trade, who understand how to operate data
Starting point is 00:52:07 and are now starting to ramp up on interpretation of data and have an operational mindset and enjoy doing that, right? Now, this is more of a platform team though, right? They're not the only ones responsible for every single system out there because the source of truth actually is not Snowflake. If transactions don't reconcile in Snowflake, no one gets fired, right? Because you will go to Oracle DB
Starting point is 00:52:31 and if that doesn't reconcile, there's a problem. But if that's working fine, your transaction process is fine, right? So now you can't hold this one person responsible for data quality in Snowflake and Oracle and Kafka and data mods being shipped out of dbt and whatnot, right?
Starting point is 00:52:49 So they're an enabler. They need to create an easy medium for different stakeholders to come in and specify their own data quality tests, which could hit source systems, which would hit Anacron systems and anything in between, right? So that's what I see kind of evolution going into.
Starting point is 00:53:06 Yeah, love it. And we're right at the buzzer here, but I'm going to read this question because you answered it, Manu, which is great, but I'd love to hear from the other panelists. I'll just read this really quickly to give the listeners context. So Damian said,
Starting point is 00:53:18 Hi, everyone, I'm a data engineer working for a startup located in Paris, and I'm the only data engineer so far. I think a lot of us can relate to that. Regarding responsibility for data quality, I really strive not to become the single point of knowledge and responsible for all pipelines simply because it's not possible to know everything about the business. Instead, I really believe in what could be called data ops and about providing the tools and infrastructure to empower developers to be more conscious about data quality and what they are producing. But now I wonder, do you think this is a good approach? And if so, do you have any advice
Starting point is 00:53:54 on how to onboard people onto these topics? So man, a great answer. And let's see, we'll go Ben, Kevin, and then Igor can close us out. Yeah, great question. And it is near and dear to many of our hearts, I think. And I would just to punch a couple of other things. I think data quality tools are not going to answer the entire spectrum of this problem. And it is super important to be integrated with a stack that allows you to do some of this. I would be also leaning heavily on some of the data catalog companies here and leaning heavily on making sure I'm integrated with how I'm looking at data quality with my integration tools and when data is moving and then making sure that people across teams can see that. So collaboration across these teams is super important.
Starting point is 00:54:48 And we just, I really believe strongly that should be mediated with software. Like software is well-suited to do this. And so that's what I think a lot of us here on this call are trying to build to make that easier. So great question. Yeah. Thanks, Ben. Kevin? I agree that no one tool will solve all of your problems. And I would start by,
Starting point is 00:55:08 you know, speak for your audience, right? Speak in a language that they understand, appealing to what they're interested in. If you go to an engineer and said, you know, you shipped an event name change that caused a fan out with, you know, downstream data assets, you know, glaze over. If you go to your, you know, your VP of sales and say, you know, there's data quality issues, we have to invest in data integrity. Yeah, know glaze over if you go to your you know your bp of sales and say you know there's data quality issues we have to invest in data integrity yeah eyes glaze over you know talk about you know we want you to ship this change without us having to come back to you one month later like yelling at you and we want your reps to put in data in a way that makes it so that your dashboards can be up and that you can be confident in it. And then once you speak the language of the rest of the org, then take it to the next level. Then
Starting point is 00:55:49 talk about contracts, talk about what they expect from the data and everything that we just talked about. It's GG at that point. Love it. All right, Igor. I think two pieces of advice from me. One is start small and figure out the biggest pain point in the business right now. To Kevin's point, speak the language of the business, figure out what they are struggling the most with and or what you are struggling the most with in order to support the business and try to solve that problem and build the tooling and process around those areas. If it's data quality, then that's what it is. If it's data discovery and nobody knows where any dashboard lives or how to find them, maybe that's the first place you need to look and solve the problem for.
Starting point is 00:56:33 The second, I think, to answer the original question is, is this a good idea? Yes, totally a good idea. The best way to onboard leadership into this, in my mind, would be to explain to them how this is going to help you and your team scale. You're a single data engineer and you're a small startup. You need to be efficient and you probably are playing a lot of roles in the organization. And by showing how DataOps can help you scale through tools and processes, you can say, look, I don't have to spend an hour a day doing this data quality check
Starting point is 00:57:07 because if somebody else can help me, can go and define what they expect and get the notifications, you will just funnel the most important things back to me. And that is going to resonate a lot with your manager, with your leadership team,
Starting point is 00:57:20 because they're going to say, great, you want to make yourself more efficient. We are all for that because now if we have to hire only one other data engineer rather than two more, because you have the tools and processes in place to become that efficient, I think that's going to be a very easy sell in your organization. Love it. Well, this has been such a helpful show. I learned so much about data quality and all of the other components that surround it. So Ben, Manu, Kevin, Igor, thank you so much for giving us some time and joining us on the Data Stack Show live. And thank you to all the listeners with all the great questions. Thank you. It was a pleasure being here.
Starting point is 00:57:59 Thanks for having us. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.