The Data Stack Show - 67: Now is the Time to Think About Data Quality with Manu Bansal of Lightup Data

Episode Date: December 22, 2021

Highlights from this week’s conversation include:Manu’s career background and describing Lightup (2:31)Why traditional tools don’t work for modern data problems (6:04)How a data lake differs fro...m a data warehouse (11:35)Defining data quality (14:07)The business impact of solving and applying data quality (31:36)Constructing a healthy financial view on the impact of data (41:09)How to work with unstructured data in a meaningful way (47:44)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. Today, we're talking with Manu from LightUp.ai, and we're going to talk about data quality. And as a marketer, when I think about data quality
Starting point is 00:00:39 and sort of how to understand it, there are so many variables. And in marketing, one thing we talk about a lot is seasonality, which a lot of marketers use as an excuse. You're laughing because... Yeah, I think like marketers are the best people to find excuses why they don't read bad data. But I'm interested to know, when you think about data quality,
Starting point is 00:01:03 how do you control for things like that? Because that's a really challenging problem, know when you think about data quality, how do you control for things like that? Because that's a really challenging problem, especially when I think about a tool like LightUp that's SaaS that's trying to do that. So that's what I'm going to pick his brain about. How about you? Well, first of all, we have to say that it's been a long time since the last time that we were together in the same place.
Starting point is 00:01:23 That's true. We're in San Francisco together. Yeah. So this is a very special. It's a very special episode. Yeah, yeah. But yeah, when it comes to quality, to be honest, I think I have more fundamental questions.
Starting point is 00:01:35 Like, what is quality? What do we mean about that? We see many vendors coming into this space and each one have their own definition of what quality is and how we should implement it. Everyone is using some kind of metaphor with SRE DevOps and infrastructure monitoring and data docs. We are the data docs of data, blah, blah, and all that stuff. Sure.
Starting point is 00:02:00 But that's a very bold claim, right? Sure. But that's like a very bold claim, right? So I would like to investigate and see how close to the tooling that we have for SREs we have to offer right now for data. And yeah, that's pretty much what I think I'll be chatting with Manu about. Well, we'll find out. Let's dig in. Manu, welcome to the Data Stack Show. We're really excited to chat about all things data quality. I'm glad to be here. It's exciting.
Starting point is 00:02:30 So give us a little bit of your background, kind of give us an overview of your career. How did you get into working with data? And then tell us about what you do at LightUp and what LightUp is. Yeah, let's dive right into it. It's a long history we're talking about here. So I come from a very technical background, computer scientist by training, did a lot of signal processing for my PhD at Stanford, which was an interesting detour and was building software for wireless systems and embedded systems and dealing with a lot of data. So we kind of talk about data now in a very different setting. That's processing 20 million events per second at microsecond latencies at one point, right? And then we spun out that research into a company called Luhana that VMware acquired in 2019, where we built a predictive analytics pipeline for telcos, right? So this is like AT&Ts and Verizons of the world. It was a very, very interesting experience.
Starting point is 00:03:25 We're processing 3 million events per second through a distributed system coming in over Kafka, going into Apache Haddon, which is like Flink, and then dumping it into InfluxDB, a time series database, right? And then serving it out to whoever cares to make use of that data. And all of that was happening at sub-second latency.
Starting point is 00:03:44 Very interesting scale, very interesting richness of data that we were dealing with in the telco space, which you don't normally hear a lot about. And one of the hardest problems that we had to deal with, which we didn't have a good solution to at the time, was just keeping the data healthy. It's like on any given day, the data would just change on us without any notice. And before we know it, we are producing junk on the output side, predicting, hey, Eric is going to get
Starting point is 00:04:11 a gigabit per second on his iPhone. And the customer is like, this is ridiculous, right? And it turns out maybe sometimes it was our fault. At times, it actually wasn't our issue at all. We were just getting fed garbage data into the pipeline. And then the system was producing garbage out. And we could keep our services up and running. We could monitor our application endpoints, but we didn't have a way to detect those kinds of data faults or data outages, so to speak.
Starting point is 00:04:41 And I said, okay, there has to be a better way to build and monitor data pipelines than just relying on customer telling us your system is faulty right now. So I said, okay, this problem needs solving. We were doing that for telcos, but as we now know, the whole world has become data-driven, right? It's FinTech, it's consumer tech,
Starting point is 00:05:04 it's hospitality, you name the vertical, right? And you guys are at the forefront of it in many ways, right? So you're seeing this firsthand, obviously. And now that we are starting to rely on data so heavily, we need a way to make sure the data we are building off of is actually worth trusting, right? And so we said, okay, this is a problem that needs solving. The old tools don't work in the new data stack. And we had ideas because we had seen this problem firsthand and that's how LightUp was born. And that's what you're solving today.
Starting point is 00:05:35 Very cool. And let's dig in just a little bit there. So the old tools don't work in the new data stack. Could you give us just a couple of specific examples of the tools that were insufficient for the job in the context of sort of dealing with that type of data at that scale in the telco? Yeah, I mean, so kind of if you look at data quality is a problem that's at least two decades old, right? If you look at the space, Gartner runs a magic quadrant on it, for example, right? Informatica has been talking about that since 2005, actually maybe even before that. Talent has had a product since early 2000s. Before we even talked about big data, we used to talk about data quality and then the whole Hadoop ecosystem happened and the rest is kind of history at this point. So the traditional tools are designed in a setting
Starting point is 00:06:23 where maybe you had a spreadsheet worth of data. You got a data dump from a third party that feeds you consumer phone numbers or names, for example. And now you want to put it to use in your marketing campaign. So now you had a data steward who would have days to look at that data, make sure it's all good, fit for use, right? And then publish it into whoever was the internal stakeholder, right? Or if you are distributing data, you will do the same process in that setting, right? That's what the old tools were designed for, kind of built for small data volumes, usually static data, right? Where you have a human in the loop who can stare at length, right?
Starting point is 00:07:04 And kind of facilitate that interactive process of manual judgment on data quality, right? What we are now talking about, the kind of pipeline I described where you have Kafka bringing in million events per second or even per day, right? Feeding into a Spark system down the stream. You're talking about maybe minutes of end-to-end delay or even less,
Starting point is 00:07:25 right? So that's a setting in which now you have still the same problem, which is, is my data healthy? And if it's going to go trigger an action at the end of the pipeline or populate a dashboard for an exec or show up as a result for my end user, I need to make sure that the data is healthy. But that's a setting in which the old tools simply don't work, right? I mean, for a variety of reasons. It's scale, it's real-time nature, it's just the data cardinality you're dealing with, right? It's not just one spreadsheet you're talking about, thousands of tables, hundreds of columns each, right? So that's the setting that we are designing for now. That's the big change we have seen.
Starting point is 00:08:02 Manu, one question before we move forward with talking more about data quality. You mentioned a few technologies and a few architectures also, but also you we mentioned a lot like the term the modern data stack. Can you give us like a definition of like what is this data stack we are talking about that quality gets becomes relevant? That's actually a very interesting question, not just from the point of view of what is the stack that we're designing for, but also like why now is a good time to think about data quality, right?
Starting point is 00:08:34 For anyone building the so-called modern data stack, right? So kind of let me just take a segue back into how big data has evolved, right? So it started out with, let's say, the Hadoop ecosystem, right? Very file-based, hourly batches maybe was the best you could do, but it was mostly like once a day kind of data processing of large, big batches, right? Then we saw Spark happen, and that was kind of happening, building on Hadoop and made it in memory. And then we saw Kafka happen, right?
Starting point is 00:09:08 And we used to talk about ETL stacks at the time, right? So extracting data, transforming it either through disk or in memory using Spark, and then putting it to use by loading it into a data warehouse or database usually, right? Because the data volume would be quite compacted. We would have almost produced finished metrics by the time it would be published out of the big data stack. And that was a very hard stack to work with. It wasn't clear. If you wanted to even monitor it,
Starting point is 00:09:35 where would your monitoring tool integrate? Because data was all in flight, all in memory. Either you have Kafka or you have Spark, where it's like, what's my story for, what's a canonical stack I can rely on for which I can now build monitoring support? And it wasn't clear. Why we are now finally articulating a modern data stack and it's catching on is because we've now gone to an ELT architecture. So the data warehouses since then have caught up. So back still in, I guess, early part of 2010s, we didn't have scalable data stores. And then redshirts were starting
Starting point is 00:10:13 to get born. And then you saw Snowflake happen, and then BigQuery. And these basically started to pull us back to the convenient architecture, which is easy to reason about, easy to work with, right? Which is you have a central store of data. That's, let's say, a snowflake or Databricks. And now you have all the raw data landing into that one place and getting persisted through stages of transformation all the way to finished metrics, right? So to me, that's the crux of modern data stack.
Starting point is 00:10:44 We can debate what are the right tools, what's the right level of aggregation, should one house do all of it, or should it be 20 different components working together? That's less important. I think it's more the idea that you have the central store. Yeah. And this central store you are talking about, are you thinking more in terms of like a data warehouse architecture or a data lake architecture work? And I'm asking that because we started with Hadoop that was like processing over a file system and we end up talking about file systems, right? Today, data lakes.
Starting point is 00:11:15 So which one of the two you find as like, let's say the most easy to work with or like the most important one? So I think there's two things here, you know, one is kind of a logical choice. So to me, a data warehouse is actually a logical structure, right? You're declaring a certain data store to be a data warehouse, right? But like I could take Snowflake for example, and call it a data warehouse. If I wanted, I could actually call it a data lake too. Yeah. Right? So it really comes down to what I want to declare
Starting point is 00:11:47 as less prepared, more raw data dump, what I want to call finished data that is fit for use, right? So we have heard people use a terminology where they would call Snowflake the data lake, right? And then you could take data breaks and then you could say, yeah, that's my data lake and my data warehouse, right? And it's really just a logical partition. Or you could just entirely work off of an object store. You just dump everything into S3, but you still need some
Starting point is 00:12:13 query layer on top, right? So you could bring Spark as a query engine, or you could use Presto or Trino, right? So you have choices there. So I think it's kind of less important to decide if you're using a data lake or a data warehouse style of design, right? I think it more comes down to having a central store. And so that's one part of the story, right? The other is what is the scale of data you're dealing with, right? So where we see this distinction becoming important is when the so-called data warehouse technologies like Snowflake start to be too expensive, just from a data volume point of view, people will start to say, let's give up on some of the functionality of the query engine and structured data definition, right? And go to a more free form, less structured data destination, which I'm going to call a data
Starting point is 00:13:01 lake. But both are great. It's easier to work with a structured data set because now you have a query language available. You could just hit it with SQL. But what we are seeing, I mean, the lake house pattern, for example, is giving you the same facility at the lake scale. I think that's where the world is going to go. So that distinction is just going to keep shrinking. And really it just comes down to disaggregation
Starting point is 00:13:23 where you have a store, which is scalable, and then you have a query engine on top, which can serve out that data through a well-understood query language. All right. And let's go back to quality, right? I think we should define it, first of all. Yeah, I think it's one of those things i was thinking you mentioned data quality several times and it's one of those terms that i i know it when i like if i see bad data
Starting point is 00:13:53 quality like i know it but i don't really think about it unless i see it right like which means that there's generally a problem so i'd love to know what's like, what is your definition of data quality? I'm happy that you know when you see bad data because that's a great starting point. So you're right. It's kind of an elusive thing to define, honestly, right? What helps is to think about the symptoms that you're going to see
Starting point is 00:14:19 if you were running into bad data, right? And especially the symptoms that you don't have monitoring support bad data, right? And especially the symptoms that you don't have monitoring support for today, right? So like what would be bad data issues? How would they end up affecting data consumers, right? Which would actually totally go unnoticed right now, right? So to me, that's the big umbrella way of defining data quality issues.
Starting point is 00:14:47 Anything that creates this ridiculous output from your data-driven product, right? Whether it's an internal consumer, an external consumer, and went unnoticed, right? So how could it present, for example, maybe let's start there, right? So let's say you have a food delivery entity.
Starting point is 00:15:06 Now your orders are not making it to the person who ordered the food because you have an issue with the data describing their address or the phone number. It could be as simple as that. Now your product is entirely failing or you are, let's say, Uber and rides are not showing up on time and customers are complaining that the ETA estimates are all off, right? It turns out it's because you are making an error loading up data from the mapping service. So you don't have the right traffic information coming in, right?
Starting point is 00:15:37 It could be credit scores getting mispredicted because the data you are pulling as a bank or as a credit scoring entity from the credit bureau is malformed. So now you cannot correctly predict the credit score, right? Or ticket prices are getting in a factor of 100, $5,000 tickets are getting sold for $50, right? So you have pricing errors, right? So data quality issues show up in a variety of forms depending on the nature of the business and often to an extent where there's direct top line impact to the business. Those are the settings we're discussing. There are kind of equally harmful, but let's say more face saving issues where you have bad data showing up on a CFO's dashboard.
Starting point is 00:16:22 The CFO just says, look, sales volume is looking too low today and i can't explain this right turns out you're dropping transactions so you're not counting your sales correctly it could be issues of that nature too so to me data quality issues are any of those issues where data is not what you expected it to be. And that issue went unnoticed, right? Unnoticed by your IT monitoring tools, when noticed, unnoticed by your APM tools, right? Anything that you have to monitor infrastructure is not able to catch it, right? It's kind of like those hidden data outages, if you will, which could be dropped events,
Starting point is 00:17:00 which could be data getting delayed, which could be schemas being wrong or values just being plain wrong. You're reporting cents instead of dollars and now everything is half-assed, right? One question for you though, and this is such an interesting challenge because let's go to, so those, you kind of defined two broad categories, which I think is really helpful. One is the end customer who's using an app or a website or a service, something goes wrong there, right? Which is sort of the worst type of problem because you're getting feedback from the people who are sort of the lifeblood of your business telling you that something's really wrong, right?
Starting point is 00:17:38 So like it's too late at that point. Let's go to the example of someone in the C-suite looking at a dashboard for sales. I think one interesting challenge in that example is that the data engineer or analyst who is managing the pipelines to deliver those reports, sometimes it may go unnoticed because they don't have context for what thresholds, right? Like seasonality, we hired a bunch of new salespeople. I mean, there's like a lot of factors there, which I think complicate that because they may notice it in terms of, oh, well, this number looks lower than it did last week, but sales forecasts fluctuate, right?
Starting point is 00:18:27 Okay. Can you speak to that a little bit? Because there's also this organizational side of it where the people dealing with the data may see the actual data and a derivation from whatever is expected, but they don't have the context to interpret that necessarily. That's a great question, actually. And I've been drawing comparisons to other monitoring tools that we understand better now, right? Like Datadog or New Relic. But there's a big difference between the two. Monitoring data
Starting point is 00:18:54 from a data quality point of view versus monitoring a more standardized asset like a virtual machine, right? Which is when you talk about monitoring IT and you're talking about monitoring a container or a VM, this is a kind of majority of metrics are well understood, they're standardized. It doesn't matter if it's an AWS VM or an Azure VM, right? You're monitoring CPU and memory and disk, right? Sure. When it comes to monitoring data though,
Starting point is 00:19:20 that's no longer true. And that's why it's such a unique, in many ways, a more challenging problem, which is that you're dealing with the customer's data model here. And that data model is actually specific to the organization that you're trying to monitor data quality in. I mean, even I'd argue that even Lyft and Uber are not going to have the same exact data model internally, right? Sure. And that's why we are seeing all the data technologies basically enabling definition of data models or data quality monitors, as opposed to prescribing a specific set of prebuilt definitions, right? So that's what we need to do.
Starting point is 00:20:07 That's the kind of grand challenge, if you will. But we're seeing success stories all around right now. I mean, DBT, for example, is really structuring how you could produce your data model day over day, minute over minute, if you wanted, where the model definition is still coming from you, who is the data engineering team or the analyst team, right? Rather, for example, you guys are doing data collection, but you will let the end user define what data should be collected, what should the event structure look like, right? And what the tool is providing you is an easy way of encoding your own definitions so that the system can take over and put it into production and do it at scale continuously.
Starting point is 00:20:47 It's the same with data quality. And I think that's the kind of key to unlocking design of the solution, which is you need to design it in a way so that you can make it easy for someone to go describe a data quality monitor they want to instantiate without enforcing or limiting them in any way so that it can work on any data model. But at the same time, taking care of everything else that needs to happen after you have put that definition in place. Continuously evaluating that data quality role, continuously doing it at scale, doing all the incident management workflow around it, right? And so on. Manu, I have a question.
Starting point is 00:21:23 You mentioned the traditional infrastructure monitoring, right? And so on. Manu, I have a question. You mentioned the traditional infrastructure monitoring, right? There are many attempts there and the industry there has matured enough to start discussing and introducing some standardization, right? At least on how we communicate, the metrics that we get
Starting point is 00:21:40 and all these things. Do you think that we will reach something similar with data? And the thing that comes in my mind, actually, it's like, I found very interesting, like what great expectations is doing, right? Where you have like the expectations there, like people can contribute new expectations, blah, blah, blah, like all these things. And the reason that I'm asking about that is because what I understand from the conversation that you already had the two of you the problem with data is that semantics matter right like the semantics around the data that you are consuming may change completely what it means to have bad or good data right yeah so
Starting point is 00:22:16 somehow we need to get into the loop the definition of the semantics there and I think that the schema itself I mean it's still like on a seductive level, right? Like there's more information there that we don't get. So how we can overcome that and what do you see as future like happening? And to add on to that, I think one of the other interesting things is like you use the example of a CPU, right? And so like CPU to CPU, there may be sort of differences, right? And like manufactured from a different manufacturer or whatever, but the core vitals are the same. Not only is that different from company to company, but I was thinking about when you were talking about definitions, like every SaaS company ever
Starting point is 00:22:57 has had years of conversations about the definition of a lead, the definition of an MQL, the definition of an SQL, sales accepted lead. It's like anyone who's worked inside of a lead, the definition of an MQL, the definition of an SQL, sales accepted lead. It's like anyone who's worked inside of a SaaS company knows that stuff's really hard. But what's interesting is the schemas change from company to company, but within a company, the schema changes over time, right? We change the MQL definition and that has a direct impact on what you would consider quality and sort of the downstream context. Yeah, absolutely. I think this is a really important question in terms of how do you scale keeping data health under check? How do you scale that? And this is regardless of how we solve the problem, how anyone solves the problem.
Starting point is 00:23:45 The thing is that this is something that we need to solve as an industry. I mean, if you're going to scale our data driven operations, we need to answer this question. Because the answer is that there is no common factor between one data quality rule and the next. It's game over. I mean, then we are just always going to be praying and that's the best solution you can come to we need better than that right and i thought a lot about this problem even before we started lighter right because we wanted to be sure that we're building something that can actually apply generically regardless of what vertical you are in what team within the organization you are in right i mean if you're going to get stuck in the definition of SQL and MQL,
Starting point is 00:24:28 it's not going to scale, right? So kind of go back to my time at Uhana, the company I built before this, where we were building the predictive analytics pipeline for telcos, right? And I look back at the process we used to follow in just doing our own ad hoc manual data debugging, right? And the thing that strikes me about that time is that the process was actually pretty much the same playbook, issue after issue. Even though I had the domain knowledge and I understood the context, or my colleague did that, our basic process was, something is wrong with the data. That's
Starting point is 00:25:05 something I can smell or someone told me about. I need to debug if it's really an issue, right? And the first thing I would do is I would say, give me the data as it stands today, but now let's pull data from a week ago when I know things were running fine, right? And the first thing I would do is compare these two data sets, right? And then look for significant differences, which would explain or confirm that yes, data is indeed looking very different today. Then I would start to reason about in what ways is it different, or is it really a problem, or what might have caused this difference, depending on what's the shape of the difference. You're seeing too many events.
Starting point is 00:25:45 Okay, maybe there's duplications happening somewhere in my pipeline. I know what module I touched in the data pipeline that could have introduced duplicates. Or I'm seeing records getting dropped. Okay, I exactly know I have a bad filter somewhere, right? So then it will start to tell me how to go about fixing the problem. But the process was actually fairly generic, which is I was comparing data on hand with data from a time I knew things were good.
Starting point is 00:26:11 And that's something we now call anomaly detection, right? Anomaly detection is kind of this bastardized term where you think of false positives as the very first thing. But to me, anomaly detection is more kind of a recipe or an algorithm, right? It's not a solution by itself, right? So it's like, that's the principle you want to apply where you can compare data from today with data in the past, right? And then it starts to be a repeatable process.
Starting point is 00:26:36 Then I don't need to know what the semantics of this data asset exactly are, how it relates to business, right? So that's what's actually working really well in the field for us. So one quick question there, and this is really tactical, and I'm coming at this as a consumer of reports and as someone who probably on a daily, weekly, monthly basis looks at the data from this week and compares it to the last couple of weeks, just to sort of do like all of us who work in business, like do our own internal gut anomaly detection, which is rarely statistically significant. But how do you approach the problem of controlling for seasonality, which as a marketer,
Starting point is 00:27:20 seasonality is the excuse that you can use for any problem with data. It's like, oh, the numbers look bad. And it's like, well, it's seasonality. But if you think about the holidays, right? Like November, December and people going on vacation month over month data from sort of October to November or December to January is hard. And I also think about this in the context of, let's say you also don't have year over year data for this particular data set. How do you think about controlling for those problems when comparing time periods of the same data? Yeah. And this kind of ties back to, I think, the question Kostas was raising earlier, right? Is there a standard here that can emerge,
Starting point is 00:28:01 even if it doesn't exist today, right? The quick answer is, if you as the human being cannot affirmatively say if the data is good or not, the system probably cannot, right? So there is a boundary we need to draw between what is a confident conclusion about the health of data and what is subjective interpretation, right? And we'll always have that boundary. The question is, how much can we have as kind of the standardized tests on data, right? We want to keep shifting that boundary and start to actually think about test-driven development for data pipelines or your data assets, right? If you can't test if the data is good or not, how do you even base a business decision on it?
Starting point is 00:28:52 So the question is, like, what are the tests I can run? And those are the tests you would always want to run. That's what the data quality system should be solving. Everything else should be left to human experts to interpret, right? I believe there's a large set of tests that can always be run, right? Some of these are very generic, right? Like what is the data delay?
Starting point is 00:29:11 That doesn't really depend on seasonality, right? I mean, your data pipeline is processing with the same delay in November as it will in January. The events you are seeing coming in, the data volume, well, that depends on how much people are interacting with your service.
Starting point is 00:29:25 Yes, that is subject to seasonality. So there are different kinds of tests you can run. Some are more black and white than others. The ones that are very clear-cut, these should be standardized. And we should always insist on defining a contract around them, right? Data delay, you should always measure that. Null values, unique values, it's like things that you might even say
Starting point is 00:29:48 are extensions of data integrity constraints that we used to put in relational databases, right? And then there are all these interpretations that you're drawing around, what are the semantics of what is the content of data, right? And there is a set of tests where experts can get together,
Starting point is 00:30:06 let's say someone who understands the sales data and says, yeah, I mean, look, a sales value of $5,000 simply doesn't make sense for my consumer product, right? So that's definitely wrong. And you start to encode some of that knowledge, but these become what I would call custom tests, right? And then the system should facilitate easily encoding those data model specific tests. And then there's everything else
Starting point is 00:30:30 where you can't even be sure what's good and what's not, right? And we should not even try and attempt to make a conclusion on those aspects. So Manu, okay, it's very clear from the conversation so far that you are very passionate about data quality. But you're also building a product for data quality, right?
Starting point is 00:30:50 So do you want to tell us a bit more about that? So how do you solve the problem of data quality with a product that you're building? Yeah, I mean, and maybe even before I go there, right? Why do, you know, why am I so passionate about this? Why do I think this is, you know, it's high time that we all started to think about data quality, right? data is the gas here and you have no control over the quality of gas you're putting into it, right? Before you know it, you're just going to throw a wrench into the engine and your car comes crashing down. So to me, it's like actually bread and butter to just monitor the health of your data. We should all be doing it. I think the hard part here is two or three different angles here. One is what we have been discussing so far, which is what is standardized tests and what is very custom to the business, right? That's definitely
Starting point is 00:31:49 one hard part. But I think the other kind of challenges that we are seeing people run into who have been actually solving data quality for decades now, right? I'm talking about Fortune 500 companies with very strict controls over any data that's been collected or put to use right i mean they have somehow managed to find the right mix of standard tests and custom tests right but where we are starting to see a lot of limitations now in incumbent tools is number one around data volume data volume has just grown, what, hundredfold in the last decade or less, right? And we are anticipating tremendous growth in the next five years, right? And now with the new modern data stack, the old tools simply don't work, right? They can't keep up with that data volume.
Starting point is 00:32:37 So that's one challenge that we are solving. Another challenge we are seeing is kind of what comes along with data volume, which is data cardinality, right? Or what you would sometimes call variety, right? It's no longer just a couple of spreadsheets and a couple of tables in a MySQL DB. You're talking about thousands of tables and potentially a total of a million different columns across those tables in Databricks or Snowflake, right? I mean, at that scale, if you're writing a test by hand for every single column separately, it's not just expensive, it's infeasible. You're never going to finish covering all your data assets, right?
Starting point is 00:33:15 And the typical number we see is 1% to 5% of all your tables and columns being actually covered by tests, right? I mean, you don't work with that kind of coverage for your software, right? You look for 95% or 99% unit test coverage on software that's in production, right? And we are running with 1% coverage for data in production, right? Why is that, right? I mean, that's kind of the biggest challenge here, which is it's just too hard right now to build out data quality tests at scale, right?
Starting point is 00:33:49 So that's the second big problem we are solving. The third one that we are seeing a lot is this shift to more and more real-time stats, right? And real-time here in the data world could even be as much as a minute of latency or a couple of minutes of latency, right? That's already very real-time compared to doing nightly data processing with Hadoop, right? And the old tools were built for that environment where you were processing data once a day. Now you're talking about doing this every minute, maybe even every second, right?
Starting point is 00:34:17 So you need the notion of streaming checks, which are kind of in line with your data pipeline. If you're going to be able to do this before, before shit hits the fan or before bad data is actually put to use, right? So if you need to catch bad data before it can do the damage. And that's a third big challenge that all tools are running into, which we are solving really well. You know, I was thinking back to our conversation about the looking at the sales number and the sales number saying off or saying a sales we are solving really well. You know, I was thinking back to our conversation about the looking at the sales number and the sales number saying off or saying a sales number that's off. That is probably the main number
Starting point is 00:34:53 that's covered in the 1%. Yeah. You know, because if the head of sales isn't happy, then no one's happy. So that's actually, that's an interesting point. There's this big, I don't know if I would call it a tussle or this tension,
Starting point is 00:35:09 but there's kind of a debate in the industry right now about who owns data quality, right? And the point you're making, Eric, right? That you're covering your sales numbers. Well, it's because the sales team really, really, really cares about this number being correct, right?
Starting point is 00:35:23 And they will find a way to do it themselves, or they'll get the data engineering team to implement some controls, right? But why aren't we covering the rest of our assets? Well, because if you look at the front end of the pipeline, which the data engineering team is responsible for, right? That's where you have the whole multitude of data assets, but the data engineering team is not yet seeing this as a responsibility, right? It's really left to the consumer of the data, which is who is an analyst
Starting point is 00:35:53 or a salesperson or marketing person, right? Or an exec for that matter, or even your customer, right? Who is the one pointing out data issues, right? But we need that shift in mindset where we say, look, you need to be producing good data in the first place. It cannot be an afterthought after you have already served me with bad data, right? And kind of a thought experiment I like to run is, suppose there was an internal data market at the organization where, say,
Starting point is 00:36:23 the analyst has to pay for the data and buy it from the data engineer producing it. What would happen then, right? How would that change the landscape of data quality? I'm very curious if someone wants to run that experiment, but it's like if I was an analyst buying data from data engineering, I would want to make sure that I understand what they're selling me, right? What's the spec? What's the contract? What's the QC on it?
Starting point is 00:36:54 I'm buying a product from you. It's a data product, right? Can you prove to me that this is worth the dollar you're charging for it? And now suddenly it will become the responsibility of the data engineers producing data to make sure that that data is worth selling and it will actually sell to the analysts internally. Right. So, I mean, short of creating that market, I think that's the shift we need. I love thinking about that in terms of economics, because I also think back to your point about testing. So I'm just trying to put myself in the shoes of the analyst who is essentially working as a middleman between a manufacturer, say, and then the person that they're delivering the final product to, right? And so my approach,
Starting point is 00:37:39 naturally, just thinking about the economics is I have a limited amount of resources. I know I'll get more resources if I can deliver the right product. And so what I'm going to do is buy like a small piece first and deliver that and like ensure that it solves a problem downstream. And like looking at sort of the QA and like being more meticulous about that would naturally drive sort of like a testing mindset like being more meticulous about that would naturally drive sort of like a testing mindset, right? Which is super interesting.
Starting point is 00:38:09 I love that analogy. That's a really great way to think about it. I just see that going so far, right? Where it'll force communication between the data consumer and the producer where a spec will have to be defined first, right? And then that'll start to bubble up. What are the data tests, right? And yeah, maybe some cannot ever be encoded into logic, right?
Starting point is 00:38:31 But most of those tests can, and they'll start to get pushed upstream and they will start to affect the data pipeline itself. They'll become part of the data production pipeline itself, right? It's like what CICD has now become. And the kind of shift or the blurring of lines between software QA
Starting point is 00:38:55 and software development, right? With all the DevOps mentality, I feel like a lot of those themes will now start to make it into the data world. Short of creating that market i think it's still happening because yeah as data leaders are being pushed on the value they're creating for the organization they have no choice but to start asking those questions and we are seeing that happen now right and the best leaders are actually doing this going around
Starting point is 00:39:21 asking what are the controls in place right right? And making sure that the data engineering teams have the right tests in place and have a way of not only writing those tests, but being able to maintain them over time so that they don't fall out of step with evolution of the data pipeline itself, right? So you're just going to see more and more of that happen. You know, it's interesting to think about economics, which I don't know a lot about, but when you put money into the equation, what's interesting and would love your thoughts on this one is that part of the challenge is that data in terms of quantity, to your point, is not a scarce resource. And when you introduce money into the equation, what you're actually doing is shaping the way that the manufacturers implement their process, or even if you wanted to boil it down to raw units,
Starting point is 00:40:21 would be how they use their time even. Right. And it's really interesting to think about because volume is like, it doesn't feel scarce. Right. And most people actually probably feel like they have too much data. Like I have so much data, I can't analyze all of it. Right. So yeah, that's, that's really interesting. So how do you think, I guess my question is, what are some ways that you've seen companies outside of actually making an analyst buy data? Maybe we should get like, I need to go get Monopoly money. Maybe we should do that. And we can try this experiment where...
Starting point is 00:40:57 I would like to issue tokens or NFTs. Oh, that's right. Yes. We could do it on blockchain. Outside of using monopoly money, what are some ways that you've seen companies do a good job of sort of creating a healthy contract that starts to drive that testing mentality? interesting. It's starting to take us in the direction of the team structures of best functioning data teams, right? Where data is most trustworthy and is being actually really returning ROI, right? And we see this pattern where a young company will just have, let's say, one team doing it all, right? They're the producers of data, they're the consumers for the most part, they're the ones analyzing it and whatnot, right? Over time then, as the team grows and the pipeline matures, you start to see some split
Starting point is 00:41:54 happening between the data engineering skill set, the people who can store and process data at scale, and then people who know how to extract meaning out of it, who would be the analysts, right? And then the third stage that we see is this kind of birth of a data quality or a data governance team out of usually the data engineering team, right? So the organizations that are thinking about it the most will now start to separate out the function, which would kind of be this unaddressed or implicit function within the data engineering team, where it's best efforts, pretty much ad hoc, no well-defined contracts being passed from the consumers of data to pieces of data, but then they would realize that's the effect they're seeing, which is why data keeps breaking, right? And so then what they realize is they need to create some separation
Starting point is 00:42:52 between data engineers and data quality people. And then they would start to create this data quality analyst group or data quality engineering group, right? Who works closely with both sides, but now starts to become the bridge between what is the definition of good data and how to test for that definition, right? And at the same time, like they are not interested in writing data engineering pipelines, right? So
Starting point is 00:43:20 they start to be now, it's like kind of the SRE team, what that did to software engineering and software production or operations, and started to be the specialist group who would understand enough about how to think about software behavior in production. Page load time, for example, okay, Google is supposed to load in under a second, otherwise we start to lose users. But at the same time, we'll have a sense of why could the page be slowing down, right? And how to even measure the page load time. They could be analyzing the metrics and whatnot, right?
Starting point is 00:43:54 So I think that's the kind of maturity model that we are seeing. We're starting to see a data quality team or data governance team or a data reliability team, if you will. And you start to see more and more of that who is bridging the gap between these two worlds and enforcing a certain contract between the two teams. Manu, we keep talking about the SRE and the DevOps paradigm. One of the things that is quite important when you're monitoring infrastructure is also do some kind of root cause analysis. You need to figure out, okay, we have a problem. Now how we solve it. Now data infrastructure is quite complex thing.
Starting point is 00:44:38 Also a piece of data goes through a lot of, let's say, transformations and it's not something you mute from the moment that you capture it until you go and consume it. How we can do that with data? How we can try and figure out what's going wrong and fix it? Yeah. I used to be a networking person. I tend to go back to what I understood in networks. network debugging is a hard problem. And in many ways, I see the data pipeline being a topology of sorts, right? We call it lineage sometimes, or we think of it as a DAG maybe, right? But data can crisscross at multiple places. So where it starts to where it finishes, it may not actually even look like a tree anymore,
Starting point is 00:45:23 right? So it's a very generic, you know, topology to which data is flowing. And, you know, how do you monitor a system like that, right? And how do you trace back an error on one end of it to a source of the error on the other end, right? I mean, look, truth be told, it's not an easy problem to solve, right? I don't think we know how to do that very effectively right now. And when we don't know how to do it, we end up relying on experts, right? So I think the
Starting point is 00:45:53 short-term answer to this question is we need to facilitate presentation of data quality information from all parts of the data pipeline so that an expert, right? Or a group of experts for that matter, right? Can look at that information and then come to a conclusion on the root cause analysis, right? So that's what the tool can do here, which is to create this single pane of glass, a single source of truth
Starting point is 00:46:19 where the data engineer, the data quality engineer, data quality analyst, maybe the data analyst or analytics engineer, maybe even the business stakeholder who is eventually consuming the data can all pull their context, be able to root cause what's on hand, right? And then be able to say things like, okay, this can be explained by seasonality. This is not a data pipeline problem. This cannot be explained by seasonality or the definition of sales number we have, right? So let's go further back. And now the data engineer starts to contribute their insight into it, right? So I think it's going to be a collaborative process, definitely in the short
Starting point is 00:47:00 term, maybe forever. And the way to approach it, in my opinion, is to facilitate that collaboration between all the different stakeholders at different layers of the stack. Yeah. One last question for me, and then Eric can ask his own questions. We talked a lot about structured data. We talked about schema.
Starting point is 00:47:22 We talked about measuring things and all that stuff. But in a modern data stack, you don't only have structured data, right? So what happens with the unstructured data? What happens when we are working with binary formats, with free text, with images, with labels, more towards like what we usually hear about like a mail and this kind of pipeline, because that's also like a data pipeline at the end, right? So does what we talked so far applies there or we need a different approach for the quality there? It perfectly applies.
Starting point is 00:48:00 What happens today is we just don't think about it as much, right? I mean, we just ignore it and we just hope things will start. Ignore the problem and it'll go away. But actually, so that's one of the very interesting problem statements that we are hearing from the customers we're talking to. They're asking us exactly the same question that you just asked me, right? What happens to my other data, right? And the good news is that the leading data teams are starting to realize that that data matters, if not as much, you know, if not more, it matters at least as much as your structured data. But in many ways,
Starting point is 00:48:41 that's actually even a more important place to monitor the health of data, because you want to shift the problem left or shift detection left as much as you can, right? We all understand that if you don't collect good data in the first place, you're not going to be able to produce good data on the tail end of the pipeline, right? That's clear. But if you discover it at the tail end, and you're trying to now fix the issue, you have to go back all the way and clean it out from the entirety of your pipeline for days. And that's a very expensive operation. You're so much better off by just detecting the problem before this bad data percolated downstream. So it's more proactive, it's more economical, and it cuts your costs a lot, not just in terms of productivity,
Starting point is 00:49:28 but also in terms of the repercussions of bad data, right? And in many ways, the root cause problem also gets solved because now you're directly monitoring the source of that problem, right? So you know that that is the root cause instead of having to trace back from tail end of the pipeline all the way to the source of the problem, right? So that needs to happen. In some ways, it's actually more challenging because data is less structured. So how do you even start to analyze it, right? There are tests
Starting point is 00:49:55 that you can run, right? You have Kafka event stream coming in, let's say, and you could just track the delay that the events are coming in at or the volume that they're showing up with, right? Or the schema that the event has. And some of those tests, for example, are part of Rudder or Segment, right? And even Kafka on the cloud side now, right? So I think we're going to see more and more of that, but that's something that we are also solving.
Starting point is 00:50:18 Eventually you want a single pane of glass to be measuring the data health, right? Right from ingestion to the object store to finally a data warehouse, right? So we need to do more of that. Absolutely. Well, we're close to time, but Manu, one last question for you. And I'm just thinking about our listeners who know they may have some data quality issues
Starting point is 00:50:42 because they've had to fight some fires. And we often like to ask people like you, if you could just recommend one or two practical things that a data engineer could do this week, what would those things be to sort of help start the conversation or start stepping towards data quality in their organization? That's a great question. I think to think about that, if what would be the first thing I would recommend someone doing, and I feel like when we hire new recruits, let's say when, you know, we were young programmers ourselves starting out in our careers, right? We would have the tendency to just write software first, then think about testing, right? And as you start to become more and more senior,
Starting point is 00:51:29 you flip it and you say, let me first stub out my tests, even if I don't implement them yet. But at least let me write this because it brings the spec out, right? I mean, it tells you what your module is supposed to do. Anything you can test is what it needs to do. Anything you cannot test is actually immaterial. That's functionality you should never be implementing because you don't
Starting point is 00:51:49 even have a way of proving it, right? And that starts to open up the design space so you can find an efficient and an effective solution. Now, I would say the same thing to data engineers who are listening to this, right? First, think about how will you prove to your consumer that you are giving them the data they ask for, right? And go ask them, right? What properties do they expect to be true of this data asset? Whether it's a table, whether it's S3 dumps, or just Kafka events you're collecting, right?
Starting point is 00:52:21 What are some constraints I can apply on it? What are some invariants you would like to see in this data set, which are always going to be true. It doesn't depend on how you are going to use this data. It's just innate to the data I'm producing. If there's one recommendation, it would be just that. Think of test-driven data development, start there, and then very, you will start to look for ways to implement and encode your tests, right? So the rest of it will sort itself out. Sure. Well, that's great advice. And we have had such a good conversation, Manu. Thanks for joining us. And one last question quickly, if people want to learn more about LightUp, where should they go? You can start at lightup.ai, a very simple URL,
Starting point is 00:53:08 or just get in touch with me or anyone at the company. We'd be very happy to strike a conversation, give you unbiased opinion on what you could be doing for data testing. If you wanted to try LightUp, we could very quickly set you up. We can deploy as a SaaS service. We can also deploy in your own environment. Oh, interesting. Right?
Starting point is 00:53:28 So we can bring our software to where data already lives, which makes it very easy to get this going without any compliance or regulatory issues. So there's a full spectrum of ways in which we can work with you. Just find us at lightup.ai. Cool. Well, thanks, Manu. Thank you for the advice and the thoughts. And thanks Just find us at lightup.ai. Cool. Well, thanks, Manu. Thank you for the advice and the thoughts and thanks for joining us on the show. Thanks for having me. It was a pleasure
Starting point is 00:53:50 talking to you both, Eric and Costas. My takeaway, which is not going to surprise you, is that I love the idea of running an economic experiment inside of an organization where you have to use monopoly money to buy data. Like I want to- Come on, it has to be true money. Like can you play poker with fake money? I mean, I guess technically you can, but you're right. It's not as fun.
Starting point is 00:54:15 Exactly. But this is an experiment I want to run. Like maybe we could get a Harvard economist to help us sort of design an economic experiment on this. Yeah, it's very interesting. I think when we start like talking, I think that the whole point of that is that we are starting in a very concrete way
Starting point is 00:54:36 to talk in terms of value, right? And we are taught like whatever theories we have, like our objection with products, technology, blah, blah, blah, whatever. At the end end what's the value and for whom right okay data obviously like in most cases our customers are internal right i am generating or like moving the data around because marketing wants that so my customer is marketing yeah but at the end like this relationship and the quality of the product and the experience that our consumers are going to have is not that different than customers that are paying us for that, right? So, yeah, I think it makes total sense.
Starting point is 00:55:12 It's a very good experiment and we need to figure out a way to do it. Yeah. All right. Well, if you want to volunteer, we can help facilitate running this economic experience in your organization. Lots of great shows coming up. Thanks for joining us again and make sure to subscribe if you haven't already so you get notified of the next episode and we will catch you then. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
Starting point is 00:55:48 That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.