The Data Stack Show - 57: Improving Data Quality Using Data Product SLAs with Egor Gryaznov of Bigeye

Episode Date: October 13, 2021

Highlights from this week’s conversation include:Egor’s software engineering background and history with Uber (2:19)Experimentation platforms and analytics definitions (7:49)Bigeye’s function an...d use cases (9:40)Managing the relationship between the data engineer maintaining the pipelines and the downstream teams providing the context (18:49)Pinpointing problems in data compared to problems in software (21:55)Defining data quality at Bigeye (24:13)Machine learning models as a data product (28:38)Determining SLAs (32:22)How Bigeye brings different parties together and addresses natural communication barriers (36:42)Looking at when an organization needs to implement data quality tooling (45:54)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back. Really excited to talk to our guest Igor today.
Starting point is 00:00:30 He founded a company called Big Guy and they are in the data quality space. And it's a really interesting topic. I think my burning question, Costas, is really what does data quality look like in an organization? And when do those problems start to become really acute? We've talked about scale a lot. The startup that's two people in a garage are just querying their Postgres database, and they don't even really have a sense of what their data is going to look like. And then at scale at a company like Uber, where Igor spent time working building data products, it's a completely different game. So I'm interested in his perspective on when the problems become acute,
Starting point is 00:01:15 when do you need tooling around data quality? How about you? Yeah, I think I would start by trying together with Igor to define what exactly is data quality, or at least like give some kind of better definition. It's one of these terms that's like together with some other stuff that they go under the broader umbrella of products related to data governance that we talk a lot about them. We use the terms a lot, like quality is something that it's very easy for anyone to have an
Starting point is 00:01:44 opinion on quality. But I don't think that we really have a very clear definition of what data quality is. And I'd love to try and make this much more clear today with Igor. And I'm sure that we will have more stuff to chat with him. Absolutely. We always do. Well, let's dive in. Let's do it. Igor, welcome to the show. Super excited to have you him. Absolutely. We always do. Well, let's dive in. Let's do it. Igor, welcome to the show. Super excited to have you with us today. Thanks a lot for having me, Eric and Costas. It's great to be on the show. All right. Many exciting things to talk about, but as always, we'd love to hear about your background and then hear about BigEye.
Starting point is 00:02:29 Oh, definitely. So I'm Igor. I'm the co-founder and CTO of BigEye. My background is as a software engineer that fell into the data space. My first job, I was working on call center analytics. We were working on a new platform using Hadoop, which back in 2012 was the hot new technology. At that point in time, we were writing raw Java MapReduce jobs and just trying to process information to understand how do we even make Hadoop and MapReduce into a scalable solution. Obviously, now there's a lot better technologies out there for scaling analytics, but it was definitely an interesting introduction to the world of data. From there, I got into data warehousing. I joined a company called One King's Lane, which is an e-commerce company. My team set up the data warehousing stack
Starting point is 00:03:17 from infrastructure all the way through ETL and data modeling and visualization. So I got a taste of what does the whole space look like? How do you scale a data platform from the lowest level, which is just set up your database and get the data in there all the way through? What do the analysts use? How do we present dashboards? What tooling do we want there? An interesting part, something interesting about that experience was we were one of the first Looker users. Yeah. We actually had some of the, I think one of the co-founders come in and present to us because this was when Looker was just getting started. Yeah. Oh, wow. Okay. So what, so remind me, what, what time period is that? So this is 2013, 2014.
Starting point is 00:04:03 Okay. Yeah. Yeah. Wild. Okay, yeah. Yeah, wild. Okay, yeah. Just on the last couple of shows, we've talked about what it was like to build a stack back then versus now, right? I mean, just drastically different. And like, of course, Looker is now part of Google. And so that's amazing.
Starting point is 00:04:19 It was very different back then. Back then, everything was hand-rolled. We wrote so many Bash and Python scripts just to make everything work together. And the options back then for analytics and BI were really either Mode or Looker for the more modern ones. Or you go with something like Tableau or MicroStrategy if you have people who are using that. But we decided to try out Looker and it was interesting just to get into LookML. They had some really great ideas even back then. I remember being really excited. Yeah. Awesome. Okay. And so from there, I actually late 2014 joined Uber. Uber at this point was trying to
Starting point is 00:05:08 scale their analytics and their data team. They were doing all of their analytics on a Postgres replica. There were a few of these replicas that wasn't scaling. The company at that time was 1800 people or so. So we had the experience of building out the data platform, myself plus a couple of other folks joined and started the data warehouse team at Uber, did the same thing that we did at One Kings Lane, but at 100X the pace and scale, set up the infrastructure, all the ETL necessary, data modeling, just
Starting point is 00:05:47 corralling people into telling us how does the business look at the data and what does it mean, all the visualization stuff. And by late 2015, early 2016, once the core was there, the platform was there, I worked on a lot more specific projects. I did a stint in ad tech. I worked on the experimentation platform, which I ended up being the tech lead for, for my last two years at Uber. And then I also worked and contributed to some of the core data platform efforts. Wow. And just, this is a quick, just nerd question from a marketer who does a lot of testing. Did you,
Starting point is 00:06:32 was your, was your testing platform hand-rolled in-house? Yes. Everything, everything at Uber in general was hand-rolled in-house. I think this is a little bit of a engineering fallacy of we can do it better and we know how to build something better than what we could buy. And I think it's still prevalent today. It was definitely prevalent back then, but a lot of things that Uber was building in-house were very specific and hyper-focused to the problems that Uber was experiencing, which were generally not the same problems that every other company would experience. The pacing, the scale, the types of data that we had, it was all fairly unique compared to what else
Starting point is 00:07:18 was on the market. Yeah. I just asked because testing infrastructure is, I mean, a lot of things in the data space are non-trivial. But when you think about statistical significance and there's a lot of math that goes into it, when you get into multivariate stuff, it gets into pretty gutsy mathematics. In addition to like executing software that has a very, it's at the sharp end of like user experience. And so it's just, it just seems like a very complex to build. Yeah. And I, it's interesting that you bring up statistical significance on the experimentation platform team. The, my biggest project there was in experiment analytics tool where users would come pick the metrics that they want pick their experiment and then the tool would go and compute the metrics run all the statistical analysis necessary and then show them statistical significance on that and it was a really
Starting point is 00:08:16 interesting experience scaling that out and making it generic enough where it could be used with any experiment in any metric but at the time, still corralling everybody into a sane set of metrics that everyone needs to look at. Right. Just basic. How many trips are being taken and is your experiment negatively affecting that? Yeah. Yeah. Yeah.
Starting point is 00:08:37 That kind of goes back to like the hardest part about any analytics are actually just the definitions across the company. I could definitely get into that. I know there's actually a lot of tools nowadays that are starting to address metric definitions and consistency. You have Superset, you have Transform. They're all working to help businesses standardize their metric definitions. At Uber, same problem. It's a large business. Everyone defines metrics differently.
Starting point is 00:09:10 And I remember part of the difficulty was just getting everyone on the same page of even what is a trip? What is revenue? How do you count the fare splits and making sure that all of this is, all of this works into the definition of the metric and that every team can actually meaningfully use it.
Starting point is 00:09:29 Sure. Love it. Well, thank you for going down a little rabbit hole there. Super fascinating. But okay, so Uber and now BigEye. Now BigEye. And so BigEye is a data quality platform. We want to help people ensure with their data so that people can know ahead of time rather than being unpleasantly surprised when they open a broken dashboard. Yeah, totally.
Starting point is 00:10:12 And so could you just give us a couple of use cases, right? So companies are collecting this type of data and then there's some sort of aberration or derivation or like, could you just give us a use case maybe from a customer or just to give our audience an idea of this is what it looks like in action? Sure. So I actually bet that every single one of your listeners has experienced the data quality problem at some point in the past and probably as recent as earlier this week. So data quality problems really range from anything as simple as, well, our vendor told us that they would deliver the data by Monday at 8am and they did not. And now it's Tuesday and we don't have our data yet. So this is known as freshness or latency issues where the data just isn't being refreshed on time. This isn't
Starting point is 00:11:13 always third-party data. Sometimes this is internal data that just doesn't get updated for whatever reason. Then from there, you move on to more interesting cases of data quality, such as, let's say your business has, on average, 10,000 users logging in a day. And you can look at your logins table and you can see the fluctuation. You're like, it's right around 10,000, maybe a little bit more users during the week, less on the weekends, but it's in a reasonable range. Now, what if all of a sudden 100,000 users sign it because you're getting spammed by a bunch of bots? You would want to know about that before all that information goes into your analytics and you're presenting this dashboard to your head of growth. And it's like, look, we just 10x'd our business overnight. But that's not real data.
Starting point is 00:12:07 But you don't know that because you're not actually looking at the data before you're using it. And so there's a lot of examples of data quality problems. But it always boils down to the question of, can I trust my data? And can I actually make accurate decisions with the data that I'm using? You mentioned your journey through Uber. You worked in many different things, right? As you said, you started from pretty much building the whole infrastructure, had many projects later on.
Starting point is 00:12:39 What made you focus with big eye on quality? Why do you think quality is important or at least why you are so excited about it? I'm excited about quality because it was one of the biggest pain points that we had when we were first building out the data platform over at Uber. The number of times that somebody would message us on Slack, it was HipChat back then, but same, same. Someone would message us and say, something looks wrong with my query. I'm pulling up this dashboard. These numbers don't make sense. And that's about it.
Starting point is 00:13:18 There's zero other visibility into what they mean by something goes wrong. And this happens over and over and over again. And sometimes it's internal analysts and data scientists talking to the data engineering team, which was the case for us. Sometimes it's the executive looking at some KPI dashboard who then messages an analyst and says, this smells fishy, like something looks wrong in this graph.
Starting point is 00:13:46 Can you double check this for me? If the analyst doesn't have a place to go to and say, all of the data that is feeding this dashboard is high quality and trustworthy and is at least consistent with what we expect that data to look like, if they can't say that with high confidence, then that analyst has to go and waste all their time and probably spend half a day just digging through a bunch of tables and SQL queries, just trying to understand why does something look wrong. Now, the problem of data quality is even more acute today because it's so much easier to scale data platforms today. If you think about even 2014, 2015, when we were building out platforms, everything had to be rolled in-house. There wasn't that much tooling around.
Starting point is 00:14:44 You would buy a data warehousing solution at that point. It was either Vertica or Teradata. If you were really ahead of the curve, maybe you're already adopting left on your own. Great. I have a place to put my data. Now what do I do with it? Well, now you build out these processes to make sure you're ingesting it and modeling it and presenting it in a very controlled manner. And so everyone had eyes on the data pretty much all the time. There would be an analyst who's responsible for that specific data set, that specific dashboard, and they would know what it looks like. They would have a gut feel for what it looks like, and they could identify issues early. In today's world, if I were to build out
Starting point is 00:15:35 a data platform at any business today in 2021, it would take me a matter of days. I would go to Snowflake and I swipe a credit card and I get a data warehouse. I go to Fivetran, I swipe a credit card, I get my ETL. And I go to Looker, Remote get into my warehouse and actually start using for business decisions has grown exponentially because I can just connect as many things that I want through Fivetran. All of a sudden, I have my marketing data going here and I have all my sales data going here and I have my product data going here. And I'm just one data engineer and I can't meaningfully know what should this data look like? Is it correct? Can people be using it? And so now I'm fielding all these questions from the business saying, well, my dashboard looks wrong. And my only answer to that is, well, the pipeline's running fine.
Starting point is 00:16:39 So I have no idea what else is going on there. And so because of that, data engineering today does not scale linearly from a headcount perspective with the amount of data that is actually being scale with the business and scale with their data growth. And so that's why right now is such a good time to focus on building tools around scaling, helping data teams scale. So for example, data quality, understanding where data is coming from, all this pipeline management, DBT is another great example of this. DBT is just a very, very fast way for us to build data models in a repeatable, sane process. And so this is why you're seeing this revolution in data tooling, is because data has started to scale so much faster than it ever has before. Igor, one point you made that I think is really interesting that we haven't talked a lot about
Starting point is 00:17:45 is the context around the problem. And it's funny because earlier this week on the marketing side, we have lots of data pipelines that run and do our reporting. And there was a number that just kind of seemed a little bit off and it wasn't off enough to like be super concerning, but I was just like, that's really interesting. Is that correct? So anyways, I went through that. The context, I think, is something that's really hard to translate.
Starting point is 00:18:15 So if you think about marketing going to the data engineer, there's so much context that the marketer has around these. The campaigns we're running and these are the conversion rates that we're looking at and all that stuff that the data engineer doesn't have. How do you address that problem? And in many ways, I think it's in some ways it transcends the tooling and gets into the cultural aspects. But I would just love to know the ways that you approach that problem or have approached that in the past. And what does a successful relationship look like there between downstream teams who have context and the data engineer who's making sure the pipelines are running?
Starting point is 00:18:49 I think that's a really interesting question because there's really two sides to this problem as you surfaced. One side is really organization. From an organization perspective, you have disparate roles that don't really understand each other's domain. Marketers don't understand data pipelines. Maybe they're writing some basic SQL, but they're probably not at the level of what the data engineer is doing on a day-to-day basis. Then you have the data engineer who just says, well, I'm already overwhelmed with all of this data that I need to move over into the warehouse. I don't have time to understand every single business domain. And so I think the right answer is to make them meet in the middle and bring the two knowledge bases together into a way that they can both benefit from each other.
Starting point is 00:19:48 And so something that BigEye sets out to do is to build a tool that allows for that process to happen. We want to allow users to express their expectations around their data in a way that is understandable to the business user so the business user can bring their context over. If they say that, well, we expect our average sale price or average number of views on an ad to be some around 200, they have that information and they should be able to provide that information into a data quality monitoring tool, which we do that through a simple to understand WYSIWYG UI so that someone can just come in and say, this is exactly what I'm expressing here. But on the flip side, it needs to be scalable enough where the data engineer can say, okay, this thing is alerting me.
Starting point is 00:20:49 It's saying that something is wrong with this data set. Where do I even start? And so that should then be able to provide enough context to the data engineer to say, well, it's this table. Here's the metric that's alerting you. Here's some SQL that you can start running right now to try to help debug it. And I think that in the future, this then extends even further into really joint runbooks. This issue fired, the data engineer knows what they need to do in order to fix it on the infrastructure side. But then the marketer can then come in and contribute to the same runbook and say,
Starting point is 00:21:30 by the way, here is the expectation. Here's why this is the expectation. So now you have context around why this is a problem. Interesting. It's almost like really good error logging in a way right like if you think about like a lot of detail around here's the basis of the problem here's where you should start troubleshooting like it's super interesting it's like a stack trace for a software yeah you you look you look at the next thing down and you're like okay well great where what line of code caused that what line
Starting point is 00:22:03 of code caused that i think line of code caused that? I think it's just so much harder in data because there is no stack trace for data. Yeah. If I, if I could have like any tool at the snap of a finger, it would be a stack trace for data where you can say, here are the 10 records here that are causing this. And they, by the way, they actually came from here and they came from here. And I know lineage is a very, very popular topic nowadays, but no one's doing lineage to the degree of at the record level. Software engineering has this line in this file caused your exception. Data has, at best, this is something interesting that's going on in your exception. Data has, at best, this is something interesting that's going on in your data,
Starting point is 00:22:50 but which like 100 records out of the 10 billion that I loaded today are causing this? Good luck. And it's easy for some cases. Sometimes you can say, okay, well, this column should never be null.
Starting point is 00:23:03 And so if you have a null record in there, fine. This is a very easy filter, easy fix. But what if your average moves? Or what if you're doing some machine learning and your distribution shifts? Your variance goes up. What caused that? Well, it could be anything.
Starting point is 00:23:23 You can't really tell. And so I think it's just so much trickier in data than it is in software to pinpoint these problems. Yeah, absolutely. And the way that something that I observed all this time that you are talking, Igor, it seems that quality is something that touches pretty much every part of the organization, right? Like it starts from the hardware that you use over there, right? Up to how the VP of marketing, for example, interprets the numbers, right? And I want to ask you, it sounds like a very big problem, like hard even to define, right? As a problem. Talking about quality is easy to use the term quality, but
Starting point is 00:24:05 at the end, if you want to solve the problem, you need a better definition of the problem. So how do you define a big I quality? So I think that's really interesting because everyone, I agree with you, everyone defines data quality differently. In our viewpoint, data quality is about the quality of the final data product. So I'm going to take a step back and I use the word data product a lot. But if you think about software, it's very easy to define what is the end deliverable for software. It's usually a website, an API, an SDK, whatever it is, that is the product. And when something goes wrong with that product, it's immediately apparent what is going wrong. If you go to your webpage and it throws an error, then that your product is broken.
Starting point is 00:25:08 For data, it's important to define what are those data products. Now, a lot of the times data engineers will say, this table is the data product. My deliverable is the fact that this table exists in the warehouse and it's being updated consistently. But you need to take a few steps forward from that because that table is then used in ETLs, goes into other tables, which then eventually go into a dashboard, a machine learning model, maybe feeding some sort of product functionality for your core application. So that is the end deliverable for the data team. And so then it's important to measure quality at that stage. It's important to understand that my KPI dashboard is good to use. My product is the KPI dashboard.
Starting point is 00:26:03 Now there might be 10 tables that are going into this dashboard. No one really cares about the tables. People care about looking at the dashboard at the end of the day. The tables are helpful in order to inform us about what could cause this dashboard to go wrong, what could cause this data product to be broken. And at BigEye, we have a concept of SLAs, which are customers used to define the state of data products. So if you think about SLAs from software service level agreements, it's the ability to say, when is my application available? And when would I consider it unhealthy, broken, and which metrics are contributing into that? So for applications, that's error rates, latency, throughput, however you want to define your
Starting point is 00:27:01 SLA. For data products, it becomes the combined metrics that you're measuring about the underlying data sources. So for example, let's say I have my KPI dashboard. Let's say I have two tables feeding into that, my users table and my sales table. Right. Now, if my sales table is delayed, for example, or all of a sudden we notice that there are negative values in the sales amount column, which should never happen, then I can say this table is unhealthy because this metric is outside of its expected range in the same way that you could do that for latency. And then that can then flow into your data product and say, the KPI dashboard is unhealthy because something that's feeding that KPI dashboard has turned unhealthy because one of the metrics has
Starting point is 00:27:57 gone outside of an expected range. And so the KPI dashboard has an SLA. That SLA is now red. It's violated because an underlying metric has violated the SLA. And so we measure quality at that end product level, but enable users to build up that SLA from the underlying components, from those metrics that we are using to measure the state of the data. It's very interesting. You're mentioning as an example of data products, usually like the outcome of BI, which is like reporting, right? What other data products you see usually in an organization today? Machine learning models are going to be the most popular ones today. And it's actually interesting because BI is the most easily understood and grasped example of a data product because there is a dashboard that you can
Starting point is 00:28:57 see on a screen. It's very easy to understand when that goes wrong. There are a lot more data products today that are more automated and less apparent when they go wrong. So machine learning models are a great example of this. These machine learning models, you have a training data set, it's going to go build that model, and then it's going to use that model to make some sort of prediction. And usually that model feeds some sort of product functionality. So at the end of the day, if we want to talk about the machine learning model being the end data product or that product functionality being the end data product, it doesn't really matter. Usually that's pretty one-to-one. So let's talk about the machine learning model. Now, if the data going into that model that's used for training is incorrect in some way, and this is the classic phrase, garbage in, even more so than a broken dashboard,
Starting point is 00:30:05 because a broken dashboard, a human's going to look at this and make a decision about whether to trust the data or not, and whether to make a decision about the data or not. If you have a machine learning model, no human's looking at that. The first person that's going to notice is the customer trying to use the product feature. Let's go back to the example of Uber even. Let's say you have a machine learning model that says, this is how far away we'll accept drivers to accept a pickup. And let's say that model trained on bad data. And now it's saying all of our drivers are coming from half an hour away because we are
Starting point is 00:30:48 failing to dispatch anyone closer. Well, that's a problem. And that's a customer facing problem. They're going to stop using the app. That's immediate impact to the business. And it's dangerous because no one's looking at it. No one's looking at the model and saying, well, what's the model doing? And sure, sometimes you have really tight feedback loops that can measure the outputs of the model
Starting point is 00:31:08 and say that, okay, something looks wrong. Let's roll it back. But most businesses don't have this and most organizations don't build this into any sort of automated flow. And so if you look at the, you must monitor that data quality of those training datasets. And you have to monitor it holistically enough and deeply enough to be able to detect issues that can cause these things. And a lot of times, those training datasets aren't monitored at all. I mean, it would be great if even the inputs to those training datasets were monitored, but even those sometimes aren't monitored. And so now you have a bunch of like, who knows what's going into this model and you just expect it to work.
Starting point is 00:31:49 And that's just not how machine learning works. Yeah, absolutely. So, okay. We are talking about like a range of like different data products where the stakeholders involved in them are like different, right? So who is the person who defines the SLAs for Big I? Because if I understand correctly, that's where everything starts, right? Someone has to define the SLA and then the SLA is attached to a number of metrics that you are calculating below the SLA and you come up with the warnings.
Starting point is 00:32:20 So who is the responsible entity? So SLAs need to be agreements between both parties. And when I say both parties, the way that I see data teams organized is really into two segments, data producers and data consumers. At the end of the day, there's going to be somebody who is producing the data that you are using. So you typically, these are data engineers. If an even easier example is third-party data, let's say Facebook is sending you your impressions and all your ad metrics. They probably have an SLA with you that
Starting point is 00:32:59 says, we promise to deliver this on this cadence and it's going to be complete and so on and so forth. That SLA is between the data producer, Facebook, and the data consumer, which is your team. Now within an organization, same thing. A data engineer is the data producer. And then the data consumer is usually the analyst, the data scientist building the ML model, the product engineer who's actually consuming some data feed and then using it in the product. So the SLA needs to be a contract between both of them. And so within an organization, sometimes it's a little bit tricky because there are different expectations of what the data should look like from the consumer and the producer, but they can at least meet in the middle and say, all right, I expect this data to be updated daily. And then the producer might say, yep, that's totally reasonable.
Starting point is 00:33:56 We're updating it more frequently already. So that's a totally reasonable expectation. Once they come together and set that expectation, that can then go into the SLA. The SLA for that data product now includes that expectation. And you can go down the list and make all of these expectations. And then the interesting part here is there can be auxiliary SLAs. You can have your core SLAs where full stop, is this data product good or not? So is it on time?
Starting point is 00:34:27 Is it complete? Are there any serious anomalies in data? Nulls, bad formats, incomplete data feeds. We expected 1,000 records. We got 200. But then what we actually see our customers do is build auxiliary SLAs. So the data consumers are then saying, well, we have expectations about what the actual data should look like.
Starting point is 00:34:52 And that might not even be the problem of the data engineer building the pipeline. This might actually be a, we instrumented logging incorrectly in our product. And by the time that it got here, something looks wrong. And so then they will go and instrument their own expectations around what the data looks like. We expect to have three product tiers and any other values are invalid. We expect a specific range of numbers when we were looking at how much we're charging users.
Starting point is 00:35:23 Maybe it's somewhere between $1 and $100 because we know we don't have anything outside of that range. And if it's outside of that range, then the data itself is bad. I shouldn't be using it. And so it really depends on what that SLA is trying to represent. But usually the way that we see it is there is a joint SLA that just is that fundamental, is this data good? Answers that fundamental, is this data good question? And then there's the secondary SLAs around, well, what does the data look like? And is it meeting my expectations as a consumer of it? How do you bring these people together as part of BigEye? Because from what I understand, it's something quite important that affects also at the end the outcome of the product itself, right?
Starting point is 00:36:09 A wrong SLA, for example, a not well-defined SLA or the thresholds of the SLA not being right at the end might affect the value that BigEye delivers. So how do you handle human nature at the end and how people can communicate or in most cases cannot communicate really light question here yeah this is a softball yeah how do you solve massive organizational issues i think that's something that we'll always have to work on. At Big Eye, our goal is to build a product that allows for people to come together and talk about these very important topics. I think there's the flip side to it, as you mentioned, that human nature side of people just don't want to do that.
Starting point is 00:37:07 People want to focus on their work and their problems. And for us, a lot of that is just education. Even me coming on the show, someone's going to listen to this and think, oh, maybe I should go talk to my data scientist and just wonder what's important to you about your data. And even just taking those small steps of education is important to us. From a product perspective, I don't think a product can ever solve organizational issues. And the only way to do that is through education and through really at the end of the day, empathy. You have to have empathy for your co-worker. You have to understand that they are also just trying to do their job and understanding what makes their life a little bit easier is just going to make for a better organization overall. Yeah, it's very interesting. The reason that I am asking is because I've seen many companies that they're building one way or another, like data-related products.
Starting point is 00:38:06 And most of them, they have also to tackle some kind of organizational obstacle there. Because I think it's the nature, when you're working with data, it's like one thing is manage the infrastructure, like the technology, blah, blah, blah, all that stuff. But at the end, you have pretty much the whole company as a stakeholder who's going to consume this data. So it's always a collaborative game at the end. And I have at least seen a couple of different ways that companies are trying to solve these problems. One is the common, let's say, the GitHub approach, right?
Starting point is 00:38:39 We are trying to build a collaborative platform where people can get on the platform and collaborate and blah, blah, blah, put them on a workflow and all that stuff. Education. That's like a very, very good point. And I think this is where also like marketing from the perspective of the company becomes like an amazing tool because you can educate your users and customers. And then there's also, because you mentioned Looker, and one of the things that always
Starting point is 00:39:07 impressed me with Looker is how they solved an organizational problem, which in their case was like making sure that they separate the data engineer and the business user as much as possible. And they did that by providing LookML for the developer to create the modeling and then a user interface, which is as easy as Excel for someone to use and do pivots. And the two people, okay, they have to talk to each other, but at least the whole communication is much easier.
Starting point is 00:39:38 Have you ever had to implement a LookML model in production? Yeah, I have. Well, I said they tried. I didn't say they succeeded. Okay. The reason I ask is because if you completely separate the two, you're never going to get anything meaningful out of it. Because for the model to be meaningful, you need that input from the business and that stakeholder in order to know what needs to go into it. The pivot table stuff is great. And I mean, Tableau did the same thing. Tableau extracts are meant to do the same thing, which is, well, here is a pre-built data set that you can now go and whizzy wig and drag and drop
Starting point is 00:40:20 your way around. But without knowing what needs to go into that data set, you're just going to get into that same cycle of the stakeholder is going to come back to you and say, oh, this doesn't exist in my model. And then you're going to go at it. And then they're going to say, well, why is this filter on here? And you're like, I don't know.
Starting point is 00:40:38 Somebody else told me to do that. And so it's just going to go this way. Yeah, absolutely. I was going to say, I think, Acosta, that's a really interesting observation. A few thoughts here. So one, and I'm making some assumptions here. This is a little bit of a hot take. So Igor, please correct me if I'm wrong.
Starting point is 00:40:55 But if you think about data quality, it happens at different places in the stack or in the data flow, right? So one thing that we've talked about a lot on the show and that Costas and I talk a lot about is tracking plans. That absolutely gets right at the heart of organizational problems, right. Cause it gets to like share definitions and then like new processes and all of these things. Right. And I mean, there are some really interesting companies doing some really cool things with
Starting point is 00:41:23 tracking plans and I know there are some great solutions out there, but it's a really hard organizational problem because a lot of companies just don't do it, right? I mean, the bottom line is like, it's just, you have your work to deliver and like tracking plans slow things down and their demands on the teams and collaboration is hard in general. And so that's sort of a difficult organizational problem to solve. When you were talking about SLAs, and I'm going to speak specifically to the BI use case, because I think the ML one is a lot more complicated. But what's interesting is I was thinking about even just my own day to day. And it's the way that big guys approaching it through the paradigm of SLAs is really interesting because even in our organization, a lot of those are just already defined, even if they're not made explicit. I know what my SLAs are for the marketing dashboard. And I haven't necessarily written that down or gone to tell a person who's writing DBbt models or whatever and those conversations come up organically but i know those right so those are it's not difficult for me to mine that information i already know that and so it's interesting when we think about the organizational challenge like
Starting point is 00:42:34 a product that formalizes something that already exists on some level but just hasn't been made explicit and then adds collaboration i think is a really interesting. And I think that that's like a, I think that's where a really well-done product can actually really facilitate that. I won't speak to the machine learning because that seems, and I'm non-technical, so that seems like a much more difficult problem, but it seems like a lot of the SLA's already exist in the organization. There's just not a great way to formalize them. And a lot of it is just getting it out of people's heads. Like you said, you have an SLA in your head. You know what you expect out of this data.
Starting point is 00:43:13 But a lot of the tools that data teams build internally are usually very technical and they're very geared towards the data engineering team. They usually involve writing some sort of SQL or configuration or checking in code even sometimes or updating the ETL. And those can capture a lot of the basic information. Again, like how fresh is my data? How many records do I have? But it's not going to get all of that
Starting point is 00:43:45 stakeholder knowledge in there. So that's why any data quality tool needs to be able to be accessible enough to extract that information for the stakeholders, the data consumers to come in and actually express what they have in their head. Because at least that gives you a starting point. You can at least go and write that down, create that configuration, start that monitoring. And when it goes off, you can then go to your data engineering team and say, here were my expectations of this data. Which one of these did I get wrong? Do you understand something that I'm not understanding? And do I need to adjust my expectations? And a lot of times the data producer team, the data engineering team might say, nope, you got that right. And this is a
Starting point is 00:44:32 real issue. We just didn't know about it. Thanks for flagging that. But it's important to first and foremost, get that out of your head, get that out of the stakeholder's head and in a place where somebody can see it, visualize it and understand it. And then that will prompt that conversation. Sure. Well, we're getting close to time here, but one thing I'd like to talk about is to kind of get specific on when we think about data quality and we think about Big Eye as a tool that helps solve that. One thing we talk a lot about on the show is that, and you mentioned this with Uber, right? Like the data problems at Uber scale are very different than data problems at a much smaller company. What are the symptoms that you think, and maybe that you even see with your customers that necessitate like,
Starting point is 00:45:27 okay, we need to start thinking about data quality and tooling around that. Are there particular tools in the stack or data pipelines that are indicative of a need for this? I just love to give our listeners a sense of when, when is the, I mean, it's kind of an acute need. Like, like you said, everyone faces a data problem, and everyone's probably face it this week, but in from the big eye perspective, when do you need to implement tooling. I'm in a very biased answer. As soon as you have data in a warehouse you probably want data quality tooling. In a more objective answer, it really depends. My gut feel would always come from how much time are you spending on data quality problems? And this is typically a question for the data engineering side, but it works for the business as well. Yeah.
Starting point is 00:46:27 How much time are you wasting looking at dashboards that are broken? On the data engineering side, how much time are you spending fielding questions from the business about why their stuff's broken? Or can you look into this query and tell me what's going wrong? Yeah. Because one question a week, fine. But if you're spending five hours, 10 hours a week, just debugging people's SQL to help them understand what's going wrong, you might want to invest in something that's a little bit more automated. Because at the end of the day, people just want to do the job that's fun. They want to do their
Starting point is 00:47:04 job. They want to do the fun parts of it. For the business, that's getting new insights, making decisions, driving direction. For the engineers, that's I want to build frameworks and I want to create new pipelines and explore new tooling. And neither side can do that if there's too many data quality problems because they get in the way. And so at some point, the business will have this critical point. I actually have a term called the oh shit moment, which is at one point, did you have such a big data quality problem that it completely derailed the whole business. Say KPIs were wrong, sales numbers were wrong. A product rollout couldn't be tracked because the instrumentation wasn't correct and no one noticed for a week until you went to pull the report. So at some point,
Starting point is 00:47:57 you're going to have that moment and you're going to realize we can never have a moment like that again. We need to start worrying about that. Yeah. Costas, I'd be interested in your thought on this. So Igor, it's a little bit of a leading question because I had kind of my own thought is, as I think about this and I'm just putting it through the lens of my own experience, for me, the trigger would almost be like, I have all my data in my warehouse. You start to build out dashboards, but then you go through this weird period where your dashboards aren't stable because you have all this data and you're just trying to figure out what should I measure? How should I measure it? And then you get to a point where you're like, okay, this is the dashboard that the marketing team is looking at every single day. And these are the numbers.
Starting point is 00:48:42 And then you have a baseline from which derivations become really important and that's the like slas in your head so to me it's almost like okay the first signs of dashboard stability give you your initial set of slas when you can measure from but would you agree with that cost us because you've done i mean all sorts of reporting especially on the products yeah i would agree with that, Costas? Because you've done, I mean, all sorts of reporting, especially on the products. Yeah, I would agree with the biased version of Igor's opinion, to be honest. Like the sooner you have at least some principles, I mean, you might not want to start using like a product or something, but at least have some principles to check what's going on with the customer facing side of your data product let's say okay which is i don't know like the your dashboard for example i think the better
Starting point is 00:49:34 you're going to be i mean it's it's amazing how many times i've heard from like pretty big companies data engineers coming to us and be like, oh, we just realized that this pipeline stopped running three weeks ago. Whoa, something is feeding. This pipeline is feeding something, right? So why it takes so long to feed? Because that's a great point, this oh shit moment. Usually it's very late when this happens and someone is angry right you might have your board meeting and you don't have your numbers
Starting point is 00:50:12 okay nice fun right that's the where you have like the the common excuse we are still working on our infrastructure until next. It's part of our OKRs, right? Reporting infrastructure as close as an OKR next quarter. But yeah, I think the sooner you do that, the better. And one of the reasons is outside of like, okay, avoiding these OS oh shit moments is because like people especially people that they start working for the first time like with data to understand and educate themselves that like data always something will go wrong with them there's no perfect data out there it's i mean pretty much can be proven in like computer science that you cannot have that period okay so i remember for example i'll give an example that i kept like remembering while i was like talking the first time i mean when we started
Starting point is 00:51:11 blendo we were using at the beginning google analytics right so we were taking like the numbers and like measuring from there then we started using mixed panel and we were like oh we have another data source like with the same data let's compare the two now that we have them like on our data warehouse and of course they didn't match right okay what do you do now but the most important thing is like not just like how you are going to tackle the problem but realizing that this is the reality that you are going to be operating in and getting into this habit of caring about data quality will make you understand and incorporate this as part of your business practices, which is like, in my opinion,
Starting point is 00:51:50 it's probably super important to start as soon as you start reporting, even on an Excel document, like some numbers to your board. The more you wait, the worse the problem is going to get. Oh, shit. Well, I also think about something that Costas says a lot, which is it's easy to talk about data in a way that it almost comes across as static. But the reality is data is changing a lot, right? New pipelines are added, other pipelines are deprecated, right? Like it is never static within an organization and the complexity is only increasing. Even within a pipeline. I mean, even taking, looking at just one pipeline, because there's plenty of, I've seen teams that have one table.
Starting point is 00:52:35 And like, this is our event log table. It is 500 columns wide and it stores every single event that happens in our whole product. And those, even those pipelines can go wrong, even if nothing changes about the pipeline. There's no new pipeline, but you stop publishing a signup event and all of a sudden your conversion goes to zero
Starting point is 00:52:56 because no one's signing up. Even within a pipeline, things can go wrong. And like, even their data is never static. Well, we are, we're at the buzzer here. Igor, before we jump off, if someone wants to learn more about BigEye, try it, where should they go? What should they do? They can go to BigEye.com. They can also email me, Igor, at BigEye.com. Awesome. This has been such a fun conversation. So many rabbit holes we could have gone down. We'll have to save that for another episode. And thank you again for giving us some of your time to have a, to talk shop about data. It's great. It was my pleasure.
Starting point is 00:53:36 I really enjoyed the conversation. Thank you. I love talking to our guests. They're just so smart and learn so many things. My big takeaway is the paradigm of SLAs. And I love the framework that Igor used to talk about SLAs for data products. a really, really smart way to approach the data quality problem. So I'm even thinking about that for my own day-to-day work. So I just really appreciated that perspective. Yeah, absolutely. Actually, I would say that it's like a broader theme in the way that he was approaching the problem of building data quality-related products. If you notice that there were like two main things that happened during our conversation.
Starting point is 00:54:26 One was the use and the definition of the term SLA, which comes like from software engineering, right? And there is again, like the usage of the term data product, which he also defined exactly. And it's again, like a term that we are much more familiar when it comes to like software, but it's something that we can reuse also in data. And I think that's what Igor and Big Eye, what they're trying to do is get a number
Starting point is 00:54:51 of best practices and principles that they are much more mature in software engineering and apply all these, like also the problem of data management and data consumption. And I think they're doing a pretty good job. And I'm really looking forward to have another follow-up episode with him because I think we just scratched the surface of quality. We didn't even talk what happens after we define the SLAs. So there are many more things that we can discuss with Igor. And I'm really looking forward to do that in the imminent future. Absolutely. Well, thanks again for joining us on the show. And we'll catch you on the next episode.
Starting point is 00:55:29 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.