The Data Stack Show - Data Debrief: Can Tools Help Solve Data Quality Organizational Challenges?

Starting point is 00:00:00 This is our first show debrief for the DataStack show. We just had a great conversation about data quality with a really cool company in the space. We invited Brian from the RedderStack team to come debrief with us. This is fun because he hasn't heard any of the conversation that we just had. So we're going to get hopefully some hot takes from Brian. You have a long history in dealing with actually data quality problems in some big companies. You want to just give us a quick background? Yeah, absolutely. Thanks, Eric. Hey everyone, I'm Brian. I'm on

Starting point is 00:00:26 the product team here at Redistack. I was previously over at DoorDash and Dropbox, respectively, having worked on their internal data platform teams on the product side across warehousing, ETL, experimentation, machine learning. So trying to bring some of that over here to the conversation. Cool. Okay. I'm going to start it off. This is a debrief, so we can kind of get straight into like some potentially like controversial questions or things that you may be very opinionated about. Okay. So we talked about data quality and our guests use this really interesting concept of SLAs for data products. So the example is like a dashboard for say the marketing team. And like, as someone working

Starting point is 00:00:55 in marketing, I'm looking at this dashboard every day and I have SLAs defined where it's like, okay, I have acceptable ranges. And if there's derivation, like having that defined SLA with a data engineering team is really helpful for troubleshooting if something goes wrong, et cetera. But one thing we talked about, and I'm so interested because you've seen this at really large organizations, you bump up against like cultural challenges within a company because ultimately that requires collaboration between two people who are on separate teams who are going differently. Obviously their work is like deeply interrelated, but in some ways that seems like a Sisyphean

Starting point is 00:01:18 challenge, right? Like can a product actually help solve like a cultural collaboration issue? So what's the Brian hot take on this? What is the hot take? Man, I think like just in general, just the data quality stuff is really difficult because there are so many things that can go wrong. And culturally, it's not just the end users that kind of are affected by these SLAs. Oftentimes the end users actually have like, are pretty forgiving in terms of how bad things

Starting point is 00:01:40 can get. One thing that's like when dashboards break, no people don't necessarily know whether it's like their fault or the fault of the person immediately about them or the fault of the person immediately about that or about that or about that. So, because you have like just the life cycle of a data point, right?

Starting point is 00:01:51 It's like you have the analysts who have created the ETLs that those could break. You have the data engineering pipelines, the pipeline writ large could break. You have the teams instrumenting the events that could break. Sometimes even the production data is like wrong. So like, or like, doesn't really match. People have like hacked the production data.

Starting point is 00:02:03 So like the analytical interpretation of it might not like match what actually like people see. So there's just so many things that can go wrong that it's really, it's, it's about like setting expectations and being able to figure out kind of like what expectations the person at the end of the pipeline can expect and really doing like a best effort for chasing down the person who can like fix these specific things, or at least setting expectations for like each stage of the process. And intuitively, I think part of the, part of the value of like what, what these observability

Starting point is 00:02:26 tools does is provide some of that, which is just like being able to like add each of these stages and checkpoints. It would be like, no, these are like the expectations for like the team before it, but it is like an organizational challenge, right? Cause like there's multiple different teams. There's different sub teams within these teams, the organization of the org, like even if like everyone's under the same umbrella, like, Oh, you're, you're all part of the end team or whatever. Like that doesn't help all the time, right? Like sometimes like teams don't talk, some teams don't talk to other teams all the time. Sometimes they suck at communicating. So tools do provide like a bit of that. It's a mix. You have to like really meet in the middle. Yeah, totally. Okay. This is the

Starting point is 00:02:52 format here. I'm kind of changing. Kassus is laughing because I'm changing the rules on the fly. I said, Kassus, we each only get one question. And then I said, actually cancel that. We each get one question and one follow-up question. So I have one follow-up question. Do you have, since we're talking about organizational structure, if we think about data teams inside of companies, is there a format that you've seen work really well? And like one concept we've talked about on the podcast is like a centralized data team versus what like one guest called structured embedding, where you have like a data professional on like the marketing team, right?

Starting point is 00:03:15 And they have a dotted line back to the data team. Are there, and I know it varies by company, but like just in your experience, was there a structure that you said, man, this actually works pretty well? I don't know. I think like, it's like basically, yeah. So in both the orgs I've worked at, it was like pretty centralized where the central data platform team

Starting point is 00:03:26 was like incentivized to build a software ecosystem that made it, it was really extensible. It allowed a lot of people to use it. And it was basically like a product internally for the organization. Whereas like after that you had, initially it was like analytics team. So one cool thing about both the companies was like,

Starting point is 00:03:38 I joined when it was around four or 500 people and watched them grow to like two, three, 4,000 people. And so like the types of profiles, the types of people, the types of people making the change, things changed a little bit. And so initially it was like the analysts but in the ETLs, the like people who are embedded, either the product analysts will build ETLs

Starting point is 00:03:51 for their own teams or analysts will build, like sales analysts, marketing analysts. We didn't get to the point where like HR analysts are building HR pipelines, but it's kind of got there as well. And you can imagine like there's maybe like across each business line, there's somebody who can like own the data for all of those.

Starting point is 00:04:03 Eventually around, I would say like the 1,000, 2 1000, 2000-ish is when data engineering and BI became a much bigger thing. So BI teams really started getting built out, or a data engineering team, which was distinctly different from a data platform team, where the data engineering team actually cared about the data, actually cared about building more sophisticated pipelines, whereas the platform team was purely on the software side and begrudgingly helped the pipelines if necessary. I think it worked decently well. So the analysts and engineering teams were staffed to other teams. You can call it embedded. You can call

Starting point is 00:04:27 it like, I don't know, whatever assigned it. That's probably, that's where like the balance between domain expertise versus like general system expertise. That's where the line was drawn there. Well, that's super interesting distinction between the platform team working on the ecosystem versus the data engineering team focused on the data. Okay. Costas, I use my, I will not change the rules again. It's your turn. Oh, you can do that if you want. That's fine. Thank you Eric. Thank you for the time. So Brian, you have worked like in organizations that's like, by definition, they have to work

Starting point is 00:04:49 with a lot of data, right? And I would like to stick a little bit on the theme of the episode because it was about data quality. Can you give us like a couple of advices or like tools on how you manage quality in these organizations and how, and something else, like I'll put my follow-up question now, how big of a problem was at the end for you managing like the quality of the data? Quality was probably something that like, what is a much more open-up question now how big of a problem was at the end for you managing like the quality of the data quality was probably something that like what is a much more open-ended question that like we didn't really end up building that much software around we built like very simple checks but it was really difficult to wrap our heads around kind of like what was the right way to

Starting point is 00:05:14 build quality software whether it be stuff that's like self-serve for individuals or something that kind of is like centrally administered kind of like a security camera type like kind of a security room type thing where people are actually watching the screens and it was tough because like initially we're like oh it should be self-serve because people like, we can't check everything. And people kind of know what they want to check with. Like people kind of came to us with very specific requirements. Like, oh, we want this table to be very accurate within here, but these other tables we don't really care about. But I think that translating what people wanted and when they wanted, oftentimes like you don't really, I think people who complain

Starting point is 00:05:40 about data quality or people who get burned by it or suffer it don't really, only a fraction of them would have like set something up beforehand. And so I think like we complain about data quality or people who, or people who get burned by it or suffer, it don't really only a fraction of them would have like set something up beforehand. And so I think like we actually would have benefited and probably these companies would probably benefit from like, just like some more holistic thinking that like these services would start to provide. I don't know if what they're doing,

Starting point is 00:05:54 they're just doing like scanning everything or something. It's some kind of combination of the two. Ultimately where quality hit us the most was actually like quality in terms. It's like, it turned into like quality in terms of reliability where data was late. And so you would see zeros and you don't know if the zeros means that the ETL didn't run or if the zeros meant that the ETL ran incorrectly or if it meant it was upstream.

Starting point is 00:06:08 And we didn't like, that was a big debugging exercise. But the biggest, I think like if you took the number of incidents where like somebody comes back and says your data didn't work, it was almost because there's an issue in the pipeline, probably in the pipeline didn't run rather than like say like code messed up. Cause usually the code doesn't like have as many bugs as the pipeline does. Yep. Makes sense. Brian, thank you for joining us for the debrief. This is as many bugs as the pipeline does. Yep. Makes sense.

Starting point is 00:06:25 Brian, thank you for joining us for the debrief. This is exciting, our very first debrief. So congratulations for being the inaugural debrief guest. Absolutely. Can't give me more hot takes yet. Maybe hot takes one day. Maybe that's it. That's a tease. I'll give you the nice vanilla answers for now. That's great. Yeah, we got to ease into it. We got to ease into the real spicy stuff. Yeah, I've seen you get worked up. Cool. Thanks for joining us.

Starting point is 00:06:45 All right. Thanks.

Pet Camera - EBO Air 2

The Data Stack Show - Data Debrief: Can Tools Help Solve Data Quality Organizational Challenges?

On this Data Debrief, Eric and Kostas are joined by Brian from Rudderstack to talk about Data Quality. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - Data Debrief: Can Tools Help Solve Data Quality Organizational Challenges?

On this Data Debrief, Eric and Kostas are joined by Brian from Rudderstack to talk about Data Quality. ...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.