The Data Stack Show - Data Debrief: Can Tools Help Solve Data Quality Organizational Challenges?
Episode Date: October 15, 2021On this Data Debrief, Eric and Kostas are joined by Brian from Rudderstack to talk about Data Quality. ...
Transcript
Discussion (0)
This is our first show debrief for the DataStack show.
We just had a great conversation about data quality with a really cool company in the space.
We invited Brian from the RedderStack team to come debrief with us.
This is fun because he hasn't heard any of the conversation that we just had.
So we're going to get hopefully some hot takes from Brian.
You have a long history in dealing with actually data quality problems in some big companies.
You want to just give us a quick background?
Yeah, absolutely. Thanks, Eric. Hey everyone, I'm Brian. I'm on
the product team here at Redistack. I was previously over at DoorDash and Dropbox,
respectively, having worked on their internal data platform teams on the product side across
warehousing, ETL, experimentation, machine learning. So trying to bring some of that
over here to the conversation. Cool. Okay. I'm going to start it off. This is a debrief,
so we can kind of get straight into like some potentially like controversial questions or
things that you may be very opinionated about. Okay. So we talked about
data quality and our guests use this really interesting concept of SLAs for data products.
So the example is like a dashboard for say the marketing team. And like, as someone working
in marketing, I'm looking at this dashboard every day and I have SLAs defined where it's like, okay,
I have acceptable ranges. And if there's derivation, like having that defined SLA
with a data engineering team is really helpful for troubleshooting if something goes wrong,
et cetera. But one thing we talked about, and I'm so interested because you've seen this at really
large organizations, you bump up against like cultural challenges within a company because
ultimately that requires collaboration between two people who are on separate teams who are
going differently.
Obviously their work is like deeply interrelated, but in some ways that seems like a Sisyphean
challenge, right?
Like can a product actually help solve like a cultural collaboration issue?
So what's the Brian hot take on this?
What is the hot take?
Man, I think like just in general, just the data quality stuff is really difficult because
there are so many things that can go wrong.
And culturally, it's not just the end users that kind of are affected by these SLAs.
Oftentimes the end users actually have like, are pretty forgiving in terms of how bad things
can get.
One thing that's like when dashboards break, no people don't necessarily know whether it's
like their fault
or the fault of the person immediately about them
or the fault of the person immediately about that
or about that or about that.
So, because you have like just the life cycle
of a data point, right?
It's like you have the analysts who have created the ETLs
that those could break.
You have the data engineering pipelines,
the pipeline writ large could break.
You have the teams instrumenting the events that could break.
Sometimes even the production data is like wrong.
So like, or like, doesn't really match.
People have like hacked the production data.
So like the analytical interpretation of it might not like match what actually like people
see.
So there's just so many things that can go wrong that it's really, it's, it's about like
setting expectations and being able to figure out kind of like what expectations the person
at the end of the pipeline can expect and really doing like a best effort for chasing
down the person who can like fix these specific things, or at least setting expectations for
like each stage of the process.
And intuitively, I think part of the, part of the value of like what, what these observability
tools does is provide some of that, which is just like being able to like add each of these
stages and checkpoints. It would be like, no, these are like the expectations for like the
team before it, but it is like an organizational challenge, right? Cause like there's multiple
different teams. There's different sub teams within these teams, the organization of the org,
like even if like everyone's under the same umbrella, like, Oh, you're, you're all part of
the end team or whatever. Like that doesn't help all the time, right? Like sometimes like teams don't talk, some teams don't talk to other
teams all the time. Sometimes they suck at communicating. So tools do provide like a bit
of that. It's a mix. You have to like really meet in the middle. Yeah, totally. Okay. This is the
format here. I'm kind of changing. Kassus is laughing because I'm changing the rules on the
fly. I said, Kassus, we each only get one question. And then I said, actually cancel that. We each get
one question and one follow-up question. So I have one follow-up question. Do you have, since
we're talking about organizational structure, if we think about data teams inside of companies,
is there a format that you've seen work really well?
And like one concept we've talked about on the podcast is like
a centralized data team versus what like one guest called structured embedding,
where you have like a data professional on like the marketing team, right?
And they have a dotted line back to the data team.
Are there, and I know it varies by company,
but like just in your experience, was there a structure that you said,
man, this actually works pretty well?
I don't know.
I think like, it's like basically, yeah.
So in both the orgs I've worked at, it was like pretty centralized
where the central data platform team
was like incentivized to build a software ecosystem
that made it, it was really extensible.
It allowed a lot of people to use it.
And it was basically like a product internally
for the organization.
Whereas like after that you had,
initially it was like analytics team.
So one cool thing about both the companies was like,
I joined when it was around four or 500 people
and watched them grow to like two, three, 4,000 people.
And so like the types of profiles, the types of people,
the types of people making the change,
things changed a little bit.
And so initially it was like the analysts
but in the ETLs, the like people who are embedded,
either the product analysts will build ETLs
for their own teams or analysts will build,
like sales analysts, marketing analysts.
We didn't get to the point where like HR analysts
are building HR pipelines,
but it's kind of got there as well.
And you can imagine like there's maybe like
across each business line,
there's somebody who can like own the data for all of those.
Eventually around, I would say like the 1,000, 2 1000, 2000-ish is when data engineering and BI became a much
bigger thing.
So BI teams really started getting built out, or a data engineering team, which was distinctly
different from a data platform team, where the data engineering team actually cared about
the data, actually cared about building more sophisticated pipelines, whereas the platform
team was purely on the software side and begrudgingly helped the pipelines if necessary.
I think it worked decently well.
So the analysts and engineering teams were staffed to other teams. You can call it embedded. You can call
it like, I don't know, whatever assigned it. That's probably, that's where like the balance
between domain expertise versus like general system expertise. That's where the line was
drawn there.
Well, that's super interesting distinction between the platform team working on the ecosystem
versus the data engineering team focused on the data. Okay. Costas, I use my, I will not
change the rules again. It's your turn.
Oh, you can do that if you want. That's fine. Thank you Eric. Thank you for the time. So
Brian, you have worked like in organizations that's like, by definition, they have to work
with a lot of data, right?
And I would like to stick a little bit on the theme of the episode because it was about
data quality.
Can you give us like a couple of advices or like tools on how you manage quality in these
organizations and how, and something else, like I'll put my follow-up question now, how
big of a problem was at the end for you managing like the quality of the data?
Quality was probably something that like, what is a much more open-up question now how big of a problem was at the end for you managing like the quality of the data quality was probably something that like what is a much more open-ended question that like we didn't really end up building that much software around we built like very simple checks
but it was really difficult to wrap our heads around kind of like what was the right way to
build quality software whether it be stuff that's like self-serve for individuals or something that
kind of is like centrally administered kind of like a security camera type like kind of a security
room type thing where people are actually watching the screens and it was tough because like initially
we're like oh it should be self-serve because people
like, we can't check everything. And people kind of know what they want to check with. Like people
kind of came to us with very specific requirements. Like, oh, we want this table to be very accurate
within here, but these other tables we don't really care about. But I think that translating
what people wanted and when they wanted, oftentimes like you don't really, I think people who complain
about data quality or people who get burned by it or suffer it don't really, only a fraction of them
would have like set something up beforehand. And so I think like we complain about data quality or people who, or people who get burned by it or suffer, it don't really only a fraction of them would have like set something up
beforehand.
And so I think like we actually would have benefited and probably these
companies would probably benefit from like,
just like some more holistic thinking that like these services would start
to provide.
I don't know if what they're doing,
they're just doing like scanning everything or something.
It's some kind of combination of the two.
Ultimately where quality hit us the most was actually like quality in
terms.
It's like,
it turned into like quality in terms of reliability where data was late.
And so you would see zeros and you don't know if the zeros means that the ETL didn't run
or if the zeros meant that the ETL ran incorrectly or if it meant it was upstream.
And we didn't like, that was a big debugging exercise.
But the biggest, I think like if you took the number of incidents where like somebody
comes back and says your data didn't work, it was almost because there's an issue in
the pipeline, probably in the pipeline didn't run rather than like say like code messed
up.
Cause usually the code doesn't like have as many bugs as the pipeline does.
Yep.
Makes sense. Brian, thank you for joining us for the debrief. This is as many bugs as the pipeline does. Yep. Makes sense.
Brian, thank you for joining us for the debrief. This is exciting,
our very first debrief. So congratulations for being the inaugural debrief guest.
Absolutely. Can't give me more hot takes yet. Maybe hot takes one day.
Maybe that's it. That's a tease. I'll give you the nice vanilla answers for now.
That's great. Yeah, we got to ease into it.
We got to ease into the real spicy stuff.
Yeah, I've seen you get worked up.
Cool. Thanks for joining us.
All right. Thanks.