The Data Stack Show - The PRQL: What’s the Hardest Part About Data Quality?

Episode Date: August 12, 2022

Eric and Kostas preview their upcoming conversation with James Campbell at Superconductive. ...

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Sack Show prequel, where we talk about the show we just recorded to give you a little teaser. Costas, we talked with James from Great Expectations, which is a data quality tool. And it's really interesting. I think one of the things that was really interesting to me about the show was their approach to solving the data quality problem. A lot of the data quality companies we've talked with sort of sit on top of some repository of data, right? And then sort of detect changes, right? So it's on the data warehouse or the data lake or whatever, right?
Starting point is 00:00:39 And so it can sort of detect variances, but it sort of sits on top of a repository. And Great Expectations takes a different approach. They sort of insert checkpoints based on very explicit definitions, right? And so you sort of insert checkpoints like within a data flow. You've built data products,
Starting point is 00:00:57 you've built data pipelines. Do you think that there's merit to like, well, actually this is a better way to say it. Do you think you need both methodologies or is there like one sort of primary way that you would approach solving data quality? You are really making like hard questions today. Like I don't know what's wrong with you. My answer is like, I don't know, to be honest. I think it's probably like there's no one way that you can solve quality.
Starting point is 00:01:33 It's also like, and I think we have discussed this a lot, like with all the data quality folks on this show. There are like so many different aspects of it that you have to go after. So I would say that, I mean, I don't know, but my guess is that, yeah, probably you need both, but it's also like depends on like the, let's say the use case that you have and how you work with data and also like what kind of data you work with. What I find like very interesting with great expectations is that they are not focusing only on like the problem of running tests on the data. They also focus on helping the people to come up with the definitions and share the
Starting point is 00:02:25 feeling, which is, let's say, not like a purely technical problem, but it's a very important aspect of the problem of quality. Like, is this like the right way to do it? With like this collaborative environments that they have with notebooks or like with different way? I don't know. Like, I think it's still early in the industry, like to, like there's more experimentation to happen there and see like at the end, like what the markets will adopt and will use.
Starting point is 00:02:59 But I have to say that like, regardless of how the solution will look like, this problem of communicating and defining the expectations that each person has around the data is part of the problem and probably like the hardest parts. So I don't know, sooner or later, I think most of the vendors out there will have to address it somehow. Yeah, I agree. It was a super interesting episode. And I think one of the things to give another little teaser here is that I think this makes great expectations really unique in that when you write tests, they automatically turn into documentation
Starting point is 00:03:39 that's easily understandable by, you know, sort of data consumers, which is a really unique approach. So definitely tune in to hear James talk about all things data quality and the way that great expectations solves the problem. And we will catch you on the next show.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.