The Data Stack Show - The PRQL: What’s the Hardest Part About Data Quality?
Episode Date: August 12, 2022Eric and Kostas preview their upcoming conversation with James Campbell at Superconductive. ...
Transcript
Discussion (0)
Welcome to the Data Sack Show prequel, where we talk about the show we just recorded to give you
a little teaser. Costas, we talked with James from Great Expectations, which is a data quality tool.
And it's really interesting. I think one of the things that was really interesting to me about
the show was their approach to solving the data quality problem.
A lot of the data quality companies we've talked with
sort of sit on top of some repository of data, right?
And then sort of detect changes, right?
So it's on the data warehouse or the data lake or whatever, right?
And so it can sort of detect variances,
but it sort of sits on top of a repository.
And Great Expectations takes a different approach.
They sort of insert checkpoints
based on very explicit definitions, right?
And so you sort of insert checkpoints
like within a data flow.
You've built data products,
you've built data pipelines.
Do you think that there's merit to like,
well, actually this is a better way to say it.
Do you think you need both methodologies or is there like one sort of primary way that you would approach solving data quality?
You are really making like hard questions today.
Like I don't know what's wrong with you.
My answer is like, I don't know, to be honest.
I think it's probably like there's no one way that you can solve quality.
It's also like, and I think we have discussed this a lot,
like with all the data quality folks on this show.
There are like so many different aspects of it that you have to go after.
So I would say that, I mean, I don't know, but my guess is that, yeah, probably you need both, but it's also like depends on like the, let's say the use case that you
have and how you work with data and also like what kind of data you work with.
What I find like very interesting with great expectations is that they are not
focusing only on like the problem of running tests on the data.
They also focus on helping the people to come up with the definitions and share the
feeling, which is, let's say, not like a purely technical problem, but it's a
very important aspect of the problem of quality.
Like, is this like the right way to do it?
With like this collaborative environments that they have with notebooks or like with different
way?
I don't know.
Like, I think it's still early in the industry, like to, like there's more experimentation
to happen there and see like at the end, like what the markets will adopt and will use.
But I have to say that like, regardless of how the solution will look like, this problem of communicating and defining
the expectations that each person has around the data is part of the problem and probably
like the hardest parts. So I don't know, sooner or later, I think most of the vendors out there
will have to address it somehow. Yeah, I agree.
It was a super interesting episode.
And I think one of the things to give another little teaser here is that
I think this makes great expectations really unique in that
when you write tests, they automatically turn into documentation
that's easily understandable by, you know, sort of data consumers,
which is a really unique approach.
So definitely tune in to hear James talk about
all things data quality
and the way that great expectations solves the problem.
And we will catch you on the next show.