The Data Stack Show - 100: Data Quality Is Relative to Purpose with James Campbell of Superconductive
Episode Date: August 17, 2022Highlights from this week’s conversation include:James’ role at Great Expectations (2:33)What Great Expectations does (5:49)How Great Expectations approaches data quality (7:01)Why a data engineer... should use Great Expectations (16:41)Defining “data quality” (19:16)Translating expectations from one domain to the other (27:00)Community around Great Expectations (30:59)The user experience (33:41)Something exciting on the horizon (40:27)Interacting with marketers in a non-technical way (43:57)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Side Show. Kostas, today we are talking with James from Great Expectations. Now we've already talked with Ben from that
company and so we sort of have gotten some interesting thoughts on definitions around
data quality, etc. But the Great Expectations is a fascinating tool. It's a command line tool
or they have a command line interface. It's a Python library.
And so the way that they approach the problem from a technical standpoint is super interesting.
One of my questions is going to be around, if we have time to get to it,
around how they think about the interaction between different parties within an organization
who need to agree on sort of data definitions, right?
That's like a huge thing with data, right?
You have some sort of variance from like some sort of data definition.
So I want to hear their approach on that, both like from a, does their product support it, but also from like a philosophical standpoint, because, you know, there's sort of, you know, potentially some limits to what software can solve, you know, in that regard.
So that's my burning question.
How about you?
Well, it seems that you are going after the hard questions.
So I have to be the good prop this time.
You're usually the bad cop.
So, I mean, I, my intention is like to talk with him a little bit more about the product and the
technology itself.
We had the opportunity with Ben to talk a lot about data quality and the need and all
that stuff.
Let's say a little bit of a higher level.
So I think it's a great opportunity to get a little bit more tangible around like the products, how it is used, what kind of problems it solves and in what unique ways these
problems get solved by a great expectation.
So that's what I'm going to ask.
All right.
Well, let's dive in.
Let's do it.
James, welcome to the Data Sack Show.
Thank you so much.
I'm excited to be here.
All right. Well, give us your brief background and tell us what you do at Great Expectations.
I'm the CTO at Great Expectations and one of the co-founders of the project together with Abe Gong.
It's crazy to think this. It's been about five years ago now.
Oh, wow. years ago now. And it's been quite a journey, driven tremendously by community. And now the
company getting to focus on product is really delightful. Before working on Great Expectations,
I spent most of my career in the US federal government, specifically in the intelligence
community. And I was an analyst. So I did a lot of work on originally cybersecurity and understanding strategic cyber threats and then broader political modeling.
And in both of those domains, I had a really exciting chance to be able to move back and forth between very quantitative and very qualitative types of analysis.
You know, I sometimes joked, you know, I was like, some of my job was Microsoft Word, and
then I'd go have a job in Microsoft Excel and back to Word and then back to Excel.
Obviously, you know, not just Excel for data volumes, but that's been a lot of how I've
gotten to spend my time.
And then now it's just a delight to work across, again, so much of the domain of superconductive.
Yeah, very cool.
Tons of questions about
that great expectation that really quickly was it, you know, it's always interesting to hear about
things like political modeling, etc. Right. Because, you know, the I think we all subconsciously,
those of us who haven't done it, you know, which is most people, you have this idea of kind of what
it's like in the movies, you know, of like secrets and all that sort of stuff. But things that's that's really important to remember
in that in that field is like there there's a whole bunch of different sources for how you
build models and i think a lot of the contemporary machine learning and ai focuses around the
structure of data and like trying to use data to be the driving factor of building an understanding.
And, you know, at least in my experience, a lot of the kind of practical modeling applications
are still very much driven by significant domain expertise being put into the model
itself.
This, you know, structure of the model plays a significant role.
And so it's like maybe one way to say it is, you know, big data was all the rage and there's still significant
worlds where it doesn't take a lot of data. And in some ways, like the defining characteristic
of the intelligence world is that maybe there's just one critical piece of information that
changes everything and you're in the hunt for that. Super interesting. Well, I could go on
and on about that, but I want to talk about some of the guts of great expectations.
So we talked with Ben and had some really good chats around sort of definitions around
data quality, et cetera.
And so I'm excited to dig into sort of the technical details.
So first question about great expectations.
So, and actually before you get going, could you just give us a super high level, like
what does great expectations do for our listeners who may not be familiar?
Absolutely.
Great expectations gives users the ability to make verifiable assertions about data.
So it allows you to define what you expect.
It also helps you learn what to expect from previous data. And then we can test those expectations against new data as it arrives and produce clear documentation describing whether or not it meets the expectations and when it doesn't, what exactly is different.
Super cool.
Okay, so this is my first question.
And I have actually, you know, in looking at the show on the calendar, I've been so excited to ask you this question. So data quality is a broad problem and there are a number
of ways to solve it, right? I mean, even including just sort of brute force SQL with, you know,
really raw, messy data on the warehouse, right? Which everyone, you know, hates.
But what's interesting to me, sort of, if I can put it this way about like
the geography of the data stack as it relates to data quality, is you can address the issue
of data quality in multiple places, right? And maybe you do want to address it in multiple places.
So the two part question, the first part is, where, where does great expectations sit
in the stack and, and sort of in the geography, you know, the data flow?
Great question. And I think the answer is, is sort of everywhere. But it's not the same
expectations that will exist everywhere. So when you think about that stack of,
that you described sort of the stack of data,
I think there are two pretty distinct things
that happen during that.
One is data kind of moves between systems
and is potentially enriched or augmented
or clarified along that process.
And the other is that data is synthesized, right?
You have a roll-up, you have an analytic running, you have a new model running on data.
And the output of all those things, it's another data set.
So there's two pretty distinct operations in a data system in that way.
And for each of those types of operations,
great expectations helps you both protect the inputs and protect the outputs.
And actually, one of the things that we've talked about on our team that
makes great expectations powerful, but also, you know, it's a challenge for us to ensure we're
making it easy for users to understand how to use it effectively is helping to kind of
differentiate those
different ways that people are addressing data quality problems.
Super interesting.
Okay, so can you dig in one click deeper?
So those two sort of key points where data is moving between a system and then data is
being synthesized to result in another data set how does great expectations interact with those two points right because one is sort of
well actually this is more of a question for you right like if data is moving from systems
it can either sort of be raw data that's you know sort of maybe being like buttoned out to go into
a warehouse, right?
Or there actually could be transformations happening, et cetera.
So we'd love to hear about that.
And then, you know, also the flip side where, you know, data is being synthesized.
Yeah.
So I think for the first case where data is moving through transformation or enrichment, I think of that as being really applicable to what I'd call a contract model.
So, you know So there's vendors that
provide data and the fact that we can go out and buy data sets that are curated and by definition,
high quality, for example, it could be stock data, it could be health insurance records,
it could be any number of different, it could be weather data. There's all kinds of data sets that
have been processed and have
characteristics that make them valuable for certain kinds of decisions. So the first thing in there is
being able to ensure that both parties understand what they're getting, right? Like when you're
buying something, we want to contract about that. We want to know, hey, I have a column and this is
what it should look like. Now in the past, a lot of the ways that we dealt with that was we had these like endless giant coordination meetings or, you know, and I,
you know, I kid you not, I've been the recipient of, I think like 175 page diagram describing this
data set that we were buying. And it was like, you know, what do I do with this? Right. So big
part of what we're doing there is we're making it possible for you to just agree in
very precise terms that are self-healing.
The documentation, that 175-page PDF is self-healing because the biggest problem with those things
is that they immediately get out of date.
But by making that contract a living artifact, something that can be tested as data continues
to flow, it can be immediately flagged when there's a problem. And then that can be tested as data continues to flow. It can be
immediately flagged when there's a problem, and then we can also update the contract.
So with respect to the second thing, the analogy that I think of there is kind of like from physics,
you know, the concept of an emergent property. So, you know, if you look at a volume of air,
you know, you can think about like, okay, if you look at a volume of, of air, you know, you can think about
like, okay, where are all the molecules?
Like what's, what are their characteristics?
And like, those might be the columns, you know, this one is at location X, Y, Z, and
has momentum alpha and so forth.
But when an analytic context, what we're doing is we're looking at a higher
order property, pressure, volume, right?
We don't need to look at all those individual records anymore. And that's what
a model is doing, right? It's like taking all these individual pieces of information,
synthesizing them together. And the key thing that happens there is that the nature of the
information is completely different. And we're reasoning about a different quantity. So I'm not
reasoning anymore about Xs and Ys and Zs. I'm reasoning about pressures and volumes.
And being able to support that kind of transition is really, really important.
And it's one of our critical goals and why we have done a lot of investment in supporting
a contributor gallery, for example, where people can define expectations that are meaningful
for them.
Like this, you know, expect this column to be in a particular geography. Right. So, you know, we're not saying it's like,
it has to be an X value and a Y value. We're, we're saying like, it needs to be in New York.
Like if this is a lat long, it needs to be in New York. And that reflects how we think about data.
And it helps you move to that emergent property, which is, I think where, where data quality
really needs to be as a field
is because that's where we're helping stakeholders really get to the value they need.
Yeah. Okay. So super helpful. And I love the physics analogy, you know, sort of the
individual components that make up something like pressure is a super helpful analogy.
The second part of the question is why you chose to solve it that way. And I would
love to hear you talk about maybe ways that you had seen it solved before and then why you decided
to sort of structure great expectations, you know, the way that you did. That's such a rich question.
I love it. The first thing is like ways I've seen it solved before.
And I think actually one of the important things,
it's not just before.
When we encounter users of great expectations,
I actually consider a point of pride
being the fact that many of them are like,
oh, I've written something like this.
Like I've solved this.
I've written the test for nullity, for volume,
for means, for stationarity of a distribution. And the reason I think that's really a good thing
is that it reflects the fact that we're kind of in tune with how people process the world.
Now, what's the key difference? Like the key insight, I think that makes it different in like how we're solving the problem is that we're providing a, a, what I would call like a general purpose language. We like to
call it like an open shared standard where it's designed. I mean, like some of the hallmarks of
great expectations are the names of expectations are incredibly verbose and people love it i mean i love it it's it's very precise you know expect column kl
divergence to be less than you know it's like this long name sure but it but it means something
and it helps people really express again express their expectation the key thing that so so why
why that i i you know you asked this question like why that? What does it give you? I think one of the most important things that it gives you is explainability.
So when I get back, when I see that some piece of data maybe doesn't match my expectation,
then it can explain what the expectation was because I told it what the expectation was.
And we can go into, like if we get in,
you know, we should dive into kind of some of the more technical details
because what I don't want to suggest
is you have to sit at your keyboard
and type expect this 100,000 times.
Like, no, you don't need to do that.
But what we can do is really make it easy
for you to get that very explainable report back
of, all right, you know,
you didn't think there was,
you know, this column is supposed to exist.
It doesn't, right?
Which in many cases,
those are like the real problems that break dashboards or something.
Yeah.
Yeah.
It makes total sense.
And the verbose naming couldn't be better aligned with sort of the Dickens reference.
I'm impressed.
Yeah.
You're totally right.
That's great.
Super long, you know, one page sentences.
Hopefully your expectation, you know, and great expectations isn't a page long.
Okay.
Kostas, I've been, I've been stealing the mic.
Yeah, you did, but it's fun.
So you can continue doing it if you want.
It's fine with me.
I have a few questions to ask and like, I, I'd like, like to focus a little bit
more like on the, the product experience first, and also the problem
that would lead someone to a solution by great expectations.
Many times we assume that everyone who listens out there, they are aware of why we are doing
the things that we are doing, but that's not always the case.
So I would like to start from the very, very basics.
Like let's, let's think of like, and I'd love to hear that like from you, or like
describe the work of like a data engineer or like whoever is like, let's say the
person who faces problems that can't be solved with great expectations and like
describe, like go through like a small scenario until we reach the point where we can say, yeah, now we can talk about great expectations and how this problem can be solved with this.
So can you do that for us, please?
I can do my best to present some.
I think there's a lot of different ways. ways but i think one of the key things one of the whys like why do people turn to this tool
is that they want to get ahead and be proactive instead of reactive already a lot of data
engineering teams face this question of you know i got the phone and we we call for we call them
data forestories sometimes and i think you, other people have used similar terms and these are out there all the time, but you get a call, Hey, my dashboard is broken.
And it, you know, it's when you, when somebody says my dashboard is broken, I think it's useful
to think about the way they're seeing the world. And so, you know, great example would be
salesperson Northeast region sales show zero.
I know that's not true.
And the reason I like framing it in that way
is like they had an expectation.
Like I was out there, I made the sale.
I saw, I wrote the ink down.
I know it's not zero.
I expected it to not be zero, but it's zero.
So the data engineering team turns to a tool
like great expectations in that case
because they want to be ahead.
So they don't want to get the call. know hey the dashboard is broken they want to see
the issue first and be able to resolve it before it ever becomes this broken or or embarrassment
that is one of the really common problems another really common problem is like the pager duty
problem you know like if systems sometimes system
some in example i gave that first example the it's a semantic failure right like the the dashboard
ran then there's a there is a number in the cell it's just not the number the person expected
other times it's like you have you know a schema mismatch or a load that totally failed or something like that, where the key thing that you're trying to solve is like, I don't want to get a page at this page at midnight.
And if I am responding to a problem, I want to have the diagnostic information that I need to be able to get to a solution right away and to be able to
zero in on where the problem actually happened.
Yeah, and I think, you know, there's variations on those, but I think kind of this, they're
both kind of forms of being able to be proactive in addressing like your core function that
you're trying to solve as a data organization.
Yeah, absolutely.
Okay, I have a slightly trickier question now.
Tricky also for me to communicate in the best possible way.
So when we are talking about quality in general and quality in data,
I'm always, let's say, confused or I'm going back and forth between two definitions.
One definition, and I'll stop using a little bit of metaphors here, is more like the stamp of QA
that we have on products, which is more about the consumer of the product, in this case data,
to trust the product that is getting used, right?
So that's like one reason that,
one way that we implement like quality,
like humans in general, like in products.
The other is like use these tools, like great expectations as a debugging tool, right?
Like as a way for the developer or like the engineer
or whoever like is responsible,
like for anything related to the consumption of the data to figure out what is wrong.
Now, obviously, data quality in general.
Which one of these two, let's say, definitions of quality you think are closer to what is needed today?
That is a tricky question.
Good.
I think I almost need to add a third one.
Oh, let's do that.
In order to be able to like flesh them out.
Yeah.
And it's very similar to debugging, but it's the, but like just sticking on the
theme for a moment of proactivity, it's like the proactive debugging, i.e.
it's the, it's the way that you generate your expectations in the first place.
And the reason I think that's really critical is like you alluded to this in the first definition, you know, quality is a term that is relative to a purpose.
You know, in some ways, quality is like your fitness for doing some particular job.
And so it will vary like that.
You know, the same data is high quality for purpose one and not high quality for purpose two, for example. And so being able to support the process of generating your understanding, like kind of
your mental model of the world, I think is actually one of the most important areas that
the broader data quality ecosystem can support, can do.
So, okay.
So I added the third and then maybe what that lets me do is say look
they're all important it's really more more of a phase and and to be honest with you this is
something we've been we've been talking about a lot on our team internally lately is making sure
that we're not trying to have the same conceptual objects like the same you know in this case it's
like literal code object like the same objects in great expectations sort of do too much, but rather exposing
APIs and interfaces that are more intuitive for the way that you're using this tool.
And that might be in this phase of I'm building and creating my expectations.
It might be in the phase of I am, you know, kind of ensuring quality, i.e. performing that QA function to make sure that I'm going to meet the needs of the product or the downstream data consumer.
And then it also might be I'm performing a debugging task.
Now, that last one is pretty challenging for us today because it's very interactive and it's also interactive on a potentially historical dataset, and so there's a lot of nuance there that we should dive
into if we're going to get into more of the technical detail.
I don't know if that...
I basically sidestepped your question and I said all three, but...
I don't know.
It's fine.
I think it's very...
first of all, there's value in adding the third dimension there.
And to be honest, I don't expect...
I mean, I think these questions are like work in progress, like both the questions and the answers.
And I think it's important like to ask these questions, even if we don't have
like definitely like answers right now, but we need to think about that stuff.
And I think it's important also like for the people, not like us who at the end,
we sell products, but around that stuff, but like
the people who their job is like to ensure like the quality of the data or deliver, let's say,
data sets to people like to work on them and like all these things to have, let's say, the right,
to ask the right questions or have the right, let's say, conceptual models around like that
kind of stuff. So I think it's, there's always like value in these conversations, let's say, conceptual models around that kind of stuff. So I think there's always
value in these conversations, even if, let's say, the answers are not yes or no. It's fine. It's
good. It's important to have the conversation. To that end, actually, I think there's one
part of the quality as QA that is really important, but that is, I think, a little
bit less obvious or less clear in a lot of platforms that kind of provide data quality.
And that's that it's really a two-way street.
The person who is, and I mean, both like there's the provider and the consumer data, but also if I'm providing a analytic model or a curated data set or a
dashboard or just a data product, right, a giant collection of records, it's actually
potentially very useful for me to be able to kind of package together with that a description
of what I think this is good at doing. And similarly, if what I'm providing
is a model without data,
it's actually very valuable for me to be able to say,
when you use this model,
make sure that your data looks like this.
And that way you can clarify,
it's like making,
like Ikea is really good
because the instructions are simple in some ways or Lego.
Think about Lego, these incredible instructions.
Like imagine being able to give, give a consumer of a complicated product, like Lego, like really elegant instructions based around how the data.
That's very, that's a very interesting point.
How do you think we can get there?
Because I don't feel like we have that today, right?
Yeah. I mean, well, I think the answer is going to be that we allow people to provide expectation suites together with their products that are validated at the time that the person brings in their own data.
Sure. But okay, let's get a little bit deeper into that because also touches the
product experience.
So an expectation might be interpreted in a very different way from a technical point
of view than from, let's say, the point of view of the consumer of the data who might
not be a technical person.
Let's say we have a marketeer.
At the end, the marketeer wants to know, can I trust that the segmentation that I'm
going to do on this data is actually representative of the reality that I'm
addressing out there, but on the other hand, the person who is the data engineer
probably doesn't even know what segmentation is or they shouldn't care.
Right?
It's not their job to do that.
Or their brains are not like trained to think that way.
Maybe they think more in terms of like the skewance of data or like, do we have
nulls or not nulls or I don't know, like what are the, like this kind of parameters.
So how do we bridge this too?
Like how do we semant, like sem, maybe it's like the right way to say that,
like translate the expectations from one domain to the other, right? So we can apply them at the end.
I mean, that's such a beautiful question for me. Like to me, that is the core of what we're
doing. And so like, let's dive in a little bit of how Great Expectations does that.
And it's basically that, specifically the way we decompose things,
we have a core object, one of the key concepts in Great Expectations is called the expectation.
And what that does is it is a semantic translation machine.
We often call it a grammar.
It's this long long verbose sentence.
Now I definitely don't want to suggest that this is like an easy solved problem.
Like go pick up the one, but let's take your hard example of a marketer who says,
you know, what they're saying is expect this segmentation to make sense, right?
Like that's kind of how they're thinking about the world.
So we have to decompose that into what we call metrics. And so what an expectation does is it asks for metrics about the data.
And metrics are a very general concept in great expectations.
So it doesn't have to just be like a statistic.
Mean could be a metric.
Number of nulls could be a metric.
But so could number of nulls could be a metric, but so could number of
nulls outside of a particular range or country code of lat long pairs.
These are also metrics.
So it's a pretty general compute engine under the hood.
So the expectation author is, is providing this declarative verbose
to declarative language.
And then they're also doing that translation into what are the metrics that make this mean that.
And then great expectations
is sort of an orchestration engine
that goes out, reaches and touches the data,
gets the values, you know,
finds the values of those metrics,
does the comparison reassembly
and then surfaces the result in the language of the way the
marketer was thinking about the world.
Okay.
Okay.
This is great.
How do we assign metrics?
How do we come up with the right metrics for the great expectation from the great marketeer?
Yeah.
Well, I think our answer to that is community okay and so we have we
have what we call our expectations gallery where we're we're we're trying to encourage and we want
to encourage a robust process of community engagement for people to be able to expand
the vocabulary of expectations to include things that are that make sense in
their domain that do these metric translations and so in order to do that they're adding new
expectations they're adding new metrics but you know what we're what we're trying to do is make
sure that we're providing the substrate for expressing that or like the the mechanism for
for for allowing people to express it, but then letting them take ownership
of the semantic and domain model.
And, you know, the short version, of course,
is like there isn't a right answer to the question,
is this the right, you know, is this the valid segmentation?
You know, it depends on the organization.
And so one of the things we see a lot is what we call custom expectations.
So, you know, yes, schema, nullity, volume, all these things, people use those expectations a lot.
But also they say, okay, well, I want to say expect columns.
And I'll pick an example.
We've used a lot of expect column values to be a valid ICD code.
And what, you know, under the hood, that might just be translated
into a fancy regex, but the
fact that there's that translation is actually
really important because that's what makes it usable
to the
marketer consumer in our case.
Yeah, absolutely. Okay.
That's super interesting.
And like, how is the engagement
of the community around that? Like, what
have you seen so far? I mean, okay okay i obviously i'm aware of the great community that great expectation has but what have
you seen like happening there and what do you have seen like working and what not sure and you know
it's like the power of open source is like one of these things that just blows my mind over and over. So I certainly
can't suggest that I can wrap my head around all of it. Some of the things that work, we've done
hackathons and we've, like I mentioned, produced this gallery. We're doing some experiments in the
space of what we call packages, where we have, you know, domain leaders in a domain or field
who can, who are willing to kind of commit and say, these are expectations that are like
useful and valid or valuable to be able to understand the kinds of concepts that are
relevant.
And like an example, actually, of a community contributed project that's really been interesting is a data profiling framework where there's expectations that are built around the Capital One data
profiler. And they use that kind of as the semantic engine to infer types and allow you
to make expectations like there shouldn't be PII here. And then that gets translated
through. Now, we didn't build that on our team,
but we are excited about supporting that community.
Now, your question was also where are the challenges?
And like, to be honest, yeah,
there's still challenges, right?
Like there's lots and lots of expectations out there
that there's a discovery problem, there's synthesis,
there's a lot of work left to do in that space,
or not left, there's a lot of opportunities still
to be had for helping people engage.
And like, that's where I'm, you know, what I would, what I would really emphasize is
our goal of, of making this a shared standard that people, people can engage
on together and you know, and, and improve.
So that's an exciting, that's an exciting area.
That's one area.
There's lots of other exciting things we're working on too, but that's definitely.
Yeah.
Well, these are like very, very super interesting, like parts of like exciting, that's an exciting area. That's one area. There's lots of other exciting things we're working on too, but that's definitely.
Yeah, yeah.
Well, these are like very, very super interesting, like, parts of like growing
and building and like having actual, like the community as part of the
product experience itself, right?
It is part of the product at the end.
Uh, anyway, that's another conversation for another time.
Like we can discuss a lot about the community. So what I want to ask you now is like, okay, we talked about the problems, like how you
think about like the solution, but let's talk a little bit more about like the experience
that the user has with great expectations.
So how, how do I use great expectations?
I mean, I have like, let's say, an idea right now that there are some
expectations somewhere that I'm like testing against my data. But like, how do I operationalize
great expectations as part of my like day to day job, like as a data engineer?
Yeah, that's a great question. We think of it as there's sort of four key steps to what it means to use great expectations.
And we call it like our universal map.
First, just to be like really explicit on it,
great expectations is a Python library.
So, you know, you run a pip install
in the case of great expectations today.
Now, you know, not to get into like the commercial aspect, we are building a cloud
product as well that's designed to make it very accessible, especially just expanding the reach
of people, but also simplifying the setup, but just setting it up. So we set up, we run our
pip install. Next step is connecting the data. Now, this is where it could mean I'm just going
to grab a batch of data, like I'm going to read a CSV off my file system and work with this data.
There are also, there's more you can do to connect to data where you also configure the
ability for great expectations to understand what your assets are and how they're divided into batches so that, you know, as new, new batches of data come in, we can,
we can understand the, the, the, the stream of data as a, as,
as a unit. So, you know, you connect to data and, and to be honest with you,
that's an area where I see us needing to do some, some work,
like some of the magic of great expectations early on came from the fact that
connect to data was a one liner read CSV.
And as we've added the power and expressivity around ensuring that you can understand batches and so forth, it has become a little more difficult there.
And so that's one of the things we're actually working on right now is like bringing that kind of magical viral experience back.
Anyway.
So next thing I do is I connect my data.
It's like literally add a data source.
It could be, you know, pointing it at an S3 bucket, say,
or connecting to a database, to a warehouse.
And that's one of the important things about gridded expectations
is we work across all the different backends, right?
All the SQL dialects, Spark, Pandas in memory,
you know, pulling data in from S3, all that, any of those things.
So we do that configuration.
Next thing is you create your expectations.
And for us, that's a notebook experience.
So you're in a Jupyter notebook.
You have this sample of data.
And it's an interactive, real-time experience.
I say, expect column values to be not null.
And I get an immediate response.
We check that right away.
And it says, hey, success.
Or actually, 5% of these values are normal.
And so what we see there is this interactive exploratory process where you're creating expectations.
Another way you can do that is with profiling, where you ask great expectations, you know, go out and build a model of this dataset and propose, you know,
a long list of expectations back to me and I'll choose which ones to accept, maybe all
of them.
And then that will become my expectation suite.
So create expectations.
And then the last step is the validation.
So what you do typically, what we see people run is what we call a checkpoint and they'll embed that into a airflow pipeline or into a prefect pipeline or, or, you know, wherever
they're running their validations.
And you know, it's, it's, it's just an operation, graded
expectations run checkpoint.
And what that then does is produce a validation result, which we convert
into a webpage
that you can share and post with your team
that says, here were the expectations.
These ones passed.
These ones didn't pass.
If they didn't pass,
here's some samples,
examples of what went wrong.
And so you have that very tangible,
shareable, visible report
of what the state of your validation
was. So that's how great expectations gets used in practice.
Okay, that's super cool.
And you mentioned that like you support like pretty much, let's say, like every
backend out there.
How does this translate in terms of like interacting with these backends?
So let's say, for example, like I have my data on a data lake on S3, right?
Or the same data,
I might have it like on a snowflake.
So first of all,
is the experience the same
that I get for great expectations,
like regardless of what I have
like as a backend?
That's one of the pieces of magic
is, you know,
we talked about that trend,
like what great expectations is doing is translating between expectations and metrics.
And then one layer deeper than that translate the request for the metric mean
into the SQL dialect that will give us back that metric.
Or if we're in Spark, we'll translate that
into the appropriate Spark command to say,
give me the mean value back.
So great expectations is handling that
and then sort of re-bubbling it back up
into the semantic layer for you.
Okay, and this is like something that's sort of interactive,
or it's something that, let's say, I run the expectation every one hour.
Is the result kept somewhere so I can go and see how the expectation has changed in time?
How does this part work?
Because I can see they're generating even more data at the end.
Right.
Yeah, totally right.
Wow.
This is such a rich area for us to continue to build on, to be honest with you.
So today the way that works is like the core validation result is a, is a, is a
big kind of Jason artifact.
Like I mentioned, we, we do render that and translate it into HTML. So you have you if
you go to the your your generated data docs site, you'll see Yeah, you'll see a list of
all the validations that are on. Now, what we're what we're doing right now in the cloud
product is providing a much more more kind of interactive, like rich, linkable experiences.
You can't, you can't really do that in open source. Like when you're producing a JSON report, you know, it's just hard to have that kind
of referential, you know, all those references between different elements of an object for
the data.
So we're, we're, we're making that possible more possible in a cloud environment, but,
but in the open source, what you can, again, what you can do is you can absolutely get
that list of all the things, you know, here X, you'll see a little X, like this one failed
this time and this time and this time, and then it passed this time and that time and the other time.
Okay.
That's pretty cool.
All right.
So one last question from me, and then like, I'll give the microphone back to Eric.
I think I punished him enough for abusing the ownership of the microphone at the beginning.
So share something exciting that is coming in the future for great expectations.
I am absolutely thrilled
about the cloud product.
And I know that probably sounds like,
oh, of course he's from...
You know, it's because we can...
You know, it's like...
We have a new concept available to us.
And what is that concept?
It's the user.
In a library,
we don't really have access to the user.
We don't see who you are
and when you're interacting.
You're interacting with files
and going to these static webpages.
Whereas when we're,
like where we really want to go
is we want to facilitate collaboration.
At the end of the day, quality is fitness for purpose.
We talked about contracts at the beginning and ensuring that people are on the same page.
And so what I'm really, really excited about is the world in which
you and I are sharing something, sharing a piece of data,
and we both have our expectations about it.
And we can just say, hey, let's go look at that validation result together.
Drop in a comment like, oh, you know what?
This expectation should be a little bit different.
But we're turning data quality into this very,
the potential for a collaborative, more of a collaborative enterprise.
That's really exciting to me.
Yeah, super, super interesting.
All right, that's all from my side.
I mean, we're probably going to need another episode to discuss a little bit more.
But Eric, it's all yours now.
So interesting. This has been such a fun conversation.
James, I'm interested to know,
understanding the technical flow was super
helpful, right? So Pippin saw great expectations. Also, again, the Dickens reference not lost on me.
So amazing work. Thank you. Amazing work there. And it's like the best sort of, you know,
smile on the mind. So let's play out a little example here, right? So you kind of talked to the flow of like a data engineer who is, you know, sort of implementing these expectations, connecting to data sources, running checkpoints, you know, and some sort of, you to the relationship between that data engineer.
Let's just say that data engineer is Costas, you know, and he's using great expectations to, you know, sort of drive data quality.
But I'm the market side, you know, for whatever it's reports or this or that, right?
Sort of presupposes that Costas and I have talked about like what I expect to see in my reports and like, you know, the data types in these columns and all that sort of stuff. So could you help
us understand what does it look like from the beginning in terms of that sort of relationship, right? Where I come with a
requirement, you know, and say, hey, like, you know, whatever it is, like purchase values always
need to be in X format, right? Because if not, then we undercount and then, you know, my boss
gets mad at me and blah, blah, blah, right? So I have that expectation as a marketer, but I don't
do anything technical, right? So like So how do I interact with process?
And I'd just love to hear, how do your customers do that?
When an expectation sort of originates with someone who's non-technical on a different team.
Great. Yeah, happy to dive into that.
Actually, I think my favorite example of this is, I mentioned it's crazy.
We started this, I think, in 2017, we gave a talk about great expectations.
And at that very first talk, there was somebody on a data engineering team facing this problem.
And, you know, she, she implemented great expectations, still very, very, very young
product and her team.
And we had a chance to connect with her much later and hear about it. And what
she was doing, and I think this makes a ton of sense to me, and I do think we can make
better workflows with a web app and so forth. But what she was doing is she was literally
kind of creating forms for her team and a structured interview. So it became a kind of a requirement solicitation exercise. And it really just helped to structure the conversation in a way that was really valuable.
And this is something I saw a lot in my kind of analytics work and that sometimes we call round trips to the domain experts.
So in your example, this marketing stakeholder is an expert
in what the data should look like. It's really expensive to have to go back and forth and like,
you know, expensive in time and complexity and all these things. So again, like first answer
of how I see that done is it provides a mechanism for conducting structured conversations and interviews to elicit what those
expectations are the second way that I would flag is, and Kostas, this is
similar to what you were kind of, I think hinting at is like, it's experimentation
and having, having a real good notebook experience now, so maybe you're gonna
say, well, wait, wait, did the domain experts sit down and write notebooks?
And no, I don't think that's the case.
But I think what actually is happening in that way
is that you can accelerate the level of domain expertise,
if you will, of a data engineer extremely quickly
when you're letting them operate
on these kind of higher order concepts instead of like, I'm staring at a CSV file.
Yep.
Super interesting.
Super interesting.
Yeah.
That makes a ton of sense that, uh, well, and actually let me follow that up with another
question.
So because there are different approaches to this and like, I'm, I'm super interested
in sort of the philosophy behind the decision sphere expectations thing some some companies that are trying to solve data quality think that
there should be a software layer that sort of like facilitates that interview process right
like a structured interview process like you said and that's an interesting, I actually struggle to know how I feel about that.
I mean, not that I'm an expert in this, but like, you know, actually sort of validating data
technically is a pretty big challenge in and of itself, right? But like, you know, solving
relational connections, you know, between two people who play very, very different roles in like a complex business.
You know, it's kind of like, can software even solve that problem, right? I mean, so in one way, it's really encouraging to hear like, I mean, just do a really good like structured
interview and have a form. And that is a way that you can create like a very helpful, but like
simple interface
between these two people.
And then the data engineer can take that and translate it, you know, using the notebook
interface.
But all that said, like philosophically, what role do you think software plays and maybe
even specifically data quality software plays in facilitating that relational interaction,
you know, irregardless of the actual technical data validation.
That's a stumper for a last question.
You got to end on a high note.
I don't know the answer to that question.
And let me tell you why we're like, what leads me to say that?
I absolutely believe in the power of software to facilitate structured interactions of any form.
So to me, when I say you have a form and you have a structured interview,
I absolutely believe software could facilitate that and make it happen in a useful, in a powerful way.
So it's actually not about that was what we replaced.
And to even go further, no doubt in my mind, software will play a role
in supporting those kinds of interactions. And we can do that. In fact, the way I think about it,
we can do that in intelligent ways by making it doesn't need to start from a form, from a blank
slate. What is the answer here? We can say, here's the past. Here's what we've seen in the past.
Firstly, does that meet your expectations?
Secondly, you know, what would cause the things to be different?
So that's the first part.
You know, so on that side, actually,
maybe I'd be like slightly different than your priors.
On the other side, there's, you know,
in modeling, we call this a lot like elicitation so you're
interacting with an actor you're eliciting their knowledge and one of the things that i found is
like experts actually can have a very difficult time understanding which parameters in a model
are kind of doing the work and so if i if i say to you you, about how many days should it be sunny in your city? And we're going to put a quality measure. So I think, I guess what I'm saying is,
there's a huge amount of, that's the rich problem.
And I think it's not,
that doesn't have to be the problem of data quality
to solve alone.
That can be done in concert with a much richer ecosystem
of the way that we facilitate collaboration between people.
And like some of my other,
like one of my long-term passions is like the way that we communicate collaboration between people. And like some of my other, like one of my long-term kind of passions
is like the way that we communicate probability effectively.
Like what does it actually look like
for people to see probability
and understand probability?
So I think there's a lot to be done
in that space as well.
So I'm going to give myself a little bit of a vibe
that we don't have to solve that problem yet
in order to be providing a lot of value
and data quality.
Yeah, no, that was a really, a really helpful answer. I'm sorry to throw that problem yet in order to keep providing a lot of value in data quality. Yeah, no, that was a really, really helpful answer. And sorry to throw that one in.
Such a good show. Thank you for your thoughtful, articulate answers. We've learned so much,
and I really appreciate you giving us some of your time.
Eric Costas, thanks so much for having me it's been a pleasure you know Kostas I
tried to think of like a really good
like takeaway from this substantive
material from the show but
I'm going to phone it in because
I can't get over how much I enjoy
the like multiple references to
Charles Dickens not only in the name of the product,
but pip install,
pip secure during great expectations.
And then the verbose nature
of what you name an expectation,
those being really long.
Just a data quality product
as a Python library,
having such like clever references to Charles Dickens
is just, it just makes me really, really happy.
So that's my big takeaway.
Yeah, absolutely.
I mean, there's like some marketing genius
behind Great Expectations.
I don't know, like maybe that's what we should do with them, actually.
Like, get the founders and
get into, like, the conversation
of how they came up with that stuff.
Like, that's... I mean, it's not like
a data-related
conversation, but I don't know.
I think it feels like a very fascinating topic,
like, how they came up with that, because
it is pretty unique, and it's, like,
extremely smart.
Like you can have like a conversation with someone from great expectations and like
every other sentence, like make a connection with Dickens, which is, it's kind of crazy.
So yeah, we need to, we need to figure out, like we need to get the playbook from them somehow.
Yeah, absolutely.
It'd be great to have all the founders on the show and have them read passages from the book Great Expectations.
Yeah, yeah, absolutely.
We should do that.
But outside of these, like, okay, it's always, like, a great pleasure, like, to discuss with the folks from Great Expectations because you can see there are some very interesting ideas and
very deep knowledge around
how they solve the problem
of quality and
how they're moving forward.
And
I'm also looking forward to
see their cloud products
and what this will bring
into this whole experience
of doing data quality with great expectations.
I agree.
All right.
Well, thank you for joining us on the Data Stack Show.
Tell someone about the show if you haven't yet.
We always like to get new listeners
and we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack,
the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.