The Data Stack Show - 100: Data Quality Is Relative to Purpose with James Campbell of Superconductive

Episode Date: August 17, 2022

Highlights from this week’s conversation include:James’ role at Great Expectations (2:33)What Great Expectations does (5:49)How Great Expectations approaches data quality (7:01)Why a data engineer... should use Great Expectations (16:41)Defining “data quality” (19:16)Translating expectations from one domain to the other (27:00)Community around Great Expectations (30:59)The user experience (33:41)Something exciting on the horizon (40:27)Interacting with marketers in a non-technical way (43:57)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Side Show. Kostas, today we are talking with James from Great Expectations. Now we've already talked with Ben from that company and so we sort of have gotten some interesting thoughts on definitions around data quality, etc. But the Great Expectations is a fascinating tool. It's a command line tool
Starting point is 00:00:41 or they have a command line interface. It's a Python library. And so the way that they approach the problem from a technical standpoint is super interesting. One of my questions is going to be around, if we have time to get to it, around how they think about the interaction between different parties within an organization who need to agree on sort of data definitions, right? That's like a huge thing with data, right? You have some sort of variance from like some sort of data definition. So I want to hear their approach on that, both like from a, does their product support it, but also from like a philosophical standpoint, because, you know, there's sort of, you know, potentially some limits to what software can solve, you know, in that regard.
Starting point is 00:01:26 So that's my burning question. How about you? Well, it seems that you are going after the hard questions. So I have to be the good prop this time. You're usually the bad cop. So, I mean, I, my intention is like to talk with him a little bit more about the product and the technology itself. We had the opportunity with Ben to talk a lot about data quality and the need and all
Starting point is 00:01:55 that stuff. Let's say a little bit of a higher level. So I think it's a great opportunity to get a little bit more tangible around like the products, how it is used, what kind of problems it solves and in what unique ways these problems get solved by a great expectation. So that's what I'm going to ask. All right. Well, let's dive in. Let's do it.
Starting point is 00:02:20 James, welcome to the Data Sack Show. Thank you so much. I'm excited to be here. All right. Well, give us your brief background and tell us what you do at Great Expectations. I'm the CTO at Great Expectations and one of the co-founders of the project together with Abe Gong. It's crazy to think this. It's been about five years ago now. Oh, wow. years ago now. And it's been quite a journey, driven tremendously by community. And now the company getting to focus on product is really delightful. Before working on Great Expectations,
Starting point is 00:02:55 I spent most of my career in the US federal government, specifically in the intelligence community. And I was an analyst. So I did a lot of work on originally cybersecurity and understanding strategic cyber threats and then broader political modeling. And in both of those domains, I had a really exciting chance to be able to move back and forth between very quantitative and very qualitative types of analysis. You know, I sometimes joked, you know, I was like, some of my job was Microsoft Word, and then I'd go have a job in Microsoft Excel and back to Word and then back to Excel. Obviously, you know, not just Excel for data volumes, but that's been a lot of how I've gotten to spend my time. And then now it's just a delight to work across, again, so much of the domain of superconductive.
Starting point is 00:03:43 Yeah, very cool. Tons of questions about that great expectation that really quickly was it, you know, it's always interesting to hear about things like political modeling, etc. Right. Because, you know, the I think we all subconsciously, those of us who haven't done it, you know, which is most people, you have this idea of kind of what it's like in the movies, you know, of like secrets and all that sort of stuff. But things that's that's really important to remember in that in that field is like there there's a whole bunch of different sources for how you build models and i think a lot of the contemporary machine learning and ai focuses around the
Starting point is 00:04:39 structure of data and like trying to use data to be the driving factor of building an understanding. And, you know, at least in my experience, a lot of the kind of practical modeling applications are still very much driven by significant domain expertise being put into the model itself. This, you know, structure of the model plays a significant role. And so it's like maybe one way to say it is, you know, big data was all the rage and there's still significant worlds where it doesn't take a lot of data. And in some ways, like the defining characteristic of the intelligence world is that maybe there's just one critical piece of information that
Starting point is 00:05:15 changes everything and you're in the hunt for that. Super interesting. Well, I could go on and on about that, but I want to talk about some of the guts of great expectations. So we talked with Ben and had some really good chats around sort of definitions around data quality, et cetera. And so I'm excited to dig into sort of the technical details. So first question about great expectations. So, and actually before you get going, could you just give us a super high level, like what does great expectations do for our listeners who may not be familiar?
Starting point is 00:05:47 Absolutely. Great expectations gives users the ability to make verifiable assertions about data. So it allows you to define what you expect. It also helps you learn what to expect from previous data. And then we can test those expectations against new data as it arrives and produce clear documentation describing whether or not it meets the expectations and when it doesn't, what exactly is different. Super cool. Okay, so this is my first question. And I have actually, you know, in looking at the show on the calendar, I've been so excited to ask you this question. So data quality is a broad problem and there are a number of ways to solve it, right? I mean, even including just sort of brute force SQL with, you know,
Starting point is 00:06:36 really raw, messy data on the warehouse, right? Which everyone, you know, hates. But what's interesting to me, sort of, if I can put it this way about like the geography of the data stack as it relates to data quality, is you can address the issue of data quality in multiple places, right? And maybe you do want to address it in multiple places. So the two part question, the first part is, where, where does great expectations sit in the stack and, and sort of in the geography, you know, the data flow? Great question. And I think the answer is, is sort of everywhere. But it's not the same expectations that will exist everywhere. So when you think about that stack of,
Starting point is 00:07:27 that you described sort of the stack of data, I think there are two pretty distinct things that happen during that. One is data kind of moves between systems and is potentially enriched or augmented or clarified along that process. And the other is that data is synthesized, right? You have a roll-up, you have an analytic running, you have a new model running on data.
Starting point is 00:07:55 And the output of all those things, it's another data set. So there's two pretty distinct operations in a data system in that way. And for each of those types of operations, great expectations helps you both protect the inputs and protect the outputs. And actually, one of the things that we've talked about on our team that makes great expectations powerful, but also, you know, it's a challenge for us to ensure we're making it easy for users to understand how to use it effectively is helping to kind of differentiate those
Starting point is 00:08:25 different ways that people are addressing data quality problems. Super interesting. Okay, so can you dig in one click deeper? So those two sort of key points where data is moving between a system and then data is being synthesized to result in another data set how does great expectations interact with those two points right because one is sort of well actually this is more of a question for you right like if data is moving from systems it can either sort of be raw data that's you know sort of maybe being like buttoned out to go into a warehouse, right?
Starting point is 00:09:05 Or there actually could be transformations happening, et cetera. So we'd love to hear about that. And then, you know, also the flip side where, you know, data is being synthesized. Yeah. So I think for the first case where data is moving through transformation or enrichment, I think of that as being really applicable to what I'd call a contract model. So, you know So there's vendors that provide data and the fact that we can go out and buy data sets that are curated and by definition, high quality, for example, it could be stock data, it could be health insurance records,
Starting point is 00:09:39 it could be any number of different, it could be weather data. There's all kinds of data sets that have been processed and have characteristics that make them valuable for certain kinds of decisions. So the first thing in there is being able to ensure that both parties understand what they're getting, right? Like when you're buying something, we want to contract about that. We want to know, hey, I have a column and this is what it should look like. Now in the past, a lot of the ways that we dealt with that was we had these like endless giant coordination meetings or, you know, and I, you know, I kid you not, I've been the recipient of, I think like 175 page diagram describing this data set that we were buying. And it was like, you know, what do I do with this? Right. So big
Starting point is 00:10:22 part of what we're doing there is we're making it possible for you to just agree in very precise terms that are self-healing. The documentation, that 175-page PDF is self-healing because the biggest problem with those things is that they immediately get out of date. But by making that contract a living artifact, something that can be tested as data continues to flow, it can be immediately flagged when there's a problem. And then that can be tested as data continues to flow. It can be immediately flagged when there's a problem, and then we can also update the contract. So with respect to the second thing, the analogy that I think of there is kind of like from physics,
Starting point is 00:10:57 you know, the concept of an emergent property. So, you know, if you look at a volume of air, you know, you can think about like, okay, if you look at a volume of, of air, you know, you can think about like, okay, where are all the molecules? Like what's, what are their characteristics? And like, those might be the columns, you know, this one is at location X, Y, Z, and has momentum alpha and so forth. But when an analytic context, what we're doing is we're looking at a higher order property, pressure, volume, right?
Starting point is 00:11:23 We don't need to look at all those individual records anymore. And that's what a model is doing, right? It's like taking all these individual pieces of information, synthesizing them together. And the key thing that happens there is that the nature of the information is completely different. And we're reasoning about a different quantity. So I'm not reasoning anymore about Xs and Ys and Zs. I'm reasoning about pressures and volumes. And being able to support that kind of transition is really, really important. And it's one of our critical goals and why we have done a lot of investment in supporting a contributor gallery, for example, where people can define expectations that are meaningful
Starting point is 00:12:02 for them. Like this, you know, expect this column to be in a particular geography. Right. So, you know, we're not saying it's like, it has to be an X value and a Y value. We're, we're saying like, it needs to be in New York. Like if this is a lat long, it needs to be in New York. And that reflects how we think about data. And it helps you move to that emergent property, which is, I think where, where data quality really needs to be as a field is because that's where we're helping stakeholders really get to the value they need. Yeah. Okay. So super helpful. And I love the physics analogy, you know, sort of the
Starting point is 00:12:36 individual components that make up something like pressure is a super helpful analogy. The second part of the question is why you chose to solve it that way. And I would love to hear you talk about maybe ways that you had seen it solved before and then why you decided to sort of structure great expectations, you know, the way that you did. That's such a rich question. I love it. The first thing is like ways I've seen it solved before. And I think actually one of the important things, it's not just before. When we encounter users of great expectations,
Starting point is 00:13:14 I actually consider a point of pride being the fact that many of them are like, oh, I've written something like this. Like I've solved this. I've written the test for nullity, for volume, for means, for stationarity of a distribution. And the reason I think that's really a good thing is that it reflects the fact that we're kind of in tune with how people process the world. Now, what's the key difference? Like the key insight, I think that makes it different in like how we're solving the problem is that we're providing a, a, what I would call like a general purpose language. We like to
Starting point is 00:13:51 call it like an open shared standard where it's designed. I mean, like some of the hallmarks of great expectations are the names of expectations are incredibly verbose and people love it i mean i love it it's it's very precise you know expect column kl divergence to be less than you know it's like this long name sure but it but it means something and it helps people really express again express their expectation the key thing that so so why why that i i you know you asked this question like why that? What does it give you? I think one of the most important things that it gives you is explainability. So when I get back, when I see that some piece of data maybe doesn't match my expectation, then it can explain what the expectation was because I told it what the expectation was. And we can go into, like if we get in,
Starting point is 00:14:46 you know, we should dive into kind of some of the more technical details because what I don't want to suggest is you have to sit at your keyboard and type expect this 100,000 times. Like, no, you don't need to do that. But what we can do is really make it easy for you to get that very explainable report back of, all right, you know,
Starting point is 00:15:00 you didn't think there was, you know, this column is supposed to exist. It doesn't, right? Which in many cases, those are like the real problems that break dashboards or something. Yeah. Yeah. It makes total sense.
Starting point is 00:15:10 And the verbose naming couldn't be better aligned with sort of the Dickens reference. I'm impressed. Yeah. You're totally right. That's great. Super long, you know, one page sentences. Hopefully your expectation, you know, and great expectations isn't a page long. Okay.
Starting point is 00:15:28 Kostas, I've been, I've been stealing the mic. Yeah, you did, but it's fun. So you can continue doing it if you want. It's fine with me. I have a few questions to ask and like, I, I'd like, like to focus a little bit more like on the, the product experience first, and also the problem that would lead someone to a solution by great expectations. Many times we assume that everyone who listens out there, they are aware of why we are doing
Starting point is 00:15:59 the things that we are doing, but that's not always the case. So I would like to start from the very, very basics. Like let's, let's think of like, and I'd love to hear that like from you, or like describe the work of like a data engineer or like whoever is like, let's say the person who faces problems that can't be solved with great expectations and like describe, like go through like a small scenario until we reach the point where we can say, yeah, now we can talk about great expectations and how this problem can be solved with this. So can you do that for us, please? I can do my best to present some.
Starting point is 00:16:37 I think there's a lot of different ways. ways but i think one of the key things one of the whys like why do people turn to this tool is that they want to get ahead and be proactive instead of reactive already a lot of data engineering teams face this question of you know i got the phone and we we call for we call them data forestories sometimes and i think you, other people have used similar terms and these are out there all the time, but you get a call, Hey, my dashboard is broken. And it, you know, it's when you, when somebody says my dashboard is broken, I think it's useful to think about the way they're seeing the world. And so, you know, great example would be salesperson Northeast region sales show zero. I know that's not true.
Starting point is 00:17:28 And the reason I like framing it in that way is like they had an expectation. Like I was out there, I made the sale. I saw, I wrote the ink down. I know it's not zero. I expected it to not be zero, but it's zero. So the data engineering team turns to a tool like great expectations in that case
Starting point is 00:17:43 because they want to be ahead. So they don't want to get the call. know hey the dashboard is broken they want to see the issue first and be able to resolve it before it ever becomes this broken or or embarrassment that is one of the really common problems another really common problem is like the pager duty problem you know like if systems sometimes system some in example i gave that first example the it's a semantic failure right like the the dashboard ran then there's a there is a number in the cell it's just not the number the person expected other times it's like you have you know a schema mismatch or a load that totally failed or something like that, where the key thing that you're trying to solve is like, I don't want to get a page at this page at midnight.
Starting point is 00:18:33 And if I am responding to a problem, I want to have the diagnostic information that I need to be able to get to a solution right away and to be able to zero in on where the problem actually happened. Yeah, and I think, you know, there's variations on those, but I think kind of this, they're both kind of forms of being able to be proactive in addressing like your core function that you're trying to solve as a data organization. Yeah, absolutely. Okay, I have a slightly trickier question now. Tricky also for me to communicate in the best possible way.
Starting point is 00:19:15 So when we are talking about quality in general and quality in data, I'm always, let's say, confused or I'm going back and forth between two definitions. One definition, and I'll stop using a little bit of metaphors here, is more like the stamp of QA that we have on products, which is more about the consumer of the product, in this case data, to trust the product that is getting used, right? So that's like one reason that, one way that we implement like quality, like humans in general, like in products.
Starting point is 00:19:55 The other is like use these tools, like great expectations as a debugging tool, right? Like as a way for the developer or like the engineer or whoever like is responsible, like for anything related to the consumption of the data to figure out what is wrong. Now, obviously, data quality in general. Which one of these two, let's say, definitions of quality you think are closer to what is needed today? That is a tricky question. Good.
Starting point is 00:20:38 I think I almost need to add a third one. Oh, let's do that. In order to be able to like flesh them out. Yeah. And it's very similar to debugging, but it's the, but like just sticking on the theme for a moment of proactivity, it's like the proactive debugging, i.e. it's the, it's the way that you generate your expectations in the first place. And the reason I think that's really critical is like you alluded to this in the first definition, you know, quality is a term that is relative to a purpose.
Starting point is 00:21:10 You know, in some ways, quality is like your fitness for doing some particular job. And so it will vary like that. You know, the same data is high quality for purpose one and not high quality for purpose two, for example. And so being able to support the process of generating your understanding, like kind of your mental model of the world, I think is actually one of the most important areas that the broader data quality ecosystem can support, can do. So, okay. So I added the third and then maybe what that lets me do is say look they're all important it's really more more of a phase and and to be honest with you this is
Starting point is 00:21:51 something we've been we've been talking about a lot on our team internally lately is making sure that we're not trying to have the same conceptual objects like the same you know in this case it's like literal code object like the same objects in great expectations sort of do too much, but rather exposing APIs and interfaces that are more intuitive for the way that you're using this tool. And that might be in this phase of I'm building and creating my expectations. It might be in the phase of I am, you know, kind of ensuring quality, i.e. performing that QA function to make sure that I'm going to meet the needs of the product or the downstream data consumer. And then it also might be I'm performing a debugging task. Now, that last one is pretty challenging for us today because it's very interactive and it's also interactive on a potentially historical dataset, and so there's a lot of nuance there that we should dive
Starting point is 00:22:48 into if we're going to get into more of the technical detail. I don't know if that... I basically sidestepped your question and I said all three, but... I don't know. It's fine. I think it's very... first of all, there's value in adding the third dimension there. And to be honest, I don't expect...
Starting point is 00:23:02 I mean, I think these questions are like work in progress, like both the questions and the answers. And I think it's important like to ask these questions, even if we don't have like definitely like answers right now, but we need to think about that stuff. And I think it's important also like for the people, not like us who at the end, we sell products, but around that stuff, but like the people who their job is like to ensure like the quality of the data or deliver, let's say, data sets to people like to work on them and like all these things to have, let's say, the right, to ask the right questions or have the right, let's say, conceptual models around like that
Starting point is 00:23:43 kind of stuff. So I think it's, there's always like value in these conversations, let's say, conceptual models around that kind of stuff. So I think there's always value in these conversations, even if, let's say, the answers are not yes or no. It's fine. It's good. It's important to have the conversation. To that end, actually, I think there's one part of the quality as QA that is really important, but that is, I think, a little bit less obvious or less clear in a lot of platforms that kind of provide data quality. And that's that it's really a two-way street. The person who is, and I mean, both like there's the provider and the consumer data, but also if I'm providing a analytic model or a curated data set or a dashboard or just a data product, right, a giant collection of records, it's actually
Starting point is 00:24:33 potentially very useful for me to be able to kind of package together with that a description of what I think this is good at doing. And similarly, if what I'm providing is a model without data, it's actually very valuable for me to be able to say, when you use this model, make sure that your data looks like this. And that way you can clarify, it's like making,
Starting point is 00:25:02 like Ikea is really good because the instructions are simple in some ways or Lego. Think about Lego, these incredible instructions. Like imagine being able to give, give a consumer of a complicated product, like Lego, like really elegant instructions based around how the data. That's very, that's a very interesting point. How do you think we can get there? Because I don't feel like we have that today, right? Yeah. I mean, well, I think the answer is going to be that we allow people to provide expectation suites together with their products that are validated at the time that the person brings in their own data.
Starting point is 00:25:41 Sure. But okay, let's get a little bit deeper into that because also touches the product experience. So an expectation might be interpreted in a very different way from a technical point of view than from, let's say, the point of view of the consumer of the data who might not be a technical person. Let's say we have a marketeer. At the end, the marketeer wants to know, can I trust that the segmentation that I'm going to do on this data is actually representative of the reality that I'm
Starting point is 00:26:13 addressing out there, but on the other hand, the person who is the data engineer probably doesn't even know what segmentation is or they shouldn't care. Right? It's not their job to do that. Or their brains are not like trained to think that way. Maybe they think more in terms of like the skewance of data or like, do we have nulls or not nulls or I don't know, like what are the, like this kind of parameters. So how do we bridge this too?
Starting point is 00:26:40 Like how do we semant, like sem, maybe it's like the right way to say that, like translate the expectations from one domain to the other, right? So we can apply them at the end. I mean, that's such a beautiful question for me. Like to me, that is the core of what we're doing. And so like, let's dive in a little bit of how Great Expectations does that. And it's basically that, specifically the way we decompose things, we have a core object, one of the key concepts in Great Expectations is called the expectation. And what that does is it is a semantic translation machine. We often call it a grammar.
Starting point is 00:27:24 It's this long long verbose sentence. Now I definitely don't want to suggest that this is like an easy solved problem. Like go pick up the one, but let's take your hard example of a marketer who says, you know, what they're saying is expect this segmentation to make sense, right? Like that's kind of how they're thinking about the world. So we have to decompose that into what we call metrics. And so what an expectation does is it asks for metrics about the data. And metrics are a very general concept in great expectations. So it doesn't have to just be like a statistic.
Starting point is 00:28:00 Mean could be a metric. Number of nulls could be a metric. But so could number of nulls could be a metric, but so could number of nulls outside of a particular range or country code of lat long pairs. These are also metrics. So it's a pretty general compute engine under the hood. So the expectation author is, is providing this declarative verbose to declarative language.
Starting point is 00:28:22 And then they're also doing that translation into what are the metrics that make this mean that. And then great expectations is sort of an orchestration engine that goes out, reaches and touches the data, gets the values, you know, finds the values of those metrics, does the comparison reassembly and then surfaces the result in the language of the way the
Starting point is 00:28:46 marketer was thinking about the world. Okay. Okay. This is great. How do we assign metrics? How do we come up with the right metrics for the great expectation from the great marketeer? Yeah. Well, I think our answer to that is community okay and so we have we
Starting point is 00:29:10 have what we call our expectations gallery where we're we're we're trying to encourage and we want to encourage a robust process of community engagement for people to be able to expand the vocabulary of expectations to include things that are that make sense in their domain that do these metric translations and so in order to do that they're adding new expectations they're adding new metrics but you know what we're what we're trying to do is make sure that we're providing the substrate for expressing that or like the the mechanism for for for allowing people to express it, but then letting them take ownership of the semantic and domain model.
Starting point is 00:29:50 And, you know, the short version, of course, is like there isn't a right answer to the question, is this the right, you know, is this the valid segmentation? You know, it depends on the organization. And so one of the things we see a lot is what we call custom expectations. So, you know, yes, schema, nullity, volume, all these things, people use those expectations a lot. But also they say, okay, well, I want to say expect columns. And I'll pick an example.
Starting point is 00:30:19 We've used a lot of expect column values to be a valid ICD code. And what, you know, under the hood, that might just be translated into a fancy regex, but the fact that there's that translation is actually really important because that's what makes it usable to the marketer consumer in our case. Yeah, absolutely. Okay.
Starting point is 00:30:38 That's super interesting. And like, how is the engagement of the community around that? Like, what have you seen so far? I mean, okay okay i obviously i'm aware of the great community that great expectation has but what have you seen like happening there and what do you have seen like working and what not sure and you know it's like the power of open source is like one of these things that just blows my mind over and over. So I certainly can't suggest that I can wrap my head around all of it. Some of the things that work, we've done hackathons and we've, like I mentioned, produced this gallery. We're doing some experiments in the
Starting point is 00:31:19 space of what we call packages, where we have, you know, domain leaders in a domain or field who can, who are willing to kind of commit and say, these are expectations that are like useful and valid or valuable to be able to understand the kinds of concepts that are relevant. And like an example, actually, of a community contributed project that's really been interesting is a data profiling framework where there's expectations that are built around the Capital One data profiler. And they use that kind of as the semantic engine to infer types and allow you to make expectations like there shouldn't be PII here. And then that gets translated through. Now, we didn't build that on our team,
Starting point is 00:32:06 but we are excited about supporting that community. Now, your question was also where are the challenges? And like, to be honest, yeah, there's still challenges, right? Like there's lots and lots of expectations out there that there's a discovery problem, there's synthesis, there's a lot of work left to do in that space, or not left, there's a lot of opportunities still
Starting point is 00:32:24 to be had for helping people engage. And like, that's where I'm, you know, what I would, what I would really emphasize is our goal of, of making this a shared standard that people, people can engage on together and you know, and, and improve. So that's an exciting, that's an exciting area. That's one area. There's lots of other exciting things we're working on too, but that's definitely. Yeah.
Starting point is 00:32:44 Well, these are like very, very super interesting, like parts of like exciting, that's an exciting area. That's one area. There's lots of other exciting things we're working on too, but that's definitely. Yeah, yeah. Well, these are like very, very super interesting, like, parts of like growing and building and like having actual, like the community as part of the product experience itself, right? It is part of the product at the end. Uh, anyway, that's another conversation for another time. Like we can discuss a lot about the community. So what I want to ask you now is like, okay, we talked about the problems, like how you
Starting point is 00:33:11 think about like the solution, but let's talk a little bit more about like the experience that the user has with great expectations. So how, how do I use great expectations? I mean, I have like, let's say, an idea right now that there are some expectations somewhere that I'm like testing against my data. But like, how do I operationalize great expectations as part of my like day to day job, like as a data engineer? Yeah, that's a great question. We think of it as there's sort of four key steps to what it means to use great expectations. And we call it like our universal map.
Starting point is 00:33:50 First, just to be like really explicit on it, great expectations is a Python library. So, you know, you run a pip install in the case of great expectations today. Now, you know, not to get into like the commercial aspect, we are building a cloud product as well that's designed to make it very accessible, especially just expanding the reach of people, but also simplifying the setup, but just setting it up. So we set up, we run our pip install. Next step is connecting the data. Now, this is where it could mean I'm just going
Starting point is 00:34:26 to grab a batch of data, like I'm going to read a CSV off my file system and work with this data. There are also, there's more you can do to connect to data where you also configure the ability for great expectations to understand what your assets are and how they're divided into batches so that, you know, as new, new batches of data come in, we can, we can understand the, the, the, the stream of data as a, as, as a unit. So, you know, you connect to data and, and to be honest with you, that's an area where I see us needing to do some, some work, like some of the magic of great expectations early on came from the fact that connect to data was a one liner read CSV.
Starting point is 00:35:07 And as we've added the power and expressivity around ensuring that you can understand batches and so forth, it has become a little more difficult there. And so that's one of the things we're actually working on right now is like bringing that kind of magical viral experience back. Anyway. So next thing I do is I connect my data. It's like literally add a data source. It could be, you know, pointing it at an S3 bucket, say, or connecting to a database, to a warehouse. And that's one of the important things about gridded expectations
Starting point is 00:35:32 is we work across all the different backends, right? All the SQL dialects, Spark, Pandas in memory, you know, pulling data in from S3, all that, any of those things. So we do that configuration. Next thing is you create your expectations. And for us, that's a notebook experience. So you're in a Jupyter notebook. You have this sample of data.
Starting point is 00:35:52 And it's an interactive, real-time experience. I say, expect column values to be not null. And I get an immediate response. We check that right away. And it says, hey, success. Or actually, 5% of these values are normal. And so what we see there is this interactive exploratory process where you're creating expectations. Another way you can do that is with profiling, where you ask great expectations, you know, go out and build a model of this dataset and propose, you know,
Starting point is 00:36:27 a long list of expectations back to me and I'll choose which ones to accept, maybe all of them. And then that will become my expectation suite. So create expectations. And then the last step is the validation. So what you do typically, what we see people run is what we call a checkpoint and they'll embed that into a airflow pipeline or into a prefect pipeline or, or, you know, wherever they're running their validations. And you know, it's, it's, it's just an operation, graded
Starting point is 00:36:56 expectations run checkpoint. And what that then does is produce a validation result, which we convert into a webpage that you can share and post with your team that says, here were the expectations. These ones passed. These ones didn't pass. If they didn't pass,
Starting point is 00:37:16 here's some samples, examples of what went wrong. And so you have that very tangible, shareable, visible report of what the state of your validation was. So that's how great expectations gets used in practice. Okay, that's super cool. And you mentioned that like you support like pretty much, let's say, like every
Starting point is 00:37:36 backend out there. How does this translate in terms of like interacting with these backends? So let's say, for example, like I have my data on a data lake on S3, right? Or the same data, I might have it like on a snowflake. So first of all, is the experience the same that I get for great expectations,
Starting point is 00:37:54 like regardless of what I have like as a backend? That's one of the pieces of magic is, you know, we talked about that trend, like what great expectations is doing is translating between expectations and metrics. And then one layer deeper than that translate the request for the metric mean into the SQL dialect that will give us back that metric.
Starting point is 00:38:30 Or if we're in Spark, we'll translate that into the appropriate Spark command to say, give me the mean value back. So great expectations is handling that and then sort of re-bubbling it back up into the semantic layer for you. Okay, and this is like something that's sort of interactive, or it's something that, let's say, I run the expectation every one hour.
Starting point is 00:38:52 Is the result kept somewhere so I can go and see how the expectation has changed in time? How does this part work? Because I can see they're generating even more data at the end. Right. Yeah, totally right. Wow. This is such a rich area for us to continue to build on, to be honest with you. So today the way that works is like the core validation result is a, is a, is a
Starting point is 00:39:18 big kind of Jason artifact. Like I mentioned, we, we do render that and translate it into HTML. So you have you if you go to the your your generated data docs site, you'll see Yeah, you'll see a list of all the validations that are on. Now, what we're what we're doing right now in the cloud product is providing a much more more kind of interactive, like rich, linkable experiences. You can't, you can't really do that in open source. Like when you're producing a JSON report, you know, it's just hard to have that kind of referential, you know, all those references between different elements of an object for the data.
Starting point is 00:39:53 So we're, we're, we're making that possible more possible in a cloud environment, but, but in the open source, what you can, again, what you can do is you can absolutely get that list of all the things, you know, here X, you'll see a little X, like this one failed this time and this time and this time, and then it passed this time and that time and the other time. Okay. That's pretty cool. All right. So one last question from me, and then like, I'll give the microphone back to Eric.
Starting point is 00:40:15 I think I punished him enough for abusing the ownership of the microphone at the beginning. So share something exciting that is coming in the future for great expectations. I am absolutely thrilled about the cloud product. And I know that probably sounds like, oh, of course he's from... You know, it's because we can... You know, it's like...
Starting point is 00:40:41 We have a new concept available to us. And what is that concept? It's the user. In a library, we don't really have access to the user. We don't see who you are and when you're interacting. You're interacting with files
Starting point is 00:40:56 and going to these static webpages. Whereas when we're, like where we really want to go is we want to facilitate collaboration. At the end of the day, quality is fitness for purpose. We talked about contracts at the beginning and ensuring that people are on the same page. And so what I'm really, really excited about is the world in which you and I are sharing something, sharing a piece of data,
Starting point is 00:41:23 and we both have our expectations about it. And we can just say, hey, let's go look at that validation result together. Drop in a comment like, oh, you know what? This expectation should be a little bit different. But we're turning data quality into this very, the potential for a collaborative, more of a collaborative enterprise. That's really exciting to me. Yeah, super, super interesting.
Starting point is 00:41:47 All right, that's all from my side. I mean, we're probably going to need another episode to discuss a little bit more. But Eric, it's all yours now. So interesting. This has been such a fun conversation. James, I'm interested to know, understanding the technical flow was super helpful, right? So Pippin saw great expectations. Also, again, the Dickens reference not lost on me. So amazing work. Thank you. Amazing work there. And it's like the best sort of, you know,
Starting point is 00:42:18 smile on the mind. So let's play out a little example here, right? So you kind of talked to the flow of like a data engineer who is, you know, sort of implementing these expectations, connecting to data sources, running checkpoints, you know, and some sort of, you to the relationship between that data engineer. Let's just say that data engineer is Costas, you know, and he's using great expectations to, you know, sort of drive data quality. But I'm the market side, you know, for whatever it's reports or this or that, right? Sort of presupposes that Costas and I have talked about like what I expect to see in my reports and like, you know, the data types in these columns and all that sort of stuff. So could you help us understand what does it look like from the beginning in terms of that sort of relationship, right? Where I come with a requirement, you know, and say, hey, like, you know, whatever it is, like purchase values always need to be in X format, right? Because if not, then we undercount and then, you know, my boss gets mad at me and blah, blah, blah, right? So I have that expectation as a marketer, but I don't
Starting point is 00:43:43 do anything technical, right? So like So how do I interact with process? And I'd just love to hear, how do your customers do that? When an expectation sort of originates with someone who's non-technical on a different team. Great. Yeah, happy to dive into that. Actually, I think my favorite example of this is, I mentioned it's crazy. We started this, I think, in 2017, we gave a talk about great expectations. And at that very first talk, there was somebody on a data engineering team facing this problem. And, you know, she, she implemented great expectations, still very, very, very young
Starting point is 00:44:19 product and her team. And we had a chance to connect with her much later and hear about it. And what she was doing, and I think this makes a ton of sense to me, and I do think we can make better workflows with a web app and so forth. But what she was doing is she was literally kind of creating forms for her team and a structured interview. So it became a kind of a requirement solicitation exercise. And it really just helped to structure the conversation in a way that was really valuable. And this is something I saw a lot in my kind of analytics work and that sometimes we call round trips to the domain experts. So in your example, this marketing stakeholder is an expert in what the data should look like. It's really expensive to have to go back and forth and like,
Starting point is 00:45:14 you know, expensive in time and complexity and all these things. So again, like first answer of how I see that done is it provides a mechanism for conducting structured conversations and interviews to elicit what those expectations are the second way that I would flag is, and Kostas, this is similar to what you were kind of, I think hinting at is like, it's experimentation and having, having a real good notebook experience now, so maybe you're gonna say, well, wait, wait, did the domain experts sit down and write notebooks? And no, I don't think that's the case. But I think what actually is happening in that way
Starting point is 00:45:53 is that you can accelerate the level of domain expertise, if you will, of a data engineer extremely quickly when you're letting them operate on these kind of higher order concepts instead of like, I'm staring at a CSV file. Yep. Super interesting. Super interesting. Yeah.
Starting point is 00:46:13 That makes a ton of sense that, uh, well, and actually let me follow that up with another question. So because there are different approaches to this and like, I'm, I'm super interested in sort of the philosophy behind the decision sphere expectations thing some some companies that are trying to solve data quality think that there should be a software layer that sort of like facilitates that interview process right like a structured interview process like you said and that's an interesting, I actually struggle to know how I feel about that. I mean, not that I'm an expert in this, but like, you know, actually sort of validating data technically is a pretty big challenge in and of itself, right? But like, you know, solving
Starting point is 00:46:58 relational connections, you know, between two people who play very, very different roles in like a complex business. You know, it's kind of like, can software even solve that problem, right? I mean, so in one way, it's really encouraging to hear like, I mean, just do a really good like structured interview and have a form. And that is a way that you can create like a very helpful, but like simple interface between these two people. And then the data engineer can take that and translate it, you know, using the notebook interface. But all that said, like philosophically, what role do you think software plays and maybe
Starting point is 00:47:38 even specifically data quality software plays in facilitating that relational interaction, you know, irregardless of the actual technical data validation. That's a stumper for a last question. You got to end on a high note. I don't know the answer to that question. And let me tell you why we're like, what leads me to say that? I absolutely believe in the power of software to facilitate structured interactions of any form. So to me, when I say you have a form and you have a structured interview,
Starting point is 00:48:09 I absolutely believe software could facilitate that and make it happen in a useful, in a powerful way. So it's actually not about that was what we replaced. And to even go further, no doubt in my mind, software will play a role in supporting those kinds of interactions. And we can do that. In fact, the way I think about it, we can do that in intelligent ways by making it doesn't need to start from a form, from a blank slate. What is the answer here? We can say, here's the past. Here's what we've seen in the past. Firstly, does that meet your expectations? Secondly, you know, what would cause the things to be different?
Starting point is 00:48:54 So that's the first part. You know, so on that side, actually, maybe I'd be like slightly different than your priors. On the other side, there's, you know, in modeling, we call this a lot like elicitation so you're interacting with an actor you're eliciting their knowledge and one of the things that i found is like experts actually can have a very difficult time understanding which parameters in a model are kind of doing the work and so if i if i say to you you, about how many days should it be sunny in your city? And we're going to put a quality measure. So I think, I guess what I'm saying is,
Starting point is 00:49:46 there's a huge amount of, that's the rich problem. And I think it's not, that doesn't have to be the problem of data quality to solve alone. That can be done in concert with a much richer ecosystem of the way that we facilitate collaboration between people. And like some of my other, like one of my long-term passions is like the way that we communicate collaboration between people. And like some of my other, like one of my long-term kind of passions
Starting point is 00:50:06 is like the way that we communicate probability effectively. Like what does it actually look like for people to see probability and understand probability? So I think there's a lot to be done in that space as well. So I'm going to give myself a little bit of a vibe that we don't have to solve that problem yet
Starting point is 00:50:21 in order to be providing a lot of value and data quality. Yeah, no, that was a really, a really helpful answer. I'm sorry to throw that problem yet in order to keep providing a lot of value in data quality. Yeah, no, that was a really, really helpful answer. And sorry to throw that one in. Such a good show. Thank you for your thoughtful, articulate answers. We've learned so much, and I really appreciate you giving us some of your time. Eric Costas, thanks so much for having me it's been a pleasure you know Kostas I tried to think of like a really good like takeaway from this substantive
Starting point is 00:50:51 material from the show but I'm going to phone it in because I can't get over how much I enjoy the like multiple references to Charles Dickens not only in the name of the product, but pip install, pip secure during great expectations. And then the verbose nature
Starting point is 00:51:15 of what you name an expectation, those being really long. Just a data quality product as a Python library, having such like clever references to Charles Dickens is just, it just makes me really, really happy. So that's my big takeaway. Yeah, absolutely.
Starting point is 00:51:38 I mean, there's like some marketing genius behind Great Expectations. I don't know, like maybe that's what we should do with them, actually. Like, get the founders and get into, like, the conversation of how they came up with that stuff. Like, that's... I mean, it's not like a data-related
Starting point is 00:51:56 conversation, but I don't know. I think it feels like a very fascinating topic, like, how they came up with that, because it is pretty unique, and it's, like, extremely smart. Like you can have like a conversation with someone from great expectations and like every other sentence, like make a connection with Dickens, which is, it's kind of crazy. So yeah, we need to, we need to figure out, like we need to get the playbook from them somehow.
Starting point is 00:52:25 Yeah, absolutely. It'd be great to have all the founders on the show and have them read passages from the book Great Expectations. Yeah, yeah, absolutely. We should do that. But outside of these, like, okay, it's always, like, a great pleasure, like, to discuss with the folks from Great Expectations because you can see there are some very interesting ideas and very deep knowledge around how they solve the problem of quality and
Starting point is 00:52:52 how they're moving forward. And I'm also looking forward to see their cloud products and what this will bring into this whole experience of doing data quality with great expectations. I agree.
Starting point is 00:53:08 All right. Well, thank you for joining us on the Data Stack Show. Tell someone about the show if you haven't yet. We always like to get new listeners and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
Starting point is 00:53:32 That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.