The Data Stack Show - Data Council Week (Ep 2): Testing and Observability Are Two Sides of the Same Coin With Ben Castleton of Great Expectations

Episode Date: April 26, 2022

Highlights from this week’s conversation include:Ben’s background and career journey (2:13)The birth of Great Expectations (5:02)Defining software engineering (9:38)Adopting open source products (...13:04)Working in data versus healthcare (18:01)What's next for Great Expectations (20:29)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show, still recording on site at Data Council in Austin. We had a great conversation with Firebolt,
Starting point is 00:00:34 and the one we're about to have is with a company called Great Expectations. Now, Kostas, this is what I'm interested in as far as Great Expectations. One, the name, but two, it has really seen, you know, sort of of the data quality, data observability, variety of tools. The community and adoption that Great Expectations has is pretty impressive, you know, and I think that as an open source project in that space, they've really had a ton of adoption. And so I'm interested to hear about, you know, sort of the origin story, like why did they choose to open source it, you know, and how they've, how they've grown that community. How about you? Yeah, absolutely.
Starting point is 00:01:12 I mean, learning more about the community is something that I definitely hope to happen. Like they are, they have a very vivid community, but they're one of these, you know, like cases, like the community that you have, like on dbt, like people are like obsessed with the technology. So yeah, I mean, I want to learn more about the technology itself, how it differentiates with the rest of like the data quality tools out there and to chat about the community and what it means to have an open source dimension
Starting point is 00:01:40 to a product that mainly does data quality. So I'm really looking forward to this conversation. All right, let's dig in. Let's do it. Ben, welcome to the show. I have been lurking sort of in the background looking at great expectations for a long time. So really fun to meet you here at Data Council Austin and hear about the origin story. So thanks for giving us some time. Yeah, no problem. Thank you. Okay. So give us your background and tell us what led to, you know, sort of starting Great Expectations. Yeah. Well, so my background is basically started as an accountant and then
Starting point is 00:02:17 switched over into healthcare when accounting became, I was in hedge funds. I was basically working to make sure that billionaires stayed billionaires. And I didn't feel like that was doing anything good for the world. And I had a good friend in Boston at the time who told me, you got to get into healthcare and data's where it's at. So switched over doing analytics and data. And that led me to meet up with Abe. And we realized there's a lot of work to be done to
Starting point is 00:02:47 help analytics in healthcare, you know, help more people and work faster. So this was a consulting firm. It was not a product firm at all. So it wasn't SaaS from the beginning? No, no, not at all. We were sort of like a tools enabled like consulting. And so my background led to figuring out how can we sell consulting? How can we do data engineering for healthcare companies? Sure. Yeah. Not where we started, but we had this meeting way back at the beginning where I remember us saying, yeah, it's okay, Abe, if you spend 5% of your time on great expectations, because yeah, maybe that'll help your career somehow. I'm not sure. Google does like whatever 20% time or something is like you get 5%. But it became clear in 2019 that Great Expectations had legs. It was taking off. There was a lot of demand across industry. And so we pivoted the company to deeply embedded in their teams and figuring out
Starting point is 00:04:06 what are the problems that they're really trying to solve. That's, you know, I know DBT has a great story there and same thing. We had real problems that we were trying to solve with this little side project and we would use it on our early clients and then it started to take off on its own okay so it's it's really interesting for me to hear that you were in the healthcare space doing work there because i wouldn't think the natural like decision is we're going to open source this and really build an open source ecosystem around this tool right because healthcare, you just kind of think about like protecting IP in healthcare. And so tell us that story. Like, I mean, Great Expectations has an unbelievable community around it. And how did that come out of the healthcare consulting?
Starting point is 00:05:00 Yeah. So actually Great Expectations was started by cross-team collaboration with Abe and James, who was working with the NSA and they were sort of collaborating across organizations to figure out how can we, you know, solve some of these problems we're seeing. So that was going on in parallel to us building up this healthcare consulting firm. Got it. That was the 5% time. Yeah. And you know, you can go over there and do that thing you're doing Abe. And eventually James came over and joined our team as we, as we moved more towards, you know, getting great expectations out there. But James and Abe really started this together and he's, James is our co-founder,
Starting point is 00:05:44 but he was in a different company when when he helped co-founder what we've got going on so yeah it started crossed industry and we've never had like a demand for it from like specific industries it's always been just like demand from everywhere um and then we tried to use it in healthcare a little bit. Yeah. Makes total sense. Oh, you stole my question. Oh, it's a good one. Well, first off, we love the name Great Expectations. I think, I don't know if that was Abe and James together, but definitely Abe's got his name all over it. Loving, you know, old English literature and Charles Dickens.
Starting point is 00:06:25 And so the puns with pip install, great expectations, it's endless. So good. quality and do that out in the open and figure out, figure out how we can validate and test if we're getting what we expect from data at different points in the life cycle. And then, you know, there's lots of different places you can go, but that's the entry point into figuring out how to collaborate better around data and enable collaboration. And okay. I'll ask, I have like a model question later on that, but I'll talk with many things are happening right now, like in the industry.
Starting point is 00:07:11 Right. I mean, common is that they are in the quality data quality space or like data observability, like there are different terms of ideas, right? Yeah. What's the difference? Like how did you see like great expectations come into play with what is happening with this category and where we start in terms of the categories. Like, I will think of the doubtful.
Starting point is 00:07:31 We're still like trying to figure it out. Yeah. Well, I, I'm going to tell you that we figured it out. I mean, I'm, I'm mostly kidding here, but yes, there's a lot of work to do in, in figuring out from an industry, how the industry is going to play out. observing data, we're starting at it from the point where you say, well, we want to be able to test that data as it moves through a system is fit for the purpose that we want it to be fit for. And so in order to do that, you have to have people defining, you know, this is what we expect it to look like. And we don't think you can ever get away from people. So when you talk about like
Starting point is 00:08:25 human in the loop AI systems where you have, you know, people involved, that's more closely what we think it looks like, as opposed to AI coming in and solving everything and telling you what the problems are you need to know. It's more human in the loop systems that sort of evolve with machine learning and work together to figure out how to make stuff faster and automate a lot of those pieces. Yeah, makes total sense. So because in this industry, like we love to borrow, let's say, terminology from software engineering.
Starting point is 00:08:56 So software engineering, we have like unit tests, we have like integration tests, like it's a much more mature like discipline when you can't like, of course, you're there, right? What you would say that in software engineering is closer to what greatest expectation is, is like, it's like building like unit tests, for example, something similar to that. Is it like something else? Like that's, I mean, other people are talking about, you know, they're talking about, you know, data, for example, when it comes to what it is.
Starting point is 00:09:26 So what you would say is like the closer part of the infomercials were integrated to what wave of experimentation. Yeah, I've seen that question quite a few times. And when we've talked about it internally, we would look at testing and observability as two sides of the same coin. That you can't really you can't really split them apart and say okay we're we're doing this so for us you you can't get away from observability as something that you need like you've got to be able to kind of see what if i come into my data warehouse or let's say i've got all my data and um an s3 and running over here we've got spark and then we've got piping it into the data warehouse here, we've got SparkMod.
Starting point is 00:10:05 And then we've got piping it into the data warehouse. And then we've got, you know, after that, we're using Jupyter Notebooks to do some analysis. I want to be able to see everything and understand and understand where the problems are. And so, yes, that's important. But understanding the specific tests and places where you can validate, that's the other side of the coin that you can't like separate those out.
Starting point is 00:10:28 So in our platform, we feel like you've got to build both of those to make sense. The testing, you can build individual tests and that would be a very manual and labor intensive process to build all the tests that you want. And so we need to have machine coming in and say, well, how can we get 80% of that automatically? And that, and that's where you get into kind of more smart tooling. And then also building observability into this, making sure that you can see that in a, in an easy way from a central place or making sure you're alerting the right people that need to be alerted.
Starting point is 00:11:02 So yeah, both sides of those we feel are really important. Henry Suryawirawan, And I'll get back to software engineering again. And we covered the concept of like CI CD there where testing happens, you know, right, like we write the tests when the code is like full singular poster at some point, it's been built like this go around, blah, blah, blah, all the things that software engineers know about. What's the frozen with data because we don't really have CICP, right? It's like we have like something, I don't know, meaning we for like, yes, so in the pipeline of like create capturing, creating and consuming
Starting point is 00:11:39 data exists where, where should we test it as an as an issue yeah well first off it is cool to see some companies actually going after that versioning of data i love seeing that that sort of action happening obviously there's a lot of work to do there but as far as as far as testing goes where should it fit in same way you would do with software. We would say that before you release a model to production and start, you know, getting production results off it, you want to make sure it's tested. And let's say, you know, in the same way software, you would say, oh, well, I'm going to commit. I'm going to make a commit. And now I'm going to, you going to run my integration tests or I've got unit tests on that. And then we run that before we deploy. It's kind of the same pattern
Starting point is 00:12:30 with data. It's just that we don't have mature infrastructure around that process yet in the industry. But you're starting to see a lot of those pieces get built out, especially like you see it in MLOps. You've got all this tooling that's coming out there. We see a lot of that tooling as being built and we are right in the middle of that. Like you have to test before you deploy the same way you would with software. All right. So let's talk a little bit about open source.
Starting point is 00:12:57 What's the relationship with open source? Yeah. So again, we were talking a little bit before this and I mentioned I might've been a skeptic a few years ago and now i'm like why would you ever build a company without having an open source product which is so interesting right because you think i mean to your average person you say hey we're going to build something and we're going to give it away for free to the entire world and then we're going to build a business on it. Yeah. And they kind of say like, okay. The business trait is going, wait, wait, wait.
Starting point is 00:13:26 Right, exactly. It's not making sense. Yeah. But I guess there's two things. One, I think, like this is my personal belief, I think most people are good and they want to do good things. And so this appeals to both the altruistic side of me and most of the people I work with and the people I remember working with.
Starting point is 00:13:48 They love doing something cool and giving it away. So that's the one. It actually appeals to a side of us that's very personal and we want to do something good and cool. And that feeds into how much excitement you get, right? And then the other side is, well, if I want to get, if I want to be deploying my product and get thousands of people using it and eventually millions, like what's the fastest way to do that? But I'm a bottoms-up approach where the people who actually use the software can just get it for free. They can tell their friends about it. They can deploy it. They can share it, building in ways that you can share it. Open source is fantastic for disseminating an idea and getting it out there in a way that if you have a paywall, it's just going to be much slower, orders of magnitude slower.
Starting point is 00:14:44 Yeah. Talk about the time a little bit. And I know like we're coming up on time here because you have a team dinner to get to. We'll be respectful of that, but talk about the time. So did you start out as open source? Because I know you said like, even maybe in the early days, you didn't necessarily think that open source is like, this is the best decision, you know made. How long did it take? Because there's an adoption period, there's sort of a validation period from a community standpoint. How did that play into it? Abe and I had a conversation at one point where Abe was saying, you know, if our company never makes money, I would still be really happy if the open source project really got far and wide and a lot of people used it. And understanding that, okay, there's this other side that we're going to be happy to build a community and build open source. And then bringing it back now where it's like, well, even if we were making a lot of money, it would feel like a failure if the open source project died or we didn't, you know, we weren't able to create something actually useful for a lot of companies. So there's a
Starting point is 00:15:54 commitment to open source that sort of supersedes the commitment to the business, but then the business, like it's, it's really going to follow follow. There's a lot of business value in having that open source community. So the timeline is really, okay, let's put it out there. Let's see what happens. We start to get, you know, a few hundred stars, people using it. We start to see deployments. And then it was really figuring out that we're trying to build a shared language. So we need a community because a language cannot exist without a community.
Starting point is 00:16:31 Or like grow or develop. Yeah, or grow or develop. And so starting that community and then starting to see the growth of that, that was really what kind of inspired us to realize, okay, this is how we can build a business around this. And, and it was a couple, you know, it was a couple of years before we really could see that. Sure. And at the beginning, it was sort of a side project, but after a couple of years, you see that growth and then we could tell, okay, we can build a business. Yeah. Which is, you know, it's easy to look back and say a couple of years, but,
Starting point is 00:17:02 you know, we can all think of experiences in our life where like going through a several year period of something, like it doesn't necessarily feel like just a couple of years when you're in the middle of those years, you know? Well, and during those years we did hire, I think maybe one or a couple of engineers and those of us on the consulting side were paying for them. We didn't have investment, but it was super fun during those years. Yeah. Oh, yeah. Okay.
Starting point is 00:17:28 Well, we want to be respectful of your time. So I have one more question than Costas. I'll give you the last word here. So you went from, you know, making sure that billionaires stay billionaires. And so what is it like sort of coming from that world and then maybe even the healthcare world, you know, where there's sort of maybe, you know, in healthcare, there's probably like bureaucracy, things move slower. What's it like now working for sort of a really modern open source company in the data space? What are sort of the biggest things that you notice as differences? Well, I think for my personality, I needed to be in a smaller organization. So I really appreciated just being able to be with a group of people who get together and decide together, like, what's the best thing to do here?
Starting point is 00:18:13 Not what are you supposed to do? Not what does that, you know, report say I'm supposed to do? Not what is this policy, but what should we do? What's the best thing to do? And so it feels really fun to do that and then be around other people who just want to do that. And I think the small startup, you know, really attracts those types of people. I also am kind of a risk junkie, so I just wanted to see if we could do it. If we fail, okay, you know, sorry, we're out some money, you know, take a hit on the salary, but let's see if we can do this. And if we do, it's really exciting.
Starting point is 00:18:49 So that definitely resonated with me personally. But also, like, if you talk to Abe, he's been kind of pretty vocal about being really concerned with how data is used. And is it ethical? Like, are we doing things that are actually good in the world with data? And one of the cool things about great expectations is it kind of helps you make explicit some of the assumptions and the rules and the things that you're expecting about data. And that has larger implications for like, should we do this with our data, right? And making that explicit in documentation.
Starting point is 00:19:28 And so it's kind of fun to have some ethical purpose behind what you're doing as well. Yeah. Before you get the last word, Costas, I just want to say I really appreciate that. And I appreciate it sounds like there's an ethos inside of Great Expectations where you're doing some like really interesting technical things, but it's very clear that there's a culture where you see the larger picture and sort of operate according to a value system within that. And I just really appreciate that. So thank you. Yeah. That means a lot at the end of the day, we're, we're all people here and, and, and we're building some software, but we're, we're people building software. So thank you.
Starting point is 00:20:07 Yeah, that's amazing. I think that's like one thing, like having this kind of this dimension of this is like in the companies of what super is, let's say, one is from great companies too, and really important to see like what's next for great expectations, so that's my last question. Like what we should like, share with us something exciting that is coming like the near future yeah well i'd say there's there's been so many like i i mean we're really excited
Starting point is 00:20:33 about all these opportunities there are but a focus for us going forward is always going to be to invest in the community around great expectations and invest in the open source, kind of build that up to be something that is super useful, not just for an individual to start to make some tests, but maybe an individual to put hundreds or thousands of tests on a data warehouse really, really quickly and be able to do that just with the open source product, right? So there's a lot more investment we can do to make it seamless, to make it easy to use. And we're not just going to save those for the commercial product. We're going to do a lot of that in the open source so that we can really feel good about, hey, we're enabling data engineers to
Starting point is 00:21:19 do something really powerful just with the open source product. And then obviously it is exciting to see how we can deploy that in organizations at the enterprise level, and that's going to involve collaborative workflows. So that's my role. I'm personally excited to see us release a commercial product that can enable enterprises to do some good stuff with data quality. All right. Well, thank you so much. I think we're going to get you out the with data quality. All right. Well, thank you so much. I think we're going to get you out the door in time for team dinner.
Starting point is 00:21:49 And we're excited to talk with Abe tomorrow. This will be a two-part episode. This will be really fun. But Ben, thanks for giving us some of your time. Yeah, thank you so much. So good to be here. What a fun conversation. I cannot wait to talk with the technical co-founder.
Starting point is 00:22:03 A couple of things, I think. It's always amazing to hear the origin stories. And there are a lot of similarities here with the DBT story, where you sort of have a consultancy and then technology coming out of it. And I think one of my takeaways, I have two. The first one is it takes a lot of courage to be running a consultancy and you can make a lot of money with a consultancy and do cool things. And they were working in the healthcare space and that can have a really significant impact in a positive way. And to say, okay,
Starting point is 00:22:39 we're going to go really invest in this open source side project. I know it takes a lot of courage and I just have a huge amount of respect for teams that can do that. Because that's, you know, you look back now and it's like, oh, this is so cool. There's a great community, right? But in the very beginning, that's a very sort of, it can be a scary proposition. And then the other thing is, you know, I just hats off to them for, you know, doing the pip install great expectations because that's one of the cleverest like tech company names I've ever heard of. It makes me smile every time I think about it. I want to install it just so I can type. Yeah.
Starting point is 00:23:20 Yeah. Makes sense. Yeah. They know what they're doing, for sure, like on many different levels, like on the product level, on the community level. Most importantly, what I want to keep from this conversation is like the passion that the founders have
Starting point is 00:23:37 about building a company and the whole, let's say, what it means to build a company outside of like just the founders, right? And that's exactly where like, it makes it so interesting to see people obsessed so much with the community. Like they don't see the work that this company is doing just like, okay. As a way, like to create value in a very monetary way.
Starting point is 00:24:03 Like there are more things there. And I think that's what, I mean, as I said during the conversation, this is what differentiates really good companies to great companies, what makes a great company. But also it's a huge, huge indicator of the commitment that the founders have to make this happen. So I'm very happy that I must have this conversation and connect with the great expectation people. And I'm really looking forward to see what's next for them because they are very creative and I'm sure that we are going to be surprised. Outside of this and a bit more on the technical side, I love the fact
Starting point is 00:24:45 that we see more and more of best practices from software engineering entering the work of working with data. We discussed about unit testing and how great expectations are related to that. So yeah, I mean another great conversation
Starting point is 00:25:02 and I think we should have more conversations with the great expectation folks. There are other people in the team. And I think we should have more conversations with the great expectation folks. There are other people in the team there that I think should be on the show. I agree. We'll do it. All right. Several more great episodes coming at you from recording on site at Data Council Austin. We'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
Starting point is 00:25:35 That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.