The Data Stack Show - Data Council Week (Ep 2): Testing and Observability Are Two Sides of the Same Coin With Ben Castleton of Great Expectations
Episode Date: April 26, 2022Highlights from this week’s conversation include:Ben’s background and career journey (2:13)The birth of Great Expectations (5:02)Defining software engineering (9:38)Adopting open source products (...13:04)Working in data versus healthcare (18:01)What's next for Great Expectations (20:29)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are
run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Stack Show, still recording
on site at Data Council in Austin. We had a great conversation with Firebolt,
and the one we're about to have is with a company called Great Expectations. Now, Kostas,
this is what I'm interested in as far as Great Expectations. One, the name, but two, it has really seen, you know, sort of of the data quality, data observability, variety of tools.
The community and adoption that Great Expectations has is pretty impressive, you know, and I think that as an open source project in that space, they've really had a ton of adoption.
And so I'm interested to hear about, you know, sort of the origin story, like why
did they choose to open source it, you know, and how they've, how they've grown
that community.
How about you?
Yeah, absolutely.
I mean, learning more about the community is something that I definitely hope to
happen.
Like they are, they have a very vivid community, but they're one of these, you
know, like cases, like the community that you have, like on dbt, like people are
like obsessed with the technology.
So yeah, I mean, I want to learn more about the technology itself, how it
differentiates with the rest of like the data quality tools out there and
to chat about the community and what it means to have an open source dimension
to a product that mainly does data quality.
So I'm really looking forward to this
conversation. All right, let's dig in. Let's do it. Ben, welcome to the show. I have been lurking
sort of in the background looking at great expectations for a long time. So really fun
to meet you here at Data Council Austin and hear about the origin story. So thanks for giving us
some time. Yeah, no problem. Thank you.
Okay. So give us your background and tell us what led to, you know, sort of starting Great
Expectations. Yeah. Well, so my background is basically started as an accountant and then
switched over into healthcare when accounting became, I was in hedge funds. I was basically
working to make sure that billionaires stayed billionaires.
And I didn't feel like that was doing anything good for the world.
And I had a good friend in Boston at the time who told me, you got to get into healthcare
and data's where it's at.
So switched over doing analytics and data.
And that led me to meet up with Abe.
And we realized there's a lot of work to be done to
help analytics in healthcare, you know, help more people and work faster. So this was a consulting
firm. It was not a product firm at all. So it wasn't SaaS from the beginning?
No, no, not at all. We were sort of like a tools enabled like consulting. And so my background led to figuring out how can we
sell consulting? How can we do data engineering for healthcare companies? Sure. Yeah. Not where
we started, but we had this meeting way back at the beginning where I remember us saying,
yeah, it's okay, Abe, if you spend 5% of your time on great expectations, because
yeah, maybe that'll help your career somehow. I'm not sure.
Google does like whatever 20% time or something is like you get 5%. But it became clear in 2019 that Great Expectations had legs. It was taking off. There was a lot of demand across industry. And so we pivoted the company to deeply embedded in their teams and figuring out
what are the problems that they're really trying to solve. That's, you know, I know DBT has a great
story there and same thing. We had real problems that we were trying to solve with this little
side project and we would use it on our early clients and then it started to take off on its own okay so it's it's really interesting
for me to hear that you were in the healthcare space doing work there because i wouldn't think
the natural like decision is we're going to open source this and really build an open source
ecosystem around this tool right because healthcare, you just kind of think about like protecting IP in healthcare. And so
tell us that story. Like, I mean, Great Expectations has an unbelievable community
around it. And how did that come out of the healthcare consulting?
Yeah. So actually Great Expectations was started by cross-team collaboration
with Abe and James, who was working with the NSA and they were sort of collaborating across
organizations to figure out how can we, you know, solve some of these problems we're seeing.
So that was going on in parallel to us building up this healthcare consulting firm. Got it. That
was the 5% time. Yeah. And you
know, you can go over there and do that thing you're doing Abe. And eventually James came over
and joined our team as we, as we moved more towards, you know, getting great expectations
out there. But James and Abe really started this together and he's, James is our co-founder,
but he was in a different company when
when he helped co-founder what we've got going on so yeah it started crossed industry and we've
never had like a demand for it from like specific industries it's always been just like demand from
everywhere um and then we tried to use it in healthcare a little bit. Yeah. Makes total sense.
Oh, you stole my question.
Oh, it's a good one. Well, first off, we love the name Great Expectations. I think,
I don't know if that was Abe and James together, but definitely Abe's got his name all over it.
Loving, you know, old English literature and Charles Dickens.
And so the puns with pip install, great expectations, it's endless.
So good. quality and do that out in the open and figure out, figure out how we can validate and test
if we're getting what we expect from data at different points in the life cycle.
And then, you know, there's lots of different places you can go, but that's the entry point
into figuring out how to collaborate better around data and enable collaboration.
And okay.
I'll ask, I have like a model question later on that, but I'll
talk with many things are happening right now, like in the industry.
Right.
I mean, common is that they are in the quality data quality space or like
data observability, like there are different terms of ideas, right?
Yeah.
What's the difference?
Like how did you see like great expectations come into play with what
is happening with this category and where we start in terms of the categories.
Like, I will think of the doubtful.
We're still like trying to figure it out.
Yeah.
Well, I, I'm going to tell you that we figured it out.
I mean, I'm, I'm mostly kidding here, but yes, there's a lot of work to do in, in
figuring out from an industry, how the industry is going to play out. observing data, we're starting at it from the point where you say, well, we want to be able to
test that data as it moves through a system is fit for the purpose that we want it to be fit for.
And so in order to do that, you have to have people defining, you know, this is what we expect
it to look like. And we don't think you can ever get away from people. So when you talk about like
human in the loop AI systems where you have, you know, people involved, that's more closely what
we think it looks like, as opposed to AI coming in and solving everything and telling you what
the problems are you need to know. It's more human in the loop systems that sort of evolve with
machine learning and work together to figure out how to make
stuff faster and automate a lot of those pieces.
Yeah, makes total sense.
So because in this industry, like we love to borrow, let's say, terminology
from software engineering.
So software engineering, we have like unit tests, we have like integration tests,
like it's a much more mature like discipline when you can't like, of course, you're there, right?
What you would say that in software engineering is closer to what
greatest expectation is, is like, it's like building like unit tests,
for example, something similar to that.
Is it like something else?
Like that's, I mean, other people are talking about, you know, they're talking
about, you know, data, for example, when it comes to what it is.
So what you would say is like the closer part of the infomercials were integrated to what wave of experimentation.
Yeah, I've seen that question quite a few times.
And when we've talked about it internally, we would look at testing and observability as two sides of the same coin.
That you can't really you can't really
split them apart and say okay we're we're doing this so for us you you can't get away from
observability as something that you need like you've got to be able to kind of see what if i
come into my data warehouse or let's say i've got all my data and um an s3 and running over here
we've got spark and then we've got piping it into the data warehouse here, we've got SparkMod.
And then we've got piping it into the data warehouse.
And then we've got, you know, after that,
we're using Jupyter Notebooks to do some analysis.
I want to be able to see everything and understand
and understand where the problems are.
And so, yes, that's important.
But understanding the specific tests
and places where you can validate, that's the other side of the coin that you can't like separate those out.
So in our platform, we feel like you've got to build both of those to make sense.
The testing, you can build individual tests and that would be a very manual and labor intensive process to build all the tests that you want.
And so we need to have machine coming in and say, well, how can we
get 80% of that automatically?
And that, and that's where you get into kind of more smart tooling.
And then also building observability into this, making sure that you can see that
in a, in an easy way from a central place or making sure you're alerting the right
people that need to be alerted.
So yeah, both sides of those we feel are really important.
Henry Suryawirawan, And I'll get back to software engineering again.
And we covered the concept of like CI CD there where testing happens, you know,
right, like we write the tests when the code is like full singular poster at
some point, it's been built like this go around, blah, blah, blah, all the things
that software engineers know about.
What's the frozen with data because we don't really have CICP, right? It's like we have like something, I don't know, meaning we for like,
yes, so in the pipeline of like create capturing, creating and consuming
data exists where, where should we test it as an as an issue yeah well first off it is cool to see
some companies actually going after that versioning of data i love seeing that that sort of action
happening obviously there's a lot of work to do there but as far as as far as testing goes where
should it fit in same way you would do with software.
We would say that before you release a model to production and start, you know, getting production results off it, you want to make sure it's tested. And let's say, you know, in the same way software, you would say, oh, well, I'm going to commit.
I'm going to make a commit.
And now I'm going to, you going to run my integration tests or I've
got unit tests on that. And then we run that before we deploy. It's kind of the same pattern
with data. It's just that we don't have mature infrastructure around that process yet in the
industry. But you're starting to see a lot of those pieces get built out, especially like you
see it in MLOps. You've got all this tooling that's coming out there.
We see a lot of that tooling as being built and we are right in the middle of that.
Like you have to test before you deploy
the same way you would with software.
All right.
So let's talk a little bit about open source.
What's the relationship with open source?
Yeah.
So again, we were talking a little bit before this
and I mentioned I might've been a skeptic a few
years ago and now i'm like why would you ever build a company without having an open source product
which is so interesting right because you think i mean to your average person you say
hey we're going to build something and we're going to give it away for free to the entire world and
then we're going to build a business on it. Yeah. And they kind of say like, okay. The business trait is going, wait, wait, wait.
Right, exactly.
It's not making sense.
Yeah.
But I guess there's two things.
One, I think, like this is my personal belief,
I think most people are good and they want to do good things.
And so this appeals to both the altruistic side of me
and most of the people I work with and the people I remember working with.
They love doing something cool and giving it away.
So that's the one.
It actually appeals to a side of us that's very personal and we want to do something good and cool.
And that feeds into how much excitement you get, right? And then the other side is, well, if I want to get, if I want to be deploying my product and get thousands of people using it and eventually millions, like what's the fastest way to do that? But I'm a bottoms-up approach where the people who actually use the software can just get it for free.
They can tell their friends about it.
They can deploy it.
They can share it, building in ways that you can share it.
Open source is fantastic for disseminating an idea and getting it out there in a way that if you have a paywall, it's just going to be much slower, orders of magnitude slower.
Yeah.
Talk about the time a little bit. And I know like we're coming up on time here because you have a team dinner to get to.
We'll be respectful of that, but talk about the time. So did you start out as open source? Because
I know you said like, even maybe in the early days, you didn't necessarily think that open
source is like, this is the best decision, you know made. How long did it take? Because there's an adoption period, there's sort of a validation period from a community standpoint. How did that play into it? Abe and I had a conversation at one point where Abe was saying, you know, if our company never makes money, I would still be really happy if the open source project really got far and wide and a lot of people used it.
And understanding that, okay, there's this other side that we're going to be happy to build a community and build open source. And then bringing it back now where it's like, well, even if we were making a
lot of money, it would feel like a failure if the open source project died or we didn't, you know,
we weren't able to create something actually useful for a lot of companies. So there's a
commitment to open source that sort of supersedes the commitment to the business, but then the
business, like it's, it's really going to follow follow. There's a lot of business value in having that open source community.
So the timeline is really, okay, let's put it out there.
Let's see what happens.
We start to get, you know, a few hundred stars, people using it.
We start to see deployments.
And then it was really figuring out that we're trying to build a shared language.
So we need a community because a language cannot exist without a community.
Or like grow or develop.
Yeah, or grow or develop.
And so starting that community and then starting to see the growth of that, that was really what kind of inspired us to realize, okay, this is how we can build a business around this. And,
and it was a couple, you know, it was a couple of years before we really could see that.
Sure.
And at the beginning, it was sort of a side project, but after a couple of years,
you see that growth and then we could tell, okay, we can build a business.
Yeah. Which is, you know, it's easy to look back and say a couple of years, but,
you know, we can all think of experiences in our life where like going through a several year period of something, like it doesn't necessarily feel
like just a couple of years when you're in the middle of those years, you know?
Well, and during those years we did hire, I think maybe one or a couple of engineers
and those of us on the consulting side were paying for them. We didn't have investment,
but it was super fun during those years.
Yeah.
Oh, yeah.
Okay.
Well, we want to be respectful of your time.
So I have one more question than Costas.
I'll give you the last word here.
So you went from, you know, making sure that billionaires stay billionaires.
And so what is it like sort of coming from that world and then maybe even the healthcare world, you know, where there's sort of maybe, you know, in healthcare, there's probably like bureaucracy, things move slower.
What's it like now working for sort of a really modern open source company in the data space?
What are sort of the biggest things that you notice as differences?
Well, I think for my personality, I needed to be in a smaller organization. So I really appreciated just being able to be with a group of people who get together and decide together, like, what's the best thing to do here?
Not what are you supposed to do?
Not what does that, you know, report say I'm supposed to do?
Not what is this policy, but what should we do?
What's the best thing to do? And so it feels really fun to do that and then be around other people who just want to do that.
And I think the small startup, you know, really attracts those types of people.
I also am kind of a risk junkie, so I just wanted to see if we could do it.
If we fail, okay, you know, sorry, we're out some money, you know, take a hit on the salary, but let's see if we can do this.
And if we do, it's really exciting.
So that definitely resonated with me personally.
But also, like, if you talk to Abe, he's been kind of pretty vocal about being really concerned with how data is used.
And is it ethical?
Like, are we doing things that are actually good
in the world with data? And one of the cool things about great expectations is it kind of helps you
make explicit some of the assumptions and the rules and the things that you're expecting about
data. And that has larger implications for like, should we do this with our data, right?
And making that explicit in documentation.
And so it's kind of fun to have some ethical purpose behind what you're doing as well.
Yeah.
Before you get the last word, Costas, I just want to say I really appreciate that.
And I appreciate it sounds like there's an ethos inside of Great Expectations where you're doing some like really interesting technical things,
but it's very clear that there's a culture where you see the larger picture and sort of operate according to a value system within that. And I just really appreciate that.
So thank you. Yeah. That means a lot at the end of the day, we're, we're all people
here and, and, and we're building some software, but we're, we're people building software.
So thank you.
Yeah, that's amazing.
I think that's like one thing, like having this kind of this dimension of
this is like in the companies of what super is, let's say, one is from
great companies too, and really important to see like what's next for great
expectations, so that's my last question.
Like what we should like, share with us something exciting that is
coming like
the near future yeah well i'd say there's there's been so many like i i mean we're really excited
about all these opportunities there are but a focus for us going forward is always going to be
to invest in the community around great expectations and invest in the open source,
kind of build that up to be something that is super useful, not just for an individual to
start to make some tests, but maybe an individual to put hundreds or thousands of tests on a data
warehouse really, really quickly and be able to do that just with the open source product, right?
So there's a lot more investment we can do to make it seamless, to make it easy to use.
And we're not just going to save those for the commercial product. We're going to do a lot of
that in the open source so that we can really feel good about, hey, we're enabling data engineers to
do something really powerful just with the open source product. And then obviously it is exciting to see how we can deploy that in organizations at the
enterprise level, and that's going to involve collaborative workflows.
So that's my role.
I'm personally excited to see us release a commercial product that can enable enterprises
to do some good stuff with data quality.
All right. Well, thank you so much. I think we're going to get you out the with data quality. All right.
Well, thank you so much.
I think we're going to get you out the door in time for team dinner.
And we're excited to talk with Abe tomorrow.
This will be a two-part episode.
This will be really fun.
But Ben, thanks for giving us some of your time.
Yeah, thank you so much.
So good to be here.
What a fun conversation.
I cannot wait to talk with the technical co-founder.
A couple of things, I think.
It's always amazing to hear the origin stories.
And there are a lot of similarities here with the DBT story,
where you sort of have a consultancy and then technology coming out of it.
And I think one of my takeaways, I have two.
The first one is it takes a lot of courage to be running a consultancy and you can make a
lot of money with a consultancy and do cool things. And they were working in the healthcare
space and that can have a really significant impact in a positive way. And to say, okay,
we're going to go really invest in this open source side project. I know it takes a lot of
courage and I just have a huge amount of respect for teams that can do that. Because that's, you
know, you look back now and it's like, oh, this is so cool. There's a great community, right? But
in the very beginning, that's a very sort of, it can be a scary proposition. And then the other
thing is, you know, I just hats off to them for, you know, doing the pip install great expectations because that's one of the cleverest like tech company names I've ever heard of.
It makes me smile every time I think about it.
I want to install it just so I can type.
Yeah.
Yeah.
Makes sense.
Yeah.
They know what they're doing, for sure,
like on many different levels,
like on the product level, on the community level.
Most importantly, what I want to keep from this conversation
is like the passion that the founders have
about building a company
and the whole, let's say,
what it means to build a company outside
of like just the founders, right?
And that's exactly where like, it makes it so interesting to see people
obsessed so much with the community.
Like they don't see the work that this company is doing just like, okay.
As a way, like to create value in a very monetary way.
Like there are more things there.
And I think that's what, I mean, as I said during the conversation, this is what differentiates
really good companies to great companies, what makes a great company.
But also it's a huge, huge indicator of the commitment that the founders have to make
this happen. So I'm very happy that I must have this conversation and connect with the great expectation people.
And I'm really looking forward to see what's next for them because they are very creative
and I'm sure that we are going to be surprised.
Outside of this and a bit more on the technical side, I love the fact
that we see more and more
of best practices from software engineering
entering the work of
working with data. We discussed about
unit testing and how
great expectations are related
to that. So yeah, I mean
another great conversation
and I think we should
have more conversations with the great expectation folks. There are other people in the team. And I think we should have more conversations with the great
expectation folks. There are other people in the team there that I think should be on the show.
I agree. We'll do it. All right. Several more great episodes coming at you from
recording on site at Data Council Austin. We'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, Eric Dodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudderstack.com.