Orchestrate all the Things - Superconductive scores $21M Series A funding to sustain growth of its Great Expectations open source framework for data quality. Featuring CEO and co-founder Abe Gong
Episode Date: May 20, 2021Ensuring data quality is essential for analytics, data science and machine learning. Superconductive's Great Expectations open source framework wants to do for data quality what test-driven devel...opment did for software quality Technical debt is a well-known concept in software development. It's what happens when unclear or forgotten assumptions are buried inside a complex, interconnected codebase, and it leads to poor software quality. The same thing also applies to data pipelines, it's called pipeline debt, and it's time we did something about it. That's the gist of what motivated Abe Gong and James Campbell to start Great Expectations in 2018. Great Expectations is an open-source tool that aims to make it easier to test data pipelines, and therefore increase data quality. Superconductive, the force behind Great Expectations, has announced it has received $21 million in Series A funding led by Index Ventures with CRV and Root Ventures participating. We caught up with Gong to learn more about Great Expectations. Article published on ZDNet
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together.
Ensuring data quality is essential for analytics, data science and machine learning.
Superconductives Great Expectations open source framework wants to do for data quality what test-driven development did for software quality. Technical debt is a well-known concept in software development. It's what happens when unclear or forgotten assumptions are
buried inside the complex interconnected code base and it leads to poor software
quality. The same thing also applies to data pipelines. It's called pipeline debt
and it's time we did something about it. That's the gist of what motivated Abe
Gong and James Campbell to start Great Expectations in 2018.
Great Expectations is an open-source tool that aims to make it easier to test data pipelines and therefore increase data quality.
Today's superconductive, the force behind Great Expectations, has announced it has received a 21 million Series A funding round.
And we caught up with Abe Gong
to learn more about great expectations.
I hope you will enjoy the podcast.
If you like my work,
you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.
It's a $21 million round being led by Index Ventures.
CRV and Root, who invested in our seed round
also participated and we're thrilled to have them on board i mean index has such a great
uh brand and reputation for open source and developer tools in particular
so um having them lead the round is just awesome like there's nobody else who would rather have
lead it um for me personally i'm coming from a data science,
data engineering background. So the thing that I've done several times is work with a growth
stage company to build out data as a function within the organization. So that's often starting
hands-on keyboard, making architecture decisions, actually writing a lot of the code. And then over
time, hiring people, building the team and kind of transitioning into management. So I've done that transition a few times now from being the first
data scientist at Java and working with IOT data, personal health, kind of upload heavy mobile apps
to more recently being the chief data officer at Aspire Health, where I was working to build
integrations to all the major insurance companies
in the country. There's a lot of internal complexity there. Is that helpful? Does that
give you a good start, at least on the round and on me? Yeah, sure. And yeah, I wonder if you could
also share a few more words on, well, company background, background basically so I know that you didn't just start it
on your own you have a co-founder if I'm not mistaken so if you'd like to say just a few words
on you know how you met and how you got interested in expectations basically which is the next
topic in line for discussion and I'll try to get as much in-depth on that as possible in the next minutes.
But just if you could give us an intro and what expectations are
and how you got started with that, basically.
Got it.
And George, on this, there are a few moving pieces.
So I'll trust you to know how to sort this out.
The company is called Superconductive.
We actually started originally as Superconductive. We actually started originally
as Superconductive Health, focused on healthcare vertical data integration, data analytics,
and so on. As part of that work, we built this open source project called Great Expectations.
And that was actually built in collaboration with some people outside the company,
including James Campbell, who was kind of the other half of Great Expectations in the very early days.
Two years ago we started to see a lot of organic community momentum around the Great Expectations
project and we took kind of looked at it and realized something really interesting was
happening around data quality and it wasn't limited limited to healthcare. So we made the decision to change our business model, focus exclusively on great
expectations, and taking it to market. So we went from being kind of healthcare data to being
specifically data quality built around great expectations as an open source project.
Is that helpful? Is that clear? I know there are a few moving parts.
Yeah, yeah, pretty much. And I think it will get even more clear as we get into what
expectations actually are. So I have to admit that, you know, initially when I heard the name
Great Expectations, and you're obviously aware of that, it kind of relates to literature. And I was
like, so okay, so how is that relevant for data science?
Because like opening up your website and your GitHub,
it was immediately obvious that what you do is data science related,
but the connection was not entirely obvious at first.
So it became a bit more clear after digging up a little bit and,
you know, trying to figure out what expectations are.
So the simplest way I could possibly frame it, and you can have your say on whether this is a correct framing or not,
is that to me expectations looks like a kind of meta schema.
Or in terms of when we're talking about things like test frameworks or even logic programming,
they look a bit like assertions. So statements that are expected to be true basically. And
they apply to data and they kind of try to foresee what the shape of data will look like, basically. So things like this column is expected to have such values
or this value is supposed to be filled in and things like that.
So would you say that's correct framing of what expectations are?
I think that's right.
We think of expectations as declarative assertions about data,
so that everyone in the organization can understand what the data is supposed to be.
They can literally have shared expectations of their data.
And that goes from engineering teams that are using them as sort of a testing framework
to documentation that's automatically compiled from those tests that become a really helpful
collaboration tool for non-engineers around the organization who still need a seat at the table
for understanding data. There's another way you can think of it, by the way, which is you've said
it's sort of a meta schema. We prefer to think of it as a shared open standard for data quality so that anyone can
declare, like, here's what we expect. And I'll grant you, like, that certainly has schema-like
aspects, but you can also do things that go way beyond schema. For example, checking the
distribution within columns or looking for statistical relationships among different
columns. All of those are things where you can expect this relationship
or expect this distribution.
So I think if you say schema,
it'll feel fairly narrow.
And the scope of what you can do
with this abstraction is very, very broad.
Okay.
Okay.
Yeah.
Thanks for clarifying that
because yeah, I admit,
I didn't exactly have all the time in the world to go
deep into the weeds of how everything about expectations works. So at first,
I wouldn't understand you too. I wouldn't expect you to, excuse me. I totally understand.
Okay. So, and that kind of actually already answers and a follow-up question I had in a way which was like okay so
how is that useful exactly and I know it may sound like a like a dumb question initially but
you know basically your intended scope of application is data pipeline so
when you have data ingested at some point
and then the various transformations being applied to it
and eventually being used downstream
for a number of applications and purposes,
be it dashboards, BI, machine learning,
reporting to clients, yes.
All of that.
So the goal, I think what you're targeting
is basically to kind of have to have a sort of quality assurance in this pipeline so that when something changes or breaks in an upstream step, it doesn't negatively affect everything that's downstream from that.
So how does having expectations help in that goal?
So I think it I think it helps in two ways, at two levels.
And for both levels, we think of speed and trust, kind of like speed, trust, confidence,
like those are the things that you get out of it.
So at the level of an individual team, say, you know, five data scientists, data engineers
collaborating on a project, they act a lot like unit tests in the sense that everybody
can declare what they expect.
Everybody can be confident there's a source of record for how the data should operate at a given
stage in the pipeline. So that's at the team level. But there's also a really interesting
thing that happens between teams. So I've been looking at the Escher drawing behind you and
thinking of handshakes. So when you look at modern data systems,
the data is usually passed from team to team several times before it actually informs decisions
or goes into product. And what that means is if the teams aren't aligned on what to expect of the
data throughout that process, it's really easy for the downstream process to be wrong. And sometimes
like really egregiously wrong. So for example, I've now met three teams
where they built awesome machine learning models.
And when the machine learning model
was deployed in production,
it was deployed upside down.
So it's like making the worst possible predictions.
And like, I mean, that's a really simple example,
but there are all kinds of other subtle ways
where a data pipeline can kind of look correct at one stage. And if not,
if there isn't shared understanding at the next stage, it'll be wrong. So by having declared
expectations of what the data should look like at each kind of gate along the way, at each stage of
the pipeline, you have this handshake where you know that there is trust and shared understanding
all the way through the organization. Okay. So since this applies to pipeline,
when data first enters the pipeline,
well, you have different scenarios, basically.
So you may have entirely unstructured data, like, I don't know,
CSV files or even documents or even multimedia files,
like audio and video.
And it's pretty hard to apply any kind of schema to those.
But in other scenarios, you have data that may be structured,
like coming, for example, from relational databases
or other types of data management systems
that do internally have some kind of schema.
So I wonder if, well, I noticed that you list in your GitHub, again,
a number of systems with which you integrate.
And I wonder what the nature of this integration is
and whether you leverage these existing schema mechanisms
in systems that actually provide those.
It's interesting that you include CSV as unstructured
because, I mean, it isn't really structured and yet there's a lot of data work that happens to CSVs and that is a format that we support.
So we've built great expectations to be kind of ubiquitous across many backends. Like if you want a shared standard for data, it has to work everywhere. So great expectations can work natively, meaning you
can compile your expectations and execute them in Python Pandas, which does a lot of machine
learning notebooks, CSV type work. You can also use it on lots of different dialects of SQL.
And the big ones are the data warehouses. So Snowflake, BigQuery, Redshift, SQL data warehouse.
It also works in Spark.
And so you can use the same expectations
and trust that regardless of where your pipeline
is being executed, whether it's going in SQL or Spark
or ReprePanded, like all of those places,
expectations will execute the same way
and return the same results.
Okay, okay. execute the same way and return the same results. Okay. However, I still wonder if, so, you know,
data enters your pipeline at some point and, well, we can discuss whether, for example,
CSV is structured or not. I was more thinking on the aspect of whether you can actually specify
things like, you know, conditions and ranges for
your fields and things like that in CSV. Well, depending on the tool you use, you may be able
to do that, but that's a different discussion. But let's assume that, you know, whatever your
data is, when it enters your pipeline, it's coming from a system that does already have some kind of
mechanism to impose rules and structure and all of those things. I was wondering if expectations και ένα σύστημα που έχει ήδη κάποιο τρόπο να εμπλέκει κανένας κατάστασης και όλα αυτά τα πράγματα.
Ήθελα να πω αν οι ελπίδες συνεχίζονται με αυτό ή αν απέτυχατε τα δίχτυα και άρχισα να εμπλέκω ελπίδες από ξανά,
όπως αν δεν υπήρχε τίποτα.
Α, καλύτερα.
Βλέπουμε και τα δύο.
Μπορείς να γράψεις ελπίδες από ξανά, μπορείς να εμπλέκεις την δική σου κατάσταση με αυτόν τον τρόπο. I see. I see. We see both. So you can write expectations completely from scratch. You can
kind of impose your own structure that way. We also have in our system what we call profilers,
which allow you to inspect data or sometimes metadata, which would include schemas,
and from that generate expectations. And we think of those sort of like translators, right?
There are a bunch of different kind of metadata languages
or schema languages that encode some information
about data quality.
And in order to be a shared source of record,
you really want to be able to bring those
into a single home.
So examples would be,
there are teams that have written schema inspector,
like introspectors that will look at their data warehouses
and pull schema
information out. Sometimes you have things like, for example, we've met a lot of teams that have
naming conventions where they'll say something like timestamp underscore DT. And the underscore
DT means it's a certain type of timestamp and therefore you should apply these expectations to
it. We've also seen things like teams
that have some data validation going on
at the source in a web form with JSON schema
or something like that.
And there's some nice modules
for being able to compile JSON schema to expectations
so that you can have API checks
that are then reflected in your tests
and your documentation in the data warehouse
or in your machine learning pipelines downstream.
So anyway, that's a very elaborate yes, but like we don't want to reinvent wheels on this. Like if that information exists, we just want to make sure that it kind of can live in a
common home and be useful. Okay. Okay. That makes sense. And actually, George, if I can, there's
one other benefit that we see there, which is data infrastructure is changing so fast today that many teams want future proofness.
They know that it's likely that they're going to swap out a piece of their underlying data infrastructure later.
And when they do, switching from, say, a Postgres schema to a Snowflake schema, that could be lossy.
But if you know what your important expectations are,
you've already agreed on that,
and you have an expectation
we'll work cross-platform that way,
you can make a transition or a migration
without having to worry about
if you're too tightly bound to that infrastructure.
Okay, yeah.
Actually, the future-proofing aspect
that you touched upon is important.
I would also say that it's perhaps ambitious if you aim to fill that gap, let's say.
But let's cover the ambitions part a bit later.
I was going to ask you actually on two of the aspects that you touched upon on the way that expectations function.
So you mentioned tests and you also mentioned documentation. So
yes, now it's also clear to me that you do leverage pre-existing conditions, schemas,
and all of those things to the extent possible to create the initial expectations when you import
data in your pipelines. So one of the ways that expectations function
is as a kind of test.
And in your documentation,
in your site and GitHub,
you also make the parallel to software tests.
And you say that, well, software is one thing.
You also need something,
the equivalent, let's say, for data pipelines.
And this is why expectations actually don't apply to the code that does the transformation
pipeline but they apply to the actual data. So my question is fine okay you
create your expectations whether it's from from scratch or
importing some pre-existing schema or whatnot but
and you have a pipeline and you know further down the line these
expectations may break because they were
misunderstood or because the pipeline applies some transformation that breaks some rule or
whatnot. So what happens then? Do you get a notification? Does the pipeline stop? How can the team
deal with that? So there are several options there. And the library is built to be flexible.
So this could get complicated.
But at a high level, we see two basic patterns.
One is sometimes people deploy their expectations within a data pipeline like Airflow or Spark.
And in that case, if an expectation breaks, and you don't just want it to be a warning,
you want to treat it as a failure you can halt the pipeline you can stop bad data from propagating and
save yourself a ton of time cleaning it up and of course you know now the data pipeline is stopped
and somebody's going to have to go and investigate in order to help with that investigation we can
generate a lightweight report the same way that we generate documentation from the expectations, we can generate reports from failed expectations. And so that can immediately point teams to like,
okay, here is the place where things broke. So zero in on this. And actually what we find is,
as compared to something like anomaly detection, having these very clear declarative statements of
like, oh, now there are null values in this
column that never had null values before, or 5% of values are out of range. And you said that only
2% was acceptable. Having those diagnostic clues to start your investigation is really, really
helpful. So that's one pattern. That's deploying it in a data pipeline where you can actually help
processing. In many cases, the teams that care most about data quality are actually not the teams that
fully control the data pipeline. And in that case, they're usually inspecting data at rest.
So they're verifying that what say data that has been ELT into a data warehouse is correct upon
arrival. And in that case, you can't necessarily stop the pipeline, but you can still
catch it as soon as possible, which is still way, way better than catching it when, you know,
an angry director of marketing comes to you and asks why the dashboard is broken.
Okay. So how does this inspection of data at rest works? Do you have some kind of tool that
people can use for that? So it's the same tool as
it's the same tool all the way through. The internal
infrastructure we call a validator and it allows you to
it's the thing I was describing before when I said that you can compile your expectations into
SQL queries, for example. So those can be
scheduled and you can set up on during that way. μετά από SQL queries, για παράδειγμα. Οπότε, αυτά μπορούν να είναι σχεδιασμένα και μπορείτε να το εγγραφείτε μετά.
Εντάξει, εντάξει, εντάξει.
Και επίσης, επηρέασες τη δικαιοσύνη σε κάποιο σημείο και σε κάποιο σημείο και,
ξέρετε, η πολύ, πολύ υπερφυσιστική εμφάνιση των εξαρτάσεων,
είχα, κατά τη δουλειά, ότι δείχνουν σαν εμφανίσεις,
οπότε, ελπίζω αυτή η κολλή να είναι, ξέρετε, τέτοια και τέτοια και τέτοια. well, they look kind of like statements. So expect this column to be, you know, such and such and so on.
So I wonder how do you generate documentation out of that?
And who's your target audience for that?
And what kind of shape does the documentation
that you generate has?
So also very flexible.
The way that we think of the target audience is
if you're a data scientist or data
engineer, you're probably using SQL, you're probably using Python, maybe Spark, like you're
fluent in a programming language. And so reading those declarative statements in JSON, for example,
it's not a problem for you, right? You already speak that language. But the number of non-engineers in a given organization
who need to use data and have domain expertise,
have kind of genuine input, in most organizations,
they outnumber the tech scientists,
state engineers themselves.
So all of those people are stakeholders.
You need some kind of translation there
where they can trust what's going on in the data.
And if you look at the state of things today,
most organizations have a wiki
of some kind, or maybe a data catalog, but the data catalog has to be updated by hand.
And what that means is there's always a lag between when something in the data pipeline changes,
when the documentation gets changed, and that lag creates a lack of trust and confidence.
Because those wikis are, I mean, I've worked in this department, those wikis are never fully up to date. As much as they're supposed to be the source of truth, nobody can completely rely on them. And if you try to make it so you can completely rely on it, it dictionary or metadata that can populate a data catalog and say, this column should never have null values or this column can have up to 10% null values.
Here's a graph that shows the intended distribution for this column.
Here are the regexes that should or should not apply to this column. Here are the regexes that should or should not apply to this column. Like having that is
super, super helpful for getting the clarity around the whole organization on how the data
really works. And the guarantee that we can make that like, really, I don't think any technology
outside great expectations is doing in the same way we are is as long as you're running your tests,
everyone can trust the docs because they know that the docs actually reflect the current state of the day because they are in your tests, they're compiled
directly. There's no kind of additional step in between where you might lose information.
You can probably tell, it's a thing I'm really excited about. It's a thing I really wish I had
had when I was working previous jobs. Yeah. Well, just a quick one on graphs, actually, because you did mention, you did just
mention them. And I also saw that you have posts on, a post, a blog post on directed isoclic graphs
and how dependencies in pipelines, well, kind of come in that shape. And I was wondering, well,
it totally resonated with me. Graph is a kind of pet interest that I have,
as I mentioned in the beginning.
So I was just wondering.
I studied graph theory for a couple of years in grad school.
So I'll write that with you.
Okay.
So I was just wondering if it's something
that you use internally in some way,
this kind of dependency graph, basically.
So great expectations today,
let me say that differently.
Data lineage, I think is one of the core abstractions,
like one of the main types of metadata that's coming up.
And that like, it's just surfacing
is super important today.
And one of the things that has,
I think made great expectations so successful
is we have integrations with basically all of the major that has, I think, made Great Expectations so successful is we have integrations
with basically all of the major data orchestrators, which is another reason why I like the name
of your operation.
So they track lineage.
And what we see is that in some ways, they kind of symbolize that handshake I was talking
about.
So being able to guarantee data quality at each stage of those graphs
is a big part of the value that we're bringing.
Great Expectations itself isn't like a super sophisticated graph engine.
We work very well with tools like that.
I don't know if I'm answering your question directly.
I feel like you may be reaching for something more there.
No, it wasn't, you know, like a major focus topic for me anyway.
It just got my attention.
And that's why I thought I would ask whether, you know,
you're leveraging some kind of graph engine
or whether this is in some way central to what you do.
Got it.
I mean, I think in DAGs, it's actually a running joke. It's super conductive that,
you know, Ava's always saying everything's a DAG, because it is. Like causal relationships
are DAGs and ontologies are DAGs and like so many things are graphs of that kind.
Yeah, yeah, sure. Okay, so now let's, I guess we will have to be wrapping up shortly. So now let's get to the
future proofing part, which also kind of leads naturally to your future plan. So you mentioned
earlier that part of the ambition that you have for great expectations is to be able to serve
that role for people. So when, for example, they change some part of their pipeline or the data
management or storage systems or whatever, that you would like them to be able to keep using great expectations
as a way of imposing rules and structures and all of that stuff on the on their data so in order for
that to happen well you need to you need a lot of things, basically. You need to have a large footprint.
You need lots of funding, which you just got.
You need a big team.
And, well, I'm wondering if you can assess, basically,
how close do you think you currently are to that goal
and whether you see yourself actually succeeding in that?
I mean, I would say you're maybe not even being ambitious enough there.
I think supporting migration across infrastructure,
that's one problem that data teams face today.
And I mean, we really want to be a shared source of record
for data collaboration of all kinds, starting with data quality.
In terms of footprint,
the Great Expectations open source community,
which we barely talked about,
is one of the fastest growing data communities in the world.
I mean, the Slack channel didn't exist
until just over 18 months ago.
Now it has 3,000 members in it.
We have hundreds of people joining every month.
And we're approaching the point
where the open source library
will be downloaded a million times every month.
So in terms of overall adoption, there's no other tool in the data ops movement
that comes close in terms of adoption for data quality. Funding will certainly help on that.
And we're really excited. I mean, most of last year, we had a single engineer working in the
Slack channel and supporting people and answering requests on GitHub.
We're really excited to be able to grow the team,
put more resources behind open source
and continue to grow it.
And then I should mention for completeness,
we're not just working on open source.
We've also been quietly working
with a small set of design partners
around a paid offering that will go on top of open source.
And we're getting ready to expand
that design partnership program.
So lots of good things coming very soon there.
Yeah.
Yeah.
Thanks for mentioning that.
Well, actually both the community growth
and your plans for monetizing basically,
because I think they're both quite central.
So open source is kind of the de facto, in a way,
way to grow companies these days.
So it's great that you have traction on that part.
But also you...
Well, you cover a lot of them.
And you also obviously need the monetization plan
because having community growth is great,
but how do you make it all sustainable?
So I was wondering what that plan is, actually,
if you can just briefly mention that.
So I think one thing to emphasize,
just because sometimes people are concerned
when an open source community raises money,
everything that's open source will always stay open source.
We're firmly committed to that path, no doubt.
And I want to be on record in public,
unambiguous about it.
When you look at the potential
for data collaboration organizations,
it goes way beyond just developers.
And so being able to build a layer on top
of the open source project
that assists with communication,
collaboration, resolution of incidents,
things like that,
there's a lot of scope for reducing friction among engineering teams,
but also, and I think especially, between engineering teams
and other people in the organization.
So, I mean, at a high level, I think you could think of it as
the open source project is a shared open standard.
It's going to be super helpful for developers
and always will be in the data
ecosystem.
And then for additional collaboration and kind of enterprise needs, there's a whole
lot of additional things you can build on top of that.
I hope you enjoyed the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.