Orchestrate all the Things - Superconductive scores $21M Series A funding to sustain growth of its Great Expectations open source framework for data quality. Featuring CEO and co-founder Abe Gong

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Ensuring data quality is essential for analytics, data science and machine learning. Superconductives Great Expectations open source framework wants to do for data quality what test-driven development did for software quality. Technical debt is a well-known concept in software development. It's what happens when unclear or forgotten assumptions are buried inside the complex interconnected code base and it leads to poor software quality. The same thing also applies to data pipelines. It's called pipeline debt and it's time we did something about it. That's the gist of what motivated Abe Gong and James Campbell to start Great Expectations in 2018. Great Expectations is an open-source tool that aims to make it easier to test data pipelines and therefore increase data quality.

Starting point is 00:00:55 Today's superconductive, the force behind Great Expectations, has announced it has received a 21 million Series A funding round. And we caught up with Abe Gong to learn more about great expectations. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. It's a $21 million round being led by Index Ventures.

Starting point is 00:01:23 CRV and Root, who invested in our seed round also participated and we're thrilled to have them on board i mean index has such a great uh brand and reputation for open source and developer tools in particular so um having them lead the round is just awesome like there's nobody else who would rather have lead it um for me personally i'm coming from a data science, data engineering background. So the thing that I've done several times is work with a growth stage company to build out data as a function within the organization. So that's often starting hands-on keyboard, making architecture decisions, actually writing a lot of the code. And then over

Starting point is 00:02:01 time, hiring people, building the team and kind of transitioning into management. So I've done that transition a few times now from being the first data scientist at Java and working with IOT data, personal health, kind of upload heavy mobile apps to more recently being the chief data officer at Aspire Health, where I was working to build integrations to all the major insurance companies in the country. There's a lot of internal complexity there. Is that helpful? Does that give you a good start, at least on the round and on me? Yeah, sure. And yeah, I wonder if you could also share a few more words on, well, company background, background basically so I know that you didn't just start it on your own you have a co-founder if I'm not mistaken so if you'd like to say just a few words

Starting point is 00:02:53 on you know how you met and how you got interested in expectations basically which is the next topic in line for discussion and I'll try to get as much in-depth on that as possible in the next minutes. But just if you could give us an intro and what expectations are and how you got started with that, basically. Got it. And George, on this, there are a few moving pieces. So I'll trust you to know how to sort this out. The company is called Superconductive.

Starting point is 00:03:24 We actually started originally as Superconductive. We actually started originally as Superconductive Health, focused on healthcare vertical data integration, data analytics, and so on. As part of that work, we built this open source project called Great Expectations. And that was actually built in collaboration with some people outside the company, including James Campbell, who was kind of the other half of Great Expectations in the very early days. Two years ago we started to see a lot of organic community momentum around the Great Expectations project and we took kind of looked at it and realized something really interesting was happening around data quality and it wasn't limited limited to healthcare. So we made the decision to change our business model, focus exclusively on great

Starting point is 00:04:10 expectations, and taking it to market. So we went from being kind of healthcare data to being specifically data quality built around great expectations as an open source project. Is that helpful? Is that clear? I know there are a few moving parts. Yeah, yeah, pretty much. And I think it will get even more clear as we get into what expectations actually are. So I have to admit that, you know, initially when I heard the name Great Expectations, and you're obviously aware of that, it kind of relates to literature. And I was like, so okay, so how is that relevant for data science? Because like opening up your website and your GitHub,

Starting point is 00:04:50 it was immediately obvious that what you do is data science related, but the connection was not entirely obvious at first. So it became a bit more clear after digging up a little bit and, you know, trying to figure out what expectations are. So the simplest way I could possibly frame it, and you can have your say on whether this is a correct framing or not, is that to me expectations looks like a kind of meta schema. Or in terms of when we're talking about things like test frameworks or even logic programming, they look a bit like assertions. So statements that are expected to be true basically. And

Starting point is 00:05:34 they apply to data and they kind of try to foresee what the shape of data will look like, basically. So things like this column is expected to have such values or this value is supposed to be filled in and things like that. So would you say that's correct framing of what expectations are? I think that's right. We think of expectations as declarative assertions about data, so that everyone in the organization can understand what the data is supposed to be. They can literally have shared expectations of their data. And that goes from engineering teams that are using them as sort of a testing framework

Starting point is 00:06:19 to documentation that's automatically compiled from those tests that become a really helpful collaboration tool for non-engineers around the organization who still need a seat at the table for understanding data. There's another way you can think of it, by the way, which is you've said it's sort of a meta schema. We prefer to think of it as a shared open standard for data quality so that anyone can declare, like, here's what we expect. And I'll grant you, like, that certainly has schema-like aspects, but you can also do things that go way beyond schema. For example, checking the distribution within columns or looking for statistical relationships among different columns. All of those are things where you can expect this relationship

Starting point is 00:07:06 or expect this distribution. So I think if you say schema, it'll feel fairly narrow. And the scope of what you can do with this abstraction is very, very broad. Okay. Okay. Yeah.

Starting point is 00:07:17 Thanks for clarifying that because yeah, I admit, I didn't exactly have all the time in the world to go deep into the weeds of how everything about expectations works. So at first, I wouldn't understand you too. I wouldn't expect you to, excuse me. I totally understand. Okay. So, and that kind of actually already answers and a follow-up question I had in a way which was like okay so how is that useful exactly and I know it may sound like a like a dumb question initially but you know basically your intended scope of application is data pipeline so

Starting point is 00:08:02 when you have data ingested at some point and then the various transformations being applied to it and eventually being used downstream for a number of applications and purposes, be it dashboards, BI, machine learning, reporting to clients, yes. All of that. So the goal, I think what you're targeting

Starting point is 00:08:22 is basically to kind of have to have a sort of quality assurance in this pipeline so that when something changes or breaks in an upstream step, it doesn't negatively affect everything that's downstream from that. So how does having expectations help in that goal? So I think it I think it helps in two ways, at two levels. And for both levels, we think of speed and trust, kind of like speed, trust, confidence, like those are the things that you get out of it. So at the level of an individual team, say, you know, five data scientists, data engineers collaborating on a project, they act a lot like unit tests in the sense that everybody can declare what they expect.

Starting point is 00:09:06 Everybody can be confident there's a source of record for how the data should operate at a given stage in the pipeline. So that's at the team level. But there's also a really interesting thing that happens between teams. So I've been looking at the Escher drawing behind you and thinking of handshakes. So when you look at modern data systems, the data is usually passed from team to team several times before it actually informs decisions or goes into product. And what that means is if the teams aren't aligned on what to expect of the data throughout that process, it's really easy for the downstream process to be wrong. And sometimes like really egregiously wrong. So for example, I've now met three teams

Starting point is 00:09:47 where they built awesome machine learning models. And when the machine learning model was deployed in production, it was deployed upside down. So it's like making the worst possible predictions. And like, I mean, that's a really simple example, but there are all kinds of other subtle ways where a data pipeline can kind of look correct at one stage. And if not,

Starting point is 00:10:06 if there isn't shared understanding at the next stage, it'll be wrong. So by having declared expectations of what the data should look like at each kind of gate along the way, at each stage of the pipeline, you have this handshake where you know that there is trust and shared understanding all the way through the organization. Okay. So since this applies to pipeline, when data first enters the pipeline, well, you have different scenarios, basically. So you may have entirely unstructured data, like, I don't know, CSV files or even documents or even multimedia files,

Starting point is 00:10:43 like audio and video. And it's pretty hard to apply any kind of schema to those. But in other scenarios, you have data that may be structured, like coming, for example, from relational databases or other types of data management systems that do internally have some kind of schema. So I wonder if, well, I noticed that you list in your GitHub, again, a number of systems with which you integrate.

Starting point is 00:11:11 And I wonder what the nature of this integration is and whether you leverage these existing schema mechanisms in systems that actually provide those. It's interesting that you include CSV as unstructured because, I mean, it isn't really structured and yet there's a lot of data work that happens to CSVs and that is a format that we support. So we've built great expectations to be kind of ubiquitous across many backends. Like if you want a shared standard for data, it has to work everywhere. So great expectations can work natively, meaning you can compile your expectations and execute them in Python Pandas, which does a lot of machine learning notebooks, CSV type work. You can also use it on lots of different dialects of SQL.

Starting point is 00:11:59 And the big ones are the data warehouses. So Snowflake, BigQuery, Redshift, SQL data warehouse. It also works in Spark. And so you can use the same expectations and trust that regardless of where your pipeline is being executed, whether it's going in SQL or Spark or ReprePanded, like all of those places, expectations will execute the same way and return the same results.

Starting point is 00:12:24 Okay, okay. execute the same way and return the same results. Okay. However, I still wonder if, so, you know, data enters your pipeline at some point and, well, we can discuss whether, for example, CSV is structured or not. I was more thinking on the aspect of whether you can actually specify things like, you know, conditions and ranges for your fields and things like that in CSV. Well, depending on the tool you use, you may be able to do that, but that's a different discussion. But let's assume that, you know, whatever your data is, when it enters your pipeline, it's coming from a system that does already have some kind of mechanism to impose rules and structure and all of those things. I was wondering if expectations και ένα σύστημα που έχει ήδη κάποιο τρόπο να εμπλέκει κανένας κατάστασης και όλα αυτά τα πράγματα.

Starting point is 00:13:06 Ήθελα να πω αν οι ελπίδες συνεχίζονται με αυτό ή αν απέτυχατε τα δίχτυα και άρχισα να εμπλέκω ελπίδες από ξανά, όπως αν δεν υπήρχε τίποτα. Α, καλύτερα. Βλέπουμε και τα δύο. Μπορείς να γράψεις ελπίδες από ξανά, μπορείς να εμπλέκεις την δική σου κατάσταση με αυτόν τον τρόπο. I see. I see. We see both. So you can write expectations completely from scratch. You can kind of impose your own structure that way. We also have in our system what we call profilers, which allow you to inspect data or sometimes metadata, which would include schemas, and from that generate expectations. And we think of those sort of like translators, right?

Starting point is 00:13:42 There are a bunch of different kind of metadata languages or schema languages that encode some information about data quality. And in order to be a shared source of record, you really want to be able to bring those into a single home. So examples would be, there are teams that have written schema inspector,

Starting point is 00:14:01 like introspectors that will look at their data warehouses and pull schema information out. Sometimes you have things like, for example, we've met a lot of teams that have naming conventions where they'll say something like timestamp underscore DT. And the underscore DT means it's a certain type of timestamp and therefore you should apply these expectations to it. We've also seen things like teams that have some data validation going on at the source in a web form with JSON schema

Starting point is 00:14:29 or something like that. And there's some nice modules for being able to compile JSON schema to expectations so that you can have API checks that are then reflected in your tests and your documentation in the data warehouse or in your machine learning pipelines downstream. So anyway, that's a very elaborate yes, but like we don't want to reinvent wheels on this. Like if that information exists, we just want to make sure that it kind of can live in a

Starting point is 00:14:53 common home and be useful. Okay. Okay. That makes sense. And actually, George, if I can, there's one other benefit that we see there, which is data infrastructure is changing so fast today that many teams want future proofness. They know that it's likely that they're going to swap out a piece of their underlying data infrastructure later. And when they do, switching from, say, a Postgres schema to a Snowflake schema, that could be lossy. But if you know what your important expectations are, you've already agreed on that, and you have an expectation we'll work cross-platform that way,

Starting point is 00:15:31 you can make a transition or a migration without having to worry about if you're too tightly bound to that infrastructure. Okay, yeah. Actually, the future-proofing aspect that you touched upon is important. I would also say that it's perhaps ambitious if you aim to fill that gap, let's say. But let's cover the ambitions part a bit later.

Starting point is 00:15:54 I was going to ask you actually on two of the aspects that you touched upon on the way that expectations function. So you mentioned tests and you also mentioned documentation. So yes, now it's also clear to me that you do leverage pre-existing conditions, schemas, and all of those things to the extent possible to create the initial expectations when you import data in your pipelines. So one of the ways that expectations function is as a kind of test. And in your documentation, in your site and GitHub,

Starting point is 00:16:32 you also make the parallel to software tests. And you say that, well, software is one thing. You also need something, the equivalent, let's say, for data pipelines. And this is why expectations actually don't apply to the code that does the transformation pipeline but they apply to the actual data. So my question is fine okay you create your expectations whether it's from from scratch or importing some pre-existing schema or whatnot but

Starting point is 00:16:58 and you have a pipeline and you know further down the line these expectations may break because they were misunderstood or because the pipeline applies some transformation that breaks some rule or whatnot. So what happens then? Do you get a notification? Does the pipeline stop? How can the team deal with that? So there are several options there. And the library is built to be flexible. So this could get complicated. But at a high level, we see two basic patterns. One is sometimes people deploy their expectations within a data pipeline like Airflow or Spark.

Starting point is 00:17:40 And in that case, if an expectation breaks, and you don't just want it to be a warning, you want to treat it as a failure you can halt the pipeline you can stop bad data from propagating and save yourself a ton of time cleaning it up and of course you know now the data pipeline is stopped and somebody's going to have to go and investigate in order to help with that investigation we can generate a lightweight report the same way that we generate documentation from the expectations, we can generate reports from failed expectations. And so that can immediately point teams to like, okay, here is the place where things broke. So zero in on this. And actually what we find is, as compared to something like anomaly detection, having these very clear declarative statements of like, oh, now there are null values in this

Starting point is 00:18:25 column that never had null values before, or 5% of values are out of range. And you said that only 2% was acceptable. Having those diagnostic clues to start your investigation is really, really helpful. So that's one pattern. That's deploying it in a data pipeline where you can actually help processing. In many cases, the teams that care most about data quality are actually not the teams that fully control the data pipeline. And in that case, they're usually inspecting data at rest. So they're verifying that what say data that has been ELT into a data warehouse is correct upon arrival. And in that case, you can't necessarily stop the pipeline, but you can still catch it as soon as possible, which is still way, way better than catching it when, you know,

Starting point is 00:19:10 an angry director of marketing comes to you and asks why the dashboard is broken. Okay. So how does this inspection of data at rest works? Do you have some kind of tool that people can use for that? So it's the same tool as it's the same tool all the way through. The internal infrastructure we call a validator and it allows you to it's the thing I was describing before when I said that you can compile your expectations into SQL queries, for example. So those can be scheduled and you can set up on during that way. μετά από SQL queries, για παράδειγμα. Οπότε, αυτά μπορούν να είναι σχεδιασμένα και μπορείτε να το εγγραφείτε μετά.

Starting point is 00:19:46 Εντάξει, εντάξει, εντάξει. Και επίσης, επηρέασες τη δικαιοσύνη σε κάποιο σημείο και σε κάποιο σημείο και, ξέρετε, η πολύ, πολύ υπερφυσιστική εμφάνιση των εξαρτάσεων, είχα, κατά τη δουλειά, ότι δείχνουν σαν εμφανίσεις, οπότε, ελπίζω αυτή η κολλή να είναι, ξέρετε, τέτοια και τέτοια και τέτοια. well, they look kind of like statements. So expect this column to be, you know, such and such and so on. So I wonder how do you generate documentation out of that? And who's your target audience for that? And what kind of shape does the documentation

Starting point is 00:20:14 that you generate has? So also very flexible. The way that we think of the target audience is if you're a data scientist or data engineer, you're probably using SQL, you're probably using Python, maybe Spark, like you're fluent in a programming language. And so reading those declarative statements in JSON, for example, it's not a problem for you, right? You already speak that language. But the number of non-engineers in a given organization who need to use data and have domain expertise,

Starting point is 00:20:49 have kind of genuine input, in most organizations, they outnumber the tech scientists, state engineers themselves. So all of those people are stakeholders. You need some kind of translation there where they can trust what's going on in the data. And if you look at the state of things today, most organizations have a wiki

Starting point is 00:21:05 of some kind, or maybe a data catalog, but the data catalog has to be updated by hand. And what that means is there's always a lag between when something in the data pipeline changes, when the documentation gets changed, and that lag creates a lack of trust and confidence. Because those wikis are, I mean, I've worked in this department, those wikis are never fully up to date. As much as they're supposed to be the source of truth, nobody can completely rely on them. And if you try to make it so you can completely rely on it, it dictionary or metadata that can populate a data catalog and say, this column should never have null values or this column can have up to 10% null values. Here's a graph that shows the intended distribution for this column. Here are the regexes that should or should not apply to this column. Here are the regexes that should or should not apply to this column. Like having that is super, super helpful for getting the clarity around the whole organization on how the data really works. And the guarantee that we can make that like, really, I don't think any technology

Starting point is 00:22:16 outside great expectations is doing in the same way we are is as long as you're running your tests, everyone can trust the docs because they know that the docs actually reflect the current state of the day because they are in your tests, they're compiled directly. There's no kind of additional step in between where you might lose information. You can probably tell, it's a thing I'm really excited about. It's a thing I really wish I had had when I was working previous jobs. Yeah. Well, just a quick one on graphs, actually, because you did mention, you did just mention them. And I also saw that you have posts on, a post, a blog post on directed isoclic graphs and how dependencies in pipelines, well, kind of come in that shape. And I was wondering, well, it totally resonated with me. Graph is a kind of pet interest that I have,

Starting point is 00:23:06 as I mentioned in the beginning. So I was just wondering. I studied graph theory for a couple of years in grad school. So I'll write that with you. Okay. So I was just wondering if it's something that you use internally in some way, this kind of dependency graph, basically.

Starting point is 00:23:23 So great expectations today, let me say that differently. Data lineage, I think is one of the core abstractions, like one of the main types of metadata that's coming up. And that like, it's just surfacing is super important today. And one of the things that has, I think made great expectations so successful

Starting point is 00:23:43 is we have integrations with basically all of the major that has, I think, made Great Expectations so successful is we have integrations with basically all of the major data orchestrators, which is another reason why I like the name of your operation. So they track lineage. And what we see is that in some ways, they kind of symbolize that handshake I was talking about. So being able to guarantee data quality at each stage of those graphs is a big part of the value that we're bringing.

Starting point is 00:24:10 Great Expectations itself isn't like a super sophisticated graph engine. We work very well with tools like that. I don't know if I'm answering your question directly. I feel like you may be reaching for something more there. No, it wasn't, you know, like a major focus topic for me anyway. It just got my attention. And that's why I thought I would ask whether, you know, you're leveraging some kind of graph engine

Starting point is 00:24:37 or whether this is in some way central to what you do. Got it. I mean, I think in DAGs, it's actually a running joke. It's super conductive that, you know, Ava's always saying everything's a DAG, because it is. Like causal relationships are DAGs and ontologies are DAGs and like so many things are graphs of that kind. Yeah, yeah, sure. Okay, so now let's, I guess we will have to be wrapping up shortly. So now let's get to the future proofing part, which also kind of leads naturally to your future plan. So you mentioned earlier that part of the ambition that you have for great expectations is to be able to serve

Starting point is 00:25:19 that role for people. So when, for example, they change some part of their pipeline or the data management or storage systems or whatever, that you would like them to be able to keep using great expectations as a way of imposing rules and structures and all of that stuff on the on their data so in order for that to happen well you need to you need a lot of things, basically. You need to have a large footprint. You need lots of funding, which you just got. You need a big team. And, well, I'm wondering if you can assess, basically, how close do you think you currently are to that goal

Starting point is 00:25:58 and whether you see yourself actually succeeding in that? I mean, I would say you're maybe not even being ambitious enough there. I think supporting migration across infrastructure, that's one problem that data teams face today. And I mean, we really want to be a shared source of record for data collaboration of all kinds, starting with data quality. In terms of footprint, the Great Expectations open source community,

Starting point is 00:26:27 which we barely talked about, is one of the fastest growing data communities in the world. I mean, the Slack channel didn't exist until just over 18 months ago. Now it has 3,000 members in it. We have hundreds of people joining every month. And we're approaching the point where the open source library

Starting point is 00:26:43 will be downloaded a million times every month. So in terms of overall adoption, there's no other tool in the data ops movement that comes close in terms of adoption for data quality. Funding will certainly help on that. And we're really excited. I mean, most of last year, we had a single engineer working in the Slack channel and supporting people and answering requests on GitHub. We're really excited to be able to grow the team, put more resources behind open source and continue to grow it.

Starting point is 00:27:13 And then I should mention for completeness, we're not just working on open source. We've also been quietly working with a small set of design partners around a paid offering that will go on top of open source. And we're getting ready to expand that design partnership program. So lots of good things coming very soon there.

Starting point is 00:27:31 Yeah. Yeah. Thanks for mentioning that. Well, actually both the community growth and your plans for monetizing basically, because I think they're both quite central. So open source is kind of the de facto, in a way, way to grow companies these days.

Starting point is 00:27:49 So it's great that you have traction on that part. But also you... Well, you cover a lot of them. And you also obviously need the monetization plan because having community growth is great, but how do you make it all sustainable? So I was wondering what that plan is, actually, if you can just briefly mention that.

Starting point is 00:28:09 So I think one thing to emphasize, just because sometimes people are concerned when an open source community raises money, everything that's open source will always stay open source. We're firmly committed to that path, no doubt. And I want to be on record in public, unambiguous about it. When you look at the potential

Starting point is 00:28:28 for data collaboration organizations, it goes way beyond just developers. And so being able to build a layer on top of the open source project that assists with communication, collaboration, resolution of incidents, things like that, there's a lot of scope for reducing friction among engineering teams,

Starting point is 00:28:49 but also, and I think especially, between engineering teams and other people in the organization. So, I mean, at a high level, I think you could think of it as the open source project is a shared open standard. It's going to be super helpful for developers and always will be in the data ecosystem. And then for additional collaboration and kind of enterprise needs, there's a whole

Starting point is 00:29:12 lot of additional things you can build on top of that. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Your Ad Here

Orchestrate all the Things - Superconductive scores $21M Series A funding to sustain growth of its Great Expectations open source framework for data quality. Featuring CEO and co-founder Abe Gong

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.