The Data Stack Show - 93: There Is No Data Observability Without Lineage with Kevin Hu of Metaplane
Episode Date: June 29, 2022Highlights from this week’s conversation include:Kevin’s background and career journey (1:54)Metaplane and the problem that is solves (6:47)The silence of data problems (9:53)Data physics work tha...t requires more (13:35)Trusting data when bugs are present (19:12)Building a navigable experience (22:36)Developing anomaly detection (30:06)What Metaplane provides today (35:05)Metaplane’s plans for the future (37:45)Comparing Bigquery, Snowflake, and Redshift (40:56)Why data goes bad (48:15)Advice for data trust workers (59:24)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show. Today,
we're talking with Kevin Hu from Metaplane. Costas, there are a lot of tools in the data observability space, and that's what Metaplane does. And I'm interested to know, of course,
I do a lot of stalking on Ergaski for the shows, but I want to know how he went from MIT to
starting Metaplane, you know, because that's an interesting dynamic sort of coming out
of academia and then going through Y Combinator and starting a company.
So I just want to hear that backstory.
How about you?
Yeah, I want to learn more about the product, to be honest.
I mean, it's data observability and data quality and like, I don't know what other
name we're going to have tomorrow for the category.
It's like a very hot
product category right now in terms of
like development and like innovation.
And I think he's the right person like to
chat about that.
So let's see how
Metaplane understands
and implements data observability and
also what's next after that. Like what are the plans
there and where the destination is going?
Let's do it.
Kevin, welcome to the DataSec show.
We're so excited to chat with you.
So excited to be here.
I'm a longtime listener of the show.
I recognize both of your voices and to be here with you on the Zoom, it's really a privilege.
So thank you.
Cool.
Well, we are, we always love hearing from our listeners and especially when they are guests on the Zoom. It's really a privilege. So thank you. Cool. Well, we are, we always love hearing from our listeners and especially when they are
guests on the show.
So I want to, of course, I do LinkedIn stalking.
Our listeners know this.
You probably know this from listening to the show.
So you started at MIT studying physics and then you made the switch over to focusing
on more computer science subjects.
And so I have two questions for you.
One, why did you make the switch?
And then two, did that influence you starting MetaPath,
actually sort of studying those topics
from an academic standpoint?
Yeah, I think, well, one great research.
It's true, closest that I've both found ourselves
in the either fortunate and privileged or unfortunate
place of seeing each other at some point.
And I did start studying physics.
And I remember the gauntlet course at the time, which was the experimental lab course
everyone took as a junior, was notorious for burning people out.
And every week, you replicate a Nobel Prize
winning experiment and the second week you analyze it. Something that really stood out to me was the
people who had the hardest time in the course weren't necessarily the people who weren't the
best physics students, but it was the people who didn't know MATLAB and didn't know Python.
So they could collect the data, but weren't able to analyze it.
They were the ones who are pulling all-nighters.
And at the same time, my sister, who is a biologist,
she had about five years of data on fish behavior.
So tilapia are very interesting fish.
You have a tank of them, you drop in another tilapia,
and all the other tilapia change.
Oh, fascinating.
Yeah, they're very tribal, very easy to observe.
And at the end of five years, she messages me saying, hey, Kevin, can you help me analyze
this data because I don't know R. And to me, this is just absurd because why are some of
the brightest people in the world bottlenecked? Because they don't know how to write
code. And obviously that doesn't apply only to scientists, but really to anyone who works in
an organization who either produces data or consumes data. If they don't know how to program,
you're not necessarily working with data in the most low friction way. So that's how I got into CS research, trying to build tools and develop methods
for automated data analysis.
This is back in 2013.
Okay, wow.
Super interesting.
Tilapia are also tasty, by the way,
you know, if you're a good cook.
That's a good point.
That is a data point.
That's a qualitative data point.
Happy to share that with your sister.
I have plenty of telegraph data points too.
Hopefully your listeners are not fish or people.
That's right.
Okay.
So tell us, so you studied, you studied computer science tooling, how to sort of support people,
help people based on your experience of really bad people not being able to analyze data, take us from there to starting Metaplane and then tell us what Metaplane
is and does. So for six years, we built tools that given a CSV, try and predict the most
interesting by some measure, like visualizations or analyses
that could come from that ESV.
So at first it was really rule-based, but then it was more machine learning-based where
we had a lot of datasets and visualizations and analyses scraped from the web.
And the papers were really interesting.
And it turned out you could predict how analysts worked on a data set with relatively high
accuracy.
The problem was when we tried to deploy it at large companies, including Colgate-Palmolive,
Estee Lauder, they funded a large part of my PhD.
And I still have many goodie bags.
Some of my colleagues have GPUs.
I have retinol.
Lots of toothpaste.
Yeah, tons of toothpaste.
I'm not complaining.
But the problem was when we wanted to deploy these tools, it became very clear, like, okay, connect us to your database.
And they'll ask, like, okay, what database?
We have, like, 23 instances of SAP.
This was back in 2015 and 2016.
So it was a bit worse back then than it is today.
But it became clear that data quality is one of the biggest impediments to working with data.
Not necessarily when you have a final clean data set in the last mile generating the analyses of that.
So that's the motivation to build Metaplane where, you know,
we couldn't necessarily make that flower grow. Now we have the augmented analytics and different
categories arising, trying to do that analysis, but we figure, you know, we can plant the garden,
maybe someone else can take it from there. Very cool. And so tell us, tell us about Metaplane.
Like what's, what's the problem that it solves?
So Metaplane, we like to think of it as the data dog for data. It's a data observability tool that connects across your data stack to your warehouse like Snowflake, to your transformation tool like dbt, a BI tool like Liquor. And very simply, we tell you when something might be going wrong.
Specifically, there's a big asymmetry that we observe today where data teams are responsible for hundreds or thousands of tables and dashboards. And this is great in part because data is becoming
a product, right? It's no longer used just within the main vein of
BI and decision support, even though that will always be important, but getting reverse ETL,
okay, maybe that term is not cool anymore, but being sent to activated into market tools,
being used to train machine learning models, and that is all good. The promise of data is
starting to be more and more true. However,
while your data team is responsible for hundreds of tables, your VP of sales only cares about
one report, which is the liquor dashboard that they're currently looking at.
So there's this asymmetry where frequently teams find out about data issues or silent data bugs,
as we call them, when the users of data notice it
and then messes the data team.
That matters for two reasons.
One is that if you've received those Slack alerts
and if you're listening to this podcast,
you probably have,
you know that there goes your afternoon
and you did not have much time to spare to begin with.
But two, data trust is very easy to lose and hard to regain, especially when it comes to data.
Because once that VP of sales decides to, okay, screw this, I'm going to have my RevOps team build up reporting in a shadow data stack.
Then what was the point of getting a snowflake and getting all this data together to begin with?
We don't have a culture around trusting data.
It doesn't really matter how much of it you collect or use.
Yeah, absolutely. I want to dig in on one thing and then I'll hand the mic over to Costas.
But could you describe, so you mentioned the silence of sort of errors, you know, or bugs or problems that happen with data, which is a
really interesting way to think about the problems that we face in data. So two questions for you.
One, how do you think sort of the audible nature of those things differs in data, say, as compared
with like software engineering? because, you know,
software engineering, like if we think about Datadog, you know, there's a lot of defined process and tooling or whatever, a lot of that's being adopted into the data world.
So one would love a comparison there. And then two, could you just describe in a, you know,
on a deeper level and maybe do this first, like what, describe a silent problem and like,
why are the problems with data silent or why do you even use that term?
Yeah, let's start from that silent data bug.
Great questions where frequently all of your jobs are running fine, right?
Airflow is all green, snowflake is up, and yet your table might have 10% of the rows that you expected.
Or that some distribution like the mean revenue metric has shifted a little bit over to an
incorrect value.
So these sorts of issues in the data itself, unless you have something that is continuously monitoring the values of the
data aren't necessarily flagged by infrastructural issues, like your systems being up or your
jobs were running.
And that's why we do want to make the silent data bugs more audible, increase the volume
a little bit, because if you don't know about these issues occurring along the way, then inevitably the only place that you
will notice it is at the very end, right?
When the data is being consumed.
One, because that person has the most incentives to make
sure that the data is correct.
But frequently the person who's using the data also has
the most domain expertise.
If they're on the sales team, they might know what exactly should
go into this revenue number.
They might not have known how it was calculated along the way,
but they know when it's wrong.
And that is one departure from software observability, which really is the
inspiration for data observability.
Right.
The term was completely co-opted from like the Datadog and Splunk's of the world.
But to be fair, they co-opted the term from control theory, where observability has a
very strict definition, right?
As like the mathematical dual for the controllability of a system, a dynamical system where you
want to understand how like the state changes from the inputs.
So I don't feel too bad about stealing the turn all art is stuffed right exactly exactly if we
keep tracing it all the way down like back hundreds of years we'll find you know a dutch
physicist trying to figure out how to make windmills turn at the same rate as range ground, which is true.
I love it.
Just to finish that, one thought that in the software world, before the data dogs, right,
you would frequently find out about data issues, I mean, software and infrastructure issues when the API went down or when your heartbeat check failed.
But as the number of assets that you're deploying increases and increases, that level of visibility
is just not sufficient, right?
Now, if you're on a software team, it's almost mind-blowing to think that you want your customers
to find out when your API is failing or when a query is slow.
You want to find out about that regression internally.
Yeah, absolutely.
Okay.
Before we resume the conversation about observability, I want you to go back to physics and your other graduate studies.
And I want to ask you, and that's like a very personal curiosity that I have, like from all the stuff that you have done in physics, what was, let's say,
the one that required the most in terms of working with data and using R or Python? What do you think
couldn't exist in a way almost, let's say, if we can exaggerate, as a domain of physics,
if we didn't have today
computers and all these languages and all these systems to go and crunch the data.
I have two answers to that question.
One is when I was doing more pure physics research, like M.O., atomic, molecular, and
optical physics research, you can think about ultra ultra cool atoms using laser cooling and trapping where
the fine level of control that you need to calibrate these systems.
And then the amount of data that you're retrieving from the systems that you're observing is
immense.
Right.
There's a reason why, you know, higher performance computing was really like invented at CERN
and by the internet was kind of invented these scientific research facilities is they have,
they had the need for data first.
And then even today, the scientific computing ecosystem almost exists separate from our
data stack.
Yeah.
The qualities of the data are completely different.
Yeah. our data stack yeah the qualities of the data are completely different yeah the other strain
was at some point i got more interested in like quantitative social science research so we
published this paper on the network of languages oh trying to understand how information flows
from person to person via the languages that they know. Specifically, there's nothing stopping us from going to any news site in another language,
besides the fact that we might not know that language.
We had tons of data at the time about bilingual Twitter users, about Wikipedia editors who
edited Wikipedia in more than one language. Mm-hmm.
With translations from one language to another to try and figure out the
connectedness and the clusters of different languages.
So that wasn't necessarily a problem of big data necessarily.
It'll all fit on one person's laptop, but we wouldn't have collected that data.
Yeah.
If it wasn't today.
Yeah.
A hundred percent.
No, no, no.
That's super interesting.
And yeah, I remember at some point,
one of the first episodes that we had,
we had a guest who worked at CERN.
He was taking care of the infrastructure there
and writing code in C++ to transfer data there.
And it was funny to hear him saying
what was his first impression
when after his PhD,
he went into the industry
and hearing about big data
and people saying,
okay, we need a whole cluster
to transfer this data.
And he was like,
okay, are you serious?
You can't say that.
Yeah, he was like,
oh, I mean,
he was dealing with petabytes and petabytes of data.
I mean, just an unbelievable amount.
So he goes to work in insurance and he's like, I mean, this is the kiddie pool.
Totally.
There's levels to the game, right?
And I'm sure that when he goes down the hall to another person on CERN, that like petabytes,
like we have even more data than that.
Yeah.
Yeah.
Yeah.
It's super interesting, like to see the different
perspectives when someone is coming like from scientific computing and the
point of view that they have and like how you solve the problems, like with
working a lot of day with a lot of data.
Although, okay.
We also have to say that like the needs are completely different, like the
environment, the context that they do, the processing is also very different.
So it's not like exactly comparable, right? Like you cannot say that the work that Facebook is
trying to do with the data that they have is like the same type of problems that are solved by
highly parallelized algorithms, like trying to solve partial differential equations, for example,
right? Like there's like very, very different like problems and they have different needs,
both in terms of infrastructure and the software and the algorithms that
we are using.
But yeah, like a hundred percent.
I mean, there is a reason, as you said, that like the internet, that the web
came out of CERN and like all these technologies, like they're like highly
associated like with physics.
Okay.
Enough with physics.
Let's go back to data observability. So I have a question about...
We use a lot, and it's very interesting
because you talked about this experiment with languages
and when you're bilingual and all that stuff.
But something similar, I think, is also happening
when we introduce new product categories, right? Like, as you said, like we, we stole like the term observability from Datadog that
took the term observability from like control theory and who knows about the
Dutch guy who was, what was doing.
But when we are talking about what we're using, like, and you used with Eric,
like the term bug, right, and silent bug. But like, okay, like in software, when we are talking about, we are using like, and you used with Eric, like the term bug, right?
And silent bug.
But like, okay, like in software, when we are talking about like bugs, there's like
a very, let's say, key relationship between, how to say that, like it's a very deterministic
thing, right?
Like, okay, there are like a few bugs that it's hard like to find them, especially like
in distributed systems and stuff like that, where the behavior is not deterministic necessarily. But broadly, when we're talking about bugs, we are talking about
something very deterministic as a system, right? But with data, my feeling is that when we're
talking about bugs on data, it's not exactly that. There's much more vagueness there, and it's not
that clear to define what the bug is. And that's why many times I say that maybe it's better to use the term trust, like how
much we can trust the data, right?
So from a binary relationship, bug or not bug, we go into how much we can trust something.
So what's your experience with that?
And what's common and what's not common between patterns from software engineering and data and working with data.
David Pérez- You're so right that the way that we refer to data as having
bugs is not, is not a one-to-one with software, right?
Like a software bug, it's a logical issue that somehow your logic did not produce the outcomes that you'd expected when it encountered the real world.
Right.
Either the real world was more complicated than you thought, which is the case, or your logic was not sound.
Yep.
In which case, get someone to review your PRs.
Mm-hmm. The engineers on my team will be like, well, Kevin, yeah, the data bugs are interesting because I think the root cause can be equally similar in some cases where, yes, there are logical issues in your DAG.
Your DAG extending beyond the warehouse, but from very beginning to very end, right?
It is conceptually a chain of logical operations, but the data could be input wrong, right?
It either came from a machine that did not do what you expected or a person entered in
the wrong number.
So you're right that the scope of a data bug is a little bit larger in that sense.
And as a result, what goes into data observability is slightly different than what goes into software observability.
In software, you have the notion of traces, right?
You have an incident that occurs, but also the traces, the time correlated or the request scoped logs that help you.
Okay, where did this begin and where did this end?
And in data, right,
that's kind of replaced by the concept
of lineage.
But the tricky
thing is that lineage is
ever perfect.
That it's
until Snowflake
starts surfacing it to everyone,
and Snowflake will not cover it end-to-end,
right? You also need a BI tool and upstream as well.
Maybe they'll work with Rudder stack to figure it out, but there's always
some loss of resolution along the way.
So as a result, right.
Even if you build all those integrations and build an amazing parser, like you're
still working with incomplete information, whereas traces in the DevOps world can be extremely exact.
You might not be inferring causality,
but at least you have all the metadata that is relevant.
Yeah.
I mean, okay, like with observability in DevOps, from a product perspective,
the problem that you have there is that you need to build an experience that's probably
going... There's too much resolution
in a way, right? There's just
too much data and you need to help
the user navigate all this data to find their
root cause, right? So
that's the problem
that you have trying to design
a product experience with that. But when we're talking about
data observability, we have
vagueness together with probably way too much data
at the same time.
Because if you start collecting all the data,
you can also have an explosion there.
So how do you do that?
How do you build an experience that can help people
navigate this vagueness and complexity at the same time to figure out like the
root cause of the problem, right?
It's at the end or figure out if they can trust the data or not.
Part of this is a, a very challenging like computational problem on the back end let's and then another part of it is a ui ux
problem which is no less difficult it may even be more important so let's take for example a table
is delayed right that it's usually refreshed every 10 minutes and it's been you know let's
say it's been two hours and that
is unusual even after taking seasonality into account. Where if we surface this issue to
a customer, then we'd be like, okay, that's useful. But almost always the first question is,
does this matter? That the table is not being used by anyone.
Maybe we don't need to fix it right now.
And then the second question is, what is the root cause?
So can I do something about it?
And only when all those three pieces fall into place, like a real issue has occurred,
it has an impact, and I can do something about it.
Is this necessarily going to bubble to the top of your triage list?
But to answer your question, what that means is being very, I mean, it means a few things on the metaplane side or any tool that's trying to do this for you. One is building really robust
integrations across your data stack. So it needs to be in your BI tool, ingesting all of your
dashboards and the components of those dashboards and getting the lineage to a table in as fine resolution as possible and making sure that that's up to date and reflecting the latest
state of your warehouse and latest state of your BI tool. It means disambiguating entities correctly.
So if you have a transactional database that's being replicated into your analytical database,
right? How do you know that one table refers to the other?
If you have a FITRAN sync, how do you know that this FITRAN sync is syncing those two,
like entity A to entity B?
That's a tough problem.
And then the third piece is, I'll call it prioritization, right?
Is one table might have 100 downstream dashboards, right?
And how exactly do you want to surface this to your user? Right.
Do you just say the number 100 or do you list all 100?
And there's a principle, at least in information visualization, Schneiderman's mantra of the
inventor of the tree map.
He's the professor at University of Maryland, I believe.
He always says like overview first and then filter and
finally details on demand.
So the way that we try to do at Matterplane is like giving you as useful of an overview
of what happened in an incident and then letting you filter down what you think is relevant
and then finally zooming in on the details when you want it.
For example, the number of times that one dashboard that depends
on this table has been used.
Henry Suryawirawan, Okay.
That's super interesting.
And you mentioned, okay, you said like, it's both like a UI UX
and a computational problem.
Uh, let's talk a little bit more about the computational problem.
So what are the challenges there?
Like what are the challenges that needs to happen on the backend and like the
methodology and the algorithms that you have to use to track these things and
make sure that you surface the right thing to the user at the end?
One tough problem is anomaly detection.
One reason why data observability exists as a category is because it's tough to test your data manually.
There are great tools to do that where you say, okay, I expect this value to be above some threshold.
And honestly, every company should probably have a tool like that for the most critical tables. However, it becomes quite cumbersome to write code across your entire data
warehouse and then merge a PR every time the data changes, which is why data
observability comes in where us and everyone in the category says, okay, you
do that for the most important tables, but let our tool handle testing for everything else.
One necessary ingredient is some sort of anomaly detection.
It could be machine learning-based.
It could be more traditional time series analysis
where we track this number for you.
And of course, we had to take the traditional components
into account.
Here's a trend component.
Here's a seasonal component, but there's a lot of
bespoke aspects to both enterprise data.
So for example, row counts tend to go up and they tend to go up
at the same rate over time.
And if you use an off the shelf tool, you're just going to be sending false
alerts every single time it goes up.
But too, like your data is particular, right?
And your company is a little bit different.
So there's a lot of work that goes into anomaly detection because if you cry a
wolf too many times, we're just going to turn you off.
Yeah.
The other component is log ingestion where I let's say you're using
snowflake yet 365 days of query history,
a tool like metaplane will be ingesting all that core history and then parsing it for both usage.
So understanding how tables and columns are being used, but also lineage.
Mm-hmm.
So like what, what does this query depend on and what does it transform those dependencies into?
And this is a notoriously difficult problem.
I think no one has figured it out
with 100% coverage and 100% accuracy
across all data warehouses,
except for the people who,
the data warehouse vendors themselves.
Yeah, why you say that that problem
is like notoriously hard?
What's the, like, what makes it so hard? Like you have all the queries that have been executed like the past
365 days. What's the difficult part that like in using that to do like the in-edge?
It's a combination of differing SQL dialects from warehouse to warehouse.
So things are starting to get standardized, right?
The, but what you had, the parts of that you write for snowflake is different than the one that you might write for redshift.
And secondly, there's often a lot of ambiguity within the data warehouse, right?
Which tables are being used within this query?
And that's a relatively easy problem,
but then what columns are being used by those tables?
And tables might have very overlapping
or duplicate column names.
And you might say, okay, well, the compiler is able to turn,
SQL is a well-defined language, right? Snowflake is able to turn, SQL is a well-defined language, right?
Snowflake is able to turn this SQL
into columns and tables that are being used,
but they have access to the metadata
and they have access to their runtime.
Yeah, yeah, yeah.
Absolutely, absolutely.
So you think that this could be easier
to handle if more metadata were exposed
by the database system at the end.
Right? If the information that was exposed through Snowflake, for example, was more,
that would help a lot to figure these things out. So it's more about exposing more of the internals
of the database system at the end that is needed there. That's, that's interesting. That's okay.
It's very interesting.
All right.
Okay. I know all the detection show.
What are you doing on your product?
Like we found normal detection right now.
Like, do you have some kind of functionality around that and how does it work?
Yeah.
One quick note on the data warehouse is releasing their internal lineage.
I know that snowflake is starting to do this.
It may only be available to enterprise customers right now.
Oh, okay.
But the moment they do that, one whole category of tools will have a
much harder time, the data lineage tools and everyone else will be
exponentially more powerful.
If we had access to that for all of our Snowflake customers, which is
basically almost all of our customers, it'd be insane the amount of
workflows without OpenLock.
Okay.
That's interesting, actually.
So it's going to be a problem for the Linux companies and the products out there, obviously,
because the product is going, like the functionality is going to be provided, let's say, by Snowflake.
But at the same time, this is going to make things much more interesting
for you.
But is there a reason?
I mean, why is this going to happen?
Outside of having access to the metadata, to the additional metadata, is there something
else that's going to make it more interesting because all your customers are on Snowflake
or it doesn't matter?
I think it's primarily being able to rely on their
lineage over our lineage.
Part of it, like does it mean that they're much more correct and up to date
and have higher coverage than we do?
Mm-hmm.
Yeah, but that's on the other hand, that's like only the Linens that
live like as part of Snowflake, right?
Like what happens before and after that.
So let's say you have, I don that. So let's say you have,
I don't know, let's say you have Spark doing some stuff on your S3 to prepare the data,
and then you load this data into Snowflake, which I think it's pretty common, like in many use cases.
So even like if Snowflake does that, how do you can see outside of Snowflake,
especially like before the data gets ingested with Snowflake?
Alexi Vandenberg- Totally.
Yeah.
They don't have the full picture, which is why data observability tools come in
and kind of augment, right, say, okay, the lineage within the warehouse might
be a very key part of the picture.
But it's not all of it, right?
It's not the downstream impact.
It's not the upstream root cause.
Yeah.
Which is how the two play together a little bit.
Yeah, it makes sense.
Makes sense.
Okay.
So back to animal detection.
What do we get from you today in terms of animal detection?
Like what's, what's happened?
Like what can I use out of the box?
So out of the box right now, if you go to metaplane.dev, you can sign up.
And sign up through email or G Suite and connect your warehouse, your transformation
tool, your BI tool.
Typically, people can do this within 15 minutes.
We've had highly motivated users do it
within five, which is insane because I can't even do it within five. But I guess when you want it,
you really are motivated to do it. And off the bat, we cover your warehouse with tests based
on information schema metadata. So for Snowflake, right, row counts and schema and freshness kind of
come for free across your warehouse.
You can go a little bit deeper with out-of-the-box tests, like testing
uniqueness, nulling the distribution of numeric columns, or you can
write custom SQL tests and for all of these tests and our customers
usually blanket their database and have hundreds of tests on top of those within like 30 minutes.
Then you just let it sit because we have the anomaly detection kind of running for you in the background as we collect this historical training set.
And depending on how frequently your data changes, it can be either between one day or five days
until you start getting alerts on that data.
Henry Suryawirawanacanthamiloy- Okay.
So, all right.
So it's like between one and five days.
That's neat.
And the deployments that you have so far, right, because we are talking
about like data observability, the conversation that we have is like focusing a little bit more, that's how I feel at least,
on the data warehouse.
So would you say that what Metaplane is doing today is more of observability of the data
warehouse, or you provide, let's say, observability across the whole data stack that the company might have.
Let's say I have streaming data and I have a Kafka somewhere.
And then I also have a couple of other databases.
And then I might also have a Teradata instance somewhere running.
What kind of coverage you would say that Metaplane today provides?
We are focused on the warehouse
and its next door neighbors right now.
Part of that is a strategic move as a company, right?
Like we want to start from the place
to the highest concentration.
And Snowflake is getting tons of market share
as is Redshift, as is BigQuery,
but we don't have to build a whole slew of integrations.
Those three cover a lot of the market today.
And most of our customers use one of those three.
We have the downstream BI integration,
so Looker, Tableau, Mode, Sigma,
kind of go down the list, Metabase, we support,
as well as the transactional databases
like MySQL and Postgres. And increasingly, the transactional databases like SQL and Postgres,
and increasingly many OLAP databases like ClickHouse. That's where we stop. And honestly,
that's where everyone in our category stops today. I'm not very happy with that because
this is just a level one of monitoring. When you check out an observability tool in two years or in five years,
it's going to be completely different.
It's going to be much like the picture that you described, Costas,
where it's like fully end to end.
That's, I think that is not only important, but really critical because data is
ultimately not produced from your data warehouse, right?
Snowflake does not sell you data.
It sells you a container into which you can put your data, but that data is being produced
by product teams, engineering teams, go to the market teams, and they're being consumed
by those teams too.
So when we talk about data trust, which you mentioned before, which I think is a much
better category name than data observability, because what is that? That trust is ultimately
in the hands of the people who consume and produce the data. That's where we as a category have to go.
That's interesting. Okay. So what's your experience so far with the other, let's say, big container of data, which is data lakes, right?
So we have the data warehouses, a much more structured environment there, but we also have data lakes.
Okay, Databricks is dominating there.
Completely different environment where it can't like to interacting with data. And okay, we get to the, I mean,
there's also like this new thing now with the lake house, where you also have like SQL interfaces
there. But what have you seen so far, like with data lakes and observability there, because that's
also like a big part, right, of like working with data, especially with big amounts of data. And
in many cases, it's let's say, say, like a lot of work that is happening
before the data is loaded
into something like Snowflake,
it has to go through like data lake, right?
So is Metaplane doing something with them today?
Plans to do something like in the future?
And what do you think is the role
that data lakes will have in the future?
Honestly, we don't come across data lakes too often.
Part of it is where we're focused in the market.
If you're, for example,
at a company with less than 5,000 people,
Plane is probably the right choice for you
as the data observability tool.
It has time to value, time to implement,
the focus on the workflows.
And if you're above 5,000, there are other options on the market and you might be in a position to build it in-house too.
We found, maybe this is incorrect, that Databricks is much more highly concentrated at the enterprise.
And when we come across a company that uses Databricks,
frequently, they're also using Snowflake or Data Warehouse and they're using
Spark for pre, like pre-Snowflake transformation.
Yeah, yeah, yeah.
A hundred percent.
Oh, that's, that's interesting. But you don't see the need right now for Metaplane to work into
observability for these environments, right? And the reason I'm asking is because technically
it's something very different. And I'd love to hear what are the challenges there? What
are the differences? And learn a little bit more about that.
That's why I'm insisting on these questions around the data lakes and the Spark ecosystem.
There are some big challenges.
I mean, there are some engineering challenges, like having to rewrite all of our SQL queries into Spark queries.
And having it run not necessarily on a table,
but on a data frame.
And there are also differences
in terms of the metadata that's available to you,
where a data warehouse metadata,
we found is quite rich in comparison
with the metadata that you might have
within a data lake,
where you might have the number of rows,
but to, or not, right?
You might have to run a table scan for that,
or to continuously monitor the queries
to keep a log of the number of rows.
Even get the schema, you might have to do a read.
It's, in general, much harder to have
the level of visibility that you have in a warehouse
as into a data lake.
Yeah, a hundred percent.
I mean, the query engine makes like a huge difference there when you have to interact
like with that stuff.
All right, cool.
So Snowflake, Metaplane, like your experience so far, because I mean, you mentioned BigQuery,
Snowflake and Redshift.
And from what I understand, like there's probably like a big part of your customer basis on Snowflake. What's your experience like with these three platforms so far?
Like give us like your pros and cons of each one of them.
Alexi Vandenbroeker There's pros and cons of each.
For sure.
Snowflake has the richest metadata in terms of the freshness and the
row counts of different tables.
BigQuery also has that metadata. In terms of the freshness and the row counts of different tables, BigQuery also has that
metadata. However, to use Metaplane, our customers either have tack us onto an existing warehouse or
they provision a warehouse specifically for Metaplane. And this is nice because you can
separate out the compute and keep track of our internal spend that is incurred through this monitoring.
But at the same time, we necessarily impose a cost, whereas some users who use Redshift with some not at their full capacity can tack on Metaplane at no visible financial cost themselves.
That makes sense.
Yeah.
I think that's like, okay, it's the trade-off between having like the
elasticity that like the serverless model that BigQuery has compared to, you know,
like paying for a cluster that yeah, obviously it can be underutilized and
when it's underutilized, you can put more stuff there without paying more.
Right.
But yeah, it's like the trade tradeoff that every infrastructure team has to face
at some point with hard decisions.
Right.
But like from, let's say, in terms of what is supported, like do you, is
like Metaplane like the same experience across all the three different platforms
or like you have like more functionality towards one or the other
because of what they expose?
It's the same experience across all three.
Okay.
No major differences.
Okay.
That's great.
And how much of a concern is the cost at the end?
I mean, the additional cost that is incurred by a platform like Metaplane that continuously monitors the data on the data warehouse.
David Pérez- It's surprisingly much less than people might expect.
As we're using information schema as much as possible and the existing metadata.
So the tests that rely on your metadata, right? We can read that within seconds at the top of the hour or whatever frequency you set.
And it turns out to be a pretty negligible amount of overhead compared to spend that you might have
from other processes running on your data warehouse, like measured in single digit
percentage points. Some customers have longer running queries for much larger tables or more sophisticated
monitoring, but typically that step is taken more deliberately so that the cost is more justified.
So there are like, let's say there are just cases where like people are, okay, you have,
let's say a continuous monitoring where you establish, let's say your, how to say that,
like the monitors and they run every,
I don't know, one hour, 10 minutes, one minute, whatever.
But do you see also like ad hoc monitoring that users do?
Like, do they use the tool also for not just for monitoring, but also to debug problems
with the data?
Totally.
That is the next step after the monitoring is like the flag kind of goes off is now you have this, well, one, you know, that incident occurred, but two, you have this historical record of what the data should be and how it has been over time.
It's a little bit like debugging.
Once you have a product analytics.
Yeah.
Yeah. a product analytics tool. If you did not have a product analytics tool,
you don't necessarily know
what the latency has been over time,
what all the dependencies are,
what has happened in a user's journey.
And it's very similar with Metaplane
where in addition to the core incident management workflow,
there's another component,
which is trust and awareness in data where
teams that bring on Metaplane, of course, at first it's because it's often because, you know,
stuff has hit the fan and they're like, okay, now we need to get ahead of it next time around.
But right after implementing Metaplane, it could be within a few minutes and you see how queries
are being used across the warehouse, how the lineage looks from within your data stack.
It's like, wow, how did I live without this?
Yeah.
Yeah.
Familiar quote.
Okay.
Okay.
Okay.
Take us by the hand now and like, give us like an example.
Like, let's say we have an incident, right?
Like a monitor goes off and it's like, oh, something is wrong with this table.
Okay.
And from things that you have experienced, like a common example, like describe to us
like the journey that the user goes through metaplane from that moment until they can
resolve the problem.
And I'd love to hear like what happens inside
Metaplane for that and what outside, right? Like how these
two like work together for the user like to figure out and solve
the problem.
So today, Metaplane is like, let's say you have like a home
like security system. It is the alarm and it is the video.
It does not call the police for you.
And it does not do the tracking for you.
So in Metaplane, we will send you a Slack alert or maybe a pager duty alert saying this value we expected it to be 5 million.
It fluctuates a little bit, but now it's at 1 million.
These are the downstream BI reports.
So this dashboard has been last viewed today
this many times by these people.
And here are the upstream dependencies.
So here are all the dbt models that go into this model.
And what you can do from there is click into the application and kind of see the overall impact of this view.
And assess like, okay, what are the media upstream root causes?
And then two, you can give feedback to our models where if this is actually an anomaly and you want to be continued, continue to be alerted on this, then you mark it and then we'll kind of exclude it from our models.
If it was actually normal, because at the end of the day, data does change and no anomaly detection tool is a hundred percent accurate.
Yep.
You click on, you say, say okay this is actually a normal occurrence
do not continue to alert me on this frequently when you have an alert our customers start a
whole conversation around that alert saying right looping and other members of their team creating
jira or like linear tickets to address this issue but that is where we stop is the actual incident resolution.
That's where we want to go in the future.
But today, it kind of stops there.
Yeah, makes sense.
And that's my last question.
I'll give it to Eric.
Give us some from your experience
because obviously you've been exposed
to many different users out there and issues.
So what's one of the most common
reasons that data go bad i like how you said that there's many issues because that's what
we've observed too it's like the whole you know told stories quote of all all happy families are
like all unhappy families are unhappy in a unique way. The same thing is true for data, right?
Where there's so many reasons why data can go wrong.
It goes back to what we were saying of, you know, either someone put it in wrong,
machine did something wrong, or there is some logic that's applied incorrectly.
But that said, across all of our customers, delays or freshness errors are probably the most common issue.
Second is probably a schema change, whether it's within the data warehouse or upstream.
And the third is a volume change, where the amount of data that's being loaded or exists is higher or lower than you expect.
It's a whole long tail from there.
The, and all of that is kind of correlated with the, the causes of data quality issues.
This depends on the team, right?
If it's a one person team, do you not have many data engineers or analytics
engineers stepping on each other with code, right?
And there might be many more third-party dependencies that cause issues.
If you're on a larger team, perhaps shipping bugs might be, like actual
software bugs, not data bugs, might be more frequent.
Awesome.
Eric, all yours.
I monopolized the conversation, but now you can ask all your really, really hard questions.
It was fascinating.
Okay.
So I want to, let's dig into Tolstoy a bit more because that quote is an amazing quote.
I think it's called, isn't it like a principle, like the Hannah Karenina principle or something?
That's exactly what it is.
Yeah.
Okay.
So this is the reason I want to dig into that a little bit more.
You've mentioned the word trust a lot through our conversation.
And in fact, that's been a recurring theme on the show, you know, sort of through a bunch
of different iterations.
I would even say from the very beginning, Costas, just one of the themes that comes
up consistently.
So what's interesting though, is if we think about
some of the examples we've talked about, you know, you have the executive stakeholder who's,
you know, refreshing a looker report and something's wrong, or the salesperson,
you know, doesn't necessarily know exactly why, but they know the revenue numbers off or whatever.
And so what's interesting is that's, those examples kind of represent a one-dimensional trust almost, right?
Which is things don't go wrong, right?
Like, I trust you if nothing ever goes wrong.
Which, you know, in the real world, like that sort of one-dimensional trust, you know, isn't really a great foundation for relationships.
So, like, you know, it's just kind of like the inner principle, which I know I'm sort of stretching that a little bit.
So thank you for emailing me.
But like, it's interesting, right?
Like, if the reports aren't broken, then everyone's happy, right?
Like, things are good. What are the other dimensions of trust, A, that you've seen, or B, that you are trying
to impact with Metaplane or the way that you think about, you know, data quality and lineage
and those sorts of things?
I love how you brought it back to trust because that is simultaneously a very simple problem.
I mean, you could state it simply, but also extremely complex,
like you're alluding to, where you could define trust, not necessarily that something's going
wrong, but that there's some contract between two parties that is violated in some way. And if the
contract is not explicit, then the two parties will always have implicit contracts. And unfortunately, in the data world,
the implicit expectation of a data consumer
is frequently that the data is just not wrong.
It's exactly what you're saying.
The data is wrong.
What am I paying you for?
Why are we paying Snowflake so much money
if the data is wrong?
But as we're alluding to,
that is not a reasonable expectation across the board.
A reasonable expectation from a data consumer might be, I am aware that data is not perfect,
that it will never be perfect, the same way that you will never have software without
bugs and code.
So how can you expect that to be true for data as well?
But I think part of it is establishing these contracts and these expectations up front with
both the data consumers as well as with data producers and saying, okay, this is what you
can expect from the data and how it will trend over time and how I will try my best as a team
to make sure that it meets the demands of this particular
use case.
I think that's a shift that I would love to see in the data world.
Instead of talking about data being perfect or being ideal, instead of talking about it
being sufficient for a use case at hand.
Where if this dashboard is being used every hour, right.
Do we really need real time streaming data?
Right.
If you're, if this is making more of a directional decision, as opposed
to being sent to your customer, right.
Does the data have to be completely correct?
Right.
Enough to like shatter your trust in it over time.
Right.
So I think really reverse engineering from the outcome and the people who are using the
data is the most clarifying approach that we found to think about data quality and data
trust over time.
Super interesting.
Okay, let's dig into that just a little bit more, just because I'm thinking about our
listeners who, you know, and even myself,
you know, we deal with these types of things every day. So I love what you said, but my guess
would be that there are a lot of people out there who, well, let me put it this way.
If you have an explicit contract that requires mutual understanding, right? And even mutual
agreement on, let's say it's a real estate contract, right? Like there's mutual agreement
on say default and other things, right? Which both parties need to have a good understanding of
for expectations to be set well, right? So if we carry that analogy over to an explicit contract between a data consumer and say,
like the person who's building the data product, you know, in whatever form that takes,
one of the challenges I think probably a lot of our listeners have faced is that if you
try to make that contract explicit, the consumer oftentimes can just say,
you know what? I don't actually really care about these definitions that we're trying to agree on.
And sometimes maybe there's some malcontent there, but a lot of times it's like,
look, I'm busy. We're all busy. And I would love to like understand like your pipeline infrastructure and
data drift issues and whatever. Can you speak to how you've seen that dynamic play out? I mean,
I think in some ways that's getting better as data becomes more valued across the organization,
but I think in a lot of places there can still be a struggle to actually make an explicit
contract, like a practical reality and a collaborative
process inside of a company. You're right. It is an idealistic process. However, I do think
the conversation is important, not just to talk about expectations of the data, but really just
to understand what exactly do the users of data want, right? And, you know, members of data teams are, it's a tough job, right? Because
a classic example is, okay, someone asks you for a dashboard, but do they really want a dashboard?
Do they really want this number to be continuously updating over time and to have a relatively fixed
set of questions that can be, you know,
varied a little bit, but not be super flexible.
Or do they want data activation to use it again into Salesforce?
Or do they just want a number like right now?
And it doesn't have to be changing over time. Or do they want a data application that is maybe more involved,
but is much more flexible and has
both inputs and outputs, right?
I think that is the importance of having a conversation about expectations from users,
like your stakeholders is, you know, there are some downsides and it takes a lot of time,
but that I think once you're the consumers of of your data feel like you really understand where they're coming from, that that is a foundation from which you can build trust.
Right. It's like, OK, they kind of get what I'm asking for.
And reverse, I know the amount of work that goes into producing data products that, OK, now the trust is much less brittle and maybe you don't need that explicit
contract but what you develop implicitly you know implicit contract that yeah i know okay it's not
when it's completely broken that i can still trust it because there's a human on the other end of it
yeah if only there were software that could solve the problem of time compression and mutual understanding and the investment that it takes to build that between two humans.
We talked before this call about all of the SaaS products that exist, but I really think tools are just tools, right?
They exist because people use them to do processes more effectively and more consistently over time.
If a tool doesn't result in something actually changing in terms of people's behavior, you know, and this is a tool that actually is being used by people, not machines, then is it really that important?
Yeah, totally.
Okay, well, we're close to the buzzer here.
I want to end by asking you an admittedly unfair question, but that I think will be
really helpful for our listeners and for me.
And I'll start with the unfairness.
So none of the answers to this question can relate to Metaplane or data lineage or data,
you know, quality tooling at all.
Okay.
So outside of, you know, what you're sort of trying to build, you know, with your life
and your team, if you could give one piece of advice to our listeners out there who are
working in data in terms of building data trust, even maybe like
one practical thing they could do this week before the week is over, what's the one thing that you
would tell them to do? Like if you could only do one thing to sort of improve trust, what would
that one thing be outside of all the, you know, data lineage? So sorry for the unfair question.
No, no. Well, you told at the end of the day, data lineage, data observability, it's just a technology,
right?
It is one technology that can be used to solve a much broader problem that can't be solved
by one tool or even like 10 tools.
I would say to conduct some user interviews.
If you had a week or two weeks, have one-on-ones with every person
at the company who could be using your data or is not using the data as much as you would like,
or in the ways that you would want, and sit down and really approach them as if you're
like a founder building a product for a customer. What do you really want here?
What problem are you trying to solve?
How will you know that you've solved that problem?
And how can I improve the product that I'm developing for you?
That, I think, is a process that we've seen our customers, especially the ones who are very, very high performing data teams, do over time.
And that really starts you from this position of the trust is yours and it's yours to lose,
as opposed to you start from, you have to build it up over time.
Super helpful.
All righty.
Well, thanks for giving us a couple of extra minutes for me to ask you an unfair question.
This has been such a great conversation and best of luck with Metaplane.
It sounds like an awesome tool
and it sounds like you're doing great stuff.
Thanks, Eric.
Thanks, Costas.
This has been an amazing conversation
and thanks for having me on.
I'm such a fan.
Absolutely.
Well, Costas, of course,
I have to bring up Tilapia
and the fact that you can drop a Tilapia into a tank and they all start
to behave the same, you know, which is interesting, which actually is pretty similar to VCs with
new data technology.
It's like you drop a new data technology and all the VCs start to behave the exact same,
you know, which is really interesting.
So that was one takeaway.
Do you think we should rename FOMO
into the Tilapia Effect or something?
VC FOMO, the Tilapia Effect.
I love it.
So that was one thing.
On a more serious note,
I thought the discussion around implicit
and explicit contracts was really helpful.
You know, I think we talk about the way that data professionals interact with other teams, the way that tooling sort of facilitates
those interactions, et cetera. And it was helpful for me, even in my own day-to-day work, to really
just think about what implicit contracts do I have with other people in the organization, right?
Whether they be consumers of data that I produce, you know, maybe for my boss or, you know,
for the data that I consume from other data producers.
So that was really helpful for me.
Yeah, a hundred percent.
I think that's like a big part of building organizations.
And I am pretty sure that you have experienced that by like building companies from scratch
and like scaling a company or a team, like big part of
it is actually figuring out all these contracts and make them more explicit. Like when we say
like we need the process to make things scale, that's what pretty much we are talking about,
right? Like when you're alone and you're running the whole growth function on your own, like,
yeah, you have like plenty of contracts with yourself, right? And then you're running the whole growth function on your own, like, yeah, you have like plenty of
contracts with yourself, right? And then you've got the other person and then another person,
and suddenly the contract is not exactly the same, right? And that's where friction starts.
And I think one of the first steps that you have to do when like you're trying to scale
an organization is actually doing that. And I's that's human nature and has like something that we see with data is something that we see
with software is something that we see with everything so yeah 100 i think that was like
an extremely interesting part of the conversation that we had outside of all the rest that we talked
about like the technologies where like their ability goes and how they work all together.
But that was actually my other very interesting point of how related these
products are with some foundational products like the data warehouse, for
example, and what the data warehouse exposes and the metadata there and how
this can be used to deliver even more value in observability
and all these things.
So yeah, always interesting to chat with Kevin
and hope to have him back really soon.
Agree.
All right.
Well, thank you for listening.
And if you like the show,
why don't you tell a friend or a colleague about it?
We would love for you to share the episodes
that you like the most with people you care about.
And we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow dot com. The show is brought to you by Rutterstack,
the CDP for developers. Learn how to build a CDP on your data warehouse at Rutterstack dot com.