The Data Stack Show - 93: There Is No Data Observability Without Lineage with Kevin Hu of Metaplane

Episode Date: June 29, 2022

Highlights from this week’s conversation include:Kevin’s background and career journey (1:54)Metaplane and the problem that is solves (6:47)The silence of data problems (9:53)Data physics work tha...t requires more (13:35)Trusting data when bugs are present (19:12)Building a navigable experience (22:36)Developing anomaly detection (30:06)What Metaplane provides today (35:05)Metaplane’s plans for the future (37:45)Comparing Bigquery, Snowflake, and Redshift (40:56)Why data goes bad (48:15)Advice for data trust workers (59:24)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, we're talking with Kevin Hu from Metaplane. Costas, there are a lot of tools in the data observability space, and that's what Metaplane does. And I'm interested to know, of course, I do a lot of stalking on Ergaski for the shows, but I want to know how he went from MIT to
Starting point is 00:00:46 starting Metaplane, you know, because that's an interesting dynamic sort of coming out of academia and then going through Y Combinator and starting a company. So I just want to hear that backstory. How about you? Yeah, I want to learn more about the product, to be honest. I mean, it's data observability and data quality and like, I don't know what other name we're going to have tomorrow for the category. It's like a very hot
Starting point is 00:01:08 product category right now in terms of like development and like innovation. And I think he's the right person like to chat about that. So let's see how Metaplane understands and implements data observability and also what's next after that. Like what are the plans
Starting point is 00:01:24 there and where the destination is going? Let's do it. Kevin, welcome to the DataSec show. We're so excited to chat with you. So excited to be here. I'm a longtime listener of the show. I recognize both of your voices and to be here with you on the Zoom, it's really a privilege. So thank you.
Starting point is 00:01:43 Cool. Well, we are, we always love hearing from our listeners and especially when they are guests on the Zoom. It's really a privilege. So thank you. Cool. Well, we are, we always love hearing from our listeners and especially when they are guests on the show. So I want to, of course, I do LinkedIn stalking. Our listeners know this. You probably know this from listening to the show. So you started at MIT studying physics and then you made the switch over to focusing on more computer science subjects.
Starting point is 00:02:04 And so I have two questions for you. One, why did you make the switch? And then two, did that influence you starting MetaPath, actually sort of studying those topics from an academic standpoint? Yeah, I think, well, one great research. It's true, closest that I've both found ourselves in the either fortunate and privileged or unfortunate
Starting point is 00:02:26 place of seeing each other at some point. And I did start studying physics. And I remember the gauntlet course at the time, which was the experimental lab course everyone took as a junior, was notorious for burning people out. And every week, you replicate a Nobel Prize winning experiment and the second week you analyze it. Something that really stood out to me was the people who had the hardest time in the course weren't necessarily the people who weren't the best physics students, but it was the people who didn't know MATLAB and didn't know Python.
Starting point is 00:03:02 So they could collect the data, but weren't able to analyze it. They were the ones who are pulling all-nighters. And at the same time, my sister, who is a biologist, she had about five years of data on fish behavior. So tilapia are very interesting fish. You have a tank of them, you drop in another tilapia, and all the other tilapia change. Oh, fascinating.
Starting point is 00:03:25 Yeah, they're very tribal, very easy to observe. And at the end of five years, she messages me saying, hey, Kevin, can you help me analyze this data because I don't know R. And to me, this is just absurd because why are some of the brightest people in the world bottlenecked? Because they don't know how to write code. And obviously that doesn't apply only to scientists, but really to anyone who works in an organization who either produces data or consumes data. If they don't know how to program, you're not necessarily working with data in the most low friction way. So that's how I got into CS research, trying to build tools and develop methods for automated data analysis.
Starting point is 00:04:10 This is back in 2013. Okay, wow. Super interesting. Tilapia are also tasty, by the way, you know, if you're a good cook. That's a good point. That is a data point. That's a qualitative data point.
Starting point is 00:04:24 Happy to share that with your sister. I have plenty of telegraph data points too. Hopefully your listeners are not fish or people. That's right. Okay. So tell us, so you studied, you studied computer science tooling, how to sort of support people, help people based on your experience of really bad people not being able to analyze data, take us from there to starting Metaplane and then tell us what Metaplane is and does. So for six years, we built tools that given a CSV, try and predict the most
Starting point is 00:05:03 interesting by some measure, like visualizations or analyses that could come from that ESV. So at first it was really rule-based, but then it was more machine learning-based where we had a lot of datasets and visualizations and analyses scraped from the web. And the papers were really interesting. And it turned out you could predict how analysts worked on a data set with relatively high accuracy. The problem was when we tried to deploy it at large companies, including Colgate-Palmolive,
Starting point is 00:05:34 Estee Lauder, they funded a large part of my PhD. And I still have many goodie bags. Some of my colleagues have GPUs. I have retinol. Lots of toothpaste. Yeah, tons of toothpaste. I'm not complaining. But the problem was when we wanted to deploy these tools, it became very clear, like, okay, connect us to your database.
Starting point is 00:05:54 And they'll ask, like, okay, what database? We have, like, 23 instances of SAP. This was back in 2015 and 2016. So it was a bit worse back then than it is today. But it became clear that data quality is one of the biggest impediments to working with data. Not necessarily when you have a final clean data set in the last mile generating the analyses of that. So that's the motivation to build Metaplane where, you know, we couldn't necessarily make that flower grow. Now we have the augmented analytics and different
Starting point is 00:06:32 categories arising, trying to do that analysis, but we figure, you know, we can plant the garden, maybe someone else can take it from there. Very cool. And so tell us, tell us about Metaplane. Like what's, what's the problem that it solves? So Metaplane, we like to think of it as the data dog for data. It's a data observability tool that connects across your data stack to your warehouse like Snowflake, to your transformation tool like dbt, a BI tool like Liquor. And very simply, we tell you when something might be going wrong. Specifically, there's a big asymmetry that we observe today where data teams are responsible for hundreds or thousands of tables and dashboards. And this is great in part because data is becoming a product, right? It's no longer used just within the main vein of BI and decision support, even though that will always be important, but getting reverse ETL, okay, maybe that term is not cool anymore, but being sent to activated into market tools,
Starting point is 00:07:38 being used to train machine learning models, and that is all good. The promise of data is starting to be more and more true. However, while your data team is responsible for hundreds of tables, your VP of sales only cares about one report, which is the liquor dashboard that they're currently looking at. So there's this asymmetry where frequently teams find out about data issues or silent data bugs, as we call them, when the users of data notice it and then messes the data team. That matters for two reasons.
Starting point is 00:08:10 One is that if you've received those Slack alerts and if you're listening to this podcast, you probably have, you know that there goes your afternoon and you did not have much time to spare to begin with. But two, data trust is very easy to lose and hard to regain, especially when it comes to data. Because once that VP of sales decides to, okay, screw this, I'm going to have my RevOps team build up reporting in a shadow data stack. Then what was the point of getting a snowflake and getting all this data together to begin with?
Starting point is 00:08:42 We don't have a culture around trusting data. It doesn't really matter how much of it you collect or use. Yeah, absolutely. I want to dig in on one thing and then I'll hand the mic over to Costas. But could you describe, so you mentioned the silence of sort of errors, you know, or bugs or problems that happen with data, which is a really interesting way to think about the problems that we face in data. So two questions for you. One, how do you think sort of the audible nature of those things differs in data, say, as compared with like software engineering? because, you know, software engineering, like if we think about Datadog, you know, there's a lot of defined process and tooling or whatever, a lot of that's being adopted into the data world.
Starting point is 00:09:32 So one would love a comparison there. And then two, could you just describe in a, you know, on a deeper level and maybe do this first, like what, describe a silent problem and like, why are the problems with data silent or why do you even use that term? Yeah, let's start from that silent data bug. Great questions where frequently all of your jobs are running fine, right? Airflow is all green, snowflake is up, and yet your table might have 10% of the rows that you expected. Or that some distribution like the mean revenue metric has shifted a little bit over to an incorrect value.
Starting point is 00:10:17 So these sorts of issues in the data itself, unless you have something that is continuously monitoring the values of the data aren't necessarily flagged by infrastructural issues, like your systems being up or your jobs were running. And that's why we do want to make the silent data bugs more audible, increase the volume a little bit, because if you don't know about these issues occurring along the way, then inevitably the only place that you will notice it is at the very end, right? When the data is being consumed. One, because that person has the most incentives to make
Starting point is 00:10:56 sure that the data is correct. But frequently the person who's using the data also has the most domain expertise. If they're on the sales team, they might know what exactly should go into this revenue number. They might not have known how it was calculated along the way, but they know when it's wrong. And that is one departure from software observability, which really is the
Starting point is 00:11:20 inspiration for data observability. Right. The term was completely co-opted from like the Datadog and Splunk's of the world. But to be fair, they co-opted the term from control theory, where observability has a very strict definition, right? As like the mathematical dual for the controllability of a system, a dynamical system where you want to understand how like the state changes from the inputs. So I don't feel too bad about stealing the turn all art is stuffed right exactly exactly if we
Starting point is 00:11:52 keep tracing it all the way down like back hundreds of years we'll find you know a dutch physicist trying to figure out how to make windmills turn at the same rate as range ground, which is true. I love it. Just to finish that, one thought that in the software world, before the data dogs, right, you would frequently find out about data issues, I mean, software and infrastructure issues when the API went down or when your heartbeat check failed. But as the number of assets that you're deploying increases and increases, that level of visibility is just not sufficient, right? Now, if you're on a software team, it's almost mind-blowing to think that you want your customers
Starting point is 00:12:37 to find out when your API is failing or when a query is slow. You want to find out about that regression internally. Yeah, absolutely. Okay. Before we resume the conversation about observability, I want you to go back to physics and your other graduate studies. And I want to ask you, and that's like a very personal curiosity that I have, like from all the stuff that you have done in physics, what was, let's say, the one that required the most in terms of working with data and using R or Python? What do you think couldn't exist in a way almost, let's say, if we can exaggerate, as a domain of physics,
Starting point is 00:13:24 if we didn't have today computers and all these languages and all these systems to go and crunch the data. I have two answers to that question. One is when I was doing more pure physics research, like M.O., atomic, molecular, and optical physics research, you can think about ultra ultra cool atoms using laser cooling and trapping where the fine level of control that you need to calibrate these systems. And then the amount of data that you're retrieving from the systems that you're observing is immense.
Starting point is 00:13:59 Right. There's a reason why, you know, higher performance computing was really like invented at CERN and by the internet was kind of invented these scientific research facilities is they have, they had the need for data first. And then even today, the scientific computing ecosystem almost exists separate from our data stack. Yeah. The qualities of the data are completely different.
Starting point is 00:14:24 Yeah. our data stack yeah the qualities of the data are completely different yeah the other strain was at some point i got more interested in like quantitative social science research so we published this paper on the network of languages oh trying to understand how information flows from person to person via the languages that they know. Specifically, there's nothing stopping us from going to any news site in another language, besides the fact that we might not know that language. We had tons of data at the time about bilingual Twitter users, about Wikipedia editors who edited Wikipedia in more than one language. Mm-hmm. With translations from one language to another to try and figure out the
Starting point is 00:15:08 connectedness and the clusters of different languages. So that wasn't necessarily a problem of big data necessarily. It'll all fit on one person's laptop, but we wouldn't have collected that data. Yeah. If it wasn't today. Yeah. A hundred percent. No, no, no.
Starting point is 00:15:25 That's super interesting. And yeah, I remember at some point, one of the first episodes that we had, we had a guest who worked at CERN. He was taking care of the infrastructure there and writing code in C++ to transfer data there. And it was funny to hear him saying what was his first impression
Starting point is 00:15:46 when after his PhD, he went into the industry and hearing about big data and people saying, okay, we need a whole cluster to transfer this data. And he was like, okay, are you serious?
Starting point is 00:15:58 You can't say that. Yeah, he was like, oh, I mean, he was dealing with petabytes and petabytes of data. I mean, just an unbelievable amount. So he goes to work in insurance and he's like, I mean, this is the kiddie pool. Totally. There's levels to the game, right?
Starting point is 00:16:17 And I'm sure that when he goes down the hall to another person on CERN, that like petabytes, like we have even more data than that. Yeah. Yeah. Yeah. It's super interesting, like to see the different perspectives when someone is coming like from scientific computing and the point of view that they have and like how you solve the problems, like with
Starting point is 00:16:35 working a lot of day with a lot of data. Although, okay. We also have to say that like the needs are completely different, like the environment, the context that they do, the processing is also very different. So it's not like exactly comparable, right? Like you cannot say that the work that Facebook is trying to do with the data that they have is like the same type of problems that are solved by highly parallelized algorithms, like trying to solve partial differential equations, for example, right? Like there's like very, very different like problems and they have different needs,
Starting point is 00:17:04 both in terms of infrastructure and the software and the algorithms that we are using. But yeah, like a hundred percent. I mean, there is a reason, as you said, that like the internet, that the web came out of CERN and like all these technologies, like they're like highly associated like with physics. Okay. Enough with physics.
Starting point is 00:17:21 Let's go back to data observability. So I have a question about... We use a lot, and it's very interesting because you talked about this experiment with languages and when you're bilingual and all that stuff. But something similar, I think, is also happening when we introduce new product categories, right? Like, as you said, like we, we stole like the term observability from Datadog that took the term observability from like control theory and who knows about the Dutch guy who was, what was doing.
Starting point is 00:17:59 But when we are talking about what we're using, like, and you used with Eric, like the term bug, right, and silent bug. But like, okay, like in software, when we are talking about, we are using like, and you used with Eric, like the term bug, right? And silent bug. But like, okay, like in software, when we are talking about like bugs, there's like a very, let's say, key relationship between, how to say that, like it's a very deterministic thing, right? Like, okay, there are like a few bugs that it's hard like to find them, especially like in distributed systems and stuff like that, where the behavior is not deterministic necessarily. But broadly, when we're talking about bugs, we are talking about
Starting point is 00:18:28 something very deterministic as a system, right? But with data, my feeling is that when we're talking about bugs on data, it's not exactly that. There's much more vagueness there, and it's not that clear to define what the bug is. And that's why many times I say that maybe it's better to use the term trust, like how much we can trust the data, right? So from a binary relationship, bug or not bug, we go into how much we can trust something. So what's your experience with that? And what's common and what's not common between patterns from software engineering and data and working with data. David Pérez- You're so right that the way that we refer to data as having
Starting point is 00:19:15 bugs is not, is not a one-to-one with software, right? Like a software bug, it's a logical issue that somehow your logic did not produce the outcomes that you'd expected when it encountered the real world. Right. Either the real world was more complicated than you thought, which is the case, or your logic was not sound. Yep. In which case, get someone to review your PRs. Mm-hmm. The engineers on my team will be like, well, Kevin, yeah, the data bugs are interesting because I think the root cause can be equally similar in some cases where, yes, there are logical issues in your DAG. Your DAG extending beyond the warehouse, but from very beginning to very end, right?
Starting point is 00:19:58 It is conceptually a chain of logical operations, but the data could be input wrong, right? It either came from a machine that did not do what you expected or a person entered in the wrong number. So you're right that the scope of a data bug is a little bit larger in that sense. And as a result, what goes into data observability is slightly different than what goes into software observability. In software, you have the notion of traces, right? You have an incident that occurs, but also the traces, the time correlated or the request scoped logs that help you. Okay, where did this begin and where did this end?
Starting point is 00:20:44 And in data, right, that's kind of replaced by the concept of lineage. But the tricky thing is that lineage is ever perfect. That it's until Snowflake
Starting point is 00:21:00 starts surfacing it to everyone, and Snowflake will not cover it end-to-end, right? You also need a BI tool and upstream as well. Maybe they'll work with Rudder stack to figure it out, but there's always some loss of resolution along the way. So as a result, right. Even if you build all those integrations and build an amazing parser, like you're still working with incomplete information, whereas traces in the DevOps world can be extremely exact.
Starting point is 00:21:29 You might not be inferring causality, but at least you have all the metadata that is relevant. Yeah. I mean, okay, like with observability in DevOps, from a product perspective, the problem that you have there is that you need to build an experience that's probably going... There's too much resolution in a way, right? There's just too much data and you need to help
Starting point is 00:21:52 the user navigate all this data to find their root cause, right? So that's the problem that you have trying to design a product experience with that. But when we're talking about data observability, we have vagueness together with probably way too much data at the same time.
Starting point is 00:22:08 Because if you start collecting all the data, you can also have an explosion there. So how do you do that? How do you build an experience that can help people navigate this vagueness and complexity at the same time to figure out like the root cause of the problem, right? It's at the end or figure out if they can trust the data or not. Part of this is a, a very challenging like computational problem on the back end let's and then another part of it is a ui ux
Starting point is 00:22:46 problem which is no less difficult it may even be more important so let's take for example a table is delayed right that it's usually refreshed every 10 minutes and it's been you know let's say it's been two hours and that is unusual even after taking seasonality into account. Where if we surface this issue to a customer, then we'd be like, okay, that's useful. But almost always the first question is, does this matter? That the table is not being used by anyone. Maybe we don't need to fix it right now. And then the second question is, what is the root cause?
Starting point is 00:23:30 So can I do something about it? And only when all those three pieces fall into place, like a real issue has occurred, it has an impact, and I can do something about it. Is this necessarily going to bubble to the top of your triage list? But to answer your question, what that means is being very, I mean, it means a few things on the metaplane side or any tool that's trying to do this for you. One is building really robust integrations across your data stack. So it needs to be in your BI tool, ingesting all of your dashboards and the components of those dashboards and getting the lineage to a table in as fine resolution as possible and making sure that that's up to date and reflecting the latest state of your warehouse and latest state of your BI tool. It means disambiguating entities correctly.
Starting point is 00:24:17 So if you have a transactional database that's being replicated into your analytical database, right? How do you know that one table refers to the other? If you have a FITRAN sync, how do you know that this FITRAN sync is syncing those two, like entity A to entity B? That's a tough problem. And then the third piece is, I'll call it prioritization, right? Is one table might have 100 downstream dashboards, right? And how exactly do you want to surface this to your user? Right.
Starting point is 00:24:47 Do you just say the number 100 or do you list all 100? And there's a principle, at least in information visualization, Schneiderman's mantra of the inventor of the tree map. He's the professor at University of Maryland, I believe. He always says like overview first and then filter and finally details on demand. So the way that we try to do at Matterplane is like giving you as useful of an overview of what happened in an incident and then letting you filter down what you think is relevant
Starting point is 00:25:20 and then finally zooming in on the details when you want it. For example, the number of times that one dashboard that depends on this table has been used. Henry Suryawirawan, Okay. That's super interesting. And you mentioned, okay, you said like, it's both like a UI UX and a computational problem. Uh, let's talk a little bit more about the computational problem.
Starting point is 00:25:41 So what are the challenges there? Like what are the challenges that needs to happen on the backend and like the methodology and the algorithms that you have to use to track these things and make sure that you surface the right thing to the user at the end? One tough problem is anomaly detection. One reason why data observability exists as a category is because it's tough to test your data manually. There are great tools to do that where you say, okay, I expect this value to be above some threshold. And honestly, every company should probably have a tool like that for the most critical tables. However, it becomes quite cumbersome to write code across your entire data
Starting point is 00:26:29 warehouse and then merge a PR every time the data changes, which is why data observability comes in where us and everyone in the category says, okay, you do that for the most important tables, but let our tool handle testing for everything else. One necessary ingredient is some sort of anomaly detection. It could be machine learning-based. It could be more traditional time series analysis where we track this number for you. And of course, we had to take the traditional components
Starting point is 00:27:02 into account. Here's a trend component. Here's a seasonal component, but there's a lot of bespoke aspects to both enterprise data. So for example, row counts tend to go up and they tend to go up at the same rate over time. And if you use an off the shelf tool, you're just going to be sending false alerts every single time it goes up.
Starting point is 00:27:21 But too, like your data is particular, right? And your company is a little bit different. So there's a lot of work that goes into anomaly detection because if you cry a wolf too many times, we're just going to turn you off. Yeah. The other component is log ingestion where I let's say you're using snowflake yet 365 days of query history, a tool like metaplane will be ingesting all that core history and then parsing it for both usage.
Starting point is 00:27:54 So understanding how tables and columns are being used, but also lineage. Mm-hmm. So like what, what does this query depend on and what does it transform those dependencies into? And this is a notoriously difficult problem. I think no one has figured it out with 100% coverage and 100% accuracy across all data warehouses, except for the people who,
Starting point is 00:28:17 the data warehouse vendors themselves. Yeah, why you say that that problem is like notoriously hard? What's the, like, what makes it so hard? Like you have all the queries that have been executed like the past 365 days. What's the difficult part that like in using that to do like the in-edge? It's a combination of differing SQL dialects from warehouse to warehouse. So things are starting to get standardized, right? The, but what you had, the parts of that you write for snowflake is different than the one that you might write for redshift.
Starting point is 00:28:55 And secondly, there's often a lot of ambiguity within the data warehouse, right? Which tables are being used within this query? And that's a relatively easy problem, but then what columns are being used by those tables? And tables might have very overlapping or duplicate column names. And you might say, okay, well, the compiler is able to turn, SQL is a well-defined language, right? Snowflake is able to turn, SQL is a well-defined language, right?
Starting point is 00:29:26 Snowflake is able to turn this SQL into columns and tables that are being used, but they have access to the metadata and they have access to their runtime. Yeah, yeah, yeah. Absolutely, absolutely. So you think that this could be easier to handle if more metadata were exposed
Starting point is 00:29:44 by the database system at the end. Right? If the information that was exposed through Snowflake, for example, was more, that would help a lot to figure these things out. So it's more about exposing more of the internals of the database system at the end that is needed there. That's, that's interesting. That's okay. It's very interesting. All right. Okay. I know all the detection show. What are you doing on your product?
Starting point is 00:30:11 Like we found normal detection right now. Like, do you have some kind of functionality around that and how does it work? Yeah. One quick note on the data warehouse is releasing their internal lineage. I know that snowflake is starting to do this. It may only be available to enterprise customers right now. Oh, okay. But the moment they do that, one whole category of tools will have a
Starting point is 00:30:29 much harder time, the data lineage tools and everyone else will be exponentially more powerful. If we had access to that for all of our Snowflake customers, which is basically almost all of our customers, it'd be insane the amount of workflows without OpenLock. Okay. That's interesting, actually. So it's going to be a problem for the Linux companies and the products out there, obviously,
Starting point is 00:30:54 because the product is going, like the functionality is going to be provided, let's say, by Snowflake. But at the same time, this is going to make things much more interesting for you. But is there a reason? I mean, why is this going to happen? Outside of having access to the metadata, to the additional metadata, is there something else that's going to make it more interesting because all your customers are on Snowflake or it doesn't matter?
Starting point is 00:31:23 I think it's primarily being able to rely on their lineage over our lineage. Part of it, like does it mean that they're much more correct and up to date and have higher coverage than we do? Mm-hmm. Yeah, but that's on the other hand, that's like only the Linens that live like as part of Snowflake, right? Like what happens before and after that.
Starting point is 00:31:42 So let's say you have, I don that. So let's say you have, I don't know, let's say you have Spark doing some stuff on your S3 to prepare the data, and then you load this data into Snowflake, which I think it's pretty common, like in many use cases. So even like if Snowflake does that, how do you can see outside of Snowflake, especially like before the data gets ingested with Snowflake? Alexi Vandenberg- Totally. Yeah. They don't have the full picture, which is why data observability tools come in
Starting point is 00:32:16 and kind of augment, right, say, okay, the lineage within the warehouse might be a very key part of the picture. But it's not all of it, right? It's not the downstream impact. It's not the upstream root cause. Yeah. Which is how the two play together a little bit. Yeah, it makes sense.
Starting point is 00:32:36 Makes sense. Okay. So back to animal detection. What do we get from you today in terms of animal detection? Like what's, what's happened? Like what can I use out of the box? So out of the box right now, if you go to metaplane.dev, you can sign up. And sign up through email or G Suite and connect your warehouse, your transformation
Starting point is 00:32:59 tool, your BI tool. Typically, people can do this within 15 minutes. We've had highly motivated users do it within five, which is insane because I can't even do it within five. But I guess when you want it, you really are motivated to do it. And off the bat, we cover your warehouse with tests based on information schema metadata. So for Snowflake, right, row counts and schema and freshness kind of come for free across your warehouse. You can go a little bit deeper with out-of-the-box tests, like testing
Starting point is 00:33:33 uniqueness, nulling the distribution of numeric columns, or you can write custom SQL tests and for all of these tests and our customers usually blanket their database and have hundreds of tests on top of those within like 30 minutes. Then you just let it sit because we have the anomaly detection kind of running for you in the background as we collect this historical training set. And depending on how frequently your data changes, it can be either between one day or five days until you start getting alerts on that data. Henry Suryawirawanacanthamiloy- Okay. So, all right.
Starting point is 00:34:13 So it's like between one and five days. That's neat. And the deployments that you have so far, right, because we are talking about like data observability, the conversation that we have is like focusing a little bit more, that's how I feel at least, on the data warehouse. So would you say that what Metaplane is doing today is more of observability of the data warehouse, or you provide, let's say, observability across the whole data stack that the company might have. Let's say I have streaming data and I have a Kafka somewhere.
Starting point is 00:34:51 And then I also have a couple of other databases. And then I might also have a Teradata instance somewhere running. What kind of coverage you would say that Metaplane today provides? We are focused on the warehouse and its next door neighbors right now. Part of that is a strategic move as a company, right? Like we want to start from the place to the highest concentration.
Starting point is 00:35:17 And Snowflake is getting tons of market share as is Redshift, as is BigQuery, but we don't have to build a whole slew of integrations. Those three cover a lot of the market today. And most of our customers use one of those three. We have the downstream BI integration, so Looker, Tableau, Mode, Sigma, kind of go down the list, Metabase, we support,
Starting point is 00:35:41 as well as the transactional databases like MySQL and Postgres. And increasingly, the transactional databases like SQL and Postgres, and increasingly many OLAP databases like ClickHouse. That's where we stop. And honestly, that's where everyone in our category stops today. I'm not very happy with that because this is just a level one of monitoring. When you check out an observability tool in two years or in five years, it's going to be completely different. It's going to be much like the picture that you described, Costas, where it's like fully end to end.
Starting point is 00:36:15 That's, I think that is not only important, but really critical because data is ultimately not produced from your data warehouse, right? Snowflake does not sell you data. It sells you a container into which you can put your data, but that data is being produced by product teams, engineering teams, go to the market teams, and they're being consumed by those teams too. So when we talk about data trust, which you mentioned before, which I think is a much better category name than data observability, because what is that? That trust is ultimately
Starting point is 00:36:52 in the hands of the people who consume and produce the data. That's where we as a category have to go. That's interesting. Okay. So what's your experience so far with the other, let's say, big container of data, which is data lakes, right? So we have the data warehouses, a much more structured environment there, but we also have data lakes. Okay, Databricks is dominating there. Completely different environment where it can't like to interacting with data. And okay, we get to the, I mean, there's also like this new thing now with the lake house, where you also have like SQL interfaces there. But what have you seen so far, like with data lakes and observability there, because that's also like a big part, right, of like working with data, especially with big amounts of data. And
Starting point is 00:37:42 in many cases, it's let's say, say, like a lot of work that is happening before the data is loaded into something like Snowflake, it has to go through like data lake, right? So is Metaplane doing something with them today? Plans to do something like in the future? And what do you think is the role that data lakes will have in the future?
Starting point is 00:38:04 Honestly, we don't come across data lakes too often. Part of it is where we're focused in the market. If you're, for example, at a company with less than 5,000 people, Plane is probably the right choice for you as the data observability tool. It has time to value, time to implement, the focus on the workflows.
Starting point is 00:38:26 And if you're above 5,000, there are other options on the market and you might be in a position to build it in-house too. We found, maybe this is incorrect, that Databricks is much more highly concentrated at the enterprise. And when we come across a company that uses Databricks, frequently, they're also using Snowflake or Data Warehouse and they're using Spark for pre, like pre-Snowflake transformation. Yeah, yeah, yeah. A hundred percent. Oh, that's, that's interesting. But you don't see the need right now for Metaplane to work into
Starting point is 00:39:12 observability for these environments, right? And the reason I'm asking is because technically it's something very different. And I'd love to hear what are the challenges there? What are the differences? And learn a little bit more about that. That's why I'm insisting on these questions around the data lakes and the Spark ecosystem. There are some big challenges. I mean, there are some engineering challenges, like having to rewrite all of our SQL queries into Spark queries. And having it run not necessarily on a table, but on a data frame.
Starting point is 00:39:47 And there are also differences in terms of the metadata that's available to you, where a data warehouse metadata, we found is quite rich in comparison with the metadata that you might have within a data lake, where you might have the number of rows, but to, or not, right?
Starting point is 00:40:06 You might have to run a table scan for that, or to continuously monitor the queries to keep a log of the number of rows. Even get the schema, you might have to do a read. It's, in general, much harder to have the level of visibility that you have in a warehouse as into a data lake. Yeah, a hundred percent.
Starting point is 00:40:28 I mean, the query engine makes like a huge difference there when you have to interact like with that stuff. All right, cool. So Snowflake, Metaplane, like your experience so far, because I mean, you mentioned BigQuery, Snowflake and Redshift. And from what I understand, like there's probably like a big part of your customer basis on Snowflake. What's your experience like with these three platforms so far? Like give us like your pros and cons of each one of them. Alexi Vandenbroeker There's pros and cons of each.
Starting point is 00:40:55 For sure. Snowflake has the richest metadata in terms of the freshness and the row counts of different tables. BigQuery also has that metadata. In terms of the freshness and the row counts of different tables, BigQuery also has that metadata. However, to use Metaplane, our customers either have tack us onto an existing warehouse or they provision a warehouse specifically for Metaplane. And this is nice because you can separate out the compute and keep track of our internal spend that is incurred through this monitoring. But at the same time, we necessarily impose a cost, whereas some users who use Redshift with some not at their full capacity can tack on Metaplane at no visible financial cost themselves.
Starting point is 00:41:43 That makes sense. Yeah. I think that's like, okay, it's the trade-off between having like the elasticity that like the serverless model that BigQuery has compared to, you know, like paying for a cluster that yeah, obviously it can be underutilized and when it's underutilized, you can put more stuff there without paying more. Right. But yeah, it's like the trade tradeoff that every infrastructure team has to face
Starting point is 00:42:08 at some point with hard decisions. Right. But like from, let's say, in terms of what is supported, like do you, is like Metaplane like the same experience across all the three different platforms or like you have like more functionality towards one or the other because of what they expose? It's the same experience across all three. Okay.
Starting point is 00:42:31 No major differences. Okay. That's great. And how much of a concern is the cost at the end? I mean, the additional cost that is incurred by a platform like Metaplane that continuously monitors the data on the data warehouse. David Pérez- It's surprisingly much less than people might expect. As we're using information schema as much as possible and the existing metadata. So the tests that rely on your metadata, right? We can read that within seconds at the top of the hour or whatever frequency you set.
Starting point is 00:43:05 And it turns out to be a pretty negligible amount of overhead compared to spend that you might have from other processes running on your data warehouse, like measured in single digit percentage points. Some customers have longer running queries for much larger tables or more sophisticated monitoring, but typically that step is taken more deliberately so that the cost is more justified. So there are like, let's say there are just cases where like people are, okay, you have, let's say a continuous monitoring where you establish, let's say your, how to say that, like the monitors and they run every, I don't know, one hour, 10 minutes, one minute, whatever.
Starting point is 00:43:47 But do you see also like ad hoc monitoring that users do? Like, do they use the tool also for not just for monitoring, but also to debug problems with the data? Totally. That is the next step after the monitoring is like the flag kind of goes off is now you have this, well, one, you know, that incident occurred, but two, you have this historical record of what the data should be and how it has been over time. It's a little bit like debugging. Once you have a product analytics. Yeah.
Starting point is 00:44:21 Yeah. a product analytics tool. If you did not have a product analytics tool, you don't necessarily know what the latency has been over time, what all the dependencies are, what has happened in a user's journey. And it's very similar with Metaplane where in addition to the core incident management workflow, there's another component,
Starting point is 00:44:42 which is trust and awareness in data where teams that bring on Metaplane, of course, at first it's because it's often because, you know, stuff has hit the fan and they're like, okay, now we need to get ahead of it next time around. But right after implementing Metaplane, it could be within a few minutes and you see how queries are being used across the warehouse, how the lineage looks from within your data stack. It's like, wow, how did I live without this? Yeah. Yeah.
Starting point is 00:45:13 Familiar quote. Okay. Okay. Okay. Take us by the hand now and like, give us like an example. Like, let's say we have an incident, right? Like a monitor goes off and it's like, oh, something is wrong with this table. Okay.
Starting point is 00:45:28 And from things that you have experienced, like a common example, like describe to us like the journey that the user goes through metaplane from that moment until they can resolve the problem. And I'd love to hear like what happens inside Metaplane for that and what outside, right? Like how these two like work together for the user like to figure out and solve the problem. So today, Metaplane is like, let's say you have like a home
Starting point is 00:46:02 like security system. It is the alarm and it is the video. It does not call the police for you. And it does not do the tracking for you. So in Metaplane, we will send you a Slack alert or maybe a pager duty alert saying this value we expected it to be 5 million. It fluctuates a little bit, but now it's at 1 million. These are the downstream BI reports. So this dashboard has been last viewed today this many times by these people.
Starting point is 00:46:40 And here are the upstream dependencies. So here are all the dbt models that go into this model. And what you can do from there is click into the application and kind of see the overall impact of this view. And assess like, okay, what are the media upstream root causes? And then two, you can give feedback to our models where if this is actually an anomaly and you want to be continued, continue to be alerted on this, then you mark it and then we'll kind of exclude it from our models. If it was actually normal, because at the end of the day, data does change and no anomaly detection tool is a hundred percent accurate. Yep. You click on, you say, say okay this is actually a normal occurrence
Starting point is 00:47:25 do not continue to alert me on this frequently when you have an alert our customers start a whole conversation around that alert saying right looping and other members of their team creating jira or like linear tickets to address this issue but that is where we stop is the actual incident resolution. That's where we want to go in the future. But today, it kind of stops there. Yeah, makes sense. And that's my last question. I'll give it to Eric.
Starting point is 00:47:56 Give us some from your experience because obviously you've been exposed to many different users out there and issues. So what's one of the most common reasons that data go bad i like how you said that there's many issues because that's what we've observed too it's like the whole you know told stories quote of all all happy families are like all unhappy families are unhappy in a unique way. The same thing is true for data, right? Where there's so many reasons why data can go wrong.
Starting point is 00:48:29 It goes back to what we were saying of, you know, either someone put it in wrong, machine did something wrong, or there is some logic that's applied incorrectly. But that said, across all of our customers, delays or freshness errors are probably the most common issue. Second is probably a schema change, whether it's within the data warehouse or upstream. And the third is a volume change, where the amount of data that's being loaded or exists is higher or lower than you expect. It's a whole long tail from there. The, and all of that is kind of correlated with the, the causes of data quality issues. This depends on the team, right?
Starting point is 00:49:16 If it's a one person team, do you not have many data engineers or analytics engineers stepping on each other with code, right? And there might be many more third-party dependencies that cause issues. If you're on a larger team, perhaps shipping bugs might be, like actual software bugs, not data bugs, might be more frequent. Awesome. Eric, all yours. I monopolized the conversation, but now you can ask all your really, really hard questions.
Starting point is 00:49:47 It was fascinating. Okay. So I want to, let's dig into Tolstoy a bit more because that quote is an amazing quote. I think it's called, isn't it like a principle, like the Hannah Karenina principle or something? That's exactly what it is. Yeah. Okay. So this is the reason I want to dig into that a little bit more.
Starting point is 00:50:08 You've mentioned the word trust a lot through our conversation. And in fact, that's been a recurring theme on the show, you know, sort of through a bunch of different iterations. I would even say from the very beginning, Costas, just one of the themes that comes up consistently. So what's interesting though, is if we think about some of the examples we've talked about, you know, you have the executive stakeholder who's, you know, refreshing a looker report and something's wrong, or the salesperson,
Starting point is 00:50:35 you know, doesn't necessarily know exactly why, but they know the revenue numbers off or whatever. And so what's interesting is that's, those examples kind of represent a one-dimensional trust almost, right? Which is things don't go wrong, right? Like, I trust you if nothing ever goes wrong. Which, you know, in the real world, like that sort of one-dimensional trust, you know, isn't really a great foundation for relationships. So, like, you know, it's just kind of like the inner principle, which I know I'm sort of stretching that a little bit. So thank you for emailing me. But like, it's interesting, right?
Starting point is 00:51:18 Like, if the reports aren't broken, then everyone's happy, right? Like, things are good. What are the other dimensions of trust, A, that you've seen, or B, that you are trying to impact with Metaplane or the way that you think about, you know, data quality and lineage and those sorts of things? I love how you brought it back to trust because that is simultaneously a very simple problem. I mean, you could state it simply, but also extremely complex, like you're alluding to, where you could define trust, not necessarily that something's going wrong, but that there's some contract between two parties that is violated in some way. And if the
Starting point is 00:51:58 contract is not explicit, then the two parties will always have implicit contracts. And unfortunately, in the data world, the implicit expectation of a data consumer is frequently that the data is just not wrong. It's exactly what you're saying. The data is wrong. What am I paying you for? Why are we paying Snowflake so much money if the data is wrong?
Starting point is 00:52:20 But as we're alluding to, that is not a reasonable expectation across the board. A reasonable expectation from a data consumer might be, I am aware that data is not perfect, that it will never be perfect, the same way that you will never have software without bugs and code. So how can you expect that to be true for data as well? But I think part of it is establishing these contracts and these expectations up front with both the data consumers as well as with data producers and saying, okay, this is what you
Starting point is 00:52:55 can expect from the data and how it will trend over time and how I will try my best as a team to make sure that it meets the demands of this particular use case. I think that's a shift that I would love to see in the data world. Instead of talking about data being perfect or being ideal, instead of talking about it being sufficient for a use case at hand. Where if this dashboard is being used every hour, right. Do we really need real time streaming data?
Starting point is 00:53:30 Right. If you're, if this is making more of a directional decision, as opposed to being sent to your customer, right. Does the data have to be completely correct? Right. Enough to like shatter your trust in it over time. Right. So I think really reverse engineering from the outcome and the people who are using the
Starting point is 00:53:49 data is the most clarifying approach that we found to think about data quality and data trust over time. Super interesting. Okay, let's dig into that just a little bit more, just because I'm thinking about our listeners who, you know, and even myself, you know, we deal with these types of things every day. So I love what you said, but my guess would be that there are a lot of people out there who, well, let me put it this way. If you have an explicit contract that requires mutual understanding, right? And even mutual
Starting point is 00:54:28 agreement on, let's say it's a real estate contract, right? Like there's mutual agreement on say default and other things, right? Which both parties need to have a good understanding of for expectations to be set well, right? So if we carry that analogy over to an explicit contract between a data consumer and say, like the person who's building the data product, you know, in whatever form that takes, one of the challenges I think probably a lot of our listeners have faced is that if you try to make that contract explicit, the consumer oftentimes can just say, you know what? I don't actually really care about these definitions that we're trying to agree on. And sometimes maybe there's some malcontent there, but a lot of times it's like,
Starting point is 00:55:18 look, I'm busy. We're all busy. And I would love to like understand like your pipeline infrastructure and data drift issues and whatever. Can you speak to how you've seen that dynamic play out? I mean, I think in some ways that's getting better as data becomes more valued across the organization, but I think in a lot of places there can still be a struggle to actually make an explicit contract, like a practical reality and a collaborative process inside of a company. You're right. It is an idealistic process. However, I do think the conversation is important, not just to talk about expectations of the data, but really just to understand what exactly do the users of data want, right? And, you know, members of data teams are, it's a tough job, right? Because
Starting point is 00:56:06 a classic example is, okay, someone asks you for a dashboard, but do they really want a dashboard? Do they really want this number to be continuously updating over time and to have a relatively fixed set of questions that can be, you know, varied a little bit, but not be super flexible. Or do they want data activation to use it again into Salesforce? Or do they just want a number like right now? And it doesn't have to be changing over time. Or do they want a data application that is maybe more involved, but is much more flexible and has
Starting point is 00:56:46 both inputs and outputs, right? I think that is the importance of having a conversation about expectations from users, like your stakeholders is, you know, there are some downsides and it takes a lot of time, but that I think once you're the consumers of of your data feel like you really understand where they're coming from, that that is a foundation from which you can build trust. Right. It's like, OK, they kind of get what I'm asking for. And reverse, I know the amount of work that goes into producing data products that, OK, now the trust is much less brittle and maybe you don't need that explicit contract but what you develop implicitly you know implicit contract that yeah i know okay it's not when it's completely broken that i can still trust it because there's a human on the other end of it
Starting point is 00:57:40 yeah if only there were software that could solve the problem of time compression and mutual understanding and the investment that it takes to build that between two humans. We talked before this call about all of the SaaS products that exist, but I really think tools are just tools, right? They exist because people use them to do processes more effectively and more consistently over time. If a tool doesn't result in something actually changing in terms of people's behavior, you know, and this is a tool that actually is being used by people, not machines, then is it really that important? Yeah, totally. Okay, well, we're close to the buzzer here. I want to end by asking you an admittedly unfair question, but that I think will be really helpful for our listeners and for me.
Starting point is 00:58:36 And I'll start with the unfairness. So none of the answers to this question can relate to Metaplane or data lineage or data, you know, quality tooling at all. Okay. So outside of, you know, what you're sort of trying to build, you know, with your life and your team, if you could give one piece of advice to our listeners out there who are working in data in terms of building data trust, even maybe like one practical thing they could do this week before the week is over, what's the one thing that you
Starting point is 00:59:12 would tell them to do? Like if you could only do one thing to sort of improve trust, what would that one thing be outside of all the, you know, data lineage? So sorry for the unfair question. No, no. Well, you told at the end of the day, data lineage, data observability, it's just a technology, right? It is one technology that can be used to solve a much broader problem that can't be solved by one tool or even like 10 tools. I would say to conduct some user interviews. If you had a week or two weeks, have one-on-ones with every person
Starting point is 00:59:47 at the company who could be using your data or is not using the data as much as you would like, or in the ways that you would want, and sit down and really approach them as if you're like a founder building a product for a customer. What do you really want here? What problem are you trying to solve? How will you know that you've solved that problem? And how can I improve the product that I'm developing for you? That, I think, is a process that we've seen our customers, especially the ones who are very, very high performing data teams, do over time. And that really starts you from this position of the trust is yours and it's yours to lose,
Starting point is 01:00:31 as opposed to you start from, you have to build it up over time. Super helpful. All righty. Well, thanks for giving us a couple of extra minutes for me to ask you an unfair question. This has been such a great conversation and best of luck with Metaplane. It sounds like an awesome tool and it sounds like you're doing great stuff. Thanks, Eric.
Starting point is 01:00:49 Thanks, Costas. This has been an amazing conversation and thanks for having me on. I'm such a fan. Absolutely. Well, Costas, of course, I have to bring up Tilapia and the fact that you can drop a Tilapia into a tank and they all start
Starting point is 01:01:06 to behave the same, you know, which is interesting, which actually is pretty similar to VCs with new data technology. It's like you drop a new data technology and all the VCs start to behave the exact same, you know, which is really interesting. So that was one takeaway. Do you think we should rename FOMO into the Tilapia Effect or something? VC FOMO, the Tilapia Effect.
Starting point is 01:01:35 I love it. So that was one thing. On a more serious note, I thought the discussion around implicit and explicit contracts was really helpful. You know, I think we talk about the way that data professionals interact with other teams, the way that tooling sort of facilitates those interactions, et cetera. And it was helpful for me, even in my own day-to-day work, to really just think about what implicit contracts do I have with other people in the organization, right?
Starting point is 01:02:02 Whether they be consumers of data that I produce, you know, maybe for my boss or, you know, for the data that I consume from other data producers. So that was really helpful for me. Yeah, a hundred percent. I think that's like a big part of building organizations. And I am pretty sure that you have experienced that by like building companies from scratch and like scaling a company or a team, like big part of it is actually figuring out all these contracts and make them more explicit. Like when we say
Starting point is 01:02:32 like we need the process to make things scale, that's what pretty much we are talking about, right? Like when you're alone and you're running the whole growth function on your own, like, yeah, you have like plenty of contracts with yourself, right? And then you're running the whole growth function on your own, like, yeah, you have like plenty of contracts with yourself, right? And then you've got the other person and then another person, and suddenly the contract is not exactly the same, right? And that's where friction starts. And I think one of the first steps that you have to do when like you're trying to scale an organization is actually doing that. And I's that's human nature and has like something that we see with data is something that we see with software is something that we see with everything so yeah 100 i think that was like
Starting point is 01:03:14 an extremely interesting part of the conversation that we had outside of all the rest that we talked about like the technologies where like their ability goes and how they work all together. But that was actually my other very interesting point of how related these products are with some foundational products like the data warehouse, for example, and what the data warehouse exposes and the metadata there and how this can be used to deliver even more value in observability and all these things. So yeah, always interesting to chat with Kevin
Starting point is 01:03:49 and hope to have him back really soon. Agree. All right. Well, thank you for listening. And if you like the show, why don't you tell a friend or a colleague about it? We would love for you to share the episodes that you like the most with people you care about.
Starting point is 01:04:04 And we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow dot com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at Rutterstack dot com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.