The Data Stack Show - 85: You Can Stop Doing Data Fire Drills with Barr Moses of Monte Carlo

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. Today, we are talking with Bar from Monte Carlo. She's one of the co-founders and CEO of the company, and they are in the data observability space. And Costas, one of my questions that I hope we

Starting point is 00:00:39 have time to get to is sort of the practical nature of what it takes to set up data observability in your company. I think about, let's say you inherit a piece of software that you need to go back and write a bunch of tests for. I'm not an engineer, but I've been close enough to that to know that that's like, no one really wants that job. So I want to know how hard is it to actually do this? Because we all have messy stacks, you know, because we're trying to sort of build these things out as we go. So that's my question. How about you? I think I'll spend some time with you on definitions, like trying to define like better what data quality is, what data reliability is, what data observability is. I mean,

Starting point is 00:01:23 we're using all these terms and we, in many cases, we take the definitions for granted because they are mainly used metaphorically coming like from other domains. It's a lot to always like exactly the same, right? Like it doesn't mean that's an SLA for server availability is the same thing as for data availability, right? So I think she's the right person to have these conversations and try to understand a little bit better what all these terms mean. All right, well, let's dig in and get some definitions. Let's do it.

Starting point is 00:01:55 Bar, welcome to the DataSec Show. We have wanted to talk to you actually for a long time. So what a treat to finally have you here with us. It's great to be here. Thanks for having me. Okay, so give us your brief background So what a treat to finally have you here with us. It's great to be here. Thanks for having me. Okay. So give us your brief background and kind of what led you to starting Monte Carlo. Yeah.

Starting point is 00:02:11 So let's see. I was born and raised in Israel. I moved to California about 12 years ago. Worked with data teams throughout my career. Most recently was at a company called Gainsight, where we work with organizations to help them make sense of their customer data and basically improve their customer success and really create the customer success category. A large part of that was actually making that data-driven. That was a huge shift in the category. Before that, the world of customer success was really

Starting point is 00:02:41 built on pretty fluffy customer relationships, actually, kind of like you buy from me, I buy from you. And the introduction of subscriptions and recurring revenue actually forced organizations to think through how do we make customers successful every single day. And oftentimes that actually requires data to do that well. And also, you know, this was around like, you know, the middle of the last decade when just it was easier to actually like ingest data and process data and analyze it. And so really sort of trying to become a data-driven organization was something that more and more companies were

Starting point is 00:03:14 doing. And as I work with these organizations, those VP operations, I noticed that, you know, you sort of look at a company, there's like someone there that makes a decision, like, let's get data driven, whatever that means. And they're like, let's just like hire lots of data people and pour a lot of money into this and, and, you know, just, just really get data driven for a second. And honestly, like that, those initiatives often fail. And, and I've found that the number one reason that those fails is, is because people actually don't trust the data. And so people, you know, might look to making a decision about the data or actually using that data, surfacing it to customers, using it in production. And when the data is wrong, people are like, well, you know, why don't we just like resort to gut-based decision making? This isn't really working for us.

Starting point is 00:04:00 And it's incredibly hard for data teams to actually know that their data is accurate and reliable. So we actually started Monte Carlo, you know, with the goal of sort of the mission of accelerating the world's adoption of data by eliminating what we call data downtime, which is basically periods of time when data is wrong or inaccurate. And a lot of the concepts that we use are actually concepts that have worked well for engineers. So we're not really sort of, you know, reinventing the wheel in any way. We're actually taking concepts that work well in other spaces and bringing them over. So we started the company about three years ago. It's been incredible to see the data observability sort of category really accelerate. You know, it's been, you know, I feel incredibly fortunate to be sort of at the forefront of this and work with amazing customers who are pioneering this. We have folks like Vimeo and Affirm and Fox and CNN and Auth0 and really, really strong

Starting point is 00:04:52 sort of data teams who are actually adopting sort of data observability as a best practice and a real part of the modern data stack. Amazing. Okay. I have so many questions that I know Costas does too, but of course, we do a little bit of LinkedIn stalking. And one of my favorite questions to ask is how previous experiences sort of influence the way that you approach what you're doing today, especially as a founder. So I noticed that you were involved in the military as a commander, which is fascinating.

Starting point is 00:05:20 And so I just love to know, are there any parts of that experience that have influenced the way that you think about starting a company, running a company, and even solving the data observability problem? For sure. So in Israel, a military service is mandatory. And so everyone is drafted at age 18. Women typically do two years and men do three years. I actually originally wanted to be a pilot, but I was not accepted. And so as a result, I actually ended up being an intelligence unit as part of the Israeli Air Force. So basically working on data intelligence that's related to operations as part of the Air Force units. Now, you know, I was quite young at the time, joined, you know, 18 years old and, you know, was promoted pretty quickly to be a commander, which meant that I was

Starting point is 00:06:11 responsible for many other 18-year-old kids. And, you know, you're joining the military without professional training, right? So you just finished high school and especially without sort of domain expertise, right? And so you actually like, you don't have a college degree and you don't have sort of further degrees. And so you really have to learn on the job. There's no more on the job than that, right? And the second is like, you know, it's also a lot of responsibility at a very young age. So I learned a ton, you know, it's definitely like a reality check, right? You have a lot of responsibility. And as a commander, you're responsible for, you know, your soldiers on different levels. You're responsible for their professional expertise, right? Making sure that they are the best at

Starting point is 00:06:54 their job. You're also responsible for their physical well-being. And finally, you're also responsible for their mental well-being, right? For making sure that they are driven and motivated and excited about what they're doing, right? And that there's camaraderie. And so at a very young age, I was sort of thrown into this situation where, you know, I really needed to create a cohesive team of folks with very little experience to do work that was very impactful and important and to do it in a way that people can also thrive in it. And that was hard. Wow. So, you know, I think I obviously learned a lot from that experience. I learned a lot about like what people care and how to care about people, but also how to motivate them. But also

Starting point is 00:07:39 honestly, like about the power of bringing people who are really aligned on a mission. And even if you don't have all the experience in the world and, world, and even if like you don't, you're not the world's greatest expert on this particular topic, you actually can learn on the job and you actually can make a big impact. And I think that gave me confidence later on, both to take on things that I necessarily wasn't necessarily the greatest expert on and diving into something and, and, you know, being confident that we can bring something to the table and also making bets on other people who are, you know, perhaps earlier in their career and, and helping set them up for success and making them shine. And that's one of the things that are most important to me in building

Starting point is 00:08:18 Monte Carlo is that, you know, we were really proud of the journey that we're on and that we make, we make Monte Carlo sort of a company that of a company that can be life-changing for people. That is incredible. Thank you so much for sharing. And you are building a great company. And so those lessons that you learned are clearly evident in the company and the team. Okay, I know I want Costas to ask a bunch of questions here, but I'll kick it off. Let's talk about trust.

Starting point is 00:08:43 So you mentioned trust. And we actually were talking with someone, a data engineer from a big mobile consumer company. And we asked, what is the hardest problem that you've ever had to solve as a data engineer? And it was fascinating because he paused for a second. He kind of stared off camera and you could tell he was thinking really hard. And his answer was trust. He said, that's the hardest problem that I face every day. So my question for you, and would love to just hear your thinking on this, even philosophically as you're building Monte Carlo, how much of that is a technical problem and how much of that is a human problem, right? Because there are certainly technical aspects to it, to your point, like adopting principles from software engineering, there's testing,

Starting point is 00:09:34 there's all sorts of stuff, right? But trust is a very visceral human experience, right? So I'd just love to hear how you think about that. Oh, man, that's a great question. So first of all, I love that you asked that because that's actually what I've experienced before starting MoneyCrawl. So to save the context a little bit for how we started the company, I actually started three companies in parallel when I started MoneyCrawl to explore what does the product market fit look like or doesn't look like. And so I actually read this book. It's called The Mom Test. It's a pretty bad title, but the book is quite good. I don't know if you had a chance to read it. I haven't read it, but I've heard of it. And

Starting point is 00:10:14 the title is like so hilarious. What? And basically it gives good guidelines to help think about how to have conversations with customers early on. And the idea is basically there are people who, if you would share an idea with them, they would give you positive feedback on that idea no matter what. And you have to find the people who would give you real feedback. Those are often folks who actually don't care about you or what you're doing. Yeah. And that's contrary to most ways in which startups get started today, which is like going to your network.

Starting point is 00:10:44 And so actually what I did is reached out to like hundreds of people who knew nothing about me, who owed me nothing, who were data engineers, perhaps like the person that you spoke with. And I just asked them like, what's keeping you up at night? Like, what's your biggest nightmare? What's like, what's some like shit

Starting point is 00:10:58 that's like annoying you these days? And their reaction was so visceral to this thing around data trust. People were like, I wake up sweating at night because there's a report that my CMO is going to look at tomorrow morning. And I'm not sure that the numbers are going to be accurate and everything is going to work out. Or I remember a chief data officer of a public company told me, you know, last week we nearly reported the wrong numbers to wall street. Like we caught it 24 hours before it was someone, someone on my team, like not even me, someone on my team caught the issue 24 hours. Like that is sort of, you know,

Starting point is 00:11:38 sort of implications for your, you know, for your job, right. For your professional integrity for, it goes into so many levels. And so I similarly, you know, in starting the company, saw that, right? But like people are just really sort of visceral about this. It's kind of ironic

Starting point is 00:11:54 given that it is, you know, you're like, you had one job, just get the number. Sure. But it's freaking hard. And I experienced that myself as well. So I was leading data analytics team and like the numbers were wrong all the time. And I would get like WTF emails from like my CEO and others are like, what's going on? And so, you know, first I'll say like,

Starting point is 00:12:17 I think the kind of importance of this problem is that it's not just, you know, getting the data right. It's also like tied deeply to people's kind of pride, professional pride and, you know, sort of satisfaction in their job as well. And I think that's why it's even more important to solve it. So, you know, specific to your question on like, how do you go about thinking about a solution like this? It's definitely a combination of tech and people, right? On the tech side, I think there's changes in the last couple of years that have made it possible for a company like Monte Carlo to create a standard solution. So the rise of data warehouses and BIs and honestly, a standardization around a finite set of solutions that a typical company will use has allowed us to build a finite

Starting point is 00:13:03 set of integrations, right? So Monte Carlo today can, you know, we support all data warehouses like Redshift, BigQuery, Snowflake, all BI solutions, Tableau, Looker, Sisense, Mode, and Data Lakes, starting to support that as well. And also ETL solutions and orchestrators like Airflow and dbt. And because of there's sort of this rise of quote, there should be like a buzzword alert on the podcast right now. I love it. It's been a lot. The modern day data stack. Thank you. So with the rise of that, it's basically like

Starting point is 00:13:39 sort of a consolidation or kind of an agreement or what are like the top vendors that folks work with, right? And that actually allows standardization in terms of how do you think about pulling metadata across those stacks and how to think about like what we call the pillars of data observability, which we sort of codified as a framework to think about common shared metrics for how to think about data observability. So, you know, I'll explain a little bit here, you know, what does data observability. So I'll explain a little bit here. What does data observability mean? It really is a corollary to observability and software engineering. And software engineering, it's very well understood what you would measure, right? So if you use Epidynamics or Neuralic or Datadog or whatnot, you look at specific metrics or engineering looks at specific metrics to make sure that you have

Starting point is 00:14:22 five nines of availability. Now, let's look at data organizations. You have people like data engineers, data analysts, data scientists who are working to create data products. And those data products can be dashboards or machine learning models or tables or data sets in production. And they need to make sure that those data products are reliable. But how do you measure that? Whatever you use in software engineering does not translate. tables or data sets in production, and they need to make sure that those data products are reliable. But how do you measure that? Whatever you use in software engineering does not translate. And so we actually had to codify what that means and sort of build around that. And so, you know, when we think about, you know, the tech part, there's definitely kind of advancement that have allowed that. And when you think about the people part, it's definitely the rise of what I would call

Starting point is 00:15:05 sort of the data engineer role in particular, because that person is now responsible not only for the job to be completed, but for the job to be completed, the data to be accurate, the data to be on time, sort of everything else that actually encapsulates what we would call sort of trusted data.

Starting point is 00:15:23 Barb, I have quite a few questions around that. So I will have the opportunity to get much deeper into observability. But before that, I have a very cliche question to make first, which is why Monte Carlo, why did you choose this name for the company? Well, that's a great question. That is a great question. We are big Formula One fans at Monte Carlo. I'm actually wearing a Formula One hat, but that's actually not the reason.

Starting point is 00:15:47 But so, you know, when we started the company, one of our couple hundred of people from data organizations, asking them about what their problems are. We actually got to a point where we very quickly wanted to start working with customers. And so I actually had like 24 hours or something crazy like that to choose a name. And I was like, okay, we'll just choose something for now and figure it out later. And I studied math and stats in college. So I literally opened my stats book, who I have right here, actually, and riffed through. And the options were Bayes' theorem, which did not seem to be great. The next option was Markov chains, which seemed even worse for a company. And the third was Monte Carlo. And I was like,

Starting point is 00:16:42 oh, Monte Carlo, I can work with that, actually. It's something that is both approachable, people know about it, but also has its sort of roots in data, if you will. And so it's named after the simulation in that sense. Okay. Okay. That's very interesting. I really like the bridge between the racing there in Monte Carlo and the statistical method. All right. Cool. on the car and the statistical method. All right, cool. So as you were like talking about data

Starting point is 00:17:06 and data quality and observability, you mentioned like a couple of terms and two of them is data has to be accurate and it has to be reliable, right? So, I mean, it makes total sense, right? I don't think that anyone's going to argue against that. But many times we forget like to be a little bit more of engineers, let's say,

Starting point is 00:17:26 and try to be a little bit more accurate on this term. So what does it mean to be accurate and reliable when it comes to data? Great question. So I think what you're saying is spot on in terms of

Starting point is 00:17:40 let's get more, let's introduce more diligence to what it means to get trusted data. And I think the path to that is actually operationalizing what that means. So let me get a little bit more specific. When we think about, we sort of call this like the data reliability lifecycle, there's sort of three core components to it that help us, or what we've seen is that our customers, if they actually operationalize with these three sort of parts, that helps them generate trust. So there's three core components to it.

Starting point is 00:18:13 The first is detection. The second is resolution. And the third is prevention. So double click into each of these stages is like introducing tech, but also processes, SLAs, contracts between teams, ownership, clarifying sort of who's responsible, who's accountable, et cetera. So on the detection side, actually understanding when data breaks and understanding why it breaks or understanding the impact of it. So let's sort of define how we actually do that.

Starting point is 00:18:44 I sort of talked before about how in observability and engineering, it's really clear what you're measuring. It's not as clear in data. So what we did is sort of from all the conversations that we had with companies, ranging from large organizations like Facebook and Uber and Google who built this in-house to small startups who didn't, we basically codified like all the reasons for why data breaks and all the symptoms for it and all the different things that people do to deal with it. And we've come up with these five pillars

Starting point is 00:19:11 that we think together help bring sort of that holistic picture. So the first is freshness of the data. So there's different ways to look at it, but basically give you an example. If there's a particular table that gets updated three times an hour for the last week, and today it hasn't gotten updated yet for the last three hours, that is a freshness problem that can potentially indicate about some problem with your data. There's different ways to look at

Starting point is 00:19:35 freshness, right? You can look at timestamps, you can look at volume of the data, actually. There's different ways to measure freshness, but basically like the data arriving on time. The second concept is volume. So literally like, again, pretty straightforward, but basically like the data arriving on time. The second concept is volume. So literally like, again, pretty straightforward, but like you can look at the number of rows over time and say like, okay, the number of rows has grown 5% every single day for the last week. Today, it suddenly dropped by 30%. What's going on? Did we miss something? So maybe the job was completed, but data actually wasn't transferred, for example. So the second around volume. The third is around distribution. So distribution is sort of like a catch-all phrase for changes at the field level. So for example, if you're, this is like a credit

Starting point is 00:20:14 card field and you're expecting numbers, and then suddenly you get letters in that field, for example. Or if you have, you know, shoe sizes and it's suddenly like, suddenly like, I don't know, shoe size of 100 or something, that's obviously like doesn't make sense. You can look at percentage null values, percentage negative values, et cetera. The fourth is schema changes. So actually, schema changes is a very, it's like a common culprit for data going wrong. And so oftentimes, like someone, an engineer might make a change somewhere that will result in field type changing or in a table added or removed. And everyone downstream is not aware of that.

Starting point is 00:20:51 So automatically tracking all changes to tables, to fields that are being added, removed, or edited. That's a fourth pillar. And the fifth pillar is lineage. And so we actually just released a great blog about how we built field level lineage. And when I say lineage, I mean both table-level and field-level lineage. And actually being able to automatically reconstruct your lineage without any manual input, just by connecting to your data warehouse and your data lake and your BI, actually understanding the connections between tables and fields and overlaying data health on top of that is incredibly powerful. So being able to say someone made a change somewhere

Starting point is 00:21:30 upstream and that resulted in this table downstream that now is not having the data up is now, you know, doesn't have the right data and resulting in this report downstream that now has like, you know, a higher percentage than what you expect null rates, having that view is sort of the start of what we call this detection phase. Does that make sense? I'll just pause there. Oh, absolutely. Absolutely. I love the way that you have codified this, to be honest. But I have a follow-up question to that. The way that you describe it makes total sense,

Starting point is 00:22:04 from my at least experience so far. Like a follow-up question to that. Like the way that you described makes like total sense, okay? Like from my at least experience so far. But how much can be done with standardization, right? And how much we need to account for the business context that this data operates in to make this framework work? Or we can completely automate it. Like what's your experience so far on that? Great question. So what we see typically is that most companies today try to do a lot of this stuff manually with tests that they would write. And so, you know, 80% of issues actually go uncatched or unnoticed, meaning they hear about it from someone downstream.

Starting point is 00:22:49 So let's say, for example, my CMO or my data scientist or someone that I'm working with is like, hey, like ping ping on Slack. Why is the data wrong? Something here is off. Help. It's probably your fault. Go figure it out. Kind of like not fun nodes on Slack. And so 80% is caught in that way. And 20% is caught with sort of manual kind of tests or manual checks that you could write based on business knowledge. What we think in sort of a better world and what we're seeing with our customers and other sort of strong data teams who are implementing this is that actually with standardization automation, you can catch 80% of the issues. That's the reality of what we're seeing. Like, I think everyone really thinks that they're a snowflake and

Starting point is 00:23:33 everyone thinks that they're unique. Yes, it's true. They are snowflakes. And you can also automate a lot of that stuff. Again, thanks to standardization in solution in the modern data step, right? And so I think what we're seeing is that 80% of the issues can be caught with automation. And then there's probably like 10 to 15% where only you would know, like no automation in the world can actually catch that. And so that like 10 to 15% is one where like really the unique expertise of data engineers and data analysts should actually spend time figuring out like, for example, you know, our customers view this data every Monday at 6 a.m. So at 5.55, this better be accurate. Or for example, this field in our business doesn't make sense to be higher or lower than 100, for example.

Starting point is 00:24:21 Like, let's say, you know, I have a ticker value, for example. Like, there's some, you know, sort of some use case in particular where only the business would understand that. And so, but that is only like the 10 to 15%. It doesn't make sense for those teams to spend their entire time actually building that. And then, you know, hopefully there's like 1% or maybe 2% that's like caught by things that just went unnoticed and

Starting point is 00:24:45 would be caught by our customers, sort of like downstream consumers. And so I would say like expertise and domain know-how is very, very important, but I feel like we can make better use of that so that data engineering teams don't spend their entire time building this, start with something off the shelf and then can add their knowledge sort of in a more custom directed way, if that makes sense. Yeah, yeah, 100%. Yeah, I totally agree. And okay, so far we had data to work with, right? A lot of data, actually. One of the biggest problems that we have is that we have way too many data sometimes. And now we also generate data about the data so we can monitor the data, right?

Starting point is 00:25:26 So how do we deal with that? Like what kind of like experience we need, what kind of experiences we need to build as like product people or like companies or vendors in this space to make sure that like at the end, we don't cause more frustration to our users, but we actually like help them figure out what's going on with that data. Yeah, I was, I sort of referred to, there's like this big, like hoarding of data. And now there's this like big hoarding of metadata. And you're like, that was useless.

Starting point is 00:25:57 This is going to be useless too, but we're just going to do it because that's what people do. It was just like the hoard data. You know, I'll also say something controversial. I think, you know, metadata by itself is quite useless. Lineage by itself is quite useless. Like nobody gives a shit about that. It doesn't matter, right? It's like great eye candy and you're like, oh, yay, like I have lineage. But where is it actually useful? What we're seeing is that metadata and, you know, things like metadata and lineage can help and be particularly useful

Starting point is 00:26:27 when they're applied to a specific problem. So for example, I sort of talked about the data reliability cycle of detection resolution prevention. In the detection phase, if something breaks, like if some table breaks, but nobody's actually using that table, like there's no downstream dependencies on that. And no one is clearing that. Maybe I don't care about that particular table, right? Maybe actually I don't need to know that the data is inaccurate there. It can be inaccurate and who cares, right? On the other hand, if there's a particular table that the data is inaccurate, there's 300,000 downstream nodes that are connected to it. There are 10 reports that my CEO and my top executives

Starting point is 00:27:06 and all of my customers are using on a daily basis. Yeah, I better get that data right. So actually using that context can help inform that and make it better. And similarly, we see that throughout this sort of life cycle. So in the second part of resolution, which is basically like, how do you speed up, you know, moving from being the first to know about data issues, moving to being very fast and quick and actually identifying the resolution of data issues. That's where these things can also help us give us clues to, okay, you know, a table here is broke. And at the same time, there was a change in this field or these three other tables also broke at the same time. And this is related to a change that a particular user in the marketing team made, for example, around Ascend.

Starting point is 00:27:48 So you can use all that information together to actually speed up resolution of data incidents. We find that data engineers and data analysts and data scientists spend oftentimes between 40 to 80% of their time on data fire drills. And so if you can use all that stuff in context with this, it's actually really powerful. And then the third part of like prevention, we're actually finding that, you know, people have reported sort of north of 70 to 80% of their data downtime incidents reduced once they have more access to this context that metadata and a lot of other metadata and data together can combine. So, for example, if we have a report on deteriorating queries that can give us really insight into where there are specific problems in our infrastructure, that can help us give clues as to how do we build a more robust infrastructure overall that can reduce data downtime incidents.

Starting point is 00:28:48 So being more proactive about how we manage our metadata and our data, I think also helps us make sure that it's more robust and trusted at the end of the day. Yeah, 100%. Okay, so we talked about detection a lot. The next step is resolution, right? So what does resolution mean? Yeah, so resolution, great question, means moving from the world today in which it often takes weeks, sometimes months. I hope, I haven't seen a case of years, but I might have to go back to my notes, but literally very long periods of time that it takes to identify the problem, do a root cause analysis, and fix the data, right? And moving that to shortening that time significantly, right? Also thinking

Starting point is 00:29:26 about SLAs for that, right? So like how quickly are those issues have to be resolved and for what severity, right? P0, P1, P2, P3, different issues, we should have different agreements on that. And so actually having SLAs and contracts with your data team on different data sets. So one thing that we're seeing is thinking of as part of this movement of like buzzword alert data as a product, there is, you know, thinking about sort of different domains, right? And sort of domain specific ownership. And so you can have like specific data sets and pipelines that are, you know,

Starting point is 00:29:58 sort of mostly used by the marketing team, for example. And maybe the SLAs there are different than SLAs for data that's used by the finance team or the finance domain. And actually, like, you know, maybe finance uses the data like once a quarter to report to Wall Street. So, you know, you have more time. But maybe marketing, it's actually feeding, you know, like a pricing algorithm for pricing houses in a particular market. And so, you know, if you're underpricing or overpricing a house, there's a big difference there, for example. Or just to give another example, one of our customers, Vimeo, it's a video hosting platform. They have a very strong team and they actually have used

Starting point is 00:30:36 data in the pandemic to not only sort of sustain their growth, but even fuel it. They've done that by sort of identifying new areas of sort of revenue and opening new revenue channels for them. And a lot of that has been doing, has been enabled by introducing data observability concepts of sort of detection, resolution, and prevention across the business. So for them, you know, they use real-time data to basically make a decision on like how much bandwidth does a particular user need, for example. And so the SLAs on that kind of data is obviously very, very different than the others. And so, you know, thinking about resolution, you sort of think about kind of impact radius, right? So who's actually impacted by this? And then also sort of downstream, then also upstream, how do you actually

Starting point is 00:31:21 locate, you know, that particular problem? And sometimes a problem, you know, can go sort of beyond a particular data warehouse and can be part of sort of, you can kind of use like logs from sort of DET or Airflow or sort of other orchestrators to help sort of pinpoint that. necessarily resolution is about like automatically identifying the problem and solving it for you as a platform, but rather giving data engineers the tools and the information to identify the problem way faster. Kind of like in the same way that you would use New Relic or Splunk to identify a problem in the infrastructure or application side. Yep. A hundred percent. Okay. I need your help to understand a little bit better the concept of SLA, right? So, I mean, I love using metaphors in general because it helps a lot to help people understand what we are building, especially when we are talking about a new category like data observability here. But the problem that I have is that an SLA, when it comes to infrastructures,

Starting point is 00:32:24 it's a very, very specific thing that has to do with the availability. We measure how much time a service or a server or whatever is available to use. Now with data, things are a little bit more complicated there. And the problem there is that you might even have edge cases where the data, because something went wrong, does not even exist, right? Let's take, for example, how Rutherstack is used, right? We have a developer, an SDK is integrated on a website, for example. If the developer does not capture correctly an event, let's say, instead of sending the actual values and just nulls there, like the data was never captured, right? So what does it mean to set an SLA there? Or how do we communicate also like these kinds of problems that might exist with data and

Starting point is 00:33:18 that we don't have with observability when it comes to infrastructure, right? Yeah, such a great question. And, you know, I can give my due sense from what I see our customers do. But, you know, just to be honest and transparent, these are early days of the category, right? So we are defining this sort of as we go. And I think for data teams, more and more teams are doing this, but it is still early days. Like, you know, I remember actually Facebook did a meta remember actually Facebook did an observability summit a few weeks ago. And we're talking about, I think, That's a very vague concept. How do you actually do that, right? And so one definition that we

Starting point is 00:34:07 sort of introduced is a combination of three metrics. One is number of incidents that you had. The second is time to detection, like your average or median time to detection. And the third is time to resolution. And so like measuring data downtime as dependent on those three measures. Now, I can introduce that concept all day long, but most teams actually don't even measure time to detection time to resolution yet. And so you're right. In order to introduce things like SLAs, there needs to be a baseline that's established. And I think that's where we're at these days with establishing those baselines. Now, particularly in the example that you shared, you know, you could, or an example

Starting point is 00:34:45 of an SLA can be the, you know, this particular table gets updated every three hours and we can only accept a deviance of X hours or we need it to be, you know, on time 99.99% of the time, for example. So that's like a very, very specific kind of SLA definition for a specific table that we know that like this needs to be updated on a regular cadence. So that's like an example for freshness problem, freshness SLA, for example. But I totally agree with you. This is like very, very early days. And, you know, I'm happy to share actually, there's like a blog that we wrote with an example of like an SLA dashboard that you can actually get an example of Calibri Games,

Starting point is 00:35:25 I believe, actually, that has specific SLAs from how often and how regularly they get their marketing data from third-party vendors like Facebook, for example. They have particular agreements on how often that data is received and contracts between teams on when they can actually use that, when they can rely on that. So I'm happy to share that as an example for something that's put in practice. But I agree with you, it's early days and we're seeing a lot of innovation. So I'm excited about where we're going with this.

Starting point is 00:35:53 But I can tell you that I've seen a data team that like operates with 100% data uptime on all five pillars. And they're like, our data is perfect. Don't talk to us. Like I've just never seen anything like that. Yeah, yeah, yeah, absolutely. And I think that's like one of the differences that we are going to see with

Starting point is 00:36:11 SLAs in this industry, like compared to the infrastructure observabilities that, okay, data is not like a server. It's not like a, you know, like a binary thing. It exists or it doesn't exist. I mean, it runs or it doesn't run. Right. So I think when we are going to like get to a point where we understand the you know, like a binary thing, it exists or it doesn't exist. I mean, it runs or it doesn't run, right? So I think when we are going to like get to a point where we understand the SLAs better,

Starting point is 00:36:30 there are going to be like much more dimensions that we measure there. And I think that it's going back to the beginning of the conversation. It has to do more with trust, like trying to measure like how much I can trust at the end, this data set that I have. And not that much about is the data set available or not available

Starting point is 00:36:46 or like these kinds of things. Awesome. Like that was like super interesting. Thank you so much. You really helped me like understand better, like the concept of SNAs. One last question from me, and then I'll give the stage to Eric

Starting point is 00:36:59 because I monopolized the conversation. And this time I want to ask you something that has to do more with building a company and not that much about data and data observability. So you are one of these quite rare cases of people who decided to start a company while a new category was under formation and you are at the forefront of that, right?

Starting point is 00:37:22 What does that mean? And what are, let's say, the fun part of doing this, like trying to build a company while the category is not yet there? And what's Hacks? Great question. So in general, I would say it's funny, you know, when you start a company, kind of everything is really hard. Like nothing is easy, right? So what's the, you know when you start a company kind of everything is really hard like nothing is easy right you so so what's the you know i always remind myself that's what's the percentage of startups that fail i think it's like 99.9 startups have failed and so you're by definition embarking on a journey that you know has like the lord like very likely to fail yes and yet you are getting

Starting point is 00:38:02 started right there's there's just something very weird about that. There's something wrong with these people. I know. I am one of them. Exactly. You're by definition like, why am I doing something that is basically doomed to fail, right? You develop different ways to think about the world, right?

Starting point is 00:38:20 One of the things that helped me get conviction early on, you know, that's why I sort of mentioned I actually started three different companies. I was like, look, if I'm going to get off of my cushy job and get other people to like leave their cushy job, it better be a DIA that I have a lot of conviction on and I'm really excited about. And so I spent a lot of time before actually starting the company with customers and with their pain. And that gave me a lot of conviction about this.

Starting point is 00:38:43 Now, you know, when you sort of start a category, it really like, it sort of makes the existence of the company hinge on the category creation. And I remember I was actually, I was chatting with, I think it was Lloyd Pab, one of the founders of Looker. And he's like, look, the worst thing that can happen to startup early days is that like, nobody cares. Just nobody gives a shit about it or about the category.

Starting point is 00:39:04 And so it's way better to have like, either like some strong reaction, either love or hate, but some strong reactions. And so we spent a lot of time on category creation at the very, very first days of the company. So actually one of the first few things was that, you know, I wrote a blog post about data downtime and about my experiences. And I actually, I remember this, this was before we actually incorporated the company. I was curious whether the concept of data downtime was something that anybody cares about. And so I actually applied to a conference, Data Council, actually, and with like the title of data downtime. And, you know, this was before there was a product or anything like that. And I was like, let's, you know, let's see what happens. And I assumed, you know, only four people would show

Starting point is 00:39:48 up and it would be kind of awkward and I'll kind of like hang out with them and talk about, you know, some random things. And then we'll just move on with our lives and pretend like they didn't happen. And actually there were like, I don't know, more than a hundred people showed up. And after the talk, they were like, thank you for giving us language for this. And we feel like this is the beginning of a movement. And this was literally like, it was just me. I didn't have, you know, I wasn't with my co-founder yet and there wasn't anyone else, but you know, that gave me the sense that like, okay, we're working on someone that can't something that people care about and we need to like actively invest in it all the time. So I would say like, you know,

Starting point is 00:40:26 the fun parts are that you're working on something that's like really important to people. And you're solving like real customer problems. Like, you know, we, a hundred percent of our customers renewed at Monte Carlo with us this past year. And so many of them are actually like, it's because we've been able to make a positive impact on their lives. And that's, you know, the stuff that I'm really excited about. Like we're actually doing something that people really care about. You know, I think the hard thing about it is that it's hard, right? We're not competing against, you know, 10 other options for customers. We're actually educating our customers on, you know, the fact that

Starting point is 00:41:06 there's a problem they're very well aware of. But there's a lot of education on like, hey, there's actually a different way to think about this problem, right? So our customers live in a world where they might have to manually look at reports all the time to make sure that the data is accurate. And now, you know, we're introducing a world where you can actually sort of rely on sort of an alert to tell you that something is wrong versus having to like manually check that. And there's also an education of saying, hey, look, there's concepts in observability and software engineering that worked. They can

Starting point is 00:41:38 also work in data. Like there's a lot of engineering best practices that we can bring over. And that is the education part. So we invest a lot in it. Like we write a lot of engineering best practices that we can bring over. And that is the education part. So we invest a lot in it. Like we write a lot. You know, we spend a lot of time with our customers on it. We really see it as like an existential part of the company. Yeah, 100%. I think of like the mechanism of metaphors are like something very, very important in category creation in general.

Starting point is 00:42:00 That's why I ask also about the SLAs and all these things. Anyway, we can be like discussing about the stuff for like four hours, but Eric, all yours. All right. I think we have time for one more question. And Bar, I'd love to actually get practical here. So in an ideal world, you know, you can have this experience of like, let's say building software from scratch, you know, with all of these principles, like unit testing and, you know, all this sort of stuff, which is like, you know, it's cathartic to think about that, I think for all of us, just because, you know, cleaning up messes is really hard, right?

Starting point is 00:42:35 But the reality for most companies running, you know, sort of even a stack that has a moderate level of complexity, you know, which I would say is most companies are trying to do something related to the modern data stack. When you think about observability, you're often coming into an environment that, not because anyone made really bad decisions, but especially as companies are just growing quickly. I mean, even thinking about reacting to COVID, all of the data and everything there, you're dealing with a situation where there's a lot of complexity. There were a lot of things that it would have been great if you had done them when you were initially building the stack, but you didn't because

Starting point is 00:43:18 you're moving too fast, you didn't have enough resources, or there's new technology coming in, et cetera. So could you help our listeners understand, it would be so great to just do observability from the ground up and have it integrated into every piece of what you're doing, but that's not the world that anyone lives in. So practically, when you go into a customer and you're implementing Monte Carlo, what's the lift? I'm thinking about the data engineers, maybe even heads of data who are just kind of like, that sounds so nice, but like, I don't know if I have like literally the resources to do like a six-month project and all this sort of stuff. So like, what does it look like? How hard is it? How long does it take? Like what's a lift? How many people, et cetera?

Starting point is 00:44:05 For sure. By the way, I don't think I've ever seen, for the record, I don't think I've seen a customer who literally has the perfect or even a clean, great setup. Most folks, I don't know, 99% have a lot of debt, a lot of people who have like come and gone, a lot of questions that they have, a lot of complexity, right? I think that's actually the reality for like most everyone. And Money Call particularly, having been in our shoes of, in the shoes of our, of teams that we work with, we recognize how little time they have and how unrealistic it is to spend six months to do something like that. And so actually early on in the company building, we've invested a lot in making it incredibly easy to get started with

Starting point is 00:44:50 Monty Prado. And so if you have sort of a standard stack, which I would say like, you know, Redshift, BigQuery, Snowflake, Looker, Tableau, et cetera, you can actually get started in less than 30 minutes. So less than 30 minutes. And those five pillars I talked about, you get that all of the box, all of the box. So within 24 hours, you actually have table and field level lineage and within our models

Starting point is 00:45:12 start working and within a couple of weeks, you will start having detection, resolution and prevention sort of features working for you. You can add customization on that, on top of that for your own, but those five pillars

Starting point is 00:45:21 are automatically out of the box within that 30 minute onboarding. No other sort are automatically out of the box within that 30-minute onboarding. No other sort of integration work required. Wow. That's amazing. So pretty low lift, I would say. And then you can get into customization. And then just again, just trying to help our audience understand what this dynamic looks like inside of organizations. Who are the users or who's the primary user of Monte Carlo? And how do they interact with the product? And what does that cadence look like as part of their workflow? Yeah. So users are sort of data teams. And so most typically data engineering, data analysts,

Starting point is 00:45:58 sometimes data scientists as well. I would say that titles are a little bit murky these days. Sure. Right. It depends a little bit on who are the people actually responsible for the data being accurate. Those three titles are the ones that we see the most. I would say more data engineers and data analysts. And then in terms of what does it actually look like, folks are incorporating it more and more into their workflows. And so that might mean waking up in the morning and checking to see whether the status of the data is up to date. And then throughout the know, throughout the day, like, you know, I want to make a change somewhere and I want to understand if I'm going to change someone's workflow because of it. So I

Starting point is 00:46:33 might go in and see, okay, if I'm making a change to this field or particular table, who is downstream actually like dependent on this and that, you know, would need to know. So I'll be thoughtful about that change. And then, you know, maybe later in the afternoon, I get sort of, you know, an alert about a particular problem in the data. And then I might sort of double click into it and understand like, it's impacted, what are the, you know, what are the queries, etc. And kind of doing research to understand what actually happened. It's mostly sort of embedded into folks sort of day to day, if you will. And largely, you know, it's sort of, it's because folks end up spending a lot of time on data fire drills. And so the goal is to reduce

Starting point is 00:47:12 that amount of time, basically. Yep. Okay, one last question. I said we had time for one more, but I've asked three. So I love that you talk so much about interacting with customers, talking with customers. So you're really close to these data teams. So really quickly, I just want to know, what's your favorite part about working with data teams? You've worked with so many different teams over your career, but what do you love in particular about working with data teams? That's the favorite part of my job. Literally, that's like what gets me out of bed in the morning is, you know, to sort of hear the amazing stories of data teams. I think maybe the favorite part of like how powerful data is, you know, we work with companies like Fox, for example,

Starting point is 00:47:54 that, you know, covers events like the Super Bowl. And they literally track like, you know, number of users and time spent on content and devices and like literally powering such important events. And, you know, then we have customers in healthcare that use data for diagnostics. And it's just like the use cases are so wide, even more so like with COVID-19 and everything, it's becoming even more important. It's just inspiring what data teams are actually working on.

Starting point is 00:48:20 It's pretty freaking cool. It makes me really proud to be working with them. Awesome. Well, thank you again so much for giving us some of your time. A wonderful conversation. And we loved having you on the show. Thank you so much for having me. Okay, Costas, this is something that I've thought about a lot over the years. And I don't know why I've thought about this. But when you think about technology, and maybe this just comes from me

Starting point is 00:48:43 doing a lot of consulting, but I was like, okay, there's probably like 10 business models across B2B and B2C that you could build basically sort of a predefined data schema and stack for. And it would work for like 90% of the companies out there, right? Like a lot of times the customizations are not necessarily a good thing, even though each business is unique, right? Like a lot of times the customizations are not necessarily a good thing, even though each business is unique, right? And it was so interesting to me to hear Barr, to some extent, validate that, right? Like we probably can solve 80% of the data problems because they're fairly known quantities. And some of that has to do with the tooling and other stuff like that. But it was just really interesting to hear her

Starting point is 00:49:30 talk with such a high level of confidence and say, look, yeah, every business is unique, but really only 10% to 15% of the problems are of the nature where it needs customized resources and we can automate the rest of it. I just, that was really cool to see her talk with such confidence about that. Yeah, a hundred percent. I totally agree with you. And validate my idea, of course. Yeah, I totally agree. I mean, it was very interesting to hear from someone like her, talking about solidization and how standardization like should be part

Starting point is 00:50:05 of the products that we offer. You know, like standardization is usually, I mean, mentioned a lot by engineers, but in the more, let's say like kind of a context, usually we don't really consider that as part of like building a business, but it is important. And I think it's even more important when we are trying to build like a new category as like Par is doing right now with the rest of the vendors like in this space. So people need guidance,

Starting point is 00:50:36 people need education and standardizing processes and concepts is one of the best tools you have like to do that. So yeah, like I love that part. processes and concepts is one of the best tools you have to do that. So yeah, I love that part. And the whole conversation was amazing. And hopefully we're going to have you back again and discuss a bit more about all these concepts. I agree.

Starting point is 00:50:55 Well, thanks for joining us again. We will catch you on the next show. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack,

Starting point is 00:51:21 the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Your Ad Here

The Data Stack Show - 85: You Can Stop Doing Data Fire Drills with Barr Moses of Monte Carlo

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.