The Data Stack Show - 180: Data Observability and AI for Data Operations Featuring Kunal Agarwal of Unravel Data

Starting point is 00:00:00 Hi, Data Stack Show listeners. I'm Pete Soderling, and I'd like to personally invite you to Data Council Austin this March 26 to 28, where I'll play host to hundreds of attendees, 100 plus top speakers, and dozens of hot startups in the cutting edge of data science, engineering, and AI. If you're sick and tired of salesy data conferences like I was, you'll understand exactly why I started Data Council and how it's become known for being engineering, and AI. If you're sick and tired of salesy data conferences like I was, you'll understand exactly why I started Data Council and how it's become known for being the best vendor-neutral, no BS, technical data conference around. The community that attends Data Council are some of the smartest founders, data engineers, and scientists, CTOs, heads of data, lead engineers, investors, and community organizers who are all working together to build

Starting point is 00:00:44 the future of data and AI. And as a listener to the Data Stack Show, you can join us at the event at a special price. Get 20% discount off tickets by using promo code DATASTACK20. That's DATASTACK20. But don't just take my word that it's the best data event out there. Our attendees refer to Data Council as Spring Break for Data Geeks. So come on down to Austin and join us for an amazing time with the data community. I can't wait to see you there. Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack,

Starting point is 00:01:24 the CDP for developers. You can learn more at rudderstack.com. We are here with Kunal from Unravel Data. Kunal, thanks for spending a few minutes with us today. Eric Costas, thank you so much for having me here. All right, give us your background. How did you get into data? And what are you doing today at Unravel? Yeah, so Kunal Agarwal, founder and CEO of Unravel Data, that I started with my co-founder Shivnath Babu, who is a professor of computer science at Duke University. We both started this company to simplify data operations.

Starting point is 00:02:00 We feel data engineers and data teams spend too much time firefighting issues rather than being productive on the data stack. And we wanted to automate and simplify some of Mal, because I really would like to chat about how data operations have changed in the past 10 years. It's extremely interesting that you've seen this whole, from the Hadoop days up to today, all the changes that have happened. And I have a feeling that the complexity around data operations has exploded, right? Especially with having pretty much every one or two years new use cases around data coming in, right? So even like, let's say, observability in data.

Starting point is 00:02:56 What does it mean? What it meant like five years ago and what it means today when we have also like AI, for example, right? In the mix. So I'd love to get more into like this journey, how things have changed and what it means today, like to operate. We like to be an operator around like data. And of course, learn about like Unravel

Starting point is 00:03:18 and like how it helps in that. What about you? What are like some topics that you're excited about? Yeah, no, of course, there's never a dull moment in the life of a data team member, especially for the last 10 years. So we've gone from doing things with Hadoop as a one Swiss army, if you may, to having a multi-system stack now, primarily running, used to run on-prem, now primarily running on the cloud.

Starting point is 00:03:46 That's a mega change that's happened. The other is we've gone from doing these batch workloads with ETL to now doing real-time or near real-time workloads in production and not just as a science project. And then we've gone from doing these BI business intelligence or just advanced analytic workloads to now doing machine learning and AI in production. So if you're a part of a data team as a data engineer or a data scientist or a data analyst, you've had to keep up with the demand of your business and also had to keep retooling and reskilling yourself on how do you work on a map-produced base system

Starting point is 00:04:26 to now a BigQuery and a Snowflake system. It's incredible the rate of, the pace of change and the rate at which things are getting evolved in this ecosystem, right? So that's a very exciting part for us because what Unravel is ultimately helping to do is to simplify how these data engineers or data analysts are creating their applications, how they're making sure that these applications are reliable, that they work on time every time, that they don't making sure that they are able to scale in a very efficient manner. It's not a linear scale in dollars versus productivity. Can we bend the cost curve as these environments are scaling up so that they're starting to get more bang for their investments? And as we now see, the most exciting thing, it's actually here right now, not even the near future is AI

Starting point is 00:05:25 and how those workloads are changing businesses and turning industries upside down. So it's a really powerful industry to be a part of and really exciting time to be a part of this industry, but it's not for the faint hearted. It's for people who are up for a challenge, who like change, who like evolving, who like to try out new things. And that's what makes it exciting overall for everybody in this industry. And I'm sure you've had the same experience

Starting point is 00:05:54 too, Costas. Yeah, 100%. I can't wait to get deeper into all that. Eric, what do you think? Let's dive in. Let's do it. You know, I'm so excited to chat with you today and dig into all things data ops. But your story actually, as a tech founder, started in the fashion industry.

Starting point is 00:06:18 You know, and you've come a long way in enterprise data ops management. So go back to the beginning beginning how did you start in the fashion industry as a tech you know just spent a lot of time trying to figure out what to wear every day i'm sure we all did and still don't look that good do we um it came from uh an actual frustration of you know why don't we have something like we have, you know, for songs that recommend what you should be listening to. You're not always thinking about the exact song you want to listen to. It just shows up and it's the right song at the right moment. So we decided to create a algorithm that helps you decide what to wear

Starting point is 00:07:01 based on a lot of different factors around where you are, what the weather is like, what your friends are wearing, and then it picks out stuff that you actually have in your wardrobe versus things that you should be getting in your wardrobe. It was exciting, but I realized that B2B enterprise software is where I have a more experience plus more liking into. But if you break down even that fashion experience, it's really a, call it a big data model that had a recommendation engine running on top of it, based on a lot of data that helps you connect the dots and figure out what you should be wearing. But that experience, and I was consulting with Oracle products back in the day, working with large enterprises.

Starting point is 00:07:46 I started to get the first exposure to technologies like Hadoop, which obviously was very nascent. We're talking back in 2012, 2013 timeframe. It was very powerful. You could get a lot of large-scale processing done for a really cheap price because it's open-source software. They can run in commodity hardware, unlike Vertigo or Teradata back in the day, which is costing millions of dollars. But we realized that it needed to be a complete product. It needed to

Starting point is 00:08:15 not have rough edges. It needed to be simple, intuitive for more users to get on the platform and start to use this powerful technology. So that's when I met my co-founder, Shivnath, when I was at Duke University. And we figured that if we were able to simplify running applications in a high-performance way and make that automated, then that would reduce the amount of toil a data engineer spends in getting their applications into production. And that was the hypothesis that we started Unravel with. And then since, we've actually extended the platform to obviously continue to focus on performance and reliability, but then also start to think about efficiency and cost.

Starting point is 00:09:06 And then as Kostas was talking about earlier, is the evolution has also led us to make sure that we are able to support all technologies that data teams are using. Back in the day, it was just one technology called Hadoop and Back to Use. Now, it's a whole zoo of animals, all complicated names, but really powerful stuff. So we want to give users a choice and bring any technology that they're using, and then be able to get that same quality of service and an efficient way to go and scale your environment, right? That's really a promise to the customers. But yeah, it's been a fun journey from the fashion days to now, Eric. Yeah. Before we jump into Unravel, I do have to ask, what was the most surprising thing you learned about the fashion industry or even fashion consumers with your dining into that world?

Starting point is 00:09:58 You know, interestingly enough, we learned that men engaged more than women did. Yeah, much higher than anticipated and, you know, marginally higher than certain categories of women in different demographies. You know, when you bring it down by regions and age, there were some men that were participating and being more active about this than women were. And I think the reason for that is women have so many other outlets to discuss fashion and men did not. And this became one of those places where they would actually engage with and understand like, oh, what are my choices for,

Starting point is 00:10:41 you know, where I'm going to sit at. But then you also have men who go all the other way, like the Zuckerbergs and, you know, the Steve Jobs of the world. We just have a uniform for that every day. And, you know, come to think of it, that may just be a better time saving way than to... Yeah, yeah, totally. But that was definitely, you know, interesting and insightful.

Starting point is 00:11:04 Yeah, maybe we're just much more clueless when it comes to fashion. And so you kind of created this. That's what you're talking around more. Yeah, that's probably it. Well, give us an overview of the Unravel platform. There are multiple components here. We talked about DataOps and maybe we can start just with a definition of DataOps. How did you define DataOps? Yeah, can start just with the definition of DataOps. How did you define

Starting point is 00:11:25 DataOps? Yeah. So it's rather simple. Think about all the stages your data pipeline or your code has to go to get an outcome. All the code that you have to write, all the sequencing you have to do, the infrastructure that it's running on, the services that it's touching. It's a rather complicated tangling of wires, if you may. That's the kind of visual that comes to mind. And when something goes wrong or something's not behaving the way you're anticipating it to behave, then you start to ask the questions of what's going on, why it's happening, and how do I go and fix it?

Starting point is 00:12:07 And to answer those questions, you need this thing called observability, right? That's the simplest way to think about it. You need to understand everything that's happening inside to then be able to ask it questions. So the Unravel platform at its center is an observability platform for data ops teams and for data ops, really, that helps do a couple of things. Number one, makes your applications highly performing. So your business is depending on certain data pipelines or certain AI models finishing correctly and on time,

Starting point is 00:12:39 otherwise revenues hurt or your products aren't advancing. So Unravel makes sure that happens, that your service level agreements internally and externally are met, which are called SLAs. The second thing is, if you do have a reliability issue, then Unravel helps you troubleshoot that and fix those issues in a proactive and automated fashion, which we'll dive into. And then third is, nobody's running a small data environment these days because every company is becoming a data company. So when you've got them spending $100,000

Starting point is 00:13:12 to $1 million to $10 million to the bigger company spending hundreds of millions of dollars, you need to make sure that you're doing it in an efficient manner. And what we're seeing is companies are wasting upwards of 30 to 40% of the cloud bill by just doing wasteful things and inefficient things that they may not even be aware of. There are some common things like keeping the tap on when you're brushing your teeth.

Starting point is 00:13:40 So you should be turning them off from something as mundane as that to writing more efficient code. But there's a lot of, you know, efficiency to be gained out there. So when we step back, we look at, hey, let's connect to everything. So if a data team has 7, 12, 14 different components that they put their stack up together, Unravel connects and collects data from everywhere. So we know absolutely everything that's going on. And then collect data from all the layers of the stack, horizontally as well, and vertically. So from your code all the way down to infrastructure,

Starting point is 00:14:15 see everything, gather everything, measure everything. And once all this data is inside the unravel platform or our service, that's when we run our algorithms and our AI models on top of it to automatically detect what problems are. So you don't have to go hunting for that. Tell them why it's happening so you don't have to do the cross-connection, so it's connecting the dots for you so you don't have to go understand why something happened. And then, in cases, give you an automatic resolution, and in some cases, give you a automatic resolution. And in some cases, give you a guided remedy where it's not possible to automate things.

Starting point is 00:14:57 But at least tell you what to go and do to go and get out of this particular issue so that it stops the trial and error that's going on in your head. Maybe I should try this out. Maybe I should try that out. You don't have to do that anymore. And what we've seen is if you approach it in this way, then you can save several hours per problem per engineer inside a company, which ultimately manifests itself into better productivity and just improvement in productivity rates and efficiency rates across your organization, but also improvement in your efficiency of your infrastructure. But more importantly, you can now start to depend on your data outcomes. Companies are betting their reputation and their money on data outcomes. And if it doesn't work half the time, then it's useless. Now you can stand behind it and say, you know what, this thing that we're launching, this recommendation engine that we're creating,

Starting point is 00:15:43 or this fraud prevention app that we're launching, it will work on time every time. And that's when companies can start to confidently invest the second wave of AI or any other applications that they may be creating out there. Yeah, it makes total sense. I wanted to get into the analogy. I loved turning the tap off while you're brushing your teeth. It made me think about something you said earlier, which was you started in data back when, in terms of big data, Hadoop was really the main game in town. And it made me think back to business intelligence originally was a finance function, right? A lot of times it reported up to the CFO, right?

Starting point is 00:16:29 Which in many ways makes a lot of sense. But then, you know, there are a couple of dynamics that happen. Number one, the cost of storage just starts to plummet. Storage and compute separate with this big migration to the cloud. And so all of a sudden, even just that is this massive workflow optimization, right? Wow, like, you know, we can be so much more efficient than we used to. We can run way more queries, et cetera.

Starting point is 00:16:55 Pipeline technology advanced significantly. And so it's way easier to move data around and cheaper to move data around, you know? And so free-for-all is probably too strong of a term, but like, you know, and so free for all is probably too strong of a term, but like, you know, it's like, well, yeah, I mean, let's just load the data warehouse, load the data lake. We can do all sorts of analytics, self-serve analytics, you know, machine learning, all this sort of stuff. And now it's sort of, we're coming full circle, right? And like, when you get the compute bill at the end of the month,

Starting point is 00:17:26 finance is like, okay, who's, we got to figure out who's, you know, who owes what on this big compute bill. Can you kind of talk like, talk through that? Cause you've lived through that story and unravels lived through that story in many ways. Yeah, no, you're absolutely right, Ernie. So if you break it down,

Starting point is 00:17:44 there's three things that have increased,ne. So if you break it down, there's three things that have increased, right? So the number of use cases for data has increased. We've gone from this, hey, it's good for financial reporting and it's good for understanding our sales data to now, you know, we want to create brand new products. We want to improve our operations, right? So the use case for data has increased. The data sets that we're capturing has increased. We are only capturing a subset of our financial data and our sales data. Now we know everything about the customer.

Starting point is 00:18:16 Right, right, right. Yeah, it was just transactions mainly, and now it's every digital touchpoint. Exactly. And the users of the data technologies have increased as well. Earlier, this was limited to the hardcore engineers. Maybe the financial analysts who knew how to switch from Excel pivot tables to maybe getting into a more powerful system. But that's really it. Now, product guys are on it.

Starting point is 00:18:41 Marketing guys are on it. Every department of the company wants to get on it. Legal teams want to are on it. Marketing guys are on it. Every department of the company wants to get on it. Legal teams want to get on it, right? So the number of people jumping on these platforms has increased. By the way, all those three things are good things to happen because you can get some great outcomes with data. But back to your point around, this does become a mess as companies start to scale this out. Because the promise that we had heard that cloud will solve all problems and world hunger

Starting point is 00:19:09 is actually not true because cloud is better suited for data analytics. Absolutely. It's got limitless compute, limitless storage that you can definitely scale out your systems for sure. But as companies started to democratize data access and give it away to a large number of audiences, there started to be spurts of people using data analytics in a fashion that should not be used in that way, knowingly or unknowingly.

Starting point is 00:19:42 And a big part of that is the range of skills that people have in these different departments. Not everybody is an expert on the data systems. And you may have, you know, on the other end of the spectrum, some people that are, you know, just who know click and drop tools, maybe some SQL.

Starting point is 00:20:01 And unknowingly what's happening with them is they're creating inefficiencies in code, inefficiencies in the way these pipelines are being scheduled and run, or just how these AI models are being used. So I'll give an example. Again, like a mundane one, like turning the tap off, if you're brushing it deep, could be a select star that a novice user does on mega tables. And this is the case that I hear about from our customers every week. And it racks up hundreds of thousands of dollars of bills. And you just scratch your head like, who did that?

Starting point is 00:20:38 Why did they do that? Sometimes being a select star, maybe what you need to do on a table. But then who did that? Why did you do that? How do we control it from happening next time? Is the question people start asking once they're shocked. That's just one simple example. There are a hundred other ways in which people can creep up on these inefficiencies. Other ones are, for example, in architectures that are not serverless or even in serverless architectures, you have to understand what's the size of your warehouse or what's the size of your containers. And if you give people a small, medium, large, extra large,

Starting point is 00:21:14 guess what? Everybody's choosing extra large. Right. Of course. The most important, biggest, baddest workloads compared to the next guy. Nobody ever chooses small, maybe medium. But then, you know, you run a profiler on that and you understand that you're only using

Starting point is 00:21:32 10% of the resources. So it could have been one tenth of your cost, but people don't know that. So, you know, I can go on. There's so many of these inefficiencies that happen all the time. But even before you go to improving the system, just understanding who's spending what becomes a critical issue. So companies have a policy of showback now

Starting point is 00:21:53 as the enterprises start to increase their usage. Showback really is, hey, look, we spent a million dollars last month. We've got five departments. Did you all spend $200,000? No, this person spent $100,000. This person, this person spent 300, et cetera, et cetera. So can we please break down and understand who's spending what? And then companies are also going to a chargeback where the group actually has to pay for what they spent out of that million dollars and that's how we're going to go and pay

Starting point is 00:22:21 this particular bill. So it's an interesting evolution because on-prem, you didn't have to think about that because of a set of resources. So the worst you could do is Eric could steal from Costas and Costas workloads would stop, but the bill would still be the same because it's hardware that you are appreciating over time. Yeah. Yeah. Now, yeah, I mean, this is a fascinating conversation. Yeah. And I agree to show back and then the chargeback dynamic is, you know, super interesting. Can you help break down, you know, we're talking about larger companies here where this, you know, a small, you know, a smaller company isn't necessarily going to face these issues, because their workloads are fairly simple, right? And, and they're even their sack is simpler.

Starting point is 00:23:09 But at scale, when these things really become a problem, can you talk about who is the sort of owner of unravel within an organization? Or who is that group? And can you just kind of break down what are like their day to day-day problems that they face you know and i'm sure we have some of those people in our audience and then some people who work with data in an org but maybe they're not as familiar with that person's sort of day-to-day issues relative to this infrastructure yeah so multiple people in the data teams use unravel let's talk with the data engineers because they're always near and dear to our heart. So Unravel helps data engineers in a couple of ways.

Starting point is 00:23:49 Number one, when you are developing your application, removing errors, removing bottlenecks from your code is something that Unravel helps you out with automatically. Putting that code into production and making sure that it's running there, meeting its SLAs, everyone, is something that Unravel also helps out with using its AI engine, making sure that it can understand deviation and performance, what happened, something new was introduced.

Starting point is 00:24:18 And if you don't have something like Unravel, then data engineers get called into these production issues or into the cycle that, you know, moves their applications from dev to product when somebody's doing a code check or a code review, for example, right? The other side that we had data engineers out with is, again, when it's running in a production system, your boss, the head of product or the head of business unit may say, I need you to go and cut this cost down. I need you to run this more efficiently. And now these data engineers have to go and hunt for ways

Starting point is 00:24:49 in which they can reduce their costs, which is also something that Unravel can help you automate and tell you in plain English, look, go and do this thing. You will not sacrifice performance, but your cost is going to improve by so much, right? So that's one group of people that use Unravel on an everyday basis. The other group of people are the centralized group of leaders and operators who are responsible

Starting point is 00:25:14 for making sure that this environment serves the purpose for every business unit, meaning they may get a Databricks environment, they may get a Snowflake environment with some Kafka, with some Starburst, right? And they are providing this to their business and saying, hey, use the stack and it will run well. So that's the other group that uses Unravel to make sure that both the performance, reliability, and the cost part are taken care of. And then this group is able to also set budgets and guardrails for all these subgroups of products that are using this platform so that they can proactively understand how the cost trends are going towards this month. So not surprised at the end of the month. And then if there's any misuse or rogue usage,

Starting point is 00:26:05 you're catching it live rather than catching it in retrospect, where you've already burned through the dollars, and you're able to fix that problem in real time. And ultimately, when companies mature to becoming a true data-driven organization, where they're actually generating revenue from their data applications and products, then we have business leaders using our product to go and understand what's the ROI and what's the agents of running these data endeavors to then ultimately go and generate revenue for the company

Starting point is 00:26:38 and can we improve that in a certain way. So it really starts to go bottoms up where people are running, you know, applications all the way up to how those applications are serving the business. Yeah. One quick question. And I know Kosta says a ton of questions that I want to dig into the ROI question because you've mentioned a couple of times, you know, sort of data output or data product, right?

Starting point is 00:27:01 And I'm just going to pull an example and tell me if this is a bad one, and maybe you can pick one. But I think about, you know, TurboTax is their system that allows end users to submit their information, you know, and essentially file a tax return, right? That's a hugely intensive data operation, because it's ingesting all this information, it's running it through all sorts of queries. I'm sure there's machine learning going on in the background. It has to check it against all sorts of regulations. I mean, that thing is probably a gnarly app, and it requires a huge amount of data infrastructure. And so can you walk us through how would you think about calculating the ROI on that product

Starting point is 00:27:46 from a data and infrastructure standpoint? Such a good example, Eric. So you can take Intuit TurboTax that's ultimately costing you $10 a pop, right? And then you walk backwards and you have to really understand what the unit cost of serving just Eric is to understand the problems you're making on that product. So any data product has multiple stages. You have to collect the data. You have to cleanse the data.

Starting point is 00:28:14 Then you're running some algorithms on top of it. And then you're getting some outcomes. And I'm making it very simple. All the engineers listening to me on this podcast probably like, yeah, that's like a hundred steps for us. That's what they presented in the board meeting. Exactly. Especially the guys in Intuit are like, Kudali, you're making this sound way too

Starting point is 00:28:34 simple. It's probably a hundred nested workflow looking at, right? Running something on Airflow, something running on Spark. And I know Intuit now is on Amazon. It is this mega migration from on-prem again as part of the evolution. So anyhow, what you need to ultimately do is understand what is the cost of one unit

Starting point is 00:28:55 of work. So if you're running it on Spark, what's the cost of that Spark job? And how many Spark jobs do you have? And then understand that end-to-end from source to outcome, what's that cost of all those different stages and multiple pipelines put together really is. And then you have to think about what is the optimized cost version of doing that? So there's a cost. It's costing me $10,000 to run this pipeline.

Starting point is 00:29:23 But if I understand room for optimization, then I can make this pipeline run for $6,000 to run this pipeline. But if I understand there's room for optimization, then I can make this pipeline run for $6,000, for example, right? Now, how many users can 6,000 people serve? Or it can serve 1,000 people. Great, it's six bucks a pop, right? On the cost side. And then you want to bend the cost curve as you're scaling up. So if it's for 1,000 people, what does it look like for 10,000 people?

Starting point is 00:29:47 And the answer should not be linear. And if it's 100,000 people, right? And that's how you start to scale it out. And then you understand how much margin you can get. Now, that's a very advanced, mature company that is using Unravel's data to be able to do that. But what we're encouraging people to do is start to think about that from the get-go because you don't want to run a full project and spend millions of dollars to then come

Starting point is 00:30:14 to the outcome that, you know what, this is actually not a feasible product or this is not a feasible project for our business to even get into. This is especially true, Eric, in the age of AI, as everybody wants to create for themselves an AI outcome. But then if you get LLMs off the shelf, you're spending about $3 to $10 million a year. But if you create your own LLM, you're looking at $150, $200 million of spend. So really understanding how are you going to measure those costs? How are you going to break them down? And then thinking about products and what it could actually mean,

Starting point is 00:30:48 super important from the design phase itself. And that comes back to our philosophy of measure everything, right? It's a philosophy of bring all your data into a data lake. Now that you've done that, start to measure every process from the get-go. So you at least know how much this costs from the get-go to then think about how you should be scaling this out. Yeah, makes total sense. Okay, Kostas, one more question. Forgive me. So we're talking about infrastructure, right? But I'd love to know, you know, and I'll use the example of sales and marketing costs. You usually measure, well, you measure it in a ton of ways, right? But when you're measuring it from a finance perspective, you'll measure like the spend, you know, okay, so how much spend are we, you know, marketing spend do we have?

Starting point is 00:31:35 And then you have a fully loaded cost, which includes all of the headcount, all the commissions on the sales side, right? When it comes to data and the types of ROI you're talking about, how are organizations thinking about the human capital aspect of it, right? Because it's not like these systems just run themselves, at least now, maybe in the future. But you have people who need to run these systems, right? And how do you think about that as part of the cost equation there? It is one of the bigger parts of the cost equation, actually. So even on the data side, just the data stack side, it's infrastructure, it's the data sets itself, it's the services that you're using. You know, all of that stuff adds up to your total cost of the stack. And then you've got the cost of the people. The way we think about the cost of the people is

Starting point is 00:32:30 thinking about a measurement, like a throughput measurement of what kind of productivity are you getting from a class of people. That's what we've seen works best. The productivity of the throughput metric could be anything that's more relevant for your organization. It could be how many data pipelines per team, per member of the team. It could be how many AI models, how many new pipelines are you able to generate every month with your team. And then, you know, you could also map that back to how many issues, how many problems, how much downtime did you have in your environment, and then start to see the productivity of your team across that. What we have seen, though, Eric, is the people's productivity is nowhere near it should be. Even getting half productivity, meaning four hours of productive time a day out of the eight hours a data engineer is working on is average right now. That's what you're getting. So people are spending half their time firefighting, wasting time on troubleshooting, debugging, fixing problems, things are breaking,

Starting point is 00:33:36 trying to stand them back up. Things were working yesterday, today they're not. It's a complicated piece of tech that these guys are running and unfortunately they haven't had enough time to train themselves people running on Oracle systems have been masters of Oracle systems over 20 years

Starting point is 00:33:56 people running Databricks, Snowflake, BigQuery they've been running for 2 years, 3 years at most so they haven't gone through those experiences and sorted this out. So productivity will get better as more maturity happens and more experience of these data teams, but the business cannot stop. The businesses are running because the competitors are creating amazing data outcomes, and they just need to get theirs out in the market as well.

Starting point is 00:34:22 And that's where automation around what we do with Unravel, you don't have to be an expert. It tells you in plain English how to go and fix certain things. So you could be a person who's coming in straight from a Teradata onto Snowflake, and you would know Snowflake overnight. And if you had any issues, you wouldn't be spending four hours a day doing that. It'd be a couple of clicks and a minute or two if it's not completely automated. Okay, I have a question about reliability, especially in the environment, like the mature environment that you have seen in the enterprise and let's say more like purely data-driven companies, right?

Starting point is 00:35:02 My experience, especially with enterprise, because what is interesting with them is that they've been around long enough, right? My experience, especially with enterprise, because what is interesting with them is that they've been around long enough, right, to go through like many different products like for what they are trying to do. And what I've seen in practice is that usually technology does not get like replaced immediately, you usually end up with pretty much everything running together. I think if anyone could take a look into a big account of a Fortune 100 company, they would probably see pretty much every possible vendor in there operating. How is reliability managed when you have so many different systems and so many steps that the data has to go through? Let's start from a technology perspective

Starting point is 00:36:02 for now, because when we get into the people aspect of it, it gets even more complicated. But how have you seen things working there with all this diversity of technologies operating together at the end?

Starting point is 00:36:20 You're right, Kostas. The people side is hard because no one person is an expert in all the systems in the stack. And that's an inherent problem. On the technology side, look, people are choosing different technologies for different use cases. That's a reason why they have different stacks. And the other reason is just compliance, that certain data cannot move the cloud. That's what they call the on-prem version and a cloud version.

Starting point is 00:36:48 And then the third, as Eric was pointing out, is with democratization and opening up the data stack to the company, people are kind of encouraged to go. You're like, hey, if you want to go spin something up, go spin something up, right? If you want to start a Snowflake cluster, start that out. And before you knew it, you had these bursts of clusters here and there. And then before you know it, the entire company started to use it. All these technologies are very different. There are similarities which end at, hey, that's an SQL engine, but you worked in Presto and Trino. The way you triage that versus you triaging

Starting point is 00:37:28 a Spark SQL application is completely different. So reliability is something that has always been an issue since the early days from MapReduce. And the only way to solve that is to understand what's happening under the hood. And a lot of people just don't have that skill set. Like you know how to drive a car, but you don't know how to fix a car. It's the same thing.

Starting point is 00:37:52 And what Unravel does is attacks that problem head on by automating all the steps that somebody would do in triaging. So collecting logs, collecting metrics, connecting the dots between all of these different causes and effects, and then bubbling up and saying, look, there could be a hundred things that could be causing this problem today, but this is what it is, and this is how you need to resolve it. So instead of even giving you a check light engine, imagine it was more descriptive, but you didn't even have to take out a mechanic. And we just say, hey, this problem in your wheel, get this fixed. And that would be a faster way to resolve it. So that's where we have

Starting point is 00:38:34 seen the cloud fallacy of, hey, the cloud is a no ops or a low op solution actually falls flat because bad code sometimes is bad code, right? It doesn't matter where you write it. So you can have the same experience and problems no matter which environment you're running on if the underlying cause of those problems is similar across these different environments. So while it's allowed to make some things easier, it's not a silver bullet that will resolve all your reliability problems itself. And the way it manifests itself, and coming a little bit to the people side of the question, is it could be an internal or an external application that you're running. If it's an external one, like your consumers are running on it, say you're doing an online banking app, and if that doesn't work, your customers can't use your services. And if it's an internal one, then there are people who are waiting for that report, who are waiting for that analysis that business decisions are getting held up for.

Starting point is 00:39:35 Each of them have an SLA. And what Unravel helps you do is guarantee those SLAs. So we've seen in companies where those SLAs were missed 10% of the time, 7% of the time, sometimes 20% of the times. So we've gone from 80% SLA to about a 99% SLA attainment for those kinds of different data applications. Just because something, a system is looking over and making sure that problems are caught proactively. And there is a fast solution to fixing that problem without it hairballing into an even bigger issue. And, okay, you said like unravel like guarantees, like the SLAs, but there's also like the human factor here, right?

Starting point is 00:40:25 Like at some point, someone needs to go there and like fix something, right? Like, at some point, someone needs to go there and, like, fix something, right? So, how is this working between, like, the technology and the person who is, like, on call that day, right? How is, like, this relationship, like, working with Unravel? Yeah. So, before Unravel,

Starting point is 00:40:42 you would get the problem, you would be notified about the problem much later because now it's visible to somebody. So Costas did not get their report. Costas is the CEO of a company and now on a Monday morning meeting with his exec team was not able to make the decisions he was able to make. 10 AM, this problem gets logged in. Somebody on the data team gets called. The person on the data team understands if it's a code level problem or infrastructure level problem, and then tries to ping the relative teams. And by the way, there's a big fight that's happening over here right now. There's a lot of finger pointing going on. Infrastructure guys are saying it's a code problem. Code guys are saying it's an infrastructure problem. I'm sure we've all been there. And then it turns out, okay, say we've identified that it's become a code level issue. Then we try to find the data engineer who actually created that application

Starting point is 00:41:33 and wrote that piece of code, but then go and debug and dissect that. So as you can see, this is a very involved process, lots of people in it, lots of time spent. And then this person is going to dig into logs, check out a lot of metrics. And by the way, each unit of work can have a 100-page log. So you look at thousands and thousands of logs to go and understand what's happening. So it's a very inefficient process, really. This used to take several hours in man-hours, which could actually be days in terms of clock time.

Starting point is 00:42:08 And then, you know, forget about loss and productivity. That even happens on the business side because the applications aren't working properly, right? With Unravel, because we are able to do the identification proactively, you will firstly understand this problem before you see this problem. It's like, hey, this application is not going to finish on time, but Monday morning, report is not going to be generated on time.

Starting point is 00:42:32 We'll notify you about that when the app is running and then tell you what you need to do to go and fix that. Secondly, because it's root causing the problem, there's no more finger pointing. It's like, look, today's issue is infrastructure. Today's issue is code. Today's issue is code. Today's issue is data layout or your services itself. So you're going to go pinpoint and not bring the team together on one side of the table rather than be combative and then selling you a guided remedy or it's taking an action

Starting point is 00:43:00 on your behalf. So in the guided remedy, it'll tell you what to go and do. So depending on your role and permission, you go and do those actions and fix or improve the reliability and performance of this application. But then in a lot of cases, Unravel can also take the action on your behalf. So you can complete the loop of doing the action as well and see the results. So a lot of times we see people wanting to, as a simple example, prevent any app from spending more than $10,000 on the cluster, as an example.

Starting point is 00:43:33 So Unravel can take that action on your behalf and stop this data pipeline or this machine learning model as soon as it nears $9,000, for example, right? So that you don't have to suffer about it and then resolve this problem reactively. So what we've done is improve efficiency, improve the productivity of this team,

Starting point is 00:43:52 and made it more like teamwork, that everybody's on the same team rather than being on different teams, because when problems happen, that's where finger-pointing starts, so you want to avoid that as well. Yeah, 100%. Okay, and if we switch now to the cost

Starting point is 00:44:07 management, again, you have a very unique perspective here because you have seen things happening on the cloud, but you also have seen how things work on-prem, right? And by the way, there are cases where you have a hybrid solution, especially, as we said, in the

Starting point is 00:44:23 enterprise. you might have data systems running on their own data centers and also have parts of the workloads that are running on the cloud. But let's say the economics of one and the other are very different. When you have your own data center, you bought your own hardware, you have it there. You can't really go and ask for more hardware. That's probably going to take some time to become available. And on the cloud, you have a completely different situation. You pretty much, at any time, you can release whatever you want, right? So the equations there of trying to figure out what cost is when you operate these workloads is different.

Starting point is 00:45:12 Can you help us understand a little bit the differences there and what it means to operate efficiently on-prem and what it means to operate efficiently in the cloud? Yeah, so when you think about on-prem costs, you're thinking about cost per machine, the fully loaded cost per machine. So the hardware for getting that machine, all the software and services you're going to run on that.

Starting point is 00:45:34 So what's your licensing cost for everything? And then depending on the type of hardware you try to depreciate that over three to five years, straight line depreciation. So if it costs you $30,000, it's about $10,000 a year, right? Just roughly. On the cloud, obviously,

Starting point is 00:45:49 it's pay by the drink, you know, 20 cents per hour for running one machine. And then, you know, you keep adding more machines, keeps adding up, obviously. So there's a lot of differences in how people approach

Starting point is 00:46:03 both of these equations. In some cases, people say, look, if you have predictable workloads, stuff that just needs to run every day, it's not going to change. It's going to be the same way every day. It's better and cheaper to run it on-prem. That's what we've seen across the majority of the enterprises, especially for large scale workloads. And then if you have experimental workloads, things that you may be just trying out, or you've got seasonality in your environment, you've got Friday, Saturday, Sunday workloads

Starting point is 00:46:36 are bigger than Monday, Tuesday, Wednesday workloads, for example. In any kind of situation like that, having a more liquid environment that can scale up and down is a better use of resources as well as cost. That's the primary difference. The way to start thinking about the cloud cost in particular is nobody knows what it's going to be on day one. You can have some sort of an idea if you break down your workloads into CPU and memory and just the basic units, you're never going to be right. So it's always good to, again, measure everything from day one

Starting point is 00:47:19 so you can start to see the trends and patterns of these things. So by the end of month two, month three, you at least have an idea of what this yearly cost could be. And then start to put proactive guardrails to avoid exactly the problem that Kostas, you were talking about, which is, hey, yeah, cloud has infinite scale, but do we want to give people that power because you don't have infinite money? And how do we put some sort of guardrails against that? Now, obviously, looking at just the number of cells, you have to part of the story. You've got to talk to your

Starting point is 00:47:50 team and understand what they're actually trying to do. In some cases, they may not even be knowing that they're doing these inefficiencies. In some cases, it may be that's their actual use case. And they're like, yeah, spend $100,000 on that query because it was doing this amazing thing and we needed to run that way. And then put the guardrails appropriately. But then people who are running hybrid environment, they're also using it in a unique way because they're thinking of, let's use the power that we have on-prem. And only when we need to burst workloads, only when we need to scale up workloads, then we use the cloud. But then everybody's got their own patterns and

Starting point is 00:48:27 anti-patterns and how they run these things, but these are the most common ones that we see. That's super interesting. One last question for me, because we're close to the end here. How things have changed because of AI, and I'm talking about data observability

Starting point is 00:48:43 here, and I don't necessarily care that much about how it changes in terms of helping someone to perform observability, but more about how we implement observability when we are implementing AI. It's different, I would assume, when you have BI. It's different when you have ML. It might be even more different when you have AI, although they're similar with ML. But what have you seen out there? I'm sure you have much more experience into that. And I'm very curious to hear from a vendor what is missing today or what works when we're trying like to actually bring the same value with of observability but like when we are doing ai yeah look ai is super interesting every company is rushing to create some innovative products with ai or at least

Starting point is 00:49:43 they're starting off with using ai to improve their own operations, right? But when you break it down, it's again a series of data steps and sequencing of data steps that need to happen to create meaningful AI outcomes. So a couple of steps are actually similar to say BI workloads, where you would have your ETL or your ELT of bringing

Starting point is 00:50:07 data in and prepping that data as a common step. So in fact, we almost always recommend to people, think about all your data apps as being modular pieces, and think about what you can repeat and use again so that your costs as well as your efficiency is great. So that's one of the ways. But yeah, to answer your question, you still have to have something that can observe multiple systems. Because AI is, again, not a one system or one technology-based app. You need something for data ingestion. You need something for data ingestion, you need something for data modeling,

Starting point is 00:50:46 you need something for running your AI algorithms on top of, something to serve it, et cetera. So you need observability that is capable of measuring things in multiple services across multiple environments. And what we're seeing, this is becoming very real, is people are actually moving to a multi-cloud environment as well. So you need a technology that cuts across these

Starting point is 00:51:08 pieces too. Now with AI, you will again have more teams and more users using your data platform because the ideas for AI-generated apps are going to come from everywhere in the organization. You're going to have your legal teams, for example, jumping on and saying, hey, we can use this data set. We're doing these amazing things with AI for our company, which means that leaders need to be even more careful and recognize that you're going to have varying skills of people. And with that may come in more complexity and inefficiencies into your platform. So having observability from the get-go

Starting point is 00:51:46 to measure all these soft pieces is going to be even more crucial as you're going to be at work. Okay, that's great. Eric, back to you. Yes, Kunal. Okay, I have to ask on a personal note, now having done a consumer startup in an enterprise startup would

Starting point is 00:52:08 you ever go back to consumers that's the itch left to scratch again eric for sure uh there are all these exciting things uh that that need to yet be created on the consumer side believe it or not uh but yeah that's that's going to be one of the companies, you know, that I do create in the future. I don't know how much in the future, but definitely an itch to scratch. Awesome. Well, thanks so much for joining us today. We learned so much. And best of luck with Unravel and your future consumer app.

Starting point is 00:52:38 Thank you. Eric Costas, thank you so much for having me here. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. Thank you so much for having me here. show.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at Rudderstack.com.

Your Ad Here

The Data Stack Show - 180: Data Observability and AI for Data Operations Featuring Kunal Agarwal of Unravel Data

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.