The Data Stack Show - 147: Where Data and Infrastructure Converge Featuring Lars Kamp of Resoto

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack. They've been helping us put on the show for years and they just launched an awesome new product called Profiles. It makes it easy to build an identity graph and complete customer profiles right in your warehouse or data lake. You should go check it out at ruddersack.com today. Welcome back to the Data Stack Show. Costas, another new topic. This has been

Starting point is 00:00:38 a great spring and early summer chatting about things that we haven't really covered on the show a ton before. Today, we're talking with Lars Kamp, and we're going to talk about resource management, which is everyone's favorite topic and the data stack. But for real, I think this is becoming a really big concern, especially in the macroeconomic climate. Understanding how to run a cost-efficient data stack is becoming more and more critical. And it's very difficult to do because of the complexity of the systems. It's not just about managing, let's say, your warehouse compute bill, for example, right? Especially teams that run, you know, sort of large experimentation platforms and need to spin up and down resources. There's a huge amount there. So Lars has worked with a team that's created a tool to help you do that, which is really fascinating. I want to start usually where we always start, which is the nature of the problem. Resource management is not something I

Starting point is 00:01:45 have a ton of personal experience with, and it sort of goes deep into sort of DevOps, you know, SRE world, which is fascinating. So I think it'll be fun to cover that topic on the show. Yeah, 100%. I won't say much. The only thing that I would say is that, although it might sound that the topic today is not directly related to data, we are actually together with Lars going to prove that resource management is a data problem. Let's go and do that. That is a great teaser.

Starting point is 00:02:18 Let's dive in and talk with Lars. Yeah, let's do it. Lars, welcome to the Data Sack Show. Hey, Eric. Hey, Kostas. Good to see you both. For sure. Well, you know, of course, we go back a little ways, but tell us about Risotto and what you're working on. Yeah, Risotto, the name stands for resource tool.

Starting point is 00:02:40 And our value prop is we cut your cloud costs by 50%. Risotto is a cloud-assisted inventory for infrastructure engineers. And the magic behind cutting your cloud costs by 50% is that we find and delete the expired resources in your cloud,

Starting point is 00:02:58 aka zombie resources that drift, along with all the associated resources. This is one use case. You may have heard This is one use case you may have heard about in software, you may have heard about garbage collection. And we do the same for your cloud infrastructure. And when I say cloud infrastructure, I think about AWS, GCP, but also Kubernetes. For sure. Okay, so I want to break this problem down, right? And so when you think about waste and cloud infrastructure, I mean, you're talking about an extremely broad footprint,

Starting point is 00:03:32 potential, you know, potential problems. What are the things that drove you to actually like invest and work on Risotto as a product? Were there particular problems that you saw in terms of resource management that sort of said, this is a major problem? Yeah. What we saw is a trend to what I would call peak ops.

Starting point is 00:04:01 You have all these different DevOps, FinOps, DevSecOps tools, and they all use the same approach. There's usually an agent that you install in your infrastructure to get data out of your infrastructure. And then you take some sort of remediative action that very moment. It's all very real-time.

Starting point is 00:04:23 Like alerting against thresholds, et cetera. That's right very real-time very reactive. Like alerting against thresholds, etc. That's right. That's right, right? But if you look at what has happened to cloud infrastructure in the past decade, five years, two years, a number of trends. So,

Starting point is 00:04:40 number one, we see a smaller size of function. And so you've gone from a beefy compute instance to a smaller size of function, right? And so you've gone from a beefy compute instance to a very tiny Lambda function that may have a lifespan of minutes in some cases, right? But you're dealing now, and as the size of function goes down, the volume of resources has gone up at the same time, right?

Starting point is 00:05:00 And so, you know, you may, let's just say, you may still be spending, I don't know, I'm making this up, a million a month, but you're not spending it on 10,000 resources. Now you're spending it on a million of resources that cost you a dollar a month, just speaking of orders of magnitude. Yeah, just distributing the crossover, a larger footprint. Yeah. So that's one change, right? That's driven by the product roadmaps of the cloud providers.

Starting point is 00:05:23 The second change is that, well, to deal with this large number of resources, you can't do that in a console anymore. You need to do that in code. And so the second change is infrastructure as code, right? So you have tools like Terraform, Pulumi, or CloudFormation that you use to deploy and manage these resources. And so the lifecycle of these resources really shortens if they get updated a lot, right? And so now you're dealing with an inventory that's not only larger,

Starting point is 00:05:56 but it also changes all the time. So there's this change that's going on all the time. So that's the second one. And I would say the third one is that when people hear cloud, they usually think of production environments. Like, so my app that's running somewhere. There's also this world of tests

Starting point is 00:06:14 and staging environments that has grown much, much faster because developers want to have the liberty to experiment. Like, I'm going to spin up this experiment. And so it's these three trends that contribute to this trend of growing cloud infrastructure and the intractable complexity that comes with it. Yeah, absolutely. One question, when you think about the

Starting point is 00:06:38 sort of migration from large resources to micro resources, right? So you were spending, you know, a million dollars on, you know, 10 things and now you're spending it on thousands of things. What's driving that? Like, what's the, is that architecture? Is that, I mean, you said the product roadmaps, but can we dig into that just a little bit more? Yeah.

Starting point is 00:07:02 If you think about the portfolios of the big cloud providers, I think the number of products that AWS offers, I think the number last time I checked was like 382 different products. That doesn't include all the different SKUs, the stock keeping units below that. And then maybe the different flavors, right?

Starting point is 00:07:27 And so, you know, if you start counting the number of APIs that are associated with those products, it goes into the thousands and tens of thousands, right? And the cloud developers have developed that in response to market needs, right? And I think one of the trends is obviously the trend to microservices. I know as we record this podcast, there was this little snatch move that Amazon had where they said, oh, Amazon Prime Video, we migrated it back to a monolithic application.

Starting point is 00:07:57 But really, you know, it's customer demand. And one big shift was the trend to microservices and smaller components of an application. And on the testing and dev side, just curious, how much of that do you think is driven by the rise of large scale ML practices inside of companies? So I mean, sort of the extreme end of that is self you know, self-driving cars who need to run like really significant tests and, you know, sort of validations before they roll things out to production. I mean, you're talking about, you know, things that most companies would only dream of in terms of production scale models running and they're doing as sort of a test.

Starting point is 00:08:45 Is that sort of, I know that's the sharp end, but is that sort of a driver? Yeah, and I think actually that's where the world of data and infrastructure, where they kind of converge. Let's speak broadly about data products. And I would include machine learning, AI, into that needs to run somewhere, right?

Starting point is 00:09:06 And for that, you need infrastructure. And usually that's probably Kubernetes because you have elastic workloads, you need orchestration. And so, yes, depending on the industry and how mature they are, then yes, I would argue that a lot of those workloads are driven by the data world.

Starting point is 00:09:26 Yeah, that makes total sense. Okay. So Risotto helps you eliminate waste. Where in the landscape and what kinds of resource drains, or sorry, what are the causes of sort of maybe the most acute resource strains? Where do those come from? And I guess maybe to direct the question a little bit is, are those clustering around use cases, right? Like, you know, we need to spin up a, you know, a cluster of, you know, 32 nodes, you know, whatever it is, in order to do this thing. And then, you know, we forget to spin it down. Is that being driven by a particular type of use case? Or is it sort of agnostic? And it's just sort of a general problem?

Starting point is 00:10:15 I think it's the latter. It's more of a general problem. And you nailed it, right? Like there's a developer says, okay, we're going to test this. We're going to spin up a workload, right? But there's also machines that spin up workloads, your CI, CD pipelines, auto scaling and all of that, right? And then, as you said, in theory, these tools should all clean up after them automatically. Reality is that doesn't always happen, right? And then also developers might forget about it they're humans too right they're under pressure they need to deliver right and so do i spent my time my friday on you know finding my next experiment next week you know to help the company ship product or do i spent my time like sifting through my consoles finding whatever needs to be cleaned up i think the answer is clear right the

Starting point is 00:11:06 answer is very clear you know how is this i mean what we're talking about here really is technical debt right how is this okay you i yeah yeah i'm interested in this yeah i guess that's one way of looking at it right but i think there's always going to be technical debt in anything in software and data. And I think the way we like to look at it is, look, we want to, at the end of the day, this comes down to control, right? How do I control this giant pile of resources in my infrastructure? And, you know, on the one hand, you want to give developers the freedom to experiment and spin up resources. But on the other hand, as an infrastructure engineer in charge of this, I want to stay in

Starting point is 00:11:58 control. And, you know, you can give lots of freedom, but then you're not in control anymore. And if you're trying to impose too much control, then nothing gets done. And so what we're saying is, why can't I have both? Yeah. Right? And then as we start looking at the problem, well, how can I do that? If I have this giant pile of resources,

Starting point is 00:12:25 then you quickly come to an answer that includes data. And that means collecting data about the state of the infrastructure. Not in real time, not in high granularity, but something like a snapshot every hour. And as I collect that data about my infrastructure, I get a good picture of what's going on in my infrastructure. So this whole concept of exploration is something that I think we know

Starting point is 00:12:58 from the analytics engineering world and everything that has happened with the modern data stack in the past five years. And I think as we've seen these changes with infrastructure, and everything that has happened with the modern data stack in the past five years. And I think as we've seen these changes with infrastructure, we can apply some of these lessons learned to infrastructure data. Yeah. So that's kind of how I like to look at the world.

Starting point is 00:13:18 Yeah. Let me push on that a little bit more because, Lars, I know you're a man of conviction. Do you envision a world where, because Lars I know you're a man of conviction do you envision a world where because I agree with you like the you know in the sort of world of like infinite compute in the warehouse and you know analytics engineering you know you're sort of you know able to explore with you know unbounded vigor if you can say that, right? And there's low likelihood that your SQL queries are going to cause someone to tap you on the shoulder and say, this is causing a problem, right?

Starting point is 00:13:54 There's probably a lot more happening elsewhere. Do you envision the same thing on the infrastructure side where really we should look at, we should operate as if these are infinite resources and we have tools that help us manage the cost control on the backend as we explore what is possible with scale and experimentation? Or do you have a more measured approach where you need to sort of consider the constraints going in? I think it's the former, right? You don't want to put boundaries on experiments.

Starting point is 00:14:39 I mean, you kind of have to, right? But... Sure, there are physical limitations. Yeah, yeah. But spinning up and spinning down clusters, and worrying about the cost afterwards. I mean, in my mind, and I'm, you know, I'm obviously showing my cards here. But in my mind, I mean, we should sort of spin up and spin down and like, do a spike and do a huge experiment and worrying about the cost to me i mean that's a huge accelerant to a company if you can sort of control that well let's go okay

Starting point is 00:15:13 so look going back to the problem giant pilot resources lots of experiments that are sort of driving that but we don't want to go back and say, okay, now you cannot do an experience anymore, right? But we want to be in control. We want to know when that happens. And I think if I can find a way to give my development team the liberties to use all the tools at their disposal out there, all the different cloud products, then I think that's to the benefit of the company, right?

Starting point is 00:15:46 The but now is like, but what do I do to stay in control, right? And I think the existing approaches include more ops tools, like, okay, let's monitor this, let's instrument that, let's deploy an agent there, right? And I think what we're proposing is what our conviction is here with Risotto is, well, look, there's a place in time for tools that give you real-time data with high granularity, right? And that's probably for your production applications. But for everything else, you know, you probably don't need real-time. And you probably do not need, like, second granularity.

Starting point is 00:16:20 A snapshot is enough. And I think that's a concept. You know, this is the data stack show and i think it's a concept that will i would expect your listeners it'll resonate with your listeners it's like we have this in analytics right and so where i run batch jobs from all my sales and marketing systems and i unify all this data in my snowflflake or Redshift cluster, and then I analyze it. And then I apply the insights from my systems, either in a dashboard, or maybe I use something like reverse ETL

Starting point is 00:16:55 where I make it actionable, right? And so I think that chain, you know, ETL my data, put it into a singular inventory, analyze it, create metrics, react to those metrics, put it back into production action. That's something I think we can apply to the infrastructure world. And so the basic concept here then would be to say, okay, so you're a developer, you have your account, you have your infrastructure resources, and you go crazy. Go at it. All we need to know is what exactly is you have your infrastructure resources, and you go crazy. Go at it, right?

Starting point is 00:17:27 All we need to know is what exactly is happening with your infrastructure. And what we do is we collect data from your infrastructure. We can go into detail how exactly we do that. But at the end of the day, it's almost like data integration for infrastructure engineers. We now go into your infrastructure. We call the cloud APIs, the same APIs, you know, Terraform or Pulumi uses to deploy resources. We call the same APIs now for data extraction. Like,

Starting point is 00:17:52 tell me about these resources that are running. Tell me about their configuration. Tell me about their state, right? And we put that all into a single repository. Yeah, that's fascinating. I mean, there really seems to be a pretty clear parallel to, you know, sort of analytics on the modern data stack. Let's step back just a little bit. How are SREs or infrastructure engineers doing this today? I mean, it's obviously a problem or you wouldn't be trying to build a solution for it. And it's very compelling to hear about. I mean, I think about it almost as like an executive dashboard, but for resource management, where you say, this trend is concerning and we need to go

Starting point is 00:18:38 back and understand the lineage and the cause of this. And so we're just going to trace it back and fix the problem, right? Yeah. Ironically, similar to BI, but you're building a solution. So how are people solving this today? I mean, one thing he mentioned was a lot of monitoring tools, which obviously is not, you know, helping, maybe that makes it more complex. Yeah. And we can go through the options and maybe talk a little bit about the pros and cons of these different options. So number one is, as you said, I call them XOps tools, right? So different operational tools.

Starting point is 00:19:16 I think it's in the name. It's not an analytics tool. It's an operational tool, right? That collects data in various ways from the infrastructure for a very specific and opinionated use case. That's one. There's definitely the world of scripts where infrastructure engineers have built their own little, right?

Starting point is 00:19:39 Use some sort of governance tool. You can cloud custodian is a good example. So yeah, it's the world of scripts, basically, YAMLs, right? The third one is the world of consults. But that is also very constrained

Starting point is 00:19:52 because if you think about like, if these companies operate, even as a startup, right? You operate in different regions, you know, each developer gets more, maybe an account. And sort of the number

Starting point is 00:20:02 of combinations you go up goes into thousands and tens of thousands, right? And so that stops working. But those are the three things that we see today. And then the fourth one is some of the cloud providers. You have native cloud provider products. Google has a product called, it's called Google Cloud Asset Inventory.

Starting point is 00:20:25 They do something similar to what Risotto does. And, you know, lo and behold, extracts the data into a BigQuery instance, right? AWS has a product called AWS Config that extracts configuration data from your resources and stores it in an S3 bucket, right? And from there, you can query it with Athena and you can visualize it in the QuickSight dashboard. So I think the parallels to the modern data stack are pretty obvious to this world here. Yeah, yeah.

Starting point is 00:20:54 Okay, two more questions. One is just my curiosity and the other one, I think, will be a great handoff to Kostas. The first one is how big is this problem? I mean, cloud costs are a very hot topic. I mean, there's obviously macroeconomic influence to this where data leaders, infrastructure leaders are trying to control costs. But regardless of the macro environment, we sort of have this weird world of infinite scalability, but it can bite you. So how big is the problem?

Starting point is 00:21:37 I think everyone has theirs. So no matter if you're small or big, it's just like, how urgent is it of a problem for you? And I think you nailed it. So macroeconomic changes, that's definitely one driving factor. I think it also depends on the industry. If you're a SaaS application, then probably cloud cost

Starting point is 00:22:00 is a first order business problem. Yeah, sure. Gross margin. Gross margin, right? And so I think everyone is affected by it. If we talk to users, our open source users and customers today, I think the common theme is always like, gosh, I wish we would have done this two or three years ago. This being putting something into place that prevents sprawl in the first place.

Starting point is 00:22:37 It doesn't just react to it. You know, that's also one of our underlying principles. Today's approach to anything security or cloud cost is like okay we wait for it to happen you know we drive the car off the cliff and i'm exaggerating right and then we're going to take action right yeah sure you know it's like oh we can't worry about this later we're growing and we're saying is well you can actually have both right but just prevent it don't even yeah have don't even wait for the sprawl to happen. Yeah.

Starting point is 00:23:08 Well, okay. So I have a sort of a 1B question here. This world of sort of infinitely scale of resources. I mean, people who have sort of, you know, recent experience here, you know, sort of young in the industry, you know, maybe that's very common to them. You know, I think people who've been working in infrastructure for a long time are very sensitive to cost control because that was a very big problem, you know, not too many years ago.

Starting point is 00:23:37 But how many, you know, one thing that's really interesting when you think about the sprawl of this problem across a very complex infrastructure. I mean, I think the example you gave about sort of moving to lambdas, even, you know, the sprawl is unbelievable, and has actually happened very quickly. There are probably really smart infrastructure engineers who don't necessarily have the, you know, sort of, you know, innate sense of how to manage that sprawl or even the experience to know when slippage is happening. Do you see that as a big problem? I mean, on some level, we're talking about drastically increasing complexity. And you have really smart people where maybe they, it's very difficult for them to detect,

Starting point is 00:24:26 not because they're not smart, but because there's slippage happening across a million different vectors on a small scale that add up to a pretty big problem. Yes, you nailed it. And it's not just cost, it's security as well. And I think at some point we may want to talk about what this common data layer, this common infrastructure data layer, what problems that layer can solve for us. But what you said, it's exactly right and i think the actual problem there is executive awareness it's a little bit like you know every time we go through a platform shift you know it's like oh 15 you know 15 years ago or whatever 10 years ago you know you you're a ceo of a company and all the all of a sudden you needed to understand what mobile ads are, right?

Starting point is 00:25:27 Oh, it's a new distribution channel, right? And it's like you had to really dig in and start to understand it. The company who chose not to understand that, the CEOs, they went out of business, right? Yeah. And I think we see the same going on right now with, you know, chat GPT-4 and all these things. As an executive, you need to familiarize with these things and how they impact your business. And I think it's the same for infrastructure. And what I observe, and this is also a little bit of just my personal opinion, that number one, tech or execs are not always very aware of what's going on with their

Starting point is 00:25:59 cloud infrastructure. I think that leads to change. In general, they look at developers as a productivity asset, they're building developers as a productivity asset. They're building code. We're shipping product. Whereas an infrastructure engineer, like an SRE, is more looked at as a cost center. So we're trying to hire lots of developers, but we're going to try to limit the number of SREs. And there's not a single SRE or infrastructure engineer that I know who's not stressed out. Yeah, sure.

Starting point is 00:26:24 Right? They're the ones who are holding the ship together, right? And something's got to give. And usually, you know, that something is either the cloud bill, you know, security, and all of that. Okay, so question number two, and this is where I'm going to need

Starting point is 00:26:40 Kostas to jump in here, but... This is your third question. No, I said 1A, 1B. Oh, Gamma. Okay. All yours. Okay. How does risotto actually solve this problem? There you go, Kostas. My second question. Oh, that's just it. Okay, great. Yeah. So, yeah, and I don't want to turn this into a commercial for Risotto. I'm actually way more excited to talk about solving this problem with data. And I think that's the fun part of the show that we can talk about now, right? We define the problem, right?

Starting point is 00:27:18 Lots of sprawl. We don't know what's going on in our infant start shared. It causes all sorts of problems, cost, security. And how do we get back in control? And the point number one is, well, I need to have data about the state of my resources. How do I get this data? And when you look into acquiring that data, like with anything, the data acquisition part is the hard part, right? And what we have done is we have built a set of collectors that calls cloud APIs and collects metadata from these cloud APIs.

Starting point is 00:27:55 What's metadata, right? Let's take it. It's something like, okay, I have an EC2 instance. I have an EC2 instance. What time is it start date, what time did we see it first in the infrastructure, how many cores does it have, what is the

Starting point is 00:28:13 attached storage volume, what VPC does it run in, so it's information about the state and the configuration of the resource, but also the relationships of that resource to the other assets in the cloud. And that's really an ETL process, right?

Starting point is 00:28:32 So these collectors, they run on the schedule. By default with us, it's one hour, but you can ramp it up to whatever fits your needs, 30 minutes, 20 minutes, 15 minutes. And they run, there's a worker, 30 minutes, 20 minutes, 15 minutes. And they run these, you know, there's a worker, it runs and we extract this data and we put it into a single place. In our place, in our case, that's a graph database, not a cloud warehouse, but there's also ways that you can export it to, you know,

Starting point is 00:29:00 like a Snowflake cluster or S3. We have a product called Cloud2SQL, which obviously, as the name suggests, transforms the data into tables and rows. For the core product for Zorio, we chose a graph database. We can talk a little bit about why we did that, but I think

Starting point is 00:29:17 for the listeners of this show who come more from a modern data stack, I think the part that will resonate is like, look, there're connectors. We know how to talk to cloud APIs, extract data, test the data, transform it into a unified format, and put it into a single place. Lars, now I think I can ask my questions, right?

Starting point is 00:29:38 Right, Eric? Am I allowed? Yes. So, okay. Before we get deeper into the risotto technical details, I want to ask you something related to the conversation you had with Eric earlier. So you mentioned like the complexity around the cloud infrastructure today, right? Yes.

Starting point is 00:30:02 And I think like pretty much like every person who has worked, let's say, in this industry the past 20 years, they know that what we are trying to do is add more and more abstraction layers to simplify the way that we interact with infrastructure, right? Like from bare metal, today we have serverless with Lambda functions and all that stuff, right? Yes, yes.

Starting point is 00:30:24 So my question, there are two questions here, actually. The first one is, to me, it feels like there's a kind of paradox here, right? Like, one, we are trying to reduce, let's say, the surface area of how we interact with infrastructure by all these abstractions. But at the same time, it becomes harder and harder to actually understand how the infrastructure we are using is operating and is part of our product. And that's what it becomes like to fill this gap and so on. But why do you think this is happening?

Starting point is 00:31:00 Well, you mentioned it, but some of it is driven by these different levels of extraction. So this evolution from bare metal to today we have Kubernetes, so we're running these different pods, they run on some sort of machine that we don't even see or know of. Even just the question like, okay, what is this pod running on with machine? That's not straightforward to answer, especially if you have thousands of them, right? But I think that's these different level of abstractions that drive that.

Starting point is 00:31:34 The second one is that, ironically, there's this thing called the well-architected framework that AWS has, and that suggests to separate your workloads into different cloud your workloads into different cloud accounts, into different regions, right? And that happens for control reasons, for security reasons, for failover reasons, all of that stuff, right? And so what on one end is a really good principle to apply to be, you know, to have a secure

Starting point is 00:32:01 infrastructure, to have a resilient infrastructure, to have a scalable infrastructure, just adds to the amount of fragmentation and therefore loss of control over the resources running in your infrastructure. Yeah, that makes a lot of sense. And okay, like from the SRE point of view, let's say we have a person who is at least aware of like all these different resources that are participating like in this infrastructure right but they are not the only people who are interacting with that right like the rest of the engineers are actually interacting even indirectly with all these things so from let's say the sre perspective let's's say we're using a tool like Risotto.

Starting point is 00:32:46 We have a much deeper understanding of what is happening there, like the behaviors that our engineers have with all these resources. How easy or difficult it is to communicate these things back to our product engineers? Because, as you said, SREs are not enough not enough right they have many things to do and the complexity of like the problem they are dealing with is like exploding right yes how do we communicate to the rest of the engineers about like these things that like risotto is helping to solve yeah i obviously love talking about risotto, but I think, and let's abstract the problem with the solution and how we solve it, right? I think what you want to do is make engineering efficiency a KPI for your product engineers.

Starting point is 00:33:38 Yeah. And one of the problems with that is that they have zero visibility into the cost of a resource, the lifetime of a resource, right? There are other tools out there that solve the deployment problems. Like, well, if I deploy this, how much actually does this cost me, right? But just making that part of good engineering habits, efficiency, I think that's the first change we need to do. Some have that, some don't, right? And then we can debate about, okay, how exactly do we do that, right?

Starting point is 00:34:13 And for instance, this is what, I give you an example of what one of our customers has done, a company called D2IQ. They are a managed Kubernetes provider. They introduced a very simple process by introducing two tags. Tags are basically JSON objects attached to a key value attached to a resource.

Starting point is 00:34:41 And they chose two tags, a name, the name of the engineer who deployed this resource and an expiration date of the resource. Meaning, and they have certain rules in certain accounts, like look, you know, if you're in these accounts,

Starting point is 00:34:54 no resource should live longer than two days or should live no longer than Friday night, 7 p.m. of the week it was deployed, right? And that's a date, right? And so the name is absolutely required because if the resource is running, and I'm an SRE, and I have developers sitting across the globe,

Starting point is 00:35:15 I want to know who deployed it. I want to have a quick way to talk to that person. It's like, hey, what are you doing with this? That's number one. Number two, the expiration tag is just a way to say, look, this is the lifetime of the resource. And once you go beyond that date, we're going to team this resource up in tests,

Starting point is 00:35:33 not in production, right? And so what are they used for Resulta for? Well, number one, once the resources are deployed, we check, they use Resulta to check if every resource adheres to those two principles, to those two tags. And then two things happen. If the tags are miscorrect, if they're incorrect, they use Resolta to correct the tags. There's a little bit of logic. People make typing errors and all sorts of mistakes, right? And so we correct those resources. For instance, if the

Starting point is 00:36:02 expiration date is longer than what's allowed in the policy, we automatically shorten the lifetime of the resource to the max allowed. Now, if neither of those two tags have been applied, D2IQ just deletes the resource and tests. And I know that may seem draconian to some people, but we'll talk about them a little bit. They had huge efficiency and developer productivity improvements.

Starting point is 00:36:29 And at first they said like, oh gosh, I can't work like that. But it turns out you can. And just those two simple tags, an owner, the name, an owner, and the expiration date, and then deleting the resource, what it has reached its expiration date, has done wonders, right? I think in their specific case, I think they saved 78% of their infrastructure spent, which is unheard of, right? So it doesn't need to be complex, right? It all, like, if you change the philosophy and you make it a KPI of the engineers, and

Starting point is 00:37:02 then you give them the tools to reach that KPI, oh, wonder, it'll happen, you know? It'll happen. Yep, 100%. Yeah, that's awesome. And let's go back to like Risotto now and the technical side of things. But because it sounds like we are dealing with a data problem here.

Starting point is 00:37:21 Yes. So when you are talking about extracting data from the cloud provider, right? Like, probably, you know, much better than me, but AWS has hundreds of products, right? Yes. Is one of these products, let's say, a separate data source for you? Are there like differences in terms, like how this data look like, like like and how they differ from each other? Because they are different products that we are talking about, right? We have things from CloudFormation, which is one product, to EC2, which is another very different kind of resource, right? How do these things work?

Starting point is 00:38:05 Like how the data looks like and how rich this data model has to be in order to represent all these products? Yeah, yeah. It's a zoo, right? So let me peel the onion here a little bit around products. What's a product?

Starting point is 00:38:22 What's a cloud product, right? So I think AWS's overall portfolio is something like, I think 382 different products, right? And then let's take one product as an example, compute instances, right? If you double click EC2,

Starting point is 00:38:38 Elastic Compute Cloud, if you double click on that, then EC2 has like, I think in itself has like 200 different compute instances, right? So it goes up. And then each product has its own set of APIs. As I said, those APIs are usually optimized for deploying resources, less so for extracting

Starting point is 00:39:02 data. And it also depends on the cloud provider. GCP, you know, they came later to the market. They've done a lot of things right when it comes to their APIs. They're a lot more consistent. But for AWS, on the other side, things can be pretty inconsistent. And what do I mean by that? Let me give you a very specific example.

Starting point is 00:39:22 If I just want to know the age of a resource, like if I talk about the life cycle of my resources, I want to know, on average, how old are my resources, right? Why would I want to do that? Why do I want to know that? Because, well, you know, if my resources tend to be old in terms of like months, and they don't tend to be updated frequently, then it's probably an indicator that we're not doing a lot of development, right? On the other hand, if they have a short life cycle, they get updated frequently, a great indicator of lots of development activity. So how do I get an age of a resource? Well, I need some sort of timestamp, right? Well, AWS does provide that timestamp. But let me give you two very specific examples for timestamps. So EC2 has a date timestamp, right? The compute instance.

Starting point is 00:40:10 And it's, you know, it's, I think, what's the ISO norm called? 8601, right? So it's a string. It's a year, month, day, hour, minute, second, like that. That's a timestamp. So I can easily calculate the delta to today's state or today the time right now then there's a second product i'm just picking two random examples sqsq right they have a property called create a timestamp and that value is the number of seconds that has elapsed since the unique epoch time which

Starting point is 00:40:39 is zero o'clock on january 1st 1970, right? And so you have these two different, like you have two different products with two different APIs who kind of tell you the same thing, which is like, you know, when was this resource created, but in completely different formats. And that happens across hundreds of resources

Starting point is 00:41:00 and thousands of APIs. So extracting data from these cloud providers is not really straightforward. You need to put a lot of work into building the connectors and understanding the data you get and then extract it and represent it in a way that it's consumable for the user. That was a little bit of a long explanation,

Starting point is 00:41:18 but I hope that explains the problem a little bit more in detail why this is hard. Absolutely. I think it's very interesting. I think people that never had a reason to go use these APIs, they probably consider, let's say, AWS as one big API, right? But actually, that's far away from reality. There are actually hundreds of different APIs, hundreds of different

Starting point is 00:41:48 data models, and all these needs to be aligned if you are going to process them in a consistent way. And that's like the ATL part of the data problem, right? And when we are talking about data for resources, I would assume that we are talking, as you mentioned, the creation date. For example, there's a set of metadata fields that describe these resources. What other information we are talking about here? There are names, dates, there are probably custom key values that people use as annotations for whatever reason. Like the one you mentioned to apply processes there.

Starting point is 00:42:32 Give us like a little bit more of like how these data looks like. Even for like an EC2 instance, right? Like let's say the default row that describes an EC2 instance. Like how does it look like? Oh, I mean, you're going down a rattle with this one, right? So the number of fields and properties can go into

Starting point is 00:42:54 their dozens for a specific instance, right? And, you know, we have a whole section on data models on our website. But a create timestamp, last updated, name of the resource, tags, and then depending on what the resource is.

Starting point is 00:43:16 Also relationships. Like, oh, what is attached to this EC2 instance? It's dozens. In some cases, it's dozens of fields, in some cases, it's like, you know, eight or nine, but it's a lot, right? And then I think the underappreciated skill there is

Starting point is 00:43:34 that if you want to work with this data, and that's the value that some of these XOps tools provide that I've mentioned, right, is you have to understand the data model for each individual cloud provider and resource, right? And so all of a sudden you have to become an expert

Starting point is 00:43:52 in, you know, 382 products plus the different properties of each API, of each resource, right? And so now all of a sudden you're looking at, I don't know, let's take 10 as a number, right? It's like 3,820 properties that you need to understand. And I think that's where you quickly run into limits where you go, yeah, maybe, you know, a single infrastructure engineer just can't do that.

Starting point is 00:44:16 Yeah. And you mentioned relationships and relationships, right? And I know like, and that's part of the abstraction, right? You start, you have storage. Storage is attached to EC2 instances. The EC2 instances might be part of, like, a cluster that you have, blah, blah, blah. All these things have, like, some kind of hierarchy,

Starting point is 00:44:40 like connection, right, and relation. Yeah. Which I don't know how explicit or implicit it is, but if I understand correctly, part of what Risotto is doing is allowing you to exploit these relationships and learn about your infrastructure as a whole. So tell us a little bit more about that, both of like, let's say, how the world looks like and how Risotto models this world. Yeah, so I think, good point, the relationships.

Starting point is 00:45:22 Let's talk about the problem there. As you said, these cloud assets or cloud resources are really nested. A storage volume is attached to an EC2 instance. The EC2 instance runs in a VPC. The VPC runs in a region. The region belongs to a cloud account. It's all nested. Then maybe there's an IP address, obviously, that belongs to the EC2 instance.

Starting point is 00:45:54 And finally, there are things like IAM rules or policy access policies, which can be nested too. So the complexity gets high quickly. And understanding these relationships is beneficial for a number of reasons. Number one, just

Starting point is 00:46:14 asking a question like, how many resources do I have? Or what is everything behind this IP address? If I have these these relationships that's what that's what these relationships tell me but also if i want to lean up my resources right so i can look at this lonely ec2 instance i'll just keep going back to the ec2 instance because everyone is familiar i would assume with that product you know i can now look at this lonely ec2 instance and say it looks unused I can probably delete it. But what you don't know when you do that

Starting point is 00:46:48 is something that's called the blast radius. If I delete that, what else will go down? And you just don't know. You don't know. You don't know without the relationships. And that's why there's so much value in capturing the dependencies and the relationships of these assets.

Starting point is 00:47:07 And how do you do that with Resultum? Yeah, so we have a data model, we look at the resource and all the different properties, and we map that to our data model. And then we also map the dependencies. It's a little bit of a visual thing. I think it's best to check out what this looks like

Starting point is 00:47:38 in reality in our docs. But it really is like a graph. It's a graph. And we map that up front. but it really is like a graph. It's a graph, right? And we map that up front. And it's a Stanford B-type model, right? And so, you know, if we know, for instance, for a computer instance, it has a certain number of cores,

Starting point is 00:48:00 well, that's an integer, right? And so when we put a lot of time into understanding the data models, we map those relationships, static D type, we collect the data. And because it's static D type, it also allows us to index this data real quick. And so search becomes really fast versus like, you know, running some batch jobs with SQL queries. And how do you query this data using Risotto? Because we are talking about a graph data model here. I would assume probably you're using a graph database. Yes, that's correct. Yeah, so we're open source, right?

Starting point is 00:48:38 And the graph database we use is called ErangoDB. There's other products out there. I think most people, when they hear graph database, they will think Neo4j, right? We use, it's called ErangoDB. There's other products out there. I think most people, when they hear graph database, they will think Neo4j, right? They kind of pioneered the model. And, you know, they have, graph databases are pretty powerful for very specific use cases,

Starting point is 00:48:57 but they're, you know, a graph query language is pretty hard. Unless you do it every day, I'm not sure it's worth the investment to learn that language. So what we've done is, you know, we've created our own domain-specific language. It's a search syntax that simplifies

Starting point is 00:49:12 a lot of these things. And it's actually understandable for humans. You know, like terms like search. We offer full-text search, right? And so it's this domain, it's this search syntax that you can use. And anyone who's familiar with using, you know, themanite tool will be

Starting point is 00:49:26 able to pick it up very quickly. And in the future, something we're working on are Jinja templates so that you don't even have to really know, like you write in Jinja and then it just automatically creates the syntax in our

Starting point is 00:49:42 Risotto syntax. Yeah, and one last question for me, and then I'll give the microphone back to Eric. You mentioned that outside of Risotto, there's also another product called Cloud to SQL, right? Yeah. Yes. How does this work?

Starting point is 00:50:02 How do you go from this graph data model into SQL? How do you do that? Yeah. So there is an existing world of analytics engineer, right? And we have all the data infrastructure in place already. And there's nothing that keeps us from working with infrastructure data the same way we work with data from your crm from salesforce google analytics marketo and and that's why we introduced cloud to sql and it basically it's it all the things that i told you about the graph and the

Starting point is 00:50:43 dependencies and all of that we just flatten the data about, the graph and the dependencies and all of that, we just flatten the data out, right? And put it into tables and rows. And then you can export it to a destination of your choice, right? And we call it SQL because you can export it to Snowflake and to Postgres, also S3, right? And that's what we use that product for. Now, you will lose these relationships, right? From the graphs. And they're you will lose these relationships, right, from the graphs.

Starting point is 00:51:10 And they're useful for use cases like, obviously, cleanup and security. But then, in theory, you can rebuild them by writing your own joint across these tables. And, you know, you can put it into a DBT model. And then you expose it to your Metabase dashboard. So that's an option we wanted to give existing Analytics engineers. And that's why we introduced Cloud to SQL. Awesome. Eric, all yours again. Alright, Lars,

Starting point is 00:51:31 I guess the question is how can data teams and infrastructure engineers in general and, you know, we have sort of more data engineers and data teams, you know, as listeners for the podcast. I guess the big question is the cloud infrastructure ecosystem is expanding at an

Starting point is 00:51:56 alarming rate. How should they think about that? And how can they sort of plan for what's coming in the next several years, especially as it relates to cost? Because it's going to be a problem. I mean, you know, AWS is a great example, but, you know, all the other cloud providers are going to become just as complicated over time. Yeah. You know, I would approach this from a perspective of what do we need to deliver our customers? I don't mean to go too far away from the actual infrastructure, but we know that to stay alive, these companies need to ship and develop a lot of new digital products.

Starting point is 00:52:39 And for that, I need to have empowered developers. They need to be able to spin up infrastructure. I do not want to put too many blockers around them. So this is the world I want to live in. And I want to give them the freedom to try out new tools and new products that the cloud providers give me. Now, how do I stay in control when I do that? And this world will never go away.

Starting point is 00:53:02 There will always be more innovation, more product, right? And how do I stay in control while I do that? And this world will never go away. There will always be more innovation, more product, right? And how do I stay in control while I do that? And I think it's the answer, as we discussed on this, includes data, but it's to say like, look, there's a time

Starting point is 00:53:18 for reactive intervention, like real time, high granularity data, right? And there's tons of great tools out there that do that. But there's also time just for exploration, for tracking long-term trends, and for using data to take remediated action

Starting point is 00:53:35 that steers my infrastructure back on the path that I want it to be on without having some sort of incident, right? And I think that's the philosophy that we're proposing, right? You use data as an input to write code so that your developers don't have to. Yeah. I love it. All right. Well, Lars, this has been absolutely fascinating, an area that we haven't covered a ton on the show, but has direct impact on all sorts of data stuff across the stack. So thank you for educating us and sharing your insights and best of luck with Rosetto.

Starting point is 00:54:10 I appreciate it. Always good seeing you guys. And as a listener to the Data Stack Show, a longtime listener myself, I'm actually very excited that I can be a guest now. Well, it's a true privilege. And thank you for your support. Thank you, guys. Costas, what a fascinating conversation with Lars from Risotto. I learned a huge amount. And I think

Starting point is 00:54:37 my big takeaway is that, you know, I kind of went into this conversation expecting to be astounded by the complexity of sort of resource management across the entire ecosystem of infrastructure and tooling, which I was. It's a very large scope, complex problem. But the bigger thing was how similar the issue is actually to sort of a standard data flow in terms of the solution, right? And so Lars kind of described it as you're, you know, you're sort of ingesting inputs, you're doing some sort of modeling, and then you're pushing those back out, right? And so when we think about the modern data stack, I mean, that's, you know, bread and butter for a data engineer dealing with customer data, for example. So it really struck me that sort of, you know, there's an elegant architecture that already exists for solving this like pretty complex problem.

Starting point is 00:55:30 Yeah, yeah, 100%. I think outside of like proving today that resource management is a data problem, I think we also proved that like everyone is a data engineer, right? Like every software engineer is a data engineer at the end. You need to, in a way, many of the problems that we are talking about solving actually relate,

Starting point is 00:55:52 sorry, contain a big part of data engineering work that has to be done. Data has to be exported, data has to be transformed somehow, modeled, and, of course, being exposed to the data consumer for value

Starting point is 00:56:09 to be created there. And I think especially... I mean, okay, it will sound like it has been said many times, I think, already, we're entering this decade of everything's going to be around data. But I think we start seeing that a lot. And we start seeing that by actually like getting into domains that don't necessarily feel like they are, you know, data problems or data related like technologies that have to be built. But at the end, that's exactly what is happening, right?

Starting point is 00:56:42 And I think especially with AI and all the stuff that's happening right now, we are going to see more and more of, let's say, these domains to come back and being rebuilt and rediscovered around the data problems that can be defined there, including sales, marketing, like pretty much like everything. And yeah, it was super fascinating. Like we should get Lars back again. He's a good friend. And I think like whenever we talk with him, we always come up like with very interesting insights. No, I completely agree. So much to learn and would love to have Lars back.

Starting point is 00:57:27 Such a deep thinker about these problems. Go ahead and subscribe to the show if you haven't. I'll look it up on your favorite podcast network. I'll tell a friend and we will catch you

Starting point is 00:57:37 on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback.

Starting point is 00:57:48 You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack,

Starting point is 00:58:00 the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

The Data Stack Show - 147: Where Data and Infrastructure Converge Featuring Lars Kamp of Resoto

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.