The Data Stack Show - 147: Where Data and Infrastructure Converge Featuring Lars Kamp of Resoto
Episode Date: July 19, 2023Highlights from this week’s conversation include:Lars work on Resoto in helping to cut cloud costs for organizations (2:02)The trend of large resources to micro resources (5:59)What are some of the ...typical resource drains in data infrastructure (8:56)Managing cost on the backend with scale and experimentation (12:51)Solutions for resource management problems (17:38)How Resoto is solving pain points in resource management (26:17)Navigating the complexities of data infrastructure (29:01)Resoto’s solution for interpreting difficult cloud data products (36:35)Exploring relationships of data points and finding solutions (43:40)Querying in graph database (47:46)How to go from graph to SQL (49:13)How can data teams plan for costs in the coming years (50:53)Final thoughts and takeaways (53:49)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack.
They've been helping us put on the show for years and they just launched an awesome new product called Profiles.
It makes it easy to build an identity graph and complete customer profiles right in your warehouse or data lake.
You should go check it out at ruddersack.com today.
Welcome back to the Data Stack Show. Costas, another new topic. This has been
a great spring and early summer chatting about things that we haven't really covered on the show a ton before.
Today, we're talking with Lars Kamp, and we're going to talk about resource management,
which is everyone's favorite topic and the data stack. But for real, I think
this is becoming a really big concern, especially in the macroeconomic climate. Understanding how to run a cost-efficient data stack is becoming more and more critical. And it's very difficult to do because of the complexity of the systems. It's not just about managing, let's say, your warehouse compute bill, for example, right? Especially teams that run,
you know, sort of large experimentation platforms and need to spin up and down resources.
There's a huge amount there. So Lars has worked with a team that's created a tool to help you do
that, which is really fascinating. I want to start usually where we always start, which is
the nature of the problem. Resource management is not something I
have a ton of personal experience with, and it sort of goes deep into sort of DevOps,
you know, SRE world, which is fascinating. So I think it'll be fun to cover that topic on the show.
Yeah, 100%. I won't say much. The only thing that I would say is that,
although it might sound that the topic today is not directly
related to data, we are actually together with Lars going to prove that resource management
is a data problem.
Let's go and do that.
That is a great teaser.
Let's dive in and talk with Lars.
Yeah, let's do it.
Lars, welcome to the Data Sack Show.
Hey, Eric. Hey, Kostas. Good to see you both.
For sure.
Well, you know, of course, we go back a little ways,
but tell us about Risotto and what you're working on.
Yeah, Risotto, the name stands for resource tool.
And our value prop is we cut your cloud costs by 50%.
Risotto is a
cloud-assisted inventory
for infrastructure engineers.
And the magic
behind cutting your cloud costs by 50%
is that we find and delete
the expired resources in your cloud,
aka zombie resources that drift,
along with all the
associated resources.
This is one use case. You may have heard This is one use case you may have heard about
in software, you may have heard about garbage collection. And we do the same for your cloud
infrastructure. And when I say cloud infrastructure, I think about AWS, GCP, but also Kubernetes.
For sure. Okay, so I want to break this problem down, right? And so when you think about waste and cloud infrastructure,
I mean, you're talking about an extremely broad footprint,
potential, you know, potential problems.
What are the things that drove you to actually like invest
and work on Risotto as a product?
Were there particular problems that you saw
in terms of resource management that sort of said,
this is a major problem?
Yeah.
What we saw is a trend to what I would call peak ops.
You have all these different DevOps, FinOps, DevSecOps tools,
and they all use the same approach.
There's usually an agent that you install
in your infrastructure
to get data out of your infrastructure.
And then you take some sort of remediative action
that very moment.
It's all very real-time.
Like alerting against thresholds, et cetera. That's right very real-time very reactive. Like alerting against
thresholds, etc. That's right.
That's right, right?
But if you look at what has happened
to cloud infrastructure in the past
decade, five years,
two years,
a number of trends. So,
number one,
we see a smaller size of function.
And so you've gone from a beefy compute instance to a smaller size of function, right?
And so you've gone from a beefy compute instance to a very tiny Lambda function
that may have a lifespan of minutes in some cases, right?
But you're dealing now,
and as the size of function goes down,
the volume of resources has gone up at the same time, right?
And so, you know, you may, let's just say,
you may still be spending, I don't know,
I'm making this up, a million a month, but you're not spending it on 10,000 resources.
Now you're spending it on a million of resources that cost you a dollar a month, just speaking of orders of magnitude.
Yeah, just distributing the crossover, a larger footprint.
Yeah.
So that's one change, right?
That's driven by the product roadmaps of the cloud providers.
The second change is that, well, to deal with this large number of resources,
you can't do that in a console anymore.
You need to do that in code.
And so the second change is infrastructure as code, right?
So you have tools like Terraform, Pulumi, or CloudFormation that you use to deploy and
manage these resources. And so the lifecycle of these resources really shortens
if they get updated a lot, right?
And so now you're dealing with an inventory that's not only larger,
but it also changes all the time.
So there's this change that's going on all the time.
So that's the second one.
And I would say the third one is that
when people hear cloud,
they usually think of production environments.
Like, so my app that's running somewhere.
There's also this world of tests
and staging environments
that has grown much, much faster
because developers want to have
the liberty to experiment.
Like, I'm going to spin up this experiment.
And so it's these three trends
that contribute to this trend of growing cloud infrastructure and the intractable complexity
that comes with it. Yeah, absolutely. One question, when you think about the
sort of migration from large resources to micro resources, right?
So you were spending, you know, a million dollars on, you know, 10 things and now you're
spending it on thousands of things.
What's driving that?
Like, what's the, is that architecture?
Is that, I mean, you said the product roadmaps, but can we dig into that just a little bit
more?
Yeah.
If you think about the portfolios
of the big cloud providers,
I think the number of products that AWS offers,
I think the number last time I checked
was like 382 different products.
That doesn't include all the different SKUs,
the stock keeping units below that.
And then maybe the different flavors, right?
And so, you know, if you start counting the number of APIs
that are associated with those products,
it goes into the thousands and tens of thousands, right?
And the cloud developers have developed that
in response to market needs, right?
And I think one of the trends is obviously the trend to microservices.
I know as we record this podcast, there was this little snatch move that Amazon had where
they said, oh, Amazon Prime Video, we migrated it back to a monolithic application.
But really, you know, it's customer demand.
And one big shift was the trend to microservices and smaller components of an application.
And on the testing and dev side, just curious, how much of that do you think is driven by
the rise of large scale ML practices inside of companies?
So I mean, sort of the extreme end of that is self you know, self-driving cars who need to run like really
significant tests and, you know, sort of validations before they roll things out to
production. I mean, you're talking about, you know, things that most companies would only dream
of in terms of production scale models running and they're doing as sort of a test.
Is that sort of, I know that's the sharp end,
but is that sort of a driver?
Yeah, and I think actually that's where
the world of data and infrastructure,
where they kind of converge.
Let's speak broadly about data products.
And I would include machine learning, AI,
into that needs to run somewhere, right?
And for that, you need infrastructure.
And usually that's probably Kubernetes
because you have elastic workloads,
you need orchestration.
And so, yes, depending on the industry
and how mature they are,
then yes, I would argue that a lot of those workloads
are driven by the data world.
Yeah, that makes total sense. Okay. So Risotto helps you eliminate waste.
Where in the landscape and what kinds of resource drains, or sorry, what are the causes of sort of
maybe the most acute resource strains?
Where do those come from? And I guess maybe to direct the question a little bit is,
are those clustering around use cases, right? Like, you know, we need to spin up a, you know,
a cluster of, you know, 32 nodes, you know, whatever it is, in order to do this thing. And then,
you know, we forget to spin it down. Is that being driven by a particular
type of use case? Or is it sort of agnostic? And it's just sort of a general problem?
I think it's the latter. It's more of a general problem. And you nailed it, right? Like there's
a developer says, okay, we're going to test this. We're going to spin up a workload, right? But there's also machines that spin up workloads, your CI, CD pipelines, auto scaling and all of that, right?
And then, as you said, in theory, these tools should all clean up after them automatically.
Reality is that doesn't always happen, right?
And then also developers might forget about it they're humans
too right they're under pressure they need to deliver right and so do i spent my time my friday
on you know finding my next experiment next week you know to help the company ship product or do i
spent my time like sifting through my consoles finding whatever needs to be cleaned up i think the answer is clear right the
answer is very clear you know how is this i mean what we're talking about here really is technical
debt right how is this okay you i yeah yeah i'm interested in this yeah i guess that's one way of
looking at it right but i think there's always going to be technical debt in anything in software and data.
And I think the way we like to look at it is, look, we want to, at the end of the day,
this comes down to control, right?
How do I control this giant pile of resources in my infrastructure? And, you know,
on the one hand, you want to give developers the freedom to experiment and spin up resources.
But on the other hand, as an infrastructure engineer in charge of this, I want to stay in
control. And, you know, you can give lots of freedom, but then you're not in control anymore.
And if you're trying to impose too much control,
then nothing gets done.
And so what we're saying is, why can't I have both?
Yeah.
Right?
And then as we start looking at the problem, well, how can I do that?
If I have this giant pile of resources,
then you quickly come to an answer that includes data.
And that means collecting data about the state of the infrastructure.
Not in real time, not in high granularity,
but something like a snapshot every hour.
And as I collect that data about my infrastructure,
I get a good picture of what's going on in my infrastructure.
So this whole concept of exploration
is something that I think we know
from the analytics engineering world
and everything that has happened with the modern data stack
in the past five years.
And I think as we've seen these changes with infrastructure, and everything that has happened with the modern data stack in the past five years.
And I think as we've seen these changes with infrastructure,
we can apply some of these lessons learned to infrastructure data.
Yeah.
So that's kind of how I like to look at the world.
Yeah.
Let me push on that a little bit more because, Lars,
I know you're a man of conviction.
Do you envision a world where, because Lars I know you're a man of conviction do you envision a world where because I agree with you like the you know in the sort of world of like infinite compute in
the warehouse and you know analytics engineering you know you're sort of you know able to explore
with you know unbounded vigor if you can say that, right?
And there's low likelihood that your SQL queries are going to cause someone to tap you on the
shoulder and say, this is causing a problem, right?
There's probably a lot more happening elsewhere.
Do you envision the same thing on the infrastructure side where really we should look at, we should operate as if these are
infinite resources and we have tools that help us manage the cost control on the backend
as we explore what is possible with scale and experimentation? Or do you have a more measured approach
where you need to sort of consider
the constraints going in?
I think it's the former, right?
You don't want to put boundaries on experiments.
I mean, you kind of have to, right?
But...
Sure, there are physical limitations.
Yeah, yeah.
But spinning up and spinning down clusters, and worrying about the cost afterwards. I mean, in my mind, and I'm, you know, I'm
obviously showing my cards here. But in my mind, I mean, we should sort of spin up and spin down
and like, do a spike and do a huge experiment and worrying about the cost to
me i mean that's a huge accelerant to a company if you can sort of control that well let's go okay
so look going back to the problem giant pilot resources lots of experiments that are sort of
driving that but we don't want to go back and say, okay, now you cannot do an
experience anymore, right?
But we want to be in control.
We want to know when that happens.
And I think if I can find a way to give my development team the liberties to use all
the tools at their disposal out there, all the different cloud products, then I think
that's to the benefit of the company, right?
The but now is like, but what do I do to stay in control, right?
And I think the existing approaches include more ops tools, like, okay, let's monitor
this, let's instrument that, let's deploy an agent there, right?
And I think what we're proposing is what our conviction is here with Risotto is, well,
look, there's a place in time for tools that give you real-time data with high granularity, right?
And that's probably for your production applications.
But for everything else, you know, you probably don't need real-time.
And you probably do not need, like, second granularity.
A snapshot is enough.
And I think that's a concept.
You know, this is the data stack show and i think it's a concept that will i would expect your listeners it'll resonate with your listeners
it's like we have this in analytics right and so where i run batch jobs from all my sales and
marketing systems and i unify all this data in my snowflflake or Redshift cluster, and then I analyze it.
And then I apply the insights from my systems,
either in a dashboard,
or maybe I use something like reverse ETL
where I make it actionable, right?
And so I think that chain, you know,
ETL my data, put it into a singular inventory,
analyze it, create metrics, react to those metrics, put it back into production action.
That's something I think we can apply to the infrastructure world.
And so the basic concept here then would be to say, okay, so you're a developer, you have your account, you have your infrastructure resources, and you go crazy.
Go at it.
All we need to know is what exactly is you have your infrastructure resources, and you go crazy. Go at it, right?
All we need to know is what exactly is happening with your infrastructure.
And what we do is we collect data from your infrastructure.
We can go into detail how exactly we do that.
But at the end of the day,
it's almost like data integration for infrastructure engineers.
We now go into your infrastructure.
We call the cloud APIs, the same APIs, you know,
Terraform or Pulumi uses to deploy resources. We call the same APIs now for data extraction. Like,
tell me about these resources that are running. Tell me about their configuration. Tell me about
their state, right? And we put that all into a single repository.
Yeah, that's fascinating. I mean, there really seems to be a pretty
clear parallel to, you know, sort of analytics on the modern data stack.
Let's step back just a little bit. How are SREs or infrastructure engineers doing this
today? I mean, it's obviously a problem or you wouldn't be trying to build a solution for
it. And it's very compelling to hear about. I mean, I think about it almost as like an executive
dashboard, but for resource management, where you say, this trend is concerning and we need to go
back and understand the lineage and the cause of this. And so we're just going to trace it back and
fix the problem, right? Yeah. Ironically, similar to BI, but you're building a solution. So
how are people solving this today? I mean, one thing he mentioned was a lot of monitoring tools,
which obviously is not, you know, helping, maybe that makes it more complex.
Yeah. And we can go through the options and maybe talk a little bit about the pros and cons
of these different options.
So number one is, as you said, I call them XOps tools, right?
So different operational tools.
I think it's in the name.
It's not an analytics tool.
It's an operational tool, right?
That collects data in various ways from the infrastructure for a very specific and
opinionated use case.
That's one.
There's definitely the world of scripts where infrastructure engineers have built their
own little, right?
Use some sort of governance tool.
You can cloud custodian is a good example.
So yeah, it's the world of scripts, basically,
YAMLs, right?
The third one is
the world of consults.
But that is also
very constrained
because if you think about like,
if these companies operate,
even as a startup, right?
You operate in different regions,
you know,
each developer gets more,
maybe an account.
And sort of the number
of combinations you go up
goes into thousands and tens of thousands, right?
And so that stops working.
But those are the three things that we see today.
And then the fourth one is some of the cloud providers.
You have native cloud provider products.
Google has a product called,
it's called Google Cloud Asset Inventory.
They do something similar to what Risotto does.
And, you know, lo and behold, extracts the data into a BigQuery instance, right?
AWS has a product called AWS Config that extracts configuration data from your resources and stores it in an S3 bucket, right?
And from there, you can query it with Athena
and you can visualize it in the QuickSight dashboard.
So I think the parallels to the modern data stack
are pretty obvious to this world here.
Yeah, yeah.
Okay, two more questions.
One is just my curiosity and the other one,
I think, will be a great handoff to Kostas.
The first one is how big is this problem?
I mean, cloud costs are a very hot topic. I mean, there's obviously macroeconomic
influence to this where data leaders, infrastructure leaders are trying to control costs.
But regardless of the macro environment, we sort of have this
weird world of infinite scalability, but it can bite you. So how big is the problem?
I think everyone has theirs. So no matter if you're small or big, it's just like,
how urgent is it of a problem for you?
And I think you nailed it.
So macroeconomic changes,
that's definitely one driving factor.
I think it also depends on the industry.
If you're a SaaS application,
then probably cloud cost
is a first order business problem.
Yeah, sure. Gross margin. Gross margin,
right? And so I think everyone is affected
by it. If we talk to
users, our open source users and customers today,
I think
the common theme is always like, gosh, I wish we would have done this two or three years ago.
This being putting something into place that prevents sprawl in the first place.
It doesn't just react to it.
You know, that's also one of our underlying principles.
Today's approach to anything security or cloud
cost is like okay we wait for it to happen you know we drive the car off the cliff and i'm
exaggerating right and then we're going to take action right yeah sure you know it's like oh we
can't worry about this later we're growing and we're saying is well you can actually have both
right but just prevent it don't even yeah have don't even wait for the sprawl to happen.
Yeah.
Well, okay.
So I have a sort of a 1B question here.
This world of sort of infinitely scale of resources.
I mean, people who have sort of, you know, recent experience here, you know, sort of
young in the industry, you know, maybe that's very common to them.
You know, I think people who've been working in infrastructure for a long time are very
sensitive to cost control because that was a very big problem, you know, not too many
years ago.
But how many, you know, one thing that's really interesting when you think about the sprawl
of this problem across a very complex infrastructure.
I mean, I think the example you gave about sort of moving to lambdas, even, you know,
the sprawl is unbelievable, and has actually happened very quickly. There are probably
really smart infrastructure engineers who don't necessarily have the, you know, sort of, you know, innate sense of how to manage that sprawl or even the experience to
know when slippage is happening. Do you see that as a big problem? I mean,
on some level, we're talking about drastically increasing complexity. And you have really smart
people where maybe they, it's very difficult for them to detect,
not because they're not smart, but because there's slippage happening across a million
different vectors on a small scale that add up to a pretty big problem.
Yes, you nailed it.
And it's not just cost, it's security as well. And I think at some point we may want to talk about what this common data layer, this common infrastructure data layer, what problems that layer can solve for us.
But what you said, it's exactly right and i think the actual problem there is executive awareness
it's a little bit like you know every time we go through a platform shift you know it's like
oh 15 you know 15 years ago or whatever 10 years ago you know you you're a ceo of a company
and all the all of a sudden you needed to understand what mobile ads are, right?
Oh, it's a new distribution channel, right?
And it's like you had to really dig in and start to understand it.
The company who chose not to understand that, the CEOs, they went out of business, right?
Yeah.
And I think we see the same going on right now with, you know, chat GPT-4 and all these things.
As an executive, you need to familiarize with these things and how they impact your business. And I think it's the
same for infrastructure. And what I observe, and this is also a little bit of just my personal
opinion, that number one, tech or execs are not always very aware of what's going on with their
cloud infrastructure. I think that leads to change. In general, they look at developers as a
productivity asset, they're building developers as a productivity asset.
They're building code.
We're shipping product.
Whereas an infrastructure engineer, like an SRE, is more looked at as a cost center.
So we're trying to hire lots of developers, but we're going to try to limit the number of SREs.
And there's not a single SRE or infrastructure engineer that I know who's not stressed out.
Yeah, sure.
Right?
They're the ones who are holding the ship together, right?
And something's got to give.
And usually, you know, that something is either
the cloud bill, you know, security,
and all of that.
Okay, so question number two,
and this is where I'm going to need
Kostas to jump in here, but...
This is your third question.
No, I said 1A, 1B. Oh, Gamma. Okay.
All yours. Okay. How does risotto actually solve this problem? There you go, Kostas.
My second question. Oh, that's just it. Okay, great. Yeah. So, yeah, and I don't want to turn this into a commercial for Risotto.
I'm actually way more excited to talk about solving this problem with data.
And I think that's the fun part of the show that we can talk about now, right?
We define the problem, right?
Lots of sprawl.
We don't know what's going on in our infant start shared.
It causes all sorts of problems, cost, security.
And how do we get back in control?
And the point number one is, well, I need to have data about the state of my resources.
How do I get this data?
And when you look into acquiring that data, like with anything, the data acquisition part is the hard part, right? And what we have done is we have built a set of collectors
that calls cloud APIs and collects metadata from these cloud APIs.
What's metadata, right?
Let's take it.
It's something like, okay, I have an EC2 instance.
I have an EC2 instance.
What time is it start date, what time did
we see it first in the infrastructure,
how many cores does
it have, what is the
attached storage volume,
what VPC does it run
in, so it's information
about the state and the configuration of
the resource, but also
the relationships of that resource
to the other assets in the cloud.
And that's really an ETL process, right?
So these collectors, they run on the schedule.
By default with us, it's one hour,
but you can ramp it up to whatever fits your needs,
30 minutes, 20 minutes, 15 minutes.
And they run, there's a worker, 30 minutes, 20 minutes, 15 minutes. And they run these, you know, there's a worker, it runs and we extract this data
and we put it into a single place.
In our place, in our case, that's a graph database, not a cloud warehouse,
but there's also ways that you can export it to, you know,
like a Snowflake cluster or S3.
We have a product called Cloud2SQL,
which obviously, as the name suggests,
transforms the data
into tables and rows.
For the core product for Zorio,
we chose a graph database. We can talk a little bit about
why we did that, but I think
for the listeners of this show who come
more from a modern data stack,
I think the part that will resonate is like,
look, there're connectors.
We know how to talk to cloud APIs,
extract data, test the data,
transform it into a unified format, and put it into a single place.
Lars, now I think I can ask my questions, right?
Right, Eric?
Am I allowed?
Yes.
So, okay.
Before we get deeper into the risotto technical details,
I want to ask you something related to the conversation you had with Eric earlier.
So you mentioned like the complexity around the cloud infrastructure today, right?
Yes.
And I think like pretty much like every person who has worked,
let's say, in this industry the past 20 years,
they know that what we are trying to do is add more and more
abstraction layers to simplify the way that we interact
with infrastructure, right?
Like from bare metal, today we have serverless
with Lambda functions and all that stuff, right?
Yes, yes.
So my question, there are two questions here, actually.
The first one is, to me, it feels like there's a kind of paradox here, right?
Like, one, we are trying to reduce, let's say, the surface area of how we interact with
infrastructure by all these abstractions.
But at the same time, it becomes harder and harder to actually understand how the infrastructure
we are using is operating and is part of our product.
And that's what it becomes like to fill this gap and so on.
But why do you think this is happening?
Well, you mentioned it, but some of it is driven by these different levels of extraction.
So this evolution from bare metal to today we have Kubernetes, so we're running these
different pods, they run on some sort of machine that we don't even see or know of.
Even just the question like, okay, what is this pod running on with machine?
That's not straightforward to answer,
especially if you have thousands of them, right?
But I think that's these different level of abstractions
that drive that.
The second one is that, ironically,
there's this thing called the well-architected framework
that AWS has, and that suggests to separate your workloads
into different cloud your workloads into different
cloud accounts, into different regions, right?
And that happens for control reasons, for security reasons, for failover reasons, all
of that stuff, right?
And so what on one end is a really good principle to apply to be, you know, to have a secure
infrastructure, to have a resilient infrastructure, to have a scalable infrastructure, just adds to the amount of fragmentation
and therefore loss of control over the resources running in your infrastructure.
Yeah, that makes a lot of sense.
And okay, like from the SRE point of view, let's say we have a person
who is at least aware of like all these different resources that are participating
like in this infrastructure right but they are not the only people who are interacting with that
right like the rest of the engineers are actually interacting even indirectly with all these things
so from let's say the sre perspective let's's say we're using a tool like Risotto.
We have a much deeper understanding of what is happening there,
like the behaviors that our engineers have with all these resources.
How easy or difficult it is to communicate these things back to our product engineers?
Because, as you said, SREs are not enough not enough right they have many things to do and the
complexity of like the problem they are dealing with is like exploding right yes how do we
communicate to the rest of the engineers about like these things that like risotto is helping
to solve yeah i obviously love talking about risotto, but I think, and let's abstract the problem with the solution and how we solve it, right?
I think what you want to do is make engineering efficiency a KPI for your product engineers.
Yeah.
And one of the problems with that is that they have zero visibility into the cost of a resource, the lifetime of a resource, right?
There are other tools out there that solve the deployment problems.
Like, well, if I deploy this, how much actually does this cost me, right?
But just making that part of good engineering habits, efficiency, I think that's the first change we need to do.
Some have that, some don't, right?
And then we can debate about,
okay, how exactly do we do that, right?
And for instance, this is what,
I give you an example of what one of our customers has done,
a company called D2IQ.
They are a managed Kubernetes provider.
They introduced a very simple process
by introducing two tags. Tags are
basically JSON objects attached to a key
value attached to a resource.
And they chose two tags, a name,
the name of the engineer
who deployed this resource
and an expiration date of the resource.
Meaning, and they have certain rules
in certain accounts,
like look, you know,
if you're in these accounts,
no resource should live longer than two days
or should live no longer than Friday night,
7 p.m. of the week it was deployed, right?
And that's a date, right?
And so the name is absolutely required
because if the resource is running,
and I'm an SRE,
and I have developers sitting across the globe,
I want to know who deployed it.
I want to have a quick way to talk to that person.
It's like, hey, what are you doing with this?
That's number one.
Number two, the expiration tag is just a way to say,
look, this is the lifetime of the resource.
And once you go beyond that date,
we're going to team this resource up in tests,
not in production, right?
And so what are they used for Resulta for?
Well, number one, once the resources are deployed,
we check, they use Resulta to check
if every resource adheres to those two principles,
to those two tags. And then two things happen. If the tags are miscorrect, if they're incorrect,
they use Resolta to correct the tags. There's a little bit of logic. People make typing errors
and all sorts of mistakes, right? And so we correct those resources. For instance, if the
expiration date is longer than what's allowed in the policy,
we automatically shorten the lifetime of the resource
to the max allowed.
Now, if neither of those two tags have been applied,
D2IQ just deletes the resource and tests.
And I know that may seem draconian to some people,
but we'll talk about them a little bit.
They had huge efficiency and developer productivity improvements.
And at first they said like, oh gosh, I can't work like that.
But it turns out you can.
And just those two simple tags, an owner, the name, an owner, and the expiration date, and then deleting the resource,
what it has reached its expiration date, has done wonders, right?
I think in their specific case, I think they saved 78% of their infrastructure spent, which
is unheard of, right?
So it doesn't need to be complex, right?
It all, like, if you change the philosophy and you make it a KPI of the engineers, and
then you give them the tools to reach that KPI,
oh, wonder, it'll happen, you know?
It'll happen.
Yep, 100%. Yeah, that's awesome.
And let's go back to like Risotto now
and the technical side of things.
But because it sounds like we are dealing
with a data problem here.
Yes.
So when you are talking about extracting data from the cloud provider, right? Like,
probably, you know, much better than me, but AWS has hundreds of products, right?
Yes.
Is one of these products, let's say, a separate data source for you? Are there like differences
in terms, like how this data look like, like like and how they differ from each other?
Because they are different products that we are talking about, right? We have things from
CloudFormation, which is one product, to EC2, which is another very different kind of resource, right? How do these things work?
Like how the data looks like
and how rich this data model has to be
in order to represent all these products?
Yeah, yeah.
It's a zoo, right?
So let me peel the onion here a little bit
around products.
What's a product?
What's a cloud product, right?
So I think AWS's overall portfolio
is something like,
I think 382 different products, right?
And then let's take one product
as an example,
compute instances, right?
If you double click EC2,
Elastic Compute Cloud,
if you double click on that,
then EC2 has like,
I think in itself has like 200 different
compute instances, right?
So it goes up.
And then each product has its own set of APIs.
As I said, those APIs are usually optimized for deploying resources, less so for extracting
data.
And it also depends on the cloud provider.
GCP, you know, they came later to the market.
They've done a lot of things right when it comes to their APIs.
They're a lot more consistent.
But for AWS, on the other side, things can be pretty inconsistent.
And what do I mean by that?
Let me give you a very specific example.
If I just want to know the age of a resource, like if I talk about the life cycle of my resources, I want to know, on average,
how old are my resources, right? Why would I want to do that? Why do I want to know that?
Because, well, you know, if my resources tend to be old in terms of like months, and they don't
tend to be updated frequently, then it's probably an indicator that we're not doing a lot of development, right? On the other hand, if they have a short life cycle,
they get updated frequently, a great indicator of lots of development activity. So how do I get an
age of a resource? Well, I need some sort of timestamp, right? Well, AWS does provide that
timestamp. But let me give you two very specific examples for timestamps. So EC2 has a date timestamp, right?
The compute instance.
And it's, you know, it's, I think, what's the ISO norm called?
8601, right?
So it's a string.
It's a year, month, day, hour, minute, second, like that.
That's a timestamp.
So I can easily calculate the delta to today's state or today the time right now then there's a second
product i'm just picking two random examples sqsq right they have a property called create a
timestamp and that value is the number of seconds that has elapsed since the unique epoch time which
is zero o'clock on january 1st 1970, right? And so you have these two different,
like you have two different products
with two different APIs
who kind of tell you the same thing,
which is like, you know,
when was this resource created,
but in completely different formats.
And that happens across hundreds of resources
and thousands of APIs.
So extracting data from these cloud providers is
not really straightforward. You need to put a
lot of work into building the connectors
and understanding the data you get
and then extract it and represent it in a way
that it's consumable for the user.
That was a little bit of a long explanation,
but I hope that explains the problem a little bit more
in detail why this is hard.
Absolutely. I think
it's very interesting.
I think people that never had a reason to go use these APIs, they probably consider,
let's say, AWS as one big API, right?
But actually, that's far away from reality.
There are actually hundreds of different APIs, hundreds of different
data models, and all these needs to be aligned if you are going to
process them in a consistent way.
And that's like the ATL part of the data problem, right?
And when we are talking about data for resources, I would assume that we are talking, as you mentioned, the creation date.
For example, there's a set of metadata fields that describe these resources.
What other information we are talking about here?
There are names, dates, there are probably custom key values that people use as annotations for whatever reason.
Like the one you mentioned to apply processes there.
Give us like a little bit more of like how these data looks like.
Even for like an EC2 instance, right?
Like let's say the default row that describes an EC2 instance.
Like how does it look like?
Oh, I mean,
you're going down
a rattle with this one, right? So the number
of fields and properties can go into
their dozens for a specific
instance, right?
And, you know,
we have a
whole section on data models
on our website.
But a create timestamp, last updated, name of the resource, tags,
and then depending on what the resource is.
Also relationships.
Like, oh, what is attached to this EC2 instance?
It's dozens.
In some cases, it's dozens of fields,
in some cases, it's like, you know, eight or
nine, but it's a lot,
right? And then I think the
underappreciated skill there is
that if you want to work
with this data, and that's the value
that some of these XOps tools provide
that I've mentioned, right, is
you have to understand
the data model for each individual cloud provider
and resource, right?
And so all of a sudden you have to become an expert
in, you know, 382 products plus the different properties
of each API, of each resource, right?
And so now all of a sudden you're looking at,
I don't know, let's take 10 as a number, right?
It's like 3,820 properties that you need to understand.
And I think that's where you quickly run into limits
where you go, yeah, maybe, you know,
a single infrastructure engineer just can't do that.
Yeah.
And you mentioned relationships and relationships, right?
And I know like, and that's part of the abstraction, right?
You start, you have storage.
Storage is attached to EC2 instances.
The EC2 instances might be part of, like, a cluster that you have,
blah, blah, blah.
All these things have, like, some kind of hierarchy,
like connection, right, and relation.
Yeah.
Which I don't know how explicit or implicit it is, but
if I understand correctly, part of what Risotto is doing is allowing you to
exploit these relationships and learn about your infrastructure as a whole.
So tell us a little bit more about that,
both of like, let's say, how the world looks like and how Risotto models this world.
Yeah, so I think, good point, the relationships.
Let's talk about the problem there.
As you said, these cloud assets or cloud resources are really nested.
A storage volume is attached to an EC2 instance.
The EC2 instance runs in a VPC.
The VPC runs in a region.
The region belongs to a cloud account.
It's all nested. Then maybe there's an IP address, obviously,
that belongs to the EC2 instance.
And finally, there are things like
IAM rules or policy
access policies, which can be nested too.
So the complexity
gets
high quickly. And understanding
these relationships is beneficial for a number of reasons.
Number one, just
asking a question like, how many resources do I
have? Or what is everything behind this IP address?
If I have these these relationships that's what
that's what these relationships tell me but also if i want to lean up my resources right so i can
look at this lonely ec2 instance i'll just keep going back to the ec2 instance because everyone
is familiar i would assume with that product you know i can now look at this lonely ec2 instance
and say it looks unused I can probably delete it.
But what you don't know when you do that
is something that's called the blast radius.
If I delete that, what else will go down?
And you just don't know.
You don't know.
You don't know without the relationships.
And that's why there's so much value
in capturing the dependencies
and the relationships of these assets.
And how do you do that with Resultum?
Yeah, so we have a data model, we look at the resource and all the
different properties, and
we map that to our data model.
And then we also map
the dependencies.
It's a little bit of a visual thing.
I think it's best to check out what this looks like
in reality in our docs.
But it really
is like a graph. It's a graph.
And we map that up front. but it really is like a graph. It's a graph, right?
And we map that up front.
And it's a Stanford B-type model, right?
And so, you know, if we know, for instance,
for a computer instance, it has a certain number of cores,
well, that's an integer, right?
And so when we put a lot of time into understanding the data models, we map those relationships, static D type, we collect the data.
And because it's static D type, it also allows us to index this data real quick.
And so search becomes really fast versus like, you know, running some batch jobs with SQL queries. And how do you query this data using Risotto?
Because we are talking about a graph data model here.
I would assume probably you're using a graph database.
Yes, that's correct.
Yeah, so we're open source, right?
And the graph database we use is called ErangoDB.
There's other products out there.
I think most people, when they hear graph database, they will think Neo4j, right? We use, it's called ErangoDB. There's other products out there.
I think most people, when they hear graph database,
they will think Neo4j, right?
They kind of pioneered the model.
And, you know, they have,
graph databases are pretty powerful for very specific use cases,
but they're, you know,
a graph query language is pretty hard.
Unless you do it every day,
I'm not sure it's worth the investment
to learn that language.
So what we've done is, you know,
we've created our own domain-specific language.
It's a search syntax that simplifies
a lot of these things.
And it's actually understandable for humans.
You know, like terms like search.
We offer full-text search, right?
And so it's this domain,
it's this search syntax that you can use.
And anyone who's familiar with using,
you know, themanite tool will be
able to pick it up very quickly.
And in the future, something we're
working on are Jinja templates
so that you don't even have to
really know, like you write in
Jinja and then it just automatically
creates the
syntax in our
Risotto syntax.
Yeah, and one last question for me,
and then I'll give the microphone back to Eric.
You mentioned that outside of Risotto,
there's also another product called Cloud to SQL, right?
Yeah.
Yes.
How does this work?
How do you go from this graph data model into SQL?
How do you do that?
Yeah.
So there is an existing world of analytics engineer, right?
And we have all the data infrastructure in place already.
And there's nothing that keeps us from working with infrastructure data the same way we work
with data from your crm from salesforce google analytics marketo and and that's why we introduced
cloud to sql and it basically it's it all the things that i told you about the graph and the
dependencies and all of that we just flatten the data about, the graph and the dependencies and all of that,
we just flatten the data out, right?
And put it into tables and rows.
And then you can export it to a destination of your choice, right?
And we call it SQL because you can export it to Snowflake and to Postgres, also S3, right?
And that's what we use that product for.
Now, you will lose these relationships, right?
From the graphs. And they're you will lose these relationships, right, from the graphs.
And they're useful for use cases like, obviously, cleanup and security.
But then, in theory, you can rebuild them by writing your own joint across these tables.
And, you know, you can put it into a DBT model.
And then you expose it to your Metabase dashboard. So that's an option we wanted to give existing Analytics engineers.
And that's why we introduced Cloud to SQL.
Awesome.
Eric, all yours again.
Alright, Lars,
I guess the question is
how can
data teams
and infrastructure engineers in general
and, you know, we have sort of more data
engineers and data
teams, you know, as listeners for the podcast.
I guess the big question is the cloud infrastructure ecosystem is expanding at an
alarming rate. How should they think about that? And how can they sort of plan for what's coming in the next several years, especially as it
relates to cost?
Because it's going to be a problem.
I mean, you know, AWS is a great example, but, you know, all the other cloud providers
are going to become just as complicated over time.
Yeah.
You know, I would approach this from a perspective of what do we need to deliver our customers?
I don't mean to go too far away from the actual infrastructure, but we know that to stay alive, these companies need to ship and develop a lot of new digital products.
And for that, I need to have empowered developers.
They need to be able to spin up infrastructure.
I do not want to put too many blockers around them.
So this is the world I want to live in.
And I want to give them the freedom to try out new tools and new products that the cloud
providers give me.
Now, how do I stay in control when I do that?
And this world will never go away.
There will always be more innovation, more product, right?
And how do I stay in control while I do that? And this world will never go away. There will always be more innovation, more product, right? And how do I stay in control
while I do that?
And I think it's the answer,
as we discussed on this,
includes data,
but it's to say like,
look, there's a time
for reactive intervention,
like real time,
high granularity data, right?
And there's tons of great tools
out there that do that.
But there's also time just for exploration,
for tracking long-term trends,
and for using data to take remediated action
that steers my infrastructure back on the path
that I want it to be on
without having some sort of incident, right?
And I think that's the philosophy that we're proposing, right?
You use data as an input to write code so that your developers don't have to.
Yeah. I love it. All right. Well, Lars, this has been absolutely fascinating,
an area that we haven't covered a ton on the show, but has direct impact on all sorts of
data stuff across the stack. So thank you for educating us and sharing your insights and best of luck with Rosetto.
I appreciate it.
Always good seeing you guys.
And as a listener to the Data Stack Show, a longtime listener myself, I'm actually very
excited that I can be a guest now.
Well, it's a true privilege.
And thank you for your support.
Thank you, guys.
Costas, what a fascinating conversation with Lars from Risotto. I learned a huge amount. And I think
my big takeaway is that, you know, I kind of went into this conversation expecting to be astounded by the complexity
of sort of resource management across the entire ecosystem of infrastructure and tooling,
which I was. It's a very large scope, complex problem. But the bigger thing was how similar
the issue is actually to sort of a standard data flow in terms of the solution, right?
And so Lars kind of described it as you're, you know, you're sort of ingesting inputs, you're doing some sort of modeling, and then you're pushing those back out, right?
And so when we think about the modern data stack, I mean, that's, you know, bread and butter for a data engineer dealing with customer data, for example. So it really struck me that sort of, you know,
there's an elegant architecture that already exists
for solving this like pretty complex problem.
Yeah, yeah, 100%.
I think outside of like proving today
that resource management is a data problem,
I think we also proved that like everyone is a data engineer, right?
Like every software engineer is a data engineer at the end.
You need to, in a way,
many of the problems that we are talking about solving
actually relate,
sorry, contain
a big part of data engineering work
that has to be done.
Data has to be exported,
data has to be transformed somehow,
modeled,
and, of course,
being exposed to the data consumer for value
to be created there.
And I think especially...
I mean, okay, it will sound like it has been said many times, I think, already, we're entering
this decade of everything's going to be around data. But I think we start seeing that a lot.
And we start seeing that by actually like getting into domains that don't necessarily
feel like they are, you know, data problems or data related like technologies that have
to be built.
But at the end, that's exactly what is happening, right?
And I think especially with AI and all the stuff that's happening right now,
we are going to see more and more of, let's say, these domains to come back
and being rebuilt and rediscovered around the data problems that can be defined there,
including sales, marketing, like pretty much like
everything. And yeah, it was super fascinating. Like we should get Lars back again. He's a good
friend. And I think like whenever we talk with him, we always come up like with very interesting
insights. No, I completely agree. So much to learn
and would love to have Lars back.
Such a deep thinker
about these problems.
Go ahead and subscribe
to the show if you haven't.
I'll look it up
on your favorite podcast network.
I'll tell a friend
and we will catch you
on the next one.
We hope you enjoyed this episode
of the Data Stack Show.
Be sure to subscribe
on your favorite podcast app
to get notified
about new episodes every week.
We'd also love your feedback.
You can email me,
ericdodds,
at eric
at datastackshow.com.
That's E-R-I-C
at datastackshow.com.
The show is brought to you
by Rutterstack,
the CDP for developers.
Learn how to build a CDP
on your data warehouse
at rutterstack.com.