Orchestrate all the Things - Reducing cloud waste by optimizing Kubernetes with machine learning. Featuring StormForge CEO / Founder Matt Provo
Episode Date: February 23, 2022Applications are proliferating, cloud complexity is exploding, and Kubernetes is prevailing as the foundation for application deployment in the cloud. That sounds like an optimization task ripe f...or machine learning, and StormForge is doing just that. Article published on ZDNet
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Applications are proliferating, cloud complexity is exploding,
and Kubernetes is prevailing as the foundation for application deployment in the cloud.
That sounds like an optimization task ripe for machine learning,
and StormForge is doing just that.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.
So I'm a product guy by background.
I was at Apple for a long period of time.
This is my second company after leaving Apple.
And we started in 2016.
And from the very beginning, the company has been focused on developing machine learning driven solutions for the enterprise. And in 2000,
late 2018, early 2019, we had raised about six or 7 million at that point. And we're using our core technology to manage how electricity is consumed in large HVAC and manufacturing equipment and the like.
And at the time, we were a Docker swarm shop.
And we were having some challenges on scaling the product.
Part of that was related to our use of Docker at the time.
And so when we lifted and shifted to Kubernetes
was really when we found kind of the perfect use case
for our core competency from a machine learning standpoint.
And so it was an interesting shift
because it was about five weeks
after the most recent board meeting.
And I went back to the board and they were super supportive on us pivoting away and we spent a bunch of time talking to developers and users about the pain points that we solve for to validate really that that these were real, and, and that we add something. And from that point forward, we've been continuing to obviously build
the solution, but also, you know, grow the team, build up the go-to-market and kind of,
you know, grow the company from there. I've always wanted to build a company that
also has an impact. And so a big part of our story is, uh, is helping to reduce carbon emissions.
And in particular, those related to cloud waste and, uh, you know, as more and more people move
to the cloud and more, uh, resources are consumed, of course, uh, there's, there is a direct connection
to, uh, to, to cloud waste and, and in many ways, uh, you know, carbon emissions and carbon footprint. And so the company has a pretty strong mission oriented side to it as well.
And, you know, that's, that's something that, you know,
is really exciting for me to be a part of leading,
but also I think exciting for the team to be a part of as well.
Okay. Thank you. I did a little bit of looking around, I mean,
excuse me, as much as time would permit and found out there seems to have been a merger at some
point in the company's history and I was wondering if you could refer to that a little bit and any
other well key company facts that you'd like to share things like i don't know
account and capital raise and this type of thing yeah sure and by the way we have a we have a
nice presentation that we can send you afterwards as well that kind of walks through different
pieces of who we are where we came came from, what we're announcing,
well, you know, where the platform's headed going forward and all that as well. But yeah, from a company overview standpoint, our largest investor is Insight Partners out of New York.
They led our Series B funding, which was 63 million. So we've raised just over 70 million in funding. At this point, we did acquire a
performance testing solution out of Germany. So we have operations now in the US and Germany and
a little bit in APAC. Excuse me, we made that acquisition in 2020. And, and, and, you know, as we get into our solution, you'll, you'll come to
understand, you know, performance and load tests historically have been our, our biggest data input
from a machine learning standpoint. So we were getting a lot of requests from people to kind of
couple the performance testing data input side directly to our machine learning,
you know, without having to bring in another vendor. And so we don't require that, but we do
offer that as a part of our solution. If people have their own load testing suite already,
we will likely integrate into that pretty seamlessly. But yeah, we did make that
acquisition in 2020. We're coming off our strongest quarter in the company's history,
coming out of Q4 and really having a nice Q1 already and leading up for a nice 2022.
And then we've got some really exciting and interesting partnerships that we're announcing
with AWS and Datadog and others, as well as also adding a new board member. So the former COO,
most recently of Tricentis, has joined our board, you know, and, and I won't be announcing that as well. Okay. Well, thank you. And yeah, with that,
you already touched on the, on the product side of things,
which was going to be my, my next question, my next question.
So I did have again, just a superficial look, to be honest with you on the,
on the product side of things.
And I'm sure you've repeated that a thousand times already.
But for the benefit of people who may be listening to the podcast,
if you could just give a brief end-to-end, let's say,
description of what it is you do and the different steps in this lifecycle.
Yeah, no, absolutely.
So I'll start at the highest of levels,
the sort of the tweet of what we do, if you will, which is around automatic users should be able to manage those resources at scale
without having to choose between something like cost or performance. They should be able to receive
options back on how they configure their application and ultimately how they use resources
that allow them to operate across the metrics that they care about for that application.
And so we work in two environments.
If you think about kind of the CICD pipeline,
we work in a pre-prod or non-production environment,
and we use load tests and performance tests as the data input.
We call that the experimentation side of the StormForge platform,
where developers in a pre-production environment are using load,
putting load against their
applications to then use machine learning to actually spin up versions of the application
against the goals or the scenarios that they have. And we are returning back to them
configuration options to deploy the application that typically results in somewhere between 40
and 60 percent cost savings and somewhere between 30 and 50 percent increase in performance.
And that's all taking place, as I said, in a pre-production environment.
What we're adding and announcing to the platform is some very similar capabilities,
but in a production environment.
And so we call this the observation side of our platform.
And so we are using, in this case, telemetry data, observability data,
integration through APMs to be able to take that information and be able to, again, connected to the metrics they care about for that application
and being able to provide nearly real-time recommendations that the user can choose to either manually deploy
or kind of what we would call set and forget, which allows the machine learning to say, these are the best recommendations from
a resource standpoint within certain thresholds that the user defines. And you can automatically
deploy those across the frequency that you care about. And so it's exciting to be able to work
across both environments in the same platform
and be able to provide that value for our users.
Okay. Thanks for the high-level description.
And I would say that conceptually, at least,
what you do seems straightforward enough.
I think there's however a few points that deserve highlighting, one of which you kind of
touched upon already which is that well this basically seems like an optimization problem,
the crux of which being again well what is it that the users want to optimize for and there's
different options there you can optimize for performance, you can optimize for cost or whatever else.
And how does it work there from your point of view?
So do you give users a pre-compiled, let's say, list of options?
Or do they come to you with the other things that they want to optimize?
And how many of those, you
know, suppose you have like a list of two or three things, do they have to prioritize?
Obviously, you can't optimize for everything at the same time.
So how does that work?
Yeah, no, it's a great question.
So we have a couple views that I'll highlight.
One is StormForge does not know the nature of the end user's application. We don't even necessarily
out of the gate know the business goals or the SLAs that they're tied to. We don't know the
metrics they care about. And we're okay with that. We need to be able to provide enough flexibility
and a user experience that allows the developer themselves to say, these are the
things I care about. These are the objectives that I need to meet or stay within, and here are my
goals. And from that point forward, when we have that, the machine learning kicks in and takes over
and will provide many times tens, if not hundreds, of configuration options that meet or exceed those objectives.
And so, yeah, oftentimes it starts out with something like a memory and CPU, for example, as two parameters that they might care about.
But pretty quickly and pretty often, our IP as an organization is in the space of multi-objective
optimization using machine learning. And so we are able to go, you know, sort of to an infinite
number of parameters, but usually with the really sophisticated users where, you know, can sometimes
be above 10 parameters that people are looking at and getting information back to be able to decide which option they want to move forward with.
And so you're also right.
Most often it is sort of on a cost versus performance continuum.
But there's certainly other scenarios that people come up with.
Our charge is also to empower developers into the process, not automate them out of it.
And so we want them to be involved. We want them to be augmenting even the machine learning capabilities by giving feedback, which we allow for in our UI and in the product experience itself. And what we find pretty quickly is that over time, users trust the results more and more.
And the last thing they want to be doing in many cases is manually tuning their applications.
They would rather be working on other tasks and kind of watching over what StormForge is providing for them.
And when that starts to happen, we know it's a good sign of success.
Okay.
So obviously, as you said, you're trying basically your core IPs around multi-objective optimization. And in order to generate machine learning models
to address that in the first point,
obviously you must have fed it
with a lot of high quality data.
And I'm wondering,
where is it that you got that data to boot?
Yeah, that's a good question.
So garbage in, garbage out would be my first response. Machine learning is only as good as the data that it's given and that it's fed. And so early on, the first few years that we were in existence, we were on a data hunt is just the honest answer. We, uh, we have, you know, an incredibly talented team and we have from, from the beginning. Uh,
but we were, we were, um, you know, data poor, if you will,
and talent rich and data poor. And so we actually engaged our first,
um, five or six, you know, pretty significant customer engagements. Um,
you know, we charge next to nothing in exchange for
access to the appropriate data that we needed. And in some cases, we also invested in data
labeling and really making sure that the integrity of the core data set itself was at the highest of
qualities. And this was actually before some of the MLOps platforms
that you see today around building
and managing machine learning models
had kind of taken off at all.
And so it was a painful process for us
to figure out how to get the data into the right spot.
But I'm glad we invested early
because we were able to build that base, that repository
of information that was helpful and accurate. And then once we had that base, you know,
every customer engagement, every user that we bring in, we do have a kind of a try before you buy motion in what we do. And so we end up, we don't take people's data,
but the more deployments and the more reps that we have, you know, the better that repository and
the better that learning gets. And so now the machine learning is sort of at a place where
it's not plateauing, it continues to learn, but it's got an incredibly strong base it has seen
uh an incredible breadth and depth of scenarios and different sort of application or environment
world views if you will across a bunch of different parameters and so um at this point
it's about you know tweaking and changing and continuing to build it, for sure.
We continue to invest in it.
But that core is there, and that's what we've protected from an IP perspective as well.
That was actually going to be my follow-up question on that.
Obviously, Kubernetes itself is evolving, and there's more and different parameters to be set.
And on the other hand, its application scenario is different.
Well, obviously, there's going to be similarities, but no two applications are the same, probably.
So I was wondering if those models need tailoring for each application that you encounter? And also, whether you need to tweak and adjust the model
for different evolving versions of Kubernetes
and how do you manage to do that?
Yeah.
So the StormForge platform, I think, is a very unique combination,
which is a difficult thing to get over and it took us years.
A lot of machine learning or AI work, frankly, either remains in academia or kind of dies before it can be productized and really taken forward. And one of the main reasons is because of the
intersection between fields like data science and engineering.
And so at the tip of that intersection is productization.
And while we've got incredibly talented data scientists, we've got incredibly talented engineers on the Kubernetes side.
And so that intersection that we've sort of crossed over between data science
and engineering is one piece to that puzzle. We do have to, there's,
I wouldn't call it a individual tailoring on a model.
It's the core model itself for, for each, for any given application,
but there is, you know,
some unique learning that takes place each time a new application or a new scenario is introduced.
And, you know, that's a pretty quick process.
So for every, I'll give you an example, for every parameter that's introduced to a scenario within our world, every additional parameter, it takes
about 10 minutes for the machine learning to kind of make the, and by the way, it's doing this on
its own. We're not hand tweaking anything. It takes about 10 minutes for each additional parameter
to kind of, to kick in once we get beyond two or three. And so, you know, within,
if someone wants to go to five parameters for an experiment, then you're talking about, you know,
maybe 45 minutes to an hour of lead time. And then from that point forward, the machine learning is
kind of learned and caught up and is able to return back configurations that include that
parameter. So I don't know if that answers your question but um you know there's a little
bit of kind of learning that takes place but overall we view that as a good thing because
again the more scenarios and more situations that we can encounter um you know the the better
performance uh we can be yeah yeah no question that it's a good thing. My question was not around that,
it was mostly about, well, how do you manage to do that? And well, the next thing I wanted to
ask you is that, well, it seems that the release you're about to announce, you can correct me if
my understanding is not correct, but it seems like basically it extends what you have
already been doing in pre-production to production. So I was wondering, well, what did it take to do
that? Because my sort of educated guess, you can call it, is that well, in pre-production,
you can experiment a lot. You can play around with different configurations and so on. In production, I'm not so sure that's actually possible.
So was that the main obstacle that you had to overcome?
And if yes, how did you do that?
Yeah, that's a good question.
You were cutting out just a little bit,
but I think I got all of it.
One distinction I do like to make is
there's a very fine line between learning and observing from production data, getting the most out of it, our system returning recommendations that can be either manually or automatically deployed.
There's a very fine line between that and live tuning in production.
And we're not live tuning in production. And we're not live tuning in production.
Although it's a very fine line, when you cross over that line, the level of risk is unmanageable and untenable. It's not something we want to take on. It's also not something that any
of our users, our customers really have. I mean,
we've, we've, we've asked them if they would go that far with things and the unequivocal answer
has been, no, uh, they don't, they don't want to go that far. And so, uh, I will answer your,
you know, I will say, um, it, uh, what has worked for us in pre-production is, you know,
we added many of those same capabilities in production.
So it is an extension of the platform for sure. Excuse me.
It also means that we're not single threaded on either an environment.
So pre-production or a data input any longer,
which would be load or performance tests. And so as we extend, if you
will, to adding those production capabilities, think about us though as a vertical solution
that's kind of right now at the application layer. We will, across this year, going down the stack,
adding new data inputs, and ultimately then going beyond the application layer
and looking at the entire cluster itself. And so we'll add things like traces and logs and
even kernel level stuff that will allow us to continue driving optimization forward,
but across the entire stack, as opposed to kind of focusing
only at the application layer, if that makes sense. Okay. I can understand how you are then,
it sounds basically like what you're describing is that you have, you're extending what you feed
your models with. But then the question is, okay, so you feed it with more sources of data,
but is the process of optimizing
and deploying optimized applications the same?
I'm guessing probably no, by the way you described it,
because, well, you don't want to have downtime.
So if it's not the same, how does it work?
So we will...
By the way, I didn it work? So we will, so everything, by the way,
I didn't answer part of your question.
So a lot of what is transferable
and a lot of what will continue to be transferable
are the core pieces of the platform.
So user management, roles and binding,
so RBAC permissions related items,
the core infrastructure, the UI, like all
of those things will, you'll continue to be able to go to one spot and have one experience, which
is good. But based on the technique, the approach and the data source, we do distinguish between those different approaches.
So you have a tab in our UI where you're focused on the pre-production side of things.
You have a separate tab in our UI where you're focused on the production and observability things.
And then so on and so forth as we go down the stack because the business objectives, the risk,
the scenarios themselves are different
at different parts of the stack.
And so we're very much in kind of R&D mode
related to traces and logs and kernel
and some of those other things.
So I don't have a great answer for you now
on exactly how that's going to work, but it will be in the same UI. And I believe we will still kind of separate
the approaches as we go down the stack. Okay. So then I'm guessing the takeaways that will,
as far as production environments go,
you do collect the data, you do use it, but you don't actually redeploy applications.
You don't optimize applications for redeployment, right?
With the production side?
So we actually do, because at the point of optimization, what we're saying to the user is, where is your risk tolerance?
What are you comfortable with from an automation standpoint?
And what are your kind of min and max biometric goals?
And we'll return options back to you that meet or exceed those goals.
In the production scenario, we are competing quote-unquote against something within a
Kubernetes environment called the VPA. And so it's the vertical pod autoscaler, we're also adding capabilities around the HPA and we'll allow what we call
two-way intelligent scaling. And so the optimization and sort of the value we provide
is measured against what the VPA and the HPA are recommending for the user within a Kubernetes
environment. So even in the production scenario, we are seeing cost savings.
They're not quite as high as the pre-production options,
but you're still talking 20% to 30%,
and you're still talking 20% improvement in performance typically.
And again, those aren't things you have to choose between.
You can have both.
Okay, so then I guess the final decision
on whether to take up on the recommendations and the optimizations that come out of the platform
is with the people running the operations of the client, right? Yes, and that decision could be
on a continuous basis for the application, deployment over deployment, or it could be a decision at a certain point in time that says, as long as you are within these thresholds, you can automatically deploy for me.
And the user gets to decide that.
Think of it as kind of like a slider of risk, risk tolerance, if you will.
And then another slider, which is like completely manual deployment on the left and completely automatic deployment on the right.
And that's literally what we provide in the UI for the user.
And they can kind of decide where they fit.
Okay, I see. There's also something else that caught my attention in browsing around your site, basically,
which was, I think you have something like
a sort of money-back guarantee in a way,
which says, well, if we can't save you
at least 30% of your cost, we'll pay the difference.
And you also have some charities
that you endorse and give out money
in case this disclose is activated.
I'm just wondering, has it been used much?
Have you been forced to pay people much?
Well, number one, it's a company backing, but it's also a personal guarantee by me.
So my wife really loves that I did that, by the way.
But no, we have never had to, we've never been below 30%.
So the guarantee says that if we don't save you 30%, you know, then we'll do that.
And that's, again, connected also not just to the performance of our solutions, but also
to the mission behind, you know, the mission behind what we're trying to do around reduction of carbon, you know, carbon our solutions, but also to the mission behind, you know, the mission behind
what we're trying to do around reduction of carbon, you know, carbon footprint and emissions.
We do partner with those organizations, which are ones that are obviously doing great work,
not connected to our technology directly in any way, but ones that we support. But yeah, we take it very seriously.
People challenge us often and we've been able to deliver.
Thanks.
I think we're almost out of time.
So we should be probably wrapping up.
And if you would just want to share
what's next basically for StormFord.
So after releasing this, what's on your roadmap?
Yeah, so most of what's on our roadmap is, and again, we'll send that along to you,
but is the pathway that I was talking about in going down the stack itself.
So continuing to add additional data sources, you know, to be able to, excuse me,
go beyond the application layer, that stuff we are, we are adding aggressively. We're also,
we're also releasing fully on-prem air-gapped capabilities for our solution. So we get a lot of
requests to go into the government or regulated environments with banks and others.
And so everything that we're able to do through the cloud today is, you know, shortly will be available in a completely on-prem air-gapped environment.
We're releasing a new report or entry point for the platform for the platform, if you will, called that we're
calling cluster health. So this is like a one click install where, uh, pretty immediately you're,
you get information back about your cluster and, and it kind of guides you to, you know,
what should I think about optimizing first or where should I focus my time? Um, you know,
so we'll, we'll be launching that.
I mentioned the two way intelligence scaling with the HPA and the BPA.
And then we are shipping with the newest with Optimize Live,
which is our new the new piece to our platform.
Out of the box, we will have integrations with Datadog and Prometheus.
We're also announcing a pretty significant partnership with AWS,
but we'll add additional integrations with things like Dynatrace
and other APMs across this year as well.
Okay, well, thanks.
It sounds like, well, you're going to be keeping busy.
So good luck with everything.
And thanks for the conversation.
I hope you enjoyed the podcast. If you like my work, you can follow Link Data
Orchestration on Twitter, LinkedIn and Facebook.