Orchestrate all the Things - Reducing cloud waste by optimizing Kubernetes with machine learning. Featuring StormForge CEO / Founder Matt Provo

Episode Date: February 23, 2022

Applications are proliferating, cloud complexity is exploding, and Kubernetes is prevailing as the foundation for application deployment in the cloud. That sounds like an optimization task ripe f...or machine learning, and StormForge is doing just that. Article published on ZDNet

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Applications are proliferating, cloud complexity is exploding, and Kubernetes is prevailing as the foundation for application deployment in the cloud. That sounds like an optimization task ripe for machine learning, and StormForge is doing just that. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration
Starting point is 00:00:28 on Twitter, LinkedIn, and Facebook. So I'm a product guy by background. I was at Apple for a long period of time. This is my second company after leaving Apple. And we started in 2016. And from the very beginning, the company has been focused on developing machine learning driven solutions for the enterprise. And in 2000, late 2018, early 2019, we had raised about six or 7 million at that point. And we're using our core technology to manage how electricity is consumed in large HVAC and manufacturing equipment and the like. And at the time, we were a Docker swarm shop.
Starting point is 00:01:17 And we were having some challenges on scaling the product. Part of that was related to our use of Docker at the time. And so when we lifted and shifted to Kubernetes was really when we found kind of the perfect use case for our core competency from a machine learning standpoint. And so it was an interesting shift because it was about five weeks after the most recent board meeting.
Starting point is 00:01:44 And I went back to the board and they were super supportive on us pivoting away and we spent a bunch of time talking to developers and users about the pain points that we solve for to validate really that that these were real, and, and that we add something. And from that point forward, we've been continuing to obviously build the solution, but also, you know, grow the team, build up the go-to-market and kind of, you know, grow the company from there. I've always wanted to build a company that also has an impact. And so a big part of our story is, uh, is helping to reduce carbon emissions. And in particular, those related to cloud waste and, uh, you know, as more and more people move to the cloud and more, uh, resources are consumed, of course, uh, there's, there is a direct connection to, uh, to, to cloud waste and, and in many ways, uh, you know, carbon emissions and carbon footprint. And so the company has a pretty strong mission oriented side to it as well. And, you know, that's, that's something that, you know,
Starting point is 00:02:54 is really exciting for me to be a part of leading, but also I think exciting for the team to be a part of as well. Okay. Thank you. I did a little bit of looking around, I mean, excuse me, as much as time would permit and found out there seems to have been a merger at some point in the company's history and I was wondering if you could refer to that a little bit and any other well key company facts that you'd like to share things like i don't know account and capital raise and this type of thing yeah sure and by the way we have a we have a nice presentation that we can send you afterwards as well that kind of walks through different
Starting point is 00:03:41 pieces of who we are where we came came from, what we're announcing, well, you know, where the platform's headed going forward and all that as well. But yeah, from a company overview standpoint, our largest investor is Insight Partners out of New York. They led our Series B funding, which was 63 million. So we've raised just over 70 million in funding. At this point, we did acquire a performance testing solution out of Germany. So we have operations now in the US and Germany and a little bit in APAC. Excuse me, we made that acquisition in 2020. And, and, and, you know, as we get into our solution, you'll, you'll come to understand, you know, performance and load tests historically have been our, our biggest data input from a machine learning standpoint. So we were getting a lot of requests from people to kind of couple the performance testing data input side directly to our machine learning,
Starting point is 00:04:46 you know, without having to bring in another vendor. And so we don't require that, but we do offer that as a part of our solution. If people have their own load testing suite already, we will likely integrate into that pretty seamlessly. But yeah, we did make that acquisition in 2020. We're coming off our strongest quarter in the company's history, coming out of Q4 and really having a nice Q1 already and leading up for a nice 2022. And then we've got some really exciting and interesting partnerships that we're announcing with AWS and Datadog and others, as well as also adding a new board member. So the former COO, most recently of Tricentis, has joined our board, you know, and, and I won't be announcing that as well. Okay. Well, thank you. And yeah, with that,
Starting point is 00:05:47 you already touched on the, on the product side of things, which was going to be my, my next question, my next question. So I did have again, just a superficial look, to be honest with you on the, on the product side of things. And I'm sure you've repeated that a thousand times already. But for the benefit of people who may be listening to the podcast, if you could just give a brief end-to-end, let's say, description of what it is you do and the different steps in this lifecycle.
Starting point is 00:06:19 Yeah, no, absolutely. So I'll start at the highest of levels, the sort of the tweet of what we do, if you will, which is around automatic users should be able to manage those resources at scale without having to choose between something like cost or performance. They should be able to receive options back on how they configure their application and ultimately how they use resources that allow them to operate across the metrics that they care about for that application. And so we work in two environments. If you think about kind of the CICD pipeline,
Starting point is 00:07:08 we work in a pre-prod or non-production environment, and we use load tests and performance tests as the data input. We call that the experimentation side of the StormForge platform, where developers in a pre-production environment are using load, putting load against their applications to then use machine learning to actually spin up versions of the application against the goals or the scenarios that they have. And we are returning back to them configuration options to deploy the application that typically results in somewhere between 40
Starting point is 00:07:46 and 60 percent cost savings and somewhere between 30 and 50 percent increase in performance. And that's all taking place, as I said, in a pre-production environment. What we're adding and announcing to the platform is some very similar capabilities, but in a production environment. And so we call this the observation side of our platform. And so we are using, in this case, telemetry data, observability data, integration through APMs to be able to take that information and be able to, again, connected to the metrics they care about for that application and being able to provide nearly real-time recommendations that the user can choose to either manually deploy
Starting point is 00:08:38 or kind of what we would call set and forget, which allows the machine learning to say, these are the best recommendations from a resource standpoint within certain thresholds that the user defines. And you can automatically deploy those across the frequency that you care about. And so it's exciting to be able to work across both environments in the same platform and be able to provide that value for our users. Okay. Thanks for the high-level description. And I would say that conceptually, at least, what you do seems straightforward enough.
Starting point is 00:09:20 I think there's however a few points that deserve highlighting, one of which you kind of touched upon already which is that well this basically seems like an optimization problem, the crux of which being again well what is it that the users want to optimize for and there's different options there you can optimize for performance, you can optimize for cost or whatever else. And how does it work there from your point of view? So do you give users a pre-compiled, let's say, list of options? Or do they come to you with the other things that they want to optimize? And how many of those, you
Starting point is 00:10:05 know, suppose you have like a list of two or three things, do they have to prioritize? Obviously, you can't optimize for everything at the same time. So how does that work? Yeah, no, it's a great question. So we have a couple views that I'll highlight. One is StormForge does not know the nature of the end user's application. We don't even necessarily out of the gate know the business goals or the SLAs that they're tied to. We don't know the metrics they care about. And we're okay with that. We need to be able to provide enough flexibility
Starting point is 00:10:40 and a user experience that allows the developer themselves to say, these are the things I care about. These are the objectives that I need to meet or stay within, and here are my goals. And from that point forward, when we have that, the machine learning kicks in and takes over and will provide many times tens, if not hundreds, of configuration options that meet or exceed those objectives. And so, yeah, oftentimes it starts out with something like a memory and CPU, for example, as two parameters that they might care about. But pretty quickly and pretty often, our IP as an organization is in the space of multi-objective optimization using machine learning. And so we are able to go, you know, sort of to an infinite number of parameters, but usually with the really sophisticated users where, you know, can sometimes
Starting point is 00:11:41 be above 10 parameters that people are looking at and getting information back to be able to decide which option they want to move forward with. And so you're also right. Most often it is sort of on a cost versus performance continuum. But there's certainly other scenarios that people come up with. Our charge is also to empower developers into the process, not automate them out of it. And so we want them to be involved. We want them to be augmenting even the machine learning capabilities by giving feedback, which we allow for in our UI and in the product experience itself. And what we find pretty quickly is that over time, users trust the results more and more. And the last thing they want to be doing in many cases is manually tuning their applications. They would rather be working on other tasks and kind of watching over what StormForge is providing for them.
Starting point is 00:12:47 And when that starts to happen, we know it's a good sign of success. Okay. So obviously, as you said, you're trying basically your core IPs around multi-objective optimization. And in order to generate machine learning models to address that in the first point, obviously you must have fed it with a lot of high quality data. And I'm wondering, where is it that you got that data to boot?
Starting point is 00:13:20 Yeah, that's a good question. So garbage in, garbage out would be my first response. Machine learning is only as good as the data that it's given and that it's fed. And so early on, the first few years that we were in existence, we were on a data hunt is just the honest answer. We, uh, we have, you know, an incredibly talented team and we have from, from the beginning. Uh, but we were, we were, um, you know, data poor, if you will, and talent rich and data poor. And so we actually engaged our first, um, five or six, you know, pretty significant customer engagements. Um, you know, we charge next to nothing in exchange for access to the appropriate data that we needed. And in some cases, we also invested in data labeling and really making sure that the integrity of the core data set itself was at the highest of
Starting point is 00:14:21 qualities. And this was actually before some of the MLOps platforms that you see today around building and managing machine learning models had kind of taken off at all. And so it was a painful process for us to figure out how to get the data into the right spot. But I'm glad we invested early because we were able to build that base, that repository
Starting point is 00:14:47 of information that was helpful and accurate. And then once we had that base, you know, every customer engagement, every user that we bring in, we do have a kind of a try before you buy motion in what we do. And so we end up, we don't take people's data, but the more deployments and the more reps that we have, you know, the better that repository and the better that learning gets. And so now the machine learning is sort of at a place where it's not plateauing, it continues to learn, but it's got an incredibly strong base it has seen uh an incredible breadth and depth of scenarios and different sort of application or environment world views if you will across a bunch of different parameters and so um at this point it's about you know tweaking and changing and continuing to build it, for sure.
Starting point is 00:15:46 We continue to invest in it. But that core is there, and that's what we've protected from an IP perspective as well. That was actually going to be my follow-up question on that. Obviously, Kubernetes itself is evolving, and there's more and different parameters to be set. And on the other hand, its application scenario is different. Well, obviously, there's going to be similarities, but no two applications are the same, probably. So I was wondering if those models need tailoring for each application that you encounter? And also, whether you need to tweak and adjust the model for different evolving versions of Kubernetes
Starting point is 00:16:31 and how do you manage to do that? Yeah. So the StormForge platform, I think, is a very unique combination, which is a difficult thing to get over and it took us years. A lot of machine learning or AI work, frankly, either remains in academia or kind of dies before it can be productized and really taken forward. And one of the main reasons is because of the intersection between fields like data science and engineering. And so at the tip of that intersection is productization. And while we've got incredibly talented data scientists, we've got incredibly talented engineers on the Kubernetes side.
Starting point is 00:17:23 And so that intersection that we've sort of crossed over between data science and engineering is one piece to that puzzle. We do have to, there's, I wouldn't call it a individual tailoring on a model. It's the core model itself for, for each, for any given application, but there is, you know, some unique learning that takes place each time a new application or a new scenario is introduced. And, you know, that's a pretty quick process. So for every, I'll give you an example, for every parameter that's introduced to a scenario within our world, every additional parameter, it takes
Starting point is 00:18:06 about 10 minutes for the machine learning to kind of make the, and by the way, it's doing this on its own. We're not hand tweaking anything. It takes about 10 minutes for each additional parameter to kind of, to kick in once we get beyond two or three. And so, you know, within, if someone wants to go to five parameters for an experiment, then you're talking about, you know, maybe 45 minutes to an hour of lead time. And then from that point forward, the machine learning is kind of learned and caught up and is able to return back configurations that include that parameter. So I don't know if that answers your question but um you know there's a little bit of kind of learning that takes place but overall we view that as a good thing because
Starting point is 00:18:53 again the more scenarios and more situations that we can encounter um you know the the better performance uh we can be yeah yeah no question that it's a good thing. My question was not around that, it was mostly about, well, how do you manage to do that? And well, the next thing I wanted to ask you is that, well, it seems that the release you're about to announce, you can correct me if my understanding is not correct, but it seems like basically it extends what you have already been doing in pre-production to production. So I was wondering, well, what did it take to do that? Because my sort of educated guess, you can call it, is that well, in pre-production, you can experiment a lot. You can play around with different configurations and so on. In production, I'm not so sure that's actually possible.
Starting point is 00:19:47 So was that the main obstacle that you had to overcome? And if yes, how did you do that? Yeah, that's a good question. You were cutting out just a little bit, but I think I got all of it. One distinction I do like to make is there's a very fine line between learning and observing from production data, getting the most out of it, our system returning recommendations that can be either manually or automatically deployed. There's a very fine line between that and live tuning in production.
Starting point is 00:20:27 And we're not live tuning in production. And we're not live tuning in production. Although it's a very fine line, when you cross over that line, the level of risk is unmanageable and untenable. It's not something we want to take on. It's also not something that any of our users, our customers really have. I mean, we've, we've, we've asked them if they would go that far with things and the unequivocal answer has been, no, uh, they don't, they don't want to go that far. And so, uh, I will answer your, you know, I will say, um, it, uh, what has worked for us in pre-production is, you know, we added many of those same capabilities in production. So it is an extension of the platform for sure. Excuse me.
Starting point is 00:21:13 It also means that we're not single threaded on either an environment. So pre-production or a data input any longer, which would be load or performance tests. And so as we extend, if you will, to adding those production capabilities, think about us though as a vertical solution that's kind of right now at the application layer. We will, across this year, going down the stack, adding new data inputs, and ultimately then going beyond the application layer and looking at the entire cluster itself. And so we'll add things like traces and logs and even kernel level stuff that will allow us to continue driving optimization forward,
Starting point is 00:22:01 but across the entire stack, as opposed to kind of focusing only at the application layer, if that makes sense. Okay. I can understand how you are then, it sounds basically like what you're describing is that you have, you're extending what you feed your models with. But then the question is, okay, so you feed it with more sources of data, but is the process of optimizing and deploying optimized applications the same? I'm guessing probably no, by the way you described it, because, well, you don't want to have downtime.
Starting point is 00:22:39 So if it's not the same, how does it work? So we will... By the way, I didn it work? So we will, so everything, by the way, I didn't answer part of your question. So a lot of what is transferable and a lot of what will continue to be transferable are the core pieces of the platform. So user management, roles and binding,
Starting point is 00:23:01 so RBAC permissions related items, the core infrastructure, the UI, like all of those things will, you'll continue to be able to go to one spot and have one experience, which is good. But based on the technique, the approach and the data source, we do distinguish between those different approaches. So you have a tab in our UI where you're focused on the pre-production side of things. You have a separate tab in our UI where you're focused on the production and observability things. And then so on and so forth as we go down the stack because the business objectives, the risk, the scenarios themselves are different
Starting point is 00:23:50 at different parts of the stack. And so we're very much in kind of R&D mode related to traces and logs and kernel and some of those other things. So I don't have a great answer for you now on exactly how that's going to work, but it will be in the same UI. And I believe we will still kind of separate the approaches as we go down the stack. Okay. So then I'm guessing the takeaways that will, as far as production environments go,
Starting point is 00:24:25 you do collect the data, you do use it, but you don't actually redeploy applications. You don't optimize applications for redeployment, right? With the production side? So we actually do, because at the point of optimization, what we're saying to the user is, where is your risk tolerance? What are you comfortable with from an automation standpoint? And what are your kind of min and max biometric goals? And we'll return options back to you that meet or exceed those goals. In the production scenario, we are competing quote-unquote against something within a
Starting point is 00:25:15 Kubernetes environment called the VPA. And so it's the vertical pod autoscaler, we're also adding capabilities around the HPA and we'll allow what we call two-way intelligent scaling. And so the optimization and sort of the value we provide is measured against what the VPA and the HPA are recommending for the user within a Kubernetes environment. So even in the production scenario, we are seeing cost savings. They're not quite as high as the pre-production options, but you're still talking 20% to 30%, and you're still talking 20% improvement in performance typically. And again, those aren't things you have to choose between.
Starting point is 00:26:01 You can have both. Okay, so then I guess the final decision on whether to take up on the recommendations and the optimizations that come out of the platform is with the people running the operations of the client, right? Yes, and that decision could be on a continuous basis for the application, deployment over deployment, or it could be a decision at a certain point in time that says, as long as you are within these thresholds, you can automatically deploy for me. And the user gets to decide that. Think of it as kind of like a slider of risk, risk tolerance, if you will. And then another slider, which is like completely manual deployment on the left and completely automatic deployment on the right.
Starting point is 00:26:53 And that's literally what we provide in the UI for the user. And they can kind of decide where they fit. Okay, I see. There's also something else that caught my attention in browsing around your site, basically, which was, I think you have something like a sort of money-back guarantee in a way, which says, well, if we can't save you at least 30% of your cost, we'll pay the difference. And you also have some charities
Starting point is 00:27:21 that you endorse and give out money in case this disclose is activated. I'm just wondering, has it been used much? Have you been forced to pay people much? Well, number one, it's a company backing, but it's also a personal guarantee by me. So my wife really loves that I did that, by the way. But no, we have never had to, we've never been below 30%. So the guarantee says that if we don't save you 30%, you know, then we'll do that.
Starting point is 00:27:55 And that's, again, connected also not just to the performance of our solutions, but also to the mission behind, you know, the mission behind what we're trying to do around reduction of carbon, you know, carbon our solutions, but also to the mission behind, you know, the mission behind what we're trying to do around reduction of carbon, you know, carbon footprint and emissions. We do partner with those organizations, which are ones that are obviously doing great work, not connected to our technology directly in any way, but ones that we support. But yeah, we take it very seriously. People challenge us often and we've been able to deliver. Thanks. I think we're almost out of time.
Starting point is 00:28:36 So we should be probably wrapping up. And if you would just want to share what's next basically for StormFord. So after releasing this, what's on your roadmap? Yeah, so most of what's on our roadmap is, and again, we'll send that along to you, but is the pathway that I was talking about in going down the stack itself. So continuing to add additional data sources, you know, to be able to, excuse me, go beyond the application layer, that stuff we are, we are adding aggressively. We're also,
Starting point is 00:29:14 we're also releasing fully on-prem air-gapped capabilities for our solution. So we get a lot of requests to go into the government or regulated environments with banks and others. And so everything that we're able to do through the cloud today is, you know, shortly will be available in a completely on-prem air-gapped environment. We're releasing a new report or entry point for the platform for the platform, if you will, called that we're calling cluster health. So this is like a one click install where, uh, pretty immediately you're, you get information back about your cluster and, and it kind of guides you to, you know, what should I think about optimizing first or where should I focus my time? Um, you know, so we'll, we'll be launching that.
Starting point is 00:30:06 I mentioned the two way intelligence scaling with the HPA and the BPA. And then we are shipping with the newest with Optimize Live, which is our new the new piece to our platform. Out of the box, we will have integrations with Datadog and Prometheus. We're also announcing a pretty significant partnership with AWS, but we'll add additional integrations with things like Dynatrace and other APMs across this year as well. Okay, well, thanks.
Starting point is 00:30:37 It sounds like, well, you're going to be keeping busy. So good luck with everything. And thanks for the conversation. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.