PurePerformance - Let the Machines optimize the Machines: Goal-Driven Performance Tuning with Stefano Doni

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson, and as always, I'm here with Andy Grabner, my co-host and performicator extraordinaire. Wow, that's a new one. Right? That's like a performicator, a performance educator. Yeah, I don't know what it really means, but it sounds awesome. It sounds like a title I want to put on my LinkedIn or my Twitter. Yeah. And I know you said my voice sounds odd today,

Starting point is 00:00:55 and it probably sounded like crap in the previous podcast, so I'm sorry for people that have to listen to me. I got a new laptop, and it seems the audio is not the same right now. We still need to figure out the settings. I still think the content is the same, while the audio might not be perfect, but the content should be the same level of quality. But now, thanks to you,

Starting point is 00:01:18 I might have to put an explicit warning on here, because you just cursed there, Andy. I can't believe it you just brought us down to a new low as always we can only go higher from here right exactly yeah and speaking of going higher from here andy would you like to introduce our guest today of course i want um so it's another guest that uh we actually met a couple of weeks ago in the French Alps. Well, I met him before then. I met him last year in Barcelona at our PERFORM conference.

Starting point is 00:01:51 But our paths crossed again at the PEC event, the Neotis PEC in the south of France. Right. I remember that. Yeah. And Stefano Doni from Moviri is with us today. And I think what I really loved about his presentation, he talked about how he and his colleagues are solving some really big problems in the performance engineering space. And without going into any further details, I actually want to pass over the ball to Stefano. Or Stefano, how is the right pronunciation, Stefano?

Starting point is 00:02:24 Stefano, I guess. Yes, that's correct. All right. Hey, Stefano, would you mind introducing yourself a little bit of background on what you're doing and what you've been doing and why performance is so interesting for you? And then we'll figure out what the things are that you're doing that will help a lot of the performance engineers out there. Sure.

Starting point is 00:02:44 So thank you. Hi, everybody. This is Stefano Doni from Moviro Akamas. Thank you for having me today. So Akamas, I will just describe our new company, which is Akamas, which is a new technology company born in 2018 from within the Moviro group. So for those familiar, Moviro is a global software and services company,

Starting point is 00:03:09 which has almost 20 years of expertise focusing mainly on performance engineering. So we love making apps run faster, use less resources and meeting business demands. The company was founded in 2000 as a spin-off of Polytechnic of Milan, which is one of the leading universities in Europe, actually. You may know us for one of the earlier products we created, which when we were called Neptuny, which was Kaplan.

Starting point is 00:03:41 It was one of the today leading products in the capacity planning space, which we sold to BNC Software in 2010. Wow, that's pretty cool. So first of all, what I just learned that you are from Milan and for those people that never been to Milan, I unfortunately have never been, but I think Milan is also very well known for fashion. And now I know that you have one of the leading technology universities there as well. And you had some great products like Kaplan. I didn't know that.

Starting point is 00:04:11 So thanks for that fact. That's pretty cool. That's great. That's great. Thank you. I also want to point out, Andy, that he mentioned Akamas, the new company. But I want to mention to people that it's akamas.io. Because I just tried, when I saw about that, I was like mention to people that it's akamas.io because I just tried, you know, when I saw about it,

Starting point is 00:04:28 I was like, oh, let me Google Akamas, see, you know, read up about them. And obviously it's sort of an island or some sort of a promontory off of Cyprus. So because it's a geological place, make sure you type in akamas.io to get to their website. Give you a little plug there. Yeah, thank you, Brian. You're welcome.

Starting point is 00:04:48 So Stefano, what caused you to found this new company? What type of problem did you or are you going to solve? Or are you solving? Yeah, sure. So the real problem that we try to solve is actually the problem of performance optimization. So we work at doing lots and lots of performance optimization activities in my previous job at Moveri. So I joined Moveri in 2005. So doing performance work mainly all the time, performance tuning, performance testing, capacity optimization, and so on. And we really started noticing that the complexity of the IT stacks today is hiding significant

Starting point is 00:05:33 performance optimization opportunities. So today we are finding that even performance experts and even technology experts, so we are also speaking, including also the vendors of the technologies, are no longer able to find ways to really extract more performance out of today's enterprise and cloud applications. So we really found out that there are several problems, key problems that are in a way blocking us to extract the full potential of our stacks today. And really the key reason is related to the potential of our stacks today.

Starting point is 00:06:05 And really the key reason is related to the complexity of our stack. So we are seeing that, in a way, we are discovering a trend that is happening behind the scenes. So it's also a little bit unknown to many professionals that work into this space, is that today, many of the key technology pieces in the stack, like JVM, like a database, like a big data solution, all those components in the stacks

Starting point is 00:06:33 have hundreds and hundreds of configuration parameters. So those are knobs that you can configure into your application or into your configuration, database, settings, and so on, they can really have a tremendous impact on the performance of the end-to-end application. So we have lots of examples in terms of the numbers of parameters for key technologies. If you take, for example, a Java virtual machine today, which is one of the key pieces running, key runtime that is powering many business-critical enterprise applications,

Starting point is 00:07:16 but also many cloud-native technologies, the JVM is not going away anytime soon. So a modern JVM from Oracle or OpenJDK today has more than 700 parameters. Did you just say 700? Yes, 700 parameters. And pretty much nobody doesn't touch anything about this. So our typical experience for our customer is that everybody is running with default settings that are provided by the vendors. But the key question for us is, what is the potential in terms of performance optimization,

Starting point is 00:07:55 so making our application run faster or consume less resources, if we are able to actually exploit this space of optimization, which is in a way totally greenfield today. So nobody is doing, in a way, is starting to looking into the space in a scientific way to actually see what we can extract in terms of performance. So is this, I mean, I know we've looked into this a little bit, but, and I think I know what you guys are doing, but pretty much what you're saying is that these, let's focus on Java now. The Java runtime has evolved over many years,

Starting point is 00:08:30 over decades now, I would say. And over the years, the JVM vendors have added a lot of knobs to optimize the JVM for particular use cases. But the problem is obviously, there's so many different use cases out there. And in the end, typically there's these performance experts that then start tuning it, whether it's the garbage collection settings, whether it's the memory, I guess the sizes that obviously

Starting point is 00:08:56 also influence this garbage collection. And now I only cover garbage collection and I only know a handful of these settings. And you said 700 knobs that people can turn. I wonder, I mean you're right, I mean it's probably impossible to find the correct combination for an individual application out there unless you have a lot of time and you turn every knob individually and see what the effect is, right? Yeah, exactly. So the key point is that we are not only seeing that the number of parameters is huge, but

Starting point is 00:09:34 it's also increasing. So it's very interesting to figure out. It was a surprise also for us that some technology, pretty much all technologies today are increasing the number of parameters. So in a way, this is a little bit unexpected. So we might expect that mature technologies like, as you said, JVM, I think it has 25 years, modern JVM, so since the start date in 1995.

Starting point is 00:10:08 But also if you see the Linux kernel, so even at the operating system level, we have this increase in terms of parameters. So, we are not seeing, in a way, a trend where the technology is, in a way, self-managing out of the box as we might expect that everything is at their most performant in a way configuration irrespective of what is the actual customer workload so totally totally on the other way so we are seeing that the number of parameters is increasing for exactly the same reason that you said. So vendors are putting more and more parameters, more and more configuration options into their technology. And of course, this is not a problem that

Starting point is 00:10:53 vendors are not smart enough to determine the best settings. As you correctly said, it's all a matter of workloads of applications. So we have a wide spectrum of use cases and workloads. So you can redesign any technology or even any car to be able to perform at best independently of, in a way, the race and the condition of the tracks and so on. Yeah, I think... go on, Andy. Yeah, and I said, so the problem obviously that you guys solved

Starting point is 00:11:30 and now to kind of come to the conclusion of why you make our lives so much easier is because you came up with an approach of figuring out the right and optimal settings of all of these 700 different knobs for a specific application under a certain workload? Is this the right description of what you do? Yeah, actually, yeah. I would say yes. So we basically, the outcome is that we have designed a new solution, which is actually Akamas.

Starting point is 00:12:02 It's a new product because we really face this kind of problems in our daily consulting activities, and we didn't find out anything on the market ready to solve this kind of problem. So we actually designed a new product. And the key problem that we are solving is exactly that. So how we can actually leverage to explore this kind of huge complexity in terms of the parameters and all the combination, if you think about it.

Starting point is 00:12:29 So it's not just a matter of visiting 700 of parameters, 700 experiments to understand how to better tune a JVM, but you have also this combination across the layers in the stack. So you have the operating system. Now we have the containers. So containers are interacting a lot with JVMs. So it's not just, in a way, a sandbox where you put a JVM and it will behave like before. Containers are changing pretty much dramatically how the JVM itself is behaving, again,

Starting point is 00:13:06 in terms of memory management, garbage collection. But it's not just actually JVM parameters. We have also application. Many, many times we face a problem of tuning application level parameters. So application have parameters like number of threads or the length of the queues and the buffers, sizing of the connection pools.

Starting point is 00:13:32 So pretty much across the entire stack, we have layers like middleware that have lots and lots of parameters. And we actually solved the problem of properly extracting the top performance from today's application in a continuous way with the power of artificial intelligence well now you have to explain though what artificial intelligence really mean because we know we've been talking about ai a lot and it's a still a very hyped word and and now the question is what type of ai this really is because we and just coming back to dynatrace we were also talking about ai and we are

Starting point is 00:14:12 talking about the eye but we now more talk about a deterministic ai which means we we we have a model underneath and then we apply certain rules and um, you know, there's obviously, my point is, you know, when you just throw the term AI out there, it's hard to say what this really is. Is it just big number crunching? Is it machine learning? Is it neural networks? What does your AI do?

Starting point is 00:14:38 Yeah, sure. That's a great question. So we also face this kind of problem. So today every product must be ai driven otherwise it won't it won't sell any anything anymore so actually ai for us it's the it's actually the right answer to be able to explore this very complex optimization space so we use machine learning for actually determining better than the human brain, because we have found out that even the application and vendor experts are no longer able to understand

Starting point is 00:15:10 how to tune all those parameters. But we think that machine learning is actually the right answer to be able to navigate all these high-dimensional spaces and understand all the relationship among the parameters. In terms of which kind of AI, we are not using traditional, in a way, conventional neural networks, which are one of the, in a way, most used machine learning technologies today for several reasons. First of all, today, neural networks require a very, very big data set for the training.

Starting point is 00:15:46 So one of the key concepts that we have in Akamas is that we have to be able to optimize an application in a very fast way. So one of the things that we do is that we not only do prediction in terms of what will be the best parameters to tune our IT stacks, but we also apply the parameters to actual systems, production systems or test systems. So we need to absolutely be in a way fast in terms of converging to great configuration. So we can for sure in a a way, trying thousands of different knob values and performing different thousands of experiments in order to determine the best settings. So the AI that we built up in Akamas is an AI that has been designed by our PhD team that is mostly coming from Polytechnic Milan and also lots of researchers

Starting point is 00:16:48 that are basically just focusing on this topic in order to design a different kind of algorithms that can be used when you have to optimize what we call costly function. So for us, when we have to decide if a given configuration of the IT stack is good enough in terms of performance, we have to evaluate it. So we have to apply the new configuration to the actual production system and measure the goodness in a way. The response time in terms of response time,

Starting point is 00:17:22 perhaps we might want to make the application run faster. So we might want to minimize the response time or we might want to increase the throughput, for example, increasing the number of payments per seconds of a financial platform. And in doing that, we must evaluate the parameters. So we really have a costly function here because to evaluate a given settings or parameters,

Starting point is 00:17:49 we must run, for example, performance experiments that might last even hours of measurement of production systems. So we have designed new algorithms that are actually able to understand by doing very few sampling, very few evaluation of the different parameters and at the same time being able to, in a way, rapidly converge towards optimal settings. And also on the other side, the key goal for us is to avoid, in a way, unsafe regions. So we cannot really afford, in a way,

Starting point is 00:18:25 to propose, to configure the IT stacks with bad configuration, which might impact the user experience. So our AI... I just got a couple of questions here. I'm sorry to interrupt you, but that means you are, are you normally operating in a pre-prod or in a production environment?

Starting point is 00:18:51 Well, we can do both. Actually, the typical starting point to evaluate the technology for the customer is to work on pre-prod environments if they have them ready. So the typical setup for us would be to automate the performance testing and the tuning of the application in a test bed. That is something that we can do. More, in a way, mature customers that are, for example,

Starting point is 00:19:18 adopting modern DevOps practices have actually adopted the production online tuning, which is actually interesting, where we can integrate with newer tools. For example, SpinHacker, which enables you to do incremental rollouts and new deployments of new releases. And we can plug in into those processes and use these kind of techniques also, not just for the purposes of releasing new application release faster, but also to, in a way, optimize the performance of applications.

Starting point is 00:19:55 So that's actually pretty cool. So that means what you're saying, in a production environment, you would be like a canary release, but instead of the canary being new application code or new features or code changes it is your configuration changes that you put in and then you compare it with the main line of of the application exactly yeah well it's pretty cool hey um now the the other question that i, so you said your algorithm is obviously measuring the performance on different dimensions, response time, as you said, memory, cost efficiency. But you also said your algorithm is also making changes on the fly and then test how the impact is of these changes. Does this mean in order for your solution to work, you need to be able to change these parameters on the fly

Starting point is 00:20:50 or need to have the ability to restart JVMs because I assume some of these parameters can only be changed during startup? So how does this work? Yeah, that's a great question. Actually, within the product, we have this concept of optimization pack. So optimization pack, it's in a way like an app that you download on your phone. It's a package that contains the knowledge about the parameters of a specific technology that we can tune.

Starting point is 00:21:21 So within this optimization pack, we can also specify in a way additional properties or parameters. For example, one of the key actually capabilities to being able to say which parameter can be applied online. So it can be hot changed as the system is actually working with respect to other parameters that might require a restart. So as you say, for example, JVNs today have very few online tunable parameters

Starting point is 00:21:50 and they need restart. But in a way, we are seeing an interest, a good trend on this regard. Many technologies are more and more including parameters that can be live changed. So for example, databases more and more are able to change also key properties like the memory size, the buffer pool size live without any restart.

Starting point is 00:22:15 So this is something that we can do without actually impacting the application in terms of restarting it. So that's pretty cool. Yeah, Andy, this makes me think of bridging tools once again, right? So first of all, Stefano, thank you for saying all this stuff because I think this is an area that I haven't really thought about. Obviously, there have been ideas of GC tuning.

Starting point is 00:22:41 In fact, we just had somebody talking a few episodes ago talking about really writing memory optimized code but the idea of you know having so many knobs to tweak on the jvm settings probably most people don't even know about them there are a few common ones people take care of but what i'm imagining will happen um with a tool like this is someone is going to try to optimize JVM settings or whatever technology they're using to run bad code. Now, on the flip side, let's say we can identify bad code. So where I'm going with this is wondering how in the future these kind of things can merge together.

Starting point is 00:23:23 Because let's say you take a simple example like thread pool database thread pool right um and maybe your tool is identifying let's increase the thread pool because you're running out of threads and it's an easy thing to to run however on on the side of our tool will be like hey this transaction is making 30 calls to the database, the exact same query all the time. That could be optimized, which would then reduce the amount of threads you actually need on the JVM.

Starting point is 00:23:52 There's these multiple tiers. It sounds like on your side, you could do a lot with the JVM plus looking at how that's running on the host, but there's a gap of the code performance and are you making these changes to accommodate poor code um that you might not be aware of yeah i don't even know if that's a question more of a more of a more of a conceptual thought of like how can we put all these pieces together to you know make it make it all one. Yeah, that's, I think, what will actually happen

Starting point is 00:24:26 because actually today for us, application code and also the way the application developers write database queries are out of scope for us. So we are not actually improving them. We might work on that in the future. But in a way, what we found out, nevertheless, is that we are finding that there are huge improvements possible just by working on the infrastructure. But I see the possibility that

Starting point is 00:24:55 when we improve the infrastructure on one side, it will spend less time, perhaps, to work on writing more efficient code as centr simply possible. Yeah, it's interesting because there's so many areas to not only tweak for performance, but also so many different things you need to consider, like how do we improve performance? How much performance can we improve by changing the JVM settings? How much performance can we improve by making changes to host settings? How much performance can we improve by making changes to host settings? How much performance can we improve by making code changes? And it's just expanding this world of performance to more and more considerations to put into, because obviously the answer isn't always code.

Starting point is 00:25:35 The answer isn't always a tweak on the settings, but having all this information at your fingertips through whatever combination of tools you can get that is awesome. And incorporated into that, I just want to have one more follow-up. Do you have, because we see this a lot with the tools, a lot of the new tools now, is there like an API that you can either expose data from or ingest data into from your tool so that if you did want to use other things to combine data sets to maybe look at the broader range, is that part of what's in there yet?

Starting point is 00:26:08 Yeah, sure. So this is part, actually, one of the key pillars of our vision. So what we wanted to be is not, in a way, an AI-driven platform to optimize just JVMs or a specific technology. So what we built for us is an end-to-end platform for machine learning driven performance optimization that can really work on any technology on our side. So we can work from applications to middlewares,

Starting point is 00:26:39 to databases, JVMs, even on the cloud. And it's currently, as you said, so the other thing that the other key design decision was not to reinvent the wheel. So in a way, in order to let Akamas work out its magic, we do not require the customers, the users, to reimplement, for example, their performance testing or implement their monitoring strategy

Starting point is 00:27:03 by using our, in a way, agents. So the solution is totally agentless. So the key design decision we took was to instead integrate against the market leading solution for load testing, for monitoring, for configuration management so that we can actually bring in, we have open APIs and SDKs so that you can actually develop new integrations and bring in additional data that you may have from other tools that we currently don't support out of the box, and pretty much everything can work.

Starting point is 00:27:40 Well, it is also the way you mentioned earlier, there are also application configuration parameters. And obviously, these parameters will be independent or will be different for every application. So is that a way how an application developer can or an architect can get metrics into Akamas and say, hey, here are my configuration options and here are some of the metrics I can tell you from within my app. And then you start analyzing it and then you start tweaking them. Yeah, exactly. So we decided to build up a very flexible platform. So actually, it's very easy to instruct Akamas

Starting point is 00:28:19 to work on a different scope, optimization scope, a new optimization scope, for example, in order to tune application-specific parameters. And really, the experience we aimed on this area was to design a modern DevOps experience. So we have an infrastructure as code approach in terms of describing what Akamas needs to do. So everything is based on YAML files

Starting point is 00:28:45 inspired by Kubernetes or modern tools like Kubernetes and Dockers. And what you can do is describe your application, how it is composed, your parameters, and also how you can gather metrics and so on. And also, most importantly, what you want to optimize, what is your goal, what is your optimization scope, and so on. And also most importantly, what you cannot, what you want to optimize, what is your goal, what is your optimization scope and so on. So everything is described in terms of an infrastructure that's called

Starting point is 00:29:13 Paradigm. And then we have a modern CLI that is very much appreciated. We see by, uh, in a way DevOps kind of a profile that actually like to use CLIs to automate all the stuff. And also we have open APIs, so standard HTTP REST APIs that you can also use to completely automate the automation. So you can include, for example, an optimization activity of three days, something like that, that you can plug in in your development pipeline.

Starting point is 00:29:46 It's all entirely possible by invoking our APIs from existing tools. Pretty cool. Hey, can we talk, I have a couple of questions on some of the common things you find and what are the common things that you guys are tuning or giving recommendations on. But can you, before we go into that detail, give us some maybe examples on what's the typical improvement

Starting point is 00:30:12 that comes out of such an exercise? Because, you know, are we talking about 2% or 3% of memory improvement or 2% or 3% of performance improvement? Are we talking about 10%? What is the typical the typical improvement people see okay so we are seeing very very interesting improvements so definitely in the range of tens of percent so 20 50 and those sometimes in terms of entire factors of improvements. So one of the, by the way, of the white paper that you can download on our website is actually telling the story of one of the customers

Starting point is 00:30:55 that we had, which was operating a financial service, a payment service. And they asked us to optimize their IT stack, which was composed by a traditional, in a way, JBoss, JVM, and Linux operating system stack. And their most important business matrix was payments per second. So they wanted to increase the payments per second within a given infrastructural footprint, no matter what. So we tuned the infrastructure after the customer already

Starting point is 00:31:28 tuned with manual approaches, the same application with a three-month effort. And it also involved the vendors of the technologies. And by the way, it challenged us in order to improve further with respect to what the expert already achieved. And we integrated Akamai Syntho there, performance testing environments, and we were able to achieve 55 more per minute per second

Starting point is 00:31:55 out of the same infrastructure, the same application. So this is an example of the kind of improvements that are in a way lying into the IT stack. And the customer was shocked to understand that after three months, experts tuning activities, there still was this kind of improvement potential that were pretty much unexplored. Can you give me a quick, I mean, it sounds 55. Can you give me a little more reference on like you improved it by X many percent because 55 itself

Starting point is 00:32:28 on its own is a little hard to understand. Is it 55 from a thousand? Is it 55 from a hundred? Just the percentage numbers or how much? It's 55 percent. Okay, wow. I thought you said 55 additional transactions.

Starting point is 00:32:44 That's obviously impressive. That's obviously impressive. Yeah, that's very impressive. Wow. And the other thing that we are noticing also on the databases, it's another very interesting area where we are finding out very big, potentially even bigger results. So another white paper that you can find on our website is about the performance optimization of MongoDB,

Starting point is 00:33:12 where we compare the performance that you can get from a MongoDB database that you can manage, for example, on-prem or on the cloud. And we compare it against the AWS DocumentDB, which is a new database as a service offering. So it's a managed service from AWS, which pretty much is compatible with the existing MongoDB applications.

Starting point is 00:33:40 So what we have done is we benchmarked by using industrial or very common performance benchmarks for NoSQL databases. And what we saw is an impressive improvement that we can get from MongoDB because we are actually able to tune it in order to extract more performance. So in this case, we tuned the MongoDB, we tuned also the Linux operating system, and we were able to pretty much more than double the performance that we can get out of MongoDB or out of DocumentDB from AWS. Wow. And are these... I'm not that familiar with Mongo. Are these changes that can be done on runtime?

Starting point is 00:34:26 Does the MongoDB database give you these options to change it during runtime or do you need to restart the nodes all the time? How does this work with Mongo? Yeah, most of the parameters can be live changed actually on Mongo. And the other set of parameters that we tuned was on the operating system and pretty much the parameters of Linux OS can be live changed, actually. Wow. Can you give me, because this is a question that I tried to bring in earlier.

Starting point is 00:34:57 So I know there's a lot of things that people can change and configuration options. But for those people that are wondering now, I mean, what are the top three things that you find, let's say, in a Java application? What are the most common settings that people have completely gotten wrong or where the defaults are just on average not good? Is there anything that you can say? Is there any magic?

Starting point is 00:35:22 Is there any list of metrics or configuration options that always come up that will be changed from the default? Yeah, sure. That's a great question. So, of course, the answer is related to it's dependent on the workload, on

Starting point is 00:35:39 the application, and also on the infrastructure. But I can say that on JVMs, pretty much, if you search, of course, online about JVM tunings, you will find lots and lots of suggestions and stories of performance optimization that are mainly around garbage collections, of course. And we are seeing that actually memory management, memory sizing, so the amount of heap, the type of garbage collection, and also the garbage collection specific options

Starting point is 00:36:13 can really play a huge role. And the key thing that we are constantly finding out is that the beauty of iCommerce is that once you find the product gives you the best configuration for your application, it allows you to do basically two things. One is to actually understand what are the actual parameters that got applied and that gave you these kind of results. And the other thing that we can provide is the list of most important parameters. And about the stories about findings that are actually very far from the defaults, we have lots and lots of things that we can share.

Starting point is 00:36:58 One of the things that we have found also on MongoDB, which is basically not Java, is that even talking about the most common sizing parameter, which is the heap size, we are finding out really unexpected findings. So we are finding out that sometimes it's totally contrary to what are the best practices about JVM tunings. So if you take Java Performance Book by Charlie Hunt,

Starting point is 00:37:26 which is one of the Bible in the space. So one of the key suggestions is that you have to leave room for the GC to work, to properly clean the memory without introducing too much poses into the application. But what we found out is that sometimes setting the JVN with a smaller heap is more helpful. This is due to the fact that, of course, we also tune not just the heap size, but we also tune the garbage collector specific parameters.

Starting point is 00:37:58 Yeah, that's interesting because a lot of people probably would look at tuning one of those components, not the other necessarily I'm just going to not do that at all. But you're giving them the ability to make multiple changes together that harmonize, right? So that's really exciting. Exactly. The other great finding that we found out is something that is typically not done in the traditional manual trial and error processes.

Starting point is 00:38:49 What happens if we tune different layers in the stack at the same time? And we are finding out very interesting results. And for the MongoDB, for example, we found out that... We found out some operating system settings that were able to actually more than double the performance of MorgonDB in terms of actual queries per second that the database could support. And when we looked at the operating system,

Starting point is 00:39:17 so which kind of effects these kind of settings actually made on the OS, we found out some totally unexpected things. So those settings actually, if you made on the OS, we found out some totally unexpected things. So those settings actually, if you look just the OS, those settings were related to how the operating system performed storage operations on the storage devices. And those kinds of settings actually made the OS doing less number of IOs against the storage devices.

Starting point is 00:39:44 So it was, the OS suddenly was doing less storage operation, but it was doing much larger, requesting much larger amount of bytes to the storage. And when we look at the effect, so this kind of change suddenly actually doubled the performance of MongoDB, which is our, in a way, top-level metric. But if you look at the OS level, this change actually makes the IEO operations lower.

Starting point is 00:40:12 So this is interesting because if you just look at the OS and you are trying to optimize your operating system, you most likely wouldn't choose this kind of settings because it is actually increasing the response time of your operation. So if you're just focusing on one layer, you may actually miss one of the key improvements because if you look at your topmost application level layer, like MongoDB or your entire application,

Starting point is 00:40:44 this change is actually the change that is making your application run twice as fast. This sounds like, I mean, I don't even know how to say this now, but essentially what you're saying is if you would have individual experts that only look into their silos and now the silos are not, let's say, vertical or more horizontal, your OS layer, your process layer, your application layer, then every individual silo would optimize based on what they think is best

Starting point is 00:41:15 for their silo, for their layer. But eventually, you never look at the full stack. And this is why having this holistic approach where you're looking at all the metrics and your AI is trying out different combinations and then by learning what impact an individual change has, it is then starting to tweak the right settings in the best way until you have the optimum result. And the optimum result

Starting point is 00:41:38 is not necessarily optimizing IO or CPU, it's optimizing the overall performance. And that is a combination of response time, memory consumption and costs, right? Yeah, yeah, exactly. So the key piece, the key design, again, goal, which is one of the key values is that the solution is goal-driven,

Starting point is 00:42:00 meaning that you can say to Akamas, what is your goal? So I want to maximize my key business metric, or perhaps in such a situation, I want to make, in a way, I want to reduce the footprint of my databases so that they can cost less on the cloud. And the nice thing is that AI will work out the different settings for you that make your end goal increase or decrease in case of cost, no matter what.

Starting point is 00:42:28 So it has no, in a way, bias in terms of avoiding certain settings that doesn't follow vendor best practices. But in a way, it's acting in order to optimize the AT stack so that your final metric, your end user or your cost metric will improve no matter what. Now, have you found, I can imagine customers would be tempted to try to be smart themselves and take, okay, the tool told us to make these settings on this application.

Starting point is 00:43:08 I learned enough now from what that did that I'm going to tweak my own settings on this other application based on what it told me for the other one, because I think they're pretty similar. Do you find any of your customers are trying to do that themselves? Or are they just embracing it and saying, you know what, this knows what it's doing. I'm not even going to try to become the expert on this because there's too much going on here. I'll just apply the tool to this one as well and let it do its magic. Does that make sense?

Starting point is 00:43:37 Yeah, sure. Great question. So what we are finding out is that of course course, customers can actually learn from the result of the optimization. So, in one way, as I said, Agamas is also providing in output the top performance parameters.

Starting point is 00:43:56 So, you can actually use this kind of output as a way to explain why AI is actually improving the performance. And you can actually learn from the output and you get a chance to actually improve the knowledge of a particular technology.

Starting point is 00:44:13 And then with this kind of knowledge, you can actually, in a way, you are better suited to improve the performance of other applications. In general, then, what happens is that if you're just focusing on a small, in a way, tuning scope, so a small number of parameters, this can be done. But typically what we see is that the parameters are so many that it's hard for a human brain in a way to reason and to transfer transfer this kind of knowledge to to other

Starting point is 00:44:45 applications right and do you also find that it does the is there any impact on the efficiency of the tool to do its job to have people um set up let's say again using the jvm example to put their application on a default jvm or is it better for them to make the tweaks they think they should make first, or does it not really matter? So you mentioned, sorry, didn't got a question? Oh, so yeah, just to repeat it. So does the tool operate better

Starting point is 00:45:22 if it's starting at like the default jvm settings like best practices in using something like this is it better for a team to say hey i have a new application i'm not going to touch the jvm settings i'm just going to drop my application on it apply your tool to it with all the default settings does is there any kind of pattern to that works better than maybe the team trying to set their own settings first and then having your tool analyze it okay so i got it so actually it doesn't it doesn't actually matter what is the starting point so it's it's basically not that impacting for for us because we have so many of so many parameters that uh even in some situation, the default configuration, in a way, the baseline configuration,

Starting point is 00:46:07 which is for us the starting point of the optimization, was the result of a manual optimization process, like in the banking case study that I mentioned before. But in a way, we managed to improve the performance much more than that. I would say that we have actually seen a couple of situations where teams did their own tweaking, but actually it resulted in worse performance than the baseline. Yeah.

Starting point is 00:46:38 So it may also happen, something like that. So if you are not actually a technology expert, so let's take Java for example and let's take the the most common now the default garbage collector the g1gc so it's not easy to tune the g1gc so we have seen situation where people in a way try to tune the g1gc in in the hope of extracting more performance but at the end of the day, we removed the settings that the customer applied and the application performed better. So we also seen something like that.

Starting point is 00:47:15 This is also related to the, in a way, the performance methodology because sometimes people, of course, are not that expert in performance methodology. And one of the key things that we needed to face also in Alchemist is how to be robust to noise. So sometimes you will do the exact same performance test across the same application, one after the other,

Starting point is 00:47:40 and you can see pretty different results. This might be due to noisy neighbors on the clouds. This might be due to different initialization of memory within the operating system. There are lots of reasons why this can happen. And perhaps if you don't have, in a way, a scientific methodology, also in performance evaluation you will uh in a way you will be we will try to optimize something that is actually perhaps related to noise so we will tune your gc in the

Starting point is 00:48:14 hope to extract more performance but at the end of the day the source of performance variability is related to the environment and not your specific settings. Andy, I think this is really exciting, right? I'm blown away. I mean, this is stuff that we've been talking a lot about, how automation is impacting or is going to change our jobs. And I believe you guys are building a solution that clearly does a much better job for performance engineering and performance optimization, especially on the layers that you have visibility to right now, right? OS process.

Starting point is 00:48:52 I mean, this is, and I really like the, like what you said earlier, right? You are not biased. You're not biased as being an expert in a certain JVM. Your AI is just goal-driven. And it's then trying whatever it can to get as close as possible to that goal. And I think that's a great approach, actually. And I think it's really great, too, because most of these kind of settings that you're addressing are ones people usually don't think about so much unless they start saying we have to optimize

Starting point is 00:49:27 performance or we were having some issues right so this is giving people a chance at the start to already get ahead of that curve and do everything they can so that those issues don't crop up earlier and that's the whole idea right is to stop these things from even occurring so that at least you're going in knowing you've optimized. You can continuously optimize by integrating this into your pipeline and into those builds and making sure, okay, maybe we made a code change. Now let's run, let's take an analysis on how the JVM settings are working with some of the maybe the new code that's put in there. And you can stay ahead of it instead of waiting until an issue comes in and then everyone scratches their head and says, oh, well, maybe we need to tune memory.

Starting point is 00:50:10 You're doing it up front, which is being proactive, I guess. Let me shorten it so it's awesome. Exactly. Thank you. Maybe you want to hire us for your marketing team to explain. It's awesome. It's awesome. Yeah, that would be great. All right. Thank you for the kind words, guys. Andy,

Starting point is 00:50:29 would you like to go ahead and summon the Summaryator or was there any other? Yeah, I think we should sum it up. Come on. Do it. So my summary is, man,

Starting point is 00:50:39 there are 700 plus configuration changes in the JVM. I had no clue about that and I'm sure that the more that I never thought of on all the other runtimes and layers and operating systems and Docker containers and PaaS environments and cloud environments that I've never heard of. And while it would be an interesting job

Starting point is 00:50:57 to play around with every setting and learn it and the impact it has, I don't think this is the job we should strive for. I think this is exactly where automation, big data analysis, AI, what you build can help. So what you guys have been doing or what you are doing is, I think, a great gift to our industry, which is just let the machines figure out how to optimize the machines so that we can focus on the creative part again. And I really like that you have a goal-driven approach, that you are taking data from all

Starting point is 00:51:39 the different tools and the different layers out there, that you have a non-biased approach to analyzing the data and making the changes. And also thanks for the reference to your white papers. So people should definitely check out the white papers on akamas.io or follow them on at Akamas Labs. So thank you so much. And I'm sure we'll hear probably more from you because it seems you're definitely you mentioned earlier you're really busy right now so the stuff that you're doing really resonates well

Starting point is 00:52:12 with the industry and hopefully we'll have you back in a couple of months so you can tell us more about you know what else you guys are up to and whom you helped and thank you so much it's really cool so thanks a lot for the great recap, Andy.

Starting point is 00:52:26 And I will be pleased to, to be there in the next month. So, so thank you very much. Yeah. And I think we're going to have to find a way to get over to Milan. There'll be some way. Yeah.

Starting point is 00:52:39 We could also do a recording live in our offices. Yes. Yes. I think we can get Dynatrace to justify a live recording, right? To fly us over. I think so, too. Yeah. Stefano, thank you very much.

Starting point is 00:52:57 This is really awesome. I can't wait to see how this continues to develop. I took a quick look you know at the technology supported and it's funny because as i mentioned a few weeks ago or a month ago or so we had a conversation with uh conrad kakosa who wrote this gigantic book about dot net memory um you know mastering your memory in your dot net applications where you don't have too much control over the settings but there are some little bits where you can do the settings. But it's going to be interesting, too, to see the more you go into the cloud areas where

Starting point is 00:53:33 there's more of an extraction away from some of those components, what's going to be left for people to be able to tweak. Obviously, if you're running a JVM on your own server, you have full access. And I see you're doing some stuff with the AWS DocumentDB, but as people go into these cloud components more and more, into the abstraction layers, serverless, and all these kind of components,

Starting point is 00:53:57 being able to get anything from it is going to be really important too. So I think it's going to be really cool to see where you all expand on being able to find the tiniest little thing that somebody can control that might give them that component that they might not have been aware of. So best of luck with all this. And I think it's really exciting.

Starting point is 00:54:16 Thanks a lot, guys. Thank you. Thank you. Thank you. Bye.

PurePerformance - Let the Machines optimize the Machines: Goal-Driven Performance Tuning with Stefano Doni

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.