PurePerformance - Let the Machines optimize the Machines: Goal-Driven Performance Tuning with Stefano Doni
Episode Date: May 13, 2019Did you know that the JVM has 700+ configuration settings? Did you know that MongoDB performance can be improved by 50% just by tuning the right database and OS nobs? Every thought that slower I/O can... actually speed up database transaction times?In this episode we invited Stefano Doni, CTO at Amakas.io, who gives us a new perspective on how to approach performance optimization for complex environments. Instead of manually tweaking nobs on all sorts of runtimes or services they developed a Goal-driven AI-engine that automatically identifies the optimal settings for any application as it is under load. Make sure to check out their website and white papers where they go into details about how their algorithms work, which metrics they optimize and how you can apply their technology into a continuous delivery processhttps://www.linkedin.com/in/stefanodoni/https://www.akamas.io/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson, and as always, I'm here with Andy Grabner, my co-host and performicator extraordinaire.
Wow, that's a new one.
Right? That's like a performicator, a performance educator. Yeah, I don't know what it really means, but it sounds awesome.
It sounds like a title I want to put on my LinkedIn or my Twitter.
Yeah.
And I know you said my voice sounds odd today,
and it probably sounded like crap in the previous podcast,
so I'm sorry for people that have to listen to me.
I got a new laptop, and it seems the audio is not the same right now.
We still need to figure out the settings.
I still think the content is the same,
while the audio might not be perfect,
but the content should be the same level of quality.
But now, thanks to you,
I might have to put an explicit warning on here,
because you just cursed there, Andy.
I can't believe it you
just brought us down to a new low as always we can only go higher from here right exactly yeah
and speaking of going higher from here andy would you like to introduce our guest today
of course i want um so it's another guest that uh we actually met a couple of weeks ago in the French Alps.
Well, I met him before then.
I met him last year in Barcelona at our PERFORM conference.
But our paths crossed again at the PEC event, the Neotis PEC in the south of France.
Right.
I remember that.
Yeah.
And Stefano Doni from Moviri is with us today.
And I think what I really loved about his presentation, he talked about how he and his colleagues are solving some really big problems in the performance engineering space.
And without going into any further details, I actually want to pass over the ball to Stefano.
Or Stefano, how is the right pronunciation, Stefano?
Stefano, I guess.
Yes, that's correct.
All right.
Hey, Stefano, would you mind introducing yourself a little bit of background on what you're
doing and what you've been doing and why performance is so interesting for you?
And then we'll figure out what the things are that you're doing that will help a lot
of the performance engineers out there.
Sure.
So thank you.
Hi, everybody.
This is Stefano Doni from Moviro Akamas.
Thank you for having me today.
So Akamas, I will just describe our new company,
which is Akamas, which is a new technology company
born in 2018 from within the Moviro group.
So for those familiar, Moviro is a global software and services company,
which has almost 20 years of expertise focusing mainly on performance
engineering. So we love making apps run faster,
use less resources and meeting business demands.
The company was founded in 2000
as a spin-off of Polytechnic of Milan,
which is one of the leading universities in Europe, actually.
You may know us for one of the earlier products we created,
which when we were called Neptuny, which was Kaplan.
It was one of the today leading products
in the capacity planning space,
which we sold to BNC Software in 2010. Wow, that's pretty cool. So first of all,
what I just learned that you are from Milan and for those people that never been to Milan,
I unfortunately have never been, but I think Milan is also very well known for fashion.
And now I know that you have one of the leading technology universities there as well.
And you had some great products like Kaplan.
I didn't know that.
So thanks for that fact.
That's pretty cool.
That's great.
That's great.
Thank you.
I also want to point out, Andy, that he mentioned Akamas, the new company.
But I want to mention to people that it's akamas.io.
Because I just tried, when I saw about that, I was like mention to people that it's akamas.io because I just tried, you know, when I saw about it,
I was like, oh, let me Google Akamas,
see, you know, read up about them.
And obviously it's sort of an island or some sort of a promontory off of Cyprus.
So because it's a geological place,
make sure you type in akamas.io to get to their website.
Give you a little plug there.
Yeah, thank you, Brian.
You're welcome.
So Stefano, what caused you to found this new company?
What type of problem did you or are you going to solve?
Or are you solving?
Yeah, sure.
So the real problem that we try to solve is actually the problem of performance optimization.
So we work at doing lots and lots of performance optimization activities in my previous job at Moveri.
So I joined Moveri in 2005. So doing performance work mainly all the time, performance tuning, performance testing, capacity optimization, and so on.
And we really started noticing that the complexity of the IT stacks today is hiding significant
performance optimization opportunities. So today we are finding that even performance experts
and even technology experts, so we are also speaking, including also the vendors of the technologies,
are no longer able to find ways to really extract more performance
out of today's enterprise and cloud applications.
So we really found out that there are several problems,
key problems that are in a way blocking us
to extract the full potential of our stacks today.
And really the key reason is related to the potential of our stacks today.
And really the key reason is related
to the complexity of our stack.
So we are seeing that, in a way, we
are discovering a trend that is happening behind the scenes.
So it's also a little bit unknown to many professionals
that work into this space, is that today,
many of the key technology pieces in the stack, like JVM, like a database,
like a big data solution, all those components in the stacks
have hundreds and hundreds of configuration parameters.
So those are knobs that you can configure into your application
or into your configuration, database, settings, and so on,
they can really have a tremendous impact on the performance of the end-to-end application.
So we have lots of examples in terms of the numbers of parameters for key technologies.
If you take, for example, a Java virtual machine today,
which is one of the key pieces running,
key runtime that is powering many business-critical enterprise applications,
but also many cloud-native technologies,
the JVM is not going away anytime soon.
So a modern JVM from Oracle or OpenJDK today has more than 700 parameters.
Did you just say 700?
Yes, 700 parameters.
And pretty much nobody doesn't touch anything about this. So our typical experience for our customer is that everybody is running with default settings that are provided by the vendors.
But the key question for us is,
what is the potential in terms of performance optimization,
so making our application run faster or consume less resources, if we are able to actually exploit this space of optimization,
which is in a way totally greenfield today.
So nobody is doing, in a way, is starting to looking into the space in a scientific
way to actually see what we can extract in terms of performance.
So is this, I mean, I know we've looked into this a little bit, but, and I think I know
what you guys are doing, but pretty much what you're saying is that these,
let's focus on Java now.
The Java runtime has evolved over many years,
over decades now, I would say.
And over the years,
the JVM vendors have added a lot of knobs
to optimize the JVM for particular use cases.
But the problem is obviously,
there's so many different use cases out there.
And in the end, typically there's these performance experts that then start tuning it, whether
it's the garbage collection settings, whether it's the memory, I guess the sizes that obviously
also influence this garbage collection.
And now I only cover garbage collection and I only know a handful of these settings.
And you said 700
knobs that people can turn. I wonder, I mean you're right, I mean it's probably impossible to find
the correct combination for an individual application out there unless you have a lot of time
and you turn every knob individually and see what the effect is, right?
Yeah, exactly.
So the key point is that we are not only seeing that the number of parameters is huge, but
it's also increasing.
So it's very interesting to figure out.
It was a surprise also for us that some technology, pretty much all technologies today
are increasing the number of parameters.
So in a way, this is a little bit unexpected.
So we might expect that mature technologies like,
as you said, JVM, I think it has 25 years,
modern JVM, so since the start date in 1995.
But also if you see the Linux kernel, so even at the operating system level,
we have this increase in terms of parameters.
So, we are not seeing, in a way, a trend where the technology is, in a way,
self-managing out of the box as we might
expect that everything is at their most performant in a way configuration irrespective of what is the
actual customer workload so totally totally on the other way so we are seeing that the number of
parameters is increasing for exactly the same reason that you said. So vendors are putting more and more parameters, more and more configuration
options into their technology. And of course, this is not a problem that
vendors are not smart enough to determine the best settings.
As you correctly said, it's all a matter of workloads of applications.
So we have a wide spectrum of use cases and workloads.
So you can redesign any technology or even any car
to be able to perform at best independently of, in a way,
the race and the condition of the tracks and so on.
Yeah, I think... go on, Andy.
Yeah, and I said, so the problem obviously that you guys solved
and now to kind of come to the conclusion of why you make our lives so much easier
is because you came up with an approach of figuring out the right and optimal settings
of all of these 700 different knobs for a specific application under a certain workload?
Is this the right description of what you do?
Yeah, actually, yeah.
I would say yes.
So we basically, the outcome is that we have designed a new solution,
which is actually Akamas.
It's a new product because we really face this kind of problems
in our daily consulting activities,
and we didn't find out anything on the market ready to solve this kind of problem.
So we actually designed a new product.
And the key problem that we are solving is exactly that.
So how we can actually leverage to explore this kind of huge complexity
in terms of the parameters and all the combination,
if you think about it.
So it's not just a matter of visiting 700 of parameters,
700 experiments to understand how to better tune a JVM,
but you have also this combination across the layers in the stack.
So you have the operating system.
Now we have the containers.
So containers are interacting a lot with JVMs.
So it's not just, in a way, a sandbox where you put a JVM and it will behave like before.
Containers are changing pretty much dramatically how the JVM itself is behaving, again,
in terms of memory management, garbage collection.
But it's not just actually JVM parameters.
We have also application.
Many, many times we face a problem of tuning application
level parameters.
So application have parameters like number of threads
or the length of the queues and the buffers,
sizing of the connection pools.
So pretty much across the entire stack,
we have layers like middleware
that have lots and lots of parameters.
And we actually solved the problem of properly extracting the top performance
from today's application in a continuous way with the power of artificial intelligence
well now you have to explain though what artificial intelligence really mean because we know
we've been talking about ai a lot and it's a still a very hyped word and and now the question is what type of ai
this really is because we and just coming back to dynatrace we were also talking about ai and we are
talking about the eye but we now more talk about a deterministic ai which means we we we have a model
underneath and then we apply certain rules and um, you know, there's obviously,
my point is, you know,
when you just throw the term AI out there,
it's hard to say what this really is. Is it just big number crunching?
Is it machine learning?
Is it neural networks?
What does your AI do?
Yeah, sure.
That's a great question.
So we also face this kind of problem.
So today every product must be ai driven otherwise
it won't it won't sell any anything anymore so actually ai for us it's the it's actually the
right answer to be able to explore this very complex optimization space so we use machine
learning for actually determining better than the human brain, because we have found out that even the application
and vendor experts are no longer able to understand
how to tune all those parameters.
But we think that machine learning is actually the right answer
to be able to navigate all these high-dimensional spaces
and understand all the relationship among the parameters.
In terms of which kind of AI, we are not using traditional, in a way, conventional neural
networks, which are one of the, in a way, most used machine learning technologies today
for several reasons.
First of all, today, neural networks require a very, very big data set for the training.
So one of the key concepts that we have in Akamas is that we have to be able to optimize an application in a very fast way.
So one of the things that we do is that we not only do prediction in terms of what will be the best parameters to tune our IT stacks,
but we also apply the parameters to actual systems, production systems or test systems.
So we need to absolutely be in a way fast in terms of converging to great configuration.
So we can for sure in a a way, trying thousands of different
knob values and performing different thousands of experiments in order to
determine the best settings. So the AI that we built up in Akamas is an AI
that has been designed by our PhD team that is mostly coming from Polytechnic Milan and also lots of researchers
that are basically just focusing on this topic in order to design a different kind of algorithms that
can be used when you have to optimize what we call costly function. So for us, when we have to decide
if a given configuration of the IT stack is good enough
in terms of performance, we have to evaluate it.
So we have to apply the new configuration
to the actual production system
and measure the goodness in a way.
The response time in terms of response time,
perhaps we might want to make the application run faster.
So we might want to minimize the response time
or we might want to increase the throughput,
for example, increasing the number of payments per seconds
of a financial platform.
And in doing that, we must evaluate the parameters.
So we really have a costly function here
because to evaluate a given settings or parameters,
we must run, for example, performance experiments that
might last even hours of measurement of production
systems.
So we have designed new algorithms that are actually
able to understand by doing very few sampling, very few evaluation
of the different parameters and at the same time being able to, in a way, rapidly converge
towards optimal settings.
And also on the other side, the key goal for us is to avoid, in a way, unsafe regions. So we cannot really afford, in a way,
to propose, to configure the IT stacks
with bad configuration,
which might impact the user experience.
So our AI...
I just got a couple of questions here.
I'm sorry to interrupt you,
but that means you are,
are you normally operating in a pre-prod or in a production environment?
Well, we can do both.
Actually, the typical starting point to evaluate the technology for the customer
is to work on pre-prod environments if they have them ready.
So the typical setup for us would
be to automate the performance testing and the tuning
of the application in a test bed.
That is something that we can do.
More, in a way, mature customers that are, for example,
adopting modern DevOps practices have actually
adopted the production online tuning,
which is actually interesting, where we can integrate with newer tools.
For example, SpinHacker, which enables you to do incremental rollouts
and new deployments of new releases.
And we can plug in into those processes and use these kind of techniques also,
not just for the purposes of releasing new application release faster,
but also to, in a way, optimize the performance of applications.
So that's actually pretty cool.
So that means what you're saying, in a production environment,
you would be like a canary release, but instead of the canary being new application
code or new features or code changes it is your configuration changes that you put in and then
you compare it with the main line of of the application exactly yeah well it's pretty cool
hey um now the the other question that i, so you said your algorithm is obviously measuring the performance on different dimensions, response time, as you said, memory, cost efficiency.
But you also said your algorithm is also making changes on the fly and then test how the impact is of these changes. Does this mean in order for your solution to work,
you need to be able to change these parameters on the fly
or need to have the ability to restart JVMs
because I assume some of these parameters
can only be changed during startup?
So how does this work?
Yeah, that's a great question.
Actually, within the product,
we have this concept of optimization pack.
So optimization pack, it's in a way like an app that you download on your phone. It's a package that contains the knowledge about the parameters of a specific technology that we can tune.
So within this optimization pack, we can also specify in a way additional properties
or parameters.
For example, one of the key actually capabilities
to being able to say which parameter can be applied online.
So it can be hot changed as the system is actually working
with respect to other parameters that might require a restart.
So as you say, for example, JVNs today
have very few online tunable parameters
and they need restart.
But in a way, we are seeing an interest,
a good trend on this regard.
Many technologies are more and more
including parameters that can be live changed.
So for example, databases more and more are able to change
also key properties like the memory size,
the buffer pool size live without any restart.
So this is something that we can do
without actually impacting the application
in terms of restarting it.
So that's pretty cool.
Yeah, Andy, this makes me think of bridging tools once again, right?
So first of all, Stefano, thank you for saying all this stuff
because I think this is an area that I haven't really thought about.
Obviously, there have been ideas of GC tuning.
In fact, we just had somebody talking a few episodes ago talking
about really writing memory optimized code but the idea of you know having so many knobs to tweak on
the jvm settings probably most people don't even know about them there are a few common ones people
take care of but what i'm imagining will happen um with a tool like this is someone is going to try to optimize JVM settings
or whatever technology they're using to run bad code.
Now, on the flip side, let's say we can identify bad code.
So where I'm going with this is wondering how in the future
these kind of things can merge together.
Because let's say you take a simple example like thread pool database thread pool right um and maybe your
tool is identifying let's increase the thread pool because you're running out of threads and
it's an easy thing to to run however on on the side of our tool will be like hey this transaction
is making 30 calls to the database,
the exact same query all the time.
That could be optimized,
which would then reduce the amount of threads
you actually need on the JVM.
There's these multiple tiers.
It sounds like on your side,
you could do a lot with the JVM
plus looking at how that's running on the host,
but there's a gap of the code performance
and are you making these changes to accommodate poor code um that you might not be aware of
yeah i don't even know if that's a question more of a more of a more of a conceptual thought of
like how can we put all these pieces together to you know make it make it all one. Yeah, that's, I think, what will actually happen
because actually today for us,
application code and also the way the application developers
write database queries are out of scope for us.
So we are not actually improving them.
We might work on that in the future.
But in a way, what we found out, nevertheless, is that we
are finding that there are huge improvements possible just by working
on the infrastructure. But I see the possibility that
when we improve the infrastructure on one side, it will
spend less time, perhaps, to work on writing more efficient code
as centr simply possible.
Yeah, it's interesting because there's so many areas to not only tweak for performance, but
also so many different things you need to consider, like how do we improve performance?
How much performance can we improve by changing the JVM settings? How much performance can we
improve by making changes to host settings? How much performance can we improve by making changes to host settings? How much performance can we improve by making code changes? And it's just expanding this world of performance
to more and more considerations to put into, because obviously the answer isn't always code.
The answer isn't always a tweak on the settings, but having all this information at your fingertips
through whatever combination of tools you can get that is awesome. And incorporated into that,
I just want to have one more follow-up.
Do you have, because we see this a lot with the tools,
a lot of the new tools now, is there like an API
that you can either expose data from or ingest data into from your tool
so that if you did want to use other things to combine data sets
to maybe look at the broader range, is that part of what's in there yet?
Yeah, sure.
So this is part, actually, one of the key pillars of our vision.
So what we wanted to be is not, in a way, an AI-driven platform
to optimize just JVMs or a specific technology.
So what we built for us is an end-to-end platform
for machine learning driven performance optimization
that can really work on any technology on our side.
So we can work from applications to middlewares,
to databases, JVMs, even on the cloud.
And it's currently, as you said,
so the other thing that the other key design decision
was not to reinvent the wheel.
So in a way, in order to let Akamas work out its magic,
we do not require the customers, the users,
to reimplement, for example, their performance testing
or implement their monitoring strategy
by using our,
in a way, agents. So the solution is totally agentless.
So the key design decision we took was to instead integrate against
the market leading solution for load testing, for monitoring, for
configuration management so that we can actually bring in, we have open APIs and SDKs so that you can actually develop new integrations
and bring in additional data that you may have from other tools
that we currently don't support out of the box,
and pretty much everything can work.
Well, it is also the way you mentioned earlier,
there are also application configuration parameters.
And obviously, these parameters will be independent or will be different for every application.
So is that a way how an application developer can or an architect can get metrics into Akamas and say, hey, here are my configuration options and here are some of the metrics I can tell you from within my app.
And then you start analyzing it and then you start tweaking them.
Yeah, exactly.
So we decided to build up a very flexible platform.
So actually, it's very easy to instruct Akamas
to work on a different scope, optimization scope,
a new optimization scope, for example,
in order to tune application-specific parameters.
And really, the experience we aimed on this area
was to design a modern DevOps experience.
So we have an infrastructure as code approach
in terms of describing what Akamas needs to do.
So everything is based on YAML files
inspired by Kubernetes
or modern tools like Kubernetes and Dockers.
And what you can do is describe your application,
how it is composed, your parameters,
and also how you can gather metrics and so on.
And also, most importantly, what you want to optimize,
what is your goal, what is your optimization scope, and so on. And also most importantly, what you cannot, what you want to optimize, what is your goal, what is your optimization scope and so on.
So everything is described in terms of an infrastructure that's called
Paradigm.
And then we have a modern CLI that is very much appreciated. We see by,
uh, in a way DevOps kind of a profile that actually
like to use CLIs to automate all the stuff.
And also we have open APIs, so standard HTTP REST APIs
that you can also use to completely automate the automation.
So you can include, for example, an optimization activity of three days,
something like that, that you can plug in in your development pipeline.
It's all entirely possible by invoking our APIs
from existing tools.
Pretty cool.
Hey, can we talk, I have a couple of questions
on some of the common things you find
and what are the common things that you guys are tuning
or giving recommendations on.
But can you, before we go into that detail, give us some maybe examples on what's the typical improvement
that comes out of such an exercise?
Because, you know, are we talking about 2% or 3% of memory improvement or 2% or 3% of performance improvement?
Are we talking about 10%?
What is the typical the typical
improvement people see okay so we are seeing very very interesting improvements so definitely in the
range of tens of percent so 20 50 and those sometimes in terms of entire factors of improvements.
So one of the, by the way, of the white paper that you can download on our website
is actually telling the story of one of the customers
that we had, which was operating a financial service,
a payment service.
And they asked us to optimize their IT stack,
which was composed by a traditional,
in a way, JBoss, JVM, and Linux operating system stack. And their most important business
matrix was payments per second. So they wanted to increase the payments per second within
a given infrastructural footprint, no matter what.
So we tuned the infrastructure after the customer already
tuned with manual approaches, the same application
with a three-month effort.
And it also involved the vendors of the technologies.
And by the way, it challenged us in order
to improve further with respect to what
the expert already achieved.
And we integrated Akamai Syntho there, performance testing environments,
and we were able to achieve 55 more per minute per second
out of the same infrastructure, the same application.
So this is an example of the kind of improvements that are in a way
lying into the IT stack.
And the customer was shocked to understand that after three months, experts tuning activities,
there still was this kind of improvement potential that were pretty much unexplored.
Can you give me a quick, I mean, it sounds 55.
Can you give me a little more reference on like you improved it by
X many percent because 55 itself
on its own is a little hard
to understand. Is it 55 from a thousand?
Is it 55 from a hundred?
Just the percentage numbers or how much?
It's 55 percent.
Okay, wow.
I thought you said 55 additional
transactions.
That's obviously impressive. That's obviously impressive.
Yeah, that's very impressive.
Wow.
And the other thing that we are noticing also on the databases,
it's another very interesting area where we are finding out very big,
potentially even bigger results.
So another white paper that you can find on our website
is about the performance optimization of MongoDB,
where we compare the performance that you can get
from a MongoDB database that you can manage,
for example, on-prem or on the cloud.
And we compare it against the AWS DocumentDB,
which is a new database as a service offering.
So it's a managed service from AWS,
which pretty much is compatible
with the existing MongoDB applications.
So what we have done is we benchmarked
by using industrial or very common performance benchmarks for NoSQL databases.
And what we saw is an impressive improvement that we can get from MongoDB because we are actually able to tune it in order to extract more performance. So in this case, we tuned the MongoDB,
we tuned also the Linux operating system,
and we were able to pretty much more than double the performance
that we can get out of MongoDB or out of DocumentDB from AWS.
Wow. And are these... I'm not that familiar with Mongo.
Are these changes that can be done on runtime?
Does the MongoDB database give you these options to change it during runtime
or do you need to restart the nodes all the time?
How does this work with Mongo?
Yeah, most of the parameters can be live changed actually on Mongo.
And the other set of parameters that we tuned was on the operating system
and pretty much the parameters of Linux OS can be live changed, actually.
Wow.
Can you give me, because this is a question that I tried to bring in earlier.
So I know there's a lot of things that people can change and configuration options. But for those people that are wondering now,
I mean, what are the top three things that you find,
let's say, in a Java application?
What are the most common settings
that people have completely gotten wrong
or where the defaults are just on average not good?
Is there anything that you can say?
Is there any magic?
Is there any list of metrics or configuration
options that always come up
that will be
changed from the default?
Yeah, sure. That's a great question.
So, of course, the answer is
related to
it's dependent on the workload, on
the application, and also on the infrastructure.
But I can say
that on JVMs, pretty much, if you search, of course, online about JVM tunings,
you will find lots and lots of suggestions and stories of performance optimization that
are mainly around garbage collections, of course.
And we are seeing that actually memory management, memory sizing, so the amount of heap,
the type of garbage collection,
and also the garbage collection specific options
can really play a huge role.
And the key thing that we are constantly finding out
is that the beauty of iCommerce is that once you find
the product gives you the best configuration for your application, it allows you to do basically two things.
One is to actually understand what are the actual parameters that got applied and that gave you these kind of results.
And the other thing that we can provide is the list of most important parameters.
And about the stories about findings that are actually very far from the defaults,
we have lots and lots of things that we can share.
One of the things that we have found also on MongoDB,
which is basically not Java,
is that even talking about the most common sizing parameter,
which is the heap size,
we are finding out really unexpected findings.
So we are finding out that sometimes it's totally contrary
to what are the best practices about JVM tunings.
So if you take Java Performance Book by Charlie Hunt,
which is one of the Bible in the space.
So one of the key suggestions is that you have to leave room
for the GC to work, to properly clean the memory
without introducing too much poses into the application.
But what we found out is that sometimes setting the JVN
with a smaller heap is more helpful.
This is due to the fact that, of course, we also tune not just the heap size, but we also
tune the garbage collector specific parameters.
Yeah, that's interesting because a lot of people probably would look at tuning one of those components, not the other necessarily I'm just going to not do that at all.
But you're giving them the ability to make multiple changes together
that harmonize, right?
So that's really exciting.
Exactly.
The other great finding that we found out
is something that is typically not done
in the traditional manual trial and error processes.
What happens if we tune different layers in the stack at the same time?
And we are finding out very interesting results.
And for the MongoDB, for example, we found out that...
We found out some operating system settings that
were able to actually more than double the performance of
MorgonDB in terms of actual queries per second
that the database could support.
And when we looked at the operating system,
so which kind of effects these kind of settings
actually made on the OS, we found out some totally
unexpected things. So those settings actually, if you made on the OS, we found out some totally unexpected things.
So those settings actually, if you look just the OS,
those settings were related to how the operating system
performed storage operations on the storage devices.
And those kinds of settings actually made the OS
doing less number of IOs against the storage devices.
So it was, the OS suddenly was doing less storage operation,
but it was doing much larger, requesting much larger amount
of bytes to the storage.
And when we look at the effect, so this kind of change
suddenly actually doubled the performance of MongoDB, which
is our, in a way, top-level metric.
But if you look at the OS level,
this change actually makes the IEO operations lower.
So this is interesting because if you just look at the OS
and you are trying to optimize your operating system,
you most likely wouldn't choose this kind of settings
because it is actually increasing the response time of your operation.
So if you're just focusing on one layer,
you may actually miss one of the key improvements
because if you look at your topmost application level layer,
like MongoDB or your entire application,
this change is actually the change that is making your application run
twice as fast.
This sounds like, I mean, I don't even know how to
say this now, but essentially what you're saying is
if you would have individual experts that only look into their silos and now
the silos are not, let's say, vertical or more horizontal,
your OS layer, your process layer, your application layer,
then every individual silo would optimize based on what they think is best
for their silo, for their layer.
But eventually, you never look at the full stack.
And this is why having this holistic approach where you're looking at all the metrics
and your AI is trying out different combinations
and then by learning what impact an individual change has,
it is then starting to tweak the right settings
in the best way until you have the optimum result.
And the optimum result
is not necessarily optimizing IO or CPU,
it's optimizing the overall performance.
And that is a combination of response time,
memory consumption and costs, right?
Yeah, yeah, exactly.
So the key piece, the key design, again, goal,
which is one of the key values is that
the solution is goal-driven,
meaning that you can say to Akamas,
what is your goal?
So I want to maximize my key business metric,
or perhaps in such a situation, I want to make, in a way,
I want to reduce the footprint of my databases
so that they can cost less on the cloud.
And the nice thing is that AI will work out the different settings for you
that make your end goal increase or decrease in case of cost, no matter what.
So it has no, in a way, bias in terms of avoiding certain settings
that doesn't follow vendor best practices.
But in a way, it's acting in order to optimize the AT stack
so that your final metric, your end user or your cost metric will improve no matter what.
Now, have you found,
I can imagine customers would be tempted to try to be smart themselves and
take, okay,
the tool told us to make these settings on this application.
I learned enough now from what that did that I'm going to tweak my own settings on this other
application based on what it told me for the other one, because I think they're pretty similar.
Do you find any of your customers are trying to do that themselves? Or are they just embracing it
and saying, you know what, this knows what it's doing.
I'm not even going to try to become the expert on this because there's too much going on
here.
I'll just apply the tool to this one as well and let it do its magic.
Does that make sense?
Yeah, sure.
Great question.
So what we are finding out is that of course course, customers can actually learn from
the result of the optimization.
So, in one way, as I said,
Agamas is also providing
in output
the top performance parameters.
So, you can actually use
this kind of output
as a way to explain why
AI is actually improving
the performance.
And you can actually learn from the output
and you get a chance to actually improve the knowledge
of a particular technology.
And then with this kind of knowledge,
you can actually, in a way,
you are better suited to improve the performance
of other applications.
In general, then, what happens is that if you're just focusing on a small, in a way, tuning
scope, so a small number of parameters, this can be done.
But typically what we see is that the parameters are so many that it's hard for a human brain
in a way to reason and to transfer transfer this kind of knowledge to to other
applications right and do you also find that it does the is there any impact on the efficiency
of the tool to do its job to have people um set up let's say again using the jvm example
to put their application on a default jvm or is it better for them to make the tweaks
they think they should make first,
or does it not really matter?
So you mentioned, sorry, didn't got a question?
Oh, so yeah, just to repeat it.
So does the tool operate better
if it's starting at like the default jvm settings like best practices
in using something like this is it better for a team to say hey i have a new application
i'm not going to touch the jvm settings i'm just going to drop my application on it
apply your tool to it with all the default settings does is there any kind of pattern
to that works better than maybe the team trying to set their own settings first
and then having your tool analyze it okay so i got it so actually it doesn't it doesn't actually
matter what is the starting point so it's it's basically not that impacting for for us because
we have so many of so many parameters that uh even in some situation, the default configuration, in a way, the baseline configuration,
which is for us the starting point of the optimization,
was the result of a manual optimization process,
like in the banking case study that I mentioned before.
But in a way, we managed to improve the performance much more than that.
I would say that we have actually seen a couple of situations
where teams did their own tweaking,
but actually it resulted in worse performance than the baseline.
Yeah.
So it may also happen, something like that.
So if you are not actually a technology expert,
so let's take Java for example and let's take the the most common now the default garbage
collector the g1gc so it's not easy to tune the g1gc so we have seen
situation where people in a way try to tune the g1gc in in the hope of
extracting more performance but at the end of the day, we removed the settings that the customer applied
and the application performed better.
So we also seen something like that.
This is also related to the, in a way,
the performance methodology
because sometimes people, of course,
are not that expert in performance methodology.
And one of the key things that we needed to face also
in Alchemist is how to be robust to noise.
So sometimes you will do the exact same performance
test across the same application, one after the other,
and you can see pretty different results.
This might be due to noisy neighbors on the clouds.
This might be due to different initialization of memory
within the operating system.
There are lots of reasons why this can happen.
And perhaps if you don't have, in a way, a scientific methodology,
also in performance evaluation you will uh in a way you will be we will try to
optimize something that is actually perhaps related to noise so we will tune your gc in the
hope to extract more performance but at the end of the day the source of performance variability
is related to the environment and not your specific settings. Andy, I think this is really exciting, right?
I'm blown away.
I mean, this is stuff that we've been talking a lot about,
how automation is impacting or is going to change our jobs. And I believe you guys are building a solution that clearly does a much better job
for performance engineering and performance optimization,
especially on the layers that you have visibility to right now, right?
OS process.
I mean, this is, and I really like the, like what you said earlier, right?
You are not biased.
You're not biased as being an expert in a certain JVM.
Your AI is just goal-driven.
And it's then trying whatever it can to get as close as possible to that goal.
And I think that's a great approach, actually.
And I think it's really great, too, because most of these kind of settings that you're addressing
are ones people usually don't think about so much unless they start saying we have to optimize
performance or we were having some issues right so this is giving people a chance at the start
to already get ahead of that curve and do everything they can so that those issues don't
crop up earlier and that's the whole idea right is to stop these things from even occurring so
that at least you're going in knowing you've optimized.
You can continuously optimize by integrating this into your pipeline and into those builds and making sure, okay, maybe we made a code change.
Now let's run, let's take an analysis on how the JVM settings are working with some of the maybe the new code that's put in there.
And you can stay ahead of it instead of waiting until an issue comes in and then everyone scratches their head and says,
oh, well, maybe we need to tune memory.
You're doing it up front, which is being proactive, I guess.
Let me shorten it so it's awesome.
Exactly. Thank you.
Maybe you want to hire us for your marketing team to explain.
It's awesome. It's awesome. Yeah, that would be great.
All right.
Thank you for the kind words, guys.
Andy,
would you like to go ahead
and summon the Summaryator
or was there any other?
Yeah, I think we should sum it up.
Come on.
Do it.
So my summary is,
man,
there are 700 plus configuration changes
in the JVM.
I had no clue about that
and I'm sure that the more that I never thought of
on all the other runtimes and layers and operating systems
and Docker containers and PaaS environments
and cloud environments that I've never heard of.
And while it would be an interesting job
to play around with every setting and learn it
and the impact it has,
I don't think this is the job we should strive for.
I think this is exactly where automation, big data analysis, AI, what you build can help.
So what you guys have been doing or what you are doing is, I think, a great gift to our industry,
which is just let the machines figure out how to optimize the machines so that we
can focus on the creative part again.
And I really like that you have a goal-driven approach, that you are taking data from all
the different tools and the different layers out there, that you have a non-biased approach
to analyzing the data and
making the changes. And also thanks for the reference to your white papers. So people
should definitely check out the white papers on akamas.io or follow them on at Akamas Labs.
So thank you so much. And I'm sure we'll hear probably more from you because it seems you're definitely
you mentioned earlier you're really busy
right now so the stuff that you're doing
really resonates well
with the industry and hopefully we'll have you back
in a couple of months so you
can tell us more about
you know what else you guys are up to and
whom you helped and
thank you so much it's really cool
so thanks a lot for the great recap,
Andy.
And I will be pleased to,
to be there in the next month.
So,
so thank you very much.
Yeah.
And I think we're going to have to find a way to get over to Milan.
There'll be some way.
Yeah.
We could also do a recording live in our offices.
Yes.
Yes.
I think we can get Dynatrace to justify a live recording, right?
To fly us over.
I think so, too.
Yeah.
Stefano, thank you very much.
This is really awesome.
I can't wait to see how this continues to develop.
I took a quick look you know at the technology supported
and it's funny because as i mentioned a few weeks ago or a month ago or so we had a conversation
with uh conrad kakosa who wrote this gigantic book about dot net memory um you know mastering
your memory in your dot net applications where you don't have too much control over the settings but
there are some little bits where you can do the settings.
But it's going to be interesting, too, to see the more you go into the cloud areas where
there's more of an extraction away from some of those components, what's going to be left
for people to be able to tweak.
Obviously, if you're running a JVM on your own server,
you have full access.
And I see you're doing some stuff with the AWS DocumentDB,
but as people go into these cloud components more and more,
into the abstraction layers, serverless,
and all these kind of components,
being able to get anything from it
is going to be really important too.
So I think it's going to be really cool
to see where you all expand on being able to
find the tiniest little thing that somebody can control that might give them that component
that they might not have been aware of.
So best of luck with all this.
And I think it's really exciting.
Thanks a lot, guys.
Thank you.
Thank you.
Thank you.
Bye.