PurePerformance - Optimizing Cloud Native Power Consumption using Kepler with Marcelo Amaral

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have with me my wonderful co-host. Hi Andy. Hey, what's up Brian? Nothing, I had a weird dream. I gotta tell you about my dream.

Starting point is 00:00:41 Yeah, yeah, this was a weird one. I had this dream that you and I started a podcast together, like, long time ago, 2015, I think, and that today we were recording our 200th episode or something. It was a very bizarre dream. And I don't know if it's real, but it's really cool if it is. 200 episodes? Yeah. Are you serious? That doesn't include all the perform stuff. Yeah. Yeah. Wow. 200 episodes yeah that doesn't include all the perform stuff yeah 200 episodes and I mean we couldn't have found a better guest

Starting point is 00:01:12 and a better topic well before we go to our guests I want to give a special shout out to any of our listeners out there especially for any who have been with us since the beginning because you know without people listening to it we wouldn't be allowed to keep doing this so big, big thank you to everyone who's been here

Starting point is 00:01:28 but the people who make it all possible as you're getting to are our guests nobody would want to hear you and I talk to each other 200 times especially when we look at the download statistics those that we do solo, they are typically not as well rated

Starting point is 00:01:43 anyway, hey, we want to not keep Marcelo waiting. Marcelo, thank you so much for being on the show. I'm just looking at your LinkedIn profile, which we also link to in the description. It says, Research on Cloud System Optimization and Sustainability, which is a very hot topic. Before we dive into the topic, Marcelo, can you just introduce yourself a little bit to our listeners,

Starting point is 00:02:08 like who you are, what you do, and what actually brought you to the place where you are right now? Yeah, sure. I'm trying to do it quickly, you know. First of all, thank you for the invitation. So I'm very glad to be part of the podcast, to be nice experience to be here. And so I'm Marcelo Amaral. I'm actually from Brazil and I did my undergrad in Brazil. And I went to Barcelona, Spain for my PhD, where I was working at Barcelona Supercomputer Center for five years with

Starting point is 00:02:48 collaborations with IBM, IBM Research. And after I finished my PhD, I went to IBM Tokyo for a postdoc and then I became a regular researcher, an employee in IBM. My background, it's performance, as you guys. So my PhD is related to performance analysis and optimization on the cloud, HPC and cloud. Also, I was doing things related to that. And when I joined IBM, I was working on performance analysis of workloads on the cloud, especially microservice to try to understand the bottlenecks, you know, where the where performance problems are coming from all the hard problems in it.

Starting point is 00:03:35 That's the classical problems when we are analyzing applications. And after that, the IBM acquired the Hat, and we had some collaborations between IBM Research and Red Hat. And I joined the project that's called KubeVert, which is a project to create virtual machines on Kubernetes ecosystem. And I was responsible to do performance analysis and optimization on kubewirt, especially for scalability analysis. I created the CI-CD pipeline for performance analysis in the kubewirt. And then after that, I joined the Kepler project. The Kepler is the project to measure the energy consumption of applications on the cloud.

Starting point is 00:04:28 And that is a project that started with Red Hat. And IBM joined efforts to improve the project. Yeah, that's the quick background. Thank you. That's awesome. Yeah, and if people want to connect with you, because obviously you have a really cool background tying it to performance. Our podcast was initially launched to really talk about performance engineering topics. That's why pure performance. Before we dive into the sustainability topic and

Starting point is 00:04:55 more into Kepler, I got a question for you. So you said you've optimized and analyzed performance problems in microservice environments on Kubernetes or in containerized applications. What is the number one thing that shines out of being typically the reason for systems not performing, not scaling? Is there a number one thing or maybe what are the typical things you found? Well, it's many things, of course, but I would say the two things are most important ones. First, network, you know, latency. It's very important for a microservice.

Starting point is 00:05:35 It's a very small message. So throughput normally is, of course, depends on depends on which microservice you are analyzing. But it's latency typically is more important for microservice. There are small services sending small messages around. And if there are some jitter or something that is happening with the network, it is not stable enough. We can see a lot of problems with that. And storage also depends. If you have database, things like that, storage impacts a lot.

Starting point is 00:06:04 And if it's storage impacts a lot. And it's if it's, you know, distributed storage system. So then it's network plus storage, isn't it? So it's hard also to isolate when you have distributed storage to see what's the problem is really the storage or the network. But I would say that these are the most common thing that impacts the performance CPU is important, but I think this is not the first number one thing that happened. But there are a lot of things that impact the performance of the micro server.

Starting point is 00:06:37 But I would say this is true. Start with that and say the other parts. Yeah. And obviously it makes sense, right? If you think about that traditional architecture, like monolithic, you put data in, a lot of activities happening, and then you get data out and you send this over the network.

Starting point is 00:06:55 Now with microservices, you are breaking up this problem into small individual pieces and you always have to, you know, send the individual pieces of the result to the next service. And, next service and things have to be stored somewhere in caches or wherever it is. So it makes a lot of sense that there's more constraint and more pressure on these systems

Starting point is 00:07:16 that connect the microservices. Now do you feel that most of these problems could have been avoided by better design of the microservice and also the type of data that was sent back and forth and stored? Or was it more that the underlying setup of, let's say, your network itself? And I don't know, maybe in a Kubernetes environment, for me, always service meshes comes to mind. Or any proxies in the middle was it more like the underlying system was not properly configured and sized or was it more an application architectural issues of not like implementing the microservice well enough yeah again so everything that we were saying, so both can impact the performance. So I would say that the design decision is important, of course.

Starting point is 00:08:10 If an application is created with the monolithic in mind, so it means it's not full parallelized, it has dependence, synchronization problem, it will be hard to to scale isn't it so but if it can be more independent and the request can go more parallel you don't have like single queue you know it's typically we see queues in microservice but it's something that it's impacts the scalability and performance. Or maybe multiple kills. So it should have in mind parallelism. And then performance will be better for that. Especially if a lot of parallelism happens, latency between services can be minimized.

Starting point is 00:09:04 It's like you don't see too much the latency of one request because the other request is going well. So things like that. So it's getting better. The design decision also impacts that. That's interesting, Andy. I think we've been talking about this

Starting point is 00:09:19 or at least internally. I don't know if we've talked about it much in the podcast. But I think one of the, especially when you're going from monolith to microservice, the idea of properly modeling that setup, observing that setup, and then tweaking, right? Because it's very tempting to just take, here are my different functions, let me break them out into microservices, And hey, I'm in microservices now. What we've seen, or at least what I've been aware of, is sometimes people go ahead and create, I don't know if this is a real term or not, but there's this idea of a nano service. Something that's way too small that you shouldn't have even broken out.

Starting point is 00:10:03 All you did was add network latency, which feeds back into what you're saying. But I hadn't even thought, not that I'm doing computer programming, but I love this idea of making sure you can run things in parallel, and also queues being a problem. I don't think I've heard about that, at least not in my world. But it goes back to that idea of microservices isn't just picking what you want to run in a microservice. It's a real design consideration, and you have to spend some time experimenting and see how it runs and then tweak and fine-tune it so that you get that performance out of it. And it's almost like some of the common problem patterns we've seen in traditional performance

Starting point is 00:10:41 where these ideas have been around for quite a long time. We've been talking about proper microservices for for a while but it sounds from your experience that this is still a very very common issue yeah yeah cool hey marcello thank you so much but it's always interesting when we when we have somebody with the performance background on the calls to just uh you know discuss a little bit about the stuff that Brian and I know about. But this podcast was actually triggered by Henrik. Henrik Rex is one of our colleagues, and I think you've worked with him as he was looking into Kepler

Starting point is 00:11:20 in his Is It Observable channel. And then you also pointed me to a blog post that, folks, if you're listening to this and you want to read that blog post, the link will be in the description. It's called Exploring Kepler's Potentials, Unveiling Cloud Application Power Consumption. And it was a guest post by you and some other colleagues on the CNCF, on the Cloud Native Computing Foundation blog,

Starting point is 00:11:41 really talking about how Kepler gives you the insights into the actual consumption of power in your applications Computing Foundation blog, really talking about how Kepler gives you the insights into the actual consumption of power in your applications and kind of tying this back to these design decisions and implementation decisions. Every decision we make in the end needs to be powered by the system they run on. And so power consumption always should be on top of our mind. Marcelo, can you tell us a little bit about the Kepler project and what type of data it produces and what type of use cases it enables and also how you see people using Kepler?

Starting point is 00:12:20 Yeah, of course. I think first of all, I will start with some introduction, we know motivation about the project. So energy consumption, I think, I like to think there are two ways to see the importance of that, you know, to measure energy consumption. First of all is money. So cloud providers, the infrastructure, it's, says you know they are paying a lot of money for energy right now the energy consumption costs are not completely exposed to user so user just rent like this the you know uh resource uh and pay you know without the knowledge of how much energy it's costing to maintain those servers. Some servers consume more energy than the others. There are some differences on that.

Starting point is 00:13:14 And especially if we go for the AI workloads now that's using GPUs, GPUs consume a lot of energy. And this is the capacity planning must be analyzing the energy consumption of the servers on the cloud infrastructure. If we go for public cloud or private cloud, this is all things that need to be analyzed. aspect is i would say the social uh uh responsibility of the you know co2 emissions the global warming all of these things that we have been you know here for years and then it's the data centers is consuming a lot of energy again especially for ai workloads and it's something important to not to pay attention to that. So if we think of, for example, chat GPT, it's being trained on a lot of computers. So and it's not only one training shot, so it's retraining all the time.

Starting point is 00:14:18 So it's consuming a lot of resource, a lot of energy, and it can be like some sustainability problem. So the first thing is to bring awareness for people. People need to know how much energy is really doing, isn't it? So what's the energy consumption of data center? There are some analysis for that. I don't remember exactly, you know, the projection for that, but it's like, if they, there are some comparison between U.S that data centers consumes like a thousand of house you know home appliance uh energy consumption so

Starting point is 00:14:53 it's something big that we need to really pay attention on data centers um and then that's what Kepler the Kepler project comes from. The first aspect is to enable observability, to expose the energy consumption of a map, the energy consumption of holes to application. This is not a easy problem, and there are some ways to solve that. So then I'm going to describe a little bit about that. There is no way to, right now, there is no harder counters on the CPU or on the machine to account the energy consumption of application for instructions, for example, that it's running. Storage operation, CPU CPU operation memory operation there is no higher counter uh calculate the energy consumption you know accumulating that on the

Starting point is 00:15:53 hardware so we need to do that in a software-based form what we do now it's the the way that we do it simple but it comes with a lot of challenge okay so so think about that the energy consumption there are two aspects of energy consumption first what we describe it's the idle power with a constant power consumption that when nothing's running the machine there is something being energized in the node and it's consuming power. And there is the dynamic power, where is the power that is associated with the load. So the static power, there is like the GHG protocol

Starting point is 00:16:39 that defines what's the fair association of power consumption to applications. And it's defined that association of power consumption to applications and it's defined that the static power consumption should be divided based on the size of application it's like if you are in a condo and then the bigger house pays more you know so if the application your virtual machines allocate more resources so it will be you you know, associated more static power to this virtual machine. And the dynamic power is related to the resource utilization. So it's using CPU. And then it's the analysis is if 10%, if that one application is using 10% of the CPU, the 10% of the energy consumption is associated with this

Starting point is 00:17:26 application. It's very simple like that. Of course, defining CPU utilization can be complex. It's instruction cache. There are a lot of components inside. But just to general view, like we get resource utilization and do this one to one mapping for resource utilization with the dynamic power so given that so then we have like uh two scenarios on the cloud where we have directly access to bare metal and virtual machines so bare metal we have more flexibility more access to

Starting point is 00:18:01 things uh and virt you know virtualization actually hides things from the users. And it's a virtual machine actually doesn't expose things from the bare metal. Then we need to discuss like these two different ways that how we tackle the problem. So bare metal is the easiest one. So typically bare metal has sensors that measure the energy consumption of the node and the resource. For example, Intel machines, x86, Intel created the application that it's called Rapple, that do some analysis. It's software-based, but it's based on harder counters and the currents and

Starting point is 00:18:46 voltage it estimates the energy consumption of the cpu and with a very good accuracy so there there are a lot of works that have done some analysis with like external meters and compare what's rapid x is it's a exporting and what the external meter is is actually saying what the cpu is consuming the energy and it typically has like a good match there's of course always some precision loss when we are doing estimation but there is a good occurrence for that um amd also has things like that uhs, NVIDIA GPUs actually at least NVIDIA GPUs also expose

Starting point is 00:19:29 the energy consumption of GPUs ARM machines I don't know we have an idea to extend it for ARM but it's an ongoing project that we are doing that so we don't know exactly how to do it right now at least I don't know so maybe other people

Starting point is 00:19:44 from the community know but I'm not aware how much energy consumption of ARM system right now. But it should have some API. Just need to do some investigation for that. So given that, I'm saying there are some APIs that expose the energy consumption of the node, the total node energy consumption, or for a specific resource. So we can break down the energy consumption cpu d-ran storage so depends on the availability of information we can associate this energy

Starting point is 00:20:15 consumption application based on the resource again the dynamic power based on the resource utilization and bare mat we have access to that on vms on the other hand we don't have access to that especially because it's there are two two two problems many problems why vm is not exposed that so first of all is is secured um if the direct information from the host, we don't typically get in the public cloud, that is not exposed to VMs, because it contains information of other VMs in it. So the total information from the node doesn't go there. It could be, so we have like this also envisioned this idea in Kepler. So maybe in the future of cloud providers, they will expose these things, but not right now.

Starting point is 00:21:06 It's to measure the energy consumption of VMs and expose that. It can be in different ways. Just with hypervisor hypercalls inside the VM, we can access that. A file that is mounted to the VM or an external API. The user has its own token and can access information about this on VM. And then, but since we don't have that right now, it's something that we are just proposing and maybe in the future cloud providers can have that. It will be much better.

Starting point is 00:21:42 We use power models. So so for VMS that's what capper is doing what's the power model it's just like a simple regression so we collect in a bare metal node collect the energy consumption of the of uh the node the resource utilization of application and by running a lot of different workloads from with different configuration, different change the CPU frequency, you know, collecting a lot of data, we just do some regression and the power model is the regression can be linear, can be nonlinear depends. It just run multiple algorithms and check which one has better occurrence to estimate and this will be the power model that will be used that's kepler is doing

Starting point is 00:22:32 and and then those power models are public uh available so kepler i think i didn't introduce that i think it's a good time to do that Kepler is a open source project totally open so there is no commercial version of Kepler right now and there are the community a lot of people with different companies are contributing to Kepler and Kepler is the first project related to measured energy consumption of application that becomes a CNCF project. So it's a Kubernetes official project that is related to CNCF. So I think that's one of the main differentiation of Kepler right now. It's fully implemented to be part of Kubernetes, but it can also run standalone outside Kubernetes for, for example,

Starting point is 00:23:23 IoT use case where it cannot be running Kubernetes, you know, kubelet in the node because it's not powerful enough, you know, the device, and can run Kepler standalone inside an exposed matrix. So, yeah, back into the virtual machine. So we have power models and power models has limitations, of course. First of all, occurrence, it can have some penalties because it's a regression. It's impossible to run all the scenarios where all different kinds of applications. We try to stress a lot of different scenarios a lot of different applications but it will be never perfect um but it's we have a very high occurrence for

Starting point is 00:24:15 the models that we train um but there are some other uh more important you know limitations for power models one that it's i think this is important to say is, again, so I was telling you that we have dynamic power and the static power, the idle power. And the idle power must be divided by the number of the VMs that are running the node on the cloud. But on public cloud, we have no idea how many vms are running the note

Starting point is 00:24:47 although we can know the you know the cpu architecture the the bare metal note that the vm i have some information about what which the bare metal note that the vm is running on we don't know how many vms are running so for right now what we do is we don't know how many VMs are running. So right now, what we do is we don't expose the idle power. We just focus on the dynamic power for the public cloud. So when we are in the bare-math, we have all of the information. And then I would say this is one of the challenges. The second challenge, I think this is also very important, is power models are architecture dependent.

Starting point is 00:25:32 So if we create a power model that is related to some specific CPU model, it will be different than a different model because it's a different CPU model, it has the different than a different model because it's different. CPU model consumes, has the power consumption curve differently. So it's not only the baseline, you know, idle power, but how the energy consumption increase with the load change. Number of CPUs also impact, hyper threads also impact. And all of this information is important.

Starting point is 00:26:06 So we need to have power models specific for CPU models. And then for public cloud, especially, for example, Amazon Cloud, there are a lot of different machines. So there are some efforts now for the community to create power models actually kepler we have in mind that we ask the community to contribute so if different companies can come and help to create power models for different nodes and everything it's open uh it's we are always improving how the way that we train the power model to make it easier for people to contribute for that. And if

Starting point is 00:26:48 someone has a different enough CPU version, different even not only for servers, it can be like for end users like laptops. People can sometimes run things like that and just want to know what the application is consuming on their laptop.

Starting point is 00:27:04 And Capra can also run that. So but it's a different CPU model. So you need to train a power model for that. And this is something that the community can help, you know. And that's actually a good maybe reminder, folks. First of all, we will link to the blog post. We will link to the Kepler project. And there's actually a really nice overview

Starting point is 00:27:29 of Kepler, the architecture, where you also see the online learning model server. I guess that's where you have your energy models in and where people can actually then, as you said, contribute the energy consumption both in idle and also in the dynamic stage. Because this is, I think, obviously the great benefit of having a big community that works together

Starting point is 00:27:50 because there's so many different hardware settings out there. And if you already have a model that knows what's the energy consumption for idle versus static, then you can do the actual calculation. That's really good. Yeah, definitely. So yeah, so if people can join our efforts, it would be very welcome, especially for trained parameters.

Starting point is 00:28:13 Yeah. So Marcelo, I know you obviously know Kepler in and out, and that's why it's fascinating to listen to you also, listening to what are actually the challenges, what problems do you solve, how do you solve it, the difference between on-premise, like bare metal, and in the cloud.

Starting point is 00:28:31 From an end-user perspective, because you mentioned earlier that capacity planning, everybody that does capacity planning must look into energy consumption, so Kepler is a great way to get this level of insights. Can you quickly tell us what you see out there as people get started with Kepler? What are the first things that they do

Starting point is 00:28:52 to fully leverage the data that comes out? What are the biggest wins and the fastest wins that people can achieve with Kepler? Yeah, as I was mentioning to you, first we have observability, so people can be aware and understand what's the energy consumption. And there are some techniques, especially in the implementation of code, to try to minimize the energy consumption of applications.

Starting point is 00:29:19 So it's not fully clear for everyone that, but there are some techniques. For example, if the application is waking up the CPU too much time, it's called like a power virus. Because it makes too intense requests to the CPU. And it changes the power mode of the CPU. So from the energy saving perspective and to full performance and start to consume more energy. So I think the first thing is developers can get their energy consumption from application and try different things and try to understand also discover that it's it's still like a uh open research area to understand how to minimize damage consumption of application so observability and the next thing is the optimization so optimization can be in different ways as i mentioned changing the code um or resource allocation so there are different perspective like

Starting point is 00:30:30 when you are it's if you are there are some definition like if we were running less application in a node then all is less energy efficient because the load is not linear. So with the energy consumption, if you have more load to the node, it's consuming less energy if you have less node spread in different nodes, something like that. So then you can do consolidation. This is only the energy efficient perspective but if you go to the CO2 it also has the impact of

Starting point is 00:31:08 what's the energy source what's the solar panels is it like wind wind thermal it changes the CO2 emission Wind, exactly. Wind, exactly. So thermal.

Starting point is 00:31:27 It's changed the CO2 emission. And also it's not only regional-based, it's also time-based. Also, it's seasonal. So if it's winter or summer, it depends on the country, it's changed also the CO2 emission for the data center. So based on that information, it's possible to do optimization to allocate resource. There are some projects that are called Kepler. It's part of Kepler. It's inside the, you know, sustainability AI umbrella.

Starting point is 00:32:01 It's like a sibling project of Kepler. So it's using Kepler information to actually optimize and allocate resource on Kubernetes and try to understand the energy consumption and CO2 emissions for different regions and time-based and allocating scheduling the pods to different nodes based on that to try to optimize things. So again, I think the first thing that the user should do is just install Kepler. It's exposed metrics to Prometheus. So deploy Kepler. Then we have the Grafana dashboard, so the user can just install and go there and check the energy consumption.

Starting point is 00:32:49 See how the application is scaling, what's the energy consumption is also scaling there, and see how the application is using. And play around. So try to understand and do some optimizations to see how the application can be more energy efficient. And this the other hand, if it has access, depends, so the user sometimes doesn't have access to the Kubernetes, it's just the user perspective. So, but maybe can ask, you know, the system administrator too,

Starting point is 00:33:22 okay, so we want to make our cluster more energy efficient, so we want to have better scheduling decision in Kubernetes to allocate resource more in a sustainable way. So then we can go for different projects, for example, Clever, which try to optimize the resource allocation. Yeah, cool. And I got to remind everybody,

Starting point is 00:33:45 like you've talked a lot about the architecture and what type of data it produces. The blog post that you wrote, Marcelo, a co-author with a couple of your colleagues, is really doing a great job also with visualizations on the architecture,

Starting point is 00:34:00 how you get the data on the different types of hardware. So really, you know really check it out. I know this is a podcast and it's audio, but sometimes visually it just helps. And this is why check out the blog post, also the Git repository. Now you said Kepler is obviously producing

Starting point is 00:34:18 the Prometheus metrics that give you the energy consumption on. And I'm looking at one of the dashboards. There was a Grafana dashboard on the page that is linked. I think it's on the consumption per namespace, the consumption per pot. Obviously, you can do all of your analytics. You can probably then also compare different versions of Kubernetes, different

Starting point is 00:34:49 versions of your software, different versions of the stack. You can compare different types of hardware, different types of sizes. So that's the interesting new field where performance engineers, site reliability engineers, I don't know, energy engineers, maybe we need to have a new term that energy optimization engineers that need to do all this because I don't think we can ask every developer

Starting point is 00:35:13 to really, you know, by default just get all this data and analyze it. There needs to be some entity in an organization that really knows what to do, how to get this data, what to do with it, and then mentor and help engineers to actually optimize.

Starting point is 00:35:30 They would need a cool name. I'm sorry. They would need a cool name, not just like energy engineers that have to come up with something a lot cooler. Sorry about that, Marcelo. What were you going to say there, Marcelo? I think you mentioned something very interesting.

Starting point is 00:35:45 So, you know, the users, you know, run that and start to understand what the metrics means. So I would say there are a couple of things that we were actually talking since the beginning that we were saying about microservice. What's the design decision? Does the design impact the performance, things like that. Instead of it, it's the same. So design decision impacts the energy consumption and also the programming language. So there are some stats that say, for example, Python, because it's interpreter language, it consumes a lot of energy, much more than C, for example. Also performance is much better with C, C++.

Starting point is 00:36:27 So then it depends on the perspective, the decisions, how we implement things. If we want to improve the performance, but also improve the energy consumption, the programming language is also something important. Of course, sometimes it's not possible to change all the applications, but in the microservice world that's the interesting part. So you can change one service, maybe.

Starting point is 00:36:54 Start small. Change one thing. Okay, let's switch for a different programming language, this one, and see how it impacts the performance and the energy. And people can start to play around and try to understand better how things behave. And I also want to give a shout out here, Brian, to our friends from Akamas. Marcelo, for you, Akamas is a company from Italy. And what they have done, they have a system where they can, it's like goal-based optimizations. And basically, you say you want to optimize your JVMs or your Kubernetes configurations on, let's say,

Starting point is 00:37:34 memory consumption on CPU, and then their system is using observability data, and then constantly making changes to all of the hundreds and thousands of settings we have in the JVMs and in the CLRs and in Kubernetes to basically find an optimal setting for the application under certain workload. Because what they've found out, if you go to the Java world, where Brian and I have done a lot of work, the way how you're selecting your garbage collector that has a big impact, the heap sizes. And so optimizing the settings is something

Starting point is 00:38:07 that can also win you a lot of improvements from a CPU memory perspective. And it's a very good point. So it's classical to analyze the performance of all of those you know fine tuning things. But how is the energy consumption? So maybe it's linear related, but maybe it's not. So that's the interesting part. Like energy consumption is starting to be some hot topic and we are trying to not only analyze the performance, but also check how the energy consumption is related to tuning applications as well. Yeah, that brings me to the thought I've had for... A lot of thoughts have come into my mind as this discussion has unfolded.

Starting point is 00:38:56 So first of all, thank you for getting my brain working this morning and really sparking a lot of imagination. But you mentioned, we had someone on the podcast a little while ago who was talking about data center power, right? And as you mentioned, power source considerations, time of day considerations, and shifting your load to different centers that you have. Now you're bringing into the idea of the code and the application power consumption, observability of that power consumption, which then leads me directly into the similar idea of taking those numbers, the observability data, and automatically making changes based on the power consumption, but going back to our, um,

Starting point is 00:39:45 Andy and I's and your original career path, performance has to be consideration as well, right? What might be good for power consumption might destroy your performance and vice versa, right? So as you were saying, and that was actually one of the questions, but you, you, you addressed it is, you know, is power consumption linear or logarithmic, right? So at certain points it's more efficient and then it loses efficiency. So just like we see, as performance response times and all starts going down,

Starting point is 00:40:13 we can have tools that automate spinning up new instances of microservices to take care of that performance. But same thing, if we take that, then take the energy data and start managing based on that, now we suddenly can automate and optimize for both performance and energy, which I think would be a real fantastic win. Obviously, some of the challenges you'd have on that are some of these incomplete models that you have with the cloud providers and virtualization that you don't completely get to. But I think models are where we have to start, right? Yeah. get to, but I think models are where we have to start. One other thought I wanted to get in here before we run out of time was it would be

Starting point is 00:40:47 really interesting to see, way back I had a five minute dabble or five minute exploration into capacity planning. This was back in the bare metal days and they had all these models and data algorithms based on CPU, brand, memory, and all this that you can plug in to see what your capacity might be based on CPU, brand, memory, and all this that you can plug in to see what your capacity might be based on your current workload. It'd be interesting to get those models into a developer's IDE, right? So that when they develop just in their IDE, it's running against that model. And as you're saying, the coding decisions, depending on what CPU or whatever model you're

Starting point is 00:41:23 running on, it could then say, hey, and I just said say hey, and that's a new speech pattern that people do these days, and I just did it, so sorry everybody listening. The model can suggest rewriting the code or tell you that based on the model you picked on, this is an inefficient way to write your code. Maybe with AI in the future it can suggest a fix for it, but at least even at the developer level, before you even get to the hardware, it'll expose these power, what would we call them,

Starting point is 00:41:55 energy regressions? That might be a new term to their... Maybe we need to start patenting these new terms. Yeah, I think there are a lot of things to explore. As you mentioned, AI, it's become a very hot topic to have this generative AI to write codes and they are solving problems and of course not like uh the full calls but it gives like the skeleton you know to write things but it's not based on energy efficient calls and so it's gonna be like maybe the future should improve those kind of things and

Starting point is 00:42:39 and uh have those suggested skeletons also to be energy efficient. Awesome. Hey, Marcelo, I know this is, the topic should have been discussed much longer than just the last couple of months or years, since we, as you mentioned earlier, as a world, need to fight a lot of the impact we cause because of too much energy consumption. But I'm sure the topic will be discussed much longer.

Starting point is 00:43:09 So hopefully, keep doing your work. Keep doing the great stuff on Kepler and on educating. Great that you contribute to the CNCF. And let's stay in touch and let's make sure we have you back on the show with updates in a couple of months from now to see what's happening.

Starting point is 00:43:28 Yeah, thank you very much. And if anyone has some questions, you can contact me and I will be glad to answer things. Yeah, we'll definitely make sure to link your LinkedIn profile and whatever else you typically use for social media. Maybe one question that typically Brian asks, but I'll ask it now. Are you coming to any conferences? I don't know, KubeCon is coming up in Paris next, like in March. Is there a chance for people to meet you?

Starting point is 00:43:58 I've been to the last KubeCon in, you know cook-on i present kepler there it was like something that was interesting um i submit a talk for the next cook-on but let's see it's yeah how it goes well if anyone from kubcon is listening put them on you know it's interesting that you mentioned that it's in paris because i was hearing some i didn't dive deep into it, but I was hearing some of the reports from the Paris Climate Summit that just happened. And a lot of what they were seeing is although there was a commitment to draw down on energy consumption, and I don't know if it was just energy or oil, whatever they were looking at, but there's been an increase in a lot of these areas. And I think that goes and ties hand in hand to the more we introduce things like chat GPT, the more everything in our life becomes electronified, relying on computers for everything, that's going to keep driving up that demand.

Starting point is 00:44:54 So at this point, this is a critical point in Kepler and this energy modeling, because if we just continue going forward without considering that, it's just going to get worse and worse and worse. So I think the timing is, it's, yeah, we should have been working on it long ago, obviously, but I think there's now starting to be that public awareness of it. Right. So it's a great time to really be pushing these things. Thank you for what you're doing there. Thank you. Just to mention the last thing,

Starting point is 00:45:29 I think sustainability is becoming a hot topic as well in Europe. Especially maybe in the Cooke comparison, we can have some discussion about that. The government, the European government, is asking for all the companies that are using AI workloads

Starting point is 00:45:45 to report the energy consumption. So, then it's starting to become something that the government is pushing and it will attract much more attention. Yeah. In the future.

Starting point is 00:45:58 And as I was joking about our last sustainability podcast, if we ever see the United States getting on board, then we know we'll be the last ones, right? But yeah, no, I think it's important. And to your point, there has to be that government regulation in there because we know we can't just necessarily

Starting point is 00:46:17 just trust businesses to do, right? They're always going to do what's best for the bottom line, right? And if sustainability is good for the bottom line, they'll do it. But there's the incentivizations that need to be in there. So awesome. I think you and everyone who are working on this, everyone who's working in that sustainability field is, again, thank you. Thank you for our children's future, really.

Starting point is 00:46:41 Because although it sounds a little cheesy cheesy it's definitely a very important topic alright and thank you for our listeners this was I think an interesting topic for our 200th episode because we consume power recording these we consume power putting them up but it's been a great run

Starting point is 00:47:02 so far and there's more to come everybody hopefully we'll see you But it's been a great run so far, and there's more to come, everybody. Hopefully, we'll see you. Hey, maybe we'll have you back on for episode 400. No, that's going to be five years from now. It'll be too late. It should be sooner. Yeah, absolutely sooner. Love to get updates from you.

Starting point is 00:47:18 Great, thank you. All right. Thank you, everybody. Thank you. Thank you, everybody. Thank you. Thank you.

PurePerformance - Optimizing Cloud Native Power Consumption using Kepler with Marcelo Amaral

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.