PurePerformance - How performance engineering saves the euro cup, holidays and keeps cloud costs low with Almudena Vivanco

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance that's okay no that's okay and people may wonder why do we speak spanish today they might they may wonder maybe not maybe we're just practicing because we're going to go take a lovely trip to barcelona or wherever our guests might be i don't want to choose locations well maybe somebody yeah maybe somebody has been to barcelona recently and found out that his spanish is not good enough, even though he's been practicing for many years with Duolingo. But I think without further ado, the reason why we speak Spanish, or at least try to use

Starting point is 00:01:18 our most limited version of Spanish that we have, is because of our guest today. Hola, Almudena, como estas? It's great to have you here. We saw each other just a couple of weeks ago in Barcelona at the Cloud Native Meetup we both presented. And then you actually reminded me that we go way, way back in history. And actually this is now your moment. Maybe you can quickly give our audience an introduction of who you are, why performance engineering, how long you've been in the field, what keeps you motivated.

Starting point is 00:01:55 And then we dive into some of the topics that you've presented at the meetup because this was really a great presentation, despite my lack of Spanish skills. I was really fascinated. So, but now over to you. Who are you? I'm really happy to be here with you. As you said, our history is like, no history, 20 years ago, I started in performance engineering like 18 years ago, I was a developer first. My first contact was with a SIL performer. At that point, it was from Borland.

Starting point is 00:02:33 I was like pre-sales. So I was like selling SIL performance at some point to public administrations, basically. It was like a main role at that point to public administrations, basically. It was like a main goal at that point. It was 2004, 2005. It was this huge impact in the technology world when everyone wanted to migrate their administration, for example, to electronic administration. That was my duty in the public administration. It was great.

Starting point is 00:03:07 So that was my first contact. CWI, Boardland, to performance and test. Why performance engineering? My background is mathematical. I have a degree in computing and computability we said here, and applied maths. So basically, I love numbers. I love doing models, mathematical models and simulating stuff. So I think that it was actually what I had to do as a performance engineer. It was the field I'm most comfortable with. And I love it. Of course, in the start, it was like NBC, so it was like a client server, that kind of stuff. I remember like

Starting point is 00:03:59 too many years ago, but 15, 16 years ago with Oracle database that you have to think ahead, six months ahead, how much CPU would you need? You have to pre-discover how much CPU you could have in your Exadata and your Oracle database. And now it's all scalable. Everything has changed. Now you have just to press the slider and it goes. You have to pay, of course, you have to have your credit card. But it's way more easy. There was a point, and I think that it was like the no return point for us, for performance engineering.

Starting point is 00:04:43 We thought that performance engineering had no meaning in the cloud world. And then we realized that the costs of the cloud, of the cloud providers, were giving us a reason to be still, to have a meaning in the world of IT. And maybe we changed the name to SRE, to whatever they want to call us now,

Starting point is 00:05:07 but we are still doing the same. Just trying to guess the costs, the scalability, the resilience of our systems, of our solutions. So that's what I do, basically. Hey, Andy, before we dive in, I just wanted to you gave me a flashback there about you had to know how much CPU you needed.

Starting point is 00:05:36 I remember the days where we're maxing out our server, we have to go order a new one, wait for it to come in, get it in the rack, get everything installed, test, make sure that's running. And then we can start pushing things over to it. It's like I forget about that. We forget what we have now. It's crazy. Holy crap.

Starting point is 00:05:57 Anyway, I forgot. Going to the bunkers, to the data center with a cable that there you have a cable. Okay. So, yeah, it was a very different world. Anyway, I just wanted to point that out because I'm sure a lot of our listeners are unfamiliar with that world. But anyhow, I'm being an old man here reminiscing about the good old days, quote unquote.

Starting point is 00:06:20 Andy, I know you have a lot you want to... Yeah, it also reminds me, and this before going down into current topics, a little bit more memory lane. 18 years ago, you said you started with Silverformer. We just recently had Ernst Ambichli on the podcast, the chief software architect and creator of Silverformer. And it was also just phenomenal to see, you know,

Starting point is 00:06:41 how we all love, obviously, that product, yeah, I mean, but whether it's Silverform or Lodrunner, there's so many great tools out there that have really revolutionized and inspired so many performance engineers, and it's just nice to remember all this, and then you said, I'm looking at your LinkedIn profile, right? So back then you worked at Aventia, right? Yeah. Aventia, yeah. And then you I also like the fact you said you started as a software

Starting point is 00:07:14 engineer, is this right? Yeah. As a developer in C Sharp. As a developer in C Sharp. For me it was similar. I also started as a developer but when I joined Segway back then, before they got acquired by Borland, we had to go as an engineer through,

Starting point is 00:07:32 I think it was three months of QA. So we had to start in QA and quality assurance. And I was, I think I mentioned this in the podcast with Ernst, I was testing Silk Performer with Silk Performer, which was a great way to learn the product, a great way to learn the strengths and weaknesses, and then a great way to then become an advocate

Starting point is 00:07:52 for performance engineering. And I learned a lot of these things on how to do performance testing from my colleagues back then. But it's Ernst or Didi Strasser, and there's so many great people that I had the luxury. One thing that you said, and I think this was an interesting sentence you said, you said you thought that performance engineering does no longer have a place in the cloud. But then you realized that the costs are obviously very important to keep track of. While it may be obvious for us, but could you explain quickly

Starting point is 00:08:28 how does performance engineering help you with the cloud costs? So I remember before joining Lidl, I was in Telefonica R&D. And I was in Expo QA that starts in a couple of days here in Madrid. There is a huge event about quality in Spain. And there was this roundtable that we were talking about how the cloud improves the general feeling about performance or removes these boundaries. Of course, you just pay more and you have more CPU,

Starting point is 00:09:04 more memory, more everything and you just could mitigate whatever bottleneck in your software could have. I was like yeah but you are still wasting your money. If you do that you waste your money and that's I think that's for performance data. You just want to optimize everything. Not only for costs and money as well as carbon print. I think that's a really important subject right now that

Starting point is 00:09:37 we have to talk more about the carbon print of our solutions. And that was like six years ago. Of it was like, of course, you have to optimize your costs, you have to optimize your solution. Because if you are not investing in optimizing, you're just wasting your money and your time. And you don't need to have the best developers, you just have monkeys just coding. That's not what we are supposed to do. And of course, you need a performance engineer

Starting point is 00:10:08 just to test everything, like the scalability of the solution. Maybe, okay, you have Kubernetes and you have a thousand nodes, but that's your probes are properly set up. You have your objectives are clear, whatever you need in order to scale properly and efficiently. So I think that was the goal for our performance in the union. You just have to rebrand a bit. You are not only testing, you have more to deal with or to cope with,

Starting point is 00:10:40 like the scalability. But I think that's a way to go. It's just a new improvement in our careers. I think, Andy, it's interesting. I'm going to butcher your name. Let me see if I can get it right. Amudena? You can say Almu. Almu, okay.

Starting point is 00:11:01 Almu is, yeah. You would think, going back to even this idea, you had to get a new server and rack it, right, especially if it was a CPU, right? Memory, you could often add memory into it, but you'd still have to get someone in there. But you would think that code optimization would have been a hotter topic back in that day

Starting point is 00:11:18 because it was so much harder to get more CPU power. It was much harder to get a bigger server. I think back then, though, the tooling didn't allow so much for looking at optimization of code. Once we can start looking at traces and once we can start looking at architecture, service flows and things like that,

Starting point is 00:11:38 we started having the ability to look at the optimization. But it is still striking that back then it was only like, well, we need more. And then you move to VMware and you could assign more CPUs on VMware or any virtualization platform, but that was the big one.

Starting point is 00:11:55 But then you'd run out of space in your cluster and you'd have to get another thing for the cluster. And then as we transitioned to cloud, there was still the habit of, because it was so easy to just add new components in the cloud to add that. I guess the thought that's going through my mind is what made people, and I don't know if there's an answer, but what made people finally look and say, instead of throwing hardware at this, maybe we should look at the code we're

Starting point is 00:12:21 writing. Because at some time there was a shift, right. At least for some people. And I don't know if it's hand in hand with the tooling that allowed that to happen, or was it just since people could so easily change the configuration in the cloud regions, when the finance team started getting the bill, did they suddenly realize, oh my gosh, all these things are coming in? Well, about the tooling, I think that it has always been there.

Starting point is 00:12:52 I mean, I remember being in a talk with Brendan Gregg, they broke Linux tools. S-trace has been around since the 80s. You just have to know what to look at. So the tooling, or at least well, some of the tooling has been there. Maybe it was not human-visible or not easy to read, but

Starting point is 00:13:16 it has always been there. I remember just fighting with, not CPU, but I was working in a proxy at some point in my life, and we have problems with the connections with all the but I was working in a proxy at some point in my life. And we have problems with the connections, with all the list of files. And it was hard to learn how to read a Wireshark, a TCP dump.

Starting point is 00:13:43 And it was not easy, but you have the tool in there. Maybe it makes everything easier just to add observability to this layer. It was harder in 2013 and you have the start of and to scale upon some human-readable data. Like I have to scale based on a list of files, for example, in the proxy context. And yeah, I think the tooling was there, but it's just like it was hard to understand or hard to report to someone else that it was not involved in the performance engineering or the scalability. It was hard to understand or hard to report to someone else that it was not involved in the performance

Starting point is 00:14:26 engineering and scalability. It was hard to understand, okay, we have these limitations in our hardware. Just so it was hardware based. And we moved from VMware. I remember when I was in Movistar TV, we have to allocate the different uh machines in the in the in the hosts like uh this is this one is like eating all the all the resources from the other from the other virtual machines so we have to put it in another host that kind of stuff that we were like moving all the virtual machines around just to make them work uh it was a streaming platform and it was in windows because we have media room. It was like, okay, it was hell.

Starting point is 00:15:07 I'm not going to talk about that. I don't want to remember that one. But yeah, it was like, I think that observability, nowadays we have the observability that it's more reachable, that you can understand data easily, and that you can just give data to someone else, and report it, make everything easier. You have the cloud, and usually you have a cloud watch,

Starting point is 00:15:31 or you have insights, or you have something else that helps you just to give the report, that kind of report of the scalability performance report to the POs, to the PMs, to product people. That just fills the gap between the two worlds, that is business and systems. I think the performance engineers are always in the middle.

Starting point is 00:15:56 We have to be aware of the business and of the systems, of the monitoring systems that we have. I think that now it's easier. A couple of thoughts quickly, because you mentioned performance engineering evolved from performance testing to just running load

Starting point is 00:16:18 and then basically analyzing the results and then maybe giving suggestions and now really being like this day the reliability call call them sre whatever you call them in your organization but you need to know much more and you also i think give more guidance and mentorship to application teams to right size to right configure to do everything right from the start because you with the background of performance engineering know much more how the systems really interact, especially as we're moving into this complex world,

Starting point is 00:16:48 like how to properly configure your resource limits, your request limits, how to properly configure your queues, how you properly do everything to make sure that your system is properly sized. So I really like what you said, kind of performance engineering with the emergence of the cloud and also now with Kubernetes has really shifted to from just being maybe performance testers to really true engineers and not just

Starting point is 00:17:11 performance but it's really about resiliency availability i guess security is also a topic even though i'm not sure how often you touch on security uh now it's not that much. When I was in Telefónica R&D, I was in the cybersecurity department. I was going to say uptown. Sorry, in the department of security. So it was a proxy. So it was security, everything. It was like a huge topic. Right now here, I have under my responsibility pen tests and that kind of stuff but not service meshes, the normal stuff but not security as a product

Starting point is 00:17:50 security as a system but not as a product itself and then the other thing you said observability has changed over the years for the good because you know 15 years ago 20 years ago when we started in that

Starting point is 00:18:06 space, observability was something that you turned on when you had to and then it was really hard because I remember the early days of Dynatrace. You had to install your Java agent, your.NET agent. You had to enable it. It was impacting the startup time. People were not

Starting point is 00:18:22 comfortable with it. You could kill applications if you made configuration mistakes. But now, 2024, observability, as we always say, is no longer optional. It's mandatory and it's baked in. Observability is baked in into our cloud vendors. You're getting all the metrics. You're getting logs. You're getting traces.

Starting point is 00:18:41 It's just there. And then additionally, with frameworks like OpenTelemetry, we give people the chance to enrich that telemetry data with what they think is important, but using a standard, which also then makes it easier to make these tools better because we all work on the standard. So I like that a lot. Now to your current job, because I think you're working for Lidl.

Starting point is 00:19:06 And for those people that don't know Lidl, maybe you can give a little bit of context what Lidl is doing. So, actually, I work for Schwarz. That is the group where Lidl belongs to. So, I work for Kaufland and Lidl and Monsieur Cousin, everything like that. I'm in the performance engineer of the company, basically. So I mentor what you said is very important for me. I don't have any team. I just mentor people in the squads to be aware of performance.

Starting point is 00:19:40 It's not a task. It's like a culture. It's like the DevOps culture. So I try to implement the performance culture in the teams and in the company. But I still run tests. I do some jam all the time. But I work along with the product teams. And Lidl, Schwarz, it's a retailer. We have the Lidl online, and we have a loyalty program that is Lidl Plus.

Starting point is 00:20:16 And we have, like, a lot of users all over Europe, 32 countries. And soon we will be in the States as well. Well, the loyalty program is not like a retailer itself. So usually the conversion rate engagement is way higher than in a retailer. So it involves a lot of scalability and a lot of campaigns and a lot of research. So it's a performance challenge itself. And Little Plus started like six years ago.

Starting point is 00:20:56 And I started like five and a half years ago. So we were like in the pilot. And it was called native from the start. That was cool. That is pretty nice. And it's like we are part of a corporate, a huge corporate. But STRM, the Little Plus project, is like, I don't know how to say, but innovation, that we are allowed to go cloud native.

Starting point is 00:21:23 We are allowed to go open telemetry, Kubernetes, all the stuff that is state of the art, right? So it is a pretty cool project. And it's pretty close to the users. That's something I... Coming from a proxy that you don't see a user in your life, it's like you have these people on Saturday that go for their shopping and they have to have their coupons ready and discounts in the stores. So you are pretty close to the users. It was not only monitoring stuff to have performance monitors

Starting point is 00:22:08 or KPIs or whatever. We needed to see the user experience that was when we implemented Dynatrace, for example, that allows us to see the two applications that we have, the two mobile applications, the user experience,

Starting point is 00:22:25 how they are using the applications and how good or bad is their experience in the application itself, in the solution. So that's cool for me. It's a good project and it's still growing. So there's a lot of stuff to do. We have the COVID period where we work a lot. Contactless, everything that was improving,

Starting point is 00:22:53 like the experience of the users during the pandemic. Yeah, it's a good project. When you gave the presentation, first of all, thanks for giving us some background on also the organizational structure that you have basically kind of like an innovation hub within Schwarz IT where you can play around and use the latest technology. I have the slides open from your presentation, which, by the way, if you're okay, we will also share it with the listeners. I see things here like Kubernetes, Argo, KEDA for event-driven scaling, all really important components for a cellular ability engineer. Prometheus is on here. Really, really cool that you could explore these new technologies for that. You mentioned that campaigns campaigns

Starting point is 00:23:47 are super important right and not only in retail but everywhere campaigns are important i remember at least if and also if you look at the slides classical campaigns if all of a sudden a lot of traffic comes in and kind of houses burn down or in this case, maybe servers go crazy. You have a lot of houses in like little stores in your presentation to visualize when things go wrong. But can you fill us in a little bit on campaigns, right? If you run campaigns, if your organization run campaigns,

Starting point is 00:24:25 and if you as a performance engineer, a site reliability engineer, are actually the link between engineering and the business, what are some of the lessons learned that you had? Because I'm pretty sure many of our listeners are in a similar spot.

Starting point is 00:24:39 So you have always, as a performance engineer or SRE, you have to always be close to business, business analysts. Sometimes you are not aware when the campaigns are going to be out in the jungle. Because you have 32 countries and maybe the campaign in Cyprus is not that important. Sorry, Cyprus is not that important. Sorry, Cyprus, but of course you have to be aware when there's a huge campaign like Easter campaign

Starting point is 00:25:10 or Christmas campaign or Black Friday, that kind of stuff that usually it's you know, you have a precise date in your calendar. But for example, now we are right now we are in the

Starting point is 00:25:25 championships, European FIFA, whatever it's called in English, in the football championships. And sometimes you are not aware that they are like, I don't know, but we have a raffle to take the

Starting point is 00:25:42 kids to the fields, to the match. And as a performance engineer, you're not aware. Maybe not take the kids to the fields, to the match. And as a performance engineer, you're not aware. Maybe not even the business units are aware because that depends on the countries. But usually I try to talk all the time with them. And I have a calendar shared with the business units that they tell me, okay, this time, this time in the year, we will have the start of this campaign for this country. We will have like a TV advertisement

Starting point is 00:26:10 at this point. Just try to scale or whatever if you want to, if you have to. Like, for example, in Black Friday or like in Christmas, we scale up preventively. So,

Starting point is 00:26:27 beforehand. or like in Christmas we scale up preventively so beforehand in the FIFA in the football ones we are not scaling that much but countries come and go it's like every day you have suddenly a peak in one of the countries that did why is this?

Starting point is 00:26:43 why is this coming from Croatia? maybe they are selling a cheese or something or a Playstation was sold in a Black Friday in Holland that it lasted like 10 minutes, something like that and we were not aware of that

Starting point is 00:26:59 a Playstation coming for 200 euros it was like insane these kind of campaigns I think that the thing that you have to learn a PlayStation coming for 200 euros. It was like insane. These kind of campaigns. But I think that the thing that you have to learn is that one, that you, if you're going to fail, you have to fail. That's for sure. Even in a campaign that it's like

Starting point is 00:27:18 the revenue will be huge. Maybe the costs in your infrastructure will be higher than the revenue of the campaign. So you have to be aware sometimes that there's like a balance. Maybe the brand is, for example, you cannot fail during Christmas. But maybe during the UEFA, the football, you can maybe 10% you can fail a bit. You don't have the revenue that you expect. I don't know. It's not as important

Starting point is 00:27:47 branding-wise as Christmas or Black Friday. So you have to create a balance between that and to talk always to the business units. And you have to measure how long does it take for you to create a new region, to get up, to absorb the load, whatever, to be resilient? How long does it take for me to be up again? And to train not only your software, your solution, but as well your teams. To have a procedure of how do we do replicate a region and you have to do it

Starting point is 00:28:29 beforehand that's that's things that we try to do in order to avoid downtime during the campaigns are you doing proactively like does this mean you're running game days? You're doing chaos engineering? You have a talk about chaos engineering in Lidl, and you took one of the links, it's in Montevideo, in the Wolver. I talk about chaos engineering, how we do it. So I guess you also have to have, besides the chaos engineering, a deep understanding of what are your most sensitive areas of the architecture, which are

Starting point is 00:29:14 the ones that are likely to fall over first, so that when you do have these unexpected lows, and then as you were saying, you have a plan for what to do if that does go down, this way everybody's ready. I mean, you have a plan for what to do if that does go down. This way, everybody's ready. I mean, you have one of these unexpected campaigns. You see the one system falls over.

Starting point is 00:29:31 You say, well, that's expected. And we have something in place to remediate that because we planned for it. So a lot of it comes down to planning, right? Whether or not you're planning for a known event like Black Friday. Andy, we go back to those old ideas of stripping everything out of the website that you don't need just to handle that scalability of the event. But this is more of the unknown events that are going to suddenly spike up like those local country situations.

Starting point is 00:30:00 So it really sounds like it's about being prepared, knowing your system, knowing where your risk areas are, and having a plan for those risk areas. I think that's one of the main issues that we have when we move to microservices. That there was not this guy, Superman, that knew everything and was going to fix everything because he knew all the money on it. So we lost that. We have the resilience now because it's microservices, but I always say the same.

Starting point is 00:30:32 When our home, that is a microservice, fails, for the user, it's not the home microservice, it's the whole little plus that is not working. If the login is not working, it's the whole little plus that is not working. If the payment doesn't work, it's the whole little plus that is not working. If the logging is not working, it's a whole little plus that it's not working. If the payment doesn't work,

Starting point is 00:30:47 it's a whole little plus that it's not working. And the performance engineer, I think, or SREs, we have this full vision of all the products, all the solutions, not only microservice

Starting point is 00:31:01 by microservice. You have the observability. You have to centralize observability, and you have everything more or less in your mind, maybe not product-wise or like the last feature that you know, but you know all the flows. You know where things can fail, which is the weakness part of the chain.

Starting point is 00:31:23 How do you have to monitor and alert that? And I think that's one of the reasons why performance engineers are important in organizations like this one, that it's Agile, it's Scrum, microservices everywhere, because we have this global vision of everything. And it's just because we have been running tests,

Starting point is 00:31:45 chaos engineering, performance, scalability, resilience, so we know where the difficult parts of the solution are. So I think that gives us a pretty good spot in the organization. Hey, in your slides, in your presentation, you also talk about a topic that is very dear to my heart, which is SLOs, service level objectives. Can you help me understand how you end up with SLOs? Who do you talk to? How do you define what are good SLOs that you've seen?

Starting point is 00:32:22 How do you enforce them? What do they mean? I mean, give me. What are good SLOs that you've seen? How do you enforce them? What do they mean? So when we started with SLOs like two years ago, the CEO and the CTO gave this task to the agile coach. And it was like, okay. But they didn't have the vision when i when i landed in the project like at the start of this year it was like let's let's start from from from from the beginning because you have um okay you have the smart business objective maybe that's a product wise that we are going to sell in in christmas we're going to say one five thousand uh uh fifty thousand perks during Christmas. But we have to put that in number of

Starting point is 00:33:09 requests, experience of the user, response time, updates, that kind of stuff that you have to just translate that into something that is more technical, more tied to the infrastructure itself.

Starting point is 00:33:25 That's when I joined the team of creating the service level objectives. Just to translate these smart business objectives into service level. And that was when actually we were like, I think that all the organizations, and when we have this huge growth during four years, we started to think of the costs and just to reduce costs, to remove the vendor block, to go to the standards, Kubernetes, OpenTelemetry. And then it was like the time to say, okay, our objective is this one.

Starting point is 00:34:03 We have to reduce the cost of this one. So, our Kubernetes has to cost less than our web apps in nature. Our response time has to be lower than the ones that we have in nature. Our CPU usage lower, our requests, the experience of the users, the number of issues that become problems in production, all that kind of objectives. We focus in three milestones. It was like scalability, we have to scale still. At the campaigns, we have to scale properly and efficiently. We have to be resilient. If we have a campaign and it just creates an outage, all the other microservices have to be resilient. We have to mitigate the outage of these microservices.

Starting point is 00:34:56 So you have to be resilient, you have to read the request, blah, blah, blah, blah, and availability. Depending on the severity or the criticity of the product for example for instance the single synonyms have five lines

Starting point is 00:35:15 but I don't know the campaign of of open gift at the same Christmas you don't need more than three so you don't have to have three data availability zones or redundancy and geo-replication and that kind of stuff. And that was an objective. This service has five lines, we have to be geo-replicated, availability zones

Starting point is 00:35:41 redundant, blah blah blah blah blah blah blah blah and I think that the point is like we have to sell 50,000 porks how does that impact in our servers availability, scalability and resilience wise because in the end little sells potatoes

Starting point is 00:36:01 what we do with the pork, hopefully. Pork and potato is definitely important. We have one really big one. It was an SLA

Starting point is 00:36:17 because it's the other part of the stores. It's not within our organization. It's GK. It's another company. So we have this SLA that it was like the time that spends tickets to arrive to the user. But I think it's pretty important. If a user is in the queue and he's paying, how long does it take for the ticket to arrive to the application? And from that one, tickets, coupons, discounts, everything else just

Starting point is 00:36:48 appeared. It was like, you just have to know where to look. It was like, okay, the tickets, but then if we have the ticket, we have the summary and the process of the discounts that is going to the scratch that is going to the winning moments, the scratch that is going to the scratch that is going to to the winning moments the scratch that is going to be win after the tickets all came along was like we have these big ones okay the other ones just the other objectives were like really easy to achieve or to know where to look at so yeah i think but we're working on that. So we have to implement some more to make them more observable, to make them easy to reach. Right now we have them in Dynatrans, but we wanted to move them to somewhere else. I'm not going to say it now. You can say it

Starting point is 00:37:39 now. We want to make them more reachable for the whole company and then the training was for the whole company all the whole squads from the product data to the product owners, product managers everyone was involved in the training of Celo I think that was

Starting point is 00:38:01 very important, it was not only for the technical people but it was like you know the business tell me what do we have to achieve do we have to go to a new country do we have to be better in the engagement what do we have to do business wise

Starting point is 00:38:16 okay we translate that into the service level I'm taking a lot of notes here and I think you know a lot I learn from you we learn from our guests too I've seen all the videos from Andy talking about this

Starting point is 00:38:38 I think it's a great confirmation that what we see and what we have seen is really stuff that happens and matters in the real world. Because we are working for a vendor and we try to be as close as possible to our end users. And with our history, we've been in that space for a while. But I made a note of the how we sell 50 000 pork for christmas this could be an interesting uh title for uh for a conference talk maybe is it gonna be iberico i'm on no it's in romania it's a custom it's a christmas custom in romania that they they eat pork and they buy a half of pork. So yeah, it's a use case.

Starting point is 00:39:33 It's every year we have that. We have the same in Germany, we have fireworks. We sell the fireworks for Germany. What happens in Austria? Because I know you're also active in Austria. Any strange customs I should know about my own country? No, not really. Not that I'm aware of. They sell trips to the hills so you can go sing.

Starting point is 00:39:58 The one of the fireworks, I love the one of the porcs because it's like we have the software solution is tied to the warehouse, not to the data warehouse, not to the warehouse itself where the

Starting point is 00:40:14 ports are there. And the fireworks one, it's the same. The warehouse has the fireworks, but there's a security chain there. Logistics, because fireworks are flammable and you have just to

Starting point is 00:40:29 deliver them in a safety way. So it's a really complicated flow. It's a tricky one, but it's pretty cool. But it only works in Germany and only during Christmas. I don't know what the Germans do with their fireworks. Well, I guess they buy it probably for New Year's,

Starting point is 00:40:47 but they already buy it for their kids as a Christmas present to then fire off maybe for New Year's. What I want to also recap, I think, again, what I've also seen in many organizations now where the former performance engineers now turns site reliability engineers are really the ones that are connecting the dots to all the different stakeholders to come up with good slos that are tied to business objectives i think you said it very nicely said you have to translate smart business objectives into service level agreements or

Starting point is 00:41:21 service level objectives and once you have kind of the first two layers figured out, it's very easy to trickle down to the technical metrics. And I think that's just, folks, something if you struggle, we should be responsible and you are necessary. Maybe you take the responsibility on you and drive this initiative. But because you have the overview of everything and you should be able to talk to everyone because you need to know what the business is planning

Starting point is 00:41:51 for your campaigns, for your capacity planning, for your scaling. So you are in a perfect position to then also define SLAs and then enforce them. That was what shocked me when I was a LIAI coach that it was like doing the task. It was like, why are you doing that? And in the end, they just tell me, you're the best suitable person to do it. Just go ahead.

Starting point is 00:42:17 Just free me from this. And I run the trainings and I try to standardize all over the organization. And that's because what you have just said, but we have the vision. So we have to talk to the business units. We have to talk to the SREs themselves, the platform engineering guys and the SREs from each of the domains and the products.

Starting point is 00:42:46 So we have this vision that we have to achieve this objective. How are we going to do that? How are we going to scale? How resilient are we going to be? How much do we have to improve our solution or how much

Starting point is 00:43:01 do we have to pay for the solution. That as well. So yes, I think that's why I took over the responsibility of taking the silos in my organization. One last question for you. I assume you're running fully in the cloud. Is this right? With your Kubernetes, everything is in the cloud?

Starting point is 00:43:29 Is there any thoughts on whether at some point in time you actually reach a size in the cloud where it again makes sense to think about building and pulling things back in on-premise data centers that you may still have? Or is the cloud really working out for you when you've figured out how to be cost-efficient? We are considering some of this in the data layers.

Starting point is 00:44:00 We are running the, how do you call that one? The polyglot data architecture. We have the data platform and this kind of stuff. And the cost is huge. The cost of that is huge. And at some point, I think that we will consider it. There are like some milestones. Like if you reach the 100, I'm just an example,

Starting point is 00:44:28 but 100 million users, or when we are moving to the States, that we will have data from the States and we will have to share, you know, the laws of the data protection are different, so we have to store them in different places. We will have to reconsider if we are moving some of our data storages back to a data center. And as well, we are creating, so Lidl, Schwartz are creating, so, no, Lido Schwarz are creating

Starting point is 00:45:07 a new cloud and it's StackIt. StackIt is a German cloud, European cloud that will be like following

Starting point is 00:45:19 the standards of the European law. And that's something that we will be using. So it's like going back to the data center, but like in our own cloud. So it's more or less like a middle way of it. I think that the stack it will be covering one gap that we have right now,

Starting point is 00:45:38 that there are like three major players, Google, Amazon, and Azure, and Microsoft. And they are all from states. And we will have, Google, Amazon, and Azure, Microsoft, and they are all from the States, and we'll have, at some point, we'll have to think, as a European, we'll have to think of a cloud in Europe following the European rules and

Starting point is 00:45:57 want to have something in Europe, I think. And it's like it's going to fill that gap, hopefully. And that's for us, it's good. The data will be in our own cloud, and it's our data centers. It will be cool. Awesome. Andy, that left you speechless, huh?

Starting point is 00:46:21 I think that we lost him. Oh, no. Andy's having some, some well he just lost audio well I will then start is there any final thoughts you wanted to get out we're pretty much at the end here anything that you wanted to say

Starting point is 00:46:36 that you didn't get a chance to or do you have any speaking engagements coming up I will be in Berlin in the DevOps in Berlin, on June 19th. Okay, perfect. So hopefully I will see you there.

Starting point is 00:46:54 And I'm just back. I'm not sure if you will be coming to Berlin, to DevOps Berlin. I will be there. I was actually suggesting, Almudena, that you should put in a CFP for KubeCon North America.

Starting point is 00:47:10 I think that would actually be a really cool thing. Talking about your experience on selling 50,000 porks all by Kubernetes, OpenTelemetry, Captain,

Starting point is 00:47:27 whatever else. For the next year one. That is in London, right? Next year. I could go that one. But North America is still... Could be a good way to make the name Lidl and Schwarz-en known in the US as you are in the market. When I was in Nylatrace performance here, there was like this woman that was like, what is Lidl? Well, it's a huge, you know, and when I started to show her numbers, it was like, oh, this

Starting point is 00:47:58 is huge. Yeah, it's huge. So yeah, I think that I will take your advice and i will and i will write down a paper like how to sell 5 000 50 000 porks for christmas without breaking the cloud costs and how to sell fireworks to children in Germany for Christmas presents what can go wrong right and that's more from Andy's way of putting it not little

Starting point is 00:48:31 awesome really appreciate you being on today I do want to say that that and it's one of my favorite sayings and it's not really even a saying it's just made up really appreciate being on I'm glad you're learning from Andy we learn from our guests like you it's always fantastic to

Starting point is 00:48:55 have people on so thank you so much and thank you for such short notice that we really everybody I'm really saved our butts today. So thank you. Thanks for the invitation. I love to be here. Alright. I guess that'll wrap the show. Thanks everyone. Bye-bye. Bye-bye.

PurePerformance - How performance engineering saves the euro cup, holidays and keeps cloud costs low with Almudena Vivanco

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.