PurePerformance - The 3 Levels of SRE and bridging the gap to DevOps with Michael Wildpaner

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Andy Grabner and as always I have with me my co-host Brian Wilson. Hello Brian, how are you today? I'm awesome Andy. I think this is the first time I actually mix the names and you try to catch me in making a mistake now or something or starting a laughter. But I'm good. And it is a hot summer day here believe it or not well it is july and i guess it's supposed to be hot

Starting point is 00:00:52 and um we're having another heat wave and the next one is already rolling in and uh which also means heat waves means we need something to cool us down but today we actually have a hot topic nothing to cool us down but another hot topic nice and a hot topic. Nothing to cool us down, but another hot topic. Nice, Andy. Nice, Andy. Well done. Well done. Well done.

Starting point is 00:01:09 Thank you so much. Thank you so much. You're going to get an award one day. Yeah, but not for being a good comedian. I don't think so. But I want to not wait any longer. I want to give our guest today the chance to introduce himself. Michael Wildpann or Michael, they're both Austrians,

Starting point is 00:01:28 even though Michael found its way towards the West, which means not too far West from Austria, but Switzerland and Zurich. But before I say anything else stupid that doesn't make any sense, Michael, welcome to the show. And please do me the favor and introduce yourself to our audience. Hey, Andi. Thank you so much for the nice words. And hi, Brian.

Starting point is 00:01:52 I'm very honored to be on this podcast with you today. To start with a fun story, when I had my basic training in Google in 2006 in the Bay Area. Like another famous Austrian was governor of California. So I kept on being asked like, hey, you really sound like the governor. Another funny story tied to that, when I first started working at Dynatrace, I was in the New York area and a lot of the Austrians would come over and we'd go to conferences and all. And there was this sketch on Saturday Night Live called Hans and Franz, Pumping Iron with Hans and Franz. It was sometime in the late 80s.

Starting point is 00:02:37 And the whole idea was these two guys were Austrian bodybuilders who worshipped Arnold Schwarzenegger. And I forget, I think it might have been, I don't remember who it was. It might've been Alois. It was over and I was like, oh, you sound like this Hans and Franz thing. Let me, let me show you this sketch, you know, this comedy bit that we all laugh at here. And he watches it and he was like, I don't get it. What's so funny about it?

Starting point is 00:03:01 And I was like, oh, we're laughing at you. I get it now. that's terrible so it was funny because it's the whole the whole bit there with the schwarzenegger and it's always in fact even because of andy we used to call him the summerator because he used to always do a summary at the end and i would actually intersperse arnold quotes uh at appropriate time so yeah it it's it's probably a joke you all get really tired of really quick i imagine and but it's still fun he's done some fantastic movies and provides with plenty of entertainment so yeah uh anyway way off topic exactly way off topic michael i know you just

Starting point is 00:03:38 briefly mentioned that you work for google but i think there's a little bit more to the story especially to our target audience in In the last couple of years, we've talked a lot about performance engineering, but we drifted over a little bit over the last couple of years towards DevOps, towards site reliability engineering. Obviously a topic that we all know has been heavily made popular by the works from Google

Starting point is 00:04:00 and what you as an organization wrote about how you're doing things, but you personally have been very much involved, I think, in that whole movement towards site reliability engineering. Can you fill us in a little bit on the background? Because I think that's going to be very interesting. Give people some context of who you really are. Yes, yes, with pleasure.

Starting point is 00:04:18 So let me start with one sentence before Google. I spent a bunch of time working on bioinformatics, so I have a little bit of a background in high-performance computing. This is also what got me into the SRE organization. Very importantly, when I joined Google, I already knew that we shouldn't use a lot of the rings' names to name our servers. But if you want to build a big distributed system, you should have a numbering scheme

Starting point is 00:04:46 and not like build individual things. Like today we would probably say cattle, not bats. Like that's one of the terms that's used in the DevOps and SRE community for that principle. And I also did a lot of network security before joining Google. And this obviously came from the university scene and people hacking each other in a very friendly way. And that becoming quite an interesting area to actively work in. Today, like most really good attacks have like a multi-step setup where you're mixing some social engineering

Starting point is 00:05:35 with like a lot of technical capabilities. At that time, it was all much easier. Like people had firewalls that were not set up in a great way. They had like Wi-Fi access points. So one of my favorite pre-Google memories is sitting in my very old car, having a big antenna on the roof and trying to find like open or not very well secured Wi-Fi access points in big industrial areas.

Starting point is 00:06:03 So that was the good old times. At Google, I got dropped in at the deep end. So I started as an engineer in the site reliability team for Google Maps. And Google Maps at that point was mostly a stateless service. It is still relatively stateless when compared with other Google services, but it was a really interesting learning experience in how can you ship a large amount of highly structured data as quickly as possible to a huge number of devices. So latency was absolutely the first thing we always looked at. We had a very deep culture of performance testing and regression measurement.

Starting point is 00:06:51 If a build created latency regressions for significant use cases, we as S3s would block that build from going out. So there was a very big focus on every 10 milliseconds count or every millisecond counts in that particular application. These days, I'm no longer working on maps, but the focus has moved a little bit to mobile. And in mobile, you usually have an application and the application has a chance to do caching and prefetching and a lot of very smart logic, there is still a lot of focus on latency. But let me just say that the story between mobile and desktop is quite different. I got a quick question on a couple of things, actually, that you just mentioned.

Starting point is 00:07:45 The first thing you said earlier is don't use the loader ring names. And this reminded me actually, I think, of the presentation you did in DevOps Fusion. This is where we actually met each other. Thanks again, because you invited me to the Google office in Zurich the day before the conference. And we spent quite some nice time together and had a chat about what you're doing. And what you said today, I mean, it's the same what you said on stage, but it really struck me a lot because you said, in order to think about automation, the first thing is you need to get rid of these names. You need to come up with a naming scheme

Starting point is 00:08:19 that is built for automation and not to make you feel good. And I think I see this a lot with people that I am working with these days because automation is a big topic, right? How can we scale, in our case, observability? And you can only scale observability and then automation if you have the basics right. And the basics start with something simple as the naming because it's easier to iterate through a thousand of hosts if you have numbers that go up and they're not called gandalf and whatever else they're all right i mean

Starting point is 00:08:52 that's the thing so this was really fascinating the second thing that i wanted to say and i think we should probably dedicate a whole episode on this because both brian and i in our background we've done a lot and a lot of performance testing in our early days. And just hearing from you the challenge of pushing a lot of structured data out to many, many devices, and then you having to say the new build is good enough or not is something actually that we have been doing over many years by doing performance testing. Maybe not at Google Maps scale but other scales now one question i have to this were you already called

Starting point is 00:09:29 sre's back then when you started at google and working on google maps yes when when when i joined when i joined google i think the term sre had been around for a few years and the sre organization existed for roughly three years. Please don't quote me on that, but I think the formal SRE organization was likely created in 2003. Ben Treanor started off SRE at Google. He coined the term and he popularized the idea. And if you allow me, I will just repeat the core idea here once. I know it has been described many times.

Starting point is 00:10:07 But the core idea of SRE is to put software engineers that understand distributed systems into a role that is not operations, like SREs are not operations, but that has just enough operations component to really annoy the engineers. And once you annoy them, they will go and fix whatever toily process is generating that operational load. And that's the core idea behind SRE. It is not meant as a group that will keep on doing whatever their operations or tally work is forever but the idea is really like be just enough annoyed at the at the sharp edges of the system to to remove them so really encourage i mean i guess with us different words we can find in the english

Starting point is 00:10:59 language annoying is one of them but motivating them, educating them or mentoring them in building distributed systems that are by default more resilient so that whoever needs to take care of the operational aspects later. And I guess in the best sense, people can just, the system can take care of itself. The system is resilient by default. Yeah, that's actually, I think this is part of the core mission of SRE. If you think about the different SRE engagement models, like how can an SRE team or a site reliability engineer work with the development team?

Starting point is 00:11:36 And there are various stages. And the most severe stage is like this service is absolutely mission critical. You really need like dedicated SREs on the system. This is mostly important if like downtime would have a huge, huge negative effect on the organization. Like there are some use cases where you really want like with a reaction time of three to five minutes, like somebody somebody that actually really deeply knows the distributed system,

Starting point is 00:12:07 that knows how to do reverse engineering on the fly. This is a big, big part of the SRE job. Let me try to reverse engineer this. The systems today are all too big to have in your head or on a very large whiteboard. So in that case, you have a dedicated SRE team that's permanently working with your developers and is very actively working

Starting point is 00:12:34 on improving the distributed system itself. But there are also other SRE engagement models that are less costly and that start to look more DevOps-like. And I think this is also a nice segue that are less costly and that start to look more DevOps-like. And I think this is also a nice segue into the conversation around this bridging DevOps and SRE talk. There's a lighter engagement model where if you're building a really important service, but you do not want to put

Starting point is 00:13:00 a huge SRE team on it, just mix a few experienced SREs into their own call rotation. And they're going to lead by example, like they're going to do a good job in cleaning up after any incidents. They will create a blameless postmortem culture, like all the things that are necessary in DevOps and in SRE. And it will help improve the system. But just mixing them into an existing on-call rotation can already change the focus on reliability. To share a personal story, when Google was launching one of the backends to G+, to the social service

Starting point is 00:13:51 that Google launched, I was running the security SRE team at Google and we ran one of the access control services that started out with G+, and that's now being used in many products. And it was one of my personally best experiences as a site reliability engineer, because we did not have a team to cover the full service. We joined the developer on call rotation, and it was, I would say, like three to four months like pure mutual learning. I learned a lot from the developer team on how to build that kind of high performance access control system. And we were able to share with the developer team, a lot of the learnings we had from like earlier services.

Starting point is 00:14:38 So this mixed model in my, in my opinion has a lot of value. And the third model is more of a teaching model where you join a team, but just for one or two weeks. This is very DevOps-like, like institute a little bit of best practices and SRE and DevOps culture and then move on and the original development team will carry on

Starting point is 00:15:02 effectively DevOps-ing their service. Usually fascinating. the original development team will carry on effectively DevOps in the service. The usually fascinating, I think I remember some of these stories maybe when when we met in Zurich. But let me ask you a question. Is there your personal story with the Google Plus access control system? Do you have a ratio that you had between the developers and the SREs that did the on-call rotation?

Starting point is 00:15:28 Did you have a one-to-one? Or how did it look like? Because I think these are some interesting facts also for our listeners, as they probably have situations like this and any guidance would be interesting. Yeah, I will be honest. I would not call this guidance so i can i can share a little bit of of data like the best exchange we had when the team was roughly i i would say 50 50 uh and and it was very practical because our development team was in the in the us and the sres were in europe so we would naturally share the 24-hour rotation into 12-hour shifts, which meant that every day the responsibility for the service had to bounce between Dev and SRE.

Starting point is 00:16:12 And every day we had to speak the same language and every day we had to clean up on all of the issues that we saw. I found that very practical. I'm not saying it's the best setup for any situation, but it was definitely a good split. And then with the overlap, I would assume Europe and then maybe West Coast, right? I guess you at least have a couple of hours, real overlap as well. And that's great. And again, I don't want to have a recommendation, so I'm sorry that I used that word, but just some insights. You also mentioned the first level is more like dedicated SRE teams. And in the very beginning, you said typically you decide

Starting point is 00:16:59 what level of SRE engagement you want based on the criticality of the apps. How is it right now at Google? Do you still have certain apps that based on criticality where you then really then decide which model fits? And how do you decide the criticality? Because I assume if I look from Google's on the outside, for me, I guess what is critical search is obviously most critical because that's where I make a lot of money.

Starting point is 00:17:24 Then also, I guess, you know, when it's Gmail and all the business apps that you have, do they all have their own dedicated SRE teams? Yes, these apps all have their own dedicated SRE teams at various levels of detail. In some cases, when you look at cloud, at GCP, like we actually have quite a big number of SRE teams because every component of cloud has its own idiosyncratic issues. And the skillset is actually in the day-to-day operation quite different, whether you're running the load balancing system or you're running a distributed database.

Starting point is 00:18:10 Yes, they're all built on the same base infrastructure and the reverse engineering that I mentioned earlier looks very similar in all of these services. But there is definitely a lot of knowledge that is really ingrained and that is very different between I'm doing front front-end traffic management, like I need to make sure that we're not dropping any queries, like I need to understand what like stabilized like TCP any cast is, I need to know what the route injection is, like there's a lot of skills on the front-end serving side that's very specific to what people are doing there. And if you're running, let's say, Spanner,

Starting point is 00:18:53 you need to understand how the actual distributed transaction system is working. And there's a lot of performance interactions between different layers of the stack, obviously, when you're running a database system. And that takes quite a bit of learning to to to um to get proficient at that um your presentation at uh devops fusion right it was um kind of bridging the gap between devops and sre and i really i can only i think the devops fusion team they've recorded the session and it will be published at some point uh you started off with kind of the uh the um what's the the right word right the two amigos and they're kind of like trying to shoot each other the gunslingers the conference yeah it was like you know devops and sre but obviously that the world is kind of like the story that you tried to tell is bridging the gap between these two.

Starting point is 00:19:45 And I also have a hard time sometimes explaining kind of really how they overlap and how they kind of benefit from each other. I think you did a really great job in it, but I just want to give you again my explanation that I give, and then I would like to then throw it back to you on how you see it. When I talk with people and they they're familiar with DevOps and then they ask what is this SRE all about and I typically say and I know people cannot see my hands right now but I always point from two directions

Starting point is 00:20:16 I say DevOps for me at least what I see are the teams that are using automation to speed up delivery so they are the ones that are providing automated way to get code changes from development all the way into production. On the other side, I see SREs, like you said a little bit, they have more ops experience. They're coming more from the ops side, and they're using automation to keep systems reliable and resilient. And kind of like there's no gap in the middle,

Starting point is 00:20:47 even though it seems like a little gap when I point the fingers from left to right, but it's more like they're influencing each other because SREs, they have knowledge on how to build resilient systems and certain things should be baked into the pipeline, like the automated monitoring, the automated testing,

Starting point is 00:21:02 the automated validation. And so I would like to now hear from you how you are approaching this whole topic of DevOps and SRE, where the overlaps are, and really how we can bridge the gap because we don't want to tell the world there's DevOps on one side, there's SRE on the other side, and then we're throwing it over the wall again. Yeah, I really like the model that you're describing,

Starting point is 00:21:25 like starting on these two ends, but effectively meeting in an overlapping zone in the middle. Let's look at one extreme example. Let's say you only have like one person or one very small team like building and operating a whole service. And so what does that person need to do? Like outside of our conversation, somebody needs to do product management. Like somebody obviously needs to understand like what we're building.

Starting point is 00:21:54 But when we go to the software engineering and the operations, there is like super deep domain knowledge in like the design of what we're building. Like many of those systems have like non-distributed systems aspects that need a lot of expertise. Let's say we only have one team. So these folks now have a great interest to iterate fast. I think to the audience of this podcast, this is like bread and butter. But many organizations to this day still have like big releases, like once a month, once a quarter, where they are betting the house on this release, like going out of the door and then something doesn't work.

Starting point is 00:22:38 Like one out of like a thousand small changes or two out of a thousand changes break. You have to roll back. And it's like essentially a war room every month or it's a war room every quarter to get the release out. And so I, and maybe you live in a bubble where it doesn't exist or maybe you see it, but you also see people moving away from that model. But there's still a lot of organizations that do that kind of software development. So let's assume our very modern team doesn't want to do that.

Starting point is 00:23:11 And this is where what you called DevOps, what I would also call rel-engineering, would come in. Like a team that is dedicated to the velocity of the software development process. And that team can eliminate all kinds of problems. Like how do we speed up code reviews? I personally think code reviews are absolutely necessary. Like I would never start a software engineering project without mandatory code reviews, to be very blunt here.

Starting point is 00:23:44 The team could, like that team could work on do we have deterministic builds like how long does it take to do an incremental build like these are all topics that would really improve the developer experience do we have proper staging environments do we have ab testing like do we have a release system that can allow like your change list to be out in in staging within minutes or within within one hour instead of in days so so so i think this is one of the core parts of devops and in sre it is not that important to do it because usually like if the service is so important that you need a dedicated SRE team, then the service is usually also so important that you have a dedicated release

Starting point is 00:24:31 management team. So SREs generally demand that a really good release pipeline exists and SREs have the skill to help build it, but it's not their main job. They usually show up and say like, hey, you're building this by hand, what is this? Can we please have a useful release pipeline? The SRE skill set is overlapping with these other groups and really starts at the design

Starting point is 00:25:04 of the distributed part of the system. Is the system stateless? Is the system stateful? What storage mechanisms do we use? How do we do load balancing? How do we do caching? When you have a cache invalidation event, how expensive is it to refill your cache? Do you have a capacity cache? Or is this cache used for latency?

Starting point is 00:25:28 There's just many, many questions that SREs will ask when they see an existing design of a distributed system or they see a new design doc coming up where they're trying to make sure that people are not designing themselves into a corner. And I think this is really where the SRE software that people are not designing themselves into a corner. And I think this is really where the SRE software engineering skill set is starting. And then it goes kind of downstream from there.

Starting point is 00:26:00 Like, does the release tool have exponential rollouts or is it trying to update everything at the same time? Does the release tool have any safety checks in case the build turns out to be broken in production? And you mentioned a lot of good stuff around monitoring and debugging earlier. I really love the idea that you said there of people designing themselves into a corner. I think that even architecturally, we see people choice themselves into a corner too, depending on which technologies they choose and why, if they're making an informed

Starting point is 00:26:33 decision early on with Kubernetes, right? Everybody just wanted to jump to Kubernetes for no good reason, because that was the latest thing, right? Or we've seen people like, hey, we're going to move 100% serverless. Why? Because we want to be serverless. And then they get stuck in a situation. And as Andy's talking about bringing in the observability side of things, how are you going to observe that if you're on that platform if there are not quite good enough observability tooling on that side? Or what do you have designed for that bit as opposed to making another choice?

Starting point is 00:27:02 So there's all these choices that go into everything you do that if you're not paying attention, as you said, you end up designing yourself into this corner. And now how do you get out of that? And I just got to also say at this point too, it's really the level of SRE that you're discussing is, I think a lot of the times we've discussed SRE in the past, Andy, it's not been this deep of a level.

Starting point is 00:27:25 A lot of my exposure to SRE work has not been this deep as well. So this is just, I'm very quiet because it's just smacking me in the head really hard. Like, wow, there is so much more than the high-level SRE stuff you hear about a lot more commonly. So I really appreciate you going into that level there. If you want, I can share a few thoughts on this, designing yourself into a corner. My most sensitive part when I look at the design is the way that people are structuring their storage.

Starting point is 00:28:00 And the reason for that is, if you fix a distributed systems design problem in any of the stateless parts of your service, that's usually easy to roll out. You need to be careful that you have forward and backward compatible APIs so that you can update parts of your system. You need to be sure that if you're doing some experiment launches, so a feature is only visible to part of your audience, that these experiment configurations are consistently applied everywhere. But it is still, from a conceptual perspective, relatively easy to move from one stateless processing system to a second one. But the moment that you're hitting either a spinning disk or these days maybe an SSD,

Starting point is 00:28:53 you're committing the sins of the past to your storage system. And if you're then trying to fix anything there, like unless you have the capacity to rewrite everything that was ever written to stable storage to a new schema, you have to keep the code that has to live with the sins of the past forever. This is something I learned. I spent a lot of time in Gmail. And once you had anything written to a storage system, you needed to be able to parse that and unmarshal that again and process it again.

Starting point is 00:29:33 And if you're building a sufficiently large system, that becomes very, very expensive. And in modern distributed systems, it's really hard to just use a relational database. Like I'm a kid of the relational database generation. Like I started out, I can't even pinpoint it. Let's say I spent a lot of time with Oracle early on. I played with like Postgres.

Starting point is 00:30:00 Before it was called Postgres, it had a query language called Quell, which was not SQL compatible. I spent a lot of time with MySQL. And these are really, really powerful tools to store data. you manage to design with sharding in your mind, unless you already build a distributed systems design where the subset of the data that needs to be relationally consistent and relationally managed and using foreign keys and all the actual good features of a database, unless you manage to design this to be very granular. Like relational databases do not scale.

Starting point is 00:30:50 And so Gmail is not using a relational database in the classical sense. But what is really important is that when you're processing email, you can have a unit which is a user. And that makes your life much easier. If you're building a system that's all about sharing and you have this huge ball of hundreds of millions of users and they're sharing things with each other, it's effectively impossible to disentangle that. So if you're building a sharing system with millions of users and billions of items shared and you start out with a relational model, you need to start from scratch. You cannot scale something in a relational database with a huge ball of

Starting point is 00:31:40 foreign keys and hard relationships. I'm especially sensitive when it comes to the design of your storage system because there, like, migrations are hard and mistakes are on your disk, like, forever or until you rewrite the data. I love these stories because, obviously, they come from your experience. I want to quickly throw in one of my stories, and I think this is also a way to date myself.

Starting point is 00:32:09 I remember in the early days of Dynatrace, a customer came to us and they said, Hey, we built this new high transaction volume application on top of SharePoint. And we don't know why things are slow. And we read some of your blog articles on SharePoint. Can you help us? And I mean, I should have known when they mentioned the word SharePoint and high transactional data transactional system in one sentence that this is most likely the problem.

Starting point is 00:32:40 And this was exactly the problem. They misused, they misused the flexibility that SharePoint gives you, gave you with the list and you can back the problem right they misused they misused the flexibility that sharepoint gives you gave you with the list and you can you know back into it was like 15 years ago so it's really flexible but obviously it doesn't scale um and and the recommendation that maybe i was kind of the sre back then that kind of brought them out of their corner i told them you need to rewrite your system because you made a very critical wrong architectural decision on how you want to store and how you want to treat your data. And the question is for me now, a lot of organizations,

Starting point is 00:33:17 and especially I would think if we look at startups that are coming up with new ideas, where is the trade-off between over engineering things and actually giving you in the beginning the runway the speed without having to think about these things because first you need to figure out does your business is it even a good business idea right but then with the effect if you don't do it right from the beginning eventually you have a lot of technical debt and then you need to rewrite everything have you been in a situation like this where you where you made on purpose certain decisions in a different way and then later changed course because you knew now you had to change it

Starting point is 00:33:54 so let me start out with admitting that i've been in this in this situation to trying to over engineer stuff like many many times in my in my career and and indeed like i i as an engineer i had to learn uh to to step back uh and actually look at the the time to market or the time to deliver the system versus like creating like the absolutely perfect perfect setup so so i think this is like an a social skill that engineers need to learn, hopefully not by mistakes, hopefully by like osmosis from really great mentors. One thing that Amazon talked about a lot that I love is like this idea that like APIs rule. You should start out with not a stable, but a sensible API

Starting point is 00:34:48 between two systems. At Google we talk about this a lot. Luis Barroso, one of our most senior engineers, really pushes people to iterate on the implementation behind an API. This is where you can make the most smart engineering trade-offs. Even if the API is not picture perfect, like as long as you can keep people using the API

Starting point is 00:35:16 and not trying to peek around it and use some other interfaces, for example, as long as you can keep people from like never directly talking to your storage system, you have a chance as an engineering team to really provide fantastic optimizations. And so that's very important. If you design your whole system around your storage schema, if you're using the storage schema as the lingua franca between teams like you have painted yourself in the very small corner of the full api of the storage system is the api between all parts of

Starting point is 00:35:52 the system and this is like just way too complex uh and and uh yeah if everybody can read from the relational database you're doing it wrong. Yeah. I remember, Brian, these days, a couple of years ago with our first generation product when everybody wanted to get direct access to the data store because they knew there's some data in there and our other APIs didn't give them

Starting point is 00:36:18 the data that they wanted. And then they re-engineered and we had a lot of people building their own solutions on top of the direct database model that we had, the Datastore model. And this had all sorts of side effects and actually to migrate these people over to another platform. You can't specify an SLA on a database API.

Starting point is 00:36:44 It just doesn't work. The APIs are too complex. Sorry, Brian, I interrupted you. Oh, no, no, no. I was just going to comment on the idea. Andy, you brought up in the beginning of this line of discussion here the idea of over-engineering

Starting point is 00:36:58 versus re-architecting. And I don't think we have the answers here, but I'd be curious to find out, especially for startups, is it beneficial to get to a point where you have to re-architect? Because let's say you do, you need to find out the viability of your product if people are going to like it. Having a chance, having the opportunity built in by force, I guess, where your hand is forced to have to then, once you get to a certain success level, re-architect and start fresh again. Almost sounds like an opportunity

Starting point is 00:37:30 to wipe the slate clean and take everything you've learned and start fresh again and build anew from there. I mean, yeah, there's always the pain of, do we have the time, everything else. But if you're going to be successful, I doubt you're going to be on the right path from day one. So getting yourself to that level

Starting point is 00:37:49 and having the opportunity to re-architect almost seems like it would be beneficial, but obviously we'd have to talk to a lot of startup people to find out. Let me put on my manager hat for one minute. And I looked a lot at risk management. If I think about rewriting a complex system, I would try to think about how can I de-risk the rewrite.

Starting point is 00:38:13 And the most risky rewrite a company or an organization is going to do is the full system rewrite. Like, hey, let's throw away everything. We learned a lot of stuff. Let's do it again. And this has all the risks. It has the risk that some of the learnings will be forgotten. It has the risk that you're not re-implementing the same feature set, but suddenly you're also trying to satisfy a huge list of other features. It has the risk that the team is overambitious and it's trying to, like we talked about, when is enough enough?

Starting point is 00:38:50 So you have a super high risk with ending up like the second system effect. It's very, very well documented in systems literature and you cannot launch the rewrite or it takes you five times as long. And so if I look at this from a pure risk management perspective, what I would say is plan for rewriting your first system, but try to have some stable APIs so that you can rewrite the system iteratively in parts. Okay, throw away your front end.

Starting point is 00:39:24 That's totally fine. But as you, throw away your front end. That's totally fine. But as you're rewriting your front end, try to use the same storage backend if possible because you don't want to have all of your balls in the air at the same time. Okay, you're changing your storage. That's really, really critical. You need to have a very good process.

Starting point is 00:39:45 I've been in critical storage migrations and it's usually let's build the new storage system but have a scaffolding, like have an API that is simulating the old storage system. Okay, let's start to do it right for a few users or for some subset. Okay, let's have a validation system that can, via the API, retrieve every data in the data unit and compare it bitwise. It's just a huge dance.

Starting point is 00:40:15 So while you're doing this dance, and then someday you trust the new storage system and you flip the source of truth over, and then you still wait and have the old system running in standby and then at some point you convince yourself okay fine the new system is working and you can you can turn off the old one if you're also rebuilding like every batch process every front end every messaging system at the same time like this is not going to converge because you can never tell like is this now a regression in the new storage system or did we change everything else at the same time?

Starting point is 00:40:48 So I would ask people to de-risk the rewriting process by picking like some API level and then only rewriting on one side of the API at a time. And this comes back to what you said earlier, right? Start with a good API. And then I think if you have a good defined API, APIs for me are like a contract, obviously, right? They're a contract between two parties.

Starting point is 00:41:13 And it's also a great way to then dare enforce, if it makes sense, your SLAs, your SLOs, right? You want to make sure that the API is responding in the way you expect it from a performance perspective, from an availability perspective. Because a lot of people always ask us, because obviously when we talk about SRE, a lot of the time, a lot of SLOs come to mind, right? And SLOs everywhere. And then I say, well, SLOs everywhere might not be the best approach, but you want to define SLOs where it's critical for you. And you talk about these critical APIs

Starting point is 00:41:50 to the outside, to your end user, but then also to business critical backend systems like the storage. And there it makes sense because you know, if you're not meeting your SLOs there, it will have an impact. And if they are kind of stable, right? If you want to keep these APIs stable, that means you also have history. And as you're changing and iterating through the implementation, you know how the performance has changed, how the resiliency has changed

Starting point is 00:42:13 with the new implementation. And I think that's why it's so critical to define these good APIs. And they will be good. And then next year they will not be good, but they're still APIs. So you still have a meeting point between the teams or between the systems. I want to cover one quick thing because I have your presentation open here in front of me on the other screen.

Starting point is 00:42:37 And in your presentation, you made a very good point where you said, why are we writing software? We're not writing software to make us feel good, but in the end to make money. How do we make money? We make money because we can sell the software to somebody. So on the top, there should always be the user. And you had this maturity model, kind of like a pyramid. And I think you called it the,

Starting point is 00:42:59 you gave kudos to Mickey Dickerson's hierarchy of service reliability needs, which I thought was really nice and especially the way you spoke to it as far as i remember you said that on the top is the user but then on the bottom of the pyramid is is monitoring right observability which is essential because if you don't have observability built in then you don't know when a user is complaining why they are complaining. Are they complaining because they just have a bad day? I would just like to ask you a little bit about,

Starting point is 00:43:31 from an SRE perspective, do you see SREs in their mentoring role that they have, also to make sure that this hierarchy of needs is then baked in into the platforms or what type of responsibilities does an SRE team have? I see here things like retrospective, root cause analysis, capacity planning, other certain things SREs need to build and mentor and build into the platforms or are there also certain aspects that are purely an SRE role and that always stay with an SRE? Yeah, so I think that you have outlined very nicely the steps on how to push good ideas into a large engineering organization. Step one is you do it and you show people how it's done. When I joined SRE at Google, yes, we were the ones that were adding monitoring metrics to binaries by hand as software engineers. Let's export this metric.

Starting point is 00:44:36 Let's create a counter. Let's create a ratio, whatever. And we configured the monitoring system to scrape them and aggregate them and alert on them. So it was from like organizational maturity, it was relatively low, but it was absolutely what was needed at that time. Because we had systems that were like not black boxes, but that were like absolutely not instrumented enough to determine, is this release good? Do we have enough capacity? Why did the latency in this cluster shoot up like these are all questions that sre's had to answer every day and in some

Starting point is 00:45:11 products we didn't have the capability and so so i think this is like this is like step one step step step two is changing the programming language frameworks to make it easy for the software engineers to help themselves. I think that's the second step. And this is where in the last 10 years, the open source world has made such a big progress. There are many, I don't want to quote them, there are many open source language frameworks that have understood that the software engineer needs to be able to export data easily. This can't be like black magic. So I think changing the software frameworks is the second step.

Starting point is 00:45:55 And the third step is creating or improving the production platforms that the software runs on. It's absolutely not a coincidence that on the framework level, I can give you one example. Like in some products, we had huge problems with overload. And the original SRE response was relatively operations heavy. Well, let's add more capacity. Or the release is bad.

Starting point is 00:46:28 Like, let's roll back. But in reality, when you look at your request stream, not every request at an RPC level should have the same QoS class. It's a notion that's coming from the network that we didn't use at the RPC level in the beginning. So we're treating all the client requests with essentially the same priority. But in some cases, the user just clicked the save button and they really want their data to be stored. And in other cases, like the browser or some part of the backend processing on a mobile app is just caching in data.

Starting point is 00:47:04 And these two operations obviously don't have the same, same QS class. So on the, on the framework level, like SREs at Google went in and built like load shedding and, and traffic management into the programming libraries so that if you effectively cannot start a new binary, that will not be able to protect itself from,

Starting point is 00:47:24 from overload. And the third level is really this platform level where you probably need to have a good microservice platform. But not everything will move to microservices or will move to, I don't know, you mentioned cloud functions earlier. You will probably have some classical kind of fat binaries with multi-planned points. So you need a really good platform to run those. And when you look at the, like Mikey Dickerson's hierarchy of service reliability needs, which

Starting point is 00:47:55 is everything here in this, I just renamed a few of the fields. You see that the pervasive monitoring is essentially enabling everything else because you can't do incident response if you don't know what the system just did. You can't do root cause analysis if you don't have a good recording of what happened. I don't want to read everything,

Starting point is 00:48:20 but like testing, capacity planning, like all of this is referring back to the capabilities below it, especially monitoring. That's why I really liked it and also the way you talked about it at DevOps Fusion and wanted to bring it up. Michael,

Starting point is 00:48:37 we are approaching the head of the hour here while we're recording. I wanted to make sure that we also give you the chance in the end to say anything else that we may have not talked about that is important for you, for our listeners to understand in the context of the topic we talked today. Is there anything we missed? Any final words?

Starting point is 00:48:58 Yeah, if I can share one thing, like when you are introducing new engineers into DevOps or an SRE role, like it's really important to make sure that they understand that this is like a blameless culture. Like my biggest problem, I can share with you as a new team member in the SRE team, my biggest problem was like I never wanted to make a visible mistake. And I spent like so many hours like reading source code, digging through design docs, just so that nobody knows that I don't fully understand the whole system. And this is an anti-pattern. And luckily, like super nice and way more senior folks like tapped me on the

Starting point is 00:49:42 shoulder and said like, hey, you should have asked. There are five people around you that can answer this question in 10 seconds. And as long as you have a good culture of batching up your requests or your questions, you're not annoying anybody. People love for you to reach out. And I think this is what all of us need to do. When we see new team members show up. We need to take away this load from their shoulders that they have to be perfect

Starting point is 00:50:10 because our software systems are not perfect and we shouldn't expect humans to be perfect. It just doesn't work. In that way, software systems are just a reflection of the imperfection of us humans. Absolutely. That's what us humans. Absolutely. That's what it is. Awesome. Brian.

Starting point is 00:50:31 Yes. I know you're fighting with your camera there a little bit. Well, I turned it off before because I was getting a little latency, but I figured I'd try to turn it back on. Did you have an SLO violation on the latency? I sure did. And I'm going to contact my cable provider for violating their SLA. Yeah, exactly.

Starting point is 00:50:54 Yeah, I think that, again, Michael, we will write a little summary of this to make sure the people that browse through the podcast objects that they have because we know there's more than our podcast out there we will also provide a couple of links so I will definitely link back to your social media profiles and if there's anything else we should post to just let us know we can add the links to the proceedings it was a pleasure having you

Starting point is 00:51:20 again thank you so much for also hosting me in Zurich before the conference and this was very much appreciated and just shows how great of a culture that you have there at Google and I hope we see each other again face to face at some time thank you Andreas for the for the kind invitation and thanks to both of you for the for the really interesting chat yeah I just wanted to say this is one of those ones where I have to just sit back and absorb. It was a lot of fantastic information. Really, really appreciate it. And it was also a reminder, you know, Andy and I moved

Starting point is 00:51:54 over from being on the hardworking side of the fence to the software sales sort of side of the fence years and years ago. And especially hearing the level of detail on some of the stories you go on, for me personally, it gave me a reminder of what our customers are going through. I have engagements, they're very light level engagements,

Starting point is 00:52:19 and I just see, I guess, the surface of it. And getting a reminder of what's going on all in the background and all the complexities and things they going on all in the background and all the complexities and things they have to all tackle on a daily basis, both in their daily work and strategically big picture is always a fantastic reminder. So really appreciate you sharing all this information with us and our audience as well. Sounds good. Thank you so much, gentlemen. Thank you. Have a wonderful day and thanks for listening. Have a great day. Goodbye.

Starting point is 00:52:48 Bye. Cheers.

PurePerformance - The 3 Levels of SRE and bridging the gap to DevOps with Michael Wildpaner

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.