PurePerformance - 066 Load Shedding & SRE at Google with Acacio Cruz

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody, and welcome to another episode of Pure Performance. My name is Brian Wilson and to kind of go back to what some of the older kids out there might know, I'm going to pull a Casey Kasem and pull a long distance dedication here because my colleague Andy Grabner is now long distance. I believe this is the first podcast we're recording that he is now back in Linz, Austria, back to his home. So, Andy, hello.

Starting point is 00:00:54 And it's with a tear of distance that I welcome you back to the show. Hello. Thank you so much. Yeah, and I think we coined the term, I'm back in Mozart time, right? Mozart time zone. Correct. Correct. Exactly. It's a little challenging sometimes to figure out in which time zone we are putting, well, we are organizing our recordings.

Starting point is 00:01:13 And now it's Mozart time and Denver time. Well, and I was trying to look up something good for Denver time. I looked up famous people from Denver. The first one I hit was Tim Allen, and I'm not really really a fan of his so i'm not going to go with him and then i saw lon chaney who played i believe the werewolf in the original movie but i don't think many people are going to know if i say lon chaney time they'd be like what that makes no sense so we'll just keep it as denver time yeah but talking about time our guest today actually comes currently you know he's he's in a country that is known for being always on time. At least they have all the tools and they create

Starting point is 00:01:48 the most precise watches in the world from Switzerland. And Acatio, I hope you're still with us. And again, instead of me introducing who you are, maybe you want to just share a couple of words about

Starting point is 00:02:04 who you are, who you want to just share a couple of words about who you are, who you're working for right now. And yeah, just give us a little intro. Hi, good morning or good afternoon, depending on the time zone you're in. My name is Acacio Cruz. I'm in Zurich and I work for Google, especially in the frameworks organization. We do software frameworks for the rest of the organization, production platforms. I started my career a long time ago in 2007 at Google. I started in an organization called SRE,

Starting point is 00:02:37 Site Reliability Engineering, and Gmail was my first project. So specifically on the delivery side of things, spam, abuse, and delivery. So our scope grew over time. And in 2009, I don't know if you want me to give an intro already about the talk, but in 2009, there were these very famous outages that the Gmail suffered and out of that set of incidents which were very publicly and very impactful to the product into the organization we started my team started a little project around load shedding and that's part of the reason why I'm here today yeah so so you mentioned a couple of things first

Starting point is 00:03:22 of all Google I guess most people have heard about this little company out of the Silicon Valley. You're one of the few companies where I believe the company name is implicitly used as a verb as well. I mean, you're Google for something. I think you have a major impact on everyone around the world. And in case somebody kind of could escape who Google is, then please Google for Google. And there's also the Google effect, the impact you've had on people's expectations on performance around the world, right? Exactly.

Starting point is 00:03:53 Yeah, yeah. Back in the days when we did a lot of work with people like Steve Saunders around web performance optimization, Google was always the golden standard. And everybody was looking up to Google about how fast the page has to be. So very honored, Acacio, that you find the time and share your stories. Now, Acacio, I think the reason why we got to talk, we invited you or the Lint, the Dynatrace Ls team invited you a couple weeks back to speak at def one in in linds and the talk that you gave was around load shedding now i also had the pleasure to meet you twice now in person first in march i met you in the google offices in zurich i happened to be

Starting point is 00:04:39 in zurich for a conference and then you invited me over and we got to chat and I think we you explained some of the concepts to me and then two weeks ago we met again at the DevOps Tallinn you did a little different talk there around microservices but I love the the load shedding and I think what I would kind of the first question I have to you for people that have may have never heard the term load shedding is there any is there a quick introduction that you can give and kind of the problem that you wanted to solve with it i we understand that you had the big problem in you know 2009 as you said it but uh load shedding where does this name come from and and what's the the principle of it yeah so succinctly load shedding is it comes basically from the metaphor of

Starting point is 00:05:28 water, like when you shed water or you shed something heavy off your back. And basically it's avoiding a workload to become overloaded by avoiding work. At the core of the principle is don't do work that might damage your task, your process, your mission, your goal.

Starting point is 00:05:55 So my goal of trying to hang out, I can say I'm load shedding boss. Sorry, ignore my stupid interjections. Keep going. You're load shedding work. That're low-chatting work. That's exactly the nature of that. The core principle is while still being productive. And that's what some people are not able to.

Starting point is 00:06:14 Right. And so I think the core principle then is actually knowing what is the maximum amount of load I can handle before I get overloaded. Isn't that the biggest challenge then? Yeah, correct. So, you know, even though the principle, like the goal is fairly simple to articulate, it's actually fairly complex because, you know, knowing what is the actual load in a task is not a simple process. And it's actually not unidimension at all, right?

Starting point is 00:06:50 It's not just about CPU. It's about memory. It's about IO. It's about latency. So over the years, the bulk of the work that my team is doing is actually around the modeling of load. Because effectively, there's two parts of to the problem one of them is reject work effectively but the second part is actually the most important one is actually knowing when to reject work which means that you need to know uh exactly what is the load of your task, your process, and that's where it becomes really complicated.

Starting point is 00:07:28 And so knowing, I mean, this is, are these, I mean, typically when we talk about load, right, I would assume that the key metric or the most basic metric would be, I don't know, the number of requests that you can handle. But I would then assume that you have a uniform distribution of the same requests coming in all the time. And I know in your talk at DevOne, one of the principles that you brought up is actually categorizing the requests into different buckets because not every request is the same, right? Exactly. So that was one of the early lessons that we had, especially the service that caused a lot of the Gmail outages, is that a request per se is not enough information. And a lot of the systems out there are based on max requests in flight or number of requests per second, rate limiting.

Starting point is 00:08:25 But for most systems, requests are like quantums. There are big quantums, small quantums, and you don't know until you actually measure them. So there are multiple techniques. You can use bucketing, as you mentioned. If you already know the characteristics of your requests well, you can try to bucket them. This is a fairly coarse process. But over the years, we actually lean towards different methodologies as well. And for instance, one of them, but that is afforded to us because we have full introspection

Starting point is 00:09:02 into our stack. And my team also does the framework side. We actually measure real time the cost of every request in flight. And that enables us to actually compute the cost request and actually have a trend per client. Because, for instance, one of the things that Puzzle does, you know, for a long time. The system I'm referring to was actually the context, managing the contact list of a user. And our top customer was Gmail. And one of the things that puzzled us

Starting point is 00:09:36 was that Gmail would send traffic to the context from different clusters, from different sources. And even though it's the same customer, which was email, using the same request type, some clusters would be cheaper than other clusters. And it took us actually many weeks to figure that out. And we only figured out in the end because I was a Gmail SRE in the past. And then I recognized the cluster names. So I recognized that the clusters that were younger had the cheaper

Starting point is 00:10:15 traffic. And then we, you know, I, you know, we had the realization that because the clusters were younger, that's where new accounts were being created. New accounts tend to have less contacts. Therefore, the average contact size per user on the new clusters was smaller than the average contact user on an older cluster. And therefore, the same traffic was at least expensive in them. But all of this came as a non-obvious consequence of the simple process of provisioning for Gmail. So this is one of the reasons that over time we moved away from requests as a pure metric and also manual bucketing or even mathematical modeling. And we prefer now to use real-time live metrics and do cost accounting per request.

Starting point is 00:11:08 Does that answer your question? Probably too much detail. Well, I wanted to bring in you. In your talk, you gave a good example just to bring it down to maybe some of the people who might not be grasping it 100% yet. I think you had a great example in your talk about if you did a query for, and I'll put myself in your place this time, this way, I'm not making

Starting point is 00:11:29 fun of you. If I did a query for all of my friends, I'd come back with maybe, you know, a query that would take two milliseconds. Cause I, there'd be five responses. Right. And if I did a query of all of, I believe Britney Spears, right. And the thing, which is a little bit of a dated one, but yes, we did one for Britney Spears. You would get hundreds of millions of rows of a payload back, which is where the idea of a request is not, requests are not equal. And it's about the cost of the request.

Starting point is 00:11:56 And one thing I wanted to abstract from that was also, you know, the first thought in my mind was, well, instead of, let's say this is going back to a database, throw in a caching layer, right? Because that's always a great performance tip to put in there. But I think even if you throw in a caching layer, you still have to pull in that payload from cache. It might not take as long as running the query. And if you're trying to get to those Google fast speeds, right, that's when you really have to start thinking about, okay, yeah, we do have a caching layer, which is a huge improvement over hitting the database directly.

Starting point is 00:12:26 But there's still, you know, a request is still not a request. There's a payload that has to get transferred. Exactly right. You know, a few bytes versus a few megabytes. I can actually, you know, share that when we went and did this implementation for the first time, one of the things that we found, very direct correlation between a request cost and the dependencies, whatever, in the background was actually bytes read. So if you read X bytes from a backend, it could be already in memory. So it could be from a cache, except already in X memory, megabytes from the same backend, it correlated linearly.

Starting point is 00:13:10 So there was, because most of the data, most of the services in the backend are simple. What they do is they read data, they transform it a little bit, and send it back to the client. So there's always some measure of data size. That's one of the most important components in request cost. Right. So I want to quickly rewind a little bit for people that may have not even started bucketizing the requests. So you mentioned there's different approaches of doing it. And one way i could think about is if i have

Starting point is 00:13:46 let's say a classical application where it's monolith already broken into smaller pieces i could look at production data and i could start bucketizing my requests maybe by looking at the different urls and url and parameter combinations and then is that a good approach where you can say well let's let's figure out what are your small medium and large hitting transactions by looking at you know a combination of url and parameters and with that you have at least an estimate or is this the wrong way would it be more like pocketizing your your the critical ones like you know what what are important requests to the business or the feature versus less critical ones so those are two different dimensions you know so criticality as you mentioned should be used on the moment of decision of what to drop

Starting point is 00:14:39 so that's one dimension you know we actually have, you know, it's just an approach, but what we use is basically we use critical, shadable, which is slightly less important, async, and here things like refresh of contacts in the background. And finally, batch, you know, batch being stuff that's generated from map producers or pipelines or even a server side, but batch work that can happen somewhere else, right? This is just one possible classification. You can have more levels or less levels depending on what's relevant to your organization. But then that is used for the moment of, you know, I've already come to the conclusion that I have to drop something. I have three requests. You know, which've already come to the conclusion that I have to drop something. I have three requests.

Starting point is 00:15:25 You know, which ones will I drop? And you start by the lowest priority. The cost accounting is a different dimension because you can have the same request being critical and you can have the same request, you know, type being batch depending on the source. Typical example is, update my contact now because the user is waiting because it just clicked OK on changing email address. Or you can have a map reduce, a pipeline that is running on, that is just moving everyone from at live.com to at hotmail.com, going back into the past, for example. So even though the request is the same and the cost will be the same, the priorities are different.

Starting point is 00:16:09 So that's one of the reasons why we keep cost accounting separate from criticality. Typically in our stack, the request comes with a criticality attached. If it doesn't, there's a default criticality. But going back to the question about bucketing, if I may ask, it's a potential way. What we found that it's very ops heavy because as releases change, so what you're

Starting point is 00:16:44 talking about we call query cost modeling. And query cost modeling can be done in two ways, either top-down where you say, I have a model that I create and then I do the topology and taxonomy of all the requests and I assign a cost, right? Or it can be done, which was our first approach, which was done statistically.

Starting point is 00:17:08 You run against the system, you check the logs, and then you compute the model. If you have a good person that is able to do linear regression, for instance, and stuff like that, you can actually get those inferences. And for instance, that's what we did for a long time on the situation that I mentioned before about that we came to the conclusion that bytes read was the most important criteria. But unfortunately, as your software evolves, as you build features, that changes quite

Starting point is 00:17:40 quickly. An example is read all my contacts, that has a certain cost, but over time for UI pretification reasons, somebody decides that read all my contacts should return the same data, but sorted. So that same feature, that same request becomes more expensive silently. And that means that the model is, if it's manually done, it will be completely out of date. If it's done from performance data, it will be out of date by the refresh period of analyzing the data. And meanwhile, you have a huge gap between your cost prediction and reality, which can be, in certain cases, can be an outage. That's pretty cool.

Starting point is 00:18:33 And thanks for your insights. I just thought, you know, maybe bring an idea up for people that have not done anything, so that will be a good way to at least get started. I remember, I think it must have been at least five or six years, I wrote a blog on the resource budget of a request. But I had a request a little more broader back in the days. I did a lot of performance optimization.

Starting point is 00:18:56 So we talked about measuring the resource consumption of, let's say, loading a page from browser all the way into the database. And basically, this is kind of what you are doing now, but on the fly, right? Back in the days, I did it in the testing environment. I used testing tools to drive a certain feature on a build-to-build basis and then using whatever the browser gave me and also the backend monitoring tools to figure out how many bytes are sent over the wire from browser to the web server, what happens between the web server and the app server,

Starting point is 00:19:26 and then the app server to the database. And then we basically use that resource consumption. We then also framed it into you have a certain budget, and when you're making changes through your code or to your code, you want to stay within a certain budget limit. Correct. So just to, you know, I wasn't put down your idea, actually, Andy, because in some places it actually makes a lot of sense. And if you measure the

Starting point is 00:19:53 requests data, you can actually build a predictive model. You know, our case in our service was particularly complex because we had a lot of data and a lot of customers. But if you have something where it's not dependent on the data itself, it's more driven by the URL or by the request parameters, then you can actually have this simple model of keeping the logs measured and then from the logs, you know, extract the correlations and create basically your cost model that if you do it daily, your model will be off at most by one release day. So that is something that is possible. You just need to accept that over time it's going to have a management cost. Yeah, but I still, I mean, obviously where we should

Starting point is 00:20:43 help people to get towards is what you're doing. It's dynamic cost measurement. And how do you call it? I think you have a metric for that, right? Yeah, there's several internal names. I think one of the areas where we also are looking are not necessarily on measuring the cost but looking at the secondary effects. And there are some very interesting approaches that are even available outside of our, you know, rich instrumentation environment at Google.

Starting point is 00:21:13 For instance, one typical method that is available, and it's actually very interesting in event-oriented programming models like Node.js and similar is basically looking at the event loop latency. So there's even some implementations out there. What they do is they actually drop these probes on the event loop and then measure how long it takes until they're called again. So if the event loop becomes,

Starting point is 00:21:46 you know, passes threshold, you know, it basically flips the state of the machine, so it will reject new requests. I think that's a very interesting model because it also accounts for factors outside of the direct processing of the process. So that's one of the things that, for instance, we learned was what we call antagonistic workloads because for those that are not familiar with Google, we run things in a containerized environment that's now being popular,

Starting point is 00:22:22 but we've been running this for many years now, and we're not in full control of what runs in the machine where we run a job. So you might be running a user-facing workload, and then the cluster manager schedules, for instance, a map reduce or pipeline. And that, even though we actually have container isolation, but at the bottom end is the CPU, is the CPU memory lanes and bandwidth, and those become constrained, which means that the performance characteristics of your job change.

Starting point is 00:22:55 It can be pretty dramatic in some architectures. And therefore, performance of a job can actually change any time. That's one of the lessons of our world is that you cannot count. Even if you do like a fully fixed modeling, the world is changing around the process, even down to the physical machine. of an event loop or using thread load latency as well, or thread count, similar almost to the Linux uptime data where you actually see the number of runnable threads and what is the state of your thread pool, for instance.

Starting point is 00:23:42 Those are interesting secondary metrics that almost everyone can implement. Yeah, Andy, that's funny. It reminds me, some of that reminds me of way back, Acacio, one of the things we used to promote a lot was the idea of,

Starting point is 00:23:53 you know, when looking at your database pools, don't look at the pool usage, but look at instead the amount of time it takes to execute get connection. Because if your get connection, if your pool's 100 but your connections are still really really fast uh who cares right um whereas if it's starting to creep up but i mean and that's like just taking back many years ago expanding that into this idea where you're you

Starting point is 00:24:17 know you're doing on a much grander uh scale i think it's it's really cool it also just kind of goes back i i remember when i heard that idea like i'm like, I said, that makes a lot of sense. And it's kind of cool seeing that, like, I guess it did make sense because obviously you guys are doing a much more complex and mature model of something like that. Yeah, so that's part of the – oh, my apologies. No, I was just wondering, so if you look at these metrics and you know that obviously it's time to kind of stop accepting your requests, is this the only thing you do with these metrics? Or does this information then also feed into your scalable architecture where you say, well, this particular container may have started to reject requests, but then we have to spin up more instances. So I assume you also use the same data points to scale up your infrastructure. So we actually have separate control systems because load shedding,

Starting point is 00:25:21 let's say, horizon of information is, you know, sub 10 seconds. You know, you need to be able to react to load, you know, very fast. So part of the problems that we face at our scale, but I think other people have the same concerns, is that load can spike very fast. And we're talking about, you know, within one second, you need to be able to effect load. Otherwise, your task is going to crash. What you're talking about,

Starting point is 00:25:54 which is basically scaling a job, you know, to accept a variable workload, those control systems exist, but they operate at a wider timeframe, right? I, you know, we have some systems internally that actually are able to spin load and, you know, Google Cloud is actually bringing systems in that space to our customers, but they operate as well in the, like, say, one minute. But, you know, if you have a spike in traffic for whatever reason, and most of the times those reasons are actually bugs, you know, on the client side, it's unintentional.

Starting point is 00:26:31 You need to be able to survive until, you know, those systems kick in. And this is where load sharing fills the sort of availability gap. Yeah. Yeah. No, that makes a lot of sense. And I know you had another train of thought earlier when we kind of, when I interrupted you. I'm not sure if that's still there. If it's still there, feel free to continue.

Starting point is 00:26:51 Yes, I was just going to say that there is not a binary accept-reject, right? So part of the way that we can use this information, you know, basically there's multiple metrics to measure the load. And for instance, one of the interesting ones that come to us often in the world of streaming is actually memory pressure. Because if you're doing video streaming, it's almost irrelevant what is your CPU load if you don't have memory to read the data and ship it out.

Starting point is 00:27:26 So for the streaming environments, you know, memory pressure is the key criteria. In some areas where it's around IOPS, right, because the disk is a fixed capacity. So regardless of the CPU or memory, you know, IOPS are important to measure and to understand and to model. But one of the things that we've learned as well is over time, using a single metric is usually not the best approach. So we sort of have all of these cost trackers running, and then the prediction of the load is a combination of these cost trackers. And we can plug in more cost trackers for specific applications. And we have a, digamos, a default formula that sort of measures all of this and then makes a verdict that is a combination of all of these input factors. You know, honestly, I mean, the more I listen to this,

Starting point is 00:28:30 the more I think I have to start writing another series of blogs on how we can use, now just a little connection over to Dynatrace, how we can encourage our customers to use the Dynatrace API to automatically pull out these cost metrics because we have the end-to-end tracing. We do the instrumentation for our customers. And then we also have APIs where we can actually pull out things like what's the resource consumption, how many bytes do we read from the backend.

Starting point is 00:29:03 And we can do this per request because we have these pure paths. That would be an excellent way if you can, you know, from a product, if you can automatically generate, you know, request cost prediction or request cost estimates per, you know, per target, per target, you know, and parameter. So, you know, you have a combinatorial explosion in all of this, so you need to manage this to become still useful. But that would be a very

Starting point is 00:29:32 interesting product, because then you can actually tie into the mechanism that the customers can use to throttle traffic. Yeah, exactly. And then, I mean, we already do this multidimensional baselining based on

Starting point is 00:29:48 metrics that we are all aware of, response time, throughput, number of database queries. But then, as you said, if we would allow our users to define this formula where they know best what to factor in and then additionally store this information

Starting point is 00:30:04 on a request-by-request basis, then do the multidimensional baselining on that and then make this data available. Then that's awesome. Now, Akatsio, so you mentioned in the very beginning SREs, Society of Reliability Engineering. I would assume if our listeners out there, if they think about, wow, this is all cool. I mean it sounds amazing. We have to do it. But who is doing it? I would assume that this is the new role or this is one of the roles of the site reliability engineer, figuring out how to get this data out, but also how to then help the architects and the engineers to actually build applications that allow this type of measurement and then also allow dealing with heavier overloaded

Starting point is 00:30:49 and then correctly handling that situation. So do you agree with me? Is this one of the roles of site reliability engineering? Yeah, so when we started this work, we actually were SREs. My whole organization, we did the first five years of the load chatting team was an SRE team. That work was, that experience and that knowledge, sometimes painful as the Gmail outage case was. It actually taught us a lot. A lot of the techniques that I mentioned,

Starting point is 00:31:20 a lot of the insights that we learned over time were actually after postmortems, right? For instance, we learn after an outage that a feature can change the model cost of a single request type, right? So we learn that CPU tracking is not necessarily complete to have a load view, right? We learn after another outage with a different service that memory tracking is more important

Starting point is 00:31:50 for different workloads. So all of this comes as part of the role of SRE. Where we've gone past over the recent years is to, because even in SRE, there's a lot of scope. There's a lot of knowledge. So my team has been doing this for eight years now. There's a lot of knowledge that we basically packed into code and into configuration and into best practices. So we're trying to make as zero conf, as we call it, as possible.

Starting point is 00:32:22 So nowadays at Google, you know, SREs are trained with load chatting. How to operate our systems are to tweak it for better optimization, but the system can be built from scratch with zero-conf and have a decent, you know, load chatting configuration that will work for the majority of the cases. And then, of course, different workloads, different work types, and you can actually always squeeze a little bit more, right? Because, as I mentioned before, the goal of this is to make the verdict

Starting point is 00:32:53 of should I accept traffic or not, and we have some policies. But in the environments where high performance is needed, very high performance is needed, very high performance is needed, or on the contrary, where resource usage accuracy is needed, you actually need an SRE that understands that specific workload, that understands the constraints of the system to go and tune our default parameters to figure out exactly when should we start, you know, doing load chaining, when we should actually be a little bit more flexible in the load, when we should actually run tighter

Starting point is 00:33:34 and hotter, even though at the cost of latency, all of those are decisions that, you know, the system and the architecture implements and our frameworks allow to orchestrate. But we need somebody that has like that really deep knowledge of the system to be able to to to to effect i just wanted to bring up that we've been talking a lot about the different um signposts um and some techniques of measuring and knowing when to load shed but i don't really think we we dove too much into uh approaches to load shedding and i'm referring specifically to, you know, if we look at the presentation where Casio is looking at traffic and breaking requests down into the ideas of batch, async, sheddable and critical. Right. Which is when you see these different things, like if your node loop is slowing down, if certain other things are having an artifact, we talk about shedding, but what do we shed? And that's the idea of, okay, first thing you could do is probably, and I'll try my hand at speaking this, drop your batch calls.

Starting point is 00:34:39 You just drop those completely. Stop all your asyncs, and you'll go ahead and do retries on these later. And then you have traffic that you categorize into sheddable, which you can start shipping away at as well. Correct. But making way for the critical ones that absolutely have to run now, can't be rerun later. But that's what the end goal is with all this data that we're collecting, right,

Starting point is 00:35:01 is to be able to find a way to reduce the load on those servers by basically stripping off or shedding the unnecessary pieces currently. And part of that is by identifying and categorizing the different type of traffic. Andy, is that what you were talking about when you were talking about buckets before? Yeah, exactly. Okay. Yeah. All right. So maybe you did cover it just when you were talking buckets.

Starting point is 00:35:24 I was thinking a different way. No, but that's perfect. I mean, that's also where, as Akatsio also corrected me earlier, it's like two concepts. The first concept is knowing when we are getting to a stage where we are overloaded and we have to shed load. And then the second thing is knowing what to shed away. And I think that's the key thing here. Yeah. And I do encourage everybody to, we'll put the link up to the talk.

Starting point is 00:35:51 I don't want to say it's too technical. It's very technical. Again, I'm the dummy in the group. So at some points I was like scratching my head a little bit. But overall, I got a ton out of it. So it's still very, very, very worthwhile watching. No matter what level you're at, you're going to get a ton out of it. And it makes a lot of sense. And just think of the graphics really speak really well to these concepts. So I think it was a really well done

Starting point is 00:36:13 presentation for people of all levels. Oh, thank you. I always told that I need to simplify it. But the problem that we have is if we talk to a high level, it's not useful. But to get it to a point where it's useful, you need to take it to a level of complexity that I know is hard to grasp, but we try our best. I just took it as some of these things are some things for me to research and learn if it went a little over my head. So I didn't take it in a negative way. But I think there's a ton in there that's very accessible. So I think you really kind of bridge the gap. And, you know, if you have somebody who takes the approach I do,

Starting point is 00:36:55 they'll say, well, let me go look into what those things mean. So anyhow, I didn't mean to sidetrack there, but I just wanted to make sure we kind of covered that point a little. Yeah. So Akatsia, I have two questions on the thing you said last. First, I wanted to go into quickly SRE again when you started because I believe a lot of the people that we are talking to right now, they understand. They're getting to understanding what SRE means and the new role. But how do you get started?

Starting point is 00:37:26 And more importantly for me is which roles did the people have within your team when you started it that moved over to SRE? Was it more driven by folks that had operational experience? Was it more folks that had a load testing and a performance engineering background? Was it more folks that had a load testing and a performance engineering background? Was it architects? Because I think a lot of people wonder, you know, is it the right thing for me to move towards an SRE role? Do I have to be the one that needs to figure all this stuff out that you tell me or not?

Starting point is 00:38:02 So in terms of role in SRE, we don't have that sort of partitioning. My teams were comprised of both people with what we call CZENG background and software engineering. And the software engineers would meet the bar of any software engineer at Google. And the CZENG, we actually had a little bit more systems knowledge, kernel knowledge, networking knowledge. So that's how we differentiated. But that's basically interview time. When we got to the job, there was very few differences. And we also don't have really the architectural.

Starting point is 00:38:36 It's a little bit lofty. I think it boils down to interest and aptitude, right? So for some people, the area of traffic management, because this is where it really falls, it's very interesting. Because we've been always talking about load shedding on a process, but this is actually just the, let's call it the leaf of traffic management. Traffic management is the process where you have a flow of requests that is coming, and it needs to find a target, you know, somewhere to reply to that request. And there's the routing aspect, you know, shall I go to cluster A or cluster B?

Starting point is 00:39:20 You know, should I go to, you know, America zone or to the European zone if you're running on a Google Cloud or Amazon or something like that, right? So that's that part. The second part is actually then to figure out what is the most appropriate for the request. And then finally, the request comes to the last stop, which is the process that handles it. And it can either be very quick saying, no, I can't handle, or can do the proper work. We've been focusing on the last aspect. But at Google, we tend to treat all of these things together.

Starting point is 00:39:58 So the SRE organization is very focused on capacity planning. And capacity planning is capacity to handle the requests. And then, therefore, the traffic management is tied also on capacity planning. And capacity planning is capacity to handle the requests. And then, therefore, the traffic management is tied also to capacity planning. And all of this process, even from the deployment of the architecture of a system, there's always two questions that are asked, is where is my client traffic coming from and where are my back ends. And that actually is part of the whole story of the role of managing traffic and then managing load shedding and managing capacity planning and managing the

Starting point is 00:40:35 routing of traffic. So people that like that sort of thing are the most more appropriate for this role, basically. That was a nice summary. that was a nice summary. That was a nice overview. Thank you. Hey, and the other thing that I wanted to cover, that I want to touch on, and I know we talked about this in Tallinn a couple weeks ago,

Starting point is 00:40:54 and I really, every time when I talk with you and we share some thoughts, it always feels that I learn, I always learn from you and go away with some new ideas, but I always feel like I never give enough back to you. So hopefully at some point in time, I can also give something back to you. But what you inspired me again two weeks ago, and you mentioned this earlier, you said zero configuration.

Starting point is 00:41:20 In Talim, we talked about or you explained what an engineer has to do at Google to start a new microservice. I think the minimum configuration is just saying what is this microservice all about, what type of service is it, is it the front end and the back end, who is the owner, and that's pretty much it. And based on this individual piece of information, you can actually automate the generation of all the other configuration elements that are necessary in the end-to-end delivery pipeline and then i i you you looked at my monitoring as code configuration and you said well it's it's it's a nice thing what you do but it's it's too much configuration still there's so much that can be automated and i really like that uh because you're right right. There's still too much configuration. I think I read a statistic and I think it was some CTO from some large system integrator.

Starting point is 00:42:14 And they said that they are currently fighting with the big challenge of having a ratio from one to four from code to configuration. They came up with so many configuration files nobody has any clue anymore what this is all about and it's all getting too complex and and basically with your approach boil it down to the bare minimum and then just automate the generation of everything else that needs to be generated for all these other tools i think is a great concept and and a great you know best practice so like yes, best practice. So like, yes. Thank you for, for, for touching that as well, because it's not just that we can generate configuration. There are several principles that, that I, you know, specifically my org believes in,

Starting point is 00:42:56 which is, you know, the configuration needs to be machine editable. So that you basically can also read, you know, even user generated configs, can read it, process it and regenerate back if need to. Because there's one aspect in today's systems, people create a config. They probably have a nice template with most of the best practices, and then they clone it, and then they tweak it for their specific instance. But over time, the best practices change. So there is a certain bit rut. So what was golden today is going to be slightly stale in six months and it's going to be completely obsolete in

Starting point is 00:43:51 three years and in four years it's going to be a pile of junk. So we need to go for a world where we clone templates and keep reapplying settings to more of an approach where when I say manage config, you actually have, let's say, let's call it a template config, and then the user of that system injects the changes that it needs for its system. So to remove from the abstract, imagine that you have a, let's call it an Apache config. And Apache is fairly complex as well once you start getting to a bunch of backends.

Starting point is 00:44:37 But instead of having an Apache config template that people change for as they add new new complexity to their site what probably they could do is to to have a tool that reads the apache config and then injects a specific you know site config done in a in a in a configuration language maybe a text protocol buffer or you know json if that's uh you know that's uh you know how the you know the the tool can operate but this means that every time that you change the Apache config, you can, again, regenerate the final config without actually having to change the config provided for that instance. So you can, you keep it alive, you keep it fresh. And that way you can integrate, you know, the different types of configs. For instance, you have CGI, you can actually define how a CGI is modeled and then how that gets merged into the Apache config. But also enables more interesting changes.

Starting point is 00:45:35 This is where we are at the current stage, which is imagine that you come to the decision that you no longer need Apache or Apache doesn't fill your needs, and you need to go to Nginx or LightHttpd or HAProxy or the new thing that Facebook just open-sourced. You can actually then transparently bring that site definition that you created and then now generate an NGINX config, you know, without having to change the, you know, everything on your pipeline. You just have one person or a subset of the people that really know the config environment

Starting point is 00:46:14 and then can keep it alive and fresh. Yeah, well, that makes a lot of sense. No, that was very inspiring. I mean, and I think your talk from Talin is also online, at least the slides will be published in a way. I'm not 100% sure, but I think... No, I'm not allowed to publish slides.

Starting point is 00:46:34 Oh, okay. It can be if it was filmed. Yeah. But I think at least we have the short podcast that we recorded. And I'm sure there will be some opportunities. So anyway, my shout out again is if anybody from the listeners wants to check out the other things you have to say besides what you have said at DevOne, I'm sure there's other opportunities to meet you in different conferences. Or just I think you also, you know, partly contributed to books, right?

Starting point is 00:47:10 You were contributing to the SRE book that came out a couple of years ago. Correct. So my team, ITL, actually for the load chatting team, Alejo, Alejandro, he actually is one of the authors of the traffic management chapter and where he also outlines the load-sharing principles there. And then there is the broader topic about frameworks as part of the way to do zero-conf and best practice in the services from the ground up, right? The how to build systems with best practices, which I co-wrote that with one of my TLs on the systems platform. Pretty cool.

Starting point is 00:47:53 Hey, Brian, do you have anything that you want to still cover, especially knowing our audience? Is there anything open or just point them to more resources and we put the links up? You know more resources and we put the links up? You know, I think we put the links up. There's quite a lot. They're very rich in information. You know, I think one of the things I get out of it the most, which is a challenge for anybody, is to challenge your assumptions. You know, one thing Acasio talks about is, you know,

Starting point is 00:48:26 with the idea of retries and a lot of people hate the idea of retries and how if you do them right, they can actually be faster, you know, and, and just even the idea of tossing away this traffic and doing it later and figuring that out. Most people just kind of would be like, how can you do that? And, but it's like, no, it can be done. And I think a lot of these, you know, the dev one presentation goes into that quite well. So I think that's, I think that's going to cover that. Well, there's a lot,

Starting point is 00:49:00 I think a lot of the topic in here, um, we can talk about a lot of it, but I think a lot of it's also served to have the visuals. Um, so for, for, for my, from my point of view, I think we covered what's mostly coverable without visuals. You know, I'm glad you mentioned retries if I can probe in that topic. As you said, there's a cultural resistance even from, you know, from the Google developers because they have the good intention, right? It's like, oh, I don't want to return an error to my customer, to my client. So I'll do my best to handle just this one request, right? That is actually the attitude and it's positive. But the reality is if you do this statistically,

Starting point is 00:49:40 you end up with high latency and high latency is the death of a product. So being able to say, you know, I will embrace, you know, throttling because what I'll do is I'll really try my best to make sure that the majority of my requests are fast. And if I can't do it fast, I'll just reject it. So if you embrace that approach and then you say on my client side, I'll work my clients so that they also embrace this process and if they see a throttling error, they immediately retry fast somewhere else. Those two things combined are very powerful. Just to go back to the original system that I was mentioning before the contacts, we used to have quite a bit

Starting point is 00:50:24 of tail latency. This is what I call like on the 99th percentile of latency was quite high because part of it was very large contactless. But I mentioned before, for instance, we would have antagonistic workloads. For some reason, sometimes some of the processes get what we call sick. Either there's a map reduce running on the machine, either, you know, there's a problem with memory contention,

Starting point is 00:50:51 or there is like very strong work being done because of garbage collection on a Java task. You know, for whatever reason, sometimes some tasks get sick. So what you want very quickly is to identify that. And basically, it's almost like if you have a rock in the middle of your stream and the stream is your request. You want the water to flow to other areas. very quickly get the signal that that task is overloaded and getting throttling, and you divert your requests elsewhere. Overall and statistically, your system performs much better, much faster,

Starting point is 00:51:32 even though there is a retry. And I think on a very different and much more gross level, another concept that ties into that, but again, being very different, but also kind of counterintuitive to what people might think in terms of retries is, and Andy, I don't remember if come back later, we're having some problems, than try to serve them super slow. Because if they're trying to use the site and it's crawling slow, they're going to hate it and they're going to have a negative opinion of you. Whereas if you just say, hey, we're having issues,

Starting point is 00:52:16 can you try back in 10 minutes or whatever? The majority of the people will be like, oh, great, yeah, I can do this in 10 minutes. Of course, you might lose a few people, but just this idea of sometimes taking a hit is better than trying to persist. Exactly. I used to have someone in my team that before they used to work at Ticketmaster. And Ticketmaster is a very interesting human load event.

Starting point is 00:52:43 Every time they have a popular artist that starts selling tickets, millions of people try to get the tickets within minutes. And they do exactly that. They do load chatting on the application side at the human scale, right, which is basically they know that they can't handle a million people clicking on tickets and the system will crash down. So what they do, they apply load chatting very simply as the number of people that are allowed to go from, I'm seeing the

Starting point is 00:53:11 ticket master form page to I'm allowed to actually enter the sales pipeline is limited. So they rate limit there, you know, the transition. And that is a much better experience than you know, clicking on the buy button and wait for 15 minutes and get a timeout request, right? So that usability impact, it's much more positive

Starting point is 00:53:33 for them to lose the customer that will go away because you can't buy the ticket because it's at a gate than actually lose the customer halfway through the process. Yeah, although, you know, although you always know it's some bot somewhere who got in front of you online.

Starting point is 00:53:49 But latency is the death of product. Yeah. I think the credit to this, what you mentioned earlier, Brian, was Ryan Townsend from Shift Commerce. He was also one of the speakers at DevOps Talim. He talked about this better an experience than a bad than a slow experience. I think that's the thing he said. That was cool.

Starting point is 00:54:16 All right. Akatsia, did we miss anything? Is there anything else you want to make sure that people understand? Again, knowing that they will be able to watch your full presentation from dev one and follow you on all the other work. But anything else where you say, hey, this is something people should understand about load shedding, especially when it gets started. Anything we missed? Yes, I think there's only one very high level principle.

Starting point is 00:54:41 Because hearing a stock, and as Andy has said as well, it's a fairly technical area, but, and then people might walk away with the impression this only applies to Google. But the reality is Google, at our scale, sometimes we can handle events that are outage potential sometimes we can handle events that are out of touch potential with our resources. We are fortunate that we have resources that can solve problems. But if you're a small company and you have

Starting point is 00:55:15 your business that depends on one app and you have six servers that are running, those are critical to the business. So it's even more important for a small company to be able to survive an event. And in the current compute world, distributed world, it's a very hostile world, right? A script kitty can doze almost anybody and take you out of business. So being able to survive hostile players, being able to survive client-side bugs, being able to survive unexpected events, being able to survive fat fingering

Starting point is 00:55:58 is even more critical at a small scale because that means that you're surviving not just the application, but you're making your business survive. So even if you don't go full techno into load chatting specifically, but start considering how critical systems are to the survivability of the business and then applying some of these measures and inching your way towards basically surviving. Survival is the key. Awesome.

Starting point is 00:56:33 Hey, Andy. Yeah, do you think it's time? I think it's time to summon the Summaryator. Summon the Summaryator, yeah. Let's do it. I think what I learned today is that the concept of site reliability engineering that has been floating out there for a while, but I know that while Google has been living this and further expanding it and refining it over the years, many of the people that we talk to are just getting started and they want to know what this all means. I think what I learned today, it is one of the aspects of site reliability engineering is trying to figure out how to help the company and the business to build resilient systems.

Starting point is 00:57:13 I think resiliency is in the end what we all want, whether it is making sure a spike in load cannot bring the system down, whether, as you said, a script kitty cannot bring the business down. Thinking about these concepts and how we can shed load of the critical backend processes, then obviously making sure that the architecture that has been used to build the systems can handle the situation of, you know, now we know we're getting in an unhealthy state. How can we shed, load off? How can we correctly respond errors to the next level? How can we respond the error state back to the customer in a way that we're not making them angry but making them want to wait?

Starting point is 00:58:01 I think these are all extremely interesting concepts, and I believe it's going to be part of many organizations that are now just thinking about what site reliability engineering is for them. I really much encourage everyone to watch your presentation, to read your blogs and your books, and simply just be open that these are the things we have to do these days to make sure that our digital businesses become successful and stay successful. famous or a blog post, a news article may shatter everything into pieces if you didn't do your homework well. And I like your last recommendation in the end where you said you may not do everything at once, but start thinking about how you can

Starting point is 00:59:00 at least build something that is resilient against a script kitty or something like that. I like that. Thank you. You choked me up there, Andy. And I just want to add a couple of my thoughts to that as well. First of all, I just want to thank you for being on the show. I think it's wonderful that companies like Google that have resources to actually put time, people, and money into solving these problems and looking at these problems turn around and then share it with the rest of us who don't necessarily have the ability to set up these projects around that. Obviously, it's critical to performance, but a lot of times, especially in smaller companies, it's just plow ahead.

Starting point is 00:59:50 So the fact that all these companies are sharing this information is wonderful. So thank you for being part of that. And Andy, a lot of times, as you know, some of the topics are a little bit more over my head. Ocasio, my background, very briefly, I was a communication major who ended up somehow in load testing because I was making a crummy salary and I had a friend working in computers making twice the amount of money that I did. So having an inquisitive mind and an analytical mind, I eventually got into load testing. So some of the some of the deeper topics, you know, go over my head a little bit. I'm always trying to level up. But I think what I enjoyed about this one a lot is it's so performance filled, that the more I look at the more I listen, the more I see, the more I can find in here a lot of fascinating ideas to dig deeper and deeper into.

Starting point is 01:00:46 I really love the idea that not all requests are equal. That's kind of a thing that used to come up a lot, just even in load testing. You wouldn't run certain requests a lot because they're so heavy. But then there was always the question of, well, what if users do it? And you'd have a developer saying, well, that's not how it's meant to be used. Well, what if they do it, right? If people keep searching for Britney Spears because she suddenly went to rehab again or something, you know, things like that will happen. But I also really loved the idea of other indicators that systems are going down the node loop you talked about specifically and and ideas where, you know, there's something running in a process or in a service that might indicate that there's a larger problem going on overall that you can't see quite as well.

Starting point is 01:01:29 And maybe certain other kind of traditional metrics wouldn't quite be an indicator. But the idea of like, you know, suddenly that loop that goes through is taking longer. Well, that's an indication of an overall larger issue. Just a really fresh way of looking at things and opens a whole, you know, kind of can of worms, but in a good way of new ways to look at performance. So again, thanks for sharing it. I think there's a lot to dig into this and a lot to get out with repeat viewings and listens.

Starting point is 01:02:02 Thank you for inviting me. It was a pleasure. And I'll just leave with a parting story. I was part of the team that built Google+, our social platform, and

Starting point is 01:02:15 one of the users of the platform was a K-pop group from Korea called AKB48, sorry, from Japan, actually, which stands for Akiba Hara in Tokyo. And they actually do a member vote out, vote in every year. So our social team was actually not prepared.

Starting point is 01:02:46 You know, we had not developed the product to have a single user with so many comments that exceeded, you know, a few per second. So just to go back to your point, everything will be used and abused in the most unexpected ways by our users. And that's how I learned about AKB48, which actually has 64 members, not 48. Oh my gosh. It's kind of like Minuto. The vote-in, vote-out.

Starting point is 01:03:17 It's an excellent story. Thanks a lot for coming on and thanks for everyone for listening. If you have any questions or comments, you can tweet us at pure underscore DT or send an email if you're into that at pureperformance at dynatrace.com. Acacio, do you have a Twitter handle? AcacioCruz. Okay, AcacioCruz.

Starting point is 01:03:38 Yep, and that'll be people in the description, so definitely make sure you follow him. Thanks for listening, and Andy, great to have you back on after your move, and hope you're enjoying life back in Austria. Have you seen any koalas? Any what? Koalas? Yeah, exactly. Let's get out of Austria on the other side of the globe. All right, thank you, everybody. Thank you.

Starting point is 01:04:03 Thank you. Bye.

Your Ad Here

PurePerformance - 066 Load Shedding & SRE at Google with Acacio Cruz

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.