PurePerformance - 066 Load Shedding & SRE at Google with Acacio Cruz
Episode Date: July 16, 2018Have you heard about Load Shedding? If not then dive into this discussion with Acacio Cruz, Engineering Director at Google ( https://twitter.com/acaciocruz ). He walks us through what Google learnt fr...om one of the early outages at Gmail and how he and his team are now applying concepts such as load shedding to avoid disruption of their services despite spikes of load or unpredictable requests. We also discuss SRE (Site Reliability Engineering), how it started and transformed at Google and how we should think about automation, configuration of automation, and automation of automation. For more details – including visuals – we encourage you to watch Acacio’s breakout session from devone.at on YouTube (Load Shedding at Google).https://devone.at/speakers/#acaciocruzhttps://www.youtube.com/watch?v=XNEIkivvaV4
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello, everybody, and welcome to another episode of Pure Performance.
My name is Brian Wilson and to kind of go back to what some of the older kids out there might know,
I'm going to pull a Casey Kasem and pull a long distance dedication here because my colleague Andy Grabner is now long distance.
I believe this is the first podcast we're recording that he is now back in Linz, Austria, back to his home.
So, Andy, hello.
And it's with a tear of distance that I welcome you back to the show.
Hello.
Thank you so much.
Yeah, and I think we coined the term, I'm back in Mozart time, right?
Mozart time zone.
Correct. Correct.
Exactly.
It's a little challenging sometimes to figure out in which time zone we are putting, well, we are organizing our recordings.
And now it's Mozart time and Denver time.
Well, and I was trying to look up something good for Denver time.
I looked up famous people from Denver.
The first one I hit was Tim Allen, and I'm not really really a fan of his so i'm not going to go with him and then i saw lon chaney who played i believe the werewolf in the
original movie but i don't think many people are going to know if i say lon chaney time they'd be
like what that makes no sense so we'll just keep it as denver time yeah but talking about time our
guest today actually comes currently you know he's he's in a country that is known for being always on time. At least they have
all the tools and they create
the most precise watches
in the world from Switzerland.
And
Acatio, I hope you're still
with us. And
again, instead of me introducing
who you are, maybe you want to
just share a couple of words about
who you are, who you want to just share a couple of words about who you are,
who you're working for right now. And yeah, just give us a little intro.
Hi, good morning or good afternoon, depending on the time zone you're in.
My name is Acacio Cruz. I'm in Zurich and I work for Google, especially in the
frameworks organization. We do software frameworks for the rest of the organization,
production platforms.
I started my career a long time ago in 2007 at Google.
I started in an organization called SRE,
Site Reliability Engineering,
and Gmail was my first project.
So specifically on the delivery side of things, spam, abuse, and delivery.
So our scope grew over time.
And in 2009, I don't know if you want me to give an intro already about the talk, but in 2009, there were these very famous outages that the Gmail suffered and out of that set of incidents which
were very publicly and very impactful to the product into the organization we
started my team started a little project around load shedding and that's part of
the reason why I'm here today yeah so so you mentioned a couple of things first
of all Google I guess most people have heard about this little company out of the Silicon Valley.
You're one of the few companies where I believe the company name is implicitly used as a verb as well.
I mean, you're Google for something.
I think you have a major impact on everyone around the world.
And in case somebody kind of could escape who Google is, then please Google for Google.
And there's also the Google effect, the impact you've had on people's expectations on performance
around the world, right?
Exactly.
Yeah, yeah.
Back in the days when we did a lot of work with people like Steve Saunders around web
performance optimization, Google was always the golden standard.
And everybody was looking up to Google about how fast the page has to be.
So very honored, Acacio, that you find the time and share your stories.
Now, Acacio, I think the reason why we got to talk, we invited you or the Lint, the Dynatrace Ls team invited you a couple weeks back to speak at def one
in in linds and the talk that you gave was around load shedding now i also had the pleasure to meet
you twice now in person first in march i met you in the google offices in zurich i happened to be
in zurich for a conference and then you invited me over and we got to chat and I think we you
explained some of the concepts to me and then two weeks ago we met again at the DevOps Tallinn
you did a little different talk there around microservices but I love the the load shedding
and I think what I would kind of the first question I have to you for people that have
may have never heard the term load shedding is there any is there a quick introduction that you can give and kind of the problem
that you wanted to solve with it i we understand that you had the big problem in you know 2009 as
you said it but uh load shedding where does this name come from and and what's the the principle
of it yeah so succinctly load shedding is it comes basically from the metaphor of
water, like when you shed water or you shed
something heavy off your back. And basically it's
avoiding a workload to become
overloaded by avoiding work. At the
core of the principle is
don't do work that might damage
your task, your process,
your mission, your goal.
So my goal of trying to hang out,
I can say I'm load shedding boss.
Sorry, ignore my stupid interjections.
Keep going.
You're load shedding work. That're low-chatting work.
That's exactly the nature of that.
The core principle is while still being productive.
And that's what some people are not able to.
Right.
And so I think the core principle then is actually knowing what is the maximum amount of load I can handle before I get overloaded.
Isn't that the biggest challenge then?
Yeah, correct.
So, you know, even though the principle, like the goal is fairly simple to articulate, it's
actually fairly complex because, you know, knowing what is the actual load in a task
is not a simple process.
And it's actually not unidimension at all, right?
It's not just about CPU.
It's about memory.
It's about IO.
It's about latency.
So over the years, the bulk of the work that my team is doing is actually around the modeling of load.
Because effectively, there's two parts of to the problem one of them is reject work effectively but the
second part is actually the most important one is actually knowing when to reject work which means
that you need to know uh exactly what is the load of your task, your process, and that's where it becomes really complicated.
And so knowing, I mean, this is, are these,
I mean, typically when we talk about load, right,
I would assume that the key metric or the most basic metric would be,
I don't know, the number of requests that you can handle.
But I would then assume that you have a uniform distribution of the same requests coming in all the time.
And I know in your talk at DevOne, one of the principles that you brought up is actually categorizing the requests into different buckets because not every request is the same, right?
Exactly. So that was one of the early lessons that we had, especially the service that caused a lot of the Gmail outages, is that a request per se is not enough information.
And a lot of the systems out there are based on max requests in flight or number of requests per second, rate limiting.
But for most systems, requests are like quantums.
There are big quantums, small quantums, and you don't know until you actually measure
them.
So there are multiple techniques.
You can use bucketing, as you mentioned.
If you already know the characteristics of your requests well, you can try to bucket them. This is a fairly
coarse process. But over the years, we actually lean towards different methodologies as well.
And for instance, one of them, but that is afforded to us because we have full introspection
into our stack. And my team also does the framework side.
We actually measure real time the cost of every request in flight.
And that enables us to actually compute the cost request and actually have a trend per client.
Because, for instance, one of the things that Puzzle does, you know, for a long time.
The system I'm referring to was actually the context,
managing the contact list of a user.
And our top customer was Gmail.
And one of the things that puzzled us
was that Gmail would send traffic to the context
from different clusters, from different sources.
And even though it's the same customer, which was email,
using the same request type,
some clusters would be cheaper than other clusters.
And it took us actually many weeks to figure that out.
And we only figured out in the end because I was a Gmail SRE in the past.
And then I recognized the cluster names. So I recognized that the clusters that were younger had the cheaper
traffic. And then we, you know, I, you know, we had the realization that because the clusters were
younger, that's where new accounts were being created. New accounts tend to have less contacts.
Therefore, the average contact size per user on the new clusters was smaller than the average contact user on an older cluster.
And therefore, the same traffic was at least expensive in them.
But all of this came as a non-obvious consequence of the simple process of provisioning for Gmail.
So this is one of the reasons that over time we moved away from requests as a pure metric and also manual bucketing or even mathematical modeling.
And we prefer now to use real-time live metrics
and do cost accounting per request.
Does that answer your question?
Probably too much detail.
Well, I wanted to bring in you.
In your talk, you gave a good example
just to bring it down to maybe some of the people
who might not be grasping it 100% yet.
I think you had a great example in your talk
about if you did a query for, and I'll put myself in your place this time, this way, I'm not making
fun of you. If I did a query for all of my friends, I'd come back with maybe, you know,
a query that would take two milliseconds. Cause I, there'd be five responses. Right.
And if I did a query of all of, I believe Britney Spears, right. And the thing, which
is a little bit of a dated one, but yes, we did one for Britney Spears.
You would get hundreds of millions of rows of a payload back,
which is where the idea of a request is not,
requests are not equal.
And it's about the cost of the request.
And one thing I wanted to abstract from that was also,
you know, the first thought in my mind was,
well, instead of, let's say this is going back to a database,
throw in a caching layer, right? Because that's always a great performance tip to put in there.
But I think even if you throw in a caching layer, you still have to pull in that payload from cache.
It might not take as long as running the query. And if you're trying to get to those Google fast
speeds, right, that's when you really have to start thinking about, okay, yeah, we do have a
caching layer, which is a huge improvement over hitting the database directly.
But there's still, you know, a request is still not a request.
There's a payload that has to get transferred.
Exactly right.
You know, a few bytes versus a few megabytes.
I can actually, you know, share that when we went and did this implementation for the first time, one of the things that we found, very direct correlation between a request cost and the dependencies, whatever, in the background was actually bytes read.
So if you read X bytes from a backend, it could be already in memory.
So it could be from a cache, except already in X memory, megabytes from the same backend,
it correlated linearly.
So there was, because most of the data,
most of the services in the backend are simple.
What they do is they read data, they transform it a little bit,
and send it back to the client.
So there's always some measure of data size. That's one of the most
important components in request cost. Right. So I want to quickly rewind a little bit for people
that may have not even started bucketizing the requests. So you mentioned there's different
approaches of doing it. And one way i could think about is if i have
let's say a classical application where it's monolith already broken into smaller pieces
i could look at production data and i could start bucketizing my requests maybe by looking at the
different urls and url and parameter combinations and then is that a good approach where you can say well let's let's figure
out what are your small medium and large hitting transactions by looking at you know a combination
of url and parameters and with that you have at least an estimate or is this the wrong way would
it be more like pocketizing your your the critical ones like you know what what are important requests to
the business or the feature versus less critical ones so those are two different dimensions you
know so criticality as you mentioned should be used on the moment of decision of what to drop
so that's one dimension you know we actually have, you know, it's just an approach, but what we use is basically we use critical, shadable, which is slightly less important, async, and here things like refresh of contacts in the background.
And finally, batch, you know, batch being stuff that's generated from map producers or pipelines or even a server side,
but batch work that can happen somewhere else, right?
This is just one possible classification.
You can have more levels or less levels depending on what's relevant to your organization.
But then that is used for the moment of, you know,
I've already come to the conclusion that I have to drop something.
I have three requests. You know, which've already come to the conclusion that I have to drop something. I have three requests.
You know, which ones will I drop?
And you start by the lowest priority.
The cost accounting is a different dimension because you can have the same request being critical and you can have the same request, you know, type being batch depending on the source. Typical example is, update my contact now because the user is waiting
because it just clicked OK on changing email address.
Or you can have a map reduce, a pipeline that is running on,
that is just moving everyone from at live.com to at hotmail.com,
going back into the past, for example.
So even though the request is the same and the cost will be the same, the priorities are different.
So that's one of the reasons why we keep cost accounting separate from criticality.
Typically in our stack, the request comes with a criticality attached.
If it doesn't, there's a default criticality.
But going back to the question about bucketing,
if I may ask, it's a potential way. What we found
that it's very
ops heavy because as
releases change, so what you're
talking about we call query cost modeling.
And query cost modeling can be done in two ways,
either top-down where you say,
I have a model that I create
and then I do the topology and taxonomy
of all the requests and I assign a cost, right?
Or it can be done, which was our first approach,
which was done statistically.
You run against the system, you check the logs,
and then you compute the model.
If you have a good person that is able to do linear regression,
for instance, and stuff like that,
you can actually get those inferences.
And for instance, that's what we did for a long time on the situation that I mentioned
before about that we came to the conclusion that bytes read was the most important criteria.
But unfortunately, as your software evolves, as you build features, that changes quite
quickly.
An example is read all my contacts, that has a certain cost, but over time
for UI pretification reasons, somebody decides that read all my contacts should return
the same data, but sorted. So that same feature, that same request becomes more expensive silently.
And that means that the model is, if it's manually done, it will be completely out of date.
If it's done from performance data, it will be out of date by the refresh period of analyzing the data. And meanwhile, you have a huge gap between your cost prediction and reality,
which can be, in certain cases, can be an outage.
That's pretty cool.
And thanks for your insights.
I just thought, you know, maybe bring an idea up
for people that have not done anything,
so that will be a good way to at least get started.
I remember, I think it must have been
at least five or six years, I wrote a blog on the resource budget of a request.
But I had a request a little more broader back in the days.
I did a lot of performance optimization.
So we talked about measuring the resource consumption of, let's say, loading a page from browser all the way into the database.
And basically, this is kind of what you are doing now, but on the fly, right?
Back in the days, I did it in the testing environment.
I used testing tools to drive a certain feature on a build-to-build basis
and then using whatever the browser gave me
and also the backend monitoring tools to figure out
how many bytes are sent over the wire from browser to the web server,
what happens between the web server and the app server,
and then the app server to the database.
And then we basically use that resource consumption.
We then also framed it into you have a certain budget,
and when you're making changes through your code or to your code,
you want to stay within a certain budget limit.
Correct.
So just to, you know, I wasn't put down your idea,
actually, Andy, because in some places it actually makes a lot of sense. And if you measure the
requests data, you can actually build a predictive model. You know, our case in our service was
particularly complex because we had a lot of data and a lot of customers. But if you have something where it's not dependent on the data itself, it's more driven by the URL or by the request parameters,
then you can actually have this simple model of keeping the logs measured and then from the
logs, you know, extract the correlations and create basically your cost model that if you do it daily, your model
will be off at most by one release day.
So that is something that is possible. You just need to accept that
over time it's going to have a management cost.
Yeah, but I still, I mean, obviously where we should
help people to get towards is what you're doing.
It's dynamic cost measurement.
And how do you call it?
I think you have a metric for that, right?
Yeah, there's several internal names.
I think one of the areas where we also are looking are not necessarily on measuring the cost but looking at the secondary effects. And there are some very interesting approaches
that are even available outside of our, you know,
rich instrumentation environment at Google.
For instance, one typical method that is available,
and it's actually very interesting in event-oriented programming models
like Node.js and similar
is basically looking at the event loop latency.
So there's even some implementations out there.
What they do is they actually drop these probes on the event loop
and then measure how long it takes until they're called again.
So if the event loop becomes,
you know, passes threshold, you know, it basically flips the state of the machine,
so it will reject new requests. I think that's a very interesting model because it also accounts
for factors outside of the direct processing of the process.
So that's one of the things that, for instance, we learned
was what we call antagonistic workloads
because for those that are not familiar with Google,
we run things in a containerized environment
that's now being popular,
but we've been running this for many years now,
and we're not in full control of what runs in the machine where we run a job.
So you might be running a user-facing workload, and then the cluster manager schedules, for
instance, a map reduce or pipeline.
And that, even though we actually have container isolation, but at the bottom end is the CPU,
is the CPU memory lanes and bandwidth,
and those become constrained,
which means that the performance characteristics of your job change.
It can be pretty dramatic in some architectures.
And therefore, performance of a job
can actually change any time.
That's one of the lessons of our world is that you cannot count.
Even if you do like a fully fixed modeling, the world is changing around the process, even down to the physical machine. of an event loop or using thread load latency as well,
or thread count, similar almost to the Linux uptime data
where you actually see the number of runnable threads
and what is the state of your thread pool, for instance.
Those are interesting secondary metrics
that almost everyone can implement.
Yeah, Andy, that's funny.
It reminds me,
some of that reminds me of way back,
Acacio, one of the things
we used to promote a lot
was the idea of,
you know, when looking at your database pools,
don't look at the pool usage,
but look at instead the amount of time
it takes to execute get connection.
Because if your get connection,
if your pool's 100 but your connections
are still really really fast uh who cares right um whereas if it's starting to creep up but i mean
and that's like just taking back many years ago expanding that into this idea where you're you
know you're doing on a much grander uh scale i think it's it's really cool it also just kind of
goes back i i remember when i heard that idea like i'm like, I said, that makes a lot of sense. And it's kind of cool seeing that, like, I guess it did make sense because obviously you guys are doing a much more complex and mature model of something like that.
Yeah, so that's part of the – oh, my apologies.
No, I was just wondering, so if you look at these metrics and you know that obviously it's time to kind of stop accepting your requests, is this the only thing you do with these metrics?
Or does this information then also feed into your scalable architecture where you say, well, this particular container may have started to reject requests,
but then we have to spin up more instances.
So I assume you also use the same data points to scale up your infrastructure.
So we actually have separate control systems because load shedding,
let's say, horizon of information is, you know, sub 10 seconds.
You know, you need to be able to react to load, you know, very fast.
So part of the problems that we face at our scale, but I think other people have the same concerns, is that load can spike very fast.
And we're talking about, you know,
within one second,
you need to be able to effect load.
Otherwise, your task is going to crash.
What you're talking about,
which is basically scaling a job,
you know, to accept a variable workload,
those control systems exist,
but they operate at a wider timeframe, right?
I, you know, we have some systems internally that actually are able to spin load and, you
know, Google Cloud is actually bringing systems in that space to our customers, but they operate
as well in the, like, say, one minute.
But, you know, if you have a spike in traffic for whatever reason, and most of the times those reasons are actually bugs, you know, on the client side, it's unintentional.
You need to be able to survive until, you know, those systems kick in.
And this is where load sharing fills the sort of availability gap.
Yeah.
Yeah.
No, that makes a lot of sense.
And I know you had another train of thought earlier when we kind of, when I interrupted you.
I'm not sure if that's still there.
If it's still there, feel free to continue.
Yes, I was just going to say that there is not a binary accept-reject, right?
So part of the way that we can use this information, you know, basically there's multiple metrics to measure the load.
And for instance, one of the interesting ones
that come to us often in the world of streaming
is actually memory pressure.
Because if you're doing video streaming,
it's almost irrelevant what is your CPU load
if you don't have memory to read the data and ship it out.
So for the streaming environments, you know, memory pressure is the key criteria.
In some areas where it's around IOPS, right, because the disk is a fixed capacity.
So regardless of the CPU or memory, you know, IOPS are important to measure and to understand and to model.
But one of the things that we've learned as well is over time, using a single metric is usually not the best approach. So we sort of have all of these cost trackers running, and then the prediction of
the load is a combination of these cost trackers. And we can plug in more cost trackers for specific
applications. And we have a, digamos, a default formula that sort of measures all of this and then
makes a verdict that is a combination of all of these input factors.
You know, honestly, I mean, the more I listen to this,
the more I think I have to start writing another series of blogs
on how we can use, now just a little connection over to Dynatrace,
how we can encourage our customers to use the Dynatrace API
to automatically pull out these cost metrics
because we have the end-to-end tracing.
We do the instrumentation for our customers.
And then we also have APIs where we can actually pull out things like
what's the resource consumption, how many bytes do we read from the backend.
And we can do this per request because we have these pure paths.
That would be an excellent way if you can, you know, from a product,
if you can automatically generate, you know, request cost prediction
or request cost estimates per, you know, per target, per target, you know, and parameter.
So, you know, you have a combinatorial explosion in all of this, so you
need to manage this to become
still useful.
But that would be a very
interesting product, because then you can actually
tie into the
mechanism that the customers
can use to
throttle traffic.
Yeah, exactly. And then, I mean, we
already do this multidimensional
baselining based on
metrics that we are all aware of, response
time, throughput, number of database queries.
But then, as you said, if we would
allow our users
to define this
formula where they know
best what to factor in and then
additionally store this information
on a request-by-request basis, then do the multidimensional baselining on that and then make this data available.
Then that's awesome.
Now, Akatsio, so you mentioned in the very beginning SREs, Society of Reliability Engineering.
I would assume if our listeners out there, if they think about, wow, this is all cool.
I mean it sounds amazing.
We have to do it.
But who is doing it?
I would assume that this is the new role or this is one of the roles of the site reliability engineer, figuring out how to get this data out, but also how to then help the architects and the engineers to actually build applications that allow this type of measurement and then also allow dealing with heavier overloaded
and then correctly handling that situation.
So do you agree with me?
Is this one of the roles of site reliability engineering?
Yeah, so when we started this work, we actually were SREs.
My whole organization, we did the first five years of the load chatting team was an SRE team.
That work was, that experience and that knowledge, sometimes
painful as the Gmail outage case was.
It actually taught us a lot. A lot of the techniques that I mentioned,
a lot of the insights that we learned over time were
actually after postmortems, right?
For instance, we learn after an outage
that a feature can change the model cost
of a single request type, right?
So we learn that CPU tracking
is not necessarily complete to have a load view, right?
We learn after another outage with a different service that memory tracking is more important
for different workloads.
So all of this comes as part of the role of SRE.
Where we've gone past over the recent years is to, because even in SRE, there's a lot of scope.
There's a lot of knowledge.
So my team has been doing this for eight years now.
There's a lot of knowledge that we basically packed into code
and into configuration and into best practices.
So we're trying to make as zero conf, as we call it, as possible.
So nowadays at Google, you know, SREs are trained with load chatting.
How to operate our systems are to tweak it for better optimization,
but the system can be built from scratch with zero-conf
and have a decent, you know, load chatting configuration
that will work for the majority of the cases.
And then, of course, different workloads, different work types,
and you can actually always squeeze a little bit more, right?
Because, as I mentioned before, the goal of this is to make the verdict
of should I accept traffic or not, and we have some policies.
But in the environments where high performance is needed,
very high performance is needed, very high performance is needed,
or on the contrary, where resource usage accuracy is needed,
you actually need an SRE that understands that specific workload,
that understands the constraints of the system to go and tune our default parameters
to figure out exactly when should we start, you know, doing load chaining, when
we should actually be a little bit more flexible in the load, when we should actually run tighter
and hotter, even though at the cost of latency, all of those are decisions that, you know,
the system and the architecture implements and our frameworks allow to orchestrate.
But we need somebody that has like that really deep knowledge of the system to be able to to to to effect
i just wanted to bring up that we've been talking a lot about the different um signposts um and some
techniques of measuring and knowing when to load shed but i don't really think we we dove too much
into uh approaches to load shedding and i'm referring specifically to, you know, if we look at the presentation where Casio is looking at traffic and breaking requests down into the ideas of batch, async, sheddable and critical.
Right. Which is when you see these different things, like if your node loop is slowing down, if certain other things are having an artifact, we talk about shedding, but what do we shed?
And that's the idea of, okay, first thing you could do is probably, and I'll try my hand at speaking this, drop your batch calls.
You just drop those completely.
Stop all your asyncs, and you'll go ahead and do retries on these later.
And then you have traffic that you categorize into sheddable,
which you can start shipping away at as well.
Correct.
But making way for the critical ones that absolutely have to run now,
can't be rerun later.
But that's what the end goal is with all this data that we're collecting, right,
is to be able to find a way to reduce the load on those servers by basically stripping off or shedding the unnecessary pieces currently.
And part of that is by identifying and categorizing the different type of traffic.
Andy, is that what you were talking about when you were talking about buckets before?
Yeah, exactly.
Okay.
Yeah.
All right.
So maybe you did cover it just when you were talking buckets.
I was thinking a different way.
No, but that's perfect.
I mean, that's also where, as Akatsio also corrected me earlier, it's like two concepts.
The first concept is knowing when we are getting to a stage where we are overloaded and we have to shed load.
And then the second thing is knowing what to shed away.
And I think that's the key thing here.
Yeah.
And I do encourage everybody to, we'll put the link up to the talk.
I don't want to say it's too technical.
It's very technical.
Again, I'm the dummy in the group.
So at some points I was like scratching my head a little bit.
But overall, I got a ton out of it.
So it's still very, very, very worthwhile watching.
No matter what level you're at, you're going to get a ton out of it. And it makes a lot of sense. And just think of
the graphics really speak really well to these concepts. So I think it was a really well done
presentation for people of all levels. Oh, thank you. I always told that I need to simplify it.
But the problem that we have is if we talk to a high level, it's not useful.
But to get it to a point where it's useful, you need to take it to a level of complexity that I know is hard to grasp, but we try our best.
I just took it as some of these things are some things for me to research and learn if it went a little over my head. So I didn't take it in a negative way.
But I think there's a ton in there that's very accessible.
So I think you really kind of bridge the gap.
And, you know, if you have somebody
who takes the approach I do,
they'll say, well, let me go look into
what those things mean.
So anyhow, I didn't mean to sidetrack there,
but I just wanted to make sure
we kind of covered that point a little.
Yeah. So Akatsia, I have two questions on the thing you said last.
First, I wanted to go into quickly SRE again when you started because I believe a lot of the people that we are talking to right now, they understand.
They're getting to understanding what SRE means and the new role. But how do you get started?
And more importantly for me is which roles did the people have within your team when
you started it that moved over to SRE?
Was it more driven by folks that had operational experience?
Was it more folks that had a load testing and a performance engineering background?
Was it more folks that had a load testing and a performance engineering background? Was it architects? Because I think a lot of people wonder, you know, is it
the right thing for me to move towards an SRE role?
Do I have to be the one that needs to figure all
this stuff out that you tell me or not?
So in terms of role in SRE, we don't have that sort of partitioning.
My teams were comprised of both people with what we call CZENG background and software engineering.
And the software engineers would meet the bar of any software engineer at Google.
And the CZENG, we actually had a little bit more systems knowledge, kernel knowledge, networking knowledge.
So that's how we differentiated.
But that's basically interview time.
When we got to the job, there was very few differences.
And we also don't have really the architectural.
It's a little bit lofty.
I think it boils down to interest and aptitude, right? So for some people, the area of traffic management,
because this is where it really falls, it's very interesting.
Because we've been always talking about load shedding on a process,
but this is actually just the, let's call it the leaf of traffic management.
Traffic management is the process where you have a flow of requests that is coming,
and it needs to find a target, you know, somewhere to reply to that request.
And there's the routing aspect, you know, shall I go to cluster A or cluster B?
You know, should I go to, you know, America zone or to the European zone if
you're running on a Google Cloud or Amazon or something like that, right? So that's that part.
The second part is actually then to figure out what is the most appropriate for the request.
And then finally, the request comes to the last stop, which is the process that handles it.
And it can either be very quick saying, no, I can't handle,
or can do the proper work.
We've been focusing on the last aspect.
But at Google, we tend to treat all of these things together.
So the SRE organization is very focused on capacity planning.
And capacity planning is capacity to handle the requests. And then, therefore, the traffic management is tied also on capacity planning. And capacity planning is capacity to handle the requests.
And then, therefore, the traffic management is tied also to capacity planning.
And all of this process, even from the deployment of the architecture of a system,
there's always two questions that are asked,
is where is my client traffic coming from and where are my back ends. And that actually is part of the whole story
of the role of managing traffic and then
managing load shedding and managing capacity planning and managing the
routing of traffic. So people that like that sort of thing
are the most more appropriate for this role, basically.
That was a nice summary. that was a nice summary.
That was a nice overview.
Thank you.
Hey, and the other thing that I wanted to cover,
that I want to touch on,
and I know we talked about this in Tallinn a couple weeks ago,
and I really, every time when I talk with you
and we share some thoughts,
it always feels that I learn,
I always learn from you and go away with some new ideas,
but I always feel like I never give enough back to you.
So hopefully at some point in time, I can also give something back to you.
But what you inspired me again two weeks ago, and you mentioned this earlier,
you said zero configuration.
In Talim, we talked about or you explained what an engineer has to do at Google to start a new microservice.
I think the minimum configuration is just saying what is this microservice all about, what type of service is it, is it the front end and the back end, who is the owner, and that's pretty much it.
And based on this individual piece of information, you can actually automate the generation of all the other configuration
elements that are necessary in the end-to-end delivery pipeline and then i i you you looked
at my monitoring as code configuration and you said well it's it's it's a nice thing what you
do but it's it's too much configuration still there's so much that can be automated and i
really like that uh because you're right right. There's still too much configuration.
I think I read a statistic and I think it was some CTO from some large system integrator.
And they said that they are currently fighting with the big challenge of having a ratio from
one to four from code to configuration. They came up with so many configuration files nobody has
any clue anymore what this is all about and it's all getting too complex and and basically with
your approach boil it down to the bare minimum and then just automate the generation of everything
else that needs to be generated for all these other tools i think is a great concept and and
a great you know best practice so like yes, best practice. So like, yes.
Thank you for, for, for touching that as well, because it's not just that we can generate configuration.
There are several principles that, that I, you know, specifically my org believes in,
which is, you know, the configuration needs to be machine editable.
So that you basically can also read, you know, even user generated configs, can read it, process it and regenerate back if need to.
Because there's one aspect in today's systems, people create a config.
They probably have a nice template with most of the best practices,
and then they clone it, and then they tweak it for their specific instance.
But over time, the best practices change.
So there is a certain bit rut.
So what was golden today is going to be slightly stale in six months and it's going to be completely obsolete in
three years and in four years it's going to be a pile of junk.
So we need to go for a world where we clone templates
and keep reapplying settings to more of an approach where
when I say manage config, you actually have, let's say,
let's call it a template config,
and then the user of that system injects the changes that it needs for its system.
So to remove from the abstract, imagine that you have a, let's call it an Apache config.
And Apache is fairly complex as well once you start getting to a bunch of backends.
But instead of having an Apache config template that people change for as they add new new complexity to their site
what probably they could do is to to have a tool that reads the apache config and then injects
a specific you know site config done in a in a in a configuration language maybe a text protocol
buffer or you know json if that's uh you know that's uh you know how the you know the the tool
can operate but this means that every time that you change the Apache config, you can, again, regenerate the final config without actually having to change the config provided for that instance.
So you can, you keep it alive, you keep it fresh.
And that way you can integrate, you know, the different types of configs. For instance, you have CGI, you can actually define how a CGI is modeled and then how that gets merged into the Apache config.
But also enables more interesting changes.
This is where we are at the current stage, which is imagine that you come to the decision that you no longer need Apache or Apache doesn't fill your needs, and you need to go to Nginx or LightHttpd or HAProxy
or the new thing that Facebook just open-sourced.
You can actually then transparently bring that site definition that you created
and then now generate an NGINX config,
you know, without having to change the,
you know, everything on your pipeline.
You just have one person or a subset of the people
that really know the config environment
and then can keep it alive and fresh.
Yeah, well, that makes a lot of sense.
No, that was very inspiring.
I mean, and I think your talk from Talin is also online,
at least the slides will be published
in a way. I'm not 100%
sure, but I think...
No, I'm not allowed to publish slides.
Oh, okay.
It can be if it was filmed.
Yeah.
But I think at least we have the short
podcast that we recorded.
And I'm sure there will be some opportunities.
So anyway, my shout out again is if anybody from the listeners wants to check out the other things you have to say besides what you have said at DevOne,
I'm sure there's other opportunities to meet you in different conferences. Or just I think you also, you know, partly contributed to books, right?
You were contributing to the SRE book that came out a couple of years ago.
Correct.
So my team, ITL, actually for the load chatting team, Alejo, Alejandro,
he actually is one of the authors of the traffic management chapter and where he also outlines the load-sharing principles there.
And then there is the broader topic about frameworks as part of the way to do zero-conf and best practice in the services from the ground up, right? The how to build systems with best practices,
which I co-wrote that with one of my TLs
on the systems platform.
Pretty cool.
Hey, Brian, do you have anything that you want to still cover,
especially knowing our audience?
Is there anything open or just point them to more resources
and we put the links up? You know more resources and we put the links up?
You know, I think we put the links up.
There's quite a lot.
They're very rich in information.
You know, I think one of the things I get out of it the most, which is a challenge for anybody, is to challenge your assumptions. You know, one thing Acasio talks about is, you know,
with the idea of retries and a lot of people hate the idea of retries and how
if you do them right, they can actually be faster, you know, and,
and just even the idea of tossing away this traffic and doing it later and
figuring that out. Most people just kind of would be like,
how can you do that? And, but it's like, no, it can be done.
And I think a lot of these, you know,
the dev one presentation goes into that quite well.
So I think that's, I think that's going to cover that. Well, there's a lot,
I think a lot of the topic in here, um, we can talk about a lot of it, but I think a lot of it's also served to have the visuals. Um, so for, for,
for my, from my point of view, I think we covered what's mostly coverable without visuals.
You know, I'm glad you mentioned retries if I can probe in that topic.
As you said, there's a cultural resistance even from, you know, from the Google developers
because they have the good intention, right?
It's like, oh, I don't want to return an
error to my customer, to my client. So I'll do my best to handle just this one request, right?
That is actually the attitude and it's positive. But the reality is if you do this statistically,
you end up with high latency and high latency is the death of a product.
So being able to say, you know, I will embrace, you know, throttling because what I'll do is I'll really try my best to make sure that the majority of my requests are fast. And if I can't do it fast,
I'll just reject it. So if you embrace that approach and then you say on my client side,
I'll work my clients so that they also embrace this process and if they see
a throttling error, they immediately retry fast somewhere else.
Those two things combined are very powerful.
Just to go back to the original system that I was mentioning before the contacts,
we used to have quite a bit
of tail latency.
This is what I call like on the 99th percentile of latency was quite high because part of it was very large contactless.
But I mentioned before, for instance,
we would have antagonistic workloads.
For some reason, sometimes some of the processes
get what we call sick.
Either there's a map reduce running on the machine,
either, you know, there's a problem with memory contention,
or there is like very strong work being done
because of garbage collection on a Java task.
You know, for whatever reason, sometimes some tasks get sick.
So what you want very quickly is to identify that.
And basically, it's almost like if you have a rock in the middle of your stream and the stream is your request.
You want the water to flow to other areas. very quickly get the signal that that task is overloaded and getting throttling,
and you divert your requests elsewhere.
Overall and statistically, your system performs much better, much faster,
even though there is a retry.
And I think on a very different and much more gross level,
another concept that ties into that, but again, being very different,
but also kind of counterintuitive to what people might think in terms of retries is, and Andy, I don't remember if come back later, we're having some problems,
than try to serve them super slow.
Because if they're trying to use the site and it's crawling slow,
they're going to hate it and they're going to have a negative opinion of you.
Whereas if you just say, hey, we're having issues,
can you try back in 10 minutes or whatever?
The majority of the people will be like, oh, great, yeah,
I can do this in 10 minutes.
Of course, you might lose a few people,
but just this idea of sometimes taking a hit is better than trying to persist.
Exactly.
I used to have someone in my team that before they used to work at Ticketmaster.
And Ticketmaster is a very interesting human load event.
Every time they have a popular artist that starts selling tickets,
millions of people try to get the tickets within minutes.
And they do exactly that.
They do load chatting on the application side at the human scale, right,
which is basically they know that they can't handle a million people clicking on tickets
and the system will crash down.
So what they do, they apply load chatting very simply as the number
of people that are allowed to go from, I'm seeing the
ticket master form page to I'm allowed to actually enter the sales
pipeline is limited. So they rate limit there,
you know, the transition. And that is a much better experience than
you know, clicking on the buy button
and wait for 15 minutes
and get a timeout request, right?
So that usability impact,
it's much more positive
for them to lose the customer
that will go away
because you can't buy the ticket
because it's at a gate
than actually lose the customer
halfway through the process.
Yeah, although, you know,
although you always know it's some bot somewhere who got in front of you online.
But latency is the death of product.
Yeah.
I think the credit to this, what you mentioned earlier, Brian, was Ryan Townsend from Shift Commerce.
He was also one of the speakers at DevOps Talim. He talked about this better
an experience than a bad
than a slow experience.
I think that's the thing he said.
That was cool.
All right. Akatsia, did we
miss anything?
Is there anything else you want to
make sure that people understand?
Again, knowing that they will be able to watch your full presentation from dev one and follow you on all the other work.
But anything else where you say, hey, this is something people should understand about load shedding, especially when it gets started.
Anything we missed?
Yes, I think there's only one very high level principle.
Because hearing a stock, and as Andy has said as well, it's a fairly technical
area, but, and then people
might walk away with the impression this only applies to Google.
But the reality is Google, at our scale,
sometimes we can handle events that are
outage potential sometimes we can handle events that are out of touch potential with our resources.
We are fortunate that we have resources that can solve problems.
But if you're a small company and you have
your business that depends on one app and you have
six servers that are running, those are critical
to the business.
So it's even more important for a small company to be able to survive an event.
And in the current compute world, distributed world, it's a very hostile world, right?
A script kitty can doze almost anybody and take you out of business.
So being able to survive hostile players, being able to survive client-side bugs,
being able to survive unexpected events, being able to survive fat fingering
is even more critical at a small scale because that means that you're surviving
not just the application,
but you're making your business survive. So even if you don't go full techno into load chatting specifically, but start considering
how critical systems are to the survivability of the business and then applying some of
these measures
and inching your way towards basically surviving.
Survival is the key.
Awesome.
Hey, Andy.
Yeah, do you think it's time?
I think it's time to summon the Summaryator.
Summon the Summaryator, yeah.
Let's do it.
I think what I learned today
is that the concept of site reliability engineering that has been floating out there for a while, but I know that while Google has been living this and further expanding it and refining it over the years, many of the people that we talk to are just getting started and they want to know what this all means. I think what I learned today, it is one of the aspects of site reliability engineering
is trying to figure out how to help the company and the business to build resilient systems.
I think resiliency is in the end what we all want, whether it is making sure a spike in
load cannot bring the system down, whether, as you said, a script kitty cannot bring the business down.
Thinking about these concepts and how we can shed load of the critical backend processes,
then obviously making sure that the architecture that has been used to build the systems can handle the situation of,
you know, now we know we're getting in an unhealthy state.
How can we shed, load off?
How can we correctly respond errors to the next level?
How can we respond the error state back to the customer in a way that we're not making them angry but making them want to wait?
I think these are all extremely interesting concepts, and I believe it's going to be part of many organizations
that are now just thinking about what site reliability engineering is for them.
I really much encourage everyone to watch your presentation,
to read your blogs and your books,
and simply just be open that these are the things we have to do these days to make sure that our digital businesses become successful and stay successful. famous or a blog post, a news article may shatter everything into pieces
if you didn't do your homework well. And I like your last
recommendation in the end where you said you may not
do everything at once, but start thinking about how you can
at least build something that is resilient against
a script kitty or something like that.
I like that.
Thank you.
You choked me up there, Andy.
And I just want to add a couple of my thoughts to that as well.
First of all, I just want to thank you for being on the show. I think it's wonderful that companies like Google that have resources to actually put time, people, and money into solving these problems and looking at these problems turn around and then share it with the rest of us who don't necessarily have the ability to set up these projects around that. Obviously, it's critical
to performance, but a lot of times, especially in smaller companies, it's just plow ahead.
So the fact that all these companies are sharing this information is wonderful. So thank you for
being part of that. And Andy, a lot of times, as you know, some of the topics are a little bit more over my head. Ocasio, my background, very briefly, I was a communication major who ended up somehow in load testing
because I was making a crummy salary and I had a friend working in computers making twice the amount of money that I did.
So having an inquisitive mind and an analytical mind, I eventually got into load testing.
So some of the some of the
deeper topics, you know, go over my head a little bit. I'm always trying to level up. But I think
what I enjoyed about this one a lot is it's so performance filled, that the more I look at the
more I listen, the more I see, the more I can find in here a lot of fascinating ideas to dig deeper and deeper into.
I really love the idea that not all requests are equal.
That's kind of a thing that used to come up a lot, just even in load testing.
You wouldn't run certain requests a lot because they're so heavy.
But then there was always the question of, well, what if users do it?
And you'd have a developer saying, well, that's not how it's meant to be used.
Well, what if they do it, right?
If people keep searching for Britney Spears because she suddenly went to rehab again or something, you know, things like that will happen.
But I also really loved the idea of other indicators that systems are going down the node loop you talked about specifically and and ideas where, you know, there's something running in a process or in a service that might indicate that there's a larger problem going on overall that you can't see quite as well.
And maybe certain other kind of traditional metrics wouldn't quite be an indicator.
But the idea of like, you know, suddenly that loop that goes through is taking longer.
Well, that's an indication of an overall larger issue. Just a really fresh way of looking at things
and opens a whole, you know, kind of can of worms,
but in a good way of new ways to look at performance.
So again, thanks for sharing it.
I think there's a lot to dig into this
and a lot to get out with repeat viewings and listens.
Thank you for inviting me.
It was a pleasure.
And I'll just leave with
a parting story.
I was part of the team
that built
Google+, our social
platform, and
one of the
users of the platform was
a K-pop
group from Korea called
AKB48, sorry, from Japan, actually,
which stands for Akiba Hara in Tokyo.
And they actually do a member vote out, vote in every year.
So our social team was actually not prepared.
You know, we had not developed the product to have a single user with so many comments that exceeded, you know, a few per second.
So just to go back to your point, everything will be used and abused in the most unexpected
ways by our users.
And that's how I learned about AKB48,
which actually has 64 members, not
48. Oh my gosh.
It's kind of like Minuto.
The vote-in, vote-out.
It's an excellent story.
Thanks a lot for coming on
and thanks for everyone for listening.
If you have any questions or comments, you can tweet us at pure underscore DT
or send an email if you're into that at pureperformance at dynatrace.com.
Acacio, do you have a Twitter handle?
AcacioCruz.
Okay, AcacioCruz.
Yep, and that'll be people in the description, so definitely make sure you follow him.
Thanks for listening, and Andy, great to have you back on after your move,
and hope you're enjoying life back in Austria.
Have you seen any koalas?
Any what? Koalas? Yeah, exactly.
Let's get out of Austria on the other side of the globe.
All right, thank you, everybody.
Thank you.
Thank you.
Bye.