PurePerformance - The 3 Levels of SRE and bridging the gap to DevOps with Michael Wildpaner
Episode Date: August 15, 2022SRE vs DevOps, SRE or DevOps or is it SRE & DevOps? No better person to ask than somebody that has been an SRE for much longer than our industry is talking about Site Reliability Engineering.Michael W...ildpaner, Sr Engineering Director Cloud Security at Google, started as an SRE for Google Maps back in 2006. Fast forward to 2022 Michael has a lot of hands-on experience about the SRE role, the different levels of SRE that one organization can apply and how it connects with DevOps.Tune in and hear his personal stories from more than 15 years at Google. While not everyone is Google – there for sure is a lot we can take out of this conversation. Here some of my personal take awaysCore idea of SRE: take engineers that understand distributed systems and “annoy” / guide developers to build better resilient systems from the startDesign for automation: this already starts with naming your infrastructure (aka – don’t use lord of the rings names)SREs help so that you DO NOT DESIGN yourself into a cornerObservability is the foundation of good SRE as it enables incident management, insights all the way up to user insightsTip: Ensure new hires understand that you have a blameless culture As follow up material check out those linksLinkedIn: https://www.linkedin.com/in/michael-wildpanerTalk at DevOps Fusion 2022: https://devops-fusion.com/en/speaker/michael-wildpaner/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Andy Grabner and as always I have with me my co-host Brian Wilson.
Hello Brian, how are you today?
I'm awesome Andy.
I think this is the first time I actually mix the names and you try to catch me in making a mistake now or something or starting a laughter.
But I'm good. And it is a hot summer day here believe it or not well it is july and i guess it's supposed to be hot
and um we're having another heat wave and the next one is already rolling in and uh which also means
heat waves means we need something to cool us down but today we actually have a hot topic
nothing to cool us down but another hot topic nice and a hot topic. Nothing to cool us down, but another hot topic.
Nice, Andy.
Nice, Andy.
Well done.
Well done.
Well done.
Thank you so much.
Thank you so much.
You're going to get an award one day.
Yeah, but not for being a good comedian.
I don't think so.
But I want to not wait any longer.
I want to give our guest today the chance to introduce himself.
Michael Wildpann or Michael, they're both Austrians,
even though Michael found its way towards the West,
which means not too far West from Austria, but Switzerland and Zurich.
But before I say anything else stupid that doesn't make any sense,
Michael, welcome to the show.
And please do me the favor and introduce yourself to our audience.
Hey, Andi.
Thank you so much for the nice words.
And hi, Brian.
I'm very honored to be on this podcast with you today.
To start with a fun story, when I had my basic training in Google in 2006 in the Bay Area.
Like another famous Austrian was governor of California.
So I kept on being asked like, hey, you really sound like the governor.
Another funny story tied to that, when I first started working at Dynatrace, I was in the
New York area and a lot of the Austrians would come over and we'd go to conferences and all.
And there was this sketch on Saturday Night Live called Hans and Franz, Pumping Iron with Hans and Franz.
It was sometime in the late 80s.
And the whole idea was these two guys were Austrian bodybuilders who worshipped Arnold Schwarzenegger.
And I forget, I think it might have been, I don't remember who it was.
It might've been Alois.
It was over and I was like, oh, you sound like this Hans and Franz thing.
Let me, let me show you this sketch, you know,
this comedy bit that we all laugh at here.
And he watches it and he was like, I don't get it.
What's so funny about it?
And I was like, oh, we're laughing at you.
I get it now. that's terrible so it was
funny because it's the whole the whole bit there with the schwarzenegger and it's always in fact
even because of andy we used to call him the summerator because he used to always do a summary
at the end and i would actually intersperse arnold quotes uh at appropriate time so yeah it it's it's
probably a joke you all get really tired of really quick
i imagine and but it's still fun he's done some fantastic movies and provides with plenty of
entertainment so yeah uh anyway way off topic exactly way off topic michael i know you just
briefly mentioned that you work for google but i think there's a little bit more to the story
especially to our target audience in In the last couple of years,
we've talked a lot about performance engineering,
but we drifted over a little bit
over the last couple of years towards DevOps,
towards site reliability engineering.
Obviously a topic that we all know
has been heavily made popular by the works from Google
and what you as an organization wrote about
how you're doing things,
but you personally have been very much involved, I think,
in that whole movement towards site reliability engineering.
Can you fill us in a little bit on the background?
Because I think that's going to be very interesting.
Give people some context of who you really are.
Yes, yes, with pleasure.
So let me start with one sentence before Google.
I spent a bunch of time working on bioinformatics,
so I have a little bit of a background in high-performance computing.
This is also what got me into the SRE organization.
Very importantly, when I joined Google,
I already knew that we shouldn't use a lot of the rings' names to name our servers.
But if you want to build a big distributed system,
you should have a numbering scheme
and not like build individual things.
Like today we would probably say cattle, not bats.
Like that's one of the terms that's used in the DevOps and SRE community for that principle.
And I also did a lot of network security before joining Google. And this obviously
came from the university scene and people hacking each other in a very friendly way.
And that becoming quite an interesting area to actively work in. Today, like most really good attacks
have like a multi-step setup
where you're mixing some social engineering
with like a lot of technical capabilities.
At that time, it was all much easier.
Like people had firewalls
that were not set up in a great way.
They had like Wi-Fi access points.
So one of my favorite pre-Google memories is sitting in my very old car,
having a big antenna on the roof and trying to find like open
or not very well secured Wi-Fi access points in big industrial areas.
So that was the good old times.
At Google, I got dropped in at the deep end.
So I started as an engineer in the site reliability team for Google Maps.
And Google Maps at that point was mostly a stateless service.
It is still relatively stateless when compared with other Google services,
but it was a really interesting learning experience in how can you ship a large amount
of highly structured data as quickly as possible to a huge number of devices. So latency was
absolutely the first thing we always looked at. We had a very deep culture of performance testing and regression measurement.
If a build created latency regressions for significant use cases,
we as S3s would block that build from going out.
So there was a very big focus on every 10 milliseconds count or every millisecond
counts in that particular application. These days, I'm no longer working on maps,
but the focus has moved a little bit to mobile. And in mobile, you usually have an application
and the application has a chance to do caching and prefetching and a lot of very smart logic, there is still a lot of focus on latency.
But let me just say that the story between mobile and desktop is quite different.
I got a quick question on a couple of things, actually, that you just mentioned.
The first thing you said earlier is don't use the loader ring names.
And this reminded me actually, I think, of the presentation you did in DevOps Fusion.
This is where we actually met each other.
Thanks again, because you invited me to the Google office in Zurich the day before the conference.
And we spent quite some nice time together and had a chat
about what you're doing. And what you said today, I mean, it's the same what you said on stage,
but it really struck me a lot because you said, in order to think about automation,
the first thing is you need to get rid of these names. You need to come up with a naming scheme
that is built for automation and not to make you feel good. And I think I see this a lot with people that I am working with these days
because automation is a big topic, right?
How can we scale, in our case, observability?
And you can only scale observability and then automation
if you have the basics right.
And the basics start with something simple as the naming
because it's easier to iterate through a thousand of hosts if
you have numbers that go up and they're not called gandalf and whatever else they're all right i mean
that's the thing so this was really fascinating the second thing that i wanted to say and i think
we should probably dedicate a whole episode on this because both brian and i in our background
we've done a lot and
a lot of performance testing in our early days.
And just hearing from you the challenge of pushing a lot of structured data out to many,
many devices, and then you having to say the new build is good enough or not is something
actually that we have been doing over many years by doing performance testing.
Maybe not at Google Maps scale but other scales now one question i have to this were you already called
sre's back then when you started at google and working on google maps yes when when when i joined
when i joined google i think the term sre had been around for a few years and the sre organization
existed for roughly three years.
Please don't quote me on that, but I think the formal SRE organization was likely created in 2003.
Ben Treanor started off SRE at Google.
He coined the term and he popularized the idea.
And if you allow me, I will just repeat the core idea here once.
I know it has been described many times.
But the core idea of SRE is to put software engineers that understand distributed systems
into a role that is not operations, like SREs are not operations,
but that has just enough operations component to really annoy the engineers.
And once you annoy them, they will go and fix whatever toily process is generating that operational load.
And that's the core idea behind SRE.
It is not meant as a group that will keep on doing whatever their operations or tally work is forever but the idea
is really like be just enough annoyed at the at the sharp edges of the system to to remove them
so really encourage i mean i guess with us different words we can find in the english
language annoying is one of them but motivating them, educating them or mentoring them in
building distributed systems that are by default more resilient so that whoever needs to take
care of the operational aspects later.
And I guess in the best sense, people can just, the system can take care of itself.
The system is resilient by default.
Yeah, that's actually, I think this is part of the core mission of SRE.
If you think about the different SRE engagement models, like how can an SRE team or a site
reliability engineer work with the development team?
And there are various stages.
And the most severe stage is like this service is absolutely mission critical.
You really need like dedicated SREs on the system.
This is mostly important if like downtime would have a huge, huge negative effect on
the organization.
Like there are some use cases where you really want like with a reaction time of three to
five minutes, like somebody somebody that actually really deeply knows
the distributed system,
that knows how to do reverse engineering on the fly.
This is a big, big part of the SRE job.
Let me try to reverse engineer this.
The systems today are all too big to have in your head
or on a very large whiteboard.
So in that case, you have a dedicated SRE team
that's permanently working with your developers
and is very actively working
on improving the distributed system itself.
But there are also other SRE engagement models
that are less costly
and that start to look more DevOps-like.
And I think this is also a nice segue that are less costly and that start to look more DevOps-like.
And I think this is also a nice segue into the conversation around this bridging DevOps and SRE talk.
There's a lighter engagement model where if you're building
a really important service, but you do not want to put
a huge SRE team on it, just mix a few experienced SREs into their own call rotation.
And they're going to lead by example, like they're going to do a good job in cleaning up after any
incidents. They will create a blameless postmortem culture, like all the things that are necessary in DevOps and in SRE.
And it will help improve the system.
But just mixing them into an existing on-call rotation
can already change the focus on reliability.
To share a personal story,
when Google was launching one of the backends to G+, to the social service
that Google launched, I was running the security SRE team at Google and we ran one of the access
control services that started out with G+, and that's now being used in many products.
And it was one of my personally best experiences as a site reliability engineer, because we did
not have a team to cover the full service. We joined the developer on call rotation,
and it was, I would say, like three to four months like pure mutual learning. I learned a lot from the developer team on how to build that kind of high
performance access control system.
And we were able to share with the developer team,
a lot of the learnings we had from like earlier services.
So this mixed model in my,
in my opinion has a lot of value.
And the third model is more of a teaching model
where you join a team, but just for one or two weeks.
This is very DevOps-like,
like institute a little bit of best practices
and SRE and DevOps culture and then move on
and the original development team will carry on
effectively DevOps-ing their service.
Usually fascinating. the original development team will carry on effectively DevOps in the service.
The
usually fascinating, I think I remember
some of these stories maybe when when we met in Zurich.
But let me ask you a question.
Is there your personal story with the Google Plus access control system?
Do you have a ratio that you had between the developers and the SREs that did the on-call rotation?
Did you have a one-to-one?
Or how did it look like? Because I think these are some interesting facts also for our listeners,
as they probably have situations like this and any guidance would be interesting.
Yeah, I will be honest.
I would not call this guidance so i can
i can share a little bit of of data like the best exchange we had when the team was roughly
i i would say 50 50 uh and and it was very practical because our development team was in
the in the us and the sres were in europe so we would naturally share the 24-hour rotation into 12-hour shifts, which meant that every day the responsibility for the service had to bounce between Dev and SRE.
And every day we had to speak the same language and every day we had to clean up on all of the issues that we saw.
I found that very practical. I'm not saying it's the best setup for any situation, but it was definitely
a good split.
And then with the overlap, I would assume Europe and then maybe West Coast, right? I
guess you at least have a couple of hours, real overlap as well. And that's great. And again, I don't want to have a recommendation,
so I'm sorry that I used that word, but just some insights.
You also mentioned the first level is more like dedicated SRE teams.
And in the very beginning, you said typically you decide
what level of SRE engagement you want based on the criticality of the apps.
How is it right now at Google?
Do you still have certain apps that based on criticality
where you then really then decide which model fits?
And how do you decide the criticality?
Because I assume if I look from Google's on the outside,
for me, I guess what is critical search is obviously most critical
because that's where I make a lot of money.
Then also, I guess, you know, when it's Gmail
and all the business apps that you have,
do they all have their own dedicated SRE teams?
Yes, these apps all have their own dedicated SRE teams
at various levels of detail.
In some cases, when you look at cloud, at GCP,
like we actually have quite a big number of SRE teams because every component of cloud has its own idiosyncratic issues.
And the skillset is actually in the day-to-day operation quite different, whether you're running the load balancing system or you're running a distributed database.
Yes, they're all built on the same base infrastructure and the reverse engineering that I mentioned earlier looks very similar in all of these services.
But there is definitely a lot of knowledge that is really ingrained and that is very
different between I'm doing front front-end traffic management,
like I need to make sure that we're not dropping any queries, like I need to understand what
like stabilized like TCP any cast is, I need to know what the route injection is, like
there's a lot of skills on the front-end serving side that's very specific to what people are
doing there.
And if you're running, let's say, Spanner,
you need to understand how the actual distributed transaction system is working.
And there's a lot of performance interactions between different layers of the stack,
obviously, when you're running a database system.
And that takes quite a bit of learning to to to um to get proficient at that um your presentation at uh devops fusion right it was um kind of bridging the gap between
devops and sre and i really i can only i think the devops fusion team they've recorded the session
and it will be published at some point uh you started off with kind of the uh the um what's the the right word right the two amigos
and they're kind of like trying to shoot each other the gunslingers the conference yeah
it was like you know devops and sre but obviously that the world is kind of like the story that you tried to tell is bridging the gap between these two.
And I also have a hard time sometimes explaining kind of really how they overlap
and how they kind of benefit from each other.
I think you did a really great job in it,
but I just want to give you again my explanation that I give,
and then I would like to then throw it back to you on how you see it.
When I talk with
people and they they're familiar with DevOps and then they ask what is this SRE all about and I
typically say and I know people cannot see my hands right now but I always point from two directions
I say DevOps for me at least what I see are the teams that are using automation to speed up
delivery so they are the ones that are providing automated way
to get code changes from development all the way into production.
On the other side, I see SREs, like you said a little bit,
they have more ops experience.
They're coming more from the ops side,
and they're using automation to keep systems reliable and resilient.
And kind of like there's no gap in the middle,
even though it seems like a little gap
when I point the fingers from left to right,
but it's more like they're influencing each other
because SREs, they have knowledge
on how to build resilient systems
and certain things should be baked into the pipeline,
like the automated monitoring,
the automated testing,
the automated validation.
And so I would like to now hear from you
how you are approaching this whole topic of DevOps and SRE,
where the overlaps are, and really how we can bridge the gap
because we don't want to tell the world
there's DevOps on one side, there's SRE on the other side,
and then we're throwing it over the wall again.
Yeah, I really like the model that you're describing,
like starting on these two ends,
but effectively meeting in an overlapping zone in the middle.
Let's look at one extreme example.
Let's say you only have like one person or one very small team
like building and operating a whole service.
And so what does that person need to do?
Like outside of our conversation, somebody needs to do product management.
Like somebody obviously needs to understand like what we're building.
But when we go to the software engineering and the operations,
there is like super deep domain knowledge in like the design of what we're building.
Like many of those systems have like non-distributed systems aspects that need a lot of expertise. Let's say we only have one team.
So these folks now have a great interest to iterate fast. I think to the audience of this podcast,
this is like bread and butter. But many organizations to this day still have like big releases,
like once a month, once a quarter,
where they are betting the house on this release,
like going out of the door and then something doesn't work.
Like one out of like a thousand small changes or two out of a thousand changes break.
You have to roll back.
And it's like essentially a war room every month or it's a war room every quarter to
get the release out.
And so I, and maybe you live in a bubble where it doesn't exist or maybe you see it, but
you also see people moving away from that model.
But there's still a lot of organizations that do that kind of software development.
So let's assume our very modern team doesn't want to do that.
And this is where what you called DevOps,
what I would also call rel-engineering, would come in. Like a team that is dedicated to the velocity
of the software development process.
And that team can eliminate all kinds of problems.
Like how do we speed up code reviews?
I personally think code reviews are absolutely necessary.
Like I would never start a software engineering project
without mandatory code reviews, to be very blunt here.
The team could, like that team could work on do we have deterministic builds like how long
does it take to do an incremental build like these are all topics that would really improve
the developer experience do we have proper staging environments do we have ab testing like do we have
a release system that can allow like your
change list to be out in in staging within minutes or within within one hour instead of in days
so so so i think this is one of the core parts of devops and in sre it is not that important to do
it because usually like if the service is so important that you need a dedicated SRE
team, then the service is usually also so important that you have a dedicated release
management team.
So SREs generally demand that a really good release pipeline exists and SREs have the
skill to help build it, but it's not their main job.
They usually show up and say like, hey, you're building this by hand, what is this?
Can we please have a useful release pipeline?
The SRE
skill set is overlapping with these other groups
and really starts at the design
of the distributed part of the system.
Is the system stateless? Is the system stateful?
What storage mechanisms do we use?
How do we do load balancing? How do we do caching?
When you have a cache invalidation event,
how expensive is it to refill your cache?
Do you have a capacity cache?
Or is this cache used for latency?
There's just many, many questions that SREs will ask
when they see an existing design of a distributed system
or they see a new design doc coming up
where they're trying to make sure
that people are not designing themselves into a corner.
And I think this is really where the SRE software that people are not designing themselves into a corner.
And I think this is really where the SRE software engineering skill set is starting.
And then it goes kind of downstream from there.
Like, does the release tool have exponential rollouts or is it trying to update everything at the same time?
Does the release tool have any safety checks
in case the build turns out to be broken in production?
And you mentioned a lot of good stuff around monitoring and debugging earlier.
I really love the idea that you said there
of people designing themselves into a corner.
I think that even architecturally,
we see people choice themselves into a corner too, depending on which technologies they choose and why, if they're making an informed
decision early on with Kubernetes, right? Everybody just wanted to jump to Kubernetes
for no good reason, because that was the latest thing, right? Or we've seen people like, hey,
we're going to move 100% serverless. Why? Because we want to be serverless.
And then they get stuck in a situation.
And as Andy's talking about bringing in the observability side of things,
how are you going to observe that if you're on that platform
if there are not quite good enough observability tooling on that side?
Or what do you have designed for that bit as opposed to making another choice?
So there's all these choices that go into everything you do
that if you're not paying attention, as you said,
you end up designing yourself into this corner.
And now how do you get out of that?
And I just got to also say at this point too,
it's really the level of SRE that you're discussing is,
I think a lot of the times we've discussed SRE in the past, Andy,
it's not been this deep of a level.
A lot of my exposure to SRE work has not been this deep as well.
So this is just, I'm very quiet because it's just smacking me in the head really hard.
Like, wow, there is so much more than the high-level SRE stuff you hear about a lot more commonly.
So I really appreciate you going into that level there.
If you want, I can share a few thoughts on this,
designing yourself into a corner.
My most sensitive part when I look at the design
is the way that people are structuring their storage.
And the reason for that is,
if you fix a distributed systems design problem in any of the stateless parts of your service, that's usually easy to roll out.
You need to be careful that you have forward and backward compatible APIs so that you can update parts of your system. You need to be sure that if you're doing some experiment launches, so a feature is only
visible to part of your audience, that these experiment configurations are consistently
applied everywhere.
But it is still, from a conceptual perspective, relatively easy to move from one stateless
processing system to a second one.
But the moment that you're hitting either a spinning disk or these days maybe an SSD,
you're committing the sins of the past to your storage system.
And if you're then trying to fix anything there,
like unless you have the capacity to rewrite everything that was ever written to stable storage
to a new schema, you have to keep the code that has to live with the sins of the past forever.
This is something I learned.
I spent a lot of time in Gmail.
And once you had anything written to a storage system,
you needed to be able to parse that and unmarshal that again and process it again.
And if you're building a sufficiently large system,
that becomes very, very expensive.
And in modern distributed systems,
it's really hard to just use a relational database.
Like I'm a kid of the relational database generation.
Like I started out, I can't even pinpoint it.
Let's say I spent a lot of time with Oracle early on.
I played with like Postgres.
Before it was called Postgres, it had a query language called Quell, which was not SQL compatible.
I spent a lot of time with MySQL.
And these are really, really powerful tools to store data. you manage to design with sharding in your mind, unless you already build a distributed systems design
where the subset of the data that needs to be relationally consistent
and relationally managed and using foreign keys
and all the actual good features of a database,
unless you manage to design this to be very granular.
Like relational databases do not scale.
And so Gmail is not using a relational database in the classical sense.
But what is really important is that when you're processing email,
you can have a unit which is a user.
And that makes your life much easier.
If you're building a system that's all about sharing and you have this huge ball of hundreds of millions of users and they're sharing things with each other, it's effectively impossible to disentangle that. So if you're building a sharing system with millions of users and
billions of items shared and you start out with a relational
model, you need to start from scratch. You cannot
scale something in a relational database with a huge ball of
foreign keys and hard relationships.
I'm especially sensitive when it comes to the design
of your storage system because there, like,
migrations are hard and mistakes are on your disk,
like, forever or until you rewrite the data.
I love these stories because, obviously,
they come from your experience.
I want to quickly throw in one of my stories, and I think this is also a way to date myself.
I remember in the early days of Dynatrace, a customer came to us and they said,
Hey, we built this new high transaction volume application on top of SharePoint.
And we don't know why things are slow.
And we read some of your blog articles on SharePoint.
Can you help us?
And I mean, I should have known when they
mentioned the word SharePoint and high transactional data transactional system
in one sentence that this is most likely the problem.
And this was exactly the problem.
They misused, they misused the flexibility that SharePoint gives you, gave you with the list and you can back the problem right they misused they misused the flexibility that
sharepoint gives you gave you with the list and you can you know back into it was like 15 years
ago so it's really flexible but obviously it doesn't scale um and and the recommendation
that maybe i was kind of the sre back then that kind of brought them out of their corner i told
them you need to rewrite your system because you made a very critical wrong architectural decision
on how you want to store and how you want to treat your data.
And the question is for me now, a lot of organizations,
and especially I would think if we look at startups
that are coming up with new ideas,
where is the trade-off between over engineering things and
actually giving you in the beginning the runway the speed without having to think about these
things because first you need to figure out does your business is it even a good business idea
right but then with the effect if you don't do it right from the beginning eventually you have a lot
of technical debt and then you need to rewrite everything have you been in a situation like this where you where you made on purpose certain
decisions in a different way and then later changed course because you knew now you had to change it
so let me start out with admitting that i've been in this in this situation to trying to
over engineer stuff like many many times in my in my career and and indeed like i i as an engineer i
had to learn uh to to step back uh and actually look at the the time to market or the time to
deliver the system versus like creating like the absolutely perfect perfect setup so so i think
this is like an a social skill that engineers need to learn,
hopefully not by mistakes, hopefully by like osmosis from really great mentors.
One thing that Amazon talked about a lot that I love is like this idea that like APIs rule. You should start out with
not a stable, but a sensible API
between two systems.
At Google we talk about this a lot.
Luis Barroso, one of our most senior engineers,
really pushes people to iterate
on the implementation behind an API.
This is where you can make the most smart engineering trade-offs.
Even if the API is not picture perfect,
like as long as you can keep people using the API
and not trying to peek around it and use some other interfaces,
for example, as long as you can keep people
from like never directly talking to your
storage system, you have a chance as an engineering team to really provide fantastic optimizations.
And so that's very important.
If you design your whole system around your storage schema, if you're using the storage
schema as the lingua franca between teams like you have painted yourself
in the very small corner of the full api of the storage system is the api between all parts of
the system and this is like just way too complex uh and and uh yeah if everybody can read from the
relational database you're doing it wrong. Yeah. I remember, Brian,
these days, a couple of years ago
with our first generation
product when everybody wanted to get
direct access to the data store
because they knew there's some data in there
and our other APIs didn't give them
the data that they wanted. And then they re-engineered
and we had a lot of people
building their own solutions on top of
the direct database model that we had,
the Datastore model.
And this had all sorts of side effects
and actually to migrate these people over to another platform.
You can't specify an SLA on a database API.
It just doesn't work.
The APIs are too complex.
Sorry, Brian, I interrupted you.
Oh, no, no, no.
I was just going to comment on the idea.
Andy, you brought up in the beginning
of this line of discussion here
the idea of over-engineering
versus re-architecting.
And I don't think we have the answers here,
but I'd be curious to find out,
especially for startups, is it beneficial to get to a point where you have to re-architect?
Because let's say you do, you need to find out the viability of your product if people are going to like it.
Having a chance, having the opportunity built in by force, I guess, where your hand is forced to have to then, once you get to a certain success level,
re-architect and start fresh again.
Almost sounds like an opportunity
to wipe the slate clean
and take everything you've learned
and start fresh again and build anew from there.
I mean, yeah, there's always the pain of,
do we have the time, everything else.
But if you're going to be successful,
I doubt you're going to be on the right path from day one.
So getting yourself to that level
and having the opportunity to re-architect
almost seems like it would be beneficial,
but obviously we'd have to talk to a lot of startup people
to find out.
Let me put on my manager hat for one minute.
And I looked a lot at risk management.
If I think about rewriting a complex system,
I would try to think about how can I de-risk the rewrite.
And the most risky rewrite a company or an organization is going to do
is the full system rewrite.
Like, hey, let's throw away everything.
We learned a lot of stuff. Let's do it again.
And this has all the risks. It has the risk that some of the learnings will be forgotten.
It has the risk that you're not re-implementing the same feature set, but suddenly you're also
trying to satisfy a huge list of other features. It has the risk that the team is overambitious and it's trying to, like
we talked about, when is enough enough?
So you have a super high risk with ending up like the second system effect.
It's very, very well documented in systems literature and you cannot launch the rewrite
or it takes you five times as long.
And so if I look at this from a pure risk management perspective,
what I would say is plan for rewriting your first system,
but try to have some stable APIs
so that you can rewrite the system iteratively in parts.
Okay, throw away your front end.
That's totally fine. But as you, throw away your front end. That's totally fine.
But as you're rewriting your front end,
try to use the same storage backend if possible
because you don't want to have all of your balls
in the air at the same time.
Okay, you're changing your storage.
That's really, really critical.
You need to have a very good process.
I've been in critical storage migrations
and it's usually let's build the new storage system
but have a scaffolding,
like have an API that is simulating the old storage system.
Okay, let's start to do it right for a few users or for some subset.
Okay, let's have a validation system that can, via the API,
retrieve every data in the data unit and compare it bitwise.
It's just a huge dance.
So while you're doing this dance, and then someday you trust
the new storage system and you flip the source of truth over,
and then you still wait and have the old system running in standby and then
at some point you convince yourself okay fine the new system is working and you can you can turn off
the old one if you're also rebuilding like every batch process every front end every messaging
system at the same time like this is not going to converge because you can never tell like is this
now a regression in the new storage system
or did we change everything else at the same time?
So I would ask people to de-risk the rewriting process
by picking like some API level
and then only rewriting on one side of the API at a time.
And this comes back to what you said earlier, right?
Start with a good API.
And then I think if you have a good defined API,
APIs for me are like a contract, obviously, right?
They're a contract between two parties.
And it's also a great way to then dare enforce,
if it makes sense, your SLAs, your SLOs, right?
You want to make sure that the API is responding
in the way you expect it from a performance perspective, from an availability perspective.
Because a lot of people always ask us, because obviously when we talk about SRE, a lot of the time, a lot of SLOs come to mind, right?
And SLOs everywhere.
And then I say, well, SLOs everywhere might not be the best approach,
but you want to define SLOs where it's critical for you. And you talk about these critical APIs
to the outside, to your end user, but then also to business critical backend systems like the
storage. And there it makes sense because you know, if you're not meeting your SLOs there,
it will have an impact. And if they are kind of stable, right? If you want to keep these APIs stable,
that means you also have history.
And as you're changing and iterating
through the implementation,
you know how the performance has changed,
how the resiliency has changed
with the new implementation.
And I think that's why it's so critical
to define these good APIs.
And they will be good.
And then next year they will not be good,
but they're still APIs.
So you still have a meeting point between the teams or between the systems.
I want to cover one quick thing because I have your presentation open here in front of me on the other screen.
And in your presentation, you made a very good point where you said, why are we writing software?
We're not writing software to make us feel good, but in the end to make money.
How do we make money?
We make money because we can sell the software to somebody.
So on the top, there should always be the user.
And you had this maturity model,
kind of like a pyramid.
And I think you called it the,
you gave kudos to Mickey Dickerson's
hierarchy of service reliability needs,
which I thought was really nice
and especially the way you spoke to it as far as i remember you said that on the top is the user but
then on the bottom of the pyramid is is monitoring right observability which is essential because if
you don't have observability built in then you don't know when a user is complaining
why they are complaining. Are they complaining
because they just have a bad day? I would just like to ask you a little bit about,
from an SRE perspective, do you see SREs in their mentoring role that they have,
also to make sure that this hierarchy of needs is then baked in into the platforms or what type
of responsibilities does an SRE team have? I see here things like retrospective, root cause analysis,
capacity planning, other certain things SREs need to build and mentor and build into the platforms
or are there also certain aspects that are purely an SRE role and that always stay with an SRE?
Yeah, so I think that you have outlined very nicely the steps on how to push good ideas into a large engineering organization.
Step one is you do it and you show people how it's done. When I joined SRE at Google, yes, we were the ones that were adding monitoring metrics to binaries by hand as software engineers.
Let's export this metric.
Let's create a counter.
Let's create a ratio, whatever.
And we configured the monitoring system to scrape them and aggregate them and alert on them.
So it was from like organizational maturity, it was relatively low, but it was absolutely what was needed at that time.
Because we had systems that were like not black boxes, but that were like absolutely not instrumented enough to determine, is this release good?
Do we have enough capacity?
Why did the latency in this
cluster shoot up like these are all questions that sre's had to answer every day and in some
products we didn't have the capability and so so i think this is like this is like step one step
step step two is changing the programming language frameworks to make it easy for the
software engineers to help themselves.
I think that's the second step. And this is where in the last 10 years, the open source world has
made such a big progress. There are many, I don't want to quote them, there are many open source
language frameworks that have understood that the software engineer needs to be able to export data easily.
This can't be like black magic.
So I think changing the software frameworks is the second step.
And the third step is creating or improving the production platforms
that the software runs on.
It's absolutely not a coincidence that on the framework level,
I can give you one example.
Like in some products, we had huge problems with overload.
And the original SRE response was relatively operations heavy.
Well, let's add more capacity.
Or the release is bad.
Like, let's roll back.
But in reality, when you look at your request stream,
not every request at an RPC level should have the same QoS class.
It's a notion that's coming from the network
that we didn't use at the RPC level in the beginning.
So we're treating all the client requests with essentially the same priority.
But in some cases, the user just clicked the save button and they really want their data to be stored.
And in other cases, like the browser or some part of the backend processing on a mobile app is just caching in data.
And these two operations obviously don't have the same,
same QS class.
So on the,
on the framework level,
like SREs at Google went in and built like load shedding and,
and traffic management into the programming libraries so that if you
effectively cannot start a new binary,
that will not be able to protect itself from,
from overload.
And the third level is really this platform level
where you probably need to have a good microservice platform.
But not everything will move to microservices or will move to,
I don't know, you mentioned cloud functions earlier.
You will probably have some classical kind of fat binaries with multi-planned points.
So you need a really good platform to run those.
And when you look at the, like Mikey Dickerson's hierarchy of service reliability needs, which
is everything here in this, I just renamed a few of the fields.
You see that the pervasive monitoring
is essentially enabling everything else
because you can't do incident response
if you don't know what the system just did.
You can't do root cause analysis
if you don't have a good recording of what happened.
I don't want to read everything,
but like testing, capacity planning,
like all of this is referring back to the
capabilities below it, especially
monitoring.
That's why I really liked
it and also the way
you talked about it at DevOps Fusion
and wanted to bring it up. Michael,
we are approaching the head of the
hour here while we're
recording.
I wanted to make sure that we also give you the chance in the end to say
anything else that we may have not talked about that is important for you,
for our listeners to understand in the context of the topic we talked today.
Is there anything we missed?
Any final words?
Yeah, if I can share one thing, like when you are introducing new engineers
into DevOps or an SRE role,
like it's really important to make sure that they understand that this is like a blameless culture.
Like my biggest problem, I can share with you as a new team member in the SRE team,
my biggest problem was like I never wanted to make a visible mistake.
And I spent like so many hours like reading source code, digging through design docs,
just so that nobody knows that I don't fully understand the whole system. And this is
an anti-pattern. And luckily, like super nice and way more senior folks like tapped me on the
shoulder and said like, hey, you should have asked.
There are five people around you that can answer this question in 10 seconds.
And as long as you have a good culture of batching up your requests
or your questions, you're not annoying anybody.
People love for you to reach out.
And I think this is what all of us need to do.
When we see new team members show up.
We need to take away this load from their shoulders that they have to be perfect
because our software systems are not perfect
and we shouldn't expect humans to be perfect.
It just doesn't work.
In that way, software systems are just a reflection of the imperfection of us humans.
Absolutely. That's what us humans. Absolutely.
That's what it is.
Awesome.
Brian.
Yes.
I know you're fighting with your camera there a little bit.
Well, I turned it off before because I was getting a little latency,
but I figured I'd try to turn it back on.
Did you have an SLO violation on the latency?
I sure did. And I'm going to contact my cable provider
for violating their SLA.
Yeah, exactly.
Yeah, I think that, again, Michael,
we will write a little summary of this
to make sure the people that browse
through the podcast
objects that they have because we know there's more than our podcast out there
we will also provide a couple of links so I will definitely link back to your
social media profiles and if there's anything else we should post to just let
us know we can add the links to the proceedings it was a pleasure having you
again thank you so much for also hosting me in Zurich before the conference and this was very much appreciated and just shows how great of a culture that you have there at
Google and I hope we see each other again face to face at some time thank you Andreas for the
for the kind invitation and thanks to both of you for the for the really interesting chat
yeah I just wanted to say this is one of those ones where I have to just sit back and absorb.
It was a lot
of fantastic information. Really, really appreciate
it. And it was also a reminder,
you know, Andy and I moved
over from being on
the hardworking side of the fence
to the software sales
sort of side of the fence years
and years ago. And
especially hearing the level of detail on some of the stories you go on,
for me personally, it gave me a reminder of what our customers are going through.
I have engagements, they're very light level engagements,
and I just see, I guess, the surface of it.
And getting a reminder of what's going on all in the background
and all the complexities and things they going on all in the background and all
the complexities and things they have to all tackle on a daily basis, both in their daily
work and strategically big picture is always a fantastic reminder. So really appreciate you
sharing all this information with us and our audience as well. Sounds good. Thank you so much,
gentlemen. Thank you. Have a wonderful day and thanks for listening.
Have a great day. Goodbye.
Bye.
Cheers.