Screaming in the Cloud - Episode 19: I want to build a world spanning search engine on top of GCP
Episode Date: July 19, 2018Some companies that offer services expect you to do things their way or take the highway. However, Google expects people to simply adapt the tech company’s suggestions and best practices fo...r their specific context. This is how things are done at Google, but this may not work in your environment. Today, we’re talking to Liz Fong-Jones, a Senior Staff Site Reliability Engineer (SRE) at Google. Liz works on the Google Cloud Customer Reliability Engineering (CRE) team and enjoys helping people adapt reliability practices in a way that makes sense for their companies. Some of the highlights of the show include: Liz figures out an appropriate level of reliability for a service and how a service is engineered to meet that target Staff SRE involves implementation, and then identifying and solving problems Google’s CRE team makes sure Google Cloud customers can build seamless services on the Google Cloud Platform (GCP) Service Level Objectives (SLOs) include error budgets, service level indicators, and key metrics to resolve issues when technology fails Learn from failures through instant reports and shared post-mortems; be transparent with customers and yourself GCP: Is it part of Google or not? It’s not a division between old and new. Perceptions and misunderstandings of how Google does things and how it’s a different environment Google’s efforts toward customer service and responsiveness to needs Migrating between different Cloud providers vs. higher level services How to use Cloud machine learning-based products GCP needs to focus on usability to maintain a phase of growth Offer sensible APIs; tear up, turn down, and update in a programmatic fashion Promotion vs. Different Job: When you’ve learned as much as you can, look for another team to teach something new What is Cloud and what isn’t? Cloud deployments require SRE to be successful but SREs can work on systems that do not necessarily run in the Cloud. Links: Cloud Spanner Kubernetes Cloud Bigtable Google Cloud Platform blog - CRE Life Lessons Google SRE on YouTube .
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This week's episode of Screaming in the Cloud is sponsored by
ReactiveOps, solving the world's problems by pouring Kubernetes on them. If you're interested
in working for a company that's fully remote and is staffed by clued people, or you have challenges
handling Kubernetes in your environment because it's new, different, and frankly, not your
company's core competency, then reach out to
ReactiveOps at reactiveops.com. Welcome to Screaming in the Cloud. I'm Corey Quinn.
Joining me today is Liz Fong-Jones, who's a staff site reliability engineer at Google,
who works on the Google Cloud customer reliability engineering team. She lives with her wife,
Metamor, and two Samoyeds in Brooklyn. And in her spare
time, she plays classical piano, leads an EVE Online Alliance, and advocates for transgender
rights. Welcome to the show, Liz. Hi, Corey. It's great to be here. Well, thank you for joining me.
Let's start at the beginning. What is a staff site reliability engineer? So, Corey, I think you have
to break that down into two pieces, which is let's talk about the SRE part first, and then we'll talk about kind of what the staff part means.
So I'm a site reliability engineer, and we're the people that are the specialists in figuring out
what's an appropriate level of reliability for a service, and how do we make sure that it is
engineered to meet that target. So we don't target 100% availability, but we target a reasonable level of availability
that meets our customers' requirements.
So this means we write software,
mostly because it turns out that
you can't really run a highly available service
by doing a bunch of manual work.
And we also make sure that
when things go bump in the night,
that we learn from them,
that we are the people that do some degree
of coordinating incident response and then figuring out what do we need to proactively do
next time. So that's kind of in a nutshell, the SRE role. And then in terms of what it means to
be a staff SRE. So our career progression basically roughly goes, you know, you start off
and we hand you a project and say, here's a design doc. Please go implement this.
And then we eventually say, here's a problem.
We know you can figure this out.
Please write a design doc and solve it.
And then eventually we ask you to figure out which problems are useful to solve over the next year.
What should the team be focusing on over the next year?
And the place in my career where I'm at right now,
where I'm a staff site reliability engineer, is I work with many different teams to kind of coordinate their roadmaps figure out
what is it that we need to work together on so kind of the thing that makes me a staff engineer
rather than a senior engineer is that aspect of looking outside of my team I would take it a step
further having seen some of the conference talks that you've given. Something that has always been very distinctive about how you frame things has been the way that you set context around,
this is how we do things at Google. This may not work in your environment.
And that's a theme that I don't see emerging very often among speakers from large, well-respected tech companies. Yeah, that's totally a response, I think, to a lot of criticism that I've seen around
companies saying, you know, do it our way or the highway, right? And I think Charity
Mater's phrases this really well, that it's a matter of context, that you have to make sure
that people are adapting your suggestions and your best practices for their specific context. And I think that that's a thing that makes me really excited about
my current role is being able to focus on helping people adapt reliability practices in a way that
makes sense for their companies. So taking it a step further, you mentioned that you were on the
Google Cloud Customer Reliability Engineering team.
Yeah.
What is that?
So the CRE team, or Customer Reliability Engineering team, focuses on making sure that
Google Cloud customers understand how is it that their services that they're building on top of
the GCP platform, making sure that they are able to build services
that operate seamlessly.
And this is something where you can do a lift and shift
onto the cloud platform,
but that's not really the thing
that's going to give you the full benefits,
that you have to look beyond that
and figure out how am I designing this?
How am I architecting this?
Am I integrating my operations
with my cloud platform providers' operations? So that's the thing that we think about a lot is,
how do we make sure that all of our major customers have SLOs, service-level objectives,
that make sure that they have error budgets, make sure that they have their service-level
indicators and key metrics available to people in GCP, and that our key metrics are exposed to them for their usage of
our platform so that we can improve everyone's time to resolve issues so that when there is an
outage, it maybe stings less and that we're setting an expectation of this is what our service level
objective is. You can expect us to deliver that, but you cannot expect us to deliver 100% reliability
because it would be prohibitively expensive.
So that's kind of what our team does is we act as a conduit between customers and Google to make sure that we're integrating and operating efficiently and that we are using best practices.
One thing that I guess is common to SREs across the entire spectrum is this stuff is very complex
and outages invariably happen.
Technology fails, human beings fail,
software has bugs in it.
And it's one of those areas
where no one remembers all of the times that you were up,
but that one time where a thing fell over,
depending on what it was,
you'll hear about that years later,
and it becomes almost the narrative that defines it. There were a couple of notable outages in the
last couple of years for AWS and for GCP, and you see that people still tend to bring those up.
How do you, I guess, transition away from the narrative being, we keep things up and running except that one time we didn't.
And I guess, turn that into a story that culturally drives the narrative of outages happen,
we don't try to yell at people for them, we try to improve so that those outages don't recur,
and those systems become more robust with time. I think there's a couple of angles that you can
take on that, one of which is
incident reports and shared postmortems, right? The idea that when you have a large failure or
even a small failure, that you talk to the people that were affected and you go over kind of this is
what went wrong on our side. This is what we're doing to make sure that this can't happen again.
And furthermore, we talk kind of
more concretely about what's the impact in your error budget, right? You know, if your error budget
says you can be down for two hours per year, and we eat 30 minutes of that error budget, then we
can talk about what are we going to do to be more conservative with the other hour and a half
remaining. I think that the other piece of doing shared postmortems is that it enables you to do things like say, hey, yes, there was an outage, but these are some mitigating strategies that would have mitigated the impact for you earlier, right? And then kind of work together on figuring out how to implement them. So it's kind of a mechanism of turning instance from, oh my God, everything exploded to this is a learning opportunity. This is what we're going to learn from it, this is what we did learn from it. One thing that tends to stand out
about Google is that they're very open with the learnings that come out of outages, that come out
of various crises, that as they solve these global world-spanning problems, they release white papers,
they talk more about how they're thinking about these problems
and what they're doing to address them than most other companies. In many cases, you'll see an
outage that makes headlines, and the company will release a very thin root cause analysis or
post-mortem or whatever term we're using this week, that turns into a narrative of,
there was a problem, we fixed it, and it doesn't go deeper than that.
Is that just been part of Google's culture forever? Or was there something that drove that,
that led to an awakening? I think that there's two aspects to that. One of which is that we,
as well as many other companies, do produce very robust internal post-mortems. I think that's kind
of a prerequisite to having that level of openness with your customers is to be open with yourself
first. But as far as explaining kind of what's going on under the hood to customers, it's really
important to, if you have a system that is really scary for customers to understand, for instance,
when GCP came out with the Google
App Engine, even before it was called GCP, or when we came out with Cloud Spanner, these are
things that it's really hard for people to get a visceral sense of what are the risks involved here?
How am I going to make this? How is this engineered? How can I be confident in how it's going to work
and why the failure patterns I might expect aren't there or the failure patterns I have seen have been remediated?
So I think having so much technology that doesn't necessarily have a large number of parallels at the time that it was released tends to motivate us to be a lot more transparent than we otherwise would be. To that end, and this, I guess, could break down in either a technical direction or a
culture direction, and I'm thrilled to explore both, but GCP is sort of perceived as being
part of Google, but not part of Google, at least from those of us outside trying to read
the tea leaves.
For example, search, to my knowledge, does not run on top of GCP for the technical side of it.
From the culture side, I'm wondering how embedded the GCP teams are compared to the rest of quote-unquote Google proper.
Yeah, totally. a large user of GCP in terms of having a large number of very mission-critical corporate
applications running on top of GCP.
So, for instance, things like our company directory or things that are relatively, you
can say that they're kind of not very serious, except they are in the sense of these are
applications of the form that many customers outside of Google want to bring to GCP.
Things like our financial system, things like our company directory,
things like our internal meme generator,
all the way up to eventually being able to run engineers' workstations on GCP.
So that's kind of one angle to think about is how are we, you know,
most customers aren't coming to GCP and saying, I want to build a world-spanning
web search engine on top of GCP. So I think the set of applications that we've chosen to run on
top of GCP that are developed by Google represent user workloads fairly well. And then
some applications just don't make financial sense to run on top of GCP because virtualization
imposes overhead. And the security requirements that we have and the performance requirements
that we have just mean that it makes sense to not impose that extra overhead. So it's kind of a new development versus old development type of thing, as well as thinking about the requirements of the application.
And then I think on the culture front, GCP has been built by the very same teams who developed the underlying original Google infrastructure.
And it's run by the same sre teams like one sre team will be responsible both for running the blob storage
system and the google cloud storage system that's one team right that's not multiple teams so there's
not really a division between kind of what you're describing as old Google and new Google. So where we can, we just kind of, we use the lessons that we learn
from having operated almost a legacy service
with an enormously complicated API.
And we say, how much of that do customers really need?
Why don't we just simplify it?
Because we're not constrained by having to
haul around a 15-year-old API.
So I think that that is really the crux of GCP development,
is that it's not a division between old and new.
It's instead the same people who developed the old
developing the new
and bringing all the lessons that we've learned from it.
There definitely is a distinction
between building technical infrastructure at Google,
whether it be GCP or not GCP,
and building products like Google Web
Search or ads. That's definitely true in that there's a difference between being someone who
develops technical infrastructure and someone who uses technical infrastructure. But even so,
there are definitely blurry lines. For instance, a lot of the work that the ads SRE teams have done has been building platforms that make it possible for individual ad development teams to basically implement their business logic on top of a framework that's going to work reliably for them.
So I think that that's an area in which someone can come to a GCP team and feel very comfortable is this idea of you're building infrastructure.
One theme that tends to emerge is that you'll see people in relatively small companies that
are getting off the ground talking about how Google does things and how it's a very different
environment there. And often in a sort of disparaging way of, well, I was talking to
my friend at Google and they spin things up with just one command line, and you have an entire environment. Why can't we do that? Without understanding that two decades of very intelligent engineers working to build out infrastructure tooling to the point where it is push-button-receive-cluster is a non-trivial investment for a company to make. And most companies are
not going to make that leap. And that's been something that has, I guess, eluded people's
understanding for a long time. That said, it does feel like GCP is aiming at solving for that
problem. You effectively get Google class infrastructure billed by the hour or second,
depending on how you want to slice that.
Is that a fair assessment? Yeah, I think that's a totally reasonable assessment to make,
is that having a lot of these developer productivity tools available for the first time
in GCP means that companies don't have to reinvent the wheel every time, that they can instead
make use of our investment in that technology. So to that end, what do you wish that people understood better
about GCP out here in the wilds that are not Google? I think that the main thing that I wish
that people understood better about GCP is the notion that we want to be not just a, you know, have a vendor-customer relationship,
but instead to have a partnership relationship with large customers.
And I think that that's a situation where people say, you know,
oh, I just want to compare on price or I just want to compare on features.
But I think that there's a difference between kind of buying interchangeable widgets and actually working together on building a shared system that incorporates the best of Google's technology and lets you innovate on top of that.
And I think that that's kind of one misunderstanding they see people having when they're looking at what's our cloud migration strategy.
Getting there, I guess, from even a customer service perspective,
has been a somewhat interesting road.
Historically, Google was very focused on not having a customer service department.
You should be able to have the system just work,
and staffing a call center back in the early days for a search engine
wasn't an area in which the company was prepared to invest.
Now that you're running
companies' production infrastructure at very large scale for a wide variety of clients,
that requires a level of engagement with those enterprise customers that looks a lot like a
traditional model that you've seen with Microsoft, Oracle, etc. for the past many decades.
What has that transition been like? I guess Google has woken up to the idea of, Oracle, etc. for the past many decades. What has that transition been like?
As I guess Google has woken up to the idea of,
huh, a frequently asked questions list
probably isn't going to cut it
when people are dropping tens of millions of dollars a year on this.
Yeah, I think that that's something that
Diane Greene has been super sharp about,
that she's recognized that challenge
and better positioned Google to be responsive to the
needs of large customers. In the past year, even from my perspective dealing with customers that
are both in GCP and AWS, I've seen a market improvement in that respect. So it's definitely
something that is evolving rapidly. The challenge in any sort of corporate reputational style of
thing is that it takes time to make the change, but far longer for the reputation of the way
things used to be to fade. It's sort of the curse of success. When you're a household name,
people form opinions and don't change them even in the light of new information. Yeah, that's definitely a mindset and mindshare issue
that we are hoping to address in part by talking about
kind of what are we doing?
How does it impact developers?
And that's kind of why I really like working so much
with our developer advocacy team
in terms of getting those kinds of messages out there about, hey, here's what's going on. Here's some reasons why you should look and see whether
GCP makes sense for you. And if it doesn't make sense for you, we'll be the first people to tell
you that as well. To that end, something that a lot of companies like to talk about is remaining
provider agnostic, where they could, in theory,
pick up their thing, whatever it looks like,
from AWS and move it to GCP,
or from GCP into this rickety ancient data center
that's falling to pieces,
or wherever they want to move things.
And I understand wanting that security blanket.
As a counterpoint, you're offering
some very differentiated higher-level things,
such as Google's Cloud Spanner, a world-spanning ACID compliance database that effectively lets you
treat it like any other SQL database, except it's in multi-regions. You can write to it,
you can read from it from anywhere on the planet. Technically, this is amazing. From a business
perspective, rolling out an application built around something like this
is in some cases considered a non-starter because it doesn't seem like there's another option.
Well, what if Google decides they want to turn all of GCP off and or burn themselves to the ground
and or just go out of business and sell hats or something. Great, awesome. I don't see those things happening,
but people at least want a theoretical Exodus story.
How do you find that the desire,
even if unrealized, for lock-in
competes with the ability to,
at least in theory, be cloud agnostic?
I think that that's, in some way,
it's a matter of choice.
It's a business decision that companies can make.
Do you want to deal with the operability headache of keeping all of your services on raw VMs
and being able to migrate those VM-based workloads between different cloud providers that all offer VMs?
Or do you want to use higher level services? And I think that there's a tremendous
amount of interest in even making some of those higher level services available cross cloud.
If you look at what's going on with Kubernetes right now, it's a huge situation where GCP
offers Google Kubernetes engine, obviously, but there are many other cloud providers that also
offer Kubernetes-based services. And that's kind of an opportunity to do something that is
differentiated, but is also something that people can choose to migrate if they choose to.
Another example that I'd offer there is Cloud Bigtable. I used to work on Cloud Bigtable before
my current team. And one of the selling points of Cloud Bigtable is you can operate your service just against a regular HBase backend that you maintain yourself.
Or you can choose to run that workload against Cloud Bigtable.
And it all uses the same API.
You basically have to compile in the stub to talk to Cloud Bigtable and you're done.
So I think that that's definitely kind of the best of both worlds where everyone is using the
common standard. They may have different implementations on the back end. So in a way,
right, like given that Cloud Spanner is very SQL-like, if you are willing to forego some of
the technical benefits, you could go and use a different SQL-like application if you really wanted to.
And it might be less reliable or less performant.
And I think that there's also the angle of if you really do care about sticking to kind of the common denominators, then you can choose to use Cloud SQL instead.
You can choose to run MySQL databases
on raw VMs if you happen to have that particular strain of masochism. So there's a variety of
different options, and it's just engineering trade-offs that people have to choose.
In a similar vein, if you were to take a look at all of the different offerings that GCP has, what's one that you think is underappreciated in the larger community that you wish more people knew about?
I think that one of the biggest opportunities that people have that they don't really fully understand how to use is the various cloud machine learning-based products that everyone has machine learning as a giant buzzword.
But I really think that over the coming couple of years that we're going to see more people being able to use cloud machine learning in a way that makes business sense for them.
And that is a much easier way of doing things rather than feeling like, oh my god, I have to
go through all of this training in order to learn how to use it. So I think that that's kind of one
of the places that's going to grow fairly rapidly. Something that I'm seeing in the machine learning
space is people are concerned less with the how and looking less for technical enhancements in machine learning.
They're still stuck on the why.
I struggled with this for a little while myself, where I love the idea of, capability, what it costs to train models and
run this stuff itself, I first need to understand how it applies to my life. And maybe this is a
limitation of my own lack of imagination, but I struggle to identify machine learning use cases
until they're explicitly pointed out to me. Is this uncommon? Am I just dense? Or is this something that tends to be
more industry-wide? I think that as people who build software and who think about reliability
and cost type things, we have a tendency to avoid things that are new and scary that we don't
understand. And I definitely could have counted myself in that camp a year or two ago
saying, why should we use machine learning on alerts?
That means that if it breaks, then we're not going to understand how to debug it, right?
So I think that that's definitely...
If you're not building consumer-facing products,
it's a lot harder to see the benefits of ML, and it's a lot easier to appreciate the
risks of it. Whereas for people that are trying to do consumer-facing things, like being able to
do object recognition, or being able to transcribe speech to text, or being able to transcribe
written words into text, or being able to transcribe written words into text or being
able to do machine translation, right? These are all things that are powered by machine learning,
right? And in fact, they're offered as prepackaged solutions rather than you must train your own
model. And I think that that's kind of an area that we overlook a lot as people that don't
necessarily think about the consumer-facing products quite as much. As with so many other things, it feels like it's an area that is rapidly evolving,
and we're going to start seeing improvements in that space relatively soon.
Speaking of, what's something that you see that GCP itself needs to, or could stand to improve upon?
I think that it is always a challenge to onboard people. There have been a
lot of improvements, but still, focusing on usability is something that GCP really needs to
get better at in order to be able to maintain a pace of growth. Because it really is people who are experimenting with GCP who decide to adopt it
just as much as it is people saying, you know, hey, I'm putting out a request to proposals from
the top three cloud providers for a $100 million contract, right? Those are both cases that we need
to pay attention to. And I think that the investment in kind of doing that
high touch cloud sales and support work also has to be accompanied with what are we doing for the
next generation of developers. I will say as someone who first picked up the GCP control
panel for a project a couple of months back, I was very pleasantly surprised.
At first, I thought it was going to go the opposite direction, where I did a quick project,
and then I was done. And now it was the fun prospect of hunt down all of the services that
I spun up and make sure they're turned off so I don't wind up with a surprise bill three months
later. And in Amazon world, that takes the better part
of a day. In GCP, it was click on the expansion thing next to the particular project and terminate
all billing resources. Now it pops up a scary warning that this will turn things off. Are you
okay with that? Which in this case I was. I clicked it and there was no step two. That was an eye-opening moment for me.
Yeah, I think that the set of features that are offered are very robust and powerful.
I think it's kind of a discoverability problem, where if I look at the GCP control panels,
and I still am a little bit like I'm sitting in the cockpit of a space show, right? There are so
many different options. And I think that that's the area that I wish that there were a little bit like I'm sitting in the cockpit of a space show, right? Like there are so many different options. And I think that that's the area that I wish that there were a little bit more effort
paid into. The first time I set something up in a cloud environment, I admit it. I'm like all of
the things I make fun of in some of my own talks. I click through the console, I spin a thing up,
and we're good. In Amazon land, great. How do I
convert that into code?
Good luck, idiot, is the
effective answer they give. With GCP,
it spits it out. Here's
a curl command that does exactly
what you'd want to do, and it's easily understood.
It breaks down the API calls,
and I can shove that into Terraform.
I can put it in a script. I can curl bash
it if I'd like to live very dangerously.
It lends itself to rapid and effective automation.
Rather than spinning something up, then I have to retrofit all of the code to it and then tear it down and hope I got everything right or I get to explore this whole area again.
That was transformative the first time I saw it.
I couldn't believe I was seeing it.
And then I very quickly moved on to, why isn't everything like this?
This is wonderful.
Yeah, I think that's in large part influenced by how we've done deployments with internal
Google technology for years and years is the idea of, yes, you have to be able to offer
sensible APIs and do tear up, turn, and updates in a programmatic fashion.
So let's talk a little bit about you, rather than GCP, for a minute.
You've been at Google a decent amount of time.
How many years now?
Ten.
That is forever in cloud space.
And during that time, you went from being an individual contributor
to managing a team.
Now you're an individual contributor again.
Let's talk a little bit about that.
In many companies, that would be considered a demotion.
In Google, it's one of the few companies that's very explicit about having a technical ladder that is distinct from the management ladder.
And going between ladders in one direction or the other
is absolutely not a promotion. It's a different job.
Yeah, absolutely. I think that one additional thing is that you can have direct reports
as someone who's on the individual contribution ladder. The difference is kind of really,
where are you focusing your time? How many reports do you have? So I've been on, I think,
eight teams now in 10 years at Google. So
it doesn't feel like that long because I only spend a year to two years in each place.
And the thing that I find that I do is when I feel like I've learned as much as I can out of
one team, I'll go and look for another team that's going to stretch me in some dimension
or teach me something new. And that's how I came to the decision to become a manager for a few years,
was that I really wanted to get some experience with helping people's career development
rather than just purely focusing on technology.
I think that even once I stopped being a manager,
that inner voice just doesn't turn off.
It's a skill that you acquire that you can hold on to and use in varying ways, even if you're not officially someone's manager.
So I think that everyone should give being a manager a shot at least once if that's something that you're interested in because it teaches you a lot.
It helps you better understand your company and helps you better understand how you're going to interact with people. But for me personally, kind of having this opportunity to try being a manager and then
discovering that I didn't actually want to, in the long term, have my career growth being tied to
how big of a scope the set of people I managed was responsible for, but that instead I wanted
to work on kind of cross-cutting projects that
are between multiple working groups. That was really a useful thing for me to learn and then
go and pivot. It's nice to see companies being supportive of that. In many environments,
making the transition you just described would have entailed at least three different companies.
Is it fair to say that Google is almost like a bunch of companies tied together
under one umbrella, even down at the relatively granular organization level? Or is this more a
story of Google being very supportive of people's needs as they grow? As a manager, you are taught
at Google to look out for the best interest of your reports,
even if it means that they may wind up leaving your team or going on to another job ladder.
So it's kind of your job to support people in developing their careers. I think that that
mindset and perspective, as opposed to, I'm going to keep this person on my team because they're
doing productive work on my team.
I think that's a huge difference from a lot of companies.
And as far as our culture, we have a culture that is fairly uniform between different teams.
We have a set of engineering tools that are fairly uniform between teams.
So as a result, sure, it may take you six months or even a year to become fully productive as an engineer at Google. Once you have that base set of skills, you can take kind of sharing that same cultural basis and sharing that same technical basis. And I think that that's one of the magical things about Google. So at this point, you would be fair to say that you'd
recommend working at Google to someone who was on the fence about it. I think that Google is a
company that is very self-aware of a lot of things,
that it knows how to do some things well,
and that it also has some areas in which it faces challenges that are not unique to Google,
but that are uniquely things that we're talking about,
kind of having public conversations within the company
or even sometimes external to the company
about what does it mean to have a a a culture of inclusion right and i think that it can be scary
sometimes on the outside looking at that and saying you know oh my goodness like i'm i'm not
sure if i want to work at google because i've seen a bunch of awful stuff in the news about Google or whatever.
But I think that on balance, Google is a place where you can have a lot of opportunities to do impactful things. And not just technically impactful things, but things that are
culturally impactful for all of information technology. So that was a long-winded way of saying,
yes, I would recommend Google as a place to work.
But I do think that those are useful things
to think about as far as
what are you looking for in an opportunity?
What are your interests?
Do they match with what you would be doing at Google?
And I think that that's also an area
where you should really carefully talk to
whoever is the hiring manager
and make sure, is this team the right fit for me or not?
And if not, you can say no, and then your recruiter will find you some other team to
look at.
Is Google still in a hiring place where every person they bring aboard, doesn't matter if
they're there to clean whiteboards or do accounting,
they still put them through a CS 101 algorithms test? So the hiring mechanisms for software engineers test a mixture of,
can you write code?
Can you practically apply lessons from computer science?
And can you do systems design?
Kind of those three things are tested during the interview process.
For site reliability engineers, we don't necessarily mandate that people have previous computer science knowledge.
Because it's been advantageous to hire people that are systems engineers.
People who have real-world
practical experience with, this is how systems break, this is how we can engineer systems better.
So for those set of people that don't have that computer science background,
we tend to focus a lot more on interviewing people on troubleshooting, figuring out in a real
situation, what would you do in order to make
sure that the impact on customers was as low as possible to root cause and debug and kind of
bisect the problem? Or focusing much more on your systems design skills, or focusing much more on,
do you understand at least some area of the Linux stack or of the distributed system stack
in a way that you can practically describe
to someone who's interviewing you.
So I think the answer is yes,
if you are interviewing for a software engineering position,
you will probably be asked to do whiteboard coding.
You will probably be asked questions
that rely on having some degree of ability
to pick the right data structure,
pick the right algorithm.
But I think that there's kind of a range and flexibility,
at least as far as SRE is concerned.
Thank you, Liz.
One more question for you before we start wrapping up
and calling it a show.
Is there anything you're working on that you want to mention
or tell our listeners about?
Yeah, so I want to point people to two resources.
The first resource is on the Google Cloud Platform blog.
There is a set of posts made by my team,
the CRE Customer Reliability Engineering team,
and they're all called CRE Life Lessons.
We'll put a link to those in the show notes.
And then secondly, I have a project with Seth Vargo,
who's a developer advocate at Google,
who is working
with me on a set of YouTube videos that explain in five-minute chunks what is SRE, what are the
key principles of SRE, how can you apply them. So I'm really excited about this project, and we'll
put a link to that in the show notes as well. Wonderful. A follow-up question for you on that.
Do you see that cloud and SRE are intrinsically linked?
Can you have one without the other? Or is it more or less two completely separate concepts
smashed together in the form of one person, you come to life?
So I think that SRE doesn't really specifically require that you do it in the cloud. So the key things about SRE are, number one,
do you have service level objectives and error budgets?
And number two, do you have limits on the amount of operational work
that you're doing in order to conserve your ability to do project work?
As long as you're doing those two things,
I posit what you're doing is SRE.
And neither of those two things specifically mandate
any kind of cloud deployment. However, I think that if you are trying to run a cloud
service at scale, and you're not adopting something in the SRE methodology space or in the DevOps
space, you're going to really struggle to operate your service, that it's going to result in having to hire a
bunch of people to do manual operational work because you're not setting targets for your
reliability, or you're not setting limits on how much operational load your systems can generate,
and you're not engineering that work away. So I think that cloud deployments require SRE to be successful, but I think that SREs can and have worked on systems that are not necessarily running in the cloud.
Yes, to go any deeper on that one turns very much into the question of, well, what is cloud and what isn't it?
And down that path lies madness.
Indeed.
Thank you for joining me, Liz. I'm
Corey Quinn. This is Screaming in the Cloud.