Screaming in the Cloud - Episode 19: I want to build a world spanning search engine on top of GCP

Episode Date: July 19, 2018

Some companies that offer services expect you to do things their way or take the highway. However, Google expects people to simply adapt the tech company’s suggestions and best practices fo...r their specific context. This is how things are done at Google, but this may not work in your environment. Today, we’re talking to Liz Fong-Jones, a Senior Staff Site Reliability Engineer (SRE) at Google. Liz works on the Google Cloud Customer Reliability Engineering (CRE) team and enjoys helping people adapt reliability practices in a way that makes sense for their companies. Some of the highlights of the show include: Liz figures out an appropriate level of reliability for a service and how a service is engineered to meet that target Staff SRE involves implementation, and then identifying and solving problems Google’s CRE team makes sure Google Cloud customers can build seamless services on the Google Cloud Platform (GCP) Service Level Objectives (SLOs) include error budgets, service level indicators, and key metrics to resolve issues when technology fails Learn from failures through instant reports and shared post-mortems; be transparent with customers and yourself GCP: Is it part of Google or not? It’s not a division between old and new. Perceptions and misunderstandings of how Google does things and how it’s a different environment Google’s efforts toward customer service and responsiveness to needs Migrating between different Cloud providers vs. higher level services How to use Cloud machine learning-based products GCP needs to focus on usability to maintain a phase of growth Offer sensible APIs; tear up, turn down, and update in a programmatic fashion Promotion vs. Different Job: When you’ve learned as much as you can, look for another team to teach something new What is Cloud and what isn’t? Cloud deployments require SRE to be successful but SREs can work on systems that do not necessarily run in the Cloud. Links: Cloud Spanner Kubernetes Cloud Bigtable Google Cloud Platform blog - CRE Life Lessons Google SRE on YouTube .

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This week's episode of Screaming in the Cloud is sponsored by ReactiveOps, solving the world's problems by pouring Kubernetes on them. If you're interested in working for a company that's fully remote and is staffed by clued people, or you have challenges
Starting point is 00:00:37 handling Kubernetes in your environment because it's new, different, and frankly, not your company's core competency, then reach out to ReactiveOps at reactiveops.com. Welcome to Screaming in the Cloud. I'm Corey Quinn. Joining me today is Liz Fong-Jones, who's a staff site reliability engineer at Google, who works on the Google Cloud customer reliability engineering team. She lives with her wife, Metamor, and two Samoyeds in Brooklyn. And in her spare time, she plays classical piano, leads an EVE Online Alliance, and advocates for transgender rights. Welcome to the show, Liz. Hi, Corey. It's great to be here. Well, thank you for joining me.
Starting point is 00:01:15 Let's start at the beginning. What is a staff site reliability engineer? So, Corey, I think you have to break that down into two pieces, which is let's talk about the SRE part first, and then we'll talk about kind of what the staff part means. So I'm a site reliability engineer, and we're the people that are the specialists in figuring out what's an appropriate level of reliability for a service, and how do we make sure that it is engineered to meet that target. So we don't target 100% availability, but we target a reasonable level of availability that meets our customers' requirements. So this means we write software, mostly because it turns out that
Starting point is 00:01:52 you can't really run a highly available service by doing a bunch of manual work. And we also make sure that when things go bump in the night, that we learn from them, that we are the people that do some degree of coordinating incident response and then figuring out what do we need to proactively do next time. So that's kind of in a nutshell, the SRE role. And then in terms of what it means to
Starting point is 00:02:14 be a staff SRE. So our career progression basically roughly goes, you know, you start off and we hand you a project and say, here's a design doc. Please go implement this. And then we eventually say, here's a problem. We know you can figure this out. Please write a design doc and solve it. And then eventually we ask you to figure out which problems are useful to solve over the next year. What should the team be focusing on over the next year? And the place in my career where I'm at right now,
Starting point is 00:02:42 where I'm a staff site reliability engineer, is I work with many different teams to kind of coordinate their roadmaps figure out what is it that we need to work together on so kind of the thing that makes me a staff engineer rather than a senior engineer is that aspect of looking outside of my team I would take it a step further having seen some of the conference talks that you've given. Something that has always been very distinctive about how you frame things has been the way that you set context around, this is how we do things at Google. This may not work in your environment. And that's a theme that I don't see emerging very often among speakers from large, well-respected tech companies. Yeah, that's totally a response, I think, to a lot of criticism that I've seen around companies saying, you know, do it our way or the highway, right? And I think Charity Mater's phrases this really well, that it's a matter of context, that you have to make sure
Starting point is 00:03:41 that people are adapting your suggestions and your best practices for their specific context. And I think that that's a thing that makes me really excited about my current role is being able to focus on helping people adapt reliability practices in a way that makes sense for their companies. So taking it a step further, you mentioned that you were on the Google Cloud Customer Reliability Engineering team. Yeah. What is that? So the CRE team, or Customer Reliability Engineering team, focuses on making sure that Google Cloud customers understand how is it that their services that they're building on top of
Starting point is 00:04:21 the GCP platform, making sure that they are able to build services that operate seamlessly. And this is something where you can do a lift and shift onto the cloud platform, but that's not really the thing that's going to give you the full benefits, that you have to look beyond that and figure out how am I designing this?
Starting point is 00:04:41 How am I architecting this? Am I integrating my operations with my cloud platform providers' operations? So that's the thing that we think about a lot is, how do we make sure that all of our major customers have SLOs, service-level objectives, that make sure that they have error budgets, make sure that they have their service-level indicators and key metrics available to people in GCP, and that our key metrics are exposed to them for their usage of our platform so that we can improve everyone's time to resolve issues so that when there is an outage, it maybe stings less and that we're setting an expectation of this is what our service level
Starting point is 00:05:19 objective is. You can expect us to deliver that, but you cannot expect us to deliver 100% reliability because it would be prohibitively expensive. So that's kind of what our team does is we act as a conduit between customers and Google to make sure that we're integrating and operating efficiently and that we are using best practices. One thing that I guess is common to SREs across the entire spectrum is this stuff is very complex and outages invariably happen. Technology fails, human beings fail, software has bugs in it. And it's one of those areas
Starting point is 00:05:55 where no one remembers all of the times that you were up, but that one time where a thing fell over, depending on what it was, you'll hear about that years later, and it becomes almost the narrative that defines it. There were a couple of notable outages in the last couple of years for AWS and for GCP, and you see that people still tend to bring those up. How do you, I guess, transition away from the narrative being, we keep things up and running except that one time we didn't. And I guess, turn that into a story that culturally drives the narrative of outages happen,
Starting point is 00:06:32 we don't try to yell at people for them, we try to improve so that those outages don't recur, and those systems become more robust with time. I think there's a couple of angles that you can take on that, one of which is incident reports and shared postmortems, right? The idea that when you have a large failure or even a small failure, that you talk to the people that were affected and you go over kind of this is what went wrong on our side. This is what we're doing to make sure that this can't happen again. And furthermore, we talk kind of more concretely about what's the impact in your error budget, right? You know, if your error budget
Starting point is 00:07:09 says you can be down for two hours per year, and we eat 30 minutes of that error budget, then we can talk about what are we going to do to be more conservative with the other hour and a half remaining. I think that the other piece of doing shared postmortems is that it enables you to do things like say, hey, yes, there was an outage, but these are some mitigating strategies that would have mitigated the impact for you earlier, right? And then kind of work together on figuring out how to implement them. So it's kind of a mechanism of turning instance from, oh my God, everything exploded to this is a learning opportunity. This is what we're going to learn from it, this is what we did learn from it. One thing that tends to stand out about Google is that they're very open with the learnings that come out of outages, that come out of various crises, that as they solve these global world-spanning problems, they release white papers, they talk more about how they're thinking about these problems and what they're doing to address them than most other companies. In many cases, you'll see an outage that makes headlines, and the company will release a very thin root cause analysis or
Starting point is 00:08:20 post-mortem or whatever term we're using this week, that turns into a narrative of, there was a problem, we fixed it, and it doesn't go deeper than that. Is that just been part of Google's culture forever? Or was there something that drove that, that led to an awakening? I think that there's two aspects to that. One of which is that we, as well as many other companies, do produce very robust internal post-mortems. I think that's kind of a prerequisite to having that level of openness with your customers is to be open with yourself first. But as far as explaining kind of what's going on under the hood to customers, it's really important to, if you have a system that is really scary for customers to understand, for instance,
Starting point is 00:09:03 when GCP came out with the Google App Engine, even before it was called GCP, or when we came out with Cloud Spanner, these are things that it's really hard for people to get a visceral sense of what are the risks involved here? How am I going to make this? How is this engineered? How can I be confident in how it's going to work and why the failure patterns I might expect aren't there or the failure patterns I have seen have been remediated? So I think having so much technology that doesn't necessarily have a large number of parallels at the time that it was released tends to motivate us to be a lot more transparent than we otherwise would be. To that end, and this, I guess, could break down in either a technical direction or a culture direction, and I'm thrilled to explore both, but GCP is sort of perceived as being part of Google, but not part of Google, at least from those of us outside trying to read
Starting point is 00:09:59 the tea leaves. For example, search, to my knowledge, does not run on top of GCP for the technical side of it. From the culture side, I'm wondering how embedded the GCP teams are compared to the rest of quote-unquote Google proper. Yeah, totally. a large user of GCP in terms of having a large number of very mission-critical corporate applications running on top of GCP. So, for instance, things like our company directory or things that are relatively, you can say that they're kind of not very serious, except they are in the sense of these are applications of the form that many customers outside of Google want to bring to GCP.
Starting point is 00:10:47 Things like our financial system, things like our company directory, things like our internal meme generator, all the way up to eventually being able to run engineers' workstations on GCP. So that's kind of one angle to think about is how are we, you know, most customers aren't coming to GCP and saying, I want to build a world-spanning web search engine on top of GCP. So I think the set of applications that we've chosen to run on top of GCP that are developed by Google represent user workloads fairly well. And then some applications just don't make financial sense to run on top of GCP because virtualization
Starting point is 00:11:32 imposes overhead. And the security requirements that we have and the performance requirements that we have just mean that it makes sense to not impose that extra overhead. So it's kind of a new development versus old development type of thing, as well as thinking about the requirements of the application. And then I think on the culture front, GCP has been built by the very same teams who developed the underlying original Google infrastructure. And it's run by the same sre teams like one sre team will be responsible both for running the blob storage system and the google cloud storage system that's one team right that's not multiple teams so there's not really a division between kind of what you're describing as old Google and new Google. So where we can, we just kind of, we use the lessons that we learn from having operated almost a legacy service with an enormously complicated API.
Starting point is 00:12:34 And we say, how much of that do customers really need? Why don't we just simplify it? Because we're not constrained by having to haul around a 15-year-old API. So I think that that is really the crux of GCP development, is that it's not a division between old and new. It's instead the same people who developed the old developing the new
Starting point is 00:12:53 and bringing all the lessons that we've learned from it. There definitely is a distinction between building technical infrastructure at Google, whether it be GCP or not GCP, and building products like Google Web Search or ads. That's definitely true in that there's a difference between being someone who develops technical infrastructure and someone who uses technical infrastructure. But even so, there are definitely blurry lines. For instance, a lot of the work that the ads SRE teams have done has been building platforms that make it possible for individual ad development teams to basically implement their business logic on top of a framework that's going to work reliably for them.
Starting point is 00:13:37 So I think that that's an area in which someone can come to a GCP team and feel very comfortable is this idea of you're building infrastructure. One theme that tends to emerge is that you'll see people in relatively small companies that are getting off the ground talking about how Google does things and how it's a very different environment there. And often in a sort of disparaging way of, well, I was talking to my friend at Google and they spin things up with just one command line, and you have an entire environment. Why can't we do that? Without understanding that two decades of very intelligent engineers working to build out infrastructure tooling to the point where it is push-button-receive-cluster is a non-trivial investment for a company to make. And most companies are not going to make that leap. And that's been something that has, I guess, eluded people's understanding for a long time. That said, it does feel like GCP is aiming at solving for that problem. You effectively get Google class infrastructure billed by the hour or second,
Starting point is 00:14:43 depending on how you want to slice that. Is that a fair assessment? Yeah, I think that's a totally reasonable assessment to make, is that having a lot of these developer productivity tools available for the first time in GCP means that companies don't have to reinvent the wheel every time, that they can instead make use of our investment in that technology. So to that end, what do you wish that people understood better about GCP out here in the wilds that are not Google? I think that the main thing that I wish that people understood better about GCP is the notion that we want to be not just a, you know, have a vendor-customer relationship, but instead to have a partnership relationship with large customers.
Starting point is 00:15:31 And I think that that's a situation where people say, you know, oh, I just want to compare on price or I just want to compare on features. But I think that there's a difference between kind of buying interchangeable widgets and actually working together on building a shared system that incorporates the best of Google's technology and lets you innovate on top of that. And I think that that's kind of one misunderstanding they see people having when they're looking at what's our cloud migration strategy. Getting there, I guess, from even a customer service perspective, has been a somewhat interesting road. Historically, Google was very focused on not having a customer service department. You should be able to have the system just work,
Starting point is 00:16:17 and staffing a call center back in the early days for a search engine wasn't an area in which the company was prepared to invest. Now that you're running companies' production infrastructure at very large scale for a wide variety of clients, that requires a level of engagement with those enterprise customers that looks a lot like a traditional model that you've seen with Microsoft, Oracle, etc. for the past many decades. What has that transition been like? I guess Google has woken up to the idea of, Oracle, etc. for the past many decades. What has that transition been like? As I guess Google has woken up to the idea of,
Starting point is 00:16:48 huh, a frequently asked questions list probably isn't going to cut it when people are dropping tens of millions of dollars a year on this. Yeah, I think that that's something that Diane Greene has been super sharp about, that she's recognized that challenge and better positioned Google to be responsive to the needs of large customers. In the past year, even from my perspective dealing with customers that
Starting point is 00:17:12 are both in GCP and AWS, I've seen a market improvement in that respect. So it's definitely something that is evolving rapidly. The challenge in any sort of corporate reputational style of thing is that it takes time to make the change, but far longer for the reputation of the way things used to be to fade. It's sort of the curse of success. When you're a household name, people form opinions and don't change them even in the light of new information. Yeah, that's definitely a mindset and mindshare issue that we are hoping to address in part by talking about kind of what are we doing? How does it impact developers?
Starting point is 00:17:57 And that's kind of why I really like working so much with our developer advocacy team in terms of getting those kinds of messages out there about, hey, here's what's going on. Here's some reasons why you should look and see whether GCP makes sense for you. And if it doesn't make sense for you, we'll be the first people to tell you that as well. To that end, something that a lot of companies like to talk about is remaining provider agnostic, where they could, in theory, pick up their thing, whatever it looks like, from AWS and move it to GCP,
Starting point is 00:18:29 or from GCP into this rickety ancient data center that's falling to pieces, or wherever they want to move things. And I understand wanting that security blanket. As a counterpoint, you're offering some very differentiated higher-level things, such as Google's Cloud Spanner, a world-spanning ACID compliance database that effectively lets you treat it like any other SQL database, except it's in multi-regions. You can write to it,
Starting point is 00:18:56 you can read from it from anywhere on the planet. Technically, this is amazing. From a business perspective, rolling out an application built around something like this is in some cases considered a non-starter because it doesn't seem like there's another option. Well, what if Google decides they want to turn all of GCP off and or burn themselves to the ground and or just go out of business and sell hats or something. Great, awesome. I don't see those things happening, but people at least want a theoretical Exodus story. How do you find that the desire, even if unrealized, for lock-in
Starting point is 00:19:33 competes with the ability to, at least in theory, be cloud agnostic? I think that that's, in some way, it's a matter of choice. It's a business decision that companies can make. Do you want to deal with the operability headache of keeping all of your services on raw VMs and being able to migrate those VM-based workloads between different cloud providers that all offer VMs? Or do you want to use higher level services? And I think that there's a tremendous
Starting point is 00:20:07 amount of interest in even making some of those higher level services available cross cloud. If you look at what's going on with Kubernetes right now, it's a huge situation where GCP offers Google Kubernetes engine, obviously, but there are many other cloud providers that also offer Kubernetes-based services. And that's kind of an opportunity to do something that is differentiated, but is also something that people can choose to migrate if they choose to. Another example that I'd offer there is Cloud Bigtable. I used to work on Cloud Bigtable before my current team. And one of the selling points of Cloud Bigtable is you can operate your service just against a regular HBase backend that you maintain yourself. Or you can choose to run that workload against Cloud Bigtable.
Starting point is 00:20:58 And it all uses the same API. You basically have to compile in the stub to talk to Cloud Bigtable and you're done. So I think that that's definitely kind of the best of both worlds where everyone is using the common standard. They may have different implementations on the back end. So in a way, right, like given that Cloud Spanner is very SQL-like, if you are willing to forego some of the technical benefits, you could go and use a different SQL-like application if you really wanted to. And it might be less reliable or less performant. And I think that there's also the angle of if you really do care about sticking to kind of the common denominators, then you can choose to use Cloud SQL instead.
Starting point is 00:21:44 You can choose to run MySQL databases on raw VMs if you happen to have that particular strain of masochism. So there's a variety of different options, and it's just engineering trade-offs that people have to choose. In a similar vein, if you were to take a look at all of the different offerings that GCP has, what's one that you think is underappreciated in the larger community that you wish more people knew about? I think that one of the biggest opportunities that people have that they don't really fully understand how to use is the various cloud machine learning-based products that everyone has machine learning as a giant buzzword. But I really think that over the coming couple of years that we're going to see more people being able to use cloud machine learning in a way that makes business sense for them. And that is a much easier way of doing things rather than feeling like, oh my god, I have to go through all of this training in order to learn how to use it. So I think that that's kind of one
Starting point is 00:22:51 of the places that's going to grow fairly rapidly. Something that I'm seeing in the machine learning space is people are concerned less with the how and looking less for technical enhancements in machine learning. They're still stuck on the why. I struggled with this for a little while myself, where I love the idea of, capability, what it costs to train models and run this stuff itself, I first need to understand how it applies to my life. And maybe this is a limitation of my own lack of imagination, but I struggle to identify machine learning use cases until they're explicitly pointed out to me. Is this uncommon? Am I just dense? Or is this something that tends to be more industry-wide? I think that as people who build software and who think about reliability
Starting point is 00:23:54 and cost type things, we have a tendency to avoid things that are new and scary that we don't understand. And I definitely could have counted myself in that camp a year or two ago saying, why should we use machine learning on alerts? That means that if it breaks, then we're not going to understand how to debug it, right? So I think that that's definitely... If you're not building consumer-facing products, it's a lot harder to see the benefits of ML, and it's a lot easier to appreciate the risks of it. Whereas for people that are trying to do consumer-facing things, like being able to
Starting point is 00:24:35 do object recognition, or being able to transcribe speech to text, or being able to transcribe written words into text, or being able to transcribe written words into text or being able to do machine translation, right? These are all things that are powered by machine learning, right? And in fact, they're offered as prepackaged solutions rather than you must train your own model. And I think that that's kind of an area that we overlook a lot as people that don't necessarily think about the consumer-facing products quite as much. As with so many other things, it feels like it's an area that is rapidly evolving, and we're going to start seeing improvements in that space relatively soon. Speaking of, what's something that you see that GCP itself needs to, or could stand to improve upon?
Starting point is 00:25:21 I think that it is always a challenge to onboard people. There have been a lot of improvements, but still, focusing on usability is something that GCP really needs to get better at in order to be able to maintain a pace of growth. Because it really is people who are experimenting with GCP who decide to adopt it just as much as it is people saying, you know, hey, I'm putting out a request to proposals from the top three cloud providers for a $100 million contract, right? Those are both cases that we need to pay attention to. And I think that the investment in kind of doing that high touch cloud sales and support work also has to be accompanied with what are we doing for the next generation of developers. I will say as someone who first picked up the GCP control
Starting point is 00:26:20 panel for a project a couple of months back, I was very pleasantly surprised. At first, I thought it was going to go the opposite direction, where I did a quick project, and then I was done. And now it was the fun prospect of hunt down all of the services that I spun up and make sure they're turned off so I don't wind up with a surprise bill three months later. And in Amazon world, that takes the better part of a day. In GCP, it was click on the expansion thing next to the particular project and terminate all billing resources. Now it pops up a scary warning that this will turn things off. Are you okay with that? Which in this case I was. I clicked it and there was no step two. That was an eye-opening moment for me.
Starting point is 00:27:06 Yeah, I think that the set of features that are offered are very robust and powerful. I think it's kind of a discoverability problem, where if I look at the GCP control panels, and I still am a little bit like I'm sitting in the cockpit of a space show, right? There are so many different options. And I think that that's the area that I wish that there were a little bit like I'm sitting in the cockpit of a space show, right? Like there are so many different options. And I think that that's the area that I wish that there were a little bit more effort paid into. The first time I set something up in a cloud environment, I admit it. I'm like all of the things I make fun of in some of my own talks. I click through the console, I spin a thing up, and we're good. In Amazon land, great. How do I convert that into code?
Starting point is 00:27:47 Good luck, idiot, is the effective answer they give. With GCP, it spits it out. Here's a curl command that does exactly what you'd want to do, and it's easily understood. It breaks down the API calls, and I can shove that into Terraform. I can put it in a script. I can curl bash
Starting point is 00:28:03 it if I'd like to live very dangerously. It lends itself to rapid and effective automation. Rather than spinning something up, then I have to retrofit all of the code to it and then tear it down and hope I got everything right or I get to explore this whole area again. That was transformative the first time I saw it. I couldn't believe I was seeing it. And then I very quickly moved on to, why isn't everything like this? This is wonderful. Yeah, I think that's in large part influenced by how we've done deployments with internal
Starting point is 00:28:37 Google technology for years and years is the idea of, yes, you have to be able to offer sensible APIs and do tear up, turn, and updates in a programmatic fashion. So let's talk a little bit about you, rather than GCP, for a minute. You've been at Google a decent amount of time. How many years now? Ten. That is forever in cloud space. And during that time, you went from being an individual contributor
Starting point is 00:29:02 to managing a team. Now you're an individual contributor again. Let's talk a little bit about that. In many companies, that would be considered a demotion. In Google, it's one of the few companies that's very explicit about having a technical ladder that is distinct from the management ladder. And going between ladders in one direction or the other is absolutely not a promotion. It's a different job. Yeah, absolutely. I think that one additional thing is that you can have direct reports
Starting point is 00:29:34 as someone who's on the individual contribution ladder. The difference is kind of really, where are you focusing your time? How many reports do you have? So I've been on, I think, eight teams now in 10 years at Google. So it doesn't feel like that long because I only spend a year to two years in each place. And the thing that I find that I do is when I feel like I've learned as much as I can out of one team, I'll go and look for another team that's going to stretch me in some dimension or teach me something new. And that's how I came to the decision to become a manager for a few years, was that I really wanted to get some experience with helping people's career development
Starting point is 00:30:15 rather than just purely focusing on technology. I think that even once I stopped being a manager, that inner voice just doesn't turn off. It's a skill that you acquire that you can hold on to and use in varying ways, even if you're not officially someone's manager. So I think that everyone should give being a manager a shot at least once if that's something that you're interested in because it teaches you a lot. It helps you better understand your company and helps you better understand how you're going to interact with people. But for me personally, kind of having this opportunity to try being a manager and then discovering that I didn't actually want to, in the long term, have my career growth being tied to how big of a scope the set of people I managed was responsible for, but that instead I wanted
Starting point is 00:31:02 to work on kind of cross-cutting projects that are between multiple working groups. That was really a useful thing for me to learn and then go and pivot. It's nice to see companies being supportive of that. In many environments, making the transition you just described would have entailed at least three different companies. Is it fair to say that Google is almost like a bunch of companies tied together under one umbrella, even down at the relatively granular organization level? Or is this more a story of Google being very supportive of people's needs as they grow? As a manager, you are taught at Google to look out for the best interest of your reports,
Starting point is 00:31:48 even if it means that they may wind up leaving your team or going on to another job ladder. So it's kind of your job to support people in developing their careers. I think that that mindset and perspective, as opposed to, I'm going to keep this person on my team because they're doing productive work on my team. I think that's a huge difference from a lot of companies. And as far as our culture, we have a culture that is fairly uniform between different teams. We have a set of engineering tools that are fairly uniform between teams. So as a result, sure, it may take you six months or even a year to become fully productive as an engineer at Google. Once you have that base set of skills, you can take kind of sharing that same cultural basis and sharing that same technical basis. And I think that that's one of the magical things about Google. So at this point, you would be fair to say that you'd
Starting point is 00:32:54 recommend working at Google to someone who was on the fence about it. I think that Google is a company that is very self-aware of a lot of things, that it knows how to do some things well, and that it also has some areas in which it faces challenges that are not unique to Google, but that are uniquely things that we're talking about, kind of having public conversations within the company or even sometimes external to the company about what does it mean to have a a a culture of inclusion right and i think that it can be scary
Starting point is 00:33:35 sometimes on the outside looking at that and saying you know oh my goodness like i'm i'm not sure if i want to work at google because i've seen a bunch of awful stuff in the news about Google or whatever. But I think that on balance, Google is a place where you can have a lot of opportunities to do impactful things. And not just technically impactful things, but things that are culturally impactful for all of information technology. So that was a long-winded way of saying, yes, I would recommend Google as a place to work. But I do think that those are useful things to think about as far as what are you looking for in an opportunity?
Starting point is 00:34:14 What are your interests? Do they match with what you would be doing at Google? And I think that that's also an area where you should really carefully talk to whoever is the hiring manager and make sure, is this team the right fit for me or not? And if not, you can say no, and then your recruiter will find you some other team to look at.
Starting point is 00:34:34 Is Google still in a hiring place where every person they bring aboard, doesn't matter if they're there to clean whiteboards or do accounting, they still put them through a CS 101 algorithms test? So the hiring mechanisms for software engineers test a mixture of, can you write code? Can you practically apply lessons from computer science? And can you do systems design? Kind of those three things are tested during the interview process. For site reliability engineers, we don't necessarily mandate that people have previous computer science knowledge.
Starting point is 00:35:19 Because it's been advantageous to hire people that are systems engineers. People who have real-world practical experience with, this is how systems break, this is how we can engineer systems better. So for those set of people that don't have that computer science background, we tend to focus a lot more on interviewing people on troubleshooting, figuring out in a real situation, what would you do in order to make sure that the impact on customers was as low as possible to root cause and debug and kind of bisect the problem? Or focusing much more on your systems design skills, or focusing much more on,
Starting point is 00:35:58 do you understand at least some area of the Linux stack or of the distributed system stack in a way that you can practically describe to someone who's interviewing you. So I think the answer is yes, if you are interviewing for a software engineering position, you will probably be asked to do whiteboard coding. You will probably be asked questions that rely on having some degree of ability
Starting point is 00:36:22 to pick the right data structure, pick the right algorithm. But I think that there's kind of a range and flexibility, at least as far as SRE is concerned. Thank you, Liz. One more question for you before we start wrapping up and calling it a show. Is there anything you're working on that you want to mention
Starting point is 00:36:38 or tell our listeners about? Yeah, so I want to point people to two resources. The first resource is on the Google Cloud Platform blog. There is a set of posts made by my team, the CRE Customer Reliability Engineering team, and they're all called CRE Life Lessons. We'll put a link to those in the show notes. And then secondly, I have a project with Seth Vargo,
Starting point is 00:37:02 who's a developer advocate at Google, who is working with me on a set of YouTube videos that explain in five-minute chunks what is SRE, what are the key principles of SRE, how can you apply them. So I'm really excited about this project, and we'll put a link to that in the show notes as well. Wonderful. A follow-up question for you on that. Do you see that cloud and SRE are intrinsically linked? Can you have one without the other? Or is it more or less two completely separate concepts smashed together in the form of one person, you come to life?
Starting point is 00:37:36 So I think that SRE doesn't really specifically require that you do it in the cloud. So the key things about SRE are, number one, do you have service level objectives and error budgets? And number two, do you have limits on the amount of operational work that you're doing in order to conserve your ability to do project work? As long as you're doing those two things, I posit what you're doing is SRE. And neither of those two things specifically mandate any kind of cloud deployment. However, I think that if you are trying to run a cloud
Starting point is 00:38:10 service at scale, and you're not adopting something in the SRE methodology space or in the DevOps space, you're going to really struggle to operate your service, that it's going to result in having to hire a bunch of people to do manual operational work because you're not setting targets for your reliability, or you're not setting limits on how much operational load your systems can generate, and you're not engineering that work away. So I think that cloud deployments require SRE to be successful, but I think that SREs can and have worked on systems that are not necessarily running in the cloud. Yes, to go any deeper on that one turns very much into the question of, well, what is cloud and what isn't it? And down that path lies madness. Indeed.
Starting point is 00:39:03 Thank you for joining me, Liz. I'm Corey Quinn. This is Screaming in the Cloud.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.