Screaming in the Cloud - Episode 3: Turning Off Someone Else's Site as a Service

Episode Date: March 28, 2018

How do you encourage businesses to pick Google Cloud over Amazon and other providers? How do you advocate for selecting Google Cloud to be successful on that platform? Google Cloud is not jus...t a toy with fun features, but is a a capable Cloud service. Today, we’re talking to Seth Vargo, a Senior Staff Developer Advocate at Google. Previously, he worked at HashiCorp in a similar advocacy role and worked very closely with Terraform, Vault, Consul, Nomad, and other tools. He left HashiCorp to join Google Cloud and talk about those tools and his experiences with Chef and Puppet, as well as communities surrounding them. He wants to share with you how to use these tools to integrate with Google Cloud and help drive product direction. Some of the highlights of the show include: Strengths related to Google Cloud include its billing aspect. You can work on Cloud bills and terminate all billable resources. The button you click in the user interface to disable billing across an entire project and delete all billable resources has an API. You can build a chat bot or script, too. It presents anything you’ve done in the Consul by clicking and pointing, as well as gives you what that looks like in code form. You can expose that from other people’s accounts because turning off someone else’s Website as a service can be beneficial. You can invite anyone with a Google account, not just ‘@gmail.com’ but ‘@’ any domain and give them admin or editor permissions across a project. They’re effectively part of your organization within the scope of that project. For example, this feature is useful for training or if a consultant needs to see all of your different clients in one dashboard, but your clients can’t see each other. Google is a household name. However, it’s important to recognize that advocacy is not just external advocacy, there’s an internal component to it. There’s many parts of Google and many features of Google Cloud that people aren’t aware of. As an advocate, Seth’s job is to help people win. Besides showing people how they can be successful on Google Cloud, Seth focuses on strategic complaining. He is deeply ingrained in several DevOps and configuration management communities, which provide him with positive and negative feedback. It’s his job to take that feedback and convert it into meaningful action items for product teams to prioritize and put on roadmaps. Then, the voice of the communities are echoed in the features and products being internally developed. Amazon has been in the Cloud business for a long time. What took Google so long? For a long time, Google was perceived as being late to the party and not able to offer as comprehensive and experienced services as Amazon. Now, people view Google Cloud as not being substandard, but not where serious business happens. It’s a fully feature platform and it comes down to preferences and pre-existing features, not capability. Small and mid-size companies typically pick a Cloud provider and stick with their choice. Larger companies and enterprises, such as Fortune 50 and Fortune 500 companies, pick multiple Clouds. This is usually due to some type of legal compliance issues, or there are Cloud providers that have specific features. Externally at Google, there is the Deployment Manager tool at cloud.google.com. It’s the equivalent of CloudFormation, and teams at Google are staffed full time to perform engineering work on it. Every API that you get by clicking a button on cloud.google.com are viewing the API Docs accessible via the Deployment Manager. Google Cloud also partners with open source tools and corresponding companies. There are people at Google who are paid by Google who work full time on open source tools, like Terraform, Chef, and Puppet. This allows you to provision Google Cloud resources using the tools that you prefer. According to Seth, there’s five key pillars of DevOps: 1) Reduce organizational silos and break down barriers between teams; 2) Accept failures; 3) Implement gradual change; 4) Tooling and automation; and 5) Measure everything. Think of DevOps as an interface in programming language, like Java, or a type of language where it doesn’t actually define what you do, but gives you a high level of what the function is supposed to implement. With the SRE discipline, there’s a prescribed way for performing those five pillars of DevOps. Specific tools and technologies used within Google, some of which are exposed publicly as part of Google Cloud, enable the kind of DevOps culture and DevOps mindset that occur. A reason why Google offers abstract classes in programming is that there’s more than one way to solve a problem, and SRE is just one of those ways. It’s the way that has worked best for Google, and it has worked best for a number of customers that Google is working with. But there are some other ways, too. Google supports those ways and recognizes that there isn’t just one path to operational success, but many ways to reach that prosperity. The book, Site Reliability Engineering, describes how Google does SRE, which tried to be evangelized with the world because it can help people improve  operations. The flip side of that is that organizations need to be cognizant of their own requirements. Google has always held up along several other companies as a shining beacon of how infrastructure management could be. But some say there’s still problems with its infrastructure, even after 20-some years and billions invested. Every company has problems, some of them technical, some cultural. Google is no exception. The one key difference is the way Google handles issues from a cultural perspective. It focuses on fixing the problem and making sure it doesn’t happen again. There’s a very blameless culture. Conferences tend to include a lot of hand waving and storytelling. But as an industry, more war stories need to be told instead of pleasure stories. Conference organizers want to see sunshine and rainbows because that sells tickets and makes people happy. The systemic problem is how to talk about problems out in the open. Becoming frustrated and trying to figure out why computers do certain things is a key component of the SRE discipline referred to as Toil -  work tied to systems that either we don’t understand or don’t make sense to automate. Those going to Google Cloud to ‘move and improve’ tend to be a mix of those from other Cloud providers and those from...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode of Screaming in the Cloud is sponsored by my friends at GorillaStack. GorillaStack's a unique automation solution for cloud cost optimization, which of course is something here and dear to my heart. By day, I'm a consultant who fixes exactly one
Starting point is 00:00:38 problem, which is the horrifying AWS bill. Every organization eventually hits a point where they start to really, really care about their cloud spend, either in terms of caring about the actual dollars and cents that they're spending, or in understanding what teams or projects are costing money and starting to build predictive analytics around that. And it turns out that early on in my consulting work, I spent an awful lot of time talking with some of my clients about a capability that GorillaStack has already built. There's a laundry list of analytics offerings in this space that tell you what you're spending and where it goes, and then they stop.
Starting point is 00:01:19 Or worse, they slap a beta label on that side of it for remediation and then say that they're not responsible for anything or everything that their system winds up doing. So some folks try and go in a direction of doing things to write their own code, such as spinning down developer environments out of hours, bolting together a bunch of different services to handle snapshot aging, having a custom Slack bot that you build that alerts you when your budget's hitting a red line. And this is all generic stuff. It's the undifferentiated heavy lifting that's not terribly specific to your own specific environment. So why build it when you can buy it? GorillaStack does all of this. Think of it more or less like if this, then that, IFTTT for AWS. It can manage resources. It can alert folks when things
Starting point is 00:02:07 are about to turn off. It keeps people appraised of what's going on. More or less the works. Go check them out. They're at GorillaStack.com, spelled exactly like it sounds. Gorilla like the animal. Stack as in a pile of things. Use the discount code SCREAMING for 15% off the first year. Thanks again for your% off the first year. Thanks again for your support, Gorillastack. Appreciate it. Hello and welcome to Screaming in the Cloud. Today I'm joined by Seth Vargo, a senior staff developer advocate at Google. Thanks for joining me, Seth. Hey, thanks for having me, Corey. Excited to be here today.
Starting point is 00:02:44 Always a pleasure to talk to you. So you, until very recently, were at HashiCorp for a while talking, do effectively the same type of advocacy work, to my understanding. Yeah, I left HashiCorp a few months ago to join Google Cloud. I was working very closely with tools like Terraform, Vault, Console, Nomad. And part of the reason that I left HashiCorp was that I have the opportunity to talk about those tools and some of my former experiences with tools like Chef and Puppet and the communities that surround those tools and how you can use those tools to integrate with Google Cloud and to help drive some product direction around how we can make Google Cloud a great provider
Starting point is 00:03:22 to integrate with those different tools. Fantastic. I have to confess, I personally only started working with GCP for a test project about a month or so ago. And until then, it was always this thing that was sort of hanging around the periphery of what I'd been doing. Historically, I was a dyed-in-the-wool AWS person, just because it's what I encountered in the wild. And I have to say, as I went through the process, I was extremely impressed by some of the nice features that GCP has worked into it. Two that leap to mind even now left a strong impression. The first is the
Starting point is 00:03:59 billing aspect of it. All I do is work on cloud bills and terminate all billable resources in this project is a godsend as far as the console goes. The second is the way that it gives you, it represents anything that you've done in the console by clicking and pointing. It gives you what that looks like in code form. And that is just, for those of us who are of the terrible breed of programmer, is just spectacular. Well, I'm glad you like it. And I don't know if you know this or not, but that button that you click in the UI to disable billing across an entire project and delete all billable resources, there's an API for that too. So you could build a chat bot or a script that does that as well. Everything we do on Google Cloud is API first. So anytime you click a button in that web UI, there is a
Starting point is 00:04:51 corresponding API call, which means you can build automation, compliance, and testing around these various aspects. Wonderful. And can you expose that from other people's accounts? Because frankly, turning off someone else's website as a service is something I would definitely pay for. It's definitely possible. The IAM and permission management in Google Cloud is incredibly powerful. It leverages the same IAM permissions that G Suite has, which is hosted Gmail and Calendar and all of those other things. So you can invite effectively anyone with a Google account, not just, you know, at gmail.com, but at any domain and give them admin or editor permissions across a project. And then they're effectively part of your organization within the scope of
Starting point is 00:05:36 that project. And this is really useful when you think about things like training or, you know, as a consultant, being able to see all of your different clients in one dashboard when you log in, but your clients can't see each other. That definitely opens up some possibilities with respect to being able to manage multiple accounts simultaneously, work on different environments. I definitely see the appeal. So you are a staff developer advocate at Google. I mean, historically, advocacy in some companies is, here's who we are. This is what we do. You could have spent the last 15 years living in a cave, and odds are you still know who Google is and what they do. How does that manifest itself at a company that has long ago become a household name?
Starting point is 00:06:26 Definitely. I think that's a great question. It's important to recognize that advocacy is not just external advocacy. There's an internal component to it. So yes, everyone knows that Google is a household name. I look to my left here and I have a Google Home sitting right there. But there's many parts of Google and many features of Google Cloud that people aren't aware of. So my job as an advocate, I view it as help people win.
Starting point is 00:06:55 How do I get people who want to use Google Cloud or don't know about Google Cloud the ability to be successful on the platform? And then the flip side of that is what I call strategic complaining. So I am deeply ingrained in a number of communities, the DevOps communities, the configuration management communities, and people are going to come to me with feedback. They're going to say, hey, this thing is great. They're going to say, hey, this thing is pretty terrible. And it's my job as an advocate to take that feedback and convert it into meaningful action items for our product teams and say, hey, I've heard this repeating pattern that this particular service's documentation isn't up to snuff, or this service is missing a key feature, and work with the product teams to get that prioritized on the roadmap so that the voice
Starting point is 00:07:39 of the community is being echoed in the features and the products that are being developed internally. That tends to make a lot of sense, especially as, and I mean, no disrespect by this, but Amazon was in the sort of in the cloud space by itself for a long time, as far as public cloud availability goes. And every, all the other providers, Google included, were for a long time sort of perceived as late to the party and not able to offer anything approaching as comprehensive an experience. And I think that that narrative, to some extent, is something that Google is still struggling with, even though, and again, I've been deep into the woods on AWS for a long time. When I've used GCP now, I am not left with the impression that it is substandard. I'm not left with a perception that, oh, this is a fun toy, but it's not where serious business happens. It's a fully featured platform. And at this point, it really comes down to preferences and what's pre-existing in most environments, not a capability story.
Starting point is 00:08:46 Do you find that that is a general perception of how the entire world is working? Or am I, frankly, too stuck off in my AWS world at this point? No, Corey, I don't think you're stuck in anything. I think we are moving to a world where small companies, and maybe even mid-sized companies, are going to pick a cloud provider and they're going to stick with it. But when we look at large companies, enterprises, Fortune 50, Fortune 500 companies, they're going to pick multiple clouds, actually. And they're going to do it for one of two reasons. The first is some type of legal compliance issue. So when you think about finance and trading, legally, they're required to not have dependencies on one provider. But the bigger reason is that
Starting point is 00:09:31 each cloud provider is going to have things that they're good at and things that they're not so good at. So at Google, for example, we have the best Kubernetes engine because we wrote Kubernetes. We have the folks who run Kubernetes and have been running Kubernetes for a while running GKE, which is our hosted Kubernetes offering. We also have some of the best ML in the world. We just launched AutoML, which allows you to use our models with your data so you don't have to do training. But then there are other cloud providers that have specific features where they shine. And organizations are going to be able to pick and choose, hey, I want to use this from AWS, this from Azure, and this from Google Cloud. And being able to link those things up and really leverage the true power and elasticity of the public cloud is very, very important for these
Starting point is 00:10:15 midsize and large organizations' long-term success. When I conducted a survey somewhat recently, last year in AWS, I wound up asking a bunch of snarky questions to a bunch of people. Almost everyone who answered the survey had some workloads in AWS. I'm sure there was no selection bias whatsoever in that. But there was also a very decent showing of other cloud providers along the way. What I found fascinating, and I wish I'd built in more questions around this, but in a pure AWS context, the breakdown between who was using CloudFormation and who was using Terraform, to step back into your previous role for a second, was neck and neck. It was effectively a 50-50 split, which is incredible, even though the official word
Starting point is 00:11:06 from Amazon has always been cloud formation is the way and the light. Does GCP offer some equivalent of that? Or is the official marching order there, if you want to automate this, use Terraform? Or is there something else I'm just not aware of? That's a great question, Corey. Externally at Google, we have a tool called Deployment Manager. You can check it out on cloud.google.com. It's kind of the equivalent of CloudFormation. There are teams at Google that are staffed full-time to do engineering work on that. Every API that you get by clicking a button on cloud.google.com or viewing the API
Starting point is 00:11:40 docs is accessible via the Deployment manager. However, in addition to that, Google Cloud has partnered very closely with a number of open source tools and the companies that correspond to them, one of which being Terraform. So there's a team, I'd like to give a shout out to the cloud graphite team, Eric Johnson, Dana Hoffman, Emily Yee, and the team over there who are doing these integrations with third-party tools. So to put it in perspective, there are people at Google who are paid by Google who work full-time on open-source tools like Terraform and Chef and Puppet so that you can provision GCP resources using the tools that you love.
Starting point is 00:12:19 So we offer Deployment Manager, and if you're only going to use Google and you're never going to use another cloud, then by all means use deployment manager. It's going to work best. It integrates everywhere. But if you're thinking of going multi-cloud or you already have experience with tools like Chef or Puppet or Terraform or Ansible, we want to meet you where you are. We want, you know, if you're an Ansible shop, we want Ansible to be the tool that you use to provision infrastructure. If you're a Terraform shop, we want Ansible to be the tool that you use to provision infrastructure. If you're a Terraform shop, we want Terraform to be the tool that you use to provision infrastructure. And the way that we support that is by having the Cloud Graphite team work on these tools and make sure that the GCP integrations run deep. Perfect. That's useful to know. And it's
Starting point is 00:12:57 something that I think a lot of different shops don't quite have a full awareness of. Do you happen to have any numbers you can share or just general sense of shops that are using GCP? Are they using this or are they tending to go in a Terraform direction? I mean, what is the general zeitgeist these days? I don't have numbers on the deployment manager side of things, but, you know, there are a few different customers who are using Terraform. I can't go into specifics, but it's non-zero and it's significant enough that we have multiple full-time people devoted to working on these integrations. And if we weren't seeing adoption and it wasn't important to us as
Starting point is 00:13:38 a company to support those ecosystems, we wouldn't be investing people in it. At a number of DevOps events where there are people from Google talking about how you folks do DevOps internally. And right around here is the point where I get interrupted by someone who works at Google. We don't do DevOps. We do SRE. What is the difference and what is the breakdown as far as how Google sees things? That's a great question, Corey. This is actually something that I'm focused on and I'm working closely with the SRE team internally at Google to make sure
Starting point is 00:14:09 that we're getting the right message out there. To just kind of backpedal just a little bit, there's kind of five key pillars of DevOps. The first is to reduce organizational silos and break down the barriers between teams. The second is that we have to accept failure is the norm. That's things like blameless postmortems. We have to accept that computers are inherently unreliable. So we can't expect perfection. And when we introduce humans into that, we get even more imperfection. The third is
Starting point is 00:14:35 implementing gradual change. We want to reduce the mean time to recover or the MTTR. And we realize that small incremental changes are much easier to review and roll back in the event of failure. The fourth piece is tooling and automation, right? There's entire conferences like Monitorama that gather people from the DevOps communities around monitoring, tooling, automation, you know, Chef and Puppet and Terraform obviously fit in that pillar. And then the fifth is to measure everything. No matter what we do in the first four categories, if we're not measuring it, we don't have clear gauges for success. We don't know if we've been successful. And when you think about it, these are actually really abstract topics, right? Nowhere in here
Starting point is 00:15:13 did I say use Chef or use Puppet. And nowhere in here did I say that you should hold more meetings to break down the silos or that you should use an elk stack for your measurement and monitoring and logging. And for this reason, you can think of DevOps as like an interface in a programming language, like Java or a typed language, where it doesn't actually define what you do. Instead, it gives you a very high level of what the function is supposed to implement. So there is a function in the interface that says reduce organizational silos. And the way that you implement that is kind of like a class. And SRE, as I'm learning, because I'm also new to Google,
Starting point is 00:15:56 but the way that I view this is that SRE is a class that implements DevOps. And just like you can have multiple classes that implement collection or sortable, it's possible to have multiple classes that implement DevOps. So in the SRE discipline, there's a very prescribed way for performing those five pillars of DevOps. Things like sharing ownership and SLIs and SLOs, moving fast by reducing cost of failure, sharing ownership among product teams and automation, and very specific tools and technologies that we use within Google, some of which are exposed publicly as part of Google Cloud, that enable the DevOps culture
Starting point is 00:16:33 and the DevOps mindset to take place. And I think for a while, because there are definitely some folks at Google who have been at Google for many years and aren't deeply involved in these communities like myself, they thought that SRE was the only way. So some of the advocation that I'm doing internally is saying, yeah, you know, SRE satisfies this interface, but, you know, to a certain extent, so do these agile practices over here. And so do these other technologies that other companies are using. And part of the work that I'm doing is getting people to realize that we can meet in the middle, right? That part of the reason why we have abstract classes in programming is that there's
Starting point is 00:17:08 more than one way to solve a problem. And SRE is just one of those ways. And it's the way that has worked best for Google. And it has worked best for a number of customers that Google is working with. But there are some other ways too. And we need to be able to support those ways and recognize that there isn't one true path and light to the operational success of a system, but there are in fact many ways to reach that prosperity. There was an entire book written by a team of SREs at Google for O'Reilly, entirely on the practice of site reliability engineering. And it's a fantastic book and the people who wrote it are incredibly skilled, but it almost felt like that book could have been subtitled How to Build Google, to some extent,
Starting point is 00:17:51 where a lot of what Google does and how they operate and how they think presupposes not only the tremendous investment in infrastructure that Google has made since its inception, but also Google culture. You take that and you drop it on a mid-sized credit union in the Midwest, for example, and almost every pre-existing condition that Google has no longer applies. How do you wind up driving that sort of cultural change to an environment that looks nothing like Google. You know, one of the things that I got out of reading the SRE book was this is how Google does SRE. And there is a group of people, and I've read the book a number of times, and I struggle
Starting point is 00:18:38 to see this viewpoint. There's a group of people that believe that book is Google telling the world how to do DevOps. And that's simply not the case. I know many of the authors that is in no way what they were trying to get across with that book. It was actually a storytelling exercise. How Google does SRE was a thing that we wanted to evangelize with the world because we think that it can help people improve their operations. The flip side of that is that organizations need to be cognizant of their
Starting point is 00:19:05 own requirements. If I'm a small startup of, say, less than 25 people and operations is someone's part-time job, the SRE kind of playbook isn't relevant yet. It doesn't become relevant until we have enough users and we have a team and we have these barriers that exist between the product organization and the engineering organization and the site reliability engineering organization, that these practices come into play. So when we talk to, say, a mid-sized credit union from the Midwest, the conversation can't be, use this tool with this technology and do exactly this, because there is a cultural component to SRE, just like there's a cultural component to DevOps that we have to solve first. And it's okay to kind of pick and choose, right?
Starting point is 00:19:52 We might choose the part of the SRE story, which is SLOs and SLIs, which are very strict, defined measurements of uptime and availability for a system. But we might not use the kind of the monitoring and metrics that are recommended. We might use our own technology. And it's about picking what's best for the organization. But when we go into these companies and we try to say, hey, you need to change if you want to innovate,
Starting point is 00:20:19 we have to be aware of their own roadblocks and their own hurdles and find ways to work around them. And that's why executive buy-in is really key in these situations. If you have a couple developers who just want to move faster, they're never going to be able to push these initiatives. But if you have top-level executives and VPs who are saying, okay, we're losing market share, we need to find a way to deliver faster, you'll get a lot more buy-in. Because it truly is, as you said, an organizational culture.
Starting point is 00:20:44 And that's true for both DevOps and SRE. I want to be very clear that this next question is not explicitly aimed at Google. I feel the need to warn you first on that one. But Google is always held up, along with several other companies, as a shining beacon of how infrastructure management could be. You see conference talks conducted by Googlers. You talk to people about what they're working on, and they paint a very compelling picture. A lot of other companies do this,
Starting point is 00:21:18 and in a lot of these other companies, I've later gone in and done projects there. And it is a very polite fiction. Internally, I have never yet found a company that didn't think its own infrastructure was, to some extent, a song of ice and tire fire, where you're always going to have things breaking. It feels like you're skating on the edge of disaster, but that doesn't make for a compelling keynote at these events. And after I've poured significant amounts of alcohol into various Google people, they start to nod and smile and say, yeah, there are still problems in our infrastructure,
Starting point is 00:21:55 even after 20 some odd years and billions invested. It feels like to some extent, it's never a solved problem. There's always more to improve. But to some extent, first off, would you agree that that's true? So I've only been at Google a couple of months now. I would definitely say that any company you work at where the recruiter tells you that it's all sunshine and rainbows and there's nothing ever wrong is a lie. Every company has problems, some of them technical, some of them cultural.
Starting point is 00:22:25 And I don't think Google is an exception to that rule. Being a company that's been around for a very long time, there's certainly technical debt. There have certainly been outages while I've been here. The one key difference is the way that Google handles that from a cultural perspective. We focus on fixing the problem and making sure it doesn't happen again as opposed to finding out who did what and why they did it and and what were they thinking so there's a very blameless culture which i found very unique to google it's like a top priority in any time there's an outage and the way they mitigate those outages so you know having the ability to say oh this particular cluster is not having a good day. Let's shift in real time
Starting point is 00:23:06 all of the workloads from that cluster to a new one. And a great example of something like that is whenever we had the Meltdown Inspector vulnerabilities here a few months ago, Google was able to migrate people's workloads in real time without downtime as they were upgrading or applying the patches for these CPU vulnerabilities. And that's, that's a technology, right? That's not a culture. That's a technology that's unique to Google. We wrote about it on the cloud platform blog. And that's, that's something that, that makes Google, you know, a unique, in a unique position where we can prioritize availability and reliability for our customers, even if behind the scenes,
Starting point is 00:23:44 there are some fires going on. Yeah, I think that's very fair. The counterpoint too, and the reason I keep harping on that particular area of things is I did this myself, and I've talked to other people who continue to do it now. They'll go to a conference, and they'll see a talk that is presented by one of these bright lights of tech. And then they go back to their own jobs at the end of the conference. And they feel sad because their environments are, from their perspective, far worse than what was just described.
Starting point is 00:24:17 It feels like there's not a ongoing sense of empathy or awareness in many cases that everyone's environment has problems, everyone's culture has problems, and this is built on a continuing series of incremental change. How do you find that that is being addressed these days? I don't think it is. I mean, you and I have both done keynotes at big conferences, and there's a lot of hand-waving, there's a lot of storytelling that goes on. And, you know, maybe as an industry, we need to tell more war stories instead of the pleasure stories.
Starting point is 00:24:53 I spoke at an event that Fastly did, the CDN company, and they asked me to speak about an outage that we had. And it was different for me. Like, I didn't feel comfortable doing it. I don't feel, I feel like talking about failure publicly without a resolution is often a negative connotation. So as an industry, maybe we need to start talking more about failure. And if at the end of the talk, the answer is it just never happened again. And we don't know why that's okay to talk about. But I also think that that's not going to get accepted into, you know, a CFP process. People as conference organizers, they want to see sunshine and rainbows
Starting point is 00:25:32 because that sells tickets that makes people happy. So it is a bit of a systemic problem is how do we talk about these things in the open without putting people in this like, well, what was the resolution or how did you fix it? Because sometimes computers are weird. I feel like there's an opportunity here for failure con or something similar where all we talk about is the failures that we've seen in various infrastructures and some of them don't have resolutions. I also feel like there needs to be a standing rule for a conference like that, that well, actually, have you considered is not a valid question to ask during the Q&A portion. Yes, after listening to this for 45 minutes, I'm sure you have the answer to a problem that has stymied entire teams of engineers for months.
Starting point is 00:26:21 Yes, based upon the window I've given you into this. And that's always the challenge too, but you're right. I think that it is a negative thing that companies and organizers don't necessarily want to see, but it's the real world. It's how these things work. There's a whole laundry list of things I have that I do not understand about why my systems behave in certain ways under certain conditions. And if they're not causing downtime or not painful enough, I'm never going to have the
Starting point is 00:26:52 time to dig into them, or frankly, maybe not even have the intellect to dig in and figure out why it does that. So the answer I put around is I just draw a circle around all of that and caption it, computers are terrible. The end. So that's actually interesting because what you just described is actually a key component of the SRE discipline, which is this thing called TOIL, like foil, but with a T in the front. And it's work tied to systems that either we don't understand or it doesn't make sense to automate away. So if there's that one service that goes down once a year, but it's highly available, so it doesn't matter, and someone has to connect to a prod system and restart it or kick it
Starting point is 00:27:34 to reboot, there's no point in investing 10, 15 hours to build automation and detection around that. Instead, let's just invest 15 minutes every year and do it. And part of the SRE discipline is about mitigating that. Instead, let's just, you know, invest 15 minutes every year and do it. And, you know, part of the SRE discipline is about mitigating that, right? So when that system goes down, does another one automatically pick up so that we're not rushing to get it back online? And then we can, you know, kick it kind of in our spare time. You know, I have a similar bubble. I don't call it computers are terrible. Mine is that computers are unpredictable,
Starting point is 00:28:04 because sometimes they do things in the opposite direction, which we fail to recognize. For example, I have my blog hosted on a cloud instance that is rated for a certain amount of traffic. And it was on the front page of Hacker News one day and I got way more traffic than it was rated to receive and it never died or melted down and the CPU didn't even go above 80%.
Starting point is 00:28:25 And that was one of those where, you know, computers aren't terrible. They're just unexplained or inexplicable in some ways, where it was like, the metrics clearly dictate that this machine should have melted, but it didn't. And I'm not going to question it. We're going to move on with our lives. One last topic I want to get into before we call it an episode. As you see customers coming into GCP, and I understand you haven't been there very long and your experience may not be representative. First, are they generally coming from other cloud providers or are they coming from on-premise data center deployments? I mean, I think there's a healthy mix. Like you said, I don't have much insight, but I do think that there's a number of folks who
Starting point is 00:29:04 are trying to do lift and shift, which I personally don't work a lot with lift and shift. My team in particular works with what we call move and improve. That's a trademark. It's not actually a trademark, but you heard it here first. And I may be stealing it later and claiming it as my own. So move and improve is this idea that we have VMs in a data center and we want to move them to the cloud. We could lift and shift and use something like a VM on the cloud and any cloud provider. Or we could make them cloud native in the process and leverage cloud provider specific technologies like Google Functions or hosted Kubernetes engine or some type of,
Starting point is 00:29:45 just re-architect the application and make it cloud native so that it behaves well in a highly available environment where the network isn't always 100% reliable or the machine might be moved or the application might be killed and restarted. So my team is focused a lot on move and improve,
Starting point is 00:30:00 not so much lift and shift. We have folks at Google who are definitely dedicated to lift and shift and making those customers successful. But I focus particularly more on the move and improve scenario for those customers in their own data centers. And then for customers that are coming from another cloud provider or are coming greenfield, they have an idea and they want to run it somewhere. I work with them pretty closely as well. And that's where, you know, tools and our integrations with things like Terraform and Chef and Puppet
Starting point is 00:30:26 and Cloud Foundry, et cetera, run deep. Because if they already have experience from another company or already have something running somewhere else, we want to make sure to meet them where they are. Which makes an awful lot of sense. But as a company goes through a cloud selection process and they look at the big four in the space, Google, Microsoft, Amazon, and Alibaba, all four of those companies have very different cultures and very different ways of managing infrastructure. Does that have any bearing or have an impact on how they're going
Starting point is 00:31:01 to manage their environment once it moves into one of those providers' clouds. In other words, if you're moving something into Azure, are you likely to manage it differently from a philosophical standpoint than if you're moving it into GCP? You know, realistically, there are tiny differences, but the cloud-native paradigm, right, there's some few key pillars here, like does it handle restarts well? Is it highly available? Can it be containerized, even though containers aren't necessarily required for cloud native? You know, does it package all of its dependencies with it? Can it run on different operating systems, right? All of these things are generic, right? They're not specific to a cloud
Starting point is 00:31:40 provider. When we start leveraging provider-specific technologies, right, AWS Lambda and Google Functions are similar technologies, right? They're both serverless technologies, but there's a little bit of configuration differences, the little, you know, some things here, some things there. So there's not a pure mapping. But from the application level, I don't think that there's cloud provider specific things. At the infrastructure level, there are definitely certain things, but I don't think that they're specific enough that it would actually hinder you from moving from one to the other. Okay. Thank you. It's always been a strange question.
Starting point is 00:32:17 And on the one hand, you can approach a cloud provider as more or less, oh, they just provide us virtual computers. So what they do internally and as a culture and how they think about the world really doesn't matter all that much to how you run your environment once you understand the constraints versus if you go something in a full, more or less in a full cloud native direction, then it turns into something that's very, well, how do they think about this? How should I be architecting this? And I mean, to some extent, if you gaze long enough into the Google abyss, do you become
Starting point is 00:32:52 Google in some small way as far as how you think about operations, how you think about the responsible running of environments? And I think it depends on what you're running too. You know, one of Google's big market segments is high performance computing. So people doing like genomics research, et cetera, where they might have some on-premise data and then they need elasticity of the cloud where, like you said, they're just launching VMs, right? They view the cloud as an extension of compute and they're launching them for a couple hours.
Starting point is 00:33:19 They're running very complex genomics simulations or DNA stuff that I don't understand because I haven't taken a biology class in 12 years, but it's very important. Don't get me wrong. The work is important. I just don't understand it. And what they look for in a cloud is very different than someone who's looking to run microservices, for example. And that concept there is, it's different, right? So it's not just about what does the cloud provider offer? It's also like, what problem are you trying to solve? And that's a key thing that I think we forget about every once in a while. Perfect. Thank you very much for your time, Seth. Before we wind up calling it an episode, is there anything that you're working on that you'd like to draw attention to and have
Starting point is 00:33:59 people check out? In the short term, not so much. In the long term, you should look for a lot of the stuff that Google is going to be doing in the DevOps space. The things I talked about specifically with SRE and how SRE relates to DevOps, you should see some content coming out shortly that will hopefully explain that in a lot clearer. Perfect. Thank you very much for your time, Seth, and enjoy the rest of the day.
Starting point is 00:34:22 Thanks, Corey, you too. This has been Screaming in the Cloud and I'm Corey Quinn. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.