Screaming in the Cloud - Episode 3: Turning Off Someone Else's Site as a Service
Episode Date: March 28, 2018How do you encourage businesses to pick Google Cloud over Amazon and other providers? How do you advocate for selecting Google Cloud to be successful on that platform? Google Cloud is not jus...t a toy with fun features, but is a a capable Cloud service. Today, we’re talking to Seth Vargo, a Senior Staff Developer Advocate at Google. Previously, he worked at HashiCorp in a similar advocacy role and worked very closely with Terraform, Vault, Consul, Nomad, and other tools. He left HashiCorp to join Google Cloud and talk about those tools and his experiences with Chef and Puppet, as well as communities surrounding them. He wants to share with you how to use these tools to integrate with Google Cloud and help drive product direction. Some of the highlights of the show include: Strengths related to Google Cloud include its billing aspect. You can work on Cloud bills and terminate all billable resources. The button you click in the user interface to disable billing across an entire project and delete all billable resources has an API. You can build a chat bot or script, too. It presents anything you’ve done in the Consul by clicking and pointing, as well as gives you what that looks like in code form. You can expose that from other people’s accounts because turning off someone else’s Website as a service can be beneficial. You can invite anyone with a Google account, not just ‘@gmail.com’ but ‘@’ any domain and give them admin or editor permissions across a project. They’re effectively part of your organization within the scope of that project. For example, this feature is useful for training or if a consultant needs to see all of your different clients in one dashboard, but your clients can’t see each other. Google is a household name. However, it’s important to recognize that advocacy is not just external advocacy, there’s an internal component to it. There’s many parts of Google and many features of Google Cloud that people aren’t aware of. As an advocate, Seth’s job is to help people win. Besides showing people how they can be successful on Google Cloud, Seth focuses on strategic complaining. He is deeply ingrained in several DevOps and configuration management communities, which provide him with positive and negative feedback. It’s his job to take that feedback and convert it into meaningful action items for product teams to prioritize and put on roadmaps. Then, the voice of the communities are echoed in the features and products being internally developed. Amazon has been in the Cloud business for a long time. What took Google so long? For a long time, Google was perceived as being late to the party and not able to offer as comprehensive and experienced services as Amazon. Now, people view Google Cloud as not being substandard, but not where serious business happens. It’s a fully feature platform and it comes down to preferences and pre-existing features, not capability. Small and mid-size companies typically pick a Cloud provider and stick with their choice. Larger companies and enterprises, such as Fortune 50 and Fortune 500 companies, pick multiple Clouds. This is usually due to some type of legal compliance issues, or there are Cloud providers that have specific features. Externally at Google, there is the Deployment Manager tool at cloud.google.com. It’s the equivalent of CloudFormation, and teams at Google are staffed full time to perform engineering work on it. Every API that you get by clicking a button on cloud.google.com are viewing the API Docs accessible via the Deployment Manager. Google Cloud also partners with open source tools and corresponding companies. There are people at Google who are paid by Google who work full time on open source tools, like Terraform, Chef, and Puppet. This allows you to provision Google Cloud resources using the tools that you prefer. According to Seth, there’s five key pillars of DevOps: 1) Reduce organizational silos and break down barriers between teams; 2) Accept failures; 3) Implement gradual change; 4) Tooling and automation; and 5) Measure everything. Think of DevOps as an interface in programming language, like Java, or a type of language where it doesn’t actually define what you do, but gives you a high level of what the function is supposed to implement. With the SRE discipline, there’s a prescribed way for performing those five pillars of DevOps. Specific tools and technologies used within Google, some of which are exposed publicly as part of Google Cloud, enable the kind of DevOps culture and DevOps mindset that occur. A reason why Google offers abstract classes in programming is that there’s more than one way to solve a problem, and SRE is just one of those ways. It’s the way that has worked best for Google, and it has worked best for a number of customers that Google is working with. But there are some other ways, too. Google supports those ways and recognizes that there isn’t just one path to operational success, but many ways to reach that prosperity. The book, Site Reliability Engineering, describes how Google does SRE, which tried to be evangelized with the world because it can help people improve  operations. The flip side of that is that organizations need to be cognizant of their own requirements. Google has always held up along several other companies as a shining beacon of how infrastructure management could be. But some say there’s still problems with its infrastructure, even after 20-some years and billions invested. Every company has problems, some of them technical, some cultural. Google is no exception. The one key difference is the way Google handles issues from a cultural perspective. It focuses on fixing the problem and making sure it doesn’t happen again. There’s a very blameless culture. Conferences tend to include a lot of hand waving and storytelling. But as an industry, more war stories need to be told instead of pleasure stories. Conference organizers want to see sunshine and rainbows because that sells tickets and makes people happy. The systemic problem is how to talk about problems out in the open. Becoming frustrated and trying to figure out why computers do certain things is a key component of the SRE discipline referred to as Toil -  work tied to systems that either we don’t understand or don’t make sense to automate. Those going to Google Cloud to ‘move and improve’ tend to be a mix of those from other Cloud providers and those from...
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode of Screaming in the Cloud is sponsored by my friends at
GorillaStack. GorillaStack's a unique automation solution for cloud cost optimization, which of
course is something here and dear to my heart. By day, I'm a consultant who fixes exactly one
problem, which is the horrifying AWS bill. Every organization eventually hits a point where they start to really,
really care about their cloud spend, either in terms of caring about the actual dollars and
cents that they're spending, or in understanding what teams or projects are costing money and
starting to build predictive analytics around that. And it turns out that early on in my
consulting work, I spent an awful lot of time talking
with some of my clients about a capability that GorillaStack has already built.
There's a laundry list of analytics offerings in this space that tell you what you're spending
and where it goes, and then they stop.
Or worse, they slap a beta label on that side of it for remediation and then say that they're
not responsible for anything or everything that their system winds up doing. So some folks try and go in
a direction of doing things to write their own code, such as spinning down developer environments
out of hours, bolting together a bunch of different services to handle snapshot aging,
having a custom Slack bot that you build that alerts you when your budget's hitting a red line. And this is all generic stuff. It's the undifferentiated heavy lifting that's not
terribly specific to your own specific environment. So why build it when you can buy it?
GorillaStack does all of this. Think of it more or less like if this, then that, IFTTT for AWS.
It can manage resources. It can alert folks when things
are about to turn off. It keeps people appraised of what's going on. More or less the works.
Go check them out. They're at GorillaStack.com, spelled exactly like it sounds. Gorilla like the
animal. Stack as in a pile of things. Use the discount code SCREAMING for 15% off the first
year. Thanks again for your% off the first year.
Thanks again for your support, Gorillastack. Appreciate it.
Hello and welcome to Screaming in the Cloud. Today I'm joined by Seth Vargo,
a senior staff developer advocate at Google. Thanks for joining me, Seth.
Hey, thanks for having me, Corey. Excited to be here today.
Always a pleasure to talk to you. So you, until very recently, were at HashiCorp for a while talking,
do effectively the same type of advocacy work, to my understanding.
Yeah, I left HashiCorp a few months ago to join Google Cloud. I was working very closely with
tools like Terraform, Vault, Console, Nomad. And part of the reason that I left HashiCorp was that
I have the opportunity to talk about
those tools and some of my former experiences with tools like Chef and Puppet and the communities
that surround those tools and how you can use those tools to integrate with Google Cloud
and to help drive some product direction around how we can make Google Cloud a great provider
to integrate with those different tools.
Fantastic.
I have to confess, I personally only started working with GCP for a test project about
a month or so ago. And until then, it was always this thing that was sort of hanging around the
periphery of what I'd been doing. Historically, I was a dyed-in-the-wool AWS person, just because
it's what I encountered in the wild. And I have to say,
as I went through the process, I was extremely impressed by some of the nice features that GCP
has worked into it. Two that leap to mind even now left a strong impression. The first is the
billing aspect of it. All I do is work on cloud bills and terminate all billable resources in this project
is a godsend as far as the console goes. The second is the way that it gives you, it represents
anything that you've done in the console by clicking and pointing. It gives you what that
looks like in code form. And that is just, for those of us who are of the terrible breed of programmer,
is just spectacular. Well, I'm glad you like it. And I don't know if you know this or not, but
that button that you click in the UI to disable billing across an entire project and delete all
billable resources, there's an API for that too. So you could build a chat bot or a script that does that as well. Everything we
do on Google Cloud is API first. So anytime you click a button in that web UI, there is a
corresponding API call, which means you can build automation, compliance, and testing around these
various aspects. Wonderful. And can you expose that from other people's accounts? Because frankly,
turning off someone else's website as a service is something I would definitely pay for. It's definitely possible. The IAM and
permission management in Google Cloud is incredibly powerful. It leverages the same IAM
permissions that G Suite has, which is hosted Gmail and Calendar and all of those other things.
So you can invite effectively anyone with a Google account,
not just, you know, at gmail.com, but at any domain and give them admin or editor permissions
across a project. And then they're effectively part of your organization within the scope of
that project. And this is really useful when you think about things like training or, you know,
as a consultant, being able to see all of your different clients in one dashboard when you log in, but your clients can't see each other.
That definitely opens up some possibilities with respect to being able to manage multiple accounts simultaneously, work on different environments.
I definitely see the appeal.
So you are a staff developer advocate at Google. I mean, historically,
advocacy in some companies is, here's who we are. This is what we do. You could have spent the last
15 years living in a cave, and odds are you still know who Google is and what they do.
How does that manifest itself at a company that has long ago become a household name?
Definitely.
I think that's a great question.
It's important to recognize that advocacy is not just external advocacy.
There's an internal component to it.
So yes, everyone knows that Google is a household name.
I look to my left here and I have a Google Home sitting right there.
But there's many parts of Google and many features of Google Cloud that people aren't aware of.
So my job as an advocate, I view it as help people win.
How do I get people who want to use Google Cloud or don't know about Google Cloud the ability to be successful on the platform?
And then the flip side of that is what I call strategic complaining. So I am deeply ingrained in a number of communities,
the DevOps communities, the configuration management communities, and people are going
to come to me with feedback. They're going to say, hey, this thing is great. They're going to say,
hey, this thing is pretty terrible. And it's my job as an advocate to take that feedback and
convert it into meaningful action items for our product teams and say, hey, I've heard this repeating pattern that
this particular service's documentation isn't up to snuff, or this service is missing a key feature,
and work with the product teams to get that prioritized on the roadmap so that the voice
of the community is being echoed in the features and the products that are being developed internally. That tends to make a lot of sense, especially as, and I mean, no disrespect by this, but Amazon was
in the sort of in the cloud space by itself for a long time, as far as public cloud availability
goes. And every, all the other providers, Google included, were for a long time sort of perceived as late to the party and not able to offer anything approaching as comprehensive an experience.
And I think that that narrative, to some extent, is something that Google is still struggling with, even though, and again, I've been deep into the woods on AWS for a long time.
When I've used GCP now, I am not left with the impression that it is
substandard. I'm not left with a perception that, oh, this is a fun toy, but it's not where serious
business happens. It's a fully featured platform. And at this point, it really comes down to
preferences and what's pre-existing in most environments, not a capability story.
Do you find that that is a general perception of how the entire world is working?
Or am I, frankly, too stuck off in my AWS world at this point?
No, Corey, I don't think you're stuck in anything.
I think we are moving to a world where small companies, and maybe even mid-sized companies,
are going to pick a cloud provider and they're going to stick with it. But when we look at large
companies, enterprises, Fortune 50, Fortune 500 companies, they're going to pick multiple clouds,
actually. And they're going to do it for one of two reasons. The first is some type of legal
compliance issue. So when you think about finance and trading, legally, they're required to not have dependencies on one provider. But the bigger reason is that
each cloud provider is going to have things that they're good at and things that they're not so
good at. So at Google, for example, we have the best Kubernetes engine because we wrote Kubernetes.
We have the folks who run Kubernetes and have been running Kubernetes for a while running GKE, which is our hosted Kubernetes offering. We also have some of the best ML in the
world. We just launched AutoML, which allows you to use our models with your data so you don't have
to do training. But then there are other cloud providers that have specific features where they
shine. And organizations are going to be able to pick and choose, hey, I want to use this from AWS,
this from Azure, and this from Google Cloud. And being able to link those things up and really
leverage the true power and elasticity of the public cloud is very, very important for these
midsize and large organizations' long-term success. When I conducted a survey somewhat recently,
last year in AWS, I wound up asking a bunch of snarky questions to
a bunch of people. Almost everyone who answered the survey had some workloads in AWS. I'm sure
there was no selection bias whatsoever in that. But there was also a very decent showing of other
cloud providers along the way. What I found fascinating, and I wish I'd built in more questions around this,
but in a pure AWS context, the breakdown between who was using CloudFormation and who was using
Terraform, to step back into your previous role for a second, was neck and neck. It was effectively
a 50-50 split, which is incredible, even though the official word
from Amazon has always been cloud formation is the way and the light.
Does GCP offer some equivalent of that?
Or is the official marching order there, if you want to automate this, use Terraform?
Or is there something else I'm just not aware of?
That's a great question, Corey.
Externally at Google, we have a tool called Deployment Manager. You can check it out on cloud.google.com. It's kind of the
equivalent of CloudFormation. There are teams at Google that are staffed full-time to do engineering
work on that. Every API that you get by clicking a button on cloud.google.com or viewing the API
docs is accessible via the Deployment manager. However, in addition to that,
Google Cloud has partnered very closely with a number of open source tools and the companies
that correspond to them, one of which being Terraform. So there's a team, I'd like to give
a shout out to the cloud graphite team, Eric Johnson, Dana Hoffman, Emily Yee, and the team
over there who are doing these integrations with third-party tools.
So to put it in perspective, there are people at Google who are paid by Google
who work full-time on open-source tools like Terraform and Chef and Puppet
so that you can provision GCP resources using the tools that you love.
So we offer Deployment Manager, and if you're only going to use Google
and you're never going to use another cloud, then by all means use deployment manager. It's going to work best.
It integrates everywhere. But if you're thinking of going multi-cloud or you already have experience
with tools like Chef or Puppet or Terraform or Ansible, we want to meet you where you are.
We want, you know, if you're an Ansible shop, we want Ansible to be the tool that you use to
provision infrastructure. If you're a Terraform shop, we want Ansible to be the tool that you use to provision infrastructure. If you're a Terraform shop, we want Terraform to be the tool that you use to provision
infrastructure. And the way that we support that is by having the Cloud Graphite team work on these
tools and make sure that the GCP integrations run deep. Perfect. That's useful to know. And it's
something that I think a lot of different shops don't quite have a full awareness of. Do you
happen to have any numbers you can
share or just general sense of shops that are using GCP? Are they using this or are they
tending to go in a Terraform direction? I mean, what is the general zeitgeist these days?
I don't have numbers on the deployment manager side of things, but, you know, there are
a few different customers who are using Terraform. I can't go into specifics, but
it's non-zero and it's significant enough that we have multiple full-time people devoted to
working on these integrations. And if we weren't seeing adoption and it wasn't important to us as
a company to support those ecosystems, we wouldn't be investing people in it.
At a number of DevOps events where there are
people from Google talking about how you folks do DevOps internally. And right around here is the
point where I get interrupted by someone who works at Google. We don't do DevOps. We do SRE.
What is the difference and what is the breakdown as far as how Google sees things?
That's a great question, Corey. This is actually something that I'm focused on
and I'm working closely with the SRE team
internally at Google to make sure
that we're getting the right message out there.
To just kind of backpedal just a little bit,
there's kind of five key pillars of DevOps.
The first is to reduce organizational silos
and break down the barriers between teams.
The second is that we have to accept failure is the norm.
That's things like blameless postmortems. We have to accept that computers are inherently unreliable. So we can't expect
perfection. And when we introduce humans into that, we get even more imperfection. The third is
implementing gradual change. We want to reduce the mean time to recover or the MTTR. And we realize
that small incremental changes are much easier to review and roll back in the event of failure. The fourth piece is tooling and automation, right? There's entire
conferences like Monitorama that gather people from the DevOps communities around monitoring,
tooling, automation, you know, Chef and Puppet and Terraform obviously fit in that pillar.
And then the fifth is to measure everything. No matter what we do in the first four categories,
if we're not measuring it,
we don't have clear gauges for success. We don't know if we've been successful.
And when you think about it, these are actually really abstract topics, right? Nowhere in here
did I say use Chef or use Puppet. And nowhere in here did I say that you should hold more meetings
to break down the silos or that you should use an elk stack for your measurement and monitoring and
logging.
And for this reason, you can think of DevOps as like an interface in a programming language,
like Java or a typed language, where it doesn't actually define what you do.
Instead, it gives you a very high level of what the function is supposed to implement.
So there is a function in the interface that says reduce organizational silos.
And the way that you implement that is kind of like a class. And SRE, as I'm learning, because I'm also new to Google,
but the way that I view this is that SRE is a class that implements DevOps. And just like you can have multiple classes that implement collection or sortable, it's possible to
have multiple classes that implement DevOps. So in the SRE discipline, there's a very prescribed way
for performing those five pillars of DevOps.
Things like sharing ownership and SLIs and SLOs,
moving fast by reducing cost of failure,
sharing ownership among product teams and automation,
and very specific tools and technologies that we use within Google,
some of which are exposed publicly as part of Google Cloud, that enable the DevOps culture
and the DevOps mindset to take place. And I think for a while, because there are definitely some
folks at Google who have been at Google for many years and aren't deeply involved in these
communities like myself, they thought that SRE
was the only way. So some of the advocation that I'm doing internally is saying, yeah,
you know, SRE satisfies this interface, but, you know, to a certain extent, so do these agile
practices over here. And so do these other technologies that other companies are using.
And part of the work that I'm doing is getting people to realize that we can meet in the middle,
right? That part of the reason why we have abstract classes in programming is that there's
more than one way to solve a problem. And SRE is just one of those ways. And it's the way that has
worked best for Google. And it has worked best for a number of customers that Google is working with.
But there are some other ways too. And we need to be able to support those ways and recognize that
there isn't one true path and light to the operational success of a system, but there are in fact many ways to reach
that prosperity. There was an entire book written by a team of SREs at Google for O'Reilly, entirely
on the practice of site reliability engineering. And it's a fantastic book and the people who wrote
it are incredibly skilled,
but it almost felt like that book could have been subtitled How to Build Google, to some extent,
where a lot of what Google does and how they operate and how they think presupposes not only
the tremendous investment in infrastructure that Google has made since its inception, but also Google culture.
You take that and you drop it on a mid-sized credit union in the Midwest, for example,
and almost every pre-existing condition that Google has no longer applies.
How do you wind up driving that sort of cultural change to an environment that looks nothing like Google.
You know, one of the things that I got out of reading the SRE book was this is how Google
does SRE.
And there is a group of people, and I've read the book a number of times, and I struggle
to see this viewpoint.
There's a group of people that believe that book is Google telling the world how to do
DevOps.
And that's simply not
the case. I know many of the authors that is in no way what they were trying to get across with
that book. It was actually a storytelling exercise. How Google does SRE was a thing that we wanted to
evangelize with the world because we think that it can help people improve their operations.
The flip side of that is that organizations need to be cognizant of their
own requirements. If I'm a small startup of, say, less than 25 people and operations is someone's
part-time job, the SRE kind of playbook isn't relevant yet. It doesn't become relevant until
we have enough users and we have a team and we have these barriers that exist between the product
organization and the engineering organization and the site reliability engineering organization,
that these practices come into play. So when we talk to, say, a mid-sized credit union from the
Midwest, the conversation can't be, use this tool with this technology and do exactly this,
because there is a cultural component to SRE, just like there's a cultural
component to DevOps that we have to solve first. And it's okay to kind of pick and choose, right?
We might choose the part of the SRE story, which is SLOs and SLIs, which are very strict,
defined measurements of uptime and availability for a system. But we might not use the kind of the monitoring
and metrics that are recommended.
We might use our own technology.
And it's about picking what's best for the organization.
But when we go into these companies
and we try to say, hey, you need to change
if you want to innovate,
we have to be aware of their own roadblocks
and their own hurdles and find ways to work around them.
And that's why executive buy-in is really key in these situations.
If you have a couple developers who just want to move faster, they're never going to be
able to push these initiatives.
But if you have top-level executives and VPs who are saying, okay, we're losing market
share, we need to find a way to deliver faster, you'll get a lot more buy-in.
Because it truly is, as you said, an organizational culture.
And that's true
for both DevOps and SRE. I want to be very clear that this next question is not explicitly aimed
at Google. I feel the need to warn you first on that one. But Google is always held up, along
with several other companies, as a shining beacon of how infrastructure management could be.
You see conference talks conducted by Googlers.
You talk to people about what they're working on,
and they paint a very compelling picture.
A lot of other companies do this,
and in a lot of these other companies,
I've later gone in and done projects there.
And it is a very polite fiction.
Internally, I have never yet found a company that didn't think its own infrastructure was, to some extent,
a song of ice and tire fire, where you're always going to have things breaking. It feels like
you're skating on the edge of disaster, but that doesn't make for a compelling keynote at these
events. And after I've poured significant amounts of alcohol into various Google people, they
start to nod and smile and say, yeah, there are still problems in our infrastructure,
even after 20 some odd years and billions invested.
It feels like to some extent, it's never a solved problem.
There's always more to improve.
But to some extent, first off, would you agree that that's true?
So I've only been at Google a couple of months now.
I would definitely say that any company you work at where the recruiter tells you that
it's all sunshine and rainbows and there's nothing ever wrong is a lie.
Every company has problems, some of them technical, some of them cultural.
And I don't think Google is an exception to that rule. Being a company that's been around
for a very long time, there's certainly technical debt. There have certainly been outages while I've
been here. The one key difference is the way that Google handles that from a cultural perspective.
We focus on fixing the problem and making sure it doesn't happen again as opposed to finding
out who did what and why they did it and and what were they thinking so there's a very blameless
culture which i found very unique to google it's like a top priority in any time there's an outage
and the way they mitigate those outages so you know having the ability to say oh this particular
cluster is not having a good day. Let's shift in real time
all of the workloads from that cluster to a new one. And a great example of something like that
is whenever we had the Meltdown Inspector vulnerabilities here a few months ago,
Google was able to migrate people's workloads in real time without downtime as they were
upgrading or applying the patches for
these CPU vulnerabilities. And that's, that's a technology, right? That's not a culture. That's
a technology that's unique to Google. We wrote about it on the cloud platform blog. And that's,
that's something that, that makes Google, you know, a unique, in a unique position where we
can prioritize availability and reliability for our customers, even if behind the scenes,
there are some fires going on.
Yeah, I think that's very fair.
The counterpoint too, and the reason I keep harping on that particular area of things
is I did this myself, and I've talked to other people who continue to do it now.
They'll go to a conference, and they'll see a talk that is presented by one of these bright
lights of tech.
And then they go back to their own jobs at the end of the conference. And they feel sad because their environments are, from their perspective, far worse than
what was just described.
It feels like there's not a ongoing sense of empathy or awareness in many cases that
everyone's environment has problems,
everyone's culture has problems, and this is built on a continuing series of incremental change.
How do you find that that is being addressed these days?
I don't think it is. I mean, you and I have both done keynotes at big conferences, and
there's a lot of hand-waving, there's a lot of storytelling that goes on.
And, you know, maybe as an industry, we need to tell more war stories instead of the pleasure
stories.
I spoke at an event that Fastly did, the CDN company, and they asked me to speak about
an outage that we had.
And it was different for me.
Like, I didn't feel comfortable doing it. I don't feel, I feel like talking about failure publicly without a resolution is often a negative
connotation. So as an industry, maybe we need to start talking more about failure. And if at the
end of the talk, the answer is it just never happened again. And we don't know why that's
okay to talk about. But I also think that that's not going to get accepted
into, you know, a CFP process. People as conference organizers, they want to see sunshine and rainbows
because that sells tickets that makes people happy. So it is a bit of a systemic problem is
how do we talk about these things in the open without putting people in this like, well,
what was the resolution or how did you fix it?
Because sometimes computers are weird. I feel like there's an opportunity here for failure con
or something similar where all we talk about is the failures that we've seen in various
infrastructures and some of them don't have resolutions. I also feel like there needs to be
a standing rule for a conference like that, that well, actually, have you considered is not a valid question to ask during the Q&A portion.
Yes, after listening to this for 45 minutes, I'm sure you have the answer to a problem that has stymied entire teams of engineers for months.
Yes, based upon the window I've given you into this.
And that's always the challenge too, but you're right.
I think that it is a negative thing that companies and organizers don't necessarily want to see,
but it's the real world.
It's how these things work.
There's a whole laundry list of things I have that I do not understand about why my systems
behave in certain ways under certain conditions.
And if they're not causing downtime or not painful enough, I'm never going to have the
time to dig into them, or frankly, maybe not even have the intellect to dig in and figure
out why it does that.
So the answer I put around is I just draw a circle around all of that and caption it,
computers are terrible.
The end.
So that's actually interesting because what you just described is actually a key component of the SRE discipline, which is this thing called TOIL, like foil, but with a T in the front.
And it's work tied to systems that either we don't understand or it doesn't make sense to automate away. So if there's that one service that goes down once a year, but it's highly available, so
it doesn't matter, and someone has to connect to a prod system and restart it or kick it
to reboot, there's no point in investing 10, 15 hours to build automation and detection
around that.
Instead, let's just invest 15 minutes every year and do it.
And part of the SRE discipline is about mitigating that. Instead, let's just, you know, invest 15 minutes every year and do it. And, you know,
part of the SRE discipline is about mitigating that, right? So when that system goes down,
does another one automatically pick up so that we're not rushing to get it back online?
And then we can, you know, kick it kind of in our spare time. You know, I have a similar bubble.
I don't call it computers are terrible. Mine is that computers are unpredictable,
because sometimes they do things in the opposite direction,
which we fail to recognize.
For example, I have my blog hosted on a cloud instance
that is rated for a certain amount of traffic.
And it was on the front page of Hacker News one day
and I got way more traffic than it was rated to receive
and it never died or melted down
and the CPU didn't even go above 80%.
And that was one of those where, you know, computers aren't terrible. They're just
unexplained or inexplicable in some ways, where it was like, the metrics clearly dictate that
this machine should have melted, but it didn't. And I'm not going to question it. We're going to
move on with our lives. One last topic I want to get into before we call it an episode. As you see
customers coming into GCP, and I understand you haven't been there very long and your experience
may not be representative. First, are they generally coming from other cloud providers
or are they coming from on-premise data center deployments? I mean, I think there's a healthy
mix. Like you said, I don't have much insight, but I do think that there's a number of folks who
are trying to do lift and shift, which I personally don't work a lot with lift and shift.
My team in particular works with what we call move and improve.
That's a trademark.
It's not actually a trademark, but you heard it here first.
And I may be stealing it later and claiming it as my own.
So move and improve is this idea that we have VMs in a data center and we want to move them to the cloud.
We could lift and shift and use something like a VM on the cloud and any cloud provider.
Or we could make them cloud native in the process and leverage cloud provider specific technologies like Google Functions or hosted Kubernetes engine or some type of,
just re-architect the application
and make it cloud native
so that it behaves well
in a highly available environment
where the network isn't always 100% reliable
or the machine might be moved
or the application might be killed and restarted.
So my team is focused a lot on move and improve,
not so much lift and shift.
We have folks at Google
who are definitely dedicated to lift and shift
and making those customers successful. But I focus particularly more on the move and
improve scenario for those customers in their own data centers. And then for customers that
are coming from another cloud provider or are coming greenfield, they have an idea and they
want to run it somewhere. I work with them pretty closely as well. And that's where, you know,
tools and our integrations with things like Terraform and Chef and Puppet
and Cloud Foundry, et cetera, run deep.
Because if they already have experience from another company or already have something
running somewhere else, we want to make sure to meet them where they are.
Which makes an awful lot of sense.
But as a company goes through a cloud selection process and they look at the big four in the
space, Google, Microsoft,
Amazon, and Alibaba, all four of those companies have very different cultures and very different
ways of managing infrastructure. Does that have any bearing or have an impact on how they're going
to manage their environment once it moves into one of those providers' clouds.
In other words, if you're moving something into Azure, are you likely to manage it differently
from a philosophical standpoint than if you're moving it into GCP?
You know, realistically, there are tiny differences, but the cloud-native paradigm,
right, there's some few key pillars here, like does it handle restarts well? Is it highly
available? Can it be containerized, even though containers aren't necessarily required for cloud
native? You know, does it package all of its dependencies with it? Can it run on different
operating systems, right? All of these things are generic, right? They're not specific to a cloud
provider. When we start leveraging provider-specific technologies, right, AWS Lambda and Google Functions are similar technologies, right? They're both
serverless technologies, but there's a little bit of configuration differences, the little,
you know, some things here, some things there. So there's not a pure mapping. But from the
application level, I don't think that there's cloud provider specific things. At the infrastructure level, there are definitely certain things, but I don't think that they're
specific enough that it would actually hinder you from moving from one to the other.
Okay.
Thank you.
It's always been a strange question.
And on the one hand, you can approach a cloud provider as more or less, oh, they just provide
us virtual computers.
So what they do internally and as a culture and how they think about the world really doesn't matter all that
much to how you run your environment once you understand the constraints versus if you go
something in a full, more or less in a full cloud native direction, then it turns into something
that's very, well, how do they think about this?
How should I be architecting this?
And I mean, to some extent, if you gaze long enough into the Google abyss, do you become
Google in some small way as far as how you think about operations, how you think about
the responsible running of environments?
And I think it depends on what you're running too.
You know, one of Google's big market segments is high performance computing.
So people doing like genomics research, et cetera, where they might have some on-premise
data and then they need elasticity of the cloud where, like you said, they're just
launching VMs, right?
They view the cloud as an extension of compute and they're launching them for a couple hours.
They're running very complex genomics simulations or DNA stuff that I don't understand because
I haven't taken a biology class in 12 years, but it's very important. Don't get me wrong. The work is
important. I just don't understand it. And what they look for in a cloud is very different than
someone who's looking to run microservices, for example. And that concept there is,
it's different, right? So it's not just about what does the cloud provider offer? It's also like,
what problem are you trying to solve? And that's a key thing that I think we forget about every
once in a while. Perfect. Thank you very much for your time, Seth. Before we wind up calling it an
episode, is there anything that you're working on that you'd like to draw attention to and have
people check out? In the short term, not so much. In the long term, you should look for a lot of the
stuff that Google is going to be doing in the DevOps space.
The things I talked about specifically with SRE
and how SRE relates to DevOps,
you should see some content coming out shortly
that will hopefully explain that in a lot clearer.
Perfect. Thank you very much for your time, Seth,
and enjoy the rest of the day.
Thanks, Corey, you too.
This has been Screaming in the Cloud and I'm Corey Quinn.
This has been this week's episode
of Screaming in the Cloud.
You can also find more Corey
at screaminginthecloud.com
or wherever fine snark is sold.