Screaming in the Cloud - Episode 27: What it Took for Google to Make Changes: Outages and Mean Tweets
Episode Date: September 12, 2018Google Cloud Platform (GCP) turned off a customer that it thought was doing something out of bounds. This led to an Internet outrage, and GCP tried to explain itself and prevent the problem i...n the future. Today, we’re talking to Daniel Compton, an independent software consultant who focuses on Clojure and large-scale systems. He’s currently building Deps, a private Maven repository service. As a third-party observer, we pick Daniel’s brain about the GCP issue, especially because he wrote a post called, Google Cloud Platform - The Good, Bad, and Ugly (It’s Mostly Good). Some of the highlights of the show include: Recommendations: Use enterprise billing - costs thousands of dollars; add phone number and extra credit card to Google account; get support contract Google describing what happened and how it plans to prevent it in the future seemed reasonable; but why did it take this for Google to make changes? GCP has inherited cultural issues that don’t work in the enterprise market; GCP is painfully learning that they need to change some things Google tends to focus on writing services aimed purely at developers; it struggles to put itself in the shoes of corporate-enterprise IT shops GCP has a few key design decisions that set it apart from AWS; focuses on global resources rather than regional resources When picking a provider, is there a clear winner? AWS or GCP? Consider company’s values, internal capabilities, resources needed, and workload GCP’s tendency to end service on something people are still using vs. AWS never ending a service tends to push people in one direction GCP has built a smaller set of services that are easy to get started with, while AWS has an overwhelming number of services Different Philosophies: Not every developer writes software as if they work at Google; AWS meets customers where they are, fixes issues, and drops prices GCP understands where it needs to catch up and continues to iterate and release features Links: Daniel  Compton Daniel Compton on Twitter Google Cloud Platform - The Good, Bad, and Ugly (It’s Mostly Good) Deps The REPL Postmortem for GCP Load Balancer Outage AWS Athena Digital Ocean .
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This week's episode of Screaming in the Cloud is generously sponsored
by DigitalOcean. I would argue that every cloud platform out there biases for different things.
Some bias for having every feature you could possibly want offered as a managed service at
varying degrees of maturity. Others bias for, hey, we heard there's some money to be made in the cloud space. Can you give us some of it?
DigitalOcean biases for neither. To me, they optimize for simplicity. I polled some friends of mine who are avid DigitalOcean supporters about why they're using it for various things,
and they all said more or less the same thing. Other offerings have a bunch of shenanigans,
root access and IP addresses.
DigitalOcean makes it all simple.
In 60 seconds, you have root access to a Linux box with an IP.
That's a direct quote, albeit with profanity about other providers taken out.
DigitalOcean also offers fixed price offerings. You always know what you're going to wind up paying this month,
so you don't wind up having a minor heart issue when the bill comes in.
Their services are also understandable without spending three months going to cloud school.
You don't have to worry about going very deep to understand what you're doing.
It's click button or make an API call and you receive a cloud resource.
They also include very understandable monitoring and alerting.
And lastly, they're not
exactly what I would call small time. Over 150,000 businesses are using them today. So go ahead and
give them a try. Visit do.co slash screaming, and they'll give you a free $100 credit to try it out.
That's do.co slash screaming. Thanks again to DigitalOcean for their support of Screaming in the Cloud.
Hello and welcome to Screaming in the Cloud.
This week, I'm joined by Daniel Compton, who is an independent consultant based in New Zealand.
Welcome to the show, Daniel.
Thanks for having me.
Thank you for taking the time to be here.
So you came to my notice a few weeks back when there was a bit of a kerfuffle with respect to GCP turning off a customer that they thought was doing something, I guess, a little out of bounds.
And it led to internet outrage. People are always mad. Google wound up posting a whole in-depth explanation of what happened, what they're doing to prevent this happening in the future. And that led me to a blog post that you not only have been writing, but been maintaining actively.
And I'll throw a link to it in the show notes.
But it comes from a perspective of looking at Google Cloud Platform, GCP,
from the perspective of someone who is also familiar with AWS.
I mean, there have been comparisons like this before,
but most of them tend to come from people
with a particular horse in the race.
You partner with neither company.
You're effectively an independent third-party observer.
And I thought that you had one of the best write-ups
that I've ever seen from that perspective
and wanted to pick your brain.
Thanks, yeah.
So as far as that, I guess, unfortunate circumstance
where Google wound up turning a customer off,
what happened there for those who aren't familiar?
Sure.
So there was a company,
I don't think we ever found out exactly who they were,
but it sounded like they were doing some work,
industrial work with windmills and big machinery,
and they were running some of it on Google Cloud.
And they got an alert saying,
your project has been shut down because of something that doesn't look right about it,
and we're scheduled to turn everything off within three days.
And that clearly scared them, and they were trying to contact Google
and there were no contact numbers
and it was the same kind of story
that we've seen on Google's consumer services
for many years now,
but now applied to Google Cloud
where the stakes are considerably higher.
And so they posted a blog post about it
and this turned into a big thing, got a lot of negative attention for Google Cloud, and is a feature I'd never heard of until this time,
where it's a way where you go through extra verification
and they promise not to shut you down if they detect something bad is going on.
And that's something.
But when I went to look into that,
you've got to be paying at least $2,500 a month to qualify for that, which I wasn't.
And I suspect many people using Google Cloud aren't.
And the other suggestions were to add a phone number to your Google Cloud account, add another credit card as a backup, and pretty critically, get a support
contract with Google Cloud.
And when all of that fell out, it's despite the internet rage machine that likes to kick
off on Hacker News or on Twitter and drag people under the bus, and I admit that I'm
occasionally guilty of participating in that myself, that it's a sympathetic problem
in that you run a hosting platform
that gives access to all kinds of different customers,
which means that effectively anyone
with a stolen credit card number
can spin up large quantities of resources
and begin doing terrible things with it.
And shutting down anything that has a hint of suspicion to it
is obviously not a great
plan, but also being completely permissive to whatever you want to do on the platform is just
fine, leads to everyone blocking your network at their own border, and that doesn't work either.
So it's a spectrum, and where you fall on that spectrum is a very difficult problem to solve for. And I do have an awful lot of sympathy for this. I thought that their mea culpa that they gave in a formal blog post about how this happened, what they were planning on doing in order to prevent this in the future was reasonable. It felt like they were starting to understand the level of concern this rightly
causes with people who are running production infrastructure on top of their platform.
My question for you, as someone who's been looking at Google as an outsider for a while now,
is do you think that that's going to stick, first off. And secondly, why does it take something going
this far afield to get Google to acknowledge that type of thing?
Yeah, so I definitely think it is going to stick. It's clearly gotten enough attention
that they're making changes internally to prevent this from ever happening again.
And yeah, it's definitely going to cost them. I'm sure it's already cost them customers and it's going to cost them customers for several years now.
That kind of reputational damage isn't repaired quickly
and so Google's got a long way to go.
The number of people who saw that blog post
would be a tenth, maybe a hundredth of the people who saw
Google Cloud shuts down your account.
So just from that point of view, they've got a long way to go to repair that damage.
And I think there's some cultural issues that Google has inherited
or Google Cloud has inherited from Google, the consumer organization,
where these kinds of things were what they had to do to scale,
or at least that's what they chose to do,
to be able to scale up to their current growth.
And so there's some behaviors like that,
which just don't fly in the enterprise market.
And they're learning painfully
that they need to change some of those things.
One question that always leaps to my mind is, and this might be an unfair characterization,
but it's always felt to some extent like Google focuses on writing services aimed purely at
developers who are similar to developers who would be found at Google. It seems they struggle to put themselves in the shoes of, for example,
corporate enterprise IT shops or companies whose entire ethos
does not necessarily revolve around technology.
Is that an unfair stereotype in your experience?
I don't know if that would be an unfair stereotype.
I think Google Cloud definitely has a particular philosophy
and product design bent, and that's different to AWS.
And we can talk a little bit more about that if you like.
But that definitely does mean that depending on the perspective
you're coming from, some things are going to be more suited to you
from AWS or perhaps Google Cloud.
Well, you've now gone into a fairly deep dive
on both AWS and on Google Cloud. Well, you've now gone into a fairly deep dive on both AWS and on Google Cloud.
So based upon that, and you go into this in extreme levels of depth in that blog post,
but at a high level, what is your takeaway? So at a high level, my main takeaway is that
Google Cloud has a few key design decisions that really set it apart from AWS.
The big one is, from a developer's perspective,
the focus on global resources rather than regional ones.
What I mean by that is that in AWS, pretty much everything you do,
not entirely everything, but most things you do are scoped
to a region or perhaps even a zone.
And so that means that all of your resources are stuck within that zone.
And if you ever want to cross out into other regions, then that can be quite a lot of work
to egress those points. Whereas Google Cloud has instead architected their system
to be global by default.
Most of the resources you use are global.
Certainly, many of the resources are global.
Things like disk images, the view in the console,
you can see across all of the regions at a single time, the key
management services that you use. So that's kind of a big thing from a developer's point of view,
especially if you're looking to run across multiple regions.
I found that the counterpoint to that shared control plane where everything is global
is that it does open the door for
outages that are world-spanning. When you have a harsh boundary at the different region level,
yeah, you might wind up losing Oregon or Virginia, but the other and the rest of the world is
generally going to be okay. In fact, I don't believe that in the past 12 years we've seen
a single global service outage for virtually anything that AWS has done.
Yeah, that's definitely the trade-off. And in fact, just
two weeks ago, there was about a half an hour or so outage
that was on the HTTP load balancers,
which also affected Stackdriver, their monitoring service,
and a few other things there too. Yeah, for clarity, that was
an outage over on the GCP side. I'll throw a link to that in the show notes as well
as far as the post-mortem that came out of that. Yeah, and so
that's the trade-off, really. They're promising a lot there
and you definitely need to, I guess, you've got a higher level
of dependence on them.
If those load balances go down, there's very little you can do.
At the time when I was running depths in production on Google Cloud,
and so during that outage I was looking, do I just bypass the load balances entirely
and redirect stuff to the instances to work around it.
And luckily, everything came back quickly enough.
But that was a little bit of a scary moment.
Absolutely.
Oh, by the way, everything's broken.
It'll be fixed soon.
Even if true, for some use cases,
it can be absolutely terrifying.
It's, well, we have paying customers and we're losing money by the minute, so
what's going on is the natural immediate panic reaction for most of us.
Yeah. And so that's
I'm sure they're going to learn from that. And there's been
before I was using Google Cloud really seriously, I know
in years past they've also had some outages on the network load balancer.
So there's definitely a risk you take,
and it's one that I'm happy to take at the moment
for the features and benefits that it provides.
But yeah, it's definitely something I keep my eye on.
So today, let's pretend that you're a new customer.
You're about to build out a thing.
And the time has come to pick a cloud provider
and you narrow it down to GCP or AWS.
Is there a clear winner today?
I don't think there's a clear winner for everybody.
One, I don't think either strictly dominates the other.
And I think that the things that you need to think about,
firstly, what are your values?
You know, as a company,
what are sort of the principles
and the things that you really value?
What are your internal capabilities?
And what is your workload like
that you're trying to run on it?
Because there's some specialty things
in both AWS and GCP that if they fit your workload, you know, to run on it. Because there's some specialty things in both AWS and GCP
that if they fit your workload, they can be gold.
And so, yeah, those are really the big differences.
And certainly from an ease of management perspective,
I would say Google Cloud definitely wins there.
You look at the ever-widening number of different instance types on AWS,
and Google Cloud has thus far managed to keep things much, much simpler.
There's just a single, basically an undifferentiated vCPUs
and memory that you can choose.
You can choose the processor family that you want,
if you really want to, although they don't really sort of push you down that you can choose. You can choose the processor family that you want, if you really want to,
although they don't really sort of push you down that path too much.
And then you just choose how many CPUs do you want,
how much memory do you want,
and you can pick just about anything on that configuration space
that you'd like.
Yeah, but as a counterpoint, if you go down that path,
how are you going to kill two and a half months doing RI calculations?
Yeah, and that path, how are you going to kill two and a half months doing RI calculations? Yeah, and the pricing in Google Cloud is just a lot simpler to calculate and understand,
and they continue to make things simpler and easier for people all the time on that perspective.
So that's probably not so great for your business where you try and help
people understand their crazy AWS bills. Believe me, I wish there wasn't a need for
my business. There are many things I would rather do instead. So when it comes time to pick a
provider, what factors should people really consider when they're trying to decide, let's
say, between GCP and AWS? I mean, it's a big decision that's kind of hard to unwind.
Yeah, yeah, it's definitely a big decision. And you're right that it's hard to unwind.
There's talk of multi-cloud, and I guess for some super large companies, that makes sense.
But for many people, the costs and limiting yourself to the lowest common denominator just really makes that not possible.
So I would look at what your team has experience in, what kind of resources you need, where they're running, where the regions are that they're running in.
Yeah, it's not an easy decision.
And I spent probably far more time than I would like to admit evaluating Google Cloud and AWS
and a few other cloud providers before I settled on Google Cloud.
So a common, I guess, criticism that some people who may or may not be
me have levied against Google historically, among them have been their propensity to end-of-life
things that people are using. The other side of that coin is that AWS will launch a new service
and that service effectively is going to be the trunkless legs of stone in the desert
or King of Kings, look upon my works, ye mighty of despair.
That service is still running after the apocalypse.
And that tends to wind up pushing people in one direction or another.
I mean, it does definitely bloat and complicate the AWS service catalog,
but it does feel like you can rely on anything that AWS
launches to a degree that you can't potentially do with GCP. Thoughts?
Yeah. So I think it's important to distinguish the consumer Google from the Google Cloud.
So consumer Google, shutting down products is something they do relatively often, and they pay for it every time on Hack and Use comments.
Oh, yes. I mean, mean tweets are absolutely something that every product manager should take into deep consideration.
Will this offend someone on the internet before we do it? Oh, yeah, that should drive all the corporate decision making. Yeah. And so Google Clouds, to my knowledge, I don't think they've shut...
Once something's become general availability, I don't think anything's been removed or shut
down from there.
But as you say, they shut down products.
They've shut down other products in the past.
And people from the outside look at Google and they don't distinguish necessarily between
Google Cloud and Google.
They just see Google shuts down services.
It's the same logo everywhere.
Yeah, it's the same logo everywhere.
And so they think, well, how can I trust Google Cloud?
Are they going to shut the price on me 40 times
or whatever the recent Google Maps price increase was?
These are sort of unforced errors from my perspective
that are going to cost them a lot,
whereas AWS just isn't making those errors
and they pay for it in complexity, certainly.
But from a business perspective, I think usually people would prefer to be able to just rely
on something to know that it's going to be there and it's never going to get more expensive.
It only ever gets cheaper.
And yeah, that's something that AWS has done really well.
Absolutely.
The challenge too, and this sounds like a bit of a backhanded compliment in some ways,
and it's not intended that way, but GCP has built out a smaller set of services that are
relatively easy to get started with, as opposed to, oh, I'm going to spin up something new
in AWS.
I've never heard of it before.
Let's see what happens.
Oh my God, I'm staring at a list of 120 services.
I don't know what any of them do. I'm going to go raise goats instead.
There's something to be said for being more straightforward in your offering and much more defined in messaging.
Do you find that that's resonating?
I mean, right now, I look at an AWS console
and I have a decent idea of what I'm looking at
because I've been institutionalized
for 12 years of staring at these things.
But for someone who's new,
I don't see that that's there.
Yeah, I definitely,
I haven't been looking at AWS for 12 years,
but I can look at the console
and get a reasonable understanding
of what I can look at and what I can ignore. But definitely for someone coming in cold, there's a lot of stuff there. And
just even getting your bearings to even understand where you should be looking or
what you should be doing there is a big job. And so that's one of the things where I think Google
Cloud is. It's stronger for me.
I'll say that,
um,
you know,
maybe not for everybody,
maybe some people prefer the AWS perspective,
but certainly for,
for smaller teams or teams who want things to be,
be more simple,
you,
you generally get a smaller set of well-built,
flexible primitives rather than,
you know,
AWS is 18 different queuing services,
which all are slightly different
and they're all relevant in slightly different contexts.
Google Cloud just has one.
And so that's probably the best example I have
of the different product philosophies there in simplicity.
I would also argue that there might be a company
ethos discussion here with respect
to how each company
respectively views its
customers. Google,
it seems to me, and please feel free to correct
me if I'm wrong in my assessment,
that they believe
that most of the world should write software
the way that Google engineers
tend to write software. And that's not inherently a bad thing. Google software engineers are incredible.
The counterpoint is that if you take a look across the entire ecosystem,
not every developer writes software like they work at Google, to the same network design
principles, to the same baseline level of quality from the same perspective. Conversely,
it feels like Amazon throws out a lot of ridiculous but closely related services
in an effort to meet customers where they are. Is that a fair characterization?
I don't know if I would say Google expects you to write software in that way, but I think
Google Cloud is definitely
moving in a direction where they're giving you the tools that you can write software in the way
that Google does. You know, the global stuff, global load balances, things that span the world.
These are, you know, these are primitives that Google uses internally and, you know, uses them
to run, you know, massive fleets of software.
And so they're, you know, one of the promises or the, you know, the dreams of Google Cloud
is that you too can write software that runs like Google. I would say also that AWS, you know,
has a product philosophy that you have to buy into. It's just a different one, and it's probably more flexible.
I'd give them that.
But underneath, there's sort of a meta point I wanted to make
about the ethos, which is that last few weeks,
I've been looking at AWS Athena,
which is a hosted Presto service from Amazon, where you can put a bunch of data in S3
and query it in super fast speeds. And we discovered a bug that affected the billing
of the service. And so it's charged by the number of bytes read. And if you touch particular column
types or column definitions in your queries, it ends up costing you to read the whole partition.
And I was talking with my boss about this and I came away and realized that I had complete faith
that AWS was going to do the right thing eventually and fix the bug and bring the pricing down for us.
There was kind of no question in my mind that AWS is going to,
my perspective of them is they're always trying to work for the customer,
bring the price down.
And it's not that Google Cloud doesn't have that.
It's just they don't have the reputation and those many years of proving it behind them.
So there's kind of maybe an empty corporate identity on the Google Cloud side.
You know, who's the Jeff Barr of Google Cloud?
I don't really see anyone there.
And I think that's something that they would do well to develop.
Absolutely.
The counterpoint, of course, is that credit where due.
They broke the mold when they made Jeff Barth.
Yes, they're not going to easily find another one of them around.
So a common observation has been that Google's feature set is in some ways
behind AWS. And that's not surprising in that for the first five years that AWS existed,
the other major players more or less ignored them for whatever reason. And they had a tremendous
head start. Now, in some cases that let them iterate and advance very quickly. In other cases
that let them go on exciting journeys
into discovering exactly what didn't work.
SimpleDB, I'm looking at you.
How do you feel that GCP is going about catching up in that context?
So when I started writing my blog post,
I started it around January of this year.
And I had a bunch of complaints in there.
And I had drafted it all out, had sort of all of the things I wanted to talk about, but didn't sort of flesh
it all the way through. And every couple of weeks or so, a new announcement from Google Cloud would
come up and it would invalidate one of the points in my post. And I would be frustrated because I'd go,
ah, that was something I was going to talk about,
and now it's just a non-issue.
And so I think I've been,
at least from the sort of,
I'm a small-scale developer.
I'm not an enterprise developer,
and I don't have a ton of insight
into what enterprises are looking for from Google. But at least
from my perspective, it definitely seems like they understand where
they need to catch up, and they're doing so. They're
continuing to iterate and release the features.
There is definitely a feature gap there, and I think they
are working their best to catch up.
The challenge for them, though, is that AWS is not standing still.
They are accelerating much faster than Google is at the moment, honestly.
And reInvent is not that far away.
And you can only imagine what bountiful pleasures Amazon's going
to give us. Oh, yes. I've been hinted to by little birds that there should be more than one new
service launching, which I'm sure is now going to take the entire world by storm. Wait, they're
going to release new things? They're not declaring victory with what they have now and moving on to selling something else? I mean, that tends to be...
Maybe.
Oh, absolutely. Or serverless. They just go down to none of them.
Aurora serverless, maybe. Maybe that'll launch.
So I do encourage listeners to take about 20 minutes or so and go through your blog post. It is a fantastic point-by-point dissection of what GCP is good at, what GCP is not terrific at, and a nuanced critique of both
aspects. It's really nice to see something like this. I don't see it too often, which is why I'm
so glad that you could clear time on your schedule and the stars aligned to finally put both of us on a call at the same time.
Where else can people go to hear your impressive thought leading?
Sure. Well, I'm not sure about thought leading,
but I do write a weekly newsletter about the Clojure programming language
called at therepal.net.
And I run a private Maven repository service,
the one that's using Google Cloud,
at, it's called Dips, and it's at dips.co. So yeah, those are two ways you can see what I'm up to.
Wonderful. I'll put links to those in the show notes as well. I want to thank you once again
for joining me. This has been Daniel Compton, an independent software consultant who focuses
on Clojure and large-scale systems. I'm Corey Quinn, and this is Screaming in the Cloud.