Screaming in the Cloud - Building a Partnership with Your Cloud Provider with Micheal Benedict
Episode Date: November 10, 2021About Micheal Micheal Benedict leads Engineering Productivity at Pinterest. He and his team focus on developer experience, building tools and platforms for over a thousand engineers to effec...tively code, build, deploy and operate workloads on the cloud. Mr. Benedict has also built Infrastructure and Cloud Governance programs at Pinterest and previously, at Twitter -- focussed on managing cloud vendor relationships, infrastructure budget management, cloud migration, capacity forecasting and planning and cloud cost attribution (chargeback). Links:Pinterest: https://www.pinterest.comTeletraan: https://github.com/pinterest/teletraanTwitter: https://twitter.com/michealPinterestcareers.com: https://pinterestcareers.com
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
You know how Git works, right?
Sort of. Kind of. Not really.
Please ask someone else.
That's all of us.
Git is how we build things,
and Netlify is one of the best ways I've found to build those things quickly for the web.
Netlify's Git-based workflows mean that you don't have to play slap and tickle with integrating arcane nonsense and webhooks, which are themselves about as well understood as Git.
Give them a try and see what folks ranging from my fake Twitter for pets startup to
global fortune 2000 companies are raving about.
If you end up talking to them,
because you don't have to, they get why self-service is important, but if you do,
be sure to tell them that I sent you and watch all of the blood drain from their faces instantly.
You can find them in the AWS Marketplace or at www.netlify.com.
N-E-T-L-I-F-Y dot com. This episode is sponsored in part by our friends
at Vulture, spelled V-U-L-T-R, because they're all about helping save money, including on things like,
you know, vowels. So what they do is they are a cloud provider that provides surprisingly high
performance cloud compute at a price that,
well, sure, they claim it is better than AWS's pricing. And when they say that,
they mean that it's less money. Sure, I don't dispute that. But what I find interesting is
that it's predictable. They tell you in advance on a monthly basis what it's going to cost.
They have a bunch of advanced networking features. They have 19 global locations
and scale things elastically, not to be confused with openly, which is apparently elastic and open.
They can mean the same thing sometimes. They have had over a million users. Deployments take less
than 60 seconds across 12 pre-selected operating systems. Or if you're one of those nutters like me, you can bring your own ISO
and install basically any operating system you want.
Starting with pricing as low as $2.50 a month
for Vulture Cloud Compute,
they have plans for developers and businesses of all sizes,
except maybe Amazon,
who stubbornly insists on having something of the scale
all on their own.
But you don't have to take
my word for it with an exclusive offer for you. Sign up today for free and receive $100 in credits
to kick the tires and see for yourself. Get started at vulture.com slash morningbrief.
That's v-u-l-t-r dot com slash morningbrief. Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while,
I like to talk to people who work at very large companies that are not, in fact, themselves a
cloud provider. I know, sounds ridiculous. How can you possibly be a big company and not make
money by selling managed NAT gateways to an unsuspecting public? But I'm told it can be done.
Here to answer that question, and hopefully at
least one other, is Pinterest's head of engineering productivity, Michael Benedict. Michael, thank you
for taking the time to join me today. Hi, Corey. Thank you for inviting me today. I'm really excited
to talk to you. So exciting times at Pinterest in a bunch of different ways. It was recently reported,
which of course went right to the top of my inbox as 500,000 people on Twitter all said, hey, this sounds like a Corey would be
interested in it thing. It was announced that you folks had signed a $3.2 billion commitment with
AWS stretching until 2028. Now, if this is like any other large-scale AWS contract commitment deal that has been made public,
you were probably immediately inundated with a whole bunch of people who are very good at
arithmetic and not very good at business context saying, 3.2 billion, you could build
massive data centers for that. Why would anyone do this? And it's tiresome, and that's the world
in which we live. But I'm guessing you heard at least
a little bit of that from the peanut gallery. I did. And I always find it interesting when,
you know, direct comparisons are made with the total amount that's been committed. And like you
said, there's so many nuances that go into kind of how to perceive that amount and put it in context
of obviously what Pinterest does. So I at least want to take this opportunity and kind of share it with everyone that Pinterest has been on the cloud
since day one. When Ben initially started the company, that product was launched. It was a
simple Django app. It was launched on AWS from Davon. And since then, it has grown to support
like 450 plus million MAUs over the course of the decade. And our infrastructure has
grown pretty complex. We started with a bunch of EC2 machines and, you know, persisting data and S3.
And since then, we have kind of explored an array of different products. In fact, sometimes working
very closely with AWS as well and helping them put together a product roadmap for some of the
items they're working on as well. So we have an amazing partnership with them. And part of the commitment on how we want to see these
numbers is how does it unlock value for Pinterest as a business over time in terms of making us
much more agile without thinking about the nuances of the infrastructure itself. And that's, I think,
one of the best ways to really put this into context, that it's not a single number we pay at the end of the month, but rather we are on track to spending
a certain amount over a period of time.
So this just keeps accruing or adding to that number.
And we basically come out with an amazing partnership in AWS where we have that commitment
and we're able to kind of leverage their products and full suite of items without any hiccups.
The most interesting part of what you said
is the word partner.
And I think that's the piece that gets lost an awful lot
when we talk about large-scale cloud negotiations.
It's not like buying a car
where you can basically beat the crap
out of the salesperson.
You can act as if $400 price difference on a car
is the difference between storm out of the dealership
and sign the contract, great.
You don't really have to deal with that person ever again.
In the context of a cloud provider,
they run your production infrastructure.
And if they have a bad day,
I promise you're going to have a bad day too.
You want to handle those negotiations
in a way that is respectful of that
because they are your partner,
whether you want them to be or not. Now, I'm not suggesting that any cloud provider is going to
hold an awkward negotiation against the customer, but at the same time, there are going to be
scenarios in which you're going to want to have strong relationships where you're going to need
to cash in political capital to some extent. And personally, I've never seen stupendous value
in trying to beat the crap out of a company
in order to get another 10th of a percent discount
on a service you barely use
just because someone decided that,
well, we didn't do well in the last negotiation,
so we're going to get them back this time.
That's great.
What are you actually planning to do as a company?
Where are you going?
And the fact that you just
alluded to that you're not just a pile of S3 and EC2 instances speaks in many ways to that. By
moving into the differentiated service world, suddenly you're able to do things that don't
look quite as much like building a better database and start looking a lot more like
servicing your users more effectively and well. And I think like you said, right, I feel like there's like a general skepticism in viewing
that the cloud providers are usually out there to rip you apart. But in reality, that's not true.
To your point, as part of the partnership, especially with AWS and Pinterest, we've got
an amazing relationship going on. And behind the scenes, there's a dedicated team at Pinterest
called the Infrastructure Governance Team, a cross-functional team with folks from finance, legal, engineering, product,
all sitting together and working with our AWS partners. Even the AWS account managers and the
TAMs are part of that to help us kind of like make both Pinterest successful. And in turn,
AWS gets that amazing customer to work with in helping build some of their
newer products as well.
And that's one of the most important things we have learned over time is that there's
two parts to it.
When you want to help improve your business agility, you want to focus not just on the
bottom line numbers as they are.
It's okay to kind of pay a premium because it offsets sort of the people capital you
would have to invest in getting there.
And that's a very tricky way
to look at math,
but that's what these teams do.
They sit down and work
through those specifics
and for what it's worth.
In our conversations,
the AWS teams always come back
with giving us very insightful data
on how we're using their systems
to help us better think about
how we should be pricing
or looking things ahead.
I'm not the expert on this.
Like I said, there's a dedicated team sitting behind this and looking through and working
through these deals.
But that's one of the important takeaways I hope the users or the listeners of this
podcast can take away that you want to treat your cloud provider as your partner as much
as possible.
They're not always there to screw you.
That's not their goal.
And I apologize for using that term.
It is important that you sort of set that expectation
that it's in their best interest
to actually make you successful
because that's how they make money as well.
It's a long-term play.
I mean, they could gouge you this quarter
and then you're trying to evacuate as fast as possible.
Well, they had a great quarter,
but what's their long-term prospect?
There are two competing philosophies
in the world of business.
You can either make a lot of money quickly
or you can make a little bit of money and build it over time in a sustained way.
And it's clear the cloud providers are playing the long game on this because they basically have to.
I mean, it's inevitable at this point, right? I mean, look at Pinterest. It is one of those
success stories, starting as a Django app and a bunch of EC2 machines to where we are right now
with having like a three plus billion dollar commitment for a span of a couple of years.
And, you know, we do spend a pretty significant chunk
of that on a yearly basis.
So in this case, I'm sure
it was a great successful partnership.
And I'm hoping like some of the newer companies
who are building the cloud from the get-go
are thinking about it from that perspective.
And one of the things I do want to call out, Corey,
is that, you know, we did initially start
with using the primitive services in AWS,
but it became clear over time, and I'm sure you've heard of the term multi-cloud and many of that, you know, when companies start evaluating how to make the most out of the deals
they're negotiating or signing, it is important to kind of acknowledge that the cost of sort of
any of those evaluations or even thinking about migrations never tends to get factored in. And we always
tend to treat of that as being extremely simple or not. But those are engineering resources you
want to be spending more building on the product rather than these crazy costly migrations. So it's
in your best interest probably to start using the most from your cloud provider and also look for
opportunities to use other cloud providers if they provide more value in certain product offerings.
Rather than thinking about like a complete lift and shift, and I'm going to make DR as being
the primary case on why I want to be moving to monthly cloud.
Yeah.
There's a question, too, of the numbers on paper look radically different than the reality
of this.
You mentioned Pinterest has been on AWS since the beginning, which means that even if an
edict had been passed at the beginning that
thou shalt never build on anything except EC2 and S3, the end, full stop. And let's say you
went down that rabbit hole of, oh, we don't trust their load balancers. We're going to build our own
at home. We have load balancers at home. We'll use those. It's terrible. But even had you done
that and restricted yourselves just to those baseline building blocks and then decided to do a cloud migration, you're still looking back at over a decade of experience where the app has been built on consciously reflecting the various failure modes that AWS has, the way that it responds to API calls, the latency in how long it takes to request something versus it being available, etc., etc. So even moving that baseline thing to another cloud provider
is not a trivial undertaking by any stretch of the imagination.
But that said, because the topic does always come up,
and I don't shy away from it.
I think it's something people should go into with an open mind.
How has the multi-cloud conversation progressed at Pinterest?
Because there's always a multi-cloud conversation. We've always approached it with some form of openness. It's not like we
don't want to be open to the ideas, but you really want to be thinking hard on the business case and
the business value something provides on why you want to be doing X. In this case, when we think
about multi-cloud, and again, like Pinterest did start with EC2 and S3,
and we did keep it that way for a long time.
We built a lot of primitives around it, used it.
For example, my team actually runs
sort of our bread and butter deployment system on EC2.
We help facilitate deployments
across 100,000 plus machines today.
And like you said, we have built that system
keeping in mind how AWS works and kind
of understanding sort of the nuances of region and AZ failovers and all of that and help facilitate
deployments across thousand plus microservices in the company. So thinking about leveraging,
say a Google cloud instance and how that works in theory, you know, we can always make a case
for engineering to kind of build our deployment system and expand that,
but there's really no value.
And one of the biggest cases,
usually when multi-cloud comes in,
is usually either negotiation for price
or actually a DR strategy.
Like what if AWS goes down and US East won?
Well, let's be honest,
they're powering half the internet from that one thing.
Yeah, so if you think your business is okay running
when AWS goes down
and half the internet is not going to be working,
how do you want to be thinking about that?
So DR is probably not the best reason
for you to be even exploring multi-cloud.
Rather, you should be thinking about
what the cloud providers are offering
as a very nuanced offering,
which your current cloud provider is not offering,
and really think about just using those specific items.
So I agree that multi-cloud for DR purposes is generally not necessarily the best
approach with the idea of being able to failover seamlessly. But I like the idea for backups.
I mean, Pinterest is a publicly traded company, which means that among other things, you have to
file risk disclosures and be responsive to auditors in a variety of different ways. There
are some regulations that start applying to you. And the idea of, well, AWS builds things out in a variety of different ways, there are some regulations that start applying to you. And the idea of, well, AWS builds things out
in a super effective way, region separation, et cetera.
Whenever I talk to Amazonians,
they are always surprised that anyone wouldn't accept that,
oh, if you want backups, just use a different region.
Problem solved.
Right, but it is often easier for me
to have a rehydrate the business level of backup
that would take weeks to redeploy living on another cloud provider than it is for me to explain to all
of those auditors and regulators and financial analysts, et cetera, why I didn't go ahead and
do that path. So there's always some story for, okay, what if AWS decides that they hate us and
want to kick us off the platform? Well,
that's why legal is involved in those high-level discussions around things like risk and indemnity
and termination for convenience and for cause clauses, et cetera, et cetera. The idea of making
an all-in commitment to a cloud provider goes well beyond things that engineering thinks about.
And it's easy for those of us with engineering backgrounds to be incredibly dismissive of that. Oh, indemnity?
When does AWS ever lose data?
Yeah, but let's say one day they do.
What is your story going to be when asked some very uncomfortable questions by people who wanted you to pay attention to this during the negotiation process?
It's about dotting the I's and crossing the T's, especially with that many commas in the contractual commitments.
No, it is true. And, you know, we did evaluate that as an option. But one of the interesting things about, you know, compliance and especially auditing as well, we generally work with sort of
the best in class, you know, consultants to kind of help us work through the controls and every,
how we kind of audit, how we look at these controls, how to make sure there's like enough
accountability going through.
The interesting part was in this case as well,
we were able to sort of work with AWS and crafting a lot of those controls
and setting up sort of the right expectations
as and when we were putting our proposals together as well.
Now, again, I'm not an expert on this
and I know we have a dedicated team
from our technical program management organization
focused on this.
But early on, we realized, to your point,
the cost of any form of backups
and then being able to audit what's going in,
look at all those pipelines,
how quickly we can get the data in and out,
was proving pretty costly for us.
So we were able to work out some of that
within the constructs of what we have
with our cloud providers today
and still meet our compliance goals.
That's sort of, on some level, the higher point too, where everything is, everything comes down
to context. Everything comes down to what the business demands, what the business requires,
what the business will accept. And I'm not suggesting that in any case they're wrong.
I'm known for beating the multi-cloud is a bad default decision drum. And then people get
surprised when I'll have one-on-one conversations
and they say, well, we're multicloud.
Do you think we're foolish?
No, you're probably doing the right thing
just because you have context
that is specific to your business
that I, speaking in a general sense,
certainly don't have.
People don't generally wake up in the morning
and decide they're going to do a terrible job
or no job at all at work today
unless they're Facebook's VP of integrity. So it's not the sort of thing that lends itself to casual
tweet-sized pithy analysis very often. There's a strong dive into what is the level of risk a
business can accept. And my general belief is that most companies are doing this stuff right.
The universal constant in all of my consulting clients that I have spoken to
about the in-depth management piece of things is they've always asked the same question of,
so this is what we've done, but can you introduce us to the people who are doing it really right,
who have absolutely nailed this and gotten it all down? It's, yeah, absolutely no one believes that
that is them, even the folks who are, from my perspective, pretty close to having achieved it. I want to talk a bit more about what you do beyond just the headline-grabbing,
large-dollar-figure commitment to a cloud provider story.
What does engineering productivity mean at Pinterest? Where do you start? Where do you stop?
I want to just quickly touch upon that last point about multi-cloud. And like you said,
every company works within the context of what they are given
and sort of the constraints of their business.
It's probably a good time to kind of give a plug
to my previous employer at Twitter
who are doing multi-cloud in a reasonably effective way.
They are on the data centers.
They do have presence on Google Cloud and AWS.
And I know probably things have changed
since a couple of years now,
but they have sort
of embraced that environment pretty effectively to cater to their acquisitions, you know, who were
on the public cloud, help obviously with their initial set of investments in the data center
and still continue to kind of scale that out and explore, in this case, Google Cloud for a variety
of other use cases, which sounds like it's been extremely beneficial as well. So to your point, there's probably no right way to do this. There's always that context and what you're
working with comes into play as part of making these decisions. And it's important to like,
take a lot of these with grain of salt, right? Because you can never sort of understand the
decisions, why they were made the way they were made. And for what it's worth, it sort of works
out in the end. I rarely heard like a story where it's never sort of worked out
and people are just upset with the deals they've signed.
So hopefully that sort of like helps close
that whole conversation about multi-cloud.
I hope so.
It's one of those areas where everyone has an opinion
and a lot of them do not necessarily apply universally.
But it's always fun to take, in that case, great.
I'll take the lesser trod path.
Everyone's saying multi-cloud is great, invariably,
because they're trying to sell you something.
Yeah, I have nothing particular to sell folks.
My argument has always been,
in the absence of a compelling reason not to,
pick a provider and go all in.
I don't care which provider you pick,
which people are sometimes surprised to hear.
It's like, well, what if they pick a cloud provider
that you don't do consulting work for?
Yeah, it turns out I don't actually need to win every AWS customer over to have a successful
working business. Do what makes sense for you folks. From my perspective, I want this industry
to be better. I don't want to sit here and just drum up business for myself and make self-serving
comments to empower that, which apparently is a rare tactic.
No, that's totally true, Corey. And like, one of the things you do is help people with their bills, right? Like, so this has come up so many times, and I realize we're sort of going off track
a bit from that engineering productivity discussion. Oh, which is fine. That's this
entire show's theme, if it has one. So I want to briefly just talk about the whole billing and
sort of how cost management works, because I know you spend a lot of time on that, and you help a lot of these companies be effective in how they manage their bills, right? These
questions have come up multiple times, even at Pinterest. We actually, in the past, when I was
sort of leading the infrastructure governance organization, we were working with other
companies of our similar size to better understand how they are looking into getting visibility
into their cost, setting
sort of the right controls and expectations within the engineering organization to plan
and capacity plan and effectively sort of meet those plans in a certain criteria.
And then obviously, if there is any risk to that, actively manage risk.
That was like the biggest thing those teams used to do.
And we used to talk a lot, trade notes, and get a better sense of how a lot of these companies are trying to do, for example,
Netflix or, you know, Lyft or Stripe. I recall Netflix content was sort of their biggest
spenders. So cloud spending was like way down in the list of things for them. But regardless,
they had like an active team looking at this on a day-to-day basis, right? So one of the things we
learned early on at Pinterest is that, you know, start investing in those visibility tools early on. No one can parse the cloud bills.
Let's be honest, like you're probably the only person who can like reverse
engineer and architecture diagram from like a cloud bill. And I think that's like definitely,
you know, you should take a patent for that or something. But in reality, like no one has the
time to do that. You want to make sure your business leaders from your finance teams to
engineering teams to engineering
teams to head of, you know, the executives all have a better understanding of how to parse it.
So investing engineering resources, take that data. How do you munch it down to sort of the
cost, the utilization across the different vectors of offerings and have a very insightful
discussion? Like, you know, what are certain action items we want to be taking? It's very
easy to see, oh, we overspent EC2 and we want to go from there. But in reality,
that's not just that thing. You'll start finding out that EC2 is being used by your Hadoop
infrastructure, which runs hundreds of thousands of jobs. Okay, now who's actually responsible for
that cost? You might find that one job, which is accruing sort of a lot of instance hours or
period of time in a shared multi-tenant environment. How do you kind of attribute that cost to that
particular cost center? And then someone left the company a while back,
and that job just kept running in perpetuity. No one's checked the output for four years.
I guess it can't be that necessarily important. And digging into it requires context. It turns
out there's no SaaS tool to do this, which is unfortunate for those of us who set out originally
to build such a thing. But we discovered pretty early on, the context on this stuff is incredibly
important.
I love the thing you're talking about here, where you're discussing with your peer companies about
these things. Because the advice that I would give to companies with the level of spend that
you folks do is worlds apart from what I would advise someone who's building something new and
is spending maybe 500 bucks a month on their cloud bill. Those folks do not need to hire a dedicated team of people to solve for these problems. At your scale, yeah, you probably should
have had some people in here looking at this for a while now. And at some point, the guidance
changes based upon scale. And if there's one thing that we discover from the horrible pages of Hacker
News, it's that people love applying bits of wisdom
that they hear in wildly inappropriate situations.
How do you think about these things at that scale?
Because, simple example,
right now I spend about a thousand bucks a month
at the Duck Bill Group, on our AWS bill.
I know, we have one too, imagine that.
And if I wind up just committing admin credentials
to GitHub, for example, and someone
compromises that and starts spinning things up to mine all the Bitcoin, yeah, I'm going to notice
that by the impact it has on the bill, which will be noticeable from orbit. At the level of spend
that you folks are at, a company would be hard-pressed to spin up enough Bitcoin miners
to materially move the billing needle on a month-to-month basis just
because of the sheer scope and scale. At small bill volumes, yeah, it's pretty easy to discover
the thing that wound up spiking your bill to three times normal. It's usually a managed NAT gateway.
At your scale, tripling the bill begins to look suspiciously like the GDP of a small country.
So what actually happened here? Invably at that scale with that level
of massive multiplier, it's usually the simplest solution, an error somewhere in the AWS billing
system. Yes, they exist. Imagine that. They do exist and we've encountered that.
Kind of heart-stopping, isn't it? I don't know if you remember when we had
the big specter and the meltdown, right? And those were like interesting scenarios for us, because we had identified a lot of those issues early on, given the scale we operate.
And we were able to sort of, obviously, you know, it did have an impact on the builds and everything,
but that said, that's why you have these dedicated teams to kind of fix that. But I think one of the
points you made, you know, these are large builds and you're never going to have a 3x jump the next
day. You know, we're not going to be seeing that. And if that happens, you know, like God save us. But to your point, one of the things we do still want to be doing is look at
trends literally on a week over week basis, because even a one percentage move is a pretty
significant amount if you think about it, which could be funding some other aspects of the
business, which we would prefer to be investing on. So we do want to have enough rigor and controls in place in our
technical stack to kind of identify and alert when something is off track. And it becomes challenging
when you start using those higher order services from your public cloud provider, because there's
no clear insights on how do you kind of parse that information. One of the biggest challenges we had
at Pinterest was tying ownership to all these things. No, using tags is
not going to cut it. It was so difficult for us to get to a point where we could like put some
sense of ownership and all the things and the resources people are using, and then subsequently
have those right conversation with our ads infrastructure teams or our product teams to
kind of like help drive the cost improvements we want to be seeing. And I wouldn't be surprised if
that's not a challenge already,
even for like the smaller companies who have bills
in the tunes of tens and thousands, right?
It is.
It's predicting the spend and trying to categorize it appropriately.
That's the root of all AWS bill panic on the corporate level.
It's not that the bill is 20% higher, so we're going to go broke.
Most companies spend far more on payroll than they do on infrastructure.
As you mentioned, with Netflix, content is a significantly larger expense than any of those things. Real estate's usually right up there too. But instead, it's when you're trying
to do business forecasting of, okay, if we're going to have an additional thousand monthly
active users, what will the cost for us be to service those users? And okay, if we're seeing
a sudden 20% variance, if that's the new
normal, then well, that does change our cost projections for a number of years. What happens
when you're public, there starts to become the question of, okay, do we have to restate earnings
or what's the deal here? And of course, all of this sidesteps past the unfortunate reality that
for many companies, the AWS bill is not a function of how many customers you have. It's how many
engineers you've hired. And that is always the way it winds up playing out
for some reason.
It's, why did we see a 10% increase in the bill?
Yeah, we hired another data science team.
Oops.
Always seems to be the data science folks.
I know I beat up on those folks a fair bit,
and my apologies.
And one day, if they analyze enough of the data,
they might figure out why.
So this is where I want to give a shout out
to our data science team,
especially some of the engineers working
in the infrastructure governance team,
like putting these charts together,
helping us derive insights.
So definitely props to them.
I think there's a great segue into the point you made.
As you add more engineers,
what is the impact on the bottom line?
And this is one of the things actually
as part of engineering productivity,
we think about as well on a long-term basis.
Pinterest does have over a thousand plus engineers today.
And to a large degree,
many of them actually have their own EC2 instances today.
And I wouldn't say it's like a significant amount of cost,
but it is a large enough number
where shutting down a C5.9 Excel
can actually fund a bunch of conference tickets
or something else.
And then you can imagine that's sort of the scale
you start kind of working with at one point. The nuance here is though, you want to like make sure there's
enough flexibility for these engineers to do their local development in a sustainable way.
But when moving to say production, we really want to tighten sort of the flexibility a bit so they
don't end up doing what you just said, like spin up a bunch of machines talking to the API directly,
which no one will be aware of. I want to share a small anecdote because when back in the day,
this was probably four years ago when we were doing some analysis on our bills,
we realized that there was a huge jump every, I believe, Wednesday instead of our EC2 instances
by almost like a factor of like 500 to 600 instances. And we're like, why is this happening?
What is going on?
And we found out there was like an obscure job written by someone who had left the company
calling an EC2 API to spin up like a search cluster
of 500 machines on demand
as part of pulling that ETL data together
and then shutting that cluster down,
which at times didn't work as expected
because obviously your Hadoop jobs
are very predictable, right?
So those are kind of the things we were dealing with back in the day.
And you want to make sure since then, this is where engineering productivity as a team
starts coming in, that our job is to enable every engineer to be doing their best work
across code building and deploying their services.
And we have done this.
Right.
You and I can sit here and have an in-depth conversation about the intricacies of AWS
billing in a bunch of different ways, because in different ways, we both specialize in it in many respects. But let's say that Pinterest theoretically was foolish enough to hire me before I got into this space as an engineer for terrifying reasons. And great. I start as day one as a typical software developer, if such a thing could be said to exist, how do you effectively build guardrails in
so that I don't inadvertently wind up
spinning up all the EC2 instances
available to me within an account,
which it turns out are more than one might expect sometimes,
but still leave me free to do my job
without effectively spending a nine-month safari
figuring out how AWS builds work.
And this is why teams like ours exist
to kind of help provide those tools
to help you get started.
So today, we actually don't let anyone
directly use AWS APIs or even use the UI for that matter.
And I think you'll soon realize
the moment you hit probably 30 or 40 people
in your organization,
you definitely want to lock it down.
You don't want that access to be given to anyone or everyone. And then subsequently start building some higher order tools or
abstractions so people can start using that to control effectively. In this case, if you're a
new engineer, Corey, which it seems like you were at some point. I still write code like I am,
don't worry. So yes, you would get access to sort of our internal tool to actually help spin up what
we call as a dev app, where you get a chance to obviously choose
sort of the instance size, not the instance type itself.
And we have actually constrained
sort of the instance types we have approved
within Pinterest as well.
We don't give you sort of the entire list
you get a chance to choose and deploy to.
We actually have constrained to,
based on the workload types,
what are the instance types we want to support?
Because in the future,
if we ever want to move from C3 to C5, and I've been there, trust me, it is not an easy thing to
do. So you want to make sure that you're not letting people just use random instances and
kind of constrain that by building some of these tools. As a new engineer, you would go in, you
use the tool and actually have a dev app provision for you with our Pinterest image to get you
started. And then subsequently, you know, we obviously shut it down if we see you're not being using it
over a certain amount of time.
But those are sort of the, you know,
guardrails we've put in over there.
So you never get a chance to directly ever use
sort of the EC2 APIs
or any of those AWS APIs to do certain things.
The similar thing applies for S3
or any of the higher order tools
which AWS would provide too. This episode is sponsored by our friends at Oracle Cloud. Counting the pennies,
but still dreaming of deploying apps instead of hello world demos? Allow me to introduce you to
Oracle's always free tier. It provides over 20 free services and infrastructure, networking,
databases, observability, management, and security. And let me be clear here it's actually free there's no surprise billing
until you intentionally and proactively upgrade your account this means you can provision a
virtual machine instance or spin up an autonomous database that manages itself all while gaining the
networking load balancing and storage resources that somehow never quite make it into most free
tiers needed to support the
application that you want to build. With Always Free, you can do things like run small-scale
applications or do proof-of-concept testing without spending a dime. You know that I always
like to put asterisk next to the word free. This is actually free, no asterisk. Start now. Visit
snark.cloud slash oci-free. That's snark.cloud slash oci-free.
How does that interplay with AWS launches yet another way to run containers, for example,
and that becomes a valuable potential avenue to get some business value for a developer,
but the platform you've built doesn't necessarily embrace that capability.
Or they release a feature to an existing tool that you use that could potentially be just a feature capability story, much more so than a cost savings one. How do you keep track of all of that and empower people to use those things so they're not effectively trying to re-implement DynamoDB on top of EC2? That has been a challenge actually in the past for us because we've always been very flexible
where engineers have had an opportunity
to kind of write their own solutions many a times
rather than leveraging sort of the AWS services.
And off late, like that's one of the reasons
why we have sort of have an infrastructure organization,
an extremely lean organization for what it's worth,
but then still able to kind of achieve
like outsized outputs
where we sort of like evaluate a lot of these use cases
as they come in and open up different aspects of what we want to provide, say directly
from AWS or build certain abstractions on top of it. Every time we talk about containers, obviously,
you know, we always associate that with something like Kubernetes and sort of offerings from there
on. We realized that our engineers directly never ask for those capabilities. They don't come in and
say, I need a new container orchestration system,
give that to me,
and I'm going to be extremely productive.
What we've actually realized
is that if you can provide them effective tools
and that can help them get their job done,
they would be happy with it.
For example, like they said, our deployment system,
which is actually an open source system called Teletran,
that is sort of the bread and butter at Pinterest,
like which my team runs.
We operate 100,000 plus machines.
We have actually looked into container orchestration
where we do have like a dedicated Kubernetes team
looking at it and helping, you know,
certain use cases move there.
But we realized that the cost of sort of entire migrations
need to be like evaluated against certain use cases,
which can benefit from being on Kubernetes from day one.
You don't want to like force anyone to move there,
but give them the right incentives to move there. Case in point, let's upgrade your OS, right? Because if you're managing machines,
obviously everyone loves to upgrade their OSs. Well, it's one of the reasons I love savings
plans versus RIs. You talk about the C3 to C5 migration, and everyone has a story about one
of those. But the most foolish or frustrating reason that I ever saw not to do the upgrade was,
well, we bought a bunch of reserved instances on the C3s, and those are a year and a half left to run. And it's foolish, not on the part of
customers, it's economically sound, but on the part of AWS, where, great, you're now forcing me
to take a contractual commitment to something that serves me less effectively, rather than getting
out of the way and letting me do my job. It's why it's so important to me, at least, that savings
plans cover Fargate and Lambda.
I wish they'd covered SageMaker
instead of SageMaker having its own thing,
because once again,
you're now architecturally constrained
based upon some ridiculous economic model
that they have imposed on us.
But that's a separate rant for another time.
No, we actually went through that process
because we do have a healthy balance
of how we do reserved instances
and how we look at on-demand.
We've never been big users of Spot in the past because just the Spot market itself, we've realized that putting
that pressure on our customers to figure out how to manage that is way more. I say customers,
in this case, engineers within the organization. Oh yes, I want to post some pictures on Pinterest
and now I have to understand the Spot market what? Yeah. So in this case, when we even were
moving from C3 to C5,
and this is where that partnership really plays out effectively,
because it's also in the best interest of AWS
to deprecate their aging hardware
to support some of these new ones
where they could also be making good enough premium margins
for what it's worth and give the benefit back to the user.
So in this case, we were able to work out an extremely flexible way
of moving to C5 as soon as possible,
get help from them actually in helping us do that too, allocating capacity
and working with them on capacity management. I believe at one point we were actually one of the
largest companies with the C3 footprint, and it took quite a while for us to move to C5, but rest
assured, once we moved, the savings was just immense, right? We were able to kind of offset
any of those RI and we were able to work behind the scenes to get that out. But obviously not a lot of that is kind of considered in a small
scale company, just because of, like you said, those constraints which have been placed in a
contractual obligation. Well, this is an area in which I will give the same guidance to companies
of your scale, as well as small scale companies. And by small scale, I mean people on the free
tier account, give or take. So I do mean the smallest of the small. Whenever you
wind up in a scenario where you find yourself architecturally constrained by an economic
barrier like this, reach out to your account manager. I promise you have one. Every account,
even the tiny free tier accounts, have an account manager. I have an account manager who I have to
say has probably one of the most surreal jobs at AWS, just based upon the conversations I throw past him. But it's reaching out to your provider rather than trying to solve
a lot of this stuff yourself by constraining how you're building things internally is always the
right first move. Because the worst case is, is you don't get anywhere in those conversations.
Okay, but at least you explored that, as opposed to what often happens is, oh yeah,
I have a switch over here I can flip and solve your entire problem. Does that help anything? Yeah. You feel foolish finding that
out only after nine months of dedicated work, it turns out. Which makes me wonder, Corey, I mean,
do you see a lot of that happening where folks don't tend to reach out to their account managers
or rather treat them as partners in this case, right? Because it sounds like there's just this
unhealthy tension, I would say, as to what is kind of the best help you could be getting from your account managers in this case.
Constantly.
And the challenge comes from a few things, in my experience.
The first is that the quality of account managers and the technical account managers, the folks who are embedded in many cases with your engineering teams in different ways, does vary.
AWS is scaling wildly and bursting at the seams, and people are hard to scale.
So some are fantastic. Some are decidedly less so, and most folks fall somewhere in the middle of that bell curve.
And it doesn't take too many poor experiences for the default to be, oh, those people are useless.
They never do anything we want, so why bother asking them? leads to an unhealthy dynamic where a lot of companies will wind up treating their AWS account manager types as a ticket triage system or the last resort of places that they'll turn
when they should be involved in earlier conversations. I mean, take Pinterest as an
example of this. I'm not sure how many technical account managers you have assigned to your
account, but I'm going to go out on a limb and guess that the ratio of technical account managers to engineers working on the environment is incredibly lopsided.
It's got to be a high ratio just because of the nature of how these things work.
So there are a lot of people who are actively working on things that would almost certainly
benefit from a more holistic conversation with your AWS account team, but it doesn't
occur to them to do it just because of either perceived biases around levels of competence or poor experiences in the past or simply not knowing the capabilities that are there.
If I could tell one story around the AWS account management story, it would be talk to folks sooner about these things.
And to be clear, Pinterest has this less than other folks. but AWS does themselves no favors by having a product strategy of yes,
because very often in service of those conversations with a number of companies, there is the very real concern of,
are they doing research so that they can launch a service that competes with us?
Amazon as a whole launching a social network is admittedly one of the most
hilarious ideas I can come up with.
And I kind of hope they take a whack at it just to watch them learn all these
lessons themselves. But that is again, neither here nor there.
That story is very interesting. And I think you mentioned one thing, it's just that
lack of trust or even knowing what the account managers can actually do for you.
There seems to be just a lack of education on that. And we also found it the hard way, right?
I wouldn't say that Pinders kind of figured this out on day one. We evolved sort of our relationship over time. Yes, our time engagements are sort of
lopsided, but we were able to kind of negotiate that as part of deals as we learned a bit more
on what we can and we cannot do and how these individuals are beneficial for Pinterest as well.
And well, here's a question for you without naming names. And this might illustrate part
of the challenge that customers have.
How long has your account manager,
not the technical account managers,
but your account manager been assigned to your account?
I've been at Pinterest for five years and I've been working with the same person.
And he's amazing.
Which is incredibly atypical.
A lot of smaller companies, it feels like,
oh, I'm your account manager being introduced
to you. And aren't you the third one this year? Great. What happens is that if the account manager
excels very often, they get promoted and work with a smaller number of accounts at larger spend.
And whereas if they don't find that AWS is a great place for them for a variety of reasons,
they go somewhere else and need to be backfilled. So the smaller account, it's great.
I've had more account managers in a year than you've had in five. And that is often the experience
when you start seeing significant levels of rotation. And especially on the customer engineering
side, where you wind up with, you have this big kickoff and everyone's aware of all the capabilities
and you look at it three years later and not a single person who was in that kickoff is still involved with the account on either side. And it's just sort of been evolving
evolutionarily from there. One thing that we've done in some of our larger accounts as part of
our negotiation process is when we see that the bridges have been so thoroughly burned, we will
effectively request a full account team cycle just because it's time to get new faces in where the customer, in many cases unreasonably,
is not going to say,
yeah, but a year and a half ago,
you did this terrible thing
and we're still salty about it.
Fine, whatever, I get it.
People relationships are hard.
Let's go ahead and swap some folks out
so that there are new faces with new perspectives
because that helps.
Well, first off,
if you have had so many switches in
account manager, I think that that's something speaks about how you've been working too.
I'm just kidding there, but entirely possible in seriousness. Yes. But if you talk to,
this is not just me because in my case, yeah, I feel like my account managers, whoever drew the
short straw that week, because frankly, yeah, that does seem like a great punishment to wind
up passing out to someone who's underperforming. But for a lot of folks who are in the mid tier, like spending 50 to a hundred thousand dollars a month, this is
a very common story. Yeah, actually we've heard a bit about this too. And like you said, I think,
you know, maintaining context is the most thing you really want your account managers, you know,
vouch for you, really be your champion in those meetings because AWS, like you said, is so large
getting those exact time and, you know, reviews, and there's so many things that happen,
your account manager is the champion for you right there. And it's important. And in fact,
in your best interest to kind of have a great relationship with them as well, not treat them
as like, oh, yet another vendor. And I think that's where things start to get a bit messy,
because when you start treating them as yet another vendor, you know, there's no incentive
for them to kind of do the best for you too. You know, people relationships are hard, but that said though,
I think given the amount of customers, like these cloud companies are accruing, I wouldn't be
surprised. Like, you know, every account manager seems to be like extremely burdened, even in our
case, although I've been having a chance to work with this one person for like a long time, we've
actually expanded. We have now multiple account managers helping us out as we've started scaling
to use certain aspects of AWS,
which we have never explored before.
You know, we were a bit constrained and reserved
about what services we want to use
because there have been instances
where we have tried using something
and we've hit the wall pretty immediately.
API rate limits, or it's not ready for prime time.
And we're like, oh my God, now what do we do?
So we have been a bit more cautious,
but that said, over time,
you know, having an account manager who understands So we have been a bit more cautious, but that said, over time, you know,
having an account manager who understands how you work, what scale you have, they're able to advocate with the internal engineering teams within the cloud provider to make the best of
sort of supporting you as a customer and sort of like tell that success story all the way up. So
yeah, I can totally understand like how this may be hard, especially for those smaller companies.
For what it's worth, I think the best way
to really think about it is not treat them as your vendor,
but really sort of go out on a limb there.
Even though you sign a deal with them,
you want to make sure that you have
the continued relationship with them
to represent your voice better within the company,
which is probably hard.
That's always the hard part.
Honestly, if this were the sort of thing
that were easy to automate,
or you could wind up building out something
that winds up helping companies figure out
how to solve these things programmatically.
You talk about interesting business problems
that are only going to get larger in the fullness of time.
This is not going away.
Even if AWS stopped signing up new customers
entirely right now,
they would still have years of growth ahead of them,
just some organic growth.
And take a company with the scale of Pinterest and just think of how many years it would take to
do a full-on exodus, even if it became priority number one. It's not realistic in many cases,
which is why I've never been a big fan of multi-cloud as an approach for negotiation.
Yeah, AWS has more data on those points than any of us do. They're not worried about it.
It just makes you sound like an unsophisticated negotiator.
Pick your poison and lean in.
That is the truth you just mentioned.
And I probably want to give a call out to our head of infrastructure, Coburn.
He's also my boss, and he had brought this perspective as well as part of any negotiation discussions.
Like you just said, AWS has way more data points on this
than what we think we can do in terms of talking about, oh, we are exploring this other cloud
provider. And it's, you know, they would be like, yeah, do tell me more how that's going. And it's
probably in the best interest to never use that as a negotiation tactic because, you know, they
clearly know sort of the investments that's gone in to kind of build out what you've done. So you
might as well like be talking more. Again, this is where that relationship really plays together because
you want both of them to be successful and it's in their best interest to like still keep you happy
because the good thing about at least companies of our size is that we're probably like one phone
call away from some of their executive team where we could always talk about sort of what didn't
work for us. And I know not everyone has that opportunity,
but I'm really hoping,
and I know like,
at least with some of the interactions
we've had with the AWS teams,
they're actively working
and sort of building that relationship more and more,
giving access to those customer advisory boards
and all of them to have those direct calls
with the executives.
I don't know whether you've seen that
in your sort of experience
in helping some of these companies.
I have a different approach to it.
It turns out when you're super loud and public and noisy about AWS and spend too much time in Seattle,
you start to spend time with those people on a social basis. Because again, I'm obnoxious and
annoying to a lot of AWS folks, but I also have an obnoxious habit of being right in most of the
things I'm pointing out. And that becomes harder and harder to ignore. I mean, part of the value
that I found in being able to do this as a consultant is that I begin to compare and contrast different customer environments
on a consistent, ongoing basis. I mean, the reason that negotiation works well from my perspective
is that AWS does a bunch of these every week, and customers do these every few years with AWS.
And, well, we do an awful lot of them, too. And it's, okay, we've seen different ways
things can get structured
and it doesn't take too long
and too many engagements
before you start to see the points of commonality
and how these things flow together.
So when we wind up seeing things
that a customer is planning on architecturally
and looking to do in the future,
well, wait a minute,
have you talked to the folks
negotiating the contract about this?
Because that does potentially have bearing and it provides better data than what AWS is gathering just through looking at overall spend trends. So yeah, bring that up. That is absolutely going to impact I think, understanding the incentives. I will say that across the board, I have never yet seen a deal from AWS come through where
it was, okay, at this point, you're just trying to hoodwink the customer and get them to sign
on something that doesn't help them.
I've seen mistakes that can definitely lead to that impression.
And I've seen areas where their data is incomplete and they're making assumptions that are not
borne out in reality.
But it's not one of those bad faith type of negotiations. If it were, I would be framing a
lot of this very differently. It sounds weird to say, yeah, your vendor is not trying to screw you
over in this sense. Because look at the entire IT industry. How often has that been true about
almost any other vendor in the fullness of time? This is something a bit different, and I still think we're trying to grapple with the repercussions
of that from a negotiation standpoint and from a long-term business continuity standpoint
when your fate is linked in a shared fate context with your vendor.
It's in their best interest as well because they are trying to build a diversified portfolio.
If they help 100 companies, even if one of them becomes the next Pinterest, that's great. Right. And that continued relationship is what they're
aiming for. So assuming any bad faith over there probably is not going to be the best outcome,
like you said. And two, it's not a zero sum game. Like I always get a sense that when you're doing
these negotiations, it's like, it's an all or nothing deal. It's not like you have to think
they're also running a business. And it's important that you as you're in a business, how okay are you with some of those premiums, right? Like you
cannot get a discount on everything. You cannot get the deal or the numbers. You probably want
almost everything. And to your point, architecturally, if you're moving in a certain
direction where you think in the next three years, this is what your usage is going to be, or it'll
come down to that. Obviously you should be investing more in kind of negotiating that out front
rather than managed network gateways, I guess.
So I think that's also an important mindset
to kind of take in, right,
as part of any of these negotiations,
which I'm assuming,
I don't know how you folks have been working in the past,
but at least that's one of the key items
we have taken in as part of any of these discussions.
I would agree wholeheartedly.
I think that it just comes down to understanding where you're going,
what's important,
and again, in some cases,
knowing around what things AWS
will never bend contractually.
I've seen companies spend six weeks or more
trying to negotiate custom SLAs around services.
Let me save everyone a bunch of time and money.
They will not grant them to you, I promise.
So stop asking for them.
You're not going to get them.
There are other things they will negotiate on that are going to be highly case dependent. I'm
hesitant to mention any of them just because, well, wait a minute, we did that once. Why are
you talking about that in public? I don't want to hear it and confidentiality matters. But yeah,
not everything is negotiable, but most things are. So figuring out what levers and knobs and dials you have is important.
We also found it that way, like AWS does cater to their, they are a platform and they are pretty clear in sort of like how much engagement, even if we are sort of like one of their top customers,
there's been many a times where I know their product managers have heavily pushed back on
some of the requests we have put in. And that makes me wonder, like they probably have the
same engagement, even with the smallest of customers.
There's always like an implicit assumption
that the big fishes try to kind of like
get the most out of your public cloud providers.
To your point, I don't think that's true.
We're rarely able to kind of negotiate anything exclusive
in terms of their product offerings just for us,
if that makes sense.
Case in point, tell us your capacity
for X instances or type of instances
so we as a company would know
sort of how to kind of plan out
our scale ups or scale downs.
That's not going to happen exclusively for you.
But those kinds of things are just like examples.
We have had a chance to kind of work
with their product managers
and see if, can we get some flexibility on that?
For what it's worth though,
they are willing to kind of like
find a middle ground with you
to make sure that you get your answers. And obviously you're being successful in sort of like your plans to
use certain technologies they offer or have more predictability in how you use their services.
So I know we've gone significantly over time and we are definitely going to do another episode
talking about a lot of the other things that you're involved in, because I'm going to assume
that your full-time job is not worrying about the AWS bill. In fact, you do a fair number of things beyond that.
I just get stuck on that one, given that it is what I eat, sleep, breathe, and dream about.
Absolutely. I would love to kind of talk more, especially about how we are, you know,
enabling sort of our engineers to be extremely productive in this new world and how we want to
cater to sort of this whole cloud native environment, which is
being created and make sure, you know, people are sort of doing their best work. But regardless,
Corey, I mean, this has been an amazing, insightful chat, even for me. And I really
appreciate you sort of having me on the show. No, thank you for joining me. If people want to
learn more about what you're up to and how you think about things, where can they find you?
Because I'm also going to go out on a limb and assume you're also probably hiring given that
everyone seems to be these days. Well, that is true. And I wasn't planning to make on a hiring
pitch, but I'm glad that you sort of like leaned into that one. Yes, we are hiring and you can find
me on Twitter at twitter.com slash M-I-C-H-E-A-L. I am spelled a bit differently, so make sure you
can hit me up and my DMs are open. And obviously we have all our open roles listed on PinterestCareers.com as well.
And we will, of course, put links to that in the show notes.
Thank you so much for taking the time to speak with me today.
I really appreciate it.
Thank you, Corey.
It was really being great on your show.
And I'm sure we'll do it again in the near future.
Michael Benedict,
Head of Engineering Productivity at Pinterest.
I am cloud economist, Corey Quinn,
and this is Screaming in the Cloud.
If you've enjoyed this podcast,
please leave a five-star review
on your podcast platform of choice.
Whereas if you've hated this podcast,
please leave a five-star review
on your podcast platform of choice,
along with a long rambling comment
about exactly how many data centers
Pinterest could build instead.
If your AWS bill keeps rising
and your blood pressure is doing the same,
then you need the Duck Bill Group.
We help companies fix their AWS bill
by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business
and we get to the point.
Visit duckbillgroup.com to get started.
This has been a humble pod production stay humble