Screaming in the Cloud - Building a Partnership with Your Cloud Provider with Micheal Benedict

Episode Date: November 10, 2021

About Micheal Micheal Benedict leads Engineering Productivity at Pinterest. He and his team focus on developer experience, building tools and platforms for over a thousand engineers to effec...tively code, build, deploy and operate workloads on the cloud. Mr. Benedict has also built Infrastructure and Cloud Governance programs at Pinterest and previously, at Twitter -- focussed on managing cloud vendor relationships, infrastructure budget management, cloud migration, capacity forecasting and planning and cloud cost attribution (chargeback). Links:Pinterest: https://www.pinterest.comTeletraan: https://github.com/pinterest/teletraanTwitter: https://twitter.com/michealPinterestcareers.com: https://pinterestcareers.com

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. You know how Git works, right? Sort of. Kind of. Not really.
Starting point is 00:00:34 Please ask someone else. That's all of us. Git is how we build things, and Netlify is one of the best ways I've found to build those things quickly for the web. Netlify's Git-based workflows mean that you don't have to play slap and tickle with integrating arcane nonsense and webhooks, which are themselves about as well understood as Git. Give them a try and see what folks ranging from my fake Twitter for pets startup to global fortune 2000 companies are raving about. If you end up talking to them,
Starting point is 00:01:06 because you don't have to, they get why self-service is important, but if you do, be sure to tell them that I sent you and watch all of the blood drain from their faces instantly. You can find them in the AWS Marketplace or at www.netlify.com. N-E-T-L-I-F-Y dot com. This episode is sponsored in part by our friends at Vulture, spelled V-U-L-T-R, because they're all about helping save money, including on things like, you know, vowels. So what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that, well, sure, they claim it is better than AWS's pricing. And when they say that,
Starting point is 00:01:51 they mean that it's less money. Sure, I don't dispute that. But what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to cost. They have a bunch of advanced networking features. They have 19 global locations and scale things elastically, not to be confused with openly, which is apparently elastic and open. They can mean the same thing sometimes. They have had over a million users. Deployments take less than 60 seconds across 12 pre-selected operating systems. Or if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month
Starting point is 00:02:33 for Vulture Cloud Compute, they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something of the scale all on their own. But you don't have to take my word for it with an exclusive offer for you. Sign up today for free and receive $100 in credits to kick the tires and see for yourself. Get started at vulture.com slash morningbrief.
Starting point is 00:02:57 That's v-u-l-t-r dot com slash morningbrief. Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while, I like to talk to people who work at very large companies that are not, in fact, themselves a cloud provider. I know, sounds ridiculous. How can you possibly be a big company and not make money by selling managed NAT gateways to an unsuspecting public? But I'm told it can be done. Here to answer that question, and hopefully at least one other, is Pinterest's head of engineering productivity, Michael Benedict. Michael, thank you for taking the time to join me today. Hi, Corey. Thank you for inviting me today. I'm really excited to talk to you. So exciting times at Pinterest in a bunch of different ways. It was recently reported,
Starting point is 00:03:43 which of course went right to the top of my inbox as 500,000 people on Twitter all said, hey, this sounds like a Corey would be interested in it thing. It was announced that you folks had signed a $3.2 billion commitment with AWS stretching until 2028. Now, if this is like any other large-scale AWS contract commitment deal that has been made public, you were probably immediately inundated with a whole bunch of people who are very good at arithmetic and not very good at business context saying, 3.2 billion, you could build massive data centers for that. Why would anyone do this? And it's tiresome, and that's the world in which we live. But I'm guessing you heard at least a little bit of that from the peanut gallery. I did. And I always find it interesting when,
Starting point is 00:04:31 you know, direct comparisons are made with the total amount that's been committed. And like you said, there's so many nuances that go into kind of how to perceive that amount and put it in context of obviously what Pinterest does. So I at least want to take this opportunity and kind of share it with everyone that Pinterest has been on the cloud since day one. When Ben initially started the company, that product was launched. It was a simple Django app. It was launched on AWS from Davon. And since then, it has grown to support like 450 plus million MAUs over the course of the decade. And our infrastructure has grown pretty complex. We started with a bunch of EC2 machines and, you know, persisting data and S3. And since then, we have kind of explored an array of different products. In fact, sometimes working
Starting point is 00:05:17 very closely with AWS as well and helping them put together a product roadmap for some of the items they're working on as well. So we have an amazing partnership with them. And part of the commitment on how we want to see these numbers is how does it unlock value for Pinterest as a business over time in terms of making us much more agile without thinking about the nuances of the infrastructure itself. And that's, I think, one of the best ways to really put this into context, that it's not a single number we pay at the end of the month, but rather we are on track to spending a certain amount over a period of time. So this just keeps accruing or adding to that number. And we basically come out with an amazing partnership in AWS where we have that commitment
Starting point is 00:05:59 and we're able to kind of leverage their products and full suite of items without any hiccups. The most interesting part of what you said is the word partner. And I think that's the piece that gets lost an awful lot when we talk about large-scale cloud negotiations. It's not like buying a car where you can basically beat the crap out of the salesperson.
Starting point is 00:06:19 You can act as if $400 price difference on a car is the difference between storm out of the dealership and sign the contract, great. You don't really have to deal with that person ever again. In the context of a cloud provider, they run your production infrastructure. And if they have a bad day, I promise you're going to have a bad day too.
Starting point is 00:06:39 You want to handle those negotiations in a way that is respectful of that because they are your partner, whether you want them to be or not. Now, I'm not suggesting that any cloud provider is going to hold an awkward negotiation against the customer, but at the same time, there are going to be scenarios in which you're going to want to have strong relationships where you're going to need to cash in political capital to some extent. And personally, I've never seen stupendous value in trying to beat the crap out of a company
Starting point is 00:07:08 in order to get another 10th of a percent discount on a service you barely use just because someone decided that, well, we didn't do well in the last negotiation, so we're going to get them back this time. That's great. What are you actually planning to do as a company? Where are you going?
Starting point is 00:07:23 And the fact that you just alluded to that you're not just a pile of S3 and EC2 instances speaks in many ways to that. By moving into the differentiated service world, suddenly you're able to do things that don't look quite as much like building a better database and start looking a lot more like servicing your users more effectively and well. And I think like you said, right, I feel like there's like a general skepticism in viewing that the cloud providers are usually out there to rip you apart. But in reality, that's not true. To your point, as part of the partnership, especially with AWS and Pinterest, we've got an amazing relationship going on. And behind the scenes, there's a dedicated team at Pinterest
Starting point is 00:08:03 called the Infrastructure Governance Team, a cross-functional team with folks from finance, legal, engineering, product, all sitting together and working with our AWS partners. Even the AWS account managers and the TAMs are part of that to help us kind of like make both Pinterest successful. And in turn, AWS gets that amazing customer to work with in helping build some of their newer products as well. And that's one of the most important things we have learned over time is that there's two parts to it. When you want to help improve your business agility, you want to focus not just on the
Starting point is 00:08:37 bottom line numbers as they are. It's okay to kind of pay a premium because it offsets sort of the people capital you would have to invest in getting there. And that's a very tricky way to look at math, but that's what these teams do. They sit down and work through those specifics
Starting point is 00:08:51 and for what it's worth. In our conversations, the AWS teams always come back with giving us very insightful data on how we're using their systems to help us better think about how we should be pricing or looking things ahead.
Starting point is 00:09:04 I'm not the expert on this. Like I said, there's a dedicated team sitting behind this and looking through and working through these deals. But that's one of the important takeaways I hope the users or the listeners of this podcast can take away that you want to treat your cloud provider as your partner as much as possible. They're not always there to screw you. That's not their goal.
Starting point is 00:09:22 And I apologize for using that term. It is important that you sort of set that expectation that it's in their best interest to actually make you successful because that's how they make money as well. It's a long-term play. I mean, they could gouge you this quarter and then you're trying to evacuate as fast as possible.
Starting point is 00:09:37 Well, they had a great quarter, but what's their long-term prospect? There are two competing philosophies in the world of business. You can either make a lot of money quickly or you can make a little bit of money and build it over time in a sustained way. And it's clear the cloud providers are playing the long game on this because they basically have to. I mean, it's inevitable at this point, right? I mean, look at Pinterest. It is one of those
Starting point is 00:09:56 success stories, starting as a Django app and a bunch of EC2 machines to where we are right now with having like a three plus billion dollar commitment for a span of a couple of years. And, you know, we do spend a pretty significant chunk of that on a yearly basis. So in this case, I'm sure it was a great successful partnership. And I'm hoping like some of the newer companies who are building the cloud from the get-go
Starting point is 00:10:16 are thinking about it from that perspective. And one of the things I do want to call out, Corey, is that, you know, we did initially start with using the primitive services in AWS, but it became clear over time, and I'm sure you've heard of the term multi-cloud and many of that, you know, when companies start evaluating how to make the most out of the deals they're negotiating or signing, it is important to kind of acknowledge that the cost of sort of any of those evaluations or even thinking about migrations never tends to get factored in. And we always tend to treat of that as being extremely simple or not. But those are engineering resources you
Starting point is 00:10:51 want to be spending more building on the product rather than these crazy costly migrations. So it's in your best interest probably to start using the most from your cloud provider and also look for opportunities to use other cloud providers if they provide more value in certain product offerings. Rather than thinking about like a complete lift and shift, and I'm going to make DR as being the primary case on why I want to be moving to monthly cloud. Yeah. There's a question, too, of the numbers on paper look radically different than the reality of this.
Starting point is 00:11:17 You mentioned Pinterest has been on AWS since the beginning, which means that even if an edict had been passed at the beginning that thou shalt never build on anything except EC2 and S3, the end, full stop. And let's say you went down that rabbit hole of, oh, we don't trust their load balancers. We're going to build our own at home. We have load balancers at home. We'll use those. It's terrible. But even had you done that and restricted yourselves just to those baseline building blocks and then decided to do a cloud migration, you're still looking back at over a decade of experience where the app has been built on consciously reflecting the various failure modes that AWS has, the way that it responds to API calls, the latency in how long it takes to request something versus it being available, etc., etc. So even moving that baseline thing to another cloud provider is not a trivial undertaking by any stretch of the imagination. But that said, because the topic does always come up,
Starting point is 00:12:16 and I don't shy away from it. I think it's something people should go into with an open mind. How has the multi-cloud conversation progressed at Pinterest? Because there's always a multi-cloud conversation. We've always approached it with some form of openness. It's not like we don't want to be open to the ideas, but you really want to be thinking hard on the business case and the business value something provides on why you want to be doing X. In this case, when we think about multi-cloud, and again, like Pinterest did start with EC2 and S3, and we did keep it that way for a long time.
Starting point is 00:12:48 We built a lot of primitives around it, used it. For example, my team actually runs sort of our bread and butter deployment system on EC2. We help facilitate deployments across 100,000 plus machines today. And like you said, we have built that system keeping in mind how AWS works and kind of understanding sort of the nuances of region and AZ failovers and all of that and help facilitate
Starting point is 00:13:12 deployments across thousand plus microservices in the company. So thinking about leveraging, say a Google cloud instance and how that works in theory, you know, we can always make a case for engineering to kind of build our deployment system and expand that, but there's really no value. And one of the biggest cases, usually when multi-cloud comes in, is usually either negotiation for price or actually a DR strategy.
Starting point is 00:13:33 Like what if AWS goes down and US East won? Well, let's be honest, they're powering half the internet from that one thing. Yeah, so if you think your business is okay running when AWS goes down and half the internet is not going to be working, how do you want to be thinking about that? So DR is probably not the best reason
Starting point is 00:13:50 for you to be even exploring multi-cloud. Rather, you should be thinking about what the cloud providers are offering as a very nuanced offering, which your current cloud provider is not offering, and really think about just using those specific items. So I agree that multi-cloud for DR purposes is generally not necessarily the best approach with the idea of being able to failover seamlessly. But I like the idea for backups.
Starting point is 00:14:12 I mean, Pinterest is a publicly traded company, which means that among other things, you have to file risk disclosures and be responsive to auditors in a variety of different ways. There are some regulations that start applying to you. And the idea of, well, AWS builds things out in a variety of different ways, there are some regulations that start applying to you. And the idea of, well, AWS builds things out in a super effective way, region separation, et cetera. Whenever I talk to Amazonians, they are always surprised that anyone wouldn't accept that, oh, if you want backups, just use a different region. Problem solved.
Starting point is 00:14:40 Right, but it is often easier for me to have a rehydrate the business level of backup that would take weeks to redeploy living on another cloud provider than it is for me to explain to all of those auditors and regulators and financial analysts, et cetera, why I didn't go ahead and do that path. So there's always some story for, okay, what if AWS decides that they hate us and want to kick us off the platform? Well, that's why legal is involved in those high-level discussions around things like risk and indemnity and termination for convenience and for cause clauses, et cetera, et cetera. The idea of making
Starting point is 00:15:15 an all-in commitment to a cloud provider goes well beyond things that engineering thinks about. And it's easy for those of us with engineering backgrounds to be incredibly dismissive of that. Oh, indemnity? When does AWS ever lose data? Yeah, but let's say one day they do. What is your story going to be when asked some very uncomfortable questions by people who wanted you to pay attention to this during the negotiation process? It's about dotting the I's and crossing the T's, especially with that many commas in the contractual commitments. No, it is true. And, you know, we did evaluate that as an option. But one of the interesting things about, you know, compliance and especially auditing as well, we generally work with sort of the best in class, you know, consultants to kind of help us work through the controls and every,
Starting point is 00:16:00 how we kind of audit, how we look at these controls, how to make sure there's like enough accountability going through. The interesting part was in this case as well, we were able to sort of work with AWS and crafting a lot of those controls and setting up sort of the right expectations as and when we were putting our proposals together as well. Now, again, I'm not an expert on this and I know we have a dedicated team
Starting point is 00:16:20 from our technical program management organization focused on this. But early on, we realized, to your point, the cost of any form of backups and then being able to audit what's going in, look at all those pipelines, how quickly we can get the data in and out, was proving pretty costly for us.
Starting point is 00:16:36 So we were able to work out some of that within the constructs of what we have with our cloud providers today and still meet our compliance goals. That's sort of, on some level, the higher point too, where everything is, everything comes down to context. Everything comes down to what the business demands, what the business requires, what the business will accept. And I'm not suggesting that in any case they're wrong. I'm known for beating the multi-cloud is a bad default decision drum. And then people get
Starting point is 00:17:03 surprised when I'll have one-on-one conversations and they say, well, we're multicloud. Do you think we're foolish? No, you're probably doing the right thing just because you have context that is specific to your business that I, speaking in a general sense, certainly don't have.
Starting point is 00:17:18 People don't generally wake up in the morning and decide they're going to do a terrible job or no job at all at work today unless they're Facebook's VP of integrity. So it's not the sort of thing that lends itself to casual tweet-sized pithy analysis very often. There's a strong dive into what is the level of risk a business can accept. And my general belief is that most companies are doing this stuff right. The universal constant in all of my consulting clients that I have spoken to about the in-depth management piece of things is they've always asked the same question of,
Starting point is 00:17:50 so this is what we've done, but can you introduce us to the people who are doing it really right, who have absolutely nailed this and gotten it all down? It's, yeah, absolutely no one believes that that is them, even the folks who are, from my perspective, pretty close to having achieved it. I want to talk a bit more about what you do beyond just the headline-grabbing, large-dollar-figure commitment to a cloud provider story. What does engineering productivity mean at Pinterest? Where do you start? Where do you stop? I want to just quickly touch upon that last point about multi-cloud. And like you said, every company works within the context of what they are given and sort of the constraints of their business.
Starting point is 00:18:28 It's probably a good time to kind of give a plug to my previous employer at Twitter who are doing multi-cloud in a reasonably effective way. They are on the data centers. They do have presence on Google Cloud and AWS. And I know probably things have changed since a couple of years now, but they have sort
Starting point is 00:18:45 of embraced that environment pretty effectively to cater to their acquisitions, you know, who were on the public cloud, help obviously with their initial set of investments in the data center and still continue to kind of scale that out and explore, in this case, Google Cloud for a variety of other use cases, which sounds like it's been extremely beneficial as well. So to your point, there's probably no right way to do this. There's always that context and what you're working with comes into play as part of making these decisions. And it's important to like, take a lot of these with grain of salt, right? Because you can never sort of understand the decisions, why they were made the way they were made. And for what it's worth, it sort of works out in the end. I rarely heard like a story where it's never sort of worked out
Starting point is 00:19:25 and people are just upset with the deals they've signed. So hopefully that sort of like helps close that whole conversation about multi-cloud. I hope so. It's one of those areas where everyone has an opinion and a lot of them do not necessarily apply universally. But it's always fun to take, in that case, great. I'll take the lesser trod path.
Starting point is 00:19:44 Everyone's saying multi-cloud is great, invariably, because they're trying to sell you something. Yeah, I have nothing particular to sell folks. My argument has always been, in the absence of a compelling reason not to, pick a provider and go all in. I don't care which provider you pick, which people are sometimes surprised to hear.
Starting point is 00:20:00 It's like, well, what if they pick a cloud provider that you don't do consulting work for? Yeah, it turns out I don't actually need to win every AWS customer over to have a successful working business. Do what makes sense for you folks. From my perspective, I want this industry to be better. I don't want to sit here and just drum up business for myself and make self-serving comments to empower that, which apparently is a rare tactic. No, that's totally true, Corey. And like, one of the things you do is help people with their bills, right? Like, so this has come up so many times, and I realize we're sort of going off track a bit from that engineering productivity discussion. Oh, which is fine. That's this
Starting point is 00:20:35 entire show's theme, if it has one. So I want to briefly just talk about the whole billing and sort of how cost management works, because I know you spend a lot of time on that, and you help a lot of these companies be effective in how they manage their bills, right? These questions have come up multiple times, even at Pinterest. We actually, in the past, when I was sort of leading the infrastructure governance organization, we were working with other companies of our similar size to better understand how they are looking into getting visibility into their cost, setting sort of the right controls and expectations within the engineering organization to plan and capacity plan and effectively sort of meet those plans in a certain criteria.
Starting point is 00:21:15 And then obviously, if there is any risk to that, actively manage risk. That was like the biggest thing those teams used to do. And we used to talk a lot, trade notes, and get a better sense of how a lot of these companies are trying to do, for example, Netflix or, you know, Lyft or Stripe. I recall Netflix content was sort of their biggest spenders. So cloud spending was like way down in the list of things for them. But regardless, they had like an active team looking at this on a day-to-day basis, right? So one of the things we learned early on at Pinterest is that, you know, start investing in those visibility tools early on. No one can parse the cloud bills. Let's be honest, like you're probably the only person who can like reverse
Starting point is 00:21:51 engineer and architecture diagram from like a cloud bill. And I think that's like definitely, you know, you should take a patent for that or something. But in reality, like no one has the time to do that. You want to make sure your business leaders from your finance teams to engineering teams to engineering teams to head of, you know, the executives all have a better understanding of how to parse it. So investing engineering resources, take that data. How do you munch it down to sort of the cost, the utilization across the different vectors of offerings and have a very insightful discussion? Like, you know, what are certain action items we want to be taking? It's very
Starting point is 00:22:24 easy to see, oh, we overspent EC2 and we want to go from there. But in reality, that's not just that thing. You'll start finding out that EC2 is being used by your Hadoop infrastructure, which runs hundreds of thousands of jobs. Okay, now who's actually responsible for that cost? You might find that one job, which is accruing sort of a lot of instance hours or period of time in a shared multi-tenant environment. How do you kind of attribute that cost to that particular cost center? And then someone left the company a while back, and that job just kept running in perpetuity. No one's checked the output for four years. I guess it can't be that necessarily important. And digging into it requires context. It turns
Starting point is 00:22:54 out there's no SaaS tool to do this, which is unfortunate for those of us who set out originally to build such a thing. But we discovered pretty early on, the context on this stuff is incredibly important. I love the thing you're talking about here, where you're discussing with your peer companies about these things. Because the advice that I would give to companies with the level of spend that you folks do is worlds apart from what I would advise someone who's building something new and is spending maybe 500 bucks a month on their cloud bill. Those folks do not need to hire a dedicated team of people to solve for these problems. At your scale, yeah, you probably should have had some people in here looking at this for a while now. And at some point, the guidance
Starting point is 00:23:35 changes based upon scale. And if there's one thing that we discover from the horrible pages of Hacker News, it's that people love applying bits of wisdom that they hear in wildly inappropriate situations. How do you think about these things at that scale? Because, simple example, right now I spend about a thousand bucks a month at the Duck Bill Group, on our AWS bill. I know, we have one too, imagine that.
Starting point is 00:24:00 And if I wind up just committing admin credentials to GitHub, for example, and someone compromises that and starts spinning things up to mine all the Bitcoin, yeah, I'm going to notice that by the impact it has on the bill, which will be noticeable from orbit. At the level of spend that you folks are at, a company would be hard-pressed to spin up enough Bitcoin miners to materially move the billing needle on a month-to-month basis just because of the sheer scope and scale. At small bill volumes, yeah, it's pretty easy to discover the thing that wound up spiking your bill to three times normal. It's usually a managed NAT gateway.
Starting point is 00:24:35 At your scale, tripling the bill begins to look suspiciously like the GDP of a small country. So what actually happened here? Invably at that scale with that level of massive multiplier, it's usually the simplest solution, an error somewhere in the AWS billing system. Yes, they exist. Imagine that. They do exist and we've encountered that. Kind of heart-stopping, isn't it? I don't know if you remember when we had the big specter and the meltdown, right? And those were like interesting scenarios for us, because we had identified a lot of those issues early on, given the scale we operate. And we were able to sort of, obviously, you know, it did have an impact on the builds and everything, but that said, that's why you have these dedicated teams to kind of fix that. But I think one of the
Starting point is 00:25:17 points you made, you know, these are large builds and you're never going to have a 3x jump the next day. You know, we're not going to be seeing that. And if that happens, you know, like God save us. But to your point, one of the things we do still want to be doing is look at trends literally on a week over week basis, because even a one percentage move is a pretty significant amount if you think about it, which could be funding some other aspects of the business, which we would prefer to be investing on. So we do want to have enough rigor and controls in place in our technical stack to kind of identify and alert when something is off track. And it becomes challenging when you start using those higher order services from your public cloud provider, because there's no clear insights on how do you kind of parse that information. One of the biggest challenges we had
Starting point is 00:26:00 at Pinterest was tying ownership to all these things. No, using tags is not going to cut it. It was so difficult for us to get to a point where we could like put some sense of ownership and all the things and the resources people are using, and then subsequently have those right conversation with our ads infrastructure teams or our product teams to kind of like help drive the cost improvements we want to be seeing. And I wouldn't be surprised if that's not a challenge already, even for like the smaller companies who have bills in the tunes of tens and thousands, right?
Starting point is 00:26:31 It is. It's predicting the spend and trying to categorize it appropriately. That's the root of all AWS bill panic on the corporate level. It's not that the bill is 20% higher, so we're going to go broke. Most companies spend far more on payroll than they do on infrastructure. As you mentioned, with Netflix, content is a significantly larger expense than any of those things. Real estate's usually right up there too. But instead, it's when you're trying to do business forecasting of, okay, if we're going to have an additional thousand monthly active users, what will the cost for us be to service those users? And okay, if we're seeing
Starting point is 00:27:02 a sudden 20% variance, if that's the new normal, then well, that does change our cost projections for a number of years. What happens when you're public, there starts to become the question of, okay, do we have to restate earnings or what's the deal here? And of course, all of this sidesteps past the unfortunate reality that for many companies, the AWS bill is not a function of how many customers you have. It's how many engineers you've hired. And that is always the way it winds up playing out for some reason. It's, why did we see a 10% increase in the bill?
Starting point is 00:27:30 Yeah, we hired another data science team. Oops. Always seems to be the data science folks. I know I beat up on those folks a fair bit, and my apologies. And one day, if they analyze enough of the data, they might figure out why. So this is where I want to give a shout out
Starting point is 00:27:43 to our data science team, especially some of the engineers working in the infrastructure governance team, like putting these charts together, helping us derive insights. So definitely props to them. I think there's a great segue into the point you made. As you add more engineers,
Starting point is 00:27:56 what is the impact on the bottom line? And this is one of the things actually as part of engineering productivity, we think about as well on a long-term basis. Pinterest does have over a thousand plus engineers today. And to a large degree, many of them actually have their own EC2 instances today. And I wouldn't say it's like a significant amount of cost,
Starting point is 00:28:14 but it is a large enough number where shutting down a C5.9 Excel can actually fund a bunch of conference tickets or something else. And then you can imagine that's sort of the scale you start kind of working with at one point. The nuance here is though, you want to like make sure there's enough flexibility for these engineers to do their local development in a sustainable way. But when moving to say production, we really want to tighten sort of the flexibility a bit so they
Starting point is 00:28:40 don't end up doing what you just said, like spin up a bunch of machines talking to the API directly, which no one will be aware of. I want to share a small anecdote because when back in the day, this was probably four years ago when we were doing some analysis on our bills, we realized that there was a huge jump every, I believe, Wednesday instead of our EC2 instances by almost like a factor of like 500 to 600 instances. And we're like, why is this happening? What is going on? And we found out there was like an obscure job written by someone who had left the company calling an EC2 API to spin up like a search cluster
Starting point is 00:29:12 of 500 machines on demand as part of pulling that ETL data together and then shutting that cluster down, which at times didn't work as expected because obviously your Hadoop jobs are very predictable, right? So those are kind of the things we were dealing with back in the day. And you want to make sure since then, this is where engineering productivity as a team
Starting point is 00:29:31 starts coming in, that our job is to enable every engineer to be doing their best work across code building and deploying their services. And we have done this. Right. You and I can sit here and have an in-depth conversation about the intricacies of AWS billing in a bunch of different ways, because in different ways, we both specialize in it in many respects. But let's say that Pinterest theoretically was foolish enough to hire me before I got into this space as an engineer for terrifying reasons. And great. I start as day one as a typical software developer, if such a thing could be said to exist, how do you effectively build guardrails in so that I don't inadvertently wind up spinning up all the EC2 instances
Starting point is 00:30:10 available to me within an account, which it turns out are more than one might expect sometimes, but still leave me free to do my job without effectively spending a nine-month safari figuring out how AWS builds work. And this is why teams like ours exist to kind of help provide those tools to help you get started.
Starting point is 00:30:30 So today, we actually don't let anyone directly use AWS APIs or even use the UI for that matter. And I think you'll soon realize the moment you hit probably 30 or 40 people in your organization, you definitely want to lock it down. You don't want that access to be given to anyone or everyone. And then subsequently start building some higher order tools or abstractions so people can start using that to control effectively. In this case, if you're a
Starting point is 00:30:53 new engineer, Corey, which it seems like you were at some point. I still write code like I am, don't worry. So yes, you would get access to sort of our internal tool to actually help spin up what we call as a dev app, where you get a chance to obviously choose sort of the instance size, not the instance type itself. And we have actually constrained sort of the instance types we have approved within Pinterest as well. We don't give you sort of the entire list
Starting point is 00:31:15 you get a chance to choose and deploy to. We actually have constrained to, based on the workload types, what are the instance types we want to support? Because in the future, if we ever want to move from C3 to C5, and I've been there, trust me, it is not an easy thing to do. So you want to make sure that you're not letting people just use random instances and kind of constrain that by building some of these tools. As a new engineer, you would go in, you
Starting point is 00:31:36 use the tool and actually have a dev app provision for you with our Pinterest image to get you started. And then subsequently, you know, we obviously shut it down if we see you're not being using it over a certain amount of time. But those are sort of the, you know, guardrails we've put in over there. So you never get a chance to directly ever use sort of the EC2 APIs or any of those AWS APIs to do certain things.
Starting point is 00:31:55 The similar thing applies for S3 or any of the higher order tools which AWS would provide too. This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of hello world demos? Allow me to introduce you to Oracle's always free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And let me be clear here it's actually free there's no surprise billing until you intentionally and proactively upgrade your account this means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the
Starting point is 00:32:38 networking load balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small-scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisk next to the word free. This is actually free, no asterisk. Start now. Visit snark.cloud slash oci-free. That's snark.cloud slash oci-free. How does that interplay with AWS launches yet another way to run containers, for example, and that becomes a valuable potential avenue to get some business value for a developer,
Starting point is 00:33:20 but the platform you've built doesn't necessarily embrace that capability. Or they release a feature to an existing tool that you use that could potentially be just a feature capability story, much more so than a cost savings one. How do you keep track of all of that and empower people to use those things so they're not effectively trying to re-implement DynamoDB on top of EC2? That has been a challenge actually in the past for us because we've always been very flexible where engineers have had an opportunity to kind of write their own solutions many a times rather than leveraging sort of the AWS services. And off late, like that's one of the reasons why we have sort of have an infrastructure organization, an extremely lean organization for what it's worth,
Starting point is 00:33:59 but then still able to kind of achieve like outsized outputs where we sort of like evaluate a lot of these use cases as they come in and open up different aspects of what we want to provide, say directly from AWS or build certain abstractions on top of it. Every time we talk about containers, obviously, you know, we always associate that with something like Kubernetes and sort of offerings from there on. We realized that our engineers directly never ask for those capabilities. They don't come in and say, I need a new container orchestration system,
Starting point is 00:34:25 give that to me, and I'm going to be extremely productive. What we've actually realized is that if you can provide them effective tools and that can help them get their job done, they would be happy with it. For example, like they said, our deployment system, which is actually an open source system called Teletran,
Starting point is 00:34:39 that is sort of the bread and butter at Pinterest, like which my team runs. We operate 100,000 plus machines. We have actually looked into container orchestration where we do have like a dedicated Kubernetes team looking at it and helping, you know, certain use cases move there. But we realized that the cost of sort of entire migrations
Starting point is 00:34:55 need to be like evaluated against certain use cases, which can benefit from being on Kubernetes from day one. You don't want to like force anyone to move there, but give them the right incentives to move there. Case in point, let's upgrade your OS, right? Because if you're managing machines, obviously everyone loves to upgrade their OSs. Well, it's one of the reasons I love savings plans versus RIs. You talk about the C3 to C5 migration, and everyone has a story about one of those. But the most foolish or frustrating reason that I ever saw not to do the upgrade was, well, we bought a bunch of reserved instances on the C3s, and those are a year and a half left to run. And it's foolish, not on the part of
Starting point is 00:35:28 customers, it's economically sound, but on the part of AWS, where, great, you're now forcing me to take a contractual commitment to something that serves me less effectively, rather than getting out of the way and letting me do my job. It's why it's so important to me, at least, that savings plans cover Fargate and Lambda. I wish they'd covered SageMaker instead of SageMaker having its own thing, because once again, you're now architecturally constrained
Starting point is 00:35:50 based upon some ridiculous economic model that they have imposed on us. But that's a separate rant for another time. No, we actually went through that process because we do have a healthy balance of how we do reserved instances and how we look at on-demand. We've never been big users of Spot in the past because just the Spot market itself, we've realized that putting
Starting point is 00:36:09 that pressure on our customers to figure out how to manage that is way more. I say customers, in this case, engineers within the organization. Oh yes, I want to post some pictures on Pinterest and now I have to understand the Spot market what? Yeah. So in this case, when we even were moving from C3 to C5, and this is where that partnership really plays out effectively, because it's also in the best interest of AWS to deprecate their aging hardware to support some of these new ones
Starting point is 00:36:34 where they could also be making good enough premium margins for what it's worth and give the benefit back to the user. So in this case, we were able to work out an extremely flexible way of moving to C5 as soon as possible, get help from them actually in helping us do that too, allocating capacity and working with them on capacity management. I believe at one point we were actually one of the largest companies with the C3 footprint, and it took quite a while for us to move to C5, but rest assured, once we moved, the savings was just immense, right? We were able to kind of offset
Starting point is 00:37:01 any of those RI and we were able to work behind the scenes to get that out. But obviously not a lot of that is kind of considered in a small scale company, just because of, like you said, those constraints which have been placed in a contractual obligation. Well, this is an area in which I will give the same guidance to companies of your scale, as well as small scale companies. And by small scale, I mean people on the free tier account, give or take. So I do mean the smallest of the small. Whenever you wind up in a scenario where you find yourself architecturally constrained by an economic barrier like this, reach out to your account manager. I promise you have one. Every account, even the tiny free tier accounts, have an account manager. I have an account manager who I have to
Starting point is 00:37:40 say has probably one of the most surreal jobs at AWS, just based upon the conversations I throw past him. But it's reaching out to your provider rather than trying to solve a lot of this stuff yourself by constraining how you're building things internally is always the right first move. Because the worst case is, is you don't get anywhere in those conversations. Okay, but at least you explored that, as opposed to what often happens is, oh yeah, I have a switch over here I can flip and solve your entire problem. Does that help anything? Yeah. You feel foolish finding that out only after nine months of dedicated work, it turns out. Which makes me wonder, Corey, I mean, do you see a lot of that happening where folks don't tend to reach out to their account managers or rather treat them as partners in this case, right? Because it sounds like there's just this
Starting point is 00:38:21 unhealthy tension, I would say, as to what is kind of the best help you could be getting from your account managers in this case. Constantly. And the challenge comes from a few things, in my experience. The first is that the quality of account managers and the technical account managers, the folks who are embedded in many cases with your engineering teams in different ways, does vary. AWS is scaling wildly and bursting at the seams, and people are hard to scale. So some are fantastic. Some are decidedly less so, and most folks fall somewhere in the middle of that bell curve. And it doesn't take too many poor experiences for the default to be, oh, those people are useless. They never do anything we want, so why bother asking them? leads to an unhealthy dynamic where a lot of companies will wind up treating their AWS account manager types as a ticket triage system or the last resort of places that they'll turn
Starting point is 00:39:12 when they should be involved in earlier conversations. I mean, take Pinterest as an example of this. I'm not sure how many technical account managers you have assigned to your account, but I'm going to go out on a limb and guess that the ratio of technical account managers to engineers working on the environment is incredibly lopsided. It's got to be a high ratio just because of the nature of how these things work. So there are a lot of people who are actively working on things that would almost certainly benefit from a more holistic conversation with your AWS account team, but it doesn't occur to them to do it just because of either perceived biases around levels of competence or poor experiences in the past or simply not knowing the capabilities that are there. If I could tell one story around the AWS account management story, it would be talk to folks sooner about these things.
Starting point is 00:39:57 And to be clear, Pinterest has this less than other folks. but AWS does themselves no favors by having a product strategy of yes, because very often in service of those conversations with a number of companies, there is the very real concern of, are they doing research so that they can launch a service that competes with us? Amazon as a whole launching a social network is admittedly one of the most hilarious ideas I can come up with. And I kind of hope they take a whack at it just to watch them learn all these lessons themselves. But that is again, neither here nor there. That story is very interesting. And I think you mentioned one thing, it's just that
Starting point is 00:40:33 lack of trust or even knowing what the account managers can actually do for you. There seems to be just a lack of education on that. And we also found it the hard way, right? I wouldn't say that Pinders kind of figured this out on day one. We evolved sort of our relationship over time. Yes, our time engagements are sort of lopsided, but we were able to kind of negotiate that as part of deals as we learned a bit more on what we can and we cannot do and how these individuals are beneficial for Pinterest as well. And well, here's a question for you without naming names. And this might illustrate part of the challenge that customers have. How long has your account manager,
Starting point is 00:41:08 not the technical account managers, but your account manager been assigned to your account? I've been at Pinterest for five years and I've been working with the same person. And he's amazing. Which is incredibly atypical. A lot of smaller companies, it feels like, oh, I'm your account manager being introduced to you. And aren't you the third one this year? Great. What happens is that if the account manager
Starting point is 00:41:31 excels very often, they get promoted and work with a smaller number of accounts at larger spend. And whereas if they don't find that AWS is a great place for them for a variety of reasons, they go somewhere else and need to be backfilled. So the smaller account, it's great. I've had more account managers in a year than you've had in five. And that is often the experience when you start seeing significant levels of rotation. And especially on the customer engineering side, where you wind up with, you have this big kickoff and everyone's aware of all the capabilities and you look at it three years later and not a single person who was in that kickoff is still involved with the account on either side. And it's just sort of been evolving evolutionarily from there. One thing that we've done in some of our larger accounts as part of
Starting point is 00:42:13 our negotiation process is when we see that the bridges have been so thoroughly burned, we will effectively request a full account team cycle just because it's time to get new faces in where the customer, in many cases unreasonably, is not going to say, yeah, but a year and a half ago, you did this terrible thing and we're still salty about it. Fine, whatever, I get it. People relationships are hard.
Starting point is 00:42:36 Let's go ahead and swap some folks out so that there are new faces with new perspectives because that helps. Well, first off, if you have had so many switches in account manager, I think that that's something speaks about how you've been working too. I'm just kidding there, but entirely possible in seriousness. Yes. But if you talk to, this is not just me because in my case, yeah, I feel like my account managers, whoever drew the
Starting point is 00:42:57 short straw that week, because frankly, yeah, that does seem like a great punishment to wind up passing out to someone who's underperforming. But for a lot of folks who are in the mid tier, like spending 50 to a hundred thousand dollars a month, this is a very common story. Yeah, actually we've heard a bit about this too. And like you said, I think, you know, maintaining context is the most thing you really want your account managers, you know, vouch for you, really be your champion in those meetings because AWS, like you said, is so large getting those exact time and, you know, reviews, and there's so many things that happen, your account manager is the champion for you right there. And it's important. And in fact, in your best interest to kind of have a great relationship with them as well, not treat them
Starting point is 00:43:34 as like, oh, yet another vendor. And I think that's where things start to get a bit messy, because when you start treating them as yet another vendor, you know, there's no incentive for them to kind of do the best for you too. You know, people relationships are hard, but that said though, I think given the amount of customers, like these cloud companies are accruing, I wouldn't be surprised. Like, you know, every account manager seems to be like extremely burdened, even in our case, although I've been having a chance to work with this one person for like a long time, we've actually expanded. We have now multiple account managers helping us out as we've started scaling to use certain aspects of AWS,
Starting point is 00:44:05 which we have never explored before. You know, we were a bit constrained and reserved about what services we want to use because there have been instances where we have tried using something and we've hit the wall pretty immediately. API rate limits, or it's not ready for prime time. And we're like, oh my God, now what do we do?
Starting point is 00:44:21 So we have been a bit more cautious, but that said, over time, you know, having an account manager who understands So we have been a bit more cautious, but that said, over time, you know, having an account manager who understands how you work, what scale you have, they're able to advocate with the internal engineering teams within the cloud provider to make the best of sort of supporting you as a customer and sort of like tell that success story all the way up. So yeah, I can totally understand like how this may be hard, especially for those smaller companies. For what it's worth, I think the best way to really think about it is not treat them as your vendor,
Starting point is 00:44:47 but really sort of go out on a limb there. Even though you sign a deal with them, you want to make sure that you have the continued relationship with them to represent your voice better within the company, which is probably hard. That's always the hard part. Honestly, if this were the sort of thing
Starting point is 00:45:02 that were easy to automate, or you could wind up building out something that winds up helping companies figure out how to solve these things programmatically. You talk about interesting business problems that are only going to get larger in the fullness of time. This is not going away. Even if AWS stopped signing up new customers
Starting point is 00:45:19 entirely right now, they would still have years of growth ahead of them, just some organic growth. And take a company with the scale of Pinterest and just think of how many years it would take to do a full-on exodus, even if it became priority number one. It's not realistic in many cases, which is why I've never been a big fan of multi-cloud as an approach for negotiation. Yeah, AWS has more data on those points than any of us do. They're not worried about it. It just makes you sound like an unsophisticated negotiator.
Starting point is 00:45:48 Pick your poison and lean in. That is the truth you just mentioned. And I probably want to give a call out to our head of infrastructure, Coburn. He's also my boss, and he had brought this perspective as well as part of any negotiation discussions. Like you just said, AWS has way more data points on this than what we think we can do in terms of talking about, oh, we are exploring this other cloud provider. And it's, you know, they would be like, yeah, do tell me more how that's going. And it's probably in the best interest to never use that as a negotiation tactic because, you know, they
Starting point is 00:46:19 clearly know sort of the investments that's gone in to kind of build out what you've done. So you might as well like be talking more. Again, this is where that relationship really plays together because you want both of them to be successful and it's in their best interest to like still keep you happy because the good thing about at least companies of our size is that we're probably like one phone call away from some of their executive team where we could always talk about sort of what didn't work for us. And I know not everyone has that opportunity, but I'm really hoping, and I know like,
Starting point is 00:46:47 at least with some of the interactions we've had with the AWS teams, they're actively working and sort of building that relationship more and more, giving access to those customer advisory boards and all of them to have those direct calls with the executives. I don't know whether you've seen that
Starting point is 00:46:59 in your sort of experience in helping some of these companies. I have a different approach to it. It turns out when you're super loud and public and noisy about AWS and spend too much time in Seattle, you start to spend time with those people on a social basis. Because again, I'm obnoxious and annoying to a lot of AWS folks, but I also have an obnoxious habit of being right in most of the things I'm pointing out. And that becomes harder and harder to ignore. I mean, part of the value that I found in being able to do this as a consultant is that I begin to compare and contrast different customer environments
Starting point is 00:47:29 on a consistent, ongoing basis. I mean, the reason that negotiation works well from my perspective is that AWS does a bunch of these every week, and customers do these every few years with AWS. And, well, we do an awful lot of them, too. And it's, okay, we've seen different ways things can get structured and it doesn't take too long and too many engagements before you start to see the points of commonality and how these things flow together.
Starting point is 00:47:54 So when we wind up seeing things that a customer is planning on architecturally and looking to do in the future, well, wait a minute, have you talked to the folks negotiating the contract about this? Because that does potentially have bearing and it provides better data than what AWS is gathering just through looking at overall spend trends. So yeah, bring that up. That is absolutely going to impact I think, understanding the incentives. I will say that across the board, I have never yet seen a deal from AWS come through where it was, okay, at this point, you're just trying to hoodwink the customer and get them to sign
Starting point is 00:48:33 on something that doesn't help them. I've seen mistakes that can definitely lead to that impression. And I've seen areas where their data is incomplete and they're making assumptions that are not borne out in reality. But it's not one of those bad faith type of negotiations. If it were, I would be framing a lot of this very differently. It sounds weird to say, yeah, your vendor is not trying to screw you over in this sense. Because look at the entire IT industry. How often has that been true about almost any other vendor in the fullness of time? This is something a bit different, and I still think we're trying to grapple with the repercussions
Starting point is 00:49:07 of that from a negotiation standpoint and from a long-term business continuity standpoint when your fate is linked in a shared fate context with your vendor. It's in their best interest as well because they are trying to build a diversified portfolio. If they help 100 companies, even if one of them becomes the next Pinterest, that's great. Right. And that continued relationship is what they're aiming for. So assuming any bad faith over there probably is not going to be the best outcome, like you said. And two, it's not a zero sum game. Like I always get a sense that when you're doing these negotiations, it's like, it's an all or nothing deal. It's not like you have to think they're also running a business. And it's important that you as you're in a business, how okay are you with some of those premiums, right? Like you
Starting point is 00:49:49 cannot get a discount on everything. You cannot get the deal or the numbers. You probably want almost everything. And to your point, architecturally, if you're moving in a certain direction where you think in the next three years, this is what your usage is going to be, or it'll come down to that. Obviously you should be investing more in kind of negotiating that out front rather than managed network gateways, I guess. So I think that's also an important mindset to kind of take in, right, as part of any of these negotiations,
Starting point is 00:50:13 which I'm assuming, I don't know how you folks have been working in the past, but at least that's one of the key items we have taken in as part of any of these discussions. I would agree wholeheartedly. I think that it just comes down to understanding where you're going, what's important, and again, in some cases,
Starting point is 00:50:28 knowing around what things AWS will never bend contractually. I've seen companies spend six weeks or more trying to negotiate custom SLAs around services. Let me save everyone a bunch of time and money. They will not grant them to you, I promise. So stop asking for them. You're not going to get them.
Starting point is 00:50:45 There are other things they will negotiate on that are going to be highly case dependent. I'm hesitant to mention any of them just because, well, wait a minute, we did that once. Why are you talking about that in public? I don't want to hear it and confidentiality matters. But yeah, not everything is negotiable, but most things are. So figuring out what levers and knobs and dials you have is important. We also found it that way, like AWS does cater to their, they are a platform and they are pretty clear in sort of like how much engagement, even if we are sort of like one of their top customers, there's been many a times where I know their product managers have heavily pushed back on some of the requests we have put in. And that makes me wonder, like they probably have the same engagement, even with the smallest of customers.
Starting point is 00:51:26 There's always like an implicit assumption that the big fishes try to kind of like get the most out of your public cloud providers. To your point, I don't think that's true. We're rarely able to kind of negotiate anything exclusive in terms of their product offerings just for us, if that makes sense. Case in point, tell us your capacity
Starting point is 00:51:43 for X instances or type of instances so we as a company would know sort of how to kind of plan out our scale ups or scale downs. That's not going to happen exclusively for you. But those kinds of things are just like examples. We have had a chance to kind of work with their product managers
Starting point is 00:51:57 and see if, can we get some flexibility on that? For what it's worth though, they are willing to kind of like find a middle ground with you to make sure that you get your answers. And obviously you're being successful in sort of like your plans to use certain technologies they offer or have more predictability in how you use their services. So I know we've gone significantly over time and we are definitely going to do another episode talking about a lot of the other things that you're involved in, because I'm going to assume
Starting point is 00:52:22 that your full-time job is not worrying about the AWS bill. In fact, you do a fair number of things beyond that. I just get stuck on that one, given that it is what I eat, sleep, breathe, and dream about. Absolutely. I would love to kind of talk more, especially about how we are, you know, enabling sort of our engineers to be extremely productive in this new world and how we want to cater to sort of this whole cloud native environment, which is being created and make sure, you know, people are sort of doing their best work. But regardless, Corey, I mean, this has been an amazing, insightful chat, even for me. And I really appreciate you sort of having me on the show. No, thank you for joining me. If people want to
Starting point is 00:52:56 learn more about what you're up to and how you think about things, where can they find you? Because I'm also going to go out on a limb and assume you're also probably hiring given that everyone seems to be these days. Well, that is true. And I wasn't planning to make on a hiring pitch, but I'm glad that you sort of like leaned into that one. Yes, we are hiring and you can find me on Twitter at twitter.com slash M-I-C-H-E-A-L. I am spelled a bit differently, so make sure you can hit me up and my DMs are open. And obviously we have all our open roles listed on PinterestCareers.com as well. And we will, of course, put links to that in the show notes. Thank you so much for taking the time to speak with me today.
Starting point is 00:53:32 I really appreciate it. Thank you, Corey. It was really being great on your show. And I'm sure we'll do it again in the near future. Michael Benedict, Head of Engineering Productivity at Pinterest. I am cloud economist, Corey Quinn, and this is Screaming in the Cloud.
Starting point is 00:53:46 If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a long rambling comment about exactly how many data centers
Starting point is 00:54:00 Pinterest could build instead. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business
Starting point is 00:54:24 and we get to the point. Visit duckbillgroup.com to get started. This has been a humble pod production stay humble

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.