Screaming in the Cloud - Breaking Down Productivity Engineering with Micheal Benedict

Starting point is 00:00:00 Hello and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. You know how Git works, right? Sort of. Kind of. Not really. Please ask someone else.

Starting point is 00:00:41 That's all of us. Git is how we build things, and Netlify is one of the best ways I've found to build those things quickly for the web. Netlify's Git-based workflows mean that you don't have to play slap and tickle with integrating arcane nonsense and webhooks, which are themselves about as well understood as Git. Give them a try and see what folks ranging from my fake Twitter for pets startup to global fortune 2000 companies are raving about. If you end up talking to them because you don't have to, they get why self-service is important. But if you do, be sure to tell them that I sent you and watch all of the blood drain from their faces instantly. You can find them in the AWS marketplace or at www.netlify.com. N-E-T-L-I-F-Y dot com. including on things like, you know, vowels. So what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that, well, sure, they claim

Starting point is 00:01:52 it is better than AWS's pricing. And when they say that, they mean that it's less money. Sure, I don't dispute that. But what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to cost. They have a bunch of advanced networking features. They have 19 global locations and scale things elastically, not to be confused with openly, which is apparently elastic and open. They can mean the same thing sometimes. They have had over a million users.

Starting point is 00:02:22 Deployments take less than 60 seconds across 12 pre-selected operating systems. Or if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vulture Cloud Compute, they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something of the scale all on their own. Try Vulture today for free by visiting vulture.com slash screaming, and you'll receive $100 in credit. That's v-u-l-t-r dot com slash

Starting point is 00:02:59 screaming. Welcome to Screaming in the Cloud. I'm Corey Quinn. Sometimes when I have conversations with guests here, we run long, really long. And then we wind up deciding it was such a good conversation and there's still so much more to say that we schedule a follow-up. And that's what happened today. Please welcome back Michael Benedict, who is, as of the last time we spoke and presumably still now, the head of engineering productivity at Pinterest. Michael, how are you? I'm doing great. And thanks for that introduction, Corey. Thankfully, yes, I am still the head of engineering productivity. I'm really glad to

Starting point is 00:03:34 kind of speak more about it today. The last time that we spoke, we went up one side and down the other of large scale environments running on AWS and billing aspects thereof, et cetera, et cetera. I want to stay away from that this time and instead focus on the rest of engineering productivity, which is always an interesting and possibly loaded term. So what is productivity engineering? It sounds almost like it's an internal dev tools team, or is it something more? Well, thanks for asking because I get this question asked a lot of times. So for one, our primary job is to enable every developer, at least at our company, to do their best work.

Starting point is 00:04:15 And we want to do this by providing them a fast, safe, and a reliable path to take any idea into production without ever worrying about the infrastructure. As you clearly know, learning anything about how AWS works or any public cloud provider works is a ton of investment. And we do want our product engineers or mobile engineers and all of the other folks to be focused on delivering amazing experiences to our pinners. So we could be doing some of the hard work and providing those abstractions for them in such a way and taking away the pain of managing infrastructure. The challenge, of course, that I've seen is that a lot of companies take the approach of, ah, we're going to make AWS available to all of our engineers in its raw, unfiltered form. And that lasts until the first bill shows up. And then it's

Starting point is 00:04:59 okay, we're going to start building some guardrails around that, which makes a lot of sense. There then tends to be a move towards internal platforms that effectively wrap cloud services. And for a while now, I've been generally down on the concept, and publicly so, in the general sense. That said, what I say that applies as a best practice or something that most people should consider does tend to fall apart when we talk about specific use cases. You folks are in an extremely large environment. How do you view it? First off, do you do internal platforms like that? And secondly, would you recommend that other companies do the same thing? I think that's such a great question because every company evolves with its own pace of development. And I wouldn't say Pinterest by itself had a developer productivity

Starting point is 00:05:47 or an engineering productivity organization from the get go. I think this happens when you start realizing that your core engineers who are working on product are now spending a certain fraction of time. It starts ballooning pretty fast in managing the underlying systems in the infrastructure. And at that point in time, it's probably a good question to ask, how can I reduce the friction in those people's lives such that they could be focused more on the product and kind of centralize or provide some sort of common abstractions through a central team, which can take away all that pain, right? So that is generally a good guiding principle to think about when your engineers are spending at least 30% of their time on operating the systems rather than building capabilities,

Starting point is 00:06:32 that it's probably a good time to revisit and see whether a central team would make sense to take away some of that. And just simple examples, right? This includes upgrading OS on your EC2 machines, or just trying to make sure you're patching sort of all the right versions on your next big machines, or just trying to make sure you're patching sort of all the right versions on your next big Kubernetes cluster you're running for serving X number of users. The moment you start seeing that, you want to start thinking about if there is a central team who could take away that pain, what are the things they could be investing on to kind of help up level

Starting point is 00:07:01 every other engineer within your organization? And I think that's one of the best ways to kind of help uplevel every other engineer within your organization. And I think that's one of the best ways to kind of be thinking about it. And it was also a guiding principle for us within Pinterest to kind of view what investments we could make in these central teams, which can uplevel each and every different type of engineer in the company as well. And just an example on that could be, your mobile engineer would have very different expectations from your backend engineer who's working on certain aspects of code in your product. And it is truly important to understand where you want to kind of centralize capabilities, which both these types of engineers

Starting point is 00:07:34 could use, or you want to divest and have like unique capabilities where it's going to make them productive. There's no one size fits all solution for this, but I'm happy to talk about what we have at Pinterest, which has been reasonably working well. But I do think there's a lot more improvements we could be doing. Yeah, but let's also be clear that, as you've mentioned, you are heavily biased towards EC2 instances for a lot of what you do. If we look at the AWS console, and we see hundreds of different services now,

Starting point is 00:08:00 and it's easy to sit here and say, oh, internal platforms are terrible because all of those services are going to be enhanced in various ways and you're never going to be able to keep up with feature parity. Yeah, but if you could wrap something like EC2 in an internal platform wrapper, that begins to be a different story. Because sure, if someone's going to go try something new with a different AWS service, they're going to need direct access. But the EC2 product across the board generally does not evolve in leaps and bounds with transformative changes overnight. Let's also not forget that at a company with the scale that Pinterest operates at, hey, AWS just

Starting point is 00:08:37 dusted off a new feature and docs are still rolling out. It's not in cloud formation yet, but we're going to roll it out to production probably seems like the wrong direction to go in, I would assume. And yes, I think that brings sort of one of the key guardrails, I think, which these groups provide. So when we start thinking about what teams, centralized teams like engineering, productivity, developer tools, developer platforms actually do is they help with a couple of things. The top three are they can help pave a path for the most common use cases. Like to your point, provisioning EC2 does take a set of steps all the time. If you're going to have 1000 people doing that every time they're building

Starting point is 00:09:17 a new service or trying to expand capacity, playing with their, you know, launch templates, those are things you could start like streamlining and making it simple by some wrapper, because you want to address those 80% use cases, which are usually common and you can have a wrapper, but could just automate that and that's one of the key things, right? Like, can you provide a paved path for those use cases? The second thing is, can you do that by having the right guardrails in place? How often have you heard the story that I just

Starting point is 00:09:45 clicked the button and that now spun up like 1000 plus instances. And now you have to juggle between trying to, you know, stop the Mars and do something about it. Back in 2013, you folks were still focusing on this a fair bit, I remember, because Jeremy Carroll, who I believe was your first SRE there once upon a time, wound up doing a whole series of talks around how Pinterest approached doing an AMI factory. And back in those days, the challenges were, okay, we have the baseline AMI, and that's great, but we also want to do deployments of things, and we don't really want to do a new deploy of an entire fleet of EC2 instances for a single line of config change. So how do we wind up weighing off of when you bake a new AI versus when you just change something that is in what is deployed to them? And it was really a complicated

Starting point is 00:10:30 problem back then. I'm not convinced it's not still a complicated problem, but the answers are a lot more cohesive and making sure that every team, when you're talking about a company as large as Pinterest, but that many teams is doing things in the same way, seems like it's critically important. Otherwise, you wind up with a whole bunch of unique looking instances that each have to be managed by hand as opposed to something that can be reasoned around collectively. Yep. And that last part you mentioned is extremely crucial as well, because like I said, our audience or our customers are just not the engineers. We do work with our product managers and business partners as well, because at times we have to tie or change our architecture based on certain cost optimizations, which would make sense. Like you just articulated, like we don't want to have all the instance types.

Starting point is 00:11:18 It does not add much value to a developer unless they're explicitly seeking a high memory instance or a GP based instance in a certain way. So we can then work with our business partners to make sure that we're committing to only a certain type of instances and how we can abstract our tools to only give you that. For example, our deployment system, Teletran, which is an open source system, actually condenses down all these instance types to like a couple of categories, like high compute, high memory. And you've probably seen that in many of the new cloud providers as well. So people don't have to learn or know the underlying instance type. When we move from C3 to C5, it was just called as a high compute system. So the next time someone provisioned a new service or deployed it using our system, they would just select high compute as a de facto instance type. And we would just automatically provision a C5 for them.

Starting point is 00:12:08 So that just reduces the complexity or the cognitive overhead individuals would have to go through in learning each instance type, what is the base AMI that comes on it? What are the different configurations that need to go in in terms of setting up your AZ scaling properties? We give them a good reasonable set of defaults to get started with. And then they can then work on like kind of optimizing or making changes to it. Ignoring entirely your mispronunciation of AMI, which is of course three syllables, and that is a petty hill upon which I will die. It occurs to me the more I work with AWS in various ways, the easier it gets. And I used to think in some respects, it was because the platform was improving so dramatically around me. But no, in various ways, the easier it gets. And I used to think in some respects, it was because the platform was improving

Starting point is 00:12:46 so dramatically around me. But no, in many cases, it's because the first time you write some cloud formation by hand, it's a nightmare and you keep smacking into weird issues. But the second or third time, it's super easy because you just copied the thing you've already built and changed the relevant bits around. And that was the learning curve that I went through

Starting point is 00:13:04 playing around with a lot of these things. When you start looking at this from a large-scale environment where it's not just about upskilling the people that you have to understand how these things integrate in AWS land, but also the consistent onboarding of engineers at a fairly progressive clip is great. In fact, you have to start doing trainings on all of these things. And there's a lot of knobs and dials that can blow up and hurt people. At some point, building the guardrails or building the environment

Starting point is 00:13:32 in which you are getting all the stuff abstracted away from where the application engineers have to think about this at all, it eventually reaches a tipping point where it starts to feel like it's no longer optional if you want to continue growing as a company, because you don't have the luxury of spending six months of onboarding before you let someone touch the thing they were hired to build. And you will see that many companies very often have very similar programming practices,

Starting point is 00:13:58 like you just described. Even I learned it the same way. You have a base template, you just copy paste it and start from there on. No one goes through the bootstrapping process manually anymore. You want to, I think we call it cargo culting, but in general, just get something to bootstrap and start from there. One of the things we learned in a hard way is that can also lead to kind of you pushing, you know, not great practices because people don't know what is sort of a blessed version

Starting point is 00:14:24 of a good template or what actually would make sense. So some of the things we have been like working on, and this is where like centralized teams like engineering productivity are really helpful, is we provide you with the blessed or the canonical way to do certain things. Case in point example is a CICD pipeline or a delivery of software services. We have invested enough in kind of experimenting on what works with some of the more nuanced use cases at Pinterest in helping generate sort of a canonical version, which would cover 80% of the use cases.

Starting point is 00:14:57 Like someone can just go and try to build a service and they could just use the same canonical pipeline without learning much or making changes to it. This also reduces sort of that cargo culting nature, which I called rather than copying it from unknown sources and trying to like, again, it may cause havoc to our system. So we can avoid a lot of that because of these practices. So let's step a little bit beyond AWS. I know I hate doing it too, but I'm going to assume that your remit is broader than, oh, AWS Whisperer slash Wrangler. So tell me a little bit more about what it is that your day-to-day looks like, if there

Starting point is 00:15:32 is anything that could be said not to focus purely around AWS Whispering. So one of the challenges, and I want to talk about this a bit more, is our environments have become extremely complex in our time. And it's the nature of you know like rising entropy like we've just noticed that there's two things we have a diverse set of customer base and these include everyone trying to do different workloads or work you know service types what that essentially translates into is that we've realized that our solution may not fit all of them. For example,

Starting point is 00:16:06 what works for a machine learning engineer in terms of iterating on sort of building a model and delivering a model is not the same as someone working on a long running service and trying to sort of deploy that. The same would apply for someone trying to operate a Kafka system. And that has made, I think, definitely our job a bit challenging in trying to assess where do you actually draw the line on the abstraction? What is the right layer of abstraction across your local development experience, across when you move over to kind of like staging sort of your code in a PR model and getting feedback and subsequently actually releasing it to production?

Starting point is 00:16:41 Because this changes dramatically based on what is the workload type you're working on. And we feel like that has been one of the biggest challenges where I know I spend my day to day and my team does too, in trying to like help provide some of the right solutions for these individuals. There's very often, we'll also get asks from individuals trying to do a very nuanced thing. Off late, we have been talking about like thinking about how do we operate functions, like provide functions as a service within the company. It does put us in a

Starting point is 00:17:08 difficult spot at times, because we have to ask the hard question, is this required? I know the industry is doing it. It's definitely there. I personally believe yes, it could be a future. But is that absolutely important? Is that going to benefit Pinterest in any formal way if we invest on some core abstractions? And those are difficult conversations to have because we have exciting engineers coming in trying to do amazing things. It puts us in a hard spot as well as to sometimes saying graciously no. I know many companies deal with it when they have these centralized teams, but I think

Starting point is 00:17:39 it's part of that job. Like when you say it's day to day, I would say I'm probably saying no a couple of times in that day let's pretend for the sake of argument that i am tomorrow morning starting another company twitter for pets and over the next 10 years it grows to be larger than pinterest in terms of infrastructure probably not revenue because it turns out pets are not the lucrative source of ad revenue that i was hoping it would be, but you know, directionally the same thing. It seems to me that building out this sort of function, that this sort of approach to things is dramatically early as far as optimizations go, when it's just me puttering around on something. I'm always cognizant of the wrong people taking the wrong message when we're talking about things

Starting point is 00:18:22 that happen like this at scale. When does having an engineering productivity group begin to make sense? I mentioned this earlier, like, yeah, there's definitely not a right answer, but we can start small. For example, this group actually started more as a delivery team. You know, when we started, like, we realized that we had, like, different ways of deploying services or software at Pinterest. So we first gathered together to figure out, okay, what are the different ways? And can we start simplifying that part? And that's where it started expanding.

Starting point is 00:18:55 Okay, we are doing button-based deployments right now. We have 1,000 plus microservices. And we are seeing more incidents than we wanted to because anything where there's a human involved means there's a potential gap for error, right? I myself was involved in a Ceph0 incident. And I will be honest, like we ended up deploying a Hello World application in one of our production fleet. Not the thing I wanted to be associated with my name, but- And you were suddenly saying hello to the world, in fact. Oops-a-doozy. Yeah. So that really prompted us to rethink how we need to kind of enable guardrails to kind of do safe production rollouts. And that's how, you know, those conversations start ballooning out.

Starting point is 00:19:29 Yeah. And then the healthy, correct way, we've all broken production in various ways. And it's, you correctly are identifying, I believe, the direction you're heading in, where this is a process problem and a tooling problem. It is not that you are secretly crap and should never have been allowed near anything in production. I mean, that's my excuse for me. But in your case, this is a common thing where it's if someone can unintentionally cause issues like that, there need to be better processes and procedures as the organization matures. Yeah. And that's kind of like always the root or the starting point for

Starting point is 00:20:00 these discussions. And it starts like growing from there on because, okay, you've kind of helped improve the deploy process, but now we're seeing insane amount of slowness, say on the build processes or even post deploy, there's like issues and how we monitor and look into data. And that I think forces these conversations, okay, where do we have these bespoke tools available? What are people doing today? And you have to ask those hard questions, like what can we actually remove from here? The goal is not to kind of introduce yet another new system. Many a times, to be honest, Bash just gets the job done. Personally, I'm okay with that as long as it's consistent and people are able to contribute

Starting point is 00:20:36 to it and you have good practices in kind of validating it. If it works, we should go for it rather than introducing yet another YAML and some of that other aspects of doing that work. And that's what we encourage as well. That's how I think a lot of this starts kind of like connecting together in terms of, okay, now this is sort of becoming a productivity group. Like they're focused on certain challenges where investing probably one person here may up level a few other engineers who don't have to do that on a day-to-day basis. And I think that's one of the key items for especially folks who are running mid-sized companies

Starting point is 00:21:09 to kind of realize and start investing in these type of teams to kind of like really up-level sort of the rest of the engineering. You've been doing this for a fair while. If you were to go back and start over again on day one, which is always a terrifying question on some level, what would you have done differently about building out this function as Pinterest continued

Starting point is 00:21:29 to scale out? Well, first, I must acknowledge that this was just not me. And there's like a ton of people involved in helping make this happen. No, it's fair. We'll blame them for the missteps. That is just fine with me. I kid. I kid.

Starting point is 00:21:40 I think definitely the nuances, if I look back, all the decisions that were made, and at that point in time, there was a decision made to sort of move to, you know, Fabricator, which was back then a great open source code management system, were with, chosen X at one point in time. And I think reality, that's how engineering organizations always evolve, that you have to make do with the information you have right now to make a decision that works for you over a couple of years. And I'll give you a small example of this. There was a time when Pinterest was actually on GitHub enterprise. This was like circa 2013, I would say. And it really served us well for like five plus years. Only then at certain point, we realized that it's hard to kind of hire PHP engineers to

Starting point is 00:22:27 support a tool like that. And we had to rethink what is sort of the ROI and the investments we would make here. Can we ever map up or match back to sort of what are the offerings in the industry today? And that's when you sort of make decisions that, okay, at this point in time, it's clear that business continuity tops, you know, it's hard to kind of operate a system, which is at this moment, not supported. And then you make a call about, you know, making a shift or moving. And I think that's the key item. Like, I don't think there's anything dramatically I would have changed since the start, perhaps definitely like investing a bit more individuals into the group instead of like going from there.

Starting point is 00:23:03 But that said, I'm really sort of at least proud of the fact that usually these teams are extremely lean and small and they always have like an outsized impact, especially when they're working with like other engineers, other opinionated engineers for what it's worth. This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of hello World demos?

Starting point is 00:23:25 Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And let me be clear here, it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself, all while gaining the networking, load balancing, and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small scale applications or do proof of concept testing without spending a dime. You know that I always

Starting point is 00:24:10 like to put asterisk next to the word free. This is actually free, no asterisk. Start now. Visit snark.cloud slash oci-free. That's snark.cloud slash oci-free. Most folks show up intending to do good today, and you make the best decision at the time with the context and constraints that you have. My question, I think, is less around, well, what are the biggest mistakes you made, but more to do with the idea of based upon what you've learned and as you have shined light on these dark areas, as you have been exploring it? Has anything jumped out at you? That is Oh, yeah, now that I know if I know then what I know now, I would definitely have made this other decision. Ideally,

Starting point is 00:24:54 something that applies a little more globally than specific within Pinterest, just because the whole idea aspirationally is that people might learn something from our conversation. At least I will have nothing else. No, I think that's a great question. And I think there's three things that jump to me top of mind. I think technology is means to an end unless it gives you a competitive edge. And it's really hard to figure out at what point in time, what technology and why we

Starting point is 00:25:21 adopt it, it's going to make the biggest difference. Humans always tend to have a bias towards aligning towards where we want to go. So that's the first one in my mind. The second one is, and we spoke about this last time, embrace your cloud provider as much as possible. You want to avoid taking on operational burden, which is not going to add value to the business. If there's something you see you're operating, which can be offloaded because your provider can, trust me, do a way better job than you or your team of few can ever do. Embrace that as soon as possible. It's better that way because then it frees up your time to focus on the most important

Starting point is 00:25:56 thing, which I realized over time is I really think teams like ours are actually, we probably the most value as a glue to all the different experiences a software engineer would go through as part of their SDLC lifecycle. If we can simplify someone's life by giving them a clear view as to where their commit or their work is in this grand scheme of rolling out and giving them the right amount of data to kind of take action when something goes wrong. Trust me, they will love you for what you're doing because you're saving them a ton of time. Many times we don't realize that when we publish 11 different UIs for you to go and check to just get your basic validation of work done. We tend to so much focus on the technological aspect of

Starting point is 00:26:40 what that tool does rather than the experience of it. And I've realized if you can bridge the experience, especially for teams like ours, people really don't even need to know whether you're running Kubernetes or any of those solutions behind the scenes. And I think that's one of the biggest takeaways I have. I want to double down on something you said about the fact that you are not going to be able to run these services as effectively as your provider can. And relatively recently, in fact, since the first time we spoke, AWS has released a investment report in Virginia. And from 2011 through 2020, they have invested in building AWS data centers there, $35 billion. I promise, almost no company that employs people listening to this that is not, they're not themselves a cloud provider, is going to make that kind of

Starting point is 00:27:32 investment in running these things themselves. Now, do cloud providers have sharp edges? Yes, absolutely. That is what my entire career is about, unfortunately. But you're not going to do a better job of running things more sustainably, more reliably, etc., etc. But there are other problems with this, and that's what I want to start exploring here. Where in the olden days, when I ran things in data centers and they went down a lot more as a result, sometimes when there were outages, I would have the CEO of the company just standing there nervous, worrying over my shoulders. I frantically typed to fix things. Spoiler, my typing accuracy did not improve by having someone looming over me. Now, when there's an outage that your cloud provider takes,

Starting point is 00:28:15 in many cases, the thing that you are doing to fix it is reloading the status page and waiting for an update because it is completely out of your hands. Is that something that you've had to encounter? Because you can push buttons and turn dials when things are broken and you control it, but in an AWS or other cloud provider outage, all you can really do is wait unless you have a DR plan that is large scale and effective enough that you won't feel foolish or have wasted a huge amount of time and energy migrating off and that because then it gets repaired in 10 minutes. How do you approach that from your perspective? I guess the expectation

Starting point is 00:28:56 management piece. It's definitely I know something which keeps a lot of folks with an infrastructure up at night, because like you just said, at times we can feel extremely powerless when, you know, we obviously don't have direct control or visibility at times as well on what's happening. One of the things we have realized over time as part of like running on our cloud provider for over like a decade now, it forces us to rethink a bit on our priority workflows, what we want our pinners to always have access to, what they need to see, what is not important or critical. Because it puts into perspective, even for the infrastructure teams, is to what is the

Starting point is 00:29:35 most important thing we should always have it available and running, what is okay to be in a degraded state, and till what time, right? So it actually forces us to define SLOs and availability criteria within the team where we can broadcast that to the larger audience, including the executives. So none of this comes as a surprise at that point. I mean, it's not the answer probably you're looking for

Starting point is 00:29:57 because there's nothing we can do except set expectations clearly on what we can do and how we need to think about sort of the business when these things do happen. So I know people may have a different view on this. I'm definitely curious to hear as well. But I know at Pinterest, at least, we have sort of like converged on our priority workflows.

Starting point is 00:30:16 When something goes out, how do we kind of jump in to kind of provide a degraded experience? We have very clear runbooks to do that. And especially when it's a set zero, we do have clear processes in place on how often we need to update our entire company and where things are. And especially, you know, this is where your partnership

Starting point is 00:30:33 with the cloud provider is going to be a big, big boon because you really want to know or have visibility at the minimum, some predictability on, you know, when things can get resolved and how you want to work with them on some creative solutions.

Starting point is 00:30:45 This is outside the DR strategy, obviously, right? You should still be focused on a DR strategy, but these are just simple things we have learned over time on how to just make it predictable for individuals within the company so not everyone is freaking out. Yeah, from my perspective,

Starting point is 00:30:59 I think the big things that I've found that have worked in my experience, most of my getting them wrong the first time is explain that someone else running the infrastructure when they take an adage, there's not much we can do. And no, it's not the sort of thing where picking up the phone and screaming at someone is going to help us is the sort of thing that is best to communicate to executive stakeholders when things are running well, not in the middle of that incident. Then when things break, it's one of those great, you're an exec, you know what your job is, literally anything other than standing in the middle of that incident. Then when things break, it's one of those, great, you're an exec. You know what your job is? Literally anything other than standing in the

Starting point is 00:31:28 middle of the engineering floor, making everyone freak out even more. We'll have a discussion later about what the contributing factors were. When you demand that we fire someone because of an outage, then we're going to have a long and hard talk about what kind of culture are you trying to build here again? But there are no perfect answers here. It's easy to sit here in the sober light of day with things working correctly and say, oh yeah, this is how outages should be handled. But then when it goes down, we're all basically an inch away at best from running around with our hair on fire, screaming, fix it, fix it, fix it, fix it now. And I am empathetic to that. There's a reason that I fix AWS bills for a living. And one of those big reasons is that it's a strictly business hours problem. And I don't have to run production

Starting point is 00:32:09 infrastructure that faces anything that people care about, which is kind of amazing and freeing for someone who spent too many years on call. Absolutely. And one of the things is this is not only with the cloud provider, I think in today's nature of how our businesses are set up, there's probably tons of other APIs you are using or you're working with, you may not be aware of. And like, we ended up finding that the hard way as well. Like there were like a certain set of APIs or services we were using in the critical path, which we were not aware of, like when these outages happen, that's when you find that out. So you're not only beholden to your provider at that point in time, you have to have those

Starting point is 00:32:43 expectations set with your other SaaS providers as well, other folks you're working with, because I don't think that's going to change it. So it's probably only going to get complicated with, you know, all the different types of tools you're using. And then that's sort of like a trade-off you need to kind of like really think. An example here is just like, you know, like I said, we moved in the past from, you know, GitHub to fabricator. I didn't close the loop on that because we're moving back to GitHub right now.

Starting point is 00:33:10 And that's one of the key projects I'm working with. Yeah, it's a circle of life, right? But the thing is, we did a very strong evaluation here, like, because we felt like, okay, there's a probability that GitHub can go down. And that means people will be not productive for that couple of hours. What do we do then? And we had to kind of put a plan together to kind of how we can mitigate that part and really build that confidence with the engineering teams internally. And it's not the best solution out there.

Starting point is 00:33:36 The other solution was just run our own. But how is that going to make any other difference? Because we do have libraries being pulled out of GitHub and so many other aspects of our systems, which are unknowingly dependent on it anyways. So you have to still mitigate those issues at some point in your entire SDLC process. So that was just one example I shared, but you know, it's not always on the cloud provider. I think there's just many aspects of at least today, how businesses are run. You're dependent. You have critical dependencies, probably on some SaaS provider you haven't really vetted or evaluated.

Starting point is 00:34:05 You'll find out when they go down. So I don't think I've told this story before, but before I started this place, I was doing a fair bit of consulting work for other companies. And I was doing a project at Pinterest years ago. And this was one of the best things I've ever experienced at a company site, let alone a client site, where I was there early in the morning, eight o'clock or so. So, you know, engineers love to show up at the crack of 1130. But so I was working a little early and it was great. And suddenly my SSH session that I was using to

Starting point is 00:34:35 remote into something or other hung and it's tap up, tap enter a couple of times, tap it a couple more. It was hung hard. What's the, and then someone gently taps me on the shoulder. So I take the headphones off. It was someone from corporate it was coming around saying hey there's a slight problem with our corporate firewall that we're fixing here's a myfi device just for you that you can tether to to get back online and get work done until the firewall gets back and it was incredible just the level of just being on top of things and the focus on keeping the people who were building things and doing expensive engineering work that was awesome. And also me, productive during that timeframe was just something I hadn't really seen before.

Starting point is 00:35:19 It really made me think about the value of where do you remove bottlenecks from people getting their jobs done? It remains one of the most impressive things I've seen. That is great. And as you were telling me that I did look up our internal system to see whether a user called Corey Quinn existed, and I should confirm this with you. I do see entries over here, a couple of commits, but this was 2015. Was that the time like you were around or is this before that even? That would have been around then. Yes, I didn't start this place until late 2016. I do see your commits like from 2015. Probably terrible. I have no doubt. There's a reason I don't write code for a living anymore. Okay, I do see a lot of gifts. And I hope it's pronounced as Gifford. Okay, this is cool. We

Starting point is 00:36:01 should definitely have a chat about this separately, Corey. Oh, yeah, we can explain this code. Absolutely not. I wrote it. Of course, I have no about this separately, Corey. Oh, yeah. Can you explain this code? Absolutely not. I wrote it. Of course, I have no idea what it does. That's the rule. That's the way code always works. Well, you are an honorary Pinterest engineer at this point.

Starting point is 00:36:13 And you have, yes, contributed to our API service and a couple of puppet profiles I see over here. Oh, yes. You don't wind up thinking that that's a risk factor that should be disclosed. I kid. I kid. I made a joke about this when VMware acquired SaltStack and I did some analytics and found there were 60 some odd lines of code I had written way back when that were still in the current version of what was being shipped. And they thought, wait, is this actually a risk? And no, I am making a joke. The joke is, is my code is bad. Fortunately, there are smart people around me who review these things. This is why code review is so important.

Starting point is 00:36:48 But there was a lot to admire when I was there doing various things at Pinterest. It was a fun environment to work in. The level of professionalism was phenomenal. And I was just a big fan of a lot of the automation stuff. Fabricator was great. I loved working with it. And right, I'm gonna use this to the next place I go. And I did. And then it was I looked at what it took to get it up and

Starting point is 00:37:09 running. And oh, yeah, I can see why GitHub is so popular these days. But it was neat. It was interesting seeing that type of environment up close. That is great to hear. You know, this is what I enjoy, like hearing some of these war stories. Like I, I am surprised, like you seem to have committed way more than I've ever done in my duration here at Pinterest. I do do managing for a living, but then again, you know, Corey, the good news is your code is still running on production and we haven't-

Starting point is 00:37:35 Oh dear. We haven't removed or made any changes to it. So that's pretty amazing. And thank you for all your contributions. Oh, please, do you have to thank me? I was paid. It was fine. That's the value of work for hire. It's kind of amazing. The best part about consultants is, is when we're done with a project, we get the hell out and everyone's happy about it. More happy when it's me that's leaving

Starting point is 00:37:55 because of obvious personality related reasons. But it's, it was just an interesting company from start to finish. I remember at one other time, I wound up opening a ticket about having a slight challenge with a flickering on my then Apple branded display that everyone was using before they discontinued those. And I expected there to be, oh, okay, you're a consultant. Great. How did we not put you in the closet with a printer next to that thing, breathing the toner like most consulting clients tend to do. And sure enough, three minutes later, I'm getting that tap on the shoulder again. They have a whole replacement monitor. Can you go grab a cup of coffee? We'll run the cable for you. It'll just be about five minutes. I started to feel actively bad about requesting things because I did a lot of consulting work for

Starting point is 00:38:38 a lot of different companies and not to be unkind, but treating consultants and contractors super well is not something that a lot of companies optimize for. I can't necessarily blame them for that. It just really stood out. Yeah, I do hope we are keeping up with that right now because I know our team definitely has a lot of consultants working with us as well. And it's always amazing to see like, you know, we do want to treat them as FTEs.

Starting point is 00:39:02 Like it doesn't even matter at that point because we're all individuals and we're trying to work towards common goals. Like you just said, like, I think I personally have learned like a few items as well from some of these folks, which is again, like I think speaks to sort of how we want to work and sort of create a culture of like, we're all engineers. We want to be solving problems together. And as you were doing it,

Starting point is 00:39:21 we want to do it in such a way that it's still fun. And, you know, we're not having any restrictions of titles or roles and other pieces. But I think I digressed. It was really fun to see your commits, though. I do want to track this at some point before we move completely over to GitHub. At least, you know, keep this as a record for what it's worth. Yeah, basically look at this graffiti in the code base of a shit poster was here.

Starting point is 00:39:41 And here I am. And that tends to be in some level the mark we leave in the universe. What's always terrifying is looking at things I did 15 years ago in my first Linux admin job. Can I still ping the thing that I built there? Yes, I can. And how is that even possible? That should not have outlived me. Honestly, it should never have seen the light of day in production. But here we are. And you never know how long that temporary clue you put together is going to last. It still pings. And there's like a bunch of things in my mind, like, you know, when you are writing code or you're working on some projects, the fact that it can outlast you and sort of live on,

Starting point is 00:40:30 I think that's a big, big contribution. And secondly, if your code can actually help up-level like 10 other people, I think you've really met the mark of a 10X engineer at that point. Yeah, the idea of the superhuman engineer is always been a strange and dangerous one. If for nothing else

Starting point is 00:40:45 from where I sit, excellence is inherently situational. Like what we just talked about, someone at Pinterest is potentially going to be able to have that kind of impact specifically because to my worldview, that there's enough process and things around there that empower them to succeed. Then if you were to take that engineer and drop them into a five-person startup where none of those things exist, they might very well flounder. It's why I'm always a little suspicious of, this is a startup founded by engineers from Google or Facebook or wherever it is. It's, yeah, and what aspects of that culture do you think are one-to-one matches with the small scrappy startup in the garage. Right. I'm predicting some

Starting point is 00:41:26 challenges here. Excellence is always situational. An amazing employee at one company can get fired at a second one for lack of performance. And that does not mean that there's anything wrong with them. And it does not mean that they are a fraud. It means that what they needed to be successful was present in one of those shops, but not the other. This is so true. And I really appreciate you bringing this up because, you know, whenever we discuss any form of, you know, performance management, that is a, in my view, personally, I think that's a incorrect term to be using. It is really at that point in time, either you have outlived sort of the environment you are in, or the environment is going in a different direction where I think your current skill sets probably could be best used in the environment, you know, where it's

Starting point is 00:42:09 going to work. And I know it's very fuzzy at that point. But like you said, yes, excellence really means you don't want to tie it to the number of commits you have pushed out, or any specific aspect of sort of your deliverables or like how you work? There are no easy answers to any of these things. And it's always situational. It's why I think people are sometimes surprised when I will make comments about the general case of how things should be. Then I talk to a specific environment where they do the exact opposite and I don't yell

Starting point is 00:42:40 at them for it. It's there in a general sense. I have some guidance, but there are usually reasons things are the way they are. And I'm interested in hearing them out it it's there are in a general sense i have some guidance but there are usually reasons things are the way they are and i'm interested in hearing them out everything's situational the worst consultant in the world is the one that shows up has no idea what's going on and then asks what moron set this up invariably to said quote unquote moron and the engagement doesn't go super well from there it's okay why is this the way that it is what constraints shaped it what was the context behind the problem you were trying to solve? And well,

Starting point is 00:43:08 why didn't you use this AWS service? Because it didn't exist for another three years when we were building that thing is a very common answer. Yes, you should definitely appreciate sort of like all the decisions that were made in past, like people tend to always forget why they were made. You're absolutely right. Like What worked back then will probably not work now or vice versa. And it's always situational. So I think I can go on about this for hours, but I think you hit that to the point, Corey.

Starting point is 00:43:34 Yeah, I do my best. I want to thank you for taking another block of time out of your day to wind up talking with me about various aspects of what it takes to effectively achieve better levels of engineering productivity at large companies with many teams working on shared code bases. If people want to learn more about what you're up to, where can they find you? I'm definitely on Twitter. So please note that I'm spelled M-I-C-H-E-A-L on Twitter. So you can

Starting point is 00:44:01 definitely read on to my tweets there, but otherwise, you can always reach out to me on LinkedIn too. Fantastic. And we will, of course, include a link to that in the show notes. Thanks once again for your time. I appreciate it. Thanks a lot, Corey. Michael Benedict, Head of Engineering Productivity at Pinterest. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice,

Starting point is 00:44:33 along with a comment telling me that you work at Pinterest, have looked at the code base, and would very much like a refund and an apology. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Starting point is 00:45:23 This has been a humble pod production stay humble

Screaming in the Cloud - Breaking Down Productivity Engineering with Micheal Benedict

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.