The Peterman Pod - AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

Starting point is 00:00:00 If you aren't doing it hands-on, your opinion about it is very likely to be completely wrong. This is Mark Brooker. He's a distinguished engineer at AWS, and I interviewed him for technical learnings from his career. 3,000 cloud system postmortals. I want to ask you what makes a good postmortem. I could spend a lot of time talking about that. You had a tweet that said that there are cases where caches are bad. I prefer to see the teams around me avoiding caching where possible. We also discussed how software engineering is changing. What is important given that code is kind of flowing like water now?

Starting point is 00:00:38 The job changes and you do different work. For someone who's structuring their career, would you say it's better to be overrated or underrated? Here's the full episode. At some point, when I was a very junior engineer, I looked at the more senior engineer. So what is the difference between you and I? I'm working more hours than you.

Starting point is 00:01:06 I'm landing more code than you. Why is it that you're so much more impactful than I am? And then I realized that kind of the direction of your work, like what is the thing that you're actually shipping matters more than the volume of your work and your contributions? What would be your advice on how do you find problems that matter? Yeah, I think you have to go super broad. So I think there's a set of those things that come in from customers, from the world. Right, like here is an unsolved problem.

Starting point is 00:01:37 I spend a lot of time meeting with AWS customers and listening to them talk about, you know, what are the things they still find difficult in our space? What are they investing in? Where are they spending their time? Where would they prefer to be not spending their time and focus on their core business instead? And so that's one rich seam of ideas and focus on what's interesting. I think completely at the other level is sort of on looking at the technical trends and you can look at just the

Starting point is 00:02:05 kind of speeds and feeds like wow networks have gotten faster storage has gotten faster you know we've seen this huge explosion in multi-core and now in GPUs and you know so there's a bottom-up innovation trend there too which you can also

Starting point is 00:02:22 look at and say well this enables all of these new new things and and then broadly kind of across the world like what are the big trends that are going on what are the things that are changing in our industry what are the things that are changing in the world and really it is those kind of moments of change that have the you know bring with them

Starting point is 00:02:43 the opportunity to to build things and and to recognize problems and so to pick one you know concretely you know when i was in working the lambda team in in 2020 and i was talking to a lot of customers about you know they were super excited about building on serverless they were super excited about building on containers. There have been this massive shift. And what people were seeing then was, wow, I love these serverless products. I love building this way. But the world of data and especially relational data doesn't fit super well into this paradigm, right?

Starting point is 00:03:19 These relational databases are still very serverful, you know, fantastically powerful products, but not kind of operationally the same. And, you know, that thinking was, you know, this felt super important. to me of like, wow, these customers have brought to me a gift of understanding something that's really important. And so I joined the Aurora team. We built Aurora Serverless and then we built the SQL. You know, we've been investing deeply across all of our database products to make them

Starting point is 00:03:49 a better fit for these, you know, serverless and container workloads. And that is an example of a trend that was brought by, you know, brought by a customer. But then also these trends that have been driven by kind of architecture or by other things going on, right? Faster networks, faster compute, faster connectivity. So one of the big technical trends in the database world right now is this sort of block storage becoming the default backend,

Starting point is 00:04:24 the default durability layer for databases of all kinds, from analytics workloads to online workloads. And there's been this incredible explosion around that. And so if you look at what we did with Aurora DSQL, for example, you know, that was very much learning from that trend and taking a lead in that trend and saying, well, we're going to make S3, this block store that we built, you know, 20 years ago, sorry, object store that we built 20 years ago,

Starting point is 00:04:54 the underlying durability layer of this new database. But obviously it doesn't have the latency properties, or the rich interface that an online database needs. And so we're going to build an architecture on top of that that deals with all of these other things in a much better way, but doesn't have to worry about durability. And so that was this perfect collision of a set of things I was hearing from customers and a set of things that were technical trends

Starting point is 00:05:24 coming together and thinking, wow, we've got this opportunity to build something now that is going to be a market-leading product. that would be hard to imagine without either of those input signals. I saw something that you wrote. You mentioned that you were on call for 15 years somewhere in there. And I've heard many stories of more senior engineers negotiating out of on call because per unit time, it could be perceived as not that impactful.

Starting point is 00:05:55 And so why did you stay on call for so long? I would say that the majority, of my in-practice knowledge about how to build distributed systems has come from being on call and analyzing and deeply understanding these post-mortems and COEs. You know, one of the challenges of running a company like AWS

Starting point is 00:06:24 and running large-scale systems is that folks come out of college with great, often great knowledge of computer science fundamentals, great programming skills, you know, great mathematical skills, all of that stuff is fantastic. But without the grounded knowledge of what it actually means to run and understand, you know, understand systems. And, you know, On Call is one of the best ways to learn those things, best ways to see, you know, how do systems really run, how do they really behave? You know, how do customers really use them?

Starting point is 00:06:58 What happens when customers use systems in unexpected ways? how can we make systems more resilient to customers using them in different ways? And I think that should be almost a goal of on call, right? If you have folks in your teams who are on call and they're just closing the same ticket over and over and over, well, you know, that's where you need to just build some automation. And again, building automation is easier than ever. It's more powerful than ever. Fantastic.

Starting point is 00:07:26 But where you really want to spend the time of the deeper experts on your team is you know, here's something unexpected or unusual that's happened in the system. Let's deeply understand that. And let's bring that knowledge back to both improving that system and communicating broadly to the company and the outside community, what we've learned from that. And so one of the most, you know, one of the most powerful things we do at AWS is we have this mechanism of a very broad weekly meeting where we all get together, you know, engineers from

Starting point is 00:08:07 across AWS leaders, senior leaders from across AWS, and talk about COEs, these post-mortems that we write, and what we can learn from them and how we can apply those lessons across the whole company. And I think that particular mechanism, that particular kind of Wednesday morning meeting that we have is one of the things that has been a core, almost causal factor behind AWS's success because it has allowed us to and forced us to spend leadership bandwidth, to spend expertise, to spend the time of our best engineers, deeply understanding how our systems operate and why they operate the way they do. and you know that level of being just extremely grounded in reality

Starting point is 00:09:02 helps you design better products helps you architect better systems and it helps you think more clearly about the next round of things helps you fix you know helps you fix issues and so it's this fundamental kind of learning exercise it's a real blessing so I would you know I would recommend on call to anybody who wants to learn about the practice of distributed systems. And I would certainly recommend spending time, reading COEs, reading post-mortems, and deeply reflecting on not only what can we fix tactically,

Starting point is 00:09:39 but what can we fix organizationally and strategically, and what kind of tools might need to exist to prevent this kind of thing happening again. And you asked earlier about where do ideas come from. And this is another fantastic kind of flow of ideas of saying, wow, you know, we seem to be solving this same problem over and over in different ways and getting it slightly wrong every time. You know, can we extract a tool to do that? Can we build a service around that? Can we build a feature around that to make it easier for us to get right and easier for our customers to get right? Yeah, it's interesting because I think if you,

Starting point is 00:10:23 you ask most engineers, they really avoid on call. But it sounds like you kind of go towards it and you've learned a lot from it because it's a major source of customer problems. Yeah. And again, you know, I think for me, it comes down to optimizing for finding the most important things to work on. And, you know, if you aren't close to operating your actual system and you don't know how it's actually working, how are you supposed to identify what to fix? You can come up with some theories about those, but they're probably not going to be right. And again, I don't think there's a huge amount of value in the Rote ticket closing work of OnCall. I think automation should be doing those kinds of work.

Starting point is 00:11:09 But I think there's fantastic value in deep understanding, deep investigations and deep reflection on what you learn from post-mortems and COEs. I tried to estimate a couple of months ago for a talk how many industry postmortems and Amazon COEs I'd read over my career. The best estimate I could come up to, and this was about a year ago, was between 3,000 and 4,000. And so, you know, even a little bit of lesson from each one, and it tends to stick. Yeah, that was my next question, actually. I looked at the slides from that internal presentation, and it said, I've read approximately 3,000 cloud system postmortems from across the industry.

Starting point is 00:11:54 And my immediate thought was, I wanted to ask you what makes a good postmortem. So I think, you know, what makes a really great postmortem is first really getting into the details and making sure that you deeply understand what happened rather than just assuming what happened based on the biases you bring in. And so there's a kind of lesson one there. is if you can't understand what happened, well, that teaches you something about your logging and metrics and observability and,

Starting point is 00:12:25 you know, and simulations and all of these other things. And then once you deeply understand what happened, then the ability, then a great post-mortem steps through the whys behind that and multiple levels, right? Like, why? Well, yeah, there was a code bug. Okay, sure.

Starting point is 00:12:46 Code bugs, yes, we can fix that. But we can't stop there, right? Like, why was that missed in testing and validation? You know, for these reasons, you know, what can we improve? What can we build around those? Okay, next step, you know, why, you know, why was our testing and validation where it was? Or, you know, why did we assume a certain thing about the behavior of the system that we wouldn't have assumed before? And so as you sort of get through these deeper and deeper layers, a great post-mortem

Starting point is 00:13:19 not only identifies kind of fixes to the proximal cause, but also identifies broader fixes to technology, to organizations, to, you know, products and so on. And so that's a kind of multiple levels thing, right? You can't get stuck on, you know, what is the most proximal cause of an incident, but you also can't get stuck on this, well, you know, things fail sometimes and what are we going to do about it? And you have to come up with a set of, you know, really concrete action items to fix things at different levels. Fix this particular line in the software that caused something. You know, fix the testing processes that didn't catch that. You know, fix the, you know, maybe social or team processes

Starting point is 00:14:10 that led to those technical processes. And then if you're seeing patterns across multiple post-mortems, sort of level those up and say, well, clearly there's a hard underlying problem here. Can we build a service around that? Can we build a library around that? Can we build a community of practice around that? Are there technical changes we can make to avoid whole classes of things?

Starting point is 00:14:40 So that's quite a long-winded answer, but I do think it all flows from understanding and understanding and multiple levels, like understanding immediately like what happened, but also understanding, you know, broadly what happened, you know, technologically and organizationally and in context. And then the ability to connect that particular event or post-mortem with other ones, you know, and, and, and, and, and, and, and, and, and, and, and, and, and, those patterns. One of the things that we did in D-SQL was we spent a lot of time as we were designing that, looking around relational database-related post-mortems and thinking about both our own and our customers and thinking about, you know, how can we design a database that helps people avoid falling into these traps? And, you know, a really common kind of outage pattern, folks with relational databases, is you have a client on a

Starting point is 00:15:40 distributed system, starts a transaction, and then goes out to lunch for whatever reason. And that could be a GC pause, or it could be a lossy network, or it could be a loss of connectivity, and now it's holding locks. And so if you look at relational databases, they don't tend to be resilient to clients misbehaving in that way. And that's a really common cause of operational issues for systems built on relational databases. And so as we were designing DSQL, we were thinking how do we avoid broadly that class of problems? So folks can say, hey, I'm going to build on DSQL and just not have this whole class of problems.

Starting point is 00:16:22 And I think that's a really kind of powerful outer loop of the post-mortem process is to say, how do we turn all of these lessons into new services and into service improvements? How do you prevent misbehaving clients from being a problem for the database? Yeah, so in D-Squels case, we have no pessimistic locking. And so within the scope of a transaction, everything that happens in that transaction,

Starting point is 00:16:54 all of the reads happen using this mechanism called multiversion called currency control, where every row in the database we sort of store a history of versions. And so you can read an old version of a row without blocking writers and saying, hey, you can't update this because I just read it. And then, you know, locally within the query processor that's handling a connection, we spool the rights locally and then you get to commit time and we do this optimistic check of, you know, can I commit this transaction at the transaction commit time. And so combining those two mechanisms of having multivision concurrency control

Starting point is 00:17:31 and the scale out storage that comes with it and the commit time optimistic checks, we can strongly say that, you know, there is no way that a reader of a piece of data can block other writers, and there's no way that a writer of data can block readers. Writers can block writers, but only by changing data, not just by looking at it. And so you can, you know, you can say,

Starting point is 00:18:00 well, you know, I can cause, sorry, writers can't block writers, but they can prevent other writers transactions from eventually committing. by making a bunch of changes. And that is inherent to the definition of the particular database isolation level. Out of curiosity, in practice, what percent overhead would you expect for keeping copies of old roles for the sake of those stale reads?

Starting point is 00:18:28 Yeah, it's actually surprisingly small. And it's surprisingly small because if you look at the access patterns for most online databases, even ones that do a lot of right traffic, that right traffic tends to be quite concentrated and it's quite unusual for an online database workload or even an analytics workload to make a second version of every row in the database. Typically what it's doing is making a first, second, third,

Starting point is 00:18:56 the hundredth version of this row and a 50th version of that row, but the vast majority of data isn't changing. And so it's super workload dependent, as is everything in the database world. But the overhead tends to be relatively small. I would say it's unusual for a online database workload for that overhead on storage to be more than about 10%. From my experience, I've seen an interesting dichotomy between teams where some teams, they really understand post-mortem culture. They tend to be infrastructure teams.

Starting point is 00:19:32 They tend to take it really seriously. and everyone on those teams, the tech leads are asking you, hey, why did that happen and really follow up and make sure it's not a problem? Then I've also noticed on other teams that is less of a strong muscle for those teams that don't take it too seriously, what would be your pitch for why they should take it seriously? Yeah, it all comes down to where you want to spend your time, right? Do you want to spend your time improving your product and making it better,

Starting point is 00:20:01 or do you want to spend your time fighting the same fire over and over? And, you know, really, the culture of building, you know, building great post-mortem cultures to make sure that at the product level

Starting point is 00:20:21 and at the organizational level, you are fixing known issues and you are avoiding having the same, problems multiple times. And typically when I see teams that have poor post-mortem culture, I think they're probably

Starting point is 00:20:43 one of two failure modes there. One of them is a lack of focus on just the outcomes, right? Like, you know, a lack of really, I wouldn't say caring enough. I think that's a little bit too

Starting point is 00:20:59 personal, but being really focused on, you know, is this product performing super well? Are we, you know, are we really making our customers happy? And that is fundamentally a cultural and leadership cultural problem of setting the right standards. Oh, and by the way, like I don't think, you know, standards should be, you know, should be uniform, right? Like there are places where, you know, the details really, really matter where things like durability are just critical and and you do need to have super high standards in those places and you know places where you want to optimize for other things and maybe have you know have have a higher production defect rate and I think that's that's okay

Starting point is 00:21:45 as long as that's an intentional decision that's being made so that's kind of case one right like insufficient focus on the outcome I think two, and this is a harder one to change, is normalization of kind of operational heroics. Like, we don't need to fix these root causes because our on-calls are superheroic and they're going to stay up all night and they're going to, you know, they're going to hack around things, and they don't mind being paged a hundred times a week. And they can feel from the inside like it's a good culture, right? Like, oh, wow, these people are super strong owners. They're super engaged, they really care, they're really working hard on call, and those are all good signals.

Starting point is 00:22:30 But then when you look at it from the outside, it's like, wow, we're not actually fixing the causes of things. We're just doing this fantastically expensive investment of taking all of these people and their strong ownership and their expertise and spending them just on this break-fix cycle. And that's where you need to kind of look at it from the outside and say, well, let's take this energy of this team, fantastic energy, and focus on improving the service, getting out of the cycle, finding, you know, finding new things to fix, finding new things to build. And that can be hard because it can be hard for, you know, those folks who've been in that mode to look at it and say, this feels so good, it feels

Starting point is 00:23:15 really like we're caring about our customers and caring about our product and caring about our business to realize that, oh, no, we're actually caring about it at the wrong. level and we're not serving our business in the best possible way by being so narrowly and tactically focused on this break-fix cycle. And that's where you sort of need to pop them out and say, well, let's spend more time thinking about the post-mortem. Let's spend more time thinking about the causes of things. Let's spend more time addressing these things in a more strategic way. And wow, okay, now you've got so much more time to do that because you've broken the cycle and you can improve your product in different ways.

Starting point is 00:23:58 I mean, since you have worked on AWS for almost two decades, I'm sure you have a lot of experience building distributed systems. And I think one of the most common advice that you hear, I guess this is maybe in the context of system design, is I almost hear almost 100% of the time, people will say, just throw a cash on it. Or you'll have a system design and you say, how do you make it better?

Starting point is 00:24:23 let's put a cash here, let's put a cash there. And I saw you had a tweet that said that there are cases where caches are bad despite people saying it's best practice. And I'm curious if you could explain that. Yeah, so caching's good, right? Like it's, hey, I'm going to take these core ideas from computer science of temporal and spatial locality, and I am going to exploit those to make my system faster, scale better, etc. And so, you know, obviously very attractive.

Starting point is 00:24:52 but the downside of caches, especially in distributed systems, is they have this mode, right? Like they have this, you know, there's a mode where the cache is full and the cache is full of the right data in time and space to perform very well. And there's a mode where the cache is empty or contains the wrong data. And in the first mode, the system is fast and happy and healthy. In the second mode, the system is slow.

Starting point is 00:25:22 often down because now the back end is in scale to deal with all of this un-cash traffic, customers are very disappointed. And often it is down in a stable way. And this is this kind of idea of metastable failures where the system has switched from state one to state two, and in state two it's still stable, right? Like it's still, it's down, but it's not going to come back up under its own energy because, for example, all of this traffic is causing a huge amount of contention in my database or is saturating the network.

Starting point is 00:25:58 And so I can't even refill the cache. It's not even getting the right kind of data in. And so, you know, when I talk about the downsides of caches, it's really about, you know, how do we avoid that modality between, you know, fast and, you know, that value of caches and the, you know, how do we avoid the state where we're down? And so if I go back to D-SQL, like our answer there is D-SQL, what we call the storage tier is essentially a cache, but it is a complete cache.

Starting point is 00:26:33 It contains every row in the database. And so it doesn't have this mode where how do I recover from it being empty or containing the wrong data. It contains all of the data. Similarly, if you look at a more, and it's a classical relational database design like Aurora, the Aurora leader is constantly telling the potential failover targets, here's something you should cash,

Starting point is 00:26:58 here's something you should catch, here's something you should cash. So when a failover happens, the cash is warm on the failover target. And so those are the kinds of things that you can do to avoid those modalities. But in general, you know, and I wouldn't extract this as a rule or say that, you know, this applies 100% of the time. But in general, I prefer to see the teams around me avoiding caching where possible. I prefer patterns where you have a, it's a complete materialized view of the data if you need very fast access to it, especially if it's slow moving, just pull it down onto your local machine and work within in-memory.

Starting point is 00:27:42 If it's only being updated once a week, who cares? Like, just make lots of copies of it. So that's one pattern. Or, you know, use a scalable backend, you know, D-Sql or DynamoDB or whatever your favorite scalable database is and keep your database vendor honest about getting to the scale and performance you need rather than putting a cache in front of things. So caching isn't a bad pattern, but it is a pattern with some significant downsize that are, you know, really best avoided.

Starting point is 00:28:16 In practice, how often do you see that metastable failure, though? Yeah, you know, this is, it's not super common, right? Like you might go years without seeing, you know, something like that. But if you look across the biggest, most impactful, you know, system post-mortems across the industry, I would say that these kinds of metastable failures have been an underlying cause in probably a majority of them. And it's super important that, you know, as an industry and as a community of practice, we understand those things deeply because also those cases where these do

Starting point is 00:28:56 happen, you know, tend to be larger scale issues, longer recovery time issues and, and more complex to fix issues, right? Where you have to often, you know, turn it off and turn it back on again, which is this very, very painful thing for a team or an organization to do. And, you know, and so again, like you might go here, it's operating in a system with seeing nothing like this. But if you look at the most impactful issues, it's actually fairly common as an underlying cause for those issues. And so, you know, it's kind of both of these things

Starting point is 00:29:33 of being quite uncommon and being rather common. I was reading your blog and you have a series of posts on how AI may impact the future of software engineering. And I kind of want to pick your brain on that. So what's your perspective on how you think AI will impact software engineering and how it'll change things? Yeah, I mean, it's maybe harder than ever to tell the future. And so, you know, this is a set of maybe guesses and predictions about the future.

Starting point is 00:30:05 So I'll say the first thing I, you know, I deeply believe about software is, is we have only just started to see the impact that software is going to have on the world. There is such an opportunity for more software to exist, bigger software, better software, more personal software, all of these things. And so software has, throughout its 60-ish-year history, been supply constrained. And, you know, I think that's going to remain true. I think the opportunity for software in the world is just almost unbounded. And that's really exciting, right?

Starting point is 00:30:49 It's really exciting to be at a moment when the economics of building software are changing and are changing rather quickly. And that gives us an opportunity to think about what could we do in the world with a lot more software. You know, a lot more software personalization, a lot more. or just the right software in the right place at the right time. And that gives me a huge amount of excitement about the future of this industry because we have a massive opportunity ahead of us, driven by these changing economics of software development.

Starting point is 00:31:35 Now, also with those changes, there are going to be needs for us as, as software practitioners, people who build software, people who love software, to adapt. And, you know, that means that software careers are going to look different. They're going to look different early on. They're going to look different later on. I think the software business is going to look different. And the success of people in organizations over the next, you know, next, who knows, five years, decade is going to be largely predicated on their ability to adapt to that change and to lead

Starting point is 00:32:16 that change. You told this story about this guy who bet on analog circuits when obviously we know digital became kind of the more more dominant way. Yet he made he made good money. For the people who maybe don't want to adapt, you could still get by and succeed. It's not going to be like a crazy thing. Is that is that kind of the takeaway and why you brought up that story? Yeah, I think that's the right takeaway. And so if I sort of break down, you know, the, the world into three tiers, you know, I think there's going to remain a huge amount of joy in the craft of software, you know, like the craft of joinery with, you know, with handsaws, right? Like, it's, it's a nice way to spend time. It's not a particularly economically interesting.

Starting point is 00:33:07 activity anymore. But not everything we do has to be an economically interesting opportunity. It can just be something I do because I enjoy it, because I enjoy the product of it, because I enjoy talking to people about it. Right. And so there's, you know, I don't think that is going to go away. I think we're going to see, you know, a lot of interest in that. Like there's been interest in retrocomputing and, you know, people who run an Apple 2 is their desktop. And like, well, again, it's wildly impractical. It's not economically interesting, but it's fun and something I, you know, could do as a hobby. And so, you know, that's going to be a remaining part of the world of software for probably

Starting point is 00:33:44 forever. And then there's this, you know, kind of story that I told in the blog post. And I think this relates to, you know, driving change in the real world is always harder than it looks from the outside, right? Like as you get into the details, things become more difficult. They become more dependent on people. They become more dependent on politics and policy and our various irrationalities as humans.

Starting point is 00:34:13 And so driven by that, you know, there is going to be a huge amount of, and a shrinking over time amount, but a huge amount of the software industry that is run in what I might call the old way, right? Past techniques, past languages, past technologies, and there's real economic opportunity in engaging with that part of the, you know, part of the world. You know, as we saw with analog electronics, analog electronics will very much exist.

Starting point is 00:34:48 In fact, there are parts of the world like, you know, like radio and power systems where there's been incredible technological advancement in those fields. But they have become more niche. And so, you know, digital became the mainstream. We wouldn't be talking like we are today. if it wasn't for this, you know, 12 orders of magnitude or whatever explosion in digital transistor counts. But there's interesting opportunity there, and I think that interesting opportunity is going to change shape and become more and more specialized and more niche and great careers to be built there. And then there is the mainstream, which I think is going to adopt these new technologies from agentic development to AI-powered development,

Starting point is 00:35:36 to, you know, specification-driven development and, you know, a whole lot of other, you know, new things whose names we don't even know yet, to build software at a speed and a cost that is unimaginable to do with the old techniques. And I think that is where correctly the majority of the industry is going to be going. I think that's where the majority of careers are going to be built. I think that's where the majority of economic opportunity is.

Starting point is 00:36:10 It's the space I'd be in. If I was building a company today, it's the space I'm in in my role. And the one I would sort of personally be most excited about. But yeah, it isn't the only one. I think there's going to be the spectrum of software practice. And especially where software engages with the physical world, there are going to be some really interesting questions about how do we bring these new technologies, how do we bring these new practices into the various many

Starting point is 00:36:42 niches that software is going to and has, you know, over six decades, kind of wormed its way into. It's interesting. You mentioned joinery. I wonder if down the road we will see apps on the app store that people pay extra for because it's marketed as this was written by a human or it was written by hand. It's a bespoke custom app crazy how the world is going to change. So it sounds like change is obviously the common case. It's the one that we should be thinking about. Maybe we can break up the conversation in two parts. One is for junior engineers, what is important given that code is kind of flowing like water now?

Starting point is 00:37:27 At risk of being a bit meta about our past conversation, it really is about finding those problems that matter and doing that early in a career. and, you know, that requires an understanding of customers. It requires an understanding of the business. It requires an understanding of economics and of systems. And that can, I think that's going to move from being, you know, almost kind of senior engineer work of like, oh, well, you know, now you're going to go and talk to customers and actually understand the context of the stuff

Starting point is 00:38:01 you're building to being more and more. part of even the earliest steps of an engineering career, right? Like here's the context, here's the problem, here's the customer, let's go off and work together and solve, you know, and solve this problem with all of this context. And I think that's going to be super exciting for one set of folks and a little bit frustrating for people who have come into, you know, looking for a pure software development career, right? Looking for a career where they sit down, open their IDE, start typing, and don't stop for eight hours.

Starting point is 00:38:44 I think that's going to be a mode that we're going to see fewer people in and a mode that's going to be harder and harder to build a career around. Now, the other mode of, oh, I'm excited to go off and learn from my customers about what their building and what they need, I think that's going to be ever more highly, you know, highly valuable. And so super exciting opportunity to build, you know, build careers there. And then maybe, and this might come across as being a little bit, you know, paradoxical. I think there's also a ton of opportunity for, you know, folks who are extremely technically deep, you know, who are, you know, deep on optimization problems or deep on infrastructure.

Starting point is 00:39:27 structure problems or deep on, you know, various scientific things, or deep on databases, or deep on, you know, one of the many, many topics that are behind our industry, because I think the ability to ask the right questions is also much more valuable than it was, has ever been. And so I think there is a ton of opportunity for people coming into the industry with deep technical or scientific knowledge to now leverage that in ways that, you know, maybe were hard before, right? There was too much sort of boilerplate to really, you know, to really use that leverage that you have. And so I think we're going to see a lot more of those kinds of careers of really kind of building expertise in a technical topic, in a scientific topic,

Starting point is 00:40:17 and then be able to turn that into software and software products in a way that was really difficult before and in some cases wasn't possible before and is now, you know, vastly easier. If I was to look at a career ladder's expectation, some of what you described of maybe engaging with the customers and understanding the business context, uniquely in software engineering, it feels like the earliest levels are insulated from all of that. You have your tech lead, tech leads handing out tasks, and then the early level engineers just given tasks, just converted into code. And it sounds like, you know, that part's relatively solved.

Starting point is 00:41:00 If not now, maybe I'd be surprised if a year or two from now wasn't, like, completely solved. And I think that could scare a lot of junior engineers because they would think, you're going to expect me to graduate from college or start working a software engineer, and then I would have the senior engineer expectations. what would you say to the scared software engineer that's just entering the industry with all this change? Yeah, you know, I think, well, I would remind them that, you know, we as people who hire and build organizations of software engineers

Starting point is 00:41:40 and they as people who are building software engineering careers have really aligned incentives, right? Like, you know, it's not valuable to, hire a bunch of people and set them up to fail. Like, that's, nobody wants that. It's, it's, it's, it's not an outcome that is good for anybody. And so, yeah, we're going to need to figure out how do you support people on, on that path? How do you help people learn those things?

Starting point is 00:42:08 How do you give them the right guardrails, you know, hey, that first time that you go out and talk to a customer, yeah, it's going to be scary. My, my first time talking to an AWS customer was, you know, was, it was, it was, it was, was super scary. But, you know, I got a bunch of help with that and I got a bunch of advice and I got a bunch of mentorship and I got a bunch of feedback and I got better and better at that over time. And I think that's exactly what these things look like is, you know, you start off and you start small and and you learn, you know, as you go. And so that feedback loop goes faster. And so I don't expect that people coming in from college or, you know, we'll come in with all of this knowledge. I think

Starting point is 00:42:51 You know, it's never been true that people coming into technical or engineering careers straight out of college know everything. Or any career, for that matter, right? Like you talk to teachers about, you know, what they've learned on their job versus what they learned, you know, studying. You know, they learn a huge amount in things like internships and so on and over the course of a career or doctors or anybody in, you know, in a field like that. And so, yeah, it is going to be about learning. I think the emphasis on what people learn is going to be different. I think it is going to require, you know, leaders like me who, you know, care deeply about, you know, hiring and developing folks early in their career to be really thoughtful about what, you know, what does that new ladder look like. And, you know, we're doing a lot of that thinking.

Starting point is 00:43:44 I think people are doing that kind of thinking across the industry. And, yeah, it's changing fast. It's uncertain. It's an interesting time to be graduating. But again, like, it's a super exciting time. I think that's just the scale of the opportunity is bigger than it's ever been. Sounds like your advice for senior engineers is different from that of junior engineers. What is your thinking there?

Starting point is 00:44:09 Yeah. I mean, I think, you know, I think for folks there, the challenge is how do you, you know, how do you retain the value of this incredible experience and knowledge that you've gained over a career? while, you know, not falling behind, while learning how do you, you know, best use the tools. And, you know, when I look at senior folks, this is a challenge, you know, ahead of them. I think a lot of people have found themselves in influence

Starting point is 00:44:39 and leadership type positions where they aren't hands-on building, you know, every day. And I think it's going to be harder and harder to be in that kind of role. and be able to influence and advise in a relevant way, in a positive way. And so really, I think my advice for folks is you kind of got to get building. Like you've got to get back into it. You need to deeply understand how the practice of building software

Starting point is 00:45:15 and the practice of designing software has changed and is continuing to change. And so the challenge is how do I really take advantage of all of this knowledge and expertise that I've built up in my career and be super curious and be super hands-on and really be in the details. And the good news for that, well, I think there's two bits of good news. One of them is because of these new tools, you know, time spent as a practitioner is so much more leverage than it is today. you can build such cool stuff, you know, during that period of time, the amount of kind of wasted time and boilerplate and so on is so much smaller. And so you really do have this opportunity.

Starting point is 00:46:03 And the other one is, again, like, why did, you know, why did we get into the space? Well, I didn't get into it so I could go to meetings and sound smart. I got into it because I love learning and because I love building technology and because I love, you know, solving my customers' problems and because I love, you know, learning about new, you know, new technologies and learning new things. And there's more opportunity to do that than ever before. Again, you know, because of this new set of tools and the leverage that comes with them. And so really is getting back to, you know, why are you here? Why did you get into this career? And I think it really gets us as technology-focused people closer to our original answer to that.

Starting point is 00:46:45 it's really obvious to me right now when I speak to, you know, practitioners, you know, who and who isn't using a, you know, modern set of agentic powered developer practices, right? And the people who are, have these really interesting things to say about the strengths and weaknesses of those approaches and the work that still needs to be done and the integrations that still need to be done and the things that are working and aren't. And the people who are, you know, using. them hands on have such a poor mental model of how they work, what they're good at, what they're not good at, that the things they say about them tend to be essentially fiction. And so, you know, I think we are in this minute that if you aren't doing it hands on, your opinion about it is very likely to be completely wrong. And that takes a level of humility to, it, admit that, you know, is tough. You know, it's tough for folks with fancy titles and it's tough for folks with distinguished careers. But I think it's a must. I feel like there's a common sentiment among software engineers when they, when they work with someone who is a, you know, quote

Starting point is 00:48:04 unquote tech lead, but they're not really hands on. So they've kind of been in the docs for the last five years or so. And there's these minor things they can tell that this person, doesn't actually understand the underlying thing. And it sounds like that gap will widen with these new tools, which is if you're looking at things from 1,000 feet up and you're not actually using the tools, that's just another thing that separates you from the people who are actually building

Starting point is 00:48:33 where you'll be very out of touch. And I think, you know, when I look at, I think that's always been true. I think it is wider than, you know, ever before. And, but when I, look at the, you know, engineering leaders that I've really respected and learned a huge amount from over my career. You know, for example, some of the folks who built S3, you know, 20 years ago, that was such a successful product because those folks were so deep in the details and so grounded

Starting point is 00:49:05 on the use cases and so deep in the economics and really just did, you know, really thought about both the kind of strategic world of like how is this cloud thing going to change the way people want to interact with storage, but also the minute-to-minute details of what's fast now, what's slow, what's good, what's bad. And I think, you know, when you think about a extremely enduring product like S3 or EC2, I think it's been that groundedness in the details from from early on, from all levels of leadership that has made those things so successful, where other products seemingly with the same amount of early promise didn't turn out to be as successful. I think one of the last topics that I wanted to ask you about was writing. You have

Starting point is 00:50:06 a ton of awesome posts on your blog. The style of writing is incredibly clear. And I was curious, why do you write so much as an engineer? Writing and speaking, but especially writing, have this incredible power. And, you know, for technical folks, it's this incredible multiplier in being able to take these ideas that's in your head and share them with the world. And, you know, you can take a set of technical ideas in your head and share them with the world. building a great product and that's a fantastic thing to do. You can share them in the world kind of one-on-one, you know, mentorship, teach people, learn, small groups, also a great way to spend time. But the multiplication factor of doing a talk or even more of writing something is so much

Starting point is 00:51:03 higher, right? Like there are so many more people that you can share that with and it lasts for a much longer period of time. And so just having something written on my blog, even that I wrote like a decade ago that I can share with someone and say, you know, here's how to think about this problem, here's an insight that I wanted to share with you, or have people discover that organically is just super powerful. And so writing lets you scale out the impact of your expertise in space and time in a way that's really hard to do in other media.

Starting point is 00:51:38 I think with video and with podcasts and so on, you know, we've seen other ways to do that. But I think writing remains kind of uniquely powerful. And then there's also this idea, which is this kind of core belief culturally at Amazon. And I've obviously been affected by this over the years, that, you know, writing forces a level of mental clarity that speaking, making slide decks, et cetera, doesn't.

Starting point is 00:52:06 and, you know, that's something that has also really been my experience of sitting down, to write something down, forces me to think that through at a depth that I wouldn't have been forced to think of through without that. And so I saw one of your early conversations was with Leslie Lamport who kind of takes that a step further and say, hey, you know, it's formal mathematics that is the next step there, and I love that point. but I think writing is this really accessible thing for people to do that does force a level of thinking. And so I do a lot of writing sometimes just for myself, right?

Starting point is 00:52:44 Like I'll write a doc, not ever intending to share it with anybody, but just to sharpen my own thinking on a particular point. And so it's some of that combination of three things, right? Like, I just have something to say and I want to say it. You know, I have something to say and I want to scale. it out in time and space. And I want to sharpen my own thinking on a subject or the thinking of a small group on a subject

Starting point is 00:53:13 in a way that writing is just a super powerful tool to do. Definitely. Yeah, I remember being surprised early in my career. I had a manager, tech lead, who we would write these docs on either designs or the strategy. and he said, even if you just wrote it and you threw it away, it would still be worthwhile because you'll realize things as you're writing and that clarity will save you a lot of time down the road.

Starting point is 00:53:44 And it's interesting to me because a lot of engineers, they complain about writing docs, docs and all the stuff around the code. They kind of hate that. I said it's a sign of slow, big company processes. And what would you say to an engineer like that? who's saying, I just, just let me write the code. Yeah, and that's a great, you know, that's a great point. And I think it really depends on the level of problem you're trying to solve.

Starting point is 00:54:10 And so, you know, if I look at, I'm going to pick on UML for a minute here, right? Like it's a sort of semi-formal software design process. And not one that I've ever found useful because I think it just happens at the wrong semantic level. I think it's bothered with details at a level that aren't helpful. and I think a lot of the let's go off and document this has a similar problem, right? Like does this actually require that level of reflection and thinking? And so I think what, you know, for me, it separates a valuable doc writing and thinking process from a busy work process is understanding what you're getting out of it.

Starting point is 00:54:55 and what you're getting out of it might be an artifact to share with the future, which is super valuable. Either your future self, if you've got a terrible memory like me, or, you know, new teams, new people, people, you know, or I want to share something with customers

Starting point is 00:55:11 or I want to share something with the world. And so that's super valuable. Or I want to write down something so I can think through a really difficult, often one-way door, kind of unhard to change technical decision or API design decision. And I'm not going to do that every time I make a technical decision. It's not worth it because a lot of those technical decisions are either easy or not as

Starting point is 00:55:39 critical or can be just be taken back if we figure out they're wrong. But I am going to spend my time that way when there are key decisions to make, when there are key insights to find. And I think, you know, and so it is that, like, what is the purpose of writing that separates well-spent time from poorly spent time? Now, there are people who still don't like writing even when it's well-spent time, even when it's like,

Starting point is 00:56:17 you know, you have to explain this piece of, technology to, you know, to a future team. I think that's a skill worth developing. You know, sometimes you do need to, you know, eat your vegetables, you know, and it's, it is a skill worth getting good at. And, you know, especially in documenting the core kind of technical decisions behind a design is so useful. And that's useful in two ways, by the way. Like, one of them is, as we think about building a big system, we make thousands of decisions. And some of those decisions are very carefully chosen, very particular, and very impactful. And some of those decisions are the best thing we could guess in the moment based on having no data to make that decision. And it's super useful for people

Starting point is 00:57:13 who are coming in to improve that system down the line to be able to look at the design and say, which of these things were very carefully chosen and thought through, and which of these things were arbitrary. And because the arbitrary things, like, okay, well, I'm going to change that, and I'm going to just go ahead and change that, because I have better data now. I've watched the system run. I can go and change those.

Starting point is 00:57:34 And these other ones are like, well, let me really engage with the reason that we made this decision. Maybe it was non-obvious. Maybe there was some more advanced thinking. And so being able to kind of understand the amount of thought that went into a decision and is almost as important as understanding what that thought was. You had a really interesting blog post. This was from a while back. It's titled The Four Hobbies and Apparent Expertise.

Starting point is 00:58:01 And you introduced this really interesting idea. It's a two by two matrix. And on one side, there's doing versus discussing. And on the other side, there's the hobby and the gear. Maybe I can overlay it for people who want to see. And then later you kind of liken that to your career and how, I guess maybe we can imagine the hobby is actually coding. And maybe the gear is, let's just say it's like your dev setup or something like that. You talked about these two aspects of being in, depending on which quadrant you are,

Starting point is 00:58:34 which is there's this tradeoff between expertise and visibility where imagine you're really into coding and you're really into doing. you're going to be phenomenal in terms of expertise, but maybe not as visible because you're not talking with everyone about how cool your setup is and all of that. On the flip side, if you're really into the gear, maybe you're set up in this case, and you're really into discussing, you're on all the messaging posts and that.

Starting point is 00:59:04 You might not actually be that good at coding, but you're very visible and you have this apparent competence. And I thought that trade-off was really interesting because I've seen that so much in software engineering too. There might be someone who's really quiet coder. They never write anything, but they know everything because they've just been in the weeds all the time. And then there are people on the complete opposite on the spectrum that writing all the time, speaking all the time, but maybe not actually practicing as much. And my question to you is, how do you strike that balance? because obviously too far in either direction is not optimal.

Starting point is 00:59:41 So how do you strike that balance? Yeah, that's something that I reflect on a lot. And I do explicitly think that sort of being 100% on either of those ends is a failure mode. And I think, you know, I will say that I have a lot more personal enjoyment working with the people that are 100% on the doing side and 0% on the talking side. and 0% on the talking side, I appreciate and deeply, you know, deeply love their expertise.

Starting point is 01:00:14 But I do think that, you know, they could have more impact and leverage if they, you know, swung a little bit away from that. You know, I tend to not enjoy as much interacting with the people who are 100% on the speaking side. But I, and I think they would, you know,

Starting point is 01:00:34 have a lot more, relevant things to say, you know, if they, you know, swung a little bit back towards the center. The other challenge of being on 100% on the doing side, it sort of gets back to that, how do you find the really important problems? And, you know, if your heads down in your IDE all day, you could very likely be working on the wrong thing. You know, something that isn't as important, isn't as impactful, you know, doesn't have these properties that people want. So, you know, how do you find the optimal balance? I don't have a have a recipe for, you know, what really is optimal. I tend to do about, let's say, 75, 25 kind of practitioner versus, you know, teaching and communicating,

Starting point is 01:01:28 maybe 80, 20 at times. I found that about what feels right for me. I would say that they're great people I work with, you know, from sort of 90-10 on that scale, up to about 50-50 on that scale. I think, you know, outside of those, you know, folks tend to, you know, tend to get into trouble as practitioners, right? Like, you know, there are people whose job it is to be, you know, communicators. And that's great as long as they have the curiosity and are clear about what they, you know, know, know and don't know. But, you know, I found that sweet spot at that sort of 75-25 point in my career. And that's what's worked for me.

Starting point is 01:02:13 I think, and I think in this moment where things are changing so fast, there's so much to learn. You know, swinging a little bit more towards the practitioner side, I think generally will help people. But again, you don't want to go too far that way because then you lose the, you know, or what's important for you that comes with interacting with the outside world. On the doing versus discussing axes, I kind of view the doing one is if you were too far, you would be underrated. And if you were too far on the discussing, you would be overrated. And if for someone who's structuring their career, would you say it's better to be overrated or underrated?

Starting point is 01:02:59 I think long term, you know, if you're using that terminology, it's probably better to be underrated. I think, you know, being overrated can feel great in the moment, but it's rarely sustainable and really sort of gets you where to where you need to be. I really enjoy, you know, things like sports and, you know, these sorts. of creative hobbies and crafts because it does, you know, turn that, let's say, perception and reality knob to very much reality, right? Like as a sports person, you can't, you can't fool the world for very long. It very quickly becomes, you know, very obvious, you know, who can and who can't. You know, I think as a craftsperson, the same, right? It very quickly becomes obvious who can and who can't.

Starting point is 01:04:01 And I think it takes a little bit longer in a field like ours where there is so much kind of qualitative stuff that goes on. But I think long term when I look at careers that I really admire and people I really admire, they tend to be people who are personally very honest about their level of knowledge and understanding and skill. So people who walk the walk,

Starting point is 01:04:26 not necessarily talk the talk. I see. Yeah, about engineers that you admire. I'd be curious because you have worked at AWS for such a long time and you have seen so many legendary engineers. Who at AWS do you look up to and why? Yeah, I mean, you know, just fantastic.

Starting point is 01:04:46 One of the blessings of working at a place like AWS is I get to work with so many great people. You know, maybe because he's retired, I'll talk a little bit about, so Elver Mullen was one of the sort of early engineers at AWS and original, say, huge contributor to the design of S3, a really big contributor to the design of a lot of our database services over time. El was actually the CTO of Amazon for a period of time when he realized, I think that wasn't the job he wanted to do. But I, you know, what I really

Starting point is 01:05:23 admired about, about Elle from early in my career is you know, very clearly he was somebody who deeply understood the things he was doing. And he could work in these two modes, right? Like, you know, I have a great memory of sort of 2010-ish, you know, arguing with Al about some of the edge cases in the Paxos paper. And, you know, he was super deep at that level, but could also get up to the really kind of executive level and talk about, you know, cloud strategy and the way.

Starting point is 01:05:58 we should be explaining things to people and some of the, you know, sort of fundamental things that we need to be building. And I really admired that ability to work sort of almost at every level. And I was like, wow, you know, this is something I aspire to. And, you know, want to model my, want to model my own career after. And so, you know, that's, I think that is, you know, the kind of person I've really, you know, really enjoyed. working with is people who do have that, you know, do have that breadth. And I think, you know, one of the other things that is really admirable about a lot of these folks is, you know, they don't want to be celebrities, right? They want to do cool work for, you know, have an impact, do great stuff for customers, you know, optimize for having impact. You know, for people who want to continue their engineering education and really remain on top of

Starting point is 01:06:57 things deeply understand the technology. Do you have any top technical book recommendations? You know, anybody who's building distributed system things, I highly recommend Martin Kleppman's book. I think there's a second edition of that coming out, you know, soon. There's a new edition of Quantitative Systems Design book, which I also think is great. Hennessy and Patterson's computer architecture book. This is a super... Super useful one that covers a ton of ground. I read a ton of fiction and nonfiction and mostly papers when I'm reading technical things. I find engaging at that level more useful for me.

Starting point is 01:07:43 And by the way, that's become way more accessible now. One of the great ways to dive into a paper is, you know, hey, Claude, summarize this for me. And then I can dive into it and read, you know, the authors. words and I find that mode is great and it's super accessible for people who haven't been able to read papers in the past. But, you know, and then there's also a ton of insight in some really old stuff too. For example, you know, some of the algorithms that we used in Lambda are to manage traffic and manage bursts of traffic come from Erlang's work like a hundred years ago on managing telephone call centers and his book about that.

Starting point is 01:08:37 And so, you know, folks also shouldn't think that, oh, well, the industry is changing super fast, and so I should only read recent things. Like there's full incredible insights in some of the, you know, older work and in the foundations of computing and infrastructure and networking and computer science that there's you know more again more maybe more leverage than ever before you know deeply understanding those topics and then last question for you is if you could go back to your younger self when you just joined AWS and give yourself some advice what would you say I think maybe be a little bit bolder I really love the team that I worked with and you know

Starting point is 01:09:21 especially in EC2 in the early days in EBS. And I think I was a little bit more hesitant than was optimal about leaving those teams and looking for the next thing. You know, as, you know, my own learning and impact kind of, you know, tapered off a little bit in those places. And so, you know, I think I've changed organizations kind of in a big way four times in my career and maybe five. or six would have been optimal. Not a lot more, but some more. And so, you know, don't hesitate to think about, you know, what am I learning and who am I learning from? And is there a better environment to do that, you know, more quickly and to learn more things? And, you know, I'm highly, personally, highly motivated by being able to follow my curiosity. And every time I've done that

Starting point is 01:10:17 in my career, I've found that a valuable move. and something that I've personally enjoyed. Awesome. Okay, well, thank you so much for your time. I really appreciate it, Mark. Thank you for sharing with the audience. This has been super fun. Thanks so much. Thank you for listening to the podcast. It's a passion project of mine that I really enjoyed building. Another passion project that I've been working on kind of in secret is building an ergonomic keyboard that I wish existed. and I finally have a prototype, so I'd love to show you what we've built.

Starting point is 01:10:53 It's ultra low profile and ergonomic, and I couldn't find anything like it on the market, so that's why we built it. I'll put a link to the keyboard in the description. You can take a look and learn more about the project there. We could definitely use your support. Also, if you have any feedback for me about the show, I'd love to hear it. Comments on YouTube have led to guests coming on like Ilya Gregorik and David Fowler.

Starting point is 01:11:15 I wasn't aware of them until someone dropped a comment. Also, feedback in the comments helped me learn to reduce the number of cliffhangers in the intros. So your comments definitely make a difference. Please keep letting me know what you'd like to see more of in the show, and I'll see you in the next episode.

The Peterman Pod - AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.