The Peterman Pod - AWS Distinguished Eng: Learning From 3000 Incidents And How Engineering Is Changing | Marc Brooker
Episode Date: April 13, 2026In this episode, I talked to Marc Brooker, a distinguished engineer at AWS who started there as a new grad and rose through the ranks. We discussed technical learnings from 3,000+ cloud system postmor...tems, how software engineering is changing with AI, how to find impactful problems and much more.🔶 My keyboard Kickstarter: https://www.kickstarter.com/projects/ryanlpeterman/compose-simple-ergonomics-beautifully-done𝗣𝗼𝗱𝗰𝗮𝘀𝘁 𝗹𝗶𝗻𝗸𝘀:• YouTube: https://youtu.be/u3GjIXP9N0s• Spotify: https://open.spotify.com/episode/1qX2GfpbzxzGpGvDZVINdO?si=wsDGZo9PTbCNalKVybFVnA• Apple: https://podcasts.apple.com/us/podcast/the-peterman-pod/id1777363835• Transcript: https://www.developing.dev/p/aws-distinguished-eng-learnings-from𝗘𝗽𝗶𝘀𝗼𝗱𝗲 𝗹𝗶𝗻𝗸𝘀:• Post we discussed on hobbies and apparent expertise: https://brooker.co.za/blog/2023/04/20/hobbies.html• Post on software engineering changing: https://brooker.co.za/blog/2026/02/07/you-are-here.html• Post about Senior engineers and AI: https://brooker.co.za/blog/2026/03/20/ic-leadership.html• Post on Junior engineers and AI: https://brooker.co.za/blog/2026/03/25/ic-junior.html𝗧𝗶𝗺𝗲𝘀𝘁𝗮𝗺𝗽𝘀:0:00 - Intro1:27 - Finding problems that matter11:42 - Learnings from 3000 postmortems23:58 - Why caches are bad29:37 - How AI will change software engineering36:49 - Advice for junior engineers given AI44:02 - Thoughts for senior engineers49:59 - Why engineers should write57:51 - Visibility and apparent expertise1:04:23 - AWS engineers he admires1:06:53 - Technical book recommendations1:09:06 - Advice for his younger self1:10:37 - Outro𝗪𝗵𝗲𝗿𝗲 𝘁𝗼 𝗳𝗶𝗻𝗱 𝗠𝗮𝗿𝗰:• LinkedIn: https://www.linkedin.com/in/marc-brooker-b431772b/• Twitter/X: https://x.com/MarcJBrooker• Personal Blog: https://brooker.co.za/blog/𝗪𝗵𝗲𝗿𝗲 𝘁𝗼 𝗳𝗶𝗻𝗱 𝗥𝘆𝗮𝗻:• Newsletter: https://www.developing.dev/• X/Twitter: https://x.com/ryanlpeterman• LinkedIn: https://www.linkedin.com/in/ryanlpeterman/• Threads: https://www.threads.com/@ryanlpeterman• Instagram: https://www.instagram.com/ryanlpeterman• TikTok: https://www.tiktok.com/@ryanlpeterman
Transcript
Discussion (0)
If you aren't doing it hands-on, your opinion about it is very likely to be completely wrong.
This is Mark Brooker. He's a distinguished engineer at AWS, and I interviewed him for technical learnings from his career.
3,000 cloud system postmortals. I want to ask you what makes a good postmortem.
I could spend a lot of time talking about that.
You had a tweet that said that there are cases where caches are bad.
I prefer to see the teams around me avoiding caching where possible.
We also discussed how software engineering is changing.
What is important given that code is kind of flowing like water now?
The job changes and you do different work.
For someone who's structuring their career,
would you say it's better to be overrated or underrated?
Here's the full episode.
At some point, when I was a very junior engineer,
I looked at the more senior engineer.
So what is the difference between you and I?
I'm working more hours than you.
I'm landing more code than you.
Why is it that you're so much more impactful than I am?
And then I realized that kind of the direction of your work,
like what is the thing that you're actually shipping matters more than the volume of your work and your contributions?
What would be your advice on how do you find problems that matter?
Yeah, I think you have to go super broad.
So I think there's a set of those things that come in from customers, from the world.
Right, like here is an unsolved problem.
I spend a lot of time meeting with AWS customers and listening to them talk about,
you know, what are the things they still find difficult in our space?
What are they investing in?
Where are they spending their time?
Where would they prefer to be not spending their time and focus on their core business instead?
And so that's one rich seam of ideas and focus on what's interesting.
I think completely at the other level is sort of on looking at the technical
trends and you can look at just the
kind of speeds and feeds like
wow networks have gotten faster storage
has gotten faster you know we've seen
this huge explosion in
multi-core and now in GPUs
and you know so there's a
bottom-up innovation trend
there too which you can also
look at and say well this enables all of these
new new things
and
and then broadly
kind of across the world
like what are the big trends that are going
on what are the things that are changing in our industry what are the things that are changing in
the world and really it is those kind of moments of change that have the you know bring with them
the opportunity to to build things and and to recognize problems and so to pick one you know
concretely you know when i was in working the lambda team in in 2020 and i was talking to a lot
of customers about you know they were super excited about building on serverless they were
super excited about building on containers.
There have been this massive shift.
And what people were seeing then was, wow, I love these serverless products.
I love building this way.
But the world of data and especially relational data doesn't fit super well into this paradigm, right?
These relational databases are still very serverful, you know, fantastically powerful products,
but not kind of operationally the same.
And, you know, that thinking was, you know, this felt super important.
to me of like, wow, these customers have brought to me a gift of understanding something
that's really important.
And so I joined the Aurora team.
We built Aurora Serverless and then we built the SQL.
You know, we've been investing deeply across all of our database products to make them
a better fit for these, you know, serverless and container workloads.
And that is an example of a trend that was brought by, you know,
brought by a customer.
But then also these trends that have been driven by kind of architecture or
by other things going on, right?
Faster networks, faster compute, faster connectivity.
So one of the big technical trends in the database world right now is
this sort of block storage becoming the default backend,
the default durability layer for databases of all kinds,
from analytics workloads to online workloads.
And there's been this incredible explosion around that.
And so if you look at what we did with Aurora DSQL, for example,
you know, that was very much learning from that trend
and taking a lead in that trend and saying,
well, we're going to make S3, this block store that we built, you know, 20 years ago,
sorry, object store that we built 20 years ago,
the underlying durability layer of this new database.
But obviously it doesn't have the latency properties,
or the rich interface that an online database needs.
And so we're going to build an architecture on top of that
that deals with all of these other things in a much better way,
but doesn't have to worry about durability.
And so that was this perfect collision of a set of things
I was hearing from customers and a set of things that were technical trends
coming together and thinking,
wow, we've got this opportunity to build something now
that is going to be a market-leading product.
that would be hard to imagine without either of those input signals.
I saw something that you wrote.
You mentioned that you were on call for 15 years somewhere in there.
And I've heard many stories of more senior engineers negotiating out of on call because
per unit time, it could be perceived as not that impactful.
And so why did you stay on call for so long?
I would say that the majority,
of my in-practice knowledge
about how to build distributed systems
has come from being on call
and analyzing and deeply understanding
these post-mortems and COEs.
You know, one of the challenges of running a company like AWS
and running large-scale systems
is that folks come out of college
with great, often great knowledge of computer science fundamentals,
great programming skills, you know, great mathematical skills, all of that stuff is fantastic.
But without the grounded knowledge of what it actually means to run and understand, you know,
understand systems. And, you know, On Call is one of the best ways to learn those things,
best ways to see, you know, how do systems really run, how do they really behave?
You know, how do customers really use them?
What happens when customers use systems in unexpected ways?
how can we make systems more resilient to customers using them in different ways?
And I think that should be almost a goal of on call, right?
If you have folks in your teams who are on call and they're just closing the same ticket over and over and over,
well, you know, that's where you need to just build some automation.
And again, building automation is easier than ever.
It's more powerful than ever.
Fantastic.
But where you really want to spend the time of the deeper experts on your team is
you know, here's something unexpected or unusual that's happened in the system.
Let's deeply understand that.
And let's bring that knowledge back to both improving that system
and communicating broadly to the company and the outside community,
what we've learned from that.
And so one of the most, you know, one of the most powerful things we do at AWS is we have
this mechanism of a very broad weekly meeting where we all get together, you know, engineers from
across AWS leaders, senior leaders from across AWS, and talk about COEs, these post-mortems
that we write, and what we can learn from them and how we can apply those lessons across the
whole company. And I think that particular mechanism, that particular kind of Wednesday morning
meeting that we have is one of the things that has been a core, almost causal factor behind
AWS's success because it has allowed us to and forced us to spend leadership bandwidth,
to spend expertise, to spend the time of our best engineers, deeply understanding how our
systems operate and why they operate the way they do.
and you know that level of being just extremely grounded in reality
helps you design better products helps you architect better systems
and it helps you think more clearly about the next round of things
helps you fix you know helps you fix issues and so it's this fundamental
kind of learning exercise it's a real blessing
so I would you know I would recommend on call to anybody who wants to learn
about the practice of distributed systems.
And I would certainly recommend spending time, reading COEs, reading post-mortems,
and deeply reflecting on not only what can we fix tactically,
but what can we fix organizationally and strategically,
and what kind of tools might need to exist to prevent this kind of thing happening again.
And you asked earlier about where do ideas come from.
And this is another fantastic kind of flow of ideas of saying, wow, you know, we seem to be solving this same problem over and over in different ways and getting it slightly wrong every time.
You know, can we extract a tool to do that?
Can we build a service around that?
Can we build a feature around that to make it easier for us to get right and easier for our customers to get right?
Yeah, it's interesting because I think if you,
you ask most engineers, they really avoid on call. But it sounds like you kind of go towards it
and you've learned a lot from it because it's a major source of customer problems.
Yeah. And again, you know, I think for me, it comes down to optimizing for finding the most
important things to work on. And, you know, if you aren't close to operating your actual system
and you don't know how it's actually working, how are you supposed to identify what to fix?
You can come up with some theories about those, but they're probably not going to be right.
And again, I don't think there's a huge amount of value in the Rote ticket closing work of OnCall.
I think automation should be doing those kinds of work.
But I think there's fantastic value in deep understanding, deep investigations and deep reflection on what you learn from post-mortems and COEs.
I tried to estimate a couple of months ago for a talk how many industry postmortems
and Amazon COEs I'd read over my career.
The best estimate I could come up to, and this was about a year ago, was between 3,000 and 4,000.
And so, you know, even a little bit of lesson from each one, and it tends to stick.
Yeah, that was my next question, actually.
I looked at the slides from that internal presentation, and it said,
I've read approximately 3,000 cloud system postmortems from across the industry.
And my immediate thought was, I wanted to ask you what makes a good postmortem.
So I think, you know, what makes a really great postmortem is first really getting into the details
and making sure that you deeply understand what happened rather than just assuming what happened
based on the biases you bring in.
And so there's a kind of lesson one there.
is if you can't understand what happened,
well, that teaches you something about your logging and metrics and observability
and,
you know,
and simulations and all of these other things.
And then once you deeply understand what happened,
then the ability,
then a great post-mortem steps through the whys behind that and multiple levels, right?
Like, why?
Well, yeah, there was a code bug.
Okay, sure.
Code bugs, yes, we can fix that.
But we can't stop there, right?
Like, why was that missed in testing and validation?
You know, for these reasons, you know, what can we improve?
What can we build around those?
Okay, next step, you know, why, you know, why was our testing and validation where it was?
Or, you know, why did we assume a certain thing about the behavior of the system that we wouldn't have assumed before?
And so as you sort of get through these deeper and deeper layers, a great post-mortem
not only identifies kind of fixes to the proximal cause, but also identifies broader fixes
to technology, to organizations, to, you know, products and so on. And so that's a kind
of multiple levels thing, right? You can't get stuck on, you know, what is the most proximal
cause of an incident, but you also can't get
stuck on this, well, you know, things fail sometimes and what are we going to do about it? And
you have to come up with a set of, you know, really concrete action items to fix things at different
levels. Fix this particular line in the software that caused something. You know, fix the testing
processes that didn't catch that. You know, fix the, you know, maybe social or team processes
that led to those technical processes.
And then if you're seeing patterns across multiple post-mortems,
sort of level those up and say,
well, clearly there's a hard underlying problem here.
Can we build a service around that?
Can we build a library around that?
Can we build a community of practice around that?
Are there technical changes we can make to avoid whole classes of things?
So that's quite a long-winded answer, but I do think it all flows from understanding and understanding and multiple levels,
like understanding immediately like what happened, but also understanding, you know, broadly what happened, you know, technologically and organizationally and in context.
And then the ability to connect that particular event or post-mortem with other ones, you know, and, and, and, and, and, and, and, and, and, and, and, and, and,
those patterns. One of the things that we did in D-SQL was we spent a lot of time as we were
designing that, looking around relational database-related post-mortems and thinking about both our
own and our customers and thinking about, you know, how can we design a database that helps
people avoid falling into these traps? And, you know, a really common kind of outage pattern,
folks with relational databases, is you have a client on a
distributed system, starts a transaction, and then goes out to lunch for whatever reason.
And that could be a GC pause, or it could be a lossy network, or it could be a loss of connectivity,
and now it's holding locks. And so if you look at relational databases, they don't tend to be
resilient to clients misbehaving in that way. And that's a really common cause of operational
issues for systems built on relational databases. And so as we were designing DSQL, we were thinking
how do we avoid broadly that class of problems?
So folks can say, hey, I'm going to build on DSQL
and just not have this whole class of problems.
And I think that's a really kind of powerful outer loop
of the post-mortem process is to say,
how do we turn all of these lessons
into new services and into service improvements?
How do you prevent misbehaving clients
from being a problem for the database?
Yeah, so in D-Squels case, we have no pessimistic locking.
And so within the scope of a transaction, everything that happens in that transaction,
all of the reads happen using this mechanism called multiversion called currency control,
where every row in the database we sort of store a history of versions.
And so you can read an old version of a row without blocking writers and saying,
hey, you can't update this because I just read it.
And then, you know, locally within the query processor that's handling a connection,
we spool the rights locally and then you get to commit time and we do this optimistic check of,
you know, can I commit this transaction at the transaction commit time.
And so combining those two mechanisms of having multivision concurrency control
and the scale out storage that comes with it and the commit time optimistic checks,
we can strongly say that, you know,
there is no way that a reader of a piece of data
can block other writers,
and there's no way that a writer of data can block readers.
Writers can block writers,
but only by changing data, not just by looking at it.
And so you can, you know, you can say,
well, you know, I can cause,
sorry, writers can't block writers,
but they can prevent other writers transactions
from eventually committing.
by making a bunch of changes.
And that is inherent to the definition of the particular database isolation level.
Out of curiosity, in practice, what percent overhead would you expect
for keeping copies of old roles for the sake of those stale reads?
Yeah, it's actually surprisingly small.
And it's surprisingly small because if you look at the access patterns for most online databases,
even ones that do a lot of right traffic,
that right traffic tends to be quite concentrated
and it's quite unusual for an online database workload
or even an analytics workload
to make a second version of every row in the database.
Typically what it's doing is making a first, second, third,
the hundredth version of this row and a 50th version of that row,
but the vast majority of data isn't changing.
And so it's super workload dependent,
as is everything in the database world.
But the overhead tends to be relatively small.
I would say it's unusual for a online database workload for that overhead on storage to be more than about 10%.
From my experience, I've seen an interesting dichotomy between teams where some teams, they really understand post-mortem culture.
They tend to be infrastructure teams.
They tend to take it really seriously.
and everyone on those teams, the tech leads are asking you,
hey, why did that happen and really follow up and make sure it's not a problem?
Then I've also noticed on other teams that is less of a strong muscle
for those teams that don't take it too seriously,
what would be your pitch for why they should take it seriously?
Yeah, it all comes down to where you want to spend your time, right?
Do you want to spend your time improving your product and making it better,
or do you want to spend your time
fighting the same fire
over and over?
And, you know, really,
the culture of building,
you know,
building great post-mortem cultures
to make sure that at the product level
and at the organizational level,
you are fixing known issues
and you are avoiding having the same,
problems multiple times.
And typically when I see
teams that have
poor post-mortem culture,
I think they're probably
one of two failure modes there.
One of them is
a lack of focus
on just the outcomes,
right? Like, you know, a lack of
really,
I wouldn't say caring enough.
I think that's a little bit too
personal, but being really
focused on, you know, is this product performing super well? Are we, you know, are we really making
our customers happy? And that is fundamentally a cultural and leadership cultural problem of
setting the right standards. Oh, and by the way, like I don't think, you know, standards should be,
you know, should be uniform, right? Like there are places where, you know, the details really,
really matter where things like durability are just critical and and you do need to have super
high standards in those places and you know places where you want to optimize for other things and
maybe have you know have have a higher production defect rate and I think that's that's okay
as long as that's an intentional decision that's being made so that's kind of case one right like
insufficient focus on the outcome I think
two, and this is a harder one to change, is normalization of kind of operational heroics.
Like, we don't need to fix these root causes because our on-calls are superheroic and they're
going to stay up all night and they're going to, you know, they're going to hack around things,
and they don't mind being paged a hundred times a week. And they can feel from the inside like
it's a good culture, right? Like, oh, wow, these people are super strong owners. They're super
engaged, they really care, they're really working hard on call, and those are all good signals.
But then when you look at it from the outside, it's like, wow, we're not actually fixing
the causes of things. We're just doing this fantastically expensive investment of taking all of
these people and their strong ownership and their expertise and spending them just on this
break-fix cycle. And that's where you need to kind of look at it from the outside and say,
well, let's take this energy of this team, fantastic energy, and focus
on improving the service, getting out of the cycle, finding, you know, finding new things to fix,
finding new things to build. And that can be hard because it can be hard for, you know,
those folks who've been in that mode to look at it and say, this feels so good, it feels
really like we're caring about our customers and caring about our product and caring about
our business to realize that, oh, no, we're actually caring about it at the wrong.
level and we're not serving our business in the best possible way by being so narrowly and
tactically focused on this break-fix cycle. And that's where you sort of need to pop them out and say,
well, let's spend more time thinking about the post-mortem. Let's spend more time thinking
about the causes of things. Let's spend more time addressing these things in a more strategic
way. And wow, okay, now you've got so much more time to do that because you've broken
the cycle and you can improve your product in different ways.
I mean, since you have worked on AWS for almost two decades,
I'm sure you have a lot of experience building distributed systems.
And I think one of the most common advice that you hear,
I guess this is maybe in the context of system design,
is I almost hear almost 100% of the time,
people will say, just throw a cash on it.
Or you'll have a system design and you say,
how do you make it better?
let's put a cash here, let's put a cash there.
And I saw you had a tweet that said that there are cases where caches are bad despite
people saying it's best practice.
And I'm curious if you could explain that.
Yeah, so caching's good, right?
Like it's, hey, I'm going to take these core ideas from computer science of temporal and
spatial locality, and I am going to exploit those to make my system faster, scale better, etc.
And so, you know, obviously very attractive.
but the downside of caches, especially in distributed systems,
is they have this mode, right?
Like they have this, you know, there's a mode where the cache is full
and the cache is full of the right data in time and space
to perform very well.
And there's a mode where the cache is empty or contains the wrong data.
And in the first mode, the system is fast and happy and healthy.
In the second mode, the system is slow.
often down because now the back end is in scale to deal with all of this un-cash traffic,
customers are very disappointed.
And often it is down in a stable way.
And this is this kind of idea of metastable failures where the system has switched from state
one to state two, and in state two it's still stable, right?
Like it's still, it's down, but it's not going to come back up under its own energy
because, for example, all of this traffic is causing a huge amount of contention in my database
or is saturating the network.
And so I can't even refill the cache.
It's not even getting the right kind of data in.
And so, you know, when I talk about the downsides of caches, it's really about, you know,
how do we avoid that modality between, you know, fast and, you know, that value of caches
and the, you know, how do we avoid the state where we're down?
And so if I go back to D-SQL, like our answer there is D-SQL,
what we call the storage tier is essentially a cache,
but it is a complete cache.
It contains every row in the database.
And so it doesn't have this mode where how do I recover from it being empty
or containing the wrong data.
It contains all of the data.
Similarly, if you look at a more,
and it's a classical relational database design like Aurora,
the Aurora leader is constantly telling the potential failover targets,
here's something you should cash,
here's something you should catch, here's something you should cash.
So when a failover happens, the cash is warm on the failover target.
And so those are the kinds of things that you can do
to avoid those modalities.
But in general, you know, and I wouldn't extract this as a rule or say that, you know, this applies 100% of the time.
But in general, I prefer to see the teams around me avoiding caching where possible.
I prefer patterns where you have a, it's a complete materialized view of the data if you need very fast access to it,
especially if it's slow moving, just pull it down onto your local machine and work within in-memory.
If it's only being updated once a week, who cares?
Like, just make lots of copies of it.
So that's one pattern.
Or, you know, use a scalable backend, you know, D-Sql or DynamoDB or whatever your
favorite scalable database is and keep your database vendor honest about getting to the scale
and performance you need rather than putting a cache in front of things.
So caching isn't a bad pattern, but it is a pattern with some significant downsize
that are, you know, really best avoided.
In practice, how often do you see that metastable failure, though?
Yeah, you know, this is, it's not super common, right?
Like you might go years without seeing, you know, something like that.
But if you look across the biggest, most impactful, you know,
system post-mortems across the industry,
I would say that these kinds of metastable failures have been an underlying
cause in probably a majority of them. And it's super important that, you know, as an industry and as a
community of practice, we understand those things deeply because also those cases where these do
happen, you know, tend to be larger scale issues, longer recovery time issues and, and more complex
to fix issues, right? Where you have to often, you know, turn it off and turn it back on again,
which is this very, very painful thing for a team or an organization to do.
And, you know, and so again, like you might go here,
it's operating in a system with seeing nothing like this.
But if you look at the most impactful issues,
it's actually fairly common as an underlying cause for those issues.
And so, you know, it's kind of both of these things
of being quite uncommon and being rather common.
I was reading your blog and you have a series of posts
on how AI may impact the future of software engineering.
And I kind of want to pick your brain on that.
So what's your perspective on how you think AI will impact software engineering
and how it'll change things?
Yeah, I mean, it's maybe harder than ever to tell the future.
And so, you know, this is a set of maybe guesses and predictions about the future.
So I'll say the first thing I, you know, I deeply believe about software is,
is we have only just started to see the impact that software is going to have on the world.
There is such an opportunity for more software to exist, bigger software, better software, more personal software,
all of these things.
And so software has, throughout its 60-ish-year history, been supply constrained.
And, you know, I think that's going to remain true.
I think the opportunity for software in the world is just almost unbounded.
And that's really exciting, right?
It's really exciting to be at a moment when the economics of building software are changing
and are changing rather quickly.
And that gives us an opportunity to think about what could we do in the world with a lot more software.
You know, a lot more software personalization, a lot more.
or just the right software in the right place at the right time.
And that gives me a huge amount of excitement about the future of this industry
because we have a massive opportunity ahead of us,
driven by these changing economics of software development.
Now, also with those changes, there are going to be needs for us as,
as software practitioners, people who build software, people who love software, to adapt.
And, you know, that means that software careers are going to look different.
They're going to look different early on.
They're going to look different later on.
I think the software business is going to look different.
And the success of people in organizations over the next, you know, next, who knows, five years,
decade is going to be largely predicated on their ability to adapt to that change and to lead
that change. You told this story about this guy who bet on analog circuits when obviously we know
digital became kind of the more more dominant way. Yet he made he made good money. For the people
who maybe don't want to adapt, you could still get by and succeed. It's not going to be like a
crazy thing. Is that is that kind of the takeaway and why you brought up that story? Yeah, I think
that's the right takeaway. And so if I sort of break down, you know, the, the world into three
tiers, you know, I think there's going to remain a huge amount of joy in the craft of software,
you know, like the craft of joinery with, you know, with handsaws, right? Like, it's,
it's a nice way to spend time. It's not a particularly economically interesting.
activity anymore. But not everything we do has to be an economically interesting opportunity.
It can just be something I do because I enjoy it, because I enjoy the product of it, because I enjoy
talking to people about it. Right. And so there's, you know, I don't think that is going to go away.
I think we're going to see, you know, a lot of interest in that. Like there's been interest in retrocomputing
and, you know, people who run an Apple 2 is their desktop. And like, well, again, it's wildly impractical.
It's not economically interesting, but it's fun and something I, you know,
could do as a hobby.
And so, you know, that's going to be a remaining part of the world of software for probably
forever.
And then there's this, you know, kind of story that I told in the blog post.
And I think this relates to, you know, driving change in the real world is always harder
than it looks from the outside, right?
Like as you get into the details, things become more difficult.
They become more dependent on people.
They become more dependent on politics and policy
and our various irrationalities as humans.
And so driven by that, you know,
there is going to be a huge amount of,
and a shrinking over time amount,
but a huge amount of the software industry
that is run in what I might call the old way, right?
Past techniques, past languages, past technologies,
and there's real economic opportunity in engaging with that part of the, you know, part of the world.
You know, as we saw with analog electronics, analog electronics will very much exist.
In fact, there are parts of the world like, you know, like radio and power systems where there's been
incredible technological advancement in those fields.
But they have become more niche.
And so, you know, digital became the mainstream.
We wouldn't be talking like we are today.
if it wasn't for this, you know, 12 orders of magnitude or whatever explosion in digital transistor counts.
But there's interesting opportunity there, and I think that interesting opportunity is going to change shape and become more and more specialized and more niche and great careers to be built there.
And then there is the mainstream, which I think is going to adopt these new technologies from agentic development to AI-powered development,
to, you know, specification-driven development
and, you know, a whole lot of other, you know,
new things whose names we don't even know yet,
to build software at a speed and a cost that is unimaginable
to do with the old techniques.
And I think that is where correctly the majority of the industry is going to be going.
I think that's where the majority of careers are going to be built.
I think that's where the majority of economic opportunity is.
It's the space I'd be in.
If I was building a company today, it's the space I'm in in my role.
And the one I would sort of personally be most excited about.
But yeah, it isn't the only one.
I think there's going to be the spectrum of software practice.
And especially where software engages with the physical world,
there are going to be some really interesting questions about
how do we bring these new technologies, how do we bring these new practices into the various many
niches that software is going to and has, you know, over six decades, kind of wormed its way into.
It's interesting. You mentioned joinery. I wonder if down the road we will see apps on the app
store that people pay extra for because it's marketed as this was written by a human or it was written by hand.
It's a bespoke custom app crazy how the world is going to change.
So it sounds like change is obviously the common case.
It's the one that we should be thinking about.
Maybe we can break up the conversation in two parts.
One is for junior engineers, what is important given that code is kind of flowing like water now?
At risk of being a bit meta about our past conversation, it really is about finding those
problems that matter and doing that early in a career.
and, you know, that requires an understanding of customers.
It requires an understanding of the business.
It requires an understanding of economics and of systems.
And that can, I think that's going to move from being, you know,
almost kind of senior engineer work of like, oh, well, you know,
now you're going to go and talk to customers and actually understand the context of the stuff
you're building to being more and more.
part of even the earliest steps of an engineering career, right? Like here's the context, here's the
problem, here's the customer, let's go off and work together and solve, you know, and solve this
problem with all of this context. And I think that's going to be super exciting for one set of folks
and a little bit frustrating for people who have come into, you know,
looking for a pure software development career, right?
Looking for a career where they sit down, open their IDE, start typing,
and don't stop for eight hours.
I think that's going to be a mode that we're going to see fewer people in
and a mode that's going to be harder and harder to build a career around.
Now, the other mode of, oh, I'm excited to go off and learn from my customers about what
their building and what they need, I think that's going to be ever more highly, you know,
highly valuable. And so super exciting opportunity to build, you know, build careers there.
And then maybe, and this might come across as being a little bit, you know, paradoxical.
I think there's also a ton of opportunity for, you know, folks who are extremely technically
deep, you know, who are, you know, deep on optimization problems or deep on infrastructure.
structure problems or deep on, you know, various scientific things, or deep on databases,
or deep on, you know, one of the many, many topics that are behind our industry, because I think
the ability to ask the right questions is also much more valuable than it was, has ever been.
And so I think there is a ton of opportunity for people coming into the industry with deep
technical or scientific knowledge to now leverage that in ways that, you know, maybe were
hard before, right? There was too much sort of boilerplate to really, you know, to really
use that leverage that you have. And so I think we're going to see a lot more of those kinds of
careers of really kind of building expertise in a technical topic, in a scientific topic,
and then be able to turn that into software and software products in a way that was really
difficult before and in some cases wasn't possible before and is now, you know, vastly easier.
If I was to look at a career ladder's expectation, some of what you described of maybe
engaging with the customers and understanding the business context, uniquely in software
engineering, it feels like the earliest levels are insulated from all of that. You have your
tech lead, tech leads handing out tasks, and then the early level engineers just given tasks,
just converted into code.
And it sounds like, you know, that part's relatively solved.
If not now, maybe I'd be surprised if a year or two from now wasn't, like, completely solved.
And I think that could scare a lot of junior engineers because they would think,
you're going to expect me to graduate from college or start working a software engineer,
and then I would have the senior engineer expectations.
what would you say to the scared software engineer
that's just entering the industry with all this change?
Yeah, you know, I think, well, I would remind them that, you know,
we as people who hire and build organizations of software engineers
and they as people who are building software engineering careers
have really aligned incentives, right?
Like, you know, it's not valuable to,
hire a bunch of people and set them up to fail.
Like, that's, nobody wants that.
It's, it's, it's, it's not an outcome that is good for anybody.
And so, yeah, we're going to need to figure out how do you support people on, on that path?
How do you help people learn those things?
How do you give them the right guardrails, you know, hey, that first time that you go out
and talk to a customer, yeah, it's going to be scary.
My, my first time talking to an AWS customer was, you know, was, it was, it was, it was,
was super scary. But, you know, I got a bunch of help with that and I got a bunch of advice and I got
a bunch of mentorship and I got a bunch of feedback and I got better and better at that over time. And I
think that's exactly what these things look like is, you know, you start off and you start small and
and you learn, you know, as you go. And so that feedback loop goes faster. And so I don't expect
that people coming in from college or, you know, we'll come in with all of this knowledge. I think
You know, it's never been true that people coming into technical or engineering careers straight out of college know everything.
Or any career, for that matter, right?
Like you talk to teachers about, you know, what they've learned on their job versus what they learned, you know, studying.
You know, they learn a huge amount in things like internships and so on and over the course of a career or doctors or anybody in, you know, in a field like that.
And so, yeah, it is going to be about learning.
I think the emphasis on what people learn is going to be different.
I think it is going to require, you know, leaders like me who, you know, care deeply about, you know, hiring and developing folks early in their career to be really thoughtful about what, you know, what does that new ladder look like.
And, you know, we're doing a lot of that thinking.
I think people are doing that kind of thinking across the industry.
And, yeah, it's changing fast.
It's uncertain.
It's an interesting time to be graduating.
But again, like, it's a super exciting time.
I think that's just the scale of the opportunity is bigger than it's ever been.
Sounds like your advice for senior engineers is different from that of junior engineers.
What is your thinking there?
Yeah.
I mean, I think, you know, I think for folks there, the challenge is how do you, you know,
how do you retain the value of this incredible experience and knowledge that you've gained over a career?
while, you know, not falling behind,
while learning how do you, you know, best use the tools.
And, you know, when I look at senior folks,
this is a challenge, you know, ahead of them.
I think a lot of people have found themselves in influence
and leadership type positions where they aren't hands-on building, you know,
every day.
And I think it's going to be harder and harder to be in that kind of role.
and be able to influence and advise in a relevant way,
in a positive way.
And so really, I think my advice for folks is you kind of got to get building.
Like you've got to get back into it.
You need to deeply understand how the practice of building software
and the practice of designing software has changed
and is continuing to change.
And so the challenge is how do I really take advantage of all of this knowledge and expertise that I've built up in my career and be super curious and be super hands-on and really be in the details.
And the good news for that, well, I think there's two bits of good news.
One of them is because of these new tools, you know, time spent as a practitioner is so much more leverage than it is today.
you can build such cool stuff, you know, during that period of time,
the amount of kind of wasted time and boilerplate and so on is so much smaller.
And so you really do have this opportunity.
And the other one is, again, like, why did, you know, why did we get into the space?
Well, I didn't get into it so I could go to meetings and sound smart.
I got into it because I love learning and because I love building technology and because I love, you know,
solving my customers' problems and because I love, you know, learning about new, you know,
new technologies and learning new things. And there's more opportunity to do that than ever before.
Again, you know, because of this new set of tools and the leverage that comes with them.
And so really is getting back to, you know, why are you here? Why did you get into this career?
And I think it really gets us as technology-focused people closer to our original answer to that.
it's really obvious to me right now when I speak to, you know, practitioners, you know, who and who isn't using a, you know, modern set of agentic powered developer practices, right? And the people who are, have these really interesting things to say about the strengths and weaknesses of those approaches and the work that still needs to be done and the integrations that still need to be done and the things that are working and aren't. And the people who are, you know, using.
them hands on have such a poor mental model of how they work, what they're good at, what they're
not good at, that the things they say about them tend to be essentially fiction. And so, you know,
I think we are in this minute that if you aren't doing it hands on, your opinion about it is
very likely to be completely wrong. And that takes a level of humility to, it,
admit that, you know, is tough. You know, it's tough for folks with fancy titles and it's tough for
folks with distinguished careers. But I think it's a must. I feel like there's a common sentiment
among software engineers when they, when they work with someone who is a, you know, quote
unquote tech lead, but they're not really hands on. So they've kind of been in the docs for
the last five years or so. And there's these minor things they can tell that this person,
doesn't actually understand the underlying thing.
And it sounds like that gap will widen with these new tools,
which is if you're looking at things from 1,000 feet up
and you're not actually using the tools,
that's just another thing that separates you
from the people who are actually building
where you'll be very out of touch.
And I think, you know, when I look at,
I think that's always been true.
I think it is wider than, you know, ever before.
And, but when I,
look at the, you know, engineering leaders that I've really respected and learned a huge amount
from over my career. You know, for example, some of the folks who built S3, you know, 20 years ago,
that was such a successful product because those folks were so deep in the details and so grounded
on the use cases and so deep in the economics and really just did, you know, really thought about
both the kind of strategic world of like how is this cloud thing going to change the way people
want to interact with storage, but also the minute-to-minute details of what's fast now, what's
slow, what's good, what's bad. And I think, you know, when you think about a extremely
enduring product like S3 or EC2, I think it's been that groundedness in the details from
from early on, from all levels of leadership that has made those things so successful,
where other products seemingly with the same amount of early promise didn't turn out to be as
successful. I think one of the last topics that I wanted to ask you about was writing. You have
a ton of awesome posts on your blog. The style of writing is incredibly clear.
And I was curious, why do you write so much as an engineer?
Writing and speaking, but especially writing, have this incredible power.
And, you know, for technical folks, it's this incredible multiplier in being able to take these ideas that's in your head and share them with the world.
And, you know, you can take a set of technical ideas in your head and share them with the world.
building a great product and that's a fantastic thing to do. You can share them in the world kind of
one-on-one, you know, mentorship, teach people, learn, small groups, also a great way to spend time.
But the multiplication factor of doing a talk or even more of writing something is so much
higher, right? Like there are so many more people that you can share that with and it lasts for a
much longer period of time.
And so just having something written on my blog, even that I wrote like a decade ago that
I can share with someone and say, you know, here's how to think about this problem,
here's an insight that I wanted to share with you, or have people discover that organically
is just super powerful.
And so writing lets you scale out the impact of your expertise in space and time in a way that's
really hard to do in other media.
I think with video and with podcasts and so on,
you know, we've seen other ways to do that.
But I think writing remains kind of uniquely powerful.
And then there's also this idea,
which is this kind of core belief culturally at Amazon.
And I've obviously been affected by this over the years,
that, you know, writing forces a level of mental clarity
that speaking, making slide decks, et cetera, doesn't.
and, you know, that's something that has also really been my experience of sitting down,
to write something down, forces me to think that through at a depth that I wouldn't have
been forced to think of through without that.
And so I saw one of your early conversations was with Leslie Lamport who kind of takes
that a step further and say, hey, you know, it's formal mathematics that is the next step
there, and I love that point.
but I think writing is this really accessible thing for people to do that does force a level of thinking.
And so I do a lot of writing sometimes just for myself, right?
Like I'll write a doc, not ever intending to share it with anybody,
but just to sharpen my own thinking on a particular point.
And so it's some of that combination of three things, right?
Like, I just have something to say and I want to say it.
You know, I have something to say and I want to scale.
it out in time and space.
And I want to sharpen my own thinking on a subject
or the thinking of a small group on a subject
in a way that writing is just a super powerful tool to do.
Definitely.
Yeah, I remember being surprised early in my career.
I had a manager, tech lead, who we would write these docs
on either designs or the strategy.
and he said, even if you just wrote it and you threw it away,
it would still be worthwhile because you'll realize things as you're writing
and that clarity will save you a lot of time down the road.
And it's interesting to me because a lot of engineers,
they complain about writing docs, docs and all the stuff around the code.
They kind of hate that.
I said it's a sign of slow, big company processes.
And what would you say to an engineer like that?
who's saying, I just, just let me write the code.
Yeah, and that's a great, you know, that's a great point.
And I think it really depends on the level of problem you're trying to solve.
And so, you know, if I look at, I'm going to pick on UML for a minute here, right?
Like it's a sort of semi-formal software design process.
And not one that I've ever found useful because I think it just happens at the wrong semantic level.
I think it's bothered with details at a level that aren't helpful.
and I think a lot of the let's go off and document this has a similar problem, right?
Like does this actually require that level of reflection and thinking?
And so I think what, you know, for me, it separates a valuable doc writing and thinking process
from a busy work process is understanding what you're getting out of it.
and what you're getting out of it might be
an artifact to share with the future,
which is super valuable.
Either your future self,
if you've got a terrible memory like me,
or, you know, new teams, new people,
people, you know,
or I want to share something with customers
or I want to share something with the world.
And so that's super valuable.
Or I want to write down something
so I can think through a really difficult,
often one-way door,
kind of unhard to change technical decision or API design decision.
And I'm not going to do that every time I make a technical decision.
It's not worth it because a lot of those technical decisions are either easy or not as
critical or can be just be taken back if we figure out they're wrong.
But I am going to spend my time that way when there are key decisions to make,
when there are key insights to find.
And I think, you know,
and so it is that, like, what is the purpose of writing
that separates well-spent time from poorly spent time?
Now, there are people who still don't like writing
even when it's well-spent time, even when it's like,
you know, you have to explain this piece of,
technology to, you know, to a future team. I think that's a skill worth developing. You know,
sometimes you do need to, you know, eat your vegetables, you know, and it's, it is a skill worth getting
good at. And, you know, especially in documenting the core kind of technical decisions behind a design
is so useful. And that's useful in two ways, by the way. Like, one of them is, as we think about
building a big system, we make thousands of decisions. And some of those decisions are very carefully
chosen, very particular, and very impactful. And some of those decisions are the best thing we could
guess in the moment based on having no data to make that decision. And it's super useful for people
who are coming in to improve that system down the line to be able to look at the design and say,
which of these things were very carefully chosen and thought through,
and which of these things were arbitrary.
And because the arbitrary things, like, okay, well, I'm going to change that,
and I'm going to just go ahead and change that,
because I have better data now.
I've watched the system run.
I can go and change those.
And these other ones are like, well, let me really engage with the reason that we made this decision.
Maybe it was non-obvious.
Maybe there was some more advanced thinking.
And so being able to kind of understand the amount of thought that went into a decision
and is almost as important as understanding what that thought was.
You had a really interesting blog post.
This was from a while back.
It's titled The Four Hobbies and Apparent Expertise.
And you introduced this really interesting idea.
It's a two by two matrix.
And on one side, there's doing versus discussing.
And on the other side, there's the hobby and the gear.
Maybe I can overlay it for people who want to see.
And then later you kind of liken that to your career and how, I guess maybe we can imagine the hobby is actually coding.
And maybe the gear is, let's just say it's like your dev setup or something like that.
You talked about these two aspects of being in, depending on which quadrant you are,
which is there's this tradeoff between expertise and visibility where imagine you're really into coding and you're really into doing.
you're going to be phenomenal in terms of expertise,
but maybe not as visible because you're not talking with everyone
about how cool your setup is and all of that.
On the flip side, if you're really into the gear,
maybe you're set up in this case,
and you're really into discussing,
you're on all the messaging posts and that.
You might not actually be that good at coding,
but you're very visible and you have this apparent competence.
And I thought that trade-off was really interesting because I've seen that so much in software engineering too.
There might be someone who's really quiet coder.
They never write anything, but they know everything because they've just been in the weeds all the time.
And then there are people on the complete opposite on the spectrum that writing all the time, speaking all the time, but maybe not actually practicing as much.
And my question to you is, how do you strike that balance?
because obviously too far in either direction is not optimal.
So how do you strike that balance?
Yeah, that's something that I reflect on a lot.
And I do explicitly think that sort of being 100% on either of those ends is a failure mode.
And I think, you know, I will say that I have a lot more personal enjoyment working with the people that are 100% on the doing side and 0% on the talking side.
and 0% on the talking side,
I appreciate and deeply,
you know,
deeply love their expertise.
But I do think that, you know,
they could have more impact and leverage
if they, you know,
swung a little bit away from that.
You know, I tend to not enjoy as much interacting
with the people who are 100% on the speaking side.
But I,
and I think they would, you know,
have a lot more,
relevant things to say, you know, if they, you know, swung a little bit back towards the center.
The other challenge of being on 100% on the doing side, it sort of gets back to that, how do you find the
really important problems? And, you know, if your heads down in your IDE all day, you could
very likely be working on the wrong thing. You know, something that isn't as important, isn't as
impactful, you know, doesn't have these properties that people want. So, you know, how do you find
the optimal balance? I don't have a have a recipe for, you know, what really is optimal. I tend to
do about, let's say, 75, 25 kind of practitioner versus, you know, teaching and communicating,
maybe 80, 20 at times. I found that about what feels right for me.
I would say that they're great people I work with, you know, from sort of 90-10 on that scale,
up to about 50-50 on that scale.
I think, you know, outside of those, you know, folks tend to, you know, tend to get into trouble as practitioners, right?
Like, you know, there are people whose job it is to be, you know, communicators.
And that's great as long as they have the curiosity and are clear about what they, you know, know, know and don't know.
But, you know, I found that sweet spot at that sort of 75-25 point in my career.
And that's what's worked for me.
I think, and I think in this moment where things are changing so fast, there's so much to learn.
You know, swinging a little bit more towards the practitioner side, I think generally will help people.
But again, you don't want to go too far that way because then you lose the, you know,
or what's important for you that comes with interacting with the outside world.
On the doing versus discussing axes, I kind of view the doing one is if you were too far,
you would be underrated.
And if you were too far on the discussing, you would be overrated.
And if for someone who's structuring their career, would you say it's better to be overrated or underrated?
I think long term, you know, if you're using that terminology, it's probably better to be underrated.
I think, you know, being overrated can feel great in the moment, but it's rarely sustainable and really sort of gets you where to where you need to be.
I really enjoy, you know, things like sports and, you know, these sorts.
of creative hobbies and crafts because it does, you know, turn that, let's say, perception
and reality knob to very much reality, right? Like as a sports person, you can't, you can't
fool the world for very long. It very quickly becomes, you know, very obvious, you know,
who can and who can't. You know, I think as a craftsperson, the same, right? It very quickly becomes
obvious who can and who can't.
And I think it takes a little bit longer
in a field like ours where there is
so much kind of qualitative stuff that goes on.
But I think long term when I look at careers
that I really admire and people I really admire,
they tend to be people who are personally very honest
about their level of knowledge and understanding and skill.
So people who walk the walk,
not necessarily talk the talk.
I see.
Yeah, about engineers that you admire.
I'd be curious because you have worked at AWS
for such a long time
and you have seen so many legendary engineers.
Who at AWS do you look up to and why?
Yeah, I mean, you know, just fantastic.
One of the blessings of working at a place like AWS
is I get to work with so many great people.
You know, maybe because he's retired,
I'll talk a little bit about,
so Elver Mullen was one of the sort of early engineers
at AWS and original, say, huge contributor to the design of S3, a really big contributor to the design of a
lot of our database services over time. El was actually the CTO of Amazon for a period of time
when he realized, I think that wasn't the job he wanted to do. But I, you know, what I really
admired about, about Elle from early in my career is
you know, very clearly he was somebody who deeply understood the things he was doing.
And he could work in these two modes, right?
Like, you know, I have a great memory of sort of 2010-ish, you know,
arguing with Al about some of the edge cases in the Paxos paper.
And, you know, he was super deep at that level,
but could also get up to the really kind of executive level
and talk about, you know, cloud strategy and the way.
we should be explaining things to people and some of the, you know, sort of fundamental things that we need to be building.
And I really admired that ability to work sort of almost at every level. And I was like, wow, you know, this is something I aspire to.
And, you know, want to model my, want to model my own career after. And so, you know, that's, I think that is, you know, the kind of person I've really, you know, really enjoyed.
working with is people who do have that, you know, do have that breadth. And I think, you know,
one of the other things that is really admirable about a lot of these folks is, you know,
they don't want to be celebrities, right? They want to do cool work for, you know, have an impact,
do great stuff for customers, you know, optimize for having impact.
You know, for people who want to continue their engineering education and really remain on top of
things deeply understand the technology. Do you have any top technical book recommendations?
You know, anybody who's building distributed system things, I highly recommend Martin Kleppman's
book. I think there's a second edition of that coming out, you know, soon. There's a new
edition of Quantitative Systems Design book, which I also think is great. Hennessy and Patterson's
computer architecture book. This is a super...
Super useful one that covers a ton of ground.
I read a ton of fiction and nonfiction and mostly papers when I'm reading technical things.
I find engaging at that level more useful for me.
And by the way, that's become way more accessible now.
One of the great ways to dive into a paper is, you know, hey, Claude, summarize this for me.
And then I can dive into it and read, you know, the authors.
words and I find that mode is great and it's super accessible for people who haven't been able
to read papers in the past. But, you know, and then there's also a ton of insight in some really
old stuff too. For example, you know, some of the algorithms that we used in Lambda are to manage traffic
and manage bursts of traffic come from Erlang's work like a hundred years ago on managing telephone call centers
and his book about that.
And so, you know, folks also shouldn't think that, oh, well, the industry is changing super fast,
and so I should only read recent things.
Like there's full incredible insights in some of the, you know, older work and in the foundations of
computing and infrastructure and networking and computer science that there's
you know more again more maybe more leverage than ever before you know deeply
understanding those topics and then last question for you is if you could go back to your
younger self when you just joined AWS and give yourself some advice what would you say
I think maybe be a little bit bolder I really love the team that I worked with and you know
especially in EC2 in the early days in EBS.
And I think I was a little bit more hesitant than was optimal about leaving those teams and looking for the next thing.
You know, as, you know, my own learning and impact kind of, you know, tapered off a little bit in those places.
And so, you know, I think I've changed organizations kind of in a big way four times in my career and maybe five.
or six would have been optimal. Not a lot more, but some more. And so, you know, don't hesitate
to think about, you know, what am I learning and who am I learning from? And is there a better
environment to do that, you know, more quickly and to learn more things? And, you know, I'm highly,
personally, highly motivated by being able to follow my curiosity. And every time I've done that
in my career, I've found that a valuable move.
and something that I've personally enjoyed.
Awesome. Okay, well, thank you so much for your time. I really appreciate it, Mark.
Thank you for sharing with the audience.
This has been super fun. Thanks so much.
Thank you for listening to the podcast. It's a passion project of mine that I really enjoyed building.
Another passion project that I've been working on kind of in secret is building an ergonomic keyboard that I wish existed.
and I finally have a prototype, so I'd love to show you what we've built.
It's ultra low profile and ergonomic,
and I couldn't find anything like it on the market,
so that's why we built it.
I'll put a link to the keyboard in the description.
You can take a look and learn more about the project there.
We could definitely use your support.
Also, if you have any feedback for me about the show, I'd love to hear it.
Comments on YouTube have led to guests coming on like Ilya Gregorik and David Fowler.
I wasn't aware of them until someone dropped a comment.
Also, feedback in the comments helped me learn to reduce the number of cliffhangers in the intros.
So your comments definitely make a difference.
Please keep letting me know what you'd like to see more of in the show, and I'll see you in the next episode.
