The Pragmatic Engineer - What is a Principal Engineer at Amazon? With Steve Huynh
Episode Date: July 9, 2025Supported by Our Partners• Statsig — The unified platform for flags, analytics, experiments, and more.• Graphite — The AI developer productivity platform. • Augment Code — AI c...oding assistant that pro engineering teams love.—Steve Huynh spent 17 years at Amazon, including four as a Principal Engineer. In this episode of The Pragmatic Engineer, I join Steve in his studio for a deep dive into what the Principal role actually involves, why the path from Senior to Principal is so tough, and how even strong engineers can get stuck. Not because they’re unqualified, but because the bar is exceptionally high.We discuss what’s expected at the Principal level, the kind of work that matters most, and the trade-offs that come with the title. Steve also shares how Amazon’s internal policies shaped his trajectory, and what made the Principal Engineer community one of the most rewarding parts of his time at the company.We also go into: • Why being promoted from Senior to Principal is one of the hardest jumps in tech• How Amazon’s freedom of movement policy helped Steve work across multiple teams, from Kindle to Prime Video• The scale of Amazon: handling 10k–100k+ requests per second and what that means for engineering• Why latency became a company-wide obsession—and the research that tied it directly to revenue• Why companies should start with a monolith, and what led Amazon to adopt microservices• What makes the Principal Engineering community so special • Amazon’s culture of learning from its mistakes, including COEs (correction of errors) • The pros and cons of the Principal Engineer role• What Steve loves about the leadership principles at Amazon• Amazon’s intense writing culture and 6-pager format • Why Amazon patents software and what that process looks like• And much more!—Timestamps(00:00) Intro(01:11) What Steve worked on at Amazon, including Kindle, Prime Video, and payments(04:38) How Steve was able to work on so many teams at Amazon (09:12) An overview of the scale of Amazon and the dependency chain(16:40) Amazon’s focus on latency and the tradeoffs they make to keep latency low at scale(26:00) Why companies should start with a monolith (26:44) The structure of engineering at Amazon and why Amazon’s Principal is so hard to reach(30:44) The Principal Engineering community at Amazon(36:06) The learning benefits of working for a tech giant (38:44) Five challenges of being a Principal Engineer at Amazon(49:50) The types of managing work you have to do as a Principal Engineer (51:47) The pros and cons of the Principal Engineer role (54:59) What Steve loves about Amazon’s leadership principles(59:15) Amazon’s intense focus on writing (1:01:11) Patents at Amazon (1:07:58) Rapid fire round—The Pragmatic Engineer deepdives relevant for this episode:• Inside Amazon’s engineering culture—See the transcript and other references from the episode at https://newsletter.pragmaticengineer.com/podcast—Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@pragmaticengineer.com. Get full access to The Pragmatic Engineer at newsletter.pragmaticengineer.com/subscribe
Transcript
Discussion (0)
If you're going to optimize for performance, saying, why can't we be at one millisecond or why can't we be at 10 milliseconds and start from there?
Instead of sort of saying, hey, let's try to decrease latencies by 50% or 25%.
Let's just start from what is the conceptually fastest thing that we could do.
And that's actually how Amazon was created.
Amazon's principal engineering level is unique in many ways across big tech.
Steve Heuden was a software engineer at Amazon for 17 years and worked as the last four years as a principal engineer.
Today, we talk about the ins and outs of this role, including why being promoted from
senior to principal is so hard, even though Amazon usually has hundreds of principal engineering
openings and thousands of seniors trying to get into these positions.
The Amazon principal engineering community, the in-person events, the Slack Group, and the
principles of Amazon internal presentation series.
Engineering concepts that Amazon are on reliability, such as Brownouts and COE, correction of errors,
and many more topics.
If you're interested in understanding one of the hardest injuring levels to get into across big tech,
together with stories of how Steve thrived in disposition, this episode is for you.
Subscribe you on YouTube and on your favorite podcast player greatly helps more people discover this show.
If you enjoy it, thanks for doing so.
So Steve, welcome to the podcast.
Thanks for having me.
How long were you at Amazon?
17 years?
Yeah, I was there for 17 and a half years.
And yeah, I just quit last year.
So I've been basically a year doing other things now.
And what were the things that you worked on while you were there?
You know, people always talk about my long tenure there.
But, you know, I feel like I've had like five or six jobs over that time period.
I started off on, you know, a project called Search Inside the Book.
I worked on the first Kindle launch.
Wow.
I worked on the precursor to prime video.
I sort of like worked there at the beginning part of my career.
And then I sort of ended my career there for the last five years of my time there.
I worked in payments.
I worked in Amazon Local, which was sort of our Groupon project when that type of business was looking like it was going to take over.
I worked on Amazon restaurants.
I worked on Amazon tickets, which was a ticket master clone.
And then my last five years was working on live sports streaming on Prime Video.
If you want to build a great product, you have to ship quickly.
But how do you know what works?
More importantly, how do you avoid shipping things that don't work?
The answer, Statsig.
Statsic is a unified platform for flags, analytics, experiments, and more,
combining 5 plus products into a single platform with a unified set of data.
Here's how it works.
First, Statsc helps you ship a feature via feature flag or config.
Then, it measures how it's working,
from alerts and errors, to replace the people using that feature,
to measurement of top line impact.
Then you get your analytics,
user account metrics,
and dashboards to track your progress over time,
all linked to the stuff you ship.
Even better, Statsic is incredibly affordable.
With the super generous freeze here,
a startup program with $50,000 of free credits
and custom plans to help you consolidate your existing spend
on flags, analytics, or AB testing tools.
To get started, go to Statsic.com slash pragmatic.
That is S-T-A-T-S-I-G.com slash pragmatic.
Happy building.
This episode is brought to you by Graphite, the developer productivity platform that helps developers create, review, and merge smaller code changes, stay unblocked, and ship faster.
Code review is a huge time sync for engineering teams.
Most developers spend about a day per week or more reviewing code or blocked waiting for a review.
It doesn't have to be this way.
Graphite brings stack pull requests, the workflow at the heart of the best in class internal code review tools at companies like meta and Google, to every software company on GitHub.
Graphite also leverages high signal, code-based-aware AI to give developers immediate actionable feedback on their poll requests, allowing teams to cut down on review cycles.
Tens of thousands of developers at top companies like Asana, Ramp, Tecton, and Versel rely on graphite every day.
Start stacking with graphite today for free and reduce your time to merge from days to hours.
Get started at gtt.def slash pragmatic.
That is G4 Graphite, T4Technology.
slash pragmatic.
So that's a lot of different teams.
Was it like how did you work out in so many teams?
Is it just like there's a lot of internal transfers?
Did you get bored?
Was it just you followed your manager?
How does it work inside Amazon?
Because when people think about companies, people who have not worked on Amazon,
they would kind of assume you go, you work there,
you're on a team for like, you know, four, five, six years.
Clearly not the case.
You know, it depends a little bit on like corporate policy
and then where you are with your career.
I started as a support engineer.
so sort of like operationally focused person.
And then, you know, I was basically like, I want to be a software developer.
And so, you know, I think getting into the company was pretty difficult.
But once I was there sort of set that target and changed roles.
And when I changed the role, you know, it was a natural time to move to another team.
There's also some internal policy.
So basically at Amazon, it used to be that you had to stay on a
a team for at least a year before you transferred. And if you wanted to transfer, like a senior
manager or director, whoever up top, could block your transfer. And what that ended up meaning was that
like certain teams that were just terrible to work on, those teams actually had more than
100% attrition over the course of a year because you measured attrition with a year-long time
unit. Amazon did something actually smart at the corporate level. They basically said, okay,
well, you have freedom of movement now. This sort of happened, I don't know, probably like 13 years
ago, 10, 13 years ago. And so they said, you have freedom of movement now. A VP or a director can
can't block you. They can say, okay, well, we need another month to get like a transition plan going.
But essentially, you have freedom of movement as long as you're not on a performance improvement plan,
which meant that certain teams were sources of high-quality engineering talent
and certain teams were sinks of high-quality engineering talent.
And it sort of created an internal marketplace for different roles.
Now, what that ended up meaning was that certain teams,
they basically didn't want you to know what the policy was.
They wanted you to sort of think that you were kind of stuck.
But, you know, despite that sort of like local gamesmanship that was going?
Yeah, like basically some managers didn't want their best people to leap, right?
Exactly.
So I just say it how it is.
But ultimately, I think it's a great strategy because it put the, like, if there was a team that was difficult to staff, the problem was on the management.
It wasn't something that had to be, you know, bared by or born from the employee themselves.
And so, you know, getting back to my own career journey, at a very large company like Amazon, there is so many awesome things that are going.
on. And, you know, I decided to just kind of go where my curiosity took me. Now, there were
some times where, you know, there were reorgs or, you know, a line of business got spun down.
But ultimately, you know, I think freedom of movement was one of the smartest things that
Amazon did. And I think this is something that people don't really appreciate about some large
companies. So, you know, not all companies are like Amazon and every company changes, right? Like
today, I'm assuming it will be hard to move as many.
teams within Amazon.
Depending on where you are, you know, if you're in a satellite office where there's
two teams, you can probably move on to the other team at max.
But I think this is one of the underrated things of large companies.
Like once you are in, it's almost always easier to get that job at another team from the
inside.
Yes.
Especially because you can talk to them.
You know, this is, I talk with the Reddit mobile team and I ask like, oh, how can
you become a platform engineer on the mobile team?
And they said, like, well, you know, most of our hires have been internal.
They just helped us out on hackathons.
They come around.
They commit stuff.
We know them.
It's a low risk higher.
I think it's just nice to remember that when you think of a big company like Amazon or
meta or Microsoft, it's just so many small teams.
And once you're in, you actually have almost priority access to those teams if you play
your cards right.
Absolutely.
And you know, you might interview for that team, but it's such lower stakes than an external
interview.
And, you know, just all things being equal, would you rather take somebody that's, you know,
know, internal and knows the culture. They know how software is developed within a particular context
or somebody that's just as good, but doesn't, you know, hasn't been onboarded. And I think ultimately,
you're going to pick the person that's internal, all things being equal. Yeah, it's just kind of like
business rationality for the most part. So one thing about Amazon and about large companies like Amazon
is people talk about externally about the scale. And it's hard to imagine, but can you give us a sense of
the scale that you've seen or like some tough engineering challenges that you worked on that would have been just really hard to work at a smaller startup.
Yeah, I think that's the thing that you just, you will not see at most other places is the scale of things.
I'll give you a couple of examples. So, you know, Prime is the exclusive club that everybody is a member of.
Yeah.
And, you know, in the U.S., the shipping benefit is probably, you know, the most popular.
But globally, Prime Video is, you know, it's the thing that people use the most with their subscription.
And so if you think about, you know, our service-oriented architecture and just loading up the app, the gateway page is the place where all of our requests come in.
Right.
And so it's just like Netflix.
It's this infinite scroll of carousels.
So the gateway page is it the Amazon Prime landing page?
It's the landing page there.
And so you're like, okay, cool.
If, let's say, 95, 99% of all of your requests are coming from that page, and that page needs to be personalized, you know, and you have a service-oriented architecture with a bunch of microservices, one request to that page turns into, let's just say, hundreds of downstream requests to different services.
It might even be more than that.
It's actually kind of hard to count.
Yeah.
And is this page, right?
Like all the stuff flowing, all personalized stuff.
So that's the retail one, but I was talking about the prime video one.
The prime video one.
But essentially it's the same thing.
Yeah.
And so, you know, same thing for the retail website as well.
And so if you have one request sort of spitering out into, you know, two orders of magnitude, more requests internally,
you start to seek like really, really large scale for these microservices.
So a microservice will have like a reverse proxy or load balancer in front of it.
and you are sort of unironically talking about things like tens of thousands of requests per second
or hundreds of thousands of requests per second coming into your service.
So the services that are like behind, you know, like there's the prime, there's all the things loading,
they're sputtering out like making, you know, to render that one recommendation, for example,
for I don't know, the video that you would like, it will make a lot of requests of different services.
And then so when you're operating a smaller service inside of Amazon, suddenly you're going to be hip,
with what you just said, 10K, 100K requests per second, that kind of scale.
Exactly. And you will essentially be dedossing yourself.
You're just like, okay, cool. Let's change a caching configuration on some item details.
And it turns out you've just browned out like a critical service, right?
What does brown down mean?
Oh, sorry, using some jargon.
So if you want to talk about availability, if you, if you,
suppose you are dedossing a service or sending a lot of requests over to them,
you can,
you know,
you can,
you can just take them down.
That would be like a blackout.
And so like you send a request,
oh,
you can't establish a connection.
It immediately comes back.
But there's a,
there's a type of outage where they brown out.
So basically they're reachable.
They might accept a connection.
But,
you know,
they'll essentially time out or,
or they might return partial results or,
or bad results.
or the only thing that they do return is a, you know, a 500 for some percentage or proportion.
After we waited a bunch of time for that. Yeah.
And so, you know, now we start talking about like availability and resilience in the face of like all of these DDoSing that you're doing to yourself.
And so the thing on top of scale that is going to really complicate things is your dependency chain, right?
And so, you know, your service is a dependency of some of the process that's going on.
it depends on, you know, maybe AWS, it may depend on another service.
You know, how do you make sure that if, you know, suppose there's a failure for a primary dependency
and that dependency comes back up, how do you make sure you don't just like inundate it with a bunch of
requests as it's trying to recover?
Yeah.
And so you have all of these sort of like odd dynamics that occur.
I used a brownout as something that is a perennial problem that we have, right, where there's maybe a dependency on a base.
service like S3 or Dynamo, DB, or whatever it is, there might be some increased latency
that may cause a chain reaction of a dependency going down. And then one of these sort of middle tier
services would brown out. So what are like, you know, you're an owner of the services for your
team. And so then it's like, okay, what do we do in those situations? How do we know that they're
browning out? What do we do in the face of, you know, a dependency outage?
And then critically, if there is an outage and then the service comes back up,
how do we make sure that we give it enough space so that it can breathe so that, you know,
as they're trying to recover from some sort of outage, we don't just take them down immediately again.
And I guess for like most of us who are not working right now on these services,
like these sound pretty cool in theory.
But you're saying this was actually like, like this is not theory.
This actually was like, oh, this service is going down.
we are literally having 100K requests per second,
and we're like pushing that on to like other three services with the same
because we need to invoke three other services.
One of them has browned out.
What do we do now?
How do we fix it?
Yeah.
And I think for certain other large tech companies, you know,
you can do best effort, right?
Which is basically like, hey, we're temporarily down,
but, you know, you can, you have some sort of degraded service
that makes sense.
But if you're on, say, a website that does purchases,
now we're talking about transactions.
Or if you're in the prime video, like live video streaming use case,
now we're talking about a football game that you're unable to see.
And then when we recover, the game might be over.
Yeah.
Right?
And so it's much higher stakes.
And so I think the scale with transactional semantics, right,
like that's actually the challenge that you're not.
not going to see unless you sort of like work for a payment processor or something like that.
Yeah, I guess that's real world pressure challenge. Like you are losing money. I'm starting to
understand why. Like I have noticed that startups love to hire from certain companies. They usually
startups love to hire from other startups because it's similar environment. From large tech companies,
it's a bit of a maybe I'm generalizing obviously. This will not be true 100% of a time. But for example,
hiring from Google, a lot of startups are not as happy because the people coming from Google are used to
having this amazing team around them, internal tools.
But most startups love hiring from Amazon,
and I'm starting to get a sense of why this actually is.
Yeah, I think that's part of the culture.
You know, you get hired as a software developer,
and they hand you a pager.
And before, you know, phone apps and things like that,
it was like this pager from the 90s.
And it's really great because you have to,
you have to like operate the software that you write.
If you actually, you cannot write
the software, hand it over to the testing team, and then throw it over to the SRE team after you're done.
Like, you own that piece of software.
Yeah, yeah, at every team, right?
One interesting thing that we talked about yesterday over dinner with Casey Moratory is you said something interesting on how Amazon measured how on their retail website, I think it was retail, maybe Amazon Prime.
The lower the latency of something loading, like a page loading, like a purchase page or a purchase button loading, the more revenue they got.
and they started to measure and there was a linear
correction as the faster
it was the more people converted and it seemed
it had no end and the question
Casey asked is like okay if this is the case
what would stop Amazon
because you have the best technologies in the world
you have AWS you know you can build
whatever you want to get the latency
of the website down to let's say like 10
milliseconds or even one millisecond
because if this goes up you would maximize
revenue so can you tell me about like how
how that thing like this measurement actually
happened. And, you know, why is Amazon's website still maybe not the fastest in the world,
even though it would generate so many more billions, right? Yeah. Well, there are a couple
questions embedded in there, but we'll start with the, you know, the latency to gross revenue
measurement. So essentially somebody way back when, you know, because we invest in logs and telemetry,
started tracking how much gross revenue we would make based off of like the latency for
detail pages based off latency of Gateway, based off of latency of the checkout pages.
And they noticed this dynamic where it's like if you're faster, you just make more money.
It's a pretty clear correlation. I think you would even go as far as to say as causation.
And so there was this really big focus on latencies. I love the idea that, you know, if you're
going to optimize for performance, saying like, why can't we be at one millisecond or why can't we be at 10 milliseconds and start
from there, instead of sort of saying, like, hey, let's try to decrease latencies by 50% or 25%,
like let's just start from what is the conceptually fastest thing that we could do.
And I think in a vacuum, the conceptually fastest thing that we could do is sort of like a monolith,
which is how Amazon started, where, you know, you have a web server with all of your catalog
information, so all of the items that are there. And then transaction processing on the host. That would
be the fastest way to run the retail website. And basically like a web request would be it opens the
HTTP or HTTP handshake. It hits the server. The server in an ideal world has everything
cached or calculated. It sends it back. So the total like latency would be the time for this
request, the time to transfer that data based on your internet speed. And that's it. That is the absolute,
you cannot be faster than that. I don't think so. Maybe there's some exotic sort of thing that's
Maybe you can do some exotic protocol that I know.
It predicts the future.
I'm like with UDP sends it.
But yeah, but this is your baseline.
I guess the optimal would be like zero click instead of like a one click checkout, right?
So we just send you stuff before like you know you want it.
That would be the, I guess, the theoretical maximum.
But, you know, if there's some sort of like web request, right, so some HTTP request
and then some sort of like buy button, that would be the fastest, right?
And that's actually how Amazon was created.
We bought this, you know, it's sort of the opposite of horizontal scaling.
It was vertical scaling.
We bought these big sunboxes,
and we hacked up our own web server in C++.
And to scale up, we bought bigger hardware.
And then when that didn't work,
we bought like six of these big boxes,
and that ran Amazon.
And we ran that wave up until the early 2000s.
And then what we realized, we ran into a wall,
which was that, you know,
when you built the C++ binary,
the binary could only be four gigabytes.
And that was a hard limit
based off of the 32-bit software,
the architecture that we were running on before.
We could not get above four gigabytes.
And so these product managers would come and just be like,
well, just make a change for me, right, to the devs.
And then they would just be like,
I don't think you understand that this is a hard constraint.
And so we...
The size of the code or the binary code,
the compiled one, it was there.
And you had so much business logic by then
that it just filled out four gigabytes.
Yeah.
Yeah.
And we had a distributed C++ build, so you know, you could, you know, it would take many, many hours for it to compile.
And so we would distribute it across desktops.
And it was this whole big thing.
But we ran into that wall.
And so what we decided to do, and I think this was super smart, was like to lean into service-oriented architectures, right?
And microservices.
Yep.
And when you break it down, a web service call is essentially, it's a remote procedure call.
So you have this execution pointer and then you're like, okay, well, I need to do some computation or I need to gather some data.
I'm going to in turn make an HDP request downstream to another service and then you can sort of chain those things together.
And so getting back to the original thing about performance in a world where you have to, because you have thousands and thousands of developers building, you know, this stuff and the fact that you cannot have a monolith as big as Amazon retail, you know, past something that's sort of like,
circa 2002 Amazon size, you have to lean into remote procedure call.
You have to say that there is a web service.
The best performance that you can actually get is always going to be bounded by the number
of web requests that you end up making, whether it's the first order calls to say go get
the item details, but then also any blocking call that happens downstream.
By blocking call, we mean like you need to wait for this to finish to get your data.
Like, you know, it serves as that like returns.
I don't know.
your top five most likely to buy things,
it might need to make those, let's say, five requests or just one request.
It needs to wait for that before it can return.
Exactly, exactly.
And you can do this telemetry stuff.
You can do this observability stuff to figure out, you know, within that service call chain,
what the blocking call is.
And you can get some amount of visualization on it.
And so then you can get down to the point where it's like, okay, if we're going to start
from first principles, what's the least amount of latency that you can get for, say,
like a web request or a checkout page call, you're going to run into like the absolute minimum,
right? And it's going to be based off of like what are the required operations, you know,
evaluation or transactions or whatever for that particular request. Yeah. And then basically so as I
understand like as it became a microserve like more microservices and services, this is great for
maintainability and also you just, well, you first just solve the issue of the monolith size. And, you know,
as we know with history, of course,
like now teams could be more autonomous.
They're not as dependent.
They could do the APIs,
but it was a tradeoff for latency.
And now, like, you had to go back and figure out the blocking calls,
how to speed those up, how to do, I guess, you know,
trade off things like caching.
Like, you know, you can have things fast,
but it might not be as correct on the first one.
Or, like, just tricky UI where you don't show the data just yet,
but it's coming.
And the user's sense, a sense of, like, progress, those kind of things.
And it also, I think, forces teams to really, end product, to really say, okay, like, what is the strictly necessary processing that happens on this page?
Some of the work that I was doing before I left Prime Video was basically, like, you have these really, really big, heavy gateway page, you know, or landing page requests.
And, you know, if you're in a situation with high load, can you preemptively reduce the amount of, say, personalization?
that's going on to sort of speed up that page or, you know, to increase the amount of like
throughput that you're able to have. So to serve more customers, can you do that in a smart way,
right, that sort of anticipates load that's coming onto the, to that page? Say if there's a football
game coming up or something like that. Yeah. Sounds like these are just like a, they seem just
hard to solve, but now you have to solve them. So it sounds like this, this kept you busy.
Everyone else busy at Amazon, to this date, right?
Like, is this, do you think is this ongoing engineering challenge for Amazon?
Because, you know, what I would imagine, the tricky thing being here is like, okay, you can optimize whatever you have.
You can find the critical paths.
But Amazon keeps growing, right?
Like, there's new teams, new services, new everything coming on.
So this thing will change all the time.
It's an ongoing puzzle to solve.
Yeah, absolutely.
Yeah, I think, you know, they definitely have a ton of work in front of them.
Also, you know, it's part of their ethos to really like launch new lines of businesses really quickly.
And so, you know, the ability for a team to go from zero to launch product within the confines
and the context of a large corporate entity, I think that's, you know, part of the DNA that's there.
So as long as they're planting seeds as the sort of like internal terminology is, I think that, you know,
software developers will be in demand for quite amount of time.
Yeah, I guess it's a good reminder that, you know, there's every now and then we have the monoliths versus microservices debate that it, it sounds like it kind of just makes sense for a startup to start with the monolith. Like you can always do what Amazon did and you have the benefits of latency. Everything is in one place. Like I'm sure there might be reason to start with microservices to start with microservices to start with. But if you're a small team, I mean, even today, I don't think that argument changes, right? Like Amazon got really big wins by starting with a monolid back in the day.
Yeah, absolutely. I think it just makes it.
ton of sense to start with a monolith, wait till it breaks, and then the part that it, where it
breaks is when you have, like, 50 developers working on the same piece of code. Once that sort of
breaking point occurs, then you start to, like, try to figure out, like, how you can sort of break
things up. But starting with a microservice architecture, especially when you're small,
like, what a waste of time and energy. Totally. So you were a principal engineer in Amazon, and apparently
I've learned that most companies
have different levels
and again this principal engineer
some companies have like staff level
but it's usually like entry level
mid level senior
and then you have staff
or in the case of Amazon it's principal
I've learned that Amazon's principle level
is both really hard to get into
compared to a lot of other companies
and it's pretty special in some ways
so we'll talk about that but can you tell me like
how how is the career
kind of development
because most people imagine like oh it should be
pretty straightforward
I spend like, I don't know, two years as a junior, two years as a mid, roughly, and two years
as a senior, then I get to principal. How does it actually work at Amazon? I think it's linear
up until you hit principle, right? So, you know, you join, you're a junior developer, you get
promoted to mid. At mid, you know, you're starting to influence the team, but then you get to
senior, and so now your expected impact is at the team level. And then there's this jump that you get
to principal. And principal is, it's L6?
principal is L7.
L7.
Yes.
Yeah.
And so I think you really have to start with like why is it, why is that jump so big?
Because I think at every, pretty much any other company, it's just a linear progression.
Like there's nothing necessarily special about staff.
You know, you can just sort of go to that level of senior staff and then principal.
But for some reason, Amazon decided that they weren't going to have a staff level.
And so, and I think they sort of like couched it around like having high standards.
Basically, to get from senior to principal, you have to do like two and a half level jump.
From L6, L7.
Yes.
Technically it sounds like one level, but at some other companies, this might be like, you know, L8, L9 or L8 and a half.
Yeah.
And, you know, so the hand wavy argument is like, hey, we have high standards and like, you know, it's, it means something to get to that level.
It's like, fine.
But I noticed that some of the best engineers that I'd ever worked with were having such problems getting to principal.
engineer that they ended up moving to Facebook or to meta or to all these other places where the
progression was just sane now their staff or senior staff now they're senior staff and you know
principal and distinguished engineer and at other companies and so because we had high standards
we actually had this brain drain and it wasn't a brain drain at lower levels it was that the brain
drain at sort of like the higher levels and it was it's just an example of something where it's just like
why did you do that to yourself and so that
That's the context for being a principal at Amazon.
It's safe to say it's wicked hard to get internally, right?
So, you know, I'm colleagues with Ethan Evans.
And so we talk about what's the hardest promotion at Amazon.
And, you know, I had made the argument that it was, you know,
it was senior engineer to principal.
And he's like, yeah, that's hard.
Actually, the hardest one, Steve, is, you know, VP to senior VP.
Because there's only eight spots or ten spots for that.
and maybe 300 VPs that are all trying to get this.
That's more of a supply and demand thing.
I will say that at Amazon,
there is gigantic demand for principal engineers.
And so there are roles that have been open for years,
I think something on the order of like 13 months
or 17 months or something like that
to get an external hire to join as a principal engineer.
But that metric is only calculated when the role is filled.
And so probably, you know, there are hundreds,
of principal engineer openings at Amazon.
And there are thousands of senior engineers.
Desperately want to get there.
They're putting in the work.
And so there's this sort of like, there's this tension, right?
And I don't think you see that at the lower levels.
I don't think that that's happening at senior or mid or junior.
And so that incongruity, I think, is super interesting.
But once you do get to principal engineer,
and one thing that I've never heard any other company have is there is apparently a
principal engineering community, which is, I've heard, again, from other people that is tightly
knit. It's actually special. It's actually just a really nice organization. Can you talk about that?
So like, you know, once you got in there somehow, I don't know, was it bloods foot and tears at
promotion? There is a community. I think it's actually really great. My own history, you know,
I went from support engineer to senior engineer in like four years at Amazon. But then from senior
to principal, it took me eight years. And I got
promoted in Q1 of 2020.
Turns out to be a consequential, like, year for in the industry for the world.
That was forceful remote work.
Yeah.
And so, you know, I got promoted and everybody's like, you know, congratulations.
They used to have like a principal engineer offsite where they just flew everybody into
Seattle or nearby and then to sort of like, you know, mingle and to talk to other folks.
That stopped during the pandemic.
And then, you know, by the time the pandemic restrictions started.
leaving, the population of principal engineers had essentially doubled. That's still to say, like,
there are still hundreds and hundreds of openings for principal engineer, but then the, you know,
the sort of like off-site community shifted over to the senior principals that I didn't have
access to. But, you know, at the moment, the manifestation of the principal engineering community
is essentially through the Slack channel, which is absolutely awesome. And then we had principal off-sites
for like our local organization,
so like Amazon Music, Prime Video, Twitch,
that sort of thing.
Those meetups were amazing.
So the reason they were is because of this high standard
that Amazon had created.
And so what it meant is that everybody
that was able to achieve that overly high standard,
there's something exceptional about them.
There's, you know,
they're super deep in our particular technology
or they were associated with, you know,
the growth of a really large line of business, either within Amazon or externally.
They were essentially leaders within the industry.
And you could just literally, you could just scoop out five people and then put them into a room.
And the conversation is just amazing, right?
And I would sort of be like, I don't even belong here.
Like, look at this guy.
You know, he wrote a book on a particular topic.
and this guy, you know, he, you know, he was, you know, a luminary in a particular field.
And then this person just like is an amazing code machine and can just write an entire application over a weekend.
And then you're like, what am I doing here?
You know.
I do wonder if that community might be coming back now.
I know you've left, but now Amazon is not in person.
Because it sounds like a lot of the benefit was the in person part as well.
Because this is what I never heard.
even before the pandemic, I didn't hear other companies, say, for example, at Uber,
I've heard that the senior staff engineers do get together every now and then, but it was
very like roots.
So it was bottoms up, but my understanding at Amazon actually invested not just, you know,
some principal industry saying, hey, let's get together, but also just kind of, you know,
like making, making sure that that group really had something.
Like, I think it's smart.
I think more companies should do it, but I've just not seeing it.
The investment was also in terms of headcount.
So there are program managers and product managers essentially that are, you know, bringing the folks together.
Awesome.
There's a wonderful series.
It's called the Principles of Amazon series where, you know, principal engineers will just, you know, they'll do a presentation and it's recorded.
That's been happening for, you know, 20 years and, you know, we record everything that's there.
But it takes work to actually...
That's an internal series.
And is that open to everyone at Amazon or it's for the principles?
Oh, it's open for everybody at Amazon to consume.
And then, you know, there might be some senior engineers and stuff like that that would make a presentation.
That's part of their promotion packet.
It was to be able to make an Amazon-wide presentation on a particular thing.
My point was, though, that that stuff doesn't just happen on its own.
Yeah.
Like, you have to, like, you need a program manager or multiple folks to...
to sort of like herd the cats and to like schedule the off-sites and to make sure that the,
you know, the Slack channel doesn't go off the rails, right?
And it's still useful.
And it's just not going to happen like grassroots with just like throwing a bunch of people
into a room.
This episode is brought to you by Augment Code.
You're a professional software engineer, vibes will not cut it.
Augment Code is the AI assistant built for real engineering teams.
It ingests your entire repo, millions of lines, tens of thousands of files,
so every suggestion lands in context and keeps you in flow.
With Augment's new remote agent, queue apparel tasks like buck fixes, features and
refactors, close your laptop, and return to Ready for Review Poll requests.
Where other tools stall, Augment Code sprints.
Augment code never trains or sells your code so your team's intellectual property stays yours.
And you don't have to switch tooling, keep using VScode JetBrains, Android Studio, or even VIM.
Don't hire an AI for Vibs, get the agent that knows you and your own your team.
that knows you and your code base best.
Start your 14-day free trial at Augmentcode.com slash pragmatic.
I think these are the things.
I mean, we're now exposing a few of these things here and there,
but some of these companies like Amazon is a great example.
There's more to the eye than what meets the surface.
So like once you're inside Amazon, for example,
you now, as an engineer, even if not a principal engineer,
you now have access to the whole 20 years of principal presentations.
Like when I joined Uber, I was amazed at how we had the RF,
Cs available, like I could read all historic ones. So I think there is, and every company has
its own. Of course, once you're in there, you have access to this like knowledge base, which
it will just never be published. It cannot because it has, you know, business sensitive things,
etc. So I think as an engineer, like you can just really just like, like be a sponge when you
join, especially one of the companies that is known to be a bit more open internally.
Amazon, I think a really interesting one, because externally is very close, is my sense. They're very
careful about what they share. For example, the post-mortems for AWS is very few are published
externally. But internally, they're all there. I understand there as an engineer you can access.
You can learn from them, like really cool real world learnings. Absolutely. You know, it is an open
place internally. And we are so selective about what we, I say we as though I still work there.
What they publish externally. And, you know, the post-mortems, we call them COEs. It's a C-O-E sounds
for it's a correction of error yeah it's you know it's this idea that you know you have like
holes in swiss cheese and and you have like a failure requires that there's a there's a whole across
layers that's the best reading like i would just subscribe to the email list where they were
published internally so you have this like stream of like of disasters that are going on within the
company and you just you know you grab some popcorn and you pop open one of these coes and
you learn so much from that and and i think that that
that's part of the secret sauce.
The idea, and I don't know if it's like this for 100% of them,
is that it's a blameless culture sort of thing.
And so to really screw up requires that multiple people drop the ball.
And you learn so much from that sort of stuff.
You know, the brownouts, you know, these lessons that you would learn from, you know,
trying to recover from really large dependencies,
those things are immortalized inside some of these COEs.
So there's some very famous outages that happened within Amazon.
And, you know, there were an egg on our face.
And we really, really learned those lessons through those postmortems.
They're absolutely wonderful.
As a principal engineer, you know, so far we kind of glamorized the role saying, you know,
it is hard to get into.
But once you're there, you have the community, you do this really impactful work.
But one of the principal engineers at Amazon, who's still there,
called Babi Kotari, he collected some things that are maybe not as glamorous
or more challenging about principal engineering.
He had five of these things, or five or six.
I just want to go through with you and your take on them.
So first he wrote, there is this paradox of belonging that you're part of all teams,
yet you're part of none.
What does that mean?
Yeah, no, so I, Bavik was actually a peer of mine.
We worked in Prime Video together.
Oh, awesome.
So he's an awesome dude.
Yeah, there's all of these paradoxes.
and this paradox of belonging is a really interesting one.
You know, you work for the organization.
You're working cross teams, right?
So at the senior engineer, you're embedded on a team.
And, you know, you own the team's architecture, the operations, you know,
the software development lifecycle and the design.
But when you get to that next level where you're working across teams,
you kind of operate in this weird layer where you're not on pager duty for a particular team.
You have visibility across all of these teams that are there.
You're helping to guide and make decisions, but you're literally not on the ground floor anymore.
And so, you know, when you work with a particular team, you know, you might call the senior engineers of the mid-level engineers in and be like, hey, let's wipeboard some stuff.
Like, let's try to figure out what's going on.
you're not on the team. You're kind of this like advisor that's sort of coming in, right? But then, you know, maybe a director or VP would call you in and say like, hey, what do I own? Like, what's going on? Explain to me this outage or tell me why we can't build this thing. And then you're you're trying to whiteboard the architecture and the system and you're trying to say like, hey, you know, this is what's going on on the ground floor. But you weren't, you know, you weren't part of that team. So you're just sort of operating in this, this sort of street.
strata where, you know, you don't really belong on a team. You know, I'm a, I'm an immigrant,
I think you are as well. And, you know, my parents came from, from Asia. I'm not Asian, right?
So when I go back to Asia, I'm definitely from the U.S. And then growing up in this country is just like,
you know, I'm, you know, not quite an American, right? And so you, you sort of operate in this
sort of, you know, area in the gaps where your identity is, is really different.
defined by not being squarely in one of these predefined categories. So it's very similar to that
as a principal engineer. You're not on the ground floor. You're not checking in, you will check in
code, but you're not necessarily part of that team embedded on that team. Even if you are for a
short time, it's usually a short time. And like tomorrow, the director call you up and say like,
hey, Steve, we need you on this other team. They're in trouble. Move over. Yeah, and you parachute in.
And then, you know, then they're like, oh, who's this guy? You know, and then your director is like,
what's going on? What happened during this outage? Why is, you know, why is the, why is the press writing about us?
And then you're like, well, you know, here's what's happening on the ground, but you're not really embedded on that team.
Which leads us to the next paradox that Bobvick said. He lists a few of the paradox, which is a freedom of responsibility.
And he writes that you enjoy significant autonomy and being able to choose what you work on. However, there's an implicit expectation and accountability for a resounding impact.
Yeah. So, you know, I, you know, I, you know, I.
I reported to a VP right before I left the company.
So they were your manager, basically.
Yeah, my manager was a VP.
Oh, wow.
That's...
I don't hear many companies having engineers report into VPs.
Yeah.
That doesn't seem very standard.
You know, and so the org that he owned, you know,
I considered myself the tech advisor for that organization was about 450 people,
450 software developers.
And what did our one-on-ones consist of?
Right. Like when I would have her one-on-one, it wasn't like, hey, here's, you know, he didn't assign me work. He wasn't like, hey, I need you to build this thing. I need you to design this thing. The context that he said was basically like, here's a direction, right, that you need to go. And the way that you can achieve that type of impact was up to me. Right. So he might say something like, hey, availability is so important for, you know, live.
sports. We just signed, you know, billion dollar contracts with these sports leagues. And so
we need to increase our availability posture. And then I would be like, okay. And then I would go away
and it would come back and I would be like, you know, here's what I'm working on. Right. Like that type
of dynamic does not exist at the senior engineer below level where you're basically telling
your boss what's happening. I was about to say that when you said my,
my manager one-on-ones, he didn't tell me what to do.
I'm like, most engineers would be like, sign me up.
Like, I don't want, you know, we all hate micromanagement.
But now when you're telling me, like, he would say like, oh, so we just sign a
billion-dollar contract.
Availability is important.
And it stops talking.
I'm like, that sounds uncomfortable.
And basically, like, you're kind of expected a little bit to, like, understand what he's
expecting, even though he doesn't know.
And then, and I'm assuming, you know, there's two ways of going, right?
You go back on the next one-on-one and you say something.
And he was like, like, Steve, like, your principal,
engineer, this is not what I expect of you, and you don't want that. Whereas this, you know,
if you're bringing back the right thing, so it sounds like you really need to up level and like
understanding how like these people think. Absolutely. And so he's, you know, he's accountable to
his boss as well. And, you know, don't get me wrong. I didn't, you know, I had a, I owned aspects of
availability. You know, there's a multi-thousin person organization at Prime Video doing this stuff. But we own the
live sports aspect of this. And, you know, there are playback teams. There are, you know,
you know,
recommendation teams.
There are,
you know,
there's so many
different teams
that are there
that had to,
to really step up
and,
and make sure
that availability was good.
But he would say
something like,
hey,
you know,
what is our availability
posture for certain aspects?
And I would have to go
and figure it out.
Yeah.
Like,
what are we measuring?
What are we not measuring?
There's a deadline
for, you know,
the start of a season
where we're expecting,
you know,
millions and millions of
concurrent to come in.
What can we do?
can we do between now and then, right? And then if we do write some software, like what
is the highest leverage piece of software that we could create that would increase our availability
posture? And so the way that I sort of describe it to people is you are assigned not a problem,
not even a problem space, you're assigned a direction. You can solve the problem with code.
You can solve the problem with system design and architecture. But you could also solve the problem,
say, by, you know, I don't know, hey, maybe there's some off-the-shelf software we should purchase.
maybe there's a dev team that we should start to spin up right now
whose job it is to do this particular thing.
Maybe we've identified a piece of software and it's already been scoped
that this team needs to go and build,
but it's not a priority for them.
Now we need to go and figure out like,
you know, how we can get them to do it.
Can we shuffle around resources, that sort of thing?
And so the way I describe it is like there's so many more things on the menu
that you can use to solve the problem
and I don't think people recognize that.
They think that it's just, oh, when you're a principal,
you just code a lot and it's just really complicated.
Or do more meetings.
You know, that's what all happens.
I mean, at the end of the day, like, don't be getting me wrong.
There's a ton of meetings that go on.
Yeah, yeah, but this is, I think it's good to, like, shine light.
Because I also feel like once, it sounds like a big change,
but I also kind of feel if you get good at this,
you might not really want to go back to, you know,
having a manager is like, all right, here's a project.
We need to solve, like, you know, scope it out, and which you can do, right?
Yeah.
That's cool.
And now, the next challenge that Bobbick said was, this all sounds great, but there's apparently
bandwidth challenge.
So it's easy to become this, like, social resource where people just pull you into everything
and you're breathing.
Yeah.
No, you know, I think, I wish I had taken a screenshot, but, you know, I have my outlook calendar,
right?
So it's my schedule.
My day looked like most people's week.
So it looked like somebody had just, like, blew up a test.
Petrus factory. Like there was like I would have triple or quadruple booked on a Monday all through
the day. So you would have the manager calendar as an I see. Yeah. And it's it's absolutely crazy because
you know for that large org that I was supporting, everybody just added me as optional or or they
might try to say like no, you're actually required for all of these meetings. But when you have you have a
triple booked calendar and you're required for this stuff, you just learn that you're going to have
to disappoint a lot of people.
Yeah.
And so it's this sort of like, you know, this thing where it's like it's almost easier to say
no now that you're obscenely overbooked versus when you're a senior engineer, you're like,
I don't have time to write code, but there's just barely enough time in between the cracks.
Yeah.
And so I think that it's almost like when your schedule breaks, that's when you are finally
freed because you know that you can sort of say no to stuff.
But ultimately, if I just went to all of the things.
the meetings that everybody said that I would have to go to, I would be a professional meeting
attender, and I would literally have no time to do the work. And then Bavik follows up on this next
challenge, which is being truly present, and he writes, I think it's almost like, you know,
he was sitting next to you. You find yourself physically present in one meeting while your mind
is already racing against next three. You know, it's a, it's a really big challenge. You know,
I pride myself on being a good communicator and being present. And when there are, there are 20 things
that are going on in the air or 100 things that are going on, it's just really, really difficult
to say single-threaded.
And what I ended up having to do is to sort of say, like, okay, I could do all of these things
and they would be really impactful.
But I just had to aggressively prioritize and say, you know, for the availability.
I'm just looking at availability.
There's all these other fires that are going on, which is disappointing because there's so
many things that you know you could be focusing on it's it's it's super difficult and so i you know i work
with a lot of people to try to get them to the next level and they say steve all i'm completely overwhelmed
there are like 20 things that are going on um and i tell them like do you think it gets easier
when you get higher level there's just going to be more and more things on your plate why wait until
you burn out or you break you can just start implementing these things now so every high level
tech I see, I know in managers included, they have a wonderful system in order to isolate signal
and then cut out the noise. And if you don't have that, you literally won't survive. But it's just
at the principal level and above, it's just amplified that much more. I'm getting sense that
a lot of the work as you do as a principal engineer. I mean, most, there's huge amounts of software
engineering and you need to be, you know, just really good at building resilient systems, learning about
new technologies, you know, for example, today, I'm assuming who offers a principal
and you're at Amazon, they're expected to just know everything about LLMs, tradeoffs,
characteristics, et cetera, because they're, anyway, but you also need to just become,
do the skills that managers have, which is managing your time, changing context,
finger, how to get that focused time, like, you know, contrary to popular belief,
like managers actually need focus time. So, like, you know, I will also always try to carve out
some time. But you're now doing it.
while your title is not manager, but actually it feels like you combine a manager, a lot of
manager responsibilities and a lot of, you know, like experience engineer, and boom, you get the
principal engineer role. Oh, the only upside is like you don't need to do performance reviews for
people. Congratulations. You saved a little bit of that time. Well, actually, during performance review
season, they pull the principal engineers in because if you're, if you're, so, you know, if you're stack
ranking people, okay, cool, well, we'll need to take a look at their performance. There we go. So I
reported to a VP, you know, one of my peers was a director and he was basically like, hey, Steve,
I would like you to show up to my performance review for my entire org of 100-something people.
And I'm like, I can't do that for you and for everybody else.
Okay.
So that would make sense why as a principal engineer, your compensation package will be similar
to like, is it a senior engineering manager or something like that?
Around that.
Around that.
But basically, like, the job has a lot of overlaps.
Okay, the benefit is you're not the one delivering the performance reviews of the Eric report,
but you're doing almost everything else, or in terms of the effort I'm talking about.
Yeah.
Okay.
So having been a principal engineer for four years, what are the good things that you really would like about Amazon, specifically Amazon's principal engineer role?
And what are some of the, you know, not so good or it could have been better things?
I mean, the great parts are you get visibility that you just couldn't possibly have.
at the team level. You know, within a large organization like Prime Video or wherever you're at,
there are many thousands of people that are working within that organization doing so many things,
right? And typically the performance of these people is really high. There's so many different
directions that are going on. And so to survive, you kind of have to look inward and you say,
okay, well, here's my service boundary, here's all the software I own. I'm going to own everything
within the sphere of ownership. Because you've built this wall up, you tend not to be able to
to see like that broader picture.
Yeah.
And so as a principal engineer, I think it's really awesome to be able to sort of like
Spalunk and be able to go to different teams and sort of see that broader picture.
And I just don't, I don't see a way that you would be able to get that, that type of
visibility that's super interesting at a lower level.
You know, I think the other thing is like, you know, whether it's, it's warranted or not,
you do get some amount of status when you go to a meeting.
People just listen to you.
They listen to your hairbrained ideas, and it's kind of nice because you don't necessarily have to, like, prove yourself over and over again.
This is a bit less professional, like, not fights, but just establishing that you know what you're talking about.
Yeah.
Yeah.
Now, the bad things are, you know, there's a lot of folks that are really good in tech and being really effective as a principal engineer.
But then they also, you know, myself included, they're like, okay, cool.
well, that sort of makes me an expert in pretty much everything.
And so you would get these principal engineers together.
We had a weekly meeting.
And so it would be like, okay, if you wanted to talk about, like, establishing a constitution for a small island nation,
all of a sudden, they would just be like, well, like, here are the main considerations.
It's like, nobody has a background in government policy.
But all of a sudden, like, just because you're sort of trained to do so, you start to, like,
pitch in.
You're like, well, actually, you know, maybe we should have two branches of government or three branches
of government. And it just sounds like we would know what we're doing, but we don't. And so there's
this trap, and again, I've fallen into it many times where you actually think you're an expert
in one thing, but you're actually not, right? And so, you know, take LLMs. There's a ton of folks
that understand AI. I left before it was sort of like allowed to use internally, but I think
you can use it now. I'm not an expert in LLMs at all. But I do think that. I do think that. I do
think that the expectation would be that you understand, you know, how they work. But then the
expectations also like, hey, what should our policy be? How should we be thinking about this stuff?
And I think that's fine for mature technologies potentially. Like you can ramp yourself up for it.
But as like that particular landscape is changing so quickly, I think there's this sort of trap where
you sort of, you speak as an authority, even though you haven't had the requisite time to ramp up
something. And you went there for 17 years at Amazon. What are your favorites parts of the culture?
Like, you know, there's a lot of things that there's the values that we all know, like the frugality,
customer obsession. What were the things that you found to be like the most interesting or the
ones that have lasting impact? And how did they change? How did Amazon change over 17 years? They must have
changed. No, I think the things I missed the most in the secret sasiad, the leadership principle,
are good, but I think the actual secret sauce there is principled thinking.
Right?
And so, yeah.
So, you know, there's invent and simplify and bias for action and all of this stuff.
But like, ultimately, the thing that is amazing about those leadership principles aren't the
specific stances that they took.
So they decided that customer obsession is a big deal.
They decided that bias for action is a big deal, all of these things.
But really, if you look at a meta level, you'd be like, oh, these guys.
have principles that they won't budge on. I sort of think about it in terms of math and axioms.
Like, you just take certain things to be true, you know, two lines that are parallel. If you extend
them out to infinity, won't touch them and won't touch with each other. Yeah, you assume that's true.
Yeah, you don't prove that. It's an axiom. And then based off of that, you're able to build a system
of mathematics, right? And so it's the same thing with the corporate leadership principles at
Amazon, they basically said, okay, we are going to fix these things to be true.
There are 16 or 12 or I don't know.
They just sort of bolted some on.
They were 14 and now they're 16.
And but there are like four or five that are just really core to Amazon and we just
fix those things to be true.
Which ones were the ones that you felt were the most present?
Customer obsession.
We are absolutely customer obsessed.
We'll just burn money to the light of customer.
You can be in a meeting with a VP as an intern and you say, hey, that's a bad customer experience.
It would be like a needle coming off a record.
It would just be like, what?
What are you talking about like immediately, right?
You know, bias for action.
So like just get some stuff done.
Stop asking for permission.
Just like go and do it.
Right.
Ownership.
It's just like you own your software.
You run the, you know, you do the operations.
You own the bug count, all of this stuff.
Right.
So those are the ones that are like those are fixed.
And then you start layering.
things on top of it. And I think it's really great. And but, you know, you could, you could take
Amazon and you could have like the, you know, evil goate version of Amazon, which is just sort of
the opposite of those things. And that would still be a really valid and awesome company. So you could
say, okay, well, it's the opposite of customer obsession. It's not customer obsession or not being
customer obsessed. I think it's, you know, like being about your staff. Yeah. Which is Google.
It could be like, hey, we really care about our people above everything else. Or it could be,
you know, that's not mince around it.
We care about top line or bottom line revenue.
Yeah.
That's totally valid, right?
And then you could just fix that.
You wouldn't, you can't prove that, you know, being, you know, staff focused is a bad thing.
You just build that.
And then, you know, a certain set of things will happen, like great things are going to happen.
And then, like, not so great things are going to happen.
Those not great things that happen, you can try to mitigate them, but you can't fix them because you have started with this principled approach to everything.
Yeah.
Yeah.
Yeah, it all goes like everything has.
Yeah.
I see what you mean, but I think what you're saying is like it might be less about what the specific principles are.
I mean, Amazon has theirs and we know about them, but it's just sticking to them and not keeping wiggling.
Because if you keep wiggling, it's like, what was the point, right?
Then you're going to have a really kind of mediocre, truly not standout company, whatever you do.
What does it actually mean to be principled and to not bend?
It could be really easy to do so.
So that's an amazing secret sauce of Amazon's.
People look at the leadership principle.
I'm like, no, it's principal thinking.
Another thing.
And a lot of this, honestly, from what I understand, talking to you earlier and some other people,
a lot of it probably comes from Jeff Bezos, being from the top down, being very principled,
though not giving, not saying, we will do this, whatever it takes.
Sounds like it was customer obsession initially and then some other things.
Yeah.
Yeah, absolutely.
And he was an absolute genius when it came through.
So I'm a, you know, I'm a Jeff Bezos fan.
for sure. Like, it just worked.
Another thing that's Amazon Secret Sauce is just a writing culture.
And so, you know, I spent on the order of like one to four hours every day reading while
as a principal engineer. And it was, we had a standard format. It was a six-page memo.
And, you know, that would be our business strategy. That would be a system design. That would be, you know,
what we call the PRFAQ,
so a press release
and frequently ask questions
for like a new line of business
or a new initiative.
And everybody was sort of constrained
to this six-page format
and everybody just produces documents
in that format
for whatever they need to do.
And so when I would try
to get up to speed on a particular thing,
I would just be like,
give me your six-pagers,
give me all your documents.
And I just got really,
really good at just reading
these documents to get up to speed,
which was a self-affirical
and virtuous cycle, which is just like, okay, well, now I need to express myself. And so I will
write a six-pager, and that will set the context for whatever we're working on. We'd go to a
meeting. You would read the six-pager, and it was just super great to just actually just have people
do study hall at the beginning part of a meeting where everybody just gets fast-forwarded. And then you
have a really great discussion at the end. That is what an amazing culture that I think that almost
every other company should replicate if they could.
But I think the difficulty would be like you actually have to be disciplined and actually
have a reading culture and principled and have a reading culture and then actually value
writing.
Yeah.
I almost wonder if unless it comes from the top, some of these things might just be really hard
to do.
Yeah.
One thing that I figured is we're in your studio right now and you have a lot of these
blocks.
And I asked them what they are.
Are they for promotions or projects?
or whatever, they're for patents.
Yeah.
And this is for patent number 10,000, 10 million, 824, 964.
Can you tell me about why you have these, how they come about, what you needed to do for them?
So the highest order bit is like, you know, for better or for worse, there are software patents that exist.
Amazon, they'll say that basically the reason they have them is defensively because, you know, other people will assert that, hey, you're in
violation of our patents or our IP. And then, you know, we'll use them reactively. Okay, fine, but,
you know, you're also in violation of these other things. And so, you know, there's a, there is a
culture of trying to make sure that, you know, we protect ourselves in that way. But, you know,
there's the other part of software patents, which is basically like, hey, can you really patent,
like math or whatever? And so what I learned over time is that, you know, I'm just a really bad
IP lawyer, even though, you know, as a principal engineer, I might cosplay as somebody that
really understand software patents, right? At the end of the day, you know, what we would do is we
would take our important six pages and we would hand them over to the legal team. And then they
would just be like, oh, this stuff is really interesting. Like, let's explore that. And so it turned
into this awesome thing where, like, we just had ready inputs to go into like the, you know,
into that particular system. A writing culture turns out has a bunch of benefits. Exactly.
And I think the there's just sort of like it's the concept is called like the curse of knowledge, which is essentially like if you understand something, you discount how long, like how easy that concept is.
And so it's just like you don't get it, you don't get it, you don't get it.
And then you get it.
And then you're like, oh, that's trivial, right?
Even though, you know, there could have been, you know, it could actually be novel or it could actually be interesting.
And so what ends up happening is that you would just throw these documents over to the lawyers.
And then they would basically be like, oh, there.
this stuff is great. And you would just be like, well, that's just, that's just regular software
development. Or that's just the context and domain that we were living in. You know, it turns out that
there's some, some interesting stuff. This particular patent I'm, I'm proud of. So there's a system
design interview question that seems to be popular right now, which is like design ticket master.
Right. And so I worked on Amazon tickets and, you know, we ended up shuttering that business,
but, you know, we ended up building like one of the world's fastest, like, ticket selling systems,
like in the world. Right.
We can do many, many orders per second.
So the use case is basically at T0, that's for a really big ticket on sale.
That's when the maximum amount of demand and requests are coming in.
And you want to sell out all of your ticket supply as quickly as possible.
The problem is, I think, one where you have seated concerts.
And so when you purchase a ticket, you know, most of the time with the system design stuff,
it'll be like general admission or it won't be a high ticket on, you know, like one with a bunch of
demand.
You have to find contiguous seats.
Yeah.
So the ones are next to each other.
Yes, exactly.
And so, you know, it's actually really hard.
Like, suppose it was a SQL database as your backing store.
Like, how do you come up with a SQL query that's just like, hey, give me the best four tickets,
you know, within this particular price range that are sitting next to each other?
Yeah.
Now you're thinking, so this is a real, real world thing where you need to, you want to be as efficient as possible in terms of research usage.
May that be maybe you want to minimize your CPU or memory, depending on what you have, I assume.
And you need to do as quick, as rapidly as possible to give this to people.
Okay.
So now we're talking about a problem that seems like pretty novel in some ways, right?
Yeah.
Yeah.
And so, you know, I was, I did this patent with a senior principal.
I was a senior engineer at the time.
But the idea is like, you know, what is the theoretical maximum speed by which we could, you know, show this inventory to people?
And it turns out that, you know, even if you have a high ticket on sale, you only have like thousands of tickets at the end of the day.
Yeah.
So instead of making a request to like a back end that would conduct some sort of search across the space, what if you actually inverted it and then you basically had each of the.
individual hosts have like some view on the entire arena or a venue that was there.
And you loaded up all of that availability and inventory into like L2 cache on a CPU.
Yeah.
Because it's actually not that many.
So if you have this compact representation.
Yeah, we'll catch it was pretty big.
Yeah.
Then what you can do is you can you can do bit manipulation to like really, really quickly get
contiguous seats that are there.
And then what you do is you can like send in that particular
requests and try to like reserve those particular seats.
Now is it a logging problem.
Which is much more tractable than like, hey, there's, you know, two million people that have
just hit your onset.
And each of them, I'm going to search for each of them.
Yes.
So the inversion of that ordering process by which you like actually send out the inventory to
the individual nodes and then like load it up into CPU cache and then just do bit manipulation
and then try to lock that resource from the individual nodes,
that was the basis of this particular patent.
Awesome.
That's clever.
And that sounds like some, you know, people are always asking like,
oh, you know, on my job, I don't use the algorithm stuff
or any of the formal methods.
Sounds like there are some uses of it,
especially when you're trying to figure out what is it,
like when you're just taking away from the patent,
just having a problem like this and saying,
like, what is the theoretical limit that we can do?
What is the fastest possible?
To answer that, you probably want to have access to these tools.
So it's not always a time and effort to actually get into these things.
And so what are you up to now that you've left Amazon a year ago after like 17, 18, very long years?
You know, I'm just making content.
I'm just sort of living the dream there.
You know, making YouTube videos.
It started up a newsletter.
I've had Discord community.
and yeah.
Yeah, and we're going to link all of those below.
I actually got to first know you before we start talking.
This was probably a few years ago from your YouTube videos,
which are, you know, you share a lot about like Amazon things,
software engineering things and just like your general thinking.
But yeah, your user is a new one.
So I'm, we'll link it in the show notes below.
It's always a good way to keep in touch.
And also, you know, like on your YouTube channel.
Awesome.
So as closing, I have some rapid questions.
Okay.
So I'll just ask.
you just shoot what comes to mind. What is career advised that greatly helped you in your path?
Yeah, I mean, this is, you know, I talk a lot about this. It's kind of like, oh, what's,
what's your favorite food or your favorite movie? It's just like there's so much there and it's
hard to pick one. What I would say is instead of saying like, hey, what's the technology that I
should learn that's really going to, you know, make my career, you know, solid? Instead, sort of
flip it around and say, like, how can I quickly learn skills?
that makes you sort of like recession proof, right?
That sort of makes you valuable.
It's essentially meta learning.
It's like, how can I learn something faster and faster?
If that's your focus, then you'll always be, you'll never have a problem finding a job,
and you'll never have a problem progressing in your career.
Now, some of the skills may be difficult to find resources online, but, you know, I think
if you just sort of think about like what's a valuable skill that if I knew right now would,
you know, make my, you know, job search easier or would like make me, you know, perform better on the
job. And then just sort of thinking about acquiring that skill as quickly as possible.
And do it now. Like, don't wait. Yeah. People tend to postpone themselves. They'll be like,
oh, well, I'll start when, you know, everything is lined up. But like to begin, you just need to begin.
Like when you start something that only then will you know what you need to do instead of saying like, oh, I need to get everything that I need to do first before I start.
You'll use a lot of programming languages.
Which one's your favorite?
And why?
And which one you do dislike most?
Yeah.
You know, I have like a, you know, obviously there's no perfect programming language.
What I would say is like I really enjoyed Pearl and nobody would ever give that answer.
but I just like this concept of like there's just so many different ways to do it it's a it's a right
only language like you can't read anybody else's pearl and I it's it's actually one of the
languages that like uses up the most power it's like the least efficient it's interpreted it's
it's just like terrible also most of booking dot com still runs out or some of it yeah Amazon's back end
was you know for a long time it still might be um you know sort of like pearl mason is sort of like
web technology bolted onto pearl but I just kind of like it I just feel like I can
express myself and there's just like, there's just,
however you'd like to express
yourself, you can.
It also looked like an asky factory blew up
sometimes, and so it's just like, it's,
you know, now that it's on a podcast,
I wouldn't really, you know, advertise
that fact. The best programming languages
right now, I think Rust is pretty
interesting, so I might, you know, pick that up.
At the end of the day, like,
I really
love the boring languages.
Yeah. So, you know, Java
with, you know, for all of its
stuff like it's verbosity and
I think it's just a great language
like a JVM based language
that has
essentially like great
library support and a bunch of stuff
written for it but it's just like super
boring maybe it's just because I'm from Amazon
and we do this like enterprise stuff like
it's a fine language
and then I see you have a large
bookshelf here you also read a lot
especially at Amazon although most internal documents
what is a book that you would recommend
something around software and
that you enjoyed and it cannot be that book.
It can't be your book.
What I would say is, you know,
I had just given the advice about, you know,
meta learning and career growth.
I think that most software developers should read a book by Cald-Newport.
It's called So Good, they can't ignore you.
And so the concept there is around career capital.
So like what are the skills that are in the most demand?
And if you can just like learn those skills,
then you become in demand.
And then, you know, from there you can choose,
what type of lifestyle that you'd like.
You can also like sort of lean into
some of the science of meta-learning,
so deliberate practice, space repetition,
that sort of thing.
In terms of like tech books,
I think the new AI engineering book
by Chipwin is amazing.
I think DDIIA,
so the design of data intensive.
So good. A new version is coming
the end of the year, actually.
I'm excited about that. I think that'll be pretty good.
But you know, at the end of the day,
day like you don't want one book on your bookshelf. You want 50 books on your bookshelf. And so,
you know, I think within a particular subgenre of tech books, you know, I'd have recommendations
there.
Steve, this was great. Awesome. Really enjoyed it.
Yeah, great. Thanks so much for having me.
Thanks a lot for Steve for sharing all these details. Although Amazon's principal engineering level
feels surprisingly difficult to get promoted to, I have yet to hear of such a strong
principal engineering community than what Amazon builds and keeps investing in.
This community itself could be a reason enough to consider the company after the principal
plus level should you have the opportunity to do so.
For a deep dive into Amazon's engineering culture, including the details on compensation,
career ladders, performance reviews and engineering processes, check out the pragmatic
engineer deep dive linked in the show notes below.
If you've enjoyed this podcast, please do subscribe on your favorite podcast platform
and on YouTube.
This helps more people discover the podcast and a special thank you.
you if you leave a rating. Thanks and see you in the next one.
