The Pragmatic Engineer - What is a Principal Engineer at Amazon? With Steve Huynh

Starting point is 00:00:00 If you're going to optimize for performance, saying, why can't we be at one millisecond or why can't we be at 10 milliseconds and start from there? Instead of sort of saying, hey, let's try to decrease latencies by 50% or 25%. Let's just start from what is the conceptually fastest thing that we could do. And that's actually how Amazon was created. Amazon's principal engineering level is unique in many ways across big tech. Steve Heuden was a software engineer at Amazon for 17 years and worked as the last four years as a principal engineer. Today, we talk about the ins and outs of this role, including why being promoted from senior to principal is so hard, even though Amazon usually has hundreds of principal engineering

Starting point is 00:00:36 openings and thousands of seniors trying to get into these positions. The Amazon principal engineering community, the in-person events, the Slack Group, and the principles of Amazon internal presentation series. Engineering concepts that Amazon are on reliability, such as Brownouts and COE, correction of errors, and many more topics. If you're interested in understanding one of the hardest injuring levels to get into across big tech, together with stories of how Steve thrived in disposition, this episode is for you. Subscribe you on YouTube and on your favorite podcast player greatly helps more people discover this show.

Starting point is 00:01:07 If you enjoy it, thanks for doing so. So Steve, welcome to the podcast. Thanks for having me. How long were you at Amazon? 17 years? Yeah, I was there for 17 and a half years. And yeah, I just quit last year. So I've been basically a year doing other things now.

Starting point is 00:01:26 And what were the things that you worked on while you were there? You know, people always talk about my long tenure there. But, you know, I feel like I've had like five or six jobs over that time period. I started off on, you know, a project called Search Inside the Book. I worked on the first Kindle launch. Wow. I worked on the precursor to prime video. I sort of like worked there at the beginning part of my career.

Starting point is 00:01:51 And then I sort of ended my career there for the last five years of my time there. I worked in payments. I worked in Amazon Local, which was sort of our Groupon project when that type of business was looking like it was going to take over. I worked on Amazon restaurants. I worked on Amazon tickets, which was a ticket master clone. And then my last five years was working on live sports streaming on Prime Video. If you want to build a great product, you have to ship quickly. But how do you know what works?

Starting point is 00:02:23 More importantly, how do you avoid shipping things that don't work? The answer, Statsig. Statsic is a unified platform for flags, analytics, experiments, and more, combining 5 plus products into a single platform with a unified set of data. Here's how it works. First, Statsc helps you ship a feature via feature flag or config. Then, it measures how it's working, from alerts and errors, to replace the people using that feature,

Starting point is 00:02:50 to measurement of top line impact. Then you get your analytics, user account metrics, and dashboards to track your progress over time, all linked to the stuff you ship. Even better, Statsic is incredibly affordable. With the super generous freeze here, a startup program with $50,000 of free credits

Starting point is 00:03:06 and custom plans to help you consolidate your existing spend on flags, analytics, or AB testing tools. To get started, go to Statsic.com slash pragmatic. That is S-T-A-T-S-I-G.com slash pragmatic. Happy building. This episode is brought to you by Graphite, the developer productivity platform that helps developers create, review, and merge smaller code changes, stay unblocked, and ship faster. Code review is a huge time sync for engineering teams. Most developers spend about a day per week or more reviewing code or blocked waiting for a review.

Starting point is 00:03:39 It doesn't have to be this way. Graphite brings stack pull requests, the workflow at the heart of the best in class internal code review tools at companies like meta and Google, to every software company on GitHub. Graphite also leverages high signal, code-based-aware AI to give developers immediate actionable feedback on their poll requests, allowing teams to cut down on review cycles. Tens of thousands of developers at top companies like Asana, Ramp, Tecton, and Versel rely on graphite every day. Start stacking with graphite today for free and reduce your time to merge from days to hours. Get started at gtt.def slash pragmatic. That is G4 Graphite, T4Technology. slash pragmatic.

Starting point is 00:04:20 So that's a lot of different teams. Was it like how did you work out in so many teams? Is it just like there's a lot of internal transfers? Did you get bored? Was it just you followed your manager? How does it work inside Amazon? Because when people think about companies, people who have not worked on Amazon, they would kind of assume you go, you work there,

Starting point is 00:04:36 you're on a team for like, you know, four, five, six years. Clearly not the case. You know, it depends a little bit on like corporate policy and then where you are with your career. I started as a support engineer. so sort of like operationally focused person. And then, you know, I was basically like, I want to be a software developer. And so, you know, I think getting into the company was pretty difficult.

Starting point is 00:05:00 But once I was there sort of set that target and changed roles. And when I changed the role, you know, it was a natural time to move to another team. There's also some internal policy. So basically at Amazon, it used to be that you had to stay on a a team for at least a year before you transferred. And if you wanted to transfer, like a senior manager or director, whoever up top, could block your transfer. And what that ended up meaning was that like certain teams that were just terrible to work on, those teams actually had more than 100% attrition over the course of a year because you measured attrition with a year-long time

Starting point is 00:05:43 unit. Amazon did something actually smart at the corporate level. They basically said, okay, well, you have freedom of movement now. This sort of happened, I don't know, probably like 13 years ago, 10, 13 years ago. And so they said, you have freedom of movement now. A VP or a director can can't block you. They can say, okay, well, we need another month to get like a transition plan going. But essentially, you have freedom of movement as long as you're not on a performance improvement plan, which meant that certain teams were sources of high-quality engineering talent and certain teams were sinks of high-quality engineering talent. And it sort of created an internal marketplace for different roles.

Starting point is 00:06:25 Now, what that ended up meaning was that certain teams, they basically didn't want you to know what the policy was. They wanted you to sort of think that you were kind of stuck. But, you know, despite that sort of like local gamesmanship that was going? Yeah, like basically some managers didn't want their best people to leap, right? Exactly. So I just say it how it is. But ultimately, I think it's a great strategy because it put the, like, if there was a team that was difficult to staff, the problem was on the management.

Starting point is 00:06:55 It wasn't something that had to be, you know, bared by or born from the employee themselves. And so, you know, getting back to my own career journey, at a very large company like Amazon, there is so many awesome things that are going. on. And, you know, I decided to just kind of go where my curiosity took me. Now, there were some times where, you know, there were reorgs or, you know, a line of business got spun down. But ultimately, you know, I think freedom of movement was one of the smartest things that Amazon did. And I think this is something that people don't really appreciate about some large companies. So, you know, not all companies are like Amazon and every company changes, right? Like today, I'm assuming it will be hard to move as many.

Starting point is 00:07:41 teams within Amazon. Depending on where you are, you know, if you're in a satellite office where there's two teams, you can probably move on to the other team at max. But I think this is one of the underrated things of large companies. Like once you are in, it's almost always easier to get that job at another team from the inside. Yes. Especially because you can talk to them.

Starting point is 00:08:00 You know, this is, I talk with the Reddit mobile team and I ask like, oh, how can you become a platform engineer on the mobile team? And they said, like, well, you know, most of our hires have been internal. They just helped us out on hackathons. They come around. They commit stuff. We know them. It's a low risk higher.

Starting point is 00:08:15 I think it's just nice to remember that when you think of a big company like Amazon or meta or Microsoft, it's just so many small teams. And once you're in, you actually have almost priority access to those teams if you play your cards right. Absolutely. And you know, you might interview for that team, but it's such lower stakes than an external interview. And, you know, just all things being equal, would you rather take somebody that's, you know,

Starting point is 00:08:39 know, internal and knows the culture. They know how software is developed within a particular context or somebody that's just as good, but doesn't, you know, hasn't been onboarded. And I think ultimately, you're going to pick the person that's internal, all things being equal. Yeah, it's just kind of like business rationality for the most part. So one thing about Amazon and about large companies like Amazon is people talk about externally about the scale. And it's hard to imagine, but can you give us a sense of the scale that you've seen or like some tough engineering challenges that you worked on that would have been just really hard to work at a smaller startup. Yeah, I think that's the thing that you just, you will not see at most other places is the scale of things. I'll give you a couple of examples. So, you know, Prime is the exclusive club that everybody is a member of.

Starting point is 00:09:29 Yeah. And, you know, in the U.S., the shipping benefit is probably, you know, the most popular. But globally, Prime Video is, you know, it's the thing that people use the most with their subscription. And so if you think about, you know, our service-oriented architecture and just loading up the app, the gateway page is the place where all of our requests come in. Right. And so it's just like Netflix. It's this infinite scroll of carousels. So the gateway page is it the Amazon Prime landing page?

Starting point is 00:10:05 It's the landing page there. And so you're like, okay, cool. If, let's say, 95, 99% of all of your requests are coming from that page, and that page needs to be personalized, you know, and you have a service-oriented architecture with a bunch of microservices, one request to that page turns into, let's just say, hundreds of downstream requests to different services. It might even be more than that. It's actually kind of hard to count. Yeah. And is this page, right? Like all the stuff flowing, all personalized stuff.

Starting point is 00:10:37 So that's the retail one, but I was talking about the prime video one. The prime video one. But essentially it's the same thing. Yeah. And so, you know, same thing for the retail website as well. And so if you have one request sort of spitering out into, you know, two orders of magnitude, more requests internally, you start to seek like really, really large scale for these microservices. So a microservice will have like a reverse proxy or load balancer in front of it.

Starting point is 00:11:02 and you are sort of unironically talking about things like tens of thousands of requests per second or hundreds of thousands of requests per second coming into your service. So the services that are like behind, you know, like there's the prime, there's all the things loading, they're sputtering out like making, you know, to render that one recommendation, for example, for I don't know, the video that you would like, it will make a lot of requests of different services. And then so when you're operating a smaller service inside of Amazon, suddenly you're going to be hip, with what you just said, 10K, 100K requests per second, that kind of scale. Exactly. And you will essentially be dedossing yourself.

Starting point is 00:11:41 You're just like, okay, cool. Let's change a caching configuration on some item details. And it turns out you've just browned out like a critical service, right? What does brown down mean? Oh, sorry, using some jargon. So if you want to talk about availability, if you, if you, suppose you are dedossing a service or sending a lot of requests over to them, you can, you know,

Starting point is 00:12:09 you can, you can just take them down. That would be like a blackout. And so like you send a request, oh, you can't establish a connection. It immediately comes back. But there's a,

Starting point is 00:12:19 there's a type of outage where they brown out. So basically they're reachable. They might accept a connection. But, you know, they'll essentially time out or, or they might return partial results or, or bad results.

Starting point is 00:12:32 or the only thing that they do return is a, you know, a 500 for some percentage or proportion. After we waited a bunch of time for that. Yeah. And so, you know, now we start talking about like availability and resilience in the face of like all of these DDoSing that you're doing to yourself. And so the thing on top of scale that is going to really complicate things is your dependency chain, right? And so, you know, your service is a dependency of some of the process that's going on. it depends on, you know, maybe AWS, it may depend on another service. You know, how do you make sure that if, you know, suppose there's a failure for a primary dependency and that dependency comes back up, how do you make sure you don't just like inundate it with a bunch of

Starting point is 00:13:17 requests as it's trying to recover? Yeah. And so you have all of these sort of like odd dynamics that occur. I used a brownout as something that is a perennial problem that we have, right, where there's maybe a dependency on a base. service like S3 or Dynamo, DB, or whatever it is, there might be some increased latency that may cause a chain reaction of a dependency going down. And then one of these sort of middle tier services would brown out. So what are like, you know, you're an owner of the services for your team. And so then it's like, okay, what do we do in those situations? How do we know that they're

Starting point is 00:13:56 browning out? What do we do in the face of, you know, a dependency outage? And then critically, if there is an outage and then the service comes back up, how do we make sure that we give it enough space so that it can breathe so that, you know, as they're trying to recover from some sort of outage, we don't just take them down immediately again. And I guess for like most of us who are not working right now on these services, like these sound pretty cool in theory. But you're saying this was actually like, like this is not theory. This actually was like, oh, this service is going down.

Starting point is 00:14:30 we are literally having 100K requests per second, and we're like pushing that on to like other three services with the same because we need to invoke three other services. One of them has browned out. What do we do now? How do we fix it? Yeah. And I think for certain other large tech companies, you know,

Starting point is 00:14:49 you can do best effort, right? Which is basically like, hey, we're temporarily down, but, you know, you can, you have some sort of degraded service that makes sense. But if you're on, say, a website that does purchases, now we're talking about transactions. Or if you're in the prime video, like live video streaming use case, now we're talking about a football game that you're unable to see.

Starting point is 00:15:16 And then when we recover, the game might be over. Yeah. Right? And so it's much higher stakes. And so I think the scale with transactional semantics, right, like that's actually the challenge that you're not. not going to see unless you sort of like work for a payment processor or something like that. Yeah, I guess that's real world pressure challenge. Like you are losing money. I'm starting to

Starting point is 00:15:40 understand why. Like I have noticed that startups love to hire from certain companies. They usually startups love to hire from other startups because it's similar environment. From large tech companies, it's a bit of a maybe I'm generalizing obviously. This will not be true 100% of a time. But for example, hiring from Google, a lot of startups are not as happy because the people coming from Google are used to having this amazing team around them, internal tools. But most startups love hiring from Amazon, and I'm starting to get a sense of why this actually is. Yeah, I think that's part of the culture.

Starting point is 00:16:08 You know, you get hired as a software developer, and they hand you a pager. And before, you know, phone apps and things like that, it was like this pager from the 90s. And it's really great because you have to, you have to like operate the software that you write. If you actually, you cannot write the software, hand it over to the testing team, and then throw it over to the SRE team after you're done.

Starting point is 00:16:34 Like, you own that piece of software. Yeah, yeah, at every team, right? One interesting thing that we talked about yesterday over dinner with Casey Moratory is you said something interesting on how Amazon measured how on their retail website, I think it was retail, maybe Amazon Prime. The lower the latency of something loading, like a page loading, like a purchase page or a purchase button loading, the more revenue they got. and they started to measure and there was a linear correction as the faster it was the more people converted and it seemed it had no end and the question

Starting point is 00:17:04 Casey asked is like okay if this is the case what would stop Amazon because you have the best technologies in the world you have AWS you know you can build whatever you want to get the latency of the website down to let's say like 10 milliseconds or even one millisecond because if this goes up you would maximize

Starting point is 00:17:21 revenue so can you tell me about like how how that thing like this measurement actually happened. And, you know, why is Amazon's website still maybe not the fastest in the world, even though it would generate so many more billions, right? Yeah. Well, there are a couple questions embedded in there, but we'll start with the, you know, the latency to gross revenue measurement. So essentially somebody way back when, you know, because we invest in logs and telemetry, started tracking how much gross revenue we would make based off of like the latency for detail pages based off latency of Gateway, based off of latency of the checkout pages.

Starting point is 00:18:01 And they noticed this dynamic where it's like if you're faster, you just make more money. It's a pretty clear correlation. I think you would even go as far as to say as causation. And so there was this really big focus on latencies. I love the idea that, you know, if you're going to optimize for performance, saying like, why can't we be at one millisecond or why can't we be at 10 milliseconds and start from there, instead of sort of saying, like, hey, let's try to decrease latencies by 50% or 25%, like let's just start from what is the conceptually fastest thing that we could do. And I think in a vacuum, the conceptually fastest thing that we could do is sort of like a monolith, which is how Amazon started, where, you know, you have a web server with all of your catalog

Starting point is 00:18:51 information, so all of the items that are there. And then transaction processing on the host. That would be the fastest way to run the retail website. And basically like a web request would be it opens the HTTP or HTTP handshake. It hits the server. The server in an ideal world has everything cached or calculated. It sends it back. So the total like latency would be the time for this request, the time to transfer that data based on your internet speed. And that's it. That is the absolute, you cannot be faster than that. I don't think so. Maybe there's some exotic sort of thing that's Maybe you can do some exotic protocol that I know. It predicts the future.

Starting point is 00:19:26 I'm like with UDP sends it. But yeah, but this is your baseline. I guess the optimal would be like zero click instead of like a one click checkout, right? So we just send you stuff before like you know you want it. That would be the, I guess, the theoretical maximum. But, you know, if there's some sort of like web request, right, so some HTTP request and then some sort of like buy button, that would be the fastest, right? And that's actually how Amazon was created.

Starting point is 00:19:50 We bought this, you know, it's sort of the opposite of horizontal scaling. It was vertical scaling. We bought these big sunboxes, and we hacked up our own web server in C++. And to scale up, we bought bigger hardware. And then when that didn't work, we bought like six of these big boxes, and that ran Amazon.

Starting point is 00:20:10 And we ran that wave up until the early 2000s. And then what we realized, we ran into a wall, which was that, you know, when you built the C++ binary, the binary could only be four gigabytes. And that was a hard limit based off of the 32-bit software, the architecture that we were running on before.

Starting point is 00:20:32 We could not get above four gigabytes. And so these product managers would come and just be like, well, just make a change for me, right, to the devs. And then they would just be like, I don't think you understand that this is a hard constraint. And so we... The size of the code or the binary code, the compiled one, it was there.

Starting point is 00:20:48 And you had so much business logic by then that it just filled out four gigabytes. Yeah. Yeah. And we had a distributed C++ build, so you know, you could, you know, it would take many, many hours for it to compile. And so we would distribute it across desktops. And it was this whole big thing. But we ran into that wall.

Starting point is 00:21:06 And so what we decided to do, and I think this was super smart, was like to lean into service-oriented architectures, right? And microservices. Yep. And when you break it down, a web service call is essentially, it's a remote procedure call. So you have this execution pointer and then you're like, okay, well, I need to do some computation or I need to gather some data. I'm going to in turn make an HDP request downstream to another service and then you can sort of chain those things together. And so getting back to the original thing about performance in a world where you have to, because you have thousands and thousands of developers building, you know, this stuff and the fact that you cannot have a monolith as big as Amazon retail, you know, past something that's sort of like, circa 2002 Amazon size, you have to lean into remote procedure call.

Starting point is 00:21:55 You have to say that there is a web service. The best performance that you can actually get is always going to be bounded by the number of web requests that you end up making, whether it's the first order calls to say go get the item details, but then also any blocking call that happens downstream. By blocking call, we mean like you need to wait for this to finish to get your data. Like, you know, it serves as that like returns. I don't know. your top five most likely to buy things,

Starting point is 00:22:22 it might need to make those, let's say, five requests or just one request. It needs to wait for that before it can return. Exactly, exactly. And you can do this telemetry stuff. You can do this observability stuff to figure out, you know, within that service call chain, what the blocking call is. And you can get some amount of visualization on it. And so then you can get down to the point where it's like, okay, if we're going to start

Starting point is 00:22:44 from first principles, what's the least amount of latency that you can get for, say, like a web request or a checkout page call, you're going to run into like the absolute minimum, right? And it's going to be based off of like what are the required operations, you know, evaluation or transactions or whatever for that particular request. Yeah. And then basically so as I understand like as it became a microserve like more microservices and services, this is great for maintainability and also you just, well, you first just solve the issue of the monolith size. And, you know, as we know with history, of course, like now teams could be more autonomous.

Starting point is 00:23:21 They're not as dependent. They could do the APIs, but it was a tradeoff for latency. And now, like, you had to go back and figure out the blocking calls, how to speed those up, how to do, I guess, you know, trade off things like caching. Like, you know, you can have things fast, but it might not be as correct on the first one.

Starting point is 00:23:38 Or, like, just tricky UI where you don't show the data just yet, but it's coming. And the user's sense, a sense of, like, progress, those kind of things. And it also, I think, forces teams to really, end product, to really say, okay, like, what is the strictly necessary processing that happens on this page? Some of the work that I was doing before I left Prime Video was basically, like, you have these really, really big, heavy gateway page, you know, or landing page requests. And, you know, if you're in a situation with high load, can you preemptively reduce the amount of, say, personalization? that's going on to sort of speed up that page or, you know, to increase the amount of like throughput that you're able to have. So to serve more customers, can you do that in a smart way,

Starting point is 00:24:28 right, that sort of anticipates load that's coming onto the, to that page? Say if there's a football game coming up or something like that. Yeah. Sounds like these are just like a, they seem just hard to solve, but now you have to solve them. So it sounds like this, this kept you busy. Everyone else busy at Amazon, to this date, right? Like, is this, do you think is this ongoing engineering challenge for Amazon? Because, you know, what I would imagine, the tricky thing being here is like, okay, you can optimize whatever you have. You can find the critical paths. But Amazon keeps growing, right?

Starting point is 00:25:03 Like, there's new teams, new services, new everything coming on. So this thing will change all the time. It's an ongoing puzzle to solve. Yeah, absolutely. Yeah, I think, you know, they definitely have a ton of work in front of them. Also, you know, it's part of their ethos to really like launch new lines of businesses really quickly. And so, you know, the ability for a team to go from zero to launch product within the confines and the context of a large corporate entity, I think that's, you know, part of the DNA that's there.

Starting point is 00:25:33 So as long as they're planting seeds as the sort of like internal terminology is, I think that, you know, software developers will be in demand for quite amount of time. Yeah, I guess it's a good reminder that, you know, there's every now and then we have the monoliths versus microservices debate that it, it sounds like it kind of just makes sense for a startup to start with the monolith. Like you can always do what Amazon did and you have the benefits of latency. Everything is in one place. Like I'm sure there might be reason to start with microservices to start with microservices to start with. But if you're a small team, I mean, even today, I don't think that argument changes, right? Like Amazon got really big wins by starting with a monolid back in the day. Yeah, absolutely. I think it just makes it. ton of sense to start with a monolith, wait till it breaks, and then the part that it, where it breaks is when you have, like, 50 developers working on the same piece of code. Once that sort of breaking point occurs, then you start to, like, try to figure out, like, how you can sort of break things up. But starting with a microservice architecture, especially when you're small,

Starting point is 00:26:32 like, what a waste of time and energy. Totally. So you were a principal engineer in Amazon, and apparently I've learned that most companies have different levels and again this principal engineer some companies have like staff level but it's usually like entry level mid level senior and then you have staff

Starting point is 00:26:50 or in the case of Amazon it's principal I've learned that Amazon's principle level is both really hard to get into compared to a lot of other companies and it's pretty special in some ways so we'll talk about that but can you tell me like how how is the career kind of development

Starting point is 00:27:06 because most people imagine like oh it should be pretty straightforward I spend like, I don't know, two years as a junior, two years as a mid, roughly, and two years as a senior, then I get to principal. How does it actually work at Amazon? I think it's linear up until you hit principle, right? So, you know, you join, you're a junior developer, you get promoted to mid. At mid, you know, you're starting to influence the team, but then you get to senior, and so now your expected impact is at the team level. And then there's this jump that you get to principal. And principal is, it's L6?

Starting point is 00:27:38 principal is L7. L7. Yes. Yeah. And so I think you really have to start with like why is it, why is that jump so big? Because I think at every, pretty much any other company, it's just a linear progression. Like there's nothing necessarily special about staff. You know, you can just sort of go to that level of senior staff and then principal.

Starting point is 00:27:57 But for some reason, Amazon decided that they weren't going to have a staff level. And so, and I think they sort of like couched it around like having high standards. Basically, to get from senior to principal, you have to do like two and a half level jump. From L6, L7. Yes. Technically it sounds like one level, but at some other companies, this might be like, you know, L8, L9 or L8 and a half. Yeah. And, you know, so the hand wavy argument is like, hey, we have high standards and like, you know, it's, it means something to get to that level.

Starting point is 00:28:30 It's like, fine. But I noticed that some of the best engineers that I'd ever worked with were having such problems getting to principal. engineer that they ended up moving to Facebook or to meta or to all these other places where the progression was just sane now their staff or senior staff now they're senior staff and you know principal and distinguished engineer and at other companies and so because we had high standards we actually had this brain drain and it wasn't a brain drain at lower levels it was that the brain drain at sort of like the higher levels and it was it's just an example of something where it's just like why did you do that to yourself and so that

Starting point is 00:29:08 That's the context for being a principal at Amazon. It's safe to say it's wicked hard to get internally, right? So, you know, I'm colleagues with Ethan Evans. And so we talk about what's the hardest promotion at Amazon. And, you know, I had made the argument that it was, you know, it was senior engineer to principal. And he's like, yeah, that's hard. Actually, the hardest one, Steve, is, you know, VP to senior VP.

Starting point is 00:29:34 Because there's only eight spots or ten spots for that. and maybe 300 VPs that are all trying to get this. That's more of a supply and demand thing. I will say that at Amazon, there is gigantic demand for principal engineers. And so there are roles that have been open for years, I think something on the order of like 13 months or 17 months or something like that

Starting point is 00:29:58 to get an external hire to join as a principal engineer. But that metric is only calculated when the role is filled. And so probably, you know, there are hundreds, of principal engineer openings at Amazon. And there are thousands of senior engineers. Desperately want to get there. They're putting in the work. And so there's this sort of like, there's this tension, right?

Starting point is 00:30:22 And I don't think you see that at the lower levels. I don't think that that's happening at senior or mid or junior. And so that incongruity, I think, is super interesting. But once you do get to principal engineer, and one thing that I've never heard any other company have is there is apparently a principal engineering community, which is, I've heard, again, from other people that is tightly knit. It's actually special. It's actually just a really nice organization. Can you talk about that? So like, you know, once you got in there somehow, I don't know, was it bloods foot and tears at

Starting point is 00:30:52 promotion? There is a community. I think it's actually really great. My own history, you know, I went from support engineer to senior engineer in like four years at Amazon. But then from senior to principal, it took me eight years. And I got promoted in Q1 of 2020. Turns out to be a consequential, like, year for in the industry for the world. That was forceful remote work. Yeah. And so, you know, I got promoted and everybody's like, you know, congratulations.

Starting point is 00:31:21 They used to have like a principal engineer offsite where they just flew everybody into Seattle or nearby and then to sort of like, you know, mingle and to talk to other folks. That stopped during the pandemic. And then, you know, by the time the pandemic restrictions started. leaving, the population of principal engineers had essentially doubled. That's still to say, like, there are still hundreds and hundreds of openings for principal engineer, but then the, you know, the sort of like off-site community shifted over to the senior principals that I didn't have access to. But, you know, at the moment, the manifestation of the principal engineering community

Starting point is 00:31:58 is essentially through the Slack channel, which is absolutely awesome. And then we had principal off-sites for like our local organization, so like Amazon Music, Prime Video, Twitch, that sort of thing. Those meetups were amazing. So the reason they were is because of this high standard that Amazon had created. And so what it meant is that everybody

Starting point is 00:32:22 that was able to achieve that overly high standard, there's something exceptional about them. There's, you know, they're super deep in our particular technology or they were associated with, you know, the growth of a really large line of business, either within Amazon or externally. They were essentially leaders within the industry. And you could just literally, you could just scoop out five people and then put them into a room.

Starting point is 00:32:52 And the conversation is just amazing, right? And I would sort of be like, I don't even belong here. Like, look at this guy. You know, he wrote a book on a particular topic. and this guy, you know, he, you know, he was, you know, a luminary in a particular field. And then this person just like is an amazing code machine and can just write an entire application over a weekend. And then you're like, what am I doing here? You know.

Starting point is 00:33:20 I do wonder if that community might be coming back now. I know you've left, but now Amazon is not in person. Because it sounds like a lot of the benefit was the in person part as well. Because this is what I never heard. even before the pandemic, I didn't hear other companies, say, for example, at Uber, I've heard that the senior staff engineers do get together every now and then, but it was very like roots. So it was bottoms up, but my understanding at Amazon actually invested not just, you know,

Starting point is 00:33:47 some principal industry saying, hey, let's get together, but also just kind of, you know, like making, making sure that that group really had something. Like, I think it's smart. I think more companies should do it, but I've just not seeing it. The investment was also in terms of headcount. So there are program managers and product managers essentially that are, you know, bringing the folks together. Awesome. There's a wonderful series.

Starting point is 00:34:15 It's called the Principles of Amazon series where, you know, principal engineers will just, you know, they'll do a presentation and it's recorded. That's been happening for, you know, 20 years and, you know, we record everything that's there. But it takes work to actually... That's an internal series. And is that open to everyone at Amazon or it's for the principles? Oh, it's open for everybody at Amazon to consume. And then, you know, there might be some senior engineers and stuff like that that would make a presentation. That's part of their promotion packet.

Starting point is 00:34:45 It was to be able to make an Amazon-wide presentation on a particular thing. My point was, though, that that stuff doesn't just happen on its own. Yeah. Like, you have to, like, you need a program manager or multiple folks to... to sort of like herd the cats and to like schedule the off-sites and to make sure that the, you know, the Slack channel doesn't go off the rails, right? And it's still useful. And it's just not going to happen like grassroots with just like throwing a bunch of people

Starting point is 00:35:13 into a room. This episode is brought to you by Augment Code. You're a professional software engineer, vibes will not cut it. Augment Code is the AI assistant built for real engineering teams. It ingests your entire repo, millions of lines, tens of thousands of files, so every suggestion lands in context and keeps you in flow. With Augment's new remote agent, queue apparel tasks like buck fixes, features and refactors, close your laptop, and return to Ready for Review Poll requests.

Starting point is 00:35:40 Where other tools stall, Augment Code sprints. Augment code never trains or sells your code so your team's intellectual property stays yours. And you don't have to switch tooling, keep using VScode JetBrains, Android Studio, or even VIM. Don't hire an AI for Vibs, get the agent that knows you and your own your team. that knows you and your code base best. Start your 14-day free trial at Augmentcode.com slash pragmatic. I think these are the things. I mean, we're now exposing a few of these things here and there,

Starting point is 00:36:09 but some of these companies like Amazon is a great example. There's more to the eye than what meets the surface. So like once you're inside Amazon, for example, you now, as an engineer, even if not a principal engineer, you now have access to the whole 20 years of principal presentations. Like when I joined Uber, I was amazed at how we had the RF, Cs available, like I could read all historic ones. So I think there is, and every company has its own. Of course, once you're in there, you have access to this like knowledge base, which

Starting point is 00:36:36 it will just never be published. It cannot because it has, you know, business sensitive things, etc. So I think as an engineer, like you can just really just like, like be a sponge when you join, especially one of the companies that is known to be a bit more open internally. Amazon, I think a really interesting one, because externally is very close, is my sense. They're very careful about what they share. For example, the post-mortems for AWS is very few are published externally. But internally, they're all there. I understand there as an engineer you can access. You can learn from them, like really cool real world learnings. Absolutely. You know, it is an open place internally. And we are so selective about what we, I say we as though I still work there.

Starting point is 00:37:16 What they publish externally. And, you know, the post-mortems, we call them COEs. It's a C-O-E sounds for it's a correction of error yeah it's you know it's this idea that you know you have like holes in swiss cheese and and you have like a failure requires that there's a there's a whole across layers that's the best reading like i would just subscribe to the email list where they were published internally so you have this like stream of like of disasters that are going on within the company and you just you know you grab some popcorn and you pop open one of these coes and you learn so much from that and and i think that that that's part of the secret sauce.

Starting point is 00:37:55 The idea, and I don't know if it's like this for 100% of them, is that it's a blameless culture sort of thing. And so to really screw up requires that multiple people drop the ball. And you learn so much from that sort of stuff. You know, the brownouts, you know, these lessons that you would learn from, you know, trying to recover from really large dependencies, those things are immortalized inside some of these COEs. So there's some very famous outages that happened within Amazon.

Starting point is 00:38:27 And, you know, there were an egg on our face. And we really, really learned those lessons through those postmortems. They're absolutely wonderful. As a principal engineer, you know, so far we kind of glamorized the role saying, you know, it is hard to get into. But once you're there, you have the community, you do this really impactful work. But one of the principal engineers at Amazon, who's still there, called Babi Kotari, he collected some things that are maybe not as glamorous

Starting point is 00:38:52 or more challenging about principal engineering. He had five of these things, or five or six. I just want to go through with you and your take on them. So first he wrote, there is this paradox of belonging that you're part of all teams, yet you're part of none. What does that mean? Yeah, no, so I, Bavik was actually a peer of mine. We worked in Prime Video together.

Starting point is 00:39:14 Oh, awesome. So he's an awesome dude. Yeah, there's all of these paradoxes. and this paradox of belonging is a really interesting one. You know, you work for the organization. You're working cross teams, right? So at the senior engineer, you're embedded on a team. And, you know, you own the team's architecture, the operations, you know,

Starting point is 00:39:38 the software development lifecycle and the design. But when you get to that next level where you're working across teams, you kind of operate in this weird layer where you're not on pager duty for a particular team. You have visibility across all of these teams that are there. You're helping to guide and make decisions, but you're literally not on the ground floor anymore. And so, you know, when you work with a particular team, you know, you might call the senior engineers of the mid-level engineers in and be like, hey, let's wipeboard some stuff. Like, let's try to figure out what's going on. you're not on the team. You're kind of this like advisor that's sort of coming in, right? But then, you know, maybe a director or VP would call you in and say like, hey, what do I own? Like, what's going on? Explain to me this outage or tell me why we can't build this thing. And then you're you're trying to whiteboard the architecture and the system and you're trying to say like, hey, you know, this is what's going on on the ground floor. But you weren't, you know, you weren't part of that team. So you're just sort of operating in this, this sort of street.

Starting point is 00:40:44 strata where, you know, you don't really belong on a team. You know, I'm a, I'm an immigrant, I think you are as well. And, you know, my parents came from, from Asia. I'm not Asian, right? So when I go back to Asia, I'm definitely from the U.S. And then growing up in this country is just like, you know, I'm, you know, not quite an American, right? And so you, you sort of operate in this sort of, you know, area in the gaps where your identity is, is really different. defined by not being squarely in one of these predefined categories. So it's very similar to that as a principal engineer. You're not on the ground floor. You're not checking in, you will check in code, but you're not necessarily part of that team embedded on that team. Even if you are for a

Starting point is 00:41:29 short time, it's usually a short time. And like tomorrow, the director call you up and say like, hey, Steve, we need you on this other team. They're in trouble. Move over. Yeah, and you parachute in. And then, you know, then they're like, oh, who's this guy? You know, and then your director is like, what's going on? What happened during this outage? Why is, you know, why is the, why is the press writing about us? And then you're like, well, you know, here's what's happening on the ground, but you're not really embedded on that team. Which leads us to the next paradox that Bobvick said. He lists a few of the paradox, which is a freedom of responsibility. And he writes that you enjoy significant autonomy and being able to choose what you work on. However, there's an implicit expectation and accountability for a resounding impact. Yeah. So, you know, I, you know, I, you know, I.

Starting point is 00:42:14 I reported to a VP right before I left the company. So they were your manager, basically. Yeah, my manager was a VP. Oh, wow. That's... I don't hear many companies having engineers report into VPs. Yeah. That doesn't seem very standard.

Starting point is 00:42:31 You know, and so the org that he owned, you know, I considered myself the tech advisor for that organization was about 450 people, 450 software developers. And what did our one-on-ones consist of? Right. Like when I would have her one-on-one, it wasn't like, hey, here's, you know, he didn't assign me work. He wasn't like, hey, I need you to build this thing. I need you to design this thing. The context that he said was basically like, here's a direction, right, that you need to go. And the way that you can achieve that type of impact was up to me. Right. So he might say something like, hey, availability is so important for, you know, live. sports. We just signed, you know, billion dollar contracts with these sports leagues. And so we need to increase our availability posture. And then I would be like, okay. And then I would go away and it would come back and I would be like, you know, here's what I'm working on. Right. Like that type

Starting point is 00:43:33 of dynamic does not exist at the senior engineer below level where you're basically telling your boss what's happening. I was about to say that when you said my, my manager one-on-ones, he didn't tell me what to do. I'm like, most engineers would be like, sign me up. Like, I don't want, you know, we all hate micromanagement. But now when you're telling me, like, he would say like, oh, so we just sign a billion-dollar contract. Availability is important.

Starting point is 00:43:56 And it stops talking. I'm like, that sounds uncomfortable. And basically, like, you're kind of expected a little bit to, like, understand what he's expecting, even though he doesn't know. And then, and I'm assuming, you know, there's two ways of going, right? You go back on the next one-on-one and you say something. And he was like, like, Steve, like, your principal, engineer, this is not what I expect of you, and you don't want that. Whereas this, you know,

Starting point is 00:44:18 if you're bringing back the right thing, so it sounds like you really need to up level and like understanding how like these people think. Absolutely. And so he's, you know, he's accountable to his boss as well. And, you know, don't get me wrong. I didn't, you know, I had a, I owned aspects of availability. You know, there's a multi-thousin person organization at Prime Video doing this stuff. But we own the live sports aspect of this. And, you know, there are playback teams. There are, you know, you know, recommendation teams. There are,

Starting point is 00:44:46 you know, there's so many different teams that are there that had to, to really step up and, and make sure

Starting point is 00:44:51 that availability was good. But he would say something like, hey, you know, what is our availability posture for certain aspects? And I would have to go

Starting point is 00:45:01 and figure it out. Yeah. Like, what are we measuring? What are we not measuring? There's a deadline for, you know, the start of a season

Starting point is 00:45:08 where we're expecting, you know, millions and millions of concurrent to come in. What can we do? can we do between now and then, right? And then if we do write some software, like what is the highest leverage piece of software that we could create that would increase our availability posture? And so the way that I sort of describe it to people is you are assigned not a problem,

Starting point is 00:45:30 not even a problem space, you're assigned a direction. You can solve the problem with code. You can solve the problem with system design and architecture. But you could also solve the problem, say, by, you know, I don't know, hey, maybe there's some off-the-shelf software we should purchase. maybe there's a dev team that we should start to spin up right now whose job it is to do this particular thing. Maybe we've identified a piece of software and it's already been scoped that this team needs to go and build, but it's not a priority for them.

Starting point is 00:46:01 Now we need to go and figure out like, you know, how we can get them to do it. Can we shuffle around resources, that sort of thing? And so the way I describe it is like there's so many more things on the menu that you can use to solve the problem and I don't think people recognize that. They think that it's just, oh, when you're a principal, you just code a lot and it's just really complicated.

Starting point is 00:46:22 Or do more meetings. You know, that's what all happens. I mean, at the end of the day, like, don't be getting me wrong. There's a ton of meetings that go on. Yeah, yeah, but this is, I think it's good to, like, shine light. Because I also feel like once, it sounds like a big change, but I also kind of feel if you get good at this, you might not really want to go back to, you know,

Starting point is 00:46:40 having a manager is like, all right, here's a project. We need to solve, like, you know, scope it out, and which you can do, right? Yeah. That's cool. And now, the next challenge that Bobbick said was, this all sounds great, but there's apparently bandwidth challenge. So it's easy to become this, like, social resource where people just pull you into everything and you're breathing.

Starting point is 00:46:59 Yeah. No, you know, I think, I wish I had taken a screenshot, but, you know, I have my outlook calendar, right? So it's my schedule. My day looked like most people's week. So it looked like somebody had just, like, blew up a test. Petrus factory. Like there was like I would have triple or quadruple booked on a Monday all through the day. So you would have the manager calendar as an I see. Yeah. And it's it's absolutely crazy because

Starting point is 00:47:24 you know for that large org that I was supporting, everybody just added me as optional or or they might try to say like no, you're actually required for all of these meetings. But when you have you have a triple booked calendar and you're required for this stuff, you just learn that you're going to have to disappoint a lot of people. Yeah. And so it's this sort of like, you know, this thing where it's like it's almost easier to say no now that you're obscenely overbooked versus when you're a senior engineer, you're like, I don't have time to write code, but there's just barely enough time in between the cracks.

Starting point is 00:47:59 Yeah. And so I think that it's almost like when your schedule breaks, that's when you are finally freed because you know that you can sort of say no to stuff. But ultimately, if I just went to all of the things. the meetings that everybody said that I would have to go to, I would be a professional meeting attender, and I would literally have no time to do the work. And then Bavik follows up on this next challenge, which is being truly present, and he writes, I think it's almost like, you know, he was sitting next to you. You find yourself physically present in one meeting while your mind

Starting point is 00:48:27 is already racing against next three. You know, it's a, it's a really big challenge. You know, I pride myself on being a good communicator and being present. And when there are, there are 20 things that are going on in the air or 100 things that are going on, it's just really, really difficult to say single-threaded. And what I ended up having to do is to sort of say, like, okay, I could do all of these things and they would be really impactful. But I just had to aggressively prioritize and say, you know, for the availability. I'm just looking at availability.

Starting point is 00:49:00 There's all these other fires that are going on, which is disappointing because there's so many things that you know you could be focusing on it's it's it's super difficult and so i you know i work with a lot of people to try to get them to the next level and they say steve all i'm completely overwhelmed there are like 20 things that are going on um and i tell them like do you think it gets easier when you get higher level there's just going to be more and more things on your plate why wait until you burn out or you break you can just start implementing these things now so every high level tech I see, I know in managers included, they have a wonderful system in order to isolate signal and then cut out the noise. And if you don't have that, you literally won't survive. But it's just

Starting point is 00:49:44 at the principal level and above, it's just amplified that much more. I'm getting sense that a lot of the work as you do as a principal engineer. I mean, most, there's huge amounts of software engineering and you need to be, you know, just really good at building resilient systems, learning about new technologies, you know, for example, today, I'm assuming who offers a principal and you're at Amazon, they're expected to just know everything about LLMs, tradeoffs, characteristics, et cetera, because they're, anyway, but you also need to just become, do the skills that managers have, which is managing your time, changing context, finger, how to get that focused time, like, you know, contrary to popular belief,

Starting point is 00:50:25 like managers actually need focus time. So, like, you know, I will also always try to carve out some time. But you're now doing it. while your title is not manager, but actually it feels like you combine a manager, a lot of manager responsibilities and a lot of, you know, like experience engineer, and boom, you get the principal engineer role. Oh, the only upside is like you don't need to do performance reviews for people. Congratulations. You saved a little bit of that time. Well, actually, during performance review season, they pull the principal engineers in because if you're, if you're, so, you know, if you're stack ranking people, okay, cool, well, we'll need to take a look at their performance. There we go. So I

Starting point is 00:50:59 reported to a VP, you know, one of my peers was a director and he was basically like, hey, Steve, I would like you to show up to my performance review for my entire org of 100-something people. And I'm like, I can't do that for you and for everybody else. Okay. So that would make sense why as a principal engineer, your compensation package will be similar to like, is it a senior engineering manager or something like that? Around that. Around that.

Starting point is 00:51:23 But basically, like, the job has a lot of overlaps. Okay, the benefit is you're not the one delivering the performance reviews of the Eric report, but you're doing almost everything else, or in terms of the effort I'm talking about. Yeah. Okay. So having been a principal engineer for four years, what are the good things that you really would like about Amazon, specifically Amazon's principal engineer role? And what are some of the, you know, not so good or it could have been better things? I mean, the great parts are you get visibility that you just couldn't possibly have.

Starting point is 00:51:57 at the team level. You know, within a large organization like Prime Video or wherever you're at, there are many thousands of people that are working within that organization doing so many things, right? And typically the performance of these people is really high. There's so many different directions that are going on. And so to survive, you kind of have to look inward and you say, okay, well, here's my service boundary, here's all the software I own. I'm going to own everything within the sphere of ownership. Because you've built this wall up, you tend not to be able to to see like that broader picture. Yeah.

Starting point is 00:52:29 And so as a principal engineer, I think it's really awesome to be able to sort of like Spalunk and be able to go to different teams and sort of see that broader picture. And I just don't, I don't see a way that you would be able to get that, that type of visibility that's super interesting at a lower level. You know, I think the other thing is like, you know, whether it's, it's warranted or not, you do get some amount of status when you go to a meeting. People just listen to you. They listen to your hairbrained ideas, and it's kind of nice because you don't necessarily have to, like, prove yourself over and over again.

Starting point is 00:53:02 This is a bit less professional, like, not fights, but just establishing that you know what you're talking about. Yeah. Yeah. Now, the bad things are, you know, there's a lot of folks that are really good in tech and being really effective as a principal engineer. But then they also, you know, myself included, they're like, okay, cool. well, that sort of makes me an expert in pretty much everything. And so you would get these principal engineers together. We had a weekly meeting.

Starting point is 00:53:30 And so it would be like, okay, if you wanted to talk about, like, establishing a constitution for a small island nation, all of a sudden, they would just be like, well, like, here are the main considerations. It's like, nobody has a background in government policy. But all of a sudden, like, just because you're sort of trained to do so, you start to, like, pitch in. You're like, well, actually, you know, maybe we should have two branches of government or three branches of government. And it just sounds like we would know what we're doing, but we don't. And so there's this trap, and again, I've fallen into it many times where you actually think you're an expert

Starting point is 00:54:02 in one thing, but you're actually not, right? And so, you know, take LLMs. There's a ton of folks that understand AI. I left before it was sort of like allowed to use internally, but I think you can use it now. I'm not an expert in LLMs at all. But I do think that. I do think that. I do think that the expectation would be that you understand, you know, how they work. But then the expectations also like, hey, what should our policy be? How should we be thinking about this stuff? And I think that's fine for mature technologies potentially. Like you can ramp yourself up for it. But as like that particular landscape is changing so quickly, I think there's this sort of trap where you sort of, you speak as an authority, even though you haven't had the requisite time to ramp up

Starting point is 00:54:50 something. And you went there for 17 years at Amazon. What are your favorites parts of the culture? Like, you know, there's a lot of things that there's the values that we all know, like the frugality, customer obsession. What were the things that you found to be like the most interesting or the ones that have lasting impact? And how did they change? How did Amazon change over 17 years? They must have changed. No, I think the things I missed the most in the secret sasiad, the leadership principle, are good, but I think the actual secret sauce there is principled thinking. Right? And so, yeah.

Starting point is 00:55:27 So, you know, there's invent and simplify and bias for action and all of this stuff. But like, ultimately, the thing that is amazing about those leadership principles aren't the specific stances that they took. So they decided that customer obsession is a big deal. They decided that bias for action is a big deal, all of these things. But really, if you look at a meta level, you'd be like, oh, these guys. have principles that they won't budge on. I sort of think about it in terms of math and axioms. Like, you just take certain things to be true, you know, two lines that are parallel. If you extend

Starting point is 00:56:02 them out to infinity, won't touch them and won't touch with each other. Yeah, you assume that's true. Yeah, you don't prove that. It's an axiom. And then based off of that, you're able to build a system of mathematics, right? And so it's the same thing with the corporate leadership principles at Amazon, they basically said, okay, we are going to fix these things to be true. There are 16 or 12 or I don't know. They just sort of bolted some on. They were 14 and now they're 16. And but there are like four or five that are just really core to Amazon and we just

Starting point is 00:56:35 fix those things to be true. Which ones were the ones that you felt were the most present? Customer obsession. We are absolutely customer obsessed. We'll just burn money to the light of customer. You can be in a meeting with a VP as an intern and you say, hey, that's a bad customer experience. It would be like a needle coming off a record. It would just be like, what?

Starting point is 00:56:55 What are you talking about like immediately, right? You know, bias for action. So like just get some stuff done. Stop asking for permission. Just like go and do it. Right. Ownership. It's just like you own your software.

Starting point is 00:57:06 You run the, you know, you do the operations. You own the bug count, all of this stuff. Right. So those are the ones that are like those are fixed. And then you start layering. things on top of it. And I think it's really great. And but, you know, you could, you could take Amazon and you could have like the, you know, evil goate version of Amazon, which is just sort of the opposite of those things. And that would still be a really valid and awesome company. So you could

Starting point is 00:57:30 say, okay, well, it's the opposite of customer obsession. It's not customer obsession or not being customer obsessed. I think it's, you know, like being about your staff. Yeah. Which is Google. It could be like, hey, we really care about our people above everything else. Or it could be, you know, that's not mince around it. We care about top line or bottom line revenue. Yeah. That's totally valid, right? And then you could just fix that.

Starting point is 00:57:54 You wouldn't, you can't prove that, you know, being, you know, staff focused is a bad thing. You just build that. And then, you know, a certain set of things will happen, like great things are going to happen. And then, like, not so great things are going to happen. Those not great things that happen, you can try to mitigate them, but you can't fix them because you have started with this principled approach to everything. Yeah. Yeah. Yeah, it all goes like everything has.

Starting point is 00:58:18 Yeah. I see what you mean, but I think what you're saying is like it might be less about what the specific principles are. I mean, Amazon has theirs and we know about them, but it's just sticking to them and not keeping wiggling. Because if you keep wiggling, it's like, what was the point, right? Then you're going to have a really kind of mediocre, truly not standout company, whatever you do. What does it actually mean to be principled and to not bend? It could be really easy to do so. So that's an amazing secret sauce of Amazon's.

Starting point is 00:58:46 People look at the leadership principle. I'm like, no, it's principal thinking. Another thing. And a lot of this, honestly, from what I understand, talking to you earlier and some other people, a lot of it probably comes from Jeff Bezos, being from the top down, being very principled, though not giving, not saying, we will do this, whatever it takes. Sounds like it was customer obsession initially and then some other things. Yeah.

Starting point is 00:59:06 Yeah, absolutely. And he was an absolute genius when it came through. So I'm a, you know, I'm a Jeff Bezos fan. for sure. Like, it just worked. Another thing that's Amazon Secret Sauce is just a writing culture. And so, you know, I spent on the order of like one to four hours every day reading while as a principal engineer. And it was, we had a standard format. It was a six-page memo. And, you know, that would be our business strategy. That would be a system design. That would be, you know,

Starting point is 00:59:41 what we call the PRFAQ, so a press release and frequently ask questions for like a new line of business or a new initiative. And everybody was sort of constrained to this six-page format and everybody just produces documents

Starting point is 00:59:55 in that format for whatever they need to do. And so when I would try to get up to speed on a particular thing, I would just be like, give me your six-pagers, give me all your documents. And I just got really,

Starting point is 01:00:06 really good at just reading these documents to get up to speed, which was a self-affirical and virtuous cycle, which is just like, okay, well, now I need to express myself. And so I will write a six-pager, and that will set the context for whatever we're working on. We'd go to a meeting. You would read the six-pager, and it was just super great to just actually just have people do study hall at the beginning part of a meeting where everybody just gets fast-forwarded. And then you have a really great discussion at the end. That is what an amazing culture that I think that almost

Starting point is 01:00:40 every other company should replicate if they could. But I think the difficulty would be like you actually have to be disciplined and actually have a reading culture and principled and have a reading culture and then actually value writing. Yeah. I almost wonder if unless it comes from the top, some of these things might just be really hard to do. Yeah.

Starting point is 01:01:01 One thing that I figured is we're in your studio right now and you have a lot of these blocks. And I asked them what they are. Are they for promotions or projects? or whatever, they're for patents. Yeah. And this is for patent number 10,000, 10 million, 824, 964. Can you tell me about why you have these, how they come about, what you needed to do for them?

Starting point is 01:01:25 So the highest order bit is like, you know, for better or for worse, there are software patents that exist. Amazon, they'll say that basically the reason they have them is defensively because, you know, other people will assert that, hey, you're in violation of our patents or our IP. And then, you know, we'll use them reactively. Okay, fine, but, you know, you're also in violation of these other things. And so, you know, there's a, there is a culture of trying to make sure that, you know, we protect ourselves in that way. But, you know, there's the other part of software patents, which is basically like, hey, can you really patent, like math or whatever? And so what I learned over time is that, you know, I'm just a really bad IP lawyer, even though, you know, as a principal engineer, I might cosplay as somebody that

Starting point is 01:02:10 really understand software patents, right? At the end of the day, you know, what we would do is we would take our important six pages and we would hand them over to the legal team. And then they would just be like, oh, this stuff is really interesting. Like, let's explore that. And so it turned into this awesome thing where, like, we just had ready inputs to go into like the, you know, into that particular system. A writing culture turns out has a bunch of benefits. Exactly. And I think the there's just sort of like it's the concept is called like the curse of knowledge, which is essentially like if you understand something, you discount how long, like how easy that concept is. And so it's just like you don't get it, you don't get it, you don't get it. And then you get it.

Starting point is 01:02:52 And then you're like, oh, that's trivial, right? Even though, you know, there could have been, you know, it could actually be novel or it could actually be interesting. And so what ends up happening is that you would just throw these documents over to the lawyers. And then they would basically be like, oh, there. this stuff is great. And you would just be like, well, that's just, that's just regular software development. Or that's just the context and domain that we were living in. You know, it turns out that there's some, some interesting stuff. This particular patent I'm, I'm proud of. So there's a system design interview question that seems to be popular right now, which is like design ticket master.

Starting point is 01:03:25 Right. And so I worked on Amazon tickets and, you know, we ended up shuttering that business, but, you know, we ended up building like one of the world's fastest, like, ticket selling systems, like in the world. Right. We can do many, many orders per second. So the use case is basically at T0, that's for a really big ticket on sale. That's when the maximum amount of demand and requests are coming in. And you want to sell out all of your ticket supply as quickly as possible. The problem is, I think, one where you have seated concerts.

Starting point is 01:03:58 And so when you purchase a ticket, you know, most of the time with the system design stuff, it'll be like general admission or it won't be a high ticket on, you know, like one with a bunch of demand. You have to find contiguous seats. Yeah. So the ones are next to each other. Yes, exactly. And so, you know, it's actually really hard.

Starting point is 01:04:20 Like, suppose it was a SQL database as your backing store. Like, how do you come up with a SQL query that's just like, hey, give me the best four tickets, you know, within this particular price range that are sitting next to each other? Yeah. Now you're thinking, so this is a real, real world thing where you need to, you want to be as efficient as possible in terms of research usage. May that be maybe you want to minimize your CPU or memory, depending on what you have, I assume. And you need to do as quick, as rapidly as possible to give this to people. Okay.

Starting point is 01:04:51 So now we're talking about a problem that seems like pretty novel in some ways, right? Yeah. Yeah. And so, you know, I was, I did this patent with a senior principal. I was a senior engineer at the time. But the idea is like, you know, what is the theoretical maximum speed by which we could, you know, show this inventory to people? And it turns out that, you know, even if you have a high ticket on sale, you only have like thousands of tickets at the end of the day. Yeah.

Starting point is 01:05:20 So instead of making a request to like a back end that would conduct some sort of search across the space, what if you actually inverted it and then you basically had each of the. individual hosts have like some view on the entire arena or a venue that was there. And you loaded up all of that availability and inventory into like L2 cache on a CPU. Yeah. Because it's actually not that many. So if you have this compact representation. Yeah, we'll catch it was pretty big. Yeah.

Starting point is 01:05:51 Then what you can do is you can you can do bit manipulation to like really, really quickly get contiguous seats that are there. And then what you do is you can like send in that particular requests and try to like reserve those particular seats. Now is it a logging problem. Which is much more tractable than like, hey, there's, you know, two million people that have just hit your onset. And each of them, I'm going to search for each of them.

Starting point is 01:06:17 Yes. So the inversion of that ordering process by which you like actually send out the inventory to the individual nodes and then like load it up into CPU cache and then just do bit manipulation and then try to lock that resource from the individual nodes, that was the basis of this particular patent. Awesome. That's clever. And that sounds like some, you know, people are always asking like,

Starting point is 01:06:41 oh, you know, on my job, I don't use the algorithm stuff or any of the formal methods. Sounds like there are some uses of it, especially when you're trying to figure out what is it, like when you're just taking away from the patent, just having a problem like this and saying, like, what is the theoretical limit that we can do? What is the fastest possible?

Starting point is 01:07:00 To answer that, you probably want to have access to these tools. So it's not always a time and effort to actually get into these things. And so what are you up to now that you've left Amazon a year ago after like 17, 18, very long years? You know, I'm just making content. I'm just sort of living the dream there. You know, making YouTube videos. It started up a newsletter. I've had Discord community.

Starting point is 01:07:26 and yeah. Yeah, and we're going to link all of those below. I actually got to first know you before we start talking. This was probably a few years ago from your YouTube videos, which are, you know, you share a lot about like Amazon things, software engineering things and just like your general thinking. But yeah, your user is a new one. So I'm, we'll link it in the show notes below.

Starting point is 01:07:48 It's always a good way to keep in touch. And also, you know, like on your YouTube channel. Awesome. So as closing, I have some rapid questions. Okay. So I'll just ask. you just shoot what comes to mind. What is career advised that greatly helped you in your path? Yeah, I mean, this is, you know, I talk a lot about this. It's kind of like, oh, what's,

Starting point is 01:08:06 what's your favorite food or your favorite movie? It's just like there's so much there and it's hard to pick one. What I would say is instead of saying like, hey, what's the technology that I should learn that's really going to, you know, make my career, you know, solid? Instead, sort of flip it around and say, like, how can I quickly learn skills? that makes you sort of like recession proof, right? That sort of makes you valuable. It's essentially meta learning. It's like, how can I learn something faster and faster?

Starting point is 01:08:37 If that's your focus, then you'll always be, you'll never have a problem finding a job, and you'll never have a problem progressing in your career. Now, some of the skills may be difficult to find resources online, but, you know, I think if you just sort of think about like what's a valuable skill that if I knew right now would, you know, make my, you know, job search easier or would like make me, you know, perform better on the job. And then just sort of thinking about acquiring that skill as quickly as possible. And do it now. Like, don't wait. Yeah. People tend to postpone themselves. They'll be like, oh, well, I'll start when, you know, everything is lined up. But like to begin, you just need to begin.

Starting point is 01:09:18 Like when you start something that only then will you know what you need to do instead of saying like, oh, I need to get everything that I need to do first before I start. You'll use a lot of programming languages. Which one's your favorite? And why? And which one you do dislike most? Yeah. You know, I have like a, you know, obviously there's no perfect programming language. What I would say is like I really enjoyed Pearl and nobody would ever give that answer.

Starting point is 01:09:47 but I just like this concept of like there's just so many different ways to do it it's a it's a right only language like you can't read anybody else's pearl and I it's it's actually one of the languages that like uses up the most power it's like the least efficient it's interpreted it's it's just like terrible also most of booking dot com still runs out or some of it yeah Amazon's back end was you know for a long time it still might be um you know sort of like pearl mason is sort of like web technology bolted onto pearl but I just kind of like it I just feel like I can express myself and there's just like, there's just, however you'd like to express

Starting point is 01:10:21 yourself, you can. It also looked like an asky factory blew up sometimes, and so it's just like, it's, you know, now that it's on a podcast, I wouldn't really, you know, advertise that fact. The best programming languages right now, I think Rust is pretty interesting, so I might, you know, pick that up.

Starting point is 01:10:37 At the end of the day, like, I really love the boring languages. Yeah. So, you know, Java with, you know, for all of its stuff like it's verbosity and I think it's just a great language like a JVM based language

Starting point is 01:10:52 that has essentially like great library support and a bunch of stuff written for it but it's just like super boring maybe it's just because I'm from Amazon and we do this like enterprise stuff like it's a fine language and then I see you have a large

Starting point is 01:11:08 bookshelf here you also read a lot especially at Amazon although most internal documents what is a book that you would recommend something around software and that you enjoyed and it cannot be that book. It can't be your book. What I would say is, you know, I had just given the advice about, you know,

Starting point is 01:11:26 meta learning and career growth. I think that most software developers should read a book by Cald-Newport. It's called So Good, they can't ignore you. And so the concept there is around career capital. So like what are the skills that are in the most demand? And if you can just like learn those skills, then you become in demand. And then, you know, from there you can choose,

Starting point is 01:11:46 what type of lifestyle that you'd like. You can also like sort of lean into some of the science of meta-learning, so deliberate practice, space repetition, that sort of thing. In terms of like tech books, I think the new AI engineering book by Chipwin is amazing.

Starting point is 01:12:04 I think DDIIA, so the design of data intensive. So good. A new version is coming the end of the year, actually. I'm excited about that. I think that'll be pretty good. But you know, at the end of the day, day like you don't want one book on your bookshelf. You want 50 books on your bookshelf. And so, you know, I think within a particular subgenre of tech books, you know, I'd have recommendations

Starting point is 01:12:28 there. Steve, this was great. Awesome. Really enjoyed it. Yeah, great. Thanks so much for having me. Thanks a lot for Steve for sharing all these details. Although Amazon's principal engineering level feels surprisingly difficult to get promoted to, I have yet to hear of such a strong principal engineering community than what Amazon builds and keeps investing in. This community itself could be a reason enough to consider the company after the principal plus level should you have the opportunity to do so.

Starting point is 01:12:54 For a deep dive into Amazon's engineering culture, including the details on compensation, career ladders, performance reviews and engineering processes, check out the pragmatic engineer deep dive linked in the show notes below. If you've enjoyed this podcast, please do subscribe on your favorite podcast platform and on YouTube. This helps more people discover the podcast and a special thank you. you if you leave a rating. Thanks and see you in the next one.

The Pragmatic Engineer - What is a Principal Engineer at Amazon? With Steve Huynh

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.