The Changelog: Software Development, Open Source - What it takes to scale engineering (Interview)

Starting point is 00:00:00 What's up friends this week on the change law we're talking to Rachel Poppin former VP of engineering at github about what it takes to scale engineering Rachel says it is a game changer when engineering scales beyond 100 people so we asked her to share everything she's learned in her career of leading and scaling engineering. A massive thank you to our friends at Fastly and Fly. Those pods are fast to download because Fastly, they're fast globally. Check them out at Fastly.com. And our good friends at Fly help us put our app and our database close to our users.

Starting point is 00:00:44 No ops. No ops required. Learn more at fly.io. This episode is brought to you by Sentry. They just launched Session Replay. It's a video-like reproduction of exactly what the user sees when using your application. And I'm here with Ryan Albrecht, Senior Software Engineer at Sentry and one of the leads behind their Emerging Technologies team that built this feature. Ryan, what is this team

Starting point is 00:01:15 all about? Emerging Technologies has been one of the greatest teams I've been working on in my career. And I think it's been highly successful. We just today launched Session Replay. And so it's a big celebration here. But I think that what we've built is going to be able to help all of our customers to solve their problems faster and really look at debugging and fixing issues in a new way. So what is Session Replay? Session Replay, it's a video-like reproduction of what your users saw. Instead of recording a video, we're recording the actual DOM nodes that appear and disappear on the screen. And then we can replay those to you in your own saw. Instead of recording a video, we're recording the actual DOM nodes that appear and disappear on the screen. And then we can replay those to you in your own browser. So what

Starting point is 00:01:48 this lets you do is you can actually see exactly what the user experienced in the application, take the guesswork out of trying to triage and what are the reproduction steps, stop at a point and inspect the DOM to see, you know, was this paragraph tag in the right spot? What are the CSS and the background colors? You can look at everything as if you were on that customer's machine. There you go. So if you've been playing detective, trying to track down support tickets, read through breadcrumbs, stack traces, and the like, trying to recreate the situation of a bug or an issue that your application has, now you have a game-changing feature called Session Replay. Head to Sentry.io and log into your dashboard. It's right there in the sidebar to set up in your front end.

Starting point is 00:02:27 And if you're not using Sentry, hey, what's going on? Head to Sentry.io and use the code CHANGELOG when you sign up. Again, Sentry.io and use the code CHANGELOG. Our listeners, well, you get the TM plan for free for three months. Enjoy. So we're here with Rachel Potvin, former VP of Engineering GitHub. But Rachel, you've done some amazing work at Google, engineering manager, engineering leader. Your previous current role at GitHub has been amazing.

Starting point is 00:03:20 You've been the VP of Data at GitHub, making sure that lots of people can collaborate on code, which is just the most amazing thing, right? So, of course, welcome to the show. Thank you so much for having me. I'm glad to be here. Well, this is, I guess it's kind of a good time to be just leaving GitHub or being at GitHub because you guys have just done so much amazing things. You got Copilot out there. You got all sorts of things happening. Actions is just amazing. But let's talk about some of your, I guess, got Copilot out there. You got all sorts of things happening. Actions is just amazing. But let's talk about some of your, I guess, some of your history there.

Starting point is 00:03:49 What are some of the amazing things you've done? You've done some cool stuff, but I don't want to say what you've done. You tell us what you've done. Thanks, Adam. Yeah, it's just been just a real privilege and just so wonderful to get to work at GitHub. It's really an incredible company doing some really, really great things for developers around the world. And so, you know, it's easy to talk about so many great accomplishments. You know, I had the great privilege of leading a large swath of the product engineering team,

Starting point is 00:04:16 in fact, most of it. And so there were so many things that happened within my team that I'm just so happy to have seen get out to developers. So for instance, I got to form the team that created GitHub's advanced security product area. This came from nothing. And, you know, with a fantastic acquisition from a company called Semmel, we built up that product area to over 100 million ARR in under three years, which was a really exciting journey and really fun to work with all sorts of folks on that. Like you said, you know, my team launched Copilot, we launched Codespaces, you know, a personal favorite of mine, we launched the new GitHub code search

Starting point is 00:04:54 and navigation experiences, which I think is just phenomenal for developer productivity. You know, I got to bring lots of renewed focus to the core productivity experiences, even, you know, around repos and issues and projects and PRs, really investing in the scalability and sustainability of that legacy code base. But honestly, I would say that my favorite work, and this is, you know, kind of on brand for me, I guess, is less about specific product milestones, though those are always, you know, really, really exciting. But I really get a lot of happiness from building healthy engineering practices and a strong engineering culture that really can sustain these product launches and these features and this growth. And, of course, all of the excellent people involved. So, you know, in my role, I used to always say I'm like 50% focused on the product areas that I'm managing and 50% focused on all of engineering and what needs to happen to keep our engineering teams happy and healthy. Yeah. I'd love to examine the other 50% to some degree, because I feel like there's a lot of

Starting point is 00:06:00 personal details, I guess, a relationship that comes into business or into leading teams that just sort of kind of goes somewhat by the wayside when describing accomplishments. Like, I'm so glad that you said, you know, how you divide that up because so often is it we did this, we launched that, it was amazing. This is how it scaled. This is what was the impact. But at the same time, you kept people healthy, employed, not crazy, showing up to work, keeping their fitness, keeping their self-care going, their marriages and relationships going, you know, not just shipping, right? Yeah. One thing I was, you know, really proud of was, you know, the level of attrition within my team during the pandemic was quite a lot lower than, you know, a lot of other areas.

Starting point is 00:06:47 And I think that's because of the focus on healthy engineering teams. And look, like the size of my team grew a lot during my three and a half years at GitHub. And GitHub engineering actually tripled in size during my tenure. So, you know, that's a huge amount of growth, right? And, you know, the product area expanded so much as well. But I can tell you a little story. And I think maybe I was teed up for thinking about culture early because of the way my team first came together. So when I first joined GitHub, I mentioned this company, Semmel, that we had just acquired. It was a really, really great company that formed the basis of GitHub's advanced security. In my second month at GitHub, you know, Microsoft had already acquired GitHub and Microsoft kind of realized that there's this Azure DevOps group doing great work in the DevOps space. And they were effectively competitive with GitHub, right? So someone realized like, hey, we should like merge these teams, right? And so a whole bunch of Microsoft people were asked if they wanted to move over to GitHub, and a lot of them did. So these were sort of a lot of PhD academic types and mostly in Europe.

Starting point is 00:08:07 I remember on their onboarding, multiple people said to me, like, these Americans who are doing our onboarding are too excited about everything. And if you're, you know, if you're excited about everything, you're excited about nothing. And they're just, you know, like, we're getting exhausted. You know, this is like a different culture, right? And then one third of my team was these sort of original hubbers who had been on the startup journey, many of them with GitHub. And they sort of had this more scrappy, get it done, ship to learn type open source first culture. And then a third of my people were from Microsoft, which they had much more big company experience. They had expectations around the way things should work, established process and expectations. They were very good

Starting point is 00:08:50 at thinking about enterprise and had much more of an enterprise-focused culture. And so these were all fantastic people with very different backgrounds, experience, and expectations. And I remember realizing that the first thing I really needed to do was just to bring these folks together with a common culture and a feeling that, hey, we're all hubbers, you know, rising tide lifts all boats. I mean, we're all in this together. There needs to be no us versus them because that can very quickly become toxic. But rather, we need sort of to have a shared vision and understanding that to be successful, we all really need to work together. So I started with that and like, what am I going to do to make this, you know, a common culture? We talked a lot about, you know, focusing on the experience of all developers. So not just open source, not just enterprise, but because all developers are people and they all deserve great productivity experiences and fulfilling lives and so on. And then again, beyond my team, I wanted GitHub engineering to share a common culture as well. And, you know, I know from

Starting point is 00:09:49 experience that setting clear shared expectations is really key for establishing and promoting healthy, productive teams. So one of the first things I did when I came in, which I never expected I would be doing was I wrote and published and socialized career ladders for both managers and individual contributors in engineering. I worked really closely with HR on that, which was, of course, super important. But I also, you know, you introduce culture by what sorts of things you reward and what expectations you set. So, you know, one thing I did in that process was I introduced a new, more technically focused career path for managers, which I think lots of people should think about doing because previously at GitHub, it had sort of been, you can be a senior manager. And then if you want, you know, a promotion, you have to be a director. And director is a different job, right? To me, director is managing managers. But you should be able to have career growth as a manager of individual contributors. And so with these new ladders, we rolled out this concept of staff manager and principal manager and sort of this like technical path for managers to take, which I

Starting point is 00:10:54 think is really important for sustaining strong technical teams. And so there's lots of stuff like this, like I established a design review process and expectations around what kind of things needed to go to design review. And design review is often a communication tool as much as it is, you know, to get specific feedback on your design. Like let's a team over here, you know, on one area of the business understand and know what's happening in another area of the business. I created something called our principal council, which I'd love to talk about maybe a little later when we get into, you know, healthy things to do to scale engineering teams. But this group was eventually renamed to the architects group. But what it did was it really helped support the difficult cross engineering technical decision making that needs to happen. And that had really been stalling at GitHub. I set up,

Starting point is 00:11:45 you know, turns into a laundry list, right? But I set up a developer satisfaction survey within, within like an internal facing survey to, to find out from all the engineers at GitHub, like, what are your biggest pain points? What are the things that are slowing you down? You know, what are you dissatisfied about? What's hurting? And you know, what's good? Where can we celebrate progress so that we can really understand and track over time, like this is the experience that our developers are happening. And if only we could, you know, focus on fixing some of these things, and we'd have happier people. You know, I also set up operational reviews, I rolled out an engineering-wide strategy that talked a lot about balancing technical debt, developer experience work, privacy and security work, along with

Starting point is 00:12:31 feature work to make sure that it was clear that these things were valued and that, you know, highly impactful work that's not directly tied to feature launches is also recognized and valued. And so this is all, you know, I could go on and on, but this is all a lot of culture work that helps manage that scale and that growth over time. That's a lot of things. Well, if there's ever... It's a lot of stuff. Yeah, it is. If we got a good person on the show to discuss this topic, I think, you know, all doubts should be removed. You obviously have a wealth of knowledge in this

Starting point is 00:13:04 space. And one of the reasons we're doing this show, Rachel, we're happy to have you here is because our audience requested you not just in type, but also by name. They want to hear more shows, not just, you know, we talk about scaling software a lot, maybe not a lot, but we talk about that. Scaling teams that scale software, you know, we talk about less. I think our audience has been clamoring for more leadership style episodes and scaling style episodes. And they got going one day in Slack and our Slack community and said, and said, Rachel Popkin is the one to get on the show. And so shout out to all of our people in Slack who gave us your name. I'm so flattered. Thank you.

Starting point is 00:13:42 I'm glad to have you here. There's so much, there's so many things we could dig into of the different things that you did in order to succeed. I don't know where to start. I do want to ask about just that manager bit.

Starting point is 00:13:53 And if we just might like pull that out and then, you know, maybe just set it aside and move on. But this technical manager distinctions, is this published work?

Starting point is 00:14:01 Is this something somebody else could follow? Like, how do you distinguish between these different managerial tiers? And like, what differentiates these different roles that are non VP or non, you know, they're still manager roles. still at GitHub, because I think it's really important. I think it's easy to fall into the trap of saying there's two separate careers. There's, you know, there's a manager career, and then there's an individual contributor career, and they're different. And sure, they are different jobs, but they share a lot of commonality. And one thing I really believe is that there's a spectrum from the deepest technical person to the most strategic thinking, sort of high-level

Starting point is 00:14:46 vision thinker. And that spectrum can exist both on the individual contributor ladder and on the manager ladder. And so you want to be able to give opportunity and job and take advantage of that skill set with the different individuals where they are. Like, you know, one thing I did at GitHub, I also developed the promotion process for engineering. And I talked a lot about that staff engineer promo as well. And I know there's lots of writing out there and so on about staff engineering. But, you know, you can never, you always have to be careful when you make your career ladders. It's never a checklist, right? It's always, there takes some interpretation. It can't be too subjective,

Starting point is 00:15:30 but it takes some interpretation to say like, this style of person is having impact in this way. And the common currency really at various levels is impact. So, you know, at more junior levels, you're taking direction really well. You know when to ask questions. By the way, that applies all the way up the ladder. But, you know, you're really good at getting things done in a constrained way. And maybe the next step up, maybe you're figuring out what the way is to answer a problem. And maybe at a higher level, you're actually figuring out what the problem is that we should be talking about. But, you know, I think it can be an anti-pattern to really pigeonhole managers to say like, these are people managers and coaches and not technical individuals as well, who can understand the depth of what their team is doing. If a manager of ICs, you know, can't jump in and help coach their individual, maybe you don't need

Starting point is 00:16:17 to be the deepest domain expert on everything, right? But you at least have to be able to understand the work that's happening on your team and be able to give good coaching advice or hook your person up with someone who can give them good technical advice, great code review of what they're working on. And then I love seeing, you know, I'd love to see a GitHub distinguished manager. Just, you know, that's someone who's got a small team of people working with them on the hardest problem that GitHub has. And some people call that maybe the surgeon model, right? You're the tech lead, but you're also, you know, working so closely with this group of people that you're the right person to be the manager. Yeah. What is the difference then be when you go from senior engineer to staff to technical? Like what are

Starting point is 00:16:59 some of the differences between those three opportunities, I suppose? Yeah, I think it's like I was saying, you know, it's like, how much agency and accountability are you taking? I remember, you know, having a great discussion with a principal engineer who reported to me at GitHub. And, you know, his opinion was, and I fully agree with this, there's a little bit of confidence that comes with those levels, too. So if you're going to be a principal engineer, imagine GitHub's down and you're in the Slack channel with all the people who are working on the problem. Are you willing to be the person who says we're going to roll back or actually we're going to turn off GitHub actions

Starting point is 00:17:38 and impact only that set of our customers so that we can bring the rest of GitHub back up? And so there is that experience and confidence that comes into these levels, but then it's also sort of the nature of the type of problems that you're taking on and how much agency and accountability you're taking for the solutions yourselves. I guess the importance is to not move from, away from even further like you had said before moving to you know say a director role which is like you said a completely different thing you still keep them closer to the technical problems yeah it's kind of similar to since we kind of know about

Starting point is 00:18:15 what you're doing now after i get it was kind of like the ability to keep advising right like as a rather move from senior engineer and continue up your own career ladder into director out of, say, a more technical role, you get to sort of keep leading and advising within, but keeping your technical skill set within your career path versus simply going into management, which sort of moves some of that away. You obviously leverage that experience, but you don't get to put it into practice on a daily basis. Yeah, I got to tell you, I'm talking to a lot of startups now. And there's a good group of unhappy CTOs out there who, you know, are kind of turning into people managers for the largest teams they've

Starting point is 00:18:57 ever run. And their joy is actually from doing the hands on technical stuff. And so right, I've been talking to a lot of those folks who are trying to find a way to get back to their joy and, you know, really being hands on. And then, you know, it is it is a different job to be leading a large, large organization of people. And that's a big responsibility. And it's it's a different role, too. Is that advice generalizable at all? Like, can you say to a typical CTO of a growing or hyper growing company, like, here's how you accomplish that, or here's the highest impact things you can do? Or is it always specific to this person in this place? Look, if I see an individual who's in a management role and they're really unhappy, you know, we all have

Starting point is 00:19:48 a certain amount of agency in our own lives, right? And I think we all have one life to live and it's okay to, you know, take one for the team for a while. If let's say, if you're a co-founder or founder and you're going to be the CTO of your company, and that means growing an engineering team and you're, you're going to do that for a while. But at a certain point, you know, if you're feeling unhappy on a day to day basis, you know, look at what you can do and see if you can change. And there's a lot of great managers out there. And so finding a really good partnership between an engineering leader, and a CTO, or and, you know, the top, you know, technical ICs in the company, that that's a partnership that that really needs to form. And so I think, you know, I encourage people to find their happiness. That's what I'm

Starting point is 00:20:30 trying to do. Right. Yeah. Happiness for sure. Well, if we go back to your laundry list of things that you did, and I don't mean to call it a laundry list, like dirty laundry, but like, you know, an epic, an epic list of things that you did. Where does the wherewithal or the knowledge, like, how did you know what to do in that circumstance? And like, where does your experience to kind of like, and surely some of it was probably explored and discovered as you went, but like, how'd you know, where'd you get the knowledge to say, I'm going to do these seven things in order to bring these three teams together in a way that scales and establishes culture? What's your background that brought you to that place where

Starting point is 00:21:08 you could be the one that got that done inside of GitHub? Hey, you know, I've been grinding in tech for 25 years. So I have a lot of experience. Grinding for sure, right? Yeah. I've seen a lot of ways that things didn't work. Sometimes when you see a counterexample, that's just as good as seeing a good example and even sometimes more effective. And, you know, I've tried things that didn't work. But, you know, I've seen several common patterns in scaling my own teams over many years. Like I brought multiple teams to over 100 people throughout my career. You know, at Google, I worked in developer infrastructure for a long time. And I brought those teams to over 100 people, working in an organization of 2000 people with the amazing Melody McFessel, who is now CEO of a company called Observable, but I worked for

Starting point is 00:21:56 her for many, many years, and learned a lot of great lessons from her, for example. Then at Google, also, I led the cloud platform and recommendations platform in Google Cloud and scaled that team from something like 30 people to well over 100 people. And then within GitHub as well, I've, you know, I've scaled multiple sub teams within my organization to over 100 people when my team itself, when I went to leave was over 500 people. And so, you know, hopefully you learn from experience, right? I mean, I certainly think I did. And like I said, that, you know, being thrown in the fire

Starting point is 00:22:32 at the beginning of my GitHub experience where, you know, there was a lot of things that were really surprising to me in terms of how siloed GitHub was. There were a lot of things, you know, in terms of how decision-making was happening that I could tell didn't work. You know, in terms of how decision making was happening that I could tell didn't work. You know, I can give you a quick story, which is when I first joined GitHub, fantastic team came to me. And it was, you know, I joined two months before GitHub Universe,

Starting point is 00:22:56 which is the big developer facing conference every year. And this great team came to me, and they were working on a language feature. And they said to me, Rachel, we have this great new language feature and we want to announce it and release it at GitHub universe as our new VP. Can you tell us, should we launch it for JavaScript or should we launch it for TypeScript, Java, Python, you know, the, the four next popular languages. And I was like, okay, well, you know, okay, hang on. This seems like a great feature. Do we need, you know, more research? Like, are we not confident? Like, why are we just like targeting one population versus another? And this great team said to me,

Starting point is 00:23:36 well, okay, here's the thing. When we first started this project over a year ago, it was easier for us to get CapEx budget approval. So that's like hardware instead of OpEx budget approval, that's cloud capacity. And so we ordered a bunch of machines and we got them racked in our data center and we were running a MySQL backend and we have space for the index for JavaScript or the next four popular languages, but not both. And it takes 12 weeks to order new machines. And GitHub universe is less than 12 weeks away. And so we got a pick.

Starting point is 00:24:17 And, you know, for me coming from Google, that was my brain was melting a little bit because on-prem, what? Like, isn't it alpha? I didn't know that that still existed, right? I had a lot of learning to do when I came to GitHub. And by the way, that team did nothing wrong because that was, you know, the way things worked. I immediately said, we're moving the cloud, you know, this is not going to work. And they had a year of pain, actually, where they couldn't scale the product that they had made. And they occasionally, you know, the scale of GitHub's code base overwhelmed them and they'd have to pull back features or turn

Starting point is 00:24:49 things off and stuff like that. And so ultimately it had to be a cloud-based product and they did successfully move to using Azure Blob Store. But that was sort of the awakening I had when I came to GitHub where I thought, oh, okay, there's trouble making maybe like these decisions that are happening in silos way too much. Like there's local optimization, I think, really happening in terms of the way teams are making decisions. And there needs to be sort of absolutely the first step of scaling is that teams have focus and agency to make their own decisions. But then there's a next step where you've grown beyond that. And there's certain decisions you need to know that you need to take to another level. And

Starting point is 00:25:27 there needs to be the ability that's not strictly product focused to make those kinds of decisions coherently for the entire organization. So I felt like I had a lot of learning to do when I came to GitHub. Understanding constraints in that case was probably key, right? Because like, if you didn't ask that question, you just thought, well, both both of course, but you had to understand the fact that they were on-prem and they had, you know, if you hadn't gotten to that part, you might've just made a premature decision or an incorrect decision to say, we should, of course, let's do both because they're all popular. We should, these are the directions to go. But once you understood their constraints, you were able to sort of understand more clearly their challenges, right? Constraints equal challenges. Yeah, absolutely. And, you know, like,

Starting point is 00:26:08 I'm really happy that, and again, it's a great team, great people. They didn't do anything wrong. That was the environment that they were in. But it also highlighted very early in my GitHub tenure, oh, interesting. This is how this is happening. And then I spoke to a whole bunch of teams, actually. And remember, GitHub had been acquired by Microsoft. And I started asking, is anyone running anything on Azure? You know, like we have a lot of AWS, we have this on-prem, I see we have some Google cloud, but like, are we running anything on Azure? And the answer was no. And, you know, I was asking around and trying to figure out like, do we plan to migrate to Azure or, you know, you know, what are we going to do here? And it became really

Starting point is 00:26:46 clear that because the product teams were so siloed, every product team was thinking of its own feature sets. And there wasn't really anyone thinking about that bigger picture of, you know, no, we're going to do the investigative work, and it's going to take time and, you know, whatever needs to happen to figure out how to move to Azure, any one product team would have to throw their entire product roadmap under the bus in order to be able to work that out. And so you need that higher level of thinking to be like, well, wait a minute, this is something we have to prioritize. We have to be able to have the flexibility to not be so constrained to these product areas and be able to fund things like this that are going to be for the greater good. So when it comes to scaling these teams, one thing I've read from you is that you think

Starting point is 00:27:28 that 100 people is kind of this, this threshold of engineers. Yeah, where it's like the game changes. I'm wondering why, like, if that's just experientially what you've seen, or is that a magic number? And then what changes in and why, in your experience? Yeah, absolutely. You know, it is experiential. Like I had, that is what I've seen myself, but I've also spent the last several months talking to a whole bunch of startups, which has been really a lot of fun. So many bright people out there doing interesting novel things. And it's held up this a hundred person threshold, you know, and it may be slightly different for different teams and

Starting point is 00:28:05 different companies, it's sort of, it matters the amount of complexity there is in your product space, you know, how many different sort of customer bases you're serving, how many different product areas maybe you have in your organization. So 100 is not, you know, the absolute exact moment, but definitely, it starts to be hard and things need to change at that threshold. And so I'll talk first about, you know, what kind of what I seen. And one of the main things is that eventually you hit the scale where it becomes impossible for one individual to hold context for everything that's happening within the products, but especially implementation wise in their head, right? And so certainly the individuals who are on the product teams

Starting point is 00:28:50 will have lost that thread a long time ago. They won't know what all their peer teams are doing. But you know, like maybe one person until 100 is kind of hanging on and having a good sense of the various challenges that all the teams are feeling. But eventually, that stops being humanly possible. Work will start happening that doesn't align well. Decisions will start happening that don't align well. Life is certainly easy when you have, let's say it's a founder or founding engineers or a senior technical person who can effectively make final decisions for teams when they're stuck. But now you're getting to a scale where there isn't necessarily that individual who can do that. And obviously, like, you know, we'll talk about the fact that decision making has to be delegated

Starting point is 00:29:34 to teams, right? Like, that's the first step of scale, you go from like having a single team, where everyone's working together, to splitting out into focus. And I can give you also lots of examples of where delegation doesn't happen well enough and teams are hampered because they can't make their own decisions where they really should be. And this is exacerbated when you have time zones coming into play and folks working on different schedules getting stuck and so on. So you don't want that. You need individual teams to be able to make their own decisions. But then there's these decisions that go beyond team boundaries and they start to spin so if two teams are you know invested in a decision that they can probably hash it out but it's these like

Starting point is 00:30:12 cross-engineering things big investments in many cases you start to see these important technical decisions really stalling and that's just a danger zone when important decisions that need to be made aren't being made because no one feels empowered or maybe attentive enough. And probably your edge leader is running the biggest team they've ever managed. And maybe they don't even realize that these level of scaling, right, with the 1,000-plus person engineering team. And these problems get exacerbated at every order of magnitude, for sure. But an example from GitHub is it really took us too long to decide that we were going to be moving to React in the front end. And, you know, some teams started using React, but they were doing so in inconsistent ways. And, like, are we going to be building within the GitHub monolith? Are we building services outside the monolith? What standards are we using? You know, what's our sort of like feel on do we want GitHub to get more of like an app like feel? Do we want sort of like a more static web page? I mean, there's a lot of inconsistency into how various teams were

Starting point is 00:31:19 approaching this. On top of that, you know, Microsoft was giving us some pressure about accessibility and making sure that GitHub respected accessibility standards, which is really important. Is React going to be the means to doing that? Or are we going to, you know, have some other UI policies? And so that's something that took investment, experimentation, investigation, but then ultimately GitHub was able to say like, yes, this is the North Star, this is the direction we're going to go. So then that gives a roadmap to every team when they're starting to think about a refresh of their front end. Well, now they know they don't have to guess and evaluate multiple technologies and so on. But there's lots of other things that start to happen at that 100 person threshold at all. I would say, you know, also like the technical impact of scale may start to be catching

Starting point is 00:32:06 up with you. So process and implementation that was like good enough at a smaller scale may start to become problematic. I have some examples of that I could talk about, you know, with so many engineers, a manual deploy process stops working, you know, and then you end up with all sorts of like terrible side effects to that where people are writing bigger changes that are harder to code review. And then you end up having more outages. And, you know, maybe some people who originally authored the code base are no longer around. And maybe you don't have clear code ownership for some things that were written once and aren't scaling now. And so, you know, outages start to happen.

Starting point is 00:32:43 Maybe confidence is low in terms of what needs to be done to address stability. I sort of mentioned this already, but beyond that, I think you see a lot of energy leaders who are starting to run the largest human organization they've ever run. And they're probably, you know, like we were just talking about, no longer touching the code day to day, and they might be feeling insecure that they're not on top of all the details, right? Maybe they know that important decisions aren't being made, but they're not sure that they still have the right level of insight to even make those decisions. Right. You know, maybe they're working with a CEO who is super focused on user customer facing progress, who doesn't want to hear or doesn't think about

Starting point is 00:33:22 infrastructure, tech debt, developer experience, etc. And so that starts getting less prioritized on the team. Or, you know, I've definitely talked to startups where the CEO was the one who wrote the first version of the code. And, you know, they're opinionated, but also, you know, their knowledge is stale. And so it's just like super hard job for these individuals who are trying to, you know, maintain that balancing act. And so it's just like super hard job for these individuals who are trying to maintain that balancing act. And so these are all things that I think start really getting exacerbated at that 100 person scale. And the good news is there's a lot of things you can do, but it's interesting to see how prevalent it is. Yeah, for sure.

Starting point is 00:34:02 How do you then do you get that person or persons that has that, I guess you kind of said it was confidence in one way, but the ability to see that there's a problem there and then start to enact change. You'd mentioned, you know, they wouldn't see the problem anymore. They were too far away from it. How then from a VP level, do you start to give people that agency to make those changes or to see more clearly and make make choices and decisions because it seems like you know when you get to a hundred plus organization engineering wise like you had said one individual can't hold all that in their personal ram it begins to be you know divided and whatnot how do you get to that point to give people

Starting point is 00:34:40 more clear access to what needs to actually happen. Isn't there some quote or something that like recognizing the problem is half the battle or I'm terrible at quotes. So I think it's G.I. Joe. Knowledge is half the battle or something like that. Yeah. Oh, it's G.I. Joe.

Starting point is 00:34:56 My goodness. Going way back there then. Yeah. Wow. Okay. Well, half the battle, I believe, is from G.I. Joe. Everything else is from something else.

Starting point is 00:35:03 I think it was a combined quote. It's a remix. Either way. Yeah, either way. Let's just say it's a Rachel original then maybe. I don't know. Sure, why not? No, I'm sure it's not.

Starting point is 00:35:13 I'm sure it's not. There we go. I think you just coined it. But, you know, recognizing that things are changing and that you have to work differently and that, you know, the way things have gone before will no longer continue to work. It is something that people realize, and whether they realize it sooner or later, they will realize it. Because again, you're going to hit one of these problems where like you have a massive outage, and you don't feel equipped to handle it. Or, you know, you'll

Starting point is 00:35:42 realize that, wow, you know, we've been spinning on this decision for a really long time, and we haven't made this decision. How come we haven't made this decision? So it will be noticeable eventually. It's just sort of like, how soon do you notice? And how much do you put in place while it's easy, so that when you get to that level, you can kind of sail through it, right? Definitely, you know, a lot of things can be done. You know, you can do work to avoid technical scaling bottlenecks early by focusing on code health and having best practices in place. You can proactively invest in your developer experience before your developers are screaming

Starting point is 00:36:19 that they can't deploy anything. You can set up individuals who are directly responsible for different product areas and different technical domains to give them agency and accountability and decision making. And there's a lot of things you can do with culture to really make sure you're valuing different types of work, right? Like a failure mode I see a lot of companies get into is being way too user-facing focused. And it you know, it's great to celebrate launches and product launches and great feature launches and so on. At Google, there was an expression, landings, not launches, which I really liked because, you know, I was talking to the co-pilot

Starting point is 00:36:58 team about this, you know, a year ago where like I said, I actually don't care about getting to GA with co-pilot. I care one year from now, do we have a healthy team that can maintain the thing that people are depending on, right? Just getting something out the door is not what you have to worry about. You really have to worry about what happens next. And so culture has really a lot to do with that. So yeah, I mean, I think people will always hit that pain eventually. And so, you know, I'd love to help people notice it sooner and be ready to address it sooner. It seems the somewhat secret sauce might be the concern and care for actual people in the mix,

Starting point is 00:37:40 right? Like one thing is clarity and expectation. This is something you've said several times and part of the way you lead is very clearly yes exactly but it seems like this desire to care for individuals like it's different whenever you lead with like you had said a launch not a landing a landing is safe intentional or at least it's desired to be safe and intentional if you're landing it's like let's make it soft let's not not make it abrupt. Let's not damage our knees. I'm thinking like, you know, airborne for the Army, for example, when you come out of an airplane and you got a parachute on, it's easy to damage your knees if you don't land properly, right?

Starting point is 00:38:13 So landings are intentional. They're safe. They have some sort of circumstances around it. You have some care for individuals. It seems like that's a somewhat unknown secret sauce to how you lead. Well, I would say also the way I define landings is you achieved what you wanted to achieve, right? So like you can launch,

Starting point is 00:38:31 you can get top of hacker news, whatever, and that's cool. But six months after launch, have you got the usage that you wanted to see? Do you have the retention that you wanted to see? Are you perhaps generating the revenue if it's a revenue generating product that you wanted to see? Do you see people using the product the way you expected them to be using it? And so before you go to any launch, you should have at least, you know, as clear as possible a hypothesis of and a target of where you want to be and what

Starting point is 00:39:00 you want to achieve. And, you know, that's something that I think launches are hard, but they're easier in some ways than sustaining, right? Sustaining, you have to have SLOs in place, you have to have, you know, a good on-call rotation with good playbooks. You have to understand what's the cost of keeping the lights on for this service? You know, how do we handle customer escalations and user escalations? How do we triage work? How do we prioritize? Is this scaling? What scaling bottlenecks are we going to hit? You know, sometimes success is a double edged sword, right? Because suddenly, the way you wrote this thing is no longer going to work or your number of machines that you have

Starting point is 00:39:42 in your MySQL on-prem and backend are not going to be able to fit, you know, what you're trying to do. And so to me, that's what a landing is, is really like we have something that people can depend on that's reliable, that's sustainable, and so on. One of the challenges that I am seeing is this like competing concerns with, I don't know, just like our propensity to build the wrong thing or to yak shave. You know, we have Yagni, which when it comes to scaling, like a lot of us aren't going to need some of the scaling things. And then we do, we really do need them.

Starting point is 00:40:18 And then there's also things that we should be building right away. So like you can't bolt on security, for instance. So when it comes to like engineering something like security, you should be thinking about from the beginning. But a lot of us in trying to prepare for the possibility of scale, never get the launch done because we are setting up our CICD, right? We picked Kubernetes when we may never need it. Or we spent all this time developing things that we didn't need.

Starting point is 00:40:46 And then it came time for us to need something. We didn't develop that thing. Like, oh, I wish I would have had this incentive system in place, right? So, I mean, it's difficult to like pick what's worth building upfront because some of these things you said can, if you're prepared to scale,

Starting point is 00:40:59 if you picked, if you rolled out a Kubernetes cluster from the beginning, and it turned out that you had this huge launch and now you're scaling and wow, it's amazing. We can just get more nodes or whatever and it worked as opposed to like an on-prem MySQL server that just hit a wall and you're done. And so, especially now that you're talking to startups, right,

Starting point is 00:41:19 who may or may not have to scale, are there ways you can help people, help us think about these things where it's like, what's worth building now? And what is premature optimization that's going to be completely a waste of my time and never, never push my business forward? Such a good point, Jared, because I've said over and over again to my teams and to, you know, various folks that I'm advising and coaching, everything's a trade-off, right? And it's not obvious.

Starting point is 00:41:46 You have to assess the cost and the benefits. And a lot of times for startups, being first to market really matters. I think you want to be really intentional sometimes about accruing technical debt, and that's perfectly fine because you're eager to get something in the hands of customers and see,

Starting point is 00:42:03 do we have product market fit or do we not? And so being able to be thoughtful and intentional and make those decisions, I think a lot of the times, definitely don't try to over-engineer something if you don't even know if you have product market fit. Get something lightweight out there, get a prototype out there and see what kind of reaction you get and learn from your users. GitHub has one of the sort of philosophies, I guess, is called ship to learn. And I like it, and I hate it. I kind of wanted to burn it down. But I also appreciate it, right? But it's like, what I want to do is add nuance to it, which is ship to learn the things you should ship to learn and be really deliberate about the things you need to be really deliberate about, if that makes any kind of sense. And so like, what kind of decisions can you unwind quickly? Right. And so, you know, I love ship to

Starting point is 00:42:56 learn for like UI features and UI changes. I think that's really healthy and good and where you can iterate quickly, but there's, then there's changes where like, this is going to be really hard to back out of. Like I'm, you know, writing this data schema and, you know, it's going to be like difficult to undo this or I'm adopting this new infrastructure. I'm not going to ship to learn it. Let's have a design doc. Let's talk about it. Let's, you know, really get the right set of eyes on it. Like, I'll tell you, I set up this engineering wide design review process at GitHub. It's really good. Half of it is a communication tool, right? Sure, people got really good feedback on their design docs.

Starting point is 00:43:37 And by the way, not every little thing needs to go to engineering wide design review, right? There's layers and you think about like, how broadly impacting is this change I'm making? If it's just on my team, then let's just do a design for my team. And actually, maybe it's just something that I'm going to ship to learn and we don't even need a design doc. But for certain things,

Starting point is 00:43:55 I'll give you an example that the issues team at GitHub want to start using Cosmos DB because we've been, you know, very MySQL backend company. And we company and we have these more sort of NoSQL use cases cropping up for storing issue hierarchy. Cosmos DB seemed like a good fit. And so bring it to engineering-wide design review. And then all the various teams who are thinking, oh, shoot, MySQL is not really working for me either, can come and be like, oh, here's the use case I have. And it's a communication tool. And you talk about it,

Starting point is 00:44:28 you get it out in the open, and then you get some good feedback and so on. And so, yeah, everything in life is a trade-off decision. And so I would never advocate for always building for scale from the start, always addressing your technical debt immediately. No, there's very legitimate reasons to make concerted decisions there. I think the challenge I see is I definitely talked to some startups recently who maybe were intentional about saying, okay, look, we're going to just like, not worry about this technical debt. We're going to hack together this feature and get it out quickly. But then do you lose track of that technical debt? Did you forget about it? And does it show up six months later in an outage?

Starting point is 00:45:11 And actually now it's a bigger deal because, you know, various other things happened that built upon it. And so, you know, I'd always advocate for being intentional about the choices you're making and having a way to track decisions and understand where you have, you know, things that you're probably going to have to look at later. And also, by the way, like thinking about, you know, what scaling sort of throttling type limits can you put into your product initially? So you know i can't tell you the number of times it's happened where like wasn't paying attention to that api and suddenly like oh my gosh a bunch of people have used it for this like really expensive use case that we sort of never imagined like the github code search api people were using to like count all instances of their

Starting point is 00:46:02 api being called through all github ever. And it's like, that's a super expensive query. It's not really what GitHub code search is about. But there were no limits on the on the API. And so customers, of course, like humans will do, you know, things the easy way and if they find a way, and so like, do think about how your products might be used, do put in place user limits, throttling, anticipate, you know, how things you might want to be alerted about when you hit certain thresholds and certain scales, right? I'll tell you one that is a personal sort of concern of mine that I've seen at GitHub. GitHub has about 40 repositories that go into the GitHub platform. And it's sort of a lot of the newer

Starting point is 00:46:45 product areas are in their own repos and are separate services. But there's also the GitHub monolith, which is a Ruby on Rails application, which is, you know, issues and PRs and projects and sort of all the core functionality of GitHub as a code hosting site is really in that monolith. And, you know, we've had a lot of scaling problems at GitHub with deployments, partially because of the way the active record paradigm works in Ruby on Rails, where the sort of data layer is too tightly coupled to the logic. And so people are making database changes all the time. And if you only have a few people working, that's manageable, But that starts to become unmanageable pretty quickly with the number of engineers like, you know, beyond that 100%

Starting point is 00:47:30 threshold, there's certainly more than 100 people who touch the GitHub monolith. And so that's created a lot of complexity for deployment and a lot of bottlenecks that, you know, need to be addressed. I can definitely imagine that going back to the decision making, do you use and or advocate for like a decision log or some sort of like a place of record? I've heard I've never done this, but I imagine at scale, you'll want to have like, here's the decision, we went with Cosmos DB for this product. Here's the analysis we did. Here's the decision we made. Here's the constraints we were working under or the assumptions. And this is why you picked it. I've heard people say you got to have one of those because, you know, the short term memory of an org, especially in software world, we churn so

Starting point is 00:48:13 much, right? People move on and switch roles often. And so you don't have that institutional domain knowledge stick around very long. So I've heard decision logs are a great tool for that kind of knowledge. Your thoughts? Yeah, look, any tool like that is as good as it is findable and as good as it is clear and part of the culture. And I'll give you an example at Google. You may have heard of GoLynx. GoLynx is a company, I think that was created based on the way linking worked at Google, where basically, if you knew a product area, you could type go slash that product name, and you would land on their documentation. It was just fantastic, because everyone used it. But that's a cultural thing, because everyone knew where to work. I, you know, I've talked to someone who was working at DuckDuckGo recently, and they use Asana for everything. And they do decision logs, and they have,

Starting point is 00:49:09 you know, just a very clear process. And everyone knows to look there, and everyone does it. So you can't just have a decision log without the culture to go along with it. Right. You gotta buy in. You gotta have buy in. And you gotta, you know, you show people that this works, and that it's usable. And then it becomes advantageous and then people buy into the culture. I spoke recently to the CEO from a company called Dream Team.

Starting point is 00:49:33 And they have a project called Kata that I'm keeping an eye on because it looks really good in terms of this sort of projects management. They do integration with Slack, integration with GitHub, integration with Jira. And again, it provides that functionality of everyone knows where to look. So you can set up a decision log in that product and type on Slack the right keyword decision, and it'll end

Starting point is 00:49:58 up there. And then people don't need to look around. I think one of the challenges I've often seen is like, yeah, let's document this decision in a Google Doc, or maybe this one's in a repo, or maybe this one is somewhere in Slack. And then, you know, that's cool. But if it's not findable, it sort of doesn't matter. So to answer your original question, yeah, I'm a fan of lightweight decision logs, you know, like, I'm a fan of design documents also. And chances are your design document points to your decision. But even more so is that culture you need around how are we doing things and where are things found to be a really big challenge. You know, I'll say even like even org chart, right? Like at GitHub, there wasn't a great org chart.

Starting point is 00:50:44 And one of the engineering directors on my team wrote a new org chart, right? Like at GitHub, there wasn't a great org chart. And one of the engineering directors on my team wrote a new org chart. It's the org chart we use now. And I was like, oh, Harry, thank you so much for doing this. Because, you know, even just being able to find who's working on what, you know, what person should I talk to?

Starting point is 00:50:58 You really have to be careful. And again, this comes to that 100 person scale around informal networks and needing to know someone, know someone to that 100 person scale around informal networks and needing to know someone who knows someone to find out the information you need as much as possible. When you get this information into systems, then you can find the answer on your own and it's easy and quick. You know, I think when you have that informal culture of network and, oh, I'll just ask

Starting point is 00:51:22 so-and-so who will know, then you propagate meetings. You know, in this remote culture, it's never just a five-minute question. You always book a 30-minute meeting with someone to ask them maybe the one question that you had. And so then you're sucking all your time into meetings. Whereas if you have clarity of where to find information, you know, that can really go a long way. Yeah. I'm kind of glad you went that direction because, Jared, I was thinking that same thing, but your question was slightly different than what I asked it. But it was more like, how do you choose the tools to communicate? Because it seems like you're a clear communicator. It's if you can find it, like you had said, and you have access to information, you don't have to have so many meetings.

Starting point is 00:51:55 You can rely less on your network because you have to know somebody who knows somebody to get access to the information. But, you know, when you're in hundreds and then to thousands, you know, I'm not asking you to use Slack over Jira, do you use this over that? But how do you organizationally choose that what becomes culture, the tools you use to communicate? Like, how do you do that? Do you build your own tools? You know, is that invented here kind of situation? Because even at small organizations like ours, which is a very small organization in comparison to yours, we still don't have a clear culture of if you want this information, go here to find it. In lots of cases, it's in code and we can go find it in our GitHub repo, of course. But like if it's written, there's like probably three different places we may have used over the last five years.

Starting point is 00:52:38 So our culture has not been adopt one tool, use it heavily. It's been fractured across many tools, never consolidate. So how do you, at that scale, hundreds of thousands? Well, don't feel bad. Okay. Yeah, don't feel bad because that is super common. Yeah, we're also early adopters. So like we try out every new thing.

Starting point is 00:52:58 And so that's part of what we do. So there's some of that culture, like we're going to try the new thing and see if it works for us. And so, yeah, we have, you know, knowledge bases spread amongst mean, project management is not the core competence of, you know, unless that is your, your business, like, you know, this Kata product that I was talking about, that is their core business. So they should use it and they should build their own thing and they should make it amazing so that everyone else can use it. But you know, it's, it doesn't matter,

Starting point is 00:53:38 right? Like, is it Asana? Is it GitHub projects? Is it Google Docs that are well organized? Pick your battle. I think a lot of things can work. But with lack of clarity, every team will in your organization will do something different. And that's when you get into trouble. So yes, you know, just just standards and consistency. And you don't want to, I mean, we can go back to everything's a trade off. You know, you don't want to be too heavy handed about things and be like, you must work this way. I was going to just ask that, like, do you just dictate it? Yeah. But there's certain things where it's, it's a virtuous cycle. I think where you say, this is, you know, where we put design docs, everyone do it because then you'll find the design docs you want to find. And that's a good thing. So, you know, please do this and use,

Starting point is 00:54:29 you know, you can, you know, as a leader, I can actually go and say, why didn't you do this? I need you to do this next time. But the best is when people see, well, okay, this is helping me. And so it's logical. It's not process for the sake of process. I think you have to be extremely careful about rolling out half-baked process where, you know, it's going to introduce friction for teams. And, you know, another thing I can talk about, which we touched on in decision making is different types of decisions hold different weight and can be undone or fixed or changed more easily or less easily? Well, different types of teams are working on different types of projects. And so I've definitely seen the pattern

Starting point is 00:55:11 where a leader will come to me and say, well, like, why is team A moving so quickly and team B is moving so slowly? Oh, well, team A is, you know, iterating on a UI for something, which is like important and hard work. But the pace of that change is different than Team B that's building infrastructure. And so, you know, I also never want to say, well, like Team B, you should be, you know, having a burndown chart that looks just like Team A.

Starting point is 00:55:41 And I want to see like the same amount of velocity and like, no, team B probably has to do more prototyping, more research. There's going to be some dead ends in terms of, you know, maybe what they're investigating. Maybe they have a, you know, buyer builds decision to make that's going to require some research that won't end up in a, in a milestone deliverable, right? Other than a decision. And so like keeping that in mind, I never want to be too heavy handed with process at the right amount of handedness, if that makes sense. Everyone has to figure out what that means for their organization. An adequate amount. That's my favorite saying. My wife says, how much do you want when it's like,

Starting point is 00:56:17 you know, food or it's like an adequate amount. I don't want too much or too little. I just want an adequate amount right in the middle there. When it comes to, I guess, not my problem, not that this is a good attitude to have. Like you can say, this is not my problem when it comes to decision-making. How do you deal with who owns certain problems? Obviously you got, you know, a senior engineer in place or a tech leader, somebody that's in charge, but how do you solve for that responsibility layer? Yeah, I mean, this is where, so when I talk about the things you can do to effectively scale,

Starting point is 00:56:50 I think I put them into pretty much three buckets. So there's a lot going on in CodeHealth. There's a lot of advice I have for teams around CodeHealth and developer experience and so on. There's a lot of advice I have for teams around how to think about decision making. And then the final one is culture and culture encompasses all those things and more. But it's fine. Sometimes something isn't a team's problem, right? Sometimes you want your team

Starting point is 00:57:15 focused on the product area they're working on. They should have a mechanism to surface. Maybe something's come up. Maybe we've noticed something. Where do you bring those problems? Is there an obvious place? Is there a spot where you document like, hey, this thing isn't working? I don't think it's for me to fix, but someone should know, right? The thing I set up at GitHub, which, you know, it was a learning process. All this stuff is a learning process. I think you're never done. You never say like, okay, I set everything up that I need to do.

Starting point is 00:57:43 And now my organization is humming perfectly and I can just, you know, drink a margarita and whatever. Right. But the principal council, which was renamed to the architects group, had a backlog where any engineer in the company could add an issue saying, Hey, I think someone should think about this. And not everything would get touched. Right. But the, the principal counsel was effectively the most senior engineers, individual contributors in the company, coupled with me and my two peers who were the engineering leaders. And so the most senior ICs had hands in the code on a daily basis, were deeply familiar with how things worked, and represented different product domains and infrastructure within the company. And me and my peers held the responsibility for cross-edge prioritization and funding and were able to,

Starting point is 00:58:34 you know, move people around from different teams. I think, you know, one thing you want to be careful about is that people don't develop too tight of an identity to the thing they're working on and that you don't get such siloed teams that it's difficult to move people and say, hey, look, we really need help over here. Can your expertise, you know, and what you did in the past come into play over here? So like me and my engineering counterparts were able to have conversations with people and say, hey, you know, can you come work on this problem? We're setting up a special virtual team to really address this thing. Let's get this done. I would always ask one of the most senior ICs to be

Starting point is 00:59:09 champion for any decision that needed to happen. And they were responsible for communicating decisions around that specific area. And really not necessarily being the lead implementer, but mentoring and coaching the people who were taking charge of the problem area. And so, yes, it's fair for people to say, this is my problem, but there should be a mechanism for, you know, important things to get surfaced. Does that answer your question? For sure. I mean, the fact that you have some sort of garbage collection, essentially, which is what that is, it's like, it's almost like, how would you write a program or a compiler or something like that? It's like, well, you need garbage collection. That's kind of what that is. Like, this is not my problem, but it is a problem and somebody should know about it.

Starting point is 00:59:47 And you've got some sort of organized body willing to, you know, have an inbox for that, big or small, and then find ways to communicate that back to you and others who are leading the organization at a larger scale to say, you know, how do we deal with this in some way, shape, or form? Because the it's not my problem situation is a really challenge because when you might find that issue, but it's like, well, it's not mine to fix, as you said, but somebody should know about this. Who do I tell? Oh, I'll tell nobody.

Starting point is 01:00:17 Let me just get back to my job, climb my ladder, doing my thing. Okay, cool. You know, and we can't have that. And there's, you know, there's also like the DevSat survey that I talked about, right? That's a great way where you're asking your internal engineering teams anonymously, tell us like, what are your biggest pain points? What are the things you're most worried about? What are the things that are not working for you? And you can, it's not just the squeaky wheel in that case who's going to get the attention. You can see aggregated over your entire group, hey, look, true story.

Starting point is 01:00:49 Every single person is talking about how painful deployment is. That takes trust, though, doesn't it? It does. You have to have trust in an organization to say those things and not get the backlash potentially. And then you have to have a frequency in some sort of case to get that feedback often enough, right? I think you're so right. Trust is so important. And so all this stuff plays into culture.

Starting point is 01:01:11 I will tell you, I did AMAs with my team fairly frequently. AMAs is ask me anything, right? And I was so happy when I would get really pointed, hard questions. I'd be like, this is, you know, I don't love this question. But I'm glad you're asking. Because then I feel like you trust me that I'm actually asking you to ask me what's on your mind. And you know, if you're only getting softballs, you're only getting easy questions, then you really have to ask yourself as a leader, like, are people scared to say the right thing? Yeah, Is there freedom of speech here? Yeah.

Starting point is 01:01:45 Is there? I mean, yeah. And sometimes it's like, look, you know, you got to move on. Like, you can disagree and commit on this. This is what the answer is. I know you don't love it, but we got to be able to move on. But other times there'll be things that I'm not even aware about. I tried all sorts of experiments. I did one time, I did an anonymous AMA, which is a really funny experience. I think it worked out well, but I had people anonymously submit questions. And I should have had a part, I should have called you guys to interview me and say the questions or something. But I answered, I did it, I did it by myself. So I did like a one hour recording of myself by myself answering these questions. And it was nice, because I was able to, you know,

Starting point is 01:02:25 gather some data to answer some of the questions too. But there was some really hard questions during the pandemic. And, you know, there were a lot of things that people were worried and insecure about. And I just thought, you know, I'm really happy that people felt safe enough to ask me these questions and that I would be able to answer them. I think that that is really important. That's a cultural thing that you can't undervalue. And even in the DevSat survey, one of the questions that I would ask is about psychological safety, how decisions were made on your team. You know, so it would be, there's a lot of questions around the specific developer experience, but there are also culture questions on there that then with that survey, I would give it, you know, as a

Starting point is 01:03:06 leadership survey. So I was interested in the broad trends across everything. But then, you know, it was a survey that each manager who had enough respondents would get so they could specifically look on their own team. Do I need to set, we used OKRs, which are objectives and key results every quarter. So set some goals. Do I need to set some goals around psychological safety on my team or maybe around some other process that's not working or on-call? On-call was a big one. Like people are really stressed out about on-call. Maybe we need to do more training. So that was a use of the survey too. And then actually the third group that would benefit from the survey was specific product areas. So like GitHub, we decided

Starting point is 01:03:45 that the paved path for development at GitHub was going to be using Codespaces. And so, you know, when we rolled that out, of course, we got lots of interesting feedback on that survey about the experience of using Codespaces. And so that was valuable feedback to the Codespaces team to be like, okay, you know, here are some things we can focus on. We want to make our internal customers really happy. And that's going to be, you know, important for them making our external customers who we have less access to happy as well. I kind of know what you mean by this. This is sort of a question to kind of get deeper at it. But when you say psychological safety, what do you mean? Like, how does that translate to actionable findings and details?

Starting point is 01:04:23 Like, what actually is that? Yeah, because, you know, I have to say, you have to be careful about over-broadening terms like that, right? Psychological safety does not mean that no one can give you constructive feedback, right? And that's really important. I think, you know, when I talk about, again, scaling eng teams and culture,

Starting point is 01:04:44 this is one that's coming to bite a whole bunch of startups. And I think it was a problem at GitHub as well, where people conflate kindness and maybe pleasantness or something like that. And so, you know, sometimes it can be really hard to make good decisions if people are too scared to say the real thing. It's actually, and I'll get back to your psychological safety bit, but it was fascinating to me when I rolled out end-wide design reviews, because the first design review happened, it was, you know, a topic, I'm trying to think of what it was. It was something around monitoring and alerting. It was good. And this is important. It was going to affect all of engineering, right? So perfect thing for a design review. I'm hosting the session and I'm getting all these DMs, right? And so the way I would set up a design review is people are supposed to be informed coming into the room.

Starting point is 01:05:34 You want to make the high bandwidth meeting as effective as possible. So everyone's read the doc, you've put all your comments on the doc. The design review is for resolving comments that can't get resolved asynchronously, right? And so then we're in the room and I'm getting these DMs and people are saying like, this thing won't work. Like this thing they're proposing, it's never going to scale. And I'm trying to host a meeting, but then I'm DMing back like. Can you say that? And people were like, well, I don't want to be a jerk. And it's like, well, it's not a jerk if you're telling a team, you have very relevant experience. Look, you've done this before. You know, this team needs to hear what you have to say. Don't just DM me and try to get me to say it. It's going to come better from you. You've built this before. And so that was like a cultural barrier to overcome where GitHub had come from this history of consensus building, which is problematic also, right? Like consensus is great

Starting point is 01:06:37 when you get it, but you can't live by consensus, especially when you start to scale. You need directly responsible people who are accountable for decisions, who are going to make unpopular decisions. Not every decision you make can be popular, right? And so I actually took over one design review just to talk about culture and be like, hey, how do we have these hard decisions where you're not being mean to a person? You're not saying mean things about that person. We need to be able to talk. It's the thing about blameless post-mortems like human error might have happened in an outage and you have to be able to say that and say here's some automation that we could put in place that would make it less likely for that to happen again it's not an attack on the individual ever but we have to be able to learn and grow so that's's a little aside, because I get nervous sometimes when we talk about psychological safety without that framing. But psychological safety to me is

Starting point is 01:07:30 being able to say things that you're worried about, things that are on your mind, things that you think are important without fear of retaliation or retribution or fear, you know, like, and that is invaluable, right? So I always want my teams to have psychological safety so that they can ask me hard questions so that I can realize, oh, I had no idea that this was such a problem for you. And by the way, the last 10 staff engineers that I spoke to told me the same thing. Wow, now I'm going to do something about it because clearly this is like, you know, a big problem. And so if people don't feel safe bringing things up, then you just don't get the information you need. But that's different than being too pleasant or too kind, right? You know, empathy coupled with accountability. What does this liberty do then for toxicity? Does it squash it completely? Does it just expose it further? Hey, look, toxicity is something I'm never going to tolerate, you know? And I think that's a cultural thing as well. Like, what do you tolerate?

Starting point is 01:08:30 I always say how you reward and who you promote speaks more to your culture than anything you say, right? And so when I would host training sessions on promoting specifically for staff engineers, it's like, look, toxic behavior is not tolerated. So that's belittling someone, attacking someone, you know, shouting at someone. All these things have happened to me in my career. You know, we're not going to... Complaining. That's why I was, my framing there was more complaining because like you can freely complain and be toxic. You can be pleasantly toxic too. And I just wonder how that blends, you know what I mean?

Starting point is 01:09:08 So it comes back to this concept of knowing when to disagree and commit. If I tell someone, look, I've heard your point. Maybe I empathize with it, but I'm sorry, we're not doing anything about it. And then you keep bringing it up. That's being toxic, right? And so complaining

Starting point is 01:09:26 is not productive when the solution is not happening or the situation is not changing, right? So I do expect people to be productive. I do also want to hear, you know, about the things that are bothering people that are maybe not fixable, because maybe at some point in the future, they will be fixable. Or maybe there's an opportunity to move someone to a different team where that won't be as much of an issue. So, you know, it's like everything. It's a tradeoff and there's judgments involved. But, yeah, there's definitely a time to stop. Yes, it depends.

Starting point is 01:09:54 Tradeoffs. The classic answer. Is that my answer to everything? Sorry. No, no, no. That's not your. It's just it's what happens. It's inevitable.

Starting point is 01:10:03 It's more like a defeatist position than anything. So while we're talking about trade-offs, you mentioned the three buckets of scaling engineering teams, code health, decision-making, and culture. We focus a lot on decision-making and culture. We talked about code health a little bit with regards to Yagni and premature optimization, things you can do now versus do later,

Starting point is 01:10:25 and how we often trade off code health for speed, shipping, etc. But when it comes to scaling an engineering org, what are some things you can do with regard to maintaining the health of the code, which allows everything to actually move forward productively? Yeah, great question. I feel like this is a podcast unto itself at some point, if we ever, ever wanted to do that, because there's there's so many things. And you know, it's overlapping with culture, as is everything that's going to be my answer for everything today too. But an example where it overlaps with culture is like code review. You know, I love the culture of prioritizing code review above your own work, right? It's not always feasible. I've definitely had problematic situations where a poor engineer in Europe woke up with so many code reviews in their inbox because all of the Americans you know, having code owners, and the ability to affect large scale code base evolution requires people doing effective code

Starting point is 01:11:32 review. And failure mode, I've seen is where, you know, I had a another principal engineer who's reporting to me at GitHub, who made a pretty simple change into basically to keep it simple, the way Go worked at GitHub. And so basically, everyone writing Go code at GitHub had to review his simple code review. And that should be fast and easy, right? But it wasn't, you know, I needed to get involved to escalate for teams outside of my area to say, hey, you need to, you know, after a month, you still haven't prioritized this code review, you need to, you know, after a month, you still haven't prioritized this code review, you need to do it so that we can roll out this change. And so really having good code review tools. Again, we talked about design review were important. And then developer

Starting point is 01:12:15 experience and like, at what scale are you going to start thinking more about your developer experience is really important from a code health perspective. I'd love to tell you a little story about deployment at GitHub because it really resonates with many of the startups that I've spoken to recently. GitHub got into trouble with its deployment strategy and is on the right track now, thankfully. But it's a surprisingly common story to see that in developer experience, you know, build and test times get longer, and there's test suites running that don't need to run and so on. But like deployment is a particularly painful one. And I would say

Starting point is 01:12:55 there are like three areas where it really hurt at GitHub. One was just the volume of changes got too high, too many people wanting to deploy. And so there, if we're only considering GitHub's primary deploy target, which is github.com, just the number of different people wanting to deploy changes on this fairly manual process that required human engagement started creating friction. GitHub has this kind of unusual deploy then merge strategy. So for code changes, you actually deploy your code first, check that everything's working, and then merge back into

Starting point is 01:13:31 the main branch so that main is always available for rollback. It's kind of an unusual strategy that I wouldn't necessarily recommend because it's part of the scaling challenge. But GitHub moved to using deploy trains to help with that volume of changes. And this is still very manual, though. A conductor would be the first person who got on the train, would be responsible for shepherding the change. And then there would be all sorts of gamification that happened. I had a teammate who was like, why am I always the conductor every time I want to roll out a change to the monolith? And it's like, well, because everyone was hanging back, waiting for someone to take that role. And you were the sucker who every time Yeah. And so, um, you know,

Starting point is 01:14:09 this is like a bad experience. And, you know, then I started hearing from people to like, well, I won't even try to deploy something after lunch. Because, you know, if I end up, you know, being responsible for that, like, who knows, I might be stuck till after dinner, waiting around, so I'm just gonna wait till tomorrow. And so you can see the sort of like aggregation of friction there, and how much that slows down development is just not acceptable. In DevSat, I mentioned the satisfaction survey deployment came out as the highest friction. And then like all these other side effects that affect code health, like people writing bigger changes, code review becoming more difficult, changes being deployed become more risky.

Starting point is 01:14:48 So like an increasingly problematic situation. And that was just for.com. And then, and this is a situation, you know, that happens at a lot of startups too. GitHub.com isn't the only deploy target for GitHub. There's GitHub Enterprise Server, which is an enterprise-focused product where customers deploy GitHub Enterprise Server on-prem. And for them to do upgrades, they require downtime, right? And so the way this worked was, you know, they'd replay all the database changes, update the code. But database changes are unpredictable timing-wise. I already talked about how way too many database changes happen on GitHub

Starting point is 01:15:28 because of partly active record and sort of the way the monolith is sort of like not well componentized across data layer. And so, you know, then GitHub Enterprise servers customers started having an unpredictable amount of downtime for their upgrades, which is a problem. Also, most of the GitHub engineering teams were really focused on.com. So like, I got my feature out to.com, I'm done. The ops team can deal with whatever. And so then this poor ops team is managing the upgrades for, you know, Apple and IBM and all these big customers, but also lots of small customers.

Starting point is 01:16:02 You know, debugging becomes more difficult because is your feature in the enterprise server deployment or is it not? There's a whole challenge with feature flags. We did a really fantastic tech deck cleanup, actually, around feature flags where there had been so many feature flags at GitHub that were on permanently or never been turned on or on in the worst case scenarios of like different configurations

Starting point is 01:16:22 for different enterprise customers. And so, you know, that became problematic as well. And then the third piece to the deployment puzzle at GitHub, which was really enough to say, stop, we got to really, really invest in how we do deployment, was, you know, on-prem enterprise product is not the state of the art. It's not where most companies want to be. And so, you know, GitHub really had to develop a cloud SaaS offering for enterprise customers. And this is something GitHub has been working on for years. There's a lot of pressure on it. Obviously, downtime for upgrades in a multi-tenant SaaS product is not a thing, right? And so there had to be a way to propagate deployment to that endpoint in a healthy way as well.

Starting point is 01:17:06 There was lots of pressure from leadership to get this product out the door quickly. And so GitHub did try to take shortcuts, tried various strategies to replay changes from.com to the cloud, and never could work, never could scale, especially the frequency and unpredictability of the time required for database changes just made that untenable. Like how do you interleave code changes and database changes with the right timing, with the right lead time? The enterprise product would always end up getting

Starting point is 01:17:39 so far behind that it could never catch up to.com. And so that just wasn't working. What an issue there, man. That's like a big headache, basically. But it's funny because I've talked to multiple startups who are in this situation as well, where they had maybe a community product, maybe an open source product

Starting point is 01:17:58 that where deployment is a little bit more straightforward. And then now they have an enterprise specific product. And in most cases, like the community product is a single deploy target. And the enterprise, it's like multiple deploy targets, like maybe you have multiple different instances, right. And so this is like completely changing the game on how deployment works. And so you have to have, you know, a thoughtful, coherent strategy for doing that for, for dealing with scale. And this is one of those ones that like, I feel like deployment is hitting

Starting point is 01:18:25 everyone and something that they need to be really thoughtful about. And historically, the deployment process at GitHub and at many, many startups just depends on so much information in humans' heads, right? Like, I made this destructive database change, and I know I can't make the associated code change until the backfill has finished. And oh, I can't make the associated code change until, you know, the backfill has finished. And I see that that backfill has finished. So now I'll make this code change. And, you know, that much information in a human's head can work okay for a single deploy target. But when you have n deploy targets, forget about it. You know, you're done, like, just too much complexity to manage. So yeah, it's interesting. Is that the state of deployment right now-ish, I suppose?

Starting point is 01:19:09 No. Okay, so has a lot of this been solved then? GitHub's doing really good work. I would say it's in good progress, but it took, you know, this is one of those things where, like, oh, maybe 1,000-plus person scale. Right. Where you had to say, look, we can't do this quickly. There was efforts to say like, quick, get this thing out the door, right? And it was an example where it didn't work. I'll tell you other sort of factors that

Starting point is 01:19:35 happened were like, this is obviously an Azure cloud-based offering. You know, we're just going to like follow Azure process. Well, all of GitHub is using PagerDuty and Datadog and sort of like all the sort of tools you would expect where Microsoft has all these custom alerting monitoring frameworks. And it was like, well, actually, I guess we need to like rewrite all our alerts in this other environment. And so now like developers are meant to be on call and look at Datadog for this, but like this other environment. And so now like developers are meant to be on call and look at Datadog for this, but like this other system for this. And so, you know, that was just falling apart from a developer experience. And so GitHub's doing really good work right now on this. And like part of the key was a bunch of different strategies were tried using checkpoints. And,

Starting point is 01:20:21 you know, this is obviously something that it's a culture thing, too. I'm going to say that every time. Because in one thing we didn't talk about today, which we could talk about in another podcast is platform teams and how you can't expect magic platform teams to solve all your problems, because you really need to have product engineering involved in you know, the work they're doing and how they work and so on. But like, every team is going to change how they do deployment on GitHub as part of this. And so it's not just a magic platform team off in a corner who's going to solve this. But the key for GitHub has been really decoupling database changes from code changes, and really seeing database changes through the entire system before moving on to associated

Starting point is 01:21:02 code changes. And so that slows velocity in some ways. And you have to work on the culture to say, okay,.com developers, maybe you're going to be slowed down a little bit, but actually this is for the greater good. And now your feature actually gets out to the enterprise product more smoothly. And so that's a win for you.

Starting point is 01:21:19 So this is still in progress at GitHub. It's not a solved problem, but I have a lot of confidence in the people who are working on it that they're making great progress. For sure. For sure. Well, a lot could be said, as you said just now. We may have to do another podcast with you on more topics or have you back next year or more frequently now that we've had you on at least once. It has been great. Yes, hearing all the behind the scenes and all the challenges that come with leading, but then also instilling the right culture, displaying the right clarity and expectation,

Starting point is 01:21:49 the right documentation, the right kind of leadership. I think you truly are an example of that. And I'm so glad we had you on the show because you get to put that on display. That's awesome. And now you're on the next hierarchy of your career advising and doing fun things. I got to imagine that you have people reaching out or there's a way for folks to reach out. Is that something you're advertising? And if so, feel free to advertise.

Starting point is 01:22:14 Oh, thanks. I, yeah, I'm still figuring out what's next for me, but I'm really enjoying getting to talk to a lot of different startups and, you know, setting up some advisory roles, which has been really fulfilling. I will say there's one startup I'm working with that I just adore called Enchflow. And I've been an investor and advisor for them for a while. And they're formed from two former colleagues at Google. Helen, the CEO is a good friend. It's just incredible. And they were the folks responsible for bringing Bazel to the world. And now they're doing amazing things for build and test optimization and developer experience. So close to my heart. And they actually came in and did a hackathon in my basement last fall. And being able to be close to them and hear the excitement of everything they're building was really part of what got me energized and thinking more about this startup world.

Starting point is 01:23:07 So I have them to thank for motivating this change in my life as well. But yeah, I'm really focused on developer and data productivity. Those are passion areas for me. And I really feel like there's a lot of exciting, important work happening in that space. So the companies I've been talking to are mostly in that space. And I do think I have, you know, like some good insight in this 100 person plus scale. So there's a lot of angel leaders who are out there who are, you know, struggling managing the scale for the first time, and I'd love to be able to

Starting point is 01:23:38 help where I can. And you know, I'm enjoying my life quite a lot right now. I realized, like I may have said to you, Adam, I felt like I've been grinding for 25 years. And I realized, gosh, I had never been away for more than one night with my husband since my 10-year-old was born. And that's embarrassing. And so we're fixing that and just, you know, enjoying a little time. Yes. Yeah. It's been really good. And so I'm definitely, you know,

Starting point is 01:24:06 on a journey, living my one life and trying to be happy and still, you know, figuring out what's next. So please do reach out to me if you want to talk. There you go. Well, Rachel, it's been an absolute pleasure hearing about your journey and all the things you've learned, all the things you put into place as a leader. And we look forward to getting you back one day, someday soon, maybe for more. So thank you so much, Rachel. It's been awesome. Thank you so much. This is a lot of fun.

Starting point is 01:24:31 I appreciate you both. Thank you. And we appreciate you, Rachel. Thank you so much for joining us today on this show. Such a cool, cool story to go through all this scaling from hundreds to thousands. Such a big chasm. What do you do? How do you care? How do you communicate? How do you speak with clarity? How do you lead by example like Rachel has done? Well, you listen to this podcast. So that's one easy button. But hey, you can follow Rachel elsewhere. Links are in the show notes to

Starting point is 01:25:03 where she's been, where she's going, and what she's doing. A massive thank you to our friends at Fastly and Fly. And also to the Beats Master in residence, Break Master Cylinder. Yes, banging beats. We love them. Keep them coming. And speaking of thanks, thank you to you, listener, for tuning into our podcast. This week, every week, we love it.

Starting point is 01:25:24 Thank you so much for listening into our podcasts this week, every week. We love it. Thank you so much for listening to our shows. If you want to go a level deeper, there's one version that's free and one that's paid, but either way you're invited. First of all, changelog.com slash community free to join. Join us in Slack. Tons of people in there always talking like-minded folks that you can hang out with and two we have a paid membership changelog plus plus that is our membership where we drop the ads we get a little closer to the metal and we give you some bonus content and more speaking of today there is a bonus for our plus plus subscribers so if you're a plus plus subscriber stay tuned if not changelog.com

Starting point is 01:26:03 slash plus plus but that's it. This show's done. We will see you again on Monday. Thank you. Game on.

The Changelog: Software Development, Open Source - What it takes to scale engineering (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.