The Changelog: Software Development, Open Source - What it takes to scale engineering (Interview)
Episode Date: February 17, 2023This week we're talking to Rachel Potvin, former VP of Engineering at GitHub about what it takes to scale engineering. Rachel says it's a game-changer when engineering scales beyond 100 people. So we ...asked to her to share everything she has learned in her career of leading and scaling engineering.
Transcript
Discussion (0)
What's up friends this week on the change law we're talking to Rachel Poppin former VP of
engineering at github about what it takes to scale engineering Rachel says it is a game changer when
engineering scales beyond 100 people so we asked her to share everything she's learned in her career of
leading and scaling engineering.
A massive thank you to our friends at Fastly and Fly.
Those pods are fast to download because Fastly, they're fast globally.
Check them out at Fastly.com.
And our good friends at Fly help us put our app and our database close to our users.
No ops.
No ops required.
Learn more at fly.io.
This episode is brought to you by Sentry.
They just launched Session Replay.
It's a video-like reproduction of exactly what the user sees when using your
application. And I'm here with Ryan Albrecht, Senior Software Engineer at Sentry and one of
the leads behind their Emerging Technologies team that built this feature. Ryan, what is this team
all about? Emerging Technologies has been one of the greatest teams I've been working on in my
career. And I think it's been highly successful. We just today launched Session Replay. And so it's a big celebration here.
But I think that what we've built is going to be able to help all of our customers to
solve their problems faster and really look at debugging and fixing issues in a new way.
So what is Session Replay?
Session Replay, it's a video-like reproduction of what your users saw.
Instead of recording a video, we're recording the actual DOM nodes that appear and disappear
on the screen. And then we can replay those to you in your own saw. Instead of recording a video, we're recording the actual DOM nodes that appear and disappear on the screen. And then we can replay those to you in your own browser. So what
this lets you do is you can actually see exactly what the user experienced in the application,
take the guesswork out of trying to triage and what are the reproduction steps, stop at a point
and inspect the DOM to see, you know, was this paragraph tag in the right spot? What are the
CSS and the background colors? You can look at everything as if you were on that customer's machine. There you go. So if you've been playing detective,
trying to track down support tickets, read through breadcrumbs, stack traces, and the like,
trying to recreate the situation of a bug or an issue that your application has, now you have
a game-changing feature called Session Replay. Head to Sentry.io and log into your dashboard.
It's right there in the sidebar to set up in your front end.
And if you're not using Sentry, hey, what's going on?
Head to Sentry.io and use the code CHANGELOG when you sign up.
Again, Sentry.io and use the code CHANGELOG.
Our listeners, well, you get the TM plan for free for three months.
Enjoy. So we're here with Rachel Potvin, former VP of Engineering GitHub.
But Rachel, you've done some amazing work at Google,
engineering manager, engineering leader.
Your previous current role at GitHub has been amazing.
You've been the VP of Data at GitHub,
making sure that lots of people can collaborate on code, which is just the most amazing thing, right?
So, of course, welcome to the show.
Thank you so much for having me. I'm glad to be here.
Well, this is, I guess it's kind of a good time to be just leaving GitHub or being at GitHub because you guys have just done so much amazing things.
You got Copilot out there. You got all sorts of things happening.
Actions is just amazing. But let's talk about some of your, I guess, got Copilot out there. You got all sorts of things happening. Actions is just amazing.
But let's talk about some of your, I guess, some of your history there.
What are some of the amazing things you've done?
You've done some cool stuff, but I don't want to say what you've done.
You tell us what you've done.
Thanks, Adam.
Yeah, it's just been just a real privilege and just so wonderful to get to work at GitHub.
It's really an incredible company doing some really, really great things for
developers around the world. And so, you know, it's easy to talk about so many great accomplishments.
You know, I had the great privilege of leading a large swath of the product engineering team,
in fact, most of it. And so there were so many things that happened within my team that I'm just
so happy to have seen get out to developers. So for instance,
I got to form the team that created GitHub's advanced security product area. This came from
nothing. And, you know, with a fantastic acquisition from a company called Semmel,
we built up that product area to over 100 million ARR in under three years, which was
a really exciting journey and really fun to
work with all sorts of folks on that. Like you said, you know, my team launched Copilot, we
launched Codespaces, you know, a personal favorite of mine, we launched the new GitHub code search
and navigation experiences, which I think is just phenomenal for developer productivity. You know,
I got to bring lots of renewed focus to the core productivity experiences, even, you know, around repos and issues and projects and PRs, really investing in the scalability and sustainability of that legacy code base.
But honestly, I would say that my favorite work, and this is, you know, kind of on brand for me, I guess, is less about specific product milestones, though those are always, you know, really, really exciting.
But I really get a lot of happiness from building healthy engineering practices and a strong engineering culture that really can sustain these product launches and these features and this growth.
And, of course, all of the excellent people involved. So, you know, in my role, I used to always say I'm like
50% focused on the product areas that I'm managing and 50% focused on all of engineering and what
needs to happen to keep our engineering teams happy and healthy.
Yeah. I'd love to examine the other 50% to some degree, because I feel like there's a lot of
personal details, I guess, a relationship that comes into business or into
leading teams that just sort of kind of goes somewhat by the wayside when describing
accomplishments. Like, I'm so glad that you said, you know, how you divide that up because so often
is it we did this, we launched that, it was amazing. This is how it scaled. This is what
was the impact. But at the same time, you kept people healthy,
employed, not crazy, showing up to work, keeping their fitness, keeping their self-care going,
their marriages and relationships going, you know, not just shipping, right?
Yeah. One thing I was, you know, really proud of was, you know, the level of attrition within my team during the pandemic was quite a lot lower than, you know, a lot of other areas.
And I think that's because of the focus on healthy engineering teams. And look, like the size of my
team grew a lot during my three and a half years at GitHub. And GitHub engineering actually tripled
in size during my tenure. So, you know, that's a huge amount of growth, right? And, you know, the product area expanded so much as well.
But I can tell you a little story.
And I think maybe I was teed up for thinking about culture early because of the way my
team first came together.
So when I first joined GitHub, I mentioned this company, Semmel, that we had just acquired.
It was a really, really great company that formed the basis of GitHub's advanced security. In my second month at GitHub, you know, Microsoft had already acquired GitHub and Microsoft kind of realized that there's this Azure DevOps group doing great work in the DevOps space. And they were effectively competitive with GitHub, right? So someone realized like, hey, we should like merge these teams, right? And so a whole bunch of Microsoft people were asked if they wanted to move over to GitHub, and a lot of them did. So these were sort of a lot of PhD academic types and mostly in Europe.
I remember on their onboarding, multiple people said to me, like, these Americans who are doing
our onboarding are too excited about everything. And if you're, you know, if you're excited about
everything, you're excited about nothing. And they're just, you know, like, we're getting
exhausted. You know, this is like a different culture, right? And then one third of my team was these sort of original hubbers who had been on the startup
journey, many of them with GitHub. And they sort of had this more scrappy, get it done,
ship to learn type open source first culture. And then a third of my people were from Microsoft,
which they had much more big company experience. They had expectations around
the way things should work, established process and expectations. They were very good
at thinking about enterprise and had much more of an enterprise-focused culture.
And so these were all fantastic people with very different backgrounds, experience,
and expectations. And I remember realizing that the first thing I really needed to do was just
to bring these folks together with a common culture and a feeling that, hey, we're all hubbers, you know, rising tide lifts all boats. I mean, we're all in this together. There needs to be no us versus them because that can very quickly become toxic. But rather, we need sort of to have a shared vision and understanding that to be successful, we all really need to work together. So I started with that and like, what am I going to do to make this, you know, a common culture?
We talked a lot about, you know, focusing on the experience of all developers. So not just
open source, not just enterprise, but because all developers are people and they all deserve
great productivity experiences and fulfilling lives and so on. And then again, beyond my team,
I wanted GitHub engineering to share a common culture as well. And, you know, I know from
experience that setting clear shared expectations is really key for establishing and promoting
healthy, productive teams. So one of the first things I did when I came in, which I never
expected I would be doing was I wrote and published and socialized career ladders
for both managers and individual contributors in engineering. I worked really closely with HR on
that, which was, of course, super important. But I also, you know, you introduce culture by what
sorts of things you reward and what expectations you set. So, you know, one thing I did in that
process was I introduced a new, more technically focused career path for managers, which I think lots of people should think about doing because previously at GitHub, it had sort of been, you can be a senior manager. And then if you want, you know, a promotion, you have to be a director. And director is a different job, right? To me, director is managing managers. But you should be able to have career growth as a manager of individual contributors. And so with these new ladders, we rolled out this concept of staff manager
and principal manager and sort of this like technical path for managers to take, which I
think is really important for sustaining strong technical teams. And so there's lots of stuff like
this, like I established a design review process and expectations around what kind of things needed to go to design review.
And design review is often a communication tool as much as it is, you know, to get specific feedback on your design.
Like let's a team over here, you know, on one area of the business understand and know what's happening in another area of the business. I created something called our principal council, which I'd love to talk
about maybe a little later when we get into, you know, healthy things to do to scale engineering
teams. But this group was eventually renamed to the architects group. But what it did was
it really helped support the difficult cross engineering technical decision making that
needs to happen. And that had really been stalling at GitHub. I set up,
you know, turns into a laundry list, right? But I set up a developer satisfaction survey
within, within like an internal facing survey to, to find out from all the engineers at GitHub,
like, what are your biggest pain points? What are the things that are slowing you down?
You know, what are you dissatisfied about? What's hurting? And you know, what's good? Where can we celebrate progress so that we can really understand
and track over time, like this is the experience that our developers are happening. And if only we
could, you know, focus on fixing some of these things, and we'd have happier people. You know,
I also set up operational reviews, I rolled out an engineering-wide strategy that talked a lot about
balancing technical debt, developer experience work, privacy and security work, along with
feature work to make sure that it was clear that these things were valued and that, you know,
highly impactful work that's not directly tied to feature launches is also recognized and valued.
And so this is all, you know, I could go on and on,
but this is all a lot of culture work that helps manage that scale and that growth over time.
That's a lot of things. Well, if there's ever...
It's a lot of stuff.
Yeah, it is. If we got a good person on the show to discuss this topic, I think,
you know, all doubts should be removed. You obviously have a wealth of knowledge in this
space. And one of the reasons we're doing this show, Rachel,
we're happy to have you here is because our audience requested you not just in type,
but also by name. They want to hear more shows, not just, you know, we talk about scaling software
a lot, maybe not a lot, but we talk about that. Scaling teams that scale software, you know,
we talk about less. I think our audience has been clamoring for more leadership style episodes and scaling style episodes. And they got going one day in
Slack and our Slack community and said, and said, Rachel Popkin is the one to get on the show.
And so shout out to all of our people in Slack who gave us your name.
I'm so flattered. Thank you.
I'm glad to have you here. There's so much, there's so many things
we could dig into
of the different things
that you did
in order to succeed.
I don't know where to start.
I do want to ask about
just that manager bit.
And if we just might
like pull that out
and then, you know,
maybe just set it aside
and move on.
But this technical
manager distinctions,
is this published work?
Is this something
somebody else could follow?
Like, how do you distinguish
between these different managerial tiers? And like, what differentiates these different roles that are non VP or non, you know, they're still manager roles. still at GitHub, because I think it's really important. I think it's easy to fall into the
trap of saying there's two separate careers. There's, you know, there's a manager career,
and then there's an individual contributor career, and they're different. And sure,
they are different jobs, but they share a lot of commonality. And one thing I really believe is
that there's a spectrum from the deepest technical person to the most strategic thinking, sort of high-level
vision thinker. And that spectrum can exist both on the individual contributor ladder and on the
manager ladder. And so you want to be able to give opportunity and job and take advantage of
that skill set with the different individuals where they are. Like, you know, one thing I did
at GitHub, I also developed the promotion process for engineering. And I talked a lot about that
staff engineer promo as well. And I know there's lots of writing out there and so on about staff
engineering. But, you know, you can never, you always have to be careful when you make your
career ladders. It's never a checklist, right? It's always, there takes some interpretation.
It can't be too subjective,
but it takes some interpretation to say like, this style of person is having impact in this way.
And the common currency really at various levels is impact. So, you know, at more junior levels,
you're taking direction really well. You know when to ask questions. By the way, that applies all the way up the ladder. But, you know, you're really good at getting things done in a constrained way.
And maybe the next step up, maybe you're figuring out what the way is to answer a problem.
And maybe at a higher level, you're actually figuring out what the problem is that we should be talking about.
But, you know, I think it can be an anti-pattern to really pigeonhole managers to say like, these are people managers and coaches
and not technical individuals as well, who can understand the depth of what their team is doing.
If a manager of ICs, you know, can't jump in and help coach their individual, maybe you don't need
to be the deepest domain expert on everything, right? But you at least have to be able to
understand the work that's happening on your team and be able to give good coaching advice or hook your person up with someone who can give them good technical advice, great code review of what they're working on.
And then I love seeing, you know, I'd love to see a GitHub distinguished manager.
Just, you know, that's someone who's got a small team of people working with them on the hardest problem that GitHub has.
And some people call that
maybe the surgeon model, right? You're the tech lead, but you're also, you know, working so closely
with this group of people that you're the right person to be the manager. Yeah. What is the
difference then be when you go from senior engineer to staff to technical? Like what are
some of the differences between those three opportunities, I suppose? Yeah, I think it's
like I was saying, you know, it's like, how much agency and accountability are you taking?
I remember, you know, having a great discussion with a principal engineer who reported to me at
GitHub. And, you know, his opinion was, and I fully agree with this, there's a little bit of
confidence that comes with those levels, too. So if you're going to be a principal engineer,
imagine GitHub's down and you're in the Slack channel with all the people who are working on the problem.
Are you willing to be the person who says we're going to roll back
or actually we're going to turn off GitHub actions
and impact only that set of our customers
so that we can bring the rest of GitHub back up?
And so there is that experience and confidence that comes into these levels,
but then it's also sort of the nature of the type of problems that you're taking on
and how much agency and accountability you're taking for the solutions yourselves.
I guess the importance is to not move from, away from even further like you had said before moving to
you know say a director role which is like you said a completely different thing you still keep
them closer to the technical problems yeah it's kind of similar to since we kind of know about
what you're doing now after i get it was kind of like the ability to keep advising right like as a
rather move from senior engineer and continue up your own career ladder into director out of,
say, a more technical role, you get to sort of keep leading and advising within, but keeping
your technical skill set within your career path versus simply going into management,
which sort of moves some of that away. You obviously leverage that experience,
but you don't get to put it into practice on a daily basis.
Yeah, I got to tell you, I'm talking to a lot of startups now. And there's a good group of unhappy
CTOs out there who, you know, are kind of turning into people managers for the largest teams they've
ever run. And their joy is actually from doing the hands on technical stuff. And so right,
I've been talking to a lot of those folks who are trying to find a way to get back to their joy and, you know, really being hands on. And then, you know, it is it is a
different job to be leading a large, large organization of people. And that's a big
responsibility. And it's it's a different role, too. Is that advice generalizable at all? Like,
can you say to a typical CTO of a growing or hyper growing
company, like, here's how you accomplish that, or here's the highest impact things you can do?
Or is it always specific to this person in this place?
Look, if I see an individual who's in a management role and they're really unhappy, you know, we all have
a certain amount of agency in our own lives, right? And I think we all have one life to live
and it's okay to, you know, take one for the team for a while. If let's say, if you're a co-founder
or founder and you're going to be the CTO of your company, and that means growing an engineering
team and you're, you're going to do that for a while. But at a certain point, you know, if you're feeling unhappy on a day to day basis, you know,
look at what you can do and see if you can change. And there's a lot of great managers out there.
And so finding a really good partnership between an engineering leader, and a CTO, or and, you know,
the top, you know, technical ICs in the company, that that's a partnership that that really needs
to form. And so I think, you know, I encourage people to find their happiness. That's what I'm
trying to do. Right. Yeah. Happiness for sure. Well, if we go back to your laundry list of
things that you did, and I don't mean to call it a laundry list, like dirty laundry, but like,
you know, an epic, an epic list of things that you did. Where does the
wherewithal or the knowledge, like, how did you know what to do in that circumstance?
And like, where does your experience to kind of like, and surely some of it was probably
explored and discovered as you went, but like, how'd you know, where'd you get the knowledge
to say, I'm going to do these seven things in order to bring these three teams together in a
way that scales and establishes culture? What's your background that brought you to that place where
you could be the one that got that done inside of GitHub? Hey, you know, I've been grinding in tech
for 25 years. So I have a lot of experience. Grinding for sure, right? Yeah. I've seen a lot
of ways that things didn't work. Sometimes when you see a counterexample, that's just as good as seeing a good example and even sometimes more effective.
And, you know, I've tried things that didn't work.
But, you know, I've seen several common patterns in scaling my own teams over many years.
Like I brought multiple teams to over 100 people throughout my career.
You know, at Google, I worked in developer infrastructure for a long time. And I brought those teams to over 100 people, working in an organization of 2000 people
with the amazing Melody McFessel, who is now CEO of a company called Observable, but I worked for
her for many, many years, and learned a lot of great lessons from her, for example. Then at Google,
also, I led the cloud platform and recommendations platform
in Google Cloud and scaled that team from something like 30 people to well over 100 people.
And then within GitHub as well, I've, you know, I've scaled multiple sub teams within my
organization to over 100 people when my team itself, when I went to leave was over 500 people.
And so, you know, hopefully you learn from experience, right?
I mean, I certainly think I did.
And like I said, that, you know, being thrown in the fire
at the beginning of my GitHub experience where, you know,
there was a lot of things that were really surprising to me
in terms of how siloed GitHub was.
There were a lot of things, you know,
in terms of how decision-making was happening
that I could tell didn't work. You know, in terms of how decision making was happening that
I could tell didn't work. You know, I can give you a quick story, which is when I first joined
GitHub, fantastic team came to me. And it was, you know, I joined two months before GitHub Universe,
which is the big developer facing conference every year. And this great team came to me,
and they were working on a language feature. And they said to me, Rachel, we have this
great new language feature and we want to announce it and release it at GitHub universe as our new
VP. Can you tell us, should we launch it for JavaScript or should we launch it for TypeScript,
Java, Python, you know, the, the four next popular languages. And I was like, okay, well, you know,
okay, hang on. This seems like
a great feature. Do we need, you know, more research? Like, are we not confident? Like,
why are we just like targeting one population versus another? And this great team said to me,
well, okay, here's the thing. When we first started this project over a year ago,
it was easier for us to get CapEx budget approval.
So that's like hardware instead of OpEx budget approval, that's cloud capacity. And so we
ordered a bunch of machines and we got them racked in our data center and we were running a MySQL
backend and we have space for the index for JavaScript or the next four popular languages, but not both.
And it takes 12 weeks to order new machines.
And GitHub universe is less than 12 weeks away.
And so we got a pick.
And, you know, for me coming from Google, that was my brain was melting a little bit because on-prem, what?
Like, isn't it alpha?
I didn't know that
that still existed, right? I had a lot of learning to do when I came to GitHub. And by the way,
that team did nothing wrong because that was, you know, the way things worked. I immediately said,
we're moving the cloud, you know, this is not going to work. And they had a year of pain,
actually, where they couldn't scale the product that they had made. And they occasionally, you
know, the scale of GitHub's code base overwhelmed them and they'd have to pull back features or turn
things off and stuff like that. And so ultimately it had to be a cloud-based product and they did
successfully move to using Azure Blob Store. But that was sort of the awakening I had when I came
to GitHub where I thought, oh, okay, there's trouble making maybe like these
decisions that are happening in silos way too much. Like there's local optimization, I think,
really happening in terms of the way teams are making decisions. And there needs to be
sort of absolutely the first step of scaling is that teams have focus and agency to make their
own decisions. But then there's a next step where you've grown beyond that. And there's certain
decisions you need to know that you need to take to another level. And
there needs to be the ability that's not strictly product focused to make those kinds of decisions
coherently for the entire organization. So I felt like I had a lot of learning to do when I came to
GitHub. Understanding constraints in that case was probably key, right? Because like, if you didn't
ask that question, you just thought, well, both both of course, but you had to understand the fact that they were on-prem
and they had, you know, if you hadn't gotten to that part, you might've just made a premature
decision or an incorrect decision to say, we should, of course, let's do both because they're
all popular. We should, these are the directions to go. But once you understood their constraints,
you were able to sort of understand more clearly their challenges, right? Constraints equal challenges. Yeah, absolutely. And, you know, like,
I'm really happy that, and again, it's a great team, great people. They didn't do anything wrong.
That was the environment that they were in. But it also highlighted very early in my GitHub tenure,
oh, interesting. This is how this is happening. And then I spoke to a whole bunch of teams,
actually. And remember, GitHub had been acquired by Microsoft. And I started asking, is anyone running anything on Azure? You know,
like we have a lot of AWS, we have this on-prem, I see we have some Google cloud, but like,
are we running anything on Azure? And the answer was no. And, you know, I was asking around and
trying to figure out like, do we plan to migrate to Azure or, you know, you know, what are we going
to do here? And it became really
clear that because the product teams were so siloed, every product team was thinking of its
own feature sets. And there wasn't really anyone thinking about that bigger picture of, you know,
no, we're going to do the investigative work, and it's going to take time and, you know, whatever
needs to happen to figure out how to move to Azure, any one product team would have to throw their entire product roadmap under the bus in order
to be able to work that out. And so you need that higher level of thinking to be like, well,
wait a minute, this is something we have to prioritize. We have to be able to have the
flexibility to not be so constrained to these product areas and be able to fund things like
this that are going to be for the greater good. So when it comes to scaling these teams, one thing I've read from you is that you think
that 100 people is kind of this, this threshold of engineers. Yeah, where it's like the game
changes. I'm wondering why, like, if that's just experientially what you've seen, or is that a
magic number? And then what changes in and why, in your experience? Yeah, absolutely. You know, it is experiential.
Like I had, that is what I've seen myself, but I've also spent the last several months
talking to a whole bunch of startups, which has been really a lot of fun.
So many bright people out there doing interesting novel things.
And it's held up this a hundred person threshold, you know, and it may be slightly different
for different teams and
different companies, it's sort of, it matters the amount of complexity there is in your product
space, you know, how many different sort of customer bases you're serving, how many different
product areas maybe you have in your organization. So 100 is not, you know, the absolute exact moment,
but definitely, it starts to be hard and things need to change
at that threshold. And so I'll talk first about, you know, what kind of what I seen. And one of
the main things is that eventually you hit the scale where it becomes impossible for one individual
to hold context for everything that's happening within the products, but especially implementation
wise in their head, right? And so certainly the individuals who are on the product teams
will have lost that thread a long time ago. They won't know what all their peer teams are doing.
But you know, like maybe one person until 100 is kind of hanging on and having a good sense of
the various challenges that all the teams are feeling. But eventually, that stops being humanly possible. Work will start happening that doesn't align well. Decisions will start
happening that don't align well. Life is certainly easy when you have, let's say it's a founder or
founding engineers or a senior technical person who can effectively make final decisions for teams
when they're stuck.
But now you're getting to a scale where there isn't necessarily that individual who can do that.
And obviously, like, you know, we'll talk about the fact that decision making has to be delegated
to teams, right? Like, that's the first step of scale, you go from like having a single team,
where everyone's working together, to splitting out into focus. And I can give you also lots of
examples of where delegation doesn't
happen well enough and teams are hampered because they can't make their own decisions where they
really should be. And this is exacerbated when you have time zones coming into play and folks
working on different schedules getting stuck and so on. So you don't want that. You need individual
teams to be able to make their own decisions. But then there's these decisions that go beyond team boundaries and they start to spin so if two
teams are you know invested in a decision that they can probably hash it out but it's these like
cross-engineering things big investments in many cases you start to see these important technical
decisions really stalling and that's just a danger zone when important decisions that need to be made aren't being made because no one feels empowered or maybe attentive enough. And probably your edge leader is running the biggest team they've ever managed. And maybe they don't even realize that these level of scaling, right, with the 1,000-plus person engineering team. And these problems get exacerbated at every order of magnitude, for sure.
But an example from GitHub is it really took us too long to decide that we were going to be moving to React in the front end.
And, you know, some teams started using React, but they were doing so in inconsistent ways.
And, like, are we going to be building within the GitHub monolith?
Are we building services outside the monolith? What standards are we using? You know, what's our
sort of like feel on do we want GitHub to get more of like an app like feel? Do we want sort of like
a more static web page? I mean, there's a lot of inconsistency into how various teams were
approaching this. On top of that, you know, Microsoft was giving us some pressure about
accessibility and making sure that GitHub respected accessibility standards, which is really important.
Is React going to be the means to doing that? Or are we going to, you know, have some other UI
policies? And so that's something that took investment, experimentation, investigation,
but then ultimately GitHub was able to say like, yes, this is the North Star, this is the direction we're going to go. So then that gives a roadmap to every team when they're
starting to think about a refresh of their front end. Well, now they know they don't have to guess
and evaluate multiple technologies and so on. But there's lots of other things that start to happen
at that 100 person threshold at all. I would say, you know, also like the technical impact of scale may start to be catching
up with you. So process and implementation that was like good enough at a smaller scale
may start to become problematic. I have some examples of that I could talk about,
you know, with so many engineers, a manual deploy process stops working, you know, and then you end
up with all sorts of like terrible side effects to that where people are writing bigger changes that are harder to code review.
And then you end up having more outages.
And, you know, maybe some people who originally authored the code base are no longer around.
And maybe you don't have clear code ownership for some things that were written once and aren't scaling now.
And so, you know, outages start to happen.
Maybe confidence is low in terms of
what needs to be done to address stability. I sort of mentioned this already, but beyond that,
I think you see a lot of energy leaders who are starting to run the largest human organization
they've ever run. And they're probably, you know, like we were just talking about, no longer
touching the code day to day, and they might be feeling insecure that they're not on top of all
the details, right? Maybe they know that important decisions aren't being made, but they're not sure that they still have the right level of
insight to even make those decisions. Right. You know, maybe they're working with a CEO who is
super focused on user customer facing progress, who doesn't want to hear or doesn't think about
infrastructure, tech debt, developer experience,
etc. And so that starts getting less prioritized on the team. Or, you know, I've definitely talked
to startups where the CEO was the one who wrote the first version of the code. And, you know,
they're opinionated, but also, you know, their knowledge is stale. And so it's just like super
hard job for these individuals who are trying to, you know, maintain that balancing act. And so it's just like super hard job for these individuals who are trying to maintain that balancing act.
And so these are all things that I think start really getting exacerbated at that 100 person scale.
And the good news is there's a lot of things you can do, but it's interesting to see how prevalent it is.
Yeah, for sure.
How do you then do you get that person or persons that has that,
I guess you kind of said it was confidence in one way, but the ability to see that there's
a problem there and then start to enact change. You'd mentioned, you know, they wouldn't see the
problem anymore. They were too far away from it. How then from a VP level, do you start to give
people that agency to make those changes or to see more clearly and make
make choices and decisions because it seems like you know when you get to a hundred plus
organization engineering wise like you had said one individual can't hold all that in their
personal ram it begins to be you know divided and whatnot how do you get to that point to give people
more clear access to what needs to actually happen. Isn't there some quote or something that like recognizing the problem is half the battle
or I'm terrible at quotes.
So I think it's G.I.
Joe.
Knowledge is half the battle or something like that.
Yeah.
Oh, it's G.I.
Joe.
My goodness.
Going way back there then.
Yeah.
Wow.
Okay.
Well, half the battle, I believe, is from G.I.
Joe.
Everything else is from something else.
I think it was a combined quote.
It's a remix.
Either way.
Yeah, either way.
Let's just say it's a Rachel original then maybe.
I don't know.
Sure, why not?
No, I'm sure it's not.
I'm sure it's not.
There we go.
I think you just coined it.
But, you know, recognizing that things are changing and that you have to work differently
and that, you know, the way things have gone before will
no longer continue to work. It is something that people realize, and whether they realize it sooner
or later, they will realize it. Because again, you're going to hit one of these problems where
like you have a massive outage, and you don't feel equipped to handle it. Or, you know, you'll
realize that, wow, you know, we've been spinning on this decision
for a really long time, and we haven't made this decision. How come we haven't made this decision?
So it will be noticeable eventually. It's just sort of like, how soon do you notice? And how
much do you put in place while it's easy, so that when you get to that level, you can kind of sail
through it, right? Definitely, you know, a lot of things can be done.
You know, you can do work to avoid technical scaling bottlenecks early by focusing on code
health and having best practices in place.
You can proactively invest in your developer experience before your developers are screaming
that they can't deploy anything.
You can set up individuals who are directly responsible for different
product areas and different technical domains to give them agency and accountability and decision
making. And there's a lot of things you can do with culture to really make sure you're valuing
different types of work, right? Like a failure mode I see a lot of companies get into is being
way too user-facing focused. And it you know, it's great to celebrate launches
and product launches and great feature launches and so on. At Google, there was an expression,
landings, not launches, which I really liked because, you know, I was talking to the co-pilot
team about this, you know, a year ago where like I said, I actually don't care about getting to
GA with co-pilot. I care one year from
now, do we have a healthy team that can maintain the thing that people are depending on, right?
Just getting something out the door is not what you have to worry about. You really have to worry
about what happens next. And so culture has really a lot to do with that. So yeah, I mean,
I think people will always hit that pain eventually. And
so, you know, I'd love to help people notice it sooner and be ready to address it sooner.
It seems the somewhat secret sauce might be the concern and care for actual people in the mix,
right? Like one thing is clarity and expectation. This is something you've said
several times and part of the way you lead is very clearly yes exactly but it seems like this desire to care
for individuals like it's different whenever you lead with like you had said a launch not a landing
a landing is safe intentional or at least it's desired to be safe and intentional if you're
landing it's like let's make it soft let's not not make it abrupt. Let's not damage our knees.
I'm thinking like, you know, airborne for the Army, for example, when you come out of
an airplane and you got a parachute on, it's easy to damage your knees if you don't land
properly, right?
So landings are intentional.
They're safe.
They have some sort of circumstances around it.
You have some care for individuals.
It seems like that's a somewhat unknown secret sauce to how you lead.
Well, I would say also the way I define landings
is you achieved what you wanted to achieve, right?
So like you can launch,
you can get top of hacker news, whatever, and that's cool.
But six months after launch,
have you got the usage that you wanted to see?
Do you have the retention that you wanted to see?
Are you perhaps generating the revenue
if it's a revenue generating product that you wanted to see? Do you see people using the product the way you
expected them to be using it? And so before you go to any launch, you should have at least,
you know, as clear as possible a hypothesis of and a target of where you want to be and what
you want to achieve. And, you know, that's something that I think launches are hard,
but they're easier in some ways than sustaining, right? Sustaining, you have to have SLOs in place,
you have to have, you know, a good on-call rotation with good playbooks. You have to understand
what's the cost of keeping the lights on for this service? You know, how do we handle
customer escalations and user escalations?
How do we triage work? How do we prioritize? Is this scaling? What scaling bottlenecks are we
going to hit? You know, sometimes success is a double edged sword, right? Because suddenly,
the way you wrote this thing is no longer going to work or your number of machines that you have
in your MySQL on-prem and backend are not going to be able to fit, you know, what you're trying to do. And so to me, that's what
a landing is, is really like we have something that people can depend on that's reliable,
that's sustainable, and so on. One of the challenges that I am seeing is this like
competing concerns with, I don't know, just like our propensity to build the wrong thing or to
yak shave.
You know, we have Yagni, which when it comes to scaling, like a lot of us aren't going
to need some of the scaling things.
And then we do, we really do need them.
And then there's also things that we should be building right away.
So like you can't bolt on security, for instance.
So when it comes to like engineering something like security,
you should be thinking about from the beginning.
But a lot of us in trying to prepare for the possibility of scale,
never get the launch done because we are setting up our CICD, right?
We picked Kubernetes when we may never need it.
Or we spent all this time developing things that we didn't need.
And then it came time for us to need something.
We didn't develop that thing.
Like, oh, I wish I would have had
this incentive system in place, right?
So, I mean, it's difficult to like pick
what's worth building upfront
because some of these things you said can,
if you're prepared to scale,
if you picked, if you rolled out a Kubernetes cluster
from the beginning,
and it turned out that you had this huge launch
and now you're scaling and wow, it's amazing.
We can just get more nodes or whatever and it worked
as opposed to like an on-prem MySQL server
that just hit a wall and you're done.
And so, especially now that you're talking to startups, right,
who may or may not have to scale,
are there ways you can help people,
help us think about these things where it's like,
what's worth building now? And what is premature optimization that's going to be
completely a waste of my time and never, never push my business forward?
Such a good point, Jared, because I've said over and over again to my teams and to, you know,
various folks that I'm advising and coaching, everything's a trade-off, right?
And it's not obvious.
You have to assess the cost and the benefits.
And a lot of times for startups,
being first to market really matters.
I think you want to be really intentional sometimes
about accruing technical debt,
and that's perfectly fine
because you're eager to get something
in the hands of customers and see,
do we have product market fit or do we not? And so being able to be thoughtful and intentional and make those decisions,
I think a lot of the times, definitely don't try to over-engineer something if you don't even know
if you have product market fit. Get something lightweight out there, get a prototype out there
and see what kind of reaction you get and learn from your users.
GitHub has one of the sort of philosophies, I guess, is called ship to learn. And I like it,
and I hate it. I kind of wanted to burn it down. But I also appreciate it, right? But it's like,
what I want to do is add nuance to it, which is ship to learn the things you should ship to learn and be really deliberate about the things you need to be really deliberate about, if that makes any kind of sense. And so
like, what kind of decisions can you unwind quickly? Right. And so, you know, I love ship to
learn for like UI features and UI changes. I think that's really healthy and good and where you can
iterate quickly, but there's, then there's changes where like,
this is going to be really hard to back out of. Like I'm, you know, writing this data schema
and, you know, it's going to be like difficult to undo this or I'm adopting this new infrastructure.
I'm not going to ship to learn it. Let's have a design doc. Let's talk about it. Let's, you know,
really get the right set of eyes on it. Like, I'll tell you,
I set up this engineering wide design review process at GitHub. It's really good. Half of it
is a communication tool, right? Sure, people got really good feedback on their design docs.
And by the way, not every little thing needs to go to engineering wide design review, right?
There's layers and you think about like, how broadly impacting is this change I'm making?
If it's just on my team,
then let's just do a design for my team.
And actually, maybe it's just something
that I'm going to ship to learn
and we don't even need a design doc.
But for certain things,
I'll give you an example that the issues team
at GitHub want to start using Cosmos DB
because we've been, you know,
very MySQL backend company. And we company and we have these more sort of
NoSQL use cases cropping up for storing issue hierarchy. Cosmos DB seemed like a good fit.
And so bring it to engineering-wide design review. And then all the various teams who are thinking,
oh, shoot, MySQL is not really working for me either, can come and be like, oh,
here's the use case I have. And it's a communication tool. And you talk about it,
you get it out in the open, and then you get some good feedback and so on. And so, yeah,
everything in life is a trade-off decision. And so I would never advocate for always building
for scale from the start, always addressing your technical debt immediately. No, there's very legitimate reasons to make
concerted decisions there. I think the challenge I see is I definitely talked to some startups
recently who maybe were intentional about saying, okay, look, we're going to just like,
not worry about this technical debt. We're going to hack together this feature and get it out
quickly. But then do you lose track
of that technical debt? Did you forget about it? And does it show up six months later in an outage?
And actually now it's a bigger deal because, you know, various other things happened that built
upon it. And so, you know, I'd always advocate for being intentional about the choices you're making and having a way to track decisions and understand
where you have, you know, things that you're probably going to have to look at later.
And also, by the way, like thinking about, you know, what scaling sort of throttling type
limits can you put into your product initially? So you know i can't tell you the number of
times it's happened where like wasn't paying attention to that api and suddenly like oh my
gosh a bunch of people have used it for this like really expensive use case that we sort of never
imagined like the github code search api people were using to like count all instances of their
api being called through all github ever. And it's like, that's
a super expensive query. It's not really what GitHub code search is about. But there were no
limits on the on the API. And so customers, of course, like humans will do, you know, things the
easy way and if they find a way, and so like, do think about how your products might be used, do
put in place user limits, throttling, anticipate,
you know, how things you might want to be alerted about when you hit certain thresholds and certain
scales, right? I'll tell you one that is a personal sort of concern of mine that I've seen
at GitHub. GitHub has about 40 repositories that go into the GitHub platform. And it's sort of a lot of the newer
product areas are in their own repos and are separate services. But there's also the GitHub
monolith, which is a Ruby on Rails application, which is, you know, issues and PRs and projects
and sort of all the core functionality of GitHub as a code hosting site is really in that monolith. And, you know, we've had a lot
of scaling problems at GitHub with deployments, partially because of the way the active record
paradigm works in Ruby on Rails, where the sort of data layer is too tightly coupled to the logic.
And so people are making database changes all the time. And if you only have a few people working,
that's manageable, But that starts to
become unmanageable pretty quickly with the number of engineers like, you know, beyond that 100%
threshold, there's certainly more than 100 people who touch the GitHub monolith. And so that's
created a lot of complexity for deployment and a lot of bottlenecks that, you know, need to be
addressed. I can definitely imagine that going back to the decision making, do you
use and or advocate for like a decision log or some sort of like a place of record? I've heard
I've never done this, but I imagine at scale, you'll want to have like, here's the decision,
we went with Cosmos DB for this product. Here's the analysis we did. Here's the decision we made.
Here's the constraints we were working under or the assumptions. And this is why you picked it. I've heard people say you got to have one of those
because, you know, the short term memory of an org, especially in software world, we churn so
much, right? People move on and switch roles often. And so you don't have that institutional
domain knowledge stick around very long. So I've heard decision logs are a great tool for that kind of knowledge. Your thoughts? Yeah, look, any tool like that is as good as it is findable
and as good as it is clear and part of the culture. And I'll give you an example at Google.
You may have heard of GoLynx. GoLynx is a company, I think that was created based on the way
linking worked at Google, where basically, if you knew a product area, you could type
go slash that product name, and you would land on their documentation. It was just fantastic,
because everyone used it. But that's a cultural thing, because everyone knew where to work.
I, you know, I've talked to someone who was working at DuckDuckGo recently, and they use Asana for everything. And they do decision logs, and they have,
you know, just a very clear process. And everyone knows to look there, and everyone does it. So
you can't just have a decision log without the culture to go along with it.
Right. You gotta buy in.
You gotta have buy in. And you gotta, you know, you show people that this works, and that it's
usable.
And then it becomes advantageous and then people buy into the culture.
I spoke recently to the CEO
from a company called Dream Team.
And they have a project called Kata
that I'm keeping an eye on
because it looks really good
in terms of this sort of projects management.
They do integration with Slack,
integration with GitHub, integration with Jira.
And again, it provides that functionality of everyone knows where to look. So you can set
up a decision log in that product and type on Slack the right keyword decision, and it'll end
up there. And then people don't need to look around. I think one of the challenges I've often
seen is like, yeah, let's document this decision in a Google Doc, or maybe this one's in a repo, or maybe this one is somewhere in Slack.
And then, you know, that's cool. But if it's not findable, it sort of doesn't matter. So to answer
your original question, yeah, I'm a fan of lightweight decision logs, you know, like,
I'm a fan of design documents also. And chances are your design document points to your decision.
But even more so is that culture you need around how are we doing things and where are things found to be a really big challenge.
You know, I'll say even like even org chart, right?
Like at GitHub, there wasn't a great org chart.
And one of the engineering directors on my team wrote a new org chart, right? Like at GitHub, there wasn't a great org chart. And one of the engineering directors on my team
wrote a new org chart.
It's the org chart we use now.
And I was like, oh, Harry,
thank you so much for doing this.
Because, you know, even just being able to find
who's working on what,
you know, what person should I talk to?
You really have to be careful.
And again, this comes to that 100 person scale
around informal networks
and needing to know someone, know someone to that 100 person scale around informal networks and needing to know
someone who knows someone to find out the information you need as much as possible.
When you get this information into systems, then you can find the answer on your own and
it's easy and quick.
You know, I think when you have that informal culture of network and, oh, I'll just ask
so-and-so who will know, then you propagate meetings. You know, in this remote culture, it's never just a five-minute question. You always book
a 30-minute meeting with someone to ask them maybe the one question that you had. And so then you're
sucking all your time into meetings. Whereas if you have clarity of where to find information,
you know, that can really go a long way. Yeah. I'm kind of glad you went that direction because,
Jared, I was thinking that same thing, but your question was slightly different than what I asked it.
But it was more like, how do you choose the tools to communicate?
Because it seems like you're a clear communicator.
It's if you can find it, like you had said, and you have access to information, you don't have to have so many meetings.
You can rely less on your network because you have to know somebody who knows somebody to get access to the information.
But, you know, when you're in hundreds and then to thousands, you know, I'm not asking you
to use Slack over Jira, do you use this over that? But how do you organizationally choose that what
becomes culture, the tools you use to communicate? Like, how do you do that? Do you build your own
tools? You know, is that invented here kind of situation? Because even at small organizations
like ours, which is a very small organization in comparison to yours, we still don't have a clear culture of if you want this information, go here to find it.
In lots of cases, it's in code and we can go find it in our GitHub repo, of course.
But like if it's written, there's like probably three different places we may have used over the last five years.
So our culture has not been adopt one tool, use it heavily.
It's been fractured across many tools, never consolidate.
So how do you, at that scale, hundreds of thousands?
Well, don't feel bad.
Okay.
Yeah, don't feel bad because that is super common.
Yeah, we're also early adopters.
So like we try out every new thing.
And so that's part of what we do.
So there's some of that culture,
like we're going to try the new thing
and see if it works for us.
And so, yeah, we have, you know, knowledge bases spread amongst mean, project management is not the core competence of, you know, unless
that is your, your business, like, you know, this Kata product that I was talking about,
that is their core business. So they should use it and they should build their own thing and they
should make it amazing so that everyone else can use it. But you know, it's, it doesn't matter,
right? Like, is it Asana? Is it GitHub projects? Is it Google Docs that are well organized? Pick your battle. I think a lot of
things can work. But with lack of clarity, every team will in your organization will do something
different. And that's when you get into trouble. So yes, you know, just just standards and
consistency. And you don't want to, I mean, we can go back to everything's a trade off. You know,
you don't want to be too heavy handed about things and be like, you must work this way. I was going to just ask that, like, do you just dictate it?
Yeah. But there's certain things where it's, it's a virtuous cycle. I think where you say,
this is, you know, where we put design docs, everyone do it because then you'll find the
design docs you want to find. And that's a good thing. So, you know, please do this and use,
you know, you can, you know, as a leader, I can actually go and say, why didn't you do this?
I need you to do this next time. But the best is when people see, well, okay, this is helping me. And so it's logical. It's not process for the sake of process. I think you have to be
extremely careful about rolling out half-baked process where, you know, it's going to introduce
friction for teams. And, you know, another thing I can talk about, which we touched on in decision
making is different types of decisions hold different weight and can be undone or fixed or
changed more easily or less easily? Well, different types of teams
are working on different types of projects.
And so I've definitely seen the pattern
where a leader will come to me and say,
well, like, why is team A moving so quickly
and team B is moving so slowly?
Oh, well, team A is, you know,
iterating on a UI for something,
which is like important and hard work.
But the pace of that change is different than Team B that's building infrastructure.
And so, you know, I also never want to say, well, like Team B, you should be, you know, having a burndown chart that looks just like Team A.
And I want to see like the same amount of velocity and like, no, team B probably
has to do more prototyping, more research. There's going to be some dead ends in terms of, you know,
maybe what they're investigating. Maybe they have a, you know, buyer builds decision to make that's
going to require some research that won't end up in a, in a milestone deliverable, right? Other
than a decision. And so like keeping that in mind, I never want to be too
heavy handed with process at the right amount of handedness, if that makes sense.
Everyone has to figure out what that means for their organization.
An adequate amount. That's my favorite saying. My wife says, how much do you want when it's like,
you know, food or it's like an adequate amount. I don't want too much or too little. I just want
an adequate amount right in the middle there. When it comes to,
I guess, not my problem, not that this is a good attitude to have. Like you can say,
this is not my problem when it comes to decision-making. How do you deal with who owns
certain problems? Obviously you got, you know, a senior engineer in place or a tech leader,
somebody that's in charge, but how do you solve for that responsibility layer? Yeah, I mean, this is where,
so when I talk about the things you can do
to effectively scale,
I think I put them into pretty much three buckets.
So there's a lot going on in CodeHealth.
There's a lot of advice I have for teams
around CodeHealth and developer experience and so on.
There's a lot of advice I have for teams
around how to think about decision
making. And then the final one is culture and culture encompasses all those things and more.
But it's fine. Sometimes something isn't a team's problem, right? Sometimes you want your team
focused on the product area they're working on. They should have a mechanism to surface. Maybe
something's come up. Maybe we've noticed something. Where do you bring those problems? Is there an obvious place?
Is there a spot where you document like, hey, this thing isn't working?
I don't think it's for me to fix, but someone should know, right?
The thing I set up at GitHub, which, you know, it was a learning process.
All this stuff is a learning process.
I think you're never done.
You never say like, okay, I set everything up that I need to do.
And now my organization is humming perfectly and I can just, you know, drink a margarita and whatever. Right. But the principal
council, which was renamed to the architects group, had a backlog where any engineer in the company
could add an issue saying, Hey, I think someone should think about this. And not everything would
get touched. Right. But the, the principal counsel was effectively the most senior engineers,
individual contributors in the company, coupled with me and my two peers who were the engineering
leaders. And so the most senior ICs had hands in the code on a daily basis, were deeply familiar
with how things worked, and represented different product domains and infrastructure within the company. And me and
my peers held the responsibility for cross-edge prioritization and funding and were able to,
you know, move people around from different teams. I think, you know, one thing you want to be
careful about is that people don't develop too tight of an identity to the thing they're working
on and that you don't get such siloed teams that it's difficult to move people and say, hey, look, we really
need help over here.
Can your expertise, you know, and what you did in the past come into play over here?
So like me and my engineering counterparts were able to have conversations with people
and say, hey, you know, can you come work on this problem?
We're setting up a special virtual team to really address this thing. Let's get this done. I would always ask one of the most senior ICs to be
champion for any decision that needed to happen. And they were responsible for communicating
decisions around that specific area. And really not necessarily being the lead implementer,
but mentoring and coaching the people who were taking charge of the problem area.
And so, yes, it's fair for people to say, this is my problem, but there should be a mechanism for, you know, important things to get surfaced. Does that answer your question?
For sure. I mean, the fact that you have some sort of garbage collection, essentially,
which is what that is, it's like, it's almost like, how would you write a program or a compiler
or something like that? It's like, well, you need garbage collection. That's kind of what that is.
Like, this is not my problem, but it is a problem and somebody should know about it.
And you've got some sort of organized body willing to, you know, have an inbox for that,
big or small, and then find ways to communicate that back to you and others who are leading the
organization at a larger scale to say, you know, how do we deal with this in some way, shape,
or form? Because the it's not my problem situation is a really challenge because when you might
find that issue, but it's like, well, it's not mine to fix, as you said, but somebody
should know about this.
Who do I tell?
Oh, I'll tell nobody.
Let me just get back to my job, climb my ladder, doing my thing.
Okay, cool.
You know, and we can't have that.
And there's, you know, there's also
like the DevSat survey that I talked about, right? That's a great way where you're asking
your internal engineering teams anonymously, tell us like, what are your biggest pain points? What
are the things you're most worried about? What are the things that are not working for you?
And you can, it's not just the squeaky wheel in that case who's going to get the attention. You can see aggregated over your entire group, hey, look, true story.
Every single person is talking about how painful deployment is.
That takes trust, though, doesn't it?
It does.
You have to have trust in an organization to say those things and not get the backlash potentially.
And then you have to have a frequency in some sort of case to get that feedback often enough, right?
I think you're so right.
Trust is so important.
And so all this stuff plays into culture.
I will tell you, I did AMAs with my team fairly frequently.
AMAs is ask me anything, right?
And I was so happy when I would get really pointed, hard questions.
I'd be like, this is, you know, I don't love this question.
But I'm glad you're asking. Because then I feel like you trust me that I'm actually asking you
to ask me what's on your mind. And you know, if you're only getting softballs, you're only getting
easy questions, then you really have to ask yourself as a leader, like, are people scared
to say the right thing? Yeah, Is there freedom of speech here? Yeah.
Is there? I mean, yeah. And sometimes it's like, look, you know, you got to move on. Like,
you can disagree and commit on this. This is what the answer is. I know you don't love it,
but we got to be able to move on. But other times there'll be things that I'm not even aware about.
I tried all sorts of experiments. I did one time, I did an anonymous AMA, which is a really funny experience.
I think it worked out well, but I had people anonymously submit questions. And I should have
had a part, I should have called you guys to interview me and say the questions or something.
But I answered, I did it, I did it by myself. So I did like a one hour recording of myself by myself
answering these questions. And it was nice, because I was able to, you know,
gather some data to answer some of the questions too. But there was some really hard questions
during the pandemic. And, you know, there were a lot of things that people were worried and
insecure about. And I just thought, you know, I'm really happy that people felt safe enough
to ask me these questions and that I would be able to answer them. I think that that is really
important. That's a cultural thing that you can't undervalue. And even in the DevSat survey, one of the questions
that I would ask is about psychological safety, how decisions were made on your team. You know,
so it would be, there's a lot of questions around the specific developer experience,
but there are also culture questions on there that then with that survey, I would give it, you know, as a
leadership survey. So I was interested in the broad trends across everything. But then, you know,
it was a survey that each manager who had enough respondents would get so they could specifically
look on their own team. Do I need to set, we used OKRs, which are objectives and key results every
quarter. So set some goals. Do I need to
set some goals around psychological safety on my team or maybe around some other process that's
not working or on-call? On-call was a big one. Like people are really stressed out about on-call.
Maybe we need to do more training. So that was a use of the survey too. And then actually the
third group that would benefit from the survey was specific product areas. So like GitHub, we decided
that the paved path for development at GitHub was going to be using Codespaces.
And so, you know, when we rolled that out, of course, we got lots of interesting feedback on
that survey about the experience of using Codespaces. And so that was valuable feedback
to the Codespaces team to be like, okay, you know, here are some things we can focus on.
We want to make our internal customers really happy. And that's going to be, you know, important for them making our
external customers who we have less access to happy as well. I kind of know what you mean by
this. This is sort of a question to kind of get deeper at it. But when you say psychological
safety, what do you mean? Like, how does that translate to actionable findings and details?
Like, what actually is that? Yeah, because, you know, I have to say,
you have to be careful about over-broadening
terms like that, right?
Psychological safety does not mean
that no one can give you constructive feedback, right?
And that's really important.
I think, you know, when I talk about, again,
scaling eng teams and culture,
this is one that's coming to bite a whole bunch of startups.
And I think it was a problem at GitHub as well, where people conflate kindness and maybe pleasantness or something like that.
And so, you know, sometimes it can be really hard to make good decisions if people are too scared to say the real thing.
It's actually, and I'll get back to your psychological safety bit, but it was fascinating to me when I rolled out end-wide design reviews, because
the first design review happened, it was, you know, a topic, I'm trying to think of
what it was. It was something around monitoring and alerting. It was good. And this is important.
It was going to affect all of engineering, right? So perfect thing for a design review. I'm hosting the session and I'm getting all these DMs, right? And so the way I
would set up a design review is people are supposed to be informed coming into the room.
You want to make the high bandwidth meeting as effective as possible. So everyone's read the doc,
you've put all your comments on the doc. The design review is for resolving comments that
can't get resolved asynchronously, right? And so then we're in the room and I'm getting these DMs and people are saying like, this thing won't work. Like this thing they're proposing, it's never going to scale. And I'm trying to host a meeting, but then I'm DMing back like. Can you say that? And people were like, well, I don't want to be a
jerk. And it's like, well, it's not a jerk if you're telling a team, you have very relevant
experience. Look, you've done this before. You know, this team needs to hear what you have to
say. Don't just DM me and try to get me to say it. It's going to come better from you. You've built
this before. And so that was like a cultural barrier to overcome where GitHub had come from
this history of consensus building, which is problematic also, right? Like consensus is great
when you get it, but you can't live by consensus, especially when you start to scale. You need
directly responsible people who are accountable for decisions, who are going to make unpopular decisions. Not every decision you make can be popular, right? And so I actually
took over one design review just to talk about culture and be like, hey, how do we have these
hard decisions where you're not being mean to a person? You're not saying mean things about that
person. We need to be able to talk. It's the thing about blameless post-mortems like human error might have happened in an outage and
you have to be able to say that and say here's some automation that we could put in place that
would make it less likely for that to happen again it's not an attack on the individual ever
but we have to be able to learn and grow so that's's a little aside, because I get nervous sometimes when we talk about psychological safety without that framing. But psychological safety to me is
being able to say things that you're worried about, things that are on your mind, things that
you think are important without fear of retaliation or retribution or fear, you know, like, and that
is invaluable, right? So I always want my teams to have psychological safety so that they can ask me hard questions so that I can realize, oh, I had no idea that this was such a problem for you. And by the way, the last 10 staff engineers that I spoke to told me the same thing. Wow, now I'm going to do something about it because clearly this is like, you know, a big problem. And so if people don't feel safe bringing things up, then you just don't get the information
you need. But that's different than being too pleasant or too kind, right? You know, empathy
coupled with accountability. What does this liberty do then for toxicity? Does it squash it completely?
Does it just expose it further? Hey, look, toxicity is something I'm never going to tolerate, you know?
And I think that's a cultural thing as well.
Like, what do you tolerate?
I always say how you reward and who you promote speaks more to your culture than anything
you say, right?
And so when I would host training sessions on promoting specifically for staff engineers,
it's like, look, toxic behavior is not tolerated. So that's belittling someone, attacking someone, you know, shouting at someone.
All these things have happened to me in my career. You know, we're not going to...
Complaining. That's why I was, my framing there was more complaining because like you can
freely complain and be toxic. You can be pleasantly toxic too.
And I just wonder how that blends, you know what I mean?
So it comes back to this concept of
knowing when to disagree and commit.
If I tell someone, look, I've heard your point.
Maybe I empathize with it,
but I'm sorry, we're not doing anything about it.
And then you keep bringing it up.
That's being toxic, right?
And so complaining
is not productive when the solution is not happening or the situation is not changing,
right? So I do expect people to be productive. I do also want to hear, you know, about the things
that are bothering people that are maybe not fixable, because maybe at some point in the
future, they will be fixable. Or maybe there's an opportunity to move someone to a different team where that won't be as much of an issue.
So, you know, it's like everything.
It's a tradeoff and there's judgments involved.
But, yeah, there's definitely a time to stop.
Yes, it depends.
Tradeoffs.
The classic answer.
Is that my answer to everything?
Sorry.
No, no, no.
That's not your.
It's just it's what happens.
It's inevitable.
It's more like a defeatist position than anything.
So while we're talking about trade-offs,
you mentioned the three buckets of scaling engineering teams,
code health, decision-making, and culture.
We focus a lot on decision-making and culture.
We talked about code health a little bit
with regards to Yagni and premature optimization,
things you can do now versus do later,
and how we often trade off code health for speed, shipping, etc. But when it comes to scaling an
engineering org, what are some things you can do with regard to maintaining the health of the code,
which allows everything to actually move forward productively?
Yeah, great question. I feel like this is a podcast unto itself at some
point, if we ever, ever wanted to do that, because there's there's so many things. And you know,
it's overlapping with culture, as is everything that's going to be my answer for everything today
too. But an example where it overlaps with culture is like code review. You know, I love the culture
of prioritizing code review above your own work, right? It's not always feasible. I've definitely had problematic situations where a poor engineer in Europe woke up with so many code reviews in their inbox because all of the Americans you know, having code owners, and the ability to affect large scale code base evolution requires people doing effective code
review. And failure mode, I've seen is where, you know, I had a another principal engineer who's
reporting to me at GitHub, who made a pretty simple change into basically to keep it simple, the way Go worked
at GitHub. And so basically, everyone writing Go code at GitHub had to review his simple code
review. And that should be fast and easy, right? But it wasn't, you know, I needed to get involved
to escalate for teams outside of my area to say, hey, you need to, you know, after a month,
you still haven't prioritized this code review, you need to, you know, after a month, you still haven't prioritized
this code review, you need to do it so that we can roll out this change. And so really having
good code review tools. Again, we talked about design review were important. And then developer
experience and like, at what scale are you going to start thinking more about your developer
experience is really important from a code health perspective. I'd
love to tell you a little story about deployment at GitHub because it really resonates with
many of the startups that I've spoken to recently. GitHub got into trouble with its deployment
strategy and is on the right track now, thankfully. But it's a surprisingly common story to see that
in developer experience,
you know, build and test times get longer, and there's test suites running that don't need to
run and so on. But like deployment is a particularly painful one. And I would say
there are like three areas where it really hurt at GitHub. One was just the volume of changes got
too high, too many people wanting to deploy.
And so there, if we're only considering GitHub's primary deploy target, which is github.com,
just the number of different people
wanting to deploy changes on this fairly manual process
that required human engagement started creating friction.
GitHub has this kind of unusual deploy then merge strategy. So for code changes,
you actually deploy your code first, check that everything's working, and then merge back into
the main branch so that main is always available for rollback. It's kind of an unusual strategy
that I wouldn't necessarily recommend because it's part of the scaling challenge. But GitHub
moved to using deploy trains to help with that volume of changes.
And this is still very manual, though. A conductor would be the first person who got on the train,
would be responsible for shepherding the change. And then there would be all sorts of gamification
that happened. I had a teammate who was like, why am I always the conductor every time I want to
roll out a change to the monolith? And it's like, well, because everyone was hanging back,
waiting for someone to take that role. And you were the sucker who every time Yeah. And so, um, you know,
this is like a bad experience. And, you know, then I started hearing from people to like, well,
I won't even try to deploy something after lunch. Because, you know, if I end up, you know,
being responsible for that, like, who knows, I might be stuck till after dinner, waiting around,
so I'm just gonna wait till tomorrow. And so you can see the sort of like
aggregation of friction there, and how much that slows down development is just not acceptable.
In DevSat, I mentioned the satisfaction survey deployment came out as the highest friction.
And then like all these other side effects that affect code health, like people writing bigger
changes, code review becoming more difficult, changes being deployed become more risky.
So like an increasingly problematic situation. And that was just for.com.
And then, and this is a situation, you know, that happens at a lot of startups too.
GitHub.com isn't the only deploy target for GitHub. There's GitHub Enterprise Server,
which is an enterprise-focused product
where customers deploy GitHub Enterprise Server on-prem. And for them to do upgrades,
they require downtime, right? And so the way this worked was, you know, they'd replay all
the database changes, update the code. But database changes are unpredictable timing-wise.
I already talked about how way too many database changes happen on GitHub
because of partly active record and sort of the way the monolith
is sort of like not well componentized across data layer.
And so, you know, then GitHub Enterprise servers customers
started having an unpredictable amount of downtime for their upgrades,
which is a problem.
Also, most of the GitHub engineering teams were really focused on.com. So like, I got my feature out to.com,
I'm done. The ops team can deal with whatever. And so then this poor ops team is managing the
upgrades for, you know, Apple and IBM and all these big customers, but also lots of small customers.
You know, debugging becomes more difficult because is your feature in the enterprise server deployment or is it not?
There's a whole challenge with feature flags.
We did a really fantastic tech deck cleanup, actually,
around feature flags
where there had been so many feature flags at GitHub
that were on permanently or never been turned on
or on in the worst case scenarios
of like different configurations
for different enterprise customers.
And so, you know, that became problematic as well. And then the third piece to the deployment
puzzle at GitHub, which was really enough to say, stop, we got to really, really invest in how we do
deployment, was, you know, on-prem enterprise product is not the state of the art. It's not
where most companies want to be. And so, you know, GitHub really had to develop a cloud SaaS offering for enterprise customers. And this is something
GitHub has been working on for years. There's a lot of pressure on it. Obviously, downtime for
upgrades in a multi-tenant SaaS product is not a thing, right? And so there had to be a way
to propagate deployment to that endpoint in a healthy way as well.
There was lots of pressure from leadership to get this product out the door quickly.
And so GitHub did try to take shortcuts, tried various strategies to replay changes from.com to the cloud, and never could work, never could scale, especially the frequency and unpredictability
of the time required for database changes
just made that untenable.
Like how do you interleave code changes
and database changes with the right timing,
with the right lead time?
The enterprise product would always end up getting
so far behind that it could never catch up to.com.
And so that just wasn't working.
What an issue there, man.
That's like a big headache, basically.
But it's funny because I've talked to multiple startups
who are in this situation as well,
where they had maybe a community product,
maybe an open source product
that where deployment is a little bit more straightforward.
And then now they have an enterprise specific product.
And in most cases,
like the community product is a single deploy target. And the enterprise, it's like multiple
deploy targets, like maybe you have multiple different instances, right. And so this is like
completely changing the game on how deployment works. And so you have to have, you know,
a thoughtful, coherent strategy for doing that for, for dealing with scale. And this is one of
those ones that like, I feel like deployment is hitting
everyone and something that they need to be really thoughtful about. And historically,
the deployment process at GitHub and at many, many startups just depends on so much information
in humans' heads, right? Like, I made this destructive database change, and I know I can't
make the associated code change until the backfill has finished. And oh, I can't make the associated code change until, you know, the backfill has finished. And I see that that backfill has finished. So now I'll make this code change.
And, you know, that much information in a human's head can work okay for a single deploy target.
But when you have n deploy targets, forget about it. You know, you're done, like,
just too much complexity to manage. So yeah, it's interesting.
Is that the state of deployment right now-ish, I suppose?
No.
Okay, so has a lot of this been solved then?
GitHub's doing really good work.
I would say it's in good progress, but it took, you know, this is one of those things where, like, oh, maybe 1,000-plus person scale.
Right.
Where you had to say, look,
we can't do this quickly. There was efforts to say like, quick, get this thing out the door,
right? And it was an example where it didn't work. I'll tell you other sort of factors that
happened were like, this is obviously an Azure cloud-based offering. You know, we're just going
to like follow Azure process. Well, all of GitHub is using PagerDuty and Datadog and sort of like all the sort of tools you would expect
where Microsoft has all these custom alerting monitoring frameworks. And it was like,
well, actually, I guess we need to like rewrite all our alerts in this other environment. And so
now like developers are meant to be on call and look at Datadog for this, but like this other environment. And so now like developers are meant to be on call and look at
Datadog for this, but like this other system for this. And so, you know, that was just falling
apart from a developer experience. And so GitHub's doing really good work right now on this. And like
part of the key was a bunch of different strategies were tried using checkpoints. And,
you know, this is obviously something that it's a culture thing, too. I'm going to say that every time. Because in one thing we didn't talk about today, which we could talk about in
another podcast is platform teams and how you can't expect magic platform teams to solve all
your problems, because you really need to have product engineering involved in you know, the
work they're doing and how they work and so on. But like, every team is going to change how they
do deployment on GitHub as part of this.
And so it's not just a magic platform team off in a corner who's going to solve this.
But the key for GitHub has been really decoupling database changes from code changes,
and really seeing database changes through the entire system before moving on to associated
code changes. And so that slows velocity in some ways.
And you have to work on the culture to say,
okay,.com developers,
maybe you're going to be slowed down a little bit,
but actually this is for the greater good.
And now your feature actually gets out
to the enterprise product more smoothly.
And so that's a win for you.
So this is still in progress at GitHub.
It's not a solved problem,
but I have a lot of confidence in the people
who are working on it that they're making great progress.
For sure. For sure.
Well, a lot could be said, as you said just now.
We may have to do another podcast with you on more topics or have you back next year or more frequently now that we've had you on at least once.
It has been great. Yes, hearing all the behind the scenes and all the challenges that come with leading, but then also instilling the right culture, displaying the right clarity and expectation,
the right documentation, the right kind of leadership.
I think you truly are an example of that.
And I'm so glad we had you on the show because you get to put that on display.
That's awesome.
And now you're on the next hierarchy of your career advising and doing fun things.
I got to imagine that you have people reaching out or there's a way for folks to reach out.
Is that something you're advertising?
And if so, feel free to advertise.
Oh, thanks.
I, yeah, I'm still figuring out what's next for me, but I'm really enjoying getting to talk to a lot of different startups and, you know, setting up some advisory roles, which has been really fulfilling. I will say there's one startup I'm working with that I
just adore called Enchflow. And I've been an investor and advisor for them for a while.
And they're formed from two former colleagues at Google. Helen, the CEO is a good friend. It's
just incredible. And they were the folks responsible for bringing Bazel to the world. And now they're doing amazing things for build and test optimization
and developer experience. So close to my heart. And they actually came in and did a hackathon in
my basement last fall. And being able to be close to them and hear the excitement of everything
they're building was really part of what got me energized and thinking more about this startup world.
So I have them to thank for motivating this change in my life as well.
But yeah, I'm really focused on developer and data productivity.
Those are passion areas for me.
And I really feel like there's a lot of exciting,
important work happening in that space.
So the companies I've been talking to are mostly in that space. And I do think I have, you know, like some good
insight in this 100 person plus scale. So there's a lot of angel leaders who are out there who are,
you know, struggling managing the scale for the first time, and I'd love to be able to
help where I can. And you know, I'm enjoying my life quite a lot right now. I realized,
like I may have said to you, Adam, I felt like I've been grinding for 25 years.
And I realized, gosh, I had never been away for more than one night with my husband since my 10-year-old was born.
And that's embarrassing.
And so we're fixing that and just, you know, enjoying a little time.
Yes.
Yeah.
It's been really good. And so I'm definitely, you know,
on a journey, living my one life and trying to be happy and still, you know, figuring out what's
next. So please do reach out to me if you want to talk. There you go. Well, Rachel, it's been
an absolute pleasure hearing about your journey and all the things you've learned, all the things
you put into place as a leader. And we look forward to getting you back one day, someday soon, maybe for more.
So thank you so much, Rachel.
It's been awesome.
Thank you so much.
This is a lot of fun.
I appreciate you both.
Thank you.
And we appreciate you, Rachel.
Thank you so much for joining us today on this show.
Such a cool, cool story to go through all this scaling from hundreds to thousands.
Such a big chasm. What do you do? How do you care? How do you communicate? How do you speak
with clarity? How do you lead by example like Rachel has done? Well, you listen to this podcast.
So that's one easy button. But hey, you can follow Rachel elsewhere. Links are in the show notes to
where she's been, where she's going, and what she's doing.
A massive thank you to our friends at Fastly and Fly.
And also to the Beats Master in residence, Break Master Cylinder.
Yes, banging beats.
We love them.
Keep them coming.
And speaking of thanks, thank you to you, listener, for tuning into our podcast.
This week, every week, we love it.
Thank you so much for listening into our podcasts this week, every week. We love it. Thank you so much
for listening to our shows. If you want to go a level deeper, there's one version that's free
and one that's paid, but either way you're invited. First of all, changelog.com slash
community free to join. Join us in Slack. Tons of people in there always talking like-minded folks
that you can hang out with and two we have
a paid membership changelog plus plus that is our membership where we drop the ads we get a little
closer to the metal and we give you some bonus content and more speaking of today there is a
bonus for our plus plus subscribers so if you're a plus plus subscriber stay tuned if not changelog.com
slash plus plus but that's it. This show's done.
We will see you again on Monday. Thank you. Game on.