Software at Scale - Software at Scale 5 - Kyle Consalus

Episode Date: January 9, 2021

Kyle is a Software Engineer at Dropbox. He’s currently working in the Core Sync area in charge of the Dropbox Sync Engine and other related components. He was previously the tech lead of the Build I...nfrastructure team, the tech lead of the Android team at Dropbox, and a senior engineer at Google.Kyle has been thinking about developer productivity for a while, and his tenure at Dropbox (~9 years) gives him a unique insight into the origins and consequences of some large technical decisions, the value of code review, and making developers more effective in general.Listen on Apple Podcasts or Spotify.We discussed technical migrations, test quarantine, the purpose of code review, multi-round code reviews, the popularity and community of Rust, the properties of Go, good software design, the design of the Dropbox sync engine, the role of a tech lead, and deciding between staying an IC vs. going into management.Highlights2:20 - Build Infrastructure at Dropbox. Dropbox has an in house CI Server that was written by David Cramer (CTO of Sentry). Dropbox’s internal CI Server might have been the first user of Sentry. How Bazel came about at Dropbox.5:40 - Bazel might have been an overcorrection at Dropbox. The danger of copying Google for internal tools.8:23 - The precision of Bazel and infrastructural investments to keep it running. Moving things into the Bazel world could be challenging.17:30 - How should a decision maker decide between investing in a customized but potentially superior developer experience, vs sticking with open source tools or external best practices? For example: Git vs. Mercurial24:00 - Enforcing the right amount of “best practices” on the rest of engineering.26:30 - One way to think about developer tools in your organization.33:40 - The issues you run into if you apply a systems engineering mindset to developer tools, compared to a product development mindset. 37:42 - What test quarantine is, and why an engineering organization needs it.47:40 - Viewpoints on code review and how they’ve evolved over time. The consequences of not catching a bug in code review.56:20 - Maintaining the reliability of applications that are being deployed on millions of desktop hosts and Android devices. User reports were often wrong.62:30 - Being a long time language nerd, and the code design of a complex sync engine. Not using standard library hashmaps due to non determinism.77:50 - A discussion on Golang.88:00 - Contrasting Golang and Rust. Diffused benefits vs. specific benefits.96:50 - Being a good tech lead, and deciding between staying an individual contributor, vs. going into management. Being smart vs. being obsessive as a reason for success in software, compared to other things. The difficulty of being a Tech Lead Manager (TLM). This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Thanks, Kyle, for being a guest on the Software at Scale podcast. We were just discussing about the direction of this podcast, and it's still brand new. This is episode number five, so I haven't really figured out exactly how i'm going to direct these things i'm just hoping that i record enough episodes and i figure it out somewhere on the way yeah yeah but anyways thanks for being a guest on the show again that was my honor yeah and just to quickly talk about so like we've i've known you since around four years,
Starting point is 00:00:46 more than four years now. You were the last interview I had at Dropbox when I was deciding where to work full time. And I remember very specifically that, and I've spoken to you about this before, the interview I had with you convinced me to join Dropbox. Because at that time, I had just interned at Google for a few months. And I was just blown away by the build system and how you could just run like 10,000 tests in parallel and you'll get results in these nice check boxes. And I was like, this is so amazing. There's nothing outside in the real world that has this stuff. And then we spoke about what are some things that you're working on at Dropbox? And you said, oh, you know that Blaise Basil thing that's at Google
Starting point is 00:01:31 very building that here. And I was pretty much sold. I was like, I have to work at Dropbox. And that's what I ended up doing and joining the developer, the build infrastructure team at that time. Oh, yeah. I left that interview being like, we got to get this guy on our team. Thanks. And I guess I worked there for three years. So the rest of it is history. But yeah, I had a lot of fun.
Starting point is 00:01:54 So we can dive right into that. So you were working at Dropbox on the developer, the build infrastructure team at that time, right? That's right, yeah. What was going on? Why was it so interesting for somebody like me? Because you're ex-Google as well. That's how you knew about Bazel and Zaze and all that. Yeah. I mean, so Dropbox spent a few years thoroughly in debt in terms of build infrastructure. And there's a reasonable argument that we sort of overcorrected. But like so, like when I like I was on the Android team initially
Starting point is 00:02:31 at Dropbox. I hadn't done Android, but I joined the team and led the team. And then but I had a habit of kind of ranting after work about how things need to be better in our, you know, developer practice infrastructure. And that led to me eventually moving over to the team that was put together to fix builds and to fix, like, test greenness and to try to make that all work well. And I ended up sort of owning our primary build infrastructure tool, which was Changes, originally developed by David Kramer. And then as he moved on, like it became sort of, I ended up becoming the de facto owner of that. And then, so it was natural as we started thinking about like, okay,
Starting point is 00:03:13 we've got like, we had Python, it was mostly all Python. It didn't really matter what you did build wise. I mean, that's a simplification, but building wasn't as much of a concern. But then over time as we got more particularly like the go stuff made it not so bad but then like as we scaled up and up it became like okay we got to do something better with the way that we are dealing with things that build because there was no there's no build step really we ran some some i guess some puppet and then just kind of started the test scripts,
Starting point is 00:03:46 which they started building in Puppet and that doesn't really work. And we had a bunch of cross-language things. And so after talking to a bunch of folks who'd worked on different, like, you know, cutting edge, because like we were, at that point it was unambiguously multilingual. Like we needed to support C++ for some critical stuff.
Starting point is 00:04:04 We needed to support Go. We needed to support C++ for some critical stuff. We needed to support Go. We needed to support Python. And we really wanted to move away from the model we previously lived in, which was all of the Python dependencies in one giant bundle that gets distributed everywhere. I'm told that was a misuse of the tool by the author, but we, it was, it was, it was a, it was a nightmare when it came to trying to make forward progress. And so the idea was we thought about pants and we looked at Buck and, but there was a lot of ex Google folks. And so we were like, I think Basil had just been open-sourced. And so that seemed like that was the right way to go. That, that was the most developed tool for the job that we were looking for. And again, at the time, the,
Starting point is 00:04:44 we had, we were pretty ambitious in terms of internal infrastructure. And so we thought, yeah, this allows us to build to the maximum thing we can envision in terms of quality infrastructure, plus a lot of other nice properties. And so that started with a grand kind of basal migration that began in the corners of the Go stuff where it was rather easy, and then kind of moved gradually along until it kind of overtook everything and became the default. And of course, the real promise of all that stuff
Starting point is 00:05:15 was to be able to do all the neat things about querying and selectivity and intelligent dependency management. And so I think, I don't remember exactly where, but I think we made some, and intelligent dependency management. I don't remember exactly where, but I think we made some, I think by the time you joined, we were making pretty good steps there and we started trying to really lean on the caching and do things beyond what was possible without a system like Bazel. And so we were just sort of taking off.
Starting point is 00:05:42 And so particularly people who were interested in that and had experience even a little bit in that area was a useful thing for us to have around too. You mentioned right in the start that arguably it was an overcorrection. Can you expand a little bit about that? It certainly is a contentious place to be but i think uh as with anyone like the danger of copying google stuff copying google practices or doing using google tools is that if you're doing if you're using the google tools the way that google is using them with the exact same intentions you'll it'll probably work out pretty well. But like the doing I think that it allowed us to build infrastructure, like it allowed us to build a very robust and elaborate
Starting point is 00:06:37 build and test infrastructure to support some of the development practices that weren't the best to begin with. So it supported models of development and testing that weren't that great. And we expanded the test infrastructure to make us even better at integration tests. And so we try to solve sort of development and organizational problems to a certain extent with build infrastructure, which I think might not have been as possible with other tools. And I think that, like, it's, I'm not entirely certain how much of the precision that is afforded by Bazel is, like, strictly necessary versus maybe less ambitious, less robust approaches
Starting point is 00:07:21 that are mostly good enough. It's hard to say. But we certainly envisioned a world in which we would be, like, you know, the sky was the limit as far as the granularity of caching being very small and everything happening in the cloud. And granted, we still might end up there. I haven't been tracking.
Starting point is 00:07:41 But you start needing more and more people with deep expertise in that very narrow area and you need to build on the distributed systems infrastructure you have to support that sort of thing. That requires work and teams associated with that and support. Whereas I think for a slightly smaller organization with out, you know, like a money printing machine of ads somewhere, like it can be more reasonable to take an approach that doesn't necessarily require that level of investment to get value. Yeah. So one thing you mentioned was the amount of precision that Bazel basically requires you to have, it can
Starting point is 00:08:27 get pretty high. And it seems like you have to keep investing in various parts of the tool. At least I've seen that you need to invest in test selection as an algorithm that you need to come up with on scratch. Now I've seen that there's some companies that have open source, something like that. But at least there's a lot of stuff that you need to manage, like the remote caching infrastructure that comes. And at Google, there are a lot of teams and there's a lot of people who can manage all of that for you. Right. If you have like a, you know, a, say, historically unique, like, systems infrastructure to allow you to, like, have top of the line caching and distribution
Starting point is 00:09:07 of all these things and process management. That certainly helps. And it's not to say that we don't have the infrastructure to support that. It's more that there's a I think there's a growing set of people thinking that the Bazal stuff is really well suited for a lot of the server use cases uh but like for for like say desktop or mobile and web like it still has a lot of value to provide but like the value prop isn't quite as much as it is in something uh like you know if you're just building a bunch of c++. And in the world of modern tooling, you can find language-specific tooling that does things that's really hard to do well in a language-independent framework without an incredible amount of investment.
Starting point is 00:09:54 So I wouldn't claim that it's a mistake. I just think that our ambitions were very much tempered into aspiring to the level of sophistication that like google has and uh like i think that might not be an entirely realistic aspiration versus like there's more of a movement now to like like for instance moving things into the basal world requires a certain level of expertise like if we want to incorporate new code bases and like it's everybody needs to kind of restructure things and learn want to incorporate new code bases, everybody needs to restructure things and learn how to incorporate those dependencies. And there's a lot of value in doing that,
Starting point is 00:10:30 but it requires some expertise versus I think maybe having a more partitioned world with domain experts in their languages or platform is dealing with that. It's hard to say which one is the right move there. I still think that Bazel is a pretty effective tool for this and that we were right to use it. And we should continue using it and continue rolling it out. But I definitely see the argument that it is not the best experience for a lot of developers. And realistically, we see people voting with their feet in a lot of cases to try to find ways to make the Bazel experience not the, you know, maybe the production one, like the release build, but for day-to-day development. Like, people have now kind of, like, again, like, if you imagine a C++ world, building takes forever. You depend on, like, a massive tree of things.
Starting point is 00:11:18 Like, at least in the time when I was working in large C++ codebases, IDs weren't of much value. And they were helpful, but didn't really do the job. Versus, you know, now, like, if you're working in, say, like, TypeScript, like, there are processes that will react in real time and editors that can do clever things. And these things probably may not actually scale up well to very large codebases in the same sense that Bazel does. But, like, maybe you're better off just carving a few seams there and just opting out a few places and living with that than trying to
Starting point is 00:11:50 establish a world in which you can expect real-time responsiveness in your IDs from something like Google or a thing like Bazel. It's a steep hill to climb with a handful of engineers working on it. And it's something that I think that maybe more critically kind of disempowers like frontline engineers from working on these things. Like, I think that particularly with you, if you're a, if you're not like a, just a massive organization with, you know, scores of people working on things, like you're pretty dependent on the people who are working on them on their own team being annoyed by something and stepping up and seeing we can do this a little better and tweaking something and maybe more bottom-up organizing and pulling
Starting point is 00:12:35 in some third-party tools. You can make all that stuff work with Bazel, but it's an extra layer of expertise required, an extra layer of expertise required, an extra layer of integration required. And I think that for a lot of larger scale or longer term things, it pays off. There's a lot of teams that would do better by kind of going off on their own and doing what makes sense to them and pulling in the tools that work right. But then you rotate a few people out, somebody gets an offer somewhere else, and then somebody has to fix that problem anyway. That can be a win or a con. It's not unambiguous. But, yeah. One example I can think of is at least for language-specific tools, Gradle, for example,
Starting point is 00:13:19 provides its own inbuilt caching. So, you don't have to think about Bazel providing caching for all languages seamlessly. And Gradle can do pretty well. And then to your second point, it seems like what you're trying to say is that Bazel gets too complex for an engineer who's not interested in how the build system works and all of the stuff behind the scenes, like Starlark, they're not empowered to make their own improvements to how they work because it's fairly complex. But if you use the language provided tool chain and you have a fairly simple thing like a bash script, an engineer might be able to figure out, you know what, we can just skip doing a lot of work in some cases.
Starting point is 00:14:04 It might not be 100% accurate, but they'll have a much faster editor effect. Which can be pretty important. I mean, it's incredibly important. And I think, like, for me personally, I think that, like, strong editor integration is one of the meaningful advances in, like, developer experience of the last, I mean, I guess it's been around for a long, long time, but it's something that Java's had for a while. I guess Smalltalk had some element of it,
Starting point is 00:14:32 but you never really had the type integration. For me, being able to, I don't need to click through the code, but if you can show me a doc and show me what type something is, and if I type something silly, you can tell me right away, like, or autocomplete. These things are really valuable. And they kind of augment the brain of the engineer doing the work so that they have real-time feedback. So anything that makes it harder to use the standard tools for real-time feedback
Starting point is 00:14:58 kind of creates a problem. And there's a pre-existing problem there of engineers having, to some extent, expectations set by smaller projects they've worked on. what's available to you if you just you know just do the out of the box stuff from like for a small project in that language that that that's kind of understandable in a lot of cases but it's also a bit of a problem in terms of like developer happiness and when we've we've seen that in a lot of places and and more so now i think now that you have like microsoft kind of going big on some of the tools that they provide. And like, you know, the jet brains. A lot of the public available stuff and with language servers and background compilers, like for the simpler cases, it works really well. And so, like, the natural question that and I know that the people using Rust mostly use standard Rust tooling and fall back to Bazel stuff if they have to.
Starting point is 00:16:14 It's just that's the experience that people want. So I think you can live with the hybrid experience. But then you're trying to make both worlds work. And no one really likes that either yeah like it's so yeah it seems like you you should you should stick with like the third-party open source tools as much as possible unless you're in a world where the third-party tools don't work at all and then you have to basically dedicate a large amount of investment, which might basically mean teams or something else, of people focused on developer productivity in order to provide a good experience to developers.
Starting point is 00:16:53 But you should not be somewhere in the middle where you provide a customized developer experience and then you don't invest enough in it so that it's at least comparable to the outside. Right. It's like if your organization is taking ownership of what development should look like and setting the guidelines and saying you must work this way, and the experience you're providing out of the box is like at least from the developer's perspective clearly worse than what they would get just with the standard tooling for that language. That represents a problem. And it's one that
Starting point is 00:17:25 not every organization can realistically invest their way out of. And so, like, I think that a lot of times, at least from a developer experience perspective, it might be more effective to, like, to make the starting point be the standard tools and then find ways to make them work in a larger code base. But let's say you're an organization like that. You've had a bunch of customized tools. And maybe not because you decided to provide, like, a customized experience. It's just that the new tools or the standardized tools have changed over time. It's like Perforce for a while and you haven't really invested in moving into Git. And now you have new engineers who have no experience with Perforce, and they're complaining about, well, why can't I just use Git?
Starting point is 00:18:10 I'm familiar with it. And you don't have the resources to invest into developer productivity, or at least you've decided not to. How do you get yourself out of that situation? If you were an engineering leader, what would you do if you were in that position? If you have tools that are just out of step with what the standard practice is. I mean, it's kind of dependent. Like sometimes like there's, I think with any larger organization, there's necessarily
Starting point is 00:18:37 going to be kind of a learning curve, like learning to do it the way that we do it there. And if you can accomplish the things that you need to accomplish and it's a relatively straightforward translation, then that's fine. People will complain. I mean, this happened with our internal use. We were using Mercurial, which is a tool that everybody, that it's publicly available.
Starting point is 00:18:58 People know it's perfectly fine. It is, in my experience, pretty comparable to Git in terms of the kind of workflows you can pull together. There's a lot of things to quibble about. But realistically for what most developers are doing, it's just fine. But there's already enough friction from everyone who is familiar with one thing being annoyed
Starting point is 00:19:18 that it's not that thing. That while we were saying you have to use Mercurial, there were people building Mercurial to Git translation layers so they could use Git. There's a weird power to the entitledness of developers that makes, like, even if you're doing something that is perfectly defensible and arguably the right thing, if it is not understood to be the right thing, it is an uphill battle that might not be worth fighting. And so, like, kind of matching developers' expectations and kind of, which sort of means matching industry expectations as they change, ends up being pretty important unless you've got
Starting point is 00:19:57 a really good story to tell about integration tooling. Because these things get harder and harder as you, like, as available tools tools, like if you're using something kind of niche or archaic, even something that's not even archaic but maybe novel, like you've chosen to use, I don't know, like darks or something for your, what was the other, some like new tool for these things. There are going to be other tools that assume you use the standard thing and that's going to be a problem. You're not going to be able to integrate with it as well. And so you have to, if you don't have a really good integration or like translation layer or something of that story to tell, you have to like, you sort of kind of have to accept that
Starting point is 00:20:36 you're going to be moving away from that eventually. And so just like not being too tightly coupled, it's pretty useful, I guess, if you're using a tool that is fairly nonstandard and drawing pretty clear lines. And so in the Perforce case, I know that Google was in that position when I was there a long, long time ago. And people were using a Git layer. But I guess for their purpose, it turned out that Git just does not work for the kind of development flow that they were working toward. Use a thing that translates Git commands. That works for most people's purposes.
Starting point is 00:21:12 If that were within a small organization such as ours, that maybe is probably not going to reinvent the internals of their source control system on their own infrastructure. You say, okay, well, I guess we're going to be migrating. We have to kind of figure out how we're going to make that work. And if that's what makes people, like, happy and enjoy what they're doing and, like, be able to use existing knowledge, then that's good. I mean, I think, like, there's a lot of arbitrary nonsense you have to internalize to become a functional software developer. And so any step that you take that forces people to discard the arbitrary nonsense they've already internalized and pick up some new stuff, uh, even if like the end state is slightly better, like that's a big cost.
Starting point is 00:21:57 And so like aiming to avoid, uh, like invalidating useful knowledge, uh, is like a pretty useful investment in developer productivity. Yeah. Even like they say that a little more concretely. Even if Mercurial has some features that are slightly better, unless you have a really strong defensible reason why you have to use Mercurial, like oh, Git just won't scale for us. Or Perforce in this example. We're using Perforce because Git doesn't scale for us. Unless you have something like that, it's just preferable to use the most widely used open source. Probably. The real measure is developer effectiveness, which is hard to measure
Starting point is 00:22:43 by itself. So you'd say say productivity or developer happiness is the closest analog you're going to get. Feature list doesn't really matter much. It doesn't matter that Git is a Byzantine tower of weirdness in certain places and people mess it up all the time. It does matter. But if the end result is that people are happier and more comfortable using a tool that maybe on its merits might not stack up as well as you would think, then like, it doesn't matter what you think it does. It matters like what the end result is. And so people were happy with it, then that's, that's what you do. And so
Starting point is 00:23:19 like, you know, it's, it's sort of a listen to your customers thing. Like you're, you can have a grand theory as to why your customers are wrong, but if they're routing around what you're offering because they don't like it, it doesn't matter whether you think it's better. You have to either sell them on it or be providing something that they want. Yeah, you can't force people to enjoy a tool.
Starting point is 00:23:47 Even if you have vendor lock-in, which you kind of do inside an organization if you're deciding this is the build system that you're going to be using. Users don't have too much of an option, but they can certainly choose not to enjoy it and build any sort of wrapper. And you've got to be really careful
Starting point is 00:24:03 where you spend your, like, you have to budget. Like, there's gonna be stuff, there's always gonna be things that you just, like, places where you need to lay down the law about what is allowed within the organization and what's not, and like, this is how we do it, this is the standard, this is how many spaces we use here,
Starting point is 00:24:20 this is the config file format or whatever. But like, there are, like, they're like, so like, you don't want to do you want to avoid cases where you're forcing people to do things that they don't think is the best thing for them and their productivity as much as you can, because it's going to have to happen anyway. And those are cases where people feel like they are not going like it, like it hurts developer happiness or hurts developer productivity when they have to you know not do the thing that they think is best like you want to allow people to exercise their judgment as freely as they can and so like the ideal case is you're just like if you're providing a landscape where they look at the their choices
Starting point is 00:24:57 the obvious correct choice is what you what you want them to do and so like you can you can really build the ecosystem support for the thing you want. That sells it well enough. If you're selling it by you have to, you better have a real good reason. You were in charge of developer productivity in a sense for more than a thousand developers at a time. I think we still have more than a thousand developers. You really had to think a lot about this stuff. Let's talk about some concrete initiatives. So, let's
Starting point is 00:25:32 say what are some things that are in a developer workflow? There's build and test, edit your code, and then there's make sure your code passes through formatters and it tests and passes all tests. Then it goes to code review and finally it gets deployed and released to production if it's like a server-based thing or released to your users if it's like a client application. What are some of the larger decisions you've made? I would include bug tracking as part of that. Yeah, that's true. So once you release it, making sure that it doesn't have any defects, as they commonly call it.
Starting point is 00:26:10 And you could definitely include monitoring and alerting, but that starts getting into like that's a critical part of development. But it's a line that we sort of we draw a bit incorrectly about how like it's it's like what like what the the craft of development sort of feels like to to like to deploy something in its full cycle and to uh ensure that it's not causing problems uh going down the road is like it's it's sort of a full integrated thing but so like the choice of developer tooling is is really a choice of like is a i think kind of has to be a like rooted in a high level choice about like how development should work in your organization and uh that's that's a that is
Starting point is 00:26:53 a question that sort of like is both kind of more complicated and simpler than the actual tools that you choose like you need to know what you're trying to do and then the tools can kind of fill that in. I think some of that will be dedicated about what tools are available to you and kind of what your legal requirements are and what kind of standard you want to hold. But it's like what does it mean to be a developer? What are we looking for? How do we ensure quality? And then that sort of feeds back in the tools. I think a mistake that we have been inclined to make historically is sort of starting by the like trying to have the tools drive development. And that, I think, is like you need to do a part of that.
Starting point is 00:27:34 Like, you know, you need to base like you need to be rooted in reality as to what you have available to you. And like they're always going to be kind of iterative steps that you take. But it's it's hard to build a good world to sort of from good world from, well, we can build faster or we can build different. Having a strong customer relationship, I think, is really important there. I think that's the top lesson, I think, of like, because you stated that I have been in charge of developer experience for a large number of people. And I think that's, I think, de facto true, but arguably it's like we decided that we needed to have development tools, and then we needed, and then we sort of decided
Starting point is 00:28:16 that teams could manage their development as they needed to. And I think we went a long time without a kind of a larger picture of what that means or what kind of standard we were going to hold or how those things should be integrated. And I think as a consequence, this was a consequence of us coming out of infrastructural debt there. Like where there were obvious problems. Like tests took way too long, were incredibly flaky.
Starting point is 00:28:41 The build was always red. It was hard to get anything deployed. No one knew how anything worked. Like, our local development environments were constantly being trashed by every update. It was a horrible mess. And so, you could go a really long way of making useful progress by just pain reduction. But I think, like like pain reduction can can bring you out of a hole, but it can't like build you a city like it. And that's where like I think that once we got past pain reduction, we sort of we thought about things in like a sort of a systems infrastructure way like, OK, well, we can be faster. We can be cheaper. We can be more efficient in the way that we do the things that we're asked to do. But the questions of are we doing the right things? Are we being asked to do the right things? Or like,
Starting point is 00:29:27 are we surfacing the right information? Is the sort of thing that requires like really thoughtful product thinking and like good relationships with your customers and kind of like that it becomes, yeah, it becomes a more of a product development question then. And, and we've, we sort of assumed for a long time that we didn't need that much of that because like It becomes more of a product development question then. We sort of assumed for a long time that we didn't need that much of that because everyone on the team is engineers. We're dog-tooting this stuff every day. That definitely meant that the things that we touched were pretty smooth. But it didn't ‑‑ it made it, I think, harder to have a ‑‑ like, when we had to think about what was the right tool to build for
Starting point is 00:30:05 improving certain processes, it became like there's a whole new skill set that was required for that, that we hadn't really developed over the years of just keeping things from being quite so broken. Yeah. So what you're saying is that there has to be some kind of product development mindset to understanding what is the best developer experience that should be provided to users. That's a question that has to be answered first before building tooling that is just, that seems better on the surface, like, oh, it's faster or it's more efficient and it's going to be cheaper to run this. So first you need to think about what is a good development experience. And
Starting point is 00:30:45 then you need to build your tooling around that rather than build the tools first and then figure just ship iterative improvements. Yeah, I think that that's true. But I think like it's you can even go like a little bit dumber than that of like of just kind of trusting your customers when they say what they want. Yeah. And because there's, like, there's, that's tricky because there's a lot of stuff that you will know that, like, when you run build infrastructure, you see people doing things and you see that it doesn't match how you think it should work or, like, that's not how this was intended to be used
Starting point is 00:31:25 like you're going to get requests for things to be done that you that are probably not the right choice and it's like you need to push back on those things but like there's i can think of like a number of examples like no i can think of a number of examples over time of there being uh just things that that by the numbers like didn't seem that important, but that assumes that the numbers were the more important thing. And that's like a systems kind of engineering way
Starting point is 00:31:54 of thinking about it. I don't know, our latency numbers are good. The P90 is actually better than it was two months ago. I don't know, this is like a total wasted two hours a week. Who cares? Relative to, we can trim a couple seconds off of this. But practically speaking, if the customers don't know that they have an input there and they don't have a way of driving
Starting point is 00:32:19 what is being built for them, it's hard to do better than kind of incremental improvements and it's hard to respond to changes in the way that development needs to be done. And it's also easy to leave behind users that don't match kind of the dominant pattern. Like the easiest example of us being not connected users was when some of the Bazel stuff first went through
Starting point is 00:32:42 and everything worked real well. And then one of the teams pointed out that this entirely broke the way that they ran tests completely. There was an entire team that basically worked by iterating against tests from their editor, or where they would run the test. And that's a perfectly fine way to develop code.
Starting point is 00:33:03 It's just not the way that most of the, particularly the systems engineers worked at Dropbox. And so it's something that we just didn't know that users were doing. And so when, like, so it seemed like, like we might've known that this was going, something that might've been broken, but it just didn't really come to mind.
Starting point is 00:33:21 And then when, once that was broken, they were understandably upset. And then we had to rethink how that all worked and send some people to sort that out. And it's an easy mistake to make. Yeah, there was no editor integration for something like Bazel test a particular target. Right, because we were thinking about it in terms of how long does it take to do X? How often does this succeed, how can we measure success within a build, within a test run, and not like how easy is it for a user to understand whether their change has done the right thing, which
Starting point is 00:33:59 is a much harder question and requires sort of a different model of thinking and then some different measurements. So in terms of reliability, like reliability or speed of running tests might have gotten faster, but the entire workflow for a set of product developers was completely broken. So at least to them, this still seemed like a regression. Yeah. Like reliably doing the wrong thing isn't that useful. Or like building a very reliable thing that people then like route around. Like you can get really good at building a particular thing, but if most people's development flows are just like, are not touching that particular kind of like testing or building until they run it in CI,
Starting point is 00:34:43 like that's like, I mean, there are different model, like, mindsets for this. But, like, I think, like, we've had some, I've seen some definite cases in which people found the local testing and building to be, like, intolerable. And so they would just, they would have all, like, had the building and testing really only happen once they sent something for review.
Starting point is 00:35:04 Like, that is not out of the norm. Or that't at least i haven't monitored for a while uh and that's and anything like oh oh goodness like they're there's like they've done they've kind of developed their whole change which they expect to be able to land without ever having touched the infrastructure that tells them like this is basically correct. And in terms of a developer workflow, that's really broken. You want to get that stuff in front of them right away. So are there any other decisions, or at least not decisions, but things that you found surprising on the way?
Starting point is 00:35:40 There's all sorts of policy that companies implement. Google has famously readability reviews that other companies seem to be implementing as well. There's all sorts of static analysis, linters that developer productivity teams like your team could have injected. Is there anything that struck out that was something that you learned? You deployed something, like how you deployed Bazel and you found out, oh, it doesn't really work for a particular user.
Starting point is 00:36:07 Were there any other examples of things like this? Hmm. I mean, this certainly is. But I can't, I'm trying to think. The most obvious thing that comes to mind is not one of that specific kind. I mean, to some extent, this requires to have a good sense of what users are doing, which isn't necessarily the case here.
Starting point is 00:36:40 Weirdly enough, I did think that there's a lot of stuff that people feel strongly about in theory, but not so much in practice. And I think that we've seen a few cases where precision and strictness just aren't that important. And people are happy enough just to mash something and have a predictable sense of how long it will take until it works. Like, the two cases that come to mind there are, like, quarantining was a mechanism that we were there. We implemented for over a course of years kind of iteratively, like got more and more sophisticated in the ways that we recognized tests that were problems and then dealt with them. And like detecting the problem ones we recognize tests that were problems and then dealt with them. And like the detecting the problem ones was never something that people,
Starting point is 00:37:28 like that's fairly straightforward. Like that's a data analysis problem. And a real- So just to interrupt you for a second. So what kind of problems are you talking about? What exactly does like quarantine? Yes, so like the problems of doing anything at scale is that like we're developing at scale in particular,
Starting point is 00:37:45 is the risk that every developer makes mistakes at some usually fairly low rate. And even with very good mechanisms for detecting mistakes, those are going to kind of sneak in. And so if you have enough developers interacting with the same code base, that kind of low rate of predictable defects becomes gradually higher and higher.
Starting point is 00:38:08 And so if anyone might break the build for anyone else, pretty soon you have like you're going to be in a situation where someone has always broken the build. And then you can't trust the build to verify your stuff. And so you need to make progress. And then like there's a very quick like everything degrades into brokenness really quickly you learn not to trust anything from the test because you are trained deeply that like this test broke i don't know why it doesn't look like mine it could have been me but it also could have been someone you know three floors up who landed something without really checking it and by the time i check again it'll be be gone. Like, it's, like, the very, like, the basic stimuli of these are, like, obvious, like, psychological training just to stop caring about things being broken.
Starting point is 00:38:52 And so, like, that's the thing that you need to do something about. And so, like, in smaller codebases, you go, oh, build broken, and someone fixes it because we all own the code and it happens infrequently. Maybe you have a sheriff that keeps an eye on it. On a large enough codebase, that sheriffing is a full-time job that requires a remarkable level of expertise. And we've had people doing that for a long period of time. And there's always some people keeping an eye out, but that's not particularly rewarding work and it's a slow thing for a human to do. And so over time, we've found ways to say, this test is unreliable. Once you can tell that from the data, you need the good data. Tests are all about results.
Starting point is 00:39:29 And so if you're not cataloging your test results somewhere in a way that you can query them efficiently, why are you running them? It's good enough, I guess, to have the last result. But anyway, so we started, you want to be able to identify tests that you can't trust, that they're providing weird results, that they're not passing, but they're also not entirely failing. It's unclear what's happening there. It's just a broken test. Like, a failing test is not a broken test.
Starting point is 00:39:54 A failing test is telling you something meaningful, hopefully. But a test that provides an inconsistent result, that's a broken test. It's not telling you what you think it should be telling you. And so we over time found better ways of like initially we would say if it passes in one commit and fails in the next and passes in the next, that was the early model of this. And then over time it became more about like, you know, statistical analysis over time because the code changes and you get a few samples at each level. But when you recognize a test is unreliable, the right course of action is to flag it as such and don't include it in any decision-making.
Starting point is 00:40:32 And then you want to make sure that the team or whoever is responsible for that test is made aware and deals with it. And so that's what we call the quarantine infrastructure. Detects these things, flags them, notifies the team responsible. And I remember when we first started going down this road and thinking, what if one of our important tests gets quarantined?
Starting point is 00:40:55 That's scary. We try to release all the time. We're always pushing. If something that we were depending on for correctness is gone, then someone could land broken code. Uh-oh. And so there was a lot of concern. And there was some people, we had some high-level conversations with people about the consequences of taking high-value tests out of the rotation. And should we page people? Or should we just block all deploys? Should it be everything is considered failing if that's not the case? Should we just block all deploys?
Starting point is 00:41:27 Should everything be considered failing if that's not the case? If we have a quarantine, your build result isn't a pass or a failure. It's neither. And that's a weird state. You should have to figure that out. I think we still have a few cases where we say you're not allowed to quarantine. You have to figure it out for like some really mission critical stuff. But for a lot of the other stuff in practice, like just like it was important like in a kind of a logical correctness sense. But in the practice of it, even those people who were deeply concerned about the idea of their tests being disabled didn't actually care that much in practice. Like, some of the people who seemed angry about it, like, after a few weeks of it being enabled didn't really seem to mind.
Starting point is 00:42:33 And I found that to be interesting. Like, the same kind of pattern applied to when we started, you know, allowing deploys to depend on subsets of tests rather than the full test suite. Like, the initial specification from this with of that functionality, which came from a lot of high level engineers thinking about it pretty hard, was that obviously we need to depend on every test that's relevant to deploy. If you're missing anything, then there's some bad stuff that could sneak in. And then that's a problem. We auto deploy this stuff. You can't just be auto deploying untested code. Which is true, generally speaking. But like there's and so we built a lot of we spent a lot of time building interesting infrastructure to like find the transitive closure of all the tests that might
Starting point is 00:43:16 be relevant to you and doing like a lot of work around that and automatically generating that because maintaining that is a pain. But in practice, that's not really what most people cared about. Being able to functionally just list a glob pattern of the things that you actually care about for a deploy is good enough for basically everybody except for a few minor cases. And then those, they can retrofit what they want to a lot better than if they need to. So it's like that precision seemed important in theory, but in practice, you recognize that your tests
Starting point is 00:43:53 are not your only safety net. You trust people writing the code, you trust people reviewing the code, you're monitoring in production. If you can't tolerate something slightly broken going into production, that's a problem. A good portion of test failures are not necessarily bad code, but inconsistencies between tests and code. And so that's part of the cost of maintaining tests weren't so much about bad code as they were about just not understanding as deeply the role that tests play.
Starting point is 00:44:33 They're a contributor to correctness. But if you spend your days looking at tests and test results, it's easy to think that they are correctness. Yeah, if you're on the team that's in charge of running all tests, you think of tests as, your vision can be kind of like narrowed down to, oh, it's really critical to make sure that all of the tests are passing as often as possible and our users feel very comfortable. But when it comes to our users in practice,
Starting point is 00:45:03 tests are only part of the story. There's so many things. Right. And like, certain legal and compliance requirements kind of feed into this, too. The same applies to code review. Where code review is a very important tool for correctness in some cases. But in a lot of cases, it's more about, like, design discussion and just getting, you getting a sanity check on these things. And a lot of and most people sending most changes, I think we have decent data on this.
Starting point is 00:45:33 I haven't checked recently. But most changes are fine. I don't know if it's 80, 90% of changes are basically okay. At first attempt. That's not as high as you might expect, but that's, when I say basically okay, I mean not in violation of anything. Not in violation of formatting guidelines,
Starting point is 00:45:53 of certain linter rules, no minor mistakes in the way that dead code is handled. Most changes are fully conforming, and particularly things that are reviewed might not be matching the exact rules that you specified for how your build file should be written, but they probably don't do something ridiculous. And in your production monitoring and deployment processes should be oriented toward the assumption that people are going to land broken stuff anyway.
Starting point is 00:46:24 And so these things are all it's an ecosystem of correctness. And like some of the compliance wording is very much about, you know, having very strict, like the tests make it correct. The expert review makes it correct. And so they, they guide you to a world in which if something like, you know, you should hard block something from ever being committed without being stamped by all the appropriate parties. But like in in practice we've found that in most cases like you can you can maybe trigger an audit in cases where you're required to and people will come back and look at it but it's in almost i i i've seen hundreds of audits of things that went in without review uh i can't even think of a single one that resulted in a meaningful change
Starting point is 00:47:07 like it's not i'm sure that there's some feedback that would have been given that wasn't and we obviously you want to make sure that owners have a say in what they're maintaining but like as far as correctness goes like uh optimizing for overall productivity within the scope of like your correctness requirements is more important than having kind of firm fortress walls of errors will not pass here. So has your view of code review evolved over time? It seems like right now code review is just seen as one step that is sometimes important, but in many cases changes are fairly small and trivial.
Starting point is 00:47:49 And they go through like it sounds like without too many rounds of review. It definitely seems like an anti-pattern if you've been an engineer at a company for a few years and you still have to go through five rounds of review. Yeah. And I think I'm not on the developer infrastructure team anymore. Before I left, I was working on how many rounds of reviews something went through. I think that's a really important warning sign for how much that shouldn't be the case. I think things that I've always enjoyed code review as a practice.
Starting point is 00:48:23 I think there's a certain element of maybe self-loathing or whatever where I'm like, yes, tell me I'm wrong. I know I'm wrong. I didn't believe this when I wrote it. I was uncertain about every aspect of this. Let's argue about it, please, so we can feel good about it. But I like getting into the minutia of what is the right way. I think there was earlier in my career, I was more into like we should find the right way. Earlier in my career, I was more into we should find the right way,
Starting point is 00:48:47 find the right patterns and adopt them universally. And so, if there's something where it's not clear what, if there's something that could be better and you're not doing something that could be better, we should discuss that. We should figure out a way, we should fix that. And then once we all understand what the better way is well we'll do that for forever more we'll lint it we'll move on like and i think there's something you said for that and particularly for people new to a team that's important but like uh nitpicking on stuff that doesn't matter is a giant waste of time and adding like round trips is a giant waste of time and like if you want to have a new pattern like fine like we'll we'll apply it to everything. That's
Starting point is 00:49:25 a migration. We should get a migration. That's fine. But I think that, yeah, it's more that like I'm much more worried now than I was a long time ago about multi-round review. I thought that was like you're being a good careful reviewer if you if like everything that goes if anything goes past you fails, like you've done you've failed as a reviewer. If anything goes past you fails, you've failed as a reviewer. I ran a team for a while with a lot of people and a lot of reviews to do. And then it became more about like, well, okay. We need to make sure that people know how to do things right. But review is the wrong place to learn that. They're sending it because they think it's ready. And if you're giving people basic practices guidance
Starting point is 00:50:10 or design guidance within the scope of a review, that's, I mean, if you have to, you have to, but you probably should have had that conversation before. And good correctness stuff is also useful if you can find it, but that's not the best use of a human's time for the most part. Make sure the interfaces are good. Make sure that any tricky thing around serialized data or known problem areas that requires human expertise is good.
Starting point is 00:50:38 Make sure that there's reasonable test coverage. But if you're executing the code in your head and trying to prove things to yourself, that's, like, maybe you need better coworkers. Like, there is a certain phase at which, like, once you've worked with someone for more than, you know, presumably a couple weeks where the first time they send you a diff and you're like, yeah, I got nothing. All right. Go on. And then you're just like, yeah, I know that I can trust them to do things. The aim should be to get to that and to get to that soon because that's a productive place to be where it's just a place to share information about the change that's happening and to offer suggestions about like, oh, we already have one of these. That's my best reviews that I ever have have been things had. Yeah, that looks good. But what about this file that already does that? You go, oh, I spent a day and a half on uselessness. But that's great. My goal wasn't to have more code.
Starting point is 00:51:36 Yeah. That's certainly the point you mentioned earlier, which was it's probably an anti-pattern to try executing the code in your head and try to find bugs through a code review process. But wouldn't you say the consequences of shipping a bug in production is very different from team to team? It is. I think an important consideration is how good your monitoring and your deploy pipelines are. There's a great joy in working in the server world where you control the hardware and the software. At least we do on all the machines. You can, within minutes, go back to the version you were on. You need to have an understanding.
Starting point is 00:52:22 It's kind of like security stuff where you need to know the risk profile to understand how much effort is actually worth applying here. And there's a massive set of changes, particularly experimental stuff. If it's not serving customer traffic, if it doesn't have critical data, if it doesn't have any data, if the risk is low and someone's gonna learn more
Starting point is 00:52:41 by getting it in production, or getting it deployed rather, then they will by having you nitpick the code. Just get it out there. Don't get me wrong. I think that having a good vision for the way code should be structured and what the abstractions are and that sort of thing is really important.
Starting point is 00:53:01 And I think it's perfectly fine to sit on reviews and be like, nope, that doesn't make sense. I think code should make sense. Bold stance. But the bar of what actually matters has certainly gone way lower. Or maybe higher? Either way.
Starting point is 00:53:23 Does it matter if the the formatting is like if you would have if you would we would have named something slightly differently like there's a certain point which that might be useful input but uh like if it's pretty easy to change and not of consequence and like yeah so it's like if it's part of the interface like or like what defines the abstraction like maybe that matters a lot if it's hard to change if it's easy to change and it's internal like nope like doesn't matter like matter. It's really how easy to change, how much risk is there in this. And if the answer is easy to change, not much risk, then unless you're trying to teach somebody something, just let it be. And you need to tune yourself to what you have. I'm now working
Starting point is 00:54:01 in client. So client meaning the Dropbox desktop application. Yeah. Which, granted, we're a modern application now. And so, like, there's updates and such. But like, before, I guess my first work in Dropbox was on the Android client. And I think we were probably a bit too conservative about our releases there. But the releases there were very manual, were user visible. Like we had to justify to our users, at least we thought, like what was happening in the release notes. And so, like, and then once something was out there, like, people might not upgrade for a while. Not everybody was auto upgrading. And so, like, it felt rather important there to anything like or if there's like some data that ends up in the database, like, you just don't control the machine. Like it, once something's
Starting point is 00:54:51 out there, it's out there, and you have to deal with it. And dealing with it might mean that you just have to live with the consequences of that change for the foreseeable future, versus server, like, as I said, like, you can just roll stuff back, and it's not that big a deal most of the time. And so yeah, like, I think you need to hold a slightly higher bar and have like a more kind of intensive release pipeline. But I mean, in modern desktop client development, we have much better practices than that because we control kind of our update schedule for the most part. And so there's like, you can control the population you roll out to and you can like, and all the major platforms have started
Starting point is 00:55:25 to really enable this sort of modern development practices. And so that really helps to make the, the distinction isn't as big as it probably was, or at least that I thought it was, between a server thing that you have like a standard kind of full control push process and a client when you don't. But there's still
Starting point is 00:55:45 a bit of a difference, I think. I think that you still need to be pretty careful about that. So, the at least the Dropbox desktop application, it's released to millions of hosts, if not tens of millions. Can you talk a little bit about some of those safety measures? How do you know that you're not shipping a bug? Yeah, that is the most important question. I mean, like the answer that, like the answer historically has been sort of more ambiguous
Starting point is 00:56:18 than it is like more recently. I mean, it is hard not to ship bugs. We rely, like we've always relied on our users to report them and on, like, kind of automated traces to be sent whenever we detect them. And, like, we've gotten more sophisticated in, like, how reliably we can detect crashes and how effectively we know that we can ensure that, like, a broken version is upgraded even if it can't run. Like, having kind of the basic bootstrapping stuff in place so that even in the worst case scenario, you're going to hear about it. You're going to be able to diagnose it and you're going to be able to fix it remotely.
Starting point is 00:56:53 That's important. Actually, on Android, that was one of the a few of the first things I worked on was adding kind of system logging that we could track and like crash reporting that we could monitor so that we like prior to that crash reporting that we could monitor. So that we, like prior to that, we just hope people would tell us in the forums, which wasn't a good system, but to be fair, like the, uh, the Android platform didn't give us anything useful then either, uh, and so like 2011. Yeah, this was a long time ago. And so like being able to like one, the first thing I learned from that was that
Starting point is 00:57:23 user reports of what happened within the app were wrong. Like, a lot of the time. And I mean, I'm sure they thought it was right. But you, like, we had traces of, like, you know, obviously scrubbed of any user data. But, like, it would be of, like, which classes were running and when. And like, where the crash happened. And the user said the crash happened when they were doing X. Categorically impossible based on the trace. We could fix the problem. But it can be clear, the telemetry stuff is just so important even if you have very dedicated and helpful users, which we do. We've always relied on that and having good telemetry.
Starting point is 00:58:05 But avoiding shipping bad stuff is also important because even if you're the best in the world at upgrading and fixing, if you could live push patches to everyone instantly, better if they'd never have a problem. Like, that's clearly an improvement. But being a sync engine is incredibly difficult. Like there are platform specific things you need to work around.
Starting point is 00:58:27 There's very nuanced states as you're trying to synchronize. Like it's a, somebody could teach, like you can go on for hours and hours about what happens within that system. But like the way that, I wasn't part of the team at the time but the way that they resolved that is they sort of had to do a ground up rewrite and with correctness and
Starting point is 00:58:47 maintainability in mind, uh, engineered to be really difficult to get wrong and easy to diagnose when it is. And they, they spent a long time with a lot of very smart people doing that. And it's still really tricky. Uh,
Starting point is 00:59:01 but so like you also have automated stuff. So we have, obviously we have like a great number of test suites, manually written stuff. And the sync engine has been publicly documented, of course, is written to be single-threaded and to be more or less fully deterministic, at least within the core parts of it.
Starting point is 00:59:20 And that allows us to write tests that generate random scenarios and to be able to replay them reliably. And so, like, I have a bug in my queue right now of something that we found with that. We do find things with this. We have, like, a lot of bulk automated randomized testing sort of trying to fuzz our way toward correctness. Because like I said, I think you need to have a good model to have good correctness. There's a lot of software stuff that you can work your way through by
Starting point is 00:59:49 sort of tweaking it until you get the right result. But at a certain level of complexity, that just absolutely falls apart. And something like our sync engine is a few orders of magnitude above that level of complexity. You can't just bug fix your way toward correctness there. You need to have a good working model that makes correctness the default. And performance the default. That was an important initiative there to make that the case. But even with that, it's still very hard. And we rely, like, and once you have a model,
Starting point is 01:00:25 like you can't imagine all the cases, like it's past some levels of human reckoning. And so you need to, like, if you, something's important, you should fuzz it. Like that's for a sync engine, that's for like basically anything that consumes user input. Like modern fuzzers are pretty good.
Starting point is 01:00:43 Like I think that that's something that we should probably internally apply to everything that uh can be uh like there's there's some very clever approaches out there for these sort of things um like so a combination of static analysis like manual testing people actually having a good mental model of how it works and being designed so that people can have that mental model being built for good testing. And then also, yeah, like having, spending a lot of computational time generating the cases that you haven't thought of in the hopes that you might find that corner case before a user does. It's expensive. It's, it's really hard.
Starting point is 01:01:18 And it takes a lot of time. And so like weighing that stuff stuff against developer productivity is also quite a trick. But if it's the core of your product, you kind of have to. The main challenge with the sync engine is there could be state that is local and there could be state that's remote and it doesn't really apply well. The state of the file system locally and the state of the file system and the state of the file system remotely, could have these subtle inconsistencies that you need to make sure you're not deleting any user data incorrectly. But you also need to perform a lot of deletes. For example,
Starting point is 01:01:55 if somebody deletes an expensive or large folder somewhere else, you want to make sure that that delete is reflected everywhere eventually. So, it's inherently a tricky problem that could be lossy for something that's really important for people, which is their files. So, you need to make sure that all of those cases are tested very well. Yeah, it's definitely a hard problem. I want to talk to you about API design. So you're one of the programmers I know who has the best thinking in terms of API design and how you would structure a code base like that. So let's talk about the sync engine.
Starting point is 01:02:33 You said that it's single-threaded and it's easy to fuzz input. So what are some things that you've seen? And I know you haven't been on this team for too long, but just what you've seen so far. How would you structure the API of this sync engine in a way that makes it easy to test and makes it easy to automate fuzzing and all of that? Yeah. So, I mean, it's so I should I'm still fairly new to the team. So, I don't have much expertise to bring. For me me personally, it's been like,
Starting point is 01:03:05 this is also my first large Rust code base I've ever stumbled into. And I've always like, I'm a long time kind of language nerd. That was, like I'm interested in like, building good models of understanding of software and like ways that that can be effectively translated into efficient execution. And so, it's sort of natural. I remember in college, I learned about
Starting point is 01:03:32 OCaml. And I was like, whoa, you can make immutable things? That's super cool. And understanding type systems and higher level functions always seemed like, wow, you can build abstractions that compose to bigger abstractions. And I think I was actually a little wrong about the way that all works. Like, you can compose abstractions like that, but not usually by being, like, it's harder to do that
Starting point is 01:03:54 by making them more and more detailed. But by making them more and more general, it becomes, like, realistically, it becomes much more easy to compose. Like, once you're just talking about bytes or a message, you know, it becomes much easier to compose. Once you talk about bytes or a message, it becomes much easier than if it becomes a 10-layer type signature. And so, like, I've been of that mind for a while.
Starting point is 01:04:15 I think C++ kind of then delving into Haskell kind of I went about as deep as I wanted to go into glamorizing that. And then after a certain while, I realized that I wasn't producing better code because of this deep fascination with that. I was just producing more code that I was more interested in. And so then after consuming the code of other people who thought that way for a while, I was like, oh, this is terrible to read. And there's no reason this needs to be a template. And like, what is, like, this isn't communicating to me. Like, so that's where I sort of got more interested in API design, I suppose, is that you're like,
Starting point is 01:04:55 like, I don't care, like, about the four type parameters we provided here. Like, what should I as a human understand this to mean? Like, when is it okay to call this? When is it not? Can I just read about this, understand what I pass at whatever turns and sort of internalize what it means rather than kind of have like it's obviously like I'm a big fan of having strong and useful type systems. I just have kind of lost the sense that the stronger the type system,
Starting point is 01:05:26 the better. And so I was, I was a little, uh, I was interested in Rust when it first came out because it's doing some novel things. Uh, but it seemed like it's, it's more precise than it's probably productive for most, uh, software that I've usually been writing. Uh, but the sync engine is kind of a different case. Like's a place where we don't necessarily care about every byte and we don't about every cycle, but lifetime management and safe concurrency are really, really important. And efficiency is also still very important.
Starting point is 01:05:57 And so it is probably the right choice there. But yeah, it is a different way of understanding certain kinds of development. Because we have a lot of async code. So, actually, the sync engine, the entirety of it functions as a single future that runs on a asynchronous executor. Where there is single threaded, like, event model executor that we run it on. So it's concurrent in its own sense, but it's single threaded. And that's, yeah. So from my understanding, it's like a game engine in a sense. You're just trying to converge one future periodically, like in a tick loop or something like that?
Starting point is 01:06:40 Yes, except that it's sort of always trying to converge. It's kind of a weird construction in that sense, but it can be like everything functions kind of with like asynchronous callbacks that go to like little worker threads to do the things that are not deterministic. But the core logic of everything is like, we've, we've got our own, like, like where there are like map hash maps and such that have randomness, we've gotten rid of that. We've got like some special abstractions that have just squeezed all of the nondeterminism
Starting point is 01:07:11 out of the basic libraries that we use. So that if you are in control of how the network and disk operations are happening, which we have special kind of mock systems for that, then if you are running a test scenario with a seed, it will continue to run exactly the same every time. That's one of the ways that the engine is constructed. But it's been actually hard for me in terms of API design, because I think the way that Rust code is structured is a bit different in how there are interfaces that you can implement in different places. It's common to come into a method or a function, I guess, and to see seven generic parameters.
Starting point is 01:08:01 What this does depends on what you pass it, which is like everything that's true with anything that you have higher level. But it's a very different model of kind of behavior and of where functions come from than like a lot of other languages. It's closer to C++. It's like C++ in a principled fashion, which is sort of always is a concession to the machine and a concession to performance requirements. I think if you want to write code that just makes sense to humans and is reliably correct, it's not going to necessarily look like that. But, yeah, reality is a compromise. But so, like, I'm it's designed in a lot of interesting ways. But I think that, like, squeezing out the determinism
Starting point is 01:08:45 that way and having clear types for everything, having clear ownership for everything is pretty key to that. I think the simple features, like you can use pattern matching and non-exhaust is a critical bit of functionality there. Which is to say, I think for language correctness, it is the confluence. It's like how all the features work together becomes important there more so than any particular bit of functionality.
Starting point is 01:09:17 So I'm still getting my rust legs. But it's incredibly complex to deal with. And I don't like that. I think complexity is... I don't, but it's incredibly complex to deal with. And I don't like that. I think complexity is, I don't want complexity, but it's, in some cases, it's necessary. And, but yeah, I think that like, but the code is very well structured. Like there's, I think that like it's,
Starting point is 01:09:41 as I said, like the network interfaces, the API stuff, everything that interacts with something on the machine or on the network is sequestered off with its own little interface. You can run the full system with any of that stuff faked out. It leans really heavily on protocol buffers to have typed representations and resilient representations of the data throughout. And it ensures that you can kind of smoothly iterate and process chunks of data from the database without doing weird or incorrect things. And so, like, a lot of the
Starting point is 01:10:21 errors are, like, that leaves us with very few actual logic errors kind of sitting around. Like, where we find things, it's usually, like, there was some case that we decided to crash instead of handling because it was rare. But now it's not so rare. So we should probably not crash on that. And so it becomes more about trying to translate product decisions into that very nuanced framework. That makes a lot of sense. And think that's pretty advanced. Getting rid of all
Starting point is 01:10:50 determinism by replacing the inbuilt hash map with a Dropbox hash map. That's pretty cool. And then you have these very standard layering of things. So you might, instead of talking to the network, you have like a fake network. That's what it sounds like. So that you can run all of your unit tests with, or you can run like a randomized test with a particular seed, and it's always gonna come back with the same answer
Starting point is 01:11:17 without it hitting the network. So, and that kind of layering, yeah, leads itself really well to randomized tests and it avoids any strange kind of logical bugs because you can assert certain properties of the file. But I mean, like, the interface design is a sort of, I guess, a kind of UX design thing. Like, where you're trying to build a world in which people can make useful and correct changes with the new things you've introduced to the system.
Starting point is 01:11:53 And that's something that I think that particularly within like a very powerful language like Rust that gives you a lot of ways to express a lot of different things. It's a, you have some sort of different tradeoffs to make. Because you can lean on the compiler to tell you about things that you might not be able to see on the page. Like in old school, like C++, it was called Hungarian notation, was a popular pattern
Starting point is 01:12:20 for a while. Because, you know, what does that mean? What does that string mean? Well, we'll just encode that into the name. But in more modern styles of language that support that sort of thing, you can create a type for that. Whether it's a duration or an ID. Or you can create a phantom type on some other thing. And that's a cool pattern there. But at a certain, it is easy to get to a world in which you can abstract something enough where the underlying code, it's hard to say that it means
Starting point is 01:12:53 something by itself. And so, correctness becomes about what it's been passed. And that starts getting, like, it's looking like all the dynamic and static. As you understand Python code, particularly in old Dropbox Python code, there was a problem that you'd say, what does this do? I got to see what we're calling it with. Whatever we're calling with, we're calling methods on that, whatever that means. That's true of any kind of dynamic dispatch but uh like it's easy if there are enough kind of dynamic parameters to something to for it for you to not have kind of an intuition for it
Starting point is 01:13:31 and that's something i think is a bit of a problem or even if like it can be understood like if there are enough like novel abstractions in the in that world that you have to have a global understanding of the system to be able to reason about things in the small. That's a broken kind of pattern in like, how dynamic stuff. The Android app used to have that problem where you would have to fetch the global state from here. You need to know or even changes have some of that going on. You need to know a bunch of systemic stuff to do anything. And so you're not really,
Starting point is 01:14:07 you're not necessarily encapsulating stuff, you're just abstracting. And so I don't have an opinion at this point to the extent of that in the Dropbox desktop app. I think it's brilliantly architected and written by people mostly smarter than me. I'd say probably exclusively smarter than me. But I know that there are parts of it where I think, like, we could probably document
Starting point is 01:14:35 this better. Because it's, like, that's the risk of having, like, a team that is deep in it for a while. And really, like, the risk of smart people is that they build things for smart people. And not everybody is smart all the time. And so, like, that's sort of the paradox of good software development is you want, like, really smart people who care about the details to build things that insulate people from the details and that you don't need to be a smart person to deal with. And so that's kind of a, I say it's like a UX design task.
Starting point is 01:15:09 You want to present the monstrosity that is the complexity of software development as something much simpler. That's how you build bigger things. If you have to deal with the complexity all the way down, it all falls apart. And so that's, like Russ gives us a lot of tools for that and the Sync Engine uses a lot of them, If you have to deal with the complexity all the way down, it all falls apart. Rust gives us a lot of tools for that.
Starting point is 01:15:26 The sync engine uses a lot of them. It's intrinsically deeply complicated. So it's something that I'm very excited to dig in more to like to broaden my understanding and to hopefully use my status as a novice to the team to bring a little bit of a different perspective. You need new people for that. I think with discipline you can keep things pretty clean and pretty simple. But if you have pressure in terms of features or, you know, like you're part of a business, it becomes pretty hard to like hold some abstract standard for its own sake. And I think that simplicity is one of those things that doesn't
Starting point is 01:16:11 always have natural advocates. It's just something that you kind of bump up against gradually versus, you know, performance or like features in some cases, just like pure velocity, like those things, like people will ensure that that gets dealt with because they have an active interest in that. But then once things get past a certain threshold, you realize, oh, this is very, very complicated and it's not really clear what you can do about it. Yeah. To one of your points that you mentioned earlier, you need to be, I think there's like a famous
Starting point is 01:16:44 quote where you need to be really, like you shouldn't be really smart when you're writing your code because you have to be double smart when you're trying to debug it. So you have to try to force yourself to keep your code as simple as possible. And then you have really smart people working on a code base. It can get pretty complex for a new person to debug what's going on. I remember in the Python codebase at least, there was a method that I was trying to look at through code search. Who's calling this method? It doesn't seem like anybody. It turns out that somebody was concatenating strings at a particular place and then calling that method. That's why it was impossible to search.
Starting point is 01:17:26 So it's interesting that language choice kind of forces a particular way for you to think about things like traits in Rust, for example, given that you have this feature, you tend to use it and that forces you to, or that makes you think about your software design in a particular way. And I think that's why the creators of go they push back so much on complexity like like generics and go is such a famous topic like why does go not have generic set but i think
Starting point is 01:17:57 the creators are just trying to push back on complexity that's introduced because then people will probably use generics for features that they shouldn't. Yeah, it goes an interesting example, I think, of what we were talking about earlier of if you don't match expectations, even if it is something that works perfectly well, that it's going to be a problem. I've heard countless examples just internally of people being like clearly upset about not being able to
Starting point is 01:18:34 just find the minimum or maximum number among two numbers um which like see like i'm like my kind of ethos is one where i'm like uh i I like the idea of, like, you know, making a fire from two sticks. I make most of my food from scratch. So, like, I'm kind of, like, a weird retro person in the same brand that a lot of the Go team seems to be. Where I'm, like, I can write an if. I don't care about an if. You know, I don't need everything to be an expression. Like, whether it looks elegant doesn't seem that important to me. It's all going in a function and I'll call it a function. Go hasn't really
Starting point is 01:19:09 it's sort of seemed to me to be a radical experiment in how little can we put into a language and get away with it. But also as an evolution of what was the I forgot what the earlier language was for the concurrency, they had some ideas about concurrency. And they also had the idea that all languages are trying really hard in a direction they don't care about. Because there's a bunch of, there are a bunch of, like, old school, like, C programmers. Like, I don't know, why don't you just garbage collect it? Was that Sozo? No, no, there was a, there was a Rob, one of Rob Pike's, I think it was a research language.
Starting point is 01:19:41 Limbo, I think might have been. That had some of the same ideas around, ideas around very, very cheap concurrency. It wasn't the first one to do it. But it's the one that was pretty influential for Go. But yeah, they the idea that I was there for its development at Google. So it was fun to watch some of that stuff and interact with them a bit. But it was pretty clear. They looked at all the giant messes of C++ code and were like,
Starting point is 01:20:06 eh. And then look at Java and go, eh, I don't like any of this stuff, but C is bad. They like C, they're comfortable with it, but it's obviously not the right choice. So you think, what do we want here? And yeah, it's like, well, you want garbage collection, easy concurrency, just build in the basic types and don't let people do anything too stupid. The rest seemed to come gradually from that. And like, oh, it should build fast.
Starting point is 01:20:32 Yeah. So it seems like they wanted to even have read this, they wanted to build a replacement or not a replacement, like an alternative to C++ and Java, like systems languages that were used at Google at that time. But it turns out that all of the programmers that were using Python and Node.js for their servers, a lot of them decided to use Go instead because of fast performance and fast build times and all of that stuff. You never even know how your decision turns out and who end up liking what you've built
Starting point is 01:21:08 as long as you've built something good that clearly some people like. I think they might have overestimated how good they could do garbage collection or how much advantage they would get from structuring data that says that garbage collection isn't a strict requirement. And I think they also kind of underestimated how people how much people actually liked
Starting point is 01:21:33 being able to like build nuanced abstractions. Like I think like they kind of messed up on some of like, I think like you can't build, it's harder to build like systems where you have like kind of new, I don't know how to like, obviously like you can't, like building like new like collection data structures is kind of a mess in standard go as is like if you want to build something that like, there are certain places like you don't always want generic, but like a few places you want it in a large system it can be pretty important like you know uh like metrics collection or that sort of thing and like you can you can work around without it and it's fine but there's always gonna be a few cases where you end up getting into like interfaces and it'll never be optimal for like those few tight loops and so you're always gonna have to farm it to something
Starting point is 01:22:21 else but then since the cost of calling from Go into anything else gets kind of pricey, that becomes kind of a problem. So it has, yeah, it was, I think they thought it was going to succeed more in the C++ space. I think they underestimated how much people kind of liked that space and liked the power that it afforded them
Starting point is 01:22:43 and the expressiveness. Because most people don't have that like whatever build it yourself thing. In a sense, it's a problem of building a team like Russ Cox is like one of the best software developers I've seen in practice. Like everyone on the team is like like, really top-tier people. And so it's like, well, you know, if you have to build your own Reddex engine, just build it. It's the sort of team where it's like, why would you import something?
Starting point is 01:23:14 Anything you import can be worse than what you'd write. And I'm sure that's not strictly part of it. But it means that the utility that they see in something, it's the smart people writing things, having a slightly different perspective thing where they were kind of trying to build a simple thing that other people could use and would be generally useful. I think that's good. But I think they didn't see the value in some of the things
Starting point is 01:23:38 that other people find value in, both because of where they stand aesthetically and the fact that it's genuinely useful to have abstractions that can check things for you. It's less flexible than a lot of the dynamic stuff, less malleable, I guess, than some of the stuff that has a lot of compile time value and less interesting than most of both sides. So it fits a very weird niche that I never thought was going to probably
Starting point is 01:24:07 take over the world, but it has its adherence. And it is actually what I wanted for a long time. It's like, I don't know. I just want to write loops, have structs, and not have to free it in Malik. Have you deployed any services in the last few months? What's your experience? Yeah, I mean, we've had a few. The thing that I think I talked Dropbox into using Go at some point.
Starting point is 01:24:28 I talked to them about it when we were talking about moving an Edge store. I think I wrote some stuff in Google in Go. I had code search at Dropbox, which was probably the first and largest mirth of Go codebase. And for me, that's like being able to start a command and have it run in a millisecond is great. That's what I want. And being able to serve a web request in five milliseconds. Because the difference for... Without any extra stuff. You don't have to optimize it. Most code is not that dynamic or not that complicated. Not true for things like the sync engine where there's a lot of very nuanced things happening.
Starting point is 01:25:08 But a lot of code is just, you know, doing fairly mundane stuff that you can do with kind of anything. And like, Go will be will have more tokens on the page than a lot of other languages in a few cases. But like, if like just suck it up and accept that, it's fine. And then like the payoff that you get, particularly as you know, people coming from the dynamic space, like you have to type a few more things. But like everything is dramatically faster, deploys easier, and like starts better and has usually
Starting point is 01:25:39 a better concurrency model than what was available to you before. And like that's like being just like, oh, yeah, this is going to be anything that's CPU bound is going to be 10 times faster than Python. Anything that's IO bound is going to be perfectly fine by default. If you need to replace it, it's going to have to be something very specialized. That's a good place to be for most software that's serving requests or processing files or what have you. It's not going to be the best at anything, but in terms of whipping up something that can basically work, it's a pretty good choice. Personally, I haven't seen too many code bases.
Starting point is 01:26:15 Like too many, I would say not code bases, but projects that have found too complex that I couldn't understand. This is one of the things I really appreciated about Go, particularly early on, less so now, but is that if I wanted to figure out if there's some algorithm I was interested in, I found that I could go to the Go source for it. And then that would, it would be more straightforward than probably any other language's implementation.
Starting point is 01:26:42 Like the dynamic stuff would would be pretty simple, too. But without types, it becomes hard to navigate. And then everything else gets bogged down in the language-specific stuff. Whereas the Go stuff, you just can't get that fancy. And if you do, it makes you pay for it syntactically. Yeah. Like if you want to understand how Tim sort works or something like that, you can just look at the Go source. And it is really easy to navigate on Godox. And we have an internally deployed version of Godox as well. So, that helps out. Yeah. I mean, when it first released, I remember thinking, like, this people are
Starting point is 01:27:18 going to hate on this a lot because it's not the best at anything. But I think the most interesting thing you can do with it is generally write useful code. Which is, like, as, again, I had been in, like, C++ and Haskell spaces a lot of the time. And so, a lot of it was about, like, finding the most elegant way of expressing something. And so I appreciated the cold water of like, nah, there's one or two ways of doing this. Neither of them look particularly fancy. Like, what does it do? The community, the amount of love
Starting point is 01:27:54 that the Rust community has for Rust, like, or at least how vocal the Rust community is compared to the Go community has always been interesting to me, given that there's clearly, I think there's more users of Go than Rust. But maybe just the amount of features that Rust gets. I don't know what it is, but there's certainly a lot of luck with it. I think that the Rust people are great. And I think Rust is a monumental achievement in software
Starting point is 01:28:23 engineering. But I also think that there is a that there is an element that people go into software because some people, because they want to be a wizard, they want to put in the work to have special knowledge and be able to do things. And be able to do them ideally. And so there was a while where Lisp was the place where there was a notion of a smug Lisp weenie. You could look down at everyone else. You were the smart one. You had the right tool. You knew how to do it best. Anyone who talked about a different language, you can talk about how Lisp can do it better. You can feel good about being into Lisp. Then typed languages took off and they were like Haskell took that space for a bit. There's always been an element, people in software like to think they're smart and like to show they're smart.
Starting point is 01:29:12 I don't think that's a controversial statement. I know that I score in that, but I personally have delighted in constructing the most clever bits of code. It's what you do. You want to write something that you are proud of. What you're proud of means, like, making something that, like, that someone else might not have been able to make that does, like, is good. And so, like, those that require you to kind of climb a mountain and then you can stand on top of that and say, this is better than what you can do with that. This can do things that that can't. You can get in arguments about Go, and Go usually loses. Our internal Go chat room is people just complaining.
Starting point is 01:29:55 And I don't disagree, but if I have to make something that works quickly, I'm not sure what I reach for. I've thought about that quite a lot, especially in like today's all of the events that have happened today and yesterday with like the capital and all that in that setting there and all of the recent arguments with like Section 230, it's unclear. It's very obvious what you get if you remove Section 230. But I think it's and you can it's not easy to win an argument, like for section 230 for
Starting point is 01:30:33 like the lack of censorship, clearly when there's obvious benefits you get by removing parts of it. But there's also like this opportunity cost, which you have without something like Section 230. How would the next Facebook or Reddit come about if every social media system or every algorithmic recommendation system was treated as a publisher? So I don't have a well-thought-out argument
Starting point is 01:31:01 around that, but it seems similar to the Go argument, where you can't really argue for any particular feature because it probably doesn't have that feature versus some other system. But just the overall lack of complexity of it is what people really like about it. At least that's what I like about it. I can probably go into a Go code base,
Starting point is 01:31:23 fun unintended, and quickly debug what's going on and quickly add a new feature. I don't feel confident about that, even with, definitely not with C++. So that would just be waiting for the compiler. There's a real danger in systems based on popularity and popular discussion of pitting things with like very specific value against things with diffuse
Starting point is 01:31:48 value, both in that it's hard to compare, but like that if it's easy, if it's easy to point to the value of one thing and like, it's hard to make the argument or maybe unpopular to make the argument or like, you know, like, or you can say like, well, if you can answer, like, be smarter like me, or, you know, use a better tool, like there's always a way that you can say, well, if you can answer, be smarter like me or use a better tool. There's always a way that you can blame the person. You can say, well, there is a way you can avoid that. People in C++ have long, well, don't do the thing that is unsafe and leaks code or leaks memory. If you just don't make mistakes, it'll be fine. And, like, there's, it's particularly, like, I feel like in public discourse where the advantages are more uniformly distributed and not, like, unique to that thing, it becomes hard to have a discussion in the large that is reasonable. Yeah.
Starting point is 01:32:41 Another example I can think of with the specific and diffuse, what do you call it, the specific advantages versus diffuse advantages or benefits is iPhone users versus Android users. I think if you pin down half of iPhone users and you ask them, why don't you use Android that has a better, phones have better cameras, better better like frame rates or whatever. A lot of them might not have an answer. We've probably not even thought about it, but they've really found some value in there.
Starting point is 01:33:11 And of course this is arguable, but I think that could be analogous to this. It is also like, I think particularly when it comes to software discussions, it's well also every discussion online, it becomes really difficult to know who has like a horse in this or who is like, because like I've gotten into long discussions.
Starting point is 01:33:32 I used to discuss people, things to people online. I don't do it as much anymore, but I've gotten into long arguments with people to find that like they're talking about the context of like personal projects that no one uses. And I'm talking about projects that with, you know, millions of daily actives. And so, I'm like, oh, like, I of course this doesn't make sense. Like, you don't care about elements of maintainability or which team can use it or how
Starting point is 01:33:58 easily it builds or how easy it is to debug or whether you can get log statements through there at all or what the tooling looks like. None of these, yeah, of course, yeah, values differ in these. And also in some cases, you find that you're talking to, like, a teenager, which is, you know, it's great. Like, teenagers are wonderful. Many of them are better than me. But, like, you're like, well, it's just a different sense of, like, perspective on what
Starting point is 01:34:21 matters. It's useful to have some understanding of what the actual discussion is. I think absent some interpersonal understanding of the grounds for evaluation of something, I find it useful just to see how good something is. You look at what people are actually doing and how people are voting with their feet and like what's, what's being used, what's being used successfully. And people score in that as a way of evaluating something because it certainly is lagging as an indicator. But I know that, like, if it, if it can't work in industry but it's good, is it, is it good?
Starting point is 01:35:04 Like if it can't integrate with anything else, that is a weakness. It could be a beautiful language, but if you can't use existing valuable stuff, that's a glaring weakness. If it's the best compiler in the world and it takes five days to build, it's not the best compiler in the world. That's why I think rust actually comes off pretty well like they they people are building things people have built things similarly i think like the the purpose of of software tools is to build useful software and so judge it by what it builds and like php as much as i have complaints about it like it's a mixed bag but like it's produced some great stuff it's there's some gl bag. But it's produced some great stuff.
Starting point is 01:35:48 There's some glaring flaws. Some of them have been fixed up. The newer versions are good. But yeah, that's the most important evaluation criteria is how does it work out when you use it. If PHP was as bad as people make it out to be, then people would have stopped using it. And I guess that's one way to think about it. But the other way, another example I can think of is the two camps that were created when people tried adding typing to Python. One camp was like, why the heck would I ever need that? And the other camp was like, this is going to be really awesome. You can really compare, you know, the people who use it for data science as a some projects, they don't really see the value in it. With us, millions of lines of untyped, bad Python code. The real question is, what percentage of your time do you spend reading other people's
Starting point is 01:36:34 code? And the higher that is, the more I bet you care. Yeah. Exactly. I want to step back a little bit. I probably need to bail pretty soon. Yeah. So this is going to be one of my last questions. I think this is going to be interesting for listeners. So you've been a tech lead a bunch of times. So I want to ask about that. What do you think constitutes a good tech lead? How do you get into that position where you're technically how would how would you decide whether you should go down like the individual contributor versus like the manager track i think it i mean as with most questions of personal like vocation it has to be derived by a combination of like
Starting point is 01:37:29 what you feel the need to do and what you do well and like for me for example like I do software like I never really thought about whether I would do software it would just was obviously what I was going like I was reading c books like when I was like 12 because like that seemed awesome like it was never like even if I like I'm not judging myself in either direction but if I was even if I was particularly bad at this work I'd be doing it somewhere because it's like it is what I feel the need to do and so it's like a calling yeah and And I don't feel like a moral sense. It's just like, that's the kind of work that I'd like to be doing. And so like being an IC versus being a manager is like, are you interested in the people
Starting point is 01:38:17 or in making the changes? And if you can find like some people like myself, I include, I feel this at times where it's like, there's a certain point at which you're like, yeah, I can code that. I know it's not necessarily a challenge, where it can feel limiting, the amount of impact that you can make as an individual. And so at a real large-scale software development, it's about coordinating people and coordinating initiatives. And if so, if you if dealing with the people side of it and dealing and like scaling on that level is more interesting to you at some point than like actually making the contributions and digging in, then then that becomes like that's probably the right call. And particularly if you have an aptitude for it. I know there are some people who would prefer not to talk to people. And while they might like to scale themselves,
Starting point is 01:39:11 maybe they might scale better by example or by documentation or by initiating, driving initiatives, spearheading them. That's a different thing. Management, it varies from company to company, but it's always about making sure your team is working well and is effective. And like, that is a, you need to understand software development deeply to do that well, but you don't necessarily need to understand
Starting point is 01:39:35 the software super deeply. I mean, like it's about the people. And so like managing engineers is a different kind of job that's related. But yeah, if that's, I think that people tend to go that way later in life because they've seen more of it and the novelty and excitement of shipping yet another version of something might be a bit lesser.
Starting point is 01:39:56 But also, yeah, it's a way to scale up. And a lot of companies make that the only way to advance. And I think that is a bit of a mistake that pushes some people maybe who wouldn't prefer to be in that position. But I also think there should be ‑‑ I'm also of the mind that there should be not that many managers. I think that ‑‑ I think it's good to have the managers focus on the people. And to, like, allow engineering to drive engineering as much as possible. But isn't a tech lead kind of in a murky position?
Starting point is 01:40:30 So there are different schools of thought on tech lead positioning. There's the lead violin model where it's like your first chair engineer, that's the person who is like if you're going to ask someone to do something hard, someone has to resolve a question, someone has to speak on behalf of the team, you know, like, that's the person you have. And there's some that's more, like, becomes kind of a combo manager, product manager, like, managing execution and that sort of thing. And, like, I can, like, as someone who's not, like, that excited about managing execution stuff, I like the first tier model more.
Starting point is 01:41:07 But I think there's probably an argument for either, depending on how you structure it. So you need to understand what the role means. But I think that in either case, you have to be someone who's most interested in how the team is doing and the success of the team that you're leading. Like you, the technical success of the team is, is what you sit up at night thinking about. And if that's not where you are or where you want to be, it's probably not the right thing to do. Like, that's why, like, there's the idea that you would give it to the strongest engineer on your team,
Starting point is 01:41:38 whatever that means, isn't always quite correct. Cause if they're just not that interested in what someone else is working on or in what no one is thinking about, then that's a problem. Technical leadership also means caring about processes and in cadences and some level of execution, regardless of what you're doing. And it means being less technically involved in most cases. You're replacing some software work with work
Starting point is 01:42:07 around at least coding work is being replaced with something else. And that's the main complaint that I hear from other tech leads. I don't feel like I write. I review code. I talk to people. And it feels easy to feel like you're getting out of step with what it is that you do, what defines your value. And so, you you have to like be where you can find value is goes a long way. I'm like, if you can't find the rewarding, if you can't find it rewarding to see other people succeed with your help, it's really hard to be happy being something that's not an IC. And that's something that like a lot of, I've seen engineers or seen tech leads and managers kind of have to grapple with. It's like, how do I feel good about this? What did I do? Like I said,
Starting point is 01:42:51 I said at the meetings, I talked to some people I didn't, nothing has my name on it, really. The impact is nebulous. Yeah. And the flip side of that is if you do care about the success of your team and you care about processes and initiatives and you start showing that even before you were a tech lead, you're basically positioning yourself to become something like that eventually. Yeah. I mean, like, it's sort of a dress for the job you want sort of thing. But if you're doing the work, that is the work.
Starting point is 01:43:18 Like caring about, I think, I mean, I think having a poor attention span can get it confused for being a tech lead or having bad focus. Because I think as a lead, you need to concern yourself with all the stuff. You should be able to point people at a task and let them focus on that. And then everything else that's going on, what should happen next, what isn't being dealt with, how we're going to deal with these alerts, What's the next thing we need to, you know, which team should we be talking to? Like those elements, like that's something that are important technical questions that are just not, like someone needs to make that call and have that conversation. And like you can't just expect that to just happen by itself.
Starting point is 01:44:00 I mean, it might. And if somebody is doing that work, you should make use of that and they should talk to you. But also if you're just doing that because you don't feel like doing your main work, that's, that's maybe not the exact same thing, but I think I might've done that a couple of times, but, but no, it's yeah. It's like, if you care, if that's the stuff that you care about, like you're sort of, you're showing yourself that this is what I want to do. Like I want to, I want to, I want to manage the health of the team and ensure that our projects are good. And if someone else is making the commits, that's fine.
Starting point is 01:44:33 Maybe someone else is better suited and maybe they should be doing it instead and maybe you should stick to your task. But it depends. We've played with Dropbox having it be not a, trying to say it's not, you're not in charge, it's not above having it be not a like trying to say like it's not and it's not like you're not in charge you're not it's not above you're just it's just a role someone plays like you know is the you know sports team position more important than others I don't know sports team positions but like it's like it's just a different position on the field uh and but practically speaking you you are wielding authority and like you do have the say in matters like where if someone I might say I want to do this and you might say no, like that, that if you have to, you have to be in that position. I think, I think like the tech lead needs to have the ability to dictate like what is acceptable technically and what is not. You have to set standards. You have to set expectations. But you also have to be able to take the lead
Starting point is 01:45:25 in that. And that's one of the things that kind of distinguishes between a manager. I think you need to maintain those skills. You need to be in the trenches. You need to be able to answer the questions. And that's why I think being a tech lead manager seems incredibly hard. If you have a really mature team with a lot of that's very collaborative, I think you can be fine. Because people can kind of, the need for someone to be the technical leading voice might be lesser. But it's a rare person who can be the voice of reason and the person who understands all the things that's happening
Starting point is 01:45:59 and also manage all the interpersonal things and execution. That seems hard to me. I don't think I could do that. Yeah, I don't think we have that many ELMs exactly for that reason. It's usually for smaller teams where it's not like, I think I suspect it's for people who are interested in management, but also don't want to give up being an IC. And that's a perfectly reasonable thing. I mean, like every team is different and it's like companies have their ideas of the roles, but like the roles are defined by who's there and what they do. And so there will always be some people picking up the slack
Starting point is 01:46:31 in one area or another. So it's like a process of wounding your self-discovery to figure out if that's the right kind of role for you. Yeah, I mean, I think this is where i think it's in like we're we're blessed to be able to work like in a field where generally we like we're we want to do what we're doing and like i don't want to speak for everybody but like that's like i know that there's a lot of jobs where people do it because because they're paid but like i'm i'm angry by broken things i want to delete things and i want to solve problems and i'm
Starting point is 01:47:05 curious why this works and i think it'd be cool to build that and like that's like that's and uh we attribute a lot of things to people being like smart or gifted and there's there are elements of that but i one of the things i realized as i become a parent and have no free time is that a lot of what i chalked up to me just being like naturally good was me just being obsessive and like actually like being doing these things in times when it wasn't technically work you know and thinking about stuff on the way to work on the way home from work you know like on the way when I was doing something else like tinkering with stuff on weekends like that's and so like one thing that it seems obvious to me is that everybody is better. People can be remarkably better at their job if it's something that they they're
Starting point is 01:47:49 excited to do. And that's where like, uh, like the, it's important for organizational health and for individual health and for like effectiveness. Like if there's something that someone really feels like doing, but I'm doing it for something that one doesn't really feel like like engineer happiness looks weirdly like effectiveness from most angles and it's much easier to measure. And so it's like, that's the, it's a really good thing to go on. And that's why like, I think sometimes if like, even if you're doing important work, if it's not that exciting and you know that you're excited about the field, that's a problem to be sorted
Starting point is 01:48:25 out. It might be that it's not that efficient. It's not that drudgery and toil is not rewarding. But we live in a, in software, you can usually automate this stuff. And so most of the time when something is boring or not rewarding, it can be a signal. And if not, if it's unavoidable then maybe like there's a new domain that people should switch to and like we were lucky enough to have a lot of options and so i think like when it comes to choice of role a lot of it comes down to
Starting point is 01:48:54 like if you're if you're excited about it do it like i've been pushed toward management things a couple times and i might go that path but the last few times I've been pushed, the very idea of it was horrifying to me. And so, like, whether or not I'd be good at it, I wouldn't be good at it if I don't want to do it, you know? Yeah, I'm super grateful that I actually like software engineering because it's such a good field to be in anyway. And then I hope for you, you get to delete code for a long, long time and I just want to say thanks for being a guest on the podcast
Starting point is 01:49:31 I had a lot of fun yeah cheers for my pleasure that's been a great hour

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.