Software at Scale - Software at Scale 23 - Laurent Ploix: Engineering Manager, Spotify

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Laurent Ploie, which I hope I pronounced correctly, is an engineering manager at Spotify. And he's been working on CI systems for a large part of his career, initially in fintech companies and now in more techie tech companies, I guess is the right way of framing it, like Spotify. And you've been in Europe, in Sweden. Is that correct?

Starting point is 00:00:43 So I was born in France and I moved to Sweden approximately 13 years ago. Can you tell listeners what got you interested in CI, CD? It's a pretty niche topic. Everybody has to deal with it but nobody is really excited about CI. So what got you interested in that? In 2003, I was a manager for a team that was developing a virtual machine. Virtual machine was running some old VMS code on Unix, various version of Unix, like AIX, HP

Starting point is 00:01:17 UX, and a couple of other ones, plus Linux. And this virtual machine was running financial code. so we had to be careful. And the problem we had was that we had my development team, then we had the QA team as well, and so we created a new version of it, and then it went to the QA team, and the QA team was trying to find bugs. And of course there were quite a few. So they opened tickets, and we were fixing the tickets, and then we had a new version of the virtual machine, etc. So it's a very well-known process, except it takes a lot of time and of course it's very slow to convert to a bug-free version.

Starting point is 00:01:58 But we didn't really know a better way to do it, frankly. Also, we had to test against four or five different versions of Unix. So, yeah, it was difficult. And then someone in the team said, look, I found this unit test thingy. Maybe we should try that. So, well, we gave it a try. And so, we started to write a number of small cases, which today we would call just unit tests. And effectively, what happened is that the QA team still found bugs, of course, but they found the bugs that were say advanced or complicated. But the trivial bugs just disappeared. At least on one OS, which was the one where we were working, which was

Starting point is 00:02:40 Linux. But for the other ones, we still had those stupid bugs, because well, we didn't test so much on them. So that's when we thought like, hey, so we have this unit test. We could automate that, right? Why don't we have five different servers of five different OSs? And then, well, we run them. And that's when CI was born for us. So after a while, of course, we found that we could release not in six months, but in 15 days. And there were no trivial bugs anymore. Of course, there were still advanced problems or use cases that were difficult to find. But whenever we found a bug, we created unit tests around that.

Starting point is 00:03:33 And it was gone more or less forever. So actually, what happened for me is that I realized that I was super interested, of course, in developing software, but this very idea of making the life easier for developer was something that I was truly interested in and then we moved to Sweden and I started to work for another fintech company and this fintech company also had a QE team of course and like a hundred developers approximately. We were doing real-time trading systems.

Starting point is 00:04:11 And so I took over the QA team and said like, hey, let's refocus our effort onto automation for developers because we wanted to change the mindset of how the company worked. I mean, developers were really good, but at the same time, as always, the software was released to the QA team, the QA team found bugs, and it came back to developers that prioritized these Jira tickets, and et cetera, et cetera. And it just took days, every time. So it took a lot more time than it should to release the software.

Starting point is 00:04:41 So we refocused the effort entirely on two things. The first one was to change the mindset and to change the approach of the development teams so that they would think in terms of writing the test for themselves to protect against future bugs and against regression from other people maybe touching the same code and at the same time we needed to change the infrastructure of course so that we could run them so it was a lot of battles to take at the same time so it took a few years because you really have to change a lot of things in a company when you want to go from long long release cycle to short ones. Effectively, what happened is that we went from six months a year to release to

Starting point is 00:05:32 this software which was more or less reusable at any point in time, more or less. And the number of the value adding tickets had tripled over the years, whereas the number of the value adding tickets tripled over the years, whereas the number of critical tickets went down dramatically. So in some way, we can say that we tripled the velocity of this company. At this point, I was hooked. I really liked that job. And working on evolving a company, working on making life easier for developers is really something I enjoy.

Starting point is 00:06:11 Have you noticed a phase where you've had to convince engineers who were initially not used to a CI environment and trying to convince them that it's a good thing? How have you gone about that? Have you dealt with that, or is it just like an industry standard practice now? Today, I would say it's an industry standard. But 10 or 15 years ago, it was not, or at least not in a place where I worked. And what I realized is that sometimes people were not convinced that it would make their life easier to, say, for instance, write unit tests or write different types of tests, really, and react fast on failures.

Starting point is 00:06:54 But what really worked is when the system stopped working and suddenly they were blind. He was very much against tests. He was very much against that because he thought that they could basically do a good job from the beginning and not have to test so much because we were testing actually quite a lot. And that's only the day when the CI system went down that I saw the same guy come and say, hey, by the way, we are blind. Now we don't know if the quality is good or not. So please, like, please fix this incident. So like, all right, is that the same person who told me that it was not very useful? But I mean, that's not the case anymore. Like, today, nobody would challenge that, I think. Not in the software industry, at least. Yeah, but do you think this was the same

Starting point is 00:07:49 like 10 years ago or like 15 years ago? Or has it been like a gradual evolution? So I'd say that if your company depends on software to create value, you'd better adopt those practices. Effectively, CI enables you to release faster, to create value faster for your customers. And if you don't do that,

Starting point is 00:08:12 well, your competitors will, and you're going to die. So, yeah, we've seen an evolution where more and more companies adopt those practices, but the ones who don't, well, at the end of the day, they disappear. So, yeah, that's what happens. The entire industry is moving in that direction,

Starting point is 00:08:32 and one of the good signs, like one of the signs that it's becoming mainstream is that you see a lot of companies providing CI in the cloud. That's usually a sign that the practice has become mainstream. Do you feel like there's a cultural gap in a world where CI used to not exist and then you add it to a company or a team?

Starting point is 00:09:02 Is it hard to convince people to you know get thinking about keeping master green or keeping the build good so it's a question of perspective and culture as a developer if you only focus on you providing or producing code, you could argue that what really matters is your pull request or your branch, and that maybe it's the job of somebody else to go fix merge conflicts in master. But if you think holistically, if you take the entire company into account, or all the teams working on the same project, then things change. So the question becomes, how do you evolve the culture so that the mindset is that everyone

Starting point is 00:10:01 needs to go faster, not only each and every person on their own piece of code. So there can be resistance if people don't see the value of what they're doing for everyone else. In a Fentype company, I have experienced resistance from some teams that thought that they had done a good job and that somebody else for instance was using their API's and they were using the API's wrong and therefore their tests were wrong but in practice you really want to take the on you want to take the person providing the API and the person's using the API's and they have to work together to make the piece of code

Starting point is 00:10:45 work. And again, it's a culture shift. It's a culture from my own corner into everyone is in it together and we have to work together. So yeah, that's like part of the job of a CI manager is actually to think holistically and to try to change the culture of everyone so that they see the impact of what they're doing on everyone else and not only on their code. Another kind of resistance you can get is from QA teams that would think that you're basically removing their job, which is not true. In fact, what happens is that you truly don't want humans to find trivial bugs. That's a waste of time. You want humans to focus on the very hard to test problems like usability ux issues logic inside the product whatever has to do with human interaction that that is really

Starting point is 00:11:59 important to test and you need qa and but whatever can be automated, there is really no reason. But here again, it's a change of perspective. It's a change of culture. The point is that quality is not a QA problem. Quality is everybody's problem. Okay, and I think one thing that all of us who've been experienced with CI, like managing CI systems and like larger companies or even smaller ones, we noticed that over time, CI

Starting point is 00:12:33 itself gets fairly complicated because the number of developers is continuously increasing the number of tests in your code base is continuously increasing. And because of that, there's this compounding effect on the CI system where just the number of build minutes or the amount of time it takes for a build to complete takes, just gets longer. So have you experienced this in the companies you've been at so far?

Starting point is 00:13:01 Yeah, so what you see is that, as you said, if you have twice as many developers, they will probably develop, like, create twice as much code. And if you add to that, that they also create more tests. And so the time it takes to build is longer, the time it takes to test is longer. Like, you have more platforms, you have more supported OASs, more, generally speaking, more platforms in many ways. So what happens is that, yes, it explodes. And soon your one machine becomes 10 and 10 becomes 100, and that's literally impossible to test everything all the time

Starting point is 00:13:41 on every commit you make or even every pull request. So that's a time when you definitely need to have to test everything all the time on every commit you make or even every pull request. So that's a time when you definitely need to have some kind of dependency graph so that you save CPUs to Southways. I mean, typically built systems like Bazel or similar will help you only focus on the ones that actually need to be built or only need to be tested. That said, we'll see that a bit later, is that it's not enough actually most of the time. Even by

Starting point is 00:14:15 only building and testing what needs to be or what is impacted by a change, it may still be too much. So what you typically do is that you need to decide what you want to test during the pull request. And there is a certain level of risk that when you merge to master, you may have problems still happening on master. And you may not even be able to run all the tests on all the pull requests on master. Have you noticed some kind of tipping point at

Starting point is 00:14:54 which testing every single thing on a PR or on every commit is no longer feasible? Have you noticed any kinds of like patterns like number of developers number of commits coming in that's a very good question um it very much depends on the shape of your test pyramid the test pyramid typically says that at the bottom you have a lot of unit tests each and every of them covers a small portion of the code. They test very few things, but they go really, really fast. And also, as soon as they break, you know exactly what part of the code is involved. At the top of the pyramid, you got typically the end-to-end test. They take a long time to run. They cover a lot more code. In that sense, that's good. But at the same

Starting point is 00:15:42 time, they are much more difficult to troubleshoot. And there is everything in between. And as an example, in financial software, we were testing different types of financial instruments and a lot of aspects of the pricing of those instruments on a lot of markets. So this is typically something which is at the top of the pyramid that can run for hours, dozens of hours, or sometimes hundreds of hours. And of course, you don't want every developer to run that. So you have this combination where you want developers to run all the unit tests typically, at the same time, they cannot do that locally, because if you support multiple environments,

Starting point is 00:16:24 then of course, their computer is only one of those environments. At the same time, they cannot do that locally because if you support multiple environments, then of course their computer is only one of those environments. So you want them to run all the unit tests in the context of the pull request against the CI environment. Then you also want to run the long running test if we know or if we believe there could be an impact on them. At the end of the day, you probably cannot run all the long running tests against all pull requests

Starting point is 00:16:50 for all developers all the time. So you're going to run them from time to time against master and bisect to find the culprit if there is a logical merge conflict. That makes sense. So it's not necessarily about the number of developers or number of commits. It's about the shape of your tests.

Starting point is 00:17:10 Basically, how much you follow the testing pyramid, in a sense. Even if you're a small team, but if you spin up a lot of end-to-end tests for every commit, which just takes a long time, very quickly you will start running into issues. Whereas you could be like a larger team, but if you focus mainly on unit tests and all of that, you might miss out on the coverage of end-to-end tests, but you can stay relatively

Starting point is 00:17:36 freer of these concerns for a much longer time. So I think the best way to find the right balance is to basically leave it to the teams to decide, because they know the value and they know the problems that come with the different types of tests that they want. So if one end-to-end test actually covers a very large part of the code and is good for them, and they feel that they get a lot of value from that because they also fix the problems as soon as they see them, then sure,

Starting point is 00:18:12 you can run this one, that's good. But you probably don't want to run 100 of them because, or more, if they are taking a lot of time, because you won't be able to react on them anyway in which case you probably want to have a lower level uh test more more say closer to the functions or like smaller pieces of the code so i like to leave that to the team i think it's it's a much better approach because they can find the right balance for themselves. What I've seen as a problem when you grow the number of developers, it's a bit different. It's more that when you have a team of 10, there's usually no problem for, like, if you test your pull request, it gets

Starting point is 00:18:55 green, you merge to master, it's going to work most of the time. Very rarely, you're going to have conflicts, especially if you, say, rebase on master. So typically, if you bring the code from master in your pull request and you test that, and then you merge to master if it's green, most of the time that's going to work. So it's going to be no problem.

Starting point is 00:19:16 But if you go for, say, 100 or 200 developers on the same repository, what's going to happen is that suddenly you're going to see statistically rare events that start to become quite annoying, like what you call logical conflicts. Say pull requests that are sort of compatible from the code perspective, like they don't conflict, like there's no conflict in the code. But if you put their two changes together, then they break.

Starting point is 00:19:54 When this happens, that's the time when you want to have some kind of merge queue or something like that. But if you have too many pull requests per day and it takes half an hour to test every pull request and you simply cannot put them in a single day, so a plain normal merge queue won't fly. So, you need to have quite advanced mechanisms to prevent issues. So the scaling issue is, again, if you have too many developers, you need to

Starting point is 00:20:29 put in place some ways of reacting really fast when what you call master gets read. And you basically want to reduce the time it takes to build and test to the bare minimum so that you have a smaller probability to have merge conflicts between different developers. That makes sense. And the contrast to me is interesting, right? Ten developers, everything is fine. But overall, even though you are going 10x in a sense, like 10? Like 10 developers, everything is fine. But overall, like, even though you are going 10x in a sense, like 10 engineers, or 10 developers to 100, so much breaks in

Starting point is 00:21:12 that middle, right? Because yes, if they're all adding to the same code base, they're roughly mostly working at the same time. Absolutely. I've seen exactly what you've talked about as well. Like it's just things change so fast. And each unless you focus on that developer productivity aspect of it, each incremental developer you hire is just blocked on random things. And they're actually adding to like business value. And this happens in every company. So you're talking about randomness. So I cannot resist temptation to mention flakiness here.

Starting point is 00:21:49 You know, so in 99% of the cases, it's a test that failed but should pass. And in some rare cases, you have the opposite tests that basically pass but should fail. It happens when you have a bug in a test framework, for instance. So, fleckiness is unavoidable. Like, you will face it if you grow. It is actually very, very hard to tackle. But that's where,'s where you have methodologies to try to at least keep it at a low enough level that it's okay to work with. So the first thing is, flakiness, where does that come from? what's the problem? So you have different levels where it can come

Starting point is 00:22:45 from the test itself, like maybe it's by the return, possibly. It can come from the test framework. So, I mean, say, JUnit, maybe. They have bugs, God knows, or anything else. It can come from the product, right, you're testing. Maybe there's flakiness in the product itself, some risk condition or something. Then it can come from external system you depend on.

Starting point is 00:23:11 Say when you run your test, you connect to a database or something. So maybe database is down, God knows, or maybe they're just flaky. Then you got all the other things that can happen, say network, anything. Even the OS can have a bug. And on that note, I would say when you grow really big, like if you have a few hundred or a few thousand machines in your CI, what happens is that you start seeing machines behaving in a strange way from time to time. Because, for instance, there's a bug in, say, Windows or Linux or whatever

Starting point is 00:23:49 that's going to make the machine break every second year, except that now you have a thousand machines, and that's going to happen every day. Just because of statistics. So anyway, the point is, you have a lot of sources of flakiness. And think about that for a second, would you like to use a product that's going to fail every 10 times you use it? Just because someone didn't want to take into account that there was some flakiness in the test. But actually, the

Starting point is 00:24:20 source of the flakiness was the product itself. So you don't want that right so if you have flaky I mean we call that flaky test but that's not correct this flaky test results really right and in my opinion that's really really important to try your best to identify why that's flaky. And whereas it's really, really hard to do in, say, in absolute, like for everyone, it's actually not that hard to do most of the time for your own company. So if you just look at the log files of your test or your test environment or your machines, you will very likely find things that kind of look like a network problem or look like

Starting point is 00:25:07 i don't know database connection problem or or something like that so in my experience if you just parse the log files or if you parse the test results and so on you kind of have a good guess as to why i mean what's the source of the problem um which means that now you can have automation to, for instance, hide the problem from the developers and rerun the test if that comes from the OS or if that comes from a database connection or something, because that is not about the product. But you certainly don't want to hide the flakiness issues if they actually come from the product itself or from the test framework

Starting point is 00:25:47 or from the test itself. That requires fixing. So and also like when a test is failing say one out of 10 times or one out of 100 times, again you don't want your users to have a failure every 100 times. You really want to understand the root cause, and you also need to consider that like a failure. I'm very much against the idea of rerunning the test just to make it green. I would actually go the opposite way, which is like, hey, run it three times.

Starting point is 00:26:24 If it's red once, try to fix it. So that's hard to do, right? Now you need to burn three times as much CPU. But that's like, you get the idea. And in my opinion, that's really, really important to put the fixing of flakiness very, very high in the priority list. Also, I'd say,

Starting point is 00:26:54 so there are many ways to first detect this flakiness. Like you can just look at what test results you have. And if some tests are read from time to time on master, for instance, it's very likely going to be due to flakiness. But you can just have an automation that's going to run the test a hundred times during the night or so. I use other ways to detect that. And again, you really want to put that quite high on your priority list to go fix that. It is also very, very hard to understand where the root cause is. I mean, ask any developer that finding REST conditions is really hard, right? We know that.

Starting point is 00:27:37 On that note, I can just mention that if your product is, I mean, it is a really bad idea most of the time to mix asynchronous code with synchronous code. Meaning like, if your test framework is synchronous, because for instance, you click on a button, you wait for something to happen on the screen, and then you do this, and then you do that, like you're basically trying to do a synchronous thing. Do A, then do B, then do C. Okay? But if the code that you're testing against is asynchronous, because when you click on a button, the button comes back, then you have control again, but then something

Starting point is 00:28:14 happens in the background, then you end up into your timeouts and you basically create flakiness for yourself. Don't go there. The second advice is to remove as many moving parts as possible. So if your product depends on the database and the network connection or something, and I don't know, a file or on a network whatever avoid that as much as you possibly can because that's just going to reduce the sources of flakiness that that you have no control on that makes sense to me and i want to zero in on one use what you said about priority list right let's say that you're in a position where you're deciding whether you want engineers to work on a new product feature or they have to work on this platform flakiness or just in general improving the state

Starting point is 00:29:12 of ci let's say you're like a ctu or like a manager of a couple of teams what's a good way to decide or to prioritize you know we should take a step back here and instead of working on this next feature like we should put a little more resources or like a few more people on the developer productivity so like how do you figure out that trade-off or like that balance so i think the key here is to think in systems think of your development team as a system that can deliver things really quick, but also needs to go fast on the long term. If you keep on piling, say, technical depth or flaky tests on your team, effectively, it's going slower and slower and slower over time and then it's going to be a lot lot harder to fix later on so if you only focus on the next 15 days or one month or

Starting point is 00:30:14 even a quarter most likely you're gonna always prioritize uh adding the new feature and not taking fixing your flakiness or fixing your technical depth but really that doesn't help you on the long term so the balance is how much value do you need to create right now because I don't only need to to be the first on the market versus how long do you want your company to survive? And think also in terms of compound value here. If you make your company 1% faster than 1% and then 1% over time, that makes a lot. The same thing happens in the other direction. Like if you make your company 1% slower and slower and slower every time you leave a flaky

Starting point is 00:31:09 test in place, then basically you're killing your company in the long term. Then you have to take into account the culture you're creating as well. Nobody likes to work in a company where the product itself is flaky, it's not nice. Nobody likes to work in a company where the development environment is so difficult to work with because tests are flaky and nobody pays attention to them. So it's not only about pure speed, it's also about the culture and the morale. So as a CTO, I would take all of that into account and find the right balance to have a constant effort to fix those problems. On top of that, I'd like to mention that fixing something where it breaks and you just changed

Starting point is 00:31:56 it, like you just changed the code and suddenly something gets flaky, that's kind of easy to reverse or at least understand and look at what code just got modified. If you look at it 10 days later, it's a lot harder. If you look at it a year later, you got no chance. So basically the tech-deb is actually not a very good analogy in some way. Think more like a backpack in which you put stones. And the more stones you have, the heavier it gets. And at some point, you get exhausted. You cannot move anymore.

Starting point is 00:32:44 And on top of that, I'd say that you really don't understand why you have those stones. People from so many different areas and different industries run into similar problems around like tech debt and test flakiness. Because I guess these are all manifestations of the same underlying problem of like, it's easy to ship features while adding tech debt and eventually that always catches up to you. It does actually. It's a, at the fleckiness is unavoidable, but you can't keep it, you should probably keep it quite low if you, and make whatever investment you need to keep it low.

Starting point is 00:33:31 Like think about that again for a second. It's like if your developers get a red result, like a broken build, basically say every third time, and it takes 20 minutes to build and test, well, you waste a lot of time, right? Like, you're paying salary for that. Like, you're basically paying people to wait and do it again. That is just such a waste, yes. So then let's talk about the next evolution of the company, right? Like when you go from like

Starting point is 00:34:06 a hundred developers to like a 500 or a thousand, and then at a point where the CI infrastructure itself ends up becoming flaky, because at that point, like from what I've seen so far that providers like CircleCI and everything that abstract this stuff out for you, they're not really good enough. And you have to start managing your own infrastructure in order to run tests. And we can see that in every single large tech company today, they're all running their own CI systems. So what happens then? Can we that so you do get some flickiness due to OS bugs and transient network issues and changes in DNS and anything. This is where you really want to have good metrics.

Starting point is 00:35:02 You really want to know what the percentage of problems you actually face. You want to know if something is flaky, what is the percentage that is due to OSs, what is the percentage that is due to the product itself and so on. As I mentioned, most of the time, it's more or less impossible to do in general. Like generically for all companies, you cannot do that. But if you look in your own system, you can probably guess quite well that this type of trace or stack trace is due to, say, network issues. And I mean, the metrics you gather are really important because then you

Starting point is 00:35:47 can see, like, say you have a bug in Linux every second year or so. You know that's going to be a machine that fails every day or every second day or whatever. You also know, and that's actually tricky, that the machines that tend to misbehave, they tend to break, like to run the test and break them, but they also tend to break fast, which means that if you don't detect them really quick, they're gonna basically, the next build is gonna come and go to that machine, and again and again and again.

Starting point is 00:36:21 So basically, like let's say you have a dozen things to do, and in no time, one of the machines that happens to be broken is going to fail all of them. Mm-hmm. That's annoying. Of course. So what you want to do here is that you want to detect those things on the fly. Right? And if you detect that one machine, for instance, has failed three builds in a row, let's say, then you really want to go quick and

Starting point is 00:36:46 say, okay, is this machine actually broken? Like what kind of stack trace do I see here? Like what kind of logs do we have? And then you immediately want to like, remove it from the pool. Automatically, of course, right? Like if you you cannot do that manually, it's just way too big. Maybe you want to rerun the build that were read automatically again without even showing that to the developers so that they don't get affected by one machine that happens to break.

Starting point is 00:37:18 So that's something you see pretty often, actually. It's a misbehaving machine in a pool basically killing the entire expanse for everyone. So that's one thing that you see, yes, with that. Maybe the second thing, yeah, let me insist on another thing which I think is really important is that you know your matrix. You know what the fleckiness level you have. But what you don't know is how much developers perceive it.

Starting point is 00:37:54 It's very common that you have, say, a 2% fleckiness in your system. 2% of your builds are going to fail for no reason. But in fact, people believe is 20. and that's also very common that the perception is not the same and the perception is what drives behavior so like if an example is like if a developer gets a red build and say okay um did i do something wrong um before they go look at the test result, before they go look in the log files, before they go try to troubleshoot, which is costly time-wise,

Starting point is 00:38:33 maybe the first thing they're going to do is re-trigger, say, hey, run it again. And they can do that if they don't trust that the results are correct. So the question becomes, how much do they actually perceive? How much do they believe the flakiness is? And that's where you really need to be good at communicating those things. So you need to be able to say, hey, here's the actual flakiness level. And maybe if you do face a flaky test or whatever, you want to tell the developers,

Starting point is 00:39:05 okay, so this test failed. By the way, we know it's been very, very stable. Just saying. Or quite the opposite, like this test failed, but you know, you got like, it was quite unstable for the last two weeks. So like, maybe you want to troubleshoot it anyway, but not for the same reasons. That makes sense to me. Even understanding the state of things, you can get metrics from the CI system and you can get these really crisp metrics on how flaky every test is but you should also be gathering the perception of developers perhaps through like surveys and stuff and that will get you both what the system actually is but and also

Starting point is 00:39:56 like whether how the system is perceived because you can imagine a case where like two percent of tests are flaky but every single time a developer tries submitting some code, like that same test fails and they think the system is like 100% flaky. So doing that qualitative, like asking people from surveys what they think the problem is with the quantitative can help you get a bigger sense of what are the velocity blockers in the organization. And maybe you can also track like re-triggers, right? Like if people are re-triggering builds all the time,

Starting point is 00:40:30 that's probably not a good sign. Exactly. So I feel like if you are in charge one day of some kind of CI environment, you really need to love your developers, really, right? You need to pay attention to what they feel. You need to put yourself in their shoes and try to look at their experience. Try to really understand how they feel about it.

Starting point is 00:41:00 Again, when they get a red, it's bad. It's good because you know what's going wrong, or that something went wrong, but you're like, oh God, I need to fix it now. You really need to pay attention to how they feel, to the position they have, and also you need to focus on their experience. If you give them a red, you really want to tell them, if you can, where that comes from, precisely. Okay, so you've made that change. It broke that piece of code on this OS in this environment.

Starting point is 00:41:41 Here is exactly the line where it broke. And by the way, looking at the stack trace, we can relate to these other type of tests that failed as well. Yeah, by the way, this test failed, but all these other ones failed as well. So what do you think? Maybe that's related. And you need to give them context, but not too much. You need to give them context but not too much um you need to give them the right context um there's nothing worse than than just looking at i don't know hundreds of thousands of lines of log files and say hey the problem is somewhere in there so just think for yourself like

Starting point is 00:42:20 when you get someone who comes and say, hey, you got a problem. Okay, what problem? If someone comes and tries to fix something in your house and they just say, hey, sorry to tell you, but you're going to have to fix something. You really want to know what it is, right? And you know why. So being in child-obsessed environment is as much a technical and scalability problem as it is a human one. Yeah, you need to be like customer obsessed in the words of like Jeff Bezos. And now let's flip it over a little bit. So the reason why companies are trying to hire developers is so that they can ship like features and everything for users quickly right and i don't know if you run into a situation where a company will hire a certain number of developers and for whatever reason like mythical man month or just state of the tooling and stuff they each incremental developer you hire

Starting point is 00:43:22 is not gonna suddenly double the number of story points that are being shipped by the engineering team every month or every quarter and have you seen like a situation of like a frustration of you know like a leader versus the engineering team like why are we not shipping stuff faster like our competitors are shipping stuff so much and like how do you think about that and how do you like resolve some of that like are we tracking the right metrics are we tracking you know shipping features that like correctly like what are your thoughts on this in general there's quite a lot to unpack here maybe again we should think in systems here.

Starting point is 00:44:05 So the first aspect is how much teams depend on each other. So you can make a graph of dependencies between different teams and maybe realize that you have bottlenecks. It can be a bottleneck in terms of quality, it can be a bottleneck in terms of workload. The first thing you want to focus on as a TDO is to fix those bottlenecks and that will probably help the entire company go faster in fact. The second thing you want to look at is when a team needs to put a new feature in place, can they operate on their own? Do they have autonomy? Or do they somehow need other teams

Starting point is 00:44:51 to do something for them? And same thing here. If you cannot operate on your own, if you're not independent in some way or autonomous, then you enter a cycle of negotiations and so on. It's not that it's bad to negotiate, but it just takes time. So it's kind of separated from the CI aspect of that, except that you can probably measure the level of quality of different teams by looking at the CI result that they

Starting point is 00:45:21 have. If you look at it from a CI perspective, which is only one of the angles, one of the good metrics to look at is how long do people take to fix a problem when a problem happens. So I call it like NTTR4CI. So basically, if you tell a developer you got a problem to this code, like here's a broken test or here's a broken build for you, how long on average is that going to take to fix this one and get to green? This metric is very useful to understand

Starting point is 00:46:03 the velocity of your development team. I usually look at the distribution thereof. You usually have a lot of developers that fix their problem in somewhere between minutes and one hour, and then you've got a long tail. This metric is you're going to see a change over time. When the code becomes more complex, the metric goes, becomes it takes more time basically

Starting point is 00:46:36 for people to fix code. That's in my opinion, that's an excellent proxy for how agile and how efficient your organization is and how your developers can iterate fast on your code. How much? That requires, basically that means that you need to collect a lot of data from your CI

Starting point is 00:47:01 environment, of course, and then you need to be able to process that metric. But that's extremely efficient to detect the complexity. And if you see that, for instance, that the NTTR of two teams are both becoming longer and longer, you probably have, you may have, let's say, a correlation or even a causation, potentially, between those two teams. Maybe one of the team is dependent on the other one, and the code that gets created is more complex and more complicated to troubleshoot, and then you can basically go there and try

Starting point is 00:47:41 to understand what's going wrong and, say, try to isolate or, or decorrelate the teams. Yeah. I really liked this like MTTR of like CI framing because it can be expanded in so many different ways. Like if you just expanded from the CI scope and you start tracking, how long does it take for a bug to be resolved once it's created in your task tracker and you can like maybe break it down by priority and stuff you can really apply the same thinking right like if it takes too long for when a bug is created and like it takes like a month for it to resolve versus like three months and that probably means okay maybe your engineering team is not fast enough but it could also mean maybe the product team is not prioritizing it or the

Starting point is 00:48:25 right way. Maybe the specs that they're creating is unclear. Maybe there's not good enough design chops to get these things sorted. There's so many ways this thing could get unpacked. So I really like this framing of this MTTR. How long does something that's a problem take to get resolved on average? And then you can do a bunch of things on top of that. I could elaborate a little bit on that and say, for instance, that it applies for what happens in the context of pull requests. So that's basically the person, MTTR, if you know what I mean. But you can also look at it on master.

Starting point is 00:49:07 So when something breaks on master, how long does it take for the team to react and fix it? If it takes you days to fix something on master, you've got a problem. You basically want to shrink that as much as you possibly can if possible two minutes but most most likely in two hours because even like the end-to-end test if it starts to break and nobody cares about them then what exactly are you running them like it's it's apart from some from burning cpu it means nothing and actually have a true

Starting point is 00:49:48 story about uh like it was in in fintech where they had like a lot of uh end-to-end tests because what happened is that they had a lot of testers before that were doing things manually. And then the way they thought about it was just like that automated test was all about testing something that someone did and turning that into an automated test, which is, I mean, you need to start with something. So that's probably not stupid to start this way. But really what happens is that

Starting point is 00:50:20 if you don't react when they get red, like then why, right? So what I wanted to do at the time was just to delete them because, well, they were creating no value and they were just burning CPU for nothing. But it was a problem because it was actually reducing the ratio of automated tests versus manual tests, which is like, well, if they break,

Starting point is 00:50:44 then why do you run them, right? If you don't pay attention, rather. The other story about that was the fact that as soon as you realize that automated tests is not about manual tests that you just got automated, it's once you can do things, once you can create the unit tests, when you can actually create them on the fly, and when you can actually create them on the fly, and when you can generate a bunch of them to cover all sorts of different use cases. Then you can have millions of them, actually, to cover a large number of possibilities. I won't get asked, okay, so how many tests do you run?

Starting point is 00:51:20 I was like, I didn't know how to answer that because we were comparing matrices of prices, and there were millions of them. How am I supposed to answer that question? When I compare two matrices, is that one test or is that one test per number I was comparing? It was somewhere between one and a few millions. So that's sort of like one of the first question you asked me is like, you have a change in mindset when you go from purely manual test, which is useful by the way, and when you go into fully automated or at least a lot of automated tests, is that you have to change your perspective. You have to stop thinking into purely manual tests. I wouldn't like people to think that manual testing is bad. It's not. It's very useful, actually. But typically not to test a small piece of code.

Starting point is 00:52:28 It's usually to test the entire experience. And that's where you want to have people. Okay. And maybe to wrap things up a little bit, once you've gone from you know your 100 developers to like your uber scale like extremely large companies right like thousand developers and all of that at that time you really have to start innovating for making sure your developers stay productive right because bills are only going to take longer more and more developers are adding to the same thing so maybe you can talk a little bit about, you know, what's the state of the CI world? Like, what are people thinking about right now? And how do you deal with that scale once you're that big? And like,

Starting point is 00:53:14 and what's the framework you use to think about solving problems when you're that big? A good framework maybe is to look in terms of layers. So you got the hardware layers and the OS. Typically, this is something you can use from the cloud. Then you want to build on top of that. You want to store your artifacts. You want to have test frameworks, databases, and so on.

Starting point is 00:53:38 That is also probably something you can use from the cloud. Then there is the orchestration. What exactly do you run first? What exactly do you run first? What is it do you run last? How do you prioritize the builds? And so on. As far as I know today, you need to own this part. And then maybe most importantly, the feedback you give to the developers when something fails. That is one of the key aspects of CI is that it's not about running things. It's about shortening the feedback loop. And that part is crucial.

Starting point is 00:54:15 And as far as I know, there is still quite some work to do in order to have a good solution for what you deal with hundreds of thousands of tests. And that part still needs to be owned by the big companies. But there is a lot of maturity in that field. It used to be that you couldn't run much things really on the CI providers in the cloud. But today, you can tell that quite a lot. I believe that in a few years from now, all the research that is happening in that field, and it's quite a lot, will be integrated by the CI providers in the cloud.

Starting point is 00:54:58 So by research, I mean, how do you prioritize tests so that you run the ones that are most likely going to fail? How do you deal with flakiness? How do you detect that? How do you warn the developers that their pull request is most likely going to create a regression that maybe they should run this long running test before they merge, et cetera, et cetera.

Starting point is 00:55:29 All of that, as far as I know, is not currently integrated super well in CI cloud providers, but maybe I'm wrong. And I certainly hope to see that happening in a few years to come. Okay. to see that happening in a few years to come. Okay. So one thing that, you know, I've thought about this a bunch and what comes out to me as interesting is that there's two kinds of build tools

Starting point is 00:55:55 that people use, right? There is like the standard, a bash script, or you run like NPM install. Very simple. And that's how you get started. And then you notice that a lot of companies migrated these like Bazel and Buck and everything because it gives you the power of like, you know, using this dependency graph or being able to like classify everything as a target.

Starting point is 00:56:17 But at what point does it make sense to switch over? Does it make sense to switch over from like a simple build tool to something like Bazel once you are going to find ROI in these things? Because it is always like a hard migration, right? So how do you think about that? So I'd say there is something very important to understand here is that things like three, four years ago and things today are like the market is just different. There's some kind of maturities that just happened. Like I don't know, Bazel was version one, I don't know, like two years ago or something like that, approximately. Before that, when there was a change in Bazel, you had to change a lot of things in your code, or at least in your Bazel description files. That was a pain. But I'd say if you can start with Bazel today, it's probably a good idea, at least I think so, or any tool that gives you, say, a dependency graph. I think that's going to

Starting point is 00:57:21 pay off. Dependency graphs are extremely useful if you want to analyze your code for all sorts of reasons, CI being only one of them. But for instance, it's very helpful to understand if you have many teams, like which teams impact which other team. Right? Like when they change something.

Starting point is 00:57:44 Anyway, the point is, today it's possible to start with those dependency-based tools, whereas I don't think it was so easy to do three years ago. So if I were to start again, I would probably start with that. But yes, it's a very painful process to move from whatever build system you have into Bazel. Or it used to be at least. I think I certainly agree with the stabilizing part. Initially, Bazel upgrades would just be extremely disruptive. And now things are mostly the same. There's small bug fixes. There's a few incompatible flags that you would have to add and remove.

Starting point is 00:58:28 Yeah, I still don't think it's at a point where, you know, you would use Bazel for like web development or something. I think it would be amazing if, you know, Bazel was like the build tool for web development, but I don't think it's just there. And I don't know, because the philosophy of these two communities is just so vastly different. But it would be kind of amazing if there was a world where the first tool that a web developer thought about was, let's use Bazel for our front end and back end. Because I personally feel like the build tooling ecosystem is just such a mess in javascript um but i'll stop i'll stop talking about my opinions on javascript now yeah anyways well thank you for being a guest on the show i

Starting point is 00:59:16 think i've had a lot of fun you're welcome yeah and yeah this was. I hope I can catch you again for a later round tour. Maybe we can talk about CI and ecosystem in the future and be more hopeful about this.

Your Ad Here

Software at Scale - Software at Scale 23 - Laurent Ploix: Engineering Manager, Spotify

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.