Software at Scale - Software at Scale 12 - John Micco: Cloud Transformation Architect, VMWare

Episode Date: March 13, 2021

John Micco is a Cloud Transformation Architect at VMWare, where he works on CI/CD systems. He’s worked in the CI/CD space for nearly twenty years at various companies like Netflix and Google, he’s... authored several research papers, and was a keynote speaker at the International Conference of Software Testing and Verification (ICST). Our collaboration helped shape my work on Athena at Dropbox.This episode is extremely technical. We talk about the management of CI servers and systems in large companies, common problems faced, and themes in emerging solutions. Large-scale CI/CD management is far from being commoditized due to the custom configuration of every company’s build tooling. Apple Podcasts | Spotify | Google PodcastsHighlights0:00 - What is a “Cloud Transformation Architect”?6:00 - Sharing knowledge in the CI/CD space7:00 - A comparison between starting a car company and starting a software company, and why standardization on tools is so much trickier in software11:30 - The scale of testing systems. Managing a system that handled 10k RPS of test result data. The quadratic growth in compute resources required to manage a testing system18:44 - “Only 7% of code changes had problems discovered by tests”23:36 - Is unit testing overrated?30:00 - At what point can companies no longer afford to run all tests for every code change35:00 - What is Culprit Finding / Regression ID? What is auto-rollback / test quarantine?41:00 - Culprit Finding at Dropbox45:00 - Dealing with flakiness at the test layer vs. the platform layer. An ongoing challenge for VMWare60:00 - Increase in demand for CI compute when developers don’t have any meetings65:00 - Bazel migration at VMWare73:00 - Developing an interest in CI/CD systems. And why it’s an exciting space where there’s a lot of innovation, and why more innovation needs to happen80:00 - How can someone tell whether they’re investing enough in developer productivity / their CI/CD experience for engineers? The trade-offs that companies should consider when deciding to run tests. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Thank you, John, for being a guest on another episode of the Software at Scale podcast. No problem. It sounds fun. I'm quite flattered. Yeah. Yeah. John's a cloud transformation architect, which I think translates to like a high level IC or like individual contributor at VMware. Cause I remember you were like a senior staff engineer at some point. I could be mistaken.
Starting point is 00:00:35 Yeah. Well, it's funny. Titles are very funny. I mean, at Google, I was a manager and I managed a group of 24 people there, which included the continuous integration system. And then when I came to VMware, they're trying to make the transition from selling products to people on-prem where they have these big data center products into making it more of a cloud offering
Starting point is 00:00:57 that more competes with the other cloud offerings like Azure and Google and AWS, right? And so the title title cloud transformation architect, my job is to get them from shipping once every two years to shipping every month, right? And that's a hard thing to do. I've done it at a couple of companies and been working on that kind of a problem,
Starting point is 00:01:15 but that's really hard when you have these entrenched engineering systems and people have a certain way of working. Changing all of that is hard. Yeah, and that's the perfect title. You're literally a cloud transformation architect yeah that's awesome literally yeah and you've been working on ci cd stuff in a sense for like a really long time right i've seen your work experience like
Starting point is 00:01:36 mathworks google netflix vmware you've been working in this space for a really really long time for 20 years i've been working in continuous integration, continuous delivery for 20 years. It's funny that Mathworks, they make MATLAB and Simulink, which is a huge desktop software offering. And they also were struggling trying to get more frequent releases. They were releasing once every year, year and a half,
Starting point is 00:02:01 and then they wanted to do it every six months, and then we got it down to once a quarter. And that was all sort of part of why, you know, what I was doing there as well, so that, yeah, so I've been doing this for 20 years across MathWorks and then Google, so I was at MathWorks for 15 years, but I did CICB only for half of that time, about eight years, and then I went to Google and worked there for eight years, then I worked at Netflix for a short time, I'm not going to say how short, but it was shorter than I wanted it to be. And then I came to VMware.
Starting point is 00:02:29 I've been there for two years now. So grand total is about 20 years-ish of time doing CICD, different companies. And you've also been an author for a bunch of papers on CICD. And I think they were, at least least that those few papers and like the conference talks that I've seen on YouTube, there are some of the few things that are actually that were available maybe three or four years ago when I started looking at the CI, CDE. And I would say that at least to me, you're like a thought leader in that space. Well, I'm, yes, I've written quite a few papers. And the main reason I did it was because at this high end of software development, where you have successful companies or things that have gone past startup mode, right? Those guys are responsible for working on a CI system because you can't use Jenkins and you can't use, you know, CircleCI
Starting point is 00:03:30 or some of these off-the-shelf solutions to make things work. You actually have to have a group that does their own custom thing. And nobody was talking about it, but we were all discovering the same problems at different companies and, you know, groups of people coming up with the same solution, but they weren't talking to each other. So one of the reasons I wanted to publish and to start participating in academic conferences and so on is to start sharing because in other disciplines, like if you look at machine learning, for example, right, the academics tie everything together. They start sharing the ideas between all the different companies. And then, you know,
Starting point is 00:04:06 yeah, there's different implementations. You know, Google has their AI implementation stack and Apple has theirs and everybody has theirs. But all of the thoughts are coming from the academics who are studying this and saying, here's what you should do. You should train your models this way. You should look at them that way.
Starting point is 00:04:23 You should continually retrain. You should, here's the you should do. You should train your models this way. You should look at them that way. You should continually retrain. You should, here's the data set, you know? So, so in CI, nothing pretty much existed like that, as you said. And so my goal was starting to publish and start to get people, if we can't share code, maybe we're all going to have our own group. We're going to do our own code for CICV. At least we can share ideas. We can share truisms about, you know, it's funny. Everybody thinks their testing system is a snowflake. And in some ways it is. But when you look at test
Starting point is 00:04:51 flakiness, dimensions like test flakiness, and you look at like how we run the test, what CI does at these different companies is identical. And it comes up with finding the same things. I measured at Google and at Netflix and at VMware how often when you see a sequence of pass, pass, pass, pass for a test, when you see a fail, how often that first fail is a flake. And the answer is that first fail being a flake is about 60 to 80 percent of the time the first fail is a flake. And that means that flakes are much more common than actual developer breakages, which are super rare. And I keep seeing these, you know, papers. I just, we just reviewed one today in the journal club where, you know, they don't pay
Starting point is 00:05:33 any attention to flakiness and they just train these models to find failures. And it's like, they find flakiness. That's what they find. The tests that are flakiest are the ones that run the most often. You know, we've made that, I made that mistake, too, as one of my first systems. So anyway, these kinds of concepts, sharing them between the companies and trying to get people to see the problem and experiment with the problem the same way helps to advance the state of the art everywhere.
Starting point is 00:05:57 So people can take those ideas. They can lever them. They can start going to the next step. And that's what academics is kind of about. It's like, how do you standardize a discipline? How do you kind of move it to the next step? what's the next thing you're going to be looking for we've already solved that problem and we're getting we're climbing the ladder of the sophistication of the tools and the systems that we can build that to do this and we're sharing information
Starting point is 00:06:18 about how they work so why do you think that is why do you think like people who work or at least in the past like the people who used to work on testing systems don't share as much? Do you think there's just a lack of incentives? Well, there's never been a standardized tool. There's not a lot of money to be made in software development tools. In fact, there are almost zero companies who have ever really been successful providing software development tools, right? Almost all of them have gone out of business. Why? Because there's not that many developers. I mean, it's the second derivative, right? The mass market things like your social media, like Facebook, their potential user base is, you know, every person on earth, 8 billion users or 10 billion users or whatever it is, right? And you look at software development, it's probably worldwide, every developer on earth. I don't know, I can't, I don't have a good estimate, but it's gotta be less than a million. Maybe it is more than a million, okay? Maybe a million is a small number, but still it's not anything close to 8 billion.
Starting point is 00:07:19 And that makes the market smaller. It makes it less interesting for businesses, number one. Secondly, I'll tell you my theory. I heard this from a guy at Dropbox, I'm not Dropbox. I'm at PayPal. He told me this thing, right? So here it is. If you wanted to start a car assembly plant, you could take your $4 billion and you could get experts, you could get advice, you could figure out how to do it, and you could set it up and it would work, right? And I think Tesla is living proof that that has happened on earth, right?
Starting point is 00:07:51 They didn't know how to make cars at all. Elon Musk knows nothing about cars. He hired a bunch of experts. He said, how do you build a car? They put together and it works, right? That's easy. If you said the same thing, I want to make a successful software development pipeline, right? And I gave you the same $4 billion. to make a successful software development pipeline, right?
Starting point is 00:08:05 And I gave you the same $4 billion. It's never been done on Earth. Never. Not a single time. Why? Because the way software companies start is they start with an idea in a garage and two guys. And they figure out, or it doesn't have to be guys anymore, sorry. Two people, right?
Starting point is 00:08:22 And they come up with a good good idea and they're just like working 24 hours a day to make that idea into something that's valuable then they sell it to a venture capitalist and he says oh i'll give you a million dollars five million dollars go go build it some more and they hire a few people and either one or two things will happen either it dies which is the most often case like 85 or it lives and along way, they've done whatever they had to do to make the thing work. Take whatever shortcuts they need to take, you know, just anything to get to the next step, right? And they create processes for developing software. And they don't look at it much because they don't care, right? And then gradually, they build up their own Baroque
Starting point is 00:09:02 processes. And then they, well, none of the off-the-shelf systems really implement our Baroque process. We want to have a customing thing that implements our process and keeps our developers productive. And it's the way we look at it. It's the way we're doing it. And that yielded a lot of differences between the approaches and how people did things. It also meant that because there's no good market for it, there's not too many good off-the-shelf solutions. I mean, you look at something like Jira, Atlassian is barely struggling to get, you know, some of their development tool stuff to actually make them money, right? They're okay.
Starting point is 00:09:35 They're not huge and they're not growing very much, right? And then you look at like Jenkins, Cogby's. They tried to make a business out of Jenkins and they really couldn't do it, right? They haven't been very successful at doing that. So I think between the lack of a market to drive, you know, sort of innovation and companies to enter the space to sort of make a better, you know, CI tool, that doesn't work. And so then companies have to, are left inventing it themselves because what is there off the shelf isn't satisfying and isn't solving their problem. So I think that has led to this place. So if we start out by sharing ideas, maybe we can get to the next level and we can start sharing some code, maybe open source. Open source is a great model
Starting point is 00:10:15 for this. It's funny, all of the best development tools that you can think of, like Git or even Bazel now or some of these other tools are open source, right? I mean, because everybody knows that you can't make money from dev tools. So you just open source it and you get some contributors and you pull something together. I think something like that could be successful and could work, right? So that's kind of why the market is the way it is, right? Okay. So what you're saying is like since everybody when they start off they're
Starting point is 00:10:45 they're not really optimizing for what the software process is they're just trying to get stuff done everybody has a very like bespoke workflow and this is definitely true in the 2000s right yeah yeah if we can't get our next round of venture capital funding we're dead right so so you gotta like you gotta put a good demo together and you have to like have some some potential customers that are there excited about what you're doing. And you're thinking about that stuff. You're not thinking about, oh, how are we doing this? Because only two people, three people, five people, until you get to like 20, there's no point. You don't have enough. You're running too fast. You don't have enough overhead to even think about, well, what are we doing for a software development process?
Starting point is 00:11:23 And nowadays, there's enough that you can pull off the shelf to get yourself started. I mean, there's all kinds of tools out there that are open source and free that you can just download and start using together, you know, and open source databases and whatever, right? So you just like shove it all together and you do whatever you have to do to make it work. And to get that next round of venture capital funding, you're not thinking about it're really not and what do you think about tools like as you said like circle ci and like build guide like these next generation like or even github actions for that matter yeah well i mean look i think the to me the difficulty is going to come from making money out of it but even then the i'm a testing guy mostly that That's kind of what I do at Google. I
Starting point is 00:12:06 managed a compute farm where we tested continuously six and a half million tests against two billion lines of code. So I've been there, right? We had 10,000 QPS of reports of path failure from test results. I mean, it was crazy to do that, right? So none of the platforms that I've seen are as good at diagnosing software testing and accounting for flakiness. They're still in the phase of conceiving of the build and test problem as you have a build stage and then you have a test stage and then you have a, oh, it's either good or it's not good, right? And they don't think about things like flakiness or things like repeatability or even just like, how do you identify where a breakage happened? Like a culprit finding kind of action, right? Off the shelf tools lack that kind of sophistication when it comes to testing. Because everybody does testing a little different and managing that testing inventory is kind of hard. There are some tools that try and, you know, manage testing inventory
Starting point is 00:13:09 and come close to doing some of that work, but really, they're not, they're not that great, and it's probably because there aren't that many standard processes, like, are you, how you use Git, you know, people are doing crazy branching things, and how, maybe they're not using Git, maybe they're using some Perforce thing, or knows right i guess nowadays you probably wouldn't start using perforce but if you started out using git you could you could still have 300 branches and have to do merging and you've made yourself a mess and some off-the-shelf tool might not do it for you right so i'm um and i i've been a big convert uh from working at Google about not having a lot of branches of development, because it's just such a time waster. And it doesn't save you the thing that you're afraid of, which is integration problems.
Starting point is 00:13:55 It makes them worse. So anyway, I don't know if I answered your question. It's definitely what I think of a similar thing, right? Like the amount of compute resources you need in order to run tests is it just grows quadratically, right? Cause the more developers you have, the more your tests are going to run because you're going to have like more pull requests and the number of tests in your code base never really decreases unless you put in an effort and it's such a low priority to like delete tests and also like developers are so scared of deleting tests since
Starting point is 00:14:34 yeah what if it's like developers are deadly afraid of deleting tests i've certainly seen that so how do you control that cost while providing like, while getting some margins and actually shipping stuff? Yeah. Well, at Google, we computed that if we ran every test that was triggered on every change, we actually computed that we would need more compute power than Google search was using. And you can imagine that Google doesn't have a lot of money or doesn't have
Starting point is 00:15:04 that much money, even though they have money, they don't have enough money to spend more than they spend on Google search compute capacity to do CI. So, and no company does. I mean, look, exhaustive testing and exhaustive testing isn't the answer. At Google, we figured out, look, you know, 93%, 93% of our changes that come in have no in don't have a bug that can be found by testing, right? So you're searching for that needle in the haystack. Where's the bug that somebody committed and how can I find it as quickly as possible? This is all about risk versus reward. You know, how much you're going to spend, how much risk you're going to take of missing
Starting point is 00:15:43 a defect until later stage, right? How much risk are you going to take versus how much compute power you're going to spend, how much risk you're going to take of missing a defect until later stage, right? How much risk are you going to take versus how much compute power you're going to throw at it? How much you're going to spend, right? So you can reduce your risk by spending more, but it's just a risk. It's a tradeoff between risk and finding that defect sooner. The goal of any CI system is to find a submitted defect as fast as you can. And the answer is the more resources you throw at it, the quicker you can find that problem, but at a cost of ridiculousness, right? You can't spend
Starting point is 00:16:12 a fortune on it. No company can afford that. So that's where regression test selection, techniques like regression test selection, which reduce the pool of tests that you're trying to run, things like test prioritization, things like skipping over a bunch of changes and collecting them all and testing them together. Those are all common techniques that pretty much every company uses to reduce the cost, the cost of the exhaustive testing that you're going to do otherwise, right?
Starting point is 00:16:39 And every company makes those trade-offs differently, but they all make it. They all have to make it. There isn't any company that's rich enough. Maybe Apple's rich enough. I don't know, but there aren't very many companies that are rich enough to be able to spend a billion dollars on this testing hardware to find every defect. And you know, when you start to get into, it's interesting.
Starting point is 00:16:56 The other part is how safety critical is your application, right? A web server like Netflix is cool. They don't do any testing at all for their backends, right? So. Well, They don't do any testing at all for their back ends. They don't do any testing at all? Pretty much none. It's totally minimal. Does it boost the service? Okay. It's good enough. Send it out. And they just deploy it to a percentage of users or something?
Starting point is 00:17:18 They deploy it to a couple of canary boxes. Somebody's home and they hit play and it doesn't work. And you're frustrated. I hit play and it doesn't work and you're frustrated i hit play and it didn't work and you know that's a bad day but but you know what that the next time you hit play it does work because they already took that server down because it wasn't working right you know it was already giving them bad telemetry that it wasn't behaving the way the other servers were so you know that's great and that works for that case. But then you think of like an anti-lock brake algorithm, right? You know, and, and, you know, if you have a bug in the anti-lock brake algorithm, you can kill the person. Tesla self-driving car, they had a bug in their
Starting point is 00:17:54 autopilot, right? Boom, guys died because of that, right? Because it's like, you know, you have to test that more ahead of time. You can't wait till the guy dies and said, maybe that was a bug. That's not a great thing to have happen. So I think it's all about your risk tolerance and how much you're willing to spend. And it's just a truism, and this has been studied over and over again, that the earlier you find a defect, the closer you find it to the developer actually submitting it,
Starting point is 00:18:24 the less it costs to fix it. Including bug escapes that go all the way to customers. Those are expensive to fix because you've pissed off your customer. It's more than just whatever problem you might have had. You've pissed off your customer and they're not going to like to use your product if it's got a lot of defects in it. I think that's what testing is all about is right sizing that. And what's really funny is we are searching for a needle in a haystack. I mean, the number of confirmed breakages in speed, like I said,
Starting point is 00:18:55 only 7% of our changes had any problem whatsoever that was discoverable by tests. And then, you know, you go into that space, the 7% that did have, there's still a tiny number of tests that they broke, right? So you have to know how to schedule just those. In fact, we figured it out in one of our papers that it's like, you know, some really tiny fraction. If you are perfect and you are omniscient and you know, you knew ahead of time which test was going to break when a developer made a commit. Well, first of all, you wouldn't let them make it, right? You tell them in their IDE, hey, you shouldn't submit this
Starting point is 00:19:27 because it's going to break this test, right? But let's say you get to the place where that thing has been committed. And if you were perfect and you knew exactly which test to schedule, you could run like 0.001% of the test pool and find all the problems, right? If you had a perfect AI or a perfect way of determining ahead of time that, hey, when you commit this thing, you're going to break it. And this is what I, geez, it's so frustrating to be in the academic space right now because everybody wants to do machine learning. They come in, oh, I'm going to do machine learning. It's going to figure everything out. Okay. I'm going to find my needle in the haystack. Problem is the data set is too sparse to actually train
Starting point is 00:20:03 the model and you can skip lots of tests and not miss problems. It's like an imbalance glass problem. Yeah. You know, your AI can decide to skip, you know, 93% of the tests and never miss a problem. And, and you say, Oh, I did a great job on my AI saving me a lot of money, but it's not verified that it's actually working the way it's supposed to because the search space is so small that you can't really train and also the features that you're looking for like complexity of the code and how many lines of code are modified and which parts of the system it was in those features are hard to correlate and nobody's got good algorithms for you know ferreting out those features that you can use to really train the i like okay yeah if i change this part of the code,
Starting point is 00:20:46 I'm likely to break this test. That correlation is not discoverable really with AI. I mean, there's some ways you can train it, but I haven't seen yet a good AI model for this because it's such a sparse space. And I'm highly skeptical of the people who are doing it because they're not evaluating their model against sort of like the truth of what the failures were, right? How many defects would escape?
Starting point is 00:21:08 Yeah. So that 7% number you're talking about, has that been replicated like at other companies and across like the industry, like as far as you can tell? I don't know if anybody's actually doing it, but everybody says the same thing that actual defects being discovered by test running is a very small percentage of the tests that anybody is doing that I've seen. I like the most I've seen is like one or 2%. Now, when I said 7%, I mean 7% of the changes. If you look at it, at the pool of results of tests, like all the tests,
Starting point is 00:21:40 the millions of tests that you run, and then you look at how many actually discovered a defect, how many times you actually went from passing to failing and it was a defect that got inserted. I mean, it's like 1% or less of the tests that you run. It's less than 1% that discover a real problem, right? It's tiny numbers. So then first question, do you have any data on local runs?
Starting point is 00:22:03 Are people just so great that they're testing everything locally before they submitted to CI? So let me just start with that question. Yeah, I mean, that's a really good question. At Google, we did some experiments with that. We tried to see if we could actually detect the correlation between developer running the test and having it pass in post-commit. And again, the search space was sparse enough and the flakiness was there present enough that we could not find a strong correlation between did they run the test ahead of time and did it actually break something? I mean, that doesn't mean they didn't do it. I mean, so in other words, they didn't run the test,
Starting point is 00:22:42 it didn't break. They did run the test. We couldn't see that it broke any more or less based on that change. And I think it makes sense in a way because developers do test their changes, right? And generally, especially at Google, there's a premium on having small changes. And they have some way of testing it. Even if they're not running a formal test, they might be testing it manually. It might be doing some kind of thing to make sure it's working the way they want. And most often, and even if it's a new feature,
Starting point is 00:23:10 maybe there's no test for it. Maybe they didn't commit a test with it, right? That kind of thing. So I think, yeah, I think generally that's just a truism and probably more study is needed. I haven't done a lot of digging into that space to see, is there a correlation? We did a little bit of it at Google, but I think it needs more work. That's an area for study for sure. So any grad students listening, that's something they can try out.
Starting point is 00:23:34 Yeah, exactly. Are there, are like unit tests and just like overrated if like most of them don't fail or like, is it that that's just the way of life and like i i get asked that a lot i think i think um there's two things right especially unit tests you cannot refactor code without unit tests period right and refactoring is a part of agile you're supposed to do it that way you're supposed to like write something that works for the purpose and then when you have a new requirement refactor it until it works for the new purpose and the old purpose, right? And this is the best way.
Starting point is 00:24:09 It's been proven over and over again. And lots of studies show that too. So if you're going to do it that way, you need those damn unit tests to make sure you don't break something during refactoring. Should we be running them exhaustively on every change that gets committed to your system? The answer is no. And again, it's about your risk tolerance. You might be missing a problem with that unit test
Starting point is 00:24:30 that it would find it, but likely you didn't because developer probably ran it. If they were doing a refactor, they certainly did run it. And, you know, I think the answer to the question is, again, it's trying to identify which of your tests pool is giving you the most value to discover a problem sooner. And then having something like regression ID. Regression ID is something every company should have it. So that when you skip a thousand changes for running this test and then you run it and it finds a problem, you can find out which of those thousand broke it, right? And you can do it pretty efficiently.
Starting point is 00:25:09 So that's, and in fact, we found that to be one of the best techniques at Google. When we were skipping tests, we started skipping tests a lot, more and more and more of these, don't run those, don't run those, don't run those. And then what we would do is we got our regression identified algorithm to be super fast. And we dedicated only a small percentage of our compute to it, like 8% or something of the compute was dedicated to doing culprit finding. And we can still like find problems really fast, because we could run multiple copies of different changes and just narrow the range and find exactly the culprit. And I think that's the kind of thing where depending on how fast your tests are,
Starting point is 00:25:44 and depending on how good your algorithms are for doing regression ID, you can skip lots of changes in between and save lots of money running tests. Right. And that's kind of, I think. A much better approach than just saying, oh, well, you know, sometime in the last we haven't run our super exhaustive, big, heavy regression test, you know, in the last, you know, month, let's run it now. Like Microsoft has a tiering of tests, right? They run their lower tier and then less frequently a higher tier and less frequently another higher tier, right? And so some of their really big system tests only run like, you know, very infrequently, like once a month or something, right? And then they discover a problem.
Starting point is 00:26:23 They don't have really good culprit finding to say, oh yeah, here it is. This commit is the one that broke this test. And if you're going to have a big, fat, nasty test like that, then you should be doing, even if it takes a day, you know, a day to cover a month's worth of tests for it to go back and forth and figure out exactly, okay, this change broke it. You don't want to be spent. The thing you don't want, the thing that's most, most expensive isn't even the computers. It's the human brain. It's the people who have to interpret the tea leaves and figure out the results. And if you can send an email to the developer that broke the test and said, Hey, Bob, guess what, Bob, you broke this test and you broke
Starting point is 00:26:58 it with your commit yesterday. Maybe you should go fix it. If you can do that, if you can tell him you're giving him valuable information and you're saving lots of human toil because Bob knows what he did in that commit. And he knows like what it might've done. Right. And he can probably figure out why I broke the test. But if you say, Oh, there's a thousand commits. Hey, Mr. QE, go figure out which of these thousand commits broke this test. That's a horrible job. Nobody can do that job because why? They don't know anything about those thousand commits broke this test, that's a horrible job. Nobody can do that job. Because why? They don't know anything about those thousand commits. All they know is the test stop, and then
Starting point is 00:27:29 they have to do fault isolation. And let me say this, fault isolation is the hardest job in computer science. Going from this thing isn't behaving the way it should to why is it not behaving the way it should is a hard problem to solve. And, you know, this is where toil happens. People just spend a ton of time reading the logs and trying to figure out, okay, well, why did this thing go wrong at this spot? And what's funny, I read a paper once, right? If you run a test twice and it passes twice, you would assume that if you instrumented it and ran it with code coverage and ran it, then it passed, ran it and it passed, but the code coverage trace should be almost identical like
Starting point is 00:28:08 what lines of code got covered and they found variation lots of it in in the execution path because of timing because of you know all maybe it went around the loop where it had to wait five times you know and or maybe it didn't and so all of those things make the there's no way to predict from from any signal that you can get, whether this run is going to pass or it's going to fail right at the end. So anyway, that's, that's kind of, I'm, I'm way, I'm wandering around a lot, but you get the idea. There's, there's lots of techniques to decrease the amount of resources you spend doing testing, but then you need smart techniques that prevent human toil when you find a problem.
Starting point is 00:28:48 When you find a problem, it ought to be able to narrow it down. Automation should be capable of narrowing that problem down and giving exactly a message to Mary saying, hey, Mary, you broke test foo when you made this commit. And that's almost a critical item now in a CI system. If you can't do that, then when you're skipping your changes, which you have to do to save compute resources, you're going to have a hard time doing fault isolation. You're going to have somebody scratching their head saying, hmm, what did it say in the log? Why did it do that? And trying to correlate it back to some change that was in. Okay. So yeah that that's a lot of information i had to ask you like a few things with that right so when you're starting off you're a small company with like 10 engineers or something um you can run all of your tests before you even submit code right
Starting point is 00:29:37 because it's the the test suite is small enough and then you look at the other side of the spectrum like at least at these testing conferences you you see Google, Microsoft, Facebook, Salesforce, like the large companies, they need to do all sorts of stuff. That's like interesting, like regression finding, they can't run most tests before you submit, because that's just too expensive. At what point do you think there's like a tipping point, you know, when you go from that small company, which can just afford to run everything at what point or is there some way of knowing or or rather do you generally see a point where you know companies transition from okay we can just run everything to oh this is no longer feasible I think look until you actually have a successful company and you have i'd say at least 20 to 30 maybe more like 50 developers somewhere in that neighborhood is when you start to accrete accrete enough code
Starting point is 00:30:37 that the build starts slowing down and then you start to get the people oh i have to wait 20 minutes to do a build you know like that's, that's not good, right? The build starts slowing down. And the testing becomes, if you're doing testing, becomes a significant part of the compute you're spending on build and test cycles, right? And that's when somebody says, hmm, we need to do this less frequently, or we need to buy more hardware. And the management starts figuring out, oh, shoot, we need to get soft labs or whatever, right? There's something else that needs to be done to help with the testing.
Starting point is 00:31:10 Or we need to go to the Circle CI. So I think somewhere in that neighborhood, and it depends a lot on the specific culture that's evolving for that company, right? That culture of, okay, what are the things that, how do we do our development? Do we require people to write tests? Do we require them to run them? Do we have some other technique for managing things? Or maybe we're not doing any tests at all because we're still running after that venture capital money and we need to get to that next demo in three weeks and we can't afford to write tests, right you know you can have i think pretty large startups that don't have tests or that don't have a lot of tests and then you get on the other side of the spectrum you can have successful software that has smaller numbers of people but has large has been able to have large
Starting point is 00:31:59 numbers of tests right so so i think it all kind of depends. But eventually, you almost have to get there. There's no, there's no possibility otherwise, because there are just certain parts, like, for example, even at Netflix, the back ends, they all went, you know, pretty much without testing at some level, from my, in my view, kind of without testing, because they did the really strong canary thing, they actually tested it with their customers, they had a million, you know, people clicking the play button, you per second so it was easy to to get that you know to become a way to do testing right which they did um but the the app like the netflix app that goes into your phone or onto your computer or your tv they have to do exhaustive testing on that.
Starting point is 00:32:45 In fact, for a while, they were burning e-prompts into the TVs that you couldn't update. So for example, the very early Netflix integration, they had that thing burned into the TV. You couldn't change it. There was no way to download new software. And so gradually, it's funny,
Starting point is 00:33:02 gradually all those companies went away from that. Even Netflix today, right? A lot of the behaviors are downloaded from the web, even though their app is like kind of a smart browser. It's the same as Facebook. Facebook is really just a browser. The Facebook app on your phone is a browser. It's a browser.
Starting point is 00:33:18 And all of the software and the rendering and everything that shows up on your Facebook app is coming from the back end. It's not coming from the thing in your phone. So they don't have to update that very often. But when they do, it better work. You know, that has to be tested. And that's just as true for Facebook and Netflix than it is for anybody else. There are certain parts of your application stack that you cannot afford not to test.
Starting point is 00:33:40 You can't just back end canary, oh, the telemetry doesn't look good. Let's not update it when you're talking about an app in somebody's hand. You've already downloaded it into their phone. You can't change it until next week because the Apple Store and the Google Store both have rate limitations on how often you can push a new app version. So this is where you have to do some testing to make sure that's going to work. So I think, yeah, it's very different for different companies, but you definitely hit that point where you have to make conscious decisions about how much you're going to invest versus how much risk you're willing to take. Yeah. And I
Starting point is 00:34:15 think what you said was that at some point the build will get slow and then you start thinking, you know, okay, I'm going to try speeding up the build, but then since there's the quadratic problem of there's more developers and you're never really deleting tests, you know, okay, I'm going to try speeding up the build. But then since there's the quadratic problem of there's more developers, and you're never really deleting tests, you need to figure out smarter ways. Exactly. And in fact, this is where the state of the art is, right? Finding smart ways to do regression test selection, finding smart ways to do like things like culprit finding and so on. That's all like super important. And being able to have flakiness in your system while you're doing culprit finding. So you're not really sure, you know, you're just getting probabilistically, okay, it was this one, right? So there's a lot of, there's a lot to those
Starting point is 00:34:55 algorithms. There's a more complex than you might think given flakiness and everything else, but they're super important to the productivity of the engineering team because you don't want someone to have to face that fault isolation for, you know, a thousand changes went in and I have no idea what broke my test. So can you talk about like culprit finding? Because you can imagine for listeners who've never dealt with like such a large code base with so many tests before,
Starting point is 00:35:16 like why would, how does that exactly work? And like, what exactly does culprit finding mean? Yeah. Well, I mean, it's pretty simple, right? Your CI system is not running on every change that's committed by any developer in your organization. If your organization has any size at all, you're doing periodic build and test runs. Maybe the simplest thing is once a day.
Starting point is 00:35:38 Or even Google, we would batch 45 minutes of changes together and we were getting like three changes a second. So, you know, 45 minutes times three changes a second and several thousand changes are going to test all at once. We're going to run all those tests. And that's typical. It's not this exception. That's pretty much what everybody does is they skip over some number of changes. So now, what is CI designed for?
Starting point is 00:36:01 Its intended purpose is to have a test that was passing and to detect that somebody broke it it becomes failing and like I said that's relatively rare it does happen and this what CI is designed to find right so you had a test that was passing and now it's failing well in the middle you had I don't know 12,000 commits that came in at Google right 12,000 commits came in in that 45 minute window between when the last time you ran that test and this time you ran that test. So either you need, so what do you do? You scratch your head. Either you need some Swami or some Oracle that can examine those changes that came in the middle and say, this changed by Mary and maybe this changed by Bob or it could be Joe broke this particular test.
Starting point is 00:36:44 And then you go and experiment a little bit until you figure it out or you need automation to do Bob or it could be Joe broke this particular test. And then you go and experiment a little bit until you figure it out. Or you need automation to do that. You really don't want to saddle a person doing that work. So what does the automation do? It says, okay, change one, it was passing. Change 1,000, 12,000, it was failing. Now I'm going to go in the middle.
Starting point is 00:37:01 I'm going to say 6,000. And I'm going to say, okay, if it was passing still at 6,000, then it's on the right, the breakage, or if it was failing at 6,000, then it was on the left, the breakage, right? So in the simplest case, you divide, you can think of binary research, you just go down until you find it, right? Find exactly which change it was that broke it. Now, that's too slow. It depends on how fast your tests are. Like if your tests take 10 minutes and you do a bisection of 12,000, you know, log in, I don't know what log in of 12,
Starting point is 00:37:28 log to the base two of 12,000 is. I don't know. It'll take like a few hours. 15 or 16. Yeah. 15 or 16 times 10 minutes and you're going to be there a while. And so what we do is actually,
Starting point is 00:37:42 most of the algorithms I've seen do n-ary. So instead of binary, log to the base 2, you do log to the base 10. You're willing to run 10 copies of that test at once. So then in 10 minutes, you've bisected down to some range that's, you know, one-tenth the size of the starting range, right? So gradually, you get better and better. You isolate down. You find exactly the problem.
Starting point is 00:38:03 As I said earlier, that's easy if your test is 100 deterministic if you always run it and get a correct pass or a correct fail you know then you're 100 deterministic you're good you've seen that first failure and now you do your bisection and you hit this one the problem is we live in an imperfect world in testing so sometimes you'll run one of those intermediate points. You're running that test, you know, let's say you're running it 100 times. Well, if it's 5% flaky, five of those runs are just going to fail for no reason. And so typically in these algorithms, to make them work and to tolerate those kind of flakiness,
Starting point is 00:38:37 you have to do some kind of retries. Facebook does double up. I mean, when they're doing bisecting, they do more than one copy of the test at each change. So you can get votes faster, right? That's, I'm sorry, that's all mind blowing, right? You're kind of like, okay, how fast can I find it? All the whole idea is, when you're dealing with flakiness, you don't want to get it wrong. You don't want to say, hey, Mary, you broke this test when it was actually Bob, you know, and you can easily get it wrong because you're just sampling the ranges and you're assuming if you see a pass or a fail, you're assuming you can divide the range that way. So you do have to be careful to tolerate the flakiness that's in your system because you'll have not zero flakiness. That's just the way it is. In fact, at VMware now, we retried the passing one that came first. Why do we do that? Why retry that one that passes? We know it was passing there, right? And the reason is some of these problems can be environmental. I'll give you an example. We had a license to one of our tools expired, right? And when you ran that
Starting point is 00:39:34 tool, it would give an error, right? So, you know, at 3 p.m., the test started failing. And anytime you ran it after 3 p.m., no matter where you ran it, it would fail, right you were doing call for finding you would say oh it's the very first change but it actually wasn't a change in the code that caused it it was the license that expired so if you ran it at the passing endpoint on the left now you can tell hey that actually you know was a real it was actually a problem with the the license and not a problem with the code because that used to pass and now it's failing right um same way can happen at the other end you have this fail this lovely fail well it could be a flake you better retry it before you even start regression identification or culprit finding because because it could just be a flake 80% of the time it's
Starting point is 00:40:15 just a flake and if you retry it it'll work so so you need you need all these algorithms that seem simple in fact um when you go to some of the academic conferences I still argue with some of the academic conferences, I still argue with some of the academics about whether flakiness is an important problem. Why don't you just fix it all? And to be honest, I was at Mathworks, they tried to do that. They got a crap team of their best developers, they said, go fix all the flakiness, we don't want to have flakiness anymore. And they spent a month, they made progress. They lowered the number of flakes. But guess what?
Starting point is 00:40:46 Flakes were being inserted at about the same rate they were fixing them by the development staff. So it was like they couldn't win the war. And so then you have to acknowledge that flakiness is just part of what your system is. And you have to figure out how to deal with it, right? And that's where these more sophisticated algorithms come into play.
Starting point is 00:41:02 Yeah. Yeah, I think at Dropbox, we do something really similar. We run each test at least twice on each potential culprit on each commit. And we have a 150 area, like 150 bisect, like insect. 150, yeah, that's good. Yeah. And that's good enough for us. Yeah. And that's good enough for us. You know what? And you have to have good systems for doing this,
Starting point is 00:41:31 but you space out your continuous integration system runs. Like if you run them every half an hour, instead run them every hour. Guess what? That now gives you 50% more capacity to do culprit finding. So when a culprit comes in, you go active and you just like nail it like you do. 150 NRA, log to the base 150 of the number.
Starting point is 00:41:47 It doesn't take very long to find the culprit, right? And yeah, you double up the test runs so that you can get a vote about whether it's flaky or whether it's a real failure. Yeah, all of those things are great techniques and balancing. And you know what? I guess maybe that's something that would be fun to do. Maybe I'm looking to write a paper. I haven't written one in a while. And this might be fun to do. Maybe I'm, I'm, uh, I'm looking to write a paper. I haven't written one in a while about,
Starting point is 00:42:06 and this might be a good topic, like kind of like, how do you, what's the best optimal amount of mix between spending resources on those continuous integration chunks and the regression ID or culprit finding, uh, function, right? How do you divide your resource pool optimally so that you can take a little bit more risk, like risk a longer window of missing a problem and then be able to quickly find where the problem was inserted
Starting point is 00:42:33 and give it right to a developer saying, hey, this is busted. Or even auto rollback. We did that, some of that at Google, we did auto rollback for builds, but we didn't do auto rollback for tests because of the flakiness problem. At least when I was there, we never got to that point. I would have liked to do that,
Starting point is 00:42:49 but the flakiness made it erroneously blame or erroneously rollback changes. And that's too hard. You have to get to better algorithms for culprit finding. Yeah. I think the way we, since we could deploy all of this much earlier in our like company life cycle, when we don't have that many automatic reverts or auto rollbacks, we could like deploy it successfully. And the main way we could do that
Starting point is 00:43:13 is we had to make sure like it's passing 10 times on like the previous commit and it's failing 10 times deterministically on like the culprit. And that's when, that gave us like reasonable confidence. There's still been
Starting point is 00:43:25 false positives it's there's there were definitely false positives and we had just five and five i think about two years ago but 10 and 10 has been reasonably good do you let me ask you a question though do you do you measure your flakiness rate of your tests yes and what we've done and i don't know if this is the best strategy but this is what we've done, and I don't know if this is the best strategy, but this is what we've deployed since like a year, year and a half is as soon as something is flaky on head, we pretty much disable it and we let the developer know. And we get like a decent number of, I don't know what the percentage is, but I would say at least every day there's been five or six tests that get disabled. When you say flaky, do you mean like over a certain threshold flaky or do you mean even one flaky failure, right? We have basically four failures in 240 runs.
Starting point is 00:44:12 So on the same commit, we run the test at least like 240 times. So 10%. You're willing to tolerate 10%. Roughly. Like four out of 240. Oh, four out of 240. I thought you said four out of 40. So four out of 240. So that's two out of 240 that's not four out of 240 i thought you said four out of 40 yeah four out of 240 yes so that's two out of 140 that's about one and a half percent that's a good flakiness yeah that's
Starting point is 00:44:32 good yeah and that's have to be pretty stable and and the thing is that this is definitely more aggressive than how what humans used to do but people haven't complained to us just because they find out about flakesake so much faster. So they submit a change and then within three or four hours, it's like disabled by the Flake bot. They probably know which changes caused it. And when it's the infrastructure's fault, which is not that often, they will immediately ping like the developer tool team. Like, oh, turns out that, you know, it's the infrastructure's fault. And please, can you stop automatic quarantine?
Starting point is 00:45:06 That's what we call it. So all of this works. So in fact, let me ask you about that question. That was a big effort at Google and it's something I've been trying to do is to divide the space between flakiness that's owned by the development team, which means flakiness in the code or the test code,
Starting point is 00:45:22 code of the test or test code, right? That's one side. And the other is flakiness that's caused by infrastructure. And at Google and at VMware and at certain things at Netflix, I'm trying to get the thought process to, look, you should recategorize things that are infrastructure related.
Starting point is 00:45:37 So for example, we had a Google, a lot of problems with Selenium WebDriver, right? Selenium WebDriver, the big, you know, web test, click, click, they flake ariver, right? Millennium WebDriver, the big, you know, web test, click, click, and they flake a lot, right? And we basically noticed certain patterns that would indicate an infrastructure problem, like the browser never provisioned, or it couldn't connect, or whatever, right? And when those things happen, we would reclassify it as an internal error. And we wouldn't count it as flakiness. We would not, and we would track it with our infrastructure providers like the networking team or whoever right to try and make those things better like
Starting point is 00:46:11 one of the most important problems we're doing the same thing in vmware and and i think this is an important concept is to separate out that which the development team has control over and you can blame them for it and that which is not owned by the development team it's's owned by someone else, anyone else, doesn't matter who it is. If the developer isn't going to make a change to the code under test or the test code to fix the problem, then it's infrastructure to you and it should be tracked separately from, it shouldn't be counted as flakiness. It should be tracked separately and you should push your infrastructure
Starting point is 00:46:41 provider, whoever that is to make that better. Right? So that's a technique. I don't know if you're doing that at Dropbox, but that's a technique that I'm really excited about. I want people to be doing that kind of a thing because you don't want to be wasting the development staff time looking at failures that aren't their fault. You don't want to blame them for flakes in their code when it's really flakes in the infrastructure that's causing the problem. And it isn't always possible, but to the extent possible, we've been trying to write automatic classifiers that'll say, yeah, when this thing happens right here, we know it's because the browser didn't connect to the backend.
Starting point is 00:47:13 And so we're just going to cast that as an internal error and we're going to automatically retry it. And that's what we do at Google. That's what we do at VMware for that problem. Yeah, we didn't have like fortunately such a large scale issue, like existing issue to deal with, so we could tackle it. But we do something really similar. We have
Starting point is 00:47:29 some kind of diagnostic where it just prints out an artifact at a certain file part, like there's a failures.json, which tells the CI system, oh, this was actually an infrastructure error and you should retry it rather than a user error. And then we also give a reason, like, oh an infrastructure error and you should retry it rather than a user error.
Starting point is 00:47:46 And then we also give like a reason like, oh, infrastructure error because Bazel exit code seven. And that indicates like a Bazel failure. And what that gives us is, and then we basically track the number of different infrastructure errors. So we can go down and fix those. And yeah, those aren't like automatic, those aren't looked at as the flake system or culprit finding or anything at all. That's just, in fact, we try to hide that from developers. So you don't want them to see it. You don't want them to see it because they want to just only see the fact that it was passing or failing something they need to deal with and not this infrastructure problem. And so, yeah, that's an important concept. Again,
Starting point is 00:48:26 it's all about the vision of labor. Who's good. Who's responsible for fixing that problem. It's not your development staff. You don't want to be using their time to do anything with it. Yeah. I think we, we, we benefited a lot from just like knowing how auto rollback and all of that works at Google. Cause like without that, I don't think we would have built this stuff out and at least made it as mature as it is today.
Starting point is 00:48:47 Like it just pretty much has auto revert access throughout our code base, except for our production, like deployment configs. Because we were like, we just don't want to like auto rollback a deployment config that might cause like another outage or something. Well, I'm super happy to hear that actually,
Starting point is 00:49:03 because, you know, that's one of my main reasons for publishing when I was at Google was to trying to get other people to be able to stand on our shoulders and learn what we had done and do some similar stuff. And actually Facebook hired a whole group
Starting point is 00:49:17 to implement some of the techniques from our paper. And they're now ahead of where, I think they're maybe even a little ahead of where Google is on some of this stuff, right? So I'm super happy that that's starting to happen now in the industry. And I'm hoping it will eventually lead towards more research in the academic community, which is good. And also eventually sharing of some tools and going beyond just the idea sharing, but actually doing it the same way. Because right now, everybody's got their own implementation and still.
Starting point is 00:49:43 But at least we've been able to learn from each other the ideas the important ideas there there's and and i i stay up with um i'm going to put a shameless plug for my google journal club if you search the web for google journal club uh we review academic papers every month and we've we've had some pretty good papers come through there uh people from different companies, we had a guy from Facebook came and presented and so on. So it's like, really a good thing. It'll help you to learn more of these techniques from other people like from Google and Facebook and Dropbox and, and in some of the academics that are studying this area. And maybe you'll hear something that will impact you or that you'll
Starting point is 00:50:23 think is appropriate to apply at your company. So I'm really gratified to hear that you're doing some of the techniques that we have in our papers at Dropbox and you know, we're not done. Let's keep, let's keep improving the systems and making them better at doing this. Right. Yeah. Yeah. I think, I think I've read all of the papers, at least a year ago I had read all of the papers, at least read the abstracts were the ones that seemed relevant. So at Google and like facebook what i've seen is there's a bunch of key optimizations
Starting point is 00:50:50 you need to have right you have uh something like you don't run tests very often and you have to do that you can't run tests on every commit that's just not feasible then you use something like basil or in i think facebook's case there's, which is their build system. It's a rip off of Bazel, actually. Google developers who left Google re-implemented Blaze, which was the internal tool. It wasn't called Bazel, it was called Blaze. At Facebook called Buck and at Twitter called Pants. So you had Pants, Buck and Bazel which all implemented the same precise kind of building system. But it is really useful, right? Because it helps so much. So you had pants, buck, and Bazel, which all implemented the same precise kind of building
Starting point is 00:51:25 system. But it is really useful, right? Because it helps so much. And what that lets you do is test selection, because Bazel or buck and all of these tools, they know exactly what to run at every commit. So you can basically skip out. In our experience, we've seen like 90% of tests get skipped out on average, just because Bazel knows which tests to run.
Starting point is 00:51:44 And we have this test selection. What I've seen Facebook- Well, you also have precise dependencies. And Bazel will start executing your test as soon as all the binaries are ready, instead of waiting for the entire build to finish. It just says, oh, I have these four binaries. This test depends on these four.
Starting point is 00:51:59 Let me run that test. I don't have to wait anymore. Yeah. So that's test selection. But something interesting from Facebook's conference talk at ICST, so the International Conference of Software Testing and Verification, was they actually do retry selection. So they also try to predict when a test fails.
Starting point is 00:52:17 That was at my workshop. That was at CCIW, ITSP 2020. Yeah, I love that retry regression test selection paper. That was really interesting. Of course, none of the graphs had y-axis. You know what I'm saying? Yeah, but they actually try predicting how likely is it that a test passes
Starting point is 00:52:35 if you try retrying it, right? Yes, that's right. And I would say that 90% of them, as you said, it's probably a flake. So it might pass. But it's crazy to me that they're optimizing at that deep of a level. So that's the amount of compute they'll be saving just because of something that seems so not trivial, I should say, but it seems like a hyper-optimization.
Starting point is 00:52:58 Well, remember, they and you guys are doing lots of retries, lots, like 10x, right? So when they retry a test, like 10x, right? So when they retry a test, they don't just do it once. They throw a hammer at it. I think it might have been a little – they didn't talk about that, but it might have been interesting to see if they had just said, oh, we'll do two retries instead of 10 because we're pretty sure it'll pass on retry, or we'll do no retries and we'll just mark it as a pass because we're pretty sure it's a flake.
Starting point is 00:53:25 You know, I don't know. I don't know what the right answer is. But I also don't think they have. I haven't heard them talk about a quarantine system, which is super important to the health of the test pool. You have to be very draconian about not tolerating levels of flakiness. That's actually my biggest challenge at VMware is trying to get people to work on test flakiness has been hard. They just aren't. It doesn't resonate with them. They don't understand how much it gums up the automation. So why does like test flakiness happen? Like maybe you can talk about that because you've seen it at a bunch of different places.
Starting point is 00:53:57 Well, I mean, look, it's pretty much always the same thing. Actually, there was a Google testing blog post. God, I can't remember the guy's name i used to work with him too um and i read this one just the size of the correlation between the test binary size and the level of flakiness in the test and that only makes sense if you think about any measure cyclomatic complexity algorithm complexity the the number amount of concurrency all those things rise as the pile of software that you're testing gets larger. And then my determinism creeps in. I've had so many QEs get offended by calling it a flaky test. Because what? It's not my test that's flaky. It could be
Starting point is 00:54:39 the production code or it could be the infrastructure. Why are you blaming me? You're calling me bad test because I'm a flaky test and there's no connotation like that and i tried to explain it to people but i've had people at netflix they have objected to the word flaky they wouldn't let me use the word flaky it had to be sporadic or non-deterministic or something else right which is just nuts i mean we've kind of adopted that in the industry now as the standard term is flaky yeah but it it comes from the same things it always comes from, race conditions. It's funny, I think that you should investigate your test flakiness if it exceeds some threshold based on decreasing amounts of flakiness. If you have a flaky test that flakes once in a million times, it's not worth the engineering to go figure out whether it's production code, whether it's test
Starting point is 00:55:27 code. It's just not worth the time, right? But if you have a test that's failing 5% of the time, and it's gumming up your CI system, you want to aggressively find out what that problem is and fix it. There have been almost no studies. Jingjo Liuia lu did a study he's my he was my intern at google he did a study in i think 2014 where he tried to tell here are the the causes of flakiness and you know what i haven't seen large-scale studies i'd love to see it it's hard to figure out right how do you know whether it's test code or production code he was looking at when you make the commit to fix the bug that says it was flaky like what kind of code did you change when you make the commit to fix the bug that says it was flaky, like what kind of code did you change when you made that commit? Was it test code? Was it production code? Was it both?
Starting point is 00:56:08 You know, that sort of thing. And I don't know if it was very precise. I'd love to know better. And I don't want anyone to say that it's not production code because it is, it's not always test code and it's sometimes infrastructure, all three things contribute. Right. And yeah, I mean, you, what's so funny is when the testing systems at every company I've worked at, including VMware, right, have been the computers that are used for testing. We drive them for every ounce of compute that they have. We overcommit the CPU. We overcommit the RAM. We over everything. We're just loading them up. Here's a zillion tests. Go run them. Right. And it's funny, you can, there is a correlation because of timing issues being more acutely apparent and so on. There's a strong correlation as well between the busyness of your, your computer that you're running your tests on, like how over committed is it and how frequently you find flakes. Right? So it's funny.
Starting point is 00:57:05 At Google, we noticed this for sure. During the weekends when there was a lull in the usage of the overall compute pool and the computers were less busy doing the tests, the tests would be less flaky than they were during the peak times during the week when everybody's doing pre-commit testing and you're doing all this stuff. And the computers are just loaded, Right. And why is that? I mean, when you load down the CPU, you're basically causing race conditions to become more apparent because everything slows down to a crawl and maybe a three second, you know, wait, isn't long enough because you didn't get enough CPU slices during that three seconds to make much progress, right? So that's, and look, it's something we as an industry need to learn to cope with because that's how we're going to do testing. We're going to get 10 boxes and we're going to use, and you know what? Like I said, at Google, the test computers,
Starting point is 00:58:01 the things that were running in Bazel backends, right? Those guys had a higher utilization rate than Google search. They were pegged 99% of CPU all the time because we overcommitted them like crazy, like crazy. We would be running eight. We, I think we ran eight or 16. I don't remember what kind of computers they were testing jobs on these nodes, right. At the same time. And they would just like crush it. If you got the wrong combination of tests, like on that same node,
Starting point is 00:58:30 it could actually just crush it. And yeah, we ended up with a little bit of flakiness because of that. But the compute people loved us. Oh yeah, you guys are using your compute better than anybody else because we don't have a fricking choice. We have to write really smart algorithms. That's the other thing right every place i've been has to write really smart
Starting point is 00:58:50 queuing and you know load control algorithms with pushback signals so that you don't over commit to the point where things start failing but you don't you don't you want to keep it right at that spot where it's about to crash all the time right right? And that's getting the tuning and the pushback signals right so that you can keep those tests, as many tests running all the time as possible is the answer, right? And having that scheduler that's smart enough to say, okay, you know, I'm going to keep queuing things. The way that Google eventually worked was we would decide how often to run CI by the compute.
Starting point is 00:59:24 So we would start running the previous test cycle and it would schedule like a million tests. Usually it was like a million, million and a half, right? And it would go into a queue. They would all be in a queue and they over the next 40 minutes or so, the queue would burn down, right? It would just keep going down.
Starting point is 00:59:39 And then it would get to a point. We had a set point, like 200 items in the queue or 2,000 items or some number of items it was oh time to run the next ci cycle and then it would start preparing the test for the next ci cycle and selecting the million and a half tests that are going to run for the next cycle right and then it would include those so we never let the queue get to zero we always had a backlog of work waiting for the back ends to do the work and we would just keep the queue and watching it and we would actually if fewer tests were selected we would run more the queue and watching it. And we would actually, if fewer tests were selected,
Starting point is 01:00:05 we would run more frequently. And if more tests were selected, we would run less frequently. It was all completely adjustable and dynamic based on available compute. And really that's the trade-off to make. Use every, philosophically, use every drop of CPU and compute and memory
Starting point is 01:00:22 and everything else that you have for testing, use it all. And if you can't figure out how to do that, just keep working at it until you do. That's the best problem to solve, right? Use it all. And that helps you to reduce the risk. And as you get smarter and smarter at testing, you will be well served by that system that just knows how to keep that sucker full. And try and do it dynamically. Don't set your Jenkins job to run every 10 minutes.
Starting point is 01:00:47 Use an automated system that says, hey, when the back ends aren't busy anymore, do some more work. Here's some tests that you can run to reduce our risk pool. Randomly select them if you have to. Sorry, I'm just going out there on a limb. I think this is an important property.
Starting point is 01:01:01 Every good testing system I have ever seen works like that. It just keeps loading it just like it just keeps loading it in until it's full right yeah i think that's so many things i wish i could have implemented uh we've certainly seen like the wednesday four o'clock load like so wednesday used to be like our it still is the no meeting wednesday at dropbox it's like and then at four o'clock is when you know developers are generally submitting their code they're leaving for the day or whatever and they're like let me submit the
Starting point is 01:01:28 diff that i worked on today so we would see like overloads and increased flakiness at like four o'clock and we at one point our ci server would go down because of the number of tests yeah it's fascinating how similar the problems are in so many different companies like i would say the dropbox is similar to google like we have basil and all of those things at least technologically but to see that you know vmware netflix all of these places have kind of similar problems i just wish there was like a service that just abstracted all of this away for us and give us like a reasonably good price there is a new service for basil that has come out like there's a new company that's trying to do that it's called like build buddy and there's more than one there's several in fact okay one of them is in germany it's being run by um
Starting point is 01:02:14 uh right the the basil engineer right yeah yeah yeah he has his own company now okay okay wolf jack that's his GitHub username. Wolfjack. Yeah, that was his. Okay. So he's working on a Bazel company as well? He's a Bazel company. Fascinating.
Starting point is 01:02:31 So in fact, there's three or four competitors. I know this because at VMware we're evaluating which of the back-ends we want to start distributing our builds better on the build farm. And we're not completely converted onto Bazel. I mean, Bazel didn't exist when these guys started their build system. So they went with F cons, which is, Oh, completely nasty. Anyway. So, so we're gradually moving over to Bazel. We've got about half of it ported over and we're aggressively looking to get
Starting point is 01:02:58 RBE remote build execution going with Bazel for exactly this reason. And we don't yet have our tests in there. And that's in part because, oh God, when you do a system test for VMware, you basically have to boot VMs with the, we have an operating system that we run. It's not like you can just like, you can't, it's harder to do, boot an operating system than, you know,
Starting point is 01:03:21 run some test that's all written in Java on top of a server. Right. So, so that's, it's a much harder problem in some senses. We have good systems for managing those compute nodes and that's what we keep saturated is our farm of nodes where we can run these virtual nested, we actually run the nested virtualization of this operating system. We boot it up, we create domains.
Starting point is 01:03:45 And it's funny, we're actually using our software-defined networking system to isolate each test rig from each other test rig. There's only one IP address that gets in and then everything is inside local. And then there's one, so you can read and write and you can talk to it, but you can't access any of the machines inside there.
Starting point is 01:04:04 It's all with our software defined networking isolated from everything else. So yeah, so it's, so it's kind of like, and that, and those kinds of things are an impediment. Like when I was at Netflix, they had a software testing lab for testing Netflix app on, on Android and iPhone, right? So, so they had these little, these crazy boxes with routers in them where you could play Netflix all the time on them with the testing rig, right? And that was all custom and, again, wouldn't fit into a mold of, you know, some off-the-shelf testing system wouldn't know how to do that, right? So it is – I have respect for the problem. And I do understand that companies are different and that that has yielded some of the inability to share,
Starting point is 01:04:49 given that the market is so small. Maybe we'll get to some open source. We'll get to some ability to share some of these things. I hope so. But you know what? At least now I'm seeing the fruits of being able to share the ideas with companies like Dropbox and Facebook and others, kind of picking them up and running with them. Right. So it's good. The, the, the Basel migration
Starting point is 01:05:09 part is pretty interesting because I've read one quote where Basel it's always like, by the time you need to migrate to it, it's too painful to migrate to it as something that I've heard. How do you, how do you manage such a large scale migration? Like I remember like, at least I've heard that the Bazel migration, even at Dropbox took a long time. So, yeah. Oh yeah. We've been at Bazel migration
Starting point is 01:05:31 for about the entire two years I've been there. Okay. Right. And they started it around the time I joined there, joined VMware. And we're only half done. We just did the vCenter. We didn't do the operating system components yet.
Starting point is 01:05:44 And I think the way that we managed it half done. We just did the vCenter. We didn't do the operating system components yet. And I think the way that we managed it, absolutely. Look, we have a couple of thousand engineers working on vCenter and we basically doled it out to the teams. Each team had to work on their own part of the build. And we created stories and we had product management, program management working with them to check up on, make sure they did their work. We tracked it aggressively. We're just about done with vCenter now. It's almost completely finished. And then we have, we just started, we hired a couple of people recently, and we just really started in earnest the server side conversion to Bazel. We're very excited about it. It cut our,el cut even without RBE, cut our build times from it was like we have a caching, we did implement some caching, but it cut our build
Starting point is 01:06:33 times from like 90 minutes, like 30 minutes for recent, which is a big deal. Yeah, I mean, you know, developers don't like to wait for that shit. And how many minutes for a build? Yeah, one change in 90 minutes for a build. That's bad. Yeah. And this bad. This is a full Java code base. Well, actually, any big company that I've been at, C, Java, JavaScript, it's a mix of everything. Everybody's got everything now. It's kind of like you have to do certain things in C and C++ to go fast enough. You do certain things in Java. Then you do some UIs in Angular. And then, you know, you end up with everything. And that's exactly what we have at VMware.
Starting point is 01:07:11 It's exactly what we had at Google. We had everything. We developed from the operating system all the way at Google, from the operating system all the way back to the top of the stack, right? And now Google has less sort of embedded stuff. They don't have an operating system. They do kind of have an operating system, the operating system that runs in their compute farm.
Starting point is 01:07:28 But, you know, they're not doing a lot of changes to it. They don't have a lot of intellectual investment there. They mostly did their Kubernetes containers, which now basically everybody is using Google's container technology on their operating systems. And that's pretty much it. That was the main customization that they did. Yeah. And how much do you think RBE will speed things up?
Starting point is 01:07:50 Well, I can base it on Google. I tried to build without RBE once. I forced it to turn off RBE and I tried to build my component and because Google was all the way down to the leaves on one node, it was going to take, I'd never finished. I wasn't able to do it successfully because because it just took it took too long and it would fall somewhere in the middle and that would be that i couldn't i never figured out why it didn't work entirely um but yeah you know i think our build will go from 30 minutes to like five or ten minutes with rbe i really do i mean with the aggressive caching that rbe does and and I only know that because at Google, it was incredible. I mean, you could sync a sandbox
Starting point is 01:08:29 instantaneously with SourceFS and do a build with Bazel with the caching, and the build would finish in like, you know, five minutes. And I'm sorry, but it literally, at Google, the Bazel system was set up to build everything. It would build the malloc library. If somebody committed a change to the malloc library, it would build that sucker. There was nothing off the shelf, nothing pre-built. I mean, even Java, it would build the entire thing.
Starting point is 01:08:59 Which seems insane, but that's what they did, right? Granted, the Java build never changes that often, but when somebody commits to change the Java, then all of a sudden you have to build it. Right. And then you cache it once and it's done. Right. But, but it, it may analyze all those dependencies all the way down through the Java to everything else. Right. So, so yeah, it's with the aggressive caching and the RBEs, it's going to be a lot, lot, lot faster. And do you think like more and more companies are like adopting tools like Bazel?
Starting point is 01:09:28 Have you seen that trend? The trend is happening at the high end of the industry, right? The ones that are really successful. And I think what you said rings totally true to me, right? You don't have enough. It's kind of like the same problem you're going to get yourself into with CI. You're not really thinking about Bazel
Starting point is 01:09:48 until you are a successful company. And now you're thinking about your build is too slow, right? And then, okay, by then you've got something, whatever it is, Maven or Ant or... It's funny, Java building is so dumb. The way Java C works, Java C is kind of like its own make system it knows it's going to go read the the java the java files and compile them for classes that
Starting point is 01:10:11 you reference and it kind of doesn't stop it kind of just goes wherever it needs to go to get all the classes built and that's just really bad and so google figured out a way to get rid of that problem they basically created what they call i jarsars, which even though you didn't create an interface for your thing, it would create an iJar, which is just the interface. It was enough to get the other guy to compile. And then you could have precise dependencies. You could have things like precise dependencies, things like single compilation.
Starting point is 01:10:39 When you call to compile this file, it only compiles this class file and not the other ones. So that iJars concept, which was built into Bazel and their Java rules, super good. I mean, it just saves, it makes the Java build much more rational and cacheable than it is without that stuff. I know the person who worked on it, she was super smart. Yeah. And I've read about iJars, but I didn't realize it's like Google who made iJars. That's... Yeah. Yeah. Okay.
Starting point is 01:11:06 So yeah, you think generally that larger and larger companies are thinking more about moving to tools like Bazel just because of all of the benefits like test selection and using things like Buildfarm and RBE and all of those. Not like Bazel. I'm sorry. I think Bazel has kind of hit that market right i think it's not like basil it is basil i don't i don't know that there's very many alternatives to that system when you get to the real high-end stuff yeah it's definitely the the market leader and especially if you go to basil con or things like that you'll see that the wide variety of larger companies that that come to basil con and talk about their Bazel
Starting point is 01:11:46 migration. I remember talking to somebody from Cisco. I think the team that was managing their Bazel migration, they're like, well, we have no idea how long this is going to take. And this was two years ago. It's going to take forever. Yeah. Well, Bazel is kind of, I think one of the points of religion that came from Google that makes trouble for Bazel is most other build systems are willing to take pre-built components from just about anywhere you can find them.
Starting point is 01:12:15 Like if you use Maven, it just goes up to its Maven repo and pulls down this and this and this and this version of that. It's a component-based build system and the reason why it's such a mind shift to go to something like basil is that it doesn't it isn't a component-based build system it doesn't design to do that what it's designed to do is build everything everything in your entire tree and i think that's why it takes a lot of time to do basil migrations and i think that there that there is some talk of improving sort of the componentized nature of Bazel so that it can become easier and more approachable for companies that are switching from a component build system like Maven, where they get all their pre-built components and they want to use Bazel. It's really hard now. It's not an easy thing. You either download a
Starting point is 01:13:01 bunch of source code and check it into your repo, which most people don't want to do, or you come up with some hack way in Bazel to sort of get those components, those prebuilt components, and then insert, inject them into your build, right? Yeah, like maybe read some kind of like zip or squashFS and try injecting that with the rest of your artifacts. That's certainly tricky.
Starting point is 01:13:19 And I think even we considered building something like that at some point. So if Bazel has good support for this, that's going to be awesome. They missed the boat on that one. Yeah, because they never cared about it, I think, right? Well, it's a Google tool and Google doesn't have that problem.
Starting point is 01:13:35 They don't do it that way. Yeah, I've seen similar cases. Like I've seen that, you know, gRPC is great for the set of languages that Google uses. But like if you try using the Node.js client, it has like all sorts of like memory leaks and all of that stuff, which would definitely not exist in the production languages that they use.
Starting point is 01:13:53 Yeah, that's right. So let me go back to a little bit of the beginning and first ask you, what got you interested in testing? It's such an esoteric topic compared to most things. And I know that we're kind of running out of time. So I also want to ask you, what advice would you give to an engineer who's interested or they're trying to figure out what to do next? And what's your pitch on why they should work on testing, for example? Well, so two things. I was a customer-facing
Starting point is 01:14:20 product developer at Mathworks working on their compiler product. And I've been, I've been working on compilers and languages and language tools for a long time. So I'd kind of been in tools, right. A little bit. I knew my stuff, my way around that stuff. And the build system at math works was horrible. The CI system was absolutely atrocious. And I wrote in my, my annual review, I'll never forget it. I wrote, I would rather go work on that piece of shit
Starting point is 01:14:45 than spend another year suffering as a developer trying to use those tools because they were horrible. In fact, when I was at MathWorks, the team producing the tools, when they checked in a change, it would automatically sync out to the main share drive where you would run the tools, like the tool share. And literally you would get syntax errors
Starting point is 01:15:04 and you'd have to call Michael Meerman. Hey, Michael, you broke this thing. You broke the script again. I can't build. It's like, that's the kind of stuff it was. And literally they took me up on it. They put me in charge of running the CICD system for MathWorks.
Starting point is 01:15:20 And I hired a consultant. His name was, oh God, I should know his name. I can't think of it. Anyway, sorry, it's been 20 years since I worked with the guy. But we hired a consultant. He came in, and we changed our system, SCM system from CVS to Perforce, which was hot at the time. We fixed our build system. We started
Starting point is 01:15:46 working on a new version of the CICD system. And, you know, we made a lot of improvements for MathWorks to their build and test system. So I was really happy. And I've always kind of liked working on infrastructure. Let me say this. Advice I have to new people, which is another thing you asked me about, right? So look, the reason I like infrastructure is because you're right next to your customers and you know them, you're a developer too. You know what they do every day and you can tell a little bit about their pain. And if you need to talk to them, you go talk to them. And you also get to synthesize your own requirements. And guess what? If a company is big enough to hire an infrastructure person, you're not going to have a problem, right? You go into a customer
Starting point is 01:16:32 facing team, and then they decide to cancel that product, which happens all the time at Google, it happens at other companies too, or that feature. Maybe they're not going to do it that way anymore. And they just like, you know, cancel the whole thing and that team goes away or they have to get reassigned to other groups or whatever, right? And with infrastructure do it that way anymore. And they just like, you know, cancel the whole thing. And that team goes away, or they have to get reassigned to other other groups or whatever. Right. And with infrastructure, it's more stable. You, you have, you get to synthesize your own requirements. You get to sign up for developers and say, Hey, what if we did this? Would it help you? Right. And you interact with them more.
Starting point is 01:16:57 And I like being close to the customer. I like being able to impact their, make their life better, have concrete impact. They asked me for something. I give it to them and they're happy. Right. And having that close connection, not having a product manager in the middle, the product manager saying, Oh no, no, no. That UI has to have a click here. Nobody tells me that because,
Starting point is 01:17:18 you know, it's like, nobody cares about the internal tools that much. Yeah. Nobody cares about the internal tools. I can do whatever I want. Customers don't see it. So all of those things are good. I think the, and the other side is it's, look,
Starting point is 01:17:31 let's just be frank. Everything that companies invest in internal tools, CICD, testing, whatever, to them it's overhead. It's not adding customer value.
Starting point is 01:17:41 It's not going out to the customers. Yeah, indirectly it does. You can't ship stuff out to the customer without testing it companies get burned with that and then they say oh well if we keep shipping crappy software to our customers guess what they're not going to buy our stuff anymore so maybe instead of developing that next feature we should hire someone to work on making the process better so when we ship another release it doesn't break that for them them, that kind of thing. That's when companies wake up and they see, ha, I need to do something about this.
Starting point is 01:18:15 When you have an infrastructure team, they're guaranteed to be smaller than they need to be. They have to be scrappy. They have to be agile. They have to be willing to listen to the customers and the engineers and say, oh, you need this? I will do that. I will solve that problem. That doesn't mean you do exactly what they ask you for because engineers all have come with a solution. They never come with a problem. They come to you and say, could you do this? And you're like, well, why do you want me to do that? And then they explain and say, no, no, we should do this other thing instead. This is what you really want.
Starting point is 01:18:40 You really want culver finding. You don't want this thing that you're talking about. And so that synthesizing of about and and so so that synthesizing of the requirements and so on it's really gratifying but you are going to work in a small lean team you're going to have to be scrappy you're going to have to be like you know doing with whatever you got you know oh you only give me 10 machines for testing this year fine i'm gonna i'm gonna peg those suckers you know so so yeah it makes it it makes it fast-paced it makes it exciting i love i love
Starting point is 01:19:05 working on it i i don't think i would go back to a customer facing role at this point i enjoy it too much yeah that's just me yeah i think you've synthesized it really well like it's very similar to what my interests at least were when i started like in a testing role right you get to work right next to your customers plus you have the. So you don't have, you decide what your roadmap is. And I think that's a really powerful thing. It is. You don't have that product manager in the middle, mucking it up and telling you no, no. It's funny. The bad part is that companies do kind of under invest in this area. And that I think is sometimes to their own detriment,
Starting point is 01:19:46 but it goes back and forth. Like when I was at MathWorks, they shipped a serious customer defect to Toyota, which was their biggest customer. And Toyota came back and said, you got to have zero defects. We're not going to take your software anymore if you don't fix this problem. Right. And then they really got religion about testing and about stuff like that. They invested more in it. That kind of wake-up call has to happen for companies to get interested in investing the correct amount in doing this sort of thing. Once you get to a mature software company, you have to have investments in this or you're going to croak. There's no way out of it. You don't want your brand to suffer.
Starting point is 01:20:20 How much more time do you have? We're at time um i i can go probably let's see i have a meeting i don't think i have another meeting until four that's okay so i can spend a little bit more okay perfect so yeah i wanted to ask you more about investment right how does one know like you spoke about this briefly but like how do i know exactly like am i investing enough like am i investing the right amount of, am I investing the right amount of money? Am I investing the right amount of headcount in making my developers more productive? How do I tell? Yeah.
Starting point is 01:20:52 That's a super hard problem because it's really hard to measure engineering productivity. And I think generally what I think companies don't pay enough attention to, there's a certain, and I actually think surveys can get at this. It's probably the best measurement that I've seen that kind of is practical, right? It's like, how happy are your developers? If your developers are miserable, they're going to leave your company. They're going to be like introducing more bugs because they don't have the tools that they need to get their work done. So all you need to do, like when I started VMware, right, they hired me, it was too late for them. But I started VMware. And I went to these meetings, all the testing engineers were talking about how flaky
Starting point is 01:21:35 the infrastructure was, the infrastructure wasn't working, it was flaky, it was this, it was that. And I spent a huge investment right when I got there, trying to track these infrastructure failures separately, and report them to the infrastructure team. I'm still meeting with them every single week, right? And it's, again, listen to the pain of your engineers and your testers, right? If they're in pain, they're not going to be successful. And in my opinion, almost you want to invest enough in infrastructure, both people and hardware to do software tools and tooling to make those engineers happy and productive. If they feel like they're a point at which it becomes more valuable to invest that incremental value in a new customer feature than another engineer working on your internal infrastructure code. And companies have to make good judgments about that to stay in business. They have to get enough value going out to the customer and enough
Starting point is 01:22:39 investment in infrastructure to keep their engineers happy. And I guess there's two things. One is the engineer's happiness. And then the other part is, are your customers getting defects that they can't live with? Yeah. You know, and that caused a big shift both at Google and at MathWorks.
Starting point is 01:22:57 When I was there both times, the company, you know, leadership said zero defects. That's what we want. We don't want any defects. Even at Google. We want to track them. We have SLAs. If it's a P1 bug, you have to fix it in a week. If it's a P0 bug, you have to fix it in a day. If you don't fix it, you have to update the bug every day about why you didn't fix it.
Starting point is 01:23:16 This was because at MathWorks, it was the incident with Toyota. At Google, Google Cloud was pushing stuff up every day that would break customers was Google Cloud was pushing stuff up every day that would break customers, you know, using Google Cloud. And it was bad for a while. They had a lot. They were starting to get a bad reputation. They were starting to lose big customers. And they said, no, no, we got to fix this.
Starting point is 01:23:35 And so VMware hasn't got their zero defects on yet. I'm working on that. They're not they haven't seen an acute problem, but you know, it's scary to a company that makes their money on their software investment to their customers and every, every, whether you're a software company or not nowadays, that's what you're, you're doing. I mean, Google's doing that and Facebook is doing that too,
Starting point is 01:23:58 even though they're not selling software, their software depends on whether they live or die. Right. And, and even though it's an end to a means of advertising, it's still, they need it, right? So it's kind of like, it's that mix. Are you investing enough that your engineers are happy and productive? And are you investing enough
Starting point is 01:24:15 that customers aren't finding serious defects? And if the answer to either of those questions is like, well, our developers aren't happy and our customers are finding defects, then maybe you need to increase your investment a little bit in your internal infrastructure and your, both in people and machines to make that better. I think that's a great answer. And that gives like a really easy and good, like well to understand framework, right? Like you survey your developers and if you decide on, you know, how happy you want them
Starting point is 01:24:42 to be, and if they're like only like 30% happy and you don't like that, you should fix that. And if it turns out that you're shipping way too many bugs and you're having like maybe way too many incidents and you're not being able to fix stuff quickly enough, that's another thing which you need to fix. It's fascinating to me that, you know, like, because Google's often like it has like a good engineering brand and all that. Like it's,
Starting point is 01:25:01 it's interesting that they ran into a similar problem with Google Cloud. Yeah, they did. Totally. They were shipping out too many bugs and they made a sea change. And you know what? At that time, guess what? Did you repeat that sea change? What do you mean by that? Well, a sea change is like you're totally changing the way you're working. You're doing something new things. Like having any sort of bug tracking SLAs about how long it would take you to fix a bug was mind-bending for the whole organization and at the same time they invested more in infrastructure and in people and in hardware to help back that up and to help
Starting point is 01:25:37 you know so so look if if you don't keep your developers happy they'll leave and you don't want to have a brain drain and if you don't keep your customers happy they'll leave and you don't want to have a brain drain. And if you don't keep your customers happy, they'll leave and you won't make any money. So, you know, there's no point to a software company if you can't do both of those things well. And you can tell that you need to make more investment in your internal infrastructure when either of those things is out of balance. I think that's, you know, and that's fuzzy. I know it's fuzzy, but there isn't a precise, you can't say lines of code per hour per developer. Yeah, that doesn't make sense. It's not going to help you.
Starting point is 01:26:10 Yeah, and developers feel happier when they can ship more stuff and they're more productive, right? It's just true. When I can deliver more feature and I don't get stuck on the build, I don't get stuck on some test that flaked, I have a good experience doing that.
Starting point is 01:26:24 I'm going to be more happy and productive as a developer than if I get that stuff not working well. And the other thing I would say, if you're in tools, dog food your stuff. Use your own system to run your tests. Because if you don't, then you have that ability to sort of find your own problems, right? So it's good. Okay. Then let's go one step back. I had this question from something we were talking about like an hour ago, which I think is pretty interesting. The reason why we run tests, so we run tests like every 20 minutes.
Starting point is 01:26:53 And I think you said that initially, I used to run tests like every 45 minutes, and then it became more like elastic. But the reason why we do that is because we want all of our tests to be green before we deploy like a new version of our main website. Is that overkill? Like, I remember there's a talk by Facebook where they talk about the fact that they disable tests that are flaky. And unless it's like a critical test, they just let the push go through, even though there's like a flaky test that's disabled.
Starting point is 01:27:24 And I used to think that's really strange but like over time even my my viewpoint is google does that too yeah okay so google does that too how how much like you know it's a very like maybe it's a common industry thing that every test should pass before you deploy but like in practice how much does that matter or like what what does google do or like what have you seen in other companies well like i said flaky tests that are quarantined do not participate in making decisions about whether to release the product they're completely blind to those to that testing inventory yeah um they do have every place i've been does have some set of tests
Starting point is 01:28:01 that they want to run kind of before they ship and they're defined in different ways. Google has that too. They have this thing for ads called, um, Oh God, I can't even remember what the name of it is. This big ad testing framework that takes a couple of days and you deploy all your ad servers and they do their, their big, you know, systemy test kind of a thing. And that they, they run before they push a new version of ads okay but when you're when you're going fast and you're pushing frequently which is the best practice the it right right then then you do you you have to be able to sacrifice some tests that are flaky and okay you hope and again remember you know only even if you had that test 93 of the changes going through your system are probably clean so you know it's it's probably you're taking more risk again testing all about risk
Starting point is 01:28:54 benefit right and when you're when you have a flaky test that's coming up your system taking it out it's probably less risk than letting it run and false signals and wasting your time trying to figure out what's wrong yeah so flaky flaky test is one thing. Yeah. You, you disabled those cause they're a low signal, but now you have like the self test that is like relatively clean, but you still can't afford to run all of them before you. Well, how do you decide that? Google? That wasn't true. They would run all the tests before they deployed. Well, let me say this pretty much every 45 minutes, we would run all the tests. That's exactly pretty much what we were doing. Right. All the tests. Let me say this. Pretty much every 45 minutes, we would run all the tests. That's exactly pretty much what we were doing. Every time we got a cut, which was about once a 45 minutes based on our compute size and our test load, it probably changes over time. But that, you know,
Starting point is 01:29:43 if there's tests that you're not selecting because the code didn't change and you're not running them, you know, then there isn't much risk. Yeah, there isn't much risk. If you're using smart regression test selection, or at least using the reverse dependencies of Bazel to do regression test selection, then you're not going to miss much. It'll be okay. Yeah. But then you have developers who are working on Java, right?
Starting point is 01:30:04 Wouldn't that invalidate every single Java change? And then you have to run every single test. What do you mean? Like, if you have a developer who's editing Java, then basically every single The basal reverse dependencies are correct for Java, at least they are used ijars. Yeah, but then what that means is every single, all the Java code basically needs to get rerun, right? Cause Java is, has changed. How valuable is that right before, or like, does that, is that just the way things are? To me again, look, it's all a measure of risk.
Starting point is 01:30:43 Yeah. That you're willing to take and how much you're willing to spend to avoid the risk. Now, I think the feedback loop is super important. If you take that shortcut, then you don't run every last test. And then you find out later that some customer reports a bug and it could have been caught by that test that you skipped. Then maybe that's more important than you should, you know, think about your policies. But generally, taking more risk is probably okay. And again, it's a game. And there's not enough money in the world to pay enough to eliminate the risk, it would be prohibitively expensive. So it's kind of like that, how much
Starting point is 01:31:25 risk are you willing to take? And that's a judgment call different for different companies and different stages of their life cycle and different customer expectations. If you're a startup, you probably don't care about the defects at all, because nobody's paying for your code yet. You know, and if you're, you know, a mature company like Google ads, and you stop serving ads for five minutes, you can, you know, waste billions of dollars of revenue for Google. And then that's, then that becomes an incident and everybody's on it and you figure out how to prevent that from happening ever again. Yeah. It's such a hard problem, right? Like figuring out, but then it's all about trade-offs
Starting point is 01:32:02 and there's no, like, there there's clearly no i wish there was more i wish that you could be more precise yeah with formulas for how much to invest versus how many developers versus what kind of i don't think there are such formulas in this space i think it's partly because engineering productivity is hard to measure yeah and it's partly because the risk tolerance depends a lot i mean you press you press play on your Netflix remote and your movie doesn't start, maybe it's a bad day, but nobody dies. You hit your brakes and your anti-lock brake system malfunctions,
Starting point is 01:32:32 you die, that's a really bad problem, right? And so risk tolerance has to be different for different kinds of software. Yeah, I think if more companies come out and say you know what we don't actually run tests that often because of our risk tolerance and this is how we came out with it if there were just like more blogs talking about these things i think a lot of what how the industry thinks about this in general because i've seen i've definitely seen most developers are just like cautious they're like oh we just we should run everything all the time what's the
Starting point is 01:33:03 point of not running stuff what if there's like a bug in our selection logic? Then I think the way industry thinks about this will probably change. And it's kind of fun. All of the testing engineers that like push that change. Yeah, exactly.
Starting point is 01:33:18 Well, then you look at something like a 737 MAX, perfect example, right? It's like, you know, that's, that's something they should have tested the shit out of it they should have had so many tests they should have had anyway so so you get the idea and i think that's people need to start thinking about all testing as sort of risk mitigation and the and the risk the main risk you're having is to make your customers unhappy or cause them harm right but you want to avoid that at all costs because you want them to keep paying
Starting point is 01:33:46 you at however they do that. And, you know, if you can take appropriate amounts of risk based on that risk tolerance, and then you decide your investment based on that risk that you're trying to avoid. Right. And that's basically some way to arrive at a formula. And then the other side of the equation, obviously, is to keep your engineers happy and productive. If you have unproductive engineers, you're not going to be able to make new features.
Starting point is 01:34:12 You're not going to be able to get value to customers quickly enough in this day and age of this industry to succeed. Yeah, I feel like this is like a great stopping point. I have so many more questions around, you know, we discussed this just before, like around, you know, privacy and security and how do you incentivize those things. And I was also going to talk about code review, but I feel like we'll have a good like follow up podcast or something. This is pretty good. This is fun. I enjoy this. And when you get when you get the podcast, put up, send me the link to it. I'll share it with my family members and maybe they'll laugh. Yeah, thanks for doing this again have a good day

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.