Software at Scale - Software at Scale 23 - Laurent Ploix: Engineering Manager, Spotify
Episode Date: June 10, 2021Laurent Ploix is an engineering manager on the Platform Insights team at Spotify. Previously, he was responsible for CI/CD at several Swedish companies, most recently as a Product Manager at Spotify, ...and a Continuous Integration Manager at Sungard.Apple Podcasts | Spotify | Google PodcastsHighlights05:40 - How CI/CD has evolved from a niche practice to a standard and expected part of the development workflow today12:00 - The compounding nature of CI requirements14:00 - Workflow inflection points. At what point do companies need to rethink their engineering workflows as they grow? How that’s affected by the testing pyramid and the “shape” of testing at your particular organization20:00 - How the developer experience breaks down “at scale” due to bottlenecks, the serial nature of tooling, and the “bystander effect”. Test flakiness.28:00 - How should an engineering team decide to invest in foundational efforts vs. product work? The idea of technical debt with varying “interest rates”. For example, an old library that needs an upgrade doesn’t impact developers every day, but a flaky test that blocks merging code is much more disruptive33:00 - The next iteration of CI/CD for companies when they can no longer use a managed platform like CircleCI for various reasons.40:00 - How should we measure product velocity? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining me here today is Laurent Ploie, which I hope I pronounced correctly, is an engineering manager at Spotify.
And he's been working on CI systems for a large part of his career,
initially in fintech companies and now in more techie tech companies,
I guess is the right way of framing it, like Spotify.
And you've been in Europe, in Sweden. Is that correct?
So I was born in France and I moved to Sweden approximately
13 years ago.
Can you tell listeners what got you interested in CI, CD?
It's a pretty niche topic. Everybody has to deal with it but nobody is really
excited about CI. So what got you interested in that?
In 2003, I was a manager for a team that was developing a
virtual machine. Virtual machine was running some old
VMS code on Unix, various version of Unix, like AIX, HP
UX, and a couple of other ones, plus Linux. And this virtual
machine was running financial code. so we had to be careful.
And the problem we had was that we had my development team, then we had the QA team as well,
and so we created a new version of it, and then it went to the QA team, and the QA team was trying
to find bugs. And of course there were quite a few. So they opened tickets, and we were fixing
the tickets, and then we had a new version of the
virtual machine, etc. So it's a very well-known process,
except it takes a lot of time and of course it's very slow to convert to a bug-free version.
But we didn't really know a better way to do it, frankly. Also, we had to test against four or five
different versions of Unix. So, yeah,
it was difficult. And then someone in the team said, look, I found this unit test thingy.
Maybe we should try that. So, well, we gave it a try. And so, we started to write a number of
small cases, which today we would call just unit tests. And effectively, what happened is that the QA team still found
bugs, of course, but they found the bugs that were say advanced
or complicated. But the trivial bugs just disappeared. At least
on one OS, which was the one where we were working, which was
Linux. But for the other ones, we still had those stupid bugs,
because well, we didn't test so much on them. So that's when we
thought like, hey, so we have this unit test. We could
automate that, right? Why don't we have five different servers
of five different OSs? And then, well, we run them. And that's when CI was born for us. So after
a while, of course, we found that we could release not in six months, but in 15 days.
And there were no trivial bugs anymore. Of course, there were still advanced problems or use cases that were difficult to find.
But whenever we found a bug, we created unit tests around that.
And it was gone more or less forever.
So actually, what happened for me is that I realized that I was super interested,
of course, in developing software, but this very idea of making
the life easier for developer was something that I was truly interested in
and then we moved to Sweden and I started to work for another fintech
company and this fintech company also had a QE team of course and like a
hundred developers approximately.
We were doing real-time trading systems.
And so I took over the QA team and said like, hey, let's refocus our effort onto automation for developers
because we wanted to change the mindset
of how the company worked.
I mean, developers were really good,
but at the same time, as always,
the software was released to the QA team, the QA team found bugs, and it came back to developers
that prioritized these Jira tickets, and et cetera, et cetera. And it just took days, every time.
So it took a lot more time than it should to release the software.
So we refocused the effort entirely on two things. The first one was to change the mindset
and to change the approach of the development teams so that they would think in terms of
writing the test for themselves to protect against future bugs and against regression
from other people maybe touching the same code and
at the same time we needed to change the infrastructure of course so that we
could run them so it was a lot of battles to take at the same time so it
took a few years because you really have to change a lot of things in a company
when you want to go from long long release cycle to short ones. Effectively, what happened is that we went from six months a year to release to
this software which was more or less reusable at any point in time, more or less.
And the number of the value adding tickets had tripled over the years, whereas the number of the value adding tickets tripled over the years, whereas the number of critical
tickets went down dramatically.
So in some way, we can say that we tripled the velocity of this company.
At this point, I was hooked.
I really liked that job.
And working on evolving a company,
working on making life easier for developers is really something I enjoy.
Have you noticed a phase where you've had to convince
engineers who were initially not used to a CI environment
and trying to convince them that it's a good thing?
How have you gone about that?
Have you dealt with that, or is it just like an industry standard practice now? Today, I would say it's an industry
standard. But 10 or 15 years ago, it was not, or at least not in a place where I worked. And
what I realized is that
sometimes people were not convinced that it would make their life easier to, say, for instance, write unit tests or write different types of tests, really, and react fast on failures.
But what really worked is when the system stopped working and suddenly they were blind. He was very much against tests. He was very much against that because he thought that they could basically do a good job from
the beginning and not have to test so much because we were testing actually quite a lot.
And that's only the day when the CI system went down that I saw the same guy come and
say, hey, by the way, we are
blind. Now we don't know if the quality is good or not. So please, like, please fix this incident.
So like, all right, is that the same person who told me that it was not very useful? But I mean,
that's not the case anymore. Like, today, nobody would challenge that, I think. Not in the software industry, at least.
Yeah, but do you think this was the same
like 10 years ago or like 15 years ago?
Or has it been like a gradual evolution?
So I'd say that if your company
depends on software to create value,
you'd better adopt those practices.
Effectively, CI enables you to release faster,
to create value faster for your customers.
And if you don't do that,
well, your competitors will,
and you're going to die.
So, yeah, we've seen an evolution
where more and more companies adopt those practices,
but the ones who don't,
well, at the end of the day, they disappear.
So, yeah, that's what happens.
The entire industry is moving in that direction,
and one of the good signs,
like one of the signs that it's becoming mainstream
is that you see a lot of companies
providing CI in the cloud.
That's usually a sign that the practice has become mainstream.
Do you feel like there's a cultural gap
in a world where CI used to not exist
and then you add it to a company or a team?
Is it hard to convince people to you know get thinking about
keeping master green or keeping the build good so it's a question of perspective and culture
as a developer if you only focus on you providing or producing code,
you could argue that what really matters is your pull request or your branch,
and that maybe it's the job of somebody else to go fix merge conflicts in master.
But if you think holistically, if you take the entire company into account, or all the
teams working on the same project, then things change.
So the question becomes, how do you evolve the culture so that the mindset is that everyone
needs to go faster, not only each and every person on their own piece of code.
So there can be resistance if people don't see the value of what they're doing for everyone else.
In a Fentype company, I have experienced resistance from some teams that thought that
they had done a good job and that
somebody else for instance was using their API's and they were using the API's
wrong and therefore their tests were wrong but in practice you really
want to take the on you want to take the person providing the API and the
person's using the API's and they have to work together to make the piece of code
work. And again, it's a culture shift. It's a culture from my own corner into everyone is in
it together and we have to work together. So yeah, that's like part of the job of a CI manager is actually to think
holistically and to try to change the culture of everyone so that they see the impact of what
they're doing on everyone else and not only on their code. Another kind of resistance you can get is from QA teams that would
think that you're basically removing their job, which is not true. In fact, what
happens is that you truly don't want humans to find trivial bugs. That's a
waste of time. You want humans to focus on the very hard to test problems like usability
ux issues logic inside the product whatever has to do with human interaction that that is really
important to test and you need qa and but whatever can be automated, there is really no reason.
But here again, it's a change of perspective.
It's a change of culture.
The point is that quality is not a QA problem.
Quality is everybody's problem.
Okay, and I think one thing that all of us
who've been experienced with CI, like managing
CI systems and like larger companies or even smaller ones, we noticed that over time, CI
itself gets fairly complicated because the number of developers is continuously increasing
the number of tests in your code base is continuously increasing.
And because of that, there's this compounding effect
on the CI system where just the number of build minutes
or the amount of time it takes for a build to complete
takes, just gets longer.
So have you experienced this in the companies
you've been at so far?
Yeah, so what you see is that, as you said, if you have twice as many
developers, they will probably develop, like, create twice as much code. And if you add to that,
that they also create more tests. And so the time it takes to build is longer, the time it takes to
test is longer. Like, you have more platforms, you have more supported OASs,
more, generally speaking, more platforms in many ways.
So what happens is that, yes, it explodes.
And soon your one machine becomes 10 and 10 becomes 100,
and that's literally impossible to test everything all the time
on every commit you make or even every pull request.
So that's a time when you definitely need to have to test everything all the time on every commit you make or even every pull request.
So that's a time when you definitely need to have some kind of dependency graph
so that you save CPUs to Southways.
I mean, typically built systems like Bazel or similar
will help you only focus on the ones
that actually need to be built or only need to be tested.
That said, we'll see that a bit later, is that it's not enough actually most of the time. Even by
only building and testing what needs to be or what is impacted by a change, it may still be too much.
So what you typically do is that you need to decide what
you want to test during the pull request.
And there is a certain level of risk that when you merge to
master, you may have problems still happening on master.
And you may not even be able to run all the tests on all
the pull requests on master.
Have you noticed some kind of tipping point at
which testing every single thing on a PR or on every
commit is no longer feasible?
Have you noticed any kinds of like patterns like number of
developers number of commits coming in that's a very good question um it very much depends on the
shape of your test pyramid the test pyramid typically says that at the bottom you have a
lot of unit tests each and every of them covers a small portion of the code. They test very few things, but they go really, really fast. And also, as soon as they break, you know exactly what part of the code is
involved. At the top of the pyramid, you got typically the end-to-end test. They take a
long time to run. They cover a lot more code. In that sense, that's good. But at the same
time, they are much more difficult to troubleshoot.
And there is everything in between.
And as an example, in financial software, we were testing different types of financial
instruments and a lot of aspects of the pricing of those instruments on a lot of markets.
So this is typically something which is at the top of the pyramid that can run for hours,
dozens of hours, or sometimes hundreds of hours. And of course, you don't want every developer to
run that. So you have this combination where you want developers to run all the unit tests typically,
at the same time, they cannot do that locally, because if you support multiple environments,
then of course, their computer is only one of those environments. At the same time, they cannot do that locally because if you support multiple environments,
then of course their computer is only one of those environments.
So you want them to run all the unit tests in the context of the pull request against
the CI environment.
Then you also want to run the long running test if we know or if we believe there could
be an impact on them.
At the end of the day, you probably cannot run
all the long running tests against all pull requests
for all developers all the time.
So you're going to run them from time to time
against master and bisect to find the culprit
if there is a logical merge conflict.
That makes sense.
So it's not necessarily about the number of developers
or number of commits.
It's about the shape of your tests.
Basically, how much you follow the testing pyramid,
in a sense.
Even if you're a small team,
but if you spin up a lot of end-to-end tests
for every commit, which just takes a long time,
very quickly you will start running into issues.
Whereas you could be like a larger team, but if you focus mainly on unit tests and all
of that, you might miss out on the coverage of end-to-end tests, but you can stay relatively
freer of these concerns for a much longer time.
So I think the best way to find the right balance is to
basically leave it to the teams to decide, because they
know the value and they know the problems that come with
the different types of tests that they want.
So if one end-to-end test actually covers a very large
part of the code and is good for them, and they feel that they get
a lot of value from that because they also fix the problems as soon as they see them, then sure,
you can run this one, that's good. But you probably don't want to run 100 of them
because, or more, if they are taking a lot of time, because you won't be able to react on them anyway in which case you
probably want to have a lower level uh test more more say closer to the functions or like
smaller pieces of the code so i like to leave that to the team i think it's it's a much better
approach because they can find the right balance for themselves. What I've seen as a problem when you grow the number of
developers, it's a bit different.
It's more that when you have a team of 10, there's usually no
problem for, like, if you test your pull request, it gets
green, you merge to master, it's going to work
most of the time.
Very rarely, you're going to have conflicts, especially if
you, say, rebase on master.
So typically, if you bring the code from master in your pull
request and you test that, and then you merge to master if
it's green, most of the time that's going to work.
So it's going to be no problem.
But if you go for, say, 100 or 200 developers on the same
repository, what's going to happen is that suddenly you're
going to see statistically
rare events that start to become quite annoying, like what you call logical conflicts.
Say pull requests that are sort of compatible from the code perspective, like they don't conflict,
like there's no conflict in the code.
But if you put their two changes together,
then they break.
When this happens,
that's the time when you want to have some kind of merge queue
or something like that.
But if you have too many pull requests per day
and it takes half an
hour to test every pull request and you simply cannot put them in a single day, so a plain
normal merge queue won't fly.
So, you need to have quite advanced mechanisms to prevent issues. So the scaling issue is, again, if you have too many developers, you need to
put in place some ways of reacting really fast when what you call master gets read.
And you basically want to reduce the time it takes to build and test to the bare minimum
so that you have a smaller probability to have merge conflicts between different developers.
That makes sense.
And the contrast to me is interesting, right?
Ten developers, everything is fine.
But overall, even though you are going 10x in a sense, like 10? Like 10 developers, everything is fine. But overall, like, even though you
are going 10x in a sense, like 10 engineers, or 10 developers to 100, so much breaks in
that middle, right? Because yes, if they're all adding to the same code base, they're
roughly mostly working at the same time. Absolutely. I've seen exactly what you've talked about
as well. Like it's just things change so fast. And each unless you focus on that developer
productivity aspect of it, each incremental developer you hire
is just blocked on random things. And they're actually
adding to like business value. And this happens in every
company.
So you're talking about randomness. So I cannot resist temptation to mention flakiness here.
You know, so in 99% of the cases, it's a test that failed but should pass. And in some rare cases,
you have the opposite tests that basically pass but should fail. It happens when you have a bug in a test framework, for instance.
So, fleckiness is unavoidable.
Like, you will face it if you grow.
It is actually very, very hard to tackle.
But that's where,'s where you have methodologies to try to at least
keep it at a low enough level that it's okay to work with. So the first thing is,
flakiness, where does that come from? what's the problem? So you have different levels where it can come
from the test itself, like maybe it's by the return,
possibly.
It can come from the test framework.
So, I mean, say, JUnit, maybe.
They have bugs, God knows, or anything else.
It can come from the product, right, you're testing.
Maybe there's flakiness in the product itself, some risk condition or something.
Then it can come from external system you depend on.
Say when you run your test, you connect to a database or something.
So maybe database is down, God knows, or maybe they're just flaky.
Then you got all the other things that can happen, say network, anything.
Even the OS can have a bug.
And on that note, I would say when you grow really big, like if you have a few hundred
or a few thousand machines in your CI, what happens is that you start seeing machines
behaving in a strange way from time to time.
Because, for instance, there's a bug in, say, Windows or Linux or whatever
that's going to make the machine break every second year,
except that now you have a thousand machines,
and that's going to happen every day.
Just because of statistics.
So anyway, the point is, you have a lot of sources of flakiness. And think about that for a second, would
you like to use a product that's going to fail every 10 times you
use it? Just because someone didn't want to take into account
that there was some flakiness in the test. But actually, the
source of the flakiness was the product itself. So you don't
want that right so
if you have flaky I mean we call that flaky test but that's not correct this
flaky test results really right and in my opinion that's really really
important to try your best to identify why that's flaky. And whereas it's really, really hard to do in, say, in absolute,
like for everyone, it's actually not that hard to do most of the time for your own company. So if
you just look at the log files of your test or your test environment or your machines, you will
very likely find things that kind of look like a network problem or look like
i don't know database connection problem or or something like that so in my experience if you
just parse the log files or if you parse the test results and so on you kind of have a good
guess as to why i mean what's the source of the problem um which means that now you can have automation to, for instance,
hide the problem from the developers and rerun the test if that comes from the OS
or if that comes from a database connection or something,
because that is not about the product.
But you certainly don't want to hide the flakiness issues
if they actually come from the product itself or from the test framework
or from the test itself. That requires fixing.
So and also like when a test is failing say one out of 10 times or one out of 100 times,
again you don't want your users to
have a failure every 100 times.
You really want to understand the root cause, and you also need to consider that like a
failure.
I'm very much against the idea of rerunning the test just to make it green.
I would actually go the opposite way, which is like, hey, run it three times.
If it's red once, try to fix it.
So that's hard to do, right?
Now you need to burn three times as much CPU.
But that's like, you get the idea.
And in my opinion, that's really, really important
to put the fixing of flakiness very, very high
in the priority list.
Also, I'd say,
so there are many ways to first detect this flakiness.
Like you can just look at what test results you have.
And if some tests are read from time to time on master,
for instance, it's very likely
going to be due to flakiness. But you can just have an automation that's going to run the test
a hundred times during the night or so. I use other ways to detect that. And again, you really
want to put that quite high on your priority list to go fix that. It is also very, very hard to understand where the root cause is.
I mean, ask any developer that finding REST conditions is really hard, right? We know that.
On that note, I can just mention that if your product is, I mean, it is a really bad idea most of the time to mix asynchronous
code with synchronous code.
Meaning like, if your test framework is synchronous, because for instance, you click on a button,
you wait for something to happen on the screen, and then you do this, and then you do that,
like you're basically trying to do a synchronous thing.
Do A, then do B, then do C. Okay?
But if the code that you're testing against is asynchronous, because when
you click on a button, the button comes back, then you have control again, but then something
happens in the background, then you end up into your timeouts and you basically create flakiness for yourself. Don't go there.
The second advice is to remove as many moving parts as possible.
So if your product depends on the database and the network connection or something, and I don't know,
a file or on a network whatever avoid that
as much as you possibly can because that's just going to reduce the sources of flakiness that
that you have no control on that makes sense to me and i want to zero in on one use what you said
about priority list right let's say that you're in a position where you're deciding whether you want engineers to work on a new product
feature or they have to work on this platform flakiness or just in general improving the state
of ci let's say you're like a ctu or like a manager of a couple of teams what's a good way to
decide or to prioritize you know we should take a step back here and instead of working on this
next feature like we should put a little more resources or like a few more people
on the developer productivity so like how do you figure out that trade-off or like that balance
so i think the key here is to think in systems
think of your development team as a system that can deliver things really quick, but also needs to go fast on the long term.
If you keep on piling, say, technical depth or flaky tests on your team, effectively, it's going slower and slower and slower over time and then it's going
to be a lot lot harder to fix later on so if you only focus on the next 15 days or one month or
even a quarter most likely you're gonna always prioritize uh adding the new feature
and not taking fixing your flakiness or fixing your
technical depth but really that doesn't help you on the long term so the balance
is how much value do you need to create right now because I don't only need to
to be the first on the market versus how long do you want your company to survive?
And think also in terms of compound value here.
If you make your company 1% faster than 1% and then 1% over time, that makes a lot.
The same thing happens in the other direction. Like if you make your company 1% slower and slower and slower every time you leave a flaky
test in place, then basically you're killing your company in the long term.
Then you have to take into account the culture you're creating as well.
Nobody likes to work in a company where the product itself is flaky, it's not nice. Nobody likes to work in a
company where the development environment is so difficult to work with because tests are flaky
and nobody pays attention to them. So it's not only about pure speed, it's also about the culture
and the morale. So as a CTO, I would take all of that into account and find the right balance to have a constant
effort to fix those problems.
On top of that, I'd like to mention that fixing something where it breaks and you just changed
it, like you just changed the code and suddenly something gets flaky, that's kind of easy to reverse or at least understand and look at what code just got
modified.
If you look at it 10 days later, it's a lot harder.
If you look at it a year later, you got no chance.
So basically the tech-deb is actually not a very good analogy in some way. Think more like a backpack in which you put stones.
And the more stones you have, the heavier it gets.
And at some point, you get exhausted.
You cannot move anymore.
And on top of that, I'd say that you really don't understand why you have those stones.
People from so many different areas and different industries run into similar problems around
like tech debt and test flakiness.
Because I guess these are all manifestations of the same underlying problem of like, it's easy to ship features
while adding tech debt and eventually that always catches up to you.
It does actually.
It's a, at the fleckiness is unavoidable, but you can't keep it, you should probably keep it
quite low if you, and make whatever investment you need to keep it low.
Like think about that again for a second.
It's like if your developers get a red result, like a broken build, basically say every third
time, and it takes 20 minutes to build and test,
well, you waste a lot of time, right?
Like, you're paying salary for that.
Like, you're basically paying people to wait and do it again.
That is just such a waste, yes.
So then let's talk about the next evolution of the company, right? Like when you go from like
a hundred developers to like a 500 or a thousand, and then at a point where the CI infrastructure
itself ends up becoming flaky, because at that point, like from what I've seen so far that
providers like CircleCI and everything that abstract this stuff out for you, they're not
really good enough. And you have to start managing your own infrastructure in order to run tests.
And we can see that in every single large tech company today, they're all running their
own CI systems. So what happens then? Can we that so you do get some flickiness due to OS bugs and transient
network issues and changes in DNS and anything.
This is where you really want to have good metrics.
You really want to know what the percentage of problems you actually face.
You want to know if something is flaky, what is the percentage that is due to OSs,
what is the percentage that is due to the product itself and so on.
As I mentioned, most of the time, it's more or less impossible to do in general.
Like generically for all companies, you cannot do that.
But if you look in your own system, you can probably guess quite well that this type of
trace or stack trace is due to, say, network issues.
And I mean, the metrics you gather are really important because then you
can see, like, say you have a bug in Linux every second year or so. You know that's going
to be a machine that fails every day or every second day or whatever. You also know, and
that's actually tricky, that the machines that tend to misbehave, they tend to break, like to run the test and break them,
but they also tend to break fast,
which means that if you don't detect them really quick,
they're gonna basically,
the next build is gonna come and go to that machine,
and again and again and again.
So basically, like let's say you have a dozen things to do, and in no time, one of the machines that happens to be broken
is going to fail all of them.
Mm-hmm. That's annoying.
Of course. So what you want to do here
is that you want to detect those things on the fly.
Right? And if you detect that one machine, for instance,
has failed three builds in a row,
let's say, then you really want to go quick and
say, okay, is this machine actually broken? Like what kind
of stack trace do I see here? Like what kind of logs do we
have? And then you immediately want to like, remove it from
the pool. Automatically, of course, right? Like if you you
cannot do that manually, it's just way too big.
Maybe you want to rerun the build that were read automatically again without even showing that
to the developers so that they don't get affected
by one machine that happens to break.
So that's something you see pretty often, actually.
It's a misbehaving machine in a pool
basically killing the entire expanse for everyone.
So that's one thing that you see, yes, with that.
Maybe the second thing, yeah, let me insist on another thing which I think is really important
is that you know your matrix.
You know what the fleckiness level you have.
But what you don't know is how much developers perceive it.
It's very common that you have, say, a 2% fleckiness in your system.
2% of your builds are going to fail for no reason.
But in fact, people believe is 20.
and that's also very common that the perception is not the same and the perception is what drives behavior so like if an example is like if a developer gets a red build and say okay um
did i do something wrong um before they go look at the test result,
before they go look in the log files,
before they go try to troubleshoot,
which is costly time-wise,
maybe the first thing they're going to do is re-trigger,
say, hey, run it again.
And they can do that if they don't trust
that the results are correct.
So the question becomes, how
much do they actually perceive? How much do they believe the flakiness is? And that's
where you really need to be good at communicating those things. So you need to be able to say,
hey, here's the actual flakiness level. And maybe if you do face a flaky test or whatever, you want to tell the developers,
okay, so this test failed. By the way, we know it's been very, very stable. Just saying.
Or quite the opposite, like this test failed, but you know, you got like, it was quite unstable
for the last two weeks. So like, maybe you want to troubleshoot it anyway, but not for the same reasons.
That makes sense to me.
Even understanding the state of things, you can get metrics from the CI system and you
can get these really crisp metrics on how
flaky every test is but you should also be gathering the perception of developers perhaps
through like surveys and stuff and that will get you both what the system actually is but and also
like whether how the system is perceived because you can imagine a case where like two percent of
tests are flaky but every single time a developer tries submitting some code, like
that same test fails and they think the system is like 100% flaky.
So doing that qualitative, like asking people from surveys what they think the problem is
with the quantitative can help you get a bigger sense of what are the velocity blockers in
the organization.
And maybe you can also track like re-triggers, right?
Like if people are re-triggering builds all the time,
that's probably not a good sign.
Exactly.
So I feel like if you are in charge one day
of some kind of CI environment,
you really need to love your developers, really, right? You need to pay attention to what they feel.
You need to put yourself in their shoes
and try to look at their experience.
Try to really understand how they feel about it.
Again, when they get a red, it's bad.
It's good because you know what's going wrong, or that something went wrong,
but you're like, oh God, I need to fix it now.
You really need to pay attention to how they feel, to the position they have, and also
you need to focus on their experience. If you give them a red, you really want to tell them, if you can, where that comes from,
precisely.
Okay, so you've made that change.
It broke that piece of code on this OS in this environment.
Here is exactly the line where it broke. And by the way, looking at the stack trace,
we can relate to these other type of tests that failed as well.
Yeah, by the way, this test failed, but all these other ones failed as well.
So what do you think? Maybe that's related.
And you need to give them context, but not too much.
You need to give them context but not too much um you need to give them the
right context um there's nothing worse than than just looking at i don't know hundreds of thousands
of lines of log files and say hey the problem is somewhere in there so just think for yourself like
when you get someone who comes and say, hey, you got a problem. Okay, what problem?
If someone comes and tries to fix something in your house and they just say, hey,
sorry to tell you, but you're going to have to fix something. You really want to know what it is, right? And you know why. So being in child-obsessed environment is as much a technical and scalability problem as it is a human one.
Yeah, you need to be like customer obsessed in the words of like Jeff Bezos.
And now let's flip it over a little bit.
So the reason why companies are trying to hire developers is so that they can ship like features and everything for users quickly right and i don't know if you run into
a situation where a company will hire a certain number of developers and for whatever reason like
mythical man month or just state of the tooling and stuff they each incremental developer you hire
is not gonna suddenly double the number of story points that
are being shipped by the engineering team every month or every quarter and have you seen like a
situation of like a frustration of you know like a leader versus the engineering team like why are
we not shipping stuff faster like our competitors are shipping stuff so much and like how do you
think about that and how do you like resolve some
of that like are we tracking the right metrics are we tracking you know shipping features that
like correctly like what are your thoughts on this in general there's quite a lot to unpack here
maybe again we should think in systems here.
So the first aspect is how much teams depend on each other.
So you can make a graph of dependencies between different teams and maybe realize that you have bottlenecks.
It can be a bottleneck in terms of quality, it can be a bottleneck in terms of workload.
The first thing you want to focus on as a TDO is to fix those
bottlenecks and that will probably help the entire company go faster in fact.
The second thing you want to look at is when a team needs to put a new feature in place, can they operate on their own?
Do they have autonomy?
Or do they somehow need other teams
to do something for them?
And same thing here.
If you cannot operate on your own,
if you're not independent in some way or autonomous,
then you enter a cycle of negotiations and so on.
It's not that it's bad to negotiate, but it just takes time.
So it's kind of separated from the CI aspect of that, except that you can probably
measure the level of quality of different teams by looking at the CI result that they
have. If you look at it from a CI perspective, which is only one of the
angles, one of the good metrics to look at is how long do people take to fix a problem when a
problem happens. So I call it like NTTR4CI. So basically, if you tell a developer
you got a problem to this code, like here's a broken test
or here's a broken build for you,
how long on average is that going to take
to fix this one and get to green?
This metric is very useful to understand
the velocity of your development team.
I usually look at the distribution thereof. You usually have a lot of developers that
fix their problem in somewhere between minutes and one hour, and then you've got a long tail. This metric is
you're going to see a change
over time. When the code
becomes more complex, the
metric goes, becomes
it takes more time basically
for people to fix code.
That's
in my opinion, that's an excellent
proxy for
how agile and how efficient your organization
is and how your developers can iterate fast on your code.
How much?
That requires, basically that means that you need to collect a lot of data from your CI
environment, of course, and then you need to be able to process that metric.
But that's extremely efficient to detect the complexity.
And if you see that, for instance,
that the NTTR of two teams are both becoming longer and longer,
you probably have, you may have, let's say, a correlation or
even a causation, potentially, between those two teams.
Maybe one of the team is dependent on the other one, and the code that gets created
is more complex and more complicated to troubleshoot, and then you can basically go there and try
to understand what's going wrong and, say, try to isolate or, or decorrelate the teams.
Yeah. I really liked this like MTTR of like CI framing because it can be expanded in so many
different ways. Like if you just expanded from the CI scope and you start tracking,
how long does it take for a bug to be resolved once it's created in your task tracker and you
can like maybe break it down by priority and stuff you can really apply the same thinking right like
if it takes too long for when a bug is created and like it takes like a month for it to resolve
versus like three months and that probably means okay maybe your engineering team is not fast
enough but it could also mean maybe the product team is not prioritizing it or the
right way. Maybe the specs that they're creating is unclear. Maybe there's not good enough design
chops to get these things sorted. There's so many ways this thing could get unpacked.
So I really like this framing of this MTTR. How long does something that's a problem take to get resolved on average? And then you can do a bunch of things on top of that.
I could elaborate a little bit on that
and say, for instance, that it applies for
what happens in the context of pull requests.
So that's basically the person, MTTR, if you know what I mean.
But you can also look at it on master.
So when something breaks on master,
how long does it take for the team to react and fix it?
If it takes you days to fix something on master,
you've got a problem.
You basically want to shrink that as much as you
possibly can if possible two minutes but most most likely in two hours
because even like the end-to-end test if it starts to break and nobody cares about them then what
exactly are you running them like it's it's apart from some from burning cpu it means nothing and actually have a true
story about uh like it was in in fintech where they had like a lot of uh end-to-end tests because
what happened is that they had a lot of testers before that were doing things manually.
And then the way they thought about it was just like that automated test
was all about testing something that someone did
and turning that into an automated test,
which is, I mean, you need to start with something.
So that's probably not stupid to start this way.
But really what happens is that
if you don't react when they get red,
like then why, right?
So what I wanted to do at the time was just to delete them
because, well, they were creating no value
and they were just burning CPU for nothing.
But it was a problem because it was actually reducing
the ratio of automated tests versus manual tests,
which is like, well, if they break,
then why do you run them, right? If you don't
pay attention, rather. The other story about that was the fact that as soon as you realize
that automated tests is not about manual tests that you just got automated, it's once you
can do things, once you can create the unit tests, when you can actually create them on
the fly, and when you can actually create them on the fly,
and when you can generate a bunch of them to cover all sorts of different use cases.
Then you can have millions of them, actually, to cover a large number of possibilities.
I won't get asked, okay, so how many tests do you run?
I was like, I didn't know how to answer that because we were comparing matrices of prices, and there were
millions of them. How am I supposed to answer that question? When I compare two matrices, is that one
test or is that one test per number I was comparing? It was somewhere between one and a few millions.
So that's sort of like one of the first question you asked me is like, you have a change in
mindset when you go from purely manual test, which is useful by the way, and when you go
into fully automated or at least a lot of automated tests, is that
you have to change your perspective. You have to stop thinking into purely manual tests.
I wouldn't like people to think that manual testing is bad. It's not. It's very useful, actually. But typically not to test a small piece of code.
It's usually to test the entire experience. And that's where you want to have people.
Okay. And maybe to wrap things up a little bit, once you've gone from you know your 100 developers to like your uber
scale like extremely large companies right like thousand developers and all of that at that time
you really have to start innovating for making sure your developers stay productive right because
bills are only going to take longer more and more developers are adding to the same thing
so maybe you can talk a little bit about, you know, what's the state of the
CI world? Like, what are people thinking about right now? And
how do you deal with that scale once you're that big? And like,
and what's the framework you use to think about solving problems
when you're that big?
A good framework maybe is to look in terms of layers. So you
got the hardware layers and the OS.
Typically, this is something you can use from the cloud.
Then you want to build on top of that.
You want to store your artifacts.
You want to have test frameworks, databases, and so on.
That is also probably something you can use from the cloud.
Then there is the orchestration.
What exactly do you run first? What exactly do you run first?
What is it do you run last? How do you prioritize the builds? And so on. As far as I know today,
you need to own this part. And then maybe most importantly, the feedback you give to the developers when something fails.
That is one of the key aspects of CI is that it's not about running things.
It's about shortening the feedback loop.
And that part is crucial.
And as far as I know, there is still quite some work to do in order to have a good solution for what you deal with hundreds of thousands of tests.
And that part still needs to be owned by the big companies.
But there is a lot of maturity in that field. It used to be that you couldn't run much things
really on the CI providers in the cloud. But today, you can tell that quite a lot.
I believe that in a few years from now,
all the research that is happening in that field,
and it's quite a lot,
will be integrated by the CI providers in the cloud.
So by research, I mean,
how do you prioritize tests
so that you run the ones that are most likely going to fail?
How do you deal with flakiness?
How do you detect that?
How do you warn the developers that their pull request is most likely going to create
a regression that maybe they should run this long running test
before they merge, et cetera, et cetera.
All of that, as far as I know, is not currently integrated super well in CI cloud providers,
but maybe I'm wrong.
And I certainly hope to see that happening in a few years to come.
Okay. to see that happening in a few years to come.
Okay. So one thing that, you know,
I've thought about this a bunch
and what comes out to me as interesting
is that there's two kinds of build tools
that people use, right?
There is like the standard, a bash script,
or you run like NPM install.
Very simple.
And that's how you get started.
And then you notice that a lot of companies
migrated these like Bazel and Buck and everything because it gives you the power of like, you know,
using this dependency graph or being able to like classify everything as a target.
But at what point does it make sense to switch over? Does it make sense to switch over from
like a simple build tool to something like Bazel once you are going to find ROI in these things? Because it is always like a hard migration,
right? So how do you think about that? So I'd say there is something very important to understand
here is that things like three, four years ago and things today are like the market is just different. There's some kind of maturities that just happened.
Like I don't know, Bazel was version one, I don't know, like two years ago or something
like that, approximately. Before that, when there was a change in Bazel, you had to change a lot of things in your code, or at least in your Bazel description files.
That was a pain. But I'd say if you can start with Bazel today, it's probably a good idea,
at least I think so, or any tool that gives you, say, a dependency graph. I think that's going to
pay off. Dependency graphs are extremely useful if you want to analyze your code
for all sorts of reasons,
CI being only one of them.
But for instance, it's very helpful to understand
if you have many teams,
like which teams impact which other team.
Right?
Like when they change something.
Anyway, the point is, today it's possible to start
with those dependency-based tools, whereas I don't think it was so easy to do three years
ago. So if I were to start again, I would probably start with that. But yes, it's a
very painful process to move from whatever
build system you have into Bazel. Or it used to be at least.
I think I certainly agree with the stabilizing part. Initially, Bazel upgrades would just
be extremely disruptive. And now things are mostly the same. There's small bug fixes.
There's a few incompatible flags that you would have to add and remove.
Yeah, I still don't think it's at a point where, you know, you would use Bazel for like
web development or something.
I think it would be amazing if, you know, Bazel was like the build tool for web development,
but I don't think it's just there.
And I don't know, because the philosophy of these two communities is just so vastly different. But it would be kind of amazing if there was
a world where the first tool that a web developer thought about was, let's use Bazel for our
front end and back end. Because I personally feel like the build tooling ecosystem is just such a mess in javascript um but i'll stop i'll stop talking
about my opinions on javascript now yeah anyways well thank you for being a guest on the show i
think i've had a lot of fun you're welcome yeah and yeah this was. I hope I can catch you again for a later round tour. Maybe we can talk about CI and ecosystem in the future
and be more hopeful about this.