PurePerformance - Developer Productivity Engineering: Its' more than buying faster hardware with Trisha Gee
Episode Date: September 11, 2023Do you measure build times? On your shared CI as well as local builds on the developers workstations? Do you measure how much time devs spend in debugging code or trying to understand why tests or bui...lds are all of a sudden failing? Are you treating your pre-production with the same respect as your production environments?Tune in and hear from Trisha Gee, Developer Champion at Gradle, who has helped development teams to reduce wait times, become more productive with their tools (gotta love that IDE of yours) and also understand the impact of their choices to other teams (when log lines wake up people at night). Trisha explains in detail what there is to know about DPE (Developer Productivity Engineering), how it fits into Platform Engineering, why adding more hardware is not always the best solution and why Flaky Tests are a passionate topic for Trisha.Here the links to Trishas social media, her books and everything else we discussed during the podcastLinkedIn: https://www.linkedin.com/in/trishagee/Trishas Website: https://trishagee.com/Trisha's Talk on DPE: https://trishagee.com/presentations/developer-productivity-engineering-whats-in-it-for-me/Trisha's Books: https://trishagee.com/2023/07/31/summer-reading-2023/Dave Farley on Continuous Delivery: https://www.youtube.com/channel/UCCfqyGl3nq_V0bo64CjZh8g
Transcript
Discussion (0)
It's time for Pure Performance.
Get your stopwatches ready.
It's time for Pure Performance with Andy Grabner and Brian Wilson.
Welcome everybody to another episode of Pure Performance.
As you can guess, tell this is not the voice of Brian Wilson, who typically does the intro.
It's the voice of Andy Grabner.
I hope Brian is still fast asleep because today we are recording a little early at an unusual time.
But we have usual, as usual, always interesting guests with us.
And today I have Tricia Gee with me. Tricia, welcome to the show. Thanks for being here.
Thanks for having me.
Hey, Tricia, we met each other early July in Barcelona at DevBCN. And I went to your talk and I was like, I want to say thank you because you gave me a lot of inspiration because you talked about the myth of the 10x engineer versus the reality engineering is actually there to enable the 10x organization by taking away all of the obstacles that otherwise people
have, whether they're developers or whoever else is using the platform.
I would like to talk a little bit more about this with you today from your perspective.
Before we get started, though, for those people that don't know you, could you give a brief
introduction, who you are, what motivates you, what you do in a day-to-day business?
Yeah, sure.
So I'm Trisha G.
I am a Java champion, developer advocate.
A lot of people might know me from doing developer advocacy for JetBrains, doing a lot of stuff
for IntelliJ IDEA and IDEs.
I recently moved to Gradle, where I'm doing developer advocacy for developer productivity engineering.
Because at Gradle, we have a tool called Gradle Enterprise, which aims to get rid of a lot of the stuff that you were talking about.
A lot of things that get in the way of developers actually creating working products.
So the thing that kind of ties together all my experience is this passion for developer productivity.
That's kind of why I spent seven years telling people how to use their IDE,
why I spent a lot of time helping people get up to speed on the latest versions of Java,
because a lot of that is aimed at helping developers become more productive,
which is why I'm at Gradle now, because that allows me to pull back a bit higher level than just writing code
and more sort of, like you sort of mentioned,
a bit more organizational,
a bit more in terms of like the tool chain that we use
and how we actually get our code out somewhere
where people can use it.
Do you also see with all of your years of experience
that also the role of a developer
and kind of like where they live has also changed?
Because you mentioned that you focus so much on the IDE,
which is great, right?
We had to make developers more productive in their main tool
that they spend most of the time in. But have you seen also with
the evolution of software engineering in the last couple of years, whether you talk
about making sure that developers are also thinking about configuration
as code, where the stuff gets deployed. I mean, testing, obviously
we shouldn't talk about that.
That should be given anyway.
But observability, right, that's a big space where I'm in.
Do you also feel, because we've been pushing so many things
on top of, quote-unquote, the developer,
that we also need to think and rethink about
where we can make them more efficient and not just in the IDE?
Yeah, 100%.
I started coding professionally,
well, actually started coding professionally well actually started
coding professionally before i even graduated so in like 97 so we're talking like 25 years of of
working in the industry like when i worked and i started working at ford which was they had the
idea of lean because lean manufacturing was a thing at for, obviously. But we're talking about software delivery lifecycle,
big release processes, big documentation.
At that time, a developer was like this classic idea of a developer.
If you give them a piece of paper, which has the technical specification of the thing they have to do,
they write the code, they should test it.
But even then, we're talking about manual testing,
a lot of that sort of thing, committing into source control. And then you had, when I was at Ford, we had one person
whose job it was to deliver that functionality. He was a release manager. And what he did was
package up the code and release it. And these release processes were complicated and difficult.
And the testing was manual and complicated and difficult.
And so over the course of my career,
we saw Agile come in,
a lot of stuff coming in through XP
in terms of more automated testing,
things like JUnit made things a lot easier
for us to write our automated tests
and give us the confidence
that we'd written the right thing,
allow us to refactor,
allow us to write better code and be more productive. DevOps sort of coming in, I worked
for Dave Farley when he was writing the continuous delivery book. So I actually worked at the
organization where he was implementing continuous delivery, like ahead of anyone even knowing what
that term meant. So I came from, I'd been working at a bank where again, we had a
difficult three hour release process. At this time, the developers were in charge of doing the release
process. It wasn't a release manager, but it was three hours of following scripts and debugging
the problems as you went, always out of hours after 6pm. Painful, painful processes. And so
continuous delivery, the reason I went to go and work with Dave is that
when he told me in the interview that continuous delivery was a thing, I was like, oh, this is
what we need. We need more automation. We need these pipelines. We need to have confidence that
the tests tell us that we've done what we wanted to do and then know that at the end of this
pipeline, this thing can be deployed into production. But as a result, and this is only
coming up to what happened up to about 10 years ago. And since then, automation's got better,
and we've got more into the DevOps movement has taken off a lot more. So again, pushing more of
the ops stuff onto the developer. Security is becoming more and more important because obviously
we deploy into the cloud. We're not necessarily doing stuff on-prem. So over the course of these
25 years, we've moved
from being, in my experience, we've moved from being a developer who writes code even in a text
editor, you know, and is just told what to do and then throws the code somewhere else. It's someone
else's responsibility. And now we have to think about not only writing correct code and testing
our code and being able to build it automatically,
pull in our dependencies automatically, which is another thing we didn't used to do.
But we also have to think about the operational side of stuff, the security side of stuff,
the deployment side of stuff, monitoring, observability, all of those kinds of things.
So in some ways, it's kind of a weird contradiction where in order for us to
become more effective at our jobs, in order for us to become more productive as people who produce
code, we actually have to do a lot less of the code production and a lot more of everything else,
a lot of worrying about lots of other things. One of the things I think that has enabled that
is automation and tooling to allow us to do
this kind of thing. And I don't think it's a bad thing for us to have more responsibilities because
back when we just wrote code, we didn't necessarily think about, is this the right thing to do? Is this
really what the user wants? What will the impact be on production if I make these changes? And
being more responsible for a wider range of things allows us
to write better code because we think more about the problems. But we definitely, definitely have
more responsibilities than ever before. I think you bring up, you know, as you were talking,
it felt like a pendulum a little bit, right? We started from, as you said, just focusing on code,
throw it over the wall. And now we have to do all these things.
And if I hear you correctly, obviously, it makes a lot of sense
that developers are familiar with everything that they have to do.
But on the other side, we also need to swing the pendulum back a little bit again,
because otherwise, we can only spend a certain small percentage of time
actually creating code and having to deal with so many other things.
Yes, automation has made it easier for everyone to think about automating security
and automating delivery and automating observability.
But still, it feels for me, we are pushing so many things on top of developers.
Right now, I need to not only know my IDE, I need to know 10 other tools.
And while these tools might be easier, maybe I can codify them,
but still,
I need to know a lot of things. And this is where, from my perspective, what I love about
the whole movement of platform engineering, kind of trying to figure out what are these 10 tools
doing and how can I make this even easier for the people that use my platform?
And so this is why I said the pendulum. It's like from just coding to doing everything.
And now I think we need to find somewhere the middle ground
to make sure that the developers can really be productive
and not having to waste a lot of time with all these different tools
to do all these things that they need to do.
Do I get this right?
Yes, 100%.
I think that some of the organizations I've worked in,
so I've worked in very big organizations like Ford
and like the banks and smaller ones like startups
where sometimes there's only three of us on the team.
Regardless of the size of the organization,
I have found that in the past,
there's usually like one person on the team
who really cares about tooling and productivity
and productivity for the team, right?
And it's that person who does things like
decides to use a different CI server
or decides that maybe a different tool set would be better
or does some, we had someone who wrote an automation script
to farm out tests to different agents
so that we could parallelize stuff.
But that doesn't scale.
You can't rely on one person in the team or people using their 20% time if they have 20% time.
You can't rely on those people. Their effect is enormous. It helps the productivity of all of the
developers, right? If you can speed up your CI build, or if you can speed up the test runs, or if you can automate something
which someone was doing manually,
that has a huge impact on all of the developers,
which is the 10x organization thing you're talking about.
It's not just one person working more effectively.
It's one person helping the whole team
to work more effectively.
But if your traditionally developers are rewarded
or at least measured on, you know, features delivered or bugs fixed or even, heaven forbid, lines of code.
So it's generally speaking, organizations are not going to be rewarding a one person to take time away from that coding thing just to enable everyone else's productivity.
So platform engineering is a great movement
because the whole point is that you have teams who are dedicated to that kind of thing. It is
recognized, it is rewarded. Their job is to look at the bottlenecks and to try and help that across
everybody and not just fix the problem for the one person who's having that problem.
Platform engineering is the way to make the 10x organization.
Yeah. And I really like that you brought up the whole metric,
because obviously we're all metric-driven.
We want to figure out how productive are we by, as you said,
how many features do we push out.
From an efficiency perspective, though, I think when I look at the State of DevOps report
that just came out focusing on platform engineering,
there are ways on how they actually
measure the developer efficiency, the gains of developer efficiency.
I think that's great.
We can measure the impact that the investment in platform engineering or in developer productivity
engineering, as you actually call it, it has.
How many more builds in the end can I still do because I don't have to wait
for all these things, so I don't need to spend so much time in troubleshooting why builds fail,
or I can faster react to problems in production because the platform just provides me with better
visibility and better observability, yeah? Yeah, I think the measurability of some of
these things is just a really appealing thing. If you cut your build time down by even 5%, when you add up the dollar amount it costs a developer or an organization,
if a developer sat there waiting for five minutes for their build, which isn't a very long build time in this world,
you can add up for every developer in your organization those five minutes for their build, which isn't a very long build time in this world. You know, you can add up for every developer
in your organization, those five minutes.
How much does it cost you?
So reducing things like build times,
reducing debug times, reducing troubleshooting times,
running tests is one of those things that often,
if you've got a good comprehensive test suite,
which is a good thing to have,
it takes longer to run it.
So like optimizing those things, So we get that fast feedback. You know, you can put a dollar amount
on that sort of thing. And on top of that, there's the context switching thing. I mean,
developers have been so great at being like, right, okay, this, the CI build is going to take
40 minutes. So I'm going to work on a different thing or go for lunch or whatever. And we get good at, we think we're good at context switching.
But the fact is, if that build fails or if there's a test, which may be a flaky test or might be a real failure, we're not really sure.
By the time it fails and we come back to it, we're like, oh, I don't remember what I was doing now because I put my brain in a different place.
So the context switching has a cost as well.
And we can measure all these things.
A lot of it just comes down to feedback time. I like the, I mean, you mentioned costs, right?
I think there's an additional aspect to costs, the real cost on the infrastructure. Because if
you think about it, if you can see at least more and more organizations are really spinning up
infrastructure on demand for different cycles of the build or different cycles of the software development lifecycle.
If you can cut down build times, not only do you save time and costs on the developer,
the context switching, but also on the amount of money you need to spend
to actually run this infrastructure coming back to sustainability.
So I think that's a really compelling thing,
especially when you're talking about things like sustainability and the greenhouse effect.
And the amount of energy that computers consume is not really great at the end of the day.
And if you can reduce those build times or test times,
you don't necessarily cut CI usage, for example,
but you will get more efficient usage of your CI environment.
For example, if you cut build time in two,
you do twice as many builds.
So it can go either way.
You can either cut down the amount of resources you need
or get twice as much stuff pushed through your pipeline.
Yeah, I think that's very good.
So your argument is don't just measure the success
by, let's say, reducing the amount of CPU cycles you need for the builds
because overall you may just run many more builds,
which means you're getting actually better software out on the door and faster.
Right.
So you keep repeating build times, cutting build times.
That's great.
Debug times, all these context switches.
What else do you see out there?
If people are listening in to us now and say, hey, okay, this is a really interesting conversation.
What other things should I have on my radar?
What do I need to measure? And what can I improve? And where maybe I'm not looking at
all, because I never thought about that this could negatively impact developer productivity,
and therefore maybe even shyest developer away, or maybe they're looking for another job because
they're frustrated. Yes, I mean, that does happen. I've seen a lot of frustrated developers who are
sick of waiting around for stuff, waiting for, I left one job because it took six weeks to get my IDE environment set up.
And I'm like, this is unacceptable. Like what I can't sit here, you can't pay me this great salary
for me to be sat here doing literally nothing because you can't be bothered to set up my IDE
for me. So the build times thing, I just want to finish up on that before I move on to the other
ones. One of the things about build times is we often think about CI build times because that's the thing that we can see. We've got visibility
of how long it takes CI to run stuff. And we have cost associated with that as well with our cloud
stuff. What we're not often doing is measuring local build times. And so that's one place where
people can start, especially if you've got a remote and distributed team. We might not have
visibility over this one person's build time takes 15 minutes and everyone else it takes five minutes.
What can we do to fix that one problem? So measuring local build times often falls through
the gaps. It's one of those things we don't really think about because we just kind of run the build
and go and get a coffee and don't consider that it's something that can be changed. But I really
want to move on to my pet peeve, my pet project.
One of the reasons I joined Gradle
is because the Gradle Enterprise tool
has a flaky test detector.
And flaky tests are, they just drive me potty.
They're one of these things which are,
they're a time drain,
but they're one of these things you're talking about,
these sources of frustration,
this kind of like energy drain
for developers,
because you kind of go,
if you have even a small amount
of flaky tests in your test suite,
the problem that you have is
if tests fail
and you're not sure why they failed,
you're not sure if you broke them or not, then you go one of two ways.
One, you spend a whole bunch of time investigating a failure that was not your problem and you should not have been looking at.
Or two, you ignore it.
But then when you start to ignore it, that leads to the potential negative side effect of ignoring any test failures, in which case, what's the point in having any automated testing if every time they go red, a developer just goes, well, it's
probably flaky. Let's just run it a few more times or let's wait for a few more builds to see if it
really failed. And so you have a lot of noise in your results. And one of the things I really,
I learned when I was working with Dave Farley is that having an extensive test suite with
information radiators on the wall, like telling you how many tests have run and how many have
passed, gives you this lovely, warm, fluffy feeling of the things I did haven't broken anything.
But if those are going red, like occasionally for no reason whatsoever,
that whole confidence goes away. You can't do refactorings because things fall over
and you're not sure if it's you. And these flaky tests could be down to any number of different
reasons. It could be you. You could have written a bad test. It could be your production code,
which has got race conditions in it. It could be your CI environment is a bit unstable. It could be
they work under some circumstances, like it works on, you know, JDK 17, but not JDK 19.
You know, and there's a lot of different factors which could contribute to a test failing sometimes and passing other times.
And I just hate flaky tests because I think like it should pass or it should fail for the right reason.
And I don't want to spend a day trying to figure out, oh, it wasn't my fault.
I didn't break it.
So I took a couple of notes and I have a couple of comments
to what you just said.
So you mentioned if there's obviously flaky tests,
it's a source of frustration.
Eventually, it's like if you draw the parallels
to production alerting, people that actually have
to react to incidents.
Eventually, people are reporting about alert fatigue
because it's too many alerts.
What do I do with a thousand ringing bells?
And in the end, I don't really know if this is a real problem or not.
And so people sometimes then start ignoring them,
like what you just said.
You just start ignoring because you don't know anyway
if it's a real problem or not a problem.
So it's an interesting parallel to production incident management.
It is. Yeah, it is for sure.
I mean, failing tests or alerting in production have the same,
they mean the same thing.
Something's not right.
Something is poor quality.
And I agree about the alerting thing.
One of the most useful things that happened to me sort of mid-career is this is when there was no DevOps
and there was ops who would monitor production.
And they would complain at the development team like,
you know, you've got these alerts going off right, left, and center
and we don't know what they mean.
And we're like, we don't really know what this is.
And we weren't aware that every time we did a log.warn, that was something that would get sent
to someone's pager. We didn't know that. We just kind of was like, oh, it's warning debug info,
like warning, I'll put it on warning because then it'll show up in production. We didn't know
that's something that's going to wake someone up at three o'clock in the morning. So, you know,
it's very easy to get sort of cavalier about, well, you know, I'll fail it if it's a bit, if it seems a bit wrong or I'll log it to some high level just in case.
If you don't really understand what the consequences are of the thing that you're doing, you can't make the right decisions.
Oh, that's interesting.
I need to write this down.
Do you know the consequences of the next logline you're writing?
Might.
I mean, for me, it's again, you know, I've spent a lot of time in the last years helping organizations in production, especially trying to get better in detecting root cause and getting better observability and also building better scalable and resilient systems. So for me, it sounds very similar, right?
Because in the end, you're saying, if something fails, I want to know whether what the root
cause is.
So observability, you just mentioned log files, they help obviously, or do we nowadays, everybody's
talking about distributed tracing with OpenTelemetry and other frameworks we have.
But then also, you mentioned something interesting, you said, you don't know, maybe the underlying system had an issue.
Maybe somebody did an upgrade to my CI system.
Maybe some other background job ran,
and therefore it had an impact on my tests.
So in the end, we really need to invest in stable and resilient systems
that actually then build and execute our tests.
Because if this is not a given,
then we're just constantly juggling around,
like whose fault is it?
Right, exactly.
And I see that in large organizations
where the CI environment is owned by a different team,
for example.
It's very easy for developers,
the development team to be more like,
well, you know, CI is just a bit flaky.
So like, it's probably their fault that my test failed.
Instead of really doing the investigation into,
well, you know what?
I just, I introduced extra latency by doing this thing
or I didn't wait correctly on that thing.
Because it's almost always some sort of waiting asynchronous thing
which causes flakiness, right?
And it's easy to be like, well, it's not our fault
because the system doesn't work properly, the CI environment.
I read an interesting quote in Mike Nygaard's release it book,
the most recent version.
And he said that the systems that we use for developing software,
which includes our laptops and our pipelines and the CI environments,
our test environments,
they should be treated with the same respect as production.
Because this is the environment,
this is our developer production environment.
We use this to produce the code that will go into production.
And too often we do see in a lot of different types of organizations
that the test environments are not treated with the same level of respect.
Because either they're not owned by one particular group
or people are always rushing and they just chuck stuff in there
or it's shared by a whole bunch of people
and they don't really know what it's for.
And no one's in charge of making sure these things
are well looked after, well maintained,
which is why platform engineering is an important thing, right?
Because if you have the platform engineering team
to manage these types of environments,
make it easy to set up and tear down new environments when you want it, that kind of thing.
Make sure that when I worked in places that had virtualization, pre-containerization, if you're running a whole bunch of virtual machines on top of your hardware and you're overloading your hardware in your CI environment, for example, you're not treating it with the respect that it deserves, right?
Because you need to give it the right resources
in order to get the right behavior,
the right performance for your tests and for your code.
Yeah.
Same point that I tried to make in my talk in Munich
where I said, you know, we are with an IDP,
with an internal development platform,
whatever tools you choose, whether it's Jenkins,
whether it is Argo, whatever you use,
you need to treat this as a product.
And it's a business-critical product because your engineers depend on it.
And therefore, I basically made a similar comment saying,
you need to make sure that you are treating your platform
and the individual components of your platform as business critical,
as the business critical software that you're creating.
And I brought a couple of examples on how we internally,
and we are a large organization as well at Dynatrace,
where I work, a large engineering organization.
I gave some example on how we are monitoring our Jenkins pipelines,
how we make sure that we monitor the infrastructure that runs on, that people get alerted in case, obviously, when build times
start failing more often, when they start taking longer, then the platform engineering team or the
developer productivity team really needs to first look into what's actually happening.
Another, I think, indicator is what I also try to say
when people are building platforms
and you want that platform to be used,
you should also monitor that platform
from an end-user perspective.
So how many developers are actually using your pipelines?
How often do they check in code?
Because if that thing all of a sudden changes,
if they change behavior, something is wrong.
Yes, yeah, and you're right.
And that's where it's interesting.
I think platform engineering is interesting
because it allows you to pull together stuff
that traditionally would have sat in different parts of the organization.
For example, the people who take care of the hardware for CI
or virtual machines or whatever
would have been perhaps the ops team,
people who are in charge of figuring out
how often developers are committing to CI, for example.
That might be the engineering management.
And there's these sorts of responsibilities
and things like how long tests take to run.
That might be individual developers
caring about that sort of thing.
But if you pull it all together in some sort of developer productivity organization or
team or even an individual or two, they can start monitoring all these different, seemingly
different metrics to figure out like what's going on, what's the real behavior and how
can we fix these sorts of things.
For example, if CI starts to take a lot longer
to run your test suite, for example,
you don't just have to throw more machines at it.
Gradle Enterprise, for example,
has the ability to run predictive test selection
so that you can cut the number of tests
that you run down by almost as much as 70% in some cases.
And you can have build caching
so you don't have to run the tests half the time.
Or you can do test parallelization so that you can split them out and not run them in serial.
So you need to be able to have a mindset of there are different solutions to these kinds of problems.
If you're a hardware person, you'll see hardware problems and hardware solutions.
But if you're a developer productivity person, the solutions you should be aiming for are like, this is the symptom. Given that our job
is to make developers more productive, more effective, more efficient, which of the different
solutions could we choose? We don't just have to throw more hardware at the problem. We don't just
have to, I don't know, tell developers to write fewer tests or whatever it is. There are different
options that can be taken and you need to have a different mindset to figure out which of these solutions is going to work the best
for the teams in your organization.
So more hardware is not always the solution to a bottleneck.
Right, exactly.
That's a good one, yeah.
I want to have one more question on the flaky tests.
So how do we deal with flaky tests?
How do we treat them correctly?
How can we get rid of them?
I used to say, I was a bit hardcore about this.
I used to say, you should just delete them
because flaky tests are a difficult problem
and they're a non-trivial problem,
something that we do need to focus on and not just ignore.
But it is difficult, which is why lots of organizations have a bunch of them
and don't know what to do about them. I used to say delete them because I figure a flaky test is actually,
if it's definitely the test that's flaky and not the infrastructure, a flaky test is just worse
than no test at all because when it goes red, you just ignore it anyway. However, I've recently
written a blog post which should be coming out in the next few weeks in terms of the solutions we
can take. And so the first thing is
to find your flaky tests. It sounds kind of stupid, but a lot of organizations don't have
visibility over which of these failures is actually flaky. I mean, it's not that difficult
to work it out. Many organizations are rerunning tests several times, and if they go green,
then that's kind of fine.'s it counts as being green but
you can flag that as flaky because if it went red one time and then you re-ran it in the same
in the same build and it went green that that's flaky another way you can detect it is and i've
worked at organizations that do this too looking across builds if the if the test fails like every
other build or you know fails and passes, then it's probably flaky.
So the first thing you need to do
is you need to identify your flaky tests.
Otherwise, you're not getting anywhere.
And then there's a few other things you can do.
You can quarantine them for a bit
so that you can actually see what your real failures are
because your flaky tests are getting in the way
of you seeing real failures.
You need to address them.
So a good way of doing that is to set aside flaky test days, for example,
where the whole development team is going to focus on the flaky tests.
It doesn't have to be the most flaky first,
but if you have an idea of which ones are the most flaky,
you can prioritize them.
And then, so there's a video by Dave Farley
about the causes of test intermittency.
So you can actually like go away and have a look at that video to kind of figure out
what is a potential cause of those tests and hopefully fix them. One of the things we found
when I worked with Dave, when we were looking at flaky tests is some of it was something that
could be addressed by the test structure. So once we identified a particular type of failing of flaky tests,
you could fix all of them.
Because some of them, they were, a lot of them were UI tests.
UI tests are particularly notorious for it
because the UI doesn't always come up at the time that you want.
So one of the things we did is we figured out a way
to wait for a specific thing to come onto the screen
and then go forward with the test.
Whereas before, we were just waiting an arbitrary amount of time
and then going on with the test, yes or yes.
So these are the sorts of things you can put into place
and they'll help fix a whole bunch of your tests.
And some of these things are not going to be about fixing tests.
Some of them might be infrastructure.
So you might need to find out, like,
you might need to figure out which tests need to run
on different types of agents, for example,
because you can tag tests and say,
oh, you run them in these particular environments.
Or there may need to be, I don't know,
sometimes some types of tests will just fail intermittently.
But we think, I was discussing this with my team at Gradle,
we think that's a small subset of tests and those should be kind of partitioned off to one side.
For example, these are not tests of your system. These are kind of sanity checks, smoke checks of
smoke tests of like external APIs or whatever. So if you're reliant on an external API,
you might have a test in place
to check that that API behaves the way you expect it to. Now, any kind of test which has network
latency or this sort of connectivity, there's a good chance that it will sometimes just time out.
But if you have those types of tests, you can put them off to one side and you don't have to
run them all the time in terms of your, what we used to call the commit build,
the build that has to run,
the tests that have to run
to give you that sanity check.
So sometimes it's a case of identifying
which ones will just by their very nature
be flaky and perhaps put them somewhere else.
Cool.
All this stuff,
I mean, there's a lot of triggers
that go off in my brain
when you mention latency and things like this because we've been investing a lot of triggers that go off in my brain when you mention, you know, like latency and things like this, because we've been investing a lot in chaos engineering where you're on purpose, right?
Slow certain things down, inject failure, and then figure out how the system behaves.
But I really like, again, a couple of notes that I took.
And folks, we will link to a lot of the things you mentioned, whether it's your blog post that comes out, the videos from Dave, your books that you've also written. I think we will also link to those.
I want to touch base on one thing because it also triggered a memory.
When you started on Flaky Test Detector, you mentioned that back in the days
you had like a radiator, a test status radiator somewhere hanging around saying green
or red.
And this reminded me of something we have in Dynatrace and we promoted it quite a bit
over the last years.
We call it the Dynatrace UFO.
It's like a flying saucer.
And it has like 16 by 2, so 32 lights, LED lights, and it's hooked up.
And the initial idea was actually hooking it up with the build pipeline
to actually flash if it's green or red.
And the idea was also at the end of the day, when you go home
and it's not green, you should check if you
are the one that actually broke the build or if something is wrong.
And now other parts of the organization,
also some of our users are using the UFO to just visualize and radiate
the status, whether it is the build pipeline, whether it's any problems in production.
Our marketing team uses it to, for instance, measure if our website is up and running because
we have synthetic checks.
But this concept of a radiator is obviously very good.
And back in the days pre-pandemic when we were all in the office, obviously everybody
saw it.
Now it's a little bit more challenging with everybody being remote and distributed,
but you can still use, obviously,
other indicators of status.
Yeah, I do miss having a large information radiator
on the wall.
When I moved to JetBrains,
it was my first 100% remote position.
And I did think about setting up a monitor
just for showing some of the stuff,
some of the status things that we had.
And in the offices, they had all that stuff.
But in the end, you're like,
I can all tab over to it.
But it's not the same thing.
When it's not in your face all the time,
it's not the same thing.
With my background of
observability i've been advocating over the last 15 years that i've been here uh on you know getting
observability into your systems and that is not only true for production but especially true also
for everything that happens before production so from a from an observability perspective are there
any any best practices
that we can tell the developers? I noticed when we were at DevPCN, so at the conference
where we met, a lot of talks, and as a developer conference, a lot of talks were actually centered
around observability and also like getting real observability into your code, whether
you are, whether it's about instrumenting it with OpenTelemetry, whether it's about
creating the right logs, emitting metrics.
Is there some guidance that you can give folks out there
that are listening in on how to improve observability,
which in the end will make it easier to identify
whether a failing test is flaky or failing for the real reason?
Yes, this is a short question.
So I want to back up a little bit
and say about observability.
One of the things that we've been doing
with Gradle Enterprise is,
so the Gradle Enterprise product
kind of has like,
I was going to say two main parts.
There's more than two main parts.
One is performance of your build
helps to improve the performance of your build.
But the other thing is observability.
And it's observability, the sorts of things that we were not necessarily looking at before. Things like local build times, test times, build failures
across the board, not just CI, local build failures, flaky tests, and test analytics as well.
So, I mean, it sounds kind of trite to just say
you should measure everything and look at everything,
but there are some things when it comes to the developer experience
that we have not been getting the visibility over that we want,
including local build times
and a lot of the things like test failures.
So flaky tests, I think flaky tests, the flaky test information that we have in Gradle Enterprises
is really, it's not that difficult to get a view on your flaky tests.
And you don't have to do much in terms of your code
to get a view on whether your tests are flaky or not.
Like I said, if you can't change your code at all,
then you can look at patterns in the test
as they've been run across various different builds.
It's a bit of a coarse-grained thing.
But the other brute force way of checking
whether a test is flaky or not
is literally just to run it more than once.
Because if you run it on the same environment
with the same code, the same hardware,
and you run it five times,
and if it passes even one of those five times,
that's a flaky test.
But it's not very efficient
because then you're basically,
we're back into that point we were talking about
using resources that you shouldn't need to use.
I mean, you shouldn't need to have to rerun a test five times,
especially because your flaky tests
are probably your expensive ones.
They probably are your database connection tests,
your UI tests, your slow tests.
So yeah, I mean, the main thing is to have,
is to consider the things
that you are not observing right now,
the things that might be impacting your developers,
things like flaky tests, things like local build times,
things like the other thing that Gradle Enterprise has
is failure analytics.
So it separates the failures into verification failures,
like your test failures, and then build failures.
And then when you look at them,
you can see how many users are being impacted by this particular failure.
So in the past as a developer, if my build failed, I was like, oh, right, okay, well,
I've just got to figure out what I've done wrong that my build has failed this time. And I'm going
to plod on on my own trying to figure out what that is. But if you're reporting all of your local
build information to somewhere, central server somewhere,
you can get analytics on that to go,
oh, this kind of build failure happens
like every three days
and impacts 25 developers.
We should probably, whatever it is,
whether it's a process thing
or an automation thing or tooling thing,
we should probably fix this problem
because 25 people are tripping over this
every three days.
It's definitely costing time and money.
That's a good one. Thank you so much.
I mean, that's why I love this podcast so much
because you get different ideas and different experiences
from different people that are working in different areas
of the software development lifecycle and very fascinating.
And hopefully our listeners appreciate that as well, that we have a
broad variety of people with different thoughts. And I never thought about this in that detail
around flaky tests. Tricia, we are getting towards the end of the podcast. Is there anything else
that we missed to mention? Anything that people can go to in case they want to read up, they want to follow up with you,
they want to learn more about what you do
and how to improve developer productivity?
Yeah, I mean, you're going to put a whole bunch of stuff
in the notes anyway, but if you're interested
in developer productivity engineering or Gradle Enterprise,
obviously the Gradle and Gradle Enterprise
has that on their website.
You mentioned my books, which I thank you for.
The books are Head First Java,
97 Things Every Java Programmer Should Know,
and Getting to Know IntelliJ Idea.
And the Getting to Know IntelliJ Idea
is like 400 pages of all the stuff.
Like I know IntelliJ really well.
Like this is the bare minimum, 400 pages of bare minimum I could tell people to be like,
you need to know this stuff if you're going to be productive with your IDE.
So to me, and also I'm working on a talk at the moment called DPE Starts in the IDE,
because three-letter acronyms are great.
And all the stuff we've been talking about today
has been around platform engineering and things like that. But my first love was,
you need to know your IDE because otherwise, you spend so much time writing code. And if you're
just typing it out by hand and not making the most of the tools available to you, then you're just,
you're wasting your own time. So so, yeah, buy my book.
And what else?
And I'm going to be speaking at various
conferences over the next three months or so.
So I'm going to be in like, well, next week I'm going to be in Manchester.
The week after that, I'm going to be in London, going to Madrid, going to be at
DevOx, which I'm looking forward to. So DevOps is in October, isn't it?
So, yeah.
Awesome.
That's great.
This one probably airs early September.
So folks, make sure that you follow Tricia
and where she's traveling to
and maybe you get a chance to catch her.
I had to write on this quote,
you need to know your IDE
because you're spending a lot,
most of your time with that tool.
So it's like, you know, you need to know your partner.
It's like when you have a relationship,
better get to know them because you're spending a lot of time.
You spend a lot of time in that IDE.
And like, I mean, we found out that when I worked at JetBrains,
we found out that, I mean, the IDEs are huge.
They have so many features in.
But like something like 95% of developers
weren't even using basic completion.
What are you using the IDE for?
Yeah, it feels for me the early days of Microsoft Word,
all these features.
Right.
Like, oh, it can do that?
But I guess there's also a lot of things
that nobody ever uses.
And that comes back to also close of to close the circle with,
if you're interested, if you are investing in developer productivity,
you really need to first understand what are your users,
so your developers, trying to accomplish where they're struggling
and then optimize there and not just artificially optimize somewhere.
I think that's also very important, right?
Figure out where people are struggling, where they waste their time,
and then help them there.
And one more thing I want to say
about developer productivity
that's really important to me
is that improving developer productivity
is not about squeezing every last line of code
out of a developer.
It's about making the developers happier.
It's about allowing them to be more creative,
more innovative.
It's giving them the freedom to think
instead of spending all this time
wasting their time on boring, frustrating stuff.
That's good.
It's about making developers happier.
It's a nice line.
Prisha, thank you so much for being on this podcast.
I got to say, I always wish that Brian
would be with me
because typically the two of us
are doing this together
with our guests.
But I'm very happy
that we had this chat.
I learned a lot of things.
I hope this goes also true,
is also true for our listeners.
And Tricia,
I hope to get to see you again
at some point in the future
at some of the conferences.
And I hope to also have you back
because I'm pretty sure
you will learn many more new things
and you have more things to tell me
and our audience
that we couldn't cover today
so maybe in a future podcast.
I would love to come back.
I've had a lot of fun
talking about the stuff
that I really care about.
Thank you.
See you.
Bye.
Bye.