PurePerformance - Developer Productivity Engineering: Its' more than buying faster hardware with Trisha Gee

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome everybody to another episode of Pure Performance. As you can guess, tell this is not the voice of Brian Wilson, who typically does the intro. It's the voice of Andy Grabner. I hope Brian is still fast asleep because today we are recording a little early at an unusual time. But we have usual, as usual, always interesting guests with us.

Starting point is 00:00:50 And today I have Tricia Gee with me. Tricia, welcome to the show. Thanks for being here. Thanks for having me. Hey, Tricia, we met each other early July in Barcelona at DevBCN. And I went to your talk and I was like, I want to say thank you because you gave me a lot of inspiration because you talked about the myth of the 10x engineer versus the reality engineering is actually there to enable the 10x organization by taking away all of the obstacles that otherwise people have, whether they're developers or whoever else is using the platform. I would like to talk a little bit more about this with you today from your perspective. Before we get started, though, for those people that don't know you, could you give a brief introduction, who you are, what motivates you, what you do in a day-to-day business? Yeah, sure.

Starting point is 00:01:49 So I'm Trisha G. I am a Java champion, developer advocate. A lot of people might know me from doing developer advocacy for JetBrains, doing a lot of stuff for IntelliJ IDEA and IDEs. I recently moved to Gradle, where I'm doing developer advocacy for developer productivity engineering. Because at Gradle, we have a tool called Gradle Enterprise, which aims to get rid of a lot of the stuff that you were talking about. A lot of things that get in the way of developers actually creating working products. So the thing that kind of ties together all my experience is this passion for developer productivity.

Starting point is 00:02:25 That's kind of why I spent seven years telling people how to use their IDE, why I spent a lot of time helping people get up to speed on the latest versions of Java, because a lot of that is aimed at helping developers become more productive, which is why I'm at Gradle now, because that allows me to pull back a bit higher level than just writing code and more sort of, like you sort of mentioned, a bit more organizational, a bit more in terms of like the tool chain that we use and how we actually get our code out somewhere

Starting point is 00:02:52 where people can use it. Do you also see with all of your years of experience that also the role of a developer and kind of like where they live has also changed? Because you mentioned that you focus so much on the IDE, which is great, right? We had to make developers more productive in their main tool that they spend most of the time in. But have you seen also with

Starting point is 00:03:11 the evolution of software engineering in the last couple of years, whether you talk about making sure that developers are also thinking about configuration as code, where the stuff gets deployed. I mean, testing, obviously we shouldn't talk about that. That should be given anyway. But observability, right, that's a big space where I'm in. Do you also feel, because we've been pushing so many things on top of, quote-unquote, the developer,

Starting point is 00:03:35 that we also need to think and rethink about where we can make them more efficient and not just in the IDE? Yeah, 100%. I started coding professionally, well, actually started coding professionally well actually started coding professionally before i even graduated so in like 97 so we're talking like 25 years of of working in the industry like when i worked and i started working at ford which was they had the idea of lean because lean manufacturing was a thing at for, obviously. But we're talking about software delivery lifecycle,

Starting point is 00:04:08 big release processes, big documentation. At that time, a developer was like this classic idea of a developer. If you give them a piece of paper, which has the technical specification of the thing they have to do, they write the code, they should test it. But even then, we're talking about manual testing, a lot of that sort of thing, committing into source control. And then you had, when I was at Ford, we had one person whose job it was to deliver that functionality. He was a release manager. And what he did was package up the code and release it. And these release processes were complicated and difficult.

Starting point is 00:04:43 And the testing was manual and complicated and difficult. And so over the course of my career, we saw Agile come in, a lot of stuff coming in through XP in terms of more automated testing, things like JUnit made things a lot easier for us to write our automated tests and give us the confidence

Starting point is 00:05:02 that we'd written the right thing, allow us to refactor, allow us to write better code and be more productive. DevOps sort of coming in, I worked for Dave Farley when he was writing the continuous delivery book. So I actually worked at the organization where he was implementing continuous delivery, like ahead of anyone even knowing what that term meant. So I came from, I'd been working at a bank where again, we had a difficult three hour release process. At this time, the developers were in charge of doing the release process. It wasn't a release manager, but it was three hours of following scripts and debugging

Starting point is 00:05:34 the problems as you went, always out of hours after 6pm. Painful, painful processes. And so continuous delivery, the reason I went to go and work with Dave is that when he told me in the interview that continuous delivery was a thing, I was like, oh, this is what we need. We need more automation. We need these pipelines. We need to have confidence that the tests tell us that we've done what we wanted to do and then know that at the end of this pipeline, this thing can be deployed into production. But as a result, and this is only coming up to what happened up to about 10 years ago. And since then, automation's got better, and we've got more into the DevOps movement has taken off a lot more. So again, pushing more of

Starting point is 00:06:14 the ops stuff onto the developer. Security is becoming more and more important because obviously we deploy into the cloud. We're not necessarily doing stuff on-prem. So over the course of these 25 years, we've moved from being, in my experience, we've moved from being a developer who writes code even in a text editor, you know, and is just told what to do and then throws the code somewhere else. It's someone else's responsibility. And now we have to think about not only writing correct code and testing our code and being able to build it automatically, pull in our dependencies automatically, which is another thing we didn't used to do.

Starting point is 00:06:50 But we also have to think about the operational side of stuff, the security side of stuff, the deployment side of stuff, monitoring, observability, all of those kinds of things. So in some ways, it's kind of a weird contradiction where in order for us to become more effective at our jobs, in order for us to become more productive as people who produce code, we actually have to do a lot less of the code production and a lot more of everything else, a lot of worrying about lots of other things. One of the things I think that has enabled that is automation and tooling to allow us to do this kind of thing. And I don't think it's a bad thing for us to have more responsibilities because

Starting point is 00:07:29 back when we just wrote code, we didn't necessarily think about, is this the right thing to do? Is this really what the user wants? What will the impact be on production if I make these changes? And being more responsible for a wider range of things allows us to write better code because we think more about the problems. But we definitely, definitely have more responsibilities than ever before. I think you bring up, you know, as you were talking, it felt like a pendulum a little bit, right? We started from, as you said, just focusing on code, throw it over the wall. And now we have to do all these things. And if I hear you correctly, obviously, it makes a lot of sense

Starting point is 00:08:09 that developers are familiar with everything that they have to do. But on the other side, we also need to swing the pendulum back a little bit again, because otherwise, we can only spend a certain small percentage of time actually creating code and having to deal with so many other things. Yes, automation has made it easier for everyone to think about automating security and automating delivery and automating observability. But still, it feels for me, we are pushing so many things on top of developers. Right now, I need to not only know my IDE, I need to know 10 other tools.

Starting point is 00:08:41 And while these tools might be easier, maybe I can codify them, but still, I need to know a lot of things. And this is where, from my perspective, what I love about the whole movement of platform engineering, kind of trying to figure out what are these 10 tools doing and how can I make this even easier for the people that use my platform? And so this is why I said the pendulum. It's like from just coding to doing everything. And now I think we need to find somewhere the middle ground to make sure that the developers can really be productive

Starting point is 00:09:11 and not having to waste a lot of time with all these different tools to do all these things that they need to do. Do I get this right? Yes, 100%. I think that some of the organizations I've worked in, so I've worked in very big organizations like Ford and like the banks and smaller ones like startups where sometimes there's only three of us on the team.

Starting point is 00:09:32 Regardless of the size of the organization, I have found that in the past, there's usually like one person on the team who really cares about tooling and productivity and productivity for the team, right? And it's that person who does things like decides to use a different CI server or decides that maybe a different tool set would be better

Starting point is 00:09:53 or does some, we had someone who wrote an automation script to farm out tests to different agents so that we could parallelize stuff. But that doesn't scale. You can't rely on one person in the team or people using their 20% time if they have 20% time. You can't rely on those people. Their effect is enormous. It helps the productivity of all of the developers, right? If you can speed up your CI build, or if you can speed up the test runs, or if you can automate something which someone was doing manually,

Starting point is 00:10:29 that has a huge impact on all of the developers, which is the 10x organization thing you're talking about. It's not just one person working more effectively. It's one person helping the whole team to work more effectively. But if your traditionally developers are rewarded or at least measured on, you know, features delivered or bugs fixed or even, heaven forbid, lines of code. So it's generally speaking, organizations are not going to be rewarding a one person to take time away from that coding thing just to enable everyone else's productivity.

Starting point is 00:11:03 So platform engineering is a great movement because the whole point is that you have teams who are dedicated to that kind of thing. It is recognized, it is rewarded. Their job is to look at the bottlenecks and to try and help that across everybody and not just fix the problem for the one person who's having that problem. Platform engineering is the way to make the 10x organization. Yeah. And I really like that you brought up the whole metric, because obviously we're all metric-driven. We want to figure out how productive are we by, as you said,

Starting point is 00:11:33 how many features do we push out. From an efficiency perspective, though, I think when I look at the State of DevOps report that just came out focusing on platform engineering, there are ways on how they actually measure the developer efficiency, the gains of developer efficiency. I think that's great. We can measure the impact that the investment in platform engineering or in developer productivity engineering, as you actually call it, it has.

Starting point is 00:12:03 How many more builds in the end can I still do because I don't have to wait for all these things, so I don't need to spend so much time in troubleshooting why builds fail, or I can faster react to problems in production because the platform just provides me with better visibility and better observability, yeah? Yeah, I think the measurability of some of these things is just a really appealing thing. If you cut your build time down by even 5%, when you add up the dollar amount it costs a developer or an organization, if a developer sat there waiting for five minutes for their build, which isn't a very long build time in this world, you can add up for every developer in your organization those five minutes for their build, which isn't a very long build time in this world. You know, you can add up for every developer in your organization, those five minutes.

Starting point is 00:12:48 How much does it cost you? So reducing things like build times, reducing debug times, reducing troubleshooting times, running tests is one of those things that often, if you've got a good comprehensive test suite, which is a good thing to have, it takes longer to run it. So like optimizing those things, So we get that fast feedback. You know, you can put a dollar amount

Starting point is 00:13:09 on that sort of thing. And on top of that, there's the context switching thing. I mean, developers have been so great at being like, right, okay, this, the CI build is going to take 40 minutes. So I'm going to work on a different thing or go for lunch or whatever. And we get good at, we think we're good at context switching. But the fact is, if that build fails or if there's a test, which may be a flaky test or might be a real failure, we're not really sure. By the time it fails and we come back to it, we're like, oh, I don't remember what I was doing now because I put my brain in a different place. So the context switching has a cost as well. And we can measure all these things. A lot of it just comes down to feedback time. I like the, I mean, you mentioned costs, right?

Starting point is 00:13:51 I think there's an additional aspect to costs, the real cost on the infrastructure. Because if you think about it, if you can see at least more and more organizations are really spinning up infrastructure on demand for different cycles of the build or different cycles of the software development lifecycle. If you can cut down build times, not only do you save time and costs on the developer, the context switching, but also on the amount of money you need to spend to actually run this infrastructure coming back to sustainability. So I think that's a really compelling thing, especially when you're talking about things like sustainability and the greenhouse effect.

Starting point is 00:14:28 And the amount of energy that computers consume is not really great at the end of the day. And if you can reduce those build times or test times, you don't necessarily cut CI usage, for example, but you will get more efficient usage of your CI environment. For example, if you cut build time in two, you do twice as many builds. So it can go either way. You can either cut down the amount of resources you need

Starting point is 00:15:04 or get twice as much stuff pushed through your pipeline. Yeah, I think that's very good. So your argument is don't just measure the success by, let's say, reducing the amount of CPU cycles you need for the builds because overall you may just run many more builds, which means you're getting actually better software out on the door and faster. Right. So you keep repeating build times, cutting build times.

Starting point is 00:15:29 That's great. Debug times, all these context switches. What else do you see out there? If people are listening in to us now and say, hey, okay, this is a really interesting conversation. What other things should I have on my radar? What do I need to measure? And what can I improve? And where maybe I'm not looking at all, because I never thought about that this could negatively impact developer productivity, and therefore maybe even shyest developer away, or maybe they're looking for another job because

Starting point is 00:15:54 they're frustrated. Yes, I mean, that does happen. I've seen a lot of frustrated developers who are sick of waiting around for stuff, waiting for, I left one job because it took six weeks to get my IDE environment set up. And I'm like, this is unacceptable. Like what I can't sit here, you can't pay me this great salary for me to be sat here doing literally nothing because you can't be bothered to set up my IDE for me. So the build times thing, I just want to finish up on that before I move on to the other ones. One of the things about build times is we often think about CI build times because that's the thing that we can see. We've got visibility of how long it takes CI to run stuff. And we have cost associated with that as well with our cloud stuff. What we're not often doing is measuring local build times. And so that's one place where

Starting point is 00:16:39 people can start, especially if you've got a remote and distributed team. We might not have visibility over this one person's build time takes 15 minutes and everyone else it takes five minutes. What can we do to fix that one problem? So measuring local build times often falls through the gaps. It's one of those things we don't really think about because we just kind of run the build and go and get a coffee and don't consider that it's something that can be changed. But I really want to move on to my pet peeve, my pet project. One of the reasons I joined Gradle is because the Gradle Enterprise tool

Starting point is 00:17:11 has a flaky test detector. And flaky tests are, they just drive me potty. They're one of these things which are, they're a time drain, but they're one of these things you're talking about, these sources of frustration, this kind of like energy drain for developers,

Starting point is 00:17:30 because you kind of go, if you have even a small amount of flaky tests in your test suite, the problem that you have is if tests fail and you're not sure why they failed, you're not sure if you broke them or not, then you go one of two ways. One, you spend a whole bunch of time investigating a failure that was not your problem and you should not have been looking at.

Starting point is 00:17:54 Or two, you ignore it. But then when you start to ignore it, that leads to the potential negative side effect of ignoring any test failures, in which case, what's the point in having any automated testing if every time they go red, a developer just goes, well, it's probably flaky. Let's just run it a few more times or let's wait for a few more builds to see if it really failed. And so you have a lot of noise in your results. And one of the things I really, I learned when I was working with Dave Farley is that having an extensive test suite with information radiators on the wall, like telling you how many tests have run and how many have passed, gives you this lovely, warm, fluffy feeling of the things I did haven't broken anything. But if those are going red, like occasionally for no reason whatsoever,

Starting point is 00:18:41 that whole confidence goes away. You can't do refactorings because things fall over and you're not sure if it's you. And these flaky tests could be down to any number of different reasons. It could be you. You could have written a bad test. It could be your production code, which has got race conditions in it. It could be your CI environment is a bit unstable. It could be they work under some circumstances, like it works on, you know, JDK 17, but not JDK 19. You know, and there's a lot of different factors which could contribute to a test failing sometimes and passing other times. And I just hate flaky tests because I think like it should pass or it should fail for the right reason. And I don't want to spend a day trying to figure out, oh, it wasn't my fault.

Starting point is 00:19:24 I didn't break it. So I took a couple of notes and I have a couple of comments to what you just said. So you mentioned if there's obviously flaky tests, it's a source of frustration. Eventually, it's like if you draw the parallels to production alerting, people that actually have to react to incidents.

Starting point is 00:19:43 Eventually, people are reporting about alert fatigue because it's too many alerts. What do I do with a thousand ringing bells? And in the end, I don't really know if this is a real problem or not. And so people sometimes then start ignoring them, like what you just said. You just start ignoring because you don't know anyway if it's a real problem or not a problem.

Starting point is 00:20:03 So it's an interesting parallel to production incident management. It is. Yeah, it is for sure. I mean, failing tests or alerting in production have the same, they mean the same thing. Something's not right. Something is poor quality. And I agree about the alerting thing. One of the most useful things that happened to me sort of mid-career is this is when there was no DevOps

Starting point is 00:20:27 and there was ops who would monitor production. And they would complain at the development team like, you know, you've got these alerts going off right, left, and center and we don't know what they mean. And we're like, we don't really know what this is. And we weren't aware that every time we did a log.warn, that was something that would get sent to someone's pager. We didn't know that. We just kind of was like, oh, it's warning debug info, like warning, I'll put it on warning because then it'll show up in production. We didn't know

Starting point is 00:20:56 that's something that's going to wake someone up at three o'clock in the morning. So, you know, it's very easy to get sort of cavalier about, well, you know, I'll fail it if it's a bit, if it seems a bit wrong or I'll log it to some high level just in case. If you don't really understand what the consequences are of the thing that you're doing, you can't make the right decisions. Oh, that's interesting. I need to write this down. Do you know the consequences of the next logline you're writing? Might. I mean, for me, it's again, you know, I've spent a lot of time in the last years helping organizations in production, especially trying to get better in detecting root cause and getting better observability and also building better scalable and resilient systems. So for me, it sounds very similar, right?

Starting point is 00:21:46 Because in the end, you're saying, if something fails, I want to know whether what the root cause is. So observability, you just mentioned log files, they help obviously, or do we nowadays, everybody's talking about distributed tracing with OpenTelemetry and other frameworks we have. But then also, you mentioned something interesting, you said, you don't know, maybe the underlying system had an issue. Maybe somebody did an upgrade to my CI system. Maybe some other background job ran, and therefore it had an impact on my tests.

Starting point is 00:22:14 So in the end, we really need to invest in stable and resilient systems that actually then build and execute our tests. Because if this is not a given, then we're just constantly juggling around, like whose fault is it? Right, exactly. And I see that in large organizations where the CI environment is owned by a different team,

Starting point is 00:22:33 for example. It's very easy for developers, the development team to be more like, well, you know, CI is just a bit flaky. So like, it's probably their fault that my test failed. Instead of really doing the investigation into, well, you know what? I just, I introduced extra latency by doing this thing

Starting point is 00:22:49 or I didn't wait correctly on that thing. Because it's almost always some sort of waiting asynchronous thing which causes flakiness, right? And it's easy to be like, well, it's not our fault because the system doesn't work properly, the CI environment. I read an interesting quote in Mike Nygaard's release it book, the most recent version. And he said that the systems that we use for developing software,

Starting point is 00:23:14 which includes our laptops and our pipelines and the CI environments, our test environments, they should be treated with the same respect as production. Because this is the environment, this is our developer production environment. We use this to produce the code that will go into production. And too often we do see in a lot of different types of organizations that the test environments are not treated with the same level of respect.

Starting point is 00:23:39 Because either they're not owned by one particular group or people are always rushing and they just chuck stuff in there or it's shared by a whole bunch of people and they don't really know what it's for. And no one's in charge of making sure these things are well looked after, well maintained, which is why platform engineering is an important thing, right? Because if you have the platform engineering team

Starting point is 00:24:02 to manage these types of environments, make it easy to set up and tear down new environments when you want it, that kind of thing. Make sure that when I worked in places that had virtualization, pre-containerization, if you're running a whole bunch of virtual machines on top of your hardware and you're overloading your hardware in your CI environment, for example, you're not treating it with the respect that it deserves, right? Because you need to give it the right resources in order to get the right behavior, the right performance for your tests and for your code. Yeah. Same point that I tried to make in my talk in Munich

Starting point is 00:24:39 where I said, you know, we are with an IDP, with an internal development platform, whatever tools you choose, whether it's Jenkins, whether it is Argo, whatever you use, you need to treat this as a product. And it's a business-critical product because your engineers depend on it. And therefore, I basically made a similar comment saying, you need to make sure that you are treating your platform

Starting point is 00:25:04 and the individual components of your platform as business critical, as the business critical software that you're creating. And I brought a couple of examples on how we internally, and we are a large organization as well at Dynatrace, where I work, a large engineering organization. I gave some example on how we are monitoring our Jenkins pipelines, how we make sure that we monitor the infrastructure that runs on, that people get alerted in case, obviously, when build times start failing more often, when they start taking longer, then the platform engineering team or the

Starting point is 00:25:36 developer productivity team really needs to first look into what's actually happening. Another, I think, indicator is what I also try to say when people are building platforms and you want that platform to be used, you should also monitor that platform from an end-user perspective. So how many developers are actually using your pipelines? How often do they check in code?

Starting point is 00:26:00 Because if that thing all of a sudden changes, if they change behavior, something is wrong. Yes, yeah, and you're right. And that's where it's interesting. I think platform engineering is interesting because it allows you to pull together stuff that traditionally would have sat in different parts of the organization. For example, the people who take care of the hardware for CI

Starting point is 00:26:20 or virtual machines or whatever would have been perhaps the ops team, people who are in charge of figuring out how often developers are committing to CI, for example. That might be the engineering management. And there's these sorts of responsibilities and things like how long tests take to run. That might be individual developers

Starting point is 00:26:42 caring about that sort of thing. But if you pull it all together in some sort of developer productivity organization or team or even an individual or two, they can start monitoring all these different, seemingly different metrics to figure out like what's going on, what's the real behavior and how can we fix these sorts of things. For example, if CI starts to take a lot longer to run your test suite, for example, you don't just have to throw more machines at it.

Starting point is 00:27:10 Gradle Enterprise, for example, has the ability to run predictive test selection so that you can cut the number of tests that you run down by almost as much as 70% in some cases. And you can have build caching so you don't have to run the tests half the time. Or you can do test parallelization so that you can split them out and not run them in serial. So you need to be able to have a mindset of there are different solutions to these kinds of problems.

Starting point is 00:27:34 If you're a hardware person, you'll see hardware problems and hardware solutions. But if you're a developer productivity person, the solutions you should be aiming for are like, this is the symptom. Given that our job is to make developers more productive, more effective, more efficient, which of the different solutions could we choose? We don't just have to throw more hardware at the problem. We don't just have to, I don't know, tell developers to write fewer tests or whatever it is. There are different options that can be taken and you need to have a different mindset to figure out which of these solutions is going to work the best for the teams in your organization. So more hardware is not always the solution to a bottleneck.

Starting point is 00:28:12 Right, exactly. That's a good one, yeah. I want to have one more question on the flaky tests. So how do we deal with flaky tests? How do we treat them correctly? How can we get rid of them? I used to say, I was a bit hardcore about this. I used to say, you should just delete them

Starting point is 00:28:29 because flaky tests are a difficult problem and they're a non-trivial problem, something that we do need to focus on and not just ignore. But it is difficult, which is why lots of organizations have a bunch of them and don't know what to do about them. I used to say delete them because I figure a flaky test is actually, if it's definitely the test that's flaky and not the infrastructure, a flaky test is just worse than no test at all because when it goes red, you just ignore it anyway. However, I've recently written a blog post which should be coming out in the next few weeks in terms of the solutions we

Starting point is 00:29:02 can take. And so the first thing is to find your flaky tests. It sounds kind of stupid, but a lot of organizations don't have visibility over which of these failures is actually flaky. I mean, it's not that difficult to work it out. Many organizations are rerunning tests several times, and if they go green, then that's kind of fine.'s it counts as being green but you can flag that as flaky because if it went red one time and then you re-ran it in the same in the same build and it went green that that's flaky another way you can detect it is and i've worked at organizations that do this too looking across builds if the if the test fails like every

Starting point is 00:29:40 other build or you know fails and passes, then it's probably flaky. So the first thing you need to do is you need to identify your flaky tests. Otherwise, you're not getting anywhere. And then there's a few other things you can do. You can quarantine them for a bit so that you can actually see what your real failures are because your flaky tests are getting in the way

Starting point is 00:30:01 of you seeing real failures. You need to address them. So a good way of doing that is to set aside flaky test days, for example, where the whole development team is going to focus on the flaky tests. It doesn't have to be the most flaky first, but if you have an idea of which ones are the most flaky, you can prioritize them. And then, so there's a video by Dave Farley

Starting point is 00:30:24 about the causes of test intermittency. So you can actually like go away and have a look at that video to kind of figure out what is a potential cause of those tests and hopefully fix them. One of the things we found when I worked with Dave, when we were looking at flaky tests is some of it was something that could be addressed by the test structure. So once we identified a particular type of failing of flaky tests, you could fix all of them. Because some of them, they were, a lot of them were UI tests. UI tests are particularly notorious for it

Starting point is 00:30:54 because the UI doesn't always come up at the time that you want. So one of the things we did is we figured out a way to wait for a specific thing to come onto the screen and then go forward with the test. Whereas before, we were just waiting an arbitrary amount of time and then going on with the test, yes or yes. So these are the sorts of things you can put into place and they'll help fix a whole bunch of your tests.

Starting point is 00:31:18 And some of these things are not going to be about fixing tests. Some of them might be infrastructure. So you might need to find out, like, you might need to figure out which tests need to run on different types of agents, for example, because you can tag tests and say, oh, you run them in these particular environments. Or there may need to be, I don't know,

Starting point is 00:31:37 sometimes some types of tests will just fail intermittently. But we think, I was discussing this with my team at Gradle, we think that's a small subset of tests and those should be kind of partitioned off to one side. For example, these are not tests of your system. These are kind of sanity checks, smoke checks of smoke tests of like external APIs or whatever. So if you're reliant on an external API, you might have a test in place to check that that API behaves the way you expect it to. Now, any kind of test which has network latency or this sort of connectivity, there's a good chance that it will sometimes just time out.

Starting point is 00:32:18 But if you have those types of tests, you can put them off to one side and you don't have to run them all the time in terms of your, what we used to call the commit build, the build that has to run, the tests that have to run to give you that sanity check. So sometimes it's a case of identifying which ones will just by their very nature be flaky and perhaps put them somewhere else.

Starting point is 00:32:37 Cool. All this stuff, I mean, there's a lot of triggers that go off in my brain when you mention latency and things like this because we've been investing a lot of triggers that go off in my brain when you mention, you know, like latency and things like this, because we've been investing a lot in chaos engineering where you're on purpose, right? Slow certain things down, inject failure, and then figure out how the system behaves. But I really like, again, a couple of notes that I took. And folks, we will link to a lot of the things you mentioned, whether it's your blog post that comes out, the videos from Dave, your books that you've also written. I think we will also link to those.

Starting point is 00:33:10 I want to touch base on one thing because it also triggered a memory. When you started on Flaky Test Detector, you mentioned that back in the days you had like a radiator, a test status radiator somewhere hanging around saying green or red. And this reminded me of something we have in Dynatrace and we promoted it quite a bit over the last years. We call it the Dynatrace UFO. It's like a flying saucer.

Starting point is 00:33:37 And it has like 16 by 2, so 32 lights, LED lights, and it's hooked up. And the initial idea was actually hooking it up with the build pipeline to actually flash if it's green or red. And the idea was also at the end of the day, when you go home and it's not green, you should check if you are the one that actually broke the build or if something is wrong. And now other parts of the organization, also some of our users are using the UFO to just visualize and radiate

Starting point is 00:34:03 the status, whether it is the build pipeline, whether it's any problems in production. Our marketing team uses it to, for instance, measure if our website is up and running because we have synthetic checks. But this concept of a radiator is obviously very good. And back in the days pre-pandemic when we were all in the office, obviously everybody saw it. Now it's a little bit more challenging with everybody being remote and distributed, but you can still use, obviously,

Starting point is 00:34:29 other indicators of status. Yeah, I do miss having a large information radiator on the wall. When I moved to JetBrains, it was my first 100% remote position. And I did think about setting up a monitor just for showing some of the stuff, some of the status things that we had.

Starting point is 00:34:52 And in the offices, they had all that stuff. But in the end, you're like, I can all tab over to it. But it's not the same thing. When it's not in your face all the time, it's not the same thing. With my background of observability i've been advocating over the last 15 years that i've been here uh on you know getting

Starting point is 00:35:12 observability into your systems and that is not only true for production but especially true also for everything that happens before production so from a from an observability perspective are there any any best practices that we can tell the developers? I noticed when we were at DevPCN, so at the conference where we met, a lot of talks, and as a developer conference, a lot of talks were actually centered around observability and also like getting real observability into your code, whether you are, whether it's about instrumenting it with OpenTelemetry, whether it's about creating the right logs, emitting metrics.

Starting point is 00:35:46 Is there some guidance that you can give folks out there that are listening in on how to improve observability, which in the end will make it easier to identify whether a failing test is flaky or failing for the real reason? Yes, this is a short question. So I want to back up a little bit and say about observability. One of the things that we've been doing

Starting point is 00:36:12 with Gradle Enterprise is, so the Gradle Enterprise product kind of has like, I was going to say two main parts. There's more than two main parts. One is performance of your build helps to improve the performance of your build. But the other thing is observability.

Starting point is 00:36:28 And it's observability, the sorts of things that we were not necessarily looking at before. Things like local build times, test times, build failures across the board, not just CI, local build failures, flaky tests, and test analytics as well. So, I mean, it sounds kind of trite to just say you should measure everything and look at everything, but there are some things when it comes to the developer experience that we have not been getting the visibility over that we want, including local build times and a lot of the things like test failures.

Starting point is 00:37:02 So flaky tests, I think flaky tests, the flaky test information that we have in Gradle Enterprises is really, it's not that difficult to get a view on your flaky tests. And you don't have to do much in terms of your code to get a view on whether your tests are flaky or not. Like I said, if you can't change your code at all, then you can look at patterns in the test as they've been run across various different builds. It's a bit of a coarse-grained thing.

Starting point is 00:37:36 But the other brute force way of checking whether a test is flaky or not is literally just to run it more than once. Because if you run it on the same environment with the same code, the same hardware, and you run it five times, and if it passes even one of those five times, that's a flaky test.

Starting point is 00:37:54 But it's not very efficient because then you're basically, we're back into that point we were talking about using resources that you shouldn't need to use. I mean, you shouldn't need to have to rerun a test five times, especially because your flaky tests are probably your expensive ones. They probably are your database connection tests,

Starting point is 00:38:09 your UI tests, your slow tests. So yeah, I mean, the main thing is to have, is to consider the things that you are not observing right now, the things that might be impacting your developers, things like flaky tests, things like local build times, things like the other thing that Gradle Enterprise has is failure analytics.

Starting point is 00:38:36 So it separates the failures into verification failures, like your test failures, and then build failures. And then when you look at them, you can see how many users are being impacted by this particular failure. So in the past as a developer, if my build failed, I was like, oh, right, okay, well, I've just got to figure out what I've done wrong that my build has failed this time. And I'm going to plod on on my own trying to figure out what that is. But if you're reporting all of your local build information to somewhere, central server somewhere,

Starting point is 00:39:07 you can get analytics on that to go, oh, this kind of build failure happens like every three days and impacts 25 developers. We should probably, whatever it is, whether it's a process thing or an automation thing or tooling thing, we should probably fix this problem

Starting point is 00:39:22 because 25 people are tripping over this every three days. It's definitely costing time and money. That's a good one. Thank you so much. I mean, that's why I love this podcast so much because you get different ideas and different experiences from different people that are working in different areas of the software development lifecycle and very fascinating.

Starting point is 00:39:42 And hopefully our listeners appreciate that as well, that we have a broad variety of people with different thoughts. And I never thought about this in that detail around flaky tests. Tricia, we are getting towards the end of the podcast. Is there anything else that we missed to mention? Anything that people can go to in case they want to read up, they want to follow up with you, they want to learn more about what you do and how to improve developer productivity? Yeah, I mean, you're going to put a whole bunch of stuff in the notes anyway, but if you're interested

Starting point is 00:40:18 in developer productivity engineering or Gradle Enterprise, obviously the Gradle and Gradle Enterprise has that on their website. You mentioned my books, which I thank you for. The books are Head First Java, 97 Things Every Java Programmer Should Know, and Getting to Know IntelliJ Idea. And the Getting to Know IntelliJ Idea

Starting point is 00:40:42 is like 400 pages of all the stuff. Like I know IntelliJ really well. Like this is the bare minimum, 400 pages of bare minimum I could tell people to be like, you need to know this stuff if you're going to be productive with your IDE. So to me, and also I'm working on a talk at the moment called DPE Starts in the IDE, because three-letter acronyms are great. And all the stuff we've been talking about today has been around platform engineering and things like that. But my first love was,

Starting point is 00:41:11 you need to know your IDE because otherwise, you spend so much time writing code. And if you're just typing it out by hand and not making the most of the tools available to you, then you're just, you're wasting your own time. So so, yeah, buy my book. And what else? And I'm going to be speaking at various conferences over the next three months or so. So I'm going to be in like, well, next week I'm going to be in Manchester. The week after that, I'm going to be in London, going to Madrid, going to be at

Starting point is 00:41:41 DevOx, which I'm looking forward to. So DevOps is in October, isn't it? So, yeah. Awesome. That's great. This one probably airs early September. So folks, make sure that you follow Tricia and where she's traveling to and maybe you get a chance to catch her.

Starting point is 00:41:58 I had to write on this quote, you need to know your IDE because you're spending a lot, most of your time with that tool. So it's like, you know, you need to know your partner. It's like when you have a relationship, better get to know them because you're spending a lot of time. You spend a lot of time in that IDE.

Starting point is 00:42:14 And like, I mean, we found out that when I worked at JetBrains, we found out that, I mean, the IDEs are huge. They have so many features in. But like something like 95% of developers weren't even using basic completion. What are you using the IDE for? Yeah, it feels for me the early days of Microsoft Word, all these features.

Starting point is 00:42:34 Right. Like, oh, it can do that? But I guess there's also a lot of things that nobody ever uses. And that comes back to also close of to close the circle with, if you're interested, if you are investing in developer productivity, you really need to first understand what are your users, so your developers, trying to accomplish where they're struggling

Starting point is 00:42:55 and then optimize there and not just artificially optimize somewhere. I think that's also very important, right? Figure out where people are struggling, where they waste their time, and then help them there. And one more thing I want to say about developer productivity that's really important to me is that improving developer productivity

Starting point is 00:43:13 is not about squeezing every last line of code out of a developer. It's about making the developers happier. It's about allowing them to be more creative, more innovative. It's giving them the freedom to think instead of spending all this time wasting their time on boring, frustrating stuff.

Starting point is 00:43:32 That's good. It's about making developers happier. It's a nice line. Prisha, thank you so much for being on this podcast. I got to say, I always wish that Brian would be with me because typically the two of us are doing this together

Starting point is 00:43:47 with our guests. But I'm very happy that we had this chat. I learned a lot of things. I hope this goes also true, is also true for our listeners. And Tricia, I hope to get to see you again

Starting point is 00:43:57 at some point in the future at some of the conferences. And I hope to also have you back because I'm pretty sure you will learn many more new things and you have more things to tell me and our audience that we couldn't cover today

Starting point is 00:44:10 so maybe in a future podcast. I would love to come back. I've had a lot of fun talking about the stuff that I really care about. Thank you. See you. Bye.

Starting point is 00:44:19 Bye.

PurePerformance - Developer Productivity Engineering: Its' more than buying faster hardware with Trisha Gee

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.