Two's Complement - Deploy First Development

Episode Date: August 17, 2024

Our hosts congratulate themselves on finally having decent microphones. Matt quizzes Ben on his "Deploy First" approach to software development. Ben explains branch-based deployment environments. He a...ssures Matt he's a mortal. Matt promises to be less rubbish.

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt. And I'm Ben Rady. And this is Two's Compliment, a programming podcast. Hey, Ben. Hey, Matt. How are you doing, my friend? I am good. And I think this time both of us have got our microphones on the right setting pointing the right way. I'll believe it when I hear it.
Starting point is 00:00:33 You're not wrong, right? We've not had great success so far. We started out strong and then we both upgraded microphones and things and then the number of combinations and permutations of things that can go wrong exponentially increased this is like when you have some difficult thing to do and you're like procrastinating and you're like i know i'll just like work on the build system or whatever what we're doing instead we were going to record a podcast but instead we're just going to mess around with microphone setups and audio engineering. Exactly. Yeah. We've turned our podcast into an excuse to do that. That seems about right. I was meant to be writing peer reviews all day today.
Starting point is 00:01:13 And so I've been doing literally everything but that. Why is it so difficult? Why is it so hard to write things for humans when code is straightforward you know like i'd rather sit down and write some lock free queue than to write things nice or otherwise things about my my peers um it's hard that's why humans human stuff's difficult right well because there's no type checking that's what it is maybe llms will do the type. Maybe that's what I should do is I should, you know, paste, paste. No, I should not do that. This is for me to my peers to tell them about what,
Starting point is 00:01:51 how I think they can improve, not for some LLM to turn it into word soup for 300. Anyway, this is not what we were talking about. I've just got totally sidetracked at the beginning but um so um i was really interested actually in something you and i've talked about outside of this before which is your sort of philosophy of developing a new product or project or machine or whatever i guess project is the main thing yeah and you know how how you approach it, right? And obviously testing is a big part of that,
Starting point is 00:02:26 but there's one aspect which I've taken on board and I think it's really important. And you should talk about it. Yeah. You're talking about deploy first? I'm talking about deploy first. Yes. Yeah.
Starting point is 00:02:37 Yeah. I mean, it's generally everything that you want to do in any sort of project is think about the risks, the things that you know that just ain't so. Right. And one of the easy things to know that just ain't so is, oh, yeah, we'll just deploy this out to whatever thing right right we'll put it on a computer somewhere i have you know project.py or i just need to copy it somewhere and then just run it right how hard can it be right exactly so that is a place where um you can i i think there's a general pattern of these things, which is it usually makes sense
Starting point is 00:03:28 to confirm easy things are actually easy. And one of those things should be deployment. Deployment should be easy because hopefully you're going to be doing it a lot. So starting with deployment, when it couldn't possibly be easier, when you have like one line of code that like writes to your logging system that says like hello world hello literally yeah yeah yeah even if your logging system is just standard out for now right it's it's like well what could possibly be easier than that well it's like oh if it's so easy then why don't you start with that right and then and then there's this little demon that shows up and and and says well what if it's not easy right what if it's a huge pain in the butt what if i just do it later when i don't have to worry about it after i have written the fun lock free
Starting point is 00:04:09 queue and all of the other cool stuff and i don't want to have to deal with me pretty quick no right exactly it's messy and it's it's a bit like writing reviews for humans, right? It can go wrong in so many different ways. And it's not in a beautiful platonic ideal of my editor and my TDD-based sort of workflow. It's like, no, some bits get tarballed up or CI has to be working or I have to copy it to a machine and I have to get the environment set up and all of the things. Yes, yes.
Starting point is 00:04:43 And the more stuff you have to deploy, the harder it's going to be. And it's, it's the classic thing of like, it's really hard to build big, complicated things. You build big, complicated things by building tiny little things and combining them all together. You know, no one likes to review the 10,000 line PR. No one wants to be on rotation when the huge change that someone has been working on for three months and then went on vacation rolls out. Right. You want to do things one tiny piece at a time. And the best way to do that when it comes to deployment is take the dumbest possible thing that could even remotely smell like the thing that you want to build and deploy it.
Starting point is 00:05:20 Right. Is it a web service? Okay, cool. right is it a web service okay cool get some basic web server in place and you know make your hello world.html and you know deploy it and make sure that you can pull it up on a browser and then it works you know is it some running service okay cool is is is this service actually run if you if you kill it what happens right does your alerting work right like you if you go under the machine that it's running on and do like a pkill, do you get the alert on your phone saying that it shut down?
Starting point is 00:05:48 Because you're assuming that that infrastructure is in place. And if it's not or it doesn't work the way that you think or it's broken or hasn't been configured yet, there's going to be no easier time to troubleshoot that than when you have one line of code. Right? So I'm
Starting point is 00:06:04 a big fan of starting out with this, because it sort of fits this model of, well, if this is so easy, then let's just do it right now. And if it's not easy, then we'll know about it sooner rather than later, and we can account for it. And the other thing it lets you do is, as you're making changes to the code, it's like, oh, we're going to need access to a large disk because we need to cache some things on disk when we're processing. Okay, cool.
Starting point is 00:06:31 You're rolling out that change as one kind of atomic change rather than getting to the end of some big development cycle and being like, okay, we're going to need a big disk, we're going to need two network cards, and we're going to need a backup server over here that's in hot deployment mode. And then now you've got to configure all of that all at once for this one deployment and everything's got to work perfectly the first time. That sucks. Right. Sort of what you're hedging for is incrementality from like literally from hello world exit zero through to massive system that does all of these things,
Starting point is 00:07:02 but you can do it one thing at a time and be sure each thing works individually, as opposed to trying to, as you say, debug 12 different things that happen at once. I also get the, there's a lovely aspect to doing deploy first, which is like,
Starting point is 00:07:13 I can point out usually some kind of dashboard that has like a green light that says my service is up and running. Obviously we're talking very specifically here about the kind of services in a corporate environment or or similar here on me obviously like if you're if you're building a game may actually no maybe if you're building a game you make sure that you can take your your dumb apk or whatever it is and then install it on your phone and say look it just says matt's game and you can click the button and it quits and that's all you can do but now i can commit that in
Starting point is 00:07:45 the repo i can make my ci step do that every time build the apk all those kinds of things yeah that's an interesting interesting one but i was thinking more like like literally our world where we there usually is a service that you use you know some kind of docker based thing or use some kind of nomad or other system home homegrown system to say, I have a new thing that's running. Please make that run as well and do all the normal bits and pieces, you know, know where its logs go,
Starting point is 00:08:11 make sure they get archived, all that kind of stuff. So yeah, deploying something and that I can point out, even if it's not doing anything yet and say, well, it's running, all we have to do now is add the code. Right. You can do something. I was literally just speaking to somebody about doing this very thing.
Starting point is 00:08:27 It's like we, internally, we wanted to make a server that publishes some specific form of derived data that my team works on. And there's a team that's going to be consuming it.
Starting point is 00:08:38 And I said to the head of that team, I think I might just deploy something which publishes the value zero continuously for you. Because I can point at it and say it's up and running you can get we can get all the boring is the tcp connectivity right is the task group so they configured right all that nonsense can get out the way and then when you're receiving a bunch of zeros later on i can replace it with you know some high quality predictions for the market right right. It's an exercise for the reader at that point, right? Because we all know that where the bodies are mostly buried is in the wiring.
Starting point is 00:09:13 I do like that. So let me ask you something about that, which is how do you feel about continuous deployment? Because it seems to sort of lead into it once if this incrementality aspect of it if you're if you if you start from deployment then do you keep it up yeah absolutely and i mean i think uh i mean me personally it i it would have to be a very unique circumstance for me to not take the time and energy that it takes to continuously deploy a system these days right it would have to be like there's only one piece of hardware that this can run on or with and it costs more than your house so we only have one and it's in production uh and so i'm like okay well if that's the case then maybe we can't continuously deploy it but
Starting point is 00:10:04 i'm feeling we can and we'll figure out a way. We have two computers that are a little bit like that. But one of them is a backup staging instance that we deploy to every day. Now, we don't do it continuously in our world. A lot of what we do on our side is sort of based around the market hours of a market. And starting up intraday is an important test but it's not the same as what we normally do which is run continuously all day and so we want to make sure that we we can in fact run continuously all day um but it's still continuous ish continuous
Starting point is 00:10:35 deployment intermittent regular intermittent deployment regular deployment i suppose is right right you know we have a staging environment where every morning it starts up with the latest version of the code then it's the expectation if you've committed it and it passed all the tests it goes to staging like as a matter of course and then fairly soon it'll end up in production but no it's i the the deploy first thing is something that you said fairly early on in our me knowing you and me and it was definitely struck a chord of being like no that's a cool thing to do that's an awesome thing yeah so definitely worth talking about and i really like your idea of like i'm just going to set up the service that publishes all zeros and then you can integrate with that because you know as you were saying like you know
Starting point is 00:11:21 you want to get all of these sort of basic of the way. Again, making sure the easy things are easy. Because if you deploy that and the service that whoever it is is building is like, wow, I'm getting all ones. You know that there's something very wrong and you know it has to be a small set of things. It can't also be bugs in your code or bugs in their code or whatever it might be. It's like, well, we just did this one stupid thing and that didn't even work. So we're clearly missing something, right? Yeah. It's much easier to, to, to debug something when it's,
Starting point is 00:11:55 when it's simple. So, and then, you know, you're always like building on top of all of this kind of infrastructure, right? Your deployment environment, your logging environment, you know, any tools that you build for observability, it's really easy to just kind of assume all that stuff works. And so you better make sure that it does, right? One of the things that I do pretty much in every system that I build these days that has any kind of alerting based into it, which is most of them, is I build ways to intentionally trigger faults, right? So like in a web app, for example, I'll have a route that you can hit that raises an exception, right? And, you know, the kind of
Starting point is 00:12:37 exception that it raises might be variable. You might even be able to pass in different parameters to get different types of errors so that can be handled in different ways. But I want to be able to deploy my system to production. So like, you know, can it continue as deployment model? I'm just going to, you know, make a change or whatever it is or just, you know, have the system be running in production. And I want to hit that route that creates an exception in the production service. And then I want to see my phone light up and say like, there's an error in production, right? And I want to be able to do that at any time. If I have even the slightest whiff of a hint
Starting point is 00:13:13 that maybe the alerting is broken in some way and there's something terrible going on that I don't know about, I can immediately dismiss those fears or most of them by being like, well, let's cause an error on purpose and make sure that we get an alert, right? And make sure that our logs work and everything else. Of course. Yeah. Those things are the kinds of things you have to be, have a good working relationship with your operations folks so that they know when you're about to do this, you know, maybe you,
Starting point is 00:13:36 of course you, so you put your, you know, the equivalent of your pager duty in maintenance mode, but you still make sure that it appears in the UI and then you trust the pager duty will in fact read your phone. But that's a really important thing. I mean, it sort of comes back down to, you know, who watches the watchers,
Starting point is 00:13:51 you know, where do you draw the line about that, that observability aspect of your, your app? And, and actually now I think about it, this is a complete non-secretary, but that kind of worrying about like,
Starting point is 00:14:04 how do I test the thing that is like the last line of defense like when if an exception if my my web server has an exception i mean you mentioned killing the process that's another thing that you might reasonably do is log into the box and go kill minus nine that pid and watch it die and then also watch your phone light up and again you don't want to be doing that every single day, but it's nice to be able to do that as part of your checks from time to time. Right. But yeah, how do you test that your monitoring is working? I mean, I guess, again, that things like Prometheus and Grafana, you can just go and look at them.
Starting point is 00:14:47 They're there. Yeah. Yeah. But, you know, there are lots of situations where you have an absence of evidence problem. It's like, OK, this, you know, locked thread count is always zero. Right. Have we ever seen it not be zero? Do we know that this metric works? Do we know that it's actually measuring the number of locked threads and not like some other random
Starting point is 00:15:08 variable that is unassigned or unused or whatever? Very good example as well. Because yeah, if you say things like that, it's like, it's one of those things that's almost always going to be zero whenever you take a look at it, just because of the, I mean, assuming you're meaning like some kind of like a lock that you take out before you do some work. Yeah. For the first approximation, it lives at zero, but it's very important when it's non-zero. And maybe if it's stuck at non-zero, you know that there's a problem. Right, right, right.
Starting point is 00:15:34 Or any kind of monitoring. Yeah. I mean, and how do you test that, right? Do you just put a, you know, I've been, I mean, again, this is perhaps a different thing. Again, you know, you can put a big honking, great big global variable. I know this is a C trick where I'll do Xtern, bool, hack equals false, and then I'll poke it in the debugger
Starting point is 00:15:54 and then make one of the things actually like if hack, then while hack effectively, just in an infinite loop. And then I can test it exogenously, but that's not very reproducible. you know, sitting at while hack effectively just in an infinite loop. And then I can test it exogenously, but that's not very reproducible. But then do you really want to put code like that in production? You know, I mean,
Starting point is 00:16:14 I have to say, yeah, I don't know. Like I'm pretty bold about putting stuff into my systems that give me more observability. For sure. For sure. give me more observability for sure for sure um but there's observability and then there's like deliberately putting a something which is known to be broken in your code yeah so that you can test that the thing that detects that it's going to be broken works live in your production system yeah yeah and i don't know i you know i now i've
Starting point is 00:16:43 said it out loud as explicit as that, it sounds terrible, especially in the context of, say, trading systems, right? There are famously cases of companies who are trading companies that have made mistakes
Starting point is 00:16:54 of this flavor that have subsequently then folded. So it's not something you want to do lightly. But on the other hand, I don't also know of a better way
Starting point is 00:17:02 of being sure, you know, Portland sure, one might say. portland sure which is a whole other conversation probably that's a whole episode that's a whole other um yeah i mean one of the questions that i frequently ask um when it comes to because you know i mean it would be easy to caricature me as a person who just like yeah i wrote all the unit tests and they all passed and I'm going to deploy it to production and not think about it anymore. Not just easy, but done every day at work and online. If we had comments for this podcast, if we don't, I'm sure that that would be in the comments, right? And all of the jokes that you've seen maybe about the paper towel dispenser over the trash can that once you trigger it, the paper towels just continuously stream
Starting point is 00:17:55 out. Because it sees it as a hand. Yes. And the subtitle is all the unit tests pass, that kind of thing. Oh, there's another one with, there's a um like a a a pipe and it's got a hole in the pipe but there's a second hole in the in the pipe and the water is leaking out and then going into like the l-shaped pipe squirting out it's like again yeah everything's fine here like the test pass all the tests one into the other yeah yeah but one of the things that that i do and i i ask of the people who work on my teams, and I
Starting point is 00:18:27 will gladly ask of anyone that asks me for advice, is if you build something, if you created some new capability, functionality within your system, have you ever actually seen it work? Like in a live running system? It doesn't necessarily have to be the production system. It could even sometimes, depending on exactly what it is, just be your workstation. But have you ever actually seen it work? Because if you haven't, you have no reason to believe, however many unit tests you've written,
Starting point is 00:18:57 that it does. That's a really interesting point there. Yeah. And so whatever you need to do, you're talking about poking variables into the system or all of these other things, whatever you need to do to create the behavior, create the effect that you just spent days, weeks, months trying to build, do it. Yeah. And design the system so that you can do it and design the deployment system so that you can see it happen. Right. Like all of these things need to be done.
Starting point is 00:19:27 If you have this model of, okay, I'm going to just write a bunch of unit tests and do a continuous deployment of my system into production, and it's all just going to work great, and I never have to check anything, and I never have to do anything that even smells like manual testing, I've never been able to develop software like that and i love tests that's that's really no that is genuinely very interesting to hear and heartening as as a mortal who doesn't like it's not quite as uh not quite as in this situation but but you know like i've always felt slightly dirty doing that i feel like i'm admitting something here but like you know the fact that i've even tell you these tricks that i have right
Starting point is 00:20:14 for like well this is a specific deploy i'm gonna do it just to see because it feels ephemeral and it feels like it could break again because it's not automated and so i know that that's sort of wrong on the one hand but if i'm now going to take the other side of the and play devil's advocate to my own like feeling then it's like seeing it work once is still infinitely better than deploying it and never actually having seen it work at all right even if it breaks the next day you know what you've done is you've shown that the on-ramp works and then after that you kind of assume maybe too much that it the the the transitively you test all the bits around it and then no one should break the on-ramp or whatever the thing that actually causes it to happen and obviously if you can
Starting point is 00:20:56 contrive it to happen in a controlled way and you can have tests to do it but that's one thing but that yeah this but yeah as i say i've always dirty. Like I'm doing it wrong when you, you, you have the, Oh, um, you know what I'm going to do is I'm going to put a massive sleep in this thread and commented in and out just so that I can show that my slow thread detector thing fires up because you know, I can't otherwise, you know, I can write all the tests in the world. And all I'm really doing is showing that my mock, when it returns time, I appropriately log it time down that
Starting point is 00:21:25 doesn't feel very satisfying right right no absolutely and and i mean the only thing that i would the only thing i would be concerned about with that is if you now feel like you have to do that sort of what i would call exploratory testing right every time you make a change right now right yeah like i don't feel comfortable deploying this out to production until i go do the sleep thing again right yeah that is telling you okay you you probably need to write some tests here or you maybe need to design this in a way where it's easier to test or you need to design in a way where it's observable where it's like maybe you're not testing it with automated tests, but you have some test environment that can simulate it or reproduce it and, and, and get that out of there. Because, you know, if, if you're, if you're stuck in this world, cause I mean, the purpose of this kind of like, have you seen it work once? It goes right back to what we were talking about at the very start of this conversation, which is, it's what you don't know that ain't so, what you know that ain't so that gets you in trouble. Right.
Starting point is 00:22:25 Yeah. And the, and, and seeing it work as an opportunity to prove that you're wrong, that your assumptions about how things work aren't so. And if you pass on that, you're just sort of waiting to be wrong in a fantastically horrible way. Right. But what you don't want to do is turn that into a crutch right like once you've done it one time once you've had that opportunity to to disprove yourself and you failed to disprove yourself if you feel like you have to do that over and over again you're missing tests or you're
Starting point is 00:22:56 missing some other form of feedback that you you need to create because it's not scalable to do that for every piece of functionality in your system every time you change it. Exactly. It's the document that says this is how you're meant to break it. Yeah. I mean, I could probably make a case in very extreme circumstances where you might need to do that, like where if your code is, say, extremely performant, you can't have loads of if statements in the middle of it for all these different things, then maybe, but then typically you've already taken paid a massive amount of cost to develop that thing and then you put a bloody great big um like comment at the beginning it
Starting point is 00:23:31 says nobody touches this code under any circumstance without running these tests and it's tucked away in the corner of your code base again the lock free stuff of which i was alluding to earlier exactly exactly this you know you spend ages getting it working and it's almost impossible to test that it is completely right under all circumstances because, you know, it's that kind of multi-threaded thing. But yeah, it has that flavor of like,
Starting point is 00:23:56 well, once you do get it working at great cost, you don't touch it and then it's fine. And then if you do, you maybe do have some extra manual steps around. Right, right.
Starting point is 00:24:04 And I have those things too. Like I have things in my code base right now, where it's like, you know, little functions that have been written little, you know, main entry points that, you know, run a whole bunch of multi-threaded code in a tight loop, hoping that if there's any concurrency issues in there, we will find them knowing that that is not going to save us. But at least if we detect it that way, we found one. Right. Yeah, exactly. That's exactly the kind of thing, you know,
Starting point is 00:24:29 do you want your unit test to sit there for six minutes while it just runs every possible comment? No, nobody wants that, but you do. It's nice to have maybe even if you run them daily, maybe if you just run them when you make the changes. I just run them when I change it. Yeah. Yeah.
Starting point is 00:24:44 Because again, it's a very small amount of code that is under test there. Right. It's a very like specific thing. And you design the system so that it is a very specific. Yeah. You can scatter that across the whole code base. You find the abstraction that means that like the horrible code lives in one place.
Starting point is 00:24:58 And then you test the crap out of it using non-traditional means. And then you kind of say all right we're done here dust your hands off and go i hope we never have to touch that again right right it actually reminds me so like the the the cliche of putting all your eggs in one basket right there's the original version of that had more in that cliche it's put all your eggs in one basket and then watch that basket. Interesting, yeah. So it's not necessarily a dumb thing to do is putting all your eggs in one basket.
Starting point is 00:25:31 It's you can, you just watch the basket, right? Yeah, you consolidate your risk into one place where you know where to look as opposed to scatter it throughout your code base in this particular instance, yeah. Which, I mean, we've seen the number of times, you know, like anytime you have some complicated um that you have to do exactly right and then if you have to scatter it through your code base then you probably designed your apis wrong it's much better
Starting point is 00:25:54 to put it in one place and have an api that means that the awkward to do thing is in one place so that when you inevitably get it wrong you only have to fix it in one place as well. Yeah. No, that's cool. I mean, that's an interesting... What I thought of when you were talking about the manual testing and stuff, and if you've never seen it fail,
Starting point is 00:26:17 or succeed, sorry, if you've never seen it succeed, that's kind of inverted from literally every test I ever write, which is I start up my ID, I open a new file in the in the relevant place and i do whatever boilerplate i need to do to get an empty test and i said the first test i do is def should fail and i do a search false and then i hit the button that says run my tests and i sure as heck it should fail on the line that says assert force because the number of times that I've misspelled the word test in the file name, like Tset or something like that,
Starting point is 00:26:47 or some other thing that means that it doesn't treat it as a test and it rather runs it as just a regular Python file and there's nothing to do because it's just a bunch of deaths or it's a C++ thing and whatever, you know, any number of reasons why it doesn't actually execute it as a test can give you the most horrible sense of full security you're like hey i'm writing these tests one after another not a single one has failed i am great yeah right i am a superstar i am this is oh my gosh i'm gonna have to this
Starting point is 00:27:15 this day will go down in history as being the day that i got everything right and then you realize started by misspelling the file name of the program of the test and now you feel very very silly so yeah you might as well just throw it all out at that point yes to be honest there's not gonna be anything good in there yeah absolutely yeah but yeah i mean you know i i think all of the deploy first stuff sort of comes from from a base philosophy right and it and it is the same philosophy that tells you you you know, make sure your tests fail as you expect them to fail, right? You don't have any behavior, you run your tests, they fail and they fail the way that you expect. If you, if you build something, how do you know
Starting point is 00:27:57 that it actually works? Have you ever seen it work? Well, you need to figure this out. And it's, it's like a, it's like a scientific mindset, scientific mindset right it's like how am i going to prove that i am wrong and just sort of thinking about those things in a very systematic and continuous way like every little step you take how am i going to prove that i'm wrong here right right um and i think yeah it's very easy to fall in the trap of asserting the behavior that you know to be right because you're still i I mean, but that finds a remarkable, I mean, still finds a remarkable amount of idiocy in my own code. Oh yeah. A number of times I'm like, oh yeah, this is, how could there possibly be a bug in this thing? It's so easy to write a test for it.
Starting point is 00:28:34 It's almost rude, rude not to. And then you find the bug anyway. Oh gosh. But then, yeah, to then turn on its head and say like, okay, well, I've written this list of processor or this lock-free queue. How could I show that it isn't working? Or how could I show that it is working even, right? You know, that's, you know, as you say, spawn up your 12 threads
Starting point is 00:28:55 and then you get them dumping stuff in and then you make sure that you can read it out the other end or whatever. And then, yeah. And that's, yeah, that's cool. But this was meant to be deploy and then we turned it into test, which is the way we do these things. But they go hand in hand, I think.
Starting point is 00:29:09 It's about discovering the things that are broken as soon as possible. Right. Howsoever they are. And then developing confidence that it is actually working. And then, yeah, if you're deploying your software first, then you can always point. Like I can point to my CEO when he's been telling me, when is this thing going to be up?
Starting point is 00:29:25 And I can point him at the publishing zero version and go like, we're only a configuration file away from this being something important to publish instead of a bunch of zeros. But like the plumbing's all done. And that's pretty satisfying. Yeah, yeah. Well, and relating some of what we were just talking about back to deployment one of the things that you may discover if you start asking the question of have you seen this work is you need an environment to watch it run right ah yeah yeah and sometimes that can be your production environment but um it's better if it doesn't have to be right right. Right. And so, you know,
Starting point is 00:30:05 I think we've talked on a couple of episodes before about, you know, branch-based environments and things like that. I think so. If we haven't, we should, because it's a very good topic. Let's assume we have. Yeah.
Starting point is 00:30:16 You've got a very good setup on your current project, just as a pre-C, where you, effectively every branch in Git is its own environment. And so everything gets deployed to an environment with that name. And it's like a unique DNS name and blah, blah, blah, blah. But it means that you can basically just for the hopefully lowish cost that whatever provider you're using as the backend for each branch, you get a copy for every pull request effectively.
Starting point is 00:30:43 Right. Exactly. Exactly. And so that really gives people on my team at least no excuse. you get a copy for every pull request effectively. Right, exactly, exactly. And so that really gives people on my team at least no excuse to say, have you seen this work? Because they're like, well, it's too hard to see it work. I'm like, well, then the design of the software is wrong
Starting point is 00:30:55 because everything else is set up to make this super easy. But yeah, if you don't have a way to easily run your system in a realistic environment that gives you confidence that it does actually behave the way that you think it behaves, it's going to be very difficult for you to answer this question. Have you seen this work? Have you seen it work, right? Well, I'm not sure how I would do that. I think that's something we solve yeah if you can't see how to do it then that's a structural problem with either the way that the team is set up i mean and i'm now thinking that like this is
Starting point is 00:31:33 exactly the problem with my team right now is that there are a number of things that are very byzantine and bespoke that are difficult we kind of lean on the crutch of having a one-size-fits-all staging environment where we do discover things but because it's the tragedy of the commons of like everyone in there if two people have made us screw up because they couldn't test it any other way then we've got two problems in the same environment and right you know you say to them well i and they're also they can quite reasonably say how could we have known this beforehand and i'm like well sorry you can't so i'm i'll tell i'm taking a note here uh be less rubbish yeah yeah this is all my fault well yeah and i mean and you know it's it's like time in the testing environment is like time on the mainframe you know you got to schedule
Starting point is 00:32:18 i get from two o'clock to four o'clock no it's i mean for us there's a physical hardware component which is unfortunate well and that's um you know us there's a physical hardware component which is unfortunate well and that's um you know not everything quite a lot of things can be run either on our local development machines or in small like cloudish environments but we have got some in our particular case some very obscure networking stuff but i have got a po out to buy more hardware so you know it will happen, but yeah, it's still less than ideal.
Starting point is 00:32:49 And some of this stuff is like, so let me, all right, we're, we're, we should probably wrap up. We're about the right amount of time in, but one of the things that I find hardest, and this is where I lean back on,
Starting point is 00:32:58 and I was actually only quoting you the other day about this, when I was telling people it was okay that they, these kinds of mistakes were happening, which, and I blamed you, is is is it failed fast right these were things that we couldn't otherwise test and when i say that we know that we have configuration files that like um have the essentially command line parameters for a whole bunch of interconnected programs that run in an environment right right? You can imagine this. There's things like the names of Kafka topics. There's the name of brokers that are the brokers for this environment versus some other environment.
Starting point is 00:33:32 There's every other like command lines, which you might imagine that you might have. And so we have one that's like, this is the staging environment command line. This is the production command line. This is the development one. And this is the all and for like N things, right? And there's always a bit of wiring somewhere, right? i mean yeah there are ways and means of like making them automated or whatever but for whatever reason it's like the place where we do go oh i'm going to turn on this flag for our staging environment that makes it on purpose different from production
Starting point is 00:33:57 because we're testing something that we want to run for a long time and then see if it compares good yeah all that good stuff right but it's also the number one place to typo a command line flag name yeah and you could try ahead of time running it on your dev machine but you will be publishing to a topic that is used in the staging environment or god forbid in production so you better make sure you don't type in that bit of the command and so people don't quite reasonably you're like okay i run it maybe and i maybe if i'm feeling really brave i carefully comment out all the things that i know to be production affecting and then i make sure and obviously there's some network partitioning as well to prevent the worst of these things from happening but it's still a risk so ultimately really we all look at it we go through code review three of us stare at it, and then we commit it. And only then do we discover, oh, it's dash, dash, blah, underscore thing rather than dash, dash, blah, dash thing.
Starting point is 00:34:52 Of course it is, but none of us could have seen that stupid thing. So the only sort of comeback I have is it fails immediately at like seven in the morning when it deploys. And we've got plenty of time to go and make a patch release to to fix it before like we care about it but um i'm now just trying to think how that works in in in say your branch-based environment how do you deal with the fact that there are some configuration things that maybe are different in production right you're like okay this really has the dash dash no you are allowed to here are the credentials for the thing that you can do so i the way that we do it is the configuration is literally just code. We don't have configuration files.
Starting point is 00:35:30 We have classes. And everything, and there aren't very many of these, but everything that is, oh, this happens in production or this is not, is keyed off the branch name, right? Okay. And it's probably less than half a dozen things that i can so that's i mean yeah that makes sense but yeah and there's and not there's not surprisingly unit tests for all the configurations so it's sort of like oh when you have this uh
Starting point is 00:35:57 setting set then it creates these objects instead of those objects and there are unit tests for all of those things that confirm uh like there's a suite of tests for the main configuration that's got the special bits in it yeah there's a test a unit test for like the deployed branch configuration which includes the main configuration but also all the pr ones there's one for uh local testing and there's one for sort of like our operational scripts and things that run in the same environment. And all of those things are unit tested. Got it. That makes sense. I think, I don't know if that could work for us now.
Starting point is 00:36:31 One of the things that I like about the command line flag based version is that we often use that locally a lot as well. So I want to run something that looks like the staging environment, but I'm going to do tons of changes to how it looks. And then I can also intersubjectively paste that into Slack like slack and say hey this is a reproducer for that issue you can run this locally and it go whereas if it was like a local config that i then actually had to edit i'd have to check it in somewhere and say you need to pull this version but yeah that's an interesting way of solving this this problem yeah it's interesting because on this project and this is not something that I have done before, I think we talked about this a little bit
Starting point is 00:37:08 in the transition from Linux to Mac episode, where we sort of redid all of our operational things in Java because we were forced to because we were changing operating systems. Right, and rather than running, you know, sed and orc and whatever, you're like, well, I'll just write the three lines of Java that does it, and then it works on both operating systems.
Starting point is 00:37:28 Yeah, yeah, yeah. And one side effect of that is that we have very much sort of turned into this, for better or worse, I'm not saying this is a good idea, I'm just saying this is what we did. We turned into this thing where, you know, the way that we do things intersubjectively on the team is someone writes a little Java main function and then they check it into a branch and they're like, here's this thing I, you know, the way that we do things intersubjectively on the team is someone writes a little Java main function
Starting point is 00:37:46 and then they check it into a branch and they're like, here's this thing I'm trying out. And sometimes we actually wind up copying and pasting that code. It's like, I want to try this over here. I'm just going to copy and paste it and run it, which is a little more complicated than copying and pasting the command line args,
Starting point is 00:38:03 but it's maybe, I don't know. But the sort of like environmental shift that we had to be able to develop on Macs has sort of forced us into this mode, which I've never done this before. But it works okay. Like, I don't have any major complaints about it um but one of the things that it does do is it allows us to lever all of the regular environmental tools and libraries and every and checks and everything else that we have where it's like if you make a configuration for one of these scripts it's real clear what it has access to and what it doesn't right yeah like that's pretty bulletproof yeah yeah no you just from trigger to memory actually like at google there were definitely some tests that would like take the string and run it through the command line parser
Starting point is 00:38:50 and then make sure that the output was what you expected kind of level things for testing things like this and it kind of flavors similar flavor but like some of these things have complicated interdependencies that you would be like well you know even if you got it right the the you might be using the right channel but you're using the wrong broker and it's hard to write test. I don't know. I maybe I'm making an excuse. I am making excuses for myself, but, um, but you've definitely given me a lot to think about. There's some definite improvements we can make. I mean, as there always is with these types of things, there's pretty much an infinite number of improvements that you can make to these things yeah so i think
Starting point is 00:39:25 given the time we should probably leave it at that um this has been incredibly useful and i think you know i can definitely claim this back this time back because i'm gonna you know as in like company time this is a perfectly good use of company time to to teach me some ideas about how to improve my setup and uh and then hopefully to tell our listener the virtues the many virtues of deploying the hello world app before you've even written the rest of your code yes and then make it fail and then make it fail yeah cool cool all right well um i guess we'll leave it there until next time until next time you've been listening to two's compliment a programming podcast by ben rady and matt godboll find the show transcripts and notes at www.twoscomplement.org contact us
Starting point is 00:40:16 on mastodon we are at twos compliment at hackyderm.io our theme music is by inverse phase find out more at inversephase.com

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.