Two's Complement - Copypasta

Episode Date: December 19, 2023

Matt and Ben talk about when it's OK to copy and paste code. Matt explains how helpful compilers take the time-saving step of copying and pasting code for you, saving you precious microseconds. Ben re...calls things from the 80s, like word processors and Indiana Jones.

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt. And I'm Ben Rady. And this is Two's Compliment, a programming podcast. Hey, Ben. Hey, Matt. How are things going? Good. Good.
Starting point is 00:00:23 It's that time of the week again when we like to record a podcast. We're definitely doing that right now. That is exactly what I'm doing. I hope you're doing it too, otherwise this is going to be a very one-sided conversation. Wait, let me check. Yes. Okay, that's good. Now, as our regular listener will know, we put so much care and effort into every one of these.
Starting point is 00:00:46 We rehearse over and over again until we're word perfect. We have a clear idea of exactly what's going on. Then the script is prepared and edited well in advance. Using word perfect. Using word perfect. Oh, my gosh. Using a document editor from the 1980s. That's what we do.
Starting point is 00:01:08 That is quite something. Yeah. Actually, all right. We're already derailed. I was padding for us coming up with an idea about what this episode is going to be around. And now I've completely derailed. I was watching an American YouTube channel where they do up computers, like old retro computers, and they'd gotten their hands on a BBC Master,
Starting point is 00:01:29 which was the kind of enhanced version of the BBC Micro, my favourite computer, and was in fact actually the computer that I had growing up. And it comes with a built-in set of ROMs, and one of the ROMs is in fact a simple word processor. And seeing him discover that and then kind of trying typing it was it was so frustrating screaming at the screen go no no you press this key to get out of it although you do this type of thing you know like it's like watching someone
Starting point is 00:01:53 use vi and be confused as heck and you're like no it's obvious oh if you were actually using computers in the late 80s that so right yeah so we're not using star view whizzy yeah before whizzy wig right oh my gosh whizzy wig people don't even probably know what that means because it's like there wouldn't be a distinction in their minds to be like yeah if you're writing like a google doc you just type the thing and that's what it looks like right isn't that funny yeah like presumably like mid late 80s through i I guess, mid-late 90s, WYSIWYG was a thing, and it was like a defining difference between applications.
Starting point is 00:02:30 This one, and for those few listeners that don't know what WYSIWYG is, stands for what you see. The teenage child of our listener. The teenage child of our listener. What you see is what you get, which is to say if you edit stuff in a document in word or google docs whatever as you're typing it that's how it's going to appear on the printer right you're just writing it and it's kind of obvious it's so obvious that why would you have ever done it any
Starting point is 00:02:56 other way but back in the day of of like text-based computing very often you would have sort of essentially html style markup code around your your document and then only when you sent it to the printer did you actually discover what it was going to end up looking like you had to render it non-interactively and so there was this big difference between uh those that could show you interactively what your document would look like and those that essentially uh had to do it uh offline and then of course the response to that from the hypothetical teenage child of our listeners would be like why would you want to print it i suppose so i suppose so yeah i mean although i uh i had an uh an interview today and i was trying to print out uh the candidate's resume and twice i sent it to
Starting point is 00:03:46 my printer and twice it didn't appear and i gave up because i hadn't got time and then yeah like a caveman or no i guess not like a caveman like a modern person i had to look at the resume on a screen next to them which uh which is better but i can't doodle on it and that's i think the main reason why I print out resumes is not to actually... It's the therapy of having something that I can idly doodle upon. Right. While you're listening to them drone on about their various accomplishments.
Starting point is 00:04:17 Drone on is not... No, it was more than that. I don't want to make it sound like our interviewing process is dull. And this is not even what this episode is supposed to be. We are, what, five minutes in? This has been a hell of a tangent. This is a tangent. So what do you want to talk about today, Ben?
Starting point is 00:04:33 I think we were going to talk about copying and pasting code and when it is actually a good idea to copy and paste code. Is it ever a good idea to copy and paste code? I think we can make the... Go on, convince me. I'm going to say every bug I've ever written is a result of copy pasting some bit of code from somewhere else
Starting point is 00:04:56 and not really thinking about it. No, let's just start with that, though. When is it a good idea? When is it a good idea? So I can think of some maybe more mundane some less controversial examples of this where when you are copying and pasting from an example or a tutorial to explore something into like a repl or to see how something works is a form of experimentation right right? There's no intention that the code that you write
Starting point is 00:05:27 is going to make its way into any real live system. That's not why you're writing it. You're writing it because you have some API or some behavior that you don't understand, maybe even just part of the language, that you don't understand and you're trying to understand it, right? I think you've built an entire website around this concept. There may be some amount of that truth that i yeah um and so copying and pasting code in in that context i think is
Starting point is 00:05:51 obviously a good idea um i think people can sometimes sometimes get into trouble when they're like all right well i've got this copy and pasted code that i maybe hacked a few pieces in uh and now i'm done right no further work is necessary from this point on and that is definitely the the knee-jerk mental image i had of what i was thinking when i'm saying i copy paste my code and then yeah you don't read it properly you don't understand it properly and you move on yeah so i mean you know i think the more the more interesting places that you can you can kind of talk about this and think about this are uh when you've you've moved beyond that stage if we're not exploring like the goal here is actually to write real production going to be used by other people code? And when is copying and pasting code
Starting point is 00:06:45 a good idea in that case? I think a more controversial sort of second condition that I could probably throw out here for discussion is hinges on the definition of duplication. So I got to talk for a little bit about why, to explain this,
Starting point is 00:07:02 I got to talk for a little bit about why copying pasting code is generally bad. Yeah, i guess that's that's not a bad idea given that we just assumed i just asserted it was terrible and we didn't back it up in any meaningful way so yeah what yeah why is it bad so one way it can be bad and one way it can create like fairly serious problems is when you take a decision in a in a system that is being made in one place, and now all of a sudden you are making it in two places. And if those two places are ever out of sync, that will create a bug in your system, right? Like imagine like the way that you, you know, calculate interest for an account, and that's in one place. And then you, you know,
Starting point is 00:07:44 all of your interest calculations use this one algorithm and they're all consistent, right? And so the amount of money that you transfer from account A to account B is consistent with the statements that you send out to your customers saying like, you earned this amount of interest. This is the amount that shows up on your statement. This is the amount that we sent to the ancient bank transfer API that actually does the transfer from point A to point B. Raining code. And now, yes, exactly. And now these things are in agreement. And if you were to duplicate that code and have the code that produced the customer statements be different than the code that did
Starting point is 00:08:18 the bank transfer, they could get out of sync. And if they did, that would be a very serious bug. Yes. Right? Agreed. And so that kind of duplication is harmful. And we do our very best, I think, as software engineers to try to avoid that because we know it's harmful. And when you're just copying and pasting code, the question that you should be asking yourself is, am I doing this? Am I creating this situation now where if you change one and you forget or neglect or just
Starting point is 00:08:46 aren't aware to change the other, is that going to create a bug? Given that, I think there are situations where the opposite is kind of true, where it's sort of like, I have this piece of code and I need to do something similar but different in this code. So what I'm trying to do is I'm calculating an interest rate, but for a completely different purpose. Maybe I'm calculating a bond yield rate or something like that. So it's different enough to where it's like, yeah, there's some sort of similarity with this code, but it's not the same thing. And it is certainly
Starting point is 00:09:29 not the case that if I changed my, you know, bond interest calculation, that I should automatically and always change the customer bank account rate interest calculation, right? Those are two separate things with two different reasons to change. And so a reasonable approach to implementing that new functionality might be, depending on the structure of the code, to copy that interest rate calculation code, to drive out the new behavior that you want with tests, to delete the behavior that you don't need, and make a copy of it and be like, now this is our interest, this is our bond algorithm, and this is our customer account algorithm. And they're different and they have different reasons to change and they have different behaviors as specified by these tests. The code maybe started
Starting point is 00:10:13 out the same and maybe right now has some sort of accidental similarity between it. And you might even discover after going through that whole process that there is some essential similarity that really is the same. It's like doing any kind of interest rate calculation where you're just multiplying a balance by a rate. It's like, okay, fine. Maybe that gets pulled out into its own thing that is separate from those two things that is tested in its own way.
Starting point is 00:10:41 There's more of like a library. And that is something that you don't want to duplicate. But I would almost argue that it is totally reasonable to start with duplicating those things on purpose because they have different reasons to change, making them be the best form that they can be, and then looking at them and being like, is there any actual duplication between these two things at that point? And how do I remove that duplication? Right, that makes a lot of sense to me.
Starting point is 00:11:12 Definitely the argument I was going to make that wasn't as maybe trite as like the interest rate itself is the calculation piece of code. And if you copy the calculation code and you have it in two places, then a bug fix or a performance optimization in one doesn't necessarily make it to the other so that would be my that would just be a pile on to your argument about don't duplicating it but yeah that second reason
Starting point is 00:11:36 where it is not copy pasting as an ends it's not like i need this in two places it's like i'm starting with something that's similar and i don't want to have coupling between these things on purpose i am choosing to essentially branch the code or a little facet of the code and say these they they share an ancestry untracked as it may be in source control but they share an ancestry because they have a similar job to do but i am deliberately going to evolve them in different directions and that to me is actually more of a a point is the um sometimes by reusing a piece of code you're introducing coupling between components that otherwise wouldn't share so much coupling in code which maybe hurts you down the line um in so to try and sort of think
Starting point is 00:12:27 of a concrete reason uh example of this is certainly if you are planning on moving two systems in two different directions the fact that they share a piece of common functionality might make it more complicated and harder to test if it has to do essentially two jobs and the jobs are disjoint or partially disjoint. And at some point, obviously you want to be able to extract a little bit that is common, truly common.
Starting point is 00:12:49 And maybe that's your like util library dot function, you know, the horrific naming of whatever that part is notwithstanding, but the, the software junk. Yes. I mean, we used to have,
Starting point is 00:13:00 we used to have something. Yeah. C cruft was all of the crufty bits of code that you needed in C just to get things to work. Naming is difficult, right? We know that. It's been well documented that naming is hard, but we can try. It can and should do better. That said, though, yeah, by deliberately choosing to copy lets the two copies evolve in their own directions. And that's maybe a boon
Starting point is 00:13:26 because you're not weighing down one implementation with the changes and the modifications to support functionality it doesn't care about, which maybe bloats its API, makes it harder to test, makes it less performant. But it is a trade-off between, well, what if there is a core bug in that feature? How will you get the fix over to the
Starting point is 00:13:45 sort of equivalent in the other piece of code yeah yeah for me it comes down to this sort of litmus test of like if these two things don't change in lockstep is that a bug or a feature right right because it could be either depending on what they are context dependent yeah yeah another thing that i i think is related but not the same thing as we're talking about here, is a situation where you are using duplication as a way to evolve the design of a system. And I actually have this right now in one of a single threaded process that runs on a local machine to a distributed process that runs inside of a work queue. And the implementation of the algorithm is the same in both cases, but the sort of scaffolding around it is completely different. And we could have kind of tortured the design a little bit to unify the duplication of those two things. But the intention is to get rid of the single threaded local version and only have the distributed version. literally have two copies, essentially, of this class in the system right now. One is called job, and the other one's called job new. And job new is pretty much a copy and paste of job with
Starting point is 00:15:12 all of the changes necessary to make it run inside of our work queue. And the intention is we're going to delete job and then rename job new to job when when this transition is complete now this is definitely something that takes discipline within the team because if you don't have that discipline and you get pulled off onto other things and then this sticks around new job new job new new and then job new new and then job new new new right and it just gets terrible but you know for teams that that have that discipline and are able basically like, I say discipline, but really it's just like, do you have control over your own priorities? Right. Because if you have control over your own priorities, then you can just decide that you want to do this and then you can do it.
Starting point is 00:15:55 If your priorities are set by somebody that doesn't have an understanding of how your source code works, then I wouldn't do this because you will easily find yourself in a situation. It's like, well, we have three new job news and they're all terrible and I don't know what to do now. Yeah. But the technique itself, I think, for teams that can do this is a very valuable one. There's sort of a form of this that I do locally. This know, this is one that's living, you know, through multiple PRs and multiple commits. And, you know, it's probably going to be a few weeks before we fully make this transition. But there are definitely situations in which I will do this exact same thing, sort of like within the span of a few hours, where I'm trying to change the implementation of something in, you know, to add some some new behavior
Starting point is 00:16:48 some performance characteristics whatever it might be and uh you maybe have heard me say this before this metaphor of like the bag of sand um as in indiana jones yes as in you know for for uh the folks who don't know what whizzy wig is yes for those who don't know what Whizzywig is, let me also explain to you the first Indiana Jones movie. There's a scene at the very beginning of the Indiana Jones movie where he's in this lost, forgotten temple. And he's trying to recover this golden statue before it gets stolen by this other person who's just going to sell it. It belongs in a museum! You know, that scene, if you've ever seen it. And there's traps all to sell it. It belongs in a museum. You know, that scene, if you've ever seen it. And so he, and there's traps all around this place,
Starting point is 00:17:29 and he's trying not to get killed. And he knows that there's some weight sensor mechanism thing in this pedestal. And so he's sitting there staring at this golden idol with a bag of sand in his hand, trying to figure out what the weight of the idol is so that he can swap out one for the other. And unfortunately, Mr. Jones fails at this task and a giant boulder rolls down
Starting point is 00:17:50 at him. But in software, you can attempt to do this by creating a new implementation of something that has some characteristic that you want. Maybe it's more performance or maybe the code is simpler or maybe it's even got some new behavior in it, but it supports all the existing behavior, and completely implement that thing, and then find a point, like a single seam in the code. Maybe it's where you're instantiating a class or calling a function, and you're like, all right, I'm going to comment out the old one, and I'm going to put in the new one, and then I'm going to run all my tests, and I'm going to see what happens. And if all the tests pass, great. Now you can go and you can delete all that duplicated code that you created.
Starting point is 00:18:28 You can delete the old implementation. You can delete all those old things and clean it all up. If the tests don't pass, then you have a giant boulder rolling at you. And now you have to do something, which is usually undo and try again. Unlike Indiana Jones, you have the opportunity to undo i think the critical part of that is that first and foremost um the bag of sand itself was a copy paste potentially of the original code right it was like hey i copy pasted it and now i have pretty much carte blanche because nothing is using this currently the bag of sand is in my hand it's not on the pedestal exactly this analogy is breaking down a little bit here but i think i think you got it going um and you've got the chance to look at it and also
Starting point is 00:19:08 critically this could be it could have its own test for the new functionality it could have its own the assertions that you want to test about the the replacement characteristics of this thing and that could be committed and checked in and it could be side by side in your code base for some amount of time even while maybe it's used in a couple of new locations while you're like i definitely need the new functionality and we haven't got the old version and then at some point you make the call that there seems to be doing what you need it to do it's passing all the tests for the old system and then you have your indiana jones moment of doing the switch and that can come in a very controlled process a very controlled part of the development process.
Starting point is 00:19:47 And then you sort of commit it on a Friday and say, all right, everyone, the old system is gone on Monday. I'm really sorry if you've been hacking on the old system over the weekend. You're going to have some horrible merge conflicts on Monday morning. Right, right, right, right.
Starting point is 00:19:58 Yeah, yeah, yeah. And I mean, you can do more sophisticated things like that where you have like feature flags or like different modes where you're like running the new and the old code. And that is actually kind of what we're doing with this job that I was talking about earlier, where it's like, you know, we've got some things that are using the old job and some things are using the new job. And we're sort of slowly transitioning all everything over to be the new thing. And then when there's no more uses of the new thing, then or the old thing, then we'll just have the new thing. And we can we can delete the new thing then or the old thing then we'll just have the new thing and we can we can delete the old thing um but yeah you can also just do it with just a couple of comments right like a comment out old thing put in new thing switch back if if yeah but don't but don't check in that comment yes which i think we've we've talked about before yeah we might have we might have talked about yeah not not checking and out code. But that's a whole other episode. So here's another place where I am tempted very often to use copy-paste.
Starting point is 00:20:52 And I say this to you partly as the catharsis of admitting it in public, or at least to you and then our listener. A confession? Confessional. Thank you. That was the word I was looking for. Cleanse your soul. You'll feel a lot better.
Starting point is 00:21:07 I will. I will. Cleanse what remains of my soul. What I tend to use copy paste for is having written a test for my code and observed its output against the empty string I asserted it to be equal to. I would then read through the difference that my my failing test tells me hey got empty string expected sorry i got blah expected empty string i will read through the blah and if the blah makes sense to me if it's the formatting is right and everything
Starting point is 00:21:38 then i would be tempted to take that and copy it into the test itself yeah i mean i don't know that that's that's certainly not a deadly sin i don't know if you can see his face you know there's there is clearly some sin-like characteristics of this is trying to be kind so for me it's it's a question of if it literally is a string so you can talk about sort of like different values and things right um you know like the the place let's talk about the places where that that is probably not a great idea and some of the places where it is it probably is actually a time saver and not not a bad idea at all so if what you're doing is you have written some complicated interest rate calculation that's like compounding daily, and it's got a bunch of different factors in it.
Starting point is 00:22:28 And you write out all that code, and then you write the test that asserts that the new balance is equal to zero. And then you run the test, and it says, no, actually, it's $217.38. And then you just copy that into your answer, and then you'd be like, well, that must be correct. You're doing it wrong, right? Like, that is not what you'd be like, well, that must be correct. You're doing it wrong,
Starting point is 00:22:45 right? Like that is not what you should be doing, right? If what you're doing is you've got some like human readable string representation of an object, right? That has like, you know,
Starting point is 00:22:59 some interesting information in it that is intended for logs or you're reporting on a screen or something like that. And you've written that function that kind of appends all the stuff together and there's no branches in that code it's just gathering up a bunch of information putting it all out and you want to take that and put that into your output i think that's completely reasonable so long as you're confident that there aren't going to be any sort of like weird, like invisible character type things in there that you might not expect. Right. Right. Because you can accidentally sort of like copy and paste,
Starting point is 00:23:34 like an unprintable character. And then someone comes along and they like edit the thing, like taking out of space or putting it back in. And then all of a sudden the tests are failing and you're like, what is going on here? Right. Like this code is identical. Like what, what is the, what is the thing so like you
Starting point is 00:23:47 know as long as you're confident that the copy and paste is like uh not gonna surprise anyone with its contents then i think that's a completely reasonable thing that's actually the exact case that i do use this particular kind of thing it's like yeah i have the to string of something and it's like i i read the code uh well i wrote the code and it seems reasonably sane to me and i just want to have some kind of test somewhere that says like no one he broke this in a way that's surprising and so i i might to string something and look at it now obviously there are with things like that you have to be a bit careful because you know if you have containers that don't have a well-defined sequence to them you know sets and
Starting point is 00:24:24 things like that that don't then you can sometimes become too sensitive to minutiae of your code, and it becomes very brittle. And almost by its very nature, it's very brittle. And in fact, it reminds me a little bit of like one of our very first podcasts when we had Claire on talking about the acceptance test. It's a kind of inline acceptance test
Starting point is 00:24:41 where you're saying like, this seems reasonable to me, and i'll be interested if it changed but but you don't get the very very high fidelity of like oh clearly you act this is it was the string of some sub sub sub object that was missing now uh close paren and you get the single targeted failure that says oh that's where it went wrong probably you're just going to find out that your giant string of like the whole world object that you created just as a kind of like smorgasbord of all the things test is now different.
Starting point is 00:25:10 And hopefully your output, your diff output is good enough for you to go to spot that it's a brace missing or a parenthesis. Right, yeah, yeah. I think that's a really astute observation because I think a lot of the interesting stuff that Claire was talking about was all of the tooling and the infrastructure that they had to kind of like deal with the fact that
Starting point is 00:25:31 these things are kind of brittle and you need to do things like strip out transient dates and deal with you know like things that are potentially out of order and like all the tools she was talking about are like specifically designed to deal with those kinds of things which your unit testing framework is almost certainly not right so if you're writing that style of you probably want to use the right using for the job then yeah right or or like structure the test in a way where you understand that it's like okay we need to be very careful about how we set this up because these tools aren't set up for they're not capable of handling that kind of variation. And so we can't let that variation in.
Starting point is 00:26:07 Whereas with the acceptance testing tools that Claire was talking about, I guess that's completely fine. It comes with it out of the gate. It's doing some things for you. Yeah. That actually makes me interested. So one of the patterns when I'm writing Python code
Starting point is 00:26:18 and I'm testing like edge cases and to some extent in C++ as I'm one of the rare heathens who actually uses exceptions in my c++ code i know a lot of this unfashionable but certainly for like error cases and error conditions that are genuinely exceptional and typically will shut the application down uh i don't mind using them but that's not why i'm talking about them what typically one has with exceptions is that you have like some kind of class that holds the exception that you
Starting point is 00:26:45 can in your testing framework you can match against and say like hey i expect a an exception or an error of this type to be thrown you know missing parameter expression or something like that and it's good practice in python to create a unique error for each of your like modules so that you can catch them in tests in general, rather than just catching runtime error and things like that. But even having done that, even with the fidelity of knowing that you've caught the, this is,
Starting point is 00:27:13 you know, a key not found in remote data store exception or whatever, you probably want to look at the message that's in there. And then again, you want to kind of try and find the right amount of, um, brittleness of like, here's the exact error message and of course if it contains upstream things from say aws or google cloud it might have some arcane error message error number inside the the string that you've got
Starting point is 00:27:37 but what you really are looking for is just a bit of it and so i will typically write uh sort of things that contains a string and it must contain missing string and it must contain my key name that I asked it and then other than that all bets are off right I'm happy with that level of fidelity only so I haven't copy pasted the exact error message that I was expecting which I could easily get by just again
Starting point is 00:28:00 matching an empty string and then seeing what I get because it just seems too brittle and it's you's case or otherwise. And most of these matches also support regular expression. So you can kind of substring in regular expression bits of it. And so even then when you're copy pasting, it pays to look at what you pasted and say,
Starting point is 00:28:19 is this exactly what I wanted or should I have modified it in some small way? Yeah, yeah. Or am I just reinforcing the bug in this code by writing a test make sure the bug is not fixed yeah uh-huh right exactly um yeah i mean i think i think those are all like very very reasonable places some of the places that might be a little bit more borderline are, which I could make some maybe arguments for doing it, and I could make some maybe arguments for not doing it, is a situation where there's a library that exists that does exactly what you need it to do, right? There's some function in there.
Starting point is 00:29:07 So there's some class in there that does exactly the behavior that you want. You do not want to implement this yourself. You want to use the library. But unfortunately, the library has 100,000 transitive dependencies, none of which are necessary for the one function that you want right i have a quick question on um yes you're developing in java at the moment aren't you yes also true in python is uh yeah okay fair all right it's a fair fight yeah all right yeah i was just thinking it sounds like you know every maven thing you've ever done before we've been and in fairness you know javascript typescript as well like you're like hey i just want this thing this one function then you're like why have i got
Starting point is 00:29:48 150 megabytes of text in my node packages how is this possible how is that much code been written how could you possibly need all these things yes um but yes i will i will very easily grant this is a common problem in the java world it is also a common problem in many other worlds too. You were saying that if there's this one class that implements this perfect thing of like, this is how to parse some data structure or some string.
Starting point is 00:30:15 Right, right, right. And now I've got to pull in layers and layers and layers of transitive dependencies, some of which might conflict with my existing ones, or even if they don't, might in the future I might I might just be like, nope we're going to copy and paste this code
Starting point is 00:30:31 I'm going to go find these 12 lines of code and I'm going to put them into my project and I might even write some tests around them because now this is my code, I own it now. That's true We should also say that one should be careful about licenses when does when doing this kind of thing so always worth considering i know obviously that's a
Starting point is 00:30:51 top right in this for us but not necessarily not all open source licenses are created equal and you need to understand how they work uh you know there are things like mit that are generally pretty safe but beyond that it's sort of like ask your lawyer. Ask your – yeah, certainly it works. Yeah, exactly. But that is another situation where I would never say that that is generally the right thing to do. But in certain situations, I have done it and I wouldn't i wouldn't uh complain anybody else doing it because the trade-off just doesn't make sense like i don't want to add you know three times the
Starting point is 00:31:30 number of dependencies to this project for a 12 line function or even like a hundred line function right yeah um i i just i just want this behavior in my system and so that i'm the best way for me to get that is is to copy and paste it yeah well that's a good i think that's a good argument for for copying and pasting as well as just some cases where you just want that little bit of code but you don't need everything else that comes with it so here's an interesting uh thought i had um and this is sort of now slightly left field uh it occurs to me that um copy and pasting is only bad when a human does it right if we do it then uh it's it's uh problematic but if the computer does it then maybe it's forgivable oh and the reason i say this say this is that like one of the sort of like the mothers of
Starting point is 00:32:19 optimization in the compiler community is inlining functions, which is to say copying the function you're calling from where it was defined to where it's being used every single time it's used. And that is generally accepted to be a very good thing to do for certain metrics, at least. But the thing is, it's automated. It's done reliably by a computer every time and it and if i change the original implementation it obviously will be compiled
Starting point is 00:32:50 as long as you've got your compile and build set up correctly every time so so yeah it's it's so i wondered if there was something from the properties of copy pasting that we're losing i think it's that when a human does it you have lost the link between the original and the copy, right? Although I sort of see there's this ancestry that you could sort of track it with some way. Maybe this is missing functionality in our tooling that we... Yeah, that's an excellent point. And it's sort of like, imagine a world where, like, you know,
Starting point is 00:33:20 the lineage of a piece of copy and pasted code could be, like, very easily and obviously tracked. And I have no idea what that would actually look like. Me neither, but no. But like, you know, it's like, oh, you're changing this line. And then there's some pop-up that's like, did you know that this is a copy of this other line over here? And like in a not annoying and stupid way.
Starting point is 00:33:40 Right. I mean, many IDs, of course, do have some amount of like, hey, this code is there's these 10 lines of code are duplicated somewhere else so it's a kind of extension of that i suppose but it's more like knowingly editing a duplicate the ide could say hey somewhere else you're also looping over and you've just looped over one less than than you having the copy so do you want to update the copy or not or should should I tell you about that? So the trick to that would be that the IDE would somehow have to understand the domain of the problem
Starting point is 00:34:09 that you're trying to solve enough to know that it's like, oh no, this is just another balance calculation, just the same as this other one. Because like one of the things about duplication is you can have code that is the exact same code, character for character, and it can be not duplicated,
Starting point is 00:34:28 right? It can, you can be for completely different reasons, completely different reasons to change, completely different reasons to exist. It just happens to be in the same shape right now, but tomorrow it might be totally different. And if you,
Starting point is 00:34:41 if you changed one and change the other, that would be a huge bug and you would never want to do that. Right. And you can have code that is totally different but is actually duplicated right like the the the structure is different the the everything's different but it it's solving the same problem in two just different ways and it needs to be unified right yeah and so like the ide would have to tell the difference which humans can't all those different humans can barely do it right this is why heuristics you know are coming to these ideas and how many lines do i say and i know some ideas were like well even if you rename all
Starting point is 00:35:13 the variables i quote know it's the same yeah but yeah if you if you change the four with a do while or a some other thing it's very hard although again interestingly and this is only you know here i am just keep steering it back to things i'd love to talk about but like something that compilers do internally is they try and canonify all the different ways that you can write code so they take a loop any loop if it be a do while or a while or a four and they rewrite it in a canonical form so that um similarities in code can be. And more importantly for a compiler specifically, it's if somebody writes a piece of... So my favorite example of this is something
Starting point is 00:35:52 where if you count the number of one bits in an integer by looping over all of the different 32 possible or 64 possible things and saying, is that bit set? And then adding that up. Or if you do it by shifting it down or you do it by other ways, ultimately there are about two or three ways you could write that and if the compiler can
Starting point is 00:36:08 canonify them into one or two possible variations of intermediate representation and whether you wrote a for loop or a while loop or a do while or whatever then the optimization that looks at that and says there's a single cpu instruction that does exactly that whole loop can kick in much more easily whereas if it has to deal with every possible combination and permutation of like well they used to four over here but they used to one yeah yeah so it's kind of interesting that there's there there's some some similar shapes in all of this it's all sort of connected yeah so in an interesting way like the compiler is unifying duplication across all of the projects that it compiles in the entire world. In essence, in order to then match it against the human curated list of like, hey, this is one way you can do a pop count. And this is another way you can do a pop count.
Starting point is 00:36:56 And all the other things that happen. And obviously, it makes their testing easier if they can guarantee that like a matrix of possible combinations is reduced by saying like look if there's a loop it always looks like this it doesn't matter if it's a do or a for there's always a start condition there's always a step condition and then there's always a terminating condition and there's always a cleanup condition and they're always there's four labels it's always this way some of them are empty some of them will immediately jump to the the the beginning of the loop some of them check the condition so if you look at for example unoptimized code in uh your favorite tool for looking at uh compiled languages then you'll
Starting point is 00:37:31 often see that bizarrely it'll jump to the end of the loop for a for loop and then it'll jump back to the top again because that's where the check is done sometimes is at the end of the loop and so it'll jump the first thing is hey i set everything up and jump to the end you're like no no no start the loop you're like no it has to check the condition at the beginning of the loop and that's checked at the end so it just jumps to the end does the conditional check and goes around to the top you know and so it's really quite interesting to see these things come out but we're talking about copying and pasting and we were and now i'm talking about loop optimizers this is i mean this is i i really like the the way that you're thinking about this because this
Starting point is 00:38:04 is fascinating it's sort of like getting to the heart of, I think it gets to the of like removing duplication by temporarily creating it. So imagine a situation where you suspect that two pieces of code are actually the same underneath the hood. Like they seem like they're doing the same thing, but you're not sure. And so the way that you're going to figure this out is you're going to bag of sand it so you're going to make a copy of one of them and then you're going to try to refactor it and maybe even just rewrite it right into the shape of the other one and then as soon as like you've achieved duplication it's almost like rose and tetris yeah they disappear and they disappear and they're like these are actually the same and then you can delete your copy and then you can
Starting point is 00:39:09 delete one of the originals and now you only have one of the three that's cool right that's an interesting way yeah like sort of it's almost like refactoring where you can make changes uh yeah over and over again you say well can i or there are you know branches of mathematics where you can sort of show two things are distinct by slowly applying changes until you can show that one is equivalent to the other through a sequence of steps. So, yeah, I'm sure there's some clever things that you can do in that respect.
Starting point is 00:39:34 But yeah, duplication done by your compiler is an interesting variation on this. I like it. You get me to talk about things I'm not talking about. This is why we have this podcast that's why we have the intersection between all these different things yeah yeah i suppose so i suppose so well i think that's about all i've got for yeah i'm sure there are other
Starting point is 00:39:56 things that we could come up with you know it's it seems like this is a good amount of time to be talking about it and our listener hopefully has been kept entertained on their dog walk or commute or whatever it is explaining to their teenager what whizzy wig and printing is telling their teenager what printing and whizzy wig yeah yeah cool well i'll see you on the next one all right sounds good you've been listening to two's compliment a programming podcast by ben rady and matt godbolt find the show transcript and notes at www.twoscompliment.org contact us on mastodon we are at twos compliment at hackyderm.io music is by inverse phase find out more at inverse phase.com

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.