Two's Complement - Copypasta
Episode Date: December 19, 2023Matt and Ben talk about when it's OK to copy and paste code. Matt explains how helpful compilers take the time-saving step of copying and pasting code for you, saving you precious microseconds. Ben re...calls things from the 80s, like word processors and Indiana Jones.
Transcript
Discussion (0)
I'm Matt Godbolt.
And I'm Ben Rady.
And this is Two's Compliment, a programming podcast.
Hey, Ben.
Hey, Matt.
How are things going?
Good.
Good.
It's that time of the week again when we like to record a podcast.
We're definitely doing that right now.
That is exactly what I'm doing.
I hope you're doing it too, otherwise this is going to be a very one-sided conversation.
Wait, let me check.
Yes.
Okay, that's good.
Now, as our regular listener will know, we put so much care and effort into every one of these.
We rehearse over and over again until we're word perfect.
We have a clear idea of exactly what's going on.
Then the script is prepared and edited well in advance.
Using word perfect.
Using word perfect.
Oh, my gosh.
Using a document editor from the 1980s.
That's what we do.
That is quite something.
Yeah.
Actually, all right.
We're already derailed.
I was padding for us coming up with an idea about what this episode is going to be around.
And now I've completely derailed.
I was watching an American YouTube channel where they do up computers, like old retro computers,
and they'd gotten their hands on a BBC Master,
which was the kind of enhanced version of the BBC Micro,
my favourite computer,
and was in fact actually the computer that I had growing up.
And it comes with a built-in set of ROMs,
and one of the ROMs is in fact a simple word processor.
And seeing him discover that and then
kind of trying typing it was it was so frustrating screaming at the screen go no no you press this
key to get out of it although you do this type of thing you know like it's like watching someone
use vi and be confused as heck and you're like no it's obvious oh if you were actually using
computers in the late 80s that so right yeah so we're not using star view whizzy yeah before whizzy wig
right oh my gosh whizzy wig people don't even probably know what that means because it's like
there wouldn't be a distinction in their minds to be like yeah if you're writing like a google
doc you just type the thing and that's what it looks like right isn't that funny yeah like
presumably like mid late 80s through i I guess, mid-late 90s,
WYSIWYG was a thing,
and it was like a defining difference between applications.
This one, and for those few listeners that don't know what WYSIWYG is,
stands for what you see.
The teenage child of our listener.
The teenage child of our listener.
What you see is what you get,
which is to say if you edit stuff in a document in word
or google docs whatever as you're typing it that's how it's going to appear on the printer right you're
just writing it and it's kind of obvious it's so obvious that why would you have ever done it any
other way but back in the day of of like text-based computing very often you would have sort of
essentially html style markup code
around your your document and then only when you sent it to the printer did you actually discover
what it was going to end up looking like you had to render it non-interactively and so there was
this big difference between uh those that could show you interactively what your document would look like and those that essentially uh had to
do it uh offline and then of course the response to that from the hypothetical teenage child of
our listeners would be like why would you want to print it i suppose so i suppose so yeah i mean
although i uh i had an uh an interview today and i was trying to print out uh the candidate's resume and twice i sent it to
my printer and twice it didn't appear and i gave up because i hadn't got time and then yeah like
a caveman or no i guess not like a caveman like a modern person i had to look at the resume
on a screen next to them which uh which is better but i can't doodle on it and that's i think the
main reason why I print out resumes
is not to actually...
It's the therapy of having something that I can idly doodle upon.
Right.
While you're listening to them drone on about their various accomplishments.
Drone on is not...
No, it was more than that.
I don't want to make it sound like our interviewing process is dull.
And this is not even what this episode is supposed to be.
We are, what, five minutes in?
This has been a hell of a tangent.
This is a tangent.
So what do you want to talk about today, Ben?
I think we were going to talk about copying and pasting code
and when it is actually a good idea to copy and paste code.
Is it ever a good idea to copy and paste code?
I think we can make the...
Go on, convince me.
I'm going to say every bug I've ever written
is a result of copy pasting
some bit of code from somewhere else
and not really thinking about it.
No, let's just start with that, though.
When is it a good idea?
When is it a good idea?
So I can think of some
maybe more mundane some less controversial examples of this where when you are copying
and pasting from an example or a tutorial to explore something into like a repl or to see
how something works is a form of experimentation right right? There's no intention that the code that you write
is going to make its way into any real live system.
That's not why you're writing it.
You're writing it because you have some API or some behavior
that you don't understand, maybe even just part of the language,
that you don't understand and you're trying to understand it, right?
I think you've built an entire website around this concept.
There may be some amount of
that truth that i yeah um and so copying and pasting code in in that context i think is
obviously a good idea um i think people can sometimes sometimes get into trouble when
they're like all right well i've got this copy and pasted code that i maybe hacked a few pieces in uh and now i'm done right no further work is necessary from this point on and that is
definitely the the knee-jerk mental image i had of what i was thinking when i'm saying i copy paste
my code and then yeah you don't read it properly you don't understand it properly and you move on
yeah so i mean you know i think the more the more interesting places
that you can you can kind of talk about this and think about this are uh when you've you've moved
beyond that stage if we're not exploring like the goal here is actually to write real production
going to be used by other people code? And when is copying and pasting code
a good idea in that case?
I think a more controversial
sort of second condition
that I could probably throw out here
for discussion is
hinges on the definition of duplication.
So I got to talk for a little bit about why,
to explain this,
I got to talk for a little bit about
why copying pasting code is generally bad. Yeah, i guess that's that's not a bad idea given that
we just assumed i just asserted it was terrible and we didn't back it up in any meaningful way
so yeah what yeah why is it bad so one way it can be bad and one way it can create like fairly
serious problems is when you take a decision in a in a system that is being made in one place,
and now all of a sudden you are making it in two places. And if those two places are ever out of
sync, that will create a bug in your system, right? Like imagine like the way that you,
you know, calculate interest for an account, and that's in one place. And then you, you know,
all of your interest calculations use this one
algorithm and they're all consistent, right? And so the amount of money that you transfer from
account A to account B is consistent with the statements that you send out to your customers
saying like, you earned this amount of interest. This is the amount that shows up on your statement.
This is the amount that we sent to the ancient bank transfer API that actually does the transfer from point A to point B.
Raining code.
And now, yes, exactly. And now these things are in agreement. And if you were to duplicate that code
and have the code that produced the customer statements be different than the code that did
the bank transfer, they could get out of sync. And if they did, that would be a very serious bug.
Yes.
Right?
Agreed. And so
that kind of duplication is harmful. And we do our very best, I think, as software engineers to try
to avoid that because we know it's harmful. And when you're just copying and pasting code, the
question that you should be asking yourself is, am I doing this? Am I creating this situation now
where if you change one and you forget or neglect or just
aren't aware to change the other, is that going to create a bug? Given that, I think there are
situations where the opposite is kind of true, where it's sort of like, I have this piece of
code and I need to do something similar but different in this code.
So what I'm trying to do is I'm calculating an interest rate,
but for a completely different purpose.
Maybe I'm calculating a bond yield rate or something like that.
So it's different enough to where it's like,
yeah, there's some sort of similarity with this code, but it's not the same thing. And it is certainly
not the case that if I changed my, you know, bond interest calculation, that I should automatically
and always change the customer bank account rate interest calculation, right? Those are two
separate things with two different reasons to change. And so a reasonable approach to
implementing that new functionality might be, depending on the structure of the code,
to copy that interest rate calculation code, to drive out the new behavior that you want with
tests, to delete the behavior that you don't need, and make a copy of it and be like, now this is our
interest, this is our bond algorithm, and this is our customer account algorithm. And they're different and they have different reasons to
change and they have different behaviors as specified by these tests. The code maybe started
out the same and maybe right now has some sort of accidental similarity between it. And you might
even discover after going through that whole process that there is some essential
similarity that really is the same.
It's like doing any kind of interest rate calculation where you're just multiplying
a balance by a rate.
It's like, okay, fine.
Maybe that gets pulled out into its own thing that is separate from those two things that
is tested in its own way.
There's more of like a library.
And that is something that you don't
want to duplicate. But I would almost argue that it is totally reasonable to start with duplicating
those things on purpose because they have different reasons to change, making them be the
best form that they can be, and then looking at them and being like, is there any actual duplication
between these two things at that point?
And how do I remove that duplication?
Right, that makes a lot of sense to me.
Definitely the argument I was going to make
that wasn't as maybe trite as like the interest rate itself
is the calculation piece of code.
And if you copy the calculation code
and you have it in two places,
then a bug fix or a
performance optimization in one doesn't necessarily make it to the other so that would be my that
would just be a pile on to your argument about don't duplicating it but yeah that second reason
where it is not copy pasting as an ends it's not like i need this in two places it's like i'm
starting with something that's similar and i don't want to have coupling between these things on purpose i am choosing
to essentially branch the code or a little facet of the code and say these they they share an
ancestry untracked as it may be in source control but they share an ancestry because they have a
similar job to do but i am deliberately going
to evolve them in different directions and that to me is actually more of a a point is the um
sometimes by reusing a piece of code you're introducing coupling between components that
otherwise wouldn't share so much coupling in code which maybe hurts you down the line um in so to try and sort of think
of a concrete reason uh example of this is certainly if you are planning on moving two
systems in two different directions the fact that they share a piece of common functionality
might make it more complicated and harder to test if it has to do essentially two jobs and the jobs
are disjoint or partially
disjoint.
And at some point,
obviously you want to be able to extract a little bit that is common,
truly common.
And maybe that's your like util library dot function,
you know,
the horrific naming of whatever that part is notwithstanding,
but the,
the software junk.
Yes.
I mean,
we used to have,
we used to have something.
Yeah.
C cruft was all of the crufty bits of code that you needed in C just to get things to work.
Naming is difficult, right?
We know that.
It's been well documented that naming is hard, but we can try.
It can and should do better.
That said, though, yeah, by deliberately choosing to copy lets the two copies evolve in their own directions. And that's maybe a boon
because you're not weighing down one implementation
with the changes and the modifications
to support functionality it doesn't care about,
which maybe bloats its API, makes it harder to test,
makes it less performant.
But it is a trade-off between,
well, what if there is a core bug in that feature?
How will you get the fix over to the
sort of equivalent in the other piece of code yeah yeah for me it comes down to this sort of
litmus test of like if these two things don't change in lockstep is that a bug or a feature
right right because it could be either depending on what they are context dependent yeah
yeah another thing that i i think is related but not the same thing as we're talking about here,
is a situation where you are using duplication as a way to evolve the design of a system.
And I actually have this right now in one of a single threaded process that runs on a local machine to a distributed process that runs inside of a work queue.
And the implementation of the algorithm is the same in both cases, but the sort of scaffolding around it is completely different. And we could have kind of tortured the design a little bit to unify the duplication of those two things. But the intention is to get rid of the single threaded local version and only have the distributed version. literally have two copies, essentially, of this class in the system right now. One is called
job, and the other one's called job new. And job new is pretty much a copy and paste of job with
all of the changes necessary to make it run inside of our work queue. And the intention is we're going
to delete job and then rename job new to job when when this transition is complete now this is definitely
something that takes discipline within the team because if you don't have that discipline and you
get pulled off onto other things and then this sticks around new job new job new new and then
job new new and then job new new new right and it just gets terrible but you know for teams that
that have that discipline and are able basically like, I say discipline, but really it's just like, do you have control over your own priorities?
Right.
Because if you have control over your own priorities, then you can just decide that you want to do this and then you can do it.
If your priorities are set by somebody that doesn't have an understanding of how your source code works, then I wouldn't do this because you will easily find yourself in a situation. It's like,
well, we have three new job news and they're all terrible and I don't know what to do now.
Yeah. But the technique itself, I think, for teams that can do this is a very valuable one.
There's sort of a form of this that I do locally. This know, this is one that's living, you know, through multiple PRs and multiple commits.
And, you know, it's probably going to be a few weeks before we fully make this transition.
But there are definitely situations in which I will do this exact same thing, sort of like
within the span of a few hours, where I'm trying to change the implementation of something
in, you know, to add some some new behavior
some performance characteristics whatever it might be and uh you maybe have heard me say this before
this metaphor of like the bag of sand um as in indiana jones yes as in you know for for uh the
folks who don't know what whizzy wig is yes for those who don't know what Whizzywig is, let me also explain to you the first Indiana Jones movie.
There's a scene at the very beginning of the Indiana Jones movie where he's in this lost, forgotten temple.
And he's trying to recover this golden statue before it gets stolen by this other person who's just going to sell it.
It belongs in a museum!
You know, that scene, if you've ever seen it.
And there's traps all to sell it. It belongs in a museum. You know, that scene, if you've ever seen it. And so he, and there's traps all around this place,
and he's trying not to get killed.
And he knows that there's some weight sensor mechanism thing
in this pedestal.
And so he's sitting there staring at this golden idol
with a bag of sand in his hand,
trying to figure out what the weight of the idol is
so that he can swap out
one for the other. And unfortunately, Mr. Jones fails at this task and a giant boulder rolls down
at him. But in software, you can attempt to do this by creating a new implementation of something
that has some characteristic that you want. Maybe it's more performance or maybe the code is simpler
or maybe it's even got some new behavior in it, but it supports all the existing behavior, and completely implement that thing, and then find
a point, like a single seam in the code. Maybe it's where you're instantiating a class or calling
a function, and you're like, all right, I'm going to comment out the old one, and I'm going to put
in the new one, and then I'm going to run all my tests, and I'm going to see what happens.
And if all the tests pass, great.
Now you can go and you can delete all that duplicated code that you created.
You can delete the old implementation.
You can delete all those old things and clean it all up.
If the tests don't pass, then you have a giant boulder rolling at you.
And now you have to do something, which is usually undo and try again.
Unlike Indiana Jones, you have the opportunity to undo i think the critical part of that is that first and foremost um the bag of sand itself was a copy paste
potentially of the original code right it was like hey i copy pasted it and now i have pretty
much carte blanche because nothing is using this currently the bag of sand is in my hand it's not
on the pedestal exactly this analogy is breaking down a little bit here but i think i think you got it going um and you've got the chance to look at it and also
critically this could be it could have its own test for the new functionality it could have its
own the assertions that you want to test about the the replacement characteristics of this thing and
that could be committed and checked in and it could be side by side in your code base for some
amount of time even while maybe it's used in a couple of new locations while you're like i definitely need
the new functionality and we haven't got the old version and then at some point you make the call
that there seems to be doing what you need it to do it's passing all the tests for the old system
and then you have your indiana jones moment of doing the switch and that can come in a very
controlled process a very controlled part of the development process.
And then you sort of commit it on a Friday
and say, all right, everyone,
the old system is gone on Monday.
I'm really sorry if you've been hacking
on the old system over the weekend.
You're going to have some horrible merge conflicts
on Monday morning.
Right, right, right, right.
Yeah, yeah, yeah.
And I mean, you can do more sophisticated things like that
where you have like feature flags
or like different modes where you're like running the new and the old code. And that is actually kind of what we're doing with this job that I was talking about earlier, where it's like, you know, we've got some things that are using the old job and some things are using the new job. And we're sort of slowly transitioning all everything over to be the new thing. And then when there's no more uses of the new thing, then or the old thing, then we'll just have the new thing. And we can we can delete the new thing then or the old thing then we'll just have the new thing and we can we can delete the old thing um but yeah you can also just do it with just a couple of comments right
like a comment out old thing put in new thing switch back if if yeah but don't but don't check
in that comment yes which i think we've we've talked about before yeah we might have we might
have talked about yeah not not checking and out code. But that's a whole other episode.
So here's another place where I am tempted very often to use copy-paste.
And I say this to you partly as the catharsis of admitting it in public,
or at least to you and then our listener.
A confession?
Confessional.
Thank you.
That was the word I was looking for.
Cleanse your soul.
You'll feel a lot better.
I will.
I will.
Cleanse what remains of my soul.
What I tend to use copy paste for is having written a test for my code and observed its
output against the empty string I asserted it to be equal to.
I would then read through the difference that my my failing
test tells me hey got empty string expected sorry i got blah expected empty string i will read
through the blah and if the blah makes sense to me if it's the formatting is right and everything
then i would be tempted to take that and copy it into the test itself yeah i mean i don't know that that's that's certainly not a deadly sin
i don't know if you can see his face you know there's there is clearly some sin-like
characteristics of this is trying to be kind so for me it's it's a question of if it literally
is a string so you can talk about sort of like different values and things right um you know like the the place let's talk about the places where that that is probably
not a great idea and some of the places where it is it probably is actually a time saver and not
not a bad idea at all so if what you're doing is you have written some complicated interest rate
calculation that's like compounding daily,
and it's got a bunch of different factors in it.
And you write out all that code,
and then you write the test that asserts
that the new balance is equal to zero.
And then you run the test, and it says,
no, actually, it's $217.38.
And then you just copy that into your answer,
and then you'd be like, well, that must be correct.
You're doing it wrong, right? Like, that is not what you'd be like, well, that must be correct. You're doing it wrong,
right?
Like that is not what you should be doing,
right?
If what you're doing is you've got some like human readable string
representation of an object,
right?
That has like,
you know,
some interesting information in it that is intended for logs or you're
reporting on a screen or something like that. And you've written that function that kind of appends all the stuff
together and there's no branches in that code it's just gathering up a bunch of information
putting it all out and you want to take that and put that into your output i think that's
completely reasonable so long as you're confident that there aren't going to be any sort of like weird, like invisible character type things in there that you might not expect.
Right.
Right.
Because you can accidentally sort of like copy and paste,
like an unprintable character.
And then someone comes along and they like edit the thing,
like taking out of space or putting it back in.
And then all of a sudden the tests are failing and you're like,
what is going on here?
Right.
Like this code is identical.
Like what, what is the, what is the thing so like you
know as long as you're confident that the copy and paste is like uh not gonna surprise anyone
with its contents then i think that's a completely reasonable thing that's actually the exact case
that i do use this particular kind of thing it's like yeah i have the to string of something and
it's like i i read the code uh well i wrote the code
and it seems reasonably sane to me and i just want to have some kind of test somewhere that
says like no one he broke this in a way that's surprising and so i i might to string something
and look at it now obviously there are with things like that you have to be a bit careful because you
know if you have containers that don't have a well-defined sequence to them you know sets and
things like that that don't then you can sometimes become too sensitive
to minutiae of your code,
and it becomes very brittle.
And almost by its very nature, it's very brittle.
And in fact, it reminds me a little bit
of like one of our very first podcasts
when we had Claire on talking about the acceptance test.
It's a kind of inline acceptance test
where you're saying like,
this seems reasonable to me,
and i'll
be interested if it changed but but you don't get the very very high fidelity of like oh clearly you
act this is it was the string of some sub sub sub object that was missing now uh close paren
and you get the single targeted failure that says oh that's where it went wrong probably you're just
going to find out that your giant string of like the whole world object that you created just as a kind of like
smorgasbord of all the things test is now different.
And hopefully your output, your diff output
is good enough for you to go to spot
that it's a brace missing or a parenthesis.
Right, yeah, yeah.
I think that's a really astute observation
because I think a lot of the interesting stuff
that Claire was talking about
was all of the tooling and the infrastructure that they had to kind of like deal with the fact that
these things are kind of brittle and you need to do things like strip out transient dates and deal
with you know like things that are potentially out of order and like all the tools she was talking
about are like specifically designed to deal with those kinds of things which your unit testing framework is almost certainly not right so if you're writing
that style of you probably want to use the right using for the job then yeah right or or like
structure the test in a way where you understand that it's like okay we need to be very careful
about how we set this up because these tools aren't set up for they're not capable of handling
that kind of variation.
And so we can't let that variation in.
Whereas with the acceptance testing tools
that Claire was talking about,
I guess that's completely fine.
It comes with it out of the gate.
It's doing some things for you.
Yeah.
That actually makes me interested.
So one of the patterns when I'm writing Python code
and I'm testing like edge cases
and to some extent in C++
as I'm one of the rare heathens
who actually uses exceptions
in my c++ code i know a lot of this unfashionable but certainly for like error cases and error
conditions that are genuinely exceptional and typically will shut the application down
uh i don't mind using them but that's not why i'm talking about them what typically one has with
exceptions is that you have like some kind of class that holds the exception that you
can in your testing framework you can match against and say like hey i expect a an exception or an
error of this type to be thrown you know missing parameter expression or something like that and
it's good practice in python to create a unique error for each of your like modules so that you
can catch them in tests in general,
rather than just catching runtime error and things like that.
But even having done that,
even with the fidelity of knowing that you've caught the,
this is,
you know,
a key not found in remote data store exception or whatever,
you probably want to look at the message that's in there.
And then again,
you want to kind of try and find the right amount of,
um,
brittleness of like, here's the exact error message and of course if it contains upstream things from say aws or google
cloud it might have some arcane error message error number inside the the string that you've got
but what you really are looking for is just a bit of it and so i will typically write uh sort of
things that contains a string and it must contain missing string
and it must contain my key name that I asked it
and then other than that all bets are off right
I'm happy with that level of fidelity only
so I haven't copy pasted the exact error message
that I was expecting
which I could easily get by just again
matching an empty string
and then seeing what I get
because it just seems too brittle
and it's you's case or otherwise.
And most of these matches also support regular expression.
So you can kind of substring in regular expression bits of it.
And so even then when you're copy pasting,
it pays to look at what you pasted and say,
is this exactly what I wanted or should I have modified it in some small way?
Yeah, yeah. Or am I just reinforcing the bug in this code by writing a test make sure the bug is not fixed yeah uh-huh right exactly um yeah i mean i think i think those are
all like very very reasonable places some of the places that might be a little bit more borderline are,
which I could make some maybe arguments for doing it,
and I could make some maybe arguments for not doing it,
is a situation where there's a library that exists
that does exactly what you need it to do, right?
There's some function in there.
So there's some class in there that does exactly the behavior that you want.
You do not want to implement this yourself.
You want to use the library.
But unfortunately, the library has 100,000 transitive dependencies, none of which are necessary for the one function that you want right i have a quick question on
um yes you're developing in java at the moment aren't you
yes also true in python is uh yeah okay fair all right it's a fair fight yeah all right yeah i was
just thinking it sounds like you know every maven thing you've ever done before we've been
and in fairness you know javascript typescript as well like you're like hey i just want this thing this one function then you're like why have i got
150 megabytes of text in my node packages how is this possible how is that much code been written
how could you possibly need all these things yes um but yes i will i will very easily grant this
is a common problem in the java world it is also a common problem in many other worlds too.
You were saying that if there's
this one class that implements this
perfect thing of like, this is how to parse
some data structure
or some string.
Right, right, right. And now I've got to pull in
layers and layers and layers
of transitive dependencies, some of which might
conflict with my existing ones,
or even if they don't, might in the future
I might
I might just be like, nope
we're going to copy and paste this code
I'm going to go find these 12 lines
of code and I'm
going to put them into my project and I might even
write some tests around them because now this is my
code, I own it now. That's true
We should also say
that one should be careful about licenses
when does when doing this kind of thing so always worth considering i know obviously that's a
top right in this for us but not necessarily not all open source licenses are created equal
and you need to understand how they work uh you know there are things like mit that are generally
pretty safe but beyond that it's sort of like ask your lawyer.
Ask your – yeah, certainly it works.
Yeah, exactly.
But that is another situation where I would never say that that is generally the right thing to do.
But in certain situations, I have done it and I wouldn't i wouldn't uh complain anybody else doing it
because the trade-off just doesn't make sense like i don't want to add you know three times the
number of dependencies to this project for a 12 line function or even like a hundred line function
right yeah um i i just i just want this behavior in my system and so that i'm the best way for me
to get that is is to copy and paste it yeah well that's a good i think that's a
good argument for for copying and pasting as well as just some cases where you just want that little
bit of code but you don't need everything else that comes with it so here's an interesting uh
thought i had um and this is sort of now slightly left field uh it occurs to me that um copy and pasting is only bad when a human does it right if we do it
then uh it's it's uh problematic but if the computer does it then maybe it's forgivable
oh and the reason i say this say this is that like one of the sort of like the mothers of
optimization in the compiler community is inlining functions, which is to say copying the function you're calling
from where it was defined to where it's being used
every single time it's used.
And that is generally accepted to be a very good thing to do
for certain metrics, at least.
But the thing is, it's automated.
It's done reliably by a computer every
time and it and if i change the original implementation it obviously will be compiled
as long as you've got your compile and build set up correctly every time so so yeah it's it's so i
wondered if there was something from the properties of copy pasting that we're losing i think it's
that when a human does it you have lost the link between the original and the copy, right?
Although I sort of see there's this ancestry
that you could sort of track it with some way.
Maybe this is missing functionality in our tooling that we...
Yeah, that's an excellent point.
And it's sort of like, imagine a world where, like, you know,
the lineage of a piece of copy and pasted code
could be, like, very easily and obviously tracked.
And I have no idea what that would actually look like.
Me neither, but no.
But like, you know, it's like, oh, you're changing this line.
And then there's some pop-up that's like,
did you know that this is a copy of this other line over here?
And like in a not annoying and stupid way.
Right.
I mean, many IDs, of course, do have some amount of like,
hey, this code
is there's these 10 lines of code are duplicated somewhere else so it's a kind of extension of that
i suppose but it's more like knowingly editing a duplicate the ide could say hey somewhere else
you're also looping over and you've just looped over one less than than you having the copy so
do you want to update the copy or not or should should I tell you about that? So the trick to that would be that the IDE
would somehow have to understand the domain of the problem
that you're trying to solve enough to know
that it's like, oh no,
this is just another balance calculation,
just the same as this other one.
Because like one of the things about duplication
is you can have code that is the exact same code,
character for character,
and it can be not duplicated,
right?
It can,
you can be for completely different reasons,
completely different reasons to change,
completely different reasons to exist.
It just happens to be in the same shape right now,
but tomorrow it might be totally different.
And if you,
if you changed one and change the other,
that would be a huge bug and you would never want to do that.
Right.
And you can have code that is totally different but is actually duplicated right like the the the structure is different the the everything's different but it
it's solving the same problem in two just different ways and it needs to be unified right yeah and so
like the ide would have to tell the difference which humans can't
all those different humans can barely do it right this is why heuristics you know are coming to
these ideas and how many lines do i say and i know some ideas were like well even if you rename all
the variables i quote know it's the same yeah but yeah if you if you change the four with a do while
or a some other thing it's very hard although again interestingly and this is only you know
here i am just keep steering it back to things i'd love to talk about but like something that compilers do
internally is they try and canonify all the different ways that you can write code so they
take a loop any loop if it be a do while or a while or a four and they rewrite it in a canonical form
so that um similarities in code can be. And more importantly for a compiler specifically,
it's if somebody writes a piece of...
So my favorite example of this is something
where if you count the number of one bits in an integer
by looping over all of the different 32 possible
or 64 possible things and saying,
is that bit set?
And then adding that up.
Or if you do it by shifting it down
or you do it by other ways,
ultimately there are about two or three ways you could write that and if the compiler can
canonify them into one or two possible variations of intermediate representation and whether you
wrote a for loop or a while loop or a do while or whatever then the optimization that looks at that
and says there's a single cpu instruction that does exactly that whole loop can kick in much more easily whereas if it has to deal with every possible combination
and permutation of like well they used to four over here but they used to one yeah yeah so it's
kind of interesting that there's there there's some some similar shapes in all of this it's all
sort of connected yeah so in an interesting way like the compiler is unifying duplication across all of the projects that it compiles in the entire world.
In essence, in order to then match it against the human curated list of like, hey, this is one way you can do a pop count.
And this is another way you can do a pop count.
And all the other things that happen.
And obviously, it makes their testing easier if they can guarantee that like a matrix of possible combinations is
reduced by saying like look if there's a loop it always looks like this it doesn't matter if it's
a do or a for there's always a start condition there's always a step condition and then there's
always a terminating condition and there's always a cleanup condition and they're always there's
four labels it's always this way some of them are empty some of them will immediately jump to the
the the beginning of the loop some of them check the condition so if you look at for example
unoptimized code in uh your favorite tool for looking at uh compiled languages then you'll
often see that bizarrely it'll jump to the end of the loop for a for loop and then it'll jump back
to the top again because that's where the check is done sometimes is at the end of the loop and
so it'll jump the first thing is hey i set everything up and jump to the end you're like
no no no start the loop you're like no it has to check the condition at the beginning of the loop
and that's checked at the end so it just jumps to the end does the conditional check and goes
around to the top you know and so it's really quite interesting to see these things come out
but we're talking about copying and pasting and we were and now i'm talking about loop optimizers
this is i mean this is i i really like the the way that you're thinking about this because this
is fascinating it's sort of like getting to the heart of, I think it gets to the of like removing duplication by temporarily creating it.
So imagine a situation where you suspect that two pieces of code are actually the same underneath the hood.
Like they seem like they're doing the same thing, but you're not sure.
And so the way that you're going to figure this out is you're going to bag of sand it so you're going to make a copy of one of them and
then you're going to try to refactor it and maybe even just rewrite it right into the shape of the
other one and then as soon as like you've achieved duplication it's almost like rose and tetris yeah
they disappear and they disappear
and they're like these are actually the same and then you can delete your copy and then you can
delete one of the originals and now you only have one of the three that's cool right that's an
interesting way yeah like sort of it's almost like refactoring where you can make changes uh
yeah over and over again you say well can i or there are you know branches of mathematics where
you can sort of show two things are distinct by slowly applying changes until you can show that one is
equivalent to the other through a sequence of
steps. So, yeah, I'm
sure there's some clever things that you can do in that
respect.
But yeah, duplication done by
your compiler is
an interesting
variation on this. I like it.
You get me to talk about things I'm not talking about.
This is why we have this
podcast that's why we have the intersection between all these different things yeah yeah i
suppose so i suppose so well i think that's about all i've got for yeah i'm sure there are other
things that we could come up with you know it's it seems like this is a good amount of time to
be talking about it and our listener hopefully has been kept entertained on their dog walk or commute or whatever it is explaining to their teenager what whizzy wig
and printing is telling their teenager what printing and whizzy wig yeah yeah cool well
i'll see you on the next one all right sounds good
you've been listening to two's compliment a programming podcast by ben rady and matt godbolt
find the show transcript and notes at www.twoscompliment.org
contact us on mastodon we are at twos compliment at hackyderm.io music is by inverse phase find out more at inverse phase.com