Two's Complement - Ben Walks A Tree
Episode Date: August 22, 2023Ben ventures into the forest, finds a tree traversal problem, and then fails his will save and gets fascinated by a hash map. Matt suggests zombies. Then they come up with a solution and talk about ho...w to test it because of course they do.
Transcript
Discussion (0)
I'm Matt Godbolt.
And I'm Ben Rady.
And this is Two's Compliment, a programming podcast.
Hey, Ben.
Hey, Matt.
How are you doing, my friend?
I'm working. I'm working today. You're working? Yeah.. How you doing, my friend? I'm working.
I'm working today.
You're working?
Yeah.
Why would you do such a thing?
Doing the day job.
I am staring at a bunch of comments in my code that are like, hey, Ben.
Hey, later, Ben.
Comments written by earlier Ben.
Yeah, past Ben.
Hey, future Ben.
You need to solve this problem. So uh you need to solve this problem so um you know solve this problem and now i'm looking at these comments being future ben going yeah i guess i
gotta solve this problem now well i tell you what that past ben has a lot of confidence
misplaced was it misplaced confidence uh hopefully not. So this is an interesting, actually, in sort of the taxonomy of comments.
There are like the, say what you're doing, which is almost always a mistake to say, hey, you know, add three, you know, the comment says, add three to the blah variable.
And then you know that underneath it, blah variable plus equals four is what it's inevitably going to say right because the comment rots um so you know one typically
don't says you know like well if you can't make it obvious by writing the code uh so that it reads
like what you want it to say then you've probably got a problem with your code but then there's the
the why explanation which is not like hey i am doing this because of this thing that's not obvious
from the context of the code or whatever. Those are OK.
But then this is the insert code here.
This doesn't do anything yet version of the comment.
Yes, yes.
What is the problem?
This is my task tracking style of comment.
I actually do this thing where I – we might have talked about this actually –
where I have a special check in branch builds for to-do comments.
So to-do means something very special in this project.
And it is basically like, I've queued up some work that I need to do, and you can't
merge to the main branch if you've got any of these things left in the code.
Interesting.
So that's a good discipline to have.
Yeah.
And it's actually also a great way to collaborate, because you can sit down with somebody and
kind of like put some to-do comments in the code and be like, oh, we need to fix this.
We need to fix this.
We need to fix this and not lose track of the fact that as a group, you need to do this work.
That's interesting.
Definitely, I've worked on projects before where we've not allowed to-dos to be merged into main.
You know, like essentially we can't promote anything to production if it has no to do's in it but nowadays i subscribe more to the provided the to-do has a tracking
issue associated with where like the finesse of like why you're it's okay to defer it for now and
it can be scheduled and put back on like someone's work to-do list actual list then it's okay but it
might be nice to have like both worlds where you're like this is a long-term to-do like hey to-do we
know that this doesn't scale but for the next year it's likely to
be okay eventually we're going to have to replace this order n squared algorithm with something
cleverer right that's the kind of thing you can put in a to-do comment uh obviously until you
forget about it and then you know everything falls over because it's too slow but that's the whole
point of having the tracking yeah but then this to do is like no i'm i just want to be able to
commit so that someone else can pull the branch we're collaborating on.
And we can't get this into main.
Yeah.
Right.
It will not accidentally go into production without having looked at this.
Yeah.
Right.
So on the other end, I have pre-commit hooks that say, if it says do not commit anywhere in my repository, then you can't do that.
But do not merge is something I have written, but I have nothing that prevents it from being merged into it's just like a convention there and you hope that when
you're diffing things you'll go oh whoops yeah so maybe we should enforce that anyway that's not why
you wanted to talk about this stuff so yeah so what is this comment what is this to do comment
so basically the problem that i have here is i have a sort of unknown...
You can think of this as a directory structure.
This is actually in S3 storage,
but we're treating keys like directories in S3,
and that all works just fine.
It all works just fine.
So you have this sort of unknown directory structure, right?
Okay.
So there's some root path, some root key that you know,
but underneath that, you have
some
heterogeneous mix of
files. And what you're trying
to... And subdirectories as well?
Yes. Like files and subdirectories. Potentially many
layers of subdirectories, right?
And you have
no idea what the structure is.
And what you're looking for is files.
And for the sake of simplicity,
let's just say that they're all of one type.
Let's say CSV files.
So you're looking for CSV files
that have the same base path.
So one of the subdirectories
in the hierarchy of subdirectories that is common contains underneath it a homogeneous set of CSV files all with the same schema. Right. So they can be in different subdirectories. And an example of this would be like separated out by date, right? So maybe you have something where it's like,
you know, you've got a base path
and then under that,
you've got a subdirectory per day.
And then in each of those days,
you have, you know, a handful of CSV files
and you want to be able to detect
that that base path has all these different files
and they all have the same schema
and there are no other files in there that have a different schema.
So the part of this problem that I've already solved is a sort of schema detection. So I have,
I already have an algorithm where you can, you know, take basically a stream and send it in
there and be like, Hey, this is a CSV file. I want you to inspect the headers. I want you to
inspect the actual values in the CSV file to make sure that they're, you know you to inspect the headers, I want you to inspect the actual values in
the CSV file to make sure that they're, you know, detect what the types are.
Like none-ness or otherwise of them or, yeah, date times or strings or whatever.
Yeah, yeah.
It's sort of like all the, it devolves into string, right?
It's sort of like, if I can't figure out what this is, I'm just going to call it a string,
right?
But if I can parse it as, you know daytime or a float or a 64 bit integer or
a 32 bit integer like that's what that's the derived yeah but you infer it yeah anyway but
that's the solved part of yeah that's already done so that works so now what i have is i have
uh the algorithm i need to implement here is you're given a a a bucket you know an s3 bucket
you're given a prefix within that bucket and now you have to scan that prefix and come up with a list of schemas where the schema contains
the results from the inspector. So like the columns and the types, but more importantly,
the paths. And the tricky bit here is if you have something, it's entirely possible that you scan this and you come up with an empty list.
Because if the structure that you're given is just a hodgepodge of heterogeneous file types and there is no root key that gives you a homogeneous set, then there's nothing to do here because what we're producing from
this is basically a queryable table where you can give it that path prefix and say,
everything underneath this has this schema, has this type.
Okay. So you're trying to find the highest point
treating it like a graph
leaf nodes where the files are you want to find
the highest point where all
children leaf nodes that
you can find from that point have the
same schema and then you want to find the set
of the highest points where there are different
schemas is that yes
fair to say yes so like if you
had
if so a directory a and a directory b and a
then under a is all these dated directories so a slash yeah 2023 04 14 yep slash some other thing
slash whatever thing and then there's a file.csv and if you could say like well all of the files
for all the days for all the measures under this route under a have the schema
where it's a date time and a value yeah there's three columns with these names and these types
and they're all the same yes exactly across the whole of that and then but conversely on the b
bucket there's a there's a different set of things because it's maybe a completely different
uh set of data but you don't know that information a priori exactly so you just want just um
yeah that's an interesting problem and there are some sort of s3 specifics that i can think of that
would either make this easier or harder and depending on how efficient it needs to be that's
also a question to consider right you know parallelism or otherwise um yeah uh so for i'm gonna think
out loud at you and see where we go from there but if i'm if i'm gibbering too much
so logically you can just ask s3 give me a dump of everything you have and it will just start from a and give you every
single giant long path name of everything because really under the hood it's just a key value store
where the key is the path and the value is the contents of the the thing so that i have
implemented i have basically a recursive search from the prefix where it will walk recursively
and give you right well that's the
trick you don't yeah you don't have to do that recursion because oh no i'm not doing it yeah
the s3 client is doing that for me yeah and it's not even doing it right the client's not doing it
either right the the way s3 works is if you give it a delimiter it will use that as a point to stop
which will then mean that necessarily you'll hit like well each level of the directory but if you don't give it a delimiter at all it just doesn't flat out listing the the keys with
no so there's no sort of hierarchy except the hierarchy that you put on top of it by saying
by the way the forward slash is is magic to me right right right you're just listing all the
keys that start with that prefix it's really nothing more than that exactly but you've so
so you could go the route where you do that which is efficient from the point of view
of just paging through the results from s3 and you get this giant list of things but it's also
you then kind of want to reinfer afterwards the structure of the directories having given this
list of things um so you could treat it like a tree and so for every one of those uh
nodes you you construct a tree in memory for each directory and the directory has children
and the direct the children may be other directories or they may be files and files
have schemas right or you could just treat it as a schema right um and then it maybe flows
naturally from that so if you've got it all in memory which may be not feasible given the scale
that you're you're working at uh i think it's okay to have it all in memory actually yeah i think
that would be fine uh but then yeah you could write the um the sort of recursive thing of, say, give me the set of your children.
Oh, let's think about this.
One question you could ask is, you know, do you have homogeneous children?
And that can be recursively applied.
You know, like if you don't have, you say, if all my children have the same schema that there are files and then
if i ask the same question of my directory children what is your common
schema and it's the same as my file children assuming that you're allowing to mix files and
directories which maybe you're not in this instance maybe the leaves are strictly at the
very very bottom of the tree but you could generalize both right but then essentially the
a directories am i homogeneous um do i have a homogeneous schema is the answer to are all my
files the same and if i have directories are they also homogeneously the same schema and then it sort
of naturally falls out and you could memoize that so it's pretty quick and easy um and then obviously
you what you do is you start at the very very top of the graph and you say are you homogeneous
and if it says yes you're done right there you are everything's the same at this point
right if it's not you say okay well then let me find all your children and ask them
are you homogeneous and as soon as you find one that says yes it is that is the leaf that is the
root of a new tree that you never need to inspect further after that now obviously it has built up
the tree and then you do a breadth first search across that tree looking for nodes whose children
are homogeneous basically yes and you can stop at that but
obviously you still are visiting all the nodes because in order to ask the question are you
homogeneous internally it's going all the way down and back up again but if you memorize it
yeah then you don't have to hit the nodes more than once so that would be a top down way of doing
it is there a way to do this where you sort of do like an order and pass through the keys?
Because I, you know, sort of the piggybacking off of what the API from S3 is giving us is
sort of like, all right, I'm going to iterate through all these keys, right?
Yeah.
And then along the way, I'm going to take each key.
I'm going to do the inspection of the underlying data at that point which is that is actually the most expensive part
computationally right yeah like you know there might be a million keys but like reading all
that data is the thing that's going to take a long time right yeah and inspecting the schemas
um but you sort of inspect the schemas as you go um and my thought here was that maybe there's some efficiency to be gained from
once you've detected a heterogeneous sub key,
stop scanning.
There's,
there's no point in continuing to do this super expensive schema inspection.
Once you've detected something where it's like,
Oh yeah,
this is a directory has two different types of files in it.
It's got a thousand files,
but there's no point in checking the other 998 because these two are not the same different yeah that's very true yeah so you could definitely
i mean if if i guess that's an error an error condition or an under specification when we were
talking about this to start with which was you know i was assuming that you you could partition
the world at some point you could find these top rooted keys but if you find a node for which
it it its very leaves
have mixtures then that node is bad a bad node like you say you say well i can't do anything
with this right right yeah but yeah to answer your earlier question to do this incrementally i think
i think you i think so effectively you given one long key string name and the results of what schema this is you can of course create
speculatively or look up if they already exist each of the nodes in the graph by splitting on
the slash and then saying okay well let's find the node that's like called bob and then the one
called ian and then the one called tracy and then here is the 010403 and then this is you know
moo.csv which is where i am. And if there's a node
there, if you've just newly created a node
then definitionally you'll say, well, at the moment
the set of all schemas that I've seen is what I've
just parsed this schema to be.
And if not, then
I
and this is obviously a leaf node
rather, a leaf
directory, it has no further directories under
it. You can write it um you're you can
write if you if you're writing the second um directory sorry second schema for that node you
know like you said okay this this node has children direct children stop this this is silly now any
more things that land in this node it's already poisoned don't bother looking at it anymore yeah um but if not then um you've got you've constructed the graph except the only
thing you haven't done is you haven't filled in the directories that only have directories as
children's schema sets but that's fine because that's what the second part of the algorithm
is going to do once you finish going through it the breadth first says, I don't have to look at the data anymore because I already have the list of all the nodes,
the leaf nodes, and whether or not they are poison because they have a mixture
or they have exactly one schema type.
And then their parent, and you could potentially do this online as well,
where you say, okay, every time I update a nodes schema, you can update his parents to say like, hey, also, by the way, this is the second thing.
Like I was thinking about this and I'm like, I could do this by like a hash map where when you process a single key, you say, I'm going to create an entry in the hash map for this directory and each of its parents, right?
I mean, what you're doing is you're storing a tree in a hash map, kind of implied, right?
Yeah, yeah yeah yeah but where there's no explicit nodes that the like you can go up and down because the way
that you go up is to just like truncate your key and look at the hash entry for the bit that's like
yeah so that would work as well right obviously the entry where there would be like the set
of schemas well it's just the one schema right because if you ever detect a conflict
you you well you'd have to mark it as basically no good, right?
So it's either the schema or a marker that says this subkey is no good, right?
Right.
Okay.
But that wouldn't let you answer the question like in my original example, if you had the
A and the B top level nodes, the root would be poisoned and say, no, I have a mixture.
Right.
But that's fine because you'd have an entry in the map for a that was good and you'd have an entry in the map for b
that was good okay but you can't you if you just have those entries yeah how does you how do you
right what so you having filled in this hash map how do you interrogate it you need to know what
the keys are and you don't actually know what the keys are anymore uh if you had the tree you know what the node of the tree of the trees and you know
from each node what children it had yeah but you can't do that in the hash map without knowing the
list of all those i mean maybe maybe i'm missing something here but my thought was like okay at
the end of that sort of happy path example i would have a as an entry and i would have b as an entry and i would have
an entry for each one of b's subdirectories yeah and each one of a's subdirectories right right
so you've got a so in my example you have a and b and all the the dated directories and all those
things underneath it yeah you've got a hash map with many many thousands of entries where you're
just going to walk through the hash map in the random
order the hash map's going to give you out the the nodes that are in that hash map yeah you're
just and then from there you look at the keys of the hash map and you find the the shortest
right like the highest level paths right right so effectively that's how you yeah that's the
bit i was missing was the fact that i it's not necessarily that straightforward to go from
here's a hash map of all the keys to what's the shortest key well now i have to go over every single entry in the hash map
yeah it's like it's like order n yeah it's like order n twice it's it's 2n right yeah as opposed
to order n and then uh uh log n effectively or n log n for not even n log n no it's just
not gonna think for that have if you built the tree as you were going along
instead of the hash map,
then you just walk at the top
and you get the answer is either the top node
or one of its second children.
I think they're equivalent.
It depends how many leaves you end up throwing
into the hash map.
I mean, I guess you don't have to store
the actual files in the hash map.
That's the trick here. You don't need to store the actual files in the hash map. That's the trick here.
You don't need to store their schema.
You only aggregate them at the directory level.
And so you are talking about, yeah.
And I think it would let you sort of abandon early where it's like when you check a key, you say, okay, is this parent already in the map marked as bad?
Because if it is, I don't need to inspect it, right?
There you are.
You turned an order, an operation into an N log N.
Yeah.
I think.
No, is that?
No, because for every N, you have to then go,
okay, there are potentially log N parents of me,
and I have to check them as I go.
Yeah, but I mean, anything that but i mean anything that lets you anything
that lets you skip the inspection process is going to be worthwhile oh but you the inspection
process would have happened when you populated the map yeah the reading bit out of the end is
the trivial part right so i think our algorithm then putative or otherwise is grab the big old list of full keys from from the s3 api is streaming through
construct or look up as you say in this hash map have i done some subset of these things already
and maybe if you find it's poisoned already you just skip it right if you've already found an
entry that says look there are already two schemas somewhere along the the list of that now again if you had a tree you walk down the tree
to where you are or walk up from the node you just just created but if you have a hash map you break
it and you look up those things which again is their equivalent right yeah but assuming that so
far you haven't found uh a node that has at least two schemas already you go well now i need to see whether or not
this schema either matches the one schema that is there or if there is no schema there i need
to populate the entry with this schema so now you know you're cursed to go and do a long look up and
read the data and pass through it and whatever like you come back with a new schema and this
if there was nothing there you say, this is the schema of this.
And then you walk up the graph and you say to all of the no's there,
hey, this is the schema.
Or you check against the schema to make sure that it's the same schema
as any of the parents.
And as soon as you hit two, you know, well, this is poisoned now.
We don't need to do this, anything under this point in the tree again.
Which obviously means you're going to run through the API,
kind of going list, list, list, list, list, list. But very quickly you'll go, yeah, this is it. I don't need to do this anything under this point in the tree again which obviously means you're going to run through the api kind of going um list list list list list but very quickly you'll
go yeah this is i don't need to look at it i don't need to look at the expensive part
um okay so that's that's the walking process and then at the very end you look over the whole map
and you find that and this is like where i said it gets more tricky you go through all the keys and
you try and find the shortest key that has not been poisoned but then if you want to find the
set of the shortest keys that are like separable and not poisoned it's harder to do that i think
from like just walking through the keys because you have to essentially infer again the graph
structure out of like breaking the key on a slash in order to say hey is this because it's not even shortest
right you know you could have two directions with longer names ones have got a longer name than the
other and you're like it's not it's not shorter it's like number right so i don't know if you're
saving anything by having it in the hash rather than just just accepting the tree it's going to
be part of your your life it's not that difficult and you can answer the question a little bit yeah
i mean the only reason i'm leaning i would lean toward a tree is because it's like an out-of-the-box
data structure and i i need to do i think very little to i mean i guess the values are probably
going to be reasonably complex but anyway this is that's like i don't know that they are even right
you know but but that's that um the other thing that's that's worth thinking about this if we're just
talking in general generalities is that even scanning over that many files takes a long
time and if you were to do multiple parallel scans one starting at a one starting at b one
starting at c or whatever the usual tricks that one does to scan over s3s like list function faster yeah then i think both of
these approaches still work you might if you have multiple threads if you're thread safe
like data structure in or map then you can you might occasionally do more work than you needed
to because there were two like threads that were both scanning files where a
third thread discovers that actually this is a futile exercise,
you know,
that this point in the node is passed.
But like,
as long as you can handle that situation,
you only ever do N pieces of extra work in the worst case where you have N
threads.
Right.
But in the good case,
you are,
you are mostly able to,
to,
to yeah. Parallelize on that although to be
honest probably just scanning through is fast enough and then you then you're going to be stuck
doing the actual parsing and that can be multi-threaded because you just have a work
queue and say hey here's some things i need you to do go and get done when you come back to me
with the results then i can i can fill in the tree yeah yeah yeah we were talking the other day actually about my efforts to uh scale this system horizontally
and that is one of the prime use cases of that work is being able to like basically delegate
work out on a per key basis like hey could you scan the schema for the csv file for me it's
probably going to take you like three minutes, but separate that out and then
do that work that way. That's perfect.
Yeah, what a fun
question. I was going to say,
because this is Two's Compliment, the
other thing that we can talk about here is
the test cases.
Of course. What tests
would you write for this? And more importantly,
maybe, what
order would you write those tests in?
Because the most basic test, if I was going to sit down, I am about to sit down and write this.
Right, literally I interrupted you from doing this.
But the test I'm probably going to start with is, if there are no files in the bucket, return me an empty list.
Right.
This is sort of a zombie approach.
Zombies approach.
Yeah, yeah, yeah yeah yeah exactly i forget
b is oh we can do this boundary bounds yeah boundaries interface uh i'm just making parts
yeah anyway the trick is the zom part is you know test with nothing test with one test with a few of
them yeah yeah yeah so yeah the first thing you do is say, hey, empty directory.
Yeah, so, like, you know, should return an empty, I'm just typing this out, list if there are no files, right?
And that's easy enough to write.
I just make an empty bucket using my fancy, my fake, fake but kind of real in-memory bucket.
It's not fake. it's a real thing are you
actually creating anything for an empty directory though i know you can in s3 you can create a
essentially an empty file that is the file that ends with the slash that causes it to be kind of
i didn't even know you could do that that's that's what i think that's what happens if you're like in
the s3 ap sorry if you're in the aws console and you make a directory and you then see this directory that's empty because otherwise
if you make it an empty directory yeah the directory doesn't exist it's like you know it's
like that thing in source control when you can try and create a directory and you check it and
it's like no the file doesn't there's nothing there because what we're tracking are files not
directories right right so i don't even know how you'd write that test. Well, I mean, I could just make an empty bucket, right?
Right. Okay, so that's like a just completely empty bucket.
Mm-hmm. Yeah.
And then assert that it returns nothing. Yeah, and then assert that I get back an
empty list. And then, of course, the next obvious one is I'm going to make one file,
and I'm going to make sure that, you know, like, I don't't even so here's an interesting thing here i think a rookie
mistake when writing that kind of tests so you know you're about to describe the way that i'm
doing it on you okay tell me tell me so the rookie mistake the rookie the rookie mistake when writing
that kind of test would be to um basically unnecessarily test through into the schema
inspection algorithm the scheme inspection algorithm is already written there's already
tests for that it and through know that it works transitivity of like as long as you call the
schema inspection routine with the right path right trust that it works right and so don't care about
that bit don't yes yeah all right so i'm gonna write the test that's like that's like that's like
i'm gonna write a file into this thing and then i'm gonna have it run the the whole scanning
inspection algorithm and i should get back one schema i'm gonna these are the things i'm going
to assert and these are the things i'm not going to assert i'm going to assert that i got one yep i'm going to assert that the path into the file
is what i expected yep i'm not going to assert the number of columns i'm not going to assert
the types and names of those columns right right because that behavior is already tested now so
that's interesting i would have mocked that out right i don't know if that's where you're going
with this i would have said like okay i'm gonna completely and
utterly i would so with my c++ mindset yeah right i would pass in the like um and something which
implements the interface of test this path and then i would give it a fake one and say hey yeah
all i'm gonna do is i'm gonna record that you said that you wanted me to check this schema and then i'm going to return you back the schema bob or whatever anything
because at that level i don't care what you're doing with that information because i'm already
trusting that it's going to work whereas you're actually talking potentially about calling the
schema code but deliberately not looking at the results of it because you don't care about the
results of it you trust that they work so equivalent um and you don't have to fake anything
out and you don't have to fake anything out and you don't
have to have an interface with the and you're already faking out like the s3 at like a very
s3 level yeah yeah right but yeah okay yeah and that's so that is that is a very sort of
interesting point of debate and i think there's lots of people that have different opinions on
that and i used to be more in the camp of like i'm gonna fake out the schema inspection algorithm and these days
the way i think about it is like if i can write this so the test is fast and the thing that i'm
mocking out you know or i would consider mocking out the real one doesn't have any outside
interactions it runs quickly it does the thing uh the the phrase that i've heard a lot of through
this is a sociable test
or testing through is another
way that people describe this but it's
basically like some number of
hops one or two or maybe
three, three's a little much but one or two
hops down
into the sort of you know dependency
graph of the thing that you're testing
generally fine so long as you don't
make any, so long as it don't make any as long as it
has doesn't have any other bad things it's not right and to be clear here the the sort of like
the the the fakeable interface you're passing around here for s3 means that it's not like
you're making tcp connections you don't have to and all that stuff so you've already kind of said
okay like that's that's the boundary that i'm prepared to draw and if i go one or two steps
away from that sort of law of demetry breakingly then that's okay i'm testing more of my system but i'm also not going to rely
on it because that is not the purpose of these tests exactly furthermore if i break my schema
detection system in some subtle way that doesn't matter for these tests i don't want these tests
to unnecessarily fail exactly exactly you don't ever want the scanner, the sort of the, you know,
tree walker scanner code tests to fail because like, oh yeah, this CSV file doesn't have a
comma in the right place or something like that. Right. Like that's just, that's going to be a not
informative test. You're going to be very confused as to how you possibly broke that. Right.
Right.
So yeah. So the assertions that you pick here, basically you're going to be very confused as to how you possibly broke that, right? Right. So, yeah. So, the assertions that you pick here basically are going to determine how much you're going to be able to use these tests to refactor versus how much you're basically just pouring concrete around your code, right?
Because, like, the most, like, you know, painful, I think, and people talk about this all the time.
It's sort of like, yeah, you know, I've heard people say about this all the time it's sort of like yeah
you know i i've heard people say and it just makes my head explode it's like yeah i don't
like writing tests because it makes it hard to refactor right and i'm just like oh my god um
the wow it's really very emotional i know it's like listening you can't see him but he's always
crying here i hear that and i'm like you don't understand what Listen, you can't see him, but he's always crying here. I hear that and I'm like, you don't understand what any of those words mean. Any of them. None
of the words in that sentence were correct. The trick here is when you over-specify,
and it sort of comes almost from a good place where you're just like, I want to make sure my
code works. I'm going to write lots of good tests and I have lots of assertions and all these other things.
And so but the problem is when you over specify, you are you're not sort of doing the core job of what the tests are there for.
The tests are there for to assert that you have the behavior that you want.
But they should be agnostic about the behavior that at that level of testing, you don't care about.
And this is exactly that, right?
This tree walker scanner thing shouldn't care at all about whether or not the schema inspector is doing its job correctly.
And if you overspecify those tests, you're going to make it harder when you come back later.
You're like, oh, we need to change the way that the schema inspector works, right?
And you go and you change it, and then you have all of these failing unrelated tests,
and you're like, why are these tests failing? Well, because you over-specified, right?
Yeah.
So the dial here that you have to turn, and you really want to get it in exactly the right spot
if you can, is to specify the things that you care about for this code that is under test and not over
specify this thing if you're going to do this sort of sociable style.
Now, the other alternative here is what you were just saying, where it's like, no, I'm
just going to mock that out.
So you can change the schema inspector all you want.
It's not going to break these tests at all.
And I have a hard guarantee that that's not true.
And there's lots of situations in which you might actually want to do that.
But what I'm about
to embark on here in the next 20 minutes when we're done with this conversation is trying to
set that dial exactly where I want it so that I have the confidence that the code that I'm writing
right now works without having over-specified into this thing that I already know works,
that I already wrote tests for, that I've actually already used in other contexts.
Yeah, yeah.
And I'm not coupling those things together.
So, what, one file.
So when you said that, before you went off on this,
not even tangent, this is a very relevant thing.
This is a major soapbox moment here.
Yeah, it was though.
Get off my lawn, you stupid kids.
But the thing that immediately sprung to mind after like one
file is one file with varying number of subdirectories in the root and then 12 different
things because again that's like a zero one many kind of thing where it's like well everybody knows
the edges the boundaries that's the b part of zombies isn't it yeah boundaries is like well if
if it's in a subdirectory that makes sense i'm sure i've tested that but like who thinks to put
one of these files directly in the root right there's probably not what you're ever planning
on doing it probably be a mistake to do it but like that is where i'm my instinct tells me a
body will be buried there and you'll find something interesting there um just like you know when
having a file that's empty,
which probably doesn't affect this in this particular case
because you've already tested your schema tracker,
but that would be another thing I think of.
Oh, yeah, someone made a file, but it's not got anything in it.
That's kind of like an edge.
That would be an edge case for your schema validator.
Yeah, yeah.
And so the interesting thing about that test is
I was having a conversation about this with our mutual friend, Mr. Pat Farley, yesterday, actually.
Patrick Farley.
Of the London Farleys.
Of the London Farleys. If I'm not sort of like lazy enough implementing the solution after I write the, you know, should handle one file, right?
If I'm like, I know how this should work and I just sort of type it out.
It is possible that I write that should handle varying subdirectories test and it passes.
Right?
Yeah.
I would hope so if you did it right, but...
Well, maybe...
So that's the thing, is if what I did is I took a step too far when I was making the
implementation, and I basically wrote untested code, right?
So I have this test for should handle one file, and I go in there and I write...
I don't write the most basic thing that I can think of.
I don't write a simple thing.
I write like, oh, well, this is what this algorithm needs to be.
So I'm just going to type it all out, right?
What I've done is I've written untested code because almost certainly I could change some
of that code or delete it or simplify it and my test would still pass, right?
And I might notice that if I go and I write that should handle varying subdirectories and it passes.
And the question that you could ask yourself at that point is, have I written untested code before and now I've tested it and now I probably want to make sure that the test, the new test that I wrote actually covers those cases.
And I might actually go in and mutate that code and break it and make sure that it fails.
Or is that test actually unnecessary?
Obviously, that case is something we need to handle.
But if that test doesn't drive out any new code, do I need it?
Or maybe more realistically, do I need the first one that I wrote?
Can I now delete the should handle one file?
Because it's impossible to have an implementation that handles varying subdirectories that doesn't should handle one file because it's impossible to have an implementation
that handles varying subdirectories that doesn't also handle one file so that's a really interesting
question because there's there's there's like the the code that I have just written the
implementation code that I know that I wrote in a particular way and I'm testing that with knowledge
like very much of like how I'm driving out, you know, if you were going for 100% code coverage
or that kind of nonsense
that we don't really put much faith in.
But you're like, oh, I have to handle this case
because in my implementation, I treat them differently
or I don't treat them differently.
And that would either drive these things
or like, oh, I didn't get 100% coverage
because the if statement that deals with the,
oh, if it's in the root it doesn't get
covered right so there's that kind of part but there's sort of like a moral feeling of like well
i have been doing this for 20 years and i know where the kind of things where if i make a change
to the implementation i bet you i break this because of that and then that's why those are
more speculative tests and maybe you have a stronger opinion about that because of course
reasonable people could disagree about like well no one will ever do that thing right you know well
what if they do or what if they change the implementation in a relatively straightforward
way that might take advantage of either knowing x or y right exactly oh it's the one file and it's
all there's more than one file kind of thing yep yep yep and that is so that is just a judgment
where it's an art form isn't it right? Right. This is art, not science.
It's because you're speculating on what the behavior of future humans will be.
That's basically what it is, right?
Like, I'm going to add these tests here, not because they drive out any code, right?
I could delete these tests and I would still have all my code very well thoroughly, the
existing code very well thoroughly tested.
But I'm adding these tests
because i'm scared of future people maybe future ben uh who had some comments left for him uh
coming in and changing this code in what would be a reasonable way yep but breaking something that
wouldn't then be covered right yep um and that like you said it's an art form it just comes
from experience really it just comes from experience
really it just comes from making mistakes on that decision yeah and going oh my god i should have
done it yeah that would have been right i mean let's say like you know you're writing a string
library and you're testing you know adding two strings together should return you know the strings
but and you just know yourself that like oh i've realized that uh the empty string i should probably
make sure that i test adding an empty string to a string those so i'm speaking from experience here this kind of stuff in like
um you know well how could you implement string concatenation where you break the empty string
concatenation and the answer is well if you're writing an optimized version in assembly i
guarantee that you haven't put the check in at the beginning of that to say oh hang on if it's
zero to start with because you just immediately want to pick up n bytes of the string and do the this kind of stuff
and so that's the kind of thing where like you are definitely locking in you're hedging for the
future of like well this is this this is like uh the kind of mistake i can see myself making down
down the line and uh and or it's almost part of the documentation at some level
as well of like well these things should work yeah you know the the interface allows for you
to do this it's surprising but it does allow for you to do this and maybe i should just leave this
here as like a a boundary condition of the interface to say this is fine so yeah and one
of the other things that you need to consider with that, in addition to sort of guarding against future mistakes, is the thing that we were talking about earlier about like writing those kinds of tests can also lead you to accidentally overspecify.
Right. Yeah.
Like you're testing things that are either behavior that you don't necessarily really care about or testing through into other dependent modules or classes or functions.
That's a really fascinating one, actually.
The testing things you don't care about.
I have definitely been guilty of that before, where I have thought to myself, I really want
to make sure that this thing works, but I don't really know what the answer should be.
Right.
And by writing a test, I've kind of mandated that that is what the answer needs to always
be.
Exactly.
Exactly.
And maybe I didn't want to do that.
Right.
Right.
But then I also don't want to have a path through my code that I... but then maybe
that's a question for like, well, fix the path through your code, throw an exception
instead, you know?
I don't know what... do something else to say like, hey, you're in uncharted territory
here, don't do this.
Right, right, right.
I mean, like, an obviously wrong example of that, like, bounding this problem a little
bit, like, an obviously wrong example of that would like bounding this problem a little bit, like an obviously wrong example of that would be like having something that,
that produces a set.
And then you make an assertion about the order.
Right.
Yeah.
Right.
It's like,
you don't care about the behavior.
You don't care about the order of the set.
Why are you writing a test for it?
It's like,
well,
because I wanted to know what it is.
Right.
Or I wanted to specify it or what is it?
It's like,
no,
like we are intentionally leaving that unspecified.
And that has value. The fact that it's unspecified and that has value the fact
that it's unspecified has value right yeah it allows us to change our mind later so like
getting the hash map thing you talked about right exactly is a perfect example of that like when you
write the code the test that's going to iterate through the hash map and and then build its world
from the hash map entries in the order that they just happen to land it in the hash map
then you're definitely way into the world of like,
well,
I don't even know what sequence these things are going to happen in.
So I can't be too critical of,
of the,
yeah,
that's yeah.
Interesting.
Well,
you better go write this code.
I got to get back to work.
You want to actually go back?
Yeah.
I was going to crack the whip on you and maybe we'll catch up with this another time and see how it turned out whether or not any of
these algorithms actually worked or not i like that idea a lot cool well we'll uh hear about
it next time perhaps all right next time you've been listening to two's compliment a programming podcast by ben rady and matt godbott
find the show transcripts and notes at www.twoscompliment.org
contact us on mastodon we are at twoscompliment at hackyderm.io Inversephase.com.