Two's Complement - Ben Walks A Tree

Episode Date: August 22, 2023

Ben ventures into the forest, finds a tree traversal problem, and then fails his will save and gets fascinated by a hash map. Matt suggests zombies. Then they come up with a solution and talk about ho...w to test it because of course they do.

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt. And I'm Ben Rady. And this is Two's Compliment, a programming podcast. Hey, Ben. Hey, Matt. How are you doing, my friend? I'm working. I'm working today. You're working? Yeah.. How you doing, my friend? I'm working. I'm working today.
Starting point is 00:00:26 You're working? Yeah. Why would you do such a thing? Doing the day job. I am staring at a bunch of comments in my code that are like, hey, Ben. Hey, later, Ben. Comments written by earlier Ben. Yeah, past Ben.
Starting point is 00:00:43 Hey, future Ben. You need to solve this problem. So uh you need to solve this problem so um you know solve this problem and now i'm looking at these comments being future ben going yeah i guess i gotta solve this problem now well i tell you what that past ben has a lot of confidence misplaced was it misplaced confidence uh hopefully not. So this is an interesting, actually, in sort of the taxonomy of comments. There are like the, say what you're doing, which is almost always a mistake to say, hey, you know, add three, you know, the comment says, add three to the blah variable. And then you know that underneath it, blah variable plus equals four is what it's inevitably going to say right because the comment rots um so you know one typically don't says you know like well if you can't make it obvious by writing the code uh so that it reads like what you want it to say then you've probably got a problem with your code but then there's the
Starting point is 00:01:38 the why explanation which is not like hey i am doing this because of this thing that's not obvious from the context of the code or whatever. Those are OK. But then this is the insert code here. This doesn't do anything yet version of the comment. Yes, yes. What is the problem? This is my task tracking style of comment. I actually do this thing where I – we might have talked about this actually –
Starting point is 00:02:00 where I have a special check in branch builds for to-do comments. So to-do means something very special in this project. And it is basically like, I've queued up some work that I need to do, and you can't merge to the main branch if you've got any of these things left in the code. Interesting. So that's a good discipline to have. Yeah. And it's actually also a great way to collaborate, because you can sit down with somebody and
Starting point is 00:02:24 kind of like put some to-do comments in the code and be like, oh, we need to fix this. We need to fix this. We need to fix this and not lose track of the fact that as a group, you need to do this work. That's interesting. Definitely, I've worked on projects before where we've not allowed to-dos to be merged into main. You know, like essentially we can't promote anything to production if it has no to do's in it but nowadays i subscribe more to the provided the to-do has a tracking issue associated with where like the finesse of like why you're it's okay to defer it for now and it can be scheduled and put back on like someone's work to-do list actual list then it's okay but it
Starting point is 00:02:57 might be nice to have like both worlds where you're like this is a long-term to-do like hey to-do we know that this doesn't scale but for the next year it's likely to be okay eventually we're going to have to replace this order n squared algorithm with something cleverer right that's the kind of thing you can put in a to-do comment uh obviously until you forget about it and then you know everything falls over because it's too slow but that's the whole point of having the tracking yeah but then this to do is like no i'm i just want to be able to commit so that someone else can pull the branch we're collaborating on. And we can't get this into main.
Starting point is 00:03:29 Yeah. Right. It will not accidentally go into production without having looked at this. Yeah. Right. So on the other end, I have pre-commit hooks that say, if it says do not commit anywhere in my repository, then you can't do that. But do not merge is something I have written, but I have nothing that prevents it from being merged into it's just like a convention there and you hope that when you're diffing things you'll go oh whoops yeah so maybe we should enforce that anyway that's not why
Starting point is 00:03:53 you wanted to talk about this stuff so yeah so what is this comment what is this to do comment so basically the problem that i have here is i have a sort of unknown... You can think of this as a directory structure. This is actually in S3 storage, but we're treating keys like directories in S3, and that all works just fine. It all works just fine. So you have this sort of unknown directory structure, right?
Starting point is 00:04:21 Okay. So there's some root path, some root key that you know, but underneath that, you have some heterogeneous mix of files. And what you're trying to... And subdirectories as well? Yes. Like files and subdirectories. Potentially many
Starting point is 00:04:38 layers of subdirectories, right? And you have no idea what the structure is. And what you're looking for is files. And for the sake of simplicity, let's just say that they're all of one type. Let's say CSV files. So you're looking for CSV files
Starting point is 00:04:56 that have the same base path. So one of the subdirectories in the hierarchy of subdirectories that is common contains underneath it a homogeneous set of CSV files all with the same schema. Right. So they can be in different subdirectories. And an example of this would be like separated out by date, right? So maybe you have something where it's like, you know, you've got a base path and then under that, you've got a subdirectory per day. And then in each of those days, you have, you know, a handful of CSV files
Starting point is 00:05:34 and you want to be able to detect that that base path has all these different files and they all have the same schema and there are no other files in there that have a different schema. So the part of this problem that I've already solved is a sort of schema detection. So I have, I already have an algorithm where you can, you know, take basically a stream and send it in there and be like, Hey, this is a CSV file. I want you to inspect the headers. I want you to inspect the actual values in the CSV file to make sure that they're, you know you to inspect the headers, I want you to inspect the actual values in
Starting point is 00:06:05 the CSV file to make sure that they're, you know, detect what the types are. Like none-ness or otherwise of them or, yeah, date times or strings or whatever. Yeah, yeah. It's sort of like all the, it devolves into string, right? It's sort of like, if I can't figure out what this is, I'm just going to call it a string, right? But if I can parse it as, you know daytime or a float or a 64 bit integer or a 32 bit integer like that's what that's the derived yeah but you infer it yeah anyway but
Starting point is 00:06:30 that's the solved part of yeah that's already done so that works so now what i have is i have uh the algorithm i need to implement here is you're given a a a bucket you know an s3 bucket you're given a prefix within that bucket and now you have to scan that prefix and come up with a list of schemas where the schema contains the results from the inspector. So like the columns and the types, but more importantly, the paths. And the tricky bit here is if you have something, it's entirely possible that you scan this and you come up with an empty list. Because if the structure that you're given is just a hodgepodge of heterogeneous file types and there is no root key that gives you a homogeneous set, then there's nothing to do here because what we're producing from this is basically a queryable table where you can give it that path prefix and say, everything underneath this has this schema, has this type.
Starting point is 00:07:37 Okay. So you're trying to find the highest point treating it like a graph leaf nodes where the files are you want to find the highest point where all children leaf nodes that you can find from that point have the same schema and then you want to find the set of the highest points where there are different
Starting point is 00:07:58 schemas is that yes fair to say yes so like if you had if so a directory a and a directory b and a then under a is all these dated directories so a slash yeah 2023 04 14 yep slash some other thing slash whatever thing and then there's a file.csv and if you could say like well all of the files for all the days for all the measures under this route under a have the schema where it's a date time and a value yeah there's three columns with these names and these types
Starting point is 00:08:32 and they're all the same yes exactly across the whole of that and then but conversely on the b bucket there's a there's a different set of things because it's maybe a completely different uh set of data but you don't know that information a priori exactly so you just want just um yeah that's an interesting problem and there are some sort of s3 specifics that i can think of that would either make this easier or harder and depending on how efficient it needs to be that's also a question to consider right you know parallelism or otherwise um yeah uh so for i'm gonna think out loud at you and see where we go from there but if i'm if i'm gibbering too much so logically you can just ask s3 give me a dump of everything you have and it will just start from a and give you every
Starting point is 00:09:26 single giant long path name of everything because really under the hood it's just a key value store where the key is the path and the value is the contents of the the thing so that i have implemented i have basically a recursive search from the prefix where it will walk recursively and give you right well that's the trick you don't yeah you don't have to do that recursion because oh no i'm not doing it yeah the s3 client is doing that for me yeah and it's not even doing it right the client's not doing it either right the the way s3 works is if you give it a delimiter it will use that as a point to stop which will then mean that necessarily you'll hit like well each level of the directory but if you don't give it a delimiter at all it just doesn't flat out listing the the keys with
Starting point is 00:10:10 no so there's no sort of hierarchy except the hierarchy that you put on top of it by saying by the way the forward slash is is magic to me right right right you're just listing all the keys that start with that prefix it's really nothing more than that exactly but you've so so you could go the route where you do that which is efficient from the point of view of just paging through the results from s3 and you get this giant list of things but it's also you then kind of want to reinfer afterwards the structure of the directories having given this list of things um so you could treat it like a tree and so for every one of those uh nodes you you construct a tree in memory for each directory and the directory has children
Starting point is 00:10:54 and the direct the children may be other directories or they may be files and files have schemas right or you could just treat it as a schema right um and then it maybe flows naturally from that so if you've got it all in memory which may be not feasible given the scale that you're you're working at uh i think it's okay to have it all in memory actually yeah i think that would be fine uh but then yeah you could write the um the sort of recursive thing of, say, give me the set of your children. Oh, let's think about this. One question you could ask is, you know, do you have homogeneous children? And that can be recursively applied.
Starting point is 00:11:38 You know, like if you don't have, you say, if all my children have the same schema that there are files and then if i ask the same question of my directory children what is your common schema and it's the same as my file children assuming that you're allowing to mix files and directories which maybe you're not in this instance maybe the leaves are strictly at the very very bottom of the tree but you could generalize both right but then essentially the a directories am i homogeneous um do i have a homogeneous schema is the answer to are all my files the same and if i have directories are they also homogeneously the same schema and then it sort of naturally falls out and you could memoize that so it's pretty quick and easy um and then obviously
Starting point is 00:12:31 you what you do is you start at the very very top of the graph and you say are you homogeneous and if it says yes you're done right there you are everything's the same at this point right if it's not you say okay well then let me find all your children and ask them are you homogeneous and as soon as you find one that says yes it is that is the leaf that is the root of a new tree that you never need to inspect further after that now obviously it has built up the tree and then you do a breadth first search across that tree looking for nodes whose children are homogeneous basically yes and you can stop at that but obviously you still are visiting all the nodes because in order to ask the question are you
Starting point is 00:13:09 homogeneous internally it's going all the way down and back up again but if you memorize it yeah then you don't have to hit the nodes more than once so that would be a top down way of doing it is there a way to do this where you sort of do like an order and pass through the keys? Because I, you know, sort of the piggybacking off of what the API from S3 is giving us is sort of like, all right, I'm going to iterate through all these keys, right? Yeah. And then along the way, I'm going to take each key. I'm going to do the inspection of the underlying data at that point which is that is actually the most expensive part
Starting point is 00:13:46 computationally right yeah like you know there might be a million keys but like reading all that data is the thing that's going to take a long time right yeah and inspecting the schemas um but you sort of inspect the schemas as you go um and my thought here was that maybe there's some efficiency to be gained from once you've detected a heterogeneous sub key, stop scanning. There's, there's no point in continuing to do this super expensive schema inspection. Once you've detected something where it's like,
Starting point is 00:14:18 Oh yeah, this is a directory has two different types of files in it. It's got a thousand files, but there's no point in checking the other 998 because these two are not the same different yeah that's very true yeah so you could definitely i mean if if i guess that's an error an error condition or an under specification when we were talking about this to start with which was you know i was assuming that you you could partition the world at some point you could find these top rooted keys but if you find a node for which it it its very leaves
Starting point is 00:14:45 have mixtures then that node is bad a bad node like you say you say well i can't do anything with this right right yeah but yeah to answer your earlier question to do this incrementally i think i think you i think so effectively you given one long key string name and the results of what schema this is you can of course create speculatively or look up if they already exist each of the nodes in the graph by splitting on the slash and then saying okay well let's find the node that's like called bob and then the one called ian and then the one called tracy and then here is the 010403 and then this is you know moo.csv which is where i am. And if there's a node there, if you've just newly created a node
Starting point is 00:15:28 then definitionally you'll say, well, at the moment the set of all schemas that I've seen is what I've just parsed this schema to be. And if not, then I and this is obviously a leaf node rather, a leaf directory, it has no further directories under
Starting point is 00:15:44 it. You can write it um you're you can write if you if you're writing the second um directory sorry second schema for that node you know like you said okay this this node has children direct children stop this this is silly now any more things that land in this node it's already poisoned don't bother looking at it anymore yeah um but if not then um you've got you've constructed the graph except the only thing you haven't done is you haven't filled in the directories that only have directories as children's schema sets but that's fine because that's what the second part of the algorithm is going to do once you finish going through it the breadth first says, I don't have to look at the data anymore because I already have the list of all the nodes, the leaf nodes, and whether or not they are poison because they have a mixture
Starting point is 00:16:35 or they have exactly one schema type. And then their parent, and you could potentially do this online as well, where you say, okay, every time I update a nodes schema, you can update his parents to say like, hey, also, by the way, this is the second thing. Like I was thinking about this and I'm like, I could do this by like a hash map where when you process a single key, you say, I'm going to create an entry in the hash map for this directory and each of its parents, right? I mean, what you're doing is you're storing a tree in a hash map, kind of implied, right? Yeah, yeah yeah yeah but where there's no explicit nodes that the like you can go up and down because the way that you go up is to just like truncate your key and look at the hash entry for the bit that's like yeah so that would work as well right obviously the entry where there would be like the set
Starting point is 00:17:37 of schemas well it's just the one schema right because if you ever detect a conflict you you well you'd have to mark it as basically no good, right? So it's either the schema or a marker that says this subkey is no good, right? Right. Okay. But that wouldn't let you answer the question like in my original example, if you had the A and the B top level nodes, the root would be poisoned and say, no, I have a mixture. Right.
Starting point is 00:18:03 But that's fine because you'd have an entry in the map for a that was good and you'd have an entry in the map for b that was good okay but you can't you if you just have those entries yeah how does you how do you right what so you having filled in this hash map how do you interrogate it you need to know what the keys are and you don't actually know what the keys are anymore uh if you had the tree you know what the node of the tree of the trees and you know from each node what children it had yeah but you can't do that in the hash map without knowing the list of all those i mean maybe maybe i'm missing something here but my thought was like okay at the end of that sort of happy path example i would have a as an entry and i would have b as an entry and i would have an entry for each one of b's subdirectories yeah and each one of a's subdirectories right right
Starting point is 00:18:52 so you've got a so in my example you have a and b and all the the dated directories and all those things underneath it yeah you've got a hash map with many many thousands of entries where you're just going to walk through the hash map in the random order the hash map's going to give you out the the nodes that are in that hash map yeah you're just and then from there you look at the keys of the hash map and you find the the shortest right like the highest level paths right right so effectively that's how you yeah that's the bit i was missing was the fact that i it's not necessarily that straightforward to go from here's a hash map of all the keys to what's the shortest key well now i have to go over every single entry in the hash map
Starting point is 00:19:29 yeah it's like it's like order n yeah it's like order n twice it's it's 2n right yeah as opposed to order n and then uh uh log n effectively or n log n for not even n log n no it's just not gonna think for that have if you built the tree as you were going along instead of the hash map, then you just walk at the top and you get the answer is either the top node or one of its second children. I think they're equivalent.
Starting point is 00:19:56 It depends how many leaves you end up throwing into the hash map. I mean, I guess you don't have to store the actual files in the hash map. That's the trick here. You don't need to store the actual files in the hash map. That's the trick here. You don't need to store their schema. You only aggregate them at the directory level. And so you are talking about, yeah.
Starting point is 00:20:13 And I think it would let you sort of abandon early where it's like when you check a key, you say, okay, is this parent already in the map marked as bad? Because if it is, I don't need to inspect it, right? There you are. You turned an order, an operation into an N log N. Yeah. I think. No, is that? No, because for every N, you have to then go,
Starting point is 00:20:38 okay, there are potentially log N parents of me, and I have to check them as I go. Yeah, but I mean, anything that but i mean anything that lets you anything that lets you skip the inspection process is going to be worthwhile oh but you the inspection process would have happened when you populated the map yeah the reading bit out of the end is the trivial part right so i think our algorithm then putative or otherwise is grab the big old list of full keys from from the s3 api is streaming through construct or look up as you say in this hash map have i done some subset of these things already and maybe if you find it's poisoned already you just skip it right if you've already found an
Starting point is 00:21:22 entry that says look there are already two schemas somewhere along the the list of that now again if you had a tree you walk down the tree to where you are or walk up from the node you just just created but if you have a hash map you break it and you look up those things which again is their equivalent right yeah but assuming that so far you haven't found uh a node that has at least two schemas already you go well now i need to see whether or not this schema either matches the one schema that is there or if there is no schema there i need to populate the entry with this schema so now you know you're cursed to go and do a long look up and read the data and pass through it and whatever like you come back with a new schema and this if there was nothing there you say, this is the schema of this.
Starting point is 00:22:06 And then you walk up the graph and you say to all of the no's there, hey, this is the schema. Or you check against the schema to make sure that it's the same schema as any of the parents. And as soon as you hit two, you know, well, this is poisoned now. We don't need to do this, anything under this point in the tree again. Which obviously means you're going to run through the API, kind of going list, list, list, list, list, list. But very quickly you'll go, yeah, this is it. I don't need to do this anything under this point in the tree again which obviously means you're going to run through the api kind of going um list list list list list but very quickly you'll
Starting point is 00:22:28 go yeah this is i don't need to look at it i don't need to look at the expensive part um okay so that's that's the walking process and then at the very end you look over the whole map and you find that and this is like where i said it gets more tricky you go through all the keys and you try and find the shortest key that has not been poisoned but then if you want to find the set of the shortest keys that are like separable and not poisoned it's harder to do that i think from like just walking through the keys because you have to essentially infer again the graph structure out of like breaking the key on a slash in order to say hey is this because it's not even shortest right you know you could have two directions with longer names ones have got a longer name than the
Starting point is 00:23:11 other and you're like it's not it's not shorter it's like number right so i don't know if you're saving anything by having it in the hash rather than just just accepting the tree it's going to be part of your your life it's not that difficult and you can answer the question a little bit yeah i mean the only reason i'm leaning i would lean toward a tree is because it's like an out-of-the-box data structure and i i need to do i think very little to i mean i guess the values are probably going to be reasonably complex but anyway this is that's like i don't know that they are even right you know but but that's that um the other thing that's that's worth thinking about this if we're just talking in general generalities is that even scanning over that many files takes a long
Starting point is 00:23:53 time and if you were to do multiple parallel scans one starting at a one starting at b one starting at c or whatever the usual tricks that one does to scan over s3s like list function faster yeah then i think both of these approaches still work you might if you have multiple threads if you're thread safe like data structure in or map then you can you might occasionally do more work than you needed to because there were two like threads that were both scanning files where a third thread discovers that actually this is a futile exercise, you know, that this point in the node is passed.
Starting point is 00:24:31 But like, as long as you can handle that situation, you only ever do N pieces of extra work in the worst case where you have N threads. Right. But in the good case, you are, you are mostly able to,
Starting point is 00:24:42 to, to yeah. Parallelize on that although to be honest probably just scanning through is fast enough and then you then you're going to be stuck doing the actual parsing and that can be multi-threaded because you just have a work queue and say hey here's some things i need you to do go and get done when you come back to me with the results then i can i can fill in the tree yeah yeah yeah we were talking the other day actually about my efforts to uh scale this system horizontally and that is one of the prime use cases of that work is being able to like basically delegate work out on a per key basis like hey could you scan the schema for the csv file for me it's
Starting point is 00:25:22 probably going to take you like three minutes, but separate that out and then do that work that way. That's perfect. Yeah, what a fun question. I was going to say, because this is Two's Compliment, the other thing that we can talk about here is the test cases. Of course. What tests
Starting point is 00:25:39 would you write for this? And more importantly, maybe, what order would you write those tests in? Because the most basic test, if I was going to sit down, I am about to sit down and write this. Right, literally I interrupted you from doing this. But the test I'm probably going to start with is, if there are no files in the bucket, return me an empty list. Right. This is sort of a zombie approach.
Starting point is 00:26:03 Zombies approach. Yeah, yeah, yeah yeah yeah exactly i forget b is oh we can do this boundary bounds yeah boundaries interface uh i'm just making parts yeah anyway the trick is the zom part is you know test with nothing test with one test with a few of them yeah yeah yeah so yeah the first thing you do is say, hey, empty directory. Yeah, so, like, you know, should return an empty, I'm just typing this out, list if there are no files, right? And that's easy enough to write. I just make an empty bucket using my fancy, my fake, fake but kind of real in-memory bucket.
Starting point is 00:26:44 It's not fake. it's a real thing are you actually creating anything for an empty directory though i know you can in s3 you can create a essentially an empty file that is the file that ends with the slash that causes it to be kind of i didn't even know you could do that that's that's what i think that's what happens if you're like in the s3 ap sorry if you're in the aws console and you make a directory and you then see this directory that's empty because otherwise if you make it an empty directory yeah the directory doesn't exist it's like you know it's like that thing in source control when you can try and create a directory and you check it and it's like no the file doesn't there's nothing there because what we're tracking are files not
Starting point is 00:27:20 directories right right so i don't even know how you'd write that test. Well, I mean, I could just make an empty bucket, right? Right. Okay, so that's like a just completely empty bucket. Mm-hmm. Yeah. And then assert that it returns nothing. Yeah, and then assert that I get back an empty list. And then, of course, the next obvious one is I'm going to make one file, and I'm going to make sure that, you know, like, I don't't even so here's an interesting thing here i think a rookie mistake when writing that kind of tests so you know you're about to describe the way that i'm doing it on you okay tell me tell me so the rookie mistake the rookie the rookie mistake when writing
Starting point is 00:27:59 that kind of test would be to um basically unnecessarily test through into the schema inspection algorithm the scheme inspection algorithm is already written there's already tests for that it and through know that it works transitivity of like as long as you call the schema inspection routine with the right path right trust that it works right and so don't care about that bit don't yes yeah all right so i'm gonna write the test that's like that's like that's like i'm gonna write a file into this thing and then i'm gonna have it run the the whole scanning inspection algorithm and i should get back one schema i'm gonna these are the things i'm going to assert and these are the things i'm not going to assert i'm going to assert that i got one yep i'm going to assert that the path into the file
Starting point is 00:28:50 is what i expected yep i'm not going to assert the number of columns i'm not going to assert the types and names of those columns right right because that behavior is already tested now so that's interesting i would have mocked that out right i don't know if that's where you're going with this i would have said like okay i'm gonna completely and utterly i would so with my c++ mindset yeah right i would pass in the like um and something which implements the interface of test this path and then i would give it a fake one and say hey yeah all i'm gonna do is i'm gonna record that you said that you wanted me to check this schema and then i'm going to return you back the schema bob or whatever anything because at that level i don't care what you're doing with that information because i'm already
Starting point is 00:29:32 trusting that it's going to work whereas you're actually talking potentially about calling the schema code but deliberately not looking at the results of it because you don't care about the results of it you trust that they work so equivalent um and you don't have to fake anything out and you don't have to fake anything out and you don't have to have an interface with the and you're already faking out like the s3 at like a very s3 level yeah yeah right but yeah okay yeah and that's so that is that is a very sort of interesting point of debate and i think there's lots of people that have different opinions on that and i used to be more in the camp of like i'm gonna fake out the schema inspection algorithm and these days
Starting point is 00:30:06 the way i think about it is like if i can write this so the test is fast and the thing that i'm mocking out you know or i would consider mocking out the real one doesn't have any outside interactions it runs quickly it does the thing uh the the phrase that i've heard a lot of through this is a sociable test or testing through is another way that people describe this but it's basically like some number of hops one or two or maybe
Starting point is 00:30:33 three, three's a little much but one or two hops down into the sort of you know dependency graph of the thing that you're testing generally fine so long as you don't make any, so long as it don't make any as long as it has doesn't have any other bad things it's not right and to be clear here the the sort of like the the the fakeable interface you're passing around here for s3 means that it's not like
Starting point is 00:30:54 you're making tcp connections you don't have to and all that stuff so you've already kind of said okay like that's that's the boundary that i'm prepared to draw and if i go one or two steps away from that sort of law of demetry breakingly then that's okay i'm testing more of my system but i'm also not going to rely on it because that is not the purpose of these tests exactly furthermore if i break my schema detection system in some subtle way that doesn't matter for these tests i don't want these tests to unnecessarily fail exactly exactly you don't ever want the scanner, the sort of the, you know, tree walker scanner code tests to fail because like, oh yeah, this CSV file doesn't have a comma in the right place or something like that. Right. Like that's just, that's going to be a not
Starting point is 00:31:39 informative test. You're going to be very confused as to how you possibly broke that. Right. Right. So yeah. So the assertions that you pick here, basically you're going to be very confused as to how you possibly broke that, right? Right. So, yeah. So, the assertions that you pick here basically are going to determine how much you're going to be able to use these tests to refactor versus how much you're basically just pouring concrete around your code, right? Because, like, the most, like, you know, painful, I think, and people talk about this all the time. It's sort of like, yeah, you know, I've heard people say about this all the time it's sort of like yeah you know i i've heard people say and it just makes my head explode it's like yeah i don't like writing tests because it makes it hard to refactor right and i'm just like oh my god um the wow it's really very emotional i know it's like listening you can't see him but he's always
Starting point is 00:32:21 crying here i hear that and i'm like you don't understand what Listen, you can't see him, but he's always crying here. I hear that and I'm like, you don't understand what any of those words mean. Any of them. None of the words in that sentence were correct. The trick here is when you over-specify, and it sort of comes almost from a good place where you're just like, I want to make sure my code works. I'm going to write lots of good tests and I have lots of assertions and all these other things. And so but the problem is when you over specify, you are you're not sort of doing the core job of what the tests are there for. The tests are there for to assert that you have the behavior that you want. But they should be agnostic about the behavior that at that level of testing, you don't care about. And this is exactly that, right?
Starting point is 00:33:07 This tree walker scanner thing shouldn't care at all about whether or not the schema inspector is doing its job correctly. And if you overspecify those tests, you're going to make it harder when you come back later. You're like, oh, we need to change the way that the schema inspector works, right? And you go and you change it, and then you have all of these failing unrelated tests, and you're like, why are these tests failing? Well, because you over-specified, right? Yeah. So the dial here that you have to turn, and you really want to get it in exactly the right spot if you can, is to specify the things that you care about for this code that is under test and not over
Starting point is 00:33:45 specify this thing if you're going to do this sort of sociable style. Now, the other alternative here is what you were just saying, where it's like, no, I'm just going to mock that out. So you can change the schema inspector all you want. It's not going to break these tests at all. And I have a hard guarantee that that's not true. And there's lots of situations in which you might actually want to do that. But what I'm about
Starting point is 00:34:06 to embark on here in the next 20 minutes when we're done with this conversation is trying to set that dial exactly where I want it so that I have the confidence that the code that I'm writing right now works without having over-specified into this thing that I already know works, that I already wrote tests for, that I've actually already used in other contexts. Yeah, yeah. And I'm not coupling those things together. So, what, one file. So when you said that, before you went off on this,
Starting point is 00:34:35 not even tangent, this is a very relevant thing. This is a major soapbox moment here. Yeah, it was though. Get off my lawn, you stupid kids. But the thing that immediately sprung to mind after like one file is one file with varying number of subdirectories in the root and then 12 different things because again that's like a zero one many kind of thing where it's like well everybody knows the edges the boundaries that's the b part of zombies isn't it yeah boundaries is like well if
Starting point is 00:35:05 if it's in a subdirectory that makes sense i'm sure i've tested that but like who thinks to put one of these files directly in the root right there's probably not what you're ever planning on doing it probably be a mistake to do it but like that is where i'm my instinct tells me a body will be buried there and you'll find something interesting there um just like you know when having a file that's empty, which probably doesn't affect this in this particular case because you've already tested your schema tracker, but that would be another thing I think of.
Starting point is 00:35:33 Oh, yeah, someone made a file, but it's not got anything in it. That's kind of like an edge. That would be an edge case for your schema validator. Yeah, yeah. And so the interesting thing about that test is I was having a conversation about this with our mutual friend, Mr. Pat Farley, yesterday, actually. Patrick Farley. Of the London Farleys.
Starting point is 00:35:53 Of the London Farleys. If I'm not sort of like lazy enough implementing the solution after I write the, you know, should handle one file, right? If I'm like, I know how this should work and I just sort of type it out. It is possible that I write that should handle varying subdirectories test and it passes. Right? Yeah. I would hope so if you did it right, but... Well, maybe... So that's the thing, is if what I did is I took a step too far when I was making the
Starting point is 00:36:35 implementation, and I basically wrote untested code, right? So I have this test for should handle one file, and I go in there and I write... I don't write the most basic thing that I can think of. I don't write a simple thing. I write like, oh, well, this is what this algorithm needs to be. So I'm just going to type it all out, right? What I've done is I've written untested code because almost certainly I could change some of that code or delete it or simplify it and my test would still pass, right?
Starting point is 00:37:01 And I might notice that if I go and I write that should handle varying subdirectories and it passes. And the question that you could ask yourself at that point is, have I written untested code before and now I've tested it and now I probably want to make sure that the test, the new test that I wrote actually covers those cases. And I might actually go in and mutate that code and break it and make sure that it fails. Or is that test actually unnecessary? Obviously, that case is something we need to handle. But if that test doesn't drive out any new code, do I need it? Or maybe more realistically, do I need the first one that I wrote? Can I now delete the should handle one file?
Starting point is 00:37:43 Because it's impossible to have an implementation that handles varying subdirectories that doesn't should handle one file because it's impossible to have an implementation that handles varying subdirectories that doesn't also handle one file so that's a really interesting question because there's there's there's like the the code that I have just written the implementation code that I know that I wrote in a particular way and I'm testing that with knowledge like very much of like how I'm driving out, you know, if you were going for 100% code coverage or that kind of nonsense that we don't really put much faith in. But you're like, oh, I have to handle this case
Starting point is 00:38:13 because in my implementation, I treat them differently or I don't treat them differently. And that would either drive these things or like, oh, I didn't get 100% coverage because the if statement that deals with the, oh, if it's in the root it doesn't get covered right so there's that kind of part but there's sort of like a moral feeling of like well i have been doing this for 20 years and i know where the kind of things where if i make a change
Starting point is 00:38:35 to the implementation i bet you i break this because of that and then that's why those are more speculative tests and maybe you have a stronger opinion about that because of course reasonable people could disagree about like well no one will ever do that thing right you know well what if they do or what if they change the implementation in a relatively straightforward way that might take advantage of either knowing x or y right exactly oh it's the one file and it's all there's more than one file kind of thing yep yep yep and that is so that is just a judgment where it's an art form isn't it right? Right. This is art, not science. It's because you're speculating on what the behavior of future humans will be.
Starting point is 00:39:10 That's basically what it is, right? Like, I'm going to add these tests here, not because they drive out any code, right? I could delete these tests and I would still have all my code very well thoroughly, the existing code very well thoroughly tested. But I'm adding these tests because i'm scared of future people maybe future ben uh who had some comments left for him uh coming in and changing this code in what would be a reasonable way yep but breaking something that wouldn't then be covered right yep um and that like you said it's an art form it just comes
Starting point is 00:39:44 from experience really it just comes from experience really it just comes from making mistakes on that decision yeah and going oh my god i should have done it yeah that would have been right i mean let's say like you know you're writing a string library and you're testing you know adding two strings together should return you know the strings but and you just know yourself that like oh i've realized that uh the empty string i should probably make sure that i test adding an empty string to a string those so i'm speaking from experience here this kind of stuff in like um you know well how could you implement string concatenation where you break the empty string concatenation and the answer is well if you're writing an optimized version in assembly i
Starting point is 00:40:18 guarantee that you haven't put the check in at the beginning of that to say oh hang on if it's zero to start with because you just immediately want to pick up n bytes of the string and do the this kind of stuff and so that's the kind of thing where like you are definitely locking in you're hedging for the future of like well this is this this is like uh the kind of mistake i can see myself making down down the line and uh and or it's almost part of the documentation at some level as well of like well these things should work yeah you know the the interface allows for you to do this it's surprising but it does allow for you to do this and maybe i should just leave this here as like a a boundary condition of the interface to say this is fine so yeah and one
Starting point is 00:41:01 of the other things that you need to consider with that, in addition to sort of guarding against future mistakes, is the thing that we were talking about earlier about like writing those kinds of tests can also lead you to accidentally overspecify. Right. Yeah. Like you're testing things that are either behavior that you don't necessarily really care about or testing through into other dependent modules or classes or functions. That's a really fascinating one, actually. The testing things you don't care about. I have definitely been guilty of that before, where I have thought to myself, I really want to make sure that this thing works, but I don't really know what the answer should be. Right.
Starting point is 00:41:36 And by writing a test, I've kind of mandated that that is what the answer needs to always be. Exactly. Exactly. And maybe I didn't want to do that. Right. Right. But then I also don't want to have a path through my code that I... but then maybe
Starting point is 00:41:47 that's a question for like, well, fix the path through your code, throw an exception instead, you know? I don't know what... do something else to say like, hey, you're in uncharted territory here, don't do this. Right, right, right. I mean, like, an obviously wrong example of that, like, bounding this problem a little bit, like, an obviously wrong example of that would like bounding this problem a little bit, like an obviously wrong example of that would be like having something that, that produces a set.
Starting point is 00:42:08 And then you make an assertion about the order. Right. Yeah. Right. It's like, you don't care about the behavior. You don't care about the order of the set. Why are you writing a test for it?
Starting point is 00:42:16 It's like, well, because I wanted to know what it is. Right. Or I wanted to specify it or what is it? It's like, no, like we are intentionally leaving that unspecified.
Starting point is 00:42:24 And that has value. The fact that it's unspecified and that has value the fact that it's unspecified has value right yeah it allows us to change our mind later so like getting the hash map thing you talked about right exactly is a perfect example of that like when you write the code the test that's going to iterate through the hash map and and then build its world from the hash map entries in the order that they just happen to land it in the hash map then you're definitely way into the world of like, well, I don't even know what sequence these things are going to happen in.
Starting point is 00:42:49 So I can't be too critical of, of the, yeah, that's yeah. Interesting. Well, you better go write this code. I got to get back to work.
Starting point is 00:43:00 You want to actually go back? Yeah. I was going to crack the whip on you and maybe we'll catch up with this another time and see how it turned out whether or not any of these algorithms actually worked or not i like that idea a lot cool well we'll uh hear about it next time perhaps all right next time you've been listening to two's compliment a programming podcast by ben rady and matt godbott find the show transcripts and notes at www.twoscompliment.org contact us on mastodon we are at twoscompliment at hackyderm.io Inversephase.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.