Two's Complement - Compression

Episode Date: October 23, 2023

Matt and Ben talk about how compression works, specifically deflate, which is apparently everywhere. Ben gets particular about compression ratios. Matt explains how to compress /dev/random by sorting ...it first.

Transcript
Discussion (0)
Starting point is 00:00:00 I'm Matt Godbolt, and I'm Ben Rady, and this is Two's Compliment, a programming podcast. Hey Ben. Hey Matt. What are we going to talk about today? I think we're going to talk about compression today. So, like squishing things? Yeah, squishing bits. Oh, bits?
Starting point is 00:00:31 Taking bits and squishing them. So like, if you squish a zero, like flat, it becomes a one? Yeah, right. That kind of thing. I see. Unless you squish it down, in which case it becomes an underscore. Oh, or a minus. Well, if you've got to squish it from both sides to make a minus. That's true.
Starting point is 00:00:49 These are the compression algorithms that we need to talk about. These are the compression algorithms that we need to talk about today. Which one of those is Zedlib? Well, in the true spirit of these podcasts, we haven't really thought too much about what we're going to talk about we did discuss that we should talk about compression a number of months ago i did a presentation at work about the various different compression algorithms or rather the uh the way that i was subverting them for a sort of side project i've got called
Starting point is 00:01:20 um oh gosh i can't even z index or z index or z index which is a way of archiving and searching through archives that you've previously um have made z uh g zipped files and um sort of random accessing within them which is something you're not supposed to be able to do but with a little bit of sour band data you can but that meant trying to understand or explain to people how z lib worked which meant talking about compression. And then we were like, hey, we should talk about that. So here we are. What do you want to talk about? I mean, what even are a compression?
Starting point is 00:01:54 So, yeah, aside from turning zeros into other various characters on your keyboard. Which is not really what they do, just to be clear. Not a real thing. Yeah, I mean, you know, quite often it makes sense to trade a little bit of extra CPU time in order to get a little bit less storage space. Or basically just bytes because you're transferring something over a network or whatever it might be. Lots of other contexts. And so you want to take a bunch of data and shrink it down into something smaller. And there are
Starting point is 00:02:31 many algorithms for this and we're going to talk about them today, but I think the first thing we should talk about are sort of the different trade-offs because it's like well, why wouldn't I want to make my data smaller? And how do I want to make my data smaller and how do i want to make my data smaller and other considerations that you might have when choosing one of these algorithms i mean so obviously the first thing that springs to mind is how good is
Starting point is 00:02:56 the compression algorithm at squishing my data in the first place like given a input file how much smaller can it be after it's being compressed and that's really easy and objective to measure you know you take your 100k file and you apply your compressor to it you like go hey it's 200 bytes now that's awesome fantastic and this one makes it 190 bytes you're like that's obviously better let's just use the 190 byte one right so that's the first thing is the compression ratio. Yeah, and people do tend to talk about it as a ratio, right? Or sometimes a percentage, right? You'll get like, you know, 90% compression, which the way I interpret 90% compression is you took something that it's sort of – it's a little – I feel like it's a little weird when you say that because it's like's like i took a hundred bytes and i compressed it down to 10 bytes right i saved 90 percent of the original size of the
Starting point is 00:03:51 original right yeah which you know it's like so i got 10 was the result but it's like one minus that right yeah um and then you know you can have, uh, you know, ratios. I feel like you'll also see things where people talk about negative compression ratios in order to express something that actually got bigger when you compressed it, which can also happen. Oh yeah. That's, but that sort of breaks that model too. Right.
Starting point is 00:04:19 So I think it's really important when you're talking about, and I, this is true all the time when you're talking about percentages and ratios and things like that where it's like it's kind of easy to lose track of what the numerator and the denominator were um so i i feel like when when you're discussing these things it never hurts to sort of be explicit right so when we're talking about compression ratios well i don't know that we'll have any concrete things, but then, yeah, in that specification, 90% would mean it is 90% smaller means that, yeah, 90% of the original size went away and you're left with a file that is 10% of the original. Right. That makes sense. So, obviously, naively, you want to pick the thing that compresses your input to be the smallest output.
Starting point is 00:05:06 Right? Right. But there are other trade-offs, obviously. Right. Otherwise, we wouldn't be bringing them up. We'd be like, yeah, well, we're done. You pick the smaller one. So what else is important?
Starting point is 00:05:15 Well, there's compression, how long it takes to do the compression. Right. And there's also how long it takes to decompress it. Right. Right. And for certain algorithms, there's how lossy it is if it is lossy. Right, right. That's a good point.
Starting point is 00:05:32 I suppose, yeah, the idea that we could talk about lossy compression as well might be too much for one podcast. Maybe we'll table that for another time. But yeah, so I was thinking we would talk really about lossless compression in this chat because the world is going to be – maybe we'll get to it at the end. Who knows? Again, exposing the listener to the lack of planning that goes into this. In fact, it's a watchword of this podcast. Yeah. I mean, at least asking the question is always a good one, right?
Starting point is 00:06:03 Yeah. Like, is this actually lossless compression? Because not every compression algorithm is. Can I assume that when I decompress it, I get the exact same bytes that I put into the compressor? That seems, it's kind of the definition of lossless versus image compression and stuff, which actually loses some of the data and you don't get it quite back as we've all seen. Audio, video can all be a little lossy. Yeah. Yeah.
Starting point is 00:06:31 Yeah, yeah. some of the data and you don't get it quite back as we've all seen audio video can all be a little lossy yeah yeah yeah yeah so obviously compression time and decompression time uh another thing that i think is an interesting one to consider is how how much code it takes to decompress your input because as kind of an implied input to your decompression data is the thing that can decompress it. Now, obviously, on a Unix system or whatever, you've got GNZip, and it's like, okay, there's 500 bytes of – sorry, 500K of executable somewhere. You don't have to worry about it because everybody has GNZip somewhere, and so you can kind of like, oh, all right, we can decompress this. But if you are working
Starting point is 00:07:05 on an embedded system which has got limited resources which is one of the main reasons you might want to do compression like hey we've got an ssd and we need to read things off of the ssd and decompress them into ram to run them or to do whatever then you realize that you actually have to budget for the decompression algorithm and the amount of code space that it takes up and if you have like a you know a compression algorithm which is like hey i have a lookup table and the lookup table is 200 megabytes of like values canned values then and you just store in your compressed version you're like well it's just number seven in that table that's great compression but you need a very big decompression table somewhere else
Starting point is 00:07:45 right right yeah that's a great consideration actually i mean it's top of mind just because of like looking at uh compression like old 8-bit machines that that i'm looking at like this is cool look at how much you can compress but unfortunately the decoder is like 5k and that's too big yeah yeah yeah and i mean i guess part and parcel to that is kind of like uh fonts also have this property is like there's a certain percentage of these things are going to be available in any environment that you're delivering the data to because again a lot of times when you're compressing stuff it's in order to send it somewhere and the question is can the person on the other side actually uncompress it, right?
Starting point is 00:08:27 And so, depending on the exact transport mechanism and the clients involved and a bunch of other factors, you may or may not have access to those algorithms. And thinking about which ones are going to be more ubiquitous and which ones are going to be like, oh, actually, we need to first send the decompression library and then we can send the data itself. Yeah, the actual data. Right. Yeah, is sort of another dimension of that in a way. And, I mean, if we talk about ubiquitous, I mean, the most ubiquitous compression algorithm that I'm aware of would be Deflate, which is the algorithm that powers zip files and GZ and every web transaction you've ever done. Although, I guess ZStand standard has now taken it over but yeah all right most most web transactions are compressed with deflate
Starting point is 00:09:12 and you know colloquially everyone sort of refers to it as you know the z lib which is the library that implements it or just gzip even though it really is the same algorithm inside your your windows zip files as well um and that uses a number of techniques under the hood and we can talk about what those techniques are but like um i guess let's talk about what types of things you can do in a compression algorithm okay yeah so one of the things that i used to do a million years ago on 8-bit machines was you would observe that very often you'd have sequences of bytes that repeat. So you've got some, usually like a picture, an image, or a sprite in this instance. You'd be like, well, okay, it's the value 0 80 times, followed by a 1 and a 7,
Starting point is 00:09:58 and another three 0s, and then another 100 0s, and then whatever. That doesn't make sense, but whatever. So even just eyeballing in a hex editor, you can see that most of it is zeros. And so one might just go, a simple compression algorithm would be, okay, if I see a zero, count how many more zeros there are. And then the compressed output is zero
Starting point is 00:10:23 followed by how many zeros there were. And that's great if zero is the only thing you want to do compression for. And this is like run length encoding, as this is RLE, where we're saying, hey, I'm going to look at how long the run of this particular single number is, and I'm just going to repeat it. Now, that's cool. It makes sense the decoder is easy you read bytes and every time and you just copy them but every time you see a zero you read the next byte and you write out that many zeros straightforward right and it's pretty effective and obviously you can imagine that the decoder is trivial the encoder is pretty trivial
Starting point is 00:11:02 as well but the first and obvious thing you might notice is well what if there's only one zero right now you're sending zero followed by a one right and that's twice as big as it was when it came out which is not a good look and that goes to your original point about like having a negative compression ratio and this is one of the reasons that you can get negative compression ratios is that sometimes you try to encode some clever things and they're not as clever for the particular stream of data that you've got that's that's a problem yep one way around this is to divide the world into little blocks and say let's let's compress every 512 bytes let's say and then say, immediately before every block,
Starting point is 00:11:47 I'm going to put a flag that says, did I apply run length encoding or did I not apply run length encoding or some other compression, whatever algorithm. And then you can try to RLE compress 512 bytes. And if it gets bigger than 512 bytes, you go, ah, let's just store it as is and then you just put the flag that says this is just exactly as um it's just byte for byte
Starting point is 00:12:10 copy the next 512 bytes and of course you can generalize that but you know in the worst case now what we have is every 512 bytes we add one byte that says this isn't compressed and now we still made it bigger but hopefully not as bad as it would be if every byte was was there and that's another reason so the sort of metadata about the data that explains how the compressor is configured or what how to decompress the next bit can take up more space than if it was just uh raw data so that's run linked encoding that's one of the easier ones what else could we do to compress data uh well that's a good question i don't have an answer to that well what is the most common letter in the english language uh e i e yes the e right so um how many bytes does it take to to store an e uh well normally it's um one byte one right right yeah
Starting point is 00:13:07 but we know that e is more common the most if we're compressing let's say text e is more common than almost every other letter uh well it is the most common letter sorry um so what if we could say let's forget about bytes let's just treat this as a stream of bits what if we could say, let's forget about bytes. Let's just treat this as a stream of bits. What if I said I could encode an E with only two bits? Like maybe let's just say zero and then one is an E, as in the binary zero one. And then every other letter, for the sake of ease over the air and not without whiteboards to draw anything on, everything else is going to be a one bit followed by eight bits of the rest of the the letter so that means that every other letter is now nine bits boohoo but every e is only two bits right right so this is a kind of like you know you think of it like the
Starting point is 00:13:58 rle in terms of we're trading off something to encode something which is hopefully more common or is more compressed but we've made it slightly more pessimistic in another case but now if you imagine like a third of all of your text is an e you've dropped that third down to two bits and everything else is nine bits now I can't do the math in my head it It's terrible to do, but there is a way to construct a optimal tree of bits, ones and zeros, that give each character
Starting point is 00:14:31 that you could possibly have inside your output a variable number of bits based on how common it is or not. So while an E may be one zero, I can't remember what I just said, 0, 1, whatever. And now maybe the next most common letter is, I don't know, T. That might be 1, 0, 1.
Starting point is 00:14:51 And that's unambiguously a prefix that only means it's T. And it's three bits now. And then you can start building a tree with all of the other things. And until you get to like, you know, the Q and the Z take maybe even now 12 bits to unambiguously encode. But now, you know, effectively, we are storing the letters with codes that are inversely proportional in length to how common they are. And now the document we're storing, provided it's an English document, and fits this sort of statistical arrangement of probabilities
Starting point is 00:15:30 of how likely each letter is, now we can store it in considerably less bits. I mean, it makes sense. There's a reason why the phone number for emergency services is just 112 or 999. What is it? one two or nine nine nine or what is it oh yeah that's what it is you should probably know what that is if you live in the united states just saying i would recommend that you write that down okay i'm gonna write that down it's like yeah yeah so i was gonna say that like phone numbers are have like this sort of like feeling of like
Starting point is 00:16:02 they're unambiguous you know know, 911 doesn't code. That's not the beginning of someone else's number. You never accidentally code it. But that means that anyone else starts with like a zero or a one or something like that, which is like a reason why. So you were going to say something. Oh, I was just going to say. So one of the things that I've heard about frequently
Starting point is 00:16:20 when talking about different compression algorithms is using a dictionary as a compression mechanism. And one of the things that I've heard about that is that you do, and I don't know if this is universally true, but it's something at least that I've considered in the past, is you either need to do this thing where you're saying like, okay, well, let's assume that we're going to be compressing English words. Like, okay, that's a big assumption, but sure, for certain applications, that makes sense. In that case, you can kind of define some of these things up front, including something like a dictionary, right? So you can make just a dictionary of words, right?
Starting point is 00:16:56 And you can have each byte sequence refer to whole words instead of individual letters. And that would be another way potentially to do this. And there's somewhere on the order of tens of thousands of words in the English language. So your dictionary wouldn't be all that big. And maybe you'd get good compression out of that. Alternatively, you could build the dictionary after you've run through the data and found the most frequently occurring patterns or words or sequences of bytes or whatever you might find. But the disadvantage of that is that now you sort of have to run through everything before you can
Starting point is 00:17:31 build that dictionary or you need to have an algorithm that will build it on the fly or something like that. And so that seems like a whole category of things where you're sort of trading off between either knowing what kind of data you're processing beforehand or doing some things post hoc after you've run through the data that might make it a little bit more difficult to work with if you were, for example, trying to do compression as like a stream. Like you don't ever want to have to seek through the data multiple times or use a whole bunch of memory while you're compressing it and so you have these these trade-offs that you need to make when you're when you're doing that um are there algorithms that
Starting point is 00:18:13 you've compression algorithms that you've worked with before that sort of make these trade-offs in different ways absolutely so um we're actually slowly walking towards what deflate is going to be which is a cool cool thing you know and a totally unintentional which is great uh but um so just on that one thing you mentioned about like um if it was english words and the probabilities and the thing as i described obviously if you knew the compressor and the decompressor both knew it was english words and they'd previously agreed on the frequency of letters sorry letters i'm talking about the letters thing first they've previously agreed on the frequency of letters sorry letters i'm talking about the letters thing first they've previously agreed on the frequency of letters then they would both already know that
Starting point is 00:18:49 e is zero one and and p is whatever 101 or whatever like that but if they don't know that or if you want to like as you say um have a sort of dynamic idea about well i ran through the data and i counted how many of each letter there were and now i'm going to build my my my tree of like how many bits code for which letter uniquely for every time every piece of data then you need to actually transmit that tree to the other end so that the other end can say okay this is we agree on the code book of like how many bits mean which letter so that's a downside with um uh dynamically sending that information across because the tree is not free. It takes up some space, right?
Starting point is 00:19:30 So you have to offset the compression you get from the tree with the fact that you have to send the tree at all. One of the cooler tricks is like exactly as you said, is what if you instead of treated each byte individually you also said well dictionary elements themselves are just other symbols so now we'll just consider compressing character at a time byte at a time we'll just call them symbols and let's say the first 255s are the literal value 0 to 255 of the byte 0 to 255. And then let's say symbol 256 is the first of my dictionary. And it just means the letter, I don't know, the word cat, right?
Starting point is 00:20:15 And the 257 means dog and whatever, right? And again, if you could previously agree this on both sides. the tree building step that we talked about where depending on the relative probability or a number of occurrences of each of the symbols that you see doesn't care how many bits long those symbols are right you can have a billion symbols in that tree if you wanted and it will still be the case that the most commonly occurring ones will have the shortest bit encoding and the least frequently occurring ones will have the longest one and obviously as you put more and more things in in the limiting case that can be a very very long uh prefix so to the point where you're like well this is longer than the word that it's coding for in the dictionary let's not bother with this anymore we might as well just encode it using
Starting point is 00:20:56 smaller words so a way of naturally extending the table of um of characters and their prefixes is just to throw in words as well or previously agreed dictionary components so that you know hey this is this this pattern appears in my data more often than not so um when you see one zero one one that actually means the entire string cats are cool right because that happens in my my application a lot right that's the kind of thing you could do and um but there's another sort of aspect which is um now i have to either transmit that dictionary boom right or we have to previously agree on it again and now we're back into the world of like oh my golly uh how big is this dictionary and my little embedded system is now scuppered going oh i need to store you know a 50 megabyte table of this this dictionary that we've agreed upon and all the
Starting point is 00:21:50 prefixes to go with it so that's right right and if you do transmit it you have to transmit it after you've examined the data correct which means if you're and maybe i'm wrong about this but if you're decompressing on the other side you can't really decompress then in a stream because you don't get the dictionary until the very end well usually in the sisters you'd have to get the dictionary up ahead because the compressor needs the dictionary as well if you were to do it that way right because as you're going through you can't you if you are i'm assuming that you have a static tree that you've built that is like here are all the probabilities and they therefore don't change but obviously we can talk about how you can actually update them and that sort of leads us naturally to the next thing which is what if you didn't transmit the dictionary at all okay because yeah if i've if i've just transmitted the letters
Starting point is 00:22:43 c a and t because that's the first three letters of the document that I'm sending I have actually also just transmitted the letter, the word cat right, it took three bytes or it took however many the relative probabilities of the C, the A and the T were to transmit
Starting point is 00:22:59 but now I effectively have a dictionary record for the word cat it is three bytes back from where we're currently decompressing if you've just written if you're the decompressor and you've just written out the c the a and the t because you've decoded them so what if instead of sending an explicit dictionary of things that i've previously agreed the dictionary entries themselves are effectively references back from earlier in the document and so now i anytime i want to say the word cat i can say
Starting point is 00:23:35 depending on where you are in the output go back however many bytes to get to the the c and copy three bytes c and a and a t and so now i've effectively got a dynamic dictionary that um allows me to refer back to anything i've previously transmitted and so suddenly i if i've transmitted the word cat i can say cat over and over and over again with just one symbol and that's pretty cool even cooler is what if i had the letter a 700 times in a row you remember in our original description we were talking well you know you have run length encoding i could transmit something that says hey there's an a followed by there's a ton of them all right so you know just copy that a 600 and sorry you know there's 700 of these a's in a row but if i say um transmit the code here is an a so the decompressor
Starting point is 00:24:27 is written out an a and now i say hey there's a dictionary element i would like you to use it is go back one character and copy 699 times yeah okay so what i've just done is i'm if you can visualize it i'm reading the A immediately that I've just decompressed and I'm copying it to the current location and then the two pointers are going to walk along and so that A is going to get smeared out 700 times. And so this dictionary compression that we've just talked
Starting point is 00:24:56 about where I can back reference to a previous part of the document gives us RLE for free. That is run length encoding. I can just take an A, and then I transmit the dictionary entry that says, go back one, copy 600, and I've got my run-length encoding. So the size of that amount of data that refers to how far back in the stream you go, whether it's 4 bytes, 8 16 bytes however big it is am i correct in assuming
Starting point is 00:25:26 that that is the amount of previous data that you need to hold in memory in order to decompress the the data as you stream through it exactly right this is referred to in as a sliding window because it's the the last right m bytes usually something like 64K, that even if you're a streaming decompressor and you're passing the bytes onto some other thing, you're printing them directly to the screen, whatever you're doing, because of this, you still have to hang on to the 64K of prior data in RAM so that when the next byte comes from the wire that says, hey, no, copy back 32K and it you know, it's 27 bytes from there.
Starting point is 00:26:06 You need to be able to get access to that still. So obviously it does. And I guess this is something we didn't talk about, like the runtime memory requirements of the decompressor, right? You have to keep something around in order to be able to decompress. Yeah, that is another consideration that we forgot to consider. We forgot, yeah, an unconsidered consideration. Right, yeah.
Starting point is 00:26:29 So anyway, so what we've got so far is we have a tree that lets us encode symbols, and symbols are not just bytes. They could be dictionary elements. And those dictionary elements are actually now phrased as go back n and copy m bytes um now this is glossing over this the actual specifics of deflate which actually has different trees for different things or whatever but like let's just stick with this for the quote simplicity um of of trying to explain it to somebody who's walking their dog at the moment or is pottering around the house on the train exactly yeah so um you mentioned a number of times you
Starting point is 00:27:14 know having to do multiple passes and the dealing of streaming and things like that unfortunately that is something that the compressor is forced to do in order to be able to make these determinations what it's likely and although there are algorithms for online building of the trees and whatever um typically what happens is the compressor chunks its input data into sort of reasonable sizes and then the compressor normally tries a few different types of algorithms uh exactly as i described right at the beginning with the rle you know like it goes hey let's try doing it with uh just a static tree let's try it with a dynamic tree let's try it with back references let's just uh and each time it says you know does this does this work out as
Starting point is 00:27:56 being smaller and it would pick the one the block that is the the smallest um of them all and ultimately there is a don't didn't compress type block, you know, storing it only, which only has a couple of bits at the beginning that says, oh, sugar, I couldn't actually compress this. And then here's just the literal data. And so that is the hedge for like the uncompressibility. It bounds the maximum size of the compressed thing
Starting point is 00:28:21 to be, you know, only a tiny percentage bigger than the input in the worst possible case. So having done this, it's now run through. It sort of encodes the dictionary-based compression. And that is, for what it's worth, a very difficult problem, right? It has to kind of search through the space of, like, should I go back two bytes and copy one byte, or should I go back further in the document and carry copy slightly more now going further back i might be able to get a longer match
Starting point is 00:28:51 that you know because maybe i've got cat and i've got ca and i can build cat by going back 700 bytes and copying three bytes or i've recently done a ca i can go back a smaller distance to get just the c and a copy those two and to get just a C and A, copy those two, and then just output a T. And those are equivalent in the output stream, but the exact bit patterns of how to describe those two operations of going back a long way and copying three versus going back a short distance
Starting point is 00:29:19 will change the overall output compression ratio because maybe the nearer ones are used more often, and so they'll have a smaller coding in the tree, and that kind of stuff. So there isn't like a one true encoding, and that's essentially when you're g-zipping something and you do g-zip minus nine, you're tending to try really, really, really hard
Starting point is 00:29:41 and go through all sorts of different combinations and try different encodings to see if it will work out as well as there are some internal acceleration structures that are sized accordingly when you do that and it's why it takes longer in the compressor because it's essentially searching through to find what might be the best encoding of the data yeah i didn't realize that when you're when you're telling these algorithms to basically try harder what they're doing is trying more internal algorithms more approaches i guess amongst other things yeah i mean most of them end up having to do um i mean if you imagine what it looks like you're trying to find a prefix um in some data you've already sent out so you've got 64k of the stuff you've previously sent and then you're looking at the next byte in the compressor and you're saying oh what is the
Starting point is 00:30:29 longest match in the 64k buffer i've got behind me um and one way to do that is to just try go back all 64k and look for the but that's a hugely order n squared operation you're now doing and so most things do some kind of stuff where they hash a few bytes and go well i just keep a hash table of like the prefixes of abc and i look at my hash table and kind of go oh i've not seen abc or if i have it was too far ago never mind um and then as i say there's all these other sort of subtleties about like well maybe you want to prefer nearer matches than further matches because you're more likely to say copy three bytes ago over and over and over again rather than copy six thousand sixty four
Starting point is 00:31:14 thousand nine hundred and twenty two bytes ago which is a very you know unique encoding compared to nearby right so so there's a whole bunch of trade-offs uh up there but anyway so you've done this a number of times and then you're you're going to tell the decompressor, hey, I picked to use a dynamic tree, which I'm going to transmit to you in this block. And then, you know, obviously it's dictionary decompression after that. So the tree gets transmitted, and there's a pretty compact representation for the tree
Starting point is 00:31:43 in terms of which bits map to which symbols. And the the decompressor has to generate that tree and then obviously look through um and just copy from before but those those chunks we're talking about these like 64k ish blocks that the compressor has chosen to like look ahead and kind of make its determination those are different from the sliding window of 64k right that's a that's a separate those are separate separate things there right so because the compressor could just choose to look at the whole file and say actually this is this technique that i'm about to do this tree that i'm going to build is good for the whole file let's send you the whole file but for pragmatic reasons of not having to go all the way to the end of the file and buffer it all in memory if you're a streaming thing, usually it's chunked into smaller pieces.
Starting point is 00:32:28 Chunking into smaller pieces also has a nice side effect because it means that youwegian maybe you need a different set of like you know vowels more often or whatever diacritic marks and stuff like that or you know more generally in your um in your data file format you're compressing there's likely to be a big header at the beginning which has some characteristic and then a whole bunch of like other data in the middle which is a different thing so so it gives it an opportunity to kind of reset and change the um the relative frequencies of all of the symbols that are occurring depending on you know on the data which of course is another degree of freedom the compressor has hey i could choose not to do that it could do the whole thing it could try it could it could try doing 1k then then 2K, then 4K.
Starting point is 00:33:26 It's got a lot of degrees of freedom there, which means, again, it can be difficult to write an optimal or a provably optimal compressor. And this is all deflate for what it's worth. This is all what deflate does. Deflate has dictionary base, which if people know what Lempel, Ziv, LZ77, I think recently one,iv or lemple one of the two died recently and there was a which was which was said but anyway uh lemple ziv is like all this dictionary based thing where you have like these back references to n bytes ago copy m
Starting point is 00:33:57 and there are 100 different ways of encoding it or choosing to to do it there's all the different lz 77s and other uh numbers i can't remember what the other one is now. LZ4? No, it's gone. Then, obviously, the tree is called a Huffman tree, and it's a Huffman encoding tree. It's a way of building these unambiguous sequences of bit patterns that will encode a node in a tree.
Starting point is 00:34:24 And the length of that sequence is provably the shortest unique sequence that will get you to that particular point in the tree based on its relative um uh occurrence how often it occurs so that's pretty cool um and deflate is built of those two primitives um in various like you know clever bit packing encoding and other bits and bits and pieces. So the devil is in the detail with a lot of these things. But there are some really other left
Starting point is 00:34:52 field ways of compressing data. I mean, you mentioned lossy. Yeah, right. So obviously you could decide to just, you know, I don't know, divide everything by two and then immediately you've saved one bit for every could decide to just you know i don't know divide everything by two and and then uh and then that immediately you've you've saved one bit for every every output value and then you just multiply it
Starting point is 00:35:10 by two on the way out and maybe some values you know everything's even but who cares right for some data that might work so but um sort of more wacky and i still really haven't got my head around how this works and also i can't possibly begin to explain this a because i'm not very good at understanding myself but b because we probably need a picture there is right but there exists a transformation to your data called the burrows wheeler transform okay it is a reversible transform that is i can apply the bwt to my data and I get a permutation of all the bytes of my data and I can reverse it and get my original data back, right? This has not changed in any meaningful way the contents of the data.
Starting point is 00:35:53 It's just rearranged it. Okay, okay. But the side effect of the Burrows-Wheeler transform is that it is sort of sorted your data. Your data is more sorted, it's more ordered because of the transform than it was before. Interesting. It is interesting.
Starting point is 00:36:13 You know, you take a megabyte of data, you apply the BWT to it, and it's a permutation of it. And the data, if you looked at it, it was more likely to be A, A, A, A, A, A, B, B, B, B, B, C, C, C, D, D, D, E, E, A, A, A. And it might go back to A again it's not you know it can't it's not just sorting it yeah uh and and again you there's a great wikipedia uh article
Starting point is 00:36:33 on the bwt and you it's got a picture and you can see what it's doing and like the way of reversing it is more tricky than you might think but it can be done right yeah okay if you apply the bwt before you throw it into a compressor right you've done is you've made it more compressible right which is cool yes and that's what um oh which one is it tar.xz what's no because that's another one one of the BZ extension BZ2 for like title BZT is borrows wheeler I think the B inside that of BZip so it applies it
Starting point is 00:37:15 before it runs it into the compressor and actually while we're talking about that there's another general sort of technique for compression which is to apply a domain specific filter to your data before you throw it into a compressor so for example you might know that there are certain sequences in uh say arm assembly code or arm machine code i should say that are very common but they are relative to their or rather their absolute locations right
Starting point is 00:37:48 they are absolute locations in the file or relative no no they're relative sorry let me get it straight they are relative so you can imagine there's a branch instruction and it has a relative destination um and so but if you're if if every branch if every third branch is calling the same absolute destination, their encoding will be different because relative to where that branch is, it's a different offset, right? Which means it's, you know, unique. But if you were to extract that out and say, okay,
Starting point is 00:38:14 well, these are all going to the same location. I'm going to make them absolute. Right, yeah. Now suddenly it's the same bits every time. Every time we see that branch or whatever or something like that. I've just butchered it, obviously. But, yeah, so there are certain things where if you know ahead of time
Starting point is 00:38:29 that you can make the data more compressible in a reversible way, BWT being like a more generalized way of doing it, then that's another technique. But obviously the decompressor has to know how to undo that too. And it has to know that that process has happened. And very often those are incredibly application specific yeah interesting so it's like you're you're intentionally
Starting point is 00:38:50 introducing duplication in your data in a reversible way because it makes the compressions algorithms job much easier right and you know i've never actually done the experiment before of just taking a like a bunch of random text sorting it and then compressing it right and compressing it and seeing what the difference is that yeah it would be really i mean it's a measure i mean generally speaking is the measure of uh of how much disorder in the file is the entropy and you know that gives you some kind of lower bound on how much you can compress a file. So obviously, if you took a completely random file, if you catted dev random and you compressed it,
Starting point is 00:39:32 you'd imagine it shouldn't compress at all. It should be slightly bigger. But obviously, if you sorted it, what you're likely to have is an equal number of zeros, ones, twos, threes, fours, fives, sixes, all up to 255. It would compress really well. And you'd imagine it would compress down to almost nothing because it'd be like yeah there's 250 million zeros there's 250 million ones there's 250 and so on and you're like all right so in
Starting point is 00:39:54 the limiting case you can get a really good ratio but obviously it's not reversible right right right right but if you can come up with a way of encoding how to put the zeros back in the right place in the final document, then you're sort of edging towards a compression algorithm. But the amount of data it would take to encode the positions of all the zeros, for example, and then all the positions of the ones should be the same amount of data as you put in in the first place. So, you know, there's no free lunch there, but it might be better to encode or it might be easier to encode, which is like essentially what the BWT is doing. Yeah.
Starting point is 00:40:27 And I'm sure I've butchered some description of it there and we'll get people tweeting at us or master learning at us for those things. think about it is that that sometimes um the the trade-off you want to make in the compressor is uh you know like imagine you're google and you want to send out the google uh you know there's a single google.png or google.whatever right and like literally every third web transfer will be of this in the whole world is on google.png right you know so you know every byte you can save on that damn png is worth its weight in gold right it right i don't know how much a byte weighs but you know whatever yes given that png internally is using deflate which is another another thing it might be worth saying i don't't care if it takes me six days to compress this PNG. If it saves me one byte, that'll pay for itself over the course of probably a day.
Starting point is 00:41:33 Millions and millions and millions of page loads, yeah. Exactly. that are dedicated to generating almost perfect bit streams of like this is the smallest conceivable sequence of valid deflate bits that will cause the output to be what you want it to be. And one of those is something called ZOPFLI, Z-O-P-F-L-I. It hasn't been touched in years, but it is a way of creating and compressing a source document and generating essentially the optimal output. And it takes hours.
Starting point is 00:42:13 Interesting. A really long time. And it's very limited. You can't do very much with it. The only reason I know about it is because of hobby projects that use this to compress stuff on a PC to be then put onto an 8-bit computer where, again, that trade-off's fine. You know, the source data is small. It's, you know, 20K,
Starting point is 00:42:30 but you want to compress it as much as actually possible for your, you know, your 8K demo you're trying to get working or your 4K demo you're trying to get working on your 6502-based machine. And so it's worth saying, well, I'll spend 30 minutes in my build
Starting point is 00:42:43 squishing it those two more bytes because I can do something cool with it. Or again, well, I'll spend 30 minutes in my build, squishing it, those two more bytes, because I can do something cool with it. Or again, yeah. That's neat. And I think the way that they work is actually they start backwards. They start at the end of the file and work backwards, and they try and find the optimal thing that would encode what remains. Something like that.
Starting point is 00:42:58 Somebody explained it to me once over a pint, and it made perfect sense at the time. And like so many of those things, as I say out loud now, I'm like how how could that work how did that work yeah exactly uh well i think you're definitely right about not being able to get to things like lossy compression in this episode because uh it seems like this is a very deep topic which maybe warrants some follow-ups right uh so i'm going to just do honorable mentions now we've talked about deflate all the way through um google and facebook have both got their own like uh versions of again i think they're all using the same techniques and slicing
Starting point is 00:43:35 and dicing there's the same tricks but with different trade-offs but z standard which is now like the seemingly the most de facto compression, at least in our company, Z standard is, is, is coming and snappy. They're both libraries again, that, that are not deflate,
Starting point is 00:43:51 but they are, they use broadly the same parts, but yeah, honorable mentions for those only because, you know, we've talked about GZIP, which is, you know,
Starting point is 00:43:58 ancient really. Right. Right. And they, you know, those things, the snappy and Z standard are usually faster and have comparable if not slightly better compression ratios too so and uh yeah one last thing there was there somebody posted a um uh so we talked about png just then which is you know
Starting point is 00:44:17 very complicated file format and ultimately the internals of it are in fact deflate compressed and whatever um some random uh coder just came up with their own image file format using a relatively made up off the fly. Like, well, one bit. And then it's like, this is how we encode the next pixel, two bits. And this is how we encode it in the next pixel, whatever. And it beats PNG hands down for lossless. So there is still room in the world for compression formats that are incredibly bespoke so this is just for image data but still though i mean there's a lot of image data in the world
Starting point is 00:44:51 that's exactly i wish i could remember it and again i'm sure we'll put something in the notes that says what it really is right but yeah as you can tell i love this stuff um if i remember rightly i've got somewhere i've got like a an online view where you can type in words and see how the um a dictionary compressor would see it and how a huffman tree would see it so we're definitely gonna have to put that we'll link that in the notes uh too so that you can have a little play around and kind of give a more of a visceral understanding about what's going on behind this but um yeah it's fun yeah that's really cool really cool all right uh maybe part one of two i don't know maybe we'll see we'll see if it works out like that
Starting point is 00:45:34 all right man well i'll talk to you later until next time my friend you've been listening to two's compliment a programming podcast by ben rady and matt godbold find the show transcript and notes at www.twoscompliment.org contact us on mastodon we are at twos compliment at hackyderm.io is by Inverse Phase.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.