Two's Complement - Compression
Episode Date: October 23, 2023Matt and Ben talk about how compression works, specifically deflate, which is apparently everywhere. Ben gets particular about compression ratios. Matt explains how to compress /dev/random by sorting ...it first.
Transcript
Discussion (0)
I'm Matt Godbolt, and I'm Ben Rady, and this is Two's Compliment, a programming podcast.
Hey Ben.
Hey Matt.
What are we going to talk about today?
I think we're going to talk about compression today.
So, like squishing things?
Yeah, squishing bits.
Oh, bits?
Taking bits and squishing them.
So like, if you squish a zero, like flat, it becomes a one?
Yeah, right.
That kind of thing. I see.
Unless you squish it down, in which case it becomes an underscore.
Oh, or a minus.
Well, if you've got to squish it from both sides to make a minus.
That's true.
These are the compression algorithms that we need to talk about.
These are the compression algorithms that we need to talk about today.
Which one of those is Zedlib?
Well, in the true spirit of these podcasts,
we haven't really thought too much about what
we're going to talk about we did discuss that we should talk about compression a number of
months ago i did a presentation at work about the various different compression algorithms
or rather the uh the way that i was subverting them for a sort of side project i've got called
um oh gosh i can't even z index or z index or z index which is a way of
archiving and searching through archives that you've previously um have made z uh g zipped files
and um sort of random accessing within them which is something you're not supposed to be able to do
but with a little bit of sour band data you can but that meant trying to understand or explain to
people how z lib worked which meant talking about compression. And then we were like, hey, we should talk about that.
So here we are.
What do you want to talk about?
I mean, what even are a compression?
So, yeah, aside from turning zeros into other various characters on your keyboard.
Which is not really what they do, just to be clear.
Not a real thing. Yeah, I mean, you know, quite often it makes sense to trade a little bit of extra CPU time in order to get a little bit less storage space.
Or basically just bytes because you're transferring something over a network or whatever it might be.
Lots of other contexts.
And so you want to take a bunch of data
and shrink it down into something smaller.
And there are
many algorithms for this
and we're going to
talk about them today, but I think the first thing we should talk
about are sort of the different
trade-offs because it's like
well, why wouldn't I want to make my data smaller?
And how do I want to make my data smaller and how do i want to make my data smaller and other considerations that you might have when choosing
one of these algorithms i mean so obviously the first thing that springs to mind is how good is
the compression algorithm at squishing my data in the first place like given a input file
how much smaller can it be after it's being compressed
and that's really easy and objective to measure you know you take your 100k file and you apply
your compressor to it you like go hey it's 200 bytes now that's awesome fantastic and this one
makes it 190 bytes you're like that's obviously better let's just use the 190 byte one right so
that's the first thing is the compression ratio.
Yeah, and people do tend to talk about it as a ratio, right?
Or sometimes a percentage, right? You'll get like, you know, 90% compression, which the way I interpret 90% compression is you took something that it's sort of – it's a little – I feel like it's a little weird when you say that because it's like's like i took a hundred bytes and i compressed it down to 10 bytes right i saved 90 percent of the original size of the
original right yeah which you know it's like so i got 10 was the result but it's like one minus
that right yeah um and then you know you can have, uh, you know, ratios.
I feel like you'll also see things where people talk about negative compression ratios in
order to express something that actually got bigger when you compressed it, which can also
happen.
Oh yeah.
That's, but that sort of breaks that model too.
Right.
So I think it's really important when you're talking about, and I, this is true all the
time when you're talking about percentages and ratios and things like that where it's like it's kind of easy to lose track of what
the numerator and the denominator were um so i i feel like when when you're discussing these things
it never hurts to sort of be explicit right so when we're talking about compression ratios well
i don't know that we'll have any concrete things, but then, yeah, in that specification, 90% would mean it is 90% smaller means that, yeah, 90% of the original size went away and you're left with a file that is 10% of the original.
Right.
That makes sense.
So, obviously, naively, you want to pick the thing that compresses your input to be the smallest output.
Right?
Right.
But there are other trade-offs, obviously.
Right.
Otherwise, we wouldn't be bringing them up.
We'd be like, yeah, well, we're done.
You pick the smaller one.
So what else is important?
Well, there's compression, how long it takes to do the compression.
Right.
And there's also how long it takes to decompress it.
Right.
Right.
And for certain algorithms, there's how lossy it is if it is lossy.
Right, right.
That's a good point.
I suppose, yeah, the idea that we could talk about lossy compression as well might be too much for one podcast.
Maybe we'll table that for another time.
But yeah, so I was thinking we would talk really about lossless compression in this chat because the world is going to be – maybe we'll get to it at the end.
Who knows?
Again, exposing the listener to the lack of planning that goes into this.
In fact, it's a watchword of this podcast.
Yeah.
I mean, at least asking the question is always a good one, right?
Yeah.
Like, is this actually lossless compression?
Because not every compression algorithm is.
Can I assume that when I decompress it, I get the exact same bytes that I put into the compressor?
That seems, it's kind of the definition of lossless versus image compression and stuff, which actually loses some of the data and you don't get it quite back as we've all seen.
Audio, video can all be a little lossy.
Yeah.
Yeah.
Yeah, yeah. some of the data and you don't get it quite back as we've all seen audio video can all be a little lossy yeah yeah yeah yeah so obviously compression time and decompression time uh another thing that i think is an interesting one to consider is how how much code it takes to decompress
your input because as kind of an implied input to your decompression data
is the thing that can decompress it.
Now, obviously, on a Unix system or whatever, you've got GNZip,
and it's like, okay, there's 500 bytes of – sorry, 500K of executable somewhere.
You don't have to worry about it because everybody has GNZip somewhere,
and so you can kind of like, oh, all right, we can decompress this.
But if you are working
on an embedded system which has got limited resources which is one of the main reasons you
might want to do compression like hey we've got an ssd and we need to read things off of the ssd
and decompress them into ram to run them or to do whatever then you realize that you actually have
to budget for the decompression algorithm and the amount of code
space that it takes up and if you have like a you know a compression algorithm which is like hey i
have a lookup table and the lookup table is 200 megabytes of like values canned values then and
you just store in your compressed version you're like well it's just number seven in that table
that's great compression but you need a very big decompression table somewhere else
right right yeah that's a great consideration actually i mean it's top of mind just because
of like looking at uh compression like old 8-bit machines that that i'm looking at like this is
cool look at how much you can compress but unfortunately the decoder is like 5k and that's
too big yeah yeah yeah and i mean i guess part and parcel to that is kind of
like uh fonts also have this property is like there's a certain percentage of these things
are going to be available in any environment that you're delivering the data to because again a lot
of times when you're compressing stuff it's in order to send it somewhere and the question is
can the person on the other side actually uncompress it, right?
And so, depending on the exact transport mechanism and the clients involved and a bunch of other factors, you may or may not have access to those algorithms.
And thinking about which ones are going to be more ubiquitous and which ones are going to be like, oh, actually, we need to first send the decompression library and then we can send the data itself.
Yeah, the actual data.
Right.
Yeah, is sort of another dimension of that in a way.
And, I mean, if we talk about ubiquitous, I mean, the most ubiquitous compression algorithm that I'm aware of would be Deflate,
which is the algorithm that powers zip files and GZ and every web transaction you've ever done.
Although, I guess ZStand standard has now taken it over but yeah all right most most web transactions are compressed with deflate
and you know colloquially everyone sort of refers to it as you know the z lib which is the library
that implements it or just gzip even though it really is the same algorithm inside your
your windows zip files as well um and that uses a number of
techniques under the hood and we can talk about what those techniques are but like um i guess
let's talk about what types of things you can do in a compression algorithm okay yeah so one of the
things that i used to do a million years ago on 8-bit machines was you would observe that very often you'd have sequences of bytes that repeat.
So you've got some, usually like a picture, an image, or a sprite in this instance.
You'd be like, well, okay, it's the value 0 80 times, followed by a 1 and a 7,
and another three 0s, and then another 100 0s, and then whatever.
That doesn't make sense, but whatever.
So even just eyeballing in a hex editor,
you can see that most of it is zeros.
And so one might just go,
a simple compression algorithm would be,
okay, if I see a zero, count how many more zeros there are.
And then the compressed output is zero
followed by how many zeros there were.
And that's great if zero is the only thing you want to do compression for.
And this is like run length encoding, as this is RLE, where we're saying, hey, I'm going to look at how long the run of this particular single number is, and I'm just going to repeat it.
Now, that's cool.
It makes sense the
decoder is easy you read bytes and every time and you just copy them but every time you see a zero
you read the next byte and you write out that many zeros straightforward right and it's pretty
effective and obviously you can imagine that the decoder is trivial the encoder is pretty trivial
as well but the first and obvious thing you might notice is
well what if there's only one zero right now you're sending zero followed by a one right
and that's twice as big as it was when it came out which is not a good look and that goes to
your original point about like having a negative compression ratio and this is one of the reasons
that you can get negative compression ratios is that sometimes you try to encode some clever things and they're not
as clever for the particular stream of data that you've got that's that's a problem yep one way
around this is to divide the world into little blocks and say let's let's compress every 512
bytes let's say and then say, immediately before every block,
I'm going to put a flag that says,
did I apply run length encoding
or did I not apply run length encoding
or some other compression, whatever algorithm.
And then you can try to RLE compress 512 bytes.
And if it gets bigger than 512 bytes,
you go, ah, let's just store it as is
and then you just put the flag that says this is just exactly as um it's just byte for byte
copy the next 512 bytes and of course you can generalize that but you know in the worst case
now what we have is every 512 bytes we add one byte that says this isn't compressed and now we
still made it bigger but hopefully not as bad as it would be if every byte was was there and that's another reason so the sort of metadata about the data
that explains how the compressor is configured or what how to decompress the next bit can take up
more space than if it was just uh raw data so that's run linked encoding that's one of the
easier ones what else could we do to compress data uh well that's a good question
i don't have an answer to that well what is the most common letter in the english language
uh e i e yes the e right so um how many bytes does it take to to store an e uh well normally it's um one byte one right right yeah
but we know that e is more common the most if we're compressing let's say text
e is more common than almost every other letter uh well it is the most common letter sorry um
so what if we could say let's forget about bytes let's just treat this as a stream of bits what if we could say, let's forget about bytes. Let's just treat this as a stream of bits.
What if I said I could encode an E with only two bits?
Like maybe let's just say zero and then one is an E, as in the binary zero one. And then every other letter, for the sake of ease over the air and not without whiteboards to draw anything on,
everything else is going to be a one bit followed by eight bits
of the rest of the the letter so that means that every other letter is now nine bits boohoo but
every e is only two bits right right so this is a kind of like you know you think of it like the
rle in terms of we're trading off something to encode something which is hopefully more common
or is more compressed
but we've made it slightly more pessimistic in another case but now if you imagine like a third
of all of your text is an e you've dropped that third down to two bits and everything else is nine
bits now I can't do the math in my head it It's terrible to do, but there is a way to construct
a optimal tree of bits,
ones and zeros,
that give each character
that you could possibly have
inside your output
a variable number of bits
based on how common it is or not.
So while an E may be one zero,
I can't remember what I just said, 0, 1, whatever.
And now maybe the next most common letter is, I don't know, T.
That might be 1, 0, 1.
And that's unambiguously a prefix that only means it's T.
And it's three bits now.
And then you can start building a tree with all of the other things.
And until you get to like, you know, the Q and the Z take maybe even now 12 bits to unambiguously encode.
But now, you know, effectively, we are storing the letters with codes
that are inversely proportional in length to how common they are.
And now the document we're storing, provided it's an English document,
and fits this sort of statistical arrangement of probabilities
of how likely each letter is,
now we can store it in considerably less bits.
I mean, it makes sense.
There's a reason why the phone number for emergency services
is just 112 or 999. What is it? one two or nine nine nine or what is it oh
yeah that's what it is you should probably know what that is if you live in the united states
just saying i would recommend that you write that down okay i'm gonna write that down it's like yeah
yeah so i was gonna say that like phone numbers are have like this sort of like feeling of like
they're unambiguous you know know, 911 doesn't code.
That's not the beginning of someone else's number.
You never accidentally code it.
But that means that anyone else starts with like a zero or a one
or something like that, which is like a reason why.
So you were going to say something.
Oh, I was just going to say.
So one of the things that I've heard about frequently
when talking about different compression algorithms
is using a dictionary as a compression mechanism. And one of the things that I've heard about that is that you do,
and I don't know if this is universally true, but it's something at least that I've considered in
the past, is you either need to do this thing where you're saying like, okay, well, let's assume
that we're going to be compressing English words. Like, okay, that's a big assumption, but sure, for certain applications, that makes sense.
In that case, you can kind of define some of these things up front, including something
like a dictionary, right?
So you can make just a dictionary of words, right?
And you can have each byte sequence refer to whole words instead of individual letters.
And that would be another way potentially to do this.
And there's somewhere on the order of tens of thousands of words in the English language.
So your dictionary wouldn't be all that big.
And maybe you'd get good compression out of that.
Alternatively, you could build the dictionary after you've run through the data and found
the most frequently occurring patterns or words or sequences of bytes or whatever you might find.
But the disadvantage of that is that now you sort of have to run through everything before you can
build that dictionary or you need to have an algorithm that will build it on the fly
or something like that. And so that seems like a whole category of things where you're sort of
trading off between either knowing what kind of
data you're processing beforehand or doing some things post hoc after you've run through the data
that might make it a little bit more difficult to work with if you were, for example, trying to do
compression as like a stream. Like you don't ever want to have to seek through the data multiple
times or use a whole bunch of memory while you're compressing it and so you have these these
trade-offs that you need to make when you're when you're doing that um are there algorithms that
you've compression algorithms that you've worked with before that sort of make these trade-offs
in different ways absolutely so um we're actually slowly walking towards what deflate is going to be which is a cool cool thing you know
and a totally unintentional which is great uh but um so just on that one thing you mentioned about
like um if it was english words and the probabilities and the thing as i described
obviously if you knew the compressor and the decompressor both knew it was english words and
they'd previously agreed on the frequency of letters sorry letters i'm talking about the
letters thing first they've previously agreed on the frequency of letters sorry letters i'm talking about the letters thing
first they've previously agreed on the frequency of letters then they would both already know that
e is zero one and and p is whatever 101 or whatever like that but if they don't know that or if you
want to like as you say um have a sort of dynamic idea about well i ran through the data and i
counted how many of each letter there were and now i'm going to build my my my tree of like how many bits code for which letter uniquely for every time every
piece of data then you need to actually transmit that tree to the other end so that the other end
can say okay this is we agree on the code book of like how many bits mean which letter so that's a
downside with um uh dynamically sending that information across
because the tree is not free.
It takes up some space, right?
So you have to offset the compression you get from the tree
with the fact that you have to send the tree at all.
One of the cooler tricks is like exactly as you said,
is what if you instead of treated each byte individually you also said well dictionary
elements themselves are just other symbols so now we'll just consider compressing character at a
time byte at a time we'll just call them symbols and let's say the first 255s are the literal value 0 to 255 of the byte 0 to 255.
And then let's say symbol 256 is the first of my dictionary.
And it just means the letter, I don't know, the word cat, right?
And the 257 means dog and whatever, right?
And again, if you could previously agree this on both sides. the tree building step that we talked about where depending on the relative probability or a number
of occurrences of each of the symbols that you see doesn't care how many bits long those symbols are
right you can have a billion symbols in that tree if you wanted and it will still be the case that
the most commonly occurring ones will have the shortest bit encoding and the least frequently
occurring ones will have the longest one and obviously as you put more and more things in in the limiting case that can be a very very long
uh prefix so to the point where you're like well this is longer than the word that it's coding for
in the dictionary let's not bother with this anymore we might as well just encode it using
smaller words so a way of naturally extending the table of um of characters and their prefixes is just to throw in words as
well or previously agreed dictionary components so that you know hey this is this this pattern
appears in my data more often than not so um when you see one zero one one that actually means the
entire string cats are cool right because that happens in my my application a lot right that's the kind of thing you could do and um but there's another sort of aspect which is
um now i have to either transmit that dictionary boom right or we have to previously agree on it
again and now we're back into the world of like oh my golly uh how big is this dictionary and my
little embedded system is now scuppered going oh i need to store
you know a 50 megabyte table of this this dictionary that we've agreed upon and all the
prefixes to go with it so that's right right and if you do transmit it you have to transmit it
after you've examined the data correct which means if you're and maybe i'm wrong about this but if you're decompressing on the other side you can't really decompress then in a stream because you don't get the dictionary until the
very end well usually in the sisters you'd have to get the dictionary up ahead because the
compressor needs the dictionary as well if you were to do it that way right because as you're
going through you can't you if you are i'm assuming that you have a static tree that you've built that is like here are all
the probabilities and they therefore don't change but obviously we can talk about how you can
actually update them and that sort of leads us naturally to the next thing which is what if you
didn't transmit the dictionary at all okay because yeah if i've if i've just transmitted the letters
c a and t because that's the first three letters
of the document that I'm sending
I have actually also
just transmitted the letter, the word cat
right, it took three
bytes or it took however many the relative
probabilities of the C, the A and the T were
to transmit
but now
I effectively have a dictionary record
for the word cat it is
three bytes back from where we're currently decompressing if you've just written if you're
the decompressor and you've just written out the c the a and the t because you've decoded them
so what if instead of sending an explicit dictionary of things that i've previously
agreed the dictionary entries themselves are effectively references back
from earlier in the document and so now i anytime i want to say the word cat i can say
depending on where you are in the output go back however many bytes to get to the the c and copy
three bytes c and a and a t and so now i've effectively got a dynamic
dictionary that um allows me to refer back to anything i've previously transmitted and so
suddenly i if i've transmitted the word cat i can say cat over and over and over again with just one
symbol and that's pretty cool even cooler is what if i had the letter a 700 times in a row you remember in our original
description we were talking well you know you have run length encoding i could transmit something
that says hey there's an a followed by there's a ton of them all right so you know just copy that
a 600 and sorry you know there's 700 of these a's in a row but if i say um transmit the code here is an a so the decompressor
is written out an a and now i say hey there's a dictionary element i would like you to use
it is go back one character and copy 699 times yeah okay so what i've just done is i'm if you
can visualize it i'm reading the A immediately that I've just decompressed
and I'm copying it to the current location
and then the two pointers are going to walk along
and so that A is going to get smeared out
700 times.
And so this dictionary compression that we've just talked
about where I can back reference
to a previous part of the
document gives us RLE
for free. That is run length encoding.
I can just take an A,
and then I transmit the dictionary entry that says, go back one, copy 600,
and I've got my run-length encoding. So the size of that amount of data that refers to how far
back in the stream you go, whether it's 4 bytes, 8 16 bytes however big it is am i correct in assuming
that that is the amount of previous data that you need to hold in memory in order to decompress the
the data as you stream through it exactly right this is referred to in as a sliding window because
it's the the last right m bytes usually something like 64K, that even if you're a streaming decompressor
and you're passing the bytes onto some other thing,
you're printing them directly to the screen, whatever you're doing,
because of this, you still have to hang on to the 64K of prior data in RAM
so that when the next byte comes from the wire that says,
hey, no, copy back 32K and it you know, it's 27 bytes from there.
You need to be able to get access to that still.
So obviously it does.
And I guess this is something we didn't talk about, like the runtime memory requirements
of the decompressor, right?
You have to keep something around in order to be able to decompress.
Yeah, that is another consideration that we forgot to consider.
We forgot, yeah, an unconsidered consideration.
Right, yeah.
So anyway, so what we've got so far is we have a tree
that lets us encode symbols, and symbols are not just bytes.
They could be dictionary elements.
And those dictionary elements are actually now phrased as
go back n and copy m bytes um now this is glossing over this the actual specifics of deflate which
actually has different trees for different things or whatever but like let's just stick with this
for the quote simplicity um of of trying to explain it to somebody who's walking their dog at the moment or
is pottering around the house on the train exactly yeah so um you mentioned a number of times you
know having to do multiple passes and the dealing of streaming and things like that unfortunately
that is something that the compressor is forced to do in order to be able to make these determinations
what it's likely and although there
are algorithms for online building of the trees and whatever um typically what happens is the
compressor chunks its input data into sort of reasonable sizes and then the compressor normally
tries a few different types of algorithms uh exactly as i described right at the beginning
with the rle you know like it goes hey let's try doing it with uh just a static tree let's try it with a dynamic tree let's try it with
back references let's just uh and each time it says you know does this does this work out as
being smaller and it would pick the one the block that is the the smallest um of them all and
ultimately there is a don't didn't compress type block,
you know, storing it only,
which only has a couple of bits at the beginning that says,
oh, sugar, I couldn't actually compress this.
And then here's just the literal data.
And so that is the hedge for like the uncompressibility.
It bounds the maximum size of the compressed thing
to be, you know, only a tiny percentage bigger
than the input in the worst possible case.
So having done this, it's now run through.
It sort of encodes the dictionary-based compression.
And that is, for what it's worth, a very difficult problem, right?
It has to kind of search through the space of, like,
should I go back two bytes and copy one byte,
or should I go back further in the document and carry copy slightly more now going further back i might be able to get a longer match
that you know because maybe i've got cat and i've got ca and i can build cat by going back 700 bytes
and copying three bytes or i've recently done a ca i can go back a smaller distance to get just
the c and a copy those two and to get just a C and A,
copy those two, and then just output a T.
And those are equivalent in the output stream,
but the exact bit patterns of how to describe those two operations
of going back a long way and copying three
versus going back a short distance
will change the overall output compression ratio
because maybe the nearer ones are used more often,
and so they'll have a smaller coding in the tree,
and that kind of stuff.
So there isn't like a one true encoding,
and that's essentially when you're g-zipping something
and you do g-zip minus nine,
you're tending to try really, really, really hard
and go through all sorts of different combinations
and try different encodings to see if it will work out as well as there are some internal acceleration
structures that are sized accordingly when you do that and it's why it takes longer in the compressor
because it's essentially searching through to find what might be the best encoding of the data
yeah i didn't realize that when you're when you're telling these algorithms to basically try harder what they're doing is trying more internal algorithms more approaches
i guess amongst other things yeah i mean most of them end up having to do um i mean if you imagine
what it looks like you're trying to find a prefix um in some data you've already sent out so you've
got 64k of the stuff you've previously sent and then you're looking at the next byte in the compressor and you're saying oh what is the
longest match in the 64k buffer i've got behind me um and one way to do that is to just try go
back all 64k and look for the but that's a hugely order n squared operation you're now doing and so
most things do some kind of stuff
where they hash a few bytes and go well i just keep a hash table of like the prefixes of abc and
i look at my hash table and kind of go oh i've not seen abc or if i have it was too far ago never
mind um and then as i say there's all these other sort of subtleties about like well maybe you want
to prefer nearer matches than further matches because you're more likely to
say copy three bytes ago over and over and over again rather than copy six thousand sixty four
thousand nine hundred and twenty two bytes ago which is a very you know unique encoding compared
to nearby right so so there's a whole bunch of trade-offs uh up there but anyway so you've done
this a number of times and then you're you're going to tell the decompressor,
hey, I picked to use a dynamic tree,
which I'm going to transmit to you in this block.
And then, you know, obviously it's dictionary decompression after that.
So the tree gets transmitted,
and there's a pretty compact representation for the tree
in terms of which bits map to which symbols. And the the decompressor has to generate that tree and then obviously look through
um and just copy from before but those those chunks we're talking about these like 64k ish
blocks that the compressor has chosen to like look ahead and kind of make its determination
those are different from the sliding window of 64k right that's a that's a separate those are separate separate things there right
so because the compressor could just choose to look at the whole file and say actually this is
this technique that i'm about to do this tree that i'm going to build is good for the whole file
let's send you the whole file but for pragmatic reasons of not having to go all the way to the
end of the file and buffer it all in memory if you're a streaming thing, usually it's chunked into smaller pieces.
Chunking into smaller pieces also has a nice side effect because it means that youwegian maybe you need a different set of like you know vowels more often or whatever
diacritic marks and stuff like that or you know more generally in your um in your data file format
you're compressing there's likely to be a big header at the beginning which has some characteristic
and then a whole bunch of like other data in the middle which is a different thing so so it gives
it an opportunity to kind of reset and change the um the relative frequencies of all of the symbols
that are occurring depending on you know on the data which of course is another degree of freedom
the compressor has hey i could choose not to do that it could do the whole thing it could try it
could it could try doing 1k then then 2K, then 4K.
It's got a lot of degrees of freedom there,
which means, again, it can be difficult to write an optimal or a provably optimal compressor.
And this is all deflate for what it's worth.
This is all what deflate does.
Deflate has dictionary base, which if people know what Lempel, Ziv, LZ77,
I think recently one,iv or lemple one of
the two died recently and there was a which was which was said but anyway uh lemple ziv is like
all this dictionary based thing where you have like these back references to n bytes ago copy m
and there are 100 different ways of encoding it or choosing to to do it there's all the different
lz 77s and other uh numbers i can't remember what the other one is now.
LZ4?
No, it's gone.
Then, obviously, the tree is called a Huffman tree,
and it's a Huffman encoding tree.
It's a way of building these unambiguous sequences of bit patterns
that will encode a node in a tree.
And the length of that sequence is
provably the shortest unique sequence that will get you to that particular point in the tree based
on its relative um uh occurrence how often it occurs so that's pretty cool um and deflate is
built of those two primitives um in various like you know clever bit packing encoding and other
bits and bits and pieces.
So the devil is in the detail with a lot
of these things. But there
are some really other left
field ways of
compressing data.
I mean, you mentioned lossy.
Yeah, right. So obviously you could
decide to just, you know, I don't know, divide everything
by two and then
immediately you've saved one bit for every could decide to just you know i don't know divide everything by two and and then uh and then that
immediately you've you've saved one bit for every every output value and then you just multiply it
by two on the way out and maybe some values you know everything's even but who cares right for
some data that might work so but um sort of more wacky and i still really haven't got my head around
how this works and also i can't possibly begin to explain this a because
i'm not very good at understanding myself but b because we probably need a picture there is right
but there exists a transformation to your data called the burrows wheeler transform okay it is
a reversible transform that is i can apply the bwt to my data and I get a permutation of all the bytes of my data
and I can reverse it and get my original data back, right?
This has not changed in any meaningful way the contents of the data.
It's just rearranged it.
Okay, okay.
But the side effect of the Burrows-Wheeler transform is that it is sort of sorted your data.
Your data is more sorted,
it's more ordered because of the transform
than it was before.
Interesting.
It is interesting.
You know, you take a megabyte of data,
you apply the BWT to it,
and it's a permutation of it.
And the data, if you looked at it,
it was more likely to be A, A, A, A, A, A,
B, B, B, B, B, C, C, C, D, D, D, E, E, A, A, A.
And it might go back to A again it's not you know it
can't it's not just sorting it yeah uh and and again you there's a great wikipedia uh article
on the bwt and you it's got a picture and you can see what it's doing and like the way of reversing
it is more tricky than you might think but it can be done right yeah okay if you apply the bwt before you throw it into a compressor
right you've done is you've made it more compressible right which is cool yes and that's
what um oh which one is it tar.xz what's no because that's another one one of the BZ extension BZ2
for like title BZT
is borrows wheeler I think the B inside that
of BZip
so it applies it
before it runs it into
the compressor and actually while we're talking about
that there's another general sort of technique
for compression which is to
apply a domain specific filter to your
data before you throw it into a compressor so for example you might know that there are certain
sequences in uh say arm assembly code or arm machine code i should say that are very common
but they are relative to their or rather their absolute locations right
they are absolute locations in the file or relative no no they're relative sorry let me get
it straight they are relative so you can imagine there's a branch instruction and it has a relative
destination um and so but if you're if if every branch if every third branch is calling the same
absolute destination,
their encoding will be different because relative to where that branch is,
it's a different offset, right?
Which means it's, you know, unique.
But if you were to extract that out and say, okay,
well, these are all going to the same location.
I'm going to make them absolute.
Right, yeah.
Now suddenly it's the same bits every time.
Every time we see that branch or whatever or something like that.
I've just butchered it, obviously.
But, yeah, so there are certain things
where if you know ahead of time
that you can make the data more compressible
in a reversible way,
BWT being like a more generalized way of doing it,
then that's another technique.
But obviously the decompressor
has to know how to undo that too.
And it has to know that that process has happened.
And very often those are incredibly application specific yeah interesting so it's like you're you're intentionally
introducing duplication in your data in a reversible way because it makes the compressions
algorithms job much easier right and you know i've never actually done the experiment before of just taking a like a
bunch of random text sorting it and then compressing it right and compressing it and
seeing what the difference is that yeah it would be really i mean it's a measure i mean generally
speaking is the measure of uh of how much disorder in the file is the entropy and you know that gives
you some kind of lower bound on how much you can compress a file.
So obviously, if you took a completely random file,
if you catted dev random and you compressed it,
you'd imagine it shouldn't compress at all.
It should be slightly bigger.
But obviously, if you sorted it,
what you're likely to have is an equal number of zeros, ones, twos, threes, fours, fives, sixes,
all up to 255.
It would compress really well.
And you'd imagine it would compress down to almost nothing because it'd be like yeah there's 250
million zeros there's 250 million ones there's 250 and so on and you're like all right so in
the limiting case you can get a really good ratio but obviously it's not reversible right right right
right but if you can come up with a way of encoding how to put the zeros back in the right place in the final document,
then you're sort of edging towards a compression algorithm.
But the amount of data it would take to encode the positions of all the zeros, for example,
and then all the positions of the ones should be the same amount of data as you put in in the first place.
So, you know, there's no free lunch there, but it might be better to encode or it might be easier to encode,
which is like essentially what the BWT is doing.
Yeah.
And I'm sure I've butchered some description of it there and we'll get people tweeting at us or master learning at us for those things. think about it is that that sometimes um the the trade-off you want to make in the compressor
is uh you know like imagine you're google and you want to send out the google uh you know there's a
single google.png or google.whatever right and like literally every third web transfer will be
of this in the whole world is on google.png right you know so you know every
byte you can save on that damn png is worth its weight in gold right it right i don't know how
much a byte weighs but you know whatever yes given that png internally is using deflate which is
another another thing it might be worth saying i don't't care if it takes me six days to compress this PNG.
If it saves me one byte, that'll pay for itself over the course of probably a day.
Millions and millions and millions of page loads, yeah.
Exactly. that are dedicated to generating almost perfect bit streams of like this is the smallest conceivable sequence
of valid deflate bits that will cause the output
to be what you want it to be.
And one of those is something called ZOPFLI, Z-O-P-F-L-I.
It hasn't been touched in years,
but it is a way of creating and compressing a source document and generating essentially the optimal output.
And it takes hours.
Interesting.
A really long time.
And it's very limited.
You can't do very much with it.
The only reason I know about it is because of hobby projects that use this to compress stuff on a PC to be then put onto an 8-bit computer
where, again, that trade-off's fine.
You know, the source data is small.
It's, you know, 20K,
but you want to compress it
as much as actually possible
for your, you know, your 8K demo
you're trying to get working
or your 4K demo you're trying to get working
on your 6502-based machine.
And so it's worth saying,
well, I'll spend 30 minutes in my build
squishing it those two more bytes because I can do something cool with it. Or again, well, I'll spend 30 minutes in my build, squishing it, those two more bytes,
because I can do something cool with it.
Or again, yeah.
That's neat.
And I think the way that they work is actually they start backwards.
They start at the end of the file and work backwards,
and they try and find the optimal thing that would encode what remains.
Something like that.
Somebody explained it to me once over a pint,
and it made perfect sense at the time.
And like so many of those things, as I say out loud now, I'm like how how could that work how did that work yeah exactly uh well i think you're
definitely right about not being able to get to things like lossy compression in this episode
because uh it seems like this is a very deep topic which maybe warrants some follow-ups right uh so
i'm going to just do honorable
mentions now we've talked about deflate all the way through um google and facebook have both got
their own like uh versions of again i think they're all using the same techniques and slicing
and dicing there's the same tricks but with different trade-offs but z standard which is
now like the seemingly the most de facto compression, at least in our company,
Z standard is,
is,
is coming and snappy.
They're both libraries again,
that,
that are not deflate,
but they are,
they use broadly the same parts,
but yeah,
honorable mentions for those only because,
you know,
we've talked about GZIP,
which is,
you know,
ancient really.
Right.
Right.
And they,
you know,
those things,
the snappy and Z standard are usually faster and have comparable if not slightly better compression ratios too so and uh yeah one last
thing there was there somebody posted a um uh so we talked about png just then which is you know
very complicated file format and ultimately the internals of it are in fact deflate compressed
and whatever um some random uh coder just came up with their own image file format using a relatively made up off the fly.
Like, well, one bit.
And then it's like, this is how we encode the next pixel, two bits.
And this is how we encode it in the next pixel, whatever.
And it beats PNG hands down for lossless.
So there is still room in the world for compression formats that are incredibly bespoke
so this is just for image data but still though i mean there's a lot of image data in the world
that's exactly i wish i could remember it and again i'm sure we'll put something in the notes
that says what it really is right but yeah as you can tell i love this stuff um if i remember
rightly i've got somewhere i've got like a an online view where you can type in
words and see how the um a dictionary compressor would see it and how a huffman tree would see it
so we're definitely gonna have to put that we'll link that in the notes uh too so that you can have
a little play around and kind of give a more of a visceral understanding about what's going on
behind this but um yeah it's fun yeah that's really cool really cool all right uh
maybe part one of two i don't know maybe we'll see we'll see if it works out like that
all right man well i'll talk to you later until next time my friend
you've been listening to two's compliment a programming podcast by ben rady and matt godbold
find the show transcript and notes at www.twoscompliment.org
contact us on mastodon we are at twos compliment at hackyderm.io is by Inverse Phase.