CoRecursive: Coding Stories - Story: From Project Management to Data Compression Innovator
Episode Date: May 2, 2023How do you accomplish something massive over time? I've had the chance to meet with a number of exceptional software developers and it's something I always wonder about. Today, I might have an answe...r with the incredible story of Yann Collet. Yann was a project manager who went from being burnt out on corporate life to becoming one of the most sought-after developers in the world. What happens when you build something so impressive and valuable that it essentially becomes invisible? And how do you do that when your day job is mainly organizing spreadsheets and keeping timelines on track? Yann built LZ4 and ZStandard - two of the world's fastest compression algorithms that have transformed databases, operating systems, file systems, and much more. We'll go back in time to Yann's initial steps with programming, his game-changing discoveries along the way and how his devotion to data compression hobby led him to create something that saves billions of dollars worldwide. Episode Links Episode Page Bonus 17 - Accomplishing Hard Things Support The Show Subscribe To The Podcast Join The Newsletter Â
Transcript
Discussion (0)
Hi, this is Co-Recursive, and I'm Adam Gordon-Bell.
Each episode is the story of a piece of software being built.
How do you accomplish something massive over time?
I've had the chance to meet a number of just amazing software developers,
and it's something I always wonder about, like how did they do that?
Well today, maybe I have an answer. I have the incredible story of Jan Colette.
Jan was a project manager who went from being burnt out on corporate life to becoming one of
the most sought after developers in the world. What happens when you build something so impressive
and so valuable that it spreads everywhere and becomes almost invisible.
And how do you do that when your day-to-day job is mainly organizing spreadsheets?
Jan built LZ4 and ZStandard, two of the world's fastest compression algorithms, and they've transformed databases and operating systems and file systems and
much more. They're everywhere. But we'll go back in time to Jan's initial steps
with programming, his game-changing discoveries along the way, and how his devotion to data compression as a hobby led him to create something that saves billions of dollars and changes everything.
This is an unforgettable career of passion that starts very simply as a hobby.
And it starts in Paris in the 1990s.
I had decided back then that if I had to live in Paris, I would live in Paris and not around
it. Because then you have only the negative side and none of the plus side. But another
aspect of it is that if you live in Paris, you better not have a car.
So I don't drive.
I actually use my bike to go to work.
Jan was working as a project manager at a tech company that contracted with the government of France.
And let's say the product I was in charge of was not working well to the point that I started to develop some script.
I wasn't a programmer, but I could do some simple stuff just to help the product just pretend that it works.
Which obviously wasn't supposed to be my job.
But OK, so it gives me a better understanding of what's going wrong.
And at some point, I go to see my boss and say,
this product is so bad, we would better rewrite it entirely from scratch.
My only experience at that point is that I've been able to put some script.
And when I say some script, really make it clear,
there's just a few batch files.
That's what it is.
So that's really nothing.
But it gives me a little bit of an That's what it is. So that's really nothing. But it gives me a little
bit of an idea of what programming is about. The project was a military radio communication system.
So the point is, you can communicate with voice, but it's awfully inconvenient to the point that
you need to set it up in advance using some huge configuration files that needs to be exactly correct.
Otherwise, it crashes.
And of course, everything must be static.
So if there is any node in the network that is not there, it's going to crash.
I mean, this is a battlefield.
You expect things to remain static?
That's nonsense.
So my goal is to have something which is fully dynamic, which essentially
discover everything when it connects, authenticate and discover its neighbor
and try to establish relations.
And it's actually fairly complex.
But the point is it makes me work directly with software engineers now.
And because of that, my product description is more than
an idea on a slide. It's actually a specification. It goes into great detail
on to how it's supposed to work. This made Yann nervous. He's not done this
before and he's just not sure if he's doing it right or not. Let's just discuss
with programmers. Do they understand what I ask them to do?
How do they even accept the idea?
And to my surprise, they take the spec and say, oh yes, of course we can do that.
And it starts very quickly actually.
In a matter of two months, we have a working prototype, and then two more months and we
have a decent early product.
So it's actually fairly quick.
This was a surprise
because Jan was pretty certain
he couldn't actually contribute
to product development.
I was always convinced
this is not for me.
It's too complex.
Before that,
I've been exposed to technology,
but always on the user side,
not as a creator of technology.
And I was sure I couldn't do it.
But this experience proved to me that actually, yes, that I can do it.
The project failed for unrelated political reasons,
but it meant something to Jan.
I've got direct relation with software engineers,
and it's a pretty good relation.
We actually get along very well, probably because
I'm basically the only guy in the marketing department to try to get close to their language,
to express it in ways that they can implement. And we have a kind of feedback model where I
get close to them and we course correct anything that goes wrong.
This model gave me a form of friendship, I would say.
In a more merit-based world, maybe Jan would transfer to product development and find his calling there and live happily ever after.
But in this world, he moves on to other projects, many of which don't use his skills very well.
And despite him working hard and really caring a lot, a lot of these
projects just get canceled.
The years slowly pass by until one day something changes.
I would say it's a moment in my life where I kind of change my aspirations.
I was 35.
In my twenties, I've been really working a lot, chasing the professional success and so on.
And by my 30s, I had accumulated quite a lot of scars.
I was no longer convinced that working a lot was really a great future.
Basically, Jan was burnt out.
It resulted in an extremely reduced ambition. Basically, just be a nice guy,
have a good life.
That's good enough.
There is no point in doing more than that.
Jury speaking, I think it's a good advice
to give to anyone
because it considerably reduces
the kind of pressure
that one can put upon himself
about the need, if not the requirement,
to absolutely succeed.
So I kind of changed all.
I'm thinking, okay, I will try first to get a stable job,
which I managed to get.
And from there, I discover, oh, but now I've got a ton of time.
Since I'm no longer working, I've got a ton of time so I can actually develop other activities
just to enjoy the time. So initially, I play got a ton of time so I can actually develop other activities just to enjoy
the time. So initially I play a bit of video games, but it feels hollow after some time.
My next occupation would be, let's learn about history. I always love the stories. So I tried,
I'm starting to learn a lot about history actually. But after a few years, I've got this
feeling that I'm reading again the same story or a different version of the same story. So I'm not learning more or not enough.
Also during the day, Jan's now working as a project manager.
I'm organizing big projects across the world. And so a lot of people depend on that. But I mean,
organizing Excel spreadsheets, essentially, making sure that the product is at the place where it's supposed to, making sure that everybody's prepared, that the right teams are aware of.
It's not exactly as thrilling as inventing a new product.
So something is missing.
And I think that's also why I go into, well, let's do some programming.
After all, I had a nice experience with real programmers,
and that will help me understand them better.
And that's how I started.
There's lots of ways to start programming.
You can get a book, you can do online tutorials.
But Jan decided to head back to his roots,
back to when he was in secondary school.
So as a student, we had the right to have a graphic calculator.
It was the early 90s.
And the one I selected was HP48.
So that's a very specific calculator because the CPU is really unlike others.
It's a 4-bit CPU, but it has a very large register to be able to have big numbers.
Essentially, it's a graphics calculator.
You're supposed to draw curve equations and curves out.
But of course, if it is a programming device,
that means you can try to make it run Doom or anything like that.
So it turns out that although he didn't think he could cut it
as a professional programmer,
Jan actually had programmed in the past for this HP calculator.
It was no Doom, but he had built a game called Fantasy Conquest.
So it's the idea of having a band of marauders, essentially.
And they roam around the country and they try to become the biggest band of marauders around.
It's not an intelligent game.
It was kind of working enough for some players to enjoy it, but it was never great.
And I had really no time to really go into detail because that's also student time.
We also have some studies to do.
So most of the time is actually spent there trying to get a good engineering school.
But back then when he did work on the game, he remembered it being fun, or at least something he enjoyed.
This HP calculator, it had a 131 by 64 screen,
so not a lot of pixels.
But it was enough to build a game.
The user experience, though, that was a bit of a challenge.
In order to play the game,
you need to have exactly the same calculator, an HP 48. And then we could download the games through either infrared connection directly,
or more complex was to get a cable to a PC.
And back then that wasn't common at all.
So infrared was used a lot.
And the thing is, the game was large.
It would take up to 50 kilobytes.
And that was considered very large by that time.
So you needed a memory expansion to actually even be able to load that.
And that was one of the problems.
This game was too heavy, and it deserved to be optimized a bit.
And so that's where I started to think about compression.
So in the evenings, after history has lost its luster, Jan starts
picking up this game, starts improving it, starts programming it.
It becomes his new hobby.
I'm not going to program every day.
It still has to be a hobby.
So I've got a number of activities, but when something is interesting,
you come back to it regularly.
So that's exactly what I do.
I would program around this calculator for many months.
I wouldn't say every evening, but very frequently.
And over time, skill level would simply improve just as a function of practice.
All I wanted to do is to have some fun and finishing this old game I never finished.
That's more or less all I wanted to do is to have some fun and finishing this old game I never finished. That's more or less all I wanted.
And yeah, I never planned anything out of this.
So it's kind of a surprise that one thing led to another.
This game did lead to so much, and it ends up taking you on to the other side of the world eventually.
But immediately, it helped him with something maybe more important.
How can I explain that?
Because it's difficult to.
But I would say when you have a job, which is clearly just a job to live,
I think it's really important to have a side activity.
So this side activity doesn't have to be something useful or that brings money.
It's just a joyful side of thing.
It could have been dancing.
I've been doing dancing to race away.
It could have been anything that brings you joy in your daily life.
I don't want to give the feeling that I was doing only programming data compression in
society.
This was one of the things I was doing.
But yeah, I think it's very important to have activities outside of the job.
I would even call it compulsory if you want a fulfilled life.
And so fulfill him, this game does.
There's a couple ways to program on an HP calculator,
but to compress a game so that it doesn't require the memory expansion,
there's really only one way.
You have to program it in assembly.
And so I start.
The first iteration, of course, are not great,
but they are all learning steps.
Every small iteration gives something,
a better compression ratio, better speed,
better memory usage, something like that.
After some time, I've got something which I can be proud of,
which is essentially a very fast decompressor.
So the idea is that the game would decompress itself
on the fly.
It's actually composed of multiple small modules.
And every time a module is called, it gets into RAM,
gets decompressed.
And that helps to reduce the size of the game
by 30% approximately.
And when we think about it, 30%, that's all, it's not that big.
But for some reason, I'm proud of it or something like that.
I've invented something.
Plus, data compression, that's clearly something that always felt magical to me.
So that's the starting point.
Before I knew it, I'm realizing that I'm only working on the data compressor.
I'm not touching the game anymore.
I'm always, always trying to improve the data compression side.
And after that, once again, there is no plan.
It's just a long journey where every step I learn something new and I find that interesting.
Data compression is a complex field to learn, though.
So initially, Jan's approach is to just try things out
and discover his own way
how can i search faster for can i encode that in which is less wasteful and so on this is insanely
slow obviously but it works many case i find a great idea that i'm almost proud of and it doesn't
take long before i understand that this idea is common for programmers.
I just didn't know it.
So, for example, I invent a hashing function in order to search faster.
And it doesn't take long before I understand, well, that's common.
Everybody does that.
So it's really a journey and a ton of fun discovery.
So Jan makes a standalone compressor for the HP
48S. At this point
though, the
scene for the HP 48
calculator is pretty much
dead, I'd say.
So there's not a lot of spectators
to see the work.
But I would nonetheless continue to
develop it. I would start
even to develop different variants,
some stronger than others.
But the real key selling point of the main algorithm
I would employ is its speed.
It reached an extremely high speed of 80 kilobytes per second,
which obviously nowadays looks like shit.
But these calculators, they don't have that much RAM,
so it's still fast for them.
And after some time, a good year, I would say, I still have this feeling that, okay,
I'm developing something.
It's interesting.
I like it.
I find it interesting.
But there is almost no one to enjoy it because the scene is dead.
So my next step here in this journey is to say, okay, let's go to the PC scene.
It's the one which is active,
2009 approximately.
The PC scene is active,
but they don't have practice
at squeezing performance
out of old underpowered CPUs
like Jan does.
I've read in newspapers
that programmers
only produce bloatware,
which are worse and worse every year.
I would have a definitive
competitive advantage there.
So I'm pretty sure of myself.
And I start developing on the PC side.
And now I have to learn C
because you don't write a Saturn assembly.
So I have to write C.
And yes, very quickly, I understand that.
No, no, I'm pretty far behind.
Very, very far behind.
The compression libraries available on Windows,
they turn out to be way faster than expected.
But Jan finds a way to catch up.
He finds this online forum
where people interested in
compression gather. It's a
vBulletin forum, and to register you have to
answer questions about compression, like
who created WinRAR. Eugene
Rochel, by the way.
And this place becomes Jan's watering hole.
So I was not alone. And that's fairly important, I think. But more importantly, I think it gave me a frame of reference. I could compare, I could get evaluated. And so there was a sense of belonging
to a tribe of peers. And I think it matters because it's difficult to sustain such a long effort,
multi-years effort, with no such context at all.
Also, within this community, it's easy to evaluate each other's work.
Every once in a while, someone would come and say,
hey, I invented this.
And some people would test it and would say,
oh, it's good or it's great it's it's not
and there would be no shortage of people who like me are interested in data compression
and would test the program so oh that's a perfect ground then i learned that yes data
compression is not limited to winrar and winzip you have actually hundreds of possibilities out there. And some of them are in my category, which means simple, light, and fast.
And now I've got something to compare to.
So once I finally understand how to program in C, that's why I do my comparison.
And yeah, that's why I quickly understand I'm behind, way behind.
There are a decent number of people out there which can develop very fast compression
algorithms, way better,
way more efficient. So
there is a learning curve there.
But I think the critical part
that I learned from the HP
48 experience is just
the will, the will to learn.
There's this kind of blind trust
that there is something interesting
to learn.
So let's dig a bit more.
This will to dig in is one of the keys to Yann's eventual success.
But also remember, this is just a hobby for Yann, like learning history.
And so he's not in a rush.
He wants to understand each step along the way, and he takes his time.
I think it's fair to say that in the data compression community, most
searchers were interested in data compression, really best ratio.
And speed, yes, has a side effect.
Let's make it not too bad.
That's really calm second.
And my mind was really focused on, I want to great speed and without
sacrificing speed, I want to get better compression ratio.
So it really changed the perspective of what matters.
And it doesn't take too long
before I got something competitive.
And at some point, also by chance,
I have an algorithm which seems to be the fastest around.
So that's LZ4.
I think it's fast because it's simple, and also for other properties
of the CPU that I wasn't aware of at the time. So I just realized, oh, this thing can decompress at
one gigabyte per second. That's crazy. I wouldn't have expected that. And that's it. I do not
continue on that. It's like a one step in this learning experience. So Jan has built the fastest algorithm to compress at this level of compression that he's seen.
And then he just moves on.
It's just a hobby, right?
And nobody cares about speed like he does.
Because the thing is, LZ4 is fast because it doesn't do much.
It does less compression than Deflate, the algorithm that's used by zip files and gz files.
And because it does less, it's very fast,
but once it's built, there's not a lot to learn.
It doesn't do a lot.
And so Jan just moves on to more advanced compression topics.
I learn about Huffman, arithmetic coding,
reduced offset, context, mixing,
things which become really, really complex.
But none of this stuff is actually usable.
It's really just for the learning experience.
I'm just following other people
who have been developing the same things before me.
So I'm not doing something really special, but I learn.
And at some point, I stopped making progresses.
I would say that's a moment where I've learned whatever was easy to learn, whatever takes a
week or two to learn. And that's a moment where I have to decide, what do I do now? Do I move on?
Like I moved on from learning history to programming. Do I move on to something else or do I double down? And so I
decide to come back to this first compressor, LZ4. And I'm thinking, okay, I've got something here.
This one is actually above its peers. So there is something there. But I achieved that outcome
almost by chance. Let's understand. Let's go deeper. Why? Why is that? And how can I improve it?
Making the first version of LZ4 was relatively easy, but understanding why it's fast and how
to make it even faster, well, that's a whole different level of complexity.
Instead of making huge progress every week, it would take months to make small progress.
But that's also the small progresses which are actually difficult to get.
After that, I'm starting to understand why it's fast,
how to make it faster, and also how to make it stronger.
So now I'm getting more into how do I search efficiently,
how do I combine, how do I pass data.
These are all fairly complex topics
that I wouldn't care spending time on if it's
just to learn quickly. But at that point, I'm more in this mentality, so I've got something
which is a bit better. If I want to make it usable, I need to polish it. And so I take the
time to polish it. And by the time I believe I've got something good,
it has taken almost a year to understand all this.
LZ4 is now faster than every single compression algorithm that Jan can get his hands on.
Jan knows that LZ4 may have tremendous impact on the world.
Data compression is all about making things smaller and more efficient.
And when you're dealing with vast amounts of data,
even small
gains in efficiency can add up to massive savings in time and money and resources.
Something like LZ4, which can compress and decompress at lightning fast speeds,
has potential to revolutionize the way that we store and manage data. It's that important.
But also, it's just some code on his computer. It's not a proper library. It doesn't
have a license. It doesn't have users and nobody knows who he is. And so its impact could be
nothing. It could be zero. The next step to change that is to open source it. And open sourcing is
more complex than it sounds. Jan's never worked as a professional developer before, and he's trying
to learn how to do things the right way. Then I understand new stuff, which is unrelated to data compression, such as never use global
variable, make that a library that people can actually integrate, all things that I wasn't
even aware of. And it takes me a few months to get that right. And at some point in 2011,
I'm thinking, okay, I'm ready now to open source it.
And one week or two before I do that, Google opens for Snappy.
And Snappy is basically exactly in the same category as LZ4.
And he's actually, I think, a bit better than LZ4 at that time.
So I'm kind of, what?
Snappy gets a lot of attention.
It turns out that inside Google, they had needed ways to compress data from Bigtable.
And so they needed something that did less than Deflate, but did it much faster.
This is very much Jan's approach, but they came from a different direction.
As soon as it gets hard to get a lot of articles over internet,
which talk about this radically new compression,
which is so fast.
And a lot of projects which get interested.
Basically, all databases start to say, oh, but that's what we need because we need speed.
So it's an instant success.
What did, like, did you feel upset?
Upset?
No.
But maybe some form of stress.
I'm not sure. I feel as if I was in a kind of increased activity mode in my brain.
Okay, there is something happening now.
What I learned from that is that my algorithm is actually the only one able to keep up with Snappy.
So that's still something.
That's not bad.
And I was convinced that I should be able to do better.
So I doubled down again.
I know I'm focusing back on speed and make that faster, faster.
I'm looking at Snappy and thinking this thing is a bit too complex for what it does.
LZ4 is way simpler, so it should be faster.
I believe that.
I believe it should be faster, but I don't know exactly how.
So I focused on that.
And indeed, months after months, progress show up,
and 10% by 10%, LZ4 is actually faster than Slappy,
and after six months, it's actually way faster.
So at least it answers the first part,
which is a bit, I would say, egocentric, like I can do it. You took on Google.
Exactly.
It's kind of crazy for someone who wasn't even a programmer.
That's right.
Through these six months of slowly finding speed improvements, Jan is still working as a project manager.
He's still biking to work in the morning, organizing spreadsheets, and then biking home.
So some people around me are aware that I'm doing that as a hobby, but no one thinks much
about it.
I myself still have this excuse that thanks to this hobby, I'm actually a pretty efficient
product marketing manager, and I have some good relation with my programming teams.
I've got several ones, so I can put together a plan.
I know how these things communicate between them.
So that makes me confident in this role.
I think that's for the professional side, that's probably the best.
And they feel it too.
I mean, programmers also quickly understand that the marketing guy in front of them understands
programming, but most of them are not aware that I'm doing that on the side. So it's not about
talking about data compression. It's just about acquiring some culture of programming.
And yeah, data compression is more like, I don't know, my hobby.
Probably Yann is a good project manager, but in his hobby, which is very, very niche, he's slowly becoming one of the best in the
world.
Maybe before internet, you just had to be good in your local neighborhood.
So I don't say that in a very strict sense.
Local neighborhoods can be the companies that work in the same field that you know about.
But no, you need to be good at worldwide scale.
That's insanely good.
But in the same time, now you can be very good at something very niche
that would be of no importance to anyone you know around.
None of your friends, none of your family, no one cares.
But at the scale of the planet, there are actually a non-negligible amount of people who care about that.
And now you can be very, very good at something very, very precise.
I think it's a very big change over the last 20, 30 years.
That change in the world, that change caused by the Internet,
should mean that Yon is now getting all the attention in the compression field, right?
LZ4 is faster than Snappy, and look at all the attention that Snappy got.
Well, it turns out that that's not exactly the case.
So sometimes, some people which like data compression, they would write an article
about comparing different compression algorithms.
And so I try to be visible.
And that's where I would say, hey, I've got something that you might be interested in.
And every time that happens, it ends well for LZ4
because the comparison is pretty favorable if we only look at performance metrics.
Now, on the potential user side, it's a very different story
because Snappy, you've got Google attached
to that.
So that's real work, professional work.
And this random guy, no one knows about, okay, who knows?
So I would say initially there is absolutely no traction at all.
But there is one thing I've got that Snappy doesn't.
During this time, I spent understanding
how it works in a more, in a deeper way.
I developed a variant called LZ4-HC,
which is LZ4 high compression mode.
So this is a variant which is slower when it compresses,
but it has a much better compression ratio.
And the decompression speed is the same,
which means it's extremely fast.
In a scenario where you compress once
and decompress many times,
think assets in video games, for example,
that's pretty useful.
And that's where I would essentially score my first wins,
if you call it this way, my first users.
Some indie video game developers realized that they could use LZ4-HC
and get very good decompression speed out of it
and a much better compression ratio than Snap can offer
because it doesn't have a high compression.
The first game to use this format was for the PlayStation Portable,
which has a totally different CPU than the one on Jan's PC. have a high compression. The first game to use this format was for the PlayStation Portable, which
has a totally different CPU than the one on Jan's PC.
Which also introduced me on the topic of portability.
So it was a great learning experience.
And then I learned that more and more games were starting to use it.
And I don't know exactly when, but at some point, Unity decided to adopt it. And through Unity, you know, a lot of games, even today, which use LZ4,
they are just not even aware of that.
It's part of the default setting.
I don't know. Did this feel like something?
You're like, I'm on to something here?
I did something that's useful.
That's a great feeling, honestly.
At that point, that's all there is.
I mean, I'm glad.
What I don't realize is the kind of positive reinforcement loop it triggers.
Because now this is a product which has been tested,
has been shown to be better technically,
but is also used in commercial products.
And so now, even in more serious activity,
if you call it this way,
I mean, not video games,
but actually databases
for other professional systems,
now it becomes something
that's worth considering.
And now, every time there is this competition,
Snappy versus LZ4,
and LZ4 wins every time.
And after a few months,
it starts to be known.
And so it reverts. I think more and more projects use LZ4 wins every time. And after a few months, it starts to be known. And so it reversed.
I think more and more projects use LZ4 first.
It did not happen overnight,
but it's kind of a gradual improvement story
and that takes time.
The news takes time just to reach people.
First big non-gaming open-source project to use LZ4 was Hadoop.
Which was supposed to be the competitor of Bigtable from Google.
That was something that was kind of, wow, this is serious now.
A bit later, I learned that it was being evaluated in the LHC, which is a large hadron collider.
So that's for research proposed in Switzerland, very big accelerator of particles, which has a massive amount of data to generate and deal with.
So they need something fast.
So I said, wow, it's even useful for some kind of fundamental research.
So yeah, that's where I started to see this is more than just a toy project.
It started to be useful for a broader range of applications.
One reason for its broad use and adoption
is that when he first open-sourced LZ4,
Jan spent time learning about how to best manage
an open-source project.
So even what is a license?
What is an open-source license?
What are the choices?
Why is this one?
I had a pretty good discussion
at the very beginning of the project
with people from Unity about that. That was important because I think I would have gone for
GPL. Instead, I went for BSD, and it totally changed the scope of, I would say, the addressable
market. It was a very important decision retrospectively, but at that time, I did not
really understand it. So I kind of trusted these guys, seemed to know what they are doing.
And then of course I got some feedback.
Another thing I learned, feedback is gold.
Anytime someone comes with a problem, that's a real problem.
It would be so easy to say, Hey, I don't have this problem of my computer.
I don't care.
Instead it's, Oh, you have a problem.
So it's worth solving.
How many other people had the same problem and never told about that? It actually matters. It does matter. Of
course, we all know that. But if you're a Google dev, you probably have internal Google things to
do. The external issues on GitHub, they might take a back seat. And actually, an amateur which is
focused on its own turf is going to be more reactive, more present than a professional which is already overburdened.
So indeed, that's what happened.
If going to the Snappy project at the time, you would see the issues increasing all the time and response time being longer and longer.
So that's part of what I call the project.
LZ4 was really much more focused,
much faster release time, correction, and so on.
And so that pays off, and LZ4 keeps spreading.
Eventually, some engineers at LG want to get it into the Linux kernel.
Getting your project merged into the Linux kernel
is a big deal, but it's also hard work.
It's a large code base.
And because of what it is,
it has lots of rules about how things should be named
and how code should be structured.
Luckily, Jan doesn't have to take this on at all.
The engineers at LG lead the effort.
But even for them, it's hard.
I see the multiple runs
where they get rejected and have to retry.
So as long as it's not done,
there is always a reason why it could not pass the bar.
And I don't know which reason it will be next time.
Could be something as simple that,
yeah, the name of your function
do not completely respect the naming convention
for this part of the kernel.
And all of these things, they look like secondary,
but I totally understand
that from the maintainer perspective, it's important.
That's what makes them able to maintain this big pile of code.
So it takes really a lot, a lot of time to get integrated.
I think it took a year.
And at that point, I realized that no one around me would understand what I was saying.
So I feel a bit alone. So yeah, for me, it was just a great achievement.
But it seems that from this community, it was a bit more than that.
We have a French word for that, adoubement.
And I'm not sure what the word in English is.
I swear a knight become a knight.
Oh, it's just knighted in English.
Knighted.
Yeah.
It's a simpler word, I guess. Yeah, it's just knighted in English. Knighted. Yeah. It's a simpler word, I guess.
Yeah.
Yeah, it's more direct.
I think it's more or less what happened from their point of view.
No, I was no longer just an amateur doing something and having strangely some success.
It felt like anointed.
It shifted my perspective on, by the way, it can be maybe a bit more than just a hobby.
So LZ4 keeps spreading.
But also, Jan gets to hear from people who evaluate LZ4 for their project and then reject it.
I remember databases system, which typically cut data into blocks and compresses each block individually.
For some of them would say, yeah, we absolutely need speed to the point that we don't compress, so LZ4 is great. Some of them would say, no, no, no, we compress with Zlib and
it matters because we have an installed layer customer. We cannot tell them, no, you are going
to need 20% more storage. We do need compression. We do need speed, but we have made the choice
that compression really matters, so we are already using Zlib.
And Xamarin is really the standard compression library that everybody knows and everybody uses.
Zlib uses the deflate algorithm mentioned earlier, same as zip files or.gz files.
It's old, it works, it's everywhere. I mean, any project who has accepted the speed cost of Zlib would not transition to LZ4. And it's actually
the majority of the addressable use cases, say, use Zlib. And therefore, it makes LZ4 something
nice, a kind of a niche product for a niche use case. Not a small one, but still a niche.
But the bigger majority of the data compression remain inaccessible to LZ4 because it's too light.
But back in his learning days, Jan had created a number of compression algorithms.
LZ4 was the fastest, but there were others.
I made several ones, actually, which would be quite competitive with Zlib.
Let's use one of them.
Let's know I understand the open sourcing process.
What does that mean? Which is way
more than data compression.
In terms of proportion,
LZ4 is such a simple
thing that the proportion
of data compression algorithm itself,
that's like 10% of the effort, and everything
else is about open sourcing it
properly. And so I think it was a good
choice to start with that, because
there's so much to learn.
But now I have that, I understand open source, can I take one of these old algorithms that I made
many years before and bring it to the open source community as a kind of competitor to Zelleweb.
And so I would start to work on that, on this idea. And I'm fairly convinced that, yeah, it can work.
I can have a compressor which is at least as good as Zlib in terms of compression ratio, but much faster.
And I was already convinced that a lot of users would like that because they were looking for more speed.
They just were not willing to accept losing compression ratio in the process.
So the path to something better than Zlib seems clear.
LZ4 does half of what Zlib does, but really fast.
And the missing part is Huffman coding.
A Huffman code is replacing common strings with a shorter prefix.
If I just replace the text co-recursive on the podcast website with CR
and then put CR equals co-recursive somewhere as a
legend and then replace Adam Gordon Bell with AGB and so on with all the most common words,
I can use less characters. That is something like a Huffman code. But if all Jan does is
re-implement this idea, it doesn't feel like enough. What Jan wants to do is something bigger.
He wants to fulfill this goal that he had when he was a kid
working on his calculator game.
The sentence, when I grow up, I will be an inventor,
I think became more and more something in my mind.
I really admire the people who can invent stuff,
and that feels magical because we just get the final product of many, many years
of iteration of efforts.
And that's what I wanted to do.
And so Jan starts work on his new Entropy encoder,
pulling in ideas from his earlier research,
incorporating things he's learned about performance.
Because he needs speed and compression ratio, that's his focus.
And he does this in the open on GitHub
in his finite state encoder repo.
And so I'm releasing finite state entropy in 2014,
I think, because it's nice off.
And it works.
It's actually even faster than FMAN.
And it has the compression ratio for arithmetic.
And that's kind of a landmark now
because, I mean, a new entropy coder,
this hasn't happened in 30 years or so, even 40.
It was a long time ago.
So that part satisfies, I would say, the inventor in me.
I created something really new.
And it's not just new.
It's actually very efficient.
And I can use it for my project of bringing a competitor to the lab.
And so all these ideas start to get together. And since I'm doing that in the open as an open source, everybody says that. It's kind of the obvious next
move. And so that's how we reach the end of 2014 and I decided to release the
first version of this standard as a technological demo.
And it brings what it's supposed to, which means compression ratio similar to Z-Lib, but way better speed.
In a simple benchmark, Jan's project compresses a bit better than Z-Lib.
But three times faster. Three times faster.
He calls it Z-Standard, and now he has people's attention. Because compression is happening everywhere. The computers are involved. But especially at data centers,
where the big companies are hosting all their servers. Hence Google being initially ahead of
him. But this three times improvement, you know, if it can be deployed, it just gave every cloud
provider a way to save lots of servers,
a way to save lots of money.
And so now Yon starts becoming popular.
Everyone's reaching out to him.
So I would say every major player in the Silicon Valley that you can think of.
Google, Facebook, Apple, Microsoft, Amazon. Yeah, yeah.
Not Amazon.
And also other players which are less known, smaller companies.
Also at this time, the HBO show Silicon Valley had just come out.
It's a great show and it's literally about somebody discovering a better compression algorithm.
Some of my colleagues talked to me about that.
I was, what is this thing?
And yeah, they are talking about data compression.
And it seems to be a big theory in Silicon Valley.
So maybe it made data compression sexy.
I don't know.
Anyway, whether there is a consequence or not, in 2015, I ended up receiving really a lot of job offers all about data compression in the Silicon Valley.
And that's the moment to decide.
Basically, it's a one-time opportunity.
Either I play it safe, I keep my safe job in Paris and I try to continue doing
data compression on the side, or I go in fully and I fully invest my time in
data compression.
And the thing is, I also understand at that point that LZ4 was kind of a lucky
strike. It was possible because it was simple. But this one is not simple. This is a really
different level of complexity now. And hoping to do that on the side, working two or three
nights per week, that's not going to work. On top of that, I'm going to be a fuzzer. So I'm waiting for my song.
And this is going to take into this budget even more
to the point that probably there will be nothing left.
So the question is, do I do it?
I mean, it was clear that if I selected one,
I would have to give up the other.
If I was to stay in marketing,
I would have to really go all in on the politics game,
which leaves not a lot of time for programming, and the other side is even more clear.
So it was an important choice to do.
In the data compression versus project management debate, compression has at least one very big plus on its side.
Generally speaking, data compression, I think, is the only thing I did in my life which has been useful.
Everything else I tried to do was always for the benefit of some kind of middle manager
who wants to show off to his own manager, who wants to show off,
and end up being not useful, just abandoned or no impact.
Well, I've got a few counter examples, but they are far in between.
Data compression is really the first time I bring something to the world, and I know it's useful.
And so Jan, he heads to the Bay Area to start interviewing.
I was very clear that my intention was to bring this standard as an open source project,
fully open source, everyone can use, no string attached. And not every employer was thrilled about that.
Some wanted to keep that completely in-house.
That's Apple, I'm assuming.
I can't give name.
And really, Meta, which was called Facebook at the time,
was very clear that they do have a culture of open source.
They were fully aligned on this objective.
And this was of great importance to me.
That was the reason I was doing all this.
So I went with them, and I think it was a good choice.
So Jan takes the job at Meta, or Facebook then.
There's just one problem, though.
If there was one moment where this story could go wrong,
it's exactly this transition between marketing to programmer.
Because we think about it, it's a pretty big transition.
Especially, I think I was 42 at the time.
So I had been an amateur for a few years,
going straight into the big league and Silicon Valley area.
That sounds like a stretch.
So there's no guarantee that it will end well.
Facebook has a bootcamp process for developers joining the team.
Jan doesn't have his visa situation worked out yet,
but for the bootcamp, it doesn't matter.
Yeah, the bootcamp period, we do have a kind of almost a hotel.
It's a very small apartment that's not too far from the campus. And so,
yeah, I stayed in Redwood City probably. Jan is used to working on Windows in Visual Studio,
but Meta is different. He needs to learn how to work on remote servers and how to develop on Linux.
So that's also a learning experience, how to use VI in order to code, in order to have your plugins.
At the end, it's a lot of small tools and small way of thinking.
None of them looks particularly terrible.
It's just that there are a lot of them.
And so it takes time to not just be blocked on the stupid thing.
After Jan figures out his environment, which does take him a little bit longer than most,
he has to take on some small tasks.
We have to reach out teams
which have actual code in production.
And there is a small thing to do.
It's not complex for them.
But for a boot camper,
there is a ton of context to acquire.
Why are we even doing that?
I remember one of the easier ones was just a matter of
processing a file in Python. You say, what's the problem with that? Well, the initial code I had
to fix was just loading the file and processing it. But if the file is huge, then it costs too
much memory just to load and process. So I had to change that so that it would stream small amount of data
processing so that the memory budget would stay low. It's a small exercise, but it's
actually a real software, really in production. So that means it goes through a whole validation
cycle that we get to discover firsthand. And that's pretty useful to do.
So Yann gets through bootcamp and he starts the data compression team at Facebook.
He gets an intern and they start working away.
But Jan's having a hard time shaking the feeling that he's out of his depth.
I mean, even my first intern looked like a genius.
So I'm surrounded by genius people.
What am I doing here?
It takes time just to accept that, yeah, there's a place.
Because the point is not to be better than everybody else around at everything.
That doesn't make sense.
There's an infinite amount of things.
The point is to be useful at something valuable so that we meet our peers, we can help them concretely.
And that demands knowledge, but also ability to learn,
to evolve. As Jan works to figure out how to work at Facebook, as he works to build Zed Standard
from a demo into a real library, he also has a son. And so his life is changing really fast.
And then out of nowhere, a competitor appears. tank of Zlib, it's more focused on the web, and therefore it's more, once again, it's more focused on compression ratio.
But it's not bad on speed.
So, but Broadly was in the making for several years.
I had started only a few months before.
They were a team.
I was alone.
It felt like before even reaching my goal, it was already lost.
But Jan persisted, and he got help from his compression forum colleagues as well,
Premise Vlas, Gabinski.
But he was really knowledgeable of data compression, and he wanted to help.
He wanted to contribute.
And since it's an open source project, everybody can contribute.
So he was contributing more and more.
And by the end of the year, we had this idea with my manager at Facebook.
Why don't we hire him so that he can fully contribute?
And so he receives a contract and he can work on the projects throughout 2016.
And so that's why he's one of the co-developers of this standard.
And his help was very welcome because I was kind of alone.
So I was a bit stretched.
Then, in 2016, at Facebook's AtScale conference, ZStandard was officially released.
Before it was on my own repository on GitHub, it was considered like a personal project I was authorized to work on.
At that point, it became an official Facebook project.
It's a great event.
First time I do that.
So the event itself, I don't know what I can say.
I think it went well.
But what was probably more important is the reception from the community.
A lot of people were waiting for this 1.0.
One of the metrics would show it, which is the number of stars on GitHub,
which would be catapulted to the day of the official release.
In the final comparison, ZStandard beat Zlib in every way.
At the same compression ratio, it compresses 3 to 5 times faster.
At the same compression speed, it results in files that are 10 to 15% smaller.
And besides all this, it can decompress twice as fast,
regardless of compression speed. The only thing even close to Z standard is Brotli from Google.
So I mentioned before that Brotli was ready one year before us. They should have reached the
market before us. But they developed in C++. and i made the decision to develop in c because
i knew from lz4 that this is a universal abi that you can connect to from anywhere and now you can
have python you can have rusk and after every language will connect to the c library and also
we have a very good control over our memory location, which matters for embedded, for kernel.
And broadly people understood that, but a bit too late.
So they start by releasing something C++ and then they convert it back to C
because they understood the importance.
And it took them a whole year to get that done.
And that's part of something you cannot guess.
I mean, if you're not exposed to open source deployment.
So the LZ4 work of understanding open source and easy adoption paid off for Jan.
It only took months for Z standard to spread.
I would say by January, most Linux distributions have this standard in their
package repository.
And of course, internal Facebook
products use it, and more
and more open source projects use it.
The adoption is really very fast.
And that's why I understand that
the knowledge,
the open source process
and restriction I
learn while doing LZ4.
That's what gets re-employed in this tunnel
and what led to such a fast adoption.
If ZStandard is three to five times faster,
then that's three to five times less CPUs
spent on compression everywhere that made the switch.
And like LZ4, ZStandard also opened up new frontiers.
It has dictionary compression,
which allows ZStandard to compress things that are really small
using aggregate statistical patterns.
This is a big deal at places like Meta.
And that's not even counting secondary impacts.
Internally at Meta, the impact has been massive.
So typically we do have caches in the stem
so that data can be delivered faster because it's closer.
But that, of course, you need equipment,
and that costs money.
So if you can compress more in these caches,
you actually have more data closer to the customer.
And there are two benefits to that.
First, the customer is served faster,
so its experience is smoother.
So it drives engagement.
People would use the application more because it's more pleasant.
And on the other side, there are a ton of events in the network
which do not need to happen because we do not have to go deep
into the database and storage system because we don't need to.
It's right there.
And cumulatively, these two effects, they are way more than the saving itself from the
data compression.
So that shows that, yes, there are primary savings, but sometimes the secondary effect
is really important too and worth measuring.
Secondary, primary, tertiary, however you measure it, the impact has been huge.
Let's go back to this idea that data compression can be invisible and still have a fairly big
impact, both on storage and transmission, because we also send more data.
So once something is used almost everywhere, the amount of data processed is staggering.
These are numbers we are not used to.
There are so many zeros,
we don't talk in this kind of range normally. That's insane.
This is the wildest story I've ever heard in terms of impact. A marketing professional,
a project manager, a couple evenings a week after he bikes home from his job in Paris,
starts tweaking a calculator game.
And by the end of it, Yann has shifted a whole industry's approach to data compression and saved billions and billions of dollars.
So the obvious question I have for Yann is how?
How can you have that much impact in your career, in your hobby?
So I would say the first advice here is don't do that for the success.
Success is too random and too far away.
If someone targets success,
it will lose stamina way before reaching that point.
So do something because you like it.
That's the inner force which will drive you beyond,
I would say, the normal investments that almost everybody
can also do.
I think that's a very important one.
So now if you're interested in the domain, keep going at it.
It's really a small effort regularly in the same direction, that brings you very far. I understood that from
my mother, which had a, how do you call that, an accident. And because of brain injury,
she couldn't move much, especially her hands were affected.
But despite her difficulty controlling her hands, she decides that she should write a book.
When you do have a problem with hands, that's slow, very slow, painfully slow.
But incredibly enough, in the matter of a few years, she managed to write, publish, sell.
I don't remember the exact number.
I think it's about four books.
And they are not small ones.
That's something. She did that they are not small ones that's something
she did that because she was focused on that and of course it wasn't the only thing she do in her
life but almost every day she would make an effort in the same direction and that would make the book
closer to the finish line so that's i think a pretty big lesson here. Just keep at it. Learn something new every day.
And after some time, actually, it's a lot.
It's way more than average.
That's what makes you an expert in the field.
So you can't do that effort if you don't like what you're doing.
If it's just for the money, that's too long, really.
And that's why it's important to like.
Thank you, Jan, for sharing your story. You can find Jan on Twitter as cyan4973 and also on GitHub and in code forums.
And thank you to Chip Turner, who hired Jan at Meta and set up the data compression team and also reached out to me and said, hey, Jan has this crazy story. You should
talk to him. If you know somebody with a story like that, yeah, send me an email, send me a note.
Also, thanks to Meta for your belief in open source and being cool about having Jan talk to
me about his experiences. If you like this podcast, if you want to help me out,
the best thing you can do is just tell other people about it.
I know it takes me a while to get out each episode,
but if you want more content from me,
you should check out my newsletter where I cover similar topics
and also follow me on Twitter at Adam Gordon Bell,
where I often share behind the scenes details of the podcast.
But really to get more content from me, the best thing you can
do is join as a supporter. Go to co-rehearsive.com slash supporters. It's in the show notes as well.
And as a supporter, you get access to more episodes. I put out a bonus episode each month
as well. Last month's bonus episode, bonus 17, was inspired by Jan's journey. I recorded it
not long after talking to him. And it's about how you accomplish hard things over time. About two
keys that Jan highlights that also were impactful in my life. And how I think about building up
expertise or accomplishing something big over time. If you're a supporter,
check it out and let me know what you think. And until next time, and I always say this very
sincerely, thank you for listening.