CoRecursive: Coding Stories - Story: From Project Management to Data Compression Innovator

Episode Date: May 2, 2023

How do you accomplish something massive over time? I've had the chance to meet with a number of exceptional software developers and it's something I always wonder about.  Today, I might have an answe...r with the incredible story of Yann Collet. Yann was a project manager who went from being burnt out on corporate life to becoming one of the most sought-after developers in the world. What happens when you build something so impressive and valuable that it essentially becomes invisible? And how do you do that when your day job is mainly organizing spreadsheets and keeping timelines on track? Yann built LZ4 and ZStandard - two of the world's fastest compression algorithms that have transformed databases, operating systems, file systems, and much more. We'll go back in time to Yann's initial steps with programming, his game-changing discoveries along the way and how his devotion to data compression hobby led him to create something that saves billions of dollars worldwide. Episode Links Episode Page Bonus 17 - Accomplishing Hard Things Support The Show Subscribe To The Podcast Join The Newsletter  

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, this is Co-Recursive, and I'm Adam Gordon-Bell. Each episode is the story of a piece of software being built. How do you accomplish something massive over time? I've had the chance to meet a number of just amazing software developers, and it's something I always wonder about, like how did they do that? Well today, maybe I have an answer. I have the incredible story of Jan Colette. Jan was a project manager who went from being burnt out on corporate life to becoming one of the most sought after developers in the world. What happens when you build something so impressive
Starting point is 00:00:40 and so valuable that it spreads everywhere and becomes almost invisible. And how do you do that when your day-to-day job is mainly organizing spreadsheets? Jan built LZ4 and ZStandard, two of the world's fastest compression algorithms, and they've transformed databases and operating systems and file systems and much more. They're everywhere. But we'll go back in time to Jan's initial steps with programming, his game-changing discoveries along the way, and how his devotion to data compression as a hobby led him to create something that saves billions of dollars and changes everything. This is an unforgettable career of passion that starts very simply as a hobby. And it starts in Paris in the 1990s. I had decided back then that if I had to live in Paris, I would live in Paris and not around
Starting point is 00:01:35 it. Because then you have only the negative side and none of the plus side. But another aspect of it is that if you live in Paris, you better not have a car. So I don't drive. I actually use my bike to go to work. Jan was working as a project manager at a tech company that contracted with the government of France. And let's say the product I was in charge of was not working well to the point that I started to develop some script. I wasn't a programmer, but I could do some simple stuff just to help the product just pretend that it works. Which obviously wasn't supposed to be my job.
Starting point is 00:02:18 But OK, so it gives me a better understanding of what's going wrong. And at some point, I go to see my boss and say, this product is so bad, we would better rewrite it entirely from scratch. My only experience at that point is that I've been able to put some script. And when I say some script, really make it clear, there's just a few batch files. That's what it is. So that's really nothing.
Starting point is 00:02:43 But it gives me a little bit of an That's what it is. So that's really nothing. But it gives me a little bit of an idea of what programming is about. The project was a military radio communication system. So the point is, you can communicate with voice, but it's awfully inconvenient to the point that you need to set it up in advance using some huge configuration files that needs to be exactly correct. Otherwise, it crashes. And of course, everything must be static. So if there is any node in the network that is not there, it's going to crash. I mean, this is a battlefield.
Starting point is 00:03:19 You expect things to remain static? That's nonsense. So my goal is to have something which is fully dynamic, which essentially discover everything when it connects, authenticate and discover its neighbor and try to establish relations. And it's actually fairly complex. But the point is it makes me work directly with software engineers now. And because of that, my product description is more than
Starting point is 00:03:46 an idea on a slide. It's actually a specification. It goes into great detail on to how it's supposed to work. This made Yann nervous. He's not done this before and he's just not sure if he's doing it right or not. Let's just discuss with programmers. Do they understand what I ask them to do? How do they even accept the idea? And to my surprise, they take the spec and say, oh yes, of course we can do that. And it starts very quickly actually. In a matter of two months, we have a working prototype, and then two more months and we
Starting point is 00:04:19 have a decent early product. So it's actually fairly quick. This was a surprise because Jan was pretty certain he couldn't actually contribute to product development. I was always convinced this is not for me.
Starting point is 00:04:33 It's too complex. Before that, I've been exposed to technology, but always on the user side, not as a creator of technology. And I was sure I couldn't do it. But this experience proved to me that actually, yes, that I can do it. The project failed for unrelated political reasons,
Starting point is 00:04:55 but it meant something to Jan. I've got direct relation with software engineers, and it's a pretty good relation. We actually get along very well, probably because I'm basically the only guy in the marketing department to try to get close to their language, to express it in ways that they can implement. And we have a kind of feedback model where I get close to them and we course correct anything that goes wrong. This model gave me a form of friendship, I would say.
Starting point is 00:05:34 In a more merit-based world, maybe Jan would transfer to product development and find his calling there and live happily ever after. But in this world, he moves on to other projects, many of which don't use his skills very well. And despite him working hard and really caring a lot, a lot of these projects just get canceled. The years slowly pass by until one day something changes. I would say it's a moment in my life where I kind of change my aspirations. I was 35. In my twenties, I've been really working a lot, chasing the professional success and so on.
Starting point is 00:06:07 And by my 30s, I had accumulated quite a lot of scars. I was no longer convinced that working a lot was really a great future. Basically, Jan was burnt out. It resulted in an extremely reduced ambition. Basically, just be a nice guy, have a good life. That's good enough. There is no point in doing more than that. Jury speaking, I think it's a good advice
Starting point is 00:06:35 to give to anyone because it considerably reduces the kind of pressure that one can put upon himself about the need, if not the requirement, to absolutely succeed. So I kind of changed all. I'm thinking, okay, I will try first to get a stable job,
Starting point is 00:06:54 which I managed to get. And from there, I discover, oh, but now I've got a ton of time. Since I'm no longer working, I've got a ton of time so I can actually develop other activities just to enjoy the time. So initially, I play got a ton of time so I can actually develop other activities just to enjoy the time. So initially I play a bit of video games, but it feels hollow after some time. My next occupation would be, let's learn about history. I always love the stories. So I tried, I'm starting to learn a lot about history actually. But after a few years, I've got this feeling that I'm reading again the same story or a different version of the same story. So I'm not learning more or not enough.
Starting point is 00:07:29 Also during the day, Jan's now working as a project manager. I'm organizing big projects across the world. And so a lot of people depend on that. But I mean, organizing Excel spreadsheets, essentially, making sure that the product is at the place where it's supposed to, making sure that everybody's prepared, that the right teams are aware of. It's not exactly as thrilling as inventing a new product. So something is missing. And I think that's also why I go into, well, let's do some programming. After all, I had a nice experience with real programmers, and that will help me understand them better.
Starting point is 00:08:09 And that's how I started. There's lots of ways to start programming. You can get a book, you can do online tutorials. But Jan decided to head back to his roots, back to when he was in secondary school. So as a student, we had the right to have a graphic calculator. It was the early 90s. And the one I selected was HP48.
Starting point is 00:08:31 So that's a very specific calculator because the CPU is really unlike others. It's a 4-bit CPU, but it has a very large register to be able to have big numbers. Essentially, it's a graphics calculator. You're supposed to draw curve equations and curves out. But of course, if it is a programming device, that means you can try to make it run Doom or anything like that. So it turns out that although he didn't think he could cut it as a professional programmer,
Starting point is 00:09:01 Jan actually had programmed in the past for this HP calculator. It was no Doom, but he had built a game called Fantasy Conquest. So it's the idea of having a band of marauders, essentially. And they roam around the country and they try to become the biggest band of marauders around. It's not an intelligent game. It was kind of working enough for some players to enjoy it, but it was never great. And I had really no time to really go into detail because that's also student time. We also have some studies to do.
Starting point is 00:09:36 So most of the time is actually spent there trying to get a good engineering school. But back then when he did work on the game, he remembered it being fun, or at least something he enjoyed. This HP calculator, it had a 131 by 64 screen, so not a lot of pixels. But it was enough to build a game. The user experience, though, that was a bit of a challenge. In order to play the game, you need to have exactly the same calculator, an HP 48. And then we could download the games through either infrared connection directly,
Starting point is 00:10:10 or more complex was to get a cable to a PC. And back then that wasn't common at all. So infrared was used a lot. And the thing is, the game was large. It would take up to 50 kilobytes. And that was considered very large by that time. So you needed a memory expansion to actually even be able to load that. And that was one of the problems.
Starting point is 00:10:33 This game was too heavy, and it deserved to be optimized a bit. And so that's where I started to think about compression. So in the evenings, after history has lost its luster, Jan starts picking up this game, starts improving it, starts programming it. It becomes his new hobby. I'm not going to program every day. It still has to be a hobby. So I've got a number of activities, but when something is interesting,
Starting point is 00:11:01 you come back to it regularly. So that's exactly what I do. I would program around this calculator for many months. I wouldn't say every evening, but very frequently. And over time, skill level would simply improve just as a function of practice. All I wanted to do is to have some fun and finishing this old game I never finished. That's more or less all I wanted to do is to have some fun and finishing this old game I never finished. That's more or less all I wanted. And yeah, I never planned anything out of this.
Starting point is 00:11:33 So it's kind of a surprise that one thing led to another. This game did lead to so much, and it ends up taking you on to the other side of the world eventually. But immediately, it helped him with something maybe more important. How can I explain that? Because it's difficult to. But I would say when you have a job, which is clearly just a job to live, I think it's really important to have a side activity. So this side activity doesn't have to be something useful or that brings money.
Starting point is 00:12:03 It's just a joyful side of thing. It could have been dancing. I've been doing dancing to race away. It could have been anything that brings you joy in your daily life. I don't want to give the feeling that I was doing only programming data compression in society. This was one of the things I was doing. But yeah, I think it's very important to have activities outside of the job.
Starting point is 00:12:24 I would even call it compulsory if you want a fulfilled life. And so fulfill him, this game does. There's a couple ways to program on an HP calculator, but to compress a game so that it doesn't require the memory expansion, there's really only one way. You have to program it in assembly. And so I start. The first iteration, of course, are not great,
Starting point is 00:12:47 but they are all learning steps. Every small iteration gives something, a better compression ratio, better speed, better memory usage, something like that. After some time, I've got something which I can be proud of, which is essentially a very fast decompressor. So the idea is that the game would decompress itself on the fly.
Starting point is 00:13:07 It's actually composed of multiple small modules. And every time a module is called, it gets into RAM, gets decompressed. And that helps to reduce the size of the game by 30% approximately. And when we think about it, 30%, that's all, it's not that big. But for some reason, I'm proud of it or something like that. I've invented something.
Starting point is 00:13:35 Plus, data compression, that's clearly something that always felt magical to me. So that's the starting point. Before I knew it, I'm realizing that I'm only working on the data compressor. I'm not touching the game anymore. I'm always, always trying to improve the data compression side. And after that, once again, there is no plan. It's just a long journey where every step I learn something new and I find that interesting. Data compression is a complex field to learn, though.
Starting point is 00:14:01 So initially, Jan's approach is to just try things out and discover his own way how can i search faster for can i encode that in which is less wasteful and so on this is insanely slow obviously but it works many case i find a great idea that i'm almost proud of and it doesn't take long before i understand that this idea is common for programmers. I just didn't know it. So, for example, I invent a hashing function in order to search faster. And it doesn't take long before I understand, well, that's common.
Starting point is 00:14:36 Everybody does that. So it's really a journey and a ton of fun discovery. So Jan makes a standalone compressor for the HP 48S. At this point though, the scene for the HP 48 calculator is pretty much dead, I'd say.
Starting point is 00:14:55 So there's not a lot of spectators to see the work. But I would nonetheless continue to develop it. I would start even to develop different variants, some stronger than others. But the real key selling point of the main algorithm I would employ is its speed.
Starting point is 00:15:13 It reached an extremely high speed of 80 kilobytes per second, which obviously nowadays looks like shit. But these calculators, they don't have that much RAM, so it's still fast for them. And after some time, a good year, I would say, I still have this feeling that, okay, I'm developing something. It's interesting. I like it.
Starting point is 00:15:32 I find it interesting. But there is almost no one to enjoy it because the scene is dead. So my next step here in this journey is to say, okay, let's go to the PC scene. It's the one which is active, 2009 approximately. The PC scene is active, but they don't have practice at squeezing performance
Starting point is 00:15:52 out of old underpowered CPUs like Jan does. I've read in newspapers that programmers only produce bloatware, which are worse and worse every year. I would have a definitive competitive advantage there.
Starting point is 00:16:06 So I'm pretty sure of myself. And I start developing on the PC side. And now I have to learn C because you don't write a Saturn assembly. So I have to write C. And yes, very quickly, I understand that. No, no, I'm pretty far behind. Very, very far behind.
Starting point is 00:16:26 The compression libraries available on Windows, they turn out to be way faster than expected. But Jan finds a way to catch up. He finds this online forum where people interested in compression gather. It's a vBulletin forum, and to register you have to answer questions about compression, like
Starting point is 00:16:41 who created WinRAR. Eugene Rochel, by the way. And this place becomes Jan's watering hole. So I was not alone. And that's fairly important, I think. But more importantly, I think it gave me a frame of reference. I could compare, I could get evaluated. And so there was a sense of belonging to a tribe of peers. And I think it matters because it's difficult to sustain such a long effort, multi-years effort, with no such context at all. Also, within this community, it's easy to evaluate each other's work. Every once in a while, someone would come and say,
Starting point is 00:17:19 hey, I invented this. And some people would test it and would say, oh, it's good or it's great it's it's not and there would be no shortage of people who like me are interested in data compression and would test the program so oh that's a perfect ground then i learned that yes data compression is not limited to winrar and winzip you have actually hundreds of possibilities out there. And some of them are in my category, which means simple, light, and fast. And now I've got something to compare to. So once I finally understand how to program in C, that's why I do my comparison.
Starting point is 00:17:56 And yeah, that's why I quickly understand I'm behind, way behind. There are a decent number of people out there which can develop very fast compression algorithms, way better, way more efficient. So there is a learning curve there. But I think the critical part that I learned from the HP 48 experience is just
Starting point is 00:18:17 the will, the will to learn. There's this kind of blind trust that there is something interesting to learn. So let's dig a bit more. This will to dig in is one of the keys to Yann's eventual success. But also remember, this is just a hobby for Yann, like learning history. And so he's not in a rush.
Starting point is 00:18:38 He wants to understand each step along the way, and he takes his time. I think it's fair to say that in the data compression community, most searchers were interested in data compression, really best ratio. And speed, yes, has a side effect. Let's make it not too bad. That's really calm second. And my mind was really focused on, I want to great speed and without sacrificing speed, I want to get better compression ratio.
Starting point is 00:19:06 So it really changed the perspective of what matters. And it doesn't take too long before I got something competitive. And at some point, also by chance, I have an algorithm which seems to be the fastest around. So that's LZ4. I think it's fast because it's simple, and also for other properties of the CPU that I wasn't aware of at the time. So I just realized, oh, this thing can decompress at
Starting point is 00:19:34 one gigabyte per second. That's crazy. I wouldn't have expected that. And that's it. I do not continue on that. It's like a one step in this learning experience. So Jan has built the fastest algorithm to compress at this level of compression that he's seen. And then he just moves on. It's just a hobby, right? And nobody cares about speed like he does. Because the thing is, LZ4 is fast because it doesn't do much. It does less compression than Deflate, the algorithm that's used by zip files and gz files. And because it does less, it's very fast,
Starting point is 00:20:09 but once it's built, there's not a lot to learn. It doesn't do a lot. And so Jan just moves on to more advanced compression topics. I learn about Huffman, arithmetic coding, reduced offset, context, mixing, things which become really, really complex. But none of this stuff is actually usable. It's really just for the learning experience.
Starting point is 00:20:34 I'm just following other people who have been developing the same things before me. So I'm not doing something really special, but I learn. And at some point, I stopped making progresses. I would say that's a moment where I've learned whatever was easy to learn, whatever takes a week or two to learn. And that's a moment where I have to decide, what do I do now? Do I move on? Like I moved on from learning history to programming. Do I move on to something else or do I double down? And so I decide to come back to this first compressor, LZ4. And I'm thinking, okay, I've got something here.
Starting point is 00:21:12 This one is actually above its peers. So there is something there. But I achieved that outcome almost by chance. Let's understand. Let's go deeper. Why? Why is that? And how can I improve it? Making the first version of LZ4 was relatively easy, but understanding why it's fast and how to make it even faster, well, that's a whole different level of complexity. Instead of making huge progress every week, it would take months to make small progress. But that's also the small progresses which are actually difficult to get. After that, I'm starting to understand why it's fast, how to make it faster, and also how to make it stronger.
Starting point is 00:21:54 So now I'm getting more into how do I search efficiently, how do I combine, how do I pass data. These are all fairly complex topics that I wouldn't care spending time on if it's just to learn quickly. But at that point, I'm more in this mentality, so I've got something which is a bit better. If I want to make it usable, I need to polish it. And so I take the time to polish it. And by the time I believe I've got something good, it has taken almost a year to understand all this.
Starting point is 00:22:31 LZ4 is now faster than every single compression algorithm that Jan can get his hands on. Jan knows that LZ4 may have tremendous impact on the world. Data compression is all about making things smaller and more efficient. And when you're dealing with vast amounts of data, even small gains in efficiency can add up to massive savings in time and money and resources. Something like LZ4, which can compress and decompress at lightning fast speeds, has potential to revolutionize the way that we store and manage data. It's that important.
Starting point is 00:23:00 But also, it's just some code on his computer. It's not a proper library. It doesn't have a license. It doesn't have users and nobody knows who he is. And so its impact could be nothing. It could be zero. The next step to change that is to open source it. And open sourcing is more complex than it sounds. Jan's never worked as a professional developer before, and he's trying to learn how to do things the right way. Then I understand new stuff, which is unrelated to data compression, such as never use global variable, make that a library that people can actually integrate, all things that I wasn't even aware of. And it takes me a few months to get that right. And at some point in 2011, I'm thinking, okay, I'm ready now to open source it.
Starting point is 00:23:50 And one week or two before I do that, Google opens for Snappy. And Snappy is basically exactly in the same category as LZ4. And he's actually, I think, a bit better than LZ4 at that time. So I'm kind of, what? Snappy gets a lot of attention. It turns out that inside Google, they had needed ways to compress data from Bigtable. And so they needed something that did less than Deflate, but did it much faster. This is very much Jan's approach, but they came from a different direction.
Starting point is 00:24:22 As soon as it gets hard to get a lot of articles over internet, which talk about this radically new compression, which is so fast. And a lot of projects which get interested. Basically, all databases start to say, oh, but that's what we need because we need speed. So it's an instant success. What did, like, did you feel upset? Upset?
Starting point is 00:24:39 No. But maybe some form of stress. I'm not sure. I feel as if I was in a kind of increased activity mode in my brain. Okay, there is something happening now. What I learned from that is that my algorithm is actually the only one able to keep up with Snappy. So that's still something. That's not bad. And I was convinced that I should be able to do better.
Starting point is 00:25:06 So I doubled down again. I know I'm focusing back on speed and make that faster, faster. I'm looking at Snappy and thinking this thing is a bit too complex for what it does. LZ4 is way simpler, so it should be faster. I believe that. I believe it should be faster, but I don't know exactly how. So I focused on that. And indeed, months after months, progress show up,
Starting point is 00:25:37 and 10% by 10%, LZ4 is actually faster than Slappy, and after six months, it's actually way faster. So at least it answers the first part, which is a bit, I would say, egocentric, like I can do it. You took on Google. Exactly. It's kind of crazy for someone who wasn't even a programmer. That's right. Through these six months of slowly finding speed improvements, Jan is still working as a project manager.
Starting point is 00:25:59 He's still biking to work in the morning, organizing spreadsheets, and then biking home. So some people around me are aware that I'm doing that as a hobby, but no one thinks much about it. I myself still have this excuse that thanks to this hobby, I'm actually a pretty efficient product marketing manager, and I have some good relation with my programming teams. I've got several ones, so I can put together a plan. I know how these things communicate between them. So that makes me confident in this role.
Starting point is 00:26:34 I think that's for the professional side, that's probably the best. And they feel it too. I mean, programmers also quickly understand that the marketing guy in front of them understands programming, but most of them are not aware that I'm doing that on the side. So it's not about talking about data compression. It's just about acquiring some culture of programming. And yeah, data compression is more like, I don't know, my hobby. Probably Yann is a good project manager, but in his hobby, which is very, very niche, he's slowly becoming one of the best in the world.
Starting point is 00:27:09 Maybe before internet, you just had to be good in your local neighborhood. So I don't say that in a very strict sense. Local neighborhoods can be the companies that work in the same field that you know about. But no, you need to be good at worldwide scale. That's insanely good. But in the same time, now you can be very good at something very niche that would be of no importance to anyone you know around. None of your friends, none of your family, no one cares.
Starting point is 00:27:43 But at the scale of the planet, there are actually a non-negligible amount of people who care about that. And now you can be very, very good at something very, very precise. I think it's a very big change over the last 20, 30 years. That change in the world, that change caused by the Internet, should mean that Yon is now getting all the attention in the compression field, right? LZ4 is faster than Snappy, and look at all the attention that Snappy got. Well, it turns out that that's not exactly the case. So sometimes, some people which like data compression, they would write an article
Starting point is 00:28:21 about comparing different compression algorithms. And so I try to be visible. And that's where I would say, hey, I've got something that you might be interested in. And every time that happens, it ends well for LZ4 because the comparison is pretty favorable if we only look at performance metrics. Now, on the potential user side, it's a very different story because Snappy, you've got Google attached to that.
Starting point is 00:28:47 So that's real work, professional work. And this random guy, no one knows about, okay, who knows? So I would say initially there is absolutely no traction at all. But there is one thing I've got that Snappy doesn't. During this time, I spent understanding how it works in a more, in a deeper way. I developed a variant called LZ4-HC, which is LZ4 high compression mode.
Starting point is 00:29:16 So this is a variant which is slower when it compresses, but it has a much better compression ratio. And the decompression speed is the same, which means it's extremely fast. In a scenario where you compress once and decompress many times, think assets in video games, for example, that's pretty useful.
Starting point is 00:29:38 And that's where I would essentially score my first wins, if you call it this way, my first users. Some indie video game developers realized that they could use LZ4-HC and get very good decompression speed out of it and a much better compression ratio than Snap can offer because it doesn't have a high compression. The first game to use this format was for the PlayStation Portable, which has a totally different CPU than the one on Jan's PC. have a high compression. The first game to use this format was for the PlayStation Portable, which
Starting point is 00:30:05 has a totally different CPU than the one on Jan's PC. Which also introduced me on the topic of portability. So it was a great learning experience. And then I learned that more and more games were starting to use it. And I don't know exactly when, but at some point, Unity decided to adopt it. And through Unity, you know, a lot of games, even today, which use LZ4, they are just not even aware of that. It's part of the default setting. I don't know. Did this feel like something?
Starting point is 00:30:36 You're like, I'm on to something here? I did something that's useful. That's a great feeling, honestly. At that point, that's all there is. I mean, I'm glad. What I don't realize is the kind of positive reinforcement loop it triggers. Because now this is a product which has been tested, has been shown to be better technically,
Starting point is 00:30:59 but is also used in commercial products. And so now, even in more serious activity, if you call it this way, I mean, not video games, but actually databases for other professional systems, now it becomes something that's worth considering.
Starting point is 00:31:14 And now, every time there is this competition, Snappy versus LZ4, and LZ4 wins every time. And after a few months, it starts to be known. And so it reverts. I think more and more projects use LZ4 wins every time. And after a few months, it starts to be known. And so it reversed. I think more and more projects use LZ4 first. It did not happen overnight,
Starting point is 00:31:32 but it's kind of a gradual improvement story and that takes time. The news takes time just to reach people. First big non-gaming open-source project to use LZ4 was Hadoop. Which was supposed to be the competitor of Bigtable from Google. That was something that was kind of, wow, this is serious now. A bit later, I learned that it was being evaluated in the LHC, which is a large hadron collider. So that's for research proposed in Switzerland, very big accelerator of particles, which has a massive amount of data to generate and deal with.
Starting point is 00:32:09 So they need something fast. So I said, wow, it's even useful for some kind of fundamental research. So yeah, that's where I started to see this is more than just a toy project. It started to be useful for a broader range of applications. One reason for its broad use and adoption is that when he first open-sourced LZ4, Jan spent time learning about how to best manage an open-source project.
Starting point is 00:32:37 So even what is a license? What is an open-source license? What are the choices? Why is this one? I had a pretty good discussion at the very beginning of the project with people from Unity about that. That was important because I think I would have gone for GPL. Instead, I went for BSD, and it totally changed the scope of, I would say, the addressable
Starting point is 00:32:57 market. It was a very important decision retrospectively, but at that time, I did not really understand it. So I kind of trusted these guys, seemed to know what they are doing. And then of course I got some feedback. Another thing I learned, feedback is gold. Anytime someone comes with a problem, that's a real problem. It would be so easy to say, Hey, I don't have this problem of my computer. I don't care. Instead it's, Oh, you have a problem.
Starting point is 00:33:23 So it's worth solving. How many other people had the same problem and never told about that? It actually matters. It does matter. Of course, we all know that. But if you're a Google dev, you probably have internal Google things to do. The external issues on GitHub, they might take a back seat. And actually, an amateur which is focused on its own turf is going to be more reactive, more present than a professional which is already overburdened. So indeed, that's what happened. If going to the Snappy project at the time, you would see the issues increasing all the time and response time being longer and longer. So that's part of what I call the project.
Starting point is 00:34:07 LZ4 was really much more focused, much faster release time, correction, and so on. And so that pays off, and LZ4 keeps spreading. Eventually, some engineers at LG want to get it into the Linux kernel. Getting your project merged into the Linux kernel is a big deal, but it's also hard work. It's a large code base. And because of what it is,
Starting point is 00:34:27 it has lots of rules about how things should be named and how code should be structured. Luckily, Jan doesn't have to take this on at all. The engineers at LG lead the effort. But even for them, it's hard. I see the multiple runs where they get rejected and have to retry. So as long as it's not done,
Starting point is 00:34:44 there is always a reason why it could not pass the bar. And I don't know which reason it will be next time. Could be something as simple that, yeah, the name of your function do not completely respect the naming convention for this part of the kernel. And all of these things, they look like secondary, but I totally understand
Starting point is 00:35:05 that from the maintainer perspective, it's important. That's what makes them able to maintain this big pile of code. So it takes really a lot, a lot of time to get integrated. I think it took a year. And at that point, I realized that no one around me would understand what I was saying. So I feel a bit alone. So yeah, for me, it was just a great achievement. But it seems that from this community, it was a bit more than that. We have a French word for that, adoubement.
Starting point is 00:35:35 And I'm not sure what the word in English is. I swear a knight become a knight. Oh, it's just knighted in English. Knighted. Yeah. It's a simpler word, I guess. Yeah, it's just knighted in English. Knighted. Yeah. It's a simpler word, I guess. Yeah. Yeah, it's more direct.
Starting point is 00:35:47 I think it's more or less what happened from their point of view. No, I was no longer just an amateur doing something and having strangely some success. It felt like anointed. It shifted my perspective on, by the way, it can be maybe a bit more than just a hobby. So LZ4 keeps spreading. But also, Jan gets to hear from people who evaluate LZ4 for their project and then reject it. I remember databases system, which typically cut data into blocks and compresses each block individually. For some of them would say, yeah, we absolutely need speed to the point that we don't compress, so LZ4 is great. Some of them would say, no, no, no, we compress with Zlib and
Starting point is 00:36:30 it matters because we have an installed layer customer. We cannot tell them, no, you are going to need 20% more storage. We do need compression. We do need speed, but we have made the choice that compression really matters, so we are already using Zlib. And Xamarin is really the standard compression library that everybody knows and everybody uses. Zlib uses the deflate algorithm mentioned earlier, same as zip files or.gz files. It's old, it works, it's everywhere. I mean, any project who has accepted the speed cost of Zlib would not transition to LZ4. And it's actually the majority of the addressable use cases, say, use Zlib. And therefore, it makes LZ4 something nice, a kind of a niche product for a niche use case. Not a small one, but still a niche.
Starting point is 00:37:20 But the bigger majority of the data compression remain inaccessible to LZ4 because it's too light. But back in his learning days, Jan had created a number of compression algorithms. LZ4 was the fastest, but there were others. I made several ones, actually, which would be quite competitive with Zlib. Let's use one of them. Let's know I understand the open sourcing process. What does that mean? Which is way more than data compression.
Starting point is 00:37:48 In terms of proportion, LZ4 is such a simple thing that the proportion of data compression algorithm itself, that's like 10% of the effort, and everything else is about open sourcing it properly. And so I think it was a good choice to start with that, because
Starting point is 00:38:04 there's so much to learn. But now I have that, I understand open source, can I take one of these old algorithms that I made many years before and bring it to the open source community as a kind of competitor to Zelleweb. And so I would start to work on that, on this idea. And I'm fairly convinced that, yeah, it can work. I can have a compressor which is at least as good as Zlib in terms of compression ratio, but much faster. And I was already convinced that a lot of users would like that because they were looking for more speed. They just were not willing to accept losing compression ratio in the process. So the path to something better than Zlib seems clear.
Starting point is 00:38:47 LZ4 does half of what Zlib does, but really fast. And the missing part is Huffman coding. A Huffman code is replacing common strings with a shorter prefix. If I just replace the text co-recursive on the podcast website with CR and then put CR equals co-recursive somewhere as a legend and then replace Adam Gordon Bell with AGB and so on with all the most common words, I can use less characters. That is something like a Huffman code. But if all Jan does is re-implement this idea, it doesn't feel like enough. What Jan wants to do is something bigger.
Starting point is 00:39:24 He wants to fulfill this goal that he had when he was a kid working on his calculator game. The sentence, when I grow up, I will be an inventor, I think became more and more something in my mind. I really admire the people who can invent stuff, and that feels magical because we just get the final product of many, many years of iteration of efforts. And that's what I wanted to do.
Starting point is 00:39:52 And so Jan starts work on his new Entropy encoder, pulling in ideas from his earlier research, incorporating things he's learned about performance. Because he needs speed and compression ratio, that's his focus. And he does this in the open on GitHub in his finite state encoder repo. And so I'm releasing finite state entropy in 2014, I think, because it's nice off.
Starting point is 00:40:12 And it works. It's actually even faster than FMAN. And it has the compression ratio for arithmetic. And that's kind of a landmark now because, I mean, a new entropy coder, this hasn't happened in 30 years or so, even 40. It was a long time ago. So that part satisfies, I would say, the inventor in me.
Starting point is 00:40:34 I created something really new. And it's not just new. It's actually very efficient. And I can use it for my project of bringing a competitor to the lab. And so all these ideas start to get together. And since I'm doing that in the open as an open source, everybody says that. It's kind of the obvious next move. And so that's how we reach the end of 2014 and I decided to release the first version of this standard as a technological demo. And it brings what it's supposed to, which means compression ratio similar to Z-Lib, but way better speed.
Starting point is 00:41:17 In a simple benchmark, Jan's project compresses a bit better than Z-Lib. But three times faster. Three times faster. He calls it Z-Standard, and now he has people's attention. Because compression is happening everywhere. The computers are involved. But especially at data centers, where the big companies are hosting all their servers. Hence Google being initially ahead of him. But this three times improvement, you know, if it can be deployed, it just gave every cloud provider a way to save lots of servers, a way to save lots of money. And so now Yon starts becoming popular.
Starting point is 00:41:51 Everyone's reaching out to him. So I would say every major player in the Silicon Valley that you can think of. Google, Facebook, Apple, Microsoft, Amazon. Yeah, yeah. Not Amazon. And also other players which are less known, smaller companies. Also at this time, the HBO show Silicon Valley had just come out. It's a great show and it's literally about somebody discovering a better compression algorithm. Some of my colleagues talked to me about that.
Starting point is 00:42:19 I was, what is this thing? And yeah, they are talking about data compression. And it seems to be a big theory in Silicon Valley. So maybe it made data compression sexy. I don't know. Anyway, whether there is a consequence or not, in 2015, I ended up receiving really a lot of job offers all about data compression in the Silicon Valley. And that's the moment to decide. Basically, it's a one-time opportunity.
Starting point is 00:42:46 Either I play it safe, I keep my safe job in Paris and I try to continue doing data compression on the side, or I go in fully and I fully invest my time in data compression. And the thing is, I also understand at that point that LZ4 was kind of a lucky strike. It was possible because it was simple. But this one is not simple. This is a really different level of complexity now. And hoping to do that on the side, working two or three nights per week, that's not going to work. On top of that, I'm going to be a fuzzer. So I'm waiting for my song. And this is going to take into this budget even more
Starting point is 00:43:30 to the point that probably there will be nothing left. So the question is, do I do it? I mean, it was clear that if I selected one, I would have to give up the other. If I was to stay in marketing, I would have to really go all in on the politics game, which leaves not a lot of time for programming, and the other side is even more clear. So it was an important choice to do.
Starting point is 00:43:52 In the data compression versus project management debate, compression has at least one very big plus on its side. Generally speaking, data compression, I think, is the only thing I did in my life which has been useful. Everything else I tried to do was always for the benefit of some kind of middle manager who wants to show off to his own manager, who wants to show off, and end up being not useful, just abandoned or no impact. Well, I've got a few counter examples, but they are far in between. Data compression is really the first time I bring something to the world, and I know it's useful. And so Jan, he heads to the Bay Area to start interviewing.
Starting point is 00:44:34 I was very clear that my intention was to bring this standard as an open source project, fully open source, everyone can use, no string attached. And not every employer was thrilled about that. Some wanted to keep that completely in-house. That's Apple, I'm assuming. I can't give name. And really, Meta, which was called Facebook at the time, was very clear that they do have a culture of open source. They were fully aligned on this objective.
Starting point is 00:45:03 And this was of great importance to me. That was the reason I was doing all this. So I went with them, and I think it was a good choice. So Jan takes the job at Meta, or Facebook then. There's just one problem, though. If there was one moment where this story could go wrong, it's exactly this transition between marketing to programmer. Because we think about it, it's a pretty big transition.
Starting point is 00:45:30 Especially, I think I was 42 at the time. So I had been an amateur for a few years, going straight into the big league and Silicon Valley area. That sounds like a stretch. So there's no guarantee that it will end well. Facebook has a bootcamp process for developers joining the team. Jan doesn't have his visa situation worked out yet, but for the bootcamp, it doesn't matter.
Starting point is 00:45:57 Yeah, the bootcamp period, we do have a kind of almost a hotel. It's a very small apartment that's not too far from the campus. And so, yeah, I stayed in Redwood City probably. Jan is used to working on Windows in Visual Studio, but Meta is different. He needs to learn how to work on remote servers and how to develop on Linux. So that's also a learning experience, how to use VI in order to code, in order to have your plugins. At the end, it's a lot of small tools and small way of thinking. None of them looks particularly terrible. It's just that there are a lot of them.
Starting point is 00:46:37 And so it takes time to not just be blocked on the stupid thing. After Jan figures out his environment, which does take him a little bit longer than most, he has to take on some small tasks. We have to reach out teams which have actual code in production. And there is a small thing to do. It's not complex for them. But for a boot camper,
Starting point is 00:46:58 there is a ton of context to acquire. Why are we even doing that? I remember one of the easier ones was just a matter of processing a file in Python. You say, what's the problem with that? Well, the initial code I had to fix was just loading the file and processing it. But if the file is huge, then it costs too much memory just to load and process. So I had to change that so that it would stream small amount of data processing so that the memory budget would stay low. It's a small exercise, but it's actually a real software, really in production. So that means it goes through a whole validation
Starting point is 00:47:36 cycle that we get to discover firsthand. And that's pretty useful to do. So Yann gets through bootcamp and he starts the data compression team at Facebook. He gets an intern and they start working away. But Jan's having a hard time shaking the feeling that he's out of his depth. I mean, even my first intern looked like a genius. So I'm surrounded by genius people. What am I doing here? It takes time just to accept that, yeah, there's a place.
Starting point is 00:48:07 Because the point is not to be better than everybody else around at everything. That doesn't make sense. There's an infinite amount of things. The point is to be useful at something valuable so that we meet our peers, we can help them concretely. And that demands knowledge, but also ability to learn, to evolve. As Jan works to figure out how to work at Facebook, as he works to build Zed Standard from a demo into a real library, he also has a son. And so his life is changing really fast. And then out of nowhere, a competitor appears. tank of Zlib, it's more focused on the web, and therefore it's more, once again, it's more focused on compression ratio.
Starting point is 00:49:06 But it's not bad on speed. So, but Broadly was in the making for several years. I had started only a few months before. They were a team. I was alone. It felt like before even reaching my goal, it was already lost. But Jan persisted, and he got help from his compression forum colleagues as well, Premise Vlas, Gabinski.
Starting point is 00:49:29 But he was really knowledgeable of data compression, and he wanted to help. He wanted to contribute. And since it's an open source project, everybody can contribute. So he was contributing more and more. And by the end of the year, we had this idea with my manager at Facebook. Why don't we hire him so that he can fully contribute? And so he receives a contract and he can work on the projects throughout 2016. And so that's why he's one of the co-developers of this standard.
Starting point is 00:50:01 And his help was very welcome because I was kind of alone. So I was a bit stretched. Then, in 2016, at Facebook's AtScale conference, ZStandard was officially released. Before it was on my own repository on GitHub, it was considered like a personal project I was authorized to work on. At that point, it became an official Facebook project. It's a great event. First time I do that. So the event itself, I don't know what I can say.
Starting point is 00:50:30 I think it went well. But what was probably more important is the reception from the community. A lot of people were waiting for this 1.0. One of the metrics would show it, which is the number of stars on GitHub, which would be catapulted to the day of the official release. In the final comparison, ZStandard beat Zlib in every way. At the same compression ratio, it compresses 3 to 5 times faster. At the same compression speed, it results in files that are 10 to 15% smaller.
Starting point is 00:51:03 And besides all this, it can decompress twice as fast, regardless of compression speed. The only thing even close to Z standard is Brotli from Google. So I mentioned before that Brotli was ready one year before us. They should have reached the market before us. But they developed in C++. and i made the decision to develop in c because i knew from lz4 that this is a universal abi that you can connect to from anywhere and now you can have python you can have rusk and after every language will connect to the c library and also we have a very good control over our memory location, which matters for embedded, for kernel. And broadly people understood that, but a bit too late.
Starting point is 00:51:51 So they start by releasing something C++ and then they convert it back to C because they understood the importance. And it took them a whole year to get that done. And that's part of something you cannot guess. I mean, if you're not exposed to open source deployment. So the LZ4 work of understanding open source and easy adoption paid off for Jan. It only took months for Z standard to spread. I would say by January, most Linux distributions have this standard in their
Starting point is 00:52:25 package repository. And of course, internal Facebook products use it, and more and more open source projects use it. The adoption is really very fast. And that's why I understand that the knowledge, the open source process
Starting point is 00:52:42 and restriction I learn while doing LZ4. That's what gets re-employed in this tunnel and what led to such a fast adoption. If ZStandard is three to five times faster, then that's three to five times less CPUs spent on compression everywhere that made the switch. And like LZ4, ZStandard also opened up new frontiers.
Starting point is 00:53:06 It has dictionary compression, which allows ZStandard to compress things that are really small using aggregate statistical patterns. This is a big deal at places like Meta. And that's not even counting secondary impacts. Internally at Meta, the impact has been massive. So typically we do have caches in the stem so that data can be delivered faster because it's closer.
Starting point is 00:53:28 But that, of course, you need equipment, and that costs money. So if you can compress more in these caches, you actually have more data closer to the customer. And there are two benefits to that. First, the customer is served faster, so its experience is smoother. So it drives engagement.
Starting point is 00:53:48 People would use the application more because it's more pleasant. And on the other side, there are a ton of events in the network which do not need to happen because we do not have to go deep into the database and storage system because we don't need to. It's right there. And cumulatively, these two effects, they are way more than the saving itself from the data compression. So that shows that, yes, there are primary savings, but sometimes the secondary effect
Starting point is 00:54:15 is really important too and worth measuring. Secondary, primary, tertiary, however you measure it, the impact has been huge. Let's go back to this idea that data compression can be invisible and still have a fairly big impact, both on storage and transmission, because we also send more data. So once something is used almost everywhere, the amount of data processed is staggering. These are numbers we are not used to. There are so many zeros, we don't talk in this kind of range normally. That's insane.
Starting point is 00:54:51 This is the wildest story I've ever heard in terms of impact. A marketing professional, a project manager, a couple evenings a week after he bikes home from his job in Paris, starts tweaking a calculator game. And by the end of it, Yann has shifted a whole industry's approach to data compression and saved billions and billions of dollars. So the obvious question I have for Yann is how? How can you have that much impact in your career, in your hobby? So I would say the first advice here is don't do that for the success. Success is too random and too far away.
Starting point is 00:55:28 If someone targets success, it will lose stamina way before reaching that point. So do something because you like it. That's the inner force which will drive you beyond, I would say, the normal investments that almost everybody can also do. I think that's a very important one. So now if you're interested in the domain, keep going at it.
Starting point is 00:55:58 It's really a small effort regularly in the same direction, that brings you very far. I understood that from my mother, which had a, how do you call that, an accident. And because of brain injury, she couldn't move much, especially her hands were affected. But despite her difficulty controlling her hands, she decides that she should write a book. When you do have a problem with hands, that's slow, very slow, painfully slow. But incredibly enough, in the matter of a few years, she managed to write, publish, sell. I don't remember the exact number. I think it's about four books.
Starting point is 00:56:42 And they are not small ones. That's something. She did that they are not small ones that's something she did that because she was focused on that and of course it wasn't the only thing she do in her life but almost every day she would make an effort in the same direction and that would make the book closer to the finish line so that's i think a pretty big lesson here. Just keep at it. Learn something new every day. And after some time, actually, it's a lot. It's way more than average. That's what makes you an expert in the field.
Starting point is 00:57:13 So you can't do that effort if you don't like what you're doing. If it's just for the money, that's too long, really. And that's why it's important to like. Thank you, Jan, for sharing your story. You can find Jan on Twitter as cyan4973 and also on GitHub and in code forums. And thank you to Chip Turner, who hired Jan at Meta and set up the data compression team and also reached out to me and said, hey, Jan has this crazy story. You should talk to him. If you know somebody with a story like that, yeah, send me an email, send me a note. Also, thanks to Meta for your belief in open source and being cool about having Jan talk to me about his experiences. If you like this podcast, if you want to help me out,
Starting point is 00:58:05 the best thing you can do is just tell other people about it. I know it takes me a while to get out each episode, but if you want more content from me, you should check out my newsletter where I cover similar topics and also follow me on Twitter at Adam Gordon Bell, where I often share behind the scenes details of the podcast. But really to get more content from me, the best thing you can do is join as a supporter. Go to co-rehearsive.com slash supporters. It's in the show notes as well.
Starting point is 00:58:34 And as a supporter, you get access to more episodes. I put out a bonus episode each month as well. Last month's bonus episode, bonus 17, was inspired by Jan's journey. I recorded it not long after talking to him. And it's about how you accomplish hard things over time. About two keys that Jan highlights that also were impactful in my life. And how I think about building up expertise or accomplishing something big over time. If you're a supporter, check it out and let me know what you think. And until next time, and I always say this very sincerely, thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.