CoRecursive: Coding Stories - Story: Frontiers of Performance with Daniel Lemire

Starting point is 00:00:00 Hello and welcome to Code Recursive, the people in the stories behind the code. I'm Adam Gordon-Bell. Did you ever meet somebody who seemed a little bit different than the rest of the world? Maybe they questioned things that others wouldn't question, or said things that others would never say. Meet Daniel Lemire. You were asking, you know, whether I was entirely sane, and I like to think that I'm a little crazy, you know.

Starting point is 00:00:23 By nature, I will obsess over things that people would just, you know, would rather not think too much about it. Yeah, I think it's kind of a personal trait. Daniel is a world-renowned expert on software performance and one of the most popular open-source developers, if you measure by GitHub followers. Today, he's going to share his story. It involves time at a research lab, teaching students in a new way. It will also involve upending people's assumptions

Starting point is 00:00:49 about IO performance. And Elon Musk and Julia Roberts will come up a little bit more than you might expect. The story starts as Daniel is doing his PhD at the University of Toronto. He gets thrown a problem and the way he solves it sets his career on a different trajectory. It starts when a couple of geologists come to him with a data set that they have generated in what seems to me like a very unique fashion. Basically, they're using helicopters and tied to the helicopter, you've got a balloon of some kind. Between the two, you've got this ring and this ring throws out EM waves into the ground. This is fairly standard stuff. And then they capture the EM waves and they kind of know what to do with them if the data,

Starting point is 00:01:35 if the signal is perfectly clean. So the way it's supposed to work is that you shoot this wave and then it comes back and then it's supposed to come back as an exponentially decreasing curve. So theory tells you exactly what you should be getting. But what you got in practice was massive garbage. It's stuff that, you know, you cannot feed it into any computer. So you need to clean it. And the way you sort of want to clean it is that you want to build some kind of model for what the noise is. So as a young PhD student, they asked, well, can you help us clean up the data? And I did, but it wasn't quite a process because they had these CD-ROMs at the time that would have hundreds of megabytes of data on them.

Starting point is 00:02:29 I would sit down and design an algorithm and then I would implement it and try it out and it would be spinning forever. And so just trying to test it out was taking way too much time. Were the geology guys gathered around and you're like, I'm going to try out this program, and then it just spins and spins?

Starting point is 00:02:49 Right. So you have this idea. You think it's going to solve their problem, and you try it out. But if it takes hours for you to find it out, then it's annoying, because of course it slows you down. But it's also, it goes further than that, is that you want to give them the algorithm and it takes them hours to check that it works. They may not do it. And that's actually what happened in my case.

Starting point is 00:03:16 It was too painful for them to try things out. So they would say, you know, so in my case, they really put the stuff on their desk. And they say, well, when we have time, we'll check it out. And I just say, okay, fine. You know, I wasn't waiting for them. And then I get a call, you know, months later. Well, if I didn't get around to it, it was painful. But yeah, it really solves our problem. So, you know, where can we go with that?

Starting point is 00:03:46 And basically, slow computing can introduce friction. It can make things that are possible practically very difficult. And I had this experience over and over and over and over again until I decided, okay, so I'm going to turn my life around. And instead of doing this algorithmic design stuff, I'm going to go down. problem of trying an idea and then having to wait forever for it to pan out instead of the higher level problems that I can leave to other people. So was this geology time, was this when you decided that a focus on performance, focus on computer science was important? That's where I was headed. So basically, being able to run code quickly

Starting point is 00:04:48 is a huge enabler. And we can go into, you know, why is deep learning taking off right now? Well, you know, it's a complex topic and there are lots of reasons, but certainly one of the reasons for it has to do with system performance. If it did exactly what it does now, but it took like 10 times or 100 times slower, we

Starting point is 00:05:18 might not even know about it because it would be too expensive to experiment with it. And you wouldn't have all these applications coming out because people would, you know, it would be too expensive to develop. It's like, I think it's the quote from Joseph Stalin. So maybe it's not good to use. He said like, quantity has a quality all its own. If you have enough computing power,

Starting point is 00:05:42 like it can be a whole different game, right? Right. And software that is just a little bit too slow to use can seem unbearable. But if you make it really, really fast, then all of a sudden it's much more fun. So with this realization that performance can be a great enabler, he finishes up his PhD and he joins a research lab. So in Canada, we have this research institution. It's called NRC. It's like this research-only government lab, basically.

Starting point is 00:06:15 And so at the time, they were creating this e-business initiative. My academic career, I would say, really started there because it was really this unique environment. So you all have these really, really smart people put together in the same building, and they all have different ideas.

Starting point is 00:06:38 And because it's brand new, you don't have two old guys in the corner who run the whole show and tell everyone what to do because nobody knows what to do. So it's basically if you're young and you have ideas, they say, well, go. You know, we don't know what to do. So do something. And so this was a lot of fun for me. Basically, we could do anything we wanted. We're free to build the research program we wanted. And so I really got to try things.

Starting point is 00:07:12 I work a little bit on recommender systems. And at the time, Greg Linden had come up with the recommender system that Amazon uses. And I thought that was really, really cool. And so this inspired me to work a bit on this problem. Daniel's work on recommender systems led to the creation of the Slope1 family of algorithms. According to Wikipedia, they are the simplest and most performant collaborative filtering algorithms. While at NRC, Daniel has another big turn in his career happen.

Starting point is 00:07:47 So I was this researcher, you know, young researcher typing at my desk. And there's this guy that comes in. He looks like a homeless person. You know, he's got this long hair. And he's swearing a lot about not being able to find a place to sit. And I'm a little bit scared, you know, because you're there and you've got this person that looks totally out of place. And you're wondering, you know, are they like going to sleep on the floor or something? But it turns out that it was this really, really, really brilliant guy that could never get a corporate job because he's really too strange. But he's excessively smart, very, very smart.

Starting point is 00:08:35 And so we start talking. And he's telling me, you know, he's telling me these stories about his vision. And he's saying, well, soon you'll have all these people, like thousands, maybe millions of people, taking these classes online. And it's all going to be free. He's a little bit on the left side of the political spectrum. And it's all going to be free. And I started listening to him.

Starting point is 00:09:01 And this was very inspiring. So he was one of the guys who really shaped my vision of the world because he was very – I think he was slightly prescient. He really did predict a few things that did happen. He did foresee a few things. Because at the time, he was very preoccupied with the cost of higher education, for example, which, as you may be aware, only got worse over time. And so he thought, well, OK, so we need to fix this problem.

Starting point is 00:09:38 So we need to get all of these fancy profs to go online where anyone, no matter how poor they are, can listen to them and learn from them. So this was very inspiring, I thought. So his name is Stephen Downs. Now, you probably don't know him, but he's the inventor of MOOCs, you know, this massive online courses. If you go on Wikipedia, they credit him with this invention. Is he what led you to go

Starting point is 00:10:09 and try to become a teacher? To become a professor? Yeah, so I became a professor and I started to build online courses. So my first online course, I think was in 2005. And for credit like not not not like not like you build a PowerPoint and you post it online like yeah actual

Starting point is 00:10:34 for credit courses and I started basically except for graduate work which is different but I started basically teaching uh exclusively online at the time and uh i did so like i've been doing so for a long time now so for example i've got this uh introduction to uh programming class where i have um i don't know like something something like 250 students a year, but it's all online, you know, and it's actually a lot of fun. And it's extremely cost-effective because there's only one of me and there are 250 students,

Starting point is 00:11:17 but it still works, you know. Did you find it hard to get into a role like that? I grew into it, I think. Now I'm enjoying myself a lot. But it was very uneasy at first. I think academia is very conservative in a strange way. So, I mean, we like to think about universities

Starting point is 00:11:43 as being progressive. And in some way, they are. I mean, we like to think about universities as being progressive. And in some way, they are. Like, you know, nobody cares if you're transgender, you know. You're not in the sense where it's very socially progressive. But there are ways in which it's extremely conservative. Like, for example, there's a tool that is perfectly fine, but that's called MATLAB.

Starting point is 00:12:11 It's a programming language system that's, to my knowledge, is very rarely used outside of a campus. And certainly if you go into a data science conference and people will be using Python or R or something, they probably won't be using MATLAB. But if you go on campus, everyone's using MATLAB because, well, I mean, to the best of my knowledge, the reason is that their classes were in MATLAB. Yeah. So then they're going to teach what they were taught, you know.

Starting point is 00:12:39 So you reproduce these things. And when you try to challenge these ideas, academia can resist you quite a bit. One of the things that I wrote maybe 10 years ago or maybe slightly longer on my blog at some point, I pointed out that there was a big problem with the big academic conferences. They're very selective. Basically, nobody from outside academia ever attends, right? So they're kind of like bubbles and everyone is kind of chasing what is hot. If you look at just the papers, you know that, okay, this was the year of XML. It's all about XML. And I say, well, these actually play a negative role because they actually,

Starting point is 00:13:35 if you want to do something original, you're probably not going to be aiming for these conferences. The people building the real system don't show up. It's a little bit challenging to be a contrarian in academia. Can you think of a specific example of when you maybe had some headbutting with maybe a department head or somebody because of your different take on things? Right. So there was something emerging that was called the Semitic Web. I don't know if people still use the term or it's completely gone now.

Starting point is 00:14:16 So basically, it came out of expert system and classical AI. And at the time, for all sorts of reasons, I got into this project with colleagues. And what they were trying to do, they were trying to leverage the semantic web that did not yet exist. But they thought, you know,

Starting point is 00:14:44 if Tim Berners-Lee says it's going to happen, it will. Well, it didn't. And then we're saying, okay, the way we should be building online classes is through these things called learning objects. And these learning objects are like objects in object-oriented programming. So they have this metadata and they can kind of all come together automagically and they're like legal block and at first i actually thought this all made sense and then i started asking questions and then i started reading my

Starting point is 00:15:21 friend steven downs you know and asking him okay but can you tell me what exactly is the learning object? This is too abstract. Yeah. Then he said, well, it can be anything. I said, okay, so we're working on anything. So I started telling people, this is not a good direction. And the irony is that you can go on Google and can find a name. There's a book, you know, called Canadian Semantic Web with my name on it.

Starting point is 00:15:52 I was the editor. But I started to have real doubts. And so I wrote a few things about this not being a good idea. I prepared a presentation about it and so forth. And this was very controversial. I got emails like, why are you doing this? And I said, well, we shouldn't go there. And this was very unpopular.

Starting point is 00:16:14 And some people say, well, okay, you don't have tenure yet. So at the time, I did not have tenure. So maybe you should be quiet a little bit and not voice your opinion too much. But I felt really strongly that this was wrong. So because of who I am, I couldn't resist speaking up. And I think one lesson I learned from this, it's hard to think in the abstract. So I always ask people to give me examples, to be concrete, right?

Starting point is 00:16:42 So software is abstract. So someone could tell you, well, what's the best way to do X? And they think it's a very well-defined problem. And you say, okay, well, give me an example. How much data do you have? What's your workflow? Be precise. Tell me.

Starting point is 00:16:58 And then you can be smart about it. But if the problem is too abstract, if you're thinking in really general terms, I think that most people, me included, are not smart enough to think in these abstract terms. You need to bring it down a little bit and to really take the thing down and really think in concrete terms, what does it mean?

Starting point is 00:17:26 That's why, for example, you've got this focus on software performance that is basically all about taking concrete systems and getting hard numbers out of them. I would say it's easy to be smart once you do that, because then you can say, okay, I've got this hard number. I know it's probably not lying to me. I know the problem, and then I can reason where this should go. To me, this is a really big insight from Daniel. It's easy to be smart when you can be concrete and precise. It's really hard to be smart when you're dealing with abstractions. Let's dig into performance though. Daniel has started to question some of the underlying best practices about performance. So a long time ago,

Starting point is 00:18:16 when I was doing more mundane database research, one of the problems that I was dealing with, it was just not a research question, it was just a practical problem, is that you've got, for example, these text files, so say a CSV file, you know, that maybe you exported from Excel or whatever, and you wanted to eat them up and include them in your program or do some processing on them.

Starting point is 00:18:50 And I remember being really annoyed at the fact that it was so slow. So I looked into the best people were doing it. So it turns out that the best people were using multi-threaded parsers. So they were using several threads to read a CSV file. And that felt strange to me because everyone had been telling me the following. People were telling me that the bottleneck was the disk. So you couldn't go faster than the which makes sense and so because you were hitting

Starting point is 00:19:28 the speed then the the efficiency of your code didn't matter and so i thought well okay we're i'm stuck because of my disk and it was really really annoyingly slow like you know i don't remember the exact numbers but reading a gigabyte of data was taking forever. And, you know, it was really, it was slowing me down and slowing down the experiments and so forth. And it was annoying. And then I started thinking about that and chatting. And then Phil was very good at this stuff,

Starting point is 00:20:00 was kind enough to exchange emails with me. He said, well, don't you think it implies that we're not disk-bound? And he said, of course we're not disk-bound. It's a software. We're reprocessor-bound. But this was very unpopular. People would not normally say that.

Starting point is 00:20:18 So, okay, we have the problem. It seems like this stuff might be CPU-bound. Then what? That sounds like a hard problem. Lots of people have built file processing stuff before. It's not like a novel area. No, it's not. So I was telling you that there's not enough Elon Musk in the world.

Starting point is 00:20:38 And one of the things that Elon Musk does, if you listen to him when he's thinking through, he says, okay, so we have this problem here. And how good could a solution be? And he's trying to do these back of the envelope thing, right? So how much would it cost to send someone to Mars? So let's try to, you know, let's not go ask consultants about it. Let's try to figure out from first principle. What programmers don't do typically is they don't do that. They don't ask.

Starting point is 00:21:10 They'll figure out this is slow and this is annoying, but they'll never ask the reverse question. How fast could it be? You sit down. You say, I've got so many bytes. Blah, blah, blah.

Starting point is 00:21:25 When you start asking this question, Now, you sit down, you say, okay, I've got so many bytes, blah, blah, blah, blah, blah, blah, blah. And then when you start asking this question, your thinking switch over because then it's kind of an engineering constraint, right? So, I mean, the bill comes back from Amazon. It's whatever it is. Oh, well. But people ask, okay, how low could it be? Now, the important thing about this question is that you don't need to make it that low, right? But it gives you a range.

Starting point is 00:21:53 So, you know, if you know you're 100 times higher than you could go, then it gives you room. You know, you could adapt it. In thinking about this problem, getting CSV files parsed faster, Daniel has another light bulb moment. It turns out there's another file parsing task that's chewing up computer cycles the world over.

Starting point is 00:22:12 Something that's a bottleneck, whether people know it or not. I was reading about really a lot of data science and NoSQL benchmarks involve a lot of JSON. And you would attend talks where really, really smart people, people who have a lot

Starting point is 00:22:34 of followers, were saying, well, avoid JSON. It's too slow. So I said, okay, okay. Let's benchmark it. And then I figured out, as is easily done, well, this is amazingly slow. This is truly slow. So I asked a friend of mine, you know, Jeff Langdale, who had done a lot of work where he was working on building really fast regular expression parsers. So I asked him,

Starting point is 00:22:59 do you think we could do better? Because, you know, I look at the numbers and say, this is terrible. And then, okay, but how good could it be? And in that particular case, I did not have a lot of experience parsing, so I turn it on to someone who does, right? Well, okay.

Starting point is 00:23:14 And he does exactly as I would expect. He goes into this Elon Musk mode and he tries to figure it out. You know, it should be about that much. I took what was reported by several people as being the fastest library available at the time,

Starting point is 00:23:30 Rapid JSON from Tencent, Chinese folks. And I was getting on a typical five, like 300 megabytes per second or something like that, which sounds fast until you reason about the fact that I'm hopefully going to get that PlayStation 5, so a game console, this week or soon. I don't know. And it has a disk that exceeds five gigabytes per second in reading speed. If you're processing JSON at 300 megabytes per second, you know,

Starting point is 00:24:07 there's quite a range. There's more than 10x difference between the two. And of course, networks are faster, like really fast networks can be much faster than five gigabytes per second. So this means that you've got this huge gap. And so then the next experiment I like to do is I just take C++. So C++ is not a slow language. It's considered really fast.

Starting point is 00:24:35 And I just use the standard library. And I just call the get line function, since it's a function that takes the current line in a text file and returns it as a string. And I just iterate it through the file like that. And I don't remember the exact numbers I get, but it's something like between 500 megabytes and

Starting point is 00:24:53 900 megabytes, but it's well under gigabytes per second. Let's pause to absorb this, right? The standard logic is that disks are a bottleneck. I.O. is slow. But just calling getline from a file is maxing out one CPU core

Starting point is 00:25:11 and only getting like one-tenth of the speed of the disk. So obviously some of the standard programming performance dogma must be wrong. But also, and here's where Daniel lost me, he thinks that based on him and Jeff's back-of-the-envelope Elon Elon Musk inspired calculations, that they can parse JSON at disk speed. That just seems unreasonably optimistic to me. JSON parsing involves, you know, like infinitely nested members. You need to reject things that don't match the spec. You need to understand Unicode.

Starting point is 00:25:38 And doing that all at over 10 times the speed that C++ can read a line, it just sounds like it's not possible. So when you look at that, you will think, we're dead. There's no way we can parse a JSON file at anything close to the disk. We're dead. There's no way to do it. But if you look at the architecture of my last little test, what it does, it creates a new little object string that contains. So it does an allocation. It creates a little object.

Starting point is 00:26:15 It populates it. Then it throws it away. It's extremely wasteful. Even though it's like three lines of code, it looks efficient. It's terrible. So there are a few rules that people who focus on efficiency

Starting point is 00:26:31 learn and that they all share. This is not my finding. So basically, you try to avoid allocation. I mean, you need memory at some point, but then you do it in big chunks. You don't go through a document and then, oh, I've got this little string with the word name in it.

Starting point is 00:26:56 Oh, let's allocate this little string there and let's put it there. This is terribly slow. You don't want to be doing this. So that's the first trick in Daniel's toolbox. Don't allocate memory unless you really have to. And when you do, allocate a big chunk. A common pattern that people use is that they have this data structure there, and then they build something like an iterator. So they access it through some high-level API, and they say, well, this is nice because it's really abstract,

Starting point is 00:27:28 and then it's going to make my code very beautiful. But this is like basically drinking beer from a straw, which is fine, you know, because the Eterno is kind of a straw. But you're never going to win any beer drinking contest. Like if you're with your friends at a bar, you're just not going to drink many beers at this rate. But this draw, this interaction is really, really elegant. But at the same time, it's going to block you all the time.

Starting point is 00:28:01 This is the second trick that Daniel has. Don't use too many unnecessary abstractions. Stay low level so that you get the full performance. The next trick is the one I think I'm least familiar with. And this one is about parallelism. So, so when people think about parallelism, doing things in parallel,

Starting point is 00:28:20 they always think, oh, he means like several cores, but actually with a single core, modern core, you've got plenty of parallelism. First of all, you can execute in real code. You can execute at least like three instructions per cycle, and you can reach higher. But this is one instance of parallelism. But there's other levels of parallelism.

Starting point is 00:28:44 For example, there's memory-level parallelism where you can... So you may have this mental model where your processor requests a byte of memory somewhere, and then it gets it back, and then it requests another byte of memory and gets it back. But of course, it doesn't work that way at all. Actually the way processes work is that they can issue multiple memory requests at a time. Easily 10, but we've benchmarked like much wider than that, like 25 or something. Like the, the, something like the, the, the Apple processors, they're incredibly wide.

Starting point is 00:29:23 What you should derive from this is that if you can tell your processors, if you can tell it what to do in such a way that it can just go and do it all without having to wait for results, so there's no data dependency. It doesn't have to wait for this part to be done before doing this part. So if you can avoid these data dependencies, and if you can avoid the bad branches, then you can go really, really fast. So there are ways to break data dependencies, and there are ways to break the branches.

Starting point is 00:30:01 The branches are bad because the way modern processors work is that they have all this amazing parallelism, but then when they get to a branch, they don't know which way to go. They don't know whether it's left or right. And so they're going to guess. And most of the time they're right, but when they're wrong,

Starting point is 00:30:20 then they have to undo all of the work they've been doing and come back. So the cost can be enormous if it's done poorly. So you have to engineer your code so there are as few branches as possible. So you basically want to write your code having a mental model of the machine. You see this line of code here and this line of code here, and you want as much as possible

Starting point is 00:30:48 for the processor to be able to run both of them at the same time. If you think this way, then a lot of code can become really, really much faster. Oh, wow. What was the end result once you applied all this? So the story is that we reach two three and some cases four gigabyte per per second so so we're not we're not yet at the disk but here's the fun but here's the

Starting point is 00:31:15 fun part i think we can reach the disk given uh enough clever work but it just it's just like writing good code it it takes time. And I don't know if I'm going to be the one breaking the five gigabytes per second barrier. Well, it would never be me alone in any case. But what I'm saying is that I think people will. If not this year, and if it's not me, if not this year, next year, or in two years,

Starting point is 00:31:44 we're going to see parsing probably like five gigabytes per second in the future. And I gave you the strongest competitor, which was RapidJSON. Now there are much faster alternatives. After some digits came along, then some other people learn i guess a bit from us and they go faster than rapid json but at the time this was the fastest competitor that there was really that was correct like it was forcing everything without uh without breaking any rules it was really really fast it was much much much faster than some popular alternative. So this means that the gap we're talking about, you know, like 20 times, 30 times faster than

Starting point is 00:32:30 some other options. So it's really interesting to think that as we're sitting on all this software architecture, we think because we're working with this old thing that they must be as fast as they can be. But they're probably not. It would be a bit like being in 1980 and driving a car and thinking, well, my car cannot get much more fuel efficient. I mean, we've been working on engine for a century or something. This is as well tuned as it will be. But of course, now our cars are much more fuel efficient than they were.

Starting point is 00:33:10 And so the same is true with software. There are hard limits, but we're very often quite far from the hard limits. And so software is like that. There's lots of things that we accept that are actually atrociously inefficient. So Daniel questioning assumptions about disk IO led him to create the fastest JSON parsing library in the world. It was 20 to 30 times faster than some popularly used libraries. But that's not all. His work on bitmap indexes is being used in much open source software, including Git, Spark, and Elasticsearch. He created a hashing algorithm that's in TensorFlow.

Starting point is 00:33:47 But always questioning assumptions and not being afraid to ignore the rules has not always made life easy for Daniel. Let's go back to when he was in kindergarten. So, you know, so they expect kids to learn to count up to, you know, some numbers, say 1, 2, 3, 4, 5, 6, 7, 10 or something. And I see I got it wrong, I think.

Starting point is 00:34:10 And they ask you to memorize your phone number and you have to tie your shoelaces. So these are kind of cognitive tests that you have to pass to be considered a normal human being. So of course, I did not memorize my phone number. And to this day, if you ask me my phone numbers, I'm quite poor at it. I certainly don't know my office phone number nor my cell phone number. Then, as far as counting goes, I figure I was five years old,

Starting point is 00:34:43 and so I could count until five, and that was good enough. And then my shoelaces, well, to this day, and this is a true story, people will see me walking downtown Montreal, and they'll say, well, your shoelaces aren't done. And I'll say, oh, and then I'll go and try to do something about it. So the story is that they decided I wasn't very smart. So they put me into this special ed class.

Starting point is 00:35:17 Did your parents sit you down and say you're going to be switched classes? Or do you remember the experience? Well, yeah. My mother was a teacher, now she's retired. This was very embarrassing to her because obviously when you're a teacher, you want your kids to do really well. If you're a primary school teacher, then you want your kids to do really well in primary school. I did do well, by the way. In the end, my grades were good.

Starting point is 00:35:44 This was a little bit of a struggle with my mother, who, well, you know, our parents are sometimes, you know, they want you to succeed. So basically, they want you to say, well, you know, stop asking odd questions and just do what you're told. Did they, you know, did they think that you had a learning disability okay so that's interesting because yeah they definitely thought that that our learning disability it was this the 70s and and so it wasn't at the level like like now basically at least in Montreal, you have something like 20% of the kids or more, you know, who have a label as having some kind of disability. But it wasn't like that at all in the 70s. at least where I live, schools had easy access to a kid psychologist and so forth, which I'm told now is much more difficult. But at the time, you know, so I would see this nice lady who would run tests by me and so forth.

Starting point is 00:37:00 And they did consider that learning disability. Whether or not the school gives him a label, a five-year-old who refuses to learn to count past five because he doesn't seem the point of it is unlikely to follow a conventional path in life. One thing that's unconventional about Daniel is when he's writing code, he tries to think of what communities might use it. He writes code thinking about adoption first.

Starting point is 00:37:24 So the same way if I want to go to China and reach out to people, I've got to speak their language. And I think it's the same kind of approach with software. It's that if you want to

Starting point is 00:37:39 reach out to software, to Java programmers, you might add the nicest Rust program or nicest Rust library you want. They won't pay attention because you're not speaking their language, right? So you have to reach out to people and you have to write in their language.

Starting point is 00:38:01 And that's why actually I use, I try to learn and use the most popular languages. So I've taught myself, of course, JavaScript, Java, Python, C++. I've done less Rust because until recently, Rust was low in popularity. But of course, now it's becoming more popular. So my stance has changed on it.

Starting point is 00:38:28 So now I'm happy to do Rust when needed. So yeah, it's just a matter of reaching out to people. When did you decide that shipping code was important? Well, this relates to a good friend of mine that I met at NRC. His name is Martin Brooks. And Martin Brooks gave this talk at NRC at one time. He said, well, okay, we're at this government lab, and we're doing research for the world, but for the Canadian public and so forth.

Starting point is 00:39:04 That's our mission. We're trying to make the world but for the canadian public and so forth that's that's our mission we're trying to make the world better and we have this model where uh we do this research research then we do some kind of prototype maybe and then he said and then we throw it over the wall so you know this this is small right and you throw it over and you hope someone is there to catch it and run with it but actually if you go and you you know you tilt your over and you hope someone is there to catch it and run with it. But actually, if you go and you tilt your head and you look behind the wall, you see there's nobody there catching anything. Nobody cares, right? And he says, well, this is broken.

Starting point is 00:39:37 And you know what happened when he was giving the stock is that I was sitting there and I thought, oh, this is really smart. And I was taking notes and people were leaving. Oh, really? One by one. Yes, because this was very upsetting. This was very upsetting to people. Being told that their model of research does not work.

Starting point is 00:39:56 That actually publishing papers, like, don't get me wrong. I'm not against publishing papers. Quite the opposite. I think more people should be, including all sorts of people, should be writing research papers. This is super important. Apparently, even Elon Musk

Starting point is 00:40:13 wrote a research paper a few years ago. So, you know, it's a true story. But more people should be writing papers. But you shouldn't just write a paper, especially with the style that we have now in computer science in 2020, where paper is hard to read for all sorts of complicated reasons. Like if you go back to Turing in the 50s or even the beginning of computer science in the 70s,

Starting point is 00:40:45 you can pick up these papers today and they're quite readable. But now they're often very, very hard to read. So if you hit the right topic and you're somewhat famous or something, or you know people who are famous, your paper might get cited a lot. But that by itself does not mean you've achieved anything because it's just like being cited is kind of like having stars on GitHub or something or having followers on Twitter. It's not by itself an accomplishment.

Starting point is 00:41:20 It's not. This is just venti stuff. It doesn't change the world. It doesn't really matter. And, you know, maybe Twitter terminates your account and all of the followers are gone. I don't know. You know, but it's really virtual, right?

Starting point is 00:41:39 It doesn't really matter. So if you want to really have an impact on the world, you have to reach out to people, to practitioners. The way Daniel reaches out to practitioners is centered around collaborating with people on GitHub. It really transformed the way I do research. Because now I can write code. I can interact with really, really, really smart people

Starting point is 00:42:04 that I would never have access to. Just this morning, I was interacting with Russian programmers who didn't look at a algorithm that I wrote. And they said, well, it's really nice, but we have to focus on this other aspect of the problem. And we think it could be improved if you did this instead. And I'm like, okay, yeah. So it's super interesting.

Starting point is 00:42:29 So this interaction just wasn't possible before. The way I do research, I think it's a successful model, but it's not a model that people can readily adapt, because it really fits what I do very specifically. Now, for the people who do like semantic web and so forth, they've been doing like open source software and so forth, but because there are still people working on semantic web and they probably don't like me very much if they're listening to you right now, but but very often there's like this fake

Starting point is 00:43:07 open source thing. And even large companies have been guilty of it, where you take this thing that you've built and you just dump it on the internet with the source code and say, there, it's open source. I think Microsoft now understands, but I think at some point they were doing things like that, that they would call open source, but really they were missing the social component, which is the most important part. Because open source is really not about the code.

Starting point is 00:43:40 It's really about the interaction with the people. It's really a social thing. This is why Daniel is known for his code, because he embraces the social nature of open source. His JSON parsing library isn't really his. He's the top contributor, but he has 68 other people working with him on GitHub. He embraced the radical ideas of Martin Brooks that, you know, people in academia should collaborate with people outside of it.

Starting point is 00:44:04 Actually, he also ran with the ideas of Stephen Downs, embracing remote computer science education back in 2005. There's one story I want to revisit though. I don't know why I keep going back to this early school days story, but it stuck in my head. But like when somebody, when I feel like somebody, you know, mistreated me or misjudged me or something, right? I think of like Pretty Woman. Do you know this movie? me or misjudged me or something right i think of like like pretty

Starting point is 00:44:26 woman do you know this movie of course yeah and like they don't let her shop at that store it's like have you ever wanted to like run into your grade four teacher while you're like accepting an award and be like no no no i mean it makes for a great movie scene but I think it's not quite healthy you know I think Paul Graham had

Starting point is 00:44:51 an essay recently about I think he called it the privilege of orthodoxy or something like that and so

Starting point is 00:44:59 his take is basically if you tend to easily think like most people in a group, then you have this thing that he calls a privilege because you're never going to be a challenge very much and people are going to say, well, you're fine.

Starting point is 00:45:20 You're one of us and it'll be fine. If you're a little bit by nature, a little bit more prompt to ask more questions and to be less quick to adapt the majority opinion, then I think you're going to be always flagged as someone who is a little bit strange. And in schools, being strange is not always a good thing, obviously. You know, people, they like to believe simple things that are being given to them.

Starting point is 00:46:00 And I think that that goes contrary to what, for example, science is. So you need to be able to go against the grain, at least selectively. I don't recommend marching in the street, refuse to wear a mask at Walmart or something. That's not what I mean. I mean it in a more intellectual manner. Were you willing in a company, not necessarily to challenge your boss,

Starting point is 00:46:27 but asking questions, like should we be doing this? Why do we do this? The scientific paradigm is about always asking another question. No matter where you are, you always want to be challenging the state of knowledge.

Starting point is 00:46:44 You always want to find challenging the state of knowledge. You always want to find where the frontier is. So that was the show. I hope you found Daniel as fascinating as I did. I think he's quite a character. If you liked this episode, do me a huge favor and just tell somebody else about it who you think might like it. Just, you know, paying them on Slack or WhatsApp or however people communicate these days. This is Adam Gordon-Bell. Until next time, thank you so much for listening.

Pet Camera - EBO Air 2

CoRecursive: Coding Stories - Story: Frontiers of Performance with Daniel Lemire

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

CoRecursive: Coding Stories - Story: Frontiers of Performance with Daniel Lemire

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.