CoRecursive: Coding Stories - Story: Frontiers of Performance with Daniel Lemire
Episode Date: December 1, 2020Did you ever meet somebody who seemed a little bit different than the rest of the world? Maybe they question things that others wouldn’t question or said things that others would never say. Danie...l is a world-renowned expert on software performance, and one of the most popular open source developers, if you measure by get up followers. Today, he’s gonna share his story. It involves time at a research lab, teaching students in a new way. It will also involve upending people’s assumptions about IO performance. Elon Musk And Julia Roberts will come up a little bit more than you might expect. Episode Page Episode Transcript Links: Daniel's Blog Daniel's Github Parsing JSON Really Quickly: Lessons Learned
Transcript
Discussion (0)
Hello and welcome to Code Recursive, the people in the stories behind the code.
I'm Adam Gordon-Bell.
Did you ever meet somebody who seemed a little bit different than the rest of the world?
Maybe they questioned things that others wouldn't question,
or said things that others would never say.
Meet Daniel Lemire.
You were asking, you know, whether I was entirely sane,
and I like to think that I'm a little crazy, you know.
By nature, I will obsess over things that people would just, you know,
would rather not think too much about it.
Yeah, I think it's kind of a personal trait.
Daniel is a world-renowned expert on software performance
and one of the most popular open-source developers,
if you measure by GitHub followers.
Today, he's going to share his story.
It involves time at a research lab, teaching students in a new way. It will also involve upending people's assumptions
about IO performance. And Elon Musk and Julia Roberts will come up a little bit more than you
might expect. The story starts as Daniel is doing his PhD at the University of Toronto.
He gets thrown a problem and the way he solves it sets his career on a different trajectory.
It starts when a couple of geologists come to him with a data set that they have generated in what seems
to me like a very unique fashion. Basically, they're using helicopters and tied to the helicopter,
you've got a balloon of some kind. Between the two, you've got this ring and this ring throws out EM waves into the ground.
This is fairly standard stuff.
And then they capture the EM waves and they kind of know what to do with them if the data,
if the signal is perfectly clean.
So the way it's supposed to work is that you shoot this wave and then it comes back and then it's supposed to come back as an exponentially decreasing curve.
So theory tells you exactly what you should be getting.
But what you got in practice was massive garbage.
It's stuff that, you know, you cannot feed it into any computer.
So you need to clean it. And the way you sort of want to clean it is that you want to build some kind of model for what the noise is.
So as a young PhD student, they asked, well, can you help us clean up the data?
And I did, but it wasn't quite a process because they had these CD-ROMs at the time that would have hundreds of megabytes of data on them.
I would sit down and design an algorithm
and then I would implement it and try it out
and it would be spinning forever.
And so just trying to test it out
was taking way too much time.
Were the geology guys gathered around
and you're like,
I'm going to try out this program, and then it just spins and spins?
Right.
So you have this idea.
You think it's going to solve their problem, and you try it out.
But if it takes hours for you to find it out, then it's annoying,
because of course it slows you down.
But it's also, it goes further than that, is that you want to give them the algorithm and it takes them hours to check that it works.
They may not do it.
And that's actually what happened in my case.
It was too painful for them to try things out. So they would say, you know, so in my case, they really put the stuff on their desk.
And they say, well, when we have time, we'll check it out.
And I just say, okay, fine.
You know, I wasn't waiting for them.
And then I get a call, you know, months later.
Well, if I didn't get around to it, it was painful.
But yeah, it really solves our problem.
So, you know, where can we go with that?
And basically, slow computing can introduce friction.
It can make things that are possible practically very difficult.
And I had this experience over and over and over and over again until I decided, okay, so I'm going to turn my life around.
And instead of doing this algorithmic design stuff, I'm going to go down. problem of trying an idea and then having to wait forever for it to pan out instead
of the higher level problems that I can leave to other people.
So was this geology time, was this when you decided that a focus on performance, focus
on computer science was important?
That's where I was headed. So basically, being able to run code quickly
is a huge enabler.
And we can go into, you know,
why is deep learning taking off right now?
Well, you know, it's a complex topic
and there are lots of reasons,
but certainly one of the reasons for it has to do with system
performance.
If it did exactly what it does now, but it took like 10 times or 100 times slower, we
might not even know about it because it would be too expensive to experiment with it.
And you wouldn't have all these applications coming out
because people would, you know,
it would be too expensive to develop.
It's like, I think it's the quote from Joseph Stalin.
So maybe it's not good to use.
He said like, quantity has a quality all its own.
If you have enough computing power,
like it can be a whole different game, right?
Right. And software that is just a little bit too slow to use can seem unbearable. But if you make
it really, really fast, then all of a sudden it's much more fun. So with this realization that
performance can be a great enabler, he finishes up his PhD and he joins a research lab. So in Canada,
we have this research institution.
It's called NRC.
It's like this research-only
government lab, basically.
And so at the time,
they were creating
this e-business initiative.
My academic career, I would say,
really started there
because it was really
this unique environment.
So you all have these really, really smart people put together in the same building, and they all have different ideas.
And because it's brand new, you don't have two old guys in the corner who run the whole show and tell everyone what to do because nobody knows what to do.
So it's basically if you're young and you have ideas, they say, well, go.
You know, we don't know what to do.
So do something.
And so this was a lot of fun for me.
Basically, we could do anything we wanted.
We're free to build the research program we wanted.
And so I really got to try things.
I work a little bit on recommender systems.
And at the time, Greg Linden had come up with the recommender system that Amazon uses.
And I thought that was really, really cool.
And so this inspired me to work a bit on this problem.
Daniel's work on recommender systems led to the creation of the Slope1 family of algorithms.
According to Wikipedia, they are the simplest and most performant
collaborative filtering algorithms.
While at NRC, Daniel has another big turn in his career happen.
So I was this researcher, you know, young researcher typing at my desk.
And there's this guy that comes in.
He looks like a homeless person.
You know, he's got this long hair.
And he's swearing a lot about not being able to find a place to sit.
And I'm a little bit scared, you know, because you're there and you've got this person that looks totally out of place.
And you're wondering, you know, are they like going to sleep on the floor or something? But it turns out that it was this really, really, really brilliant guy that could never get a corporate job because he's really too strange.
But he's excessively smart, very, very smart.
And so we start talking.
And he's telling me, you know, he's telling me these stories about his vision. And he's saying, well, soon you'll have all these people,
like thousands, maybe millions of people,
taking these classes online.
And it's all going to be free.
He's a little bit on the left side of the political spectrum.
And it's all going to be free.
And I started listening to him.
And this was very inspiring.
So he was one of the guys who really shaped my vision of the world
because he was very – I think he was slightly prescient.
He really did predict a few things that did happen.
He did foresee a few things.
Because at the time, he was very preoccupied with the cost of higher education, for example,
which, as you may be aware, only got worse over time.
And so he thought, well, OK, so we need to fix this problem.
So we need to get all of these fancy profs to go online where anyone, no matter how poor they are,
can listen to them and learn from them.
So this was very inspiring, I thought.
So his name is Stephen Downs.
Now, you probably don't know him, but he's the inventor of MOOCs, you know, this massive
online courses.
If you go on Wikipedia, they credit him with this invention.
Is he what led you to go
and try to become a teacher?
To become a professor?
Yeah, so I became a professor
and I started to build online courses.
So my first online course,
I think was in 2005.
And for credit like not not
not like not like you build a PowerPoint and you post it online like yeah actual
for credit courses and I started basically except for graduate work which
is different but I started basically teaching uh exclusively online at the time and uh i did so
like i've been doing so for a long time now so for example i've got this uh introduction to uh
programming class where i have um i don't know like something something like 250 students a year, but it's all online, you know,
and it's actually a lot of fun.
And it's extremely cost-effective
because there's only one of me
and there are 250 students,
but it still works, you know.
Did you find it hard to get into a role like that?
I grew into it, I think.
Now I'm enjoying myself a lot.
But it was very uneasy at first.
I think academia is very conservative
in a strange way.
So, I mean, we like to think about universities
as being progressive.
And in some way, they are. I mean, we like to think about universities as being progressive.
And in some way, they are.
Like, you know, nobody cares if you're transgender, you know.
You're not in the sense where it's very socially progressive.
But there are ways in which it's extremely conservative.
Like, for example, there's a tool that is perfectly fine,
but that's called MATLAB.
It's a programming language system that's, to my knowledge,
is very rarely used outside of a campus.
And certainly if you go into a data science conference and people will be using Python or R or something,
they probably won't be using MATLAB.
But if you go on campus, everyone's using MATLAB because, well, I mean, to the best
of my knowledge, the reason is that their classes were in MATLAB.
Yeah.
So then they're going to teach what they were taught, you know.
So you reproduce these things.
And when you try to challenge these ideas, academia can resist you quite a bit.
One of the things that I wrote maybe 10 years ago or maybe slightly longer on my blog at some point, I pointed out that there was a big problem with the big academic conferences.
They're very selective.
Basically, nobody from outside academia ever attends, right? So they're kind of like bubbles and everyone is kind of chasing what is hot.
If you look at just the papers, you know that, okay, this was the year of XML.
It's all about XML.
And I say, well, these actually play a negative role because they actually,
if you want to do something original, you're probably not going to be aiming
for these conferences.
The people building the real system don't show up.
It's a little bit challenging to be a contrarian in academia.
Can you think of a specific example of when you maybe had some headbutting with maybe
a department head or somebody because of your different take on things? Right. So there was something
emerging that was called the Semitic Web. I don't know if people
still use the term or it's completely gone now.
So basically, it
came out of expert system and classical AI.
And at the time, for all sorts of reasons,
I got into this project with colleagues.
And what they were trying to do,
they were trying to leverage the semantic web
that did not yet exist.
But they thought, you know,
if Tim Berners-Lee says it's going to happen, it will.
Well, it didn't.
And then we're saying, okay, the way we should be building online classes
is through these things called learning objects.
And these learning objects are like objects in object-oriented programming.
So they have this metadata and they
can kind of all come together automagically and they're like legal block and at first i actually
thought this all made sense and then i started asking questions and then i started reading my
friend steven downs you know and asking him okay but can you tell me what exactly is the learning object?
This is too abstract.
Yeah.
Then he said, well, it can be anything.
I said, okay, so we're working on anything.
So I started telling people, this is not a good direction.
And the irony is that you can go on Google and can find a name.
There's a book, you know, called Canadian Semantic Web with my name on it.
I was the editor.
But I started to have real doubts.
And so I wrote a few things about this not being a good idea.
I prepared a presentation about it and so forth.
And this was very controversial.
I got emails like, why are you doing this?
And I said, well, we shouldn't go there.
And this was very unpopular.
And some people say, well, okay, you don't have tenure yet.
So at the time, I did not have tenure.
So maybe you should be quiet a little bit
and not voice your opinion too much.
But I felt really strongly that this was wrong.
So because of who I am, I couldn't resist speaking up.
And I think one lesson I learned from this, it's hard to think in the abstract.
So I always ask people to give me examples, to be concrete, right?
So software is abstract.
So someone could tell you, well, what's the best way to do X?
And they think it's a very well-defined problem.
And you say, okay, well, give me an example.
How much data do you have?
What's your workflow?
Be precise.
Tell me.
And then you can be smart about it.
But if the problem is too abstract,
if you're thinking in really general terms,
I think that most people, me included, are not smart enough to think in these abstract terms.
You need to bring it down a little bit
and to really take the thing down
and really think in concrete terms,
what does it mean?
That's why, for example, you've got this focus on software performance that is basically
all about taking concrete systems and getting hard numbers out of them.
I would say it's easy to be smart once you do that, because then you can say,
okay, I've got this hard number. I know it's probably not lying to me. I know the problem,
and then I can reason where this should go. To me, this is a really big insight from Daniel.
It's easy to be smart when you can be concrete and precise. It's really hard to be
smart when you're dealing with abstractions. Let's dig into performance though. Daniel has
started to question some of the underlying best practices about performance. So a long time ago,
when I was doing more mundane database research, one of the problems that I was dealing with,
it was just not a research question,
it was just a practical problem,
is that you've got, for example, these text files,
so say a CSV file, you know,
that maybe you exported from Excel or whatever,
and you wanted to eat them up
and include them in your program or do some processing on them.
And I remember being really annoyed at the fact that it was so slow.
So I looked into the best people were doing it.
So it turns out that the best people were using multi-threaded parsers.
So they were using several threads to read a CSV file.
And that felt strange to me
because everyone had been telling me the following.
People were telling me that the bottleneck was the disk.
So you couldn't go faster than the which makes sense and so because you were hitting
the speed then the the efficiency of your code didn't matter and so i thought well okay we're
i'm stuck because of my disk and it was really really annoyingly slow like you know i don't
remember the exact numbers but reading a gigabyte of data was taking forever.
And, you know, it was really, it was slowing me down
and slowing down the experiments and so forth.
And it was annoying.
And then I started thinking about that and chatting.
And then Phil was very good at this stuff,
was kind enough to exchange emails with me.
He said, well, don't you think it implies that we're not
disk-bound? And he said,
of course we're not disk-bound.
It's a software. We're
reprocessor-bound. But this
was very unpopular. People would not
normally say that.
So, okay, we have the
problem. It seems like this
stuff might be CPU-bound.
Then what? That sounds like a hard problem.
Lots of people have built file processing stuff before.
It's not like a novel area.
No, it's not.
So I was telling you that there's not enough Elon Musk in the world.
And one of the things that Elon Musk does,
if you listen to him when he's thinking through,
he says, okay, so we have this problem here.
And how good could a solution be? And he's trying to do these back of the envelope thing, right? So
how much would it cost to send someone to Mars? So let's try to, you know, let's not go ask
consultants about it. Let's try to figure out from first principle. What programmers
don't do typically is
they don't do that. They don't ask.
They'll figure out
this is slow and this is
annoying, but they'll never ask
the reverse question.
How fast could it be?
You sit down.
You say, I've got so many bytes.
Blah, blah, blah.
When you start asking this question, Now, you sit down, you say, okay, I've got so many bytes, blah, blah, blah, blah, blah, blah, blah.
And then when you start asking this question, your thinking switch over because then it's kind of an engineering constraint, right?
So, I mean, the bill comes back from Amazon.
It's whatever it is.
Oh, well.
But people ask, okay, how low could it be? Now, the important thing about this question
is that you don't need to make it that low, right?
But it gives you a range.
So, you know, if you know you're 100 times higher
than you could go, then it gives you room.
You know, you could adapt it.
In thinking about this problem,
getting CSV files parsed faster,
Daniel has another light bulb moment.
It turns out there's another file parsing task
that's chewing up computer cycles the world over.
Something that's a bottleneck,
whether people know it or not.
I was reading about really a lot of data science
and NoSQL benchmarks involve a lot of JSON.
And you would attend
talks where really,
really smart people, people
who have a lot
of followers, were saying,
well, avoid JSON. It's too slow.
So I said, okay, okay. Let's
benchmark it. And then I
figured out, as is easily
done, well, this is amazingly slow.
This is truly slow. So I asked a friend of mine, you know, Jeff Langdale, who had done a lot of
work where he was working on building really fast regular expression parsers. So I asked him,
do you think we could do better? Because, you know, I look at the numbers and say, this is terrible.
And then, okay,
but how good could it be?
And in that particular case,
I did not have a lot of experience parsing,
so I turn it on
to someone who does, right?
Well, okay.
And he does exactly
as I would expect.
He goes into this Elon Musk mode
and he tries to figure it out.
You know, it should be
about that much.
I took what was reported by several people
as being the fastest library available at the time,
Rapid JSON from Tencent, Chinese folks.
And I was getting on a typical five,
like 300 megabytes per second or something like that,
which sounds fast until you reason about the fact that I'm hopefully going to get that
PlayStation 5, so a game console, this week or soon.
I don't know.
And it has a disk that exceeds five gigabytes per second in reading speed.
If you're processing JSON at 300 megabytes per second, you know,
there's quite a range.
There's more than 10x difference between the two.
And of course, networks are faster, like really fast networks can be much
faster than five gigabytes per second.
So this means that you've got this huge gap.
And so then the next experiment I like to do is I just take C++.
So C++ is not a slow language.
It's considered really fast.
And I just use the standard library.
And I just call the get line function,
since it's a function that takes the current line in a text file
and returns it as a string. And I
just iterate it through the
file like that. And I don't remember
the exact numbers I get, but it's something like
between 500 megabytes and
900 megabytes,
but it's well under
gigabytes per second.
Let's pause to absorb this, right?
The standard logic is that
disks are a bottleneck.
I.O. is slow.
But just calling getline from a file is maxing out one CPU core
and only getting like one-tenth of the speed of the disk.
So obviously some of the standard programming performance dogma must be wrong.
But also, and here's where Daniel lost me,
he thinks that based on him and Jeff's back-of-the-envelope Elon Elon Musk inspired calculations, that they can parse JSON at disk speed.
That just seems unreasonably optimistic to me.
JSON parsing involves, you know, like infinitely nested members.
You need to reject things that don't match the spec.
You need to understand Unicode.
And doing that all at over 10 times the speed that C++ can read a line, it just sounds like it's not possible.
So when you look at that, you will think, we're dead.
There's no way we can parse a JSON file at anything close to the disk.
We're dead. There's no way to do it.
But if you look at the architecture of my last little test,
what it does, it creates a new little object string that contains.
So it does an allocation.
It creates a little object.
It populates it.
Then it throws it away.
It's extremely wasteful.
Even though it's like three lines of code, it looks efficient.
It's terrible.
So there are a few rules
that people who
focus on efficiency
learn and that they all
share. This is not my
finding. So
basically, you try to avoid allocation.
I mean, you need memory
at some point, but then you do it in
big chunks.
You don't go through a document and then, oh, I've got this little string with the word name in it.
Oh, let's allocate this little string there and let's put it there.
This is terribly slow.
You don't want to be doing this.
So that's the first trick in Daniel's toolbox.
Don't allocate memory unless you really have to. And when you do, allocate a big chunk.
A common pattern that people use is that they have this data structure there, and
then they build something like an iterator.
So they access it through some high-level API, and they say, well, this is nice because it's really abstract,
and then it's going to make my code very beautiful.
But this is like basically drinking beer from a straw,
which is fine, you know, because the Eterno is kind of a straw.
But you're never going to win any beer drinking contest.
Like if you're with your friends at a bar,
you're just not going to drink many beers at this rate.
But this draw, this interaction is really, really elegant.
But at the same time, it's going to block you all the time.
This is the second trick that Daniel has.
Don't use too many unnecessary abstractions.
Stay low level so that you get the full performance.
The next trick is the one I think I'm least familiar with.
And this one is about parallelism.
So,
so when people think about parallelism,
doing things in parallel,
they always think,
oh,
he means like several cores,
but actually with a single core, modern core, you've got plenty of parallelism.
First of all, you can execute in real code.
You can execute at least like three instructions per cycle, and you can reach higher.
But this is one instance of parallelism.
But there's other levels of parallelism.
For example, there's memory-level parallelism where you can...
So you may have this mental model where your processor requests a byte of memory somewhere,
and then it gets it back, and then it requests another byte of memory and gets it back.
But of course, it doesn't work that way at all. Actually the way processes work is that they can issue multiple
memory requests at a time.
Easily 10, but we've benchmarked like much wider than that, like 25 or something.
Like the, the, something like the, the, the Apple processors,
they're incredibly wide.
What you should derive from this is that if you can tell your processors,
if you can tell it what to do in such a way that it can just go and do it all
without having to wait for results, so there's no data dependency.
It doesn't have to wait for this part to be done before doing this part.
So if you can avoid these data dependencies,
and if you can avoid the bad branches, then you can go really, really fast.
So there are ways to break data dependencies,
and there are ways to break the branches.
The branches are bad because the way modern processors work is that they have all this
amazing parallelism,
but then when they get to a branch, they don't know which
way to go. They don't know whether it's left
or right. And so they're going to guess.
And
most of the time they're right,
but when they're wrong,
then they have to undo all
of the work they've been doing
and come back.
So the cost can be enormous if it's done poorly.
So you have to engineer your code so there are as few branches as possible.
So you basically want to write your code having a mental model of the machine.
You see this line of code here and this line of code here,
and you want as much as possible
for the processor to be able to run both of them
at the same time.
If you think this way,
then a lot of code can become really, really much faster.
Oh, wow.
What was the end result once you applied all this?
So the story is that we reach two three and some cases four
gigabyte per per second so so we're not we're not yet at the disk but here's the fun but here's the
fun part i think we can reach the disk given uh enough clever work but it just it's just like
writing good code it it takes time.
And I don't know if I'm going to be the one breaking
the five gigabytes per second barrier.
Well, it would never be me alone in any case.
But what I'm saying is that I think people will.
If not this year, and if it's not me,
if not this year, next year, or in two years,
we're
going to see parsing probably like five gigabytes per second in the future.
And I gave you the strongest competitor, which was RapidJSON.
Now there are much faster alternatives.
After some digits came along, then some other people learn i guess a bit from us and they go
faster than rapid json but at the time this was the fastest competitor that there was really that
was correct like it was forcing everything without uh without breaking any rules it was really really
fast it was much much much faster than some popular alternative. So this means that the gap we're talking about, you know, like 20 times, 30 times faster than
some other options.
So it's really interesting to think that as we're sitting on all this software architecture,
we think because we're working with this old thing that they must be as fast as they can be.
But they're probably not.
It would be a bit like being in 1980 and driving a car and thinking, well, my car cannot get much more fuel efficient.
I mean, we've been working on engine for a century or something.
This is as well tuned as it will be.
But of course, now our cars are much more fuel efficient than they were.
And so the same is true with software.
There are hard limits, but we're very often quite far from the hard limits.
And so software is like that.
There's lots of things that we accept that are actually atrociously inefficient.
So Daniel questioning assumptions about disk IO led him to create the fastest JSON parsing
library in the world. It was 20 to 30 times faster than some popularly used libraries.
But that's not all. His work on bitmap indexes is being used in much open source software,
including Git, Spark, and Elasticsearch. He created a hashing algorithm that's in TensorFlow.
But always questioning assumptions
and not being afraid to ignore the rules
has not always made life easy for Daniel.
Let's go back to when he was in kindergarten.
So, you know, so they expect kids to learn to count
up to, you know, some numbers,
say 1, 2, 3, 4, 5, 6, 7, 10 or something.
And I see I got it wrong, I think.
And they ask you to memorize your phone number
and you have to tie your shoelaces.
So these are kind of cognitive tests
that you have to pass to be considered a normal human being.
So of course, I did not memorize my phone number.
And to this day, if you ask me my phone numbers, I'm quite poor at it.
I certainly don't know my office phone number nor my cell phone number.
Then, as far as counting goes, I figure I was five years old,
and so I could count until five,
and that was good enough.
And then my shoelaces, well, to this day, and this is a true story,
people will see me walking downtown Montreal, and they'll say,
well, your shoelaces aren't done.
And I'll say, oh, and then I'll go and try to do something about it.
So the story is that they decided I wasn't very smart.
So they put me into this special ed class.
Did your parents sit you down and say you're going to be switched classes?
Or do you remember the experience?
Well, yeah.
My mother was a teacher, now she's retired.
This was very embarrassing to her because obviously when you're a teacher,
you want your kids to do really well. If you're a primary school
teacher, then you want your kids to do really well in primary school.
I did do well, by the way. In the end, my grades were good.
This was a little bit of a struggle with my mother, who, well, you know, our parents are sometimes, you know, they want you to succeed.
So basically, they want you to say, well, you know, stop asking odd questions and just do what you're told.
Did they, you know, did they think that you had a learning disability okay so that's
interesting because yeah they definitely thought that that our learning disability it was this
the 70s and and so it wasn't at the level like like now basically at least in Montreal, you have something like 20% of the kids or more, you know, who have a label as having some kind of disability.
But it wasn't like that at all in the 70s. at least where I live, schools had easy access to a kid psychologist and so forth,
which I'm told now is much more difficult.
But at the time, you know, so I would see this nice lady who would run tests by me and so forth.
And they did consider that learning disability.
Whether or not the school gives him a label,
a five-year-old who refuses to learn to count past five
because he doesn't seem the point of it
is unlikely to follow a conventional path in life.
One thing that's unconventional about Daniel is when he's writing code,
he tries to think of what communities might use it.
He writes code thinking about adoption first.
So the same way if I want
to go to China
and reach out to
people, I've got to speak
their language. And I think
it's the same kind of approach
with software.
It's that if you want to
reach out to software, to Java
programmers, you
might add the nicest Rust program
or nicest Rust library you want.
They won't pay attention
because you're not speaking their language, right?
So you have to reach out to people
and you have to write in their language.
And that's why actually I use,
I try to learn and use the most popular languages.
So I've taught myself, of course,
JavaScript, Java, Python, C++.
I've done less Rust
because until recently, Rust was low in popularity.
But of course, now it's becoming more popular.
So my stance has changed on it.
So now I'm happy to do Rust when needed.
So yeah, it's just a matter of reaching out to people.
When did you decide that shipping code was important?
Well, this relates to a good friend of mine
that I met at NRC.
His name is Martin Brooks.
And Martin Brooks gave this talk at NRC at one time.
He said, well, okay, we're at this government lab, and we're doing research for the world, but for the Canadian public and so forth.
That's our mission. We're trying to make the world but for the canadian public and so forth that's that's our mission
we're trying to make the world better and we have this model where uh we do this research research
then we do some kind of prototype maybe and then he said and then we throw it over the wall so you
know this this is small right and you throw it over and you hope someone is there to catch it
and run with it but actually if you go and you you know you tilt your over and you hope someone is there to catch it and run with it. But actually, if you go and you tilt your head and you look behind the wall, you see
there's nobody there catching anything.
Nobody cares, right?
And he says, well, this is broken.
And you know what happened when he was giving the stock is that I was sitting there and
I thought, oh, this is really smart.
And I was taking notes and people were leaving.
Oh, really?
One by one.
Yes, because this was very upsetting.
This was very upsetting to people.
Being told that their model of research does not work.
That actually publishing papers, like, don't get me wrong.
I'm not against publishing papers.
Quite the opposite.
I think more people should be,
including all sorts of people,
should be writing research papers.
This is super important.
Apparently, even Elon Musk
wrote a research paper a few years ago.
So, you know, it's a true story.
But more people should be writing papers.
But you shouldn't just write a paper,
especially with the style that we have now in computer science in 2020,
where paper is hard to read for all sorts of complicated reasons.
Like if you go back to Turing in the 50s
or even the beginning of computer science in the 70s,
you can pick up these papers today and they're quite readable.
But now they're often very, very hard to read.
So if you hit the right topic and you're somewhat famous or something,
or you know people who are famous, your paper might get cited a lot.
But that by itself does not mean you've achieved anything
because it's just like being cited is kind of like having stars on GitHub
or something or having followers on Twitter.
It's not by itself an accomplishment.
It's not.
This is just venti stuff.
It doesn't change the world.
It doesn't really matter.
And, you know, maybe Twitter terminates your account
and all of the followers are gone.
I don't know.
You know, but it's really virtual, right?
It doesn't really matter.
So if you want to really have an impact on the world,
you have to reach out to people, to practitioners.
The way Daniel reaches out to practitioners
is centered around collaborating with people on GitHub.
It really transformed the way I do research.
Because now I can write code.
I can interact with really, really, really smart people
that I would never have access to.
Just this morning, I was interacting with Russian programmers
who didn't look at a algorithm that I wrote.
And they said, well, it's really nice,
but we have to focus on this other aspect of the problem.
And we think it could be improved if you did this instead.
And I'm like, okay, yeah.
So it's super interesting.
So this interaction just wasn't possible before.
The way I do research, I think it's a successful model,
but it's not a model that people can readily adapt,
because it really fits what I do very specifically.
Now, for the people who do like semantic web and so forth, they've been doing like open
source software and so forth, but because there are still people working on semantic
web and they probably don't like me very much if they're listening to you right now, but
but very often there's like this fake
open source thing. And even large companies have been guilty of it, where you take this thing that
you've built and you just dump it on the internet with the source code and say, there, it's open
source. I think Microsoft now understands,
but I think at some point they were doing things like that,
that they would call open source,
but really they were missing the social component,
which is the most important part.
Because open source is really not about the code.
It's really about the interaction with the people.
It's really a social thing.
This is why Daniel is known for his code,
because he embraces the social nature of open source.
His JSON parsing library isn't really his.
He's the top contributor, but he has 68 other people working with him on GitHub.
He embraced the radical ideas of Martin Brooks that,
you know, people in academia should collaborate with people outside of it.
Actually, he also ran with the ideas of Stephen Downs,
embracing remote computer science education back in 2005.
There's one story I want to revisit though.
I don't know why I keep going back to this early school days story,
but it stuck in my head.
But like when somebody, when I feel like somebody, you know,
mistreated me or misjudged me or something, right?
I think of like Pretty Woman. Do you know this movie? me or misjudged me or something right i think of like like pretty
woman do you know this movie of course yeah and like they don't let her shop at that store it's
like have you ever wanted to like run into your grade four teacher while you're like
accepting an award and be like no no no i mean it makes for a great movie scene but I think
it's not quite healthy
you know
I think
Paul Graham
had
an essay
recently about
I think
he called it
the privilege
of orthodoxy
or something like that
and so
his take
is basically
if you tend
to
easily think like most people in a group,
then you have this thing that he calls a privilege
because you're never going to be a challenge very much
and people are going to say, well, you're fine.
You're one of us and it'll be fine.
If you're a little bit by nature, a little bit more prompt to ask more questions and to be less quick to adapt the majority opinion,
then I think you're going to be always flagged as someone who is a little bit strange.
And in schools, being strange
is not always a good thing, obviously.
You know, people,
they like to believe simple things
that are being given to them.
And I think that that goes contrary
to what, for example, science is.
So you need to be able to go against the grain, at least selectively.
I don't recommend marching in the street, refuse to wear a mask at Walmart or something.
That's not what I mean.
I mean it in a more intellectual manner.
Were you willing in a company,
not necessarily to challenge your boss,
but asking questions,
like should we be doing this?
Why do we do this?
The scientific paradigm is about
always asking another question.
No matter where you are,
you always want to be challenging
the state of knowledge.
You always want to find challenging the state of knowledge. You always want to find where the
frontier is. So that was the show. I hope you found Daniel as fascinating as I did. I think
he's quite a character. If you liked this episode, do me a huge favor and just tell somebody else
about it who you think might like it. Just, you know, paying them on Slack or WhatsApp or
however people communicate these days. This is Adam Gordon-Bell. Until next time,
thank you so much for listening.