The Changelog: Software Development, Open Source - Inside GitHub's Arctic Code Vault (Interview)
Episode Date: September 4, 2020Earlier this year on February 2nd, 2020 Jon Evans and his team of archivists took a snapshot of all active public repositories on GitHub and sent it to a decommissioned coal mine in the Svalbard archi...pelago where it will be stored for the next 1,000 years. On this episode, Jon chats with Jerod all about the GitHub Archive Program and how they're preserving open source software for future generations.
Transcript
Discussion (0)
We have this amazing advisory board of, you know, anthropologists and historians and linguists and so forth.
And one of the interesting things that they mentioned to us, which I found fascinating, is that you look at existing archives of like the Renaissance.
And it's full of lists of like that wealthy and important people, almost all the men, of course, because the 15th century, you know, these are the important books that should be preserved for posterity.
And apparently, you know, we have so many of those, we don't know what to do with them and don't really care about them what we really want from that era to really understand how that era
worked is ordinary people's shopping lists and almost none of those survived because they were
considered important at the time so you know we thought it would be more democratic and more
inclusive and also possibly more important and give a more complete view to be as broad as we
could yeah isn't that amazing that what we're not after is like some official narrative,
right?
We're after is a snapshot,
a view into the daily lives of the people or the things they were doing or
what they were thinking during a time period to reconstruct our own view of
what was going on at that time.
Being with her change log is provided by Fastly.
Learn more at Fastly.com.
We move fast and fix things here at Changelog because of Rollbar. Check them out at Rollbar.com.
And we're hosted on Linode cloud servers. Head to Linode.com slash Changelog.
Deciding on a cloud provider is hard enough. And figuring out pricing and projected costs,
that should just be easy.
And that's exactly why DigitalOcean
has transparent and predictable pricing
and also an awesome pricing calculator
that not only makes it easy
to figure out your cost per month,
but it also compares that cost
against AWS, Google Cloud, and also Azure.
So head to digitalocean.com
slash pricing slash calculator to play with the pricing
calculator and then head to do.co slash changelog to try DigitalOcean for free with a $100 credit.
Again, digitalocean.com slash pricing slash calculator to play with the pricing calculator
and do.co slash changelog to get your $100 credit to play with.
Alright, welcome back everyone.
This is the ChangeLoggle podcast featuring the hackers, the leaders,
and the innovators in the world
of software. I'm Adam Stachowiak,
Editor-in-Chief here at ChangeLog.
On today's show, Jerry went solo to talk
with Johnans about the
github archive program and how they're preserving open source software for future generations
on february 2nd 2020 this year john and his team of archivists took a snapshot of all of the active
projects on github and sent them to a decommissioned coal mine in these fall bart archipelago where it
will be stored for the next 1,000 years.
And today, we dig into the why and everything that makes that possible.
So John, you have a long list of credentials and experiences.
You're an award-winning author, a journalist.
You've appeared in The Guardian, Wired, TechCrunch, amongst others.
You're a world traveler. you've visited over 100 countries, and you're a software engineer.
Which seems like one of the things doesn't fit in with the others, but maybe it does.
You tell me.
It's kind of a weird graphic.
I mean, the engineering came first.
I did my degree in that, and then decided I wanted to go dancing around the world, and
then decided I wanted to write.
So I usually describe myself as easily bored,
which is how all these things kind of fit together.
So I took five years off to be a full-time novelist
and then returned to the warm embrace of the tech industry.
Are you staying busy? Are you getting bored again?
Or how are you feeling?
I'm staying pretty busy.
I mean, I have a couple of different things I'm working on.
I am actually writing a new novel.
We're all staying fairly indoors these days,
so it's mostly interior projects, as you might imagine.
No doubt. No doubt.
So back to writing.
You're also the CTO of Happy Fun Corp,
which is a software development product agency
that works with startups and enterprises.
But most germane to this conversation,
you're the founding director of the GitHub Archive Program,
which is exactly what we're here to talk about today.
I would love to hear how this program started.
The way it started for me was Nat Friedman, the CEO of GitHub,
who I've known for some years, reached out to me saying he was interested
in archiving software, particularly open source software,
which is GitHub's main focus, but certainly a main industry saying he was interested in archiving software, particularly open source software,
which is GitHub's main, I don't know, but main focus,
but certainly a main interest for Nat and myself.
And he wanted to look into the possibilities for that.
And so we kicked around a couple of ideas and decided that the best thing to do
would be to actually go ahead and launch a program
under the auspices of GitHub itself.
So I took a sabbatical from Happy Fun Corp, HFC,
and came on to work full-time at GitHub on that last year.
So Nat came to you. Why you? Why John?
Did he have a history of archiving things?
Are you a friend of his? Why did he select you for this?
We've not talked about it explicitly, but I think the notion was that
he wanted someone with enough technical depth and background
to understand the nitty-gritty of how actually to get all the code into whatever very long-term storage we were
talking about, which is a non-trivial process, but also a sense of imagination and a willingness to
work outside the sort of usual thinking. And I guess the history of writing novels and bouncing
around the world spoke to that to some extent.
So you took a sabbatical and you decided, well, we're going to archive this under the auspices of GitHub. What were the first steps? Was it like, go find the coldest place on earth or
get a file format down? What were your first steps?
Well, the first steps were obviously to see what other people were doing in this area,
which was actually super interesting. There's a project called The Memory of Mankind,
which is built in a salt mine in Austria, for instance,
which is perhaps the oldest working mine in the world.
It's been worked since probably 3,000 or 4,000 BC.
And they are writing down data to ceramic tablets
and putting it in this ancient salt mine in the Alps.
And then the salt slowly sort of moves and accumulates over time.
And so this is going to be sealed off by this giant slow wave of salt as a
time capsule for the future, which is a fascinating idea.
Didn't really fit with what we were doing as it's hard to fit terabytes of
code on ceramic tablets, it turns out.
And also sealing off a time capsule with a giant wall of salt isn't the most convenient
way to get to it if you want the best way yeah yeah that was interesting there there's uh something
called the arc project which is actually dropping copies of various things on the moon is the plan
to just sort of crash land various archival facilities drives and so forth onto the moon
again very cool idea not super useful in case you want to access it
any time in the near future or in a whole bunch of possible futures.
And then we found out that there is a Norwegian company called Pickle,
P-I-Q-L, a little software joke for the relatively few who will get it,
I guess a disproportionate number in this podcast,
which had just recently built in cooperation with the
Norwegian government, or at least a mining company owned by the Norwegian government,
a vault beneath an Arctic mountain and Svalbard. That was obviously of some interest. So we
proceeded things further with them, and it turned out that was going to be a pretty good fit.
Which is good, because building an entire sort of, you know, elaborate superstructure apparatus for archival
is obviously a non-trivial job,
so it was kind of nice that someone else had done a lot of that work for us.
So that's ultimately where you all chose,
this Svalbard Archipelago.
Have you been there personally, or was it merely satellite images?
Yeah, yeah, we went last year,
although I should probably mention that that was, of course,
that's the most sort of charismatic, it's like the charismatic megafauna part of the archive is the thousand year part. And that's probably the most wacky out of the box thinking. But there's also sort of more day to day prosaic, ongoing archival programs that we're doing with the Internet Archive and the Library of Alexandria, hopefully, and the Bodleian Library and Software Heritage Foundation and so forth.
And so we wound up being a much larger thing of archiving on a sort of week-to-week, month-to-month basis,
and also the very, very long-term Arctic mountain under the permafrost one.
Right. So let's leave the permafrost for now,
and let's talk about some of these warm storage.
So you're going to have this warm-to-cold storage strategy
where you have dailies or weeklies or whatever it is, you can lay it out
that go to these different places.
So of course, like you said,
the frozen tundra is what gets all the press, right?
And of course, it's the coolest, weirdest part about it.
It's like the video is not going to be
of the Wayback Machine.
But the Wayback Machine's involved.
Tell us about all the things you're doing
to do the warm, long-term storage.
Yeah, and in fairness, the Wayback Machine
does actually look cool. I don't know if you've been there. Yeah, and in fairness, the Wayback Machine does actually look cool.
I don't know if you've been there, but at least one copy of the Wayback Machine sits
in the Internet Archives headquarters in San Francisco, which is a former church.
And they have these sort of walls of hard drives in the back of the former church with
little lights blinking whenever somebody archives them.
So there is a certain dramatic effect which goes on there.
But yeah, part of archival is making things available to people.
And in particular, something like GitHub,
which is like a tool that people use
and has critical aspects that people want to be able to access,
it's useful to have other backups of that out there.
And so the Internet Archive is currently sucking down
a whole bunch of GitHub public repositories
with the intent of making them available as Git repos
on the Wayback Machine.
So you can point your Git command line
to the Internet Archive URL
and pull down your Node package from there.
Right.
If need be in the future.
So those are effectively Git clones that are synchronized.
Yeah.
And I guess a larger perspective,
so we were sort of inspired by Stuart Brand's Long Now Foundation, who you may be familiar with. They're the people who think
that we have this mayfly-ish attitude towards history, which when in fact history turns out
is a very large thing. When they give the year, they preface it with a zero to reinforce this.
So we're in the year 02020 right now, which is sort of a fun little eyebrow razor that they do.
It shows their perspective, huh?
Yes, yeah, exactly.
And Brand wrote this piece about pace layers,
about how certain aspects of a society or a civilization
move very quickly, and other aspects move quite slowly.
And it makes sense to sort of look at things
from that point of view.
And so we kind of adopted that into archiving, you know, paste layers. We have the very, very
slow, under the ice for a thousand years, but we also have the sort of
more dynamic, faster, let's grab changes as they occur
several times a year paste layer, which sort of maps to software too,
and that, you know, obviously everything is changing and iterating, but you still have your baseline
of the tried and true technology
that everyone uses, and then you have the new stuff
that people are playing around with,
and changes are coming thick and fast.
Yeah, exactly.
So when you look at the Wayback Machine's version,
is that effectively a day-old thing, a week-old thing?
How old are those snapshots?
Are they synchronized in real time?
Well, they're working on a couple of test projects right now.
The objective is to get the snapshots several times a year.
So it'd never be more than three, four months old.
The actuality of that, I mean, they're still working away on it, but they are very good
at what they do.
So that's our hope and expectation.
And that's effectively a backup.
Is the point of that, if GitHub disappears, at least we have the Wayback Machine?
I mean, GitHub could disappear through some sort of BSD hacking, right?
Like pieces of the internet could vanish for a day because someone messed with BSD, because
it's, sorry, BGP, not BSD, Worldly Insecure Protocol.
So it's nice to have that handy if, you know, for whatever reason, GitHub IP numbers aren't
accessible in your country at that time, that sort of thing.
And, you know, more generally, it's just useful to have another copy around, so you can go back and refer to that if needed.
Then there's also gh-archive, which I think it lives in BigQuery.
Is that right?
We use it for our ChangeLog Nightly newsletter, which queries.
It's queryable, and it's the events that happen on GitHub,
but there's also source code involved in that as well.
Is that part of the archive program, or is that a separate project altogether?
Yeah, they are affiliated with us.
They predated the archive program, and we sort of reached out
and tried to incorporate them into that.
There's them, there's also Software Heritage
who are doing much the same thing
in an archive, except they're trying to get all
source code everywhere and keep it in one
single sort of monorepo of their own.
They're based in Inria
in Paris, and they have their own
sort of technology
and scraping and so forth.
So, you know, as with all backups, you want multiple copies.
Yeah, if it doesn't exist in three places,
it doesn't exist, right?
Right.
So that's the warm storage. You've got the GH archive,
which is associated. You have the other foundation,
which is associated, and then you have this long-term,
which is a snapshot.
You all did the snapshot
on February 2nd, 2020, I believe.
Was this all public GitHub repositories at the time,
or was it like you picked your favorites,
picked the most relevant repos?
We talked about that very early on,
of whether we wanted to be editorial
about what we picked and chose,
and we decided to avoid that to the extent possible.
I think that's a good idea.
Yes.
Some of the reasons are obvious,
some of them are less so.
We have this amazing advisory board of, you know,
anthropologists and historians and linguists and so forth.
And one of the interesting things that they mentioned to us,
which I found fascinating,
is that you look at existing archives of, like, the Renaissance,
and it's full of lists of, like, that wealthy and important people,
almost all the men, of course, because the 15th centuryth century you know these are the important books that should be preserved for
posterity and apparently you know we have so many of those we don't know what to do with them and
don't really care about them what we really want from that era to really understand how that era
worked is ordinary people's shopping list and almost none of those survived because they were
considered important at the time so you know we thought it would be more democratic and more inclusive and also possibly more important and give a more complete
view to be as broad as we could. Yeah. Isn't that amazing that what we're not after is like
some official narrative, right? What we're after is a snapshot, a view into the daily lives of the
people or the things they were doing or what they were thinking during a time period to reconstruct our own view of what was going on at that time.
It's amazing.
Yeah, exactly.
And if you're deciding what's important,
then you're passing judgment on what's important.
And maybe our judgment isn't so great.
Maybe 100, 200 years from now,
they're going to look back and think,
what we really care about are the Hello World apps,
the Hello World stuff,
and where they came from and what time zones they were in.
That was the most interesting things to us right now so if you had public code on github on february 2nd of 2020
then you have code in this archive is that correct yeah i can break that down for you in more detail
it was a little more complicated because we did space restrictions so any repos with any commits of any kind,
regardless of how many stars or anything between the day that the program was
announced, that was GitHub universe 2019.
That was November, I think.
And the snapshot date.
So in the 80 days before the snapshot and all of those repos were captured.
So active projects.
Yes.
Somewhat active.
Yeah.
Any repos with any stars at all.
And you can start your own repo in GitHub, of course,
from the previous year before the snapshot.
We all do, don't we?
Yes, exactly.
I've been known to snare mine, I'm not going to lie.
So for the full year, anything with a commit for the full year
before the snapshot was also taken.
And anything with more than 250 stars,
regardless of commit date, was captured.
So if there's old stuff that the community thought was important
but hadn't been updated in some years,
we grabbed that as well because we figured 250 stars
was a pretty significant indicator
that somebody thought this was okay.
Obviously, this is some level of judgment
because we didn't have a limited amount of space,
but we tried to minimize that to the extent possible
and be as inclusive as we could without setting criteria.
So let's talk about the space required then.
That snapshot, how big was it in layman's terms that we can understand,
like terabytes, petabytes, whatever?
Yeah, 21 terabytes, and that's compressed.
Okay.
So it added up.
And do you know how much it would have been if you would just have said
all public repos,
even the old stale ones?
Would it have been like 10x, 100x that?
I don't really know.
We looked into it and we're like,
that's going to be more than we probably have space for.
10x seems high to me, but not super high.
That's a gut feeling.
I don't really have the numbers offhand.
Sure.
So when you say that we have space for,
are you talking about terabytes?
Are you talking about physical space in this vault you only have so much surface area in their volume
that you can fill is that correct yeah i mean the vault is a former coal mine so in terms of
cubic meters or cubic feet there's a very very large amount of space
coal is not the densest and stuff on earth right Right. It goes down very deep. But we had a limited amount of tapes that we were generating.
And 186, I think it turned out to be.
And each of those has a limitation of about 110 gigabytes,
which on the one hand is actually fairly dense
for something which is written to a visual format.
But on the other hand, when you're
accustomed to one terabyte USB sticks and so forth,
it seems a little worrying and something you have to sort of recalculate around.
Yeah. And so a quick note for the listener.
If you're curious, do I have code in the GitHub Arctic Code Vault?
Just go to your profile. They've added now a badge for everybody who does.
And if you were active during that time period at all then likely
you do but you can be sure and they'll even list if you hover over that they will list which repos
a short list of the ones that you contributed to which yes do have code in the arctic code vault
very cool when adam and i were looking at that because we started to notice this badge on some
people's repos and we were thinking it was just like if you contributed to Ruby or Rails
or the Go programming language or
NPM or these very important repos
then you might
be an Arctic Code Vault
contributor. Then we both realized
we were both Arctic Code Vault
contributors. Oh, cool!
It must have been not that because we aren't contributing
to those large projects and it's very interesting
to hear that decision was made.
I think it was a very wise one to say, if it's active, we're going to snapshot it, because that's smart for many reasons.
Actually, not everyone.
I have co-workers at Happy Fun Corp who do not have the badge, because they're professional software developers, but they work in private repos day in, day out, and so forth.
And I got a couple of comments like, could have told us, John.
Could have mentioned that this was happening.
Because they missed the boat.
Yeah.
Or I guess it's a train or an airplane.
I don't know how you get out there. what's up friends when was the last time you considered how much time your team is spending
building and maintaining internal tooling and i bet if you looked at the way your team spends
time you're probably building and maintaining those tools way more often you thought and you
probably shouldn't have to do that i mean there is such a thing as retool. Have you heard about retool yet? Well,
companies like DoorDash, Braggs, Plaid, and even Amazon, they use retool to build internal tools
super fast. And the idea is that almost all internal tools look the same. They're made up
of tables, dropdowns, buttons, text inputs, search, and all
this is very similar. And Retool gives you a point, click, drag and drop interface that makes
it super simple to build internal UIs like that in hours, not days. So stop wasting your time and
use Retool. Check them out at retool.com slash changelog. Again, retool.com slash changelog.
So we were talking about storage format,
and many of us have ran into the scenario
where you think you've backed something up,
and then you wait a few years,
and you realize that there's nothing in the world
that can read that anymore,
whether it's Betamax or it's been damaged, right?
CD-ROMs, DVDs, they're still out there,
but you go 100 years in the future,
there may not be any CD readers out there that will work.
So I'm sure that was a huge consideration
when you're trying to shoot for 1,000 years.
Absolutely.
That format is super important.
Yeah, and ironically, I mean,
that's one of the reasons of the archive
is to document things like file formats and so forth
for the future.
And fortunately, this is a thing which
the format that we're using, which is sort of hardened microfilm as an oversimplification,
but it's not too much of an oversimplification, is useful for, because ultimately to just get
basic information out of a piece of film, you need some source of light and some magnifier.
So each of those 186 reels is actually in and of itself a self-contained archive.
It starts with human-readable, visible sort of text and pictures,
explaining in several languages what is on the reel and how to access it,
and how to make sense of its contents and an index of the things which are on it,
before going into the more encoded sort of QR code-ish sort of visual data encoding.
Like an instruction manual.
Yes, exactly.
What's the physical medium?
The physical medium is a silver halide on polyester film,
which the ISO rates for 500 years, but PICOL has a special hardened film,
which the Norwegian military has done some initial test with,
and say it should be longer than that.
Pickle thinks it could be good for up to 2,000 years.
We're saying 1,000 years out of what seems like a reasonable abundance of caution.
Right.
Yeah.
How do they do that?
This thing will last 2,000 years.
We've tested it for three months.
Well, I mean, they do artificial testing and sort of like heat treating and other forms but to an extent yeah i mean the only way obviously you can actually test for something
last for a thousand years is to leave it out for a thousand years that said i mean as the iso will
tell you this stuff silver halide on polyester is widely considered to be one of the most stable
formats around and it's not going to be going anywhere anytime soon, particularly if stored.
And these are in boxes,
and the boxes are wrapped in aluminum film,
and the aluminum film is in a steel vault,
and the vault is in a coal mine,
and the coal mine is in an Arctic mountain, etc.
Seems pretty safe. So the condition should be, yes,
we'd like to think so, yes.
Until a meteor hits that mountain,
that particular mountain.
Well, there is actually another backup.
We're taking a couple of reels
with the 15,000 most starred repos on GitHub
and also a random sampling of just all other repos
because we still wanted to include
some of the sort of inclusive, democratic,
everyone thing, even in these,
what we're calling the greatest hits versions.
And we're going to give those to libraries.
So we're intending to give those to various, you know,
more traditional archives and libraries
in other locations around the world.
Yeah, that's interesting because I did read
from some of your marketing copy.
You say this protects the priceless knowledge
by storing multiple copies on an ongoing basis
across various data formats and locations.
And I was like, and locations?
So I thought maybe this Arctic storage vault is just the first of multiple locations. But is that referring to like
the Wayback Machine and these other libraries? Or do you think you'll say, well, we got one in
the Arctic, how about the Antarctic? Or how about the equator? That'd be a bad place to store it.
Well, yes, to all of those, maybe. We don't really have like a fixed you know formal plan for
the the next snapshot but i personally expect that there is going to be a next snapshot you know five
years from now maybe we're working with project silica which is this kind of amazing microsoft
research project that uses femtosecond lasers and 5d polarized light technology to store enormous
amounts of data on quite small platters of glass.
So that's, you know, a possible format of the future.
That's theoretically good for 10,000 years, because obviously 1,000 years isn't good enough.
You know, we have to, you know, but it's, you know, a little uncertain what the next snapshots will look like.
But the general idea is that another way to get redundancy is to have multiple snapshots in multiple different locations.
So potentially more locations coming. What was the process to get them out to this
particular place? You mentioned it was February 2nd, 2020. The snapshot was taken. You had 186
reels put in boxes. Were these just, you slap a FedEx shipping label on them or how do you get
them up there? Originally we were going to go with them.
And in fact, we went, we being a small team of GitHub people,
went last year to sort of investigate the site,
put an initial reel with 6,000 repos in,
you know, sort of proof of concept,
prepare for the announcement, that sort of thing.
So we did go to Svalbard, go to the coal mine and so forth.
And the plan was to return for the actual deposit this year.
But then the pandemic broke out, which, as you might imagine, kind of confused the whole international logistics
part of the operation. Fortunately, Pickle is based in Norway, and Norwegians at the time,
only Norwegians could go to Svalbard, which is still COVID-free, by the way. There's not been
a single case. And it's famously quasi-illegal to die on Svalbard, so that's good.
Wait, wait, wait, wait.
Quasi-illegal to die.
Please unpack that.
Tell me what that even means.
I think this is kind of an apocryphal, maybe a too-good-to-check kind of story.
But they don't really have any facilities for death on Svalbard.
There's no morgue.
You can't bury anyone in the permafrost and so forth.
And so generally, when there's a serious medical condition,
you get sent back to the mainland one way or another.
That's hilarious.
Somewhat morbid, but interesting.
So in this particular case, you know,
our Norwegian partners wrote the data to film.
And then Svalbard is more accessible than people might think.
Until recently, it was growing to a significant tourist destination.
And there are flights there a couple times a week still so it flew in the belly of the twice weekly flight
to Svalbard this I think roughly the size of a Toyota Prius is for some reason that the unit
the volume unit we started using that you know the github archive is about the size of a Prius
so they basically packed this Prius into the belly of a passenger plane, flew it up to Svalbard, and then sent it up to the mine in the mountain itself, which is actually not far above the airport.
So it kind of overlooks the runway.
How many people live up there?
3,000.
It's variable because there's a university there, and so there's sort of an occasional university population, but 3,000 seems a bit right.
It's certainly by far, given its latitude,
it's by far the largest thing north of about 70 degrees.
So you can't die there, but could you visit the mine
and see the GitHub boxes, or at least a sign that says
GitHub lives here, or something like that?
You can visit the mine.
The vault itself is locked and sealed off,
but I believe they do run, or at least they were,
running tours to the mine itself. So you can get reasonably close. Similarly, the famous Global
Seed Bank is right around the corner. You can walk from one to the other. It's about a mile
distance between the two, from the mine to the seed bank and vice versa. So you could do a sort
of twofer survival tourist destination. I'm not familiar with the seed bank. Is that like where
they keep a bunch of seeds for things? Yeah, seed vault so every country has a seed bank to sort of maintain seeds of the various
agricultural plants that they use and then the global seed vault is sort of the backup to the
backup for those seed banks and it has this very dramatic wedge-shaped building also in small part
yeah i have seen the picture of that building it's very cool so there's no hot you can't do
any bug fixes though.
So you can't go up there, extract your code,
fix something real quick.
Cause that's it.
That's in history forever.
You know, I'm sure I have bugs in there.
Oh, I definitely do.
I fixed one the other day and I actually thought
the stupid little typo is now eternal.
So yeah, there's one of my personal repos
that happened to get captured up there.
But I mean, I guess that's part of the appeal.
Maybe in the future they'll look back and think,
in these antiquated days of software
development, they still had bugs. They didn't have
AI to automatically fix them while they were working.
How fascinating. Maybe something like that.
They might think, we don't know who this
Jared Santafel is, but he was a real idiot.
He was a real bad
programmer.
Oh, man.
Well, speaking of that, I guess, can you opt
out? Can you say, yeah, not for me.
This code is just, it's public, but
I don't want it to be in perpetuity.
You can opt out. In fact, you could
between the announcement and the snapshot.
We got very, very few opt
out requests. I forget how many, but
it was fingers of one hand,
something like that, I think.
But it is possible, and there's an option on your settings page
in GitHub somewhere now to opt out.
I don't think it's, you know, I think most people are mostly opt-in.
They like the idea of their stuff going into the future,
and they like the idea of sort of the broader perspective
of capturing, you know, not just the open source
on which society relies.
So that's obviously crucial as well,
and that's the part that may be of medium-termish practical use,
but being part of this big capture of not just software,
but kind of the tech community and to an extent a way of life
that is being snapshot and put up there.
Yeah.
How would you imagine somebody finding this
or unpacking it a thousand years from now, what would they do with this archive? Would they read the code and try be a primarily historical value. I think people might try and run the code again,
especially since there are some games there.
The Internet Archive, you can go to the Wayback Machine right now
or at least to archive.org and play the initial Prince of Persia,
for instance, which is very popular.
And I think in the same way that 8-bits became a weird aesthetic
not so long ago.
It's possible people will want to craft emulators
of today's antiquated computers
and run software the way it used to be in the old days
in the same way that, I don't know,
people build 19th century train sets
or mock train sets today.
There's also the possibility that this will actually be useful.
A thing that people don't realize necessarily is that software is surprisingly ephemeral.
Like it's all on hard drives.
Hard drives don't last that long, you know, like years, maybe decades.
Backup tapes are also, you know, they're good for decades.
And over the long run, we kind of expect everything to get copied to the next storage medium and
the next storage medium and so forth.
And probably most of it will, but also you're almost certainly going to have losses along the way. So it's easy to envision, you know, some piece of industrial
software that it suddenly, you know, something vital has been running on for the last 40 years
that anyone noticing suddenly we need to patch or some data format that's suddenly important for
some high profile legal case or something that we need to be able to access, that sort of thing.
And someone going back and saying, wait, where is that code from 2017
describing this obscure data format that looked like a good idea at the time
for about two years in 2067?
Small Bird?
Small Bird?
Kind of like the beginning of an Indiana Jones movie, right?
He's got to go to find the thing
i mean it could be sort of a rosetta stone if there were other code that was found
that they didn't know how to interpret it they know how to execute it because this has those
instructions that maybe there's a opportunity there to find the runtime that that ran against
or fix that dependency problem like hey is all of n in here? Maybe we can actually reduce all these dependencies.
That's what I think about when it comes to execution,
because a lot of the code up there,
you're not vendoring your dependencies in your repo.
So a lot of the source code is there.
Are you taking binaries to executable code,
or would everything have to be built from source
in this hypothetical situation
where someone's trying to restore something?
There are some binaries up there.
The repos with a lot of stars again
mostly is just source.
I am kind of curious myself just how many copies
of node modules we wind up capturing
because I thought seriously
about excluding that from the archive
but I decided not to in the end.
And even that might have some value
an implicit snapshot
of the various dependencies along the way and how those changed.
But it wouldn't shock me greatly if, you know, there are a lot of node modules just raw up there, duplicated over and over and over again.
But that might be useful as well.
There is also going to be a master index. which, you know, most open source is these days, and it's not some sort of private repo, then you should be able to, at a given time,
in a good computer, sort of reconstruct most of the dependency tree
for any given project.
What's up, friends?
Are you looking for a way to instantly troubleshoot
your applications and services running in production on Kubernetes?
Well, Pixie gives you a magical api to get instant debug data
and the best part is this doesn't involve code change and there are no manual uis and this all
lives inside kubernetes pixie is an api which lives inside your platform it harvests all the
data you need and it exposes a bunch of interfaces that you can paint to get the data that you need.
And it's essentially like a decentralized Splunk.
It's a programmable edge intelligence platform,
and it captures metrics, traces, logs, and events, all this without any code changes.
And the team behind Pixie is working hard to bring it to market for broad use by the end of 2020.
But guess what? Because you listened to this show,
I'm here to tell you how you can get your hands on the beta today
by going to pixielabs.ai.
Links are in the show notes,
so check them out to click through to the beta and their Slack community.
Once again, pixielabs.ai.
And look forward to a pixie day coming soon. So one of the things I read about in the documentation around this
is this idea of a tech tree.
And maybe you've already described this with the manuals,
but there's like a capital case T, capital case T tech tree,
and I wasn't sure exactly what that is.
Can you describe what the tech tree is
and how that concept plays into the archive?
Sure, yeah, and I'm glad you mentioned that because it is a distinction. It's not the same
thing as the manuals, as the guide, and the sort of instructions for decoding that's on every reel
that turns every reel into its own self-contained archive. The tech tree is a real, possibly two,
we're still compiling it, we're going to add it to the mass once it's done, just of sort of larger,
higher level explanatory stuff,
mostly works, you know, pre-existing works, books and so forth, but to explain, you know,
what software engineering is, what an algorithm is, what a computer is, you know, what, how you
would hook together transistors and op-amps and so forth to form a NAND gate, how NAND gates would
make up, you know, ultimately a small microprocessor, that sort of thing.
So in theory, there'd be enough information that you could, in fact, reconstruct a fair amount of modern technology from the information, you know, on those various books.
Now, this is a very romantic and compelling image.
I should mention also, in all honesty, that our advisors were like, yeah, this is cool, but we are living in what is going to be the best documented era in all of history already.
Like, it's very unlikely we're going to have a future in which these books, many copies of these books don't already exist sitting around in many other physical libraries that are kind of easier to get to.
But we figured it would be useful as context and general understanding for the source code, which goes with it as well.
So did you end up packaging that stuff up, or this is an idea that's ongoing?
It is going to be packaged up. We actually just released it for public commentary last month,
and I've been incorporating pull requests and issues on that recently. So we're going to compile
those books. We're going to put visual copies. This will all be human-readable, not encoded,
for obvious reasons, so that you get sort of the background
to begin with.
Except for Wikipedia, because that's too big.
But we're going to put a snapshot of Wikipedia.
I was just going to ask that.
It seems like that would be the easy button for this.
Just put Wikipedia up there, and you're done.
One of the highest-rated comments in the video
when we first released the video last year was,
don't forget to store Stack Overflow next door.
But Stack Overflow is also Creative Commons, so we are in fact
going to get a dump of Stack Overflow and drop it in the tech tree as well
next to Wikipedia, yes. That's awesome. What else is going in there?
Wiktionary, a couple of other things, and a list of about 200 books, mostly but not
exclusively technical, all of which is available on the Archive Program Reco
at GitHub, which I
think is github.com slash
github slash archive dash program.
Smash that one up and
link it for those interested in seeing all the things
inside the tech tree. Are you guys taking suggestions?
We are actively taking suggestions right now.
We're incorporating some at the moment.
We still have to sort of work
with publishers since we're literally making a copy.
Copyright becomes an issue, obviously.
So we have to figure out the right issues
with a bunch of these and so forth,
which is why one reason it's been a little slower
than the rest of the project.
But we are actively working on compiling that
and adding it to the vault.
Okay, very cool.
Yeah, I found it, and you have it broken out
into different areas, such as hardware architectures,
hardware development, electronic components, and you have books. You into different areas such as hardware architectures, hardware development, electronic components,
and you have books.
You have articles, I assume,
written modern software development,
and under there you have these different books
that are going to be included.
So that's very interesting.
What's the next iteration of this then?
What's the timing?
I guess, like you said, the pandemic has changed timings.
Was the hope to be once a year
you'd ship another thing to the Arctic
or would it be every once in a while?
When would you be updating the vault?
I think we're still figuring out
the roadmap. I wouldn't expect every year.
That seems a little...
I don't think we need that much
frequency. The sort of first
deposit captures the last
hopefully 20 or 30 years of
software i could see every five years yeah and i could see different data formats again for each
one's sort of redundancy through variability that kind of thing the tech tree is also a thing which
i think will iterate over time like the romantic image of the tech tree and one that we do aspire
to is like an actual manual for rebuilding you know technological civilization from scratch the v1 as with the long now as manual for civilization
is existing works but i could see sort of things being constructed for this purpose kind of courses
over time but that is hypothetical somewhat pie in the sky stuff for now right i think the road
the roadmap at least the one that i personally have in mind, is snapshots every five years or so.
Oh, it looks like you're including Wiktionary as well.
That's correct, yes.
I'm just over here looking at all the...
This is a pretty big tech tree.
You have a lot of copyrights to get figured out here, don't you?
I guess you just speak to each publisher once
and you probably get all the permissions you need for that publisher.
Yeah, that's the idea.
I mean, we've got some already.
Some have been extremely helpful, like O'Reilly, Pact.
They've been great.
And we're having trouble just sort of working up the whole list of publishers.
There's quite a long list if you go through them.
But we'll get them one at a time as time goes on, hopefully.
So somewhat interestingly related with regards to the cultural and technical context of the time period,
all of the changelog and our whole network's podcasts, transcripts are open source on GitHub.
And so they are undoubtedly also in there.
Yes, that's true.
Everything up until February 2nd will be recorded word for word.
So you have thousands of conversations of technologists through the years associated with that.
That's the kind of stuff that I think is interesting in a tech tree as well.
What were people saying to each other?
Yeah, and actually that's one of the things that I get excited about.
There's a lot of source code. Source code is very important.
There's the fundamental pinnings of open source,
which is a cornerstone of technology and civilization.
If you know it, that's critical as well.
But also, as we all know, people use GitHub for all kinds of weird things.
There's recipes on there.
There are books on there.
There are sort of random notes on there all over the place.
And the extent to which historians of the future
will find that this weird and unexpected treasure trove
is kind of appealing.
All the things you'd find in there
would be quite an interesting thing to dive into
exactly
what about the issues?
so lots of conversations go on on GitHub
that are about the source code
is GitHub issues going to be involved anywhere?
we did indeed also pull the issues
and the issues are in there
so how do you decide which issues to pull?
of the repos you decided
you just took all the issues?
yeah, issues it turns out are not that spacious
they're mostly text
so the issues were quite compact Issues, it turns out, are not that spacious. They're mostly text.
Right.
So, yeah, the issues were quite compact.
They were not really a significant figure in the sizing.
And all the comments on the issues as well.
That's correct.
I'm just enjoying the fact that there's so much drama around that's just been immortalized in the Arctic Code Vault developer drama.
I was actually just thinking that.
The future might look back and think, wow, this was a testy and easily aggravated time.
What's wrong with these people?
Or they might look back and think, man, they were so civilized back then.
Look how they reacted.
They were so passionate.
They really cared about their bugs.
What's with these two-dimensional emoji?
We use four-dimensional emoji now.
Wow.
I mean, we got the Unicode, so all the emoji are in there as well.
So I look forward to the history of looking back on those.
The guides, I mean, there are some translations, as I mentioned,
Arabic, Spanish, simplified Chinese, and Hindi.
Most of it, and most of the tech tree is in English, at least this iteration.
And certainly that may change in future iterations. But a thing which surprised me, and I of the tech tree is in English, at least this iteration, and certainly that may change in future iterations.
But a thing which surprised me and I thought people might be interested in is that we have this great linguist as a consultant, John McWhorter at Columbia.
And he said that people assume that since English has changed a great deal in the last thousand years, they assume that it will change a great deal in the next thousand years.
But he thinks the evidence shows that's actually quite unlikely.
His estimation is that
English is more or less stable now. You know, people learn it younger, everyone's more
interconnected. It's not like in little islands evolving off on its own. And so the expectation
is that English many hundred years from now will be as different from today as like Jane Austen's
English is from ours. You know, a little weird, a little courtly, a little formal, but not that different.
So his exact quote was,
as uncool as it may be, you'd be all right with just English.
That's interesting.
I assume that it would move.
So, I mean, it's such a long time.
You'd assume it would move to where it would at least be difficult to understand.
It was impressive to me as well, actually.
We did cover our bases by, after all,
adding these other five translations of
most significant languages.
We also, in fact, just
to be on the paranoid safe side,
each reel begins with
the Universal
Declaration of Human Rights in
every known written language in Unicode.
So that's several hundred.
So even if only some obscure Basque language survives,
then we do, in fact, have a Rosetta Stone for that on each reel as well.
I wonder if there was a sense of dread when it came time to actually ship.
Because in software, we have the advantage over most disciplines
of just shipping iterative improvements at all times.
And I remember talking with folks who wrote code for NASA and stuff
where it was like, this had to work.
This was our one chance.
And you ship it to some satellite or some orbiting thing.
And it was just like, even back when they had to package up software
on a compact disc and put it
into a box and sell it to you in a box like that idea of like this is the final what they call them
gold uh gold master i don't remember they called them back then but like that was like the version
and sure we could ship patches maybe but they weren't going to get in for three months
this is like you get one shot at this snapshot you know you're putting the the declaration of
rights and stuff like these things where you're like snapshot you know you put in the the declaration of rights and stuff
like these things where you're like what else should we shove in here did you have that moment
where you're like no we're gonna close the vault we're just gonna stop shoving stuff in or was it
difficult no no no we totally did it was you know it was difficult to say okay this is it it was
useful that we actually had a fixed date and like this is going to be the snapshot you know and we
set that fairly early on
and that was good it actually called back so my background my degree is in electrical engineering
and i did a couple of co-op terms i went to waterloo in canada that does co-op um of chip
designs at nortel and at hewlett packard before i went into software and spent the entire rest of
my career after i graduated in software hardware being much too unforgiving and and permanent um
but the chip design
was a lot like that.
You're working on this VHDL, and you've
got it working, and you've got the test working, you think.
And then you actually send it out to be fabricated
somewhere and burnt into silicon.
And if you screwed up,
there's nothing that can be done.
Yes, exactly.
And so it reminded me of that for the first time in a very long
time, that you are committing this to the world, whether you like it or not.
Yeah, as a software developer, I assume that you've all but forgotten that feeling,
because don't we have the freedom right now to just not really worry about that?
It was pretty unusual, the feeling of perpetuity, the irrevocability.
It had been a long time since I'd felt that professionally.
The permafrost.
Yes, excellent metaphor. Excellent metaphor, the permafrost. Yes, excellent metaphor.
Excellent metaphor, the permafrost.
Well, that is really cool.
Anything else about the program that we haven't touched on
that I haven't asked about that you would like to discuss?
I mean, it's really awesome,
and I really appreciate you sharing the details of this program
and all the work you did to archive these things.
Anything we haven't touched on that you think we should?
I think we have captured to archive these things. Anything we haven't touched on that you think we should? I think we have captured
most of the things. I mean, I want to stress just how
important and how useful our partners have been.
You know, the Internet Archive, Software Heritage,
Stanford, the Bodleian, etc.,
etc. I'm sure I'm leaving
someone off now who really shouldn't be left off.
This is inevitably the way when you try and
enlist people. But, you know,
I think it was really important that we cast a broad
tent and tried to work with as many of these organizations. The Long Now. The Long Now have been great.
Having a conversation with anyone at the Long Now is always a mind-opening experience.
Even if it's a relatively simple one. And I guess Project
Silica, another partner. Hopefully that's the longest.
I think it was important to treat this as
not as a thing that one company is doing for one company, but that a broad consortium are doing,
you know, and hopefully as a general goodwill thing. I mean, this is,
this is not a project which has an ROI. This is a project which, you know,
we think is actually important, you know, or could be important.
It's sort of a weird project in that you sort of hope it's not really that
important in a perfect world. It will, you world, all this data will be saved anyway,
and we'll just sort of grab it off the internet
a thousand years from now, and no one will care about it.
Right, but you never know.
Yeah, exactly.
Anyone who works on backups knows it's important,
even if it's not used.
So you say there's no ROI on this.
What was the magnitude of the eye, at least?
You don't have to share
specific numbers,
but was this a large investment?
Is the mine,
is the rent high on the mine?
How much went into
this kind of project?
I mean, I'm pretty sure
I'm not supposed to share numbers,
but I can say I think
it was more economical
than people assume.
And in fairness, Pickle,
who are obviously the partner
I really should have mentioned
in the Arctic World Archive, were very understanding and working with us and realizing that this was sort of a beneficial project more than a private benefit project.
And so there's no sort of rent.
We sort of paid up front for storage and perpetuity in the World Archive, which is useful and is probably quasi-eternal
in as much as things are eternal these days,
in that it's owned by the Norwegian government.
The Svalbard Archipelago actually has its weird own legal structure.
It's quasi, anyone can go to Svalbard.
You don't need a passport to go there.
Anyone can work there.
And it's governed by its own special treaty,
which was signed after World War I,
which made it a place officially free of war and sort of free habitation for any human being that can get there.
Is it sovereign then, or is it underneath?
Well, it's definitely Norwegian, but it has its own special legal status as sort of extra-national territory as well.
I mean, I am not a lawyer.
My wife is a lawyer, and I'm sure
she'll be very upset at
me misrepresenting the
legal status.
Well, you go ask her
and get back to us.
Yes, that was my
crude layman's takeaway
from the strange legal
status that Schwalbart
has, which is, you
know, kind of an
international zone of,
you know, peace,
freedom, and
availability.
So it's sort of an
optimal place to store
something.
You know, it's not
likely that conflict's going to break out there anytime soon.
Right.
It's optimal for storage, but suboptimal for living,
which is why there's only about 3,000 people there.
And no one's breaking down the doors, even though it's 100% COVID-free.
That is correct, yes.
Awesome, John.
So like I said, a storied career.
You've done a lot of things.
This is a very cool project. I would think a highlight of your career, at least if it was my career, would be a highlight of my career.
Absolutely, yes.
Yeah. What's coming next for you then? Can you top this one or you go back to building software products? What's next?
I mean, I am going back to building the software now. I think it's important to sort of keep doing the thing that you care about. And I do think software is important.
I'm working on a new novel.
Who knows what will come up with that.
This year has been pretty bad for plans in general, as you may have noticed.
So I've had friends calling it the great reset year.
So we will see what happens 2021.
But, you know, I expect to stay involved with the archive program on sort of an indefinite ongoing basis.
And hope to work on the next iterations of it as well.
Awesome. Well, thanks for coming on the show and telling us all about it. We appreciate it.
Hey, thanks very much. It was a great talk.
That's it for this episode of The Change Law. Thank you for tuning in. If you haven't heard yet,
we have launched Change Law Plus Plus. It is our membership program that lets you get closer to the
metal, remove the ads, make them disappear, as we say, and enjoy supporting us.
It's the best way to directly support this show and our other podcasts here on changelog.com.
And if you've never been to changelog.com, you should go there now.
Again, join Changelog++ to directly support our work and make the ads disappear.
Check it out at changelog.com slash plus plus.
Of course, huge thanks to our partners who get it,
Fastly, Linode, and Rollbar.
Also, thanks to Breakmaster Cylinder
for making all of our beats. And thank
you to you for listening. We appreciate you.
That's it for this week. We'll see you
next week. Thank you. Субтитры создавал DimaTorzok