The Changelog: Software Development, Open Source - Inside GitHub's Arctic Code Vault (Interview)

Starting point is 00:00:00 We have this amazing advisory board of, you know, anthropologists and historians and linguists and so forth. And one of the interesting things that they mentioned to us, which I found fascinating, is that you look at existing archives of like the Renaissance. And it's full of lists of like that wealthy and important people, almost all the men, of course, because the 15th century, you know, these are the important books that should be preserved for posterity. And apparently, you know, we have so many of those, we don't know what to do with them and don't really care about them what we really want from that era to really understand how that era worked is ordinary people's shopping lists and almost none of those survived because they were considered important at the time so you know we thought it would be more democratic and more inclusive and also possibly more important and give a more complete view to be as broad as we could yeah isn't that amazing that what we're not after is like some official narrative,

Starting point is 00:00:46 right? We're after is a snapshot, a view into the daily lives of the people or the things they were doing or what they were thinking during a time period to reconstruct our own view of what was going on at that time. Being with her change log is provided by Fastly. Learn more at Fastly.com. We move fast and fix things here at Changelog because of Rollbar. Check them out at Rollbar.com.

Starting point is 00:01:11 And we're hosted on Linode cloud servers. Head to Linode.com slash Changelog. Deciding on a cloud provider is hard enough. And figuring out pricing and projected costs, that should just be easy. And that's exactly why DigitalOcean has transparent and predictable pricing and also an awesome pricing calculator that not only makes it easy to figure out your cost per month,

Starting point is 00:01:35 but it also compares that cost against AWS, Google Cloud, and also Azure. So head to digitalocean.com slash pricing slash calculator to play with the pricing calculator and then head to do.co slash changelog to try DigitalOcean for free with a $100 credit. Again, digitalocean.com slash pricing slash calculator to play with the pricing calculator and do.co slash changelog to get your $100 credit to play with. Alright, welcome back everyone.

Starting point is 00:02:14 This is the ChangeLoggle podcast featuring the hackers, the leaders, and the innovators in the world of software. I'm Adam Stachowiak, Editor-in-Chief here at ChangeLog. On today's show, Jerry went solo to talk with Johnans about the github archive program and how they're preserving open source software for future generations on february 2nd 2020 this year john and his team of archivists took a snapshot of all of the active

Starting point is 00:02:37 projects on github and sent them to a decommissioned coal mine in these fall bart archipelago where it will be stored for the next 1,000 years. And today, we dig into the why and everything that makes that possible. So John, you have a long list of credentials and experiences. You're an award-winning author, a journalist. You've appeared in The Guardian, Wired, TechCrunch, amongst others. You're a world traveler. you've visited over 100 countries, and you're a software engineer. Which seems like one of the things doesn't fit in with the others, but maybe it does.

Starting point is 00:03:14 You tell me. It's kind of a weird graphic. I mean, the engineering came first. I did my degree in that, and then decided I wanted to go dancing around the world, and then decided I wanted to write. So I usually describe myself as easily bored, which is how all these things kind of fit together. So I took five years off to be a full-time novelist

Starting point is 00:03:32 and then returned to the warm embrace of the tech industry. Are you staying busy? Are you getting bored again? Or how are you feeling? I'm staying pretty busy. I mean, I have a couple of different things I'm working on. I am actually writing a new novel. We're all staying fairly indoors these days, so it's mostly interior projects, as you might imagine.

Starting point is 00:03:50 No doubt. No doubt. So back to writing. You're also the CTO of Happy Fun Corp, which is a software development product agency that works with startups and enterprises. But most germane to this conversation, you're the founding director of the GitHub Archive Program, which is exactly what we're here to talk about today.

Starting point is 00:04:09 I would love to hear how this program started. The way it started for me was Nat Friedman, the CEO of GitHub, who I've known for some years, reached out to me saying he was interested in archiving software, particularly open source software, which is GitHub's main focus, but certainly a main industry saying he was interested in archiving software, particularly open source software, which is GitHub's main, I don't know, but main focus, but certainly a main interest for Nat and myself. And he wanted to look into the possibilities for that.

Starting point is 00:04:36 And so we kicked around a couple of ideas and decided that the best thing to do would be to actually go ahead and launch a program under the auspices of GitHub itself. So I took a sabbatical from Happy Fun Corp, HFC, and came on to work full-time at GitHub on that last year. So Nat came to you. Why you? Why John? Did he have a history of archiving things? Are you a friend of his? Why did he select you for this?

Starting point is 00:04:58 We've not talked about it explicitly, but I think the notion was that he wanted someone with enough technical depth and background to understand the nitty-gritty of how actually to get all the code into whatever very long-term storage we were talking about, which is a non-trivial process, but also a sense of imagination and a willingness to work outside the sort of usual thinking. And I guess the history of writing novels and bouncing around the world spoke to that to some extent. So you took a sabbatical and you decided, well, we're going to archive this under the auspices of GitHub. What were the first steps? Was it like, go find the coldest place on earth or get a file format down? What were your first steps?

Starting point is 00:05:38 Well, the first steps were obviously to see what other people were doing in this area, which was actually super interesting. There's a project called The Memory of Mankind, which is built in a salt mine in Austria, for instance, which is perhaps the oldest working mine in the world. It's been worked since probably 3,000 or 4,000 BC. And they are writing down data to ceramic tablets and putting it in this ancient salt mine in the Alps. And then the salt slowly sort of moves and accumulates over time.

Starting point is 00:06:06 And so this is going to be sealed off by this giant slow wave of salt as a time capsule for the future, which is a fascinating idea. Didn't really fit with what we were doing as it's hard to fit terabytes of code on ceramic tablets, it turns out. And also sealing off a time capsule with a giant wall of salt isn't the most convenient way to get to it if you want the best way yeah yeah that was interesting there there's uh something called the arc project which is actually dropping copies of various things on the moon is the plan to just sort of crash land various archival facilities drives and so forth onto the moon

Starting point is 00:06:43 again very cool idea not super useful in case you want to access it any time in the near future or in a whole bunch of possible futures. And then we found out that there is a Norwegian company called Pickle, P-I-Q-L, a little software joke for the relatively few who will get it, I guess a disproportionate number in this podcast, which had just recently built in cooperation with the Norwegian government, or at least a mining company owned by the Norwegian government, a vault beneath an Arctic mountain and Svalbard. That was obviously of some interest. So we

Starting point is 00:07:16 proceeded things further with them, and it turned out that was going to be a pretty good fit. Which is good, because building an entire sort of, you know, elaborate superstructure apparatus for archival is obviously a non-trivial job, so it was kind of nice that someone else had done a lot of that work for us. So that's ultimately where you all chose, this Svalbard Archipelago. Have you been there personally, or was it merely satellite images? Yeah, yeah, we went last year,

Starting point is 00:07:41 although I should probably mention that that was, of course, that's the most sort of charismatic, it's like the charismatic megafauna part of the archive is the thousand year part. And that's probably the most wacky out of the box thinking. But there's also sort of more day to day prosaic, ongoing archival programs that we're doing with the Internet Archive and the Library of Alexandria, hopefully, and the Bodleian Library and Software Heritage Foundation and so forth. And so we wound up being a much larger thing of archiving on a sort of week-to-week, month-to-month basis, and also the very, very long-term Arctic mountain under the permafrost one. Right. So let's leave the permafrost for now, and let's talk about some of these warm storage. So you're going to have this warm-to-cold storage strategy where you have dailies or weeklies or whatever it is, you can lay it out

Starting point is 00:08:26 that go to these different places. So of course, like you said, the frozen tundra is what gets all the press, right? And of course, it's the coolest, weirdest part about it. It's like the video is not going to be of the Wayback Machine. But the Wayback Machine's involved. Tell us about all the things you're doing

Starting point is 00:08:40 to do the warm, long-term storage. Yeah, and in fairness, the Wayback Machine does actually look cool. I don't know if you've been there. Yeah, and in fairness, the Wayback Machine does actually look cool. I don't know if you've been there, but at least one copy of the Wayback Machine sits in the Internet Archives headquarters in San Francisco, which is a former church. And they have these sort of walls of hard drives in the back of the former church with little lights blinking whenever somebody archives them. So there is a certain dramatic effect which goes on there.

Starting point is 00:09:03 But yeah, part of archival is making things available to people. And in particular, something like GitHub, which is like a tool that people use and has critical aspects that people want to be able to access, it's useful to have other backups of that out there. And so the Internet Archive is currently sucking down a whole bunch of GitHub public repositories with the intent of making them available as Git repos

Starting point is 00:09:26 on the Wayback Machine. So you can point your Git command line to the Internet Archive URL and pull down your Node package from there. Right. If need be in the future. So those are effectively Git clones that are synchronized. Yeah.

Starting point is 00:09:42 And I guess a larger perspective, so we were sort of inspired by Stuart Brand's Long Now Foundation, who you may be familiar with. They're the people who think that we have this mayfly-ish attitude towards history, which when in fact history turns out is a very large thing. When they give the year, they preface it with a zero to reinforce this. So we're in the year 02020 right now, which is sort of a fun little eyebrow razor that they do. It shows their perspective, huh? Yes, yeah, exactly. And Brand wrote this piece about pace layers,

Starting point is 00:10:13 about how certain aspects of a society or a civilization move very quickly, and other aspects move quite slowly. And it makes sense to sort of look at things from that point of view. And so we kind of adopted that into archiving, you know, paste layers. We have the very, very slow, under the ice for a thousand years, but we also have the sort of more dynamic, faster, let's grab changes as they occur several times a year paste layer, which sort of maps to software too,

Starting point is 00:10:39 and that, you know, obviously everything is changing and iterating, but you still have your baseline of the tried and true technology that everyone uses, and then you have the new stuff that people are playing around with, and changes are coming thick and fast. Yeah, exactly. So when you look at the Wayback Machine's version, is that effectively a day-old thing, a week-old thing?

Starting point is 00:10:58 How old are those snapshots? Are they synchronized in real time? Well, they're working on a couple of test projects right now. The objective is to get the snapshots several times a year. So it'd never be more than three, four months old. The actuality of that, I mean, they're still working away on it, but they are very good at what they do. So that's our hope and expectation.

Starting point is 00:11:17 And that's effectively a backup. Is the point of that, if GitHub disappears, at least we have the Wayback Machine? I mean, GitHub could disappear through some sort of BSD hacking, right? Like pieces of the internet could vanish for a day because someone messed with BSD, because it's, sorry, BGP, not BSD, Worldly Insecure Protocol. So it's nice to have that handy if, you know, for whatever reason, GitHub IP numbers aren't accessible in your country at that time, that sort of thing. And, you know, more generally, it's just useful to have another copy around, so you can go back and refer to that if needed.

Starting point is 00:11:47 Then there's also gh-archive, which I think it lives in BigQuery. Is that right? We use it for our ChangeLog Nightly newsletter, which queries. It's queryable, and it's the events that happen on GitHub, but there's also source code involved in that as well. Is that part of the archive program, or is that a separate project altogether? Yeah, they are affiliated with us. They predated the archive program, and we sort of reached out

Starting point is 00:12:05 and tried to incorporate them into that. There's them, there's also Software Heritage who are doing much the same thing in an archive, except they're trying to get all source code everywhere and keep it in one single sort of monorepo of their own. They're based in Inria in Paris, and they have their own

Starting point is 00:12:22 sort of technology and scraping and so forth. So, you know, as with all backups, you want multiple copies. Yeah, if it doesn't exist in three places, it doesn't exist, right? Right. So that's the warm storage. You've got the GH archive, which is associated. You have the other foundation,

Starting point is 00:12:37 which is associated, and then you have this long-term, which is a snapshot. You all did the snapshot on February 2nd, 2020, I believe. Was this all public GitHub repositories at the time, or was it like you picked your favorites, picked the most relevant repos? We talked about that very early on,

Starting point is 00:12:55 of whether we wanted to be editorial about what we picked and chose, and we decided to avoid that to the extent possible. I think that's a good idea. Yes. Some of the reasons are obvious, some of them are less so. We have this amazing advisory board of, you know,

Starting point is 00:13:10 anthropologists and historians and linguists and so forth. And one of the interesting things that they mentioned to us, which I found fascinating, is that you look at existing archives of, like, the Renaissance, and it's full of lists of, like, that wealthy and important people, almost all the men, of course, because the 15th centuryth century you know these are the important books that should be preserved for posterity and apparently you know we have so many of those we don't know what to do with them and don't really care about them what we really want from that era to really understand how that era

Starting point is 00:13:36 worked is ordinary people's shopping list and almost none of those survived because they were considered important at the time so you know we thought it would be more democratic and more inclusive and also possibly more important and give a more complete view to be as broad as we could. Yeah. Isn't that amazing that what we're not after is like some official narrative, right? What we're after is a snapshot, a view into the daily lives of the people or the things they were doing or what they were thinking during a time period to reconstruct our own view of what was going on at that time. It's amazing. Yeah, exactly. And if you're deciding what's important,

Starting point is 00:14:12 then you're passing judgment on what's important. And maybe our judgment isn't so great. Maybe 100, 200 years from now, they're going to look back and think, what we really care about are the Hello World apps, the Hello World stuff, and where they came from and what time zones they were in. That was the most interesting things to us right now so if you had public code on github on february 2nd of 2020

Starting point is 00:14:32 then you have code in this archive is that correct yeah i can break that down for you in more detail it was a little more complicated because we did space restrictions so any repos with any commits of any kind, regardless of how many stars or anything between the day that the program was announced, that was GitHub universe 2019. That was November, I think. And the snapshot date. So in the 80 days before the snapshot and all of those repos were captured. So active projects.

Starting point is 00:15:02 Yes. Somewhat active. Yeah. Any repos with any stars at all. And you can start your own repo in GitHub, of course, from the previous year before the snapshot. We all do, don't we? Yes, exactly.

Starting point is 00:15:14 I've been known to snare mine, I'm not going to lie. So for the full year, anything with a commit for the full year before the snapshot was also taken. And anything with more than 250 stars, regardless of commit date, was captured. So if there's old stuff that the community thought was important but hadn't been updated in some years, we grabbed that as well because we figured 250 stars

Starting point is 00:15:35 was a pretty significant indicator that somebody thought this was okay. Obviously, this is some level of judgment because we didn't have a limited amount of space, but we tried to minimize that to the extent possible and be as inclusive as we could without setting criteria. So let's talk about the space required then. That snapshot, how big was it in layman's terms that we can understand,

Starting point is 00:15:55 like terabytes, petabytes, whatever? Yeah, 21 terabytes, and that's compressed. Okay. So it added up. And do you know how much it would have been if you would just have said all public repos, even the old stale ones? Would it have been like 10x, 100x that?

Starting point is 00:16:09 I don't really know. We looked into it and we're like, that's going to be more than we probably have space for. 10x seems high to me, but not super high. That's a gut feeling. I don't really have the numbers offhand. Sure. So when you say that we have space for,

Starting point is 00:16:22 are you talking about terabytes? Are you talking about physical space in this vault you only have so much surface area in their volume that you can fill is that correct yeah i mean the vault is a former coal mine so in terms of cubic meters or cubic feet there's a very very large amount of space coal is not the densest and stuff on earth right Right. It goes down very deep. But we had a limited amount of tapes that we were generating. And 186, I think it turned out to be. And each of those has a limitation of about 110 gigabytes, which on the one hand is actually fairly dense

Starting point is 00:16:57 for something which is written to a visual format. But on the other hand, when you're accustomed to one terabyte USB sticks and so forth, it seems a little worrying and something you have to sort of recalculate around. Yeah. And so a quick note for the listener. If you're curious, do I have code in the GitHub Arctic Code Vault? Just go to your profile. They've added now a badge for everybody who does. And if you were active during that time period at all then likely

Starting point is 00:17:26 you do but you can be sure and they'll even list if you hover over that they will list which repos a short list of the ones that you contributed to which yes do have code in the arctic code vault very cool when adam and i were looking at that because we started to notice this badge on some people's repos and we were thinking it was just like if you contributed to Ruby or Rails or the Go programming language or NPM or these very important repos then you might be an Arctic Code Vault

Starting point is 00:17:54 contributor. Then we both realized we were both Arctic Code Vault contributors. Oh, cool! It must have been not that because we aren't contributing to those large projects and it's very interesting to hear that decision was made. I think it was a very wise one to say, if it's active, we're going to snapshot it, because that's smart for many reasons. Actually, not everyone.

Starting point is 00:18:14 I have co-workers at Happy Fun Corp who do not have the badge, because they're professional software developers, but they work in private repos day in, day out, and so forth. And I got a couple of comments like, could have told us, John. Could have mentioned that this was happening. Because they missed the boat. Yeah. Or I guess it's a train or an airplane. I don't know how you get out there. what's up friends when was the last time you considered how much time your team is spending building and maintaining internal tooling and i bet if you looked at the way your team spends

Starting point is 00:18:58 time you're probably building and maintaining those tools way more often you thought and you probably shouldn't have to do that i mean there is such a thing as retool. Have you heard about retool yet? Well, companies like DoorDash, Braggs, Plaid, and even Amazon, they use retool to build internal tools super fast. And the idea is that almost all internal tools look the same. They're made up of tables, dropdowns, buttons, text inputs, search, and all this is very similar. And Retool gives you a point, click, drag and drop interface that makes it super simple to build internal UIs like that in hours, not days. So stop wasting your time and use Retool. Check them out at retool.com slash changelog. Again, retool.com slash changelog.

Starting point is 00:20:03 So we were talking about storage format, and many of us have ran into the scenario where you think you've backed something up, and then you wait a few years, and you realize that there's nothing in the world that can read that anymore, whether it's Betamax or it's been damaged, right? CD-ROMs, DVDs, they're still out there,

Starting point is 00:20:23 but you go 100 years in the future, there may not be any CD readers out there that will work. So I'm sure that was a huge consideration when you're trying to shoot for 1,000 years. Absolutely. That format is super important. Yeah, and ironically, I mean, that's one of the reasons of the archive

Starting point is 00:20:38 is to document things like file formats and so forth for the future. And fortunately, this is a thing which the format that we're using, which is sort of hardened microfilm as an oversimplification, but it's not too much of an oversimplification, is useful for, because ultimately to just get basic information out of a piece of film, you need some source of light and some magnifier. So each of those 186 reels is actually in and of itself a self-contained archive. It starts with human-readable, visible sort of text and pictures,

Starting point is 00:21:09 explaining in several languages what is on the reel and how to access it, and how to make sense of its contents and an index of the things which are on it, before going into the more encoded sort of QR code-ish sort of visual data encoding. Like an instruction manual. Yes, exactly. What's the physical medium? The physical medium is a silver halide on polyester film, which the ISO rates for 500 years, but PICOL has a special hardened film,

Starting point is 00:21:39 which the Norwegian military has done some initial test with, and say it should be longer than that. Pickle thinks it could be good for up to 2,000 years. We're saying 1,000 years out of what seems like a reasonable abundance of caution. Right. Yeah. How do they do that? This thing will last 2,000 years.

Starting point is 00:21:59 We've tested it for three months. Well, I mean, they do artificial testing and sort of like heat treating and other forms but to an extent yeah i mean the only way obviously you can actually test for something last for a thousand years is to leave it out for a thousand years that said i mean as the iso will tell you this stuff silver halide on polyester is widely considered to be one of the most stable formats around and it's not going to be going anywhere anytime soon, particularly if stored. And these are in boxes, and the boxes are wrapped in aluminum film, and the aluminum film is in a steel vault,

Starting point is 00:22:30 and the vault is in a coal mine, and the coal mine is in an Arctic mountain, etc. Seems pretty safe. So the condition should be, yes, we'd like to think so, yes. Until a meteor hits that mountain, that particular mountain. Well, there is actually another backup. We're taking a couple of reels

Starting point is 00:22:46 with the 15,000 most starred repos on GitHub and also a random sampling of just all other repos because we still wanted to include some of the sort of inclusive, democratic, everyone thing, even in these, what we're calling the greatest hits versions. And we're going to give those to libraries. So we're intending to give those to various, you know,

Starting point is 00:23:06 more traditional archives and libraries in other locations around the world. Yeah, that's interesting because I did read from some of your marketing copy. You say this protects the priceless knowledge by storing multiple copies on an ongoing basis across various data formats and locations. And I was like, and locations?

Starting point is 00:23:26 So I thought maybe this Arctic storage vault is just the first of multiple locations. But is that referring to like the Wayback Machine and these other libraries? Or do you think you'll say, well, we got one in the Arctic, how about the Antarctic? Or how about the equator? That'd be a bad place to store it. Well, yes, to all of those, maybe. We don't really have like a fixed you know formal plan for the the next snapshot but i personally expect that there is going to be a next snapshot you know five years from now maybe we're working with project silica which is this kind of amazing microsoft research project that uses femtosecond lasers and 5d polarized light technology to store enormous amounts of data on quite small platters of glass.

Starting point is 00:24:05 So that's, you know, a possible format of the future. That's theoretically good for 10,000 years, because obviously 1,000 years isn't good enough. You know, we have to, you know, but it's, you know, a little uncertain what the next snapshots will look like. But the general idea is that another way to get redundancy is to have multiple snapshots in multiple different locations. So potentially more locations coming. What was the process to get them out to this particular place? You mentioned it was February 2nd, 2020. The snapshot was taken. You had 186 reels put in boxes. Were these just, you slap a FedEx shipping label on them or how do you get them up there? Originally we were going to go with them.

Starting point is 00:24:45 And in fact, we went, we being a small team of GitHub people, went last year to sort of investigate the site, put an initial reel with 6,000 repos in, you know, sort of proof of concept, prepare for the announcement, that sort of thing. So we did go to Svalbard, go to the coal mine and so forth. And the plan was to return for the actual deposit this year. But then the pandemic broke out, which, as you might imagine, kind of confused the whole international logistics

Starting point is 00:25:11 part of the operation. Fortunately, Pickle is based in Norway, and Norwegians at the time, only Norwegians could go to Svalbard, which is still COVID-free, by the way. There's not been a single case. And it's famously quasi-illegal to die on Svalbard, so that's good. Wait, wait, wait, wait. Quasi-illegal to die. Please unpack that. Tell me what that even means. I think this is kind of an apocryphal, maybe a too-good-to-check kind of story.

Starting point is 00:25:36 But they don't really have any facilities for death on Svalbard. There's no morgue. You can't bury anyone in the permafrost and so forth. And so generally, when there's a serious medical condition, you get sent back to the mainland one way or another. That's hilarious. Somewhat morbid, but interesting. So in this particular case, you know,

Starting point is 00:25:54 our Norwegian partners wrote the data to film. And then Svalbard is more accessible than people might think. Until recently, it was growing to a significant tourist destination. And there are flights there a couple times a week still so it flew in the belly of the twice weekly flight to Svalbard this I think roughly the size of a Toyota Prius is for some reason that the unit the volume unit we started using that you know the github archive is about the size of a Prius so they basically packed this Prius into the belly of a passenger plane, flew it up to Svalbard, and then sent it up to the mine in the mountain itself, which is actually not far above the airport. So it kind of overlooks the runway.

Starting point is 00:26:32 How many people live up there? 3,000. It's variable because there's a university there, and so there's sort of an occasional university population, but 3,000 seems a bit right. It's certainly by far, given its latitude, it's by far the largest thing north of about 70 degrees. So you can't die there, but could you visit the mine and see the GitHub boxes, or at least a sign that says GitHub lives here, or something like that?

Starting point is 00:26:58 You can visit the mine. The vault itself is locked and sealed off, but I believe they do run, or at least they were, running tours to the mine itself. So you can get reasonably close. Similarly, the famous Global Seed Bank is right around the corner. You can walk from one to the other. It's about a mile distance between the two, from the mine to the seed bank and vice versa. So you could do a sort of twofer survival tourist destination. I'm not familiar with the seed bank. Is that like where they keep a bunch of seeds for things? Yeah, seed vault so every country has a seed bank to sort of maintain seeds of the various

Starting point is 00:27:28 agricultural plants that they use and then the global seed vault is sort of the backup to the backup for those seed banks and it has this very dramatic wedge-shaped building also in small part yeah i have seen the picture of that building it's very cool so there's no hot you can't do any bug fixes though. So you can't go up there, extract your code, fix something real quick. Cause that's it. That's in history forever.

Starting point is 00:27:50 You know, I'm sure I have bugs in there. Oh, I definitely do. I fixed one the other day and I actually thought the stupid little typo is now eternal. So yeah, there's one of my personal repos that happened to get captured up there. But I mean, I guess that's part of the appeal. Maybe in the future they'll look back and think,

Starting point is 00:28:09 in these antiquated days of software development, they still had bugs. They didn't have AI to automatically fix them while they were working. How fascinating. Maybe something like that. They might think, we don't know who this Jared Santafel is, but he was a real idiot. He was a real bad programmer.

Starting point is 00:28:27 Oh, man. Well, speaking of that, I guess, can you opt out? Can you say, yeah, not for me. This code is just, it's public, but I don't want it to be in perpetuity. You can opt out. In fact, you could between the announcement and the snapshot. We got very, very few opt

Starting point is 00:28:42 out requests. I forget how many, but it was fingers of one hand, something like that, I think. But it is possible, and there's an option on your settings page in GitHub somewhere now to opt out. I don't think it's, you know, I think most people are mostly opt-in. They like the idea of their stuff going into the future, and they like the idea of sort of the broader perspective

Starting point is 00:28:59 of capturing, you know, not just the open source on which society relies. So that's obviously crucial as well, and that's the part that may be of medium-termish practical use, but being part of this big capture of not just software, but kind of the tech community and to an extent a way of life that is being snapshot and put up there. Yeah.

Starting point is 00:29:20 How would you imagine somebody finding this or unpacking it a thousand years from now, what would they do with this archive? Would they read the code and try be a primarily historical value. I think people might try and run the code again, especially since there are some games there. The Internet Archive, you can go to the Wayback Machine right now or at least to archive.org and play the initial Prince of Persia, for instance, which is very popular. And I think in the same way that 8-bits became a weird aesthetic not so long ago.

Starting point is 00:30:05 It's possible people will want to craft emulators of today's antiquated computers and run software the way it used to be in the old days in the same way that, I don't know, people build 19th century train sets or mock train sets today. There's also the possibility that this will actually be useful. A thing that people don't realize necessarily is that software is surprisingly ephemeral.

Starting point is 00:30:31 Like it's all on hard drives. Hard drives don't last that long, you know, like years, maybe decades. Backup tapes are also, you know, they're good for decades. And over the long run, we kind of expect everything to get copied to the next storage medium and the next storage medium and so forth. And probably most of it will, but also you're almost certainly going to have losses along the way. So it's easy to envision, you know, some piece of industrial software that it suddenly, you know, something vital has been running on for the last 40 years that anyone noticing suddenly we need to patch or some data format that's suddenly important for

Starting point is 00:31:02 some high profile legal case or something that we need to be able to access, that sort of thing. And someone going back and saying, wait, where is that code from 2017 describing this obscure data format that looked like a good idea at the time for about two years in 2067? Small Bird? Small Bird? Kind of like the beginning of an Indiana Jones movie, right? He's got to go to find the thing

Starting point is 00:31:25 i mean it could be sort of a rosetta stone if there were other code that was found that they didn't know how to interpret it they know how to execute it because this has those instructions that maybe there's a opportunity there to find the runtime that that ran against or fix that dependency problem like hey is all of n in here? Maybe we can actually reduce all these dependencies. That's what I think about when it comes to execution, because a lot of the code up there, you're not vendoring your dependencies in your repo. So a lot of the source code is there.

Starting point is 00:31:55 Are you taking binaries to executable code, or would everything have to be built from source in this hypothetical situation where someone's trying to restore something? There are some binaries up there. The repos with a lot of stars again mostly is just source. I am kind of curious myself just how many copies

Starting point is 00:32:11 of node modules we wind up capturing because I thought seriously about excluding that from the archive but I decided not to in the end. And even that might have some value an implicit snapshot of the various dependencies along the way and how those changed. But it wouldn't shock me greatly if, you know, there are a lot of node modules just raw up there, duplicated over and over and over again.

Starting point is 00:32:33 But that might be useful as well. There is also going to be a master index. which, you know, most open source is these days, and it's not some sort of private repo, then you should be able to, at a given time, in a good computer, sort of reconstruct most of the dependency tree for any given project. What's up, friends? Are you looking for a way to instantly troubleshoot your applications and services running in production on Kubernetes? Well, Pixie gives you a magical api to get instant debug data

Starting point is 00:33:06 and the best part is this doesn't involve code change and there are no manual uis and this all lives inside kubernetes pixie is an api which lives inside your platform it harvests all the data you need and it exposes a bunch of interfaces that you can paint to get the data that you need. And it's essentially like a decentralized Splunk. It's a programmable edge intelligence platform, and it captures metrics, traces, logs, and events, all this without any code changes. And the team behind Pixie is working hard to bring it to market for broad use by the end of 2020. But guess what? Because you listened to this show,

Starting point is 00:33:46 I'm here to tell you how you can get your hands on the beta today by going to pixielabs.ai. Links are in the show notes, so check them out to click through to the beta and their Slack community. Once again, pixielabs.ai. And look forward to a pixie day coming soon. So one of the things I read about in the documentation around this is this idea of a tech tree. And maybe you've already described this with the manuals,

Starting point is 00:34:34 but there's like a capital case T, capital case T tech tree, and I wasn't sure exactly what that is. Can you describe what the tech tree is and how that concept plays into the archive? Sure, yeah, and I'm glad you mentioned that because it is a distinction. It's not the same thing as the manuals, as the guide, and the sort of instructions for decoding that's on every reel that turns every reel into its own self-contained archive. The tech tree is a real, possibly two, we're still compiling it, we're going to add it to the mass once it's done, just of sort of larger,

Starting point is 00:35:03 higher level explanatory stuff, mostly works, you know, pre-existing works, books and so forth, but to explain, you know, what software engineering is, what an algorithm is, what a computer is, you know, what, how you would hook together transistors and op-amps and so forth to form a NAND gate, how NAND gates would make up, you know, ultimately a small microprocessor, that sort of thing. So in theory, there'd be enough information that you could, in fact, reconstruct a fair amount of modern technology from the information, you know, on those various books. Now, this is a very romantic and compelling image. I should mention also, in all honesty, that our advisors were like, yeah, this is cool, but we are living in what is going to be the best documented era in all of history already.

Starting point is 00:35:47 Like, it's very unlikely we're going to have a future in which these books, many copies of these books don't already exist sitting around in many other physical libraries that are kind of easier to get to. But we figured it would be useful as context and general understanding for the source code, which goes with it as well. So did you end up packaging that stuff up, or this is an idea that's ongoing? It is going to be packaged up. We actually just released it for public commentary last month, and I've been incorporating pull requests and issues on that recently. So we're going to compile those books. We're going to put visual copies. This will all be human-readable, not encoded, for obvious reasons, so that you get sort of the background to begin with.

Starting point is 00:36:27 Except for Wikipedia, because that's too big. But we're going to put a snapshot of Wikipedia. I was just going to ask that. It seems like that would be the easy button for this. Just put Wikipedia up there, and you're done. One of the highest-rated comments in the video when we first released the video last year was, don't forget to store Stack Overflow next door.

Starting point is 00:36:44 But Stack Overflow is also Creative Commons, so we are in fact going to get a dump of Stack Overflow and drop it in the tech tree as well next to Wikipedia, yes. That's awesome. What else is going in there? Wiktionary, a couple of other things, and a list of about 200 books, mostly but not exclusively technical, all of which is available on the Archive Program Reco at GitHub, which I think is github.com slash github slash archive dash program.

Starting point is 00:37:10 Smash that one up and link it for those interested in seeing all the things inside the tech tree. Are you guys taking suggestions? We are actively taking suggestions right now. We're incorporating some at the moment. We still have to sort of work with publishers since we're literally making a copy. Copyright becomes an issue, obviously.

Starting point is 00:37:26 So we have to figure out the right issues with a bunch of these and so forth, which is why one reason it's been a little slower than the rest of the project. But we are actively working on compiling that and adding it to the vault. Okay, very cool. Yeah, I found it, and you have it broken out

Starting point is 00:37:40 into different areas, such as hardware architectures, hardware development, electronic components, and you have books. You into different areas such as hardware architectures, hardware development, electronic components, and you have books. You have articles, I assume, written modern software development, and under there you have these different books that are going to be included. So that's very interesting.

Starting point is 00:37:58 What's the next iteration of this then? What's the timing? I guess, like you said, the pandemic has changed timings. Was the hope to be once a year you'd ship another thing to the Arctic or would it be every once in a while? When would you be updating the vault? I think we're still figuring out

Starting point is 00:38:14 the roadmap. I wouldn't expect every year. That seems a little... I don't think we need that much frequency. The sort of first deposit captures the last hopefully 20 or 30 years of software i could see every five years yeah and i could see different data formats again for each one's sort of redundancy through variability that kind of thing the tech tree is also a thing which

Starting point is 00:38:37 i think will iterate over time like the romantic image of the tech tree and one that we do aspire to is like an actual manual for rebuilding you know technological civilization from scratch the v1 as with the long now as manual for civilization is existing works but i could see sort of things being constructed for this purpose kind of courses over time but that is hypothetical somewhat pie in the sky stuff for now right i think the road the roadmap at least the one that i personally have in mind, is snapshots every five years or so. Oh, it looks like you're including Wiktionary as well. That's correct, yes. I'm just over here looking at all the...

Starting point is 00:39:11 This is a pretty big tech tree. You have a lot of copyrights to get figured out here, don't you? I guess you just speak to each publisher once and you probably get all the permissions you need for that publisher. Yeah, that's the idea. I mean, we've got some already. Some have been extremely helpful, like O'Reilly, Pact. They've been great.

Starting point is 00:39:28 And we're having trouble just sort of working up the whole list of publishers. There's quite a long list if you go through them. But we'll get them one at a time as time goes on, hopefully. So somewhat interestingly related with regards to the cultural and technical context of the time period, all of the changelog and our whole network's podcasts, transcripts are open source on GitHub. And so they are undoubtedly also in there. Yes, that's true. Everything up until February 2nd will be recorded word for word.

Starting point is 00:40:00 So you have thousands of conversations of technologists through the years associated with that. That's the kind of stuff that I think is interesting in a tech tree as well. What were people saying to each other? Yeah, and actually that's one of the things that I get excited about. There's a lot of source code. Source code is very important. There's the fundamental pinnings of open source, which is a cornerstone of technology and civilization. If you know it, that's critical as well.

Starting point is 00:40:22 But also, as we all know, people use GitHub for all kinds of weird things. There's recipes on there. There are books on there. There are sort of random notes on there all over the place. And the extent to which historians of the future will find that this weird and unexpected treasure trove is kind of appealing. All the things you'd find in there

Starting point is 00:40:42 would be quite an interesting thing to dive into exactly what about the issues? so lots of conversations go on on GitHub that are about the source code is GitHub issues going to be involved anywhere? we did indeed also pull the issues and the issues are in there

Starting point is 00:40:56 so how do you decide which issues to pull? of the repos you decided you just took all the issues? yeah, issues it turns out are not that spacious they're mostly text so the issues were quite compact Issues, it turns out, are not that spacious. They're mostly text. Right. So, yeah, the issues were quite compact.

Starting point is 00:41:11 They were not really a significant figure in the sizing. And all the comments on the issues as well. That's correct. I'm just enjoying the fact that there's so much drama around that's just been immortalized in the Arctic Code Vault developer drama. I was actually just thinking that. The future might look back and think, wow, this was a testy and easily aggravated time. What's wrong with these people? Or they might look back and think, man, they were so civilized back then.

Starting point is 00:41:37 Look how they reacted. They were so passionate. They really cared about their bugs. What's with these two-dimensional emoji? We use four-dimensional emoji now. Wow. I mean, we got the Unicode, so all the emoji are in there as well. So I look forward to the history of looking back on those.

Starting point is 00:41:53 The guides, I mean, there are some translations, as I mentioned, Arabic, Spanish, simplified Chinese, and Hindi. Most of it, and most of the tech tree is in English, at least this iteration. And certainly that may change in future iterations. But a thing which surprised me, and I of the tech tree is in English, at least this iteration, and certainly that may change in future iterations. But a thing which surprised me and I thought people might be interested in is that we have this great linguist as a consultant, John McWhorter at Columbia. And he said that people assume that since English has changed a great deal in the last thousand years, they assume that it will change a great deal in the next thousand years. But he thinks the evidence shows that's actually quite unlikely. His estimation is that

Starting point is 00:42:25 English is more or less stable now. You know, people learn it younger, everyone's more interconnected. It's not like in little islands evolving off on its own. And so the expectation is that English many hundred years from now will be as different from today as like Jane Austen's English is from ours. You know, a little weird, a little courtly, a little formal, but not that different. So his exact quote was, as uncool as it may be, you'd be all right with just English. That's interesting. I assume that it would move.

Starting point is 00:42:55 So, I mean, it's such a long time. You'd assume it would move to where it would at least be difficult to understand. It was impressive to me as well, actually. We did cover our bases by, after all, adding these other five translations of most significant languages. We also, in fact, just to be on the paranoid safe side,

Starting point is 00:43:15 each reel begins with the Universal Declaration of Human Rights in every known written language in Unicode. So that's several hundred. So even if only some obscure Basque language survives, then we do, in fact, have a Rosetta Stone for that on each reel as well. I wonder if there was a sense of dread when it came time to actually ship.

Starting point is 00:43:40 Because in software, we have the advantage over most disciplines of just shipping iterative improvements at all times. And I remember talking with folks who wrote code for NASA and stuff where it was like, this had to work. This was our one chance. And you ship it to some satellite or some orbiting thing. And it was just like, even back when they had to package up software on a compact disc and put it

Starting point is 00:44:05 into a box and sell it to you in a box like that idea of like this is the final what they call them gold uh gold master i don't remember they called them back then but like that was like the version and sure we could ship patches maybe but they weren't going to get in for three months this is like you get one shot at this snapshot you know you're putting the the declaration of rights and stuff like these things where you're like snapshot you know you put in the the declaration of rights and stuff like these things where you're like what else should we shove in here did you have that moment where you're like no we're gonna close the vault we're just gonna stop shoving stuff in or was it difficult no no no we totally did it was you know it was difficult to say okay this is it it was

Starting point is 00:44:39 useful that we actually had a fixed date and like this is going to be the snapshot you know and we set that fairly early on and that was good it actually called back so my background my degree is in electrical engineering and i did a couple of co-op terms i went to waterloo in canada that does co-op um of chip designs at nortel and at hewlett packard before i went into software and spent the entire rest of my career after i graduated in software hardware being much too unforgiving and and permanent um but the chip design was a lot like that.

Starting point is 00:45:07 You're working on this VHDL, and you've got it working, and you've got the test working, you think. And then you actually send it out to be fabricated somewhere and burnt into silicon. And if you screwed up, there's nothing that can be done. Yes, exactly. And so it reminded me of that for the first time in a very long

Starting point is 00:45:24 time, that you are committing this to the world, whether you like it or not. Yeah, as a software developer, I assume that you've all but forgotten that feeling, because don't we have the freedom right now to just not really worry about that? It was pretty unusual, the feeling of perpetuity, the irrevocability. It had been a long time since I'd felt that professionally. The permafrost. Yes, excellent metaphor. Excellent metaphor, the permafrost. Yes, excellent metaphor. Excellent metaphor, the permafrost.

Starting point is 00:45:48 Well, that is really cool. Anything else about the program that we haven't touched on that I haven't asked about that you would like to discuss? I mean, it's really awesome, and I really appreciate you sharing the details of this program and all the work you did to archive these things. Anything we haven't touched on that you think we should? I think we have captured to archive these things. Anything we haven't touched on that you think we should? I think we have captured

Starting point is 00:46:05 most of the things. I mean, I want to stress just how important and how useful our partners have been. You know, the Internet Archive, Software Heritage, Stanford, the Bodleian, etc., etc. I'm sure I'm leaving someone off now who really shouldn't be left off. This is inevitably the way when you try and enlist people. But, you know,

Starting point is 00:46:21 I think it was really important that we cast a broad tent and tried to work with as many of these organizations. The Long Now. The Long Now have been great. Having a conversation with anyone at the Long Now is always a mind-opening experience. Even if it's a relatively simple one. And I guess Project Silica, another partner. Hopefully that's the longest. I think it was important to treat this as not as a thing that one company is doing for one company, but that a broad consortium are doing, you know, and hopefully as a general goodwill thing. I mean, this is,

Starting point is 00:46:52 this is not a project which has an ROI. This is a project which, you know, we think is actually important, you know, or could be important. It's sort of a weird project in that you sort of hope it's not really that important in a perfect world. It will, you world, all this data will be saved anyway, and we'll just sort of grab it off the internet a thousand years from now, and no one will care about it. Right, but you never know. Yeah, exactly.

Starting point is 00:47:15 Anyone who works on backups knows it's important, even if it's not used. So you say there's no ROI on this. What was the magnitude of the eye, at least? You don't have to share specific numbers, but was this a large investment? Is the mine,

Starting point is 00:47:29 is the rent high on the mine? How much went into this kind of project? I mean, I'm pretty sure I'm not supposed to share numbers, but I can say I think it was more economical than people assume.

Starting point is 00:47:40 And in fairness, Pickle, who are obviously the partner I really should have mentioned in the Arctic World Archive, were very understanding and working with us and realizing that this was sort of a beneficial project more than a private benefit project. And so there's no sort of rent. We sort of paid up front for storage and perpetuity in the World Archive, which is useful and is probably quasi-eternal in as much as things are eternal these days, in that it's owned by the Norwegian government.

Starting point is 00:48:09 The Svalbard Archipelago actually has its weird own legal structure. It's quasi, anyone can go to Svalbard. You don't need a passport to go there. Anyone can work there. And it's governed by its own special treaty, which was signed after World War I, which made it a place officially free of war and sort of free habitation for any human being that can get there. Is it sovereign then, or is it underneath?

Starting point is 00:48:35 Well, it's definitely Norwegian, but it has its own special legal status as sort of extra-national territory as well. I mean, I am not a lawyer. My wife is a lawyer, and I'm sure she'll be very upset at me misrepresenting the legal status. Well, you go ask her and get back to us.

Starting point is 00:48:49 Yes, that was my crude layman's takeaway from the strange legal status that Schwalbart has, which is, you know, kind of an international zone of, you know, peace,

Starting point is 00:49:00 freedom, and availability. So it's sort of an optimal place to store something. You know, it's not likely that conflict's going to break out there anytime soon. Right.

Starting point is 00:49:08 It's optimal for storage, but suboptimal for living, which is why there's only about 3,000 people there. And no one's breaking down the doors, even though it's 100% COVID-free. That is correct, yes. Awesome, John. So like I said, a storied career. You've done a lot of things. This is a very cool project. I would think a highlight of your career, at least if it was my career, would be a highlight of my career.

Starting point is 00:49:30 Absolutely, yes. Yeah. What's coming next for you then? Can you top this one or you go back to building software products? What's next? I mean, I am going back to building the software now. I think it's important to sort of keep doing the thing that you care about. And I do think software is important. I'm working on a new novel. Who knows what will come up with that. This year has been pretty bad for plans in general, as you may have noticed. So I've had friends calling it the great reset year. So we will see what happens 2021.

Starting point is 00:50:00 But, you know, I expect to stay involved with the archive program on sort of an indefinite ongoing basis. And hope to work on the next iterations of it as well. Awesome. Well, thanks for coming on the show and telling us all about it. We appreciate it. Hey, thanks very much. It was a great talk. That's it for this episode of The Change Law. Thank you for tuning in. If you haven't heard yet, we have launched Change Law Plus Plus. It is our membership program that lets you get closer to the metal, remove the ads, make them disappear, as we say, and enjoy supporting us. It's the best way to directly support this show and our other podcasts here on changelog.com.

Starting point is 00:50:31 And if you've never been to changelog.com, you should go there now. Again, join Changelog++ to directly support our work and make the ads disappear. Check it out at changelog.com slash plus plus. Of course, huge thanks to our partners who get it, Fastly, Linode, and Rollbar. Also, thanks to Breakmaster Cylinder for making all of our beats. And thank you to you for listening. We appreciate you.

Starting point is 00:50:54 That's it for this week. We'll see you next week. Thank you. Субтитры создавал DimaTorzok

The Changelog: Software Development, Open Source - Inside GitHub's Arctic Code Vault (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.