The Changelog: Software Development, Open Source - Let's archive the web (Interview)

Starting point is 00:00:00 What's up friends, welcome back, this is the changelog. Software moves fast, so keep up. On today's show we're joined by Nick Sweeting, the archive guy, talking about the importance of archiving digital content his work on archive box to make it easier the challenges faced by archive.org and the wayback machine and the need for both centralized and distributed archiving solutions nick also shared some cool stories his personal experiences with internet censorship via the great firewall while living in china okay we got lots to cover today.

Starting point is 00:00:45 A massive thank you to our friends and our partners over at Fly.io. Yes, that's the home of ChangeLog.com, and it's also the public cloud built for developers who ship. That's us, that's you. Learn more at Fly.io. Okay, let's archive. What's up friends i'm here with kurt mackie co-founder and ceo of fly as you know we love fly that is the home of changelog.com but kurt i want to know how you explain fly to developers do you tell them a story first how do you do it i kind of change how i

Starting point is 00:01:21 explain it based on almost like the generation of developer I'm talking to. So like for me, I built and shipped apps on Heroku, which if you've never used Heroku is roughly like building and shipping an app on Vercel today. It's just it's 2024 instead of 2008 or whatever. And what frustrated me about doing that was I didn't I got stuck. You can build and ship a Rails app with a Postgres on Heroku the same way you can build and ship a Next.js app on Vercel. But as soon as you want to do something interesting, like as soon as you want to, at the time, I think one of the things I ran into is like, I wanted to add what used to be like kind of the basis for Elasticsearch. I want to do full text search in my applications. You kind of hit

Starting point is 00:01:59 this wall with something like Heroku where you can't really do that. I think lately we've seen it with like people wanting to add LLMs kind of inference stuff to their applications. On Vercel or Heroku or Cloudflare or whoever these days, they've started like releasing abstractions that sort of let you do this, but I can't just run the model I'd run locally on these black box platforms that are very specialized. For the people my age, it's always like,

Starting point is 00:02:22 oh, Heroku is great, but I outgrew it. And one of the things that I felt like I should be able to do when I was using Heroku was like, run my app close to people in Tokyo for users that were in Tokyo. And that was never possible. For modern generation devs, it's a lot more Vercel based. It's a lot like Vercel is great right up until you hit one of their hard line boundaries. And then you're kind of stuck. The other one, we've had someone within the company. I can't remember the name of this game, but the tagline was like five minutes to start forever to master. That's sort of how we're pitching fly is like you can get an app going in five minutes, but there's so much depth to the platform that you're never going to run out of

Starting point is 00:02:55 things you can do with it. So unlike AWS or Heroku or Vercel, which are all great platforms, the cool thing we love here at ChangeLog most about Fly is that no matter what we want to do on the platform, we have primitives, we have abilities, and we as developers can charge our own mission on Fly. It is a no-limits platform built for developers, and we think you should try it out. Go to fly.io to learn more.

Starting point is 00:03:24 Launch your app in five minutes. Too easy. Once again, fly.io. we are here with nick sweeting a fullstack software engineer in Oakland and founder of Archivebox.io. Nick, welcome to The Change Log. Thanks for inviting me. It's a pleasure to be with y'all. Pleasure to have you. You want to archive stuff. Let's archive stuff. Let's be pack rats. Let's start with why archive? I mean, isn't that just a lot of work and no gain? Why archive stuff?

Starting point is 00:04:24 Yeah, it's a totally valid question. I think for most people, the answer is maybe you don't have to archive stuff and that's okay. Archiving is sort of a curation role and some people are drawn to it and some people are not. And I think that responsible archiving involves some amount of curation labor. It doesn't have to be a lot of labor, but it's the labor of choosing what's important and what is not. And that can be just for yourself. It can be for your

Starting point is 00:04:50 family. It can be for your friends. It can be for your academic institutions. But it is some labor that you're taking on by deciding to preserve something. And just acknowledge that and pat yourself on the back. And if you do decide to archive, keep in mind that it's not just a one-time decision. You're going to have to decide, oh, do I move this data from this hard drive to the next one when it inevitably gets old? Do I give this data to my kids? Will they care about it? Do I give it to a library? Where does it go next? What should I do if someone asked me to delete it and they don't want it preserved? And all of those things are sort of things that you have to think about but if you're excited about archiving you know don't weigh yourself down with all of that just just save one or two things and see if you like it

Starting point is 00:05:32 when it comes to archiving the web or digital artifacts i'm not sure how broad archive boxes ambitions are but i thought we had archiving the web kind of figured out, like there was a whole group of people who are enthusiastic about it and still are enthusiastic about it. Of course, I'm referring to archive.org, the Wayback Machine, and that entire operation, which felt like the web's archive was in good hands. And all you have to do is donate to that, those good hands, or support those good hands and hope that everything continues as normal. But recently it seems like they've been going through trials and tribulations. I'm not sure the exact details of who and why have been attacking the Wayback Machine

Starting point is 00:06:19 and trying to take archive.org either offline or somehow ruin it, but it seems like maybe that's an assumption that is not well-based. What do you think about that? I think archive.org is doing an incredible job. They're tasked with a really hard problem of doing this labor that I just described, but at a massive scale for the entire internet. They effectively become moderators for the entire internet. Because if someone doesn't like the content that they've decided to preserve,

Starting point is 00:06:48 which is basically everything they can get their hands on, they get personally attacked and they have to take the flack for it. So it's a really, really tough position that they're in as the sort of centralized curators of everything. And inevitably, they're going to get attacked by people who don't like stuff. And I think that they've done an incredible job so far, but there's limits to a central moderation team that has to be able to manage and defend

Starting point is 00:07:12 every piece of content on the internet from attack. So they've undergone attack recently. Do we know the motivations of these attackers? Is it simply, we don't know yet? Adam, do you know? Nick, do you know? I don't know. That's why I'm asking the question in earnest. we don't know yet? Adam, do you know? Nick, do you know? I don't know. That's why I'm asking the question in earnest.

Starting point is 00:07:28 I don't know the answer to this. They've actually been going through a lot of stuff. I mean, they had not just like a DDoS attack on a situation where you have somebody trying to take it down or keeping the set offline. They've had a major copyright case loss recently where they were trying to archive things that, you know i think

Starting point is 00:07:45 we as a society want these things to be archived and like you had said nick this might be part of that curation aspect to like just us as humans wanting to preserve not so much to break copyright but there was some breaking there so there's there's a point of breaking i suppose or a breaking point with the internet archive where you've got copyright concerns, things like that. They've had various versions of attacks that isn't just simply an attack or an attack vector trying to take it down. It's beyond that. I would say one thing about the copyright case, if you'll allow me a moment. Yeah, please.

Starting point is 00:08:20 Their stance is pretty admirable. I think I originally was quite worried about it. And I commented online and was like, yeah, why are they risking the whole Internet Archive to take this stance? It seems like they should spin out a separate company if they really want to fight the publishers on this. And I talked to Brewster about it, and I've sort of come around now.

Starting point is 00:08:40 Who's Brewster? Brewster's the founder of Archive.org. Sweet, okay, great. Incredible character. It's been his life's mission to make all of human knowledge available for everyone. And I think he's doing a great job. But his take on it was that he's personally wealthy from a dot-com era sale. And he wants to do good things with that money.

Starting point is 00:09:02 And part of that is rebuking publishers when they start really crossing lines around content ownership. And the archive.org is actually properly legally structured so that these things are isolated. He's not risking archive.org and the internet archive by doing this, by taking this fairly strong stance against publishers forcing content licensing as the only option upon ebook readers. So basically, publishers were saying, we're not going to sell you an ebook anymore. And this effectively makes libraries lending ebooks impossible, because you can't reshare the license to an ebook. They want to charge for every view of the ebook. And so libraries can no longer lend ebooks. And so he just thought that this was an

Starting point is 00:09:45 egregious line to cross. And he's like, okay, you know, as someone fairly well off, who cares a lot about this, and who cares a lot about the freedom of information access for future generations, I can afford to take a stance and lose sometimes on cases like this. And I think that this case needs to be very publicly fought and won or lost. And it's not jeopardizing the rest of the internet archive. I think that that message doesn't get out enough. So that, you know, they did the right needs to be very publicly fought and won or lost. And it's not jeopardizing the rest of the Internet Archive. I think that that message doesn't get out enough. So they did the right thing there. And they have this software that does, it's CDL.

Starting point is 00:10:13 You may know this, Nick. It's Controlled Digital Lending is what this program, it's not just software, it's a program they had to allow this. I wasn't sure of the details of which books. I think it was mostly older books, but it was essentially ruled that it was fought in the Second Circuit Court recently in September. That's why this is so fresh in my brain, at least the details to some degree, basically concluding that this practice of this controlled digital lending that the Internet Archive is doing,

Starting point is 00:10:43 it harmed the publisher's markets by providing free digital copies of books and you know i don't know those specific details like which kind of books were they new i mean obviously they're new that doesn't make any sense but if they're older or it's sort of like almost public domain maybe that makes sense but you know certainly if it's public domain it makes sense but yeah i mean i think at that point you don't have much of a leg to stand on in terms of the fight but you know i'm for you know freedom of information i'm not for freedom of information insofar as it takes away a corporation's ability to control their own work and their own financial destiny with the things they've helped create in the world as information. But there is a line there that at some point we have to adjust. And I applaud them for trying to adjust it.

Starting point is 00:11:32 Yeah, I think they broadly agree with not depriving publishers of content ownership. That's not really the issue they're fighting. They're more fighting that the publishers crossed a line by forcing licensing as the only option for content access and that that was not where the line was before that they moved it. And this is their way of fighting back. And that this there's broadly been a sort of Overton window shift of what is acceptable content release policy in the first place. And the publishers have successfully moved that to licensing only. And you can no longer own anything. And that's what they were fighting.

Starting point is 00:12:08 So yes, they did cross some lines with the controlled digital lending, where they were not counting how many copies they lent out. And I think that they expected to get sued for that. I think that they wanted to take a fairly strong stance there by saying that the way that the publishers are releasing the content in the first place is unacceptable. There's a, we can go 17,000 more layers deeper on this. There is an article on the EFF, or I should just say EFF.org, the EFF

Starting point is 00:12:37 website, Electronic Frontier Foundation, that, you know, gives a few more details. There's four different publishers, Hatchet, HarperCollins, Wiley, and Penguin Random House. And the stance basically was that these libraries have paid publishers billions, I'm quoting, libraries have paid publishers billions of dollars for books in their print collections. And they're investing enormous resources in digitalization in order to preserve those texts. And they say the CDL helps to ensure that the public can keep access to those full books that they've bought and paid for, basically. That ensures the usage, digital versions of them, they've already paid for. So it seems like there's some details there for sure. But they've lost that case publicly recently. But again, it's back to several different ways this central point is being attacked, whether it's

Starting point is 00:13:28 in the court of law. Legally or technically. Yeah, which brings us back to Archivebox and maybe it's the need for it to be distributed. Yeah, I just think fundamentally that both should exist. I think having big centralized

Starting point is 00:13:43 resources is awesome because centralized moderation is effective. You can keep bad actors out if you take a stance and you don't get dragged down by politics too much. You can do a really good job and you can provide an amazing free public resource for a lot of people, and that's awesome. But we should also have distributed archives that cover all of the things that the central archives can't just from a scale perspective. A lot of different people saving stuff on a lot of different hard drives is always going to be able to save more and know about more content. Not everyone wants to report what they find to the Internet Archive. Maybe you want to save something without announcing to the world that you're saving it.

Starting point is 00:14:21 There are lots of reasons, political, personal. Sure. So when did you start Archivebox? And what was the initial inspiration for that? What made you actually get the editor out and start coding? So the initial, I'll start with the initial inspiration. I grew up partly in China. My family moved when I was nine and I did like middle school, high school there, had an amazing time. And I obviously ran into the problem of having censored internet. So, you know, we'd read news articles

Starting point is 00:14:53 and then 20 minutes later you refresh and it's a 404, it's gone. Great firewall. Yeah. So you get used to, just for practical reasons, you get used to saving pages out of your browser or screenshotting them or making PDFs just as a default whenever you find something interesting in order to be able to share it with people there. And so that led to creating a small tool called Bookmark Archiver that I was

Starting point is 00:15:15 just using to auto download all of my pocket sort of saved articles. And that was a side project for many years. And I've sort of come back to it over time, adding features here and there. And then I used it. There was a funny security incident when Equifax got hacked. I used it to make a spoof site impersonating Equifax's site and got a whole bunch of viral attention for that. And I was like, OK, this is just a random, interesting side project. It's not actually what I care about working on. But a nice thing to come out of that was a bunch of attention towards Bookmark Archiver,

Starting point is 00:15:50 where a bunch of people were like, oh, I would use this. This seems useful. And then so I've been slowly chipping away at it, adding features over the years. And then I quit my consulting job a couple years ago and decided to work on it full time. And over the last year and a half, I've been building it up full time. Wow. Some layers there for sure. Yeah.

Starting point is 00:16:10 I was thinking about this. Not sure if it's a direct one-to-one, but have you read the book Fahrenheit 451? Yeah. Okay. You smiled. Nobody saw that smile. What made you smile about that? He's read it.

Starting point is 00:16:34 Well, there's a lot of interesting layers to that book that are becoming increasingly relevant, which is kind of terrible. But I don't know. There's a lot of misinformation and disinformation these days. And it's sort of, you know, at the foothills of where Fahrenheit 451 starts before it's outright deletion of information as a public strategy becoming acceptable. That's my concern. Jared opened up with why it should be archived. You didn't say a full zero, but you've said that before in other cases.

Starting point is 00:16:56 I'm sure that you probably didn't say that. No. What would you say then? Is it not important? No, I think it's incredibly important. It's incredibly important. Okay, good. And I think that's why I'm like, let's get Archivebox on the show. Right.

Starting point is 00:17:09 I'm a huge fan of Archive.org. I think it's a shame that it's getting so much, you know, problems. And I think that if we can decentralize those problems across a bunch of people, that's probably better. So no, I'm not against it by any means. I don't think it's... Gotcha. A fool's error. And I do think it's a hard problem and- Laborious, for sure. And expensive and lots of stuff, which is why the software needs to be there.

Starting point is 00:17:30 But go ahead, Adam. I was not trying to, you know, say you were seeing something bad or good, but it just, it seemed like why do it was the question, like in a negative light. It was the question. What's the point? Yeah, what's the point of this?

Starting point is 00:17:40 I think it's a valid question. Yeah. I think so too. But I think when we look at, you know, when we cross-examine the challenges which we opened up with for the internet archive and then this book or the premise of

Starting point is 00:17:51 this fictitious book in light of today's world and then your history of living in China behind the Great Firewall and the challenges that come from internet disappearing essentially. Like truth is, you know, you can go online see a price for something and tomorrow that price changes but unless you screenshot it or something that you

Starting point is 00:18:10 can't go back to that retailer and say hey look the deal should still be the deal they're like now we we just change that price behind the scenes or something like that like your only truth is the artifact you can claim or that you have a hold of. And I think that's kind of the premise of the desire to archive the internet so that we can preserve it for years to come, but at the same time, just to hold true what's true. I think there's one more public perspective that's pretty common that's maybe worth addressing around why is archiving worth it. A lot of people sort of have the valid idea that, oh, with AI tools or with modern

Starting point is 00:18:47 technology or better tooling over time, we can have our computers just sort of osmosis all of the content and keep track of what's important for us. And we don't actually need to preserve the actual website the way I saw it originally. Like, well, just use a browser extension that sort of ingests it all into a model or, you know, they're training models on the whole internet all the time anyway, why do we need to save the original sites? Let's just keep these models over time, and that's good enough. I think that that's, it's a reasonable thought, and that might work in the long run, but I think in the short run, we haven't seen those models be accurate enough to recall all of the original content without hallucinating

Starting point is 00:19:23 at all, and then, unfortunately, the subsequent models get trained on the output of the initial models. So it's really important to keep those primary sources around for as long as possible, because our future kids' kids' kids might care for historical purposes also, you know, what did websites look like, but also for contextual purposes, how is this content delivered in what format, what ads were on the page, you know, all of these things are things that future people might care about that might not seem important now. That's part of why archiving this active curation, this active labor that I describe it as is important is because you're trying to preserve as much of the original historical context of the world around this piece of content at that moment in time with the content. It's not

Starting point is 00:20:03 always just about the raw content. Right. Well said. And I don't think the technique you describe, because of the way large language models work, I mean, they are effectively compression algorithms. And so lossy by definition. I mean, they're not lossless. Maybe eventually they become lossless.

Starting point is 00:20:20 And so they can have both your compressed artifact and your original artifact perfectly pristine. Well, then they're just archivists, aren't they? And so they can have both your compressed artifact and your original artifact perfectly pristine. Well, then they're just archivists, aren't they? And so we are still archiving. We're just letting the machines do it. But you're kind of letting the machine do it, right? That's what your software does. Sort of. Yeah. So I actually don't take as much issue with the compression. I think all archiving is lossy to some degree. I take more issue with the lack of perspective of the tool. I think that the perspective of the person doing the saving is almost as important as the actual record. Because

Starting point is 00:20:52 if I visit a website in the US and on the Eastern time zone, I'm going to see a totally different New York Times homepage than if I visit it from Germany. Or if I visit my Facebook timeline, it's going to look totally different to me than to someone else. So the perspective of the person viewing it is almost as important. And these models don't have that perspective. They don't record any information about who's doing the saving. Why are they doing the saving? When did they do the saving? What did they visit before and after? And so all of that stuff is part of the curatorial work of creating these archives. Gotcha. So that's something that's unique to the web then because of the dynamism of the documents.

Starting point is 00:21:31 Because if we were going to archive ancient writings, maybe you want to know what cave this came from and all the context you could possibly gather. But there's not the perspective of the gatherer. Maybe they choose to exclude some stuff or there's not like the perspective of the gatherer. Maybe they choose to exclude some stuff or, you know, there's censorship and things like there is a bit of an editorial to like decide what to archive. But, you know, based on one person in Seattle and the other person in London gets two different web pages. That's a really good point. But it seems like it's almost unique to web. If you go back far enough, I think you'll encounter editorial adjustments more often, right? History is written by the victors and the victors are the ones who retranscribe it over the years. And so you're essentially getting layers of

Starting point is 00:22:14 delayed perspective added. I think that if you look very closely at any sort of historical archives, the older they get, the more perspective is necessary because those are each layers of decisions to decide to keep this around. Fair. The documents don't change though. Yeah. Right. Unless they like literally change them. Unless there's, that's fraud then. So now we're talking about fraud. Hopefully. Yeah. But then you get libraries of Alexandria and you have to retranscribe things from memory or oral history. Once you get to a long enough timescale, it all becomes layers of recollection. But yeah, you're right. Hopefully the documents don't change in the 100-year timescale.

Starting point is 00:22:47 The interesting thing, though, is that the internet is fairly young in comparison to pretty much any other archived medium. It's one thing to have an archive or a museum of paintings or of art or of different artifacts. The web is a uniquely, like Jared said, it's dynamic, you know, but at the same time, the perspective of, we don't know what's important right now until later. So it's almost like archive as much as society might think is important because we're not really sure what is important right in this moment. We have to have a zoom out, which is time, right? The time is the perspective. 50 years from now, the world and the web or whatever the web becomes or whatever the web makes the world become will be drastically different for sure. Bad or good, we're not sure. But today's breadcrumbs,

Starting point is 00:23:40 so to speak, may point us to why or how or what later on because the questions we'll have later are unknown to us it's almost like the unknown unknowns just archive it all as much as you can and distribute it and protect it so that we have the opportunity for the look back yeah that's i think a good strategy for a central actor like archive.org their strategy is just archive literally everything they can get their hands on. You submit a URL to them, they'll archive it. I do think that breaks down somewhat

Starting point is 00:24:10 in distributed archiving, where the goal is slightly different because you're empowering individuals to save things that they care about. It's a little counterintuitive, but actually recommending that people save as much as they possibly can tends to backfire because they end up with massive multi-terabyte collections that they just can't handle. They can't deal with,

Starting point is 00:24:29 and they don't know who to send it to. And eventually they stop paying for hosting. So that's why I really stress this sort of archiving as an active curation line. It may get old, but for distributed archiving, it's especially important. It's especially important to recognize that the people running these are really contributing labor. They're contributing public service to other people, and they should do it to the extent that they can sustainably do it. And if you dive headfirst into saving everything you possibly think is useful, I've seen many, many people burn out on archiving from that. It's a fad. They'll get into it a little bit. They download 10,000 URLs, and then they're like, okay, I don't know what to do with this.

Starting point is 00:25:08 It's too big to search. It's too big to use. It's kind of cool. Maybe I'll send it to someone. And it actually dies faster. Whereas if you empower people to archive what they care about and sort of harp on that a lot so that you make it easy to curate and tag and add context. It's the context that indicates why it's valuable. And it's a different strategy than a big library of Alexandria warehouse where you just store everything you possibly can. It's more about having nodes of these curations of different groups. And these nodes can then start sharing what they think is important with each other. And through this sort of federated network of decision making on what is important, you end up with the same average result at the end of basically everything that anyone has cared about at some point being saved. But putting that whole

Starting point is 00:25:56 responsibility on one person of, oh, if you're starting archiving, you must archive everything you possibly can. I think it actually tends to backfire more than it does good. I can certainly see that. So Archivebox is to empower individuals to archive that which they care about from the web. So this is a tool for downloading web pages, storing them offline in their own little archives that you can bring them up and look at them again. You know, HTML, JavaScript, PDFs, images, like the raw nuts and bolts of what puts a website together. Is the end goal then, like we all have our own little archives, is it like you described and like ArchiveBox

Starting point is 00:26:34 is somehow going to provide this Wayback Machine based on this federation of me agreeing and other people agreeing, which feels a little kumbaya, but would be awesome if we all agreed to like share our little view of the world with everybody else is that the idea no actually so i don't i don't want archives to be necessarily defaulting to being public for everyone because again that's

Starting point is 00:26:58 not the role of this distributed archiving tool like it's a great role for a library but it doesn't work as well for distributed archiving because of cookies, because of authentication. Basically, one of the main selling points when you actually get down to it and you're like, do I really run this tomorrow? Is it worth it or not? Is, oh, I can save my social media. I can save stuff behind paywalls. I can save stuff that I have to be logged in to see. Archive.org cannot save any of that, and they won't take it. Or they'll upload it for you,

Starting point is 00:27:27 and they'll hold it privately for you, but you won't be able to share it with anyone because they don't want your cookies, right? They've archived your cookies, your login sessions, all of that. So a lot of that content is kind of unshareable until you die or stop using those accounts. And so it gets really tricky. That's the main selling point of saving stuff locally.

Starting point is 00:27:47 If I start adding features of like, oh, share your archives with the whole world, most people don't want that. They're saving their Facebook photos. They're saving their news articles and stuff they read, but also a lot of their own personal browsing history. They don't necessarily want to share the URLs only, and they don't necessarily want to share the

Starting point is 00:28:03 snapshotted page content. But it's important for the longevity of, and they don't necessarily want to share that snapshotted page content. But it's important for the longevity of humanity and this information for it to be shareable eventually. And so I think very carefully about sort of different ways to tackle that issue. It's a really human issue. It's not a technical problem, right? Do you have time unlock? Do you try to incentivize people to donate their archives to a public collection by providing free hosting in exchange for them releasing the information? Do you have scrubbing tools that try to go through

Starting point is 00:28:29 and scrub all the sensitive information? If you do that, where do you stop? Because you are tampering, you know, archivists try very hard, like you were saying, to not tamper with the original documents. But the original document has someone's personal email and username and password in the HTML somewhere, there's a trade off. At some point, you do have to scrub that for it to be useful to other people without being harmful to the original curator. Curation is an act of labor. We shouldn't punish those people doing the curation by spreading their social media logins to the world. So it's a very delicate balance. And I think that the answer is there's no one permission setting that gets pushed on

Starting point is 00:29:04 everyone ever. This tool is never going to force everyone to upload all their archives to a big federated network. This tool is never going to force everyone to only have private archives and not be able to upload stuff to a big federated network. Instead, it's going to give up a range of options and it's going to be annoying to some people that they have to decide, you know, do I share this with other people or not? But I think that that's the right move for now is giving the full spread. Do I keep it local? Do I share it with my neighbors who I know and trust?

Starting point is 00:29:34 Or do I share it with everyone in a big, untrusted, scary world where someone might use this content to hurt me later? And every social app network platform has to make these decisions when they first start. For sure. The time unlock is super interesting because we recently spent some time with Jordan Eldridge. I'm not sure if you heard that episode, Nick, the Winamp era, where he had dug through different Winamp themes. I love Winamp. And he had found in these themes all kinds of digital artifacts, things that shouldn't have been there. Because he has this Winamp, not theme, what are they called? Skins.

Starting point is 00:30:06 Skins, yeah. He has this Winamp Skin Museum, which is really rad. And in that, he had found like old pictures of people. It's like basically a compressed file of a folder of files. And in there is like the stuff you'd normally have for a skin, but then like random things that he found in there. And he shared some of those. And we were looking at pictures of people from the nineties and like old audio files of like, you

Starting point is 00:30:29 know, kids at their computer recording weird noises. And it was just really enjoyable to kind of have that snapshot of the past people we'd never met and never will meet. Sure. If we had seen it right after they had taken it now, it's like almost, it's a privacy violation, right? Like, I didn't want you seeing that. Well, you shouldn't have dropped it in your Winamp skin. That's right, purposefully. Over time, like, you know, they're gone and old or dead or, you know, it's just like the context is gone.

Starting point is 00:30:54 There's no fear there. And it's really, for us, it was nostalgic, but there's lots of reasons why that would be interesting. So I like that time unlock option of like, you know, maybe, like you said, maybe I donate my archive when I die or every 20 years, like go back 20 years and those are now publicly available.

Starting point is 00:31:13 Similar to how stuff gets declassified, you know, in our government. Yeah. I think that would be really cool. Yeah, that's sort of what I'm gravitating towards as an initial carrot to offer is like, you know, if you agree to time unlock, then I'll host your stuff for free as a backup. It gets dicey when I have to rehost

Starting point is 00:31:31 content for other people. So the way archive.org works is they're, they basically operate as a library, right? They're a nonprofit institution. They don't earn income from their hosting. They have a separate LLC that does some paid services, but it's a separate LLC. And they're not basically not earning revenue directly off of re-hosting often copyrighted content. I would have to, if I ran a public hosted service where I'm mirroring people's content, I would have to either be a library like them, in which case I can't accept payment for hosting at all.

Starting point is 00:32:00 So this is the only way that I could offer to host people's stuff. Or I have to figure out some other new legal system that hasn't been invented yet to do this. Basically, you're trying to make a business out of BitTorrent, right? It's a very similar problem, right? It's very hard to charge for this and not be legally liable for re-hosting copyrighted content. So there's probably some middle ground where people are buying an app that they are running locally, that they are operating, that's connecting them to other people running this app. But I am nevering that stuff if I get copyright complaints. If someone sends a DMCA notice and says, I have to take it down, I have to comply as a central agency.

Starting point is 00:32:51 But the people running those individual archiving apps can still share it if they want to. Something like that. That's sort of a middle ground option. Okay, friends, I'm with a good friend of mine, Avthar Swithin from Timescale. They are positioning Postgres for everything from IoT, sensors, AI, dev tools, crypto, and finance apps. So Avtar, help me understand why Timescale feels Postgres is most well positioned to be the database for AI applications.

Starting point is 00:33:19 It's the most popular database according to the Stack Overflow Developer Survey. And Postgres, one of the distinguishing characteristics is that it's extensible. And so you can extend it for use cases beyond just relational and transactional data for use cases like time series and analytics. That's kind of where Timescale, the company, started, as well as now more recently Vector Search and Vector Storage, which are super impactful for applications like RAG, recommendation systems, and even AI agents, which we're seeing more and more of those things today. Yeah, Postgres is super powerful. It's well loved by developers. I feel like more devs, because they know it, it can enable more developers to become AI developers, AI engineers, and build AI apps. From our side, we think Postgres is really the no-brainer choice. You don't have to manage a different database.

Starting point is 00:34:07 You don't have to deal with data synchronization and data isolation because you have like three different systems and three different sources of truth. And one area where we've done work in is around the performance and scalability. So we've built an extension called PG Vector Scale that enhances the performance and scalability of Postgres so that you can use it with confidence for large-scale AI applications like RAG and agents and such.

Starting point is 00:34:30 And then also another area is, coming back to something that you said, enabling more and more developers to make the jump into building AI applications and become AI engineers using the expertise that they already have. And so that's where we built the PG AI extension that brings LLMs to Postgres to enable things like LLM reasoning on your Postgres data, as well as embedding creation. And for all those reasons, I think when you're building an AI application,

Starting point is 00:34:53 you don't have to use something new. You can just use Postgres. Well, friends, learn how Timescale is making Postgres powerful. Over 3 million Timescale databases power IoT, sensors, AI, dev tools, crypto and finance applications, and they do it all on Postgres.

Starting point is 00:35:08 Timescale uses Postgres for everything, and now you can too. Learn more at timescale.com. Again, timescale.com. And also by our friends over at Wix. I've got just 30 seconds to tell you about Wix Studio, the web platform for freelancers, agencies, and enterprises. So here are a few things you can do in 30 seconds or less on Studio. Number one, integrate, extend, and write custom scripts in a VS Code-based IDE. Two, leverage zero setup dev, test, and production

Starting point is 00:35:42 environments. Three, ship faster with an AI code assistant. And four, work with Wix headless APIs on any tech stack. Wix Studio is for devs who build websites, sell apps, go headless, or manage clients. Well, my time is up, but the list keeps going on. Step into Wix Studio and see for yourself. Go to wix.com slash studio. Once again, wix.com slash studio once again wix.com slash studio

Starting point is 00:36:06 i would be motivated to archive for legacy you know this this what's internet today for me is not the same internet of tomorrow for my kids and so i think that would be where i would personally find some motivation and i i'm kind of hanging out in that motivational space because you're like describing, you know, archive 10,000 URLs, you get burnt out and you sort of quit. And so the job of you is to instill the obvious software to do the job but at the same time bootstrap and educate the people that you want to sort of clone and say, this is why it's important.

Starting point is 00:36:46 Here's how you can use it for yourself. Here's ways you can even share it with others that make it so that you stay motivated. Yes. time, if I can't show off my stuff, the things I think are cool or have a purpose or a reason to do it, I'll eventually become bored with the practice and just basically move on. I think for me personally, I would want an archive box for my future generations. And it's not to be narcissistic. It's because those, it's my people, my closest people is who I really care about in life. Sure, I care about everybody and I'm a kind person, but at the same time, like family is family. You know, I want my kids to know where I came from, what was important

Starting point is 00:37:28 about me. And maybe it's part of the podcast. Maybe it's part of the, you know, the web amp museum, so to speak, you know, these little things that were cool to me that eventually my kids can spelunk and be like curious and explore and find new things and reach back and all that good stuff. Or maybe they decide to donate it to a museum and then the museum decides to, you know, bring a whole new life to it. Like your kids have a bunch of interesting agency and choice that they can make. But yeah, that's a great point. Legacy is a common attractor for individuals who want to do archiving. I'd say it's right now it's an even split sort of between journalists, researchers, lawyers.

Starting point is 00:38:07 Lawyers are the biggest category, to be honest, about archiving. And individuals who want legacy or just sort of personal use, archival of their bookmarks, that kind of thing. Imagine this headline in 2070. Seemingly long-time digital pack rat finally, through family and legacy, has had their internet archive or their

Starting point is 00:38:26 web their archive box donated and it's enabled this new technology to be the foundation of i don't know like some i don't know like reaching for the stars here but like imagine that kind of headline like somebody who was like really archiving the good stuff and they gave it to future society and enable this brand new thing that is just super cool well you also have like certain creators through time who were prolific and they wrote way more than they published for instance and then that person died and they became famous because they wrote such great prose and over time you're like wow what if we had their unpublished works what if we had their journal what if we had their unpublished works?

Starting point is 00:39:05 What if we had their journal? What if we had their thoughts? We could mine those for such interesting insights like Albert Einstein's and such. Yeah, there's a delicate balance there, though, because with any content that people create, they're being vulnerable and sharing a part of themselves that they might not otherwise share if they knew that everything they shared was instantly public 100% of the time. Well, that's why I'm speaking of legacy, though. This is like your foundation that you arranged.

Starting point is 00:39:32 They decided that we're in the context of you saying that finally this person died and their foundation decided to open up their archive box, for instance. Yeah, that's totally a fair game. Yeah, yeah, yeah. And then they probably scrubbed it first just to make sure it's not embarrassing and stuff. And then the public benefits, that's where a fair game. Yeah. And then they probably scrubbed it first just to make sure it's not embarrassing and stuff. And then like, you know, the public benefits, that's where I was going.

Starting point is 00:39:47 Not just like, hey, all your secrets are public now that you're dead, you know. Well, I was also meaning for people running these distributed nodes,

Starting point is 00:39:55 I think it's also important to sort of discourage the, oh, archive everything you possibly see mentality because I think that would also kind of destroy the internet to some degree.

Starting point is 00:40:03 Like part of the beauty of the internet is that there are pseudonymous spaces, there are anonymous spaces, and there are real namespaces. But you're not forced to be one in the same identity across all of them. And so you get more vulnerability, more connection, more willingness to share things online that you might not have in person. And that the threat of everyone watching is actually tape recording everything they see 100% of the time. And even if they don't decide to share it today, within 20

Starting point is 00:40:31 years, 100% of everything is going to be online copied by everyone. I think that that is rightfully a scary concept for some, especially people who feel more threatened, right? If you don't experience a lot of threats online day to day, it seems like, oh, that's not a big deal. Like, you know, if my stuff is time unlocked in 20 years, like I'll be fine. If you're experiencing a lot of oppression today, and you don't think your situation is going to change, having all of your social media public in 20 years might not seem as attractive an option. And I just want to acknowledge that there's sort of a range, there's a range of privacy that's needed. And there's a range of respect that's needed and there's a range of respect that needs to be given to privacy from archiving tool makers to acknowledge that

Starting point is 00:41:11 we're not trying to build the tape recorder for the entire internet, especially the private stuff, the stuff that requires cookies and logins that because archive.org doesn't have this problem, right there. They're not archiving stuff behind logins, but of course I am pro archiving in general. I love people to archive. I just, of course, I am pro archiving in general. I love people to

Starting point is 00:41:25 archive. I just, you know, I feel like these points don't get harped on enough when people talk about archiving online. And so I feel like this is the right space to give them a little bit of airtime. Sure. So tell us how Archivebox works then mechanically as a person who might use it. How do I point it at things and how do I decide? Just like walk us through it. Yeah, so Archivebox right now is a self-hosted Docker app mostly and a pip library. So if you don't know what those things are, I'd say Archivebox is not for you.

Starting point is 00:41:55 There are other apps out there that do a way better job of providing a nice user interface, a nice iOS app. And all of that is coming for Archivebox eventually. But right now we're a server that you run like NextCloud or Plex or Home Assistant that you set up on a little $5 a month machine. It's totally fine. You run a couple commands. It takes five minutes to get it running.

Starting point is 00:42:15 You have an admin interface, web UI, and you have a browser extension that you can use to submit URLs. Or you can just paste in URLs manually or drag them in from a spreadsheet or your bookmarks out of your browser. There's ways to ingest most of the common ways that you would want to send a list of URLs to this. Then it goes through and it pretty serially, we can't do too many in parallel because you'll get blocked pretty fast.

Starting point is 00:42:40 So we just go through one by one. And for every URL, we save it in a ton of different formats. So the raw HTML, we'll save single file, which is an excellent way to get everything into one HTML file, including JavaScript and images and all that. Wget, YouTube DL, so we rip all the audio, video subtitles out, video metadata, comments, photo galleries, like basically every piece of content. We archive boxes stances to actually rip it out of the original page. We're not trying to do the, oh, preserve it perfectly in its original format thing.

Starting point is 00:43:11 Because I think that that, even though I harped on before how important the original context is of a piece of content, honestly, it's a really difficult technical problem. And so I'm going the other direction where I'm actually trying to get the content out into its usable forms for LLMs and for humans to actually use it. And so I don't actually write it to this work standard, which is sort of the internet archiving standard file. I think it's a little bit unapproachable for most people who don't interact with work files on a day-to-day basis. And so instead, Archivebox writes everything as raw files to the file system. You get a normal PNG, a normal PDF, a normal.txt file with article text.

Starting point is 00:43:48 You get JSON. You get just basically really simple, common file formats that I think will survive for more than 100 years. And you get it all flat on the file system right there. You can just dig in and look at it. There's no complicated binary formats, nothing like that. Yeah, so that's generally how it works. And then you can set up scheduled archives

Starting point is 00:44:05 that pull in stuff on a daily basis. You could archive your own Twitter feed or Hacker News or whatever you want. And then you can tag it. You can send an archive to someone else. You can export it statically in a way that you can share. And the distributed sharing between archiving nodes is coming. I'm working really hard on that, but that's not out yet.

Starting point is 00:44:25 So that's how it works so far. How do you deal with the, if it's a flat file, how do you deal with file size or archive size over time? I understand the reason why you're doing that because you want it to be preserved in a format that is accessible, whereas work, which I believe is W-A-R-C, right? That's the file format. format yeah where it's stuck in this other thing that may not be accessible you know i don't know like a zip files probably be

Starting point is 00:44:52 around forever but you know these randos that might not be which work is not but at some point somebody might be like no that's not cool anymore regular pds let's do that i think works will last so work is actually a zip file modern works like work Z is just a zip file. You can add dot zip on the end and uncompress it. I don't think it's too bad. Like it's really if once you get used to them, they're very easy to work with and they're quite standard. And I think that will survive for a really long time. I just want archive box to be like immediately usable by the next tool that you want to consume the data with. Like I don't't want multiple decompression steps and stuff like that. So for your concern about file size, yeah, it does take up a lot of space. It's not as bad as you would expect, though. I'd say about 1,000 URLs take up, on average, about five gigabytes with most of the methods enabled.

Starting point is 00:45:39 So as long as you're not saving only YouTube videos, you can expect, if most of your content is text, plenty of images still, but no massive, massive videos, because that's what really skyrockets it quickly, about five gigs per thousand URLs. So, you know, 10,000, not too bad, 50 gigs, you could probably stick that on a drive somewhere. As storage gets cheaper, that's not that big of an issue. I, for my big, you know, massive archives that I keep, I use CFS that has built-in compression and lately fast deduplication.

Starting point is 00:46:07 And so I like to solve those issues at the file system. Where you dedupe, huh? I'm experimenting with a new fast dedupe feature. I haven't used it on the big, big archive yet, but it's working well. I usually disable dedupe, honestly. I mean, I don't have a need for it. But I think if I was running an archive, I would probably want it. Yeah, it's like one of the few cases where it makes sense

Starting point is 00:46:25 but specifically the new recently released like in the last few months fast dedup rewrite by is it IX systems

Starting point is 00:46:32 or another company stepped in and contributed a big update to it interesting so it's more reasonable now for people to run it

Starting point is 00:46:39 yeah as I was asking that question I was thinking Adam don't worry about it the file system will do it so I was going to ask you what your favorite file system was or what file system is beneath this thing. I love ZFS. I assumed you'd say ZFS, and I'm thankful that you did.

Starting point is 00:46:54 So am I. Otherwise, we'd have a fight, you know. It'd be Adam versus Nick. It's not worth going there. It's the wrong place to slap somebody, you know. Well, I know I can appreciate your taste in so many other things that I know that you appreciate ZFS. So there you go. There you go. And that really, I mean, I'm a ZFS guy myself. That's exactly what I would put, you know, this archive on.

Starting point is 00:47:14 I would spin up a new ZFS file system and I would let that file system do all the work of compression, dedupe, stuff like that that would matter, and let the archive box do its thing, which is what it should do. Let me, as the user and the curator of it, interact with the original file system versus, or the original file types versus what the file system can do for me. Yeah, one way to make that more accessible is I've added support for our clone recently, so you can link it up to a Google Drive or a,... A lot of people don't have terabytes of storage at home anymore, and so letting people use their Google Drive as their storage, I think, is

Starting point is 00:47:50 important. And then Google Drive, they'll still charge you for every file, but they're doing de-duping on their side. Same with AWS or all of them. I think that'll get cheap enough over time that it's not a big issue. I think most people are going to run into losing motivation

Starting point is 00:48:07 sooner than they're going to run into running out of storage. File system size, yeah. Yeah. How do you get this to go? Well, I guess maybe the better question would be, how well used is this? How much are people using Archivebox, and what would it take to make it more used, more adopted?

Starting point is 00:48:26 Yeah, so I don't have analytics in the actual product. There's only a few stats that I keep track of. So there's 6 million Docker Hub pulls so far, 6 or 7 million if you include both repos. That's a lot. The PyPy installs are sitting at around 70,000 a month. And the Google Chrome extension only has about 2000 users. So a lot of those are automated. You know, people have scripts that auto update their Docker container or auto update their pip packages. But I think it's in the 10s of 1000s exact numbers. I don't know when people open GitHub issues. That's a pretty strong indicator that they care enough to say something. And there there's thousands of GitHub issues

Starting point is 00:49:06 and hundreds of contributors and a few granted donations not enough to make it a sustainable business model but enough that I can't ignore it lots of attention whenever stuff goes on Hacker News so I know people care about the issues and I know that people are using it but I refuse to add analytics

Starting point is 00:49:21 so hard to say You're one of us. You are one of us. So how does it get into your credentialed stuff? Do you have to be using Google Chrome? Is that the extension? Is that what that does? Or is it grabbing cookies out of your cookie jar? How does it do it? Yeah, that's a, it's constantly evolving. I'm trying to make this as smooth as possible.

Starting point is 00:49:42 The golden rule is don't let people use their normal accounts. This is based on talking to a lot of my industry peers. We just don't think that the scrubbing tech is there yet to sanitize these archives. And unless people really, really know what they're doing, which some people do, and they can save that stuff, you don't know who the audience of an archive is going to be in five or ten years. And so people are going to forget, oh, this archive was saved with cookies turned on, which means your whole personal information is probably mirrored in the HTML somewhere.

Starting point is 00:50:15 So I basically force people to create separate accounts for archiving. If you want to archive Facebook stuff, you make a second fake Facebook account, invite it to all the groups that you wanted to have access to it's an arduous process it's annoying and I'm being paid by companies to automate it so that's how Archivebox is a sustainable business right now is that's the paid service that I offer to companies is creation of sock puppet accounts there's no engagement I have a hard rule I don't allow these accounts to do anything other than view, but you create these accounts, you log them into all the groups that you want to be able to save stuff from. And then these accounts will archive on your behalf. And that way, if the accounts

Starting point is 00:50:53 get burned by, you know, an archive being shared or something, it's fine. They're not, you know, they're not real info. They're not tied to anyone. Interesting. So that's some of the labor you were talking about earlier. You know, like this is hard work it's not like uh just download it and click go like you're gonna be yeah doing some stuff here yeah it's not too bad so i've the recent changes i've made it smoother so there's a vnc container running in the background so you can it'll open chrome automatically you can just go to a new tab you'll see like a desktop chrome you log into all your sites and then it'll save those cookies automatically, and then you just close it.

Starting point is 00:51:27 You never have to think about it again. It'll stay logged in. If it kicks you out of some site, you just reopen that VNC window and log back in. So I'm trying to make it as smooth as possible. I do allow you to import cookies from your existing Chrome. I just strongly don't recommend it, unless you're the only person who's ever going to look at your archives

Starting point is 00:51:42 for the next, you know, how many years, or if the people that you're sharing the archives with are people that you really trust. Or if you're willing to manually sanitize. And I think most people don't understand that risk. So I don't make it too easy. Is this only a single player game? Like is there an archive scenario where it's like a group? Like let's say Jared and I were like, man, that was cool.

Starting point is 00:52:03 Nick is awesome. And we start our own archive, essentially. And it's like anybody who's in and around the ChangeLog podcast universe, just had to say that, Jared, they can join in. Or there's a mission here. And we can, similar to the way you would have a core team member or commit rights, you can have this membership, so to speak, to an archive. Is that out there? Yeah. Is that part of your plan?

Starting point is 00:52:28 Yeah, that is my plan. That's the core mission is actually to serve that group. So Archivebox is primarily aimed at organizations to save what they collectively care about. And so there are users, there are permissions, there's sharing stuff, there's multiple logins. And the idea is your org probably has shared ability to access some resources. So your org only has to set up these credentials once for the archiving bot. And then when people submit URLs, it doesn't archive with the person's URLs, it archives with the archiving bot's URLs. And so an org can collectively maintain access to all the

Starting point is 00:53:01 resources that they care about. And then the org's archiving bot will also have access and will just save any URL that anyone in the org submits. And that's how the paying customers are using Archivebox today. So I work with nonprofits that monitor disinformation campaigns and look for evidence of war crimes on social media. As I was saying before, it's lawyers who pay for this. They pay for evidence collection, both to catch the social networks breaking their own terms of service and their own rules, to help governments with regulatory issues around how social media is behaving,

Starting point is 00:53:33 but also to look for war crimes. It's interesting. So they're doing this method of shared one collection and they have teams of researchers that submit URLs to the shared collection. But you can't reveal who the researchers are because they're researching really sensitive content.

Starting point is 00:53:47 You can't burn their identities. Yeah, it's like a journalist and their source. Exactly. When you got into, when you even first had the spark of this idea, did you think that's what you would be doing to sustain it and get paid? Sock puppets. No, but now that I'm working on it, it's a surprisingly fun problem

Starting point is 00:54:07 because I get to red team. I love security stuff. And now I'm a red teamer. I literally, my job is to break like CAPTCHAs and rate limits and login walls for good cause. Like I'm supporting, you know, anti-disinformation, especially after the, you know, recent election. It's motivating to actually work on what matters right now i feel

Starting point is 00:54:25 like this really matters and directly working on anti-disinformation and like mass social media manipulation is motivating for sure what a what an interesting job you have wow so jerry uh what are we doing about our archive box when we spin this thing up did you already spin up a new fly machine for this i've not tried it yet. I'm excited about Docker being the, you know, is that one of the primary ways that folks do spin it up and play with it? I imagine like a Docker compose or Docker file

Starting point is 00:54:53 just generally is an easier thing than anything else. You know, I would think, but the archiving crowd attracts a lot of people who still want to do stuff the old school way, unfortunately. Which is zip files onto a machine? Yeah, or apt install every single dependency manually. And some people really want to do that.

Starting point is 00:55:12 But unfortunately, a surprisingly large amount of the user base will not touch Docker and will only apt install every single dependency manually. And so I spent the last two months writing my own runtime dependency manager for Archivebox. It's a whole new library called abxdl that uses the Python type system to basically have unique... I went a little overboard designing this, but it was pretty fun. Basically, Archivebox is now pluginized, so people can contribute plugins. It's really hard for me to maintain the auto-login for Facebook and Twitter and Instagram and TikTok and YouTube and Quora and all of these. So I want a community to come build around little scripts that do things automatically while archiving. And I'm working with other archiving companies to sort of share a common spec for this. But part of what these plugins need to be able to do is access dependencies. So YouTube DL or Wget or Curl or things that the

Starting point is 00:56:01 user might not have installed on their system. And so if I'm allowing people to install plugins from an app store or ecosystem type deal, it needs to also be able to install random packages at runtime. And so Archivebox now has this whole built-in package manager. And I have a rant blog post about the inevitable progress of building a tool is that everyone eventually bakes a package manager into their tool. Once you go far enough in any product evolution, eventually you're going to have

Starting point is 00:56:27 to write your own package manager. So ABXDL is both a, a runtime as well as a CLI tool. Is that, am I reading that right? Based on the, the repo on archive box on GitHub? Um, I wouldn't just, it's, it's closer to like an ORM for package managers. Gotcha. It's just a layer between software and the system, like Ansible or PyInfra.

Starting point is 00:56:49 In fact, it uses those under the hood. It just gives you nice, clean Python types for different packages and package managers and allows you to define in a sort of flat YAML format all of the things that a plugin needs, regardless of whether they come from Brew or Pip or NPM or cargo. Or yeah.

Starting point is 00:57:07 I dig the writing here. You say every wish you could YT, DLP gallery, DL, W get curl puppeteer, et cetera. All in one command. A B X D L is an all in one CLI.

Starting point is 00:57:19 Is that not the same? Is that not the same thing? Is that a different thing? I mixed up my own names. A B X D L is archive box, but simpler. I was referring to A B X P K G, which is, okay. That's where the same thing. Is that a different thing? I mixed up my own names. ABXDL is archive box, but simpler. I was referring to ABXPKG, which is,

Starting point is 00:57:27 okay. That's where the confusion is at. Okay. ABXDL sounds cool though. ABXDL is a simplified archive box. That's a one-liner. It's a one-liner for all the tools you might need. So like you give it a URL and it's going to figure it out.

Starting point is 00:57:40 Rip every piece of content that you possibly can out of this page by any means necessary and put it in our folder. That's cool. I like that tool. Yeah, I like that tool a lot. So to clarify the confusion here, ABX PKG is the runtime you're talking about. Yeah, correct. Sorry about that. But you said ABX DL, and so I went up and found your repo and then Tangent does in a positive way, but now we're less confused.

Starting point is 00:58:04 But now we're more excited because we know two tools, not just one. We're getting two for one here, okay? That's why I like you, Nick. ABXDL is pretty cool. So what you're saying then, if I'm reading this right, is this ready for prime time? Is this, you know, okay, so this is coming soon.

Starting point is 00:58:20 Which way, which tool? ABXDL. Yeah, ABXPKG is ready. We've been using that for months now. ABX DL, I just announced because it's this evolution of plug-inizing ArchiveBox. Inevitably makes it a little bit too complicated for some people. And so ABX is stepping in to fill in behind and basically provide a new tool that is way simpler than ArchiveBox to all the people that really don't want to spend time with Docker or setting up services or logins and all that. They just like, give me the files now. Because that's how Archivebox started. Originally, it was like

Starting point is 00:58:52 ABXDL. It evolved so much that now we need a simpler replacement. Yeah. To put it more simply, you write it well. ABXDL is a CLI tool to auto-detect and download everything available from a URL. So just like you would use, which I use, YTDLP, I obviously use Wget, Prefer Curl, but either or, pick your flavor. So if you're using this kind of tools,

Starting point is 00:59:16 you can potentially at some point in the future replace those things if you're trying to archive with ABXDL. Yep, it should be a fairly drop-in replacement. It's got a few of its own flags like you know you can provide cookies you can tell it to ignore ssl warnings it's got the usual things that you would be able to configure but i'm aiming for like direct drop-in replacement for wget or curl i want to confess something here on the show if you don't mind. We always like confessions. One thing I do like to do sometimes is I run my own Plexbox and I don't always want to,

Starting point is 00:59:51 it's almost my version of archiving now that I'm thinking about it out loud, is I will take some music that I like from YouTube and it's not to take it from me so I can give it to everybody else and be a distributor. It's more so I can have my copy

Starting point is 01:00:04 and I'm not spending web resources. I'm spending LAN resources, so to speak. It's allowed. That's legally allowed. Yeah, and so I use YTDLP to pull down different things into a WAV file, mostly like coding music and stuff like that, that I'm like,

Starting point is 01:00:20 I want to keep going back to this YouTube URL and have a tab open. I would rather just have it play in my truck or play on my phone or wherever it's at. And so, you know, Plex Amp is the iOS app. And so I can play that from my Plex at my home wherever. And so I YTDLP all the time. I mean, all the time, like several times a month, all the time, you know, but enough to be like, this is a useful tool and this is how I use it. And occasionally I'll pull down a video if I

Starting point is 01:00:48 want to archive it forever, but my file system has been the archive. So I think I'm like one step removed from actually becoming an archive box user. That's great. That's how, that's how a lot of people start. You start with the content you care about and hold onto that, right? Use that, use that as motivation to get more into archiving.'t yeah don't break yourself into having to save everything just save the stuff you want to save yeah i like the idea and premise i i think the thing i want is i wanted to catch on and i think organizationally it's good like that's where you're sort of seeing a lot of the movement so to speak but i still think there's opportunity elsewhere but i think that it might just get burnt out.

Starting point is 01:01:26 I don't know, like, what would motivate somebody to do it continuously forever if it wasn't legacy things like we said earlier, you know? Isn't it just a cron job after you got it all set up? I mean, what do you got to keep doing? Yeah, so part of it is on me to make this easier, right? My tool right now is not incredibly

Starting point is 01:01:42 so user-easy that you can just set it up and it runs in the background forever. I'm trying to get there. And once it is at that point, then I think it'll be less important to select for people who are really motivated to archive. But right now, because there are still hurdles to curating and managing all the storage and passing hard drives around and deciding who gets to look at it and scrubbing stuff out, I am selecting on purpose more for people

Starting point is 01:02:05 who are willing to take on this workload. There are other tools, like WebRecorder is amazing. They have a new cloud offering that lets you do stuff. They're the team that I'm collaborating with on this. Behaviors spec we're calling to share these plugins between different tools. There's single file. There's lots of browser extensions that make it fairly easy

Starting point is 01:02:22 to save stuff passively as you're browsing. I think those are great options for people that are looking for sort of easy, passive archiving. But yeah, a lot of the hard decisions don't come until you're six months into archiving and now you have a few terabytes that you need to move around between places. How big is your personal archive? I have, I guess there's a fuzzy line. So I have many personal archives for different things. I tend to start a new collection for a new campaign, I guess I'll call it. A lot of different tools call these campaigns.

Starting point is 01:02:55 So like if I care about my YouTube favorites, for example, that's going to be a hefty bucket of stuff. So I'll start a dedicated collection just for that. That's probably the biggest one. It dedicated collection just for that. That's probably the biggest one. It's a few terabytes. It's not insane. But then I have a bunch of these collections. And so altogether, I probably have about 20 terabytes saved in a little ZFS thing over there on the shelf. I'm a big bare metal fan. I tend to not pay for lots of cloud hosting. It's mirrored. I have a 3-2.1 backup, but I think that all in all,

Starting point is 01:03:27 up around 20 terabytes. Well, friends, I'm here with a friend of mine, Michael Greenwich, co-founder and CEO of WorkOS. We're big fans of WorkOS here. Michael, tell me about AuthKit. What is this? How does it work? Why'd you make it?

Starting point is 01:03:44 WorkOS has been building stuff in authentication for a long time, since the very beginning. But we really focused initially on just enterprise auth, single sign-on SAML authentication. But a year or two into that, we heard from more people that they wanted all the auth stuff covered. Two-factor auth, password auth, with blocking passwords that have been reused. They wanted auth with, you know, other third party systems. And they wanted really WorkOS to handle all the business logic around tying together identities, provisioning users, and even more advanced things like role based access control and permissions. So we started thinking about that more how we could offer it

Starting point is 01:04:18 as an API. And then we realized we had this amazing experience with Radix, with this API, really the component system for building front-end experiences for developers. Radix is downloaded tens of millions of times every month for doing exactly this. So we glued those two things together and we built AuthKit. So AuthKit is the easiest way to add auth to any app, not just Next.js if you're building a Rails app or a Django app or a just straight up Express app or something. It comes with a hosted login box. So you can customize that. You can style it. You can build your own login experience, too.

Starting point is 01:04:52 It's extremely modular. You can just use the backend APIs in a headless fashion. But out of the box, it gives you everything you need to be able to serve customers. And it's tied into the WorkOS platform. So you can really, really quickly add any enterprise features you need. So we have a lot of companies that start using it because they anticipate they're going to grow up market and want to serve enterprise. And they don't want to have to re-architect their auth stack when they do that. So it's kind of a way to like future-proof your auth system for your future growth. And we have people that have done that. People that started

Starting point is 01:05:20 off and they're like, oh, I'm just kicking the tires. I'm just doing this. And then poof, their app gets a bunch of traction, starts growing. It's awesome. And they go close Coinbase or Disney or United Airlines or, you know, it's like a major customer. And instead of saying, oh, no, sorry, we don't have any of these enterprise things and we're going to have to rebuild everything. Just go into the WorkOS dashboard and check a box and you're done. Aside from the fact that AuthKit is just awesome. The real awesome thing

Starting point is 01:05:45 is that it is free for up to 1 million users. Yes, 1 million monthly active users are included in this out of the gate. So use it from day one. And when you need to scale to enterprise, you're already ready. Too easy. You can learn more at offkit.com or, of course, workos.com. Big fans, check it out. One million users for free. Wow. Workos.com or offkit.com. As you're describing these YouTube favorites,

Starting point is 01:06:21 I have many playlists on many social media accounts. And I would say the one I would probably almost covet, like love it to death almost, is my YouTube playlists. They're all private, obviously, like only I can see them. But now I'm thinking like you said that, I feel like if I can archive my playlists, then I know because there's times I go back to them and it says this video is not here anymore because it was removed. And I'm like, well, why was it? It was useful to me at one point.

Starting point is 01:06:50 I'm not trying to like get somebody politically for any reason. So like, I know it's not that kind of content. It's just like, for some reason, somebody upset and it's not available to the public anymore. And my ability to archive that, now you're making me,

Starting point is 01:07:03 see, you're getting me. You're getting me. Yeah, definitely save that stuff. YouTube, I think, is a great starting point because- For sure. It's also, interestingly enough, text copyright, audio copyright, video copyright, music copyright, they're all very different fields legally. There's not that much overlap.

Starting point is 01:07:20 Like the way those cases are handled, the way that what the precedent is in the courts is very, very different. You have a Supreme Court judge to thank for the ability to save video locally, who had a TiVo and was like, I don't understand why I can't just TiVo my stuff at home. Like, who am I hurting by doing that? And so you have a fair use exemption to basically TiVo your video content at home. Now, of course, platforms will argue you're violating their terms of service by cloning that. But like, realistically, the precedent is set. You can save video that you care about at home and it's probably going to be okay as

Starting point is 01:07:53 long as you're not charging people to access it or depriving the original, like spamming it in their public channels saying, hey, I have a free version, come over here. It's an interesting problem in the fact that you have this archive box idea and the things that you do to do the archiving is you as a individual or an organization, you identify something worth archiving. So that's step one, right? Step two is having the necessary software technology, whether it's a plugin or a CLI tool or something that goes out there and gets the thing and says, okay, I've got the thing. And I assume as part of the ABX DL, at some point you'll have some sort of config that says, this is where you put it. And that's the archive box. That is the

Starting point is 01:08:37 file system. That's ZFS backed, praying everybody follows your rule or at least your desires. And then you have this ability, this viewer, so to speak, the hallways and the rooms of the museum, right? Those are the different, am I missing anything else that's in the sphere of how you would interact with or curate or view this museum slash archive? No, you basically perfectly identified it. There's different words used for those different

Starting point is 01:09:05 areas you know the viewers uh often called the replayer because you're like replaying a recording but yeah that's that's basically it so the archive box as it is now when i if i went out there today and spun up the docker because i'm i'm that kind of person i would spin up the docker version of it what is that that's not the dl thing right i mean it is it's baked into it as it is but this abx dl is a secondary cli tool that enhances or adds to what the archive box will eventually do or does now currently right yeah so to dive into the nitty-gritty for like a couple minutes so archive box internallybox internally is a Django application. It exposes a command line interface that is the same package as the Django web app.

Starting point is 01:09:53 Like it's all in one pip package. So you can pip install Archivebox without any of the Docker stuff. And you immediately get the CLI. You get a Python API. It uses SQLite and it just saves to whatever current folder you're in. It'll create a collection, it'll create a SQLite database on disk, it'll create folders for all the archives and logs and all that. So you don't need a continuously running container at all. If you just want to basically replace YouTube DL, you can pip install archive box, archive box, add HTTPS, whatever, and it'll just spin all that up locally and archive that one URL and then exit.

Starting point is 01:10:27 And then if you run another command in the same directory, it'll add the next URL to the same collection. You import 1,000 from Google Chrome, it'll run them all right there and exit. So you can use it as a CLI tool. You can use it as a long-running app. You can use it as a Docker container. All of these are actually just one Django package

Starting point is 01:10:45 underneath. And that's like the first, first principles of this, because then you got the challenge where you got orgs, you want to view it and you want to enjoy it. Well, you're not in that setting whatsoever. Like you're probably on the web, right? You're probably in some sort of web application. And so your viewer, your, would you call them a playback person? What was the terminology for? Replayer. Replayer, yeah. Yeah, replayer. So if you've got a replayer out there, they're probably on the web. That's a whole different problem set, right?

Starting point is 01:11:11 Yeah, so the CLI tool, because everything just saves raw straight up to the file system as raw files, you don't actually ever have to see the ArchiveBox UI at all. You don't have to use the replayer. You don't have to use the admin interface. You don't have to use anything. You can just use the file system. Or some people never see the file system at all. You don't have to use the replayer. You don't have to use the admin interface. You don't have to use the, you know, anything. You can just use the file system or some people never see the file system at all, right? They're running it on fly IO and it's, you know, hosted file system and they only see the web UI. And so it, yeah, fundamentally I'm serving like two different groups. I personally use both heavily. So I'm running my own web UI, but I also very often go

Starting point is 01:11:43 into the file system because I want to play with a local LLM and I want to train it on all my YouTube videos or I want to train it on all the articles that I read last month or stuff like that. Right. The reason why I'm asking you how to experience it is because I'm literally thinking about, okay, if I started to do this,

Starting point is 01:11:59 you know, one job is to archive. Got it. Okay, cool. It's on my file system. And the next job is later on, I want to experience it or replay it and be the eventual consumer I will be of my YouTube playlist, for example. And I'll admit it's mostly cooking videos. All the confessions.

Starting point is 01:12:18 It's mostly cooking videos. You know, right now I'm trying to perfect my chicken Parmigiana recipe. Like I am like, I am trying to nail it from the, the sauce, the original, you know, tomatoes to use, you know, the garlic, all the process, you know, which olive oil, like I'm trying to perfect it. And so I've got a collection of videos. And so future Adam, like once I've perfected it or my kids, you know, even a year from now, like they will want to, they would want to view this stuff.

Starting point is 01:12:46 But the here and now is the useful. I think if you can make this archiving like useful today to me so that if it's useful for me to archive and then also experience my archive means that I'll curate it better over time because it's today useful, not tomorrow useful or some fictitious future that may or may not even come to fruition. That's what I'm thinking about because I'm already doing that in a way with my music, but I'm not using it in the way I'm using it.

Starting point is 01:13:11 I'm doing it in a way that is today useful. And today useful is on Plex and experiencing it as music because that's what it is. Plex doesn't really serve me to serve my YouTube playlist, but this Django app or this web interface could be more full-featured at some point so that you invite people to archive an experience today so that it has future generation payoffs. Yeah, 100%. You're touching on a really key part of why archiving is

Starting point is 01:13:39 hard for it to spread virally is you need to convince people that it's useful today when most people only realize archiving is important when it's too late, once they're already are missing something. So making it really useful today is super important to me. And I think another big part of that is search, making sure search is really good, making sure you can quickly find like, I go to great lengths to get the subtitles for every video and add them to the full text search. So can search by content of video. Extracting text by any means necessary is super important. Making sure that the search engine is fast works really well. We use Sonic, which is a Rust-based elastic search all-in-one binary replacement. It's awesome. There's other ways that we can make it

Starting point is 01:14:19 really useful now, too. We can try and do, like, not everyone wants this, but some people really want it. AI-based summarization or categorization after the fact. So let's say you have, you know, a thousand URLs saved. I don't want to have to go in and click through each one to find the article that I care about. What if they all also had a column that was, you know, a two-sentence summary of the article and the author and the byline and the date it was published extracted out. So I call these extractors and Archivebox is designed to be able to add many extractors over time. And I envision it being like a home assistant type ecosystem or NextCloud or WordPress ecosystem where you have tons of plugins for all the extractors of the things that you care about. And the extractors come with their own replayers. So if you have an extractor that

Starting point is 01:15:04 specializes in getting YouTube videos, it will also provide a nice replayer UI to look at your YouTube videos. If you have an extractor that gets article text out of the page, it should also provide a nice article reading UI. You have an extractor that gets cooking recipes, but it just gets the recipe part,

Starting point is 01:15:20 then you also need a replayer that shows cooking recipes nicely. And so this is how I imagine the ecosystem evolving over time. Yeah. It's almost like an internet on top of the internet, powered by, I would say, probably importance to somebody. It's almost like its own index, too.

Starting point is 01:15:36 That's why I think there's a lot of the possibility, the potential here is just tremendous if you can put it out there in the right way. I'm not saying the way you're doing it is wrong because you're iterating, right? You're trying to get to this eventual long-term really useful thing. Because if I'm an archiver and I do things well and it's useful to me and I can expose that stuff in some way, like the things that I think are important to me because of who I am or what I do or the way I think that adds layers of importance to the thing itself. It's not about the actual content and the archiving the content is one important aspect, but it's also what was archived, not what is it in the literal files or the content. It's like,

Starting point is 01:16:17 what was it? Who and why? Those are things that I think is like a sentiment layer that's just not out there really. And I think if you can find a way to expose that, you know what I mean? Then you sort of like get this aspect of like invitation into it, either as a consumer or replayer, as you've said, or somebody who's actually an archiver and joins in. Another really interesting idea that other tools have played with is preserving the context in which a page was discovered. Like, oh, I clicked these three links in a row from this Google search, and that's how I found this thing that I then decided to save. Like, saving that whole research chain of the URLs that you found is maybe interesting context, and that makes it more valuable.

Starting point is 01:16:59 Possibly. Possibly. It's like session replay in a way for a scenario. I can see how that adds context, but it's like uh it's like session replay in a way for a scenario i can see how that adds context but it's also complexity yeah i don't personally see value in that necessarily except for when i would see value in it of course it's like how did i actually find this website oh that's right i was watching this which i watched that and that led me to this and that's why i really don't mind youtube's algorithm honestly because it's like it's it's interesting how it knows what i want to check out in the future. And like my whole time, I was just full of chicken parmesan.

Starting point is 01:17:29 You know, it's like, it's endless. It's pretty easy for you then, I guess. It's easy. Yeah. It's easy. Yeah. YouTube doesn't have me. It doesn't have me figured out like it does you, Adam. I can just show you chicken parmesan, but I'm constantly mad at it. Is that right? That's a shame. Yeah. I get angry at it all the time.

Starting point is 01:17:42 Like I don't want to watch this. And I subscribed to somebody six months ago and you haven't shown me one of their videos in three months. And I forgot they existed. You know, I'm there too. I'm, I'm with you on the same anger point. You know, you should, uh, check out the tweaks for YouTube extension. It's totally changed my relationship with YouTube. It lets you change the homepage algorithm. It lets you make videos faster than twox. It lets you... What's SideQuest right now? What is this?

Starting point is 01:18:06 Faster than 2x? That's blasphemy, man. Come on. People create those videos for you to watch. That's right. 1x for life. Not all videos, only the ones that are very slow. I'm fine with faster than 1x, but faster than 2x?

Starting point is 01:18:19 Holy cow. That's not the only reason. They also hide a lot of clutter in the UI. It's basically infinite configuration options for YouTube. I love that idea i will check it out my problem is with that is i experience youtube in so many different contexts that aren't my computer yeah my phone my tv yeah my computer other people's things so yeah for sure anyways off on a youtube rant you were going to say something and I cut you off, Nick. Well, I think back to the earlier concept of this index being sort of like

Starting point is 01:18:48 worth sharing as this collection together or worth sharing of like the what of the archive. I think that's a really important point. And the replayers, like one thing to think about is if you take this to its logical extreme and everyone archives enough content that they care about,

Starting point is 01:19:04 that the internet is broadly copied multiple times over, what's the point of hosting anymore? What's the point of hosting stuff on your own? Once you publish it and enough people have archived it, just stop paying for hosting. And people already use archive.org like that today. And it's kind of an interesting thought experiment to think about if this becomes the content distribution mechanism for the internet, what happens. But I also don't think that will happen. I think that in any social system, you have two ways to share things.

Starting point is 01:19:34 You can share by reference or you can share by copy. The internet right now is usually share by reference. You share a URL to something and it's referring to the original content hosted by the creator. SMS is share by copy. When you text someone, they have a copy of the SMS. If you delete it off your phone, it's not deleting the original content hosted by the creator. SMS is share by copy. When you text someone, they have a copy of the SMS. If you delete it off your phone, it's not deleting it off of their phone. Email is share by copy, BitTorrent is share by copy.

Starting point is 01:19:53 Discord is share by reference. You delete the Discord server, everything on it is gone. Even though it looks like messaging, it's not share by copy. It's kind of interesting to think about. I think most share byy systems broadly will not succeed in taking over as being the content distribution mechanisms for the world. I think whether that's IPFS, whether that's BitTorrent, whether that's anything that's share-by-copy,

Starting point is 01:20:18 I don't think it's going to become the de facto way we share content simply because it deprives the original creators of the power to monetize or delete their content. You can't moderate, you can't get rid of CSAM once it's out there, you can't get rid of misinformation, you can't get rid of libel. Artists, musicians, creators don't necessarily want to publish on a platform where they lose control the moment they share something the first time. It's immediately copied millions of times, they can't ever retract it or ask people to pay for it. So I think archiving

Starting point is 01:20:45 is fundamentally limited in that societally the human scale people don't want to shift to losing control over their content authorship yeah and so people striving to make archiving do that to like really replace all sharing of content by any other means, I think are a little misguided. And so it helps actually hone the focus a little more and make it easier to work on this problem, to not try to replace the entire internet, because that's where it goes quickly if you don't think it through. Well, I'm excited.

Starting point is 01:21:19 I think Adam's probably already got his Docker commands queued up. I think you got him, Nick. I'm a little bit more reserved in my, I'll wait until Adam sells me. He's going to sell me something. I'm already doing it, so it's like a better version of it, I think. It might help me organize myself. Yeah, this sounds like something that you're working too hard, and actually it's going to help you work less hard.

Starting point is 01:21:41 Yeah. So archivebox.org. I did see that you went ahead and took the... Or.io. I don't have the.org. My bad. Archivebox.io. Oh gosh, you're part of that crew.

Starting point is 01:21:53 Oh yeah, I have some regrets, but.com is too expensive and.org. I wasn't a nonprofit when I first started, so I didn't... I was going to bring up the nonprofit. So you actually went ahead and went through the time and effort to get that done. So that's a step. I'm not my own nonprofit.

Starting point is 01:22:09 I'm a fiscally sponsored project through the excellent Hack Club Bank. Yeah. I see. So you took a shortcut. Oh, very cool. So does that provide you some leniency? Because you mentioned you're trying to decide,

Starting point is 01:22:20 you know, should you go nonprofit? Should you go profit? Do you have leniency because you didn't... because it's a proxy that you can change later? How does that work? No matter what, I'm going to have to be both. There has to be a non-profit component. There has to be a for-profit component. It's going to be a peer corporate structure relationship

Starting point is 01:22:39 similar to any company that does massive content re-hosting, like Archive.org, like OpenAI, like Mozilla, like Maps. Basically, you have a nonprofit and you have LLCs underneath it that do anything relating to money. The content is only ever hosted by the nonprofit, which is not earning revenue for it, but you can sell software that people use that contributes to that pool of content. And so the financial motivations to basically the financial motivations are kept

Starting point is 01:23:11 separate. You're not incentivized to profit off of the copyrighted material, which I think is important because as this eventually grows beyond just me, I don't want to have sort of corporate structuring that is pushing it in the direction of destroying copyright. Gotcha. Anything else? Anything we, a stone we have left unturned? I didn't ask a lot of questions of you guys. I would love to hear more about your own personal backgrounds.

Starting point is 01:23:40 You know, have you ever inherited a big legacy collection of stuff from your parents or grandparents? Like, do you have any sort of personal interests? Just photos. Nothing digital. We're the first generation, I would say say probably for jared and i in digital right like we have our parents in there but by and large for me at least all my parents are dead so do you have kids now i do have kids yeah nice what would you what would you love to see them enjoy in, in 30 years? If you know, they could only save, let's say a couple hundred pieces of your digital life. Hmm.

Starting point is 01:24:10 His chicken parmesan. Yeah. Well, I always have fond memories of, I would probably say photos is probably the, the, the easiest one and videos, right?

Starting point is 01:24:19 And those kind of go in the same category. Yeah. Like personal videos. Yeah, definitely videos. I would say if I kind of put them in the same lump you know it's the photos app everything in photos app yes that's interesting i think it's mostly memories less like artifacts i don't know i haven't really thought about that honestly i

Starting point is 01:24:37 i do think that eventually my my copied versions of my playlists that really feature chicken parmigiana or the best steak ever or the most amazing smash burger of your entire life those three things in particular are staples in our household are you gonna have to send me that last one i'm a huge smash burger fan in the last few months well you have to come to my house because that's the best one sorry about that and you're invited uh i'll gladly make you a smash burger. I would say those kind of things I imagine my kids will want to take on because we make homemade marshmallows. We do interesting things for the holidays. And just generally, you know, we like to make our own food and we really appreciate that process. And I'm trying to get my kids to think about that kind of stuff more so and what goes into the food.

Starting point is 01:25:25 Like even so far as like making your own sauce. It's not because we're crazy. I'm like if I can buy that sauce for whatever and I can buy the actual ingredients for like one quarter of the price and I enjoy it better and I know what went in it, that to me is like an A plus for all the things. So yeah, I mean I would say those are the things. Things that point to those principles, not so much the things themselves. I think this YouTube playlist with my buddy Frank Proto might be – I see my buddy because I actually reached out to this chef literally recently. I'll tell you. This is really plus-plus content, but either way, I'll tell you.

Starting point is 01:25:57 So I call him a friend because he's a future friend. His name is Frank Proto. He's a chef. Okay? And I reached out to him on Instagram. I'm like, hey, I'm a big fan. I've made your pancakes. So pancakes from scratch.

Starting point is 01:26:09 I've made your spaghetti. I've made this and that. Big fan. How hard is it to book you for a podcast? He's like, not hard at all. That's his only response. But it's not hard at all. So long story short, a future Change Law podcast will feature a chef.

Starting point is 01:26:23 Amazing. Yeah. Chef Frank Proto. Check him out. a future ChangeLog podcast will feature a chef amazing yeah Chef Frank Proto check him out Proto Cooks I believe is his channel but he does some cool stuff anything he makes

Starting point is 01:26:32 I will make Frank's amazing so I think those things are things that I appreciate and I know my kids appreciate them because they have the second order effects of me making them for them

Starting point is 01:26:41 and so they'll eventually appreciate where I've gathered my knowledge from. So I will eventually create my own recipe that is a culmination of 17 recipes, you know, a trick from here, a tactic from that, or these particular tomatoes from that person's recipe or where they got them at. Or if I want to spice it up, this is how I do it. You know, I get the simple version and the complex version and it's all cooking related, but I think that's probably the easiest answer I can give you right now, which is something related to cooking. Cooking is actually a shockingly popular answer to that question.

Starting point is 01:27:13 A lot of people like myself included increasingly as I'm starting the beginnings of a family. I'm winning you over, right? You're wanting to take on my, we can have a, on my... We can share a box, so to speak. Yeah, that's my wife would love. Basically, photos, some news, and a lot of cooking recipes reserved. And also, some personal work portfolio is important to journalists, especially I think a lot of people that do writing for a living

Starting point is 01:27:46 see a lot of their content sort of disappear when the publishers go bankrupt. So that's a common answer I get. Yeah, everyone has a really unique and interesting answer usually to that question of what do they want to save?

Starting point is 01:27:59 And then the alternate version, if you don't mind me asking one more follow-up, is now take away the 100 URL requirement, but now pretend you can't mind me asking one more follow-up, is now take away the 100 URL requirement, but now pretend you can't save any individual piece of content, but your kids will get a model trained on everything that you save with no limit. You could feed this model 20 terabytes of training data. What do you limit it to now? What do you want the model to have and what don't you?

Starting point is 01:28:22 That's TMI. No worries. the model to have and what don't you that's tmi no worries yeah i'm also gonna i'll pass on that one not because it's tmi although that's hilarious is because i would have to think really hard about that more food for thought for people to think about because i think it sort of gets the gears turning on perspective and yeah it's an interesting question i like that question yeah i like that idea though i like the the premise of the question not so much the answer i'll give i like the idea of you know self it's almost like knowledge for the future and this lm is an encapsulation of some version of the obvious answer is like you know just copy my psyche copy my entire who i am you know full on Ready Player One, actually Ready Player Two, with the ONI headset kind of thing and a replay of who I am.

Starting point is 01:29:09 That's the obvious best case, but that's so weird. Such weird implications. But also the victors write the history. You get a chance to rewrite your own history book. You can cut out all the bad parts. Let me give a different answer. I started to think about it more.

Starting point is 01:29:25 And I realized this is a false dichotomy. There's no reason that it can't be both. But my answer is spend way more time with your kids and talk to them about life, about what you think, about what you believe, and why you do what you do. And just spend a whole bunch of time with them. And you don't have to give them a model. They'll already have it. Well, it might not be for your kids. It might be for your kids as kids as kids. Yeah. Well, people go, people come and go, you know? Yeah. We don't have to like sustain our psyches into the future. Well, that really cuts deep to the

Starting point is 01:29:57 heart of archiving. Like that. I also believe this. I think that like death is an important part of life. It's sort of the recycling engine that really tests, is this idea worth propagating or not? Because if someone doesn't propagate it, then maybe it wasn't worth propagating. And that's sort of where I want people to go when they think about these ideas. It's maybe seems weird coming from the archiving guy to be like, Oh, you know, don't archive so much mortality. Yeah. But I honestly believe this. And I think that there's some beauty in ephemerality. And that's why I want archiving to be really intentional. Because you are depriving the original creator of that decision to let death recycle their ideas by dragging their ideas, kicking and screaming into the next generation.

Starting point is 01:30:39 But we have to do it. There's a balance. There's a balance. It's the only thing that makes life exciting because what's old is new again to so many people because there's nobody to propagate forever. Right. You know, there is mortality, not immortality. And so I can have this idea, which I thought was mine,

Starting point is 01:30:55 but it's not. It's just recycled. Somebody else had it. It's recycled. Yeah. And it's only new to me because it's new to me. A deep note to end on. Yeah.

Starting point is 01:31:04 Perhaps. It was fun i i think archivebox.io to clarify check it out man uh if you're if you're jiving on this we do have a zolip i'm sure there's a episode topic is that what they call it not channels it's a topic hop in there say hello nick i see that you have a zoological archive box so if you want to dig deeper in the community go hang out there and nick's zoological archive box but also come in hours if you're not there already changelog.com community and comment on this episode and say what's up and tell us what you're archiving or what you thought about this episode or

Starting point is 01:31:40 say hi to nick if he's there all that good stuff good times nick thank you yes thanks nick you. Yes, thanks, Nick. Thank you so much for having me. And I'll join the Zulip right away. I didn't realize y'all had a Zulip. Yeah, man. Heck yeah, man. Zulip for life. Well, this episode cuts a little deep, you know. It makes you think. What you archive is, in a way, what you're thinking. It's almost like search history, but not really. It wouldn't be hard to backtrace where you came from based on what you archived. Obviously, privacy plays a role. But I think the time unlock that Jared mentioned is kind of interesting because at some point, it doesn't matter to you.

Starting point is 01:32:23 Contextually, it's gone. It's not relevant, applicable. You can't be persecuted or canceled necessarily. Maybe future generations can be, which is interesting to think about. But I think we all have a bit of an archivist in our blood, right? Anyone who's in tech, anyone who's in software, anyone who's in the development of software products has got to be a bit of a pack rat in some way, shape, or form, or someone recovering from. And this conversation around Archivebox, this conversation with Nick, has got me personally thinking about the things that

Starting point is 01:32:59 matter to me, digitally, of course, and the things that I see, get impressed by, get changed by, and they're important to me. And whether or not I would be sad to have not archived them or to be able to go back again to experience it or to share it with future generations. I love this conversation. I hope you did too. It's got me thinking. Okay, so archivebox.io, there is a bonus by the way. We went a little deep, one, maybe two, maybe three layers deeper, and we gave Nick some advice. We encouraged him and advised him on some different directions. And if you're not a

Starting point is 01:33:40 plus plus subscriber, hey, that just means that you end the show now. Okay? And that's cool. But it's not. Because you can easily go to changelog.com slash plus plus, become a paying subscriber. Ten bucks a month. A hundred bucks a year. You drop the ads. You get closer to that cool changelog medal. You get bonus content like today.

Starting point is 01:34:02 And you directly support us. which is just the coolest thing ever honestly i love it i appreciate it i know jared does too but changelog.com slash plus plus it's better it's better because of all the reasons i've said and you get today's bonus content with nick and that's a win once again changelog.com slash plus plus. It's better. Okay, so some awesome brands support us, love us. We love them. You should love them because they love us

Starting point is 01:34:34 and we love them and all the things. Fly.io Timescale.com Wix. That's awesome. Wix Studio. The coolest thing ever. Wix Studio. And then thing ever. Wix Studio. And then, of course, our friends over at WorkOS. WorkOS.com, Michael Greenwich and team.

Starting point is 01:34:54 So awesome. WorkOS. AuthKit. My gosh. They're killing it. And, of course, the beat freak in residence, Breakmaster Cylinder. Man, the beats are banging. Thank you, BMC. Thank you. Okay, that's it. The show's done.

Starting point is 01:35:10 We're off this Friday because, hey, Thanksgiving. Enjoy your family. Enjoy your time away. We are. And we'll see you next week. Peace. Thank you. not that I'm suggesting a rename, but because you, you, the.org is so expensive, a name adjacent, and it might be a terrible play on words, but a good play on words, is instead of Archive Box, what if it was Archive Machine?

Starting point is 01:36:16 And then ArchiveMachine.org is available right now for $10. Just saying. So you haven't been entrenched enough where a name change might be impossible. It is available and you are pursuing a nonprofit future kind of thing. And you also have the Wayback Machine. So it's sort of like adjacent to what people already might know. And so this is the archive machine that may power the Wayback Machine of your life kind of idea. And the.org is available literally right now. Cool. Yeah, Archivebox actually was a suggestion from a community member, Filippo Valsorda,

Starting point is 01:36:50 who's been a longtime supporter and an interesting crypto guy. We know Filippo. Yeah, he's great. He is awesome. He's the longest term supporter of Archivebox from the very beginning, has been reliably donating 20 bucks a month.

Starting point is 01:37:04 And I know him from Reeker Center in New York. But yeah, I think either he or someone right after him in the same conversation thread, we were brainstorming name ideas. It's funny that you offer that, Adam, replacing the box with machine, because as you were describing some of the, what I would say, brand hurdles of us understanding the current value of something like this, I thought maybe the word archive was the one.

The Changelog: Software Development, Open Source - Let's archive the web (Interview)

Nick Sweeting joins Adam and Jerod to talk about the importance of archiving digital content, his work on ArchiveBox to make it easier, the challenges faced by Archive.org and the Wayback Machine, and... the need for both centralized and distributed archiving solutions.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.