The a16z Show - Preserving Digital History: How to Close the Web's 'Memory Hole'

Starting point is 00:00:00 Hi, welcome to the A16Z podcast, I'm Zorn. Conventional wisdom has it that most everything important is accessible online and that it will exist forever. But in fact, more than 98% of the information on the web is lost within 20 years, and huge gaps exist in our cultural history. This includes published works like news reports, books, music, and video. A16Z investment partner Alex Pruden and I interviewed Brewster Kale and Sam Williams, who are using different approaches to attack this problem. Brewster co-founded the Internet Archive. It's well known for creating the Wayback Machine, which crawls a billion URLs every day. Sam co-founded Arweave,

Starting point is 00:00:37 a company that uses decentralized crypto networks to store information forever. For both of them, this issue has implications that go far beyond just data storage. It touches on issues of censorship, government manipulation of information, and how historical context is necessary for well-functioning societies.

Starting point is 00:00:53 We talk about the challenges of trying to preserve digital history in an era of deep fakes and fake news, how decentralized models like Arweave can fit in, and what types of economic models will help sustain data preservation. We also discuss baseler protocols versus archive layers and evolving standards for the cryptographic hash functions

Starting point is 00:01:12 crucial for tagging in metadata. But we begin with Brewster and how he started his quest to preserve digital history. The idea is to build the Library of Alexandria for the digital age. Could we go and make all the published works of mankind available to everybody and in computer readable form so we could build a global brain, kind of building on the ideas of Vannevar Bush and Ted Nelson,

Starting point is 00:01:34 which even by 1980 when I started on this project, it was kind of assumed that it was going to be done or done already. So I thought, okay, let's just go and do that. A lot of people would say we already have the Internet, all the knowledge that we need is already there and accessible. Why is that true or not true? Oh, it's actually all fairly thin. I mean, there's a Wikipedia page on all sorts of things,

Starting point is 00:01:56 and there's certainly all sorts of commentary about current events. So the good, bad, and the ugly out there. But a lot of the actual published works aren't there, books, music, video. It's astonishing how little of it, or the 20th century is largely not there. Or if you take webpages, they only last 100 days on average before they're changed or deleted. So we're building our culture on shifting sand. So I think it was a cruel joke to call web pages pages, as if they were like, like a Gutenberg Bible that would last hundreds of years if you didn't pay attention to it.

Starting point is 00:02:31 It's just not true. This is all shifting. So we needed to build a library now that people are publishing with this new medium. And so it's kind of notable here we are in 2020 and we seem to be living in this age of fake news, I guess, where the truth itself can seem slippery. People can't even agree on what sources of info are reliable. What does that mean for the internet archive? Is this kind of a unique period in history as you've been doing this for the last four decades? And what are the challenges that that presents for someone like you trying to preserve knowledge? I think it's showing the need for more and more context around what people are looking at, being able to reference real materials.

Starting point is 00:03:07 Really, the web grew up in parallel and separate from, say, the book world. The book world just staunchly, you can go and buy a book from Amazon, or you can download it on your Kindle, but the web and books never grew up together. So we're trying to make all the footnotes, the references in Wikipedia, live links. So if it's a book and it's got a page number, it opens right to the right page. Or just take Wikipedia's footnotes

Starting point is 00:03:34 that go to other web pages. We started based on talking with Catherine Maher, the exec director of Wikipedia, she was worried that truth might fracture based on basically bad citation availability. So we went and made a robot to fix broken links in Wikipedia.

Starting point is 00:03:50 We fixed 11 million broken links. So the idea is to try to make this good idea of the World Wide Web into something you could actually live on and depend on. And in this era of fake news that's really grown up throughout the web, it shows how fragile information is if you don't have an ecosystem of references, context, reliability, organizations you can believe in. So we're trying to play a role in hooking these things together into this wondrous thing, which is the World Wide Web. One of the reasons I'm personally so excited about what you and Sam are doing is that I firmly believe that human progress is largely dependent on our understanding of our own human history.

Starting point is 00:04:36 And I think this notion of how we study history and how we understand events that happened in the past, this is called historiography, I think is absolutely critical and often underappreciated part of that. I think it gets to kind of this, maybe this is the same as when you were referring to it as a global brain. But I guess can you in your mind describe maybe the benefits, maybe tangible benefits that we have by having this better, more granular understanding of events exactly as they happened on the web in the past? Well, let's just take there's going and understanding the World Wide Web's history so that politicians can't just erase their past or, I don't know, when you have pundits that want to go and now be in government, they just want to have erased their past. And if you just make it so that they can do that, then people just feel much more free to just make things up. If there's no accountability for what it is, you've said, you've done, you've published, you can just erase it, then that's a really scary time. It gets into the whole George Orwell time rewrite history thing, which is we're seeing more. And as the governments around the world are swinging right in a very hardcore way, and we're losing newspapers, we're getting more and more control and balkanization of the internet.

Starting point is 00:05:48 But let's just take the 20th century. The 20th century documents from the 20th century are largely not online. Copyright in the United States ends in 1924, so if it's published before 1924, it can be public domain. And so our book project, we digitize and make them just publicly downloadable. Google's less so, they kind of still bound up the public domain. But at least a lot of materials pre-1924 are available online. But after that, if you look at the graph of what's digitized and put on the Internet Archive book collection, it goes up to 1924 and the craters.

Starting point is 00:06:23 And it's basically almost nothing through to the 20th century. And it's not because those are all in print, because if you try to go to Amazon and buy them new, they're not in Amazon either. So the 20th century, we're bringing up a generation without access to the information and what happened during the 20th century. It was a very impactful, very important century.

Starting point is 00:06:43 And if we don't learn from it, if we can't even reference it, then people can go and say, oh, this didn't happen. or it really happened this way or actually people were really cheerful about going into the Japanese internment camps. It's like, it's just not true at all.

Starting point is 00:06:59 But if you don't have referensible materials either through the web era or back through the 20th century, you can just make things up. And we're suffering from that right now. Sam, you're working on, or you're driven by some of these same kind of ideas, I think, and you're doing it in a slightly different way

Starting point is 00:07:16 than what Brewster has done. Talk a little bit about the decentralized web and as you call it, the perma web with RWeave. Yeah, I mean, so we definitely started with some of the same ideals. Like, the real question we were trying to answer when we started our weave was, how is it that we can close the memory hole as Georgia well puts in 1984, this idea that a politician can instruct someone to take a record of history and then delete it such that nobody could ever prove what happened in the past.

Starting point is 00:07:43 Yeah, and so we were thinking about how, well, we were thinking about distributed networks at the time, and then we think about how they could be used in practice. And one of the things they do exceptionally well is replicate data around the world with no single central point of failure. And that's great. But of course, in a typical blockchain system, you could only store a very small amount of data. So with Awe, we essentially, yeah, built it to address that question. How do we make it so that we can store very large amounts of data trustlessly in exceptionally

Starting point is 00:08:09 large numbers of places across the world so it can be deleted by, well, all kinds of people, in fact. So that's what we were going for. And then when we built this, which was mainly driven by this question of how are we going to make verifiable copies of news archives, we realized that the technology underneath was essentially just a permanent hard drive could really be used for storing all things that should be permanently archived. And then subsequently, yeah, from there we realized, well, if you make that data storage available inside a web browser, what you have is a version of the web, except all of the data inside it is permanent. And this goes to address some of the problems that Bruce was talking about with the fact that, you know, web links last on average 100 days.

Starting point is 00:08:52 And I think what's really striking, 98.4% of the content on the web is lost within 20 years. And 98.4% is essentially the entire thing reset, which is really incredible given that we're embedding it at the core of our societies as the fundamental way of transmitting information between people. So we see with the perma web that you essentially have a system by which to address this. You make an endowment structure which pays for the storage of information forever in the network where there is no single centralized point of failure. And you essentially make an open business model that anyone can offer their hard drive to and get paid for offering storage such that there is no centralized authority that needs to manage that network either. So that's what we're working on with the perma web.

Starting point is 00:09:37 I think the question potentially comes up. Sometimes when people discuss the Internet Archive is the sustainability question. In other words, if the Internet Archive goes away tomorrow, who is going to continue to do this important work? How can we make sure that this important work is continuing to be done? And so, yeah, we'd love to hear your thoughts on kind of the approach that you're taking and then this kind of sort of different approach that Sam is taking. Oh, yeah, no, I think there should be lots of approaches.

Starting point is 00:10:02 There should be lots of libraries. So the Internet Archive does play a major reference role by building the Wayback machine. We crawl about a billion URLs every day now. which is kind of amazing to me, a billion a day. It's a total of maybe 800 billion in the Wayback Machine collection at this point. And it's getting large. So we try to collect a copy of every web page every two months

Starting point is 00:10:28 as one robot's sort of mandate, another is collecting the top web pages. That's just the home pages plus one. We have active news collecting going on, but we also now have 600 organizations. about 1,000 librarians that are directing particular crawls to go and make sure that we've got good collections in their particular subject expertise. And lastly, there's a service called SavePage Now. So if one goes to web.archive.org and puts in a URL,

Starting point is 00:11:01 it will archive it and hand you back an archival URL right then. And that's being done now at 80 per second. So we're starting to get woven into the web, the Brave browser. if you hit a 404, it will ask the Wayback Machine, is there a copy of that and then offer that to you? So it's kind of being woven into making the web a more reliable piece of infrastructure. The Internet Archive stores it in a couple places spinning, but also has partial copies in Canada, in Amsterdam, and in Alexandria, Egypt.

Starting point is 00:11:37 So the idea is to have multiple copies in multiple places. And that's not to mention all of the libraries that we're working with that go and get copies of their national collections from the Internet Archive when they contract with us. So, for instance, when the Library Congress contracts with the Internet Archive to collect it, they bring a copy home with them. So I like the concept of lots of copies keeps us safe. And what Are We was trying to do is use more decentralized technologies, which I think is a great idea. The decentralized web has been an initiative of the end. internet archive for several years. There's now been many gatherings. There's a dweb.archive.org, which is a prototype trying to make the internet archive, but decentralized. And that leverages

Starting point is 00:12:24 web torrent IPFS gun to go and build a decentralized internet archive. Again, still really early days. Very exciting project. And RWeave is part of that whole meuror of decentralized web technologies. Sam, we'd love to hear kind of your thoughts on, and maybe a specific example of how RWeave solves this problem that we've described. And maybe, I think RWeave is different in architecture, certainly, but the same goal. Yeah, I mean, I wouldn't really say we're trying to solve the same goal. I see RWeave as like a fundamental base layer protocol, which is offering a service to anyone of any kind that wants to purchase it, which is simply permanent information storage. Whereas perhaps, Brewster, you could correct me if I'm wrong, but really the internet

Starting point is 00:13:11 archive is like an archive layer. It's the kind of thing built on top of the technology. So it's deliberately built to be a library. And that's one of the things you can do with Alweave, but Arweave is really just focused on basically building this hard drive that never forgets. But in terms of how you can use this in practice in environments where censorship is rampant, say, yeah, there's a really good instance from just the last month, actually. So we noticed that after the death of Dr. Lee Wenliang, there was a substantial outpouring of early an outcry for free speech inside China on Weibo, which is basically unheard of. This is kind of like breakthrough event that requires a very large number of people to work in unison, which is very difficult to organize under

Starting point is 00:13:54 a kind of authoritarian regime because everybody is scared, right? But once enough people get together and start speaking all at once, the critical mass is such that everyone else can also start to speak. And so for about three to five days, I think it was, there was a pretty large amount of dissent in China over free speech issues. So that was very interesting, we thought. And we figured that we could very quickly just build a small scraper that could go around Weibo and collect up these artifacts, commit them to the essentially collective permanent memory that we're building, distribute them around the world, and also make them available inside China in a censorship-resistant way. And so we got this running very quickly.

Starting point is 00:14:34 And a member of our community thought, oh, wow, that's really interesting. So he built a essentially UI on top of this that not only showed all of the posts on Weibo that had been stored, but also went and detected when they had been censored inside China and then highlighted them inside this UI. So now you can go to this webpage, I think they call it Weibo uncensored. And you can press this button that says, show only censored. And this works inside mainland China. And it'll show you all of the things in Weebu, the bots of court, that the Chinese, Chinese government are attempting to censor. So it actually has this sort of Streisand effect

Starting point is 00:15:08 where it further highlights the information that they were attempting to censor because they were censoring it. And of course, there are hundreds and hundreds of Rweave notes. So it's extraordinarily difficult for the Chinese government to come along and try and actually block access to that. And those nodes are shifting and changing every single day. I'm curious about this question of centralization and Brewster, when you think about the Internet Archive and centralization versus decentralization, are there challenges around having single points of failure or single points of contact versus having it dispersed the way it is in an R-Weave kind of model? Oh, I think you want bulls. I think there's reason for having sort of the dot-com sector,

Starting point is 00:15:46 the dot-org sector, dot edgue, and working together well. I think they offer different perspectives and points of view. I think the top-level domain names of the internet are kind of an interesting decomposition of the different types of organizational structures and what they're for and how, if we can get them all to cooperate well, then that can work very well. So the Internet Archive, by having partners all over the world that are in these established institutions, that is a different form of decentralization. So I think we're in early days in terms of the actually how to go and pass around information pure to pure, but terrific progress is being made.

Starting point is 00:16:28 Basically, with hashes of files, you don't even need to care where you're getting it from because you know if you get a file that matches that hash, that you have the data you're looking for. So there's, with this cryptography that is now commonly available, which wasn't really to Tim Berners-Lee when he was building the web, JavaScript makes it so you can do your browser. It can be a first-class node,

Starting point is 00:16:54 on the internet, you don't have to depend on servers, that's a tremendous step forward. And we now have digital cash that allows people to be able to pay for things, which has been a big missing piece on the World Wide Web that has really driven us towards advertising models, which have real problems. So I'm very excited about some of the technologies being brought to bear on some of these cultural areas. So yes, I think there's reasons to have these different systems work together, have the more traditional server-type structures be actively working on decentralization, and having the decentralized organizations, ones that are sort of born decentralized,

Starting point is 00:17:36 start to go and bridge to the more server-based systems. Just to add on to that, I think particularly of interest. So hashing is fantastic in terms of just being able to identify a piece of data at a later point. But one of the things that really got us excited when we were starting, I think Arweave was how you could use time stamping of data, verifiable time stamping of data using blockchain or blockchain like ledgers that involve a proof of work or proof of state mechanism to ensure that a piece of data existed at a certain point in time and hasn't been altered since that time. And so when we started to work with the internet archive data, it was natural that we looked at these torrents that you guys produced.

Starting point is 00:18:18 And so subsequently, we've just been archiving very large volumes of these torrents over time. And now you can go back and you can say, okay, so this is a piece of data from the Internet Archive Collection that is verified the unchanged since it was added to, since the Torrent was added to the archive, you know, like 18 months ago, whenever it was. And so I think that's a really compelling and exciting way that we can use these technologies. Yeah, no, that's great. And one question I have for both of you, and I think so far we, I think we've kind of discussed at a high level or a philosophical level why this is an important problem for people to be thinking about. But, and Brewster, I know you also, you sit on the, I think, I believe you're on the board of the EFF. And so I'm sure you're used to having this conversation is, you know, you're probably trying to convey to folks, hey, why should you care? So I'm curious for both of you. What is in your mind the best application or the best use of the technologies that you both are building?

Starting point is 00:19:10 Like, how might the average person relate to this in a way that they can understand or apply in their own lives? Oh, gosh. Well, I think of it as a library. And so there are some of us that are old enough to remember. remember what libraries did. So it's physical libraries. But I think the thing that most people use the internet archive for is the wayback machine, the free service on archive.org to be able to see past web pages and be able to get two things. So there was a pronouncement by the son-in-law of the president that the national stockpile of medical equipment wasn't for the states. And within two hours, the homepage of the National Stockpile website changed its mission and took out states and Indian territories, other territories and Indian reservations. And so somebody found this, and they used the Wayback Machines' diff feature to go and surface the new page and how the diff, you know, sort of a redline version of how did it change from a couple hours before that public pronouncement of the change of what the federal.

Starting point is 00:20:16 stockpile was four. That's, I think, an example of why you'd want an archive of the World Wide Web. And there's similar examples because we're also archiving television of going and saying, no, no, no, he really did say that, actually, repeatedly. To be able to think critically, I have kids in school. So to think critically, you need to be able to quote, compare, and contrast. You need to be able to go and put your hands on something and be able to quote it. And then you can compare and contrast it with other things. If you can't quote what has been said or what was pronounced, then you can't really think critically. It just flows over.

Starting point is 00:20:57 And if it flows over, then there's no accountability and people feel, well, free to make things up. And what's going on as people are making things up. And I think people have this impression that everything is on YouTube now. It's all out there. Don't worry about it. Everything's preserved. But that's not really true. There's these massive gaps, isn't there?

Starting point is 00:21:15 Oh, yeah. And YouTube also takes things down all the time. Before YouTube at Google, there was Google videos, and that had 6 million videos. And they just decided to turn that off. So I had archived those. So those are on the way back machine. There was Yahoo videos.

Starting point is 00:21:29 Those are all gone. Apple had a self-hosted site for people that own their equipment to go and make web pages and the like. And that was called MobileMe. And they just turned that off. So there's whole swaths of the Internet that just go away. And if people just get used to things coming and going and not being reliable, then we'll have people just start to not know what to depend on and what to believe.

Starting point is 00:21:55 So this is not a good state. The Wayback machine is a truly incredible thing. But actually so few people understand about how much else is stored inside the Internet archive. Like it's really incredible, the amount of interesting data that is stored there. But most people, I think, just see the Wayback machine essentially. So it's definitely worth exploring. In terms of R-Weave, I think the obvious first use cases are two, really. If there's something that you would like to say that you would like to make sure we'll be around for a very long period of time, like a blog post, a thought you had, something like that, you can just upload it to one of the blogging platforms on the network, and it'll just perpetuate it essentially forever, backed by a sustainable endowment economic structure.

Starting point is 00:22:37 So that's very valuable. And on the other side, you can also just use it as an accountability ledger for anything. So if you make a contract with someone, for example, or you receive an email and you make a deal over this email, then you can just forward it to a certain address, which someone in the community is set up, and it will time stamp it in the RWeave, and you'll be able to come back to it later and prove verifiably that you haven't tampered with that data in the intervening period. And this is very valuable when we think about things like people sharing screenshots of stuff, which people do all the time.

Starting point is 00:23:06 Those screenshots have no verifiability. And quite often we see that they've actually been faked. So with R-Weave, you can totally circumvent that problem. That would just be a surface scratch, I guess, but definitely a first use case. One of the, when I excitedly describe R-Weave to folks, as I often do, one of the applications that I often cite is one that a community member has made called FeedWeave. And I think that one I find resonates with folks, just the idea that you can take, you can essentially replicate a social network graph on R-Weave.

Starting point is 00:23:37 And the implications of that mean that it's easier to switch between or view that graph in different ways. Yeah, definitely. I think one of the deeper implications of this system is that it's essentially a massive open database that any application can read to and any application can write to, sorry, read from and write to. And that means that you can build applications that are purely composable in nature. So yeah, someone made feed weave.com, which is a simple blogging platform. But of course, you can then build on top of this and you can say, okay, well, I would like to make a version of FeedWeave that is only for people that are mentioning COVID-19. So somebody has built a COVID-19 quarantine journaling app on top of the RWeave network.

Starting point is 00:24:16 Just actually today, I think it was released. And of course, if you lay out the tags correctly, you can get intercrossing views of those data across the different applications. This also highlights another feature, which is if you have these permanent web pages, the applications stored inside those web pages can, well, they essentially persist forever. and that has really interesting properties for the user, because it essentially means that if I come along to say, I don't know, weave mail, which is essentially an emailing service inside this network, I will always be able to get back to WeaveMail as it exists today in the future.

Starting point is 00:24:49 There is simply no way that the author can come along and change it and subsequently perhaps add adverts or make it sell people's data or something like this. So if you've audited and you understand what the application is doing today, you can have very, very strong guarantees that is going to do that tomorrow or really any time in the future. And when you compare this to typical Web 2.0 services, the contrast is pretty stark. I mean, everyone remembers when they first used Facebook or they first used Gmail, and they think about the way they perceived those services at the time. And now, in hindsight, when we can look back and see the essentially abuses of privacy that have happened on those services, we wonder, well, would I have

Starting point is 00:25:28 signed up for it if I'd known that is what they would have eventually done? if that is what the service would have eventually become. And the answer is in some cases quite likely no. However, with a permaweb app, you just don't get that problem. So we'vemail.com. It works like it does today, and it will work that way tomorrow, and in 10 years and in 20 years. I'm really glad that you brought the fact that most web applications

Starting point is 00:25:50 are ultimately just wrappers, UI wrappers around a database. And I think that the point that you're making is that when it's a company that owns the database, it's not, you know, it's a proprietary database. It's difficult as a user to take that your part of the database, the data that's relevant to you, and move it to another application. But in this paradigm that you're describing, that's not the case, right? It's essentially frictionless. And therefore, much easier for people to, as you say, if they decide that at some point,

Starting point is 00:26:18 their privacy is not being respected or they just don't like the application, it's easy to switch. They don't have to make the tradeoff that, like, well, I have to leave all my data that's relevant here and start new somewhere else. Right. Well, there's actually kind of two sides of things. right? So it's the fact that the database itself is open and available for anyone to build on. So it's as if every single application has an API by default and by necessity. So it has to have an API in order to function. And that part is really valuable. And it's kind of interesting in the RWeave community, we have two camps about this.

Starting point is 00:26:48 I was one of the great things you get with like decentralized communities like this. Some people, and I'm definitely on this side, I would say, think that the fact that the UI itself, that accesses this data is permanent and think that that's important. and other people are just more focused on the database side. That certainly, yeah, both are very interesting. Bruce, I wanted to ask you actually about community because Sam mentioned community and how big a part of kind of the decentralized web community is.

Starting point is 00:27:12 Blockchain protocols tend to build networks and communities of early users who become sort of evangelists. And it strikes me that with Internet Archive, there have been people along the way with you, whether there are volunteers or other advocates who have helped. Could you talk about community as it relates to the preservation of knowledge? If we don't really have a desire to keep things around, it will just fade away.

Starting point is 00:27:35 The example of the Library of Alexandria version one, which ran for 500 years, so it was a pretty good run. But it was a joint project, basically, between the Greeks and the Egyptians. And they built just an amazing center of knowledge for the ancient world. But by 200, 300, 400, A.D., the concept of having universal knowledge, or an encyclopedia, it started to fade as a concept in the, so the rise of some of the newer rebellion movements, really, which was the Christians.

Starting point is 00:28:09 It was not supported as much. And there's only about eight pieces of papyrus that we believe exists today that were in the library of Alexandria of that time. That was told to me by Dr. Siragildene, who runs the new library of Alexandria in Alexandria, Egypt. So we need to keep a culture wanting to have information available. So how do we do that?

Starting point is 00:28:37 There's going and having lots of people buy in. I think that's your community question. And I'm really quite happy that last year 100,000 people donated to the Internet archive directly money. And then there's lots of people that have contributed in other ways. But I'd say, you know, to your point on decentralization, people are supporting it by going and doing things on their own that may leverage the Internet Archive,

Starting point is 00:29:03 may not leverage the Internet Archive, but we all take this concept of legacy and memory very seriously. I don't think it's just a technological hack that we need to do. There's financial components, there's organizational components. And being able to share, I find difficult for people to adjust to, especially within commercial enterprises. But every which way people don't like to share in, bulk. So how do you go and encourage that and make it build it in as I think you talk about into

Starting point is 00:29:34 the protocol itself, but make it so that there's an ongoing interest in having materials live? I'd say is one of the key parts of the sociological architecture as opposed to just the technological architecture. Yeah, I really, what you just said reminds me of one of my favorite quotes from, I'm a big science fiction fan, and Kim Stanley Robinson, who's a relatively famous sci-fi author, He frames technology in terms of it's not just like the physical, you know, it's not just a computer or the way that bits and bytes move around, you know, based on the physics of electricity, democracy he defines as a technology, for example, right? It's these social innovations that actually collectively can enable us to do things at far greater

Starting point is 00:30:18 scale, enable us to coordinate at far greater scale. And so I think broadening the frame there, I completely agree with. And I do think sometimes it's underappreciated in our, our current world where technology kind of implies that it's a piece of software or hardware. And of course, it could be a blend of both. Like take cryptocurrencies and digital tokens, for example. So not only do they rely on cryptography and computer science, technology, as we kind of traditionally think of it, but they also have an economic dimension. So like the game theory that underpins the security of that model and even political and social dimension

Starting point is 00:30:52 when it comes to governance these protocols. And those economic, political, social dimensions are equally important parts. Right. There's new technologies. I love the sort of the token-based things. That's completely great to really sort of focus a lot of energy, right? Money is gasoline. So that's tremendous to go and leverage these new technologies for having people coordinate their activities through contributions to these token sales. It's exciting to see this type of energy going into these sort of more traditional topics of cultural preservation. I think there is a slight detail here, which is that technology itself can be money. It's not such that the money has to be separated from technology. And in the way that you point out, Alex, about democracy being a

Starting point is 00:31:40 technology, really what we're talking about here is just mechanism designs, right? They are incentive structures that we can produce that push rational humans, essentially towards some kind of behavior that we prefer. We see this same in democracy and capitalism is essentially that we have adopted competition, helpful competition at the basis of these things in order to produce the best outcomes we can, or at least in theory, we can at the global layer. One question that I have for both of you, actually, is, you know, I think I've read that the amount of data being produced by the human race has increased exponentially over the past couple decades, particularly as kind of the World Wide Web has come online and more and more

Starting point is 00:32:23 people are using it for various things. I'm curious how you view the sustainability of your efforts in terms of how do you keep up in this race of more and more data is created and therefore there's more and more that's being archived or this permanent hard drive can kind of keep up in terms of storage space. I'm curious how both of you think about that. Because money can be technology, we just built a kind of technological money that creates this endowment structure to pay for storage going forward. So we essentially financialize the lowering cost of storage over time at a very, very conservative rate to ensure that we have this. Yeah, it's really an endowment in a traditional sense with a couple of modifications that allows us to pay for storage from the interest from

Starting point is 00:33:08 that principle that we put up front when we add a piece of data to the network, rather than actually taking away from that upfront payment. So yeah, I think these are, I guess, differences in approach. I mean, really, the R-Weave approach is a kind of, in some sense, it's more like a traditional business model. It's simply decentralized rather than being a not-for-profit structure like the internet archive. And your question about sort of how do you scale archives when the amount of storage that's offered to people producing the things is going exponential. And I think the internet archive is about 60 petabytes of data at this point. And it's just what web pages look like. You talk to the folks at AWS, they don't really think about petabytes, they just think about exabytes,

Starting point is 00:33:51 which is kind of awesome. But that's not where we are. So it's just where we could record things that of what the web looked like would be terrific. And it's sort of the promise of the decentralized web. And where R. Weave is part of that whole sphere is, can we make the web itself sort of a self-archiving structure? So instead of going and thinking there's a website that's over there and you have to then make a copy of it, snapshots of periods of time. Can it be more like how Git works so that you have a website that evolves over time, but you can go and replay it back to whatever point in time it ran? So you can even fork it.

Starting point is 00:34:32 But that's not just the code of the website, but the actual data within it. And have that data basically and the code itself for the website live in many places, not just one. it's too fragile to have things in just one place. Just the idea of a publisher being able to just take a book away and then have it just go away from all the shelves of all of the libraries is kind of dystopian. But it's exactly what the lobbyists for the publishing industry and the authors want to have happen is every reading event is a permissioned event.

Starting point is 00:35:07 So it can be turned off at any time. And you can understand why people want that. So think of it as sort of Netflix, right? That Netflix is the only place that you can go and see movies. If that were true, then there's too much power in the hands of Netflix to go and make, oh, some documentary that is not in favor anymore just disappear and nobody can see it again because people hadn't made decentralized copies. So the decentralized copies approach, especially if it's built right into how a next generation web works,

Starting point is 00:35:40 that would be a better system than the kind of cluge, which is the World Wide Web, and then try to back it up separately. So I really applaud the efforts of the decentralized community to try to make it so this is just how the next generation technology works. Okay, so I'd love to get your input on something, Brewster, actually. One of the conversations we have a lot in the decentralization community is about content addressing, right?

Starting point is 00:36:07 So should a piece of content be addressed? by its location as it is on the current web, or should it be addressed by the, essentially the hash of the piece of content itself? Or the way that we see it at Alweave is that it should be really, I mean, no piece of information in the web is just a piece of information. There's much more to it than that that's interesting. There's, when did it get there, who put it there,

Starting point is 00:36:31 and also other kinds of metadata. It's very strange to us that on the web at the moment, there's no metadata tags available for stuff. You just get a web page. And so, yeah, essentially what we've done with Alweave is just bake in these tags that can be of arbitrary type. And at the beginning, this was really just to add small amounts of metadata. But then we realized the community was picking this up and essentially using the whole network like a massive document store that had these sort of tags associated with stuff that they could query. And then you get these very complex web applications built on top.

Starting point is 00:36:59 So I guess, yeah, I wondered if you had some thoughts on that. One of my mentors is Bill Dunn. He started Dow Jones Electronics in the mid-1980s. he's completely inspiring. He said the metadata is more important than the data itself. It's kind of cool idea. It was really ahead of its time. I mean, you think of how does Google work, and it's all metadata.

Starting point is 00:37:20 It's not actually the information in the document. It's often anchor text and usage patterns and all of that. And so Bill Dunn was way ahead of it. He actually coined the term metadata, which is kind of great to know that somebody actually came up with that and who it was. So there's, I think, kind of. Context is another way of putting that is absolutely important. And context shifts over time.

Starting point is 00:37:44 So often you need a mechanism of having materials and then being able to find what are people saying about it. So being able to query based on a hash or something, some sort of persistent identifier, even if the material has moved from place to place, is absolutely critical and allowing new context to be able to be built around it. And of course, all of that context then has to be able to have referenceable. Where did it come from?

Starting point is 00:38:12 And people able to comment on that. The URL, which is sort of very primitive system that's really location-based, is not good enough that we do need mechanisms to have things be able to live in many different places and over time and still know what it was and mechanisms of referring to these materials. And hashes are little problematic at the moment. moment because they keep changing. You know, okay, now we have MD5s. Oh, no, we can't use MD5s anymore.

Starting point is 00:38:43 Okay, now it's Shaw ones. Great. All right, we standardize on Shaw ones. Everybody used Shaw ones. And now we need Shaw off 256 because people are breaking the Shaw ones. So we have some trouble just based on how do we go and maintain these hashes over time? And I don't have a good answer for that. So that's a real puzzle.

Starting point is 00:39:02 But I'm glad you're baking in tags. I hope there's kind of a rendezvous, so that if you have a particular item that you can find other things that refer to that item, that's going to be an important part of a next generation web. Yeah, that makes a lot of sense. I mean, the way that it works in our protocol is we have the sort of base layer that exposes these tags, and then we have indexes on top that are actually very freeform. So there's one index that uses a kind of a very cryptic, frankly, old type of querying language,

Starting point is 00:39:34 but there's a new one that uses GraphQL and another one that looks, more like SQL and this kind of thing. And then you can essentially interact with this metadata in a pretty free form fashion. But yeah, absolutely, in terms of metadata being important, I'm always reminded of this quote. I'm not sure if it, I don't know if we actually know who it came from in the first place, but it was that, quote, we kill people with metadata or on the basis of metadata. In reference to the NSA and, well, the drone strike program that the US runs,

Starting point is 00:40:04 which I thought was very powerful. I mean, that's a really strong statement about the importance of this kind of data. Yeah, Brewster, I wanted to follow up on one thing that you said, which is the standards keep changing. You know, I think that you mentioned, you know, there was MD5, Shaw 1, Shaw 2,6. In fact, I think Shaw 3 was recently standardized with NIST. And so I guess it just, you know, this speaks to the fact that the technology is evolving and new innovations are coming out all the time. I'm curious for both of you, what are the technological innovations on the horizon that you've think are most relevant to both of your organization's missions? Or put another way, I guess,

Starting point is 00:40:40 what are you most excited about in the future? Decentralized autonomous organizations, for sure. This idea that we can program the structure of how an organization should work in code is exceptionally exciting. I mean, we've been pretty keen on this for a few years, even since we started, but then just in, I think, December time, we decided that we would make a small experiment where we just gave essentially an ownership token in this new Dow that we'd made, so decentralized autonomous organization, to, I think about 20, 25 of the most core community members in the RWeaf space. And we gave them a very small amount of money. And we just said, hey, you guys just go basically do what you want with it. And you might have this fear that

Starting point is 00:41:19 people would essentially just take the money and run. But quite the opposite happened. They spent almost no money. And they've achieved such incredible things in such a short period of time. I mean, they've essentially, they've made like, I think, two or two and a half, depending on how you want to count it different products. And they've taken a pretty major step in decentralizing the infrastructure of the RWeave network. Yeah, on just a tiny amount of money. So I guess what we've found so exceptionally powerful about that was if you truly give people ownership of something. Like if you give them responsibility and ownership as a decentralized community that doesn't have to have a formal leader, they're actually able to get things done. and thrive and help the community like ours.

Starting point is 00:42:01 So that's been really exciting to see. And I'm sure in the future, Dow's will play a bigger and bigger part in our society. But this is probably like 10 to 20 years off, like in the mainstream sense. Culturally, the things that I'm kind of excited about is people learning to work together better, whether it's the Wikipedias, the brave browsers,

Starting point is 00:42:19 hopefully the Chrome's and Mozilla's, the Internet archives, that people sort of see that it works better to work together. In terms of what's going on right now in terms of this whole COVID thing and the economic impact that's going to be, I'm really hoping that we def financialize more of our systems. So people actually go back to owning things for real. Because when everything's leveraged up to the hilt and it's owned and control to go and multiply results and it's all ROI, it's a very fragile piece of infrastructure.

Starting point is 00:42:53 So I think that that in terms of just how to build a more robust society, I think def financializing, which is sort of trying to undo some of the financial engineering masterpieces that have come out recently and have been showing the fragility and centralization problems that we have. And in the technology area, I agree with Sam and that's sort of the exciting things of the decentralization. I want blank but decentralized. Let's take our favorite tools, but let's make it so that it's not going and playing into somebody else's business model to go and hoard data. I use Slack all the time, but it's a little creepy, right? Because it's in a third-party service. I want Slack, but decentralized.

Starting point is 00:43:36 Ideally, I'd want Slack to go and make that because they don't need to go and hoard all this information for their business model. Or Google Docs, but decentralized. Or Google Maps, but decentralized. There are these miracles of technologies that we've built centralized versions of, so we know what the future we want is,

Starting point is 00:43:54 but can we build blank but decentralized? I find that another sort of useful shower thought experiment is like, how would I go and do that without having all of this information be hoarded by these organizations that may be profiting for it by it or at least handing it over to people that you don't necessarily think that you want to have that information? How do you have those toll tags that are on our cars that allow us to go zipping through the tolls go and be paid securely, make sure that they're done, but not leave a trace of that you went through that toll booth at that time. All they should want is just to get paid for somebody to go through that

Starting point is 00:44:32 toll booth. How do we make it like cash? So those are the sorts of, I would hope that the technology community really works on. And if it's based on open standards, then there may not be quite the lock-in sort of thing that a lot of investors are looking for. But I think it may be what it is. Society is looking for. And I think in general, we should be pursuing things that society wants. Bristre, back around 1980, you set a goal. You were pretty specific, I believe. It was October 2020. You had a specific goal about making the internet a library. Can you tell us how close we are to achieving that goal? We're now really depending on this information ecosystem. Yet, it's not good enough yet. It's not referenceable, doesn't have enough of the published works woven into it.

Starting point is 00:45:18 There's a, for older folks, I was trying to think, what did I learn about a library in high school? And there was this thing called the Reader's Guide to Periodical Literature. And it was a subject index. So if you were doing a report on, say, cigarette advertisements from the 1960s or the McCarthy era of the witch hunts of communists. and you wanted to go and understand how did people think about it in 1955, right? What party was for it and what the party was against it, right? That kind of thing. It's very difficult to answer that kind of question on the World Wide Web now. But we had this thing called the Reader's Guide to Periodical Literature

Starting point is 00:45:58 where you could go and look up McCarthy in 1955 and go and get articles from Time Magazine, New York Times, all these different things and go and actually at that time you went back to microfilm. to go and pull those out, to try to walk back in time and feel like it was 1955 again, what was the media that was out there for you? Wouldn't it be great if we could actually do that again so people could start to feel empathetic or understand how did things used to work? And that is not available yet on the net. That would be a component, I would say, to making the Internet into a library,

Starting point is 00:46:40 another is reference desk, right? How do you go and have that helpful librarian be able to help you find things? Is that really there? Is Google good enough for that? I'd say no, we're really not there. What do we want out of a library and how do we make it so that the next generation, this generation, our generation, all of us that are turning to our screens to answer questions that we have as good, a rich, as deep background as to what's going on, what's true, and what's not true. Who should you be? thinking about, who should you be reading about, even if they're not currently in print. Those are the aspects that I find a useful thought experiment, and we're using at the Internet Archive and other organizations to try to build a reliable, thoughtful infrastructure

Starting point is 00:47:25 to build our culture on, as opposed to kind of the slapdash techie thing that we've ended up now depending on so fully.

The a16z Show - Preserving Digital History: How to Close the Web's 'Memory Hole'

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.