a16z Podcast - Preserving Digital History: How to Close the Web's 'Memory Hole'
Episode Date: July 13, 2020More than 98% of the information on the web is lost within 20 years, and huge gaps exist in our digital and cultural history. Zoran Basich and Alex Pruden of a16z talk to Brewster Kahle and Sam Willia...ms, who are using different approaches to attack this problem. Brewster cofounded the Internet Archive, which is well known for creating the Wayback Machine that crawls a billion URLs every day. Sam cofounded Arweave, a company that uses decentralized crypto networks to store information forever. For both of them, this issue has implications that go far beyond just data storage. It touches on issues of censorship, government manipulation of information, and how historical context is necessary for well-functioning societies. They discuss how decentralized models offer the promise of building a next-generation web that works better for users.
Transcript
Discussion (0)
Hi, welcome to the A16Z podcast, I'm Zorn. Conventional wisdom has it that most everything important
is accessible online and that it will exist forever. But in fact, more than 98% of the information
on the web is lost within 20 years, and huge gaps exist in our cultural history. This includes
published works like news reports, books, music, and video. A16Z investment partner, Alex Pruden and
I interviewed Brewster Kale and Sam Williams, who are using different approaches to attack this problem.
Brewster co-founded the Internet Archive.
It's well known for creating the Wayback Machine,
which crawls a billion URLs every day.
Sam co-founded Arweave,
a company that uses decentralized crypto networks
to store information forever.
For both of them, this issue has implications
that go far beyond just data storage.
It touches on issues of censorship,
government manipulation of information,
and how historical context is necessary
for well-functioning societies.
We talk about the challenges
of trying to preserve digital history
in an era of deep fakes and fake news,
how decentralized models like ROEVE can fit in,
and what types of economic models
will help sustain data preservation.
We also discuss baseler protocols versus archive layers
and evolving standards for the cryptographic hash functions
crucial for tagging in metadata.
But we begin with Brewster and how he started his quest
to preserve digital history.
The idea is to build the Library of Alexandria
for the digital age.
Could we go and make all the published works of mankind
available to everybody, and in computer-readable form so we could build a global brain,
kind of building on the ideas of Vannevar Bush and Ted Nelson,
which even by 1980, when I started on this project,
it was kind of assumed that it was going to be done or done already.
So I thought, okay, let's just go and do that.
A lot of people would say we already have the Internet,
all the knowledge that we need is already there and accessible.
Why is that true or not true?
Oh, it's actually all fairly thin.
I mean, there's a Wikipedia page on all sorts of things,
and there's certainly all sorts of commentary about current events.
So the good, bad, and the ugly out there.
But a lot of the actual published works aren't there, books, music, video.
It's astonishing how little of it, or the 20th century is largely not there.
Or if you take web pages, they only last 100 days on average before they're changed or deleted.
So we're building our culture on shifting sand.
So I think it was a cruel joke to call web pages pages, as if they were like,
like a Gutenberg Bible that would last hundreds of years if you didn't pay attention to it.
It's just not true. This is all shifting. So we needed to build a library now that people are
publishing with this new medium. And so it's kind of notable here we are in 2020 and we seem
to be living in this age of fake news, I guess, where the truth itself can seem slippery.
People can't even agree on what sources of info are reliable. What does that mean for the
Internet Archive? Is this kind of a unique period in history as you've been doing this for the last
four decades, and what are the challenges that that presents for someone like you trying to
preserve knowledge? I think it's showing the need for more and more context around what people
are looking at, being able to reference real materials. Really, the web grew up in parallel
and separate from, say, the book world. The book world just staunchly, you know, you can go
and buy a book from Amazon, or you can download it on your Kindle, but the web and books never
grew up together. So we're trying to make all the footnotes, the references in Wikipedia,
live links. So if it's a book and it's got a page number, it opens right to the right page.
Or just take Wikipedia's footnotes that go to other web pages. We started based on talking with
Catherine Maher, the exec director of Wikipedia, she was worried that truth might fracture
based on basically bad citation availability. So we went and made a robot to fix broken links in
Wikipedia. We fixed 11 million broken links. So the idea is to try to make this good idea of
the World Wide Web into something you could actually live on and depend on. And in this era of fake
news that's really grown up throughout the web, it shows how fragile information is if you don't
have an ecosystem of references, context, reliability, organizations you can believe in. So we're trying to
play a role in hooking these things together into this wondrous thing, which is the
World Wide Web. One of the reasons I'm personally so excited about what you and Sam are doing is
that I firmly believe that human progress is largely dependent on our understanding of our own
human history. And I think this notion of how we study history and how we understand events that
happened in the past, this is called historiography, I think is absolutely critical and often
underappreciated part of that. I think it gets to kind of this, maybe this is the same as
when you were referring to it as a global brain. But I guess can you in your mind describe maybe
the benefits, maybe tangible benefits that we have by having this better, more granular
understanding of events exactly as they happened on the web in the past?
Well, let's just take there's going and understanding the World Wide Web's history so that
politicians can't just erase their past or, I don't know, when you have pundits that want
to go and now be in government, they just want to have erased their past.
And if you just make it so that they can do that,
then people just feel much more free to just make things up.
If there's no accountability for what it is,
you've said, you've done, you've published,
you can just erase it, then that's a really scary time.
And it gets into the whole George Orwell,
I'm trying to rewrite history thing,
which is we're seeing more,
and as the governments around the world,
are swinging right in a very hardcore way,
and we're losing newspapers,
we're getting more and more controlled,
and balkanization of the internet.
But let's just take the 20th century.
The 20th century documents from the 20th century are largely not online.
Copyright in the United States ends in 1924, so if it's published before 1924, it can be
public domain.
And so our book project, we digitize and make them just publicly downloadable.
Google's less so, they kind of still bound up the public domain.
But at least a lot of materials pre-1924 are available online.
But after that, if you look at the graph of what's digitized and put on the Internet Archive book collection, it goes up to 1924, and the craters.
And it's basically almost nothing through to the 20th century.
And it's not because those are all in print, because if you try to go to Amazon and buy them new, they're not in Amazon either.
So the 20th century, we're bringing up a generation without access to the information and what happened during the 20th century.
It was a very impactful, very important century.
And if we don't learn from it, if we can't even reference it, then people can go and say, oh, this didn't happen, or it really happened this way.
Or actually, people were really cheerful about going into the Japanese internment camps.
It's like, it's just not true at all.
But if you don't have referenceable materials, either through the web era or back through the 20th century, you can just make things up.
And we're suffering from that right now.
Sam, you're working on, or you're driven by some of these same kind of ideas, I think,
and you're doing it in a slightly different way than what Brewster has done.
Talk a little bit about the decentralized web and, as you call it, the perma web with RWeave.
Yeah, I mean, so we definitely started with some of the same ideals.
Like the real question we were trying to answer when we started Alveef was,
how is it that we can close the memory hole as Georgia well puts in 1984?
This idea that a politician can instruct someone to take a record of history
and then delete it such that nobody could ever prove what happened in the past.
Yeah, and so we were thinking about how, well, we were thinking about distributed networks at the time,
then we were thinking about how they could be used in practice.
And one of the things they do exceptionally well is replicate data around the world with no single central point of failure.
And that's great, but of course, in a typical blockchain system, you could only store a very small amount of data.
So with Awe, we essentially, yeah, built it to address that question.
How do we make it so that we can store very large amounts of data?
data trustlessly in exceptionally large numbers of places across the world so it can be deleted
by, well, all kinds of people, in fact.
So that's what we were going for.
And then when we built this, which was mainly driven by this question of how are we going
to make verifiable copies of news archives, we realized that the technology underneath was
essentially just a permanent hard drive could really be used for storing all things that
should be permanently archived.
And then subsequently, yeah, from there we realized, well, if you make that data,
storage available inside a web browser, what you have is a version of the web, except all of the
data inside it is permanent. And this goes to address some of the problems that Bruce was talking
about with the fact that, you know, web links last on average 100 days. And I think what's really
striking, 98.4% of the content on the web is lost within 20 years. And 98.4% is essentially
the entire thing reset, which is really incredible given that we're embedding it at the core of our
societies as the fundamental way of transmitting information between people.
So we see with the perma web that you essentially have a system by which to address this.
You make an endowment structure which pays for the storage of information forever in a network
where there is no single centralized point of failure.
And you essentially make an open business model that anyone can offer their hard drive to
and get paid for offering storage such that there is no centralized authority that needs
to manage that network either.
So that's what we're working on with the perma web.
I think the question potentially comes up sometimes when people discuss the Internet Archive is the sustainability question.
In other words, if the Internet Archive goes away tomorrow, who is going to continue to do this important work?
How can we make sure that this important work is continuing to be done?
And so, yeah, I would love to hear your thoughts on kind of the approach that you're taking
and then this kind of sort of different approach that Sam is taking.
Oh, yeah, no, I think there should be lots of approaches.
There should be lots of libraries.
So the Internet Archive does play a major reference role by building the Wayback Machine.
We crawl about a billion URLs every day now, which is kind of amazing to me, a billion a day.
It's a total of maybe $800 billion in the Wayback Machine collection at this point.
And it's getting large.
So we try to collect a copy of every web page every two months as one robot's sort of mandate,
and another is collecting the top web pages.
That's just the sort of home pages plus one.
We have active news collecting going on,
but we also now have 600 organizations,
about 1,000 librarians that are directing particular crawls to go
and make sure that we've got good collections
in their particular subject expertise.
And lastly, there's a service called SavePage Now.
So if one goes to web.archive.org
and puts in a URL, it will archive it and hand you back an archival URL right then.
And that's being done now at 80 per second.
So we're starting to get woven into the web.
The Brave browser, if you hit a 404, it will ask the wayback machine.
Is there a copy of that and then offer that to you?
So it's kind of being woven into making the web a more reliable piece of infrastructure.
The Internet Archive stores it in a couple places spinning.
but also has partial copies in Canada, in Amsterdam, and in Alexandria, Egypt.
So the idea is to have multiple copies in multiple places.
And that's not to mention all of the libraries that we're working with
that go and get copies of their national collections from the Internet Archive when they contract with us.
So, for instance, when the Library Congress contracts with the Internet Archive to collect it,
they bring a copy home with them.
So I like the concept of lots of copies, keeps stuff safe.
And what are we trying to do is use more decentralized technologies, which I think is a great idea.
The decentralized web has been an initiative of the Internet Archive for several years.
There's now been many gatherings.
There's a dweb.archive.org, which is a prototype trying to make the Internet Archive, but decentralized.
And that leverages web torrent IPFS gun to go and build a decentralized internet archive.
Again, still really early days.
Very exciting project.
And RWeave is part of that whole meuror of decentralized web technologies.
Sam, we'd love to hear kind of your thoughts on it.
And maybe a specific example of how RWeave solves this problem that we've described.
And maybe I think RWeave is different in architecture, certainly, but the same goal.
Yeah, I mean, I wouldn't really say we're trying to solve the same goal.
I see RWeave as like a fundamental base layer protocol, which is offering a service to anyone of any kind that wants to purchase it, which is simply permanent information storage.
Whereas perhaps, Brewster, you could correct me if I'm wrong, but really the Internet Archive is like an archive layer.
It's the kind of thing built on top of the technology.
So it's deliberately built to be a library.
And that's one of the things you can do with RWeave.
but Al-Weave is really just focused on basically building this hard drive that never forget.
But in terms of how you can use this in practice in environments where censorship is rampant, say,
yeah, there's a really good instance from just the last month, actually.
So we noticed that after the death of Dr. Lee Wenliang,
there was a substantial outpouring of early an outcry for free speech inside China on Weibo,
which is basically unheard of.
This is kind of like breakthrough event that requires a very large number of people
to work in unison, which is very difficult to organize under a kind of authoritarian regime
because everybody is scared, right? But once enough people get together and start speaking all at
once, the critical mass is such that everyone else can also start to speak. And so for about
three to five days, I think it was, there was a pretty large amount of dissent in China
over free speech issues. So that was very interesting, we thought. And we figured that we
could very quickly just build a small scraper that could go around Weibo and collect up these
artifacts, commit them to the essentially collective permanent memory that we're building,
distribute them around the world, and also make them available inside China in a censorship
resistant way. And so we got this running very quickly. And then a member of our community
thought, oh, wow, that's really interesting. So he built a essentially UI on top of this that
not only showed all of the posts on Webu that had been stored, but also went and detected
when they had been censored inside China and then highlighted them inside this UI. So,
So now you can go to this web page, I think they call it Webu Uncensored, and you can press this
button that says, show only censored, and this works inside mainland China, and it'll show you
all of the things in Weibo, the bots of court, that the Chinese government are attempting
to censor.
So it actually has this sort of Streisand effect, where it further highlights the information
that they were attempting to censor because they were censoring it.
And of course, there are hundreds and hundreds of our we've notes, so it's extraordinarily
difficult for the Chinese government to come along and try and actually block access to
that. And those nodes are shifting and changing every single day. I'm curious about this question
of centralization and Brewster, when you think about the Internet Archive and centralization versus
decentralization, are there challenges around having single points of failure or single
points of contact versus having it dispersed the way it is in an R-Weave kind of model?
Oh, I think you want bulls. I think there's reason for having sort of the dot-com sector, the dot-org
sector.edu and working together well. I think they offer different perspectives and points of view.
I think the top level domain names of the internet are kind of an interesting decomposition
of the different types of organizational structures and what they're for and how, if we can get
them all to cooperate well, then that can work very well. So the internet archive by having
partners all over the world that are in these established institutions, that is a
different form of decentralization.
So I think we're in early days in terms of the actually how to go and pass around information
pure to pure, but terrific progress is being made with basically with hashes of files.
You don't even need to care where you're getting it from because you know if you get a file
that matches that hash that you have the data you're looking for.
So there's with this cryptography that is now commonly available, which wasn't really to
Tim Berners-Lee when he was building the web.
JavaScript makes it so you can do your browser.
It can be a first-class node on the Internet.
You don't have to depend on servers.
That's a tremendous step forward.
And we now have digital cash that allows people to be able to pay for things,
which has been a big missing piece on the World Wide Web
that has really driven us towards advertising models, which have real problems.
So I'm very excited about some of the technologies being brought to bear
on some of these cultural areas.
So yes, I think there's reasons to have these different systems work together,
have the more traditional server-type structures
be actively working on decentralization
and having the decentralized organizations,
ones that are sort of born decentralized,
start to go and bridge to the more server-based systems.
Just to add on to that, I think particularly of interest.
So hashing is fantastic.
in terms of just being able to identify a piece of data at a later point.
But one of the things that really got us excited when we were starting Arweave
was how you could use time stamping of data,
verifiable time stamping of data using blockchain or blockchain like ledgers
that involve a proof of work or proof of state mechanism
to ensure that a piece of data existed at a certain point in time
and hasn't been altered since that time.
And so when we started to work with the Internet Archive data,
it was natural that we looked at these torrents,
that you guys produced.
And so subsequently, we've just been archiving very large volumes of these torrents over time.
And now you can go back and you can say, okay, so this is a piece of data from the
Internet Archive Collection that is verified the unchanged since it was added to, since the
torrent was added to the archive, you know, like 18 months ago, whenever it was.
And so I think that's a really compelling and exciting way that we can use these technologies.
Yeah, no, that's great.
And one question I have for both of you, and I think so far we've kind of discussed at a high
level or philosophical level why this is an important problem for people to be thinking about.
But, and Brewster, I know you also, you sit on the, I think, I believe you're on the board of
the EFF. And so I'm sure you're used to having this conversation is, you know, you're probably
trying to convey to folks, hey, why should you care? So I'm curious for both of you.
What is in your mind the best application or the best use of the technologies that you both are
building? Like, how might the average person relate to this in a way that they can understand or
apply in their own lives. Oh, gosh. Well, I think of it as a library. And so there are some of us that
are old enough to remember what libraries did. So it's physical libraries. But I think the thing that
most people use the internet archive for is the wayback machine, the free service on archive.org
to be able to see past web pages and be able to get two things. So there was a pronouncement by
the son-in-law of the president that the national stockpile of medical equipment,
wasn't for the states. And within two hours, the homepage of the National Stockpile website
changed its mission and took out states and Indian territories, other territories and Indian
reservations. And so somebody found this, and they used the Wayback Machine's diff feature
to go and surface the new page and how the diff, you know, sort of a redline version of
how did it change from a couple hours before that public pronouncement of the,
change of what the federal stockpile was for. That's, I think, an example of why you'd want
an archive of the World Wide Web. And there's similar examples because we're also archiving
television of going and saying, no, no, no, he really did say that, actually, repeatedly.
To be able to think critically, I have kids in school. So to think critically, you need to be able to
quote, compare, and contrast. You need to be able to go and put your hands on something and be
able to quote it. And then you can compare and contrast it with other things. If you can't quote
what has been said or what was pronounced, then you can't really think critically. It just flows over.
And if it flows over, then there's no accountability and people feel well free to make things up
and what's going on as people are making things up. And I think people have this impression that
everything is on YouTube now. It's all out there. Don't worry about it. Everything's preserved.
But that's not really true.
There's these massive gaps, isn't there?
Oh, yeah.
And YouTube also takes things down all the time.
Before YouTube, at Google, there was Google videos,
and that had six million videos.
And they just decided to turn that off.
So I had archived those.
So those are on the way back machine.
There was Yahoo videos.
Those are all gone.
Apple had a self-hosted site
for people that own their equipment
to go and make web pages and the like.
And that was called MobileMe.
And they'd just turn that off.
So there's whole swaths of the internet that just go away.
And if people just get used to things coming and going and not being reliable,
then we'll have people just start to not know what to depend on and what to believe.
So this is not a good state.
The wayback machine is a truly incredible thing.
But actually so few people understand about how much else is stored inside the Internet archive.
Like it's really incredible the amount of interesting data that is stored there.
But most people, I think, just see, you know, the Wayback Machine, essentially.
So it's definitely worth exploring.
In terms of Aweave, I think the obvious first use cases are two, really.
If there's something that you would like to say that you would like to make sure we'll be around for a very long period of time,
like a blog post, a thought you had, something like that, you can just upload it to one of the blogging platforms on the network,
and it'll just perpetuate it essentially forever, backed by a sustainable endowment economic structure.
So that's very valuable.
And on the other side, you can also just use it as an accountability ledger for anything.
So if you make a contract with someone, for example, or you receive an email and you make a deal over this email, then you can just forward it to a certain address, which someone in the community is set up, and it will timestamp it in the R-Weave.
And you'll be able to come back to it later and prove verifiably that you haven't tampered with that data in the intervening period.
And this is very valuable when we think about things like people sharing screenshots of stuff, which people do all the time.
the screenshots have no verifiability.
And quite often, we see that they've actually been faked.
So with Rave, you can totally circumvent that problem.
That would just be a surface scratch, I guess, but definitely a first use case.
One of the, when I excitedly describe R weave to folks, as I often do,
one of the applications that I often cite is one that a community member has made called feedweave.
And I think that one I find resonates with folks, just the idea that you can take,
you can essentially replicate a social network graph.
on our weave, and the implications of that mean that it's easier to switch between or view that
graph in different ways. Yeah, definitely. I think one of the deeper implications of this system is that it's
essentially a massive open database that any application can read to and any application can
write to, sorry, read from and write to. And that means that you can build applications that are
purely composable in nature. So yeah, someone made feedweave.com, which is a simple blogging platform,
but of course, you can then build on top of this and you can say, okay, well, I would like
to make a version of FeedWeave that is only for people that are mentioning COVID-19.
So somebody has built a COVID-19 quarantine journaling app on top of the RWeave Network,
just actually today I think it was released.
And of course, if you lay out the tags correctly, you can get intercrossing views of those
data across the different applications.
This also highlights another feature, which is if you have these permanent web pages,
the applications stored inside those web pages can, well, essentially persist forever.
and that has really interesting properties for the user because it essentially means that if I come
along to say, I don't know, WeaveMail, which is essentially an emailing service inside this network,
I will always be able to get back to WeaveMail as it exists today in the future.
There is simply no way that the author can come along and change it and subsequently perhaps
add adverts or make it sell people's data or something like this.
So if you've audited and you understand what the application is doing today, you can have very, very strong guarantees
that is going to do that tomorrow or really any time in the future.
And when you compare this to typical Web 2.0 services,
the contrast is pretty stark.
I mean, everyone remembers when they first used Facebook or they first used Gmail,
and they think about the way they perceived those services at the time.
And now, in hindsight, when we can look back and see the essentially abuses of privacy
that have happened on those services, we wonder, well, would I have signed up for it
if I'd known that is what they would have eventually done,
if that is what the service would have eventually become.
And the answer is, in some cases, quite likely, no.
However, with the permaweb app, you just don't get that problem.
So we'vemail.com.
It works like it does today, and it will work that way tomorrow and in 10 years and in 20 years.
I'm really glad that you brought the fact that most web applications are ultimately just
wrappers, UI wrappers around a database.
And I think that the point that you're making is that when it's a company that owns the database,
it's not, you know, it's a proprietary database.
It's difficult as a user to take that your part of the database, the data that's relevant to you,
and move it to another application.
But in this paradigm that you're describing, that's not the case, right?
It's essentially frictionless, and therefore much easier for people to, as you say,
if they decide that at some point their privacy is not being respected or they just don't like the application,
it's easy to switch.
They don't have to make the trade off that, like, well, I have to leave all my data that's relevant here
and start new somewhere else.
Right.
Well, there's actually kind of two sides of things.
right? So it's the fact that the database itself is open and available for anyone to build on.
So it's as if every single application has an API by default and by necessity. So it has to have an
API in order to function. And that part is really valuable. And it's kind of interesting.
In the Rave community, we have two camps about this. I was one of the great things you get
with like decentralized communities like this. Some people, and I'm definitely on this side,
I would say, think that the fact that the UI itself that accesses this data is permanent and
think that that's important. And other people are just more focused on the database side.
That certainly, yeah, both are very interesting.
Bruce, I wanted to ask you actually about community because Sam mentioned community
and how big a part of kind of the decentralized web community is.
Blockchain protocols tend to build networks and communities of early users who become
sort of evangelists. And it strikes me that with Internet Archive, there have been people
along the way with you, whether there are volunteers or other advocates who have helped.
Could you talk about community as it relates to the preservation of knowledge?
If we don't really have a desire to keep things around, it will just fade away.
The example of the Library of Alexandria Version 1, which ran for 500 years, so it was a pretty good run.
But it was a joint project, basically, between the Greeks and the Egyptians.
And they built just an amazing center of knowledge for the ancient world.
But by 200, 300, 400, A.D., the concept of having universal knowledge,
or an encyclopedia, it started to fade as a concept in the, so the rise of some of the newer
rebellion movements, really, which was the Christians.
It was not supported as much.
And there's only about eight pieces of papyrus that we believe exist today that were in the
library of Alexandria of that time.
That was told to me by Dr. Serregulidine, who runs the new library of Alexandria in
Alexandria, Egypt. So we need to keep a culture wanting to have information available. So how do we
do that? There's going and having lots of people buy in. I think that's your community question.
And I'm really quite happy that last year, 100,000 people donated to the Internet Archive
directly, money. And then there's lots of people that have contributed in other ways.
But I'd say, you know, to your point on decentralization, people are supporting it by going
and doing things on their own that may leverage the Internet Archive, may not leverage the Internet
Archive, but we all take this concept of legacy and memory very seriously.
I don't think it's just a technological hack that we need to do.
There's financial components, there's organizational components, and being able to share,
I find difficult for people to adjust to, especially within commercial enterprises,
but every which way people don't like to share in bulk.
So how do you go and encourage that and make it build it in
as I think you talk about into the protocol itself,
but make it so that there's an ongoing interest in having materials live?
I'd say is one of the key parts of the sociological architecture
as opposed to just the technological architecture.
Yeah, I really, what you just said reminds me of one of my favorite quotes
from, I'm a big science fiction fan, and Kim Stanley Robinson, who's a relatively famous
sci-fi author, he frames technology in terms of, it's not just like the physical, you know,
it's not just a computer or the way that bits and bytes move around, you know, based on the
physics of electricity, democracy, he defines as a technology, for example, right?
It's these social innovations that actually collectively can enable us to do things at far
greater scale, enable us to coordinate at far greater scale. And so I think broadening,
the frame there, I completely agree with. And I do think sometimes it's underappreciated in our current
world where technology kind of implies that it's a piece of software or hardware. And of course,
it could be a blend of both. Like take cryptocurrencies and digital tokens, for example. So not only do
they rely on cryptography and computer science, technology as we kind of traditionally think of it,
but they also have an economic dimension. So like the game theory that underpins the security
of that model, and even political and social dimension when it comes to governance these protocols.
And those economic, political, social dimensions are equally important parts.
Right. There's new technologies. I love the sort of the token-based things.
That's completely great to really sort of focus a lot of energy.
Right. Money is gasoline. So that's tremendous to go and leverage these new technologies
for having people coordinate their activities through contributions to these token sales.
It's exciting to see this type of energy going into these sort of more traditional topics of cultural preservation.
I think there is a slight detail here, which is that technology itself can be money.
It's not such that the money has to be separated from technology.
And in the way that you point out, Alex, about democracy being a technology, really what we're talking about here is just mechanism designs, right?
they are incentive structures that we can produce that push rational humans, essentially, towards
some kind of behavior that we prefer. We see this same in democracy and capitalism is essentially
that we have adopted competition, helpful competition at the basis of these things, in order to
produce the best outcomes we can, or at least in theory, we can at the global layer.
One question that I have for both of you, actually, is, you know, I think I've read that the amount of
data being produced by the human race has increased exponentially over the past couple
decades, particularly as kind of the World Wide Web has come online and more and more people
are using it for various things. I'm curious how you view the sustainability of your efforts
in terms of how do you keep up in this race of more and more data is created and therefore
there's more and more that's being archived or this permanent hard drive can kind of keep up
in terms of storage space. I'm curious how both of you think about that. Because
money can be technology. We just built a kind of technological money that creates this endowment
structure to pay for storage going forward. So we essentially financialize the lowering cost
of storage over time at a very, very conservative rate to ensure that we have this, yeah,
it's really an endowment in a traditional sense with a couple of modifications that allows us to
pay for storage from the interest from that principle that we put up front when we add a piece of
data to the network, rather than actually taking away from that up-front payment.
So, yeah, I think these are, I guess, differences in approach.
I mean, really, the R-Weave approach is a kind of, in some sense, it's more like a traditional
business model.
It's simply decentralized rather than being a not-for-profit structure like the Internet
archive.
And your question about sort of how do you scale archives when the amount of storage that's
offered to people producing the things is going exponential?
And I think the Internet Archive is about 60 petabytes of data at this point.
And it's just what web pages look like.
When you talk to the folks at AWS, they don't really think about petabytes.
They just think about exabytes, which is kind of awesome.
But that's not where we are.
So it's just where we could record things of what the web look like would be terrific.
And it's sort of the promise of the decentralized web.
And where R.O.E. is part of that whole sphere is can we make the web itself sort of
a self-archiving structure.
So instead of going and thinking there's a website that's over there and you have to then make a copy of it, snapshots of periods of time, can it be more like how Git works so that you have a website that evolves over time, but you can go and replay it back to whatever point in time it ran.
So you can even fork it, but that's not just the code of the website, but the actual data within it and have that data.
basically and the code itself for the website live in many places, not just one.
It's too fragile to have things in just one place.
The idea of a publisher being able to just take a book away
and then have it just go away from all the shelves of all of the libraries
is kind of dystopian, but it's exactly what the lobbyists
for the publishing industry and the authors want to have happen
is every reading event is a permissioned event.
so it can be turned off at any time.
And you can understand why people want that.
So think of it as sort of Netflix, right?
That Netflix is the only place that you can go and see movies.
If that were true, then there's too much power in the hands of Netflix to go and make,
oh, some documentary that is not in favor anymore just disappear and nobody can see it again
because people hadn't made decentralized copies.
So the decentralized copies approach, especially if it's built right,
into how a next generation web works,
that would be a better system than the kind of cluge,
which is the World Wide Web,
and then try to back it up separately.
So I really applaud the efforts of the decentralized community
to try to make it so this is just how the next generation technology works.
Okay, so I'd love to get your input on something, Brewster, actually.
One of the conversations we have a lot in the decentralization community
is about content address.
Right? So should a piece of content be addressed by its location, as it is on the current web, or should it be addressed by the, essentially, the hash of the piece of content itself? Or the way that we see it at Alweave is that it should be really, I mean, no piece of information in the web is just a piece of information. There's much more to it than that that's interesting. There's when did it get there? Who put it there? And also other kinds of metadata. It's very strange to us that on the web at the moment, there's no metadata tag.
available for stuff. You just get a web page. And so, yeah, essentially what we've done with
Rweave is just bake in these tags that can be of arbitrary type. And at the beginning, this was
really just to add small amounts of metadata. But then we realized the community was picking this up
and essentially using the whole network like a massive document store that had these sort
of tags associated with stuff that they could query. And then you get these very complex
web applications built on top. So I guess, yeah, I wondered if you had some thoughts on that.
One of my mentors is Bill Dunn. He started Dow Jones Electronics. In the mid-Nodon's electronics. In the
mid-1980s, he was completely inspiring. He said, the metadata is more important than the data
itself. It's kind of cool idea. It was really ahead of its time. I mean, you think of how does
Google work, and it's all metadata. It's not actually the information in the document. It's often
anchor text and usage patterns and all of that. And so Bill Dunn was way ahead of it. He coined,
he actually coined the term metadata, which is kind of great to know that somebody actually came up
with that and who it was.
So there's, I think context is another way of putting that is absolutely important.
And context shifts over time.
So often you need a mechanism of having materials and then being able to find what are people
saying about it.
So being able to query based on a hash or something, some sort of persistent identifier,
even if the material has moved from place to place, is absolutely critical.
and allowing new context to be able to be built around it.
And of course, all of that context then has to be able to have referenceable.
Where did it come from?
And people able to comment on that.
The URL, which is sort of very primitive system that's really location-based,
is not good enough that we do need mechanisms to have things be able to live in many different places and over time
and still know what it was and where mechanisms of referring to these materials.
And hashes are a little problematic at the moment because they keep changing.
You know, okay, now we have MD5s.
Oh, no, we can't use MD5s anymore.
Okay, now it's Shaw ones.
Great.
All right, we standardize on Shaw ones.
Everybody use Shaw ones.
And now we need Shawf, 256, because people are breaking the Shaw ones.
So we have some trouble just based on how do we go and maintain these hashes over time?
And I don't have a good answer for that.
So that's a real puzzle.
but I'm glad you're baking in tags.
I hope there's kind of a rendezvous function
so that if you have a particular item
that you can find other things that refer to that item,
that's going to be an important part of a next generation web.
Yeah, that makes a lot of sense.
I mean, the way that it works in our protocol
is we have the sort of base layer that exposes these tags,
and then we have indexes on top that are actually very freeform.
So there's one index that uses a kind of a very cryptic, frankly,
old type of querying language, but there's a new one that uses GraphQL and another one that
looks more like SQL and this kind of thing. And then you can essentially interact with this metadata
in a pretty freeform fashion. But yeah, absolutely, in terms of metadata being important,
I'm always reminded of this quote, I'm not sure if it, I don't know if we actually know
who it came from in the first place, but it was that, quote, we kill people with metadata
or on the basis of metadata. In reference to the NSA and, well, the drone strike
program that the U.S. runs, which I thought was very powerful. I mean, that's a really strong
statement about the importance of this kind of data. Yeah, Brewster, I wanted to follow up on one thing
that you said, which is the standards keep changing. You know, I think that you mentioned,
you know, there was MD5, Shaw 1, Shaw 2,6. In fact, I think Shaw 3 was recently standardized
with NIST. And so I guess it just, you know, this speaks to the fact that the technology is
evolving and new innovations are coming out all the time. I'm curious for both of you, what
are the technological innovations on the horizon that you think are most relevant to both of your
organization's missions? Or put another way, I guess, what are you most excited about in the future?
Decentralized autonomous organizations, for sure. This idea that we can program the structure of how
an organization should work in code is exceptionally exciting. I mean, we've been pretty keen on this
for a few years, even since we started, but then just in, I think, December time, we decided that
we would make a small experiment where we just gave essentially an ownership token in this new
Dow that we'd made, so decentralized autonomous organization, to, I think about 20, 25 of the most
core community members in the RWeaf space. And we gave them a very small amount of money.
And we just said, hey, you guys just go basically do what you want with it. And you might have
this fear that people would essentially just take the money and run. But quite the opposite happened.
They spent almost no money. And they've achieved such incredible things in such a short period of time.
I mean, they've essentially, they've made like, I think, two or two and a half, depending on how you want to count it different products, and they've taken a pretty major step in decentralizing the infrastructure of the RWeave network.
Yeah, on just a tiny amount of money.
So I guess what we've found so exceptionally powerful about that was if you truly give people ownership of something, like if you give them responsibility and ownership as a decentralized community that doesn't have to have a formal leader, they're actually able to get things done.
and thrive and help the community like that.
So that's been really exciting to see.
And I'm sure in the future,
Dow's will play a bigger and bigger part in our society.
But this is probably like 10 to 20 years off,
like in the mainstream sense.
Culturally, the things that I'm kind of excited about
is people learning to work together better,
whether it's the Wikipedias, the brave browsers,
hopefully the Chrome's and Mozilla's,
the Internet archives,
that people sort of see that it works better to work together.
In terms of what's going on right now, in terms of this whole COVID thing and the economic impact that's going to be,
I'm really hoping that we de-financialize more of our systems.
So people actually go back to owning things for real.
Because when everything's leveraged up to the hilt and it's owned and control to go and multiply results and it's all ROI,
it's a very fragile piece of infrastructure.
So I think that that in terms of just how to build a more robust society,
I think de-financializing, which is sort of trying to undo some of the financial engineering masterpieces that have come out recently and have been showing the fragility and centralization problems that we have.
And in the technology area, I agree with Sam and the sort of the exciting things of the decentralization.
I want blank but decentralized.
Let's take our favorite tools, but let's make it so that it's not going and playing into somebody else's business model to go and hoard data.
I use Slack all the time, but it's a little creepy, right?
Because it's in a third-party service.
I want Slack, but decentralized.
Ideally, I'd want Slack to go and make that
because they don't need to go and hoard all this information
for their business model.
Or Google Docs, but decentralized.
Or Google Maps, but decentralized.
There are these miracles of technologies
that we've built centralized versions of,
so we know what the future we want is,
but can we build blank but decentralized?
I find that another sort of useful shower thought experiment
is like how would I go and do that
without having all of this information be hoarded
by these organizations that may be profiting for it
by it or at least handing it over to people
that you don't necessarily think that you want to have that information?
How do you have those toll tags that are on our cars
that allow us to go zipping through the tolls
go and be paid securely, make sure that they're done?
but not leave a trace of that you went through that toll booth at that time.
All they should want is just to get paid for somebody to go through that toll booth.
How do we make it like cash?
So those are the sorts of, I would hope that the technology community really works on.
And if it's based on open standards, then there may not be quite the lock-in sort of thing that a lot of investors are looking for.
But I think it may be what it is society is looking for.
And I think in general we should be pursuing things that society.
wants.
Brewstra, back at around 1980, you set a goal.
You were pretty specific, I believe.
It was October 2020.
You had a specific goal about making the Internet a library.
Can you tell us how close we are to achieving that goal?
We're now really depending on this information ecosystem, yet it's not good enough yet.
It's not referenceable, doesn't have enough of the published works woven into it.
For older folks, I was trying to think, what did I learn about a library in high school?
And there was this thing called the Reader's Guide to Periodical Literature, and it was a subject index.
So if you were doing a report on, say, cigarette advertisements from the 1960s or the McCarthy era of the witch hunts of communists, and you wanted to go and understand how did people think about it in 1955, right?
What party was for it and what the party was against it, right?
That kind of thing.
it's very difficult to answer that kind of question
on the World Wide Web now
but we had this thing
called the Reader's Guide to Periodical Literature
where you could go and look up McCarthy in 1955
and go and get articles from Time Magazine,
New York Times, all these different things
and go and actually at that time you went back to microfilm
to go and pull those out
to try to walk back in time
and feel like it was 1955 again
what was the media that was out there
for you. Wouldn't it be great if we could actually do that again so people could start to feel
empathetic or understand how did things used to work? And that is not available yet on the net.
That would be a component, I would say, to making the internet into a library. Another is reference
desk, right? How do you go and have that helpful librarian be able to help you find things?
Is that really there? Is Google good enough for that?
I'd say no, we're really not there.
What do we want out of a library and how do we make it so that the next generation, this
generation, our generation, all of us that are turning to our screens to answer questions
that we have as good, a rich, as deep background as to what's going on, what's true and what's
not true?
Who should you be thinking about?
Who should you be reading about, even if they're not currently in print?
Those are the aspects that I find a useful thought experiment and we're using at the
Internet Archive and other organizations to try to build a reliable, thoughtful infrastructure
to build our culture on as opposed to kind of the slapdash techy thing that we've ended up now
depending on so fully.