librarypunk - 160 - SciOp.net feat. Jonny and Jez (part 1)
Episode Date: March 12, 2026We’re talking with Jonny and Jez about SciOp, a torrent-focused data preservation project that encourages academics to help with the act of making data available. It’s distributed, it’s robust, ...and SciOp is working on making it easy to do. contact@safeguar.de Contact Jonny: Fedi: https://elektrine.com/remote/jonny@neuromatch.social Bsky: https://bsky.app/profile/jo.nny.rip Media mentioned https://sciop.net/ https://forum.safeguar.de/about https://blog.sciop.net/2025-08-29/webseeds https://punctumbooks.com/titles/warez-the-infrastructure-and-aesthetics-of-piracy/ Safeguarding/sciop: collecting at-risk data in torrent rss feeds - Codeberg.org - https://codeberg.org/safeguarding/sciop - the scraping code, such as it is: https://codeberg.org/Safeguarding/sciop-scraping - sciop the blog, such as it is: https://blog.sciop.net/ https://programminghistorian.org/ https://book.the-turing-way.org/ last episodes on federation: https://www.librarypunk.gay/e/101-mastodon-bluesky-and-bullshit-part-1-feat-jonny-saunders/ https://www.librarypunk.gay/e/102-mastodon-bluesky-and-bullshit-part-2-feat-jonny/ Tools mentioned webrecorder, browser trix https://webrecorder.net/ https://webrecorder.net/browsertrix/ academictorrents - academic data tracker https://academictorrents.com/ aggregor mauve https://agregore.mauve.moe/ DRP - lumos https://www.icpsr.umich.edu/sites/datalumos/home https://en.wikipedia.org/wiki/Distributed_hash_table https://torrentfreak.com/controversy-as-rookie-admin-aspires-to-bittorrent-domination-080730/ ddos secrets https://ddosecrets.org/ Archive Team Warrior https://tracker.archiveteam.org/ How to contribute to sciop https://sciop.net/docs/quickstart/ Create account. Create a torrent. [find how to link for how to create a torrent] Upload. Get permission so they are visible to others. Seed. Transcripts: https://podscripts.co/podcasts/librarypunk Join the Discord: https://discord.gg/qWPTurTnkT
Transcript
Discussion (0)
I'm Justin. I have a new job, so I forgot what it is, and my pronouns are he and they?
I'm Sadie. I work IT at a public library, and my pronouns are they them?
I'm Jay. I'm a cataloguing librarian. I'm so sick. I'm so sorry. And my pronouns are he, him.
And we have guests. Would you like to introduce yourselves?
Yes. I am Johnny Saunders. I'm a postdoc at UCLA. I'm also now the newly appointed CEO of PureTech
global, cyberindustrial concern LLC, Shady Delaware LLC, which is my lifelong dream,
and one of the folks who works on this project, SciUp, that we'll be talking about.
And I'm Jess Cope. My pronouns are he-him. I currently work at the British Library for the next
like three weeks until I start a new job elsewhere, doing research data stuff. And I have had the
pleasure of being on this project
with John for some time.
Quiet cheer, a quiet
sensible cheer.
It was maxed out. It just
doesn't go any louder than that.
Welcome. I've been wanting to get this episode
going for a while, ever since I've
watched it kind of from the
start, I guess. I think you started
Syow last, what, year
or two? Oh, Lordy. Yeah,
last, well, it was last
January, and I think we got it
online last March. And, and
We've been slowly hacking away at it since then.
Oh, so we've not missed the opportunity to have a big birthday celebration.
I don't think so.
And I mean, also, you know, we get to do whatever we want.
We can say we don't have any sort of like formal analytics or anything like that.
So numbers are meaningless and we get to make them up.
Yeah, I threw SIE up into the like, what's it built in?
And it was just like, I don't know, third party analytics stuff.
I don't know what it's built.
It's like all the codes just in your
Python and your GitHub, right?
Yep, the site.
Yep.
It's like we use none of the stack.
Like none of there's not like any
AWS in front of us.
There's not any cloud flare in there either.
It's just like raw dog Python running on a random
Icelandic VPS that they keep giving us for free.
Well, shout out to FlokyNet.
They're like very legit web hosts.
but also it just feels like, I don't know,
we stole a little bit of computer space
and are running a BitTor tracker on it.
Yeah, I know.
They did point out that you're on an Icelandic server,
and I was like,
it just must be something Johnny has access to through work or whatever.
I don't know, who knows?
No, it's just some cool folks that probably Henrik.
Like, Henrik was the one that has contacts and connections.
Yeah, yeah, it's just like,
We're, you know, trying to figure out how to do this thing sort of legally-ish and also, like, in a way that isn't, like, totally 100% under U.S. jurisdiction, but we also don't really know how that works.
And so we're just sort of like, DPS, outside of the United States and, you know, Iceland, it sounded nice at the time.
Iceland is very nice. Greenland is covered in ice, yeah.
They were, like, like, it's very cool. We want to support you. Can we give you a...
a free PS. Yeah. Nice. So if you had to give overview definition of SIOP, like what is SIOP?
Oh, well, I mean, there's like an easy explanation and there's a more of a gross one.
We'll go through both. Yeah, the easy explanation is Syop is a bit torrent tracker for at-risk
threatened, altered, or otherwise endangered public information. And the longer explanation is,
what we're trying to do.
We're just like, we're working on
making BitTorrent records, which
usually are associated with piracy.
We're not trying to necessarily distance
ourselves from that lineage of, like,
radical information of liberation,
but trying to bring it into the light
and just sort of like doing something
along the lines that like previous
academic BitTorrent trackers have done,
but then actually trying to like integrate that
into institutional as well as private,
is private as in, like, privately held
as in your random hard drive kicking around
in your closet resources.
So trying to like take that,
the power of BitTorrent
and put it to use in ways similar
that like DDoS secrets
and other like leakers or purveyors
of large datasets have done,
but then apply that to the public information
that's being removed from all of the government servers.
I think that's a good start.
I'm interested in the idea of like getting people back into using BitTorrent as just like a way of moving files around because, you know, I was introduced to it at a time.
I don't know.
I was, I'd have not 10 or 11.
Or people were just like, yeah, I had no idea that there was a whole season of Futurama inside that Linux distro.
Like, so it's this, I guess it fits neatly into the chronological retelling of Saion.
is we were, you know, we all knew that just like the fascists were going to be coming for the information, right?
But I don't know that any of us expected the ferocity and immediacy of attacking public information sources like the day that, you know, Trump took office.
And so it was just like noticing that that was happening right now.
We needed something that worked right now.
And so there wasn't time to, you know, bike shed about possible future technologies we would like to have.
We need something that is like, what can we run immediately that we can put in people's hands now to share and to share and preserve large scale data sets?
Because it's like that's like the kind of thing that's at threat are these like petabyte scale climate data sets that just like we don't have them.
then even if we wanted to do anything about climate change, we couldn't because we would just
not have the information about where we would need to put the solar plants and so on.
And so that was like the, that's like the initial impetus for using BitTor.
Not only we love it, but it's also like it exists.
It's been kicking for 20 plus years.
And we know for a fact that just we can get someone who has never touched a computer before.
Well, most people would touch a computer that are not like computer people.
And we can put something in the hands of my mom, and she can run a BitTorrent client and double-click a torrent file and it works.
And so that's like we just got that going.
And also like getting a torrent indexer started is like pretty simple as opposed to like building on some of the more complex peer-to-peer technologies that exist.
It would take a huge amount of like technical development.
But BitTorrent works on files and it is like is a very simple thing to do.
And also, like, the reason why BitTorrent trackers and BitTorin has been preserved itself is, like, has a really nice, like, distribution of liability.
Where, like, by running the BitTorrent tracker, we are not actually directly legally responsible for the content.
Like, Saob is not serving the content of the files.
And that's served by the peers.
All we do is say, hey, you can get this file over here somewhere else.
And so it's like it serves as well as an index in case there are sort of legal, couple of.
from, even though all the data that we host is public domain, people should have it. There's
nothing legally or ethically risky in there that doesn't really matter so much to the prison government.
Yeah, legally, not legally risky now. Right. Yeah. And also, there's a part of me that wonders,
like, whenever we're talking about computer law, like, how long is it until there are bills
proposed that just say BitTorrent indexing websites must be regulated or approved?
or because all of the major,
it is an untamed part of the web
because it is the peer to peer
in a very direct way that like
very few direct peer to peer things on the web
still are.
Like there are very few things like,
like, I forget what it's called like offline websites,
things that are just hosted from connection to connection.
Like you can't even do a direct connection over IP
for a video game anymore.
You've got to go onto someone else's servers.
You can't just type in your friend's phone number and call their modem and play Mech Warrior 3 anymore, right?
Yeah.
By the way, it was great back in the 56K period.
You were just like, wow, it's like instant.
Yeah.
There's just no lag.
Exactly.
And like, I mean, there's good and bad reasons for that.
And so it's like, I'm sure that, like, I feel like a lot of like the, like what we are trying to do is just like would horrify.
Sadie, like, just like
and trying to, like, run these things on
like things where, like, all the IT professionals
in the world are like, yeah, you shouldn't
be running a BitTorrent on this
here network. It's got health
information in it, or at least that's the stance of
UCLA's IT.
We're negotiating. It's fine.
But for now, I have the unregulated
undisallowed BitTorrent
client in the UCLA Health Network.
It's fine. It's safe.
But, uh,
um,
okay.
But they walled us off from the rest of the network, which is good.
But that's like, that's part of the overarching problem and part of the overarching background that we take place in that just like of the sort of like D ownership.
I don't know what to call that.
Like the just like the actual enclosure of the web part, we're just like now you, people started talking about this with the rise in hard drive prices and memory.
stuff like that is like, you're not supposed to even own a computer anymore. That like,
you're supposed to be using the cloud. The only thing that you should have is a phone or some
other screen that allows you to access someone else's resources. So that's sort of what we're
fighting upstream against is the overwhelming urge of liability, terrified administrators of all
institutions of all kind wanting to run to the cloud where they have someone to sue rather than
actually owning any hardware and running any services themselves. So it's that, that precise
arrangement that allowed a lot of the immediate harms that we saw. And so some of the larger harms
to come when the shakedown of all of the tech giants reaches its next phase and they start
really leaning on AWS to start, you know, I see you're over there. How does it?
this open data set or I see you're over there like promoting this social service.
You need to take this shit off of your network or else, you know, shake the stick of fascism at you and so on.
I think that's kind of one of the things that we saw right from the start of the Trump administration last year as well is that like not,
there was not only stuff that was being directly targeted and directly taken down by
dictat because it contained words like female or whatever the thing was. But also there
there was a lot of stuff that went offline because people had to cancel contract because
their funding was suddenly under question and because they again, they didn't have the stuff
themselves. Right. And so like that's that's sort of like the background assumption that a lot of
scientific, academic, and other, like, publicly funded or institutionally funded archives made,
that, like, we assume that this country, as its government, as its people, as, like,
the funding mechanisms and power distribution that exists, like, generally think that it's good
to know things and generally think it's good to, like, have information, period. And that changed
overnight that just like the general agreement that like we should continue to do information
of like stop being true. And so you know, you're an organization. You don't have all the resources
in the world. You're not like planning for every adversarial contingency like this kind of a
contingency that now you have all of your archival materials and what do you do with them?
how do you you you haven't made plans for making sure that they exist in a ton of different places that just like if we were to go down what is our succession plan because we weren't planning to have our archive go down or taken down so that's like sort of like where we step in and just sort of like not trying to say like we're trying to solve the world or like that you know the blah blah blah blah the rest of only us and we are the true answer that's not what i'm trying to say just like trying to step in it's like okay in an emergency situation what are the possible
things that someone could do to preserve these large-scale
data sets, and naturally BitTorrent came to mind.
For many reasons, at least it would just like my personal history,
having a lifelong love affair watching the torrent window,
but also just like looking around and seeing the possible things that we could do.
And it just works.
We regularly get people saying,
have you considered IPFS? Have you looked at that?
Have you considered using Filecoin?
and like they are not none of those are at a point where they like just work in the same way that BitTorrent just allowed us to start doing stuff and just worked.
Right.
It's like every time we have taken any sort of step into like fancier technology, it has just brought in like 10,000 more like complications than it's worth.
even it's trying to use BitTorrent V2, which is a spec that is now like 10 years old or something like that,
has proven to be a monumental technical challenge.
And so just like, we're not trying to be finicky and mess with the technology.
We're trying to make an archive happen.
And so like just we're trying to like at this point, the planet is to make sure it's rock solid.
get the foundation in, and then we can try experimenting with more, you know, fabulous technologies.
But for now, we're just like at the stage of make sure the basics function.
Yeah, I remember in 2015, 2016, the first wave, sort of my introduction to this kind of work,
which was people doing archiving of government websites.
It was a lot of, it was a lot, lot more labor intensive than what I've seen this time around.
A lot of people was like, okay, we're going to have three categories of volunteers.
You're going to run bag it.
You're going to upload.
You're going to help with outreach.
And it was incredibly like coordinated, but it was also very strange.
And also, I wasn't sure where everything was going to live.
I was dealing with it just at my work because I was in charge.
of all the health websites.
So obviously overnight,
all of those health websites broke
because whenever a new administration comes in,
they change which department is going to handle which things.
Their websites break.
They take down studies.
They add new studies.
This is the kind of breaking that our government does
to its own data infrastructure.
Every new administration that I think is kind of inevitable.
And I find it, so I was kind of like,
where has everyone been when it happened this time?
I was like, it kind of did happen eight years before.
But I did notice that people were way, way better at it this time around.
Like, I knew where the data went.
I knew, like, I understand you were using BitTorin.
I understood what Internet Archive was doing.
I understood, like, what other web archives were doing this time around, right?
Whereas the last time I was like, I'd learn what Baggett was.
I didn't even know about it.
And I went to library school.
And it's like, that's something like only a digital archivist in a library would know about.
So not very many people know about this thing.
So I like the idea that we're using like a 20-year-old technology
because that's archival grade-tested technology at this point.
Yeah.
I like the idea of like we're looking through our toolkit
and we're going to use the file as in like an actual file.
We're going to use the chisel.
We're not going to use the power drill.
Yeah.
And I mean like if only we have any, with this project,
it's like we would love to have that kind of,
organization to be able to have different roles and stuff like that. That's why we, for,
you know, notes for the future, we had every intention of inviting someone from the Data Rescue
Project to come along with us, but failed due to just like the chaos of all of our lives. But
like Data Rescue Project is like much better at that. Like they're sort of just like the sibling
organization that folks will know about because they are more organized and better at doing
of like a lot of things than we are.
And one of them is, is like making sure that people know what a role is, like, know how to volunteer
and know how to like, you know, find their work.
We've just been like, we're sort of like the scatterbrain cousin that will go in and do like some
heinous web scraping that when some scrape project might elude folks among them and like
in that labor relationship.
But like that's like, that's also just like part of the problem.
this time around too is that just like
there was so much of it that needed to
be done that the usual
suspect like internet archive
and like you know just like the typical
like the people who always do that
probably there was just two more more of it than you
that just like we needed to engage
more people and there were a lot of people that
wanted to being like how do I do this
I think like one of our we
regularly cruise the data hoarder subreddit
and just the
amount of just random people that are just sort of like, I went and bought a million hard drives
and I want to go get a bunch of stuff and make sure that I have a copy of it.
Like this instinct to hoard information is like, well, good. It's a good pro-social one and
widely shared. And so that was initially like a lot of like when we started this project,
we actually got a decent amount of pushback like saying like, I'm just going to go and get it
and store this data like under my mattress basically.
that just like, I want to go, like, there was someone that was actually disagreeing with us about the notion of doing this in public, like trying to make like a bid torn archive where, you know, because that involves peers being recognizable and online, you know. And so that like, someone was saying like that paints a target on your back. I would instead, what I'm going to do is make like a private copy, hoard it and not tell anyone so that in the future when like we might want to have it again, like I'll be able to release.
it. And that doesn't really work that well. And for the obvious reasons of, you're going to forget,
you're going to die or lose the data. And then just like, so what you, it's like, anyway.
But that, like, then that poses an immediate problem is like, how do you coordinate all of that
energy and work? And like, how do you actually make sense of everyone doing everything everywhere?
And I mean, we're not original in this regard. Like, da, da, da, da, da, the.
a web platform, like, as like a way of, like, coordinating work. It's like as old as the
internet as just sort of just like these pure production models of let's all work together
in public. But like in particular, the need for not only a way of like organizing the scrapes
and the backups that were done, but like organizing the scraping, like the actual act of going
and getting it. Because like that was like one of the things that especially last spring was true
is that like we'd get advanced notice
that some site,
some data set, some archive was going down.
And we'd have like,
this thing is happening on Friday.
And there are 100 terabytes in there
and the server will only serve
each individual downloader
like three megabytes a second go.
You're just sort of like, how do you
coordinate and de-duplicate
that action of making sure that
we get all of the files,
but we only get them once
because we don't have time to go back and get that, you know,
for everyone to go and scrape their own individual copy.
And so that was like one of the, well,
it's one of the underdesigned but still present parts of Saob
is that like not only is it and trying to be an archive for things that exist,
but it's like a coordinating space for you make target,
like a dataset target.
Like we need to go out and get this.
And then there's like facility on there for splitting it up into smaller pizzas.
And then we even have the beginnings of,
something maybe we'll talk about in a bit of just like how to actually automate that saying
I have this piece someone else go and get the next one and so it was like how to like distribute
the actual act of scraping and so that's both like a what is everyone like you know a observability
question like what is everyone doing what does everyone have but also like a documentation a
education and like a resource building problem how do we teach people how to scrape the web
How do we, like, you know, make these different technologies accessible?
Like, I think some of the, like, a lot of web scraping is increasingly very accessible and easy to do.
Like, to, like, things to some technologies from, like, you know, friends of ours over at web recorder who make browser tricks, make a wonderful thing that you can just basically plug in play.
But still, there's a lot of subtlety as far as, like, what comes out the other side.
As you're saying, like, what are the formats of the web archives?
If you make all of this data backed up, how do you make sure it's actually usable in the future?
And these are all ongoing technical and social questions that we're sort of engaging with.
I think it was very interesting that we were starting this all up at the same time as the internet was just going completely wild with scrapers and very badly written scrapers from everyone and their dog with a tech startup being like, we're going to get our own.
and train our own AI and be the next open AI.
And so we were immediately sort of fighting all of the countermeasures that have been put in
place against that.
Yeah.
It's awkward.
Web scraping used to be cool and edgy.
And it's like I miss that part of, you know, that era of web culture where like, I mean,
the value shift where, I mean, it's like you're reflected in like copyright law.
and our understanding of copyright and piracy
is like consumption used to not be valuable.
Like, you know, that was not the valuable part.
The valuable part was in distribution.
And so when it became disgustingly profitable
to just consume things,
like that like shifted the landscape as far as like,
well, scraping is not really great
because before it'd be like, yeah, you'd get a web scraper,
but you weren't really concerned with like the impact
on your servers because it would be relatively minor,
you were concerned with them having your data.
And like, what are they going to do with it?
Oh, they're going to have a private copy of it.
Oh, no.
And that means they won't have to come to the website anymore.
Like, that was the biggest problem.
And so it's a real shame that like,
now when we say we're doing a scrape,
it's like, no, no, no, but not like that kind.
Like, you know, not like, you know,
that like scraping these sites to make them available,
like to not to hoard them and train a private,
model and like convert other people's labor into profit for us. It's like actually to to decrease like and so does like a lot of like what we're doing as far like coordinating scraping goes and making sort of making these bit torrent backed archives of things. It's like one of the things that is a longer term goal when we talk about like the next step beyond like the emergency response to a bunch of data disappearing is like what does a longer term archive look like? And it's like longer term. Longer term.
we want to decrease the burden on these institutions and decrease the hosting like costs for them
by supplementing them with a lot of other people's resources in bandwidth. We're not trying to
be the next scraper that's hitting your page 100 times a minute just so it can get the fresh text
to eat and then like fuck off into its own oblivion. So it's a real shame that the culture
has shifted in that way. But yeah, we do have to perpetuate. I wish there was a way to
differentiate pro-social
scraping from the kind of
stuff that all of the tech giants are doing.
No, no, no, we're the nice ones.
Yeah, no, we're good.
Trust me, I'm good because I say I'm good.
Leave it to the tech giants to take something good
and pro-social and somehow make it into a
profit-churning
bullshit, so, yeah.
Yep, it's what they do best.
I think that's good enough for people
to understand kind of what Psiop's doing.
I want to talk about the people more because we're talking about pro-social stuff.
So I guess focused on the people, like, who's working SIOP now as much as you want to, you know, I know people, who's working it now?
And then who do you want to come in and like what do you want people to do with SIOP?
Like, you know, you're like I mentioned earlier, I don't know if it was on this recording because we had some recording issues.
But, you know, your website is available on GitHub for people to copy to make their own SIOP website.
So, like, what do you want people to do with it?
So that's talking about people.
Yeah, the people, so I think most of the folks involved are pseudonymates in some way.
I think so.
I don't necessarily, it's like torn between giving credit and also not doxing people.
You know, we're like, and so like we, we, we initial, so the named folks in here are like,
We'd sort of core of us, I guess,
nucleated pretty quickly around Henrik,
who is taking a bit of a break now,
was like the dynamo of a human being that he is,
sort of like brought a bunch of folks together.
And then the other folks that are named
and on public documents associated with Sciop
are like Will Wades is another person
who's like helped us out greatly
as far as like connecting us with old web,
you know, whenever
everyone's like,
we need to like contact some,
you know,
someone who knows about the deep lore
about the internet.
It's sort of just like,
well,
we'll just immediately know someone.
Anyway,
I don't want to necessarily like list off,
like everyone.
I feel like we should give like a list of credits
or something like that in the showdown
because I feel bad naming people
because just like that's just part of,
the part of the internet that we are in is like
no names, no faces.
our most prolific contributor is triple shrimp,
and we have no idea who they are,
but we just know that we love them,
and they are like one of the most tenacious archivists
that I've ever seen before.
And so, yeah, so like we have a core of us that are just like,
we, you know, that we sort of know who some of the people are,
but most of them we actually don't know who they are.
And just like people who have gravitated towards this kind of project.
And as far as like the people we want to be involved and like external people to be involved
is like the, you know, the bad answer to that is everybody, you know, like that's just like,
that's like the lazy answer is like we want everybody to be involved in this.
And to some degree that's true.
We're just like that like that is actually something that I mean, I'm being flippant on that,
but like that is something that we've sort of endeavored to make available.
It's like if you want to like that's the kind of permission structure that we're
trying to make is no permission structure.
You don't have to wait for someone to tell you to do something.
You don't have to say, go and be in the queue and go and prove yourself.
Like, do one, you know, it's like getting jumped into the scraping gang is just sort of like,
go and prove to me that you can scrape the web and then we'll give you a task.
It's more like we want to like put the things in people's hands because everyone else out there
knows what needs to be preserved better than we do.
It's sort of, that's like the people that are closer to the problem and like make sure that
they can do the work of gathering and sharing their resources because like, you know, we can't
know what is at risk in the time that it takes to actually protect that information.
So, but like longer term, as far as like the code goes and like what,
The plan is with, like, other sciops and stuff.
Yeah.
That's also a big part of, like, the dream of the project is that this comes sort of,
I think that we talked about this on one of the previous times I was on here.
Just like, this like, hold on, it might be a little bit frozen.
Hang on.
I think we can still see and hear you.
Yeah, we can see and hear you.
Okay, everyone is frozen on my screen on my screen.
That just can happen.
like you froze for me for a little bit.
The video can freeze a little bit.
Okay. Well, okay, we'll cut out the
dead air moment there.
But yeah, so I think
we talked about this a bit
on like, like one of the last times
that we were on here is just like this
history of boom and bust
and rise and calamitous
collapse of bit torrent
trackers and how that just like that is how
BitTorrent trackers worked historically
is that, especially in private
bit torrent trackers, you'll have this
massive accumulation of labor and organization,
accumulating these huge archives of media,
and then the law will come knocking,
and the tracker will get shut down,
the database will be lost.
And one of the things that's remarkable about private Bittern trackers
is how resilient that actually proves to be,
or just when the successor tracker shows up,
everyone sort of like rallies and re-uploads everything they have,
and it might come back online in a matter of weeks,
which is incredible compared to other possible results
when other kinds of archives go down.
But it still is a power struggle,
and literally it will often be a battle of wheels
and a battle of personality.
I'm thinking about this Mac tracker,
broken stones that,
just to go off on a little bit side.
There's this private Bittern tracker
that hosts like,
Mac applications, broken stones.
And it's like, one of the admins took hostile control over the site and rerouted a bunch
of the donations to their crypto wallet or something like that.
And then so the people chose to like shut down the site to protect users from this admin.
And then the successor site showed up.
And then that hostile admin tried to worm their way in through IRC into the admin of that
new site.
Just being like that like there's all the history.
of dramas and sort of like a Greek tragedy of keeping these archives running.
And so anyway, like with that as a background, like, one of the things we're trying to do is,
like, try and address that problem with like a bit of a redesign about how, like, you know,
introducing Federation to the idea of BitTorine trackers, where we're in the same way that, like,
I don't know, we don't need to do it.
What is Federation, right?
thing here. Lots of computers.
Yeah. We talked about it on the last episode.
Goose on the last episode where we talked about Federation for two episodes in a row with Johnny.
Yeah. Yeah. So just like this idea that like instead of it being one sci-op and that's like a unitary thing, like we want to do the same thing here where we have a number of these different trackers that can be online that can all talk to one another and share metadata back in. And so something like that doesn't really.
really exist as far as
we know there are some things that are
like it. So I, for example, I saw a project
someone linked it to me like
two or three days ago that was like
another huge French
bit torrent tracker went down.
And then a bunch of people
went and put all of those
torrents on Noster, for
example. So there are like
other things that exist that are
roughly trying to
make these like
BitTorrent in particular, but in general just like
large.
data set archives spread out across domains.
But what we're trying to do is both preserve the social coordination,
which is the main role that BitTorrent trackers actually have.
So people think, this is another question we got initially.
It's just like, why are you using a tracker?
I thought trackers were obsolete now that we have Dht and other ways of directly exchanging
peers between each other.
So that's like the technical role.
of a BitTorrent tracker is to, when I go and download a torrent, I go and ask the tracker,
who else has got this torrent? And so the tracker is the one that will like be telling me,
go and connect to this other IP address and so on. And so like that's one of the roles of a tracker,
but the other more important role of the tracker is a site of social coordination,
a site of giving organization and structure to a bundle of torrents. And in particular,
giving a focus to it.
So just like, you know,
in the same way that what CD focused labor
towards archiving music,
having SIOP focusing labor as like a place to put
the public information torrents
is what, you know,
the main thing that it actually does.
And so building those kind of like social systems
into the tracker and then making that,
then the next step is making those social system
extend across multiple trackers.
And there's like some interesting,
technical and social challenges that come up with that,
where, like, again, a lot of the peer-to-peer space,
especially lately post-crypto boom,
can lean very libertarian in terms of its design and its goals,
that, like, the goal is to make the one big public archive of everything,
and that, like, that's not exactly, doesn't really fit in this context,
but also like it's just a very particular arrangement of power and how that's supposed to work.
And so, you know, we, one of, so I'm thinking about one of the most recent examples that's like a challenge for this is like,
we wanted to distribute a data set that we weren't really sure we could.
And we also wanted to make sure that like the origin of it was not so obvious.
And so we needed a way to predistribute that data before.
we made it public. And so if you build your system around assuming that everything is public,
everything should be public, and it should always be immediately available, then you don't really
have the means of making these sort of like gray area, private negotiations and discussions and
stuff like that that might need to happen for data that's a little more sensitive. And so we need
a federation and a sharing model that can scale from private literally peer to peer as in like,
I want to know exactly who's involved in this swarm of peers up to the global public index.
So that's like the next step as far as what we're working on this year, where like last year we got the thing running.
And we got a bunch of data sets, we got our wins as far as like some nice archival work done and foot in the door as far as like getting a base of cedars online.
and then the next step will be making federation happen.
And so we want to do an interesting blend of what ATProto is doing
and what activity pub is doing here.
I mean, I don't necessarily know,
I don't know what degree of technical detail would be good to say,
but I'll just say that like in broad strokes,
we want to do some of the nice things that ATProto is doing
as far as like having these mobile, like very much,
mobile units of
coordination where like instead of having
the server own everything
making it so that just like it is possible
for people to like own their own
personal space of this
and building in like the
social graph down to
an individual account or person
or something like that that then
exists on a network of these
gateways but then also doing
some of the things that Activity Pub does
better which is like being able to have
better control over who has access
to things and being able to have better control over, like, the actual way that the data spreads
throughout the system, as opposed to with AT Pro, you assume that it just goes everywhere.
And so it's like, that's the, like I said, I'll stop there to say that, like, we, there's a lot
of probably boring details that go into, like, what that actually means, but that's the next
step in broad strokes is we're trying to make a federation model that fits the needs of
sensitive, sometimes personally identifiable, but also public domain data and how to make that
actually function both in like a gray archive but then increasingly into legit archive space.
Yeah. So let me try and make that a little more concrete interpersonally and professionally.
So like I'm starting a new job next month. And if I wanted to approach them and say,
hey, we have a big research infrastructure.
We're a big target of this administration.
And we're a private institution.
We're not a state institution.
What angles would I approach of saying we should start a sciop here?
I was saying like we should use our compute for this.
We should convince the local Sadi to let us start up a BitTorrent software.
And we should spend staff time on this.
So, like, if you, you already have, like, control over your ability to do this at your job.
But, like, if I was, I'm going to be in the library.
I can talk to some people in computing, but I'm probably going to have to go talk to IT or some department.
So, like, where am I?
How do I convince, like, you know, part of the bureaucracy or get them on board?
Well, yeah.
So that's, it's a complicated question.
So the.
So we're here for.
Right.
So there's a different angles for this.
And some of them is like, this is all work to be done.
at the moment because like, yeah, like I said,
that's next year of work.
But like, the goal is to make it as minimal as possible.
And so like that's actually something that is how SIOP has been designed
from the bottom up is that just like,
like we wanted to make it ridiculously deployable.
Where like if you were to just like pop into your neighborhood like vibe coding software
and say make me a website, what it would make you is something that is like,
has like 10 different services running.
you know, you must, that like a heavy full-stack application that has like, you know,
that you need to basically have a degree in SIS admin, you know, to run.
And SIAB is not that.
Siyab is like a Python program that just like single install
and no external web services, you just press play and go.
And so at the moment, it is integrated as a full-stack website,
but we'll start breaking those pieces apart where what we want to do is be able to have like
a metadata federation node underlay thing
that you have all of your existing resources.
You have all of your storage in some CMS somewhere
and you have all of your metadata in a system
that we're not trying to supplant that.
That's another thing that we don't want to do
or don't want to try and do.
Because every time that someone tries to come up with
the next unifying metadata system
and ignore all of the embedded labor and time and local expertise that used into that,
it fails and it's like it's an embarrassing situation.
You now have 11 standards.
Yeah.
And it's just like disrespectful, basically, like to metadata workers.
Like, you don't know how to do your own job and stuff like that.
Just say, like, no, come and do it on our thing.
But like making it possible to have this system exist side by side with something like that,
that it can ingest the metadata.
It can bridge to your existing system and pull that metadata out and make it so that just like you have everything that you have currently, but then also have a bit torrent underlay with it.
So that like the things that that adds or concretely brings to organization is that like what we want to offer as a not a not a service as in like we're selling a service, but like offer as in like this is this approach is to one be able to make your metadata and.
your archival information more robust by giving it a concrete shareable form that just like,
this is sort of like, again, harkening back to conversations that we had like last time
when we were talking about linked data, just like, so I obviously link data application.
Like it like publishes all of its metadata in RDF already. And so just like what we want to do
is like basically like you can ingest any sort of like link data metadata that you have.
And here's a data set and it corresponds to these files. And those files are on your
CMS somewhere. And so we can make a torrent, derive it from that data. And now what you have is
your metadata, your data on your CMS and a torrent that refers to both of those things. And so then
someone who wants to download it can download that torrent. They will download it from your
servers, but then also then become an independent source for that information. So then
Were your archive to completely fail or go down, what you already have out in the world then
are these very small, very portable descriptions of what your data is, how to get it, who has it?
And so you just make your single monopolar archive into a distributed archive with very little
additional effort or very little change to your existing systems.
Let me spin a scenario so that I can make this, because I don't want this to back.
bounced off like library workers too much.
And I want to make this really easy for people who are like,
I didn't understand anything you said this episode,
but I want to help.
Where do I go?
So.
Yeah.
Neither me to hurt off the baby.
So I have like,
I have like a lot of experience in academic libraries doing different things.
So let's say it sounds like what I can do is maybe I can rustle up enough IT support
to say like,
okay, I want to spin up like one server to run this sci-op.
site. And what I'm going to do then is maybe work with our data librarian or our SCOLCOM person
and say, okay, all these data sets that we have in the repository or that people have in their
labs, we're going to help coordinate them to create torrents of that data that's already living
on their servers. So they don't have to move anything. We're going to create these torrent files.
We're going to say, here's the URI. It's going to point straight to it. So someone, much in the same way,
that same way that part of
institutional repository success
is emailing them saying
send us the file and we'll upload it for you,
emailing them saying,
tell us the URI for where this lives,
and then we will create a torrent for you.
And so someone in the library could start building up
the SIEP project by adding torrents and files,
and we would just have to control one small server
and then everything should be able to cede
from that URI.
Would that work?
All right.
It's about right.
Yeah, this is the webseed concept, isn't it?
Where, sort of within the Torren metadata, as well as a bunch of trackers that you can go to to ask for peers that already have this content,
you also include plain old-fashioned H-TPS URL that also holds the same content.
And so if there are no peers online, you just start downloading it from the web, which might be.
your repository or it might be some department or labs website or whatever.
And then as soon as you've started downloading, you start peering those chunks that
you've already got.
And it kind of kickstarts, kickstarts the swarm that way.
And one of the, so I think with my institutional hat on, like, one of the nice selling points
of that is, again, going back to the scraping thing in an
where all of our websites are being bombarded with Crapers all of the time,
there's potential at least to distribute that load on your own infrastructure,
because as soon as one person's downloading it,
the next person can download it from two places,
and then the next person can download it from three places.
I think the thinking about sort of what people who listen to everyone's just turned their videos
off or is it me?
Okay. Yeah, I think.
No, it's, it's, it's just
Jez's, uh, feed is
struggling, but
since this records locally, we'll be able
to hear everything you said just fine.
Cool, cool, cool, cool, cool.
So we'll agree with whatever you said just there and you
sound very smart with promise.
Yeah, yeah, totally for sure.
That's the magic of using this
software. That's the reason we put up with all of its
bullshit is that whenever that happens, we
didn't just lose all your audio.
Yeah. I got most of it, but then you started trailing off into robot voice. It was very impressive.
That was probably about the time I noticed everyone else's video going very MPEG artifact.
Yeah, you're back now. You're back now. So what was the last bit that you said?
Where was I getting up to? Oh, yeah. So, like, there's a nice argument to be made from an institutional perspective of this will help to relieve the love.
mode on your infrastructure.
And this is exactly why
Linux distributions have been
distributing ISOs using torrents for years.
Aside from the sort of piracy thing,
that's the other thing that BitTorrent gets used a lot for,
and it means that the small organizations
throwing up their own Linux distro
can distribute it without it
completely crashing their tiny servers.
Yeah, it makes me,
think about, and something I've thought about before but haven't really explored is you know how
when you go on Archive.org and there's a file, and one of your options is just to download a torrent.
Why don't we do that for institutional repositories and data repositories? Because those are starting
to get big. I mean, I know one of the reasons is because, like, we use proprietary software for
our institutional repository and Clarabate and Elsevier don't want to support that. Because I just,
the last institutional repository I worked with was, was Digital Commons, which is the
the Elsevier used to be called B-pressed.
But we put some big files on there.
You have to have software installed on your computer to handle torrents a lot of the time.
And if you're on an administered computer, like the permissions involved with that,
especially in, unlike if your faculty, you're like work computer,
you probably don't have permission to do that kind of downloading.
Yeah, but it doesn't, there's no permission needed for Elsevier to give me the torrent file
of the data on there.
Like to, you know, just there's a download button.
there's a download torrent button,
there's a download raw file button.
Like, it costs them basically nothing to do that, right?
Yeah, but then you have to have the software on your computer
to be able to do torrent.
That's my problem.
I'm just talking about the, like,
why can't our software provide that option?
Is there a technical reason?
No.
So as Jay, as Jay is saying,
it's like, like, most of the limitation is on the like recipient side.
And on like, well, that's, well, part of the beautiful thing about BitTorne is that
it problematizes the difference between surfing and receiving.
Like, that's sort of the whole point is that everyone is both.
But yeah, like, it's like, that's like the major roadblock is that just like people
needing to be able to run some different software that isn't a web browser in order to, like,
so you'd hate to have this be a way that diminishes access to information, like, you know,
putting in a place that like people can't access it just because of the software.
And so, like, there's some approaches for that.
But it, in the.
ideal case, it is a purely additive thing that like, as exactly what you're saying of we just
add Dorrance and they are just an additional download mechanism. That's the idea that like, and I think
that Torrents, BitTorin can be very mysterious. And so like one of the things that we wrote in our
documentation is like, what is BitTorin trying to demystify that a bit? And it's like, it's so
simple you would not even believe it. That just like there's one of the other.
I say this all the time, and I just like don't know how much this, I never know how much this lands.
And it's like, when you're talking about peer to peer, it's like, bit torrent is like the hegel of peer to peer.
We're just like, you can't really have like a disposition towards peer to peer without reconciling your disposition towards BitTorren.
Like it's like, it's so fundamental and foundational to what peer to peer is that it's like everything else is sort of like in some way derivative of it or reacting.
to it or like, you know, that, like, and so all that it is is just like you take a data, take a
whatever, you know, I file a stream of bits, chop it up into little pieces, make hashes of
those who are just like an abbreviated representation of that data where it's like, if you get
that data, you can do the same operation, the same hash operation, and then if the hash is
match, you know that you've got the correct data. And that's pretty much it, like that, like you
just have pieces of data and distribute them.
So what it means to make a torrent is literally to just read your files
and then produce another file that's just like a very small reference to those files.
And then that's it.
So whenever you're hosting data, presumably you have read access to the data that you're hosting,
hopefully, you just hash the data and then you post it along with a hash representation of it.
It's almost the same thing as posting checksums along with your data.
except these are checksums you can download.
So like, so like that's the, yeah, the minimal lift that it takes is the selling point.
And like, why don't we do that?
Like so also think about like the interface for that though with like archive.org where like
why don't you use the torrent option every time there.
And part of it is because like the important like social element of that is not really centered
on a lot of these interfaces.
So it's like like in the same way that like Wikipedia.
hides all of the work of the pages in the talk pages
and in like the Wiki projects and stuff like that
so that like what comes out on the other side
is like this beautiful pristine, unproblematic
encyclopedia article.
Like if you don't surface the swarm,
like you don't show that there are in peers here
if you were to join the swarm,
you would be like another one of them.
Like these things need more peers
versus these, you know, like, if you don't, like, design that into the interface,
then it's just sort of like, oh, it's an additional download option.
Why would I choose the weird one?
And it's like, and so part of it is like a social project of like making users of data
feel like they've got some skin in the game and they can participate in its availability.
That like that, so if I'm like a regular, so that's a, that's a, you know, a major issue with,
with a lot of the current data that's being backed up and lost, that like these are
datasets that are used by thousands of people all over the world, but they only exist in one place.
And so these are like, it's like you have so many people that rely on it and have resources
that could contribute to its, to make it available, but that's just not how they see their role
as in the system. They are a user or a consumer of this data set, not necessarily a co-owner of this
data set. And so, like, that's, that's like part of the social change of if you then surface the
So if you go on to, we don't have this yet, we're working on it, but like, academic torrents, I think it's dot com.
I forget what it is.
It might be dot org.
Whatever academic torrents is like the other like major academic data tracker.
They have like these like little sponsor widgets or whatever.
Like when you're a web seed for some data set, it'll say like hosted by blah, blah, blah, you know,
they're just like giving people like a little brag point that they can, that they can say, I'm helping to make this thing available.
Yeah, the beauty and the curse of torrenting is always like someone has to be doing it.
And the more of the people do it, the better.
And if enough people aren't doing it, then, like, no one can use that file.
Exactly.
Yeah.
And I think that's where, like, having, normalizing having websites as well.
Because you don't put websites in a torrent for the latest blockbuster movie because you don't want to, like, have it available from this website.
Yeah, maybe explain, I mean, you did kind of explain websites, but kind of, you did kind of explain websites, but kind of,
maybe give it another.
It's basically like instead of having a person who has their PC on overnight,
you have a web server that has the file,
and if there's no peers, it will be the peer, basically.
Yeah, and that can literally be like your existing data repository.
Exactly.
In fact, we do this regularly.
So all the website is, so again, what a torrent file is,
is a set of chunks of data that have hashes to them.
And so when you're downloading over BitTorrent,
you'll go and ask other peers, give me this chunk,
give me chunk 10, you got chunk 15, cool.
A webseed is that but HTTP servers.
So just like all HTTP servers,
you can request a range of bytes from them.
So instead of going and downloading the whole file,
just saying, I need range, byte range, X to Y.
And pretty much every HTTP server can do this.
If they can't, it's like ancient, it must be extremely ancient.
Or it's been deliberately disabled.
Right, yeah, right.
Yeah, I remember using this program called WebAants when we were on dial-up.
Remember WebAns.
So if your computer was, this was like when DSL was starting to come out, so files were
starting to get bigger.
But if you were still on dial-up, you would add this little browser add-on called WebAns,
and you would feed the URL for the download into, you would right-click on the file and say
download with WebAz, and it would chunk it up so that if you lost it,
your connection, like someone made a phone call or whatever, it would download bits and pieces
and you would watch the little chunks turn blue as each chunk was downloaded because your stupid
modem couldn't keep up with trying to download like a gif of Danzig's face, yeah.
That's exactly what your BitTorrent client is doing.
Yeah.
Rather than pulling them all from the one place, it's like, who's got this piece, who's got
this piece?
So like this is like a part and that's part of the, then,
there's like layers of experience with the torrent and just like most people
never had much experience with it.
The people that have had experience with it like will that will be especially with
rare torrents that will be like a common experience too that just like I tried to
get something and it's not there. And so that's like the fear that we that we
experience from a lot of people that it's like well if we put it on bit torrent then
like eventually people will stop seeing it will be gone. And like that's true but
that's also true of the archives. So like and and so if you think
of your web archive, like, wherever you are hosting data, like, whether that be in some
as three buckets somewhere or your, you know, institutional hosting service. Just think of that
as a peer. And so your, it's like, when I say it's strictly additive, that's what we mean.
It's like, we're only adding more peers. And your existing thing is just a big one that's
added as a web seat. It's a different kind of peer, but a peer nonetheless. And so I think that, like,
One thing that is like an immediate and obvious thing that you can do is that like if you both have webseeds and the ability to add webseeds to things after the fact, then you can do something that you can't currently do with like existing HTTP Archive Deck, which is like download from multiple of them at once.
That like that so we have these data repositories where I think about from a researchers point of view too,
where like when I'm going to archive my data,
like I can put it on like my discipline-specific archive
or sometimes you might have an NIH archive,
like for the genetics and genomics people will have like there,
but you put it on there and that's the only place that it is.
And if I wanted to add it to Zanodo or whatever,
that's like an additional step.
And then someone,
and now someone will have to choose between these two things
and it's one or the other.
But like having a torrent with Webseys,
what that immediately allows you to do is if I,
so there's like, say there's like institutional collaboration,
we are co-hosting this data together,
that like this happens all the time.
Like there are like multiple institutions
that are involved with collecting,
organizing, organizing, curating, and hosting data,
is that now it's possible for me to download it
from all of that at once.
Even if there are only Websees,
even if there are only HTTP servers in the mix,
no peers,
I can still spread my downloading out
and use bandwidth from all of these things at the same time.
