librarypunk - 161 - SciOp.net feat. Jonny and Jez (part 2)
Episode Date: March 19, 2026Part 2 of our discussion with Jonny and Jez about SciOp, a torrent-focused data preservation project that encourages academics to help with the act of making data available. It’s distributed, it’s... robust, and SciOp is working on making it easy to do. contact@safeguar.de Contact Jonny: Fedi: https://elektrine.com/remote/jonny@neuromatch.social Bsky: https://bsky.app/profile/jo.nny.rip Media mentioned https://sciop.net/ https://forum.safeguar.de/about https://blog.sciop.net/2025-08-29/webseeds https://punctumbooks.com/titles/warez-the-infrastructure-and-aesthetics-of-piracy/ Safeguarding/sciop: collecting at-risk data in torrent rss feeds - Codeberg.org - https://codeberg.org/safeguarding/sciop - the scraping code, such as it is: https://codeberg.org/Safeguarding/sciop-scraping - sciop the blog, such as it is: https://blog.sciop.net/ https://programminghistorian.org/ https://book.the-turing-way.org/ last episodes on federation: https://www.librarypunk.gay/e/101-mastodon-bluesky-and-bullshit-part-1-feat-jonny-saunders/ https://www.librarypunk.gay/e/102-mastodon-bluesky-and-bullshit-part-2-feat-jonny/ Tools mentioned webrecorder, browser trix https://webrecorder.net/ https://webrecorder.net/browsertrix/ academictorrents - academic data tracker https://academictorrents.com/ aggregor mauve https://agregore.mauve.moe/ DRP - lumos https://www.icpsr.umich.edu/sites/datalumos/home https://en.wikipedia.org/wiki/Distributed_hash_table https://torrentfreak.com/controversy-as-rookie-admin-aspires-to-bittorrent-domination-080730/ ddos secrets https://ddosecrets.org/ Archive Team Warrior https://tracker.archiveteam.org/ How to contribute to sciop https://sciop.net/docs/quickstart/ Create account. Create a torrent. [find how to link for how to create a torrent] Upload. Get permission so they are visible to others. Seed. All transcripts: https://podscripts.co/podcasts/librarypunk Join the Discord: https://discord.gg/qWPTurTnkT
Transcript
Discussion (0)
So let me give you a scenario of something I was working on.
So we were working on this linguistic project, and part of the data plan was we are going to host it in our institutional repository,
and this other institution is going to host it in their institutional repository, which is different software,
neither of which give torrent as an option.
But if we were doing a sciop, I could say, when I'm creating the torrent, here's one webseed and there's another webseed.
Could I do that?
Absolutely.
So it would be one torrent but two webseeds.
Yep.
And if a third institution comes along and also has a copy on theirs,
then you can add that as a website later.
Yeah.
Yeah.
So that's something like we can do now without,
because part of my brain is like how do we get DSpace to do a feature request
where it just automatically gives everyone the option to start hosting torrents,
which I think would be very nice.
But we don't have to wait for that.
Like we could start a sci-op
Like for example
Going back to like my next job
I could start making this argument of
We've already hosted the data
Now let's just start making torrents for it
And we'll have a website that points people to the torrents
Yeah we just
Yes soon soon
We don't have a we don't have a
Executable for you just yet
But that is the plan
Like that kind of thing
But like even even now
You could
You could start manually
Yes
going, okay, we've got
these are our 10 biggest
data sets. Let's make
torrents for them and just list them on our
webpage yet. And like, yeah,
I meant manually, I meant like,
because in my head there's going to be someone
who's like 10% of their job is going to be doing this
because I've always got like my manager hat
on of like I need to,
who is going to have to, whose
problem is this going to have to be? Is it going to be mine?
It's going to be someone else's. Who do I have to convince
to give me five hours a week of their
time? Because, you know, it's,
diet bowl and I don't want to waste anyone's time but like yeah could we start could you go out
and start making torrents of your data right now using webseeds yeah yeah basically yeah and and part of it
part of what we've been doing there's like I feel like we're we're talking a lot about like the
technical challenges of this here so like maybe last bit on that because like more interesting
part is like you know going off road and scraping and stuff like that it's like the glory of
the hunt like you know in in but the
the bit torrent ecosystem is ancient and ailing and old and stuff.
And so, like, part of what we've been doing is trying to revitalize the development effort around it.
Where, like, to make a torrent, you know, there are, like, there are these truly disturbing blobs of C code that you can, you know, use to make a torrent.
And it just, like, sort of half works.
And it's, like, you know, seg fault on my computer half the time.
But we're just, like, trying to make that be, you know, modernize that a bit.
we're just like now you can actually have like a program that that makes good torrents efficiently
that you can plug into like a deployment pipeline easily.
And that's part of what we made.
It's worth mentioning as well that the vast majority of the software in the BitTorrent ecosystem
is from the perspective of, I mean, for a start, there's a whole aesthetic about trying to make your torrents as small as possible.
So like there is a, Johnny will know this much better than I do, but there's like a culturally, there's a size beyond which it's kind of assumed that torrents don't go.
And as soon as we got into scientific data sets, we immediately started breaking those assumptions.
Yep.
That's a lot of fun. That story is a lot of fun.
That's not an issue with the design of the protocol, because I think having looked at that,
it's really very flexibly designed, but it's a function of the implementation choices made by the people writing those client softwares.
Yeah.
So it's like, you know, the legacy of this technology.
We're trafficking like in one gigabyte, the Matrix 99, yiffy, dot MKB-1080P.
It's like the, speaking of BitToranesthetics, like the naming conventions are one of my favorite.
nice they're fun
whenever I rip a video
for like for a podcast
so that every like there's not a torrent available
so I have to like rip the DVD myself
I always put it like smazzy gang
dot rips
yeah
I feel like I should
feel like I should mention if
if like me you had a very sheltered
and naive upbringing and didn't like
get into bit torrent as a teenager
I love the I'm learning about
a lot of the social history of this
through Martin Paul Eve's book.
Hell, yes.
Great.
I made a book on,
wait, what is Martin Paul Eve's book?
Where's?
W-A-R-E-Z.
It's a phenomenal book,
both because it's like,
it specifically covers
the culture and aesthetics of piracy.
We're just like,
a lot of books that try and write about
about piracy
just miss the entire scene of it.
Like, they just don't approach
the actual social structures.
that underlie it. And like, that's the most important part. And, and to, like, talking about, like,
he arrives at, like, this correct thing that's, like, everyone in the scene would know, but, you know,
it's like, it's not really written. I've never seen, it's the only academic work I've seen
that actually describes, like, some of the, some of these basic phenomena. Like, in piracy,
why would it be the case that if the whole point of it is to, like, steal stuff and make it
available. Why would there be beef or problems if one group were to, like, steal another group's
release and claim it as their own or something like that? That just, like, the amount of
internecine struggle that exists and it is just like, can only be really understood as just
like a, you know, a culture of honor and, you know, elite scores and stuff like that that you're
trying to get. But it, so anyway, yeah, I highly recommend that for sure. Yeah, that whole
bringing it, dragging it back onto the point I thought I was trying to make.
Can I remember that now?
Was like that whole wears culture very much informs the, all of the current technical implementations.
And you like, you'll, you go to, you find a bug because you're trying to make a torrent that's bigger than the author of Lib Torrent thought a Torrent should be.
Right.
And you go and you report a bug.
And they're just like, that's ridiculous.
No one would ever want a torrent that big.
I'm not going to implement this.
Yep.
So this also relates to what, like, what Jay was saying just a moment ago,
just being like, like, needing to have this special software.
And, like, needing that, like, there is a thing called web torrent.
And it's, like, been designed to work in the browser and to be, like, a thing.
But it's been hampered by this culture of, like, you know, the bittern.
It's like, there's a thread that's been going on for,
several years now on LibTorrent.
That's like, enable web torrent for, you know, LibTorrent.
So LibTorrent is the software that runs underneath a lot of clients.
So like, you know, QBTorrent is a front end, like a GUI built on top of LibTorrent.
And so it's like the argument is like, we want to serve people on the web that are acquiring
data with a different modality that don't or can't run BitTorrent clients, but they should
be allowed to be part of this BitTorrent.
Swarm. And so even if I'm downloading using WebTorrent, like, I, and I might not become, like, a permanent
seed of this thing. I should still be able to use it and still be able to be part of the swarm.
And so, also just like a side note, maybe like put in the footnotes like there's this wonderful
browser, Agrigore that's made by Mov, Ranger Mov, this excellent piece of software really
flies under the radar. That is a browser that's designed to be, you know, peer-to-peer
first that just like you can become actually appear in these in these peer-to-peer systems just
directly from your browser anyway but like so we were trying to say that that that was one of
the major reasons that web torrent support is not enabled by default is that like people
want to structurally exclude web torrent from the from these pier swarms because there's
the perception that just like what we want are long-term cedars traditional bit torrent users
and webseed is just a means of like hitting and running, you know, the term in term of art in BitTorrent
for downloading without seeding.
And so what we're trying to make the point of saying is that it's like, well, this is,
BitTorren has always worked in this way where you have this mixture of long-term ceders.
Why do people seed the Pirate Bay torrents?
Because some people are just pathological ceders like that and just need to seed.
But many people are not.
So expanding the base of who can be involved is always more powerful than trying to limit it,
especially limit it structurally and at like a code level to only those people that we really want to be in the swarm.
Yeah.
I can see how that's a bad limit.
Like there's good pro-social limits, but saying like my assumption is always going to be correct and therefore I'm going to physically limit the tool.
Like this could be a two-person saw, but I'm going to only.
design and sell one person saws.
Exactly.
Right.
And, yeah.
And part of it, the issue there is that, is that the people are, this term that I, that I love.
So I can't remember who introduced me this term of cookie licking where, like, where there's a
box of cookies and you pick up a cookie and lick it and put it back.
So it's there.
but only you can really have it.
And so it's just like this ownership of something
that should be held in the commons.
And you aren't necessarily making use of it,
but you're holding ownership of it.
And that's exactly like the BitTorrent developer scene
is cookie licked to the max.
Like that just like there are that like Bram Cohen
like Grifter in chief of BitTorrent,
is now pivoted entirely to focusing on like his crypto scam.
And so, like, he has drawn the other major developer of LibTorrent along with it.
So all of their time is now devoted towards crypto scamming.
And so in the meantime, the rest of us trying to use LibTorrent and BitTorrent are stuck in the lurch of like, we're waiting for the crypto bros to come back and care about us again and the software isn't moving.
So, like, that's like the current state of that web torrent, like enabling PRs, like begging the one guy and charge.
of this piece of software to enable some flag.
And so that's...
I'm feeling like the library equivalent of this
would be like Mark Edit or something.
And it's like, this is a great piece of software
and everyone uses it and it's free,
but also the developer refuses to make it open source
and refuses to accept any help in developing it in any way.
Yeah, that's the...
Like, something I've come across in my career
is sort of moving from like,
I think this is very cool and I'm able to do it.
So I'm just going to go forth and do it to like thinking how easy is it
for people who aren't me to take on this work?
I think Mark Edit is one of those because like,
I don't know if you're like on those lists or those listservs,
but so much of it is just asking people how do we do a thing
because not only does the creator of Mark Edit not make this open source
or accepted any help, his documentation is also terrible and barely explains how to use the software.
I've put in there like, hey, is there a way to do this? And he goes, oh, huh, I never thought of that.
I'll have to do that when I have access to an OCL API key again because I don't anymore at a different job.
Like there's this whole plug-in for Mark Edit that he wasn't able to develop for a little bit
because he was no longer at an institution
where he had access to the OCLC metadata
API.
Like,
like,
there's just be going on a little rant now,
but it's,
it's a really frustrating thing when like,
he just like doesn't make it open.
And it's like,
well,
what's going to happen when he no longer can work on it?
Yeah.
Like,
no one else can take it up.
Like,
if he,
if he does sort of shuffle off this mortal coil,
there are plans to make it open source.
And like, why don't you just do it now and then you don't have to have the plans?
Yeah.
And it's because there are so many people who have, like, such cool ideas for market it or for, like, other tools.
And, like, part of me, like, understands his reasoning for not making it open source, but not really at this point.
There are two projects that I would love to just, if someone would pay me to have a roof over my head and food to eat.
that I would love to just work on.
One of them is the open source market and the other one is the research grade bit torrent clients.
We're going to get funding for that.
We are.
We're definitely going to get funding for that.
Yeah, this brings up maybe an interesting question of like, I, as someone who is all four things that are like open source or the sort of like pro-social sharing and torrenting and everything, but is also very concerned about the,
politics of free labor.
Yeah.
Out of people, like, I know that's a huge problem with, like, the open source space.
So I guess, I don't know how, because I'm not nearly as techy as I come off sometimes.
Like, in this project, like, what sort of, it's like, you know, if someone's going to be hosting a torrent,
they have to be able to afford the energy to do so and pay for the internet.
So in a way, it's almost like a Rantier kind of thing.
You have to be able to pay to have internet to do this
and to have electricity to do this.
And so there's a lot of people contributing time and energy
and resources who aren't necessarily getting paid
for that or compensated for that.
So I don't know if that's like a conversation
that's come up with this project
or like sort of in the torrenty space or anything.
But what does that kind of look like?
There are many different flavors of this.
So I think that like, I think that,
I think you could
that place
that I think it's reasonable to start with
is like thinking about the
norm and the status quo
just like how like how labor works for a lot
of these public archives and a lot
and stuff like that we're just like that
on the one hand
the norm of it is already that just like
if I were to
so if I were to try and do like
the thing about archive.org as an example
that like there's
a lot of wonderful thing
we love archive.org
like you know there's some problematic
elements of it, but just like, I'm not trying to like start a fight there, but like that...
We are, though.
You know, everybody's got their beefs.
Sort of, you know, but like what you might do is for that is I go and gather something.
I do the work of finding something and preserving something, formatting it, whatever,
and then I'll upload it to archive.org.
And so I've done the work of doing that.
But then I, in one respect, I don't have to have any resources in order to host it.
in another respect, I've done a bunch of work to basically like
meant archive.org's portfolio that just like
that I also don't have a lot of ownership over like the thing that I have made or contributed to.
So it's like there is that that push-a-pull there of like if you do make a hosted service,
you do lower the barriers and do lower the resource constraints to accessing and using that thing.
But at the same time, you then own that thing and you sort of like by necessity take control of it.
And so, like, that's, like, we love data rescue project.
That is, they are doing excellent work.
They are much more organized and much more complete and rigorous and, like, in their coverage of the things that are missing.
But, like, they're using this piece of software called Data Lumos.
That just, like, requires you to, like, log in in order to access stuff.
Like, they're, you know, and it's hosted by them.
And, well, I actually don't know if it's a thing.
I think it's, like, an independent organization.
But that's like the norm of how these file hosting, file archiving, et cetera, works,
is that it's like someone else is hosting it and they basically on it.
And like that's like trying to break out of that loop of we want to avoid the next time
when this thing is the ownership has a hostile acquisition or lose funding or whatever
and goes down.
And so all people's time that they spent contributing to something is now just gone
because they didn't have any sort of ownership stake in it.
And that's like a little bit of like getting back into like technical details,
but like trying to bring down those resource constraints and trying to bring down like the,
you need to have an always on server in order to participate and stuff like that is like
part of the direction that we're trying to go into as far as like automating these things on my phone and my,
and my laptop and whatever and being able to be a partial peer some of the time when I have the resources in order to,
and meshing that across, you know, the fact that people are using multiple devices.
I don't want to necessarily go too far into that.
But basically, like, that question sort of underlies the whole of how we design the system
and why we are doing it in the way that we are doing it.
Because, like, we do want people to, in general, be able to have control and ownership
over, like, the things that they put their labor and time into.
and like I'm working you know
going down into just the design of the code
or just like the next thing that I'm going to do
as soon as I start like I'm pivoting
I'm doing some background work on something else
that I'm not going to talk about but like coming back into
siob is like start breaking it up into like a more of a plugin based system
where you you currently it's open source
but like you don't even need to ask me
to add some functionality to it
like you could just go off and do your own thing
make that available and then whatever.
It fits in the framework and you can run it on your own.
And so, like, that's, yeah, we acutely and focused on this problem of how do we value people's labor in a way that, like, takes us out of the equation, basically.
Like, that's like, we shouldn't even be us valuing your labor.
It should just be you being able to do something.
And so, anyway, that's part of the social.
politics of the space, but I feel like I'm getting too far into like the
socio-technological, and we haven't really even spoken much about just like the actual
stuff that's being stolen from the culture and like the more of the what actually is
being lost element of this, but sorry, don't want to go too far into the technical details
and miss the, miss the form of the trees. I think like picking up on that and the like,
what words were you going to say, Jess?
I think the aspect of that that sort of does keep bothering me
with trying to get more institutions to adopt this,
like, that would be great,
but I'm very wary that one of the ways that you could sell it
is this will make your repository cheaper.
And if everyone adopts like peer-to-peer file distribution,
because it will make their repository cheaper.
It doesn't work.
You end up with the tragedy of the common.
So you have to have the kind of governance and the norms in place formally or informally
that sort of support the positive of the commons.
Explain that to me because in my mind we were talking about all this, you know,
increased server load because of all the scrapers.
And you could mitigate that by having torrents.
So why would, I mean, I guess is the assumption that people would stop having their own repositories,
and that's the way they would save costs?
Or because if everyone just, if DSpace gave you the same option as Internet Archive,
which is download file, download torrent, how would that be a problem for of the commons?
Yeah, and I don't think that would be a problem.
I think I think that the danger is more,
when you're kind of making the business case for this to be budgeted for and things like that.
And you're saying, oh yeah, this is reducing the loads on our servers.
And they're like, great, that means we can give you less money for servers.
And I think, like, there's not a reason not to do it.
It's just a reason to be sort of aware of how these kinds of arguments play out in different ways.
Or maybe if you're starting this project, have a plan for,
how you're going to reuse that new, that freed up service space from the beginning.
Because also if you don't need the server space and you don't need it, so it's like,
it is good financially, like, to put my manager head on again.
You know, if we don't need to spend money on something, we don't need to spend money on it.
But that's also where as a person who likes to build his own little fiefdom, I would go,
okay, I've saved money.
Now let me spend that money in something else in my fiefdom.
So, you know.
One of the cool ways of getting, I think, to get this adopted by some institutions would be to get together a little.
I want to say cabal, but I'm going to say collaboration of like two, three, four like-minded folks in different libraries.
And kind of that's the point at which you can say, okay, we're going to be for kind of resilience and sustainability, we want to store three copies of this data.
at the moment, we all pay for hard drives
and we all have our three copies in our different machine rooms.
Let's still pay for the same amount of storage,
but we'll keep a copy of yours
and they'll keep a copy of theirs
and it's sort of increasing the geographic
distribution and resilience of that
without the stuff without increasing the cost.
Yeah, that's kind of what just pinged in my head
thinking about this from the IT perspective is like
I mean, for one, I'm all for, you know, doing shit just because it's good.
But there is also the, like, the practical part of me is like, who's going to maintain the server?
How much time is it?
How much of it is shadow IT?
Because, like, that's a huge issue, right?
It's just making sure users aren't doing some weird shit that we have no idea about.
But, like, I know that our state library has a large collection of large archival.
of photographs from all over the country and all sorts of libraries contribute to it. Now I'm like,
wouldn't it be cool of public libraries with more of that space could create that torrent,
keep that archive going across the state so that way it's not just on the state library to
maintain it because libraries are contributing to it but not necessarily hosting it. So now I'm like,
I could see, I could absolutely see a case for this if like certain, if certain like, if we could get like a
state library or the people who have the big archives like, I'm in Washington. So like the
Udub, why not run a torrent for Udub's archives? So like, I could see the business case for this as a
backup solution kind of thing. Like I think was kind of what you're getting at, Jez, is like
it's the three to one of backups. You always want to have some geographically distant, right? So like
this could be kind of a way to bolster that even if it's not like a full, you know, backup
up of something. So like I've got like I've got the I've been gathering a ball of yarn over here
like a bunch of these like interrelated thoughts about this. So okay so we were just like like that so this
this notion of okay this is about this idea of platform fiefdoms and resource allocation and the
distinction between archivalism and active service and stuff like that just okay so like like
this goal of platforms and web platforms
of bringing resources down
for people to be able to use data.
It's like, on its face, good one.
That is a positive thing to be able to do.
To make it a website, I go to the thing,
it's all set up there for me.
I don't need to do anything.
But then that also poses an additional problem
for archiving.
How do you archive a web service?
That's a much harder thing to do
than archiving static data.
the underlying static data.
And even archiving the static data plus the code needed to use it.
Way, way harder.
Yeah.
Way, way harder.
And so I think about this, one of these examples that, like, we faced, like, drought.gov.
That just, like, this is a resource that farmers use to farm, like, essential information,
like, being able to plan crop field and being able to just, like, have information about drought
and water usage and just, like, what the next year is going.
And so that was one of the casualties of, you know, Trump administration that took this thing down.
And we're trying to archive it and like, like, how? And it's because like there's like multiple
pieces of that. One is that just like without the funding for the thing to continue existing,
then it will not be continued to update. You know, just like we don't have the, the thing that's
lost is not, is not just the data. It's the means of keeping that data function. Like the system itself
is at a loss. And you can't archive a system of labor. And so,
So, like, that's something that there's, like, fundamental limitation, we cannot address ourselves.
Like, that's just like, that's not possible. And, like, trying to do that would be hubris, right?
And so, like, when that thing goes down, not only do you lose access to the thing that could be backwards looking,
but you look at that lose access to the thing that would be forwards looking.
And so that we have partial solutions to how to archive a web service. And it's like where, like, our friends over at,
Web Recorder and Browser Tricks have made this tool where usually when you go and do
a WebSgrave or a Web Archive, you go and do static ATP requests, go and download this set of
files, JavaScript, and et cetera.
Browser tricks and Web Recorder works differently where it simulates a full web browser, and it
just captures all of the network traffic that happens the whole time you are using the site.
And so, for example, like if you're going to a site like drought.org and you go and zoom in on the map, that's requesting additional map tiles. And you're like scrolling around and looking at these different like facets of this data. All of that can be captured and recapitulate. And so like basically what you need to do is be able to, well, in the volume case to automate that kind of like interactivity, go and zoom in on all of the map tiles and expand every data browser and.
and so on. I'm thinking about the heinous miscarriage of infrastructure that happened in the United States where, what the hell was that?
That, where all the government services were using this, like, was it plateau or tableau? That's what it is.
Tableau, yeah.
That just like, that now all of the government sites have this, like, embedded JavaScript applet that, like, has their live updating COVID tracker or something like that.
But that's, like, a really rickety and fragile system that, like, that, like, that does.
doesn't serve, like, public use of information in that way.
It makes it available, but it's impossible to argue.
So, anyway, so we have these, like, this, I want to bring in browser tricks, this more
advanced kind of web scraping.
And they've done a good job of making that accessible to you.
And then you basically get into the need for, we want to then not only scrape and grab
these sites, but then make them available again.
So, like, that, you know, maybe drought.org wouldn't be able to be useful in the future
because it's not being updated, but other interoperable.
interactive web services can be archived that like aren't, you know, intended to be like live
updating things like, you know, climatological and weather information. And so that's also
currently something that like sciop can't do yet. And archive.org does better. That just like
that archive.org does preserve some, you know, you can, if there is like some web service that can
be captured by their technology, you can then just go to a website and have like the same
quality of experience, like the same utility that the web service gives to you. And so that's,
that is something that like the browser tricks and web recorder folks were working on. In fact,
they, I mean, hopefully by the time that this, the podcast comes out, like, they,
Ilya said that like, don't mention the release yet because it's still in beta. But I'm going to
mention anyway. They just like, they released a version of web recorder that is BitTorrent power,
that like you can make a, you know, a full network traffic backup of a service and then have,
be able to browse that from a BitTorrent archive. The limitation that they face is exactly the same
one I was talking about before is that just like LibTorrent doesn't want to enable web torrent,
blah, blah, blah, but like assume that that works. Then you get into the situation where just like,
now you can have distributed archive.org where if I'm, if I care about something, if I want something to
exist and I know it's going to go down soon. I can basically go and just use it, like open up a
browser Turk's web browser, go and touch all the knobs and fiddle with everything to record and
capture all of that. And then like basically in the act of me using it, I create an archive from that.
So I create like this thing that's a whack z file. It's just like a big zip file of all that network
draft. And then then you, but then you have the question of trust that like that's like that like that you
can make all of these distributed copies of websites and stuff.
But then the thing that Archive.org also provides is that trust.
You trust that an Archive.org copy of a website is correct
because Archive.org systems captured it.
And they do not have a history and a reputation of modifying the things that they capture.
So it's like the thing that I'm like the sort of big ball of yarn trying to pull out here
is just being sort of like that once you try imagining like an alternative to platform,
web, then you need to basically pull out and redo everything. We're just like, you need to
rebuild the technology by which we populate these archives. You need to create a whole social
system for like distributed trust so I know who's scraping what and like whether or not this
has been useful in the past so that like I can look at different, I can host a Bittern client
and like, you know, know that like I'm going to connect to peers that are trustworthy that aren't
just like trying to like spam the network with garbage. And so like, like,
Like that that is a massive project that involves just like basically like, you know,
rethinking just like the way that every person is using computers.
And so like how do you like make that palatable and make that like, you know,
sort of sneak these ideas in the back door basically.
And just like, and like it's like sort of like all of these fronts are related.
Like you're communicating to administrators that we need to like it would bring down costs
in order to, you know, make torrents of everything.
That just, like, communicating to scientists, researchers, academics
that, like, when you use data, you should be partly responsible for keeping that data available.
Like, making it communicate to, like, the BitTorrent community,
just, like, web torrents are friend.
And, like, all of these things are part of, like, you know, related to this, like,
largest part of it.
And just, like, you can do bits of them, like, independently and stuff like that.
But just, like, they become much more powerful once you have,
each of them in place.
And so...
And there's also, to jump in,
there's also,
if we're bringing in
this whole new community
of academic users,
that also puts pressure
on the interpersonal ecosystem
to say things like,
you know,
webthorns are your friend.
And, and like,
that's another part of it too.
That just like,
the social tooling,
in most cases,
is like an afterthought
or just like basically dog shit,
you know,
just like that you,
like,
you have,
if you,
you're,
design, if you don't design a system with social tooling first, then inevitably what you end up with
is like the council of expert gods that like control all metadata and mediate the whole system.
And that's also like a, like I say you was saying just like who regulates, who runs this thing.
It's like it must be you.
Like you, like it pins a certain set of people into maintaining the system.
And so you need to build like the social tooling that allows people to negotiate over like who is able to post what.
and like how do we trust them?
And it's like what the Federation model is intended to bring
is like the ability to break off and make a new set of norms
and instead of like a community of archivists,
that isn't then isolated.
That doesn't become like its own separate system.
They might have different norms and standards than we do,
but like we still can like talk with each other.
So like that was like one of the major goals is that like sciop.com.
dot net, squeaky clean, super legal. Like, if you download a torrent from sciop.net, we make the promise to
you that it is legal to download it. Like, that's like our standard for that site, in part to
make it possible to develop the underlying technology so that then another archive group
that has a higher risk threshold that wants to be like hosting, you know, confidential leaks
or wants to be hosted, you know, just like more risky information,
they can do that over there.
And like it doesn't necessarily implicate us,
but we can still be a part of that system as well.
So that's like what I was talking about before
with like the different kind of federation model
where you might want to have the idea of like dark archives.
Like who was it?
Was it the, was it, there was a New York library
that was like, we made a dark archive of all of data.gov
or something like that.
but we're not posting it publicly.
This was about a year ago at this point.
I don't know if you remember this, Chess.
Someone made a dark archive of all the government data,
and they were like, we have it and we're not sharing it.
I'm about to look that up.
But like that you could do something like that with metadata.
We're just like, I'm thinking about like DDoS secrets or something like that,
where DDoS secrets is they're functioning.
Like what they do is legal, but like it's extremely threatened.
And so we might want to make copies.
of all of that metadata of all the stuff
that DDoS Secrets has leaked,
but we might not want to say in advance
that we're a mirror. Like we're mirroring
all of their content, and it's like basically
like insurance, like they have an insurance
file up there where it's like everyone download
this and if DDoS Secrets goes down, we'll distribute
the key and then leak some really damaging information.
But it's like insurance in that respect,
in this way, it's like, we have a mirror
of all this risky data. And if that
goes down, then we have a copy of it and can
verify that it is in fact that
data. It's like a provable
copy of the data and can be a secondary mirror. And to be able to like scale that from
widely acknowledged public mirrors of everything, we just are going to like repeat it in the
same sense that like a like a blue sky app is like a or a Vetaverse server is a public
archive of other servers posts and stuff like that. We might want to be able to scale that to
privately archiving and reproducing these things as well. Yeah, that's yeah, that's more of the
the gray area legal nature of what one has to do when information, as we know it, is increasingly
illegal to have that you shouldn't be able to actually have the data, you should only be able
to access like a predigested AI hallucination of it. And that's the legal path to accessing information
as opposed to like, you know, being able to know stuff directly.
Yeah.
And definitely, like, we talked, I think it was last week with Hagenblicks,
about how the knowledge is part of,
it represents a power hierarchy.
And so, you know, your ability to go to college gets you,
is reflected of your already existing part in the hierarchy
and where you exist and you want to be intellectual property
is part of that knowledge hierarchy.
And the part of AI is to get more people who don't have knowledge,
knowledge, meaning like, we can think of the word knowledge here to mean proprietary information
or data or things like that, to get that knowledge away from them so that you can then rely on
gate workers with less knowledge. So it's the same like Luddite, deskilling sort of thing of
you can then be pushed further down in the power hierarchy. We don't need professional managerial
people if we can have an AI that can kind of take that knowledge away from you. And so you don't
have to have it. The ownership class doesn't have to have it, but they own it, right? So
thinking of knowledge as power is really, as part of the power hierarchy is also very useful. So
I also wanted to ask, because at this point, I'm just going to split this into two episodes,
so we can go like 10 more minutes.
Oh, David, every time. Sorry.
Well, we're going to be on hiatus for a while when I move, so it's like, I would like to have
things I can, like, split out, too. If there, if there's nothing, I hope it's that me being hyper,
verbose and talking about shit for way too long does is it's give you a little bit of heat off
when you need to give an episode while you do.
I'm glad that's a good outcome.
It's totally great.
There's part of one thing.
There's two things I want to get to, so you can choose which one you want to do.
For someone who has no idea, how can they start to help?
And then the other one is a discussion of,
web scraping because you said that's the really fun part so which one are you more excited about
I think probably web scraping I saw your face I don't know the how to how to get involved is probably
more important like okay I can also do that at the very very end yeah well like the fun off road
okay yeah so the other by the fun off road part of the web scraping is basically like you're sort of
like doing reverse engineering and a little bit of hacking in a way that's like all of it is
technically illegal in the same sense that like using a computer is technically illegal because like
the computer fraud and abuse act is so broad. Like, you know, that like scraping a website is all,
like visiting a website is illegal. If they could argue that it's against the terms of service for
you right now to access our machine. And so like, but like that's been like a joyful part of this
of watching folks.
So in our group,
we have a mixture of folks
that are like old web scrapers
that do this all the time
and new people who have never done it.
And so that's like,
I guess this is a hybrid answer
to both of these things.
It's like, how do you get involved?
It's like you just like look around
and see what you care about
and what matters to you
and what you think is at risk.
And you go and grab it.
And alternatively, like, part of this is also scouting for that information.
Like, we thrive off of people who don't know computer stuff and but know something's at risk.
And to be able to say, hey, we need help over here.
Can someone come and help us, like, handle this big thing that's about to go down?
So, like, this has happened a couple times where, like, we'll just be, like, alerted to something and just, like, set the dogs of hell loose on, on this, like, like,
like, you know, set of, but, but like, the, the barriers to participate in are very low because, like,
one, if you don't know how to do it, we've written documentation that, like, not complete by any
means. And, like, you can just go into the sci up documentation and just see what's, if any of
that applies to you and do it. Like, that, like, if that doesn't work for you, just tell us this,
I don't know how to do this, but I don't know how to do this. And we'll do our best to, like,
describe that and to facilitate you being able to, because it's like the, the, we love to
great, the web, but also the whole, like, as we've been talking about just like, the whole test
of whether something is useful is like whether you can do it without me. Like, if you can't do it
without me, then it's shit. And we really haven't done anything or moved the needle as far as
like autonomy or power those. So I want to make sure that people can do that on their own. So
give it a shot, try and do it using the information that you have available. And then when you get stuck,
just like yell at me and yell at the rest of us just like raise an issue or like you can talk
to any of us on the Fedover's to say, like, how do I do this? And then, like, then the second part of
that is just like, just becoming a seed is a very, it's like something you can do in 10 minutes.
If you just go and, like, go on the, get a torrent client, let that run in your background and
get one torrent. Go on the side and just look for like the thing that has very few seeds on it.
And it looks important to you, grab that. We have also, for people who have more resources,
like more storage resources, more bandwidth resources.
Like we've set up a system of RSS feeds for these torrents.
This is like an old feature of that like a lot of Vittorrent client support.
So for example, like we tag all of our datasets.
And so one of the, when I look at the server logs and see what sites are being,
I mean what URLs are being accessed the most.
The most frequent is like the LGBT RSS feed for like getting all of the torrents that have to do
with queer people period.
And just like that's, so, like, that's one of the biggest things that's been lost.
And, like, actually, we can't, it's like just, it's taking a side tangent.
I see there's a note of just, like, in the show notes about what has been lost.
And that is a tragic loss.
There's a lot of biomed research and a lot of, like, ethnographic research that about just
queer people that has been lost, in part because it has PII in it.
and like the, we can't just make like a public archive of this, of this data, but the people who
were the curators of it and the holders of it were forced to take it down, remove it, or usually
hide it, not actually destroy it. But like that, those, that's been the biggest casualty so far
of unrecoverable information that we cannot retrieve unless the people who, you know, are the
researchers or people holding it. Yeah. But that's like, that's the, that's the,
main thing is like when there is something around you that you care about, that you want to preserve,
that's the best way to get involved is that like you are the expert of your local domain.
Like you don't be waiting for someone else to tell you, go and get something.
I'm sure that you're, you know, anyone is already aware of this.
And then just like, like I said before, if you need help doing that, that's what we're here for
is to try and make like that possible for you.
And then to facilitate a group of people around you that also care about that thing and
also want to help preserve it. I don't know if that's
content enough getting started.
No, no, no. Actually, so for example, I'm trying to think
of like, if I say that there is a
data set on my institution or repository or somewhere I used to work
and I go, oh, that's probably under danger. I could
create, could I create a torrent for it and get
that onto the sciop tracker so that because I need people to find
the torrent that I've created, like me as an individual
person. So I get that.
try, is there a way for me to submit
that horn to the SIAP tracker?
Yes, that's the, yes, do it.
Yes, that's, that's
real functionality of what Saiaf does is yes,
you can submit things to it. So it's like,
it's relatively simple. Like,
there's a bunch of like moderation tooling that we have yet to build.
But like, go, you go and make a, make an account
on SAO. And for now, it's just a matter of like,
you just need to tell us that's you.
We're working on messaging and commenting
and inter-site communication so that you can do that in-band.
But now just like, make a thing, tell us that to you,
we'll give you upload permissions.
And so it's like the way that our permission system,
it's like a tiny bit different than a lot of,
than the way that a lot of other web platforms work
in the sense that just like we're trying to embody a model of soft security
where like the way to make system secure and safe
is like to limit the damage that someone can do
without constraining the things that they can do.
So it's like in the same way that Wikipedia allows any anonymous user to edit the site.
The reason why that isn't a catastrophe is because there's like abundant means of preserving history,
of monitoring changes and making, like just making and discussing these changes.
So. And also undoing is easier than doing.
Exactly.
So we're like we have a, this is work in progress stuff, but like it we're working.
It is the case of just like we have.
like wiki like edit logs and history that can be rolled back and stuff like that.
But in the meantime, just like the, you go on the site, you make an account, you can already
create a dataset. You can already create an upload without being given permissions yet.
So you make them, they're just not visible to anyone but you and the administrators.
So just like, this thing is done. The data set has is described, it has a description, it's got
metadata, it's got a title and everything like that. Like, this is ready.
for someone else to find it.
And then the upload part of that is,
I put a torrent on the website,
and here it is.
Then we'll review that and then approve it.
And then that's the whole story.
Then basically you just need to seed it.
And so, like, there's this initial period where if you make,
so we have tools for making torrents that include adding webseeds
and indeed making torrents from webseeds.
Like, again, our beloved triple shrimp,
Prolific Archivist doesn't have the biggest bandwidth in the world
and doesn't have the most hard drives in the world.
So something that they will do is they'll make Webseed-only torrents that are just like,
literally just like download data, hash it, and then delete it.
Download data, hash it, delete it.
Just create the torrent file that has a reference to where the data come from,
and that's it.
So they won't even be a seed in the initial swarm,
but other people can download it just by the webseed only torrent.
And so, like, that's the last part of it is, like, then if the data comes from something,
like some HTTP server and that can serve as the initial seed, great,
otherwise you just stay online and seed the thing until, like, other people come along and see it,
and that's the participatory nature of, you know, group archiving.
is other people, like stuff exists and will be preserved and seated to the degree that other people
believe that it should be seated, backed up, archived and stuff like that. So like that's like the
means of that you put it on there, announce everyone, hey, help me back this up. And then the people
that will do that will show up and do it. I think that's like the one thing that we've seen
quite a lot is because that one aspect is really unintuitive is that we've had a few cases.
where someone's scraped, downloaded a dataset,
they've created a torrent,
they've uploaded the torrent,
it's been published on the site,
and no one's ever been able to download the data
because they didn't realize up front
that they needed to keep seeding it long enough
for someone else to have a copy.
They thought of it as a repository.
Yes.
So we put notices on the upload form
and stuff like that that is like,
a torrent contains no data.
It is not a zip file.
you need to stay online and see the data
in order to make it available to other people.
But, like, yeah, it's just an informational barrier
that is a very understandable one
because nothing else that you experience on the web works like this.
And so, like, yeah, that's a real issue
as far as making it acceptable,
making it understandable what the system does
and how it works without overburdening someone
with a big, long lecture about how Bidtorn works.
I'm just thinking we need to start,
like library punk academy where we just have workshops like the software like the software
carpentries where it's just like how to bit torrent what is a bit torrent and then we just
start posting because i was part of me was just like because i'm a carpentries instructor part
of me is just like oh we should just have a library carpentries for bit torrent and probably we
should uh there should be software carpentries for bit torrent i don't think there is but this is the
whole point of carpentries is like to teach academics the computer skills they need bit torrent
should be on there yeah very much
like I have on my stupidly long to-do list is to submit a programming historian lesson on BitTorrent and a Turingway chapter on BitTorrent.
Yeah, and we should make those, you know, if you build it in the carpentries, or if you have it and you want me to do it, I mean, like, turning it into a carpentry's like style so we can put it on the carpentries or just have other people teach it in the carpentry's style.
I mean, that'd be great. I'd be happy to work with that.
See, this is all the shit that I'm terrible at
and that's like why I need other people in the world
with different focuses, skill sets, and expertise.
I'm terrible in this.
And so just like, if y'all folks are good at making educational materials happen,
that's like very much.
That's another way to get involved is like do the things that we are terrible at.
And what we are terrible at is teaching people how to do stuff,
writing down what we've done, et cetera.
I'm literally adding into the notes
how to contribute to Sciop
and it's like create account, create a torrent,
find a link for how to create a torrent.
That's a note to myself.
Upload to get permission so that they are visible to others.
Seed.
Five bullet points of how you can get started
if you want to do it right now
and create a torrent to some data that you want.
We have a quick start
docs page.
Great, I'll have that too.
Yeah, but
you know, like all our docs are jenky and incomplete.
I mean, it happens.
That's how all good documentation starts, right?
And if someone, if listeners thinking, what can I contribute to this,
that sort of the documentation stuff, the outreach and external comms,
those kinds of things are things that we're sort of lacking a little bit in our core team at the moment.
And like this is like, because this project, you know,
lives in so many social systems.
Like the folks that are coming that are not from open source world,
like one cultural lesson from open source is like,
tell us our shit sucks.
We like it when you do that.
Like that's,
you know,
like if you are trying to do something and it doesn't make any sense,
it's a compliment to me and a compliment to other people
that you tell me this sucks and I hate it.
Because that means you care enough about something that you,
to tell me that it's broken and it should be fixed.
So like, don't be shy and say, well, they talk for like, you know, four hours or whatever it is about all of this like lofty ambitions, but then I try and go here and it sucks. Like, yeah, I know it sucks. Like, you're not going to surprise me by telling me that and you're not going to offend me because I already know that. Like, I'm way ahead of you. I'm thinking everything I do is terrible. But like it's actually very helpful when when people say the things that don't work for that. So like we have documentation.
about how to make Dorrance and how to use a site and everything like that.
But like if you, if it's amazed to you and you can't find it or like this doesn't make any sense
from you or I've never touched a terminal in my life or whatever, then like, yeah, that's your,
your input is needed.
And I'm just thinking about Microsoft documentation and how much of it is just like absolute
bullshit labyrinthine sort of find the one line in the one article about, oh yeah, by the way,
you can't do it if you don't have this very specific permission.
And I'm like, yeah, I could test document.
for you guys.
We'd love to have that.
Also, yeah.
And like,
OCLC's documentation,
they put everything on different pages
so you can't just control F through the documentation.
And everything's on a different website.
And genuinely,
I think the best use case for Google adding
their stupid AI summary on top of everything
is because,
one, OCLC's web presence sucks on Google.
But two,
like, it will find the actual right document page
because it's scraping all of them to pull the data
and it will at least get me to the part of the OCLCLC documentation I need
because luckily all of OCLC's documentation is on the open web,
which is not true of all library vendors.
I'm pretty sure Alma isn't.
So when I change jobs to Alma, that's going to go away.
Good luck, God's feed.
That'll be fun.
How is it that so many of these library vendors are so awful at Discovery?
Like they literally try to sell us Discovery platform.
sense. I mean, genuinely, I think OCLC knows their discovery platform called discovery sucks. You can
email them a problem with discovery and they go, yeah, it just does that. They'll literally they'll say,
yeah, it just does that. Amazing. As though they have no, no agency whatsoever over it. It is its own
independent entity. That just reminds me of a time I emailed some vendor to be like, hey, does this
product do this very specific thing that is incredibly useful to do for a, God, what is it?
Like a computer reservation program. So like, so patrons can reserve computers. And they were like,
oh, yeah, it doesn't do that. It would be nice if it did, though, right? But that's kind of too hard for us right now.
And I wanted to be like, where is the thousands of dollars?
Yeah. But I'm just like, you don't even have like a feature list. Like these are the things.
things you can submit to vote on.
Like, or like, yeah.
Anyway, I just, I always think about that and just being like, okay, well, that's, that's a you problem.
Maybe.
Can you help me out?
This is why no better, no matter how bad open source community shit gets, I'm always like,
remember the alternative.
Yeah.
Which is that this is all intellectual property and no one can fix it.
And no one will fix it.
And they will continue to fire their staff who can fix it.
It's like, it's like, yeah, one of the things I love it as being like a open source program is just like I always have no resources. And so I can always say, we'd love to do that, but I've got no money. And so like if I ever go into the situation where I actually had resources and had to be responsive to, you know, my whatever share, what do you call? You customers and Patriots.
Stakeholders, customers.
That would be just like, God, I actually have to do stuff because I actually should be able to do this, you know.
this is why we don't have a Patreon because then people would be like why why do your transcripts not so good i'm like someone can do them for free but i can't spend six hours every episode doing them if we had a patreon i would feel like morally obligated to do to do like manual transcripts even though all of our transcripts are online on podtranscripts.com now i don't know how long that website's going to exist but they do it for free so if you have a podcast submit it to them you can just submit it i just submit it i just
We weren't already on there.
Submit to it, they will do all of your episodes.
They will download them and do a pretty good transcript.
But I have been transcribing ours with the speaker.
It won't have the speaker identified, obviously.
But all the words will be there and you will have a searchable transcript.
And of all the like bullshit kind of website startups, I'm like, hell, this is a good service.
So go check them out.
You don't have to pay money to search one transcript.
But if you want to search multiple podcasts, then they charge you money.
but you know what, that's good service.
I mean, a lot of these bullshit startups start up with a good idea and implement it well.
And it's only when like the we must grow and continue to grow and number must continue to go up at all costs that it all starts to go safe.
It's like we've talked about this internally with like the safeguarding group just like, how could we like get money to do this?
And it was like, what could, like, what could our service be?
Like, what could we, like, what's actually like the thing that we could sell that, like,
it's like, the ideal case is like, if you want to do something for free, it's just sort of like, yeah, we do the thing that we do for free.
And then there's some ethical freemium upsell deluxe package or whatever that doesn't compromise the core, you know, not getting into whether that's possible or whether that always, you know, undermines its own.
Sure, but you got to think about it.
But, like, in our case, it's like, we could sell network congestion as a service.
Like, that's the main thing we produce.
We've got too much bandwidth.
We'll use it all up for you.
Don't worry.
We'll make it our problem.
Right.
And it's been, like, it's been one of the biggest shames to see.
So, like, like, hard drives being cheap was just a true thing for, like, most of the
contemporary history of computing.
And like, it's like you can tell when, when hard drives are scarce, like something is going
really wrong in the world.
And so it's like, you know, the last time before this one was, again, this is like, coming
back to Bram Cohen's grift company and stuff like that, like when they launched Chia and
turned proof of proving that you have wasted a ton of hard drives space into a currency.
and there was a huge run on hard drives
that then became a bunch of e-waste.
That was like something is sick in the world.
And now it's like, now the hard drives are all gone
because you know why.
And like, yes, indeed, again,
something is very sick in the world.
And so like go out and get more hard drives,
but not to poison the ecosystem
and to, you know, store
like the hallucinated text
that you then cannibalize your models on,
but go get hard drives to make sure that we can continue to have climate data.
I've got very into scouring eBay for secondhand hard drives last year.
Yeah, a lot of things, I recently, because what I did most recently is I bought a 12
terabyte non-solid state drive, like an old server hard drive that you'd just be in a rack,
and just wiped it and used it because I needed to do a full backup of my other.
storage hard drives that are like two and three terabytes. So I'm like, I need a big one. And I was just
talking with someone about like hard drives, another person who works in IT. And just the, just any kind of
hard drives are shooting up at price. And it's kind of like, do you wait it out or do you buy another 12
terabyte drive right now and find, you know, find a used one? And everyone yelled at me for buying a used
hard drive. They're like, that's a security risk. I'm like, I'm going to deformat it. It's fine.
Wait, security risk to you?
Yeah, to who?
Yeah, because I, I don't know, because I told him I had to reformat the drive and fully wipe it.
And they're like, but there could be stuff on it.
Sure, yeah.
Security risk.
That's their problem.
That's what is it.
That's what Newton boot is for or boot and nuke is for, right?
Yeah, well, they were saying that me reformatting it was not enough to get rid of any risks on the hard drive.
Sure, yeah, that's, but like you're not going to go and execute any latent binary.
that you find in like the nether regions of the hard drive, are you?
Like, that's, yeah.
I know, it's a storage drive.
It just literally is like I point my torrents to it.
And it's like, that's download there.
This is, I would say, of the things that like the sciop chat is concerned with,
how to make giant arrays of hard drives out of dog shit is like one of the major points
of conversation.
I have this, like, a set of images that I use in all of my.
posters now where like we so we're just like trying to make the argument to i like when i go to
conferences and stuff like that i'll just like go and talk to the other archives there and show them like
this is what sciop runs on like it just like because you know AWS wants to see you this image
of just like squeaky clean space age server racks that go for miles into the core of the earth
and stuff and like we're like so the different aesthetic of like one of my favorite images that
This in the world is of a SIOP seed box rig
where it's a normal desktop case
that constitutionively does not have a side on it
as far as I'm aware.
But they ran out of hard drive bays
and so what they did is they took yellow tape
and like taped a sling
underneath the hard drive bays.
So it just sort of like
through the tape,
screwed the tape into the drive base that exists,
and then now the tape hangs down,
and then had a bunch of drives stacked slung in the tape,
each of them screwed through the tape,
and so I've got to just show you the image of these four hard drives
just like sloughing off into the loose space of the computer case
in a tape sling.
And it's like, so now it's like whenever people are saying like,
How do I, what, like, physical enclosure should I use for my drives?
It's always like yellow tape sling, obviously.
Like, that's the most, like, most efficient.
The hard drive hammock.
Yeah, exactly.
And it's functional, too.
It's amazing how much it reduces the vibration transmission to your desk or floor.
So the whole, the whole rig is much quiet.
Was this rig yours, Janice?
No, no, no.
I feel bad, like, we should be crediting folks, but it's also, yeah,
Like I said before, we don't know how much people want to be known for stuff.
But yeah, like...
Well, if you can get me that photo, it's going to be the photo for the episode.
Hell yeah, I'll get it to.
I've got it on.
So I set a reminder in Discord so you can get it later.
When you were talking about the coordinating of scraping of targets,
I kind of heard that you were working on some tooling for that so that you can automate it.
I'm curious about that if we have time.
So...
Yeah.
this warrior.
So,
ill-determined name.
So there's this thing that this group archived team that's like loosely affiliated with
Archive.org does.
Then they have this thing called the Archive Team Warrior.
And what that is is like, it's like a Docker service that you can run where, like, so
Archive Team does this exact thing of distributing scrapes for at-risk stuff.
And so it's like when they'll go, they'll have a scrape project, you'll go and run the
archive team warrior for that scrape project. And then what that will do is it will,
they have a list of URLs to be scraped. You'll go claim a set of those. Say,
I'll get these ones. You scrape it, submit the data, and then say, this URL list is done.
And you also probably will return additional URLs that you encountered along the way.
So we've done something similar to that where, and it's, and so it's like, this is sort of like
an ill-defined, like, that's a well-defined problem for like web scraping, web scraping,
like HTTP site scraping,
but like there are lots of data sets that we
encounter that are like a big
S3 bucket that has
like logical divisions
in it, like, you know, this
this like data set over a year.
I'm thinking the main thing we designed this for
was like the chronicling America
dataset. And so this is like
a set that is released
as bundles as like, what are they,
what's the term did they use for each of the things?
I forget, but they have like fancy names
and shit like that.
Like, they're all named after, like, sci-fi characters.
Let me see if I can, yeah, Chronically America.
Like, so they have, like, and they're each of, so, you know, Chronically America is a project
across a bunch of different libraries.
And so they'll be like, the library abbreviation hyphen Falcor, like, DLC,
and IUNA underscore EGRIT.
And so it's like, they're like, you know, little codenames that are batches of the
data. And so just, I didn't give any introduction to what Chronicling America is. It's an archive of all of like the local newspapers from the United States through its history. So what we did is we made a thing where that is the unit of division. So we have a thing, a concept on SIO called a data set. And that is like, just like the abstract description of something that exists out there. And then a data set will have data set parts. And in this case, these are these like batches that Chronicleing America will release.
So we made a SIEOP team scraper, the Scyop team sworeer, like,
this worst name of all time.
That is like a similar thing where you run for a given project, in this case,
like the chronicling everything, like go and get me a batch of this.
And so I will go, it will hit SIOP, say, I've claimed this one.
I'm going to get this one.
you'll do a scrape, you'll create the torrent, upload the torrent, and then say that one is done.
And so we want to generalize that so that currently that's pretty labor intensive to create like a task, like a scrape task.
We called them quests.
It's like this terminology is so terrible.
But like make it so that it's possible for someone to say, hey, over here, there's a data set that needs to be scraped and say here are the subdivisions of it.
that people can go grab and maybe we should like again automate the process of discover like
the way that like archive team and well general recursive web scraping works is you start at some
root URL and then but in the process of scraping you discover all of the other Urials that you need
to be getting and so we need to like make tooling that will allow that process of over here
there's a data set that needs to be gotten here are its pieces and and so to go and help get it
run this command that is, you know,
CIOC scraping name of project.
And then it'll automatically handle the,
go and grab the data,
create the torrent, upload it.
And it even handles, like,
adding it to your torrent client,
so it, like, auto-seeds for you.
And so, like, that's, like, the tooling that we're working on.
And the goal of that is, again,
like what we were mentioning about,
like, the distributed archived.org,
like what that could look like,
where instead of,
when I want to go to the Wayback Machine
and archive a U.S.
URL, like what that does is it sends archive.org servers to go and grab the thing, archive it,
and scrape it for you. Like to make it so that, like, you can do that and like anyone can do
that. So if I'm running my version of the, you know, local scraper and I want to grab something,
I can just do that, put it in the thing, go grab URL, and it dispatches that out to my
scraper. Or we could have like a standing army, a pro-social botnet, as I have termed it, it's
like people that are saying like when there is scraping to be done, you can use these extra
resources on my computer. Like I allocate X amount memory, bandwidth, storage, whatever. And like,
I will go grab that website for you, create like a signed copy of it and make an upload of it
for you. So you have the scraping part distributed. And then again, once we figure out, like,
the lovely folks over at Browser Trick's Web Recorder, like get the BitTorrent backed web view.
then the goal of that will be to basically replicate what the way back machine does,
where you go to sciop.net slash URL, whatever prefix that is,
and then in the way that you do this with, well, it used to be like 12-foot I.O.
Used to work like this and those of these like paywall things.
You just put the URL after that prefix, and you can go be served some bit torrent-backed archive
of that web page.
So that's like a actual goal.
But like we're, it's still very, it was like most things in sci-up, like done out of necessity and trying to respond to some immediate need. And with the idea of generalizing it later, that's sort of like the status of that distributed scraping project is that like there are a couple of projects that we needed that for. So there's like chronically America with the big one. But then there's like an NPR scrape, a Department of the Interior scrape. I think like some Department of Justice scrapes that are that are in there. But we want to in the future generalize that into some.
something that's much more possible.
Because that's ultimately the question, it's like, how do I get involved and trying to bring
that barrier down as much as possible to the point where you can distribute labor between
people who are like scouts that know about stuff that's at risk but don't have a big
scraping system and people who do have that but don't know about what they need to be going
out in scraping.
So that's like.
So you would just kind of run an EXE on your computer that once in a while would
get a task to scrape, it would scrape, download, and upload for you? Or is this all in the browser
instead of a severity XE file? At the moment, it's a program that you're running on your
machine, not in the browser, but like, that's with the magic of...
When you started talking about the distributed archive or way back machine, I was,
I got confused as to like if that was part of it. That is, yeah. So, ideally, a lot of it would be
triggerable and
browsable and interactable
from the browser
but like some parts
some things browsers can't do
and that's good
like browsers can't access
the file system
we don't want them to
like so
like there's just like some
anyway
that like I said
it's very in a protein stage
now and so
yeah we'll come back to it
in the future
when it's fleshed out
and we'll do a different episode
yeah but that's the dream
okay yeah it's good
it's good if there's someone
who has expertise and would like to reach out to the SIOP team to join as a member. Is that something
they could do? Totally. You can hit me up on the Fediverse, like my, I don't know, we put my
handles in description, but then also we have an email that we respond to. It's a group email.
We have yet to make, so that that is contact at, oh God, I need to get this right. Contact at
Safeguard.de. Is that right? So, that's right. Contact at SafeSafSaf.
A-F-E-G-U-A-R-D-E.
We're like a...
I should just put it in show notes.
I'm doing my usual thing of stuffing links and things in the chat, but like the actual show notes.
Cool URLs are our downfall.
We like, it would be easier to just say safeguard.com, something that could be said out loud, but having the cool...
You can email us.
Anyway, I was telling a professor about Plum X one time, which is an Elsevier thing.
and he says, yeah, I saw the Mexico site, but it's not anything.
I'm like, no, it's not a Mexico site.
Yeah.
It's, it's, it's, it's not a Mexican project.
It's, it's, it's actually Dutch now.
Like half of publishing ecosystem.
And like, so we, we will set up a public chat, a Zula chat at some point.
We just like have yet to do that.
But yeah, so if someone has expertise, like,
email us, like just DM me on Fevers, whatever,
and then we'll get you set up.
But in the future, we'll actually have, like, a reasonable public chat.
We have a Discord discourse forum.
It's neglected.
Don't direct people in Discourse Forum.
No one's moderating it right now.
But I don't think it's become overrun with Spanbot quite yet.
I actually just don't know.
It's like discourse is something that I want to love because I love forums,
but, like, I have never been able to, like, actually,
reliably use them
because I find that sort of awkward.
C4 mentioned
out about ATT.
Yeah. All right. So I think
we're good to wrap. We can always do more
episodes in the future as the project comes along.
And then, so
I will put your contact info, any of the contact info
you want to save, we'll go into show notes.
And the tools we mentioned,
I've been making a list of, the media
mentioned, and then just a quick, like,
how to contribute to SIOP.
Was there any, like, absolute last thing you wanted to
cover Sadie? No. Okay. All right. Well, thanks so much for coming on. I've been really looking forward
to this one. That's why I let it go long because I really wanted to talk about this for a while.
Yeah, glad to chat about it. It's like, you know, labor of love. It's something that like I
will be doing for the indefinite future in my life because it turns out like this is what I was
put on earth to do is apparently be trapped and entangled with BitTorrent in different ways throughout
my life. But yeah, I'm glad to be able to spread the good word. And like, we love, we love our library. The librarian audience of library punk is like the core, like, group of people who we love to, you know, just would love to know and have better contact with, but have been doing a bad job of reaching out to and communicating with. So we appreciate like, like, like, the, just like, also just like the perspective of like thinking about it as like a political and like a, and as a social problem. Just like, it's a perfect.
match between these two universes that we live in. So I'm glad to be able to come on and
and chat with you about it. Great. All right. Good night.
