librarypunk - 160 - SciOp.net feat. Jonny and Jez (part 1)

Starting point is 00:00:26 I'm Justin. I have a new job, so I forgot what it is, and my pronouns are he and they? I'm Sadie. I work IT at a public library, and my pronouns are they them? I'm Jay. I'm a cataloguing librarian. I'm so sick. I'm so sorry. And my pronouns are he, him. And we have guests. Would you like to introduce yourselves? Yes. I am Johnny Saunders. I'm a postdoc at UCLA. I'm also now the newly appointed CEO of PureTech global, cyberindustrial concern LLC, Shady Delaware LLC, which is my lifelong dream, and one of the folks who works on this project, SciUp, that we'll be talking about. And I'm Jess Cope. My pronouns are he-him. I currently work at the British Library for the next

Starting point is 00:01:18 like three weeks until I start a new job elsewhere, doing research data stuff. And I have had the pleasure of being on this project with John for some time. Quiet cheer, a quiet sensible cheer. It was maxed out. It just doesn't go any louder than that. Welcome. I've been wanting to get this episode

Starting point is 00:01:43 going for a while, ever since I've watched it kind of from the start, I guess. I think you started Syow last, what, year or two? Oh, Lordy. Yeah, last, well, it was last January, and I think we got it online last March. And, and

Starting point is 00:01:59 We've been slowly hacking away at it since then. Oh, so we've not missed the opportunity to have a big birthday celebration. I don't think so. And I mean, also, you know, we get to do whatever we want. We can say we don't have any sort of like formal analytics or anything like that. So numbers are meaningless and we get to make them up. Yeah, I threw SIE up into the like, what's it built in? And it was just like, I don't know, third party analytics stuff.

Starting point is 00:02:29 I don't know what it's built. It's like all the codes just in your Python and your GitHub, right? Yep, the site. Yep. It's like we use none of the stack. Like none of there's not like any AWS in front of us.

Starting point is 00:02:46 There's not any cloud flare in there either. It's just like raw dog Python running on a random Icelandic VPS that they keep giving us for free. Well, shout out to FlokyNet. They're like very legit web hosts. but also it just feels like, I don't know, we stole a little bit of computer space and are running a BitTor tracker on it.

Starting point is 00:03:06 Yeah, I know. They did point out that you're on an Icelandic server, and I was like, it just must be something Johnny has access to through work or whatever. I don't know, who knows? No, it's just some cool folks that probably Henrik. Like, Henrik was the one that has contacts and connections. Yeah, yeah, it's just like,

Starting point is 00:03:28 We're, you know, trying to figure out how to do this thing sort of legally-ish and also, like, in a way that isn't, like, totally 100% under U.S. jurisdiction, but we also don't really know how that works. And so we're just sort of like, DPS, outside of the United States and, you know, Iceland, it sounded nice at the time. Iceland is very nice. Greenland is covered in ice, yeah. They were, like, like, it's very cool. We want to support you. Can we give you a... a free PS. Yeah. Nice. So if you had to give overview definition of SIOP, like what is SIOP? Oh, well, I mean, there's like an easy explanation and there's a more of a gross one. We'll go through both. Yeah, the easy explanation is Syop is a bit torrent tracker for at-risk threatened, altered, or otherwise endangered public information. And the longer explanation is,

Starting point is 00:04:27 what we're trying to do. We're just like, we're working on making BitTorrent records, which usually are associated with piracy. We're not trying to necessarily distance ourselves from that lineage of, like, radical information of liberation, but trying to bring it into the light

Starting point is 00:04:45 and just sort of like doing something along the lines that like previous academic BitTorrent trackers have done, but then actually trying to like integrate that into institutional as well as private, is private as in, like, privately held as in your random hard drive kicking around in your closet resources.

Starting point is 00:05:02 So trying to like take that, the power of BitTorrent and put it to use in ways similar that like DDoS secrets and other like leakers or purveyors of large datasets have done, but then apply that to the public information that's being removed from all of the government servers.

Starting point is 00:05:23 I think that's a good start. I'm interested in the idea of like getting people back into using BitTorrent as just like a way of moving files around because, you know, I was introduced to it at a time. I don't know. I was, I'd have not 10 or 11. Or people were just like, yeah, I had no idea that there was a whole season of Futurama inside that Linux distro. Like, so it's this, I guess it fits neatly into the chronological retelling of Saion. is we were, you know, we all knew that just like the fascists were going to be coming for the information, right? But I don't know that any of us expected the ferocity and immediacy of attacking public information sources like the day that, you know, Trump took office.

Starting point is 00:06:13 And so it was just like noticing that that was happening right now. We needed something that worked right now. And so there wasn't time to, you know, bike shed about possible future technologies we would like to have. We need something that is like, what can we run immediately that we can put in people's hands now to share and to share and preserve large scale data sets? Because it's like that's like the kind of thing that's at threat are these like petabyte scale climate data sets that just like we don't have them. then even if we wanted to do anything about climate change, we couldn't because we would just not have the information about where we would need to put the solar plants and so on. And so that was like the, that's like the initial impetus for using BitTor.

Starting point is 00:07:03 Not only we love it, but it's also like it exists. It's been kicking for 20 plus years. And we know for a fact that just we can get someone who has never touched a computer before. Well, most people would touch a computer that are not like computer people. And we can put something in the hands of my mom, and she can run a BitTorrent client and double-click a torrent file and it works. And so that's like we just got that going. And also like getting a torrent indexer started is like pretty simple as opposed to like building on some of the more complex peer-to-peer technologies that exist. It would take a huge amount of like technical development.

Starting point is 00:07:37 But BitTorrent works on files and it is like is a very simple thing to do. And also, like, the reason why BitTorrent trackers and BitTorin has been preserved itself is, like, has a really nice, like, distribution of liability. Where, like, by running the BitTorrent tracker, we are not actually directly legally responsible for the content. Like, Saob is not serving the content of the files. And that's served by the peers. All we do is say, hey, you can get this file over here somewhere else. And so it's like it serves as well as an index in case there are sort of legal, couple of. from, even though all the data that we host is public domain, people should have it. There's

Starting point is 00:08:18 nothing legally or ethically risky in there that doesn't really matter so much to the prison government. Yeah, legally, not legally risky now. Right. Yeah. And also, there's a part of me that wonders, like, whenever we're talking about computer law, like, how long is it until there are bills proposed that just say BitTorrent indexing websites must be regulated or approved? or because all of the major, it is an untamed part of the web because it is the peer to peer in a very direct way that like

Starting point is 00:08:52 very few direct peer to peer things on the web still are. Like there are very few things like, like, I forget what it's called like offline websites, things that are just hosted from connection to connection. Like you can't even do a direct connection over IP for a video game anymore. You've got to go onto someone else's servers.

Starting point is 00:09:12 You can't just type in your friend's phone number and call their modem and play Mech Warrior 3 anymore, right? Yeah. By the way, it was great back in the 56K period. You were just like, wow, it's like instant. Yeah. There's just no lag. Exactly. And like, I mean, there's good and bad reasons for that.

Starting point is 00:09:33 And so it's like, I'm sure that, like, I feel like a lot of like the, like what we are trying to do is just like would horrify. Sadie, like, just like and trying to, like, run these things on like things where, like, all the IT professionals in the world are like, yeah, you shouldn't be running a BitTorrent on this here network. It's got health information in it, or at least that's the stance of

Starting point is 00:09:56 UCLA's IT. We're negotiating. It's fine. But for now, I have the unregulated undisallowed BitTorrent client in the UCLA Health Network. It's fine. It's safe. But, uh, um,

Starting point is 00:10:11 okay. But they walled us off from the rest of the network, which is good. But that's like, that's part of the overarching problem and part of the overarching background that we take place in that just like of the sort of like D ownership. I don't know what to call that. Like the just like the actual enclosure of the web part, we're just like now you, people started talking about this with the rise in hard drive prices and memory. stuff like that is like, you're not supposed to even own a computer anymore. That like, you're supposed to be using the cloud. The only thing that you should have is a phone or some other screen that allows you to access someone else's resources. So that's sort of what we're

Starting point is 00:10:56 fighting upstream against is the overwhelming urge of liability, terrified administrators of all institutions of all kind wanting to run to the cloud where they have someone to sue rather than actually owning any hardware and running any services themselves. So it's that, that precise arrangement that allowed a lot of the immediate harms that we saw. And so some of the larger harms to come when the shakedown of all of the tech giants reaches its next phase and they start really leaning on AWS to start, you know, I see you're over there. How does it? this open data set or I see you're over there like promoting this social service. You need to take this shit off of your network or else, you know, shake the stick of fascism at you and so on.

Starting point is 00:11:47 I think that's kind of one of the things that we saw right from the start of the Trump administration last year as well is that like not, there was not only stuff that was being directly targeted and directly taken down by dictat because it contained words like female or whatever the thing was. But also there there was a lot of stuff that went offline because people had to cancel contract because their funding was suddenly under question and because they again, they didn't have the stuff themselves. Right. And so like that's that's sort of like the background assumption that a lot of scientific, academic, and other, like, publicly funded or institutionally funded archives made, that, like, we assume that this country, as its government, as its people, as, like,

Starting point is 00:12:49 the funding mechanisms and power distribution that exists, like, generally think that it's good to know things and generally think it's good to, like, have information, period. And that changed overnight that just like the general agreement that like we should continue to do information of like stop being true. And so you know, you're an organization. You don't have all the resources in the world. You're not like planning for every adversarial contingency like this kind of a contingency that now you have all of your archival materials and what do you do with them? how do you you you haven't made plans for making sure that they exist in a ton of different places that just like if we were to go down what is our succession plan because we weren't planning to have our archive go down or taken down so that's like sort of like where we step in and just sort of like not trying to say like we're trying to solve the world or like that you know the blah blah blah blah the rest of only us and we are the true answer that's not what i'm trying to say just like trying to step in it's like okay in an emergency situation what are the possible things that someone could do to preserve these large-scale

Starting point is 00:14:01 data sets, and naturally BitTorrent came to mind. For many reasons, at least it would just like my personal history, having a lifelong love affair watching the torrent window, but also just like looking around and seeing the possible things that we could do. And it just works. We regularly get people saying, have you considered IPFS? Have you looked at that? Have you considered using Filecoin?

Starting point is 00:14:25 and like they are not none of those are at a point where they like just work in the same way that BitTorrent just allowed us to start doing stuff and just worked. Right. It's like every time we have taken any sort of step into like fancier technology, it has just brought in like 10,000 more like complications than it's worth. even it's trying to use BitTorrent V2, which is a spec that is now like 10 years old or something like that, has proven to be a monumental technical challenge. And so just like, we're not trying to be finicky and mess with the technology. We're trying to make an archive happen. And so like just we're trying to like at this point, the planet is to make sure it's rock solid.

Starting point is 00:15:22 get the foundation in, and then we can try experimenting with more, you know, fabulous technologies. But for now, we're just like at the stage of make sure the basics function. Yeah, I remember in 2015, 2016, the first wave, sort of my introduction to this kind of work, which was people doing archiving of government websites. It was a lot of, it was a lot, lot more labor intensive than what I've seen this time around. A lot of people was like, okay, we're going to have three categories of volunteers. You're going to run bag it. You're going to upload.

Starting point is 00:16:01 You're going to help with outreach. And it was incredibly like coordinated, but it was also very strange. And also, I wasn't sure where everything was going to live. I was dealing with it just at my work because I was in charge. of all the health websites. So obviously overnight, all of those health websites broke because whenever a new administration comes in,

Starting point is 00:16:24 they change which department is going to handle which things. Their websites break. They take down studies. They add new studies. This is the kind of breaking that our government does to its own data infrastructure. Every new administration that I think is kind of inevitable. And I find it, so I was kind of like,

Starting point is 00:16:43 where has everyone been when it happened this time? I was like, it kind of did happen eight years before. But I did notice that people were way, way better at it this time around. Like, I knew where the data went. I knew, like, I understand you were using BitTorin. I understood what Internet Archive was doing. I understood, like, what other web archives were doing this time around, right? Whereas the last time I was like, I'd learn what Baggett was.

Starting point is 00:17:08 I didn't even know about it. And I went to library school. And it's like, that's something like only a digital archivist in a library would know about. So not very many people know about this thing. So I like the idea that we're using like a 20-year-old technology because that's archival grade-tested technology at this point. Yeah. I like the idea of like we're looking through our toolkit

Starting point is 00:17:28 and we're going to use the file as in like an actual file. We're going to use the chisel. We're not going to use the power drill. Yeah. And I mean like if only we have any, with this project, it's like we would love to have that kind of, organization to be able to have different roles and stuff like that. That's why we, for, you know, notes for the future, we had every intention of inviting someone from the Data Rescue

Starting point is 00:17:57 Project to come along with us, but failed due to just like the chaos of all of our lives. But like Data Rescue Project is like much better at that. Like they're sort of just like the sibling organization that folks will know about because they are more organized and better at doing of like a lot of things than we are. And one of them is, is like making sure that people know what a role is, like, know how to volunteer and know how to like, you know, find their work. We've just been like, we're sort of like the scatterbrain cousin that will go in and do like some heinous web scraping that when some scrape project might elude folks among them and like

Starting point is 00:18:38 in that labor relationship. But like that's like, that's also just like part of the problem. this time around too is that just like there was so much of it that needed to be done that the usual suspect like internet archive and like you know just like the typical like the people who always do that

Starting point is 00:18:57 probably there was just two more more of it than you that just like we needed to engage more people and there were a lot of people that wanted to being like how do I do this I think like one of our we regularly cruise the data hoarder subreddit and just the amount of just random people that are just sort of like, I went and bought a million hard drives

Starting point is 00:19:20 and I want to go get a bunch of stuff and make sure that I have a copy of it. Like this instinct to hoard information is like, well, good. It's a good pro-social one and widely shared. And so that was initially like a lot of like when we started this project, we actually got a decent amount of pushback like saying like, I'm just going to go and get it and store this data like under my mattress basically. that just like, I want to go, like, there was someone that was actually disagreeing with us about the notion of doing this in public, like trying to make like a bid torn archive where, you know, because that involves peers being recognizable and online, you know. And so that like, someone was saying like that paints a target on your back. I would instead, what I'm going to do is make like a private copy, hoard it and not tell anyone so that in the future when like we might want to have it again, like I'll be able to release. it. And that doesn't really work that well. And for the obvious reasons of, you're going to forget, you're going to die or lose the data. And then just like, so what you, it's like, anyway.

Starting point is 00:20:26 But that, like, then that poses an immediate problem is like, how do you coordinate all of that energy and work? And like, how do you actually make sense of everyone doing everything everywhere? And I mean, we're not original in this regard. Like, da, da, da, da, da, the. a web platform, like, as like a way of, like, coordinating work. It's like as old as the internet as just sort of just like these pure production models of let's all work together in public. But like in particular, the need for not only a way of like organizing the scrapes and the backups that were done, but like organizing the scraping, like the actual act of going and getting it. Because like that was like one of the things that especially last spring was true

Starting point is 00:21:12 is that like we'd get advanced notice that some site, some data set, some archive was going down. And we'd have like, this thing is happening on Friday. And there are 100 terabytes in there and the server will only serve each individual downloader

Starting point is 00:21:30 like three megabytes a second go. You're just sort of like, how do you coordinate and de-duplicate that action of making sure that we get all of the files, but we only get them once because we don't have time to go back and get that, you know, for everyone to go and scrape their own individual copy.

Starting point is 00:21:47 And so that was like one of the, well, it's one of the underdesigned but still present parts of Saob is that like not only is it and trying to be an archive for things that exist, but it's like a coordinating space for you make target, like a dataset target. Like we need to go out and get this. And then there's like facility on there for splitting it up into smaller pizzas. And then we even have the beginnings of,

Starting point is 00:22:11 something maybe we'll talk about in a bit of just like how to actually automate that saying I have this piece someone else go and get the next one and so it was like how to like distribute the actual act of scraping and so that's both like a what is everyone like you know a observability question like what is everyone doing what does everyone have but also like a documentation a education and like a resource building problem how do we teach people how to scrape the web How do we, like, you know, make these different technologies accessible? Like, I think some of the, like, a lot of web scraping is increasingly very accessible and easy to do. Like, to, like, things to some technologies from, like, you know, friends of ours over at web recorder who make browser tricks, make a wonderful thing that you can just basically plug in play.

Starting point is 00:23:00 But still, there's a lot of subtlety as far as, like, what comes out the other side. As you're saying, like, what are the formats of the web archives? If you make all of this data backed up, how do you make sure it's actually usable in the future? And these are all ongoing technical and social questions that we're sort of engaging with. I think it was very interesting that we were starting this all up at the same time as the internet was just going completely wild with scrapers and very badly written scrapers from everyone and their dog with a tech startup being like, we're going to get our own. and train our own AI and be the next open AI. And so we were immediately sort of fighting all of the countermeasures that have been put in place against that.

Starting point is 00:23:49 Yeah. It's awkward. Web scraping used to be cool and edgy. And it's like I miss that part of, you know, that era of web culture where like, I mean, the value shift where, I mean, it's like you're reflected in like copyright law. and our understanding of copyright and piracy is like consumption used to not be valuable. Like, you know, that was not the valuable part.

Starting point is 00:24:15 The valuable part was in distribution. And so when it became disgustingly profitable to just consume things, like that like shifted the landscape as far as like, well, scraping is not really great because before it'd be like, yeah, you'd get a web scraper, but you weren't really concerned with like the impact on your servers because it would be relatively minor,

Starting point is 00:24:39 you were concerned with them having your data. And like, what are they going to do with it? Oh, they're going to have a private copy of it. Oh, no. And that means they won't have to come to the website anymore. Like, that was the biggest problem. And so it's a real shame that like, now when we say we're doing a scrape,

Starting point is 00:24:56 it's like, no, no, no, but not like that kind. Like, you know, not like, you know, that like scraping these sites to make them available, like to not to hoard them and train a private, model and like convert other people's labor into profit for us. It's like actually to to decrease like and so does like a lot of like what we're doing as far like coordinating scraping goes and making sort of making these bit torrent backed archives of things. It's like one of the things that is a longer term goal when we talk about like the next step beyond like the emergency response to a bunch of data disappearing is like what does a longer term archive look like? And it's like longer term. Longer term. we want to decrease the burden on these institutions and decrease the hosting like costs for them by supplementing them with a lot of other people's resources in bandwidth. We're not trying to be the next scraper that's hitting your page 100 times a minute just so it can get the fresh text

Starting point is 00:25:55 to eat and then like fuck off into its own oblivion. So it's a real shame that the culture has shifted in that way. But yeah, we do have to perpetuate. I wish there was a way to differentiate pro-social scraping from the kind of stuff that all of the tech giants are doing. No, no, no, we're the nice ones. Yeah, no, we're good. Trust me, I'm good because I say I'm good.

Starting point is 00:26:20 Leave it to the tech giants to take something good and pro-social and somehow make it into a profit-churning bullshit, so, yeah. Yep, it's what they do best. I think that's good enough for people to understand kind of what Psiop's doing. I want to talk about the people more because we're talking about pro-social stuff.

Starting point is 00:26:40 So I guess focused on the people, like, who's working SIOP now as much as you want to, you know, I know people, who's working it now? And then who do you want to come in and like what do you want people to do with SIOP? Like, you know, you're like I mentioned earlier, I don't know if it was on this recording because we had some recording issues. But, you know, your website is available on GitHub for people to copy to make their own SIOP website. So, like, what do you want people to do with it? So that's talking about people. Yeah, the people, so I think most of the folks involved are pseudonymates in some way. I think so.

Starting point is 00:27:16 I don't necessarily, it's like torn between giving credit and also not doxing people. You know, we're like, and so like we, we, we initial, so the named folks in here are like, We'd sort of core of us, I guess, nucleated pretty quickly around Henrik, who is taking a bit of a break now, was like the dynamo of a human being that he is, sort of like brought a bunch of folks together. And then the other folks that are named

Starting point is 00:27:49 and on public documents associated with Sciop are like Will Wades is another person who's like helped us out greatly as far as like connecting us with old web, you know, whenever everyone's like, we need to like contact some, you know,

Starting point is 00:28:07 someone who knows about the deep lore about the internet. It's sort of just like, well, we'll just immediately know someone. Anyway, I don't want to necessarily like list off, like everyone.

Starting point is 00:28:14 I feel like we should give like a list of credits or something like that in the showdown because I feel bad naming people because just like that's just part of, the part of the internet that we are in is like no names, no faces. our most prolific contributor is triple shrimp, and we have no idea who they are,

Starting point is 00:28:37 but we just know that we love them, and they are like one of the most tenacious archivists that I've ever seen before. And so, yeah, so like we have a core of us that are just like, we, you know, that we sort of know who some of the people are, but most of them we actually don't know who they are. And just like people who have gravitated towards this kind of project. And as far as like the people we want to be involved and like external people to be involved

Starting point is 00:29:06 is like the, you know, the bad answer to that is everybody, you know, like that's just like, that's like the lazy answer is like we want everybody to be involved in this. And to some degree that's true. We're just like that like that is actually something that I mean, I'm being flippant on that, but like that is something that we've sort of endeavored to make available. It's like if you want to like that's the kind of permission structure that we're trying to make is no permission structure. You don't have to wait for someone to tell you to do something.

Starting point is 00:29:36 You don't have to say, go and be in the queue and go and prove yourself. Like, do one, you know, it's like getting jumped into the scraping gang is just sort of like, go and prove to me that you can scrape the web and then we'll give you a task. It's more like we want to like put the things in people's hands because everyone else out there knows what needs to be preserved better than we do. It's sort of, that's like the people that are closer to the problem and like make sure that they can do the work of gathering and sharing their resources because like, you know, we can't know what is at risk in the time that it takes to actually protect that information.

Starting point is 00:30:23 So, but like longer term, as far as like the code goes and like what, The plan is with, like, other sciops and stuff. Yeah. That's also a big part of, like, the dream of the project is that this comes sort of, I think that we talked about this on one of the previous times I was on here. Just like, this like, hold on, it might be a little bit frozen. Hang on. I think we can still see and hear you.

Starting point is 00:30:53 Yeah, we can see and hear you. Okay, everyone is frozen on my screen on my screen. That just can happen. like you froze for me for a little bit. The video can freeze a little bit. Okay. Well, okay, we'll cut out the dead air moment there. But yeah, so I think

Starting point is 00:31:09 we talked about this a bit on like, like one of the last times that we were on here is just like this history of boom and bust and rise and calamitous collapse of bit torrent trackers and how that just like that is how BitTorrent trackers worked historically

Starting point is 00:31:25 is that, especially in private bit torrent trackers, you'll have this massive accumulation of labor and organization, accumulating these huge archives of media, and then the law will come knocking, and the tracker will get shut down, the database will be lost. And one of the things that's remarkable about private Bittern trackers

Starting point is 00:31:49 is how resilient that actually proves to be, or just when the successor tracker shows up, everyone sort of like rallies and re-uploads everything they have, and it might come back online in a matter of weeks, which is incredible compared to other possible results when other kinds of archives go down. But it still is a power struggle, and literally it will often be a battle of wheels

Starting point is 00:32:16 and a battle of personality. I'm thinking about this Mac tracker, broken stones that, just to go off on a little bit side. There's this private Bittern tracker that hosts like, Mac applications, broken stones. And it's like, one of the admins took hostile control over the site and rerouted a bunch

Starting point is 00:32:36 of the donations to their crypto wallet or something like that. And then so the people chose to like shut down the site to protect users from this admin. And then the successor site showed up. And then that hostile admin tried to worm their way in through IRC into the admin of that new site. Just being like that like there's all the history. of dramas and sort of like a Greek tragedy of keeping these archives running. And so anyway, like with that as a background, like, one of the things we're trying to do is,

Starting point is 00:33:09 like, try and address that problem with like a bit of a redesign about how, like, you know, introducing Federation to the idea of BitTorine trackers, where we're in the same way that, like, I don't know, we don't need to do it. What is Federation, right? thing here. Lots of computers. Yeah. We talked about it on the last episode. Goose on the last episode where we talked about Federation for two episodes in a row with Johnny. Yeah. Yeah. So just like this idea that like instead of it being one sci-op and that's like a unitary thing, like we want to do the same thing here where we have a number of these different trackers that can be online that can all talk to one another and share metadata back in. And so something like that doesn't really.

Starting point is 00:33:56 really exist as far as we know there are some things that are like it. So I, for example, I saw a project someone linked it to me like two or three days ago that was like another huge French bit torrent tracker went down. And then a bunch of people

Starting point is 00:34:11 went and put all of those torrents on Noster, for example. So there are like other things that exist that are roughly trying to make these like BitTorrent in particular, but in general just like large.

Starting point is 00:34:26 data set archives spread out across domains. But what we're trying to do is both preserve the social coordination, which is the main role that BitTorrent trackers actually have. So people think, this is another question we got initially. It's just like, why are you using a tracker? I thought trackers were obsolete now that we have Dht and other ways of directly exchanging peers between each other. So that's like the technical role.

Starting point is 00:34:56 of a BitTorrent tracker is to, when I go and download a torrent, I go and ask the tracker, who else has got this torrent? And so the tracker is the one that will like be telling me, go and connect to this other IP address and so on. And so like that's one of the roles of a tracker, but the other more important role of the tracker is a site of social coordination, a site of giving organization and structure to a bundle of torrents. And in particular, giving a focus to it. So just like, you know, in the same way that what CD focused labor

Starting point is 00:35:31 towards archiving music, having SIOP focusing labor as like a place to put the public information torrents is what, you know, the main thing that it actually does. And so building those kind of like social systems into the tracker and then making that, then the next step is making those social system

Starting point is 00:35:50 extend across multiple trackers. And there's like some interesting, technical and social challenges that come up with that, where, like, again, a lot of the peer-to-peer space, especially lately post-crypto boom, can lean very libertarian in terms of its design and its goals, that, like, the goal is to make the one big public archive of everything, and that, like, that's not exactly, doesn't really fit in this context,

Starting point is 00:36:21 but also like it's just a very particular arrangement of power and how that's supposed to work. And so, you know, we, one of, so I'm thinking about one of the most recent examples that's like a challenge for this is like, we wanted to distribute a data set that we weren't really sure we could. And we also wanted to make sure that like the origin of it was not so obvious. And so we needed a way to predistribute that data before. we made it public. And so if you build your system around assuming that everything is public, everything should be public, and it should always be immediately available, then you don't really have the means of making these sort of like gray area, private negotiations and discussions and

Starting point is 00:37:06 stuff like that that might need to happen for data that's a little more sensitive. And so we need a federation and a sharing model that can scale from private literally peer to peer as in like, I want to know exactly who's involved in this swarm of peers up to the global public index. So that's like the next step as far as what we're working on this year, where like last year we got the thing running. And we got a bunch of data sets, we got our wins as far as like some nice archival work done and foot in the door as far as like getting a base of cedars online. and then the next step will be making federation happen. And so we want to do an interesting blend of what ATProto is doing and what activity pub is doing here.

Starting point is 00:38:01 I mean, I don't necessarily know, I don't know what degree of technical detail would be good to say, but I'll just say that like in broad strokes, we want to do some of the nice things that ATProto is doing as far as like having these mobile, like very much, mobile units of coordination where like instead of having the server own everything

Starting point is 00:38:21 making it so that just like it is possible for people to like own their own personal space of this and building in like the social graph down to an individual account or person or something like that that then exists on a network of these

Starting point is 00:38:38 gateways but then also doing some of the things that Activity Pub does better which is like being able to have better control over who has access to things and being able to have better control over, like, the actual way that the data spreads throughout the system, as opposed to with AT Pro, you assume that it just goes everywhere. And so it's like, that's the, like I said, I'll stop there to say that, like, we, there's a lot of probably boring details that go into, like, what that actually means, but that's the next

Starting point is 00:39:08 step in broad strokes is we're trying to make a federation model that fits the needs of sensitive, sometimes personally identifiable, but also public domain data and how to make that actually function both in like a gray archive but then increasingly into legit archive space. Yeah. So let me try and make that a little more concrete interpersonally and professionally. So like I'm starting a new job next month. And if I wanted to approach them and say, hey, we have a big research infrastructure. We're a big target of this administration. And we're a private institution.

Starting point is 00:39:47 We're not a state institution. What angles would I approach of saying we should start a sciop here? I was saying like we should use our compute for this. We should convince the local Sadi to let us start up a BitTorrent software. And we should spend staff time on this. So, like, if you, you already have, like, control over your ability to do this at your job. But, like, if I was, I'm going to be in the library. I can talk to some people in computing, but I'm probably going to have to go talk to IT or some department.

Starting point is 00:40:18 So, like, where am I? How do I convince, like, you know, part of the bureaucracy or get them on board? Well, yeah. So that's, it's a complicated question. So the. So we're here for. Right. So there's a different angles for this.

Starting point is 00:40:32 And some of them is like, this is all work to be done. at the moment because like, yeah, like I said, that's next year of work. But like, the goal is to make it as minimal as possible. And so like that's actually something that is how SIOP has been designed from the bottom up is that just like, like we wanted to make it ridiculously deployable. Where like if you were to just like pop into your neighborhood like vibe coding software

Starting point is 00:40:57 and say make me a website, what it would make you is something that is like, has like 10 different services running. you know, you must, that like a heavy full-stack application that has like, you know, that you need to basically have a degree in SIS admin, you know, to run. And SIAB is not that. Siyab is like a Python program that just like single install and no external web services, you just press play and go. And so at the moment, it is integrated as a full-stack website,

Starting point is 00:41:28 but we'll start breaking those pieces apart where what we want to do is be able to have like a metadata federation node underlay thing that you have all of your existing resources. You have all of your storage in some CMS somewhere and you have all of your metadata in a system that we're not trying to supplant that. That's another thing that we don't want to do or don't want to try and do.

Starting point is 00:41:56 Because every time that someone tries to come up with the next unifying metadata system and ignore all of the embedded labor and time and local expertise that used into that, it fails and it's like it's an embarrassing situation. You now have 11 standards. Yeah. And it's just like disrespectful, basically, like to metadata workers. Like, you don't know how to do your own job and stuff like that.

Starting point is 00:42:23 Just say, like, no, come and do it on our thing. But like making it possible to have this system exist side by side with something like that, that it can ingest the metadata. It can bridge to your existing system and pull that metadata out and make it so that just like you have everything that you have currently, but then also have a bit torrent underlay with it. So that like the things that that adds or concretely brings to organization is that like what we want to offer as a not a not a service as in like we're selling a service, but like offer as in like this is this approach is to one be able to make your metadata and. your archival information more robust by giving it a concrete shareable form that just like, this is sort of like, again, harkening back to conversations that we had like last time when we were talking about linked data, just like, so I obviously link data application.

Starting point is 00:43:17 Like it like publishes all of its metadata in RDF already. And so just like what we want to do is like basically like you can ingest any sort of like link data metadata that you have. And here's a data set and it corresponds to these files. And those files are on your CMS somewhere. And so we can make a torrent, derive it from that data. And now what you have is your metadata, your data on your CMS and a torrent that refers to both of those things. And so then someone who wants to download it can download that torrent. They will download it from your servers, but then also then become an independent source for that information. So then Were your archive to completely fail or go down, what you already have out in the world then

Starting point is 00:44:04 are these very small, very portable descriptions of what your data is, how to get it, who has it? And so you just make your single monopolar archive into a distributed archive with very little additional effort or very little change to your existing systems. Let me spin a scenario so that I can make this, because I don't want this to back. bounced off like library workers too much. And I want to make this really easy for people who are like, I didn't understand anything you said this episode, but I want to help.

Starting point is 00:44:35 Where do I go? So. Yeah. Neither me to hurt off the baby. So I have like, I have like a lot of experience in academic libraries doing different things. So let's say it sounds like what I can do is maybe I can rustle up enough IT support to say like,

Starting point is 00:44:54 okay, I want to spin up like one server to run this sci-op. site. And what I'm going to do then is maybe work with our data librarian or our SCOLCOM person and say, okay, all these data sets that we have in the repository or that people have in their labs, we're going to help coordinate them to create torrents of that data that's already living on their servers. So they don't have to move anything. We're going to create these torrent files. We're going to say, here's the URI. It's going to point straight to it. So someone, much in the same way, that same way that part of institutional repository success

Starting point is 00:45:27 is emailing them saying send us the file and we'll upload it for you, emailing them saying, tell us the URI for where this lives, and then we will create a torrent for you. And so someone in the library could start building up the SIEP project by adding torrents and files, and we would just have to control one small server

Starting point is 00:45:46 and then everything should be able to cede from that URI. Would that work? All right. It's about right. Yeah, this is the webseed concept, isn't it? Where, sort of within the Torren metadata, as well as a bunch of trackers that you can go to to ask for peers that already have this content, you also include plain old-fashioned H-TPS URL that also holds the same content.

Starting point is 00:46:19 And so if there are no peers online, you just start downloading it from the web, which might be. your repository or it might be some department or labs website or whatever. And then as soon as you've started downloading, you start peering those chunks that you've already got. And it kind of kickstarts, kickstarts the swarm that way. And one of the, so I think with my institutional hat on, like, one of the nice selling points of that is, again, going back to the scraping thing in an where all of our websites are being bombarded with Crapers all of the time,

Starting point is 00:46:58 there's potential at least to distribute that load on your own infrastructure, because as soon as one person's downloading it, the next person can download it from two places, and then the next person can download it from three places. I think the thinking about sort of what people who listen to everyone's just turned their videos off or is it me? Okay. Yeah, I think. No, it's, it's, it's just

Starting point is 00:47:29 Jez's, uh, feed is struggling, but since this records locally, we'll be able to hear everything you said just fine. Cool, cool, cool, cool, cool. So we'll agree with whatever you said just there and you sound very smart with promise. Yeah, yeah, totally for sure.

Starting point is 00:47:46 That's the magic of using this software. That's the reason we put up with all of its bullshit is that whenever that happens, we didn't just lose all your audio. Yeah. I got most of it, but then you started trailing off into robot voice. It was very impressive. That was probably about the time I noticed everyone else's video going very MPEG artifact. Yeah, you're back now. You're back now. So what was the last bit that you said? Where was I getting up to? Oh, yeah. So, like, there's a nice argument to be made from an institutional perspective of this will help to relieve the love.

Starting point is 00:48:23 mode on your infrastructure. And this is exactly why Linux distributions have been distributing ISOs using torrents for years. Aside from the sort of piracy thing, that's the other thing that BitTorrent gets used a lot for, and it means that the small organizations throwing up their own Linux distro

Starting point is 00:48:46 can distribute it without it completely crashing their tiny servers. Yeah, it makes me, think about, and something I've thought about before but haven't really explored is you know how when you go on Archive.org and there's a file, and one of your options is just to download a torrent. Why don't we do that for institutional repositories and data repositories? Because those are starting to get big. I mean, I know one of the reasons is because, like, we use proprietary software for our institutional repository and Clarabate and Elsevier don't want to support that. Because I just,

Starting point is 00:49:19 the last institutional repository I worked with was, was Digital Commons, which is the the Elsevier used to be called B-pressed. But we put some big files on there. You have to have software installed on your computer to handle torrents a lot of the time. And if you're on an administered computer, like the permissions involved with that, especially in, unlike if your faculty, you're like work computer, you probably don't have permission to do that kind of downloading. Yeah, but it doesn't, there's no permission needed for Elsevier to give me the torrent file

Starting point is 00:49:48 of the data on there. Like to, you know, just there's a download button. there's a download torrent button, there's a download raw file button. Like, it costs them basically nothing to do that, right? Yeah, but then you have to have the software on your computer to be able to do torrent. That's my problem.

Starting point is 00:50:03 I'm just talking about the, like, why can't our software provide that option? Is there a technical reason? No. So as Jay, as Jay is saying, it's like, like, most of the limitation is on the like recipient side. And on like, well, that's, well, part of the beautiful thing about BitTorne is that it problematizes the difference between surfing and receiving.

Starting point is 00:50:26 Like, that's sort of the whole point is that everyone is both. But yeah, like, it's like, that's like the major roadblock is that just like people needing to be able to run some different software that isn't a web browser in order to, like, so you'd hate to have this be a way that diminishes access to information, like, you know, putting in a place that like people can't access it just because of the software. And so, like, there's some approaches for that. But it, in the. ideal case, it is a purely additive thing that like, as exactly what you're saying of we just

Starting point is 00:50:59 add Dorrance and they are just an additional download mechanism. That's the idea that like, and I think that Torrents, BitTorin can be very mysterious. And so like one of the things that we wrote in our documentation is like, what is BitTorin trying to demystify that a bit? And it's like, it's so simple you would not even believe it. That just like there's one of the other. I say this all the time, and I just like don't know how much this, I never know how much this lands. And it's like, when you're talking about peer to peer, it's like, bit torrent is like the hegel of peer to peer. We're just like, you can't really have like a disposition towards peer to peer without reconciling your disposition towards BitTorren. Like it's like, it's so fundamental and foundational to what peer to peer is that it's like everything else is sort of like in some way derivative of it or reacting.

Starting point is 00:51:52 to it or like, you know, that, like, and so all that it is is just like you take a data, take a whatever, you know, I file a stream of bits, chop it up into little pieces, make hashes of those who are just like an abbreviated representation of that data where it's like, if you get that data, you can do the same operation, the same hash operation, and then if the hash is match, you know that you've got the correct data. And that's pretty much it, like that, like you just have pieces of data and distribute them. So what it means to make a torrent is literally to just read your files and then produce another file that's just like a very small reference to those files.

Starting point is 00:52:34 And then that's it. So whenever you're hosting data, presumably you have read access to the data that you're hosting, hopefully, you just hash the data and then you post it along with a hash representation of it. It's almost the same thing as posting checksums along with your data. except these are checksums you can download. So like, so like that's the, yeah, the minimal lift that it takes is the selling point. And like, why don't we do that? Like so also think about like the interface for that though with like archive.org where like

Starting point is 00:53:09 why don't you use the torrent option every time there. And part of it is because like the important like social element of that is not really centered on a lot of these interfaces. So it's like like in the same way that like Wikipedia. hides all of the work of the pages in the talk pages and in like the Wiki projects and stuff like that so that like what comes out on the other side is like this beautiful pristine, unproblematic

Starting point is 00:53:36 encyclopedia article. Like if you don't surface the swarm, like you don't show that there are in peers here if you were to join the swarm, you would be like another one of them. Like these things need more peers versus these, you know, like, if you don't, like, design that into the interface, then it's just sort of like, oh, it's an additional download option.

Starting point is 00:53:57 Why would I choose the weird one? And it's like, and so part of it is like a social project of like making users of data feel like they've got some skin in the game and they can participate in its availability. That like that, so if I'm like a regular, so that's a, that's a, you know, a major issue with, with a lot of the current data that's being backed up and lost, that like these are datasets that are used by thousands of people all over the world, but they only exist in one place. And so these are like, it's like you have so many people that rely on it and have resources that could contribute to its, to make it available, but that's just not how they see their role

Starting point is 00:54:38 as in the system. They are a user or a consumer of this data set, not necessarily a co-owner of this data set. And so, like, that's, that's like part of the social change of if you then surface the So if you go on to, we don't have this yet, we're working on it, but like, academic torrents, I think it's dot com. I forget what it is. It might be dot org. Whatever academic torrents is like the other like major academic data tracker. They have like these like little sponsor widgets or whatever. Like when you're a web seed for some data set, it'll say like hosted by blah, blah, blah, you know,

Starting point is 00:55:12 they're just like giving people like a little brag point that they can, that they can say, I'm helping to make this thing available. Yeah, the beauty and the curse of torrenting is always like someone has to be doing it. And the more of the people do it, the better. And if enough people aren't doing it, then, like, no one can use that file. Exactly. Yeah. And I think that's where, like, having, normalizing having websites as well. Because you don't put websites in a torrent for the latest blockbuster movie because you don't want to, like, have it available from this website.

Starting point is 00:55:47 Yeah, maybe explain, I mean, you did kind of explain websites, but kind of, you did kind of explain websites, but kind of, maybe give it another. It's basically like instead of having a person who has their PC on overnight, you have a web server that has the file, and if there's no peers, it will be the peer, basically. Yeah, and that can literally be like your existing data repository. Exactly. In fact, we do this regularly.

Starting point is 00:56:12 So all the website is, so again, what a torrent file is, is a set of chunks of data that have hashes to them. And so when you're downloading over BitTorrent, you'll go and ask other peers, give me this chunk, give me chunk 10, you got chunk 15, cool. A webseed is that but HTTP servers. So just like all HTTP servers, you can request a range of bytes from them.

Starting point is 00:56:39 So instead of going and downloading the whole file, just saying, I need range, byte range, X to Y. And pretty much every HTTP server can do this. If they can't, it's like ancient, it must be extremely ancient. Or it's been deliberately disabled. Right, yeah, right. Yeah, I remember using this program called WebAants when we were on dial-up. Remember WebAns.

Starting point is 00:57:01 So if your computer was, this was like when DSL was starting to come out, so files were starting to get bigger. But if you were still on dial-up, you would add this little browser add-on called WebAns, and you would feed the URL for the download into, you would right-click on the file and say download with WebAz, and it would chunk it up so that if you lost it, your connection, like someone made a phone call or whatever, it would download bits and pieces and you would watch the little chunks turn blue as each chunk was downloaded because your stupid modem couldn't keep up with trying to download like a gif of Danzig's face, yeah.

Starting point is 00:57:35 That's exactly what your BitTorrent client is doing. Yeah. Rather than pulling them all from the one place, it's like, who's got this piece, who's got this piece? So like this is like a part and that's part of the, then, there's like layers of experience with the torrent and just like most people never had much experience with it. The people that have had experience with it like will that will be especially with

Starting point is 00:57:59 rare torrents that will be like a common experience too that just like I tried to get something and it's not there. And so that's like the fear that we that we experience from a lot of people that it's like well if we put it on bit torrent then like eventually people will stop seeing it will be gone. And like that's true but that's also true of the archives. So like and and so if you think of your web archive, like, wherever you are hosting data, like, whether that be in some as three buckets somewhere or your, you know, institutional hosting service. Just think of that as a peer. And so your, it's like, when I say it's strictly additive, that's what we mean.

Starting point is 00:58:34 It's like, we're only adding more peers. And your existing thing is just a big one that's added as a web seat. It's a different kind of peer, but a peer nonetheless. And so I think that, like, One thing that is like an immediate and obvious thing that you can do is that like if you both have webseeds and the ability to add webseeds to things after the fact, then you can do something that you can't currently do with like existing HTTP Archive Deck, which is like download from multiple of them at once. That like that so we have these data repositories where I think about from a researchers point of view too, where like when I'm going to archive my data, like I can put it on like my discipline-specific archive or sometimes you might have an NIH archive, like for the genetics and genomics people will have like there,

Starting point is 00:59:32 but you put it on there and that's the only place that it is. And if I wanted to add it to Zanodo or whatever, that's like an additional step. And then someone, and now someone will have to choose between these two things and it's one or the other. But like having a torrent with Webseys, what that immediately allows you to do is if I,

Starting point is 00:59:51 so there's like, say there's like institutional collaboration, we are co-hosting this data together, that like this happens all the time. Like there are like multiple institutions that are involved with collecting, organizing, organizing, curating, and hosting data, is that now it's possible for me to download it from all of that at once.

Starting point is 01:00:08 Even if there are only Websees, even if there are only HTTP servers in the mix, no peers, I can still spread my downloading out and use bandwidth from all of these things at the same time.

librarypunk - 160 - SciOp.net feat. Jonny and Jez (part 1)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.