librarypunk - 161 - SciOp.net feat. Jonny and Jez (part 2)

Starting point is 00:00:27 So let me give you a scenario of something I was working on. So we were working on this linguistic project, and part of the data plan was we are going to host it in our institutional repository, and this other institution is going to host it in their institutional repository, which is different software, neither of which give torrent as an option. But if we were doing a sciop, I could say, when I'm creating the torrent, here's one webseed and there's another webseed. Could I do that? Absolutely. So it would be one torrent but two webseeds.

Starting point is 00:00:59 Yep. And if a third institution comes along and also has a copy on theirs, then you can add that as a website later. Yeah. Yeah. So that's something like we can do now without, because part of my brain is like how do we get DSpace to do a feature request where it just automatically gives everyone the option to start hosting torrents,

Starting point is 00:01:20 which I think would be very nice. But we don't have to wait for that. Like we could start a sci-op Like for example Going back to like my next job I could start making this argument of We've already hosted the data Now let's just start making torrents for it

Starting point is 00:01:35 And we'll have a website that points people to the torrents Yeah we just Yes soon soon We don't have a we don't have a Executable for you just yet But that is the plan Like that kind of thing But like even even now

Starting point is 00:01:50 You could You could start manually Yes going, okay, we've got these are our 10 biggest data sets. Let's make torrents for them and just list them on our webpage yet. And like, yeah,

Starting point is 00:02:04 I meant manually, I meant like, because in my head there's going to be someone who's like 10% of their job is going to be doing this because I've always got like my manager hat on of like I need to, who is going to have to, whose problem is this going to have to be? Is it going to be mine? It's going to be someone else's. Who do I have to convince

Starting point is 00:02:19 to give me five hours a week of their time? Because, you know, it's, diet bowl and I don't want to waste anyone's time but like yeah could we start could you go out and start making torrents of your data right now using webseeds yeah yeah basically yeah and and part of it part of what we've been doing there's like I feel like we're we're talking a lot about like the technical challenges of this here so like maybe last bit on that because like more interesting part is like you know going off road and scraping and stuff like that it's like the glory of the hunt like you know in in but the

Starting point is 00:02:53 the bit torrent ecosystem is ancient and ailing and old and stuff. And so, like, part of what we've been doing is trying to revitalize the development effort around it. Where, like, to make a torrent, you know, there are, like, there are these truly disturbing blobs of C code that you can, you know, use to make a torrent. And it just, like, sort of half works. And it's, like, you know, seg fault on my computer half the time. But we're just, like, trying to make that be, you know, modernize that a bit. we're just like now you can actually have like a program that that makes good torrents efficiently that you can plug into like a deployment pipeline easily.

Starting point is 00:03:33 And that's part of what we made. It's worth mentioning as well that the vast majority of the software in the BitTorrent ecosystem is from the perspective of, I mean, for a start, there's a whole aesthetic about trying to make your torrents as small as possible. So like there is a, Johnny will know this much better than I do, but there's like a culturally, there's a size beyond which it's kind of assumed that torrents don't go. And as soon as we got into scientific data sets, we immediately started breaking those assumptions. Yep. That's a lot of fun. That story is a lot of fun. That's not an issue with the design of the protocol, because I think having looked at that,

Starting point is 00:04:22 it's really very flexibly designed, but it's a function of the implementation choices made by the people writing those client softwares. Yeah. So it's like, you know, the legacy of this technology. We're trafficking like in one gigabyte, the Matrix 99, yiffy, dot MKB-1080P. It's like the, speaking of BitToranesthetics, like the naming conventions are one of my favorite. nice they're fun whenever I rip a video for like for a podcast

Starting point is 00:04:57 so that every like there's not a torrent available so I have to like rip the DVD myself I always put it like smazzy gang dot rips yeah I feel like I should feel like I should mention if if like me you had a very sheltered

Starting point is 00:05:13 and naive upbringing and didn't like get into bit torrent as a teenager I love the I'm learning about a lot of the social history of this through Martin Paul Eve's book. Hell, yes. Great. I made a book on,

Starting point is 00:05:27 wait, what is Martin Paul Eve's book? Where's? W-A-R-E-Z. It's a phenomenal book, both because it's like, it specifically covers the culture and aesthetics of piracy. We're just like,

Starting point is 00:05:39 a lot of books that try and write about about piracy just miss the entire scene of it. Like, they just don't approach the actual social structures. that underlie it. And like, that's the most important part. And, and to, like, talking about, like, he arrives at, like, this correct thing that's, like, everyone in the scene would know, but, you know, it's like, it's not really written. I've never seen, it's the only academic work I've seen

Starting point is 00:06:07 that actually describes, like, some of the, some of these basic phenomena. Like, in piracy, why would it be the case that if the whole point of it is to, like, steal stuff and make it available. Why would there be beef or problems if one group were to, like, steal another group's release and claim it as their own or something like that? That just, like, the amount of internecine struggle that exists and it is just like, can only be really understood as just like a, you know, a culture of honor and, you know, elite scores and stuff like that that you're trying to get. But it, so anyway, yeah, I highly recommend that for sure. Yeah, that whole bringing it, dragging it back onto the point I thought I was trying to make.

Starting point is 00:06:52 Can I remember that now? Was like that whole wears culture very much informs the, all of the current technical implementations. And you like, you'll, you go to, you find a bug because you're trying to make a torrent that's bigger than the author of Lib Torrent thought a Torrent should be. Right. And you go and you report a bug. And they're just like, that's ridiculous. No one would ever want a torrent that big. I'm not going to implement this.

Starting point is 00:07:22 Yep. So this also relates to what, like, what Jay was saying just a moment ago, just being like, like, needing to have this special software. And, like, needing that, like, there is a thing called web torrent. And it's, like, been designed to work in the browser and to be, like, a thing. But it's been hampered by this culture of, like, you know, the bittern. It's like, there's a thread that's been going on for, several years now on LibTorrent.

Starting point is 00:07:48 That's like, enable web torrent for, you know, LibTorrent. So LibTorrent is the software that runs underneath a lot of clients. So like, you know, QBTorrent is a front end, like a GUI built on top of LibTorrent. And so it's like the argument is like, we want to serve people on the web that are acquiring data with a different modality that don't or can't run BitTorrent clients, but they should be allowed to be part of this BitTorrent. Swarm. And so even if I'm downloading using WebTorrent, like, I, and I might not become, like, a permanent seed of this thing. I should still be able to use it and still be able to be part of the swarm.

Starting point is 00:08:30 And so, also just like a side note, maybe like put in the footnotes like there's this wonderful browser, Agrigore that's made by Mov, Ranger Mov, this excellent piece of software really flies under the radar. That is a browser that's designed to be, you know, peer-to-peer first that just like you can become actually appear in these in these peer-to-peer systems just directly from your browser anyway but like so we were trying to say that that that was one of the major reasons that web torrent support is not enabled by default is that like people want to structurally exclude web torrent from the from these pier swarms because there's the perception that just like what we want are long-term cedars traditional bit torrent users

Starting point is 00:09:15 and webseed is just a means of like hitting and running, you know, the term in term of art in BitTorrent for downloading without seeding. And so what we're trying to make the point of saying is that it's like, well, this is, BitTorren has always worked in this way where you have this mixture of long-term ceders. Why do people seed the Pirate Bay torrents? Because some people are just pathological ceders like that and just need to seed. But many people are not. So expanding the base of who can be involved is always more powerful than trying to limit it,

Starting point is 00:09:54 especially limit it structurally and at like a code level to only those people that we really want to be in the swarm. Yeah. I can see how that's a bad limit. Like there's good pro-social limits, but saying like my assumption is always going to be correct and therefore I'm going to physically limit the tool. Like this could be a two-person saw, but I'm going to only. design and sell one person saws. Exactly. Right.

Starting point is 00:10:19 And, yeah. And part of it, the issue there is that, is that the people are, this term that I, that I love. So I can't remember who introduced me this term of cookie licking where, like, where there's a box of cookies and you pick up a cookie and lick it and put it back. So it's there. but only you can really have it. And so it's just like this ownership of something that should be held in the commons.

Starting point is 00:10:52 And you aren't necessarily making use of it, but you're holding ownership of it. And that's exactly like the BitTorrent developer scene is cookie licked to the max. Like that just like there are that like Bram Cohen like Grifter in chief of BitTorrent, is now pivoted entirely to focusing on like his crypto scam. And so, like, he has drawn the other major developer of LibTorrent along with it.

Starting point is 00:11:20 So all of their time is now devoted towards crypto scamming. And so in the meantime, the rest of us trying to use LibTorrent and BitTorrent are stuck in the lurch of like, we're waiting for the crypto bros to come back and care about us again and the software isn't moving. So, like, that's like the current state of that web torrent, like enabling PRs, like begging the one guy and charge. of this piece of software to enable some flag. And so that's... I'm feeling like the library equivalent of this would be like Mark Edit or something. And it's like, this is a great piece of software

Starting point is 00:11:55 and everyone uses it and it's free, but also the developer refuses to make it open source and refuses to accept any help in developing it in any way. Yeah, that's the... Like, something I've come across in my career is sort of moving from like, I think this is very cool and I'm able to do it. So I'm just going to go forth and do it to like thinking how easy is it

Starting point is 00:12:25 for people who aren't me to take on this work? I think Mark Edit is one of those because like, I don't know if you're like on those lists or those listservs, but so much of it is just asking people how do we do a thing because not only does the creator of Mark Edit not make this open source or accepted any help, his documentation is also terrible and barely explains how to use the software. I've put in there like, hey, is there a way to do this? And he goes, oh, huh, I never thought of that. I'll have to do that when I have access to an OCL API key again because I don't anymore at a different job.

Starting point is 00:13:01 Like there's this whole plug-in for Mark Edit that he wasn't able to develop for a little bit because he was no longer at an institution where he had access to the OCLC metadata API. Like, like, there's just be going on a little rant now, but it's,

Starting point is 00:13:18 it's a really frustrating thing when like, he just like doesn't make it open. And it's like, well, what's going to happen when he no longer can work on it? Yeah. Like, no one else can take it up.

Starting point is 00:13:29 Like, if he, if he does sort of shuffle off this mortal coil, there are plans to make it open source. And like, why don't you just do it now and then you don't have to have the plans? Yeah. And it's because there are so many people who have, like, such cool ideas for market it or for, like, other tools. And, like, part of me, like, understands his reasoning for not making it open source, but not really at this point.

Starting point is 00:13:57 There are two projects that I would love to just, if someone would pay me to have a roof over my head and food to eat. that I would love to just work on. One of them is the open source market and the other one is the research grade bit torrent clients. We're going to get funding for that. We are. We're definitely going to get funding for that. Yeah, this brings up maybe an interesting question of like, I, as someone who is all four things that are like open source or the sort of like pro-social sharing and torrenting and everything, but is also very concerned about the, politics of free labor.

Starting point is 00:14:41 Yeah. Out of people, like, I know that's a huge problem with, like, the open source space. So I guess, I don't know how, because I'm not nearly as techy as I come off sometimes. Like, in this project, like, what sort of, it's like, you know, if someone's going to be hosting a torrent, they have to be able to afford the energy to do so and pay for the internet. So in a way, it's almost like a Rantier kind of thing. You have to be able to pay to have internet to do this and to have electricity to do this.

Starting point is 00:15:13 And so there's a lot of people contributing time and energy and resources who aren't necessarily getting paid for that or compensated for that. So I don't know if that's like a conversation that's come up with this project or like sort of in the torrenty space or anything. But what does that kind of look like? There are many different flavors of this.

Starting point is 00:15:31 So I think that like, I think that, I think you could that place that I think it's reasonable to start with is like thinking about the norm and the status quo just like how like how labor works for a lot of these public archives and a lot

Starting point is 00:15:45 and stuff like that we're just like that on the one hand the norm of it is already that just like if I were to so if I were to try and do like the thing about archive.org as an example that like there's a lot of wonderful thing

Starting point is 00:15:59 we love archive.org like you know there's some problematic elements of it, but just like, I'm not trying to like start a fight there, but like that... We are, though. You know, everybody's got their beefs. Sort of, you know, but like what you might do is for that is I go and gather something. I do the work of finding something and preserving something, formatting it, whatever, and then I'll upload it to archive.org.

Starting point is 00:16:26 And so I've done the work of doing that. But then I, in one respect, I don't have to have any resources in order to host it. in another respect, I've done a bunch of work to basically like meant archive.org's portfolio that just like that I also don't have a lot of ownership over like the thing that I have made or contributed to. So it's like there is that that push-a-pull there of like if you do make a hosted service, you do lower the barriers and do lower the resource constraints to accessing and using that thing. But at the same time, you then own that thing and you sort of like by necessity take control of it.

Starting point is 00:17:03 And so, like, that's, like, we love data rescue project. That is, they are doing excellent work. They are much more organized and much more complete and rigorous and, like, in their coverage of the things that are missing. But, like, they're using this piece of software called Data Lumos. That just, like, requires you to, like, log in in order to access stuff. Like, they're, you know, and it's hosted by them. And, well, I actually don't know if it's a thing. I think it's, like, an independent organization.

Starting point is 00:17:32 But that's like the norm of how these file hosting, file archiving, et cetera, works, is that it's like someone else is hosting it and they basically on it. And like that's like trying to break out of that loop of we want to avoid the next time when this thing is the ownership has a hostile acquisition or lose funding or whatever and goes down. And so all people's time that they spent contributing to something is now just gone because they didn't have any sort of ownership stake in it. And that's like a little bit of like getting back into like technical details,

Starting point is 00:18:07 but like trying to bring down those resource constraints and trying to bring down like the, you need to have an always on server in order to participate and stuff like that is like part of the direction that we're trying to go into as far as like automating these things on my phone and my, and my laptop and whatever and being able to be a partial peer some of the time when I have the resources in order to, and meshing that across, you know, the fact that people are using multiple devices. I don't want to necessarily go too far into that. But basically, like, that question sort of underlies the whole of how we design the system and why we are doing it in the way that we are doing it.

Starting point is 00:18:51 Because, like, we do want people to, in general, be able to have control and ownership over, like, the things that they put their labor and time into. and like I'm working you know going down into just the design of the code or just like the next thing that I'm going to do as soon as I start like I'm pivoting I'm doing some background work on something else that I'm not going to talk about but like coming back into

Starting point is 00:19:14 siob is like start breaking it up into like a more of a plugin based system where you you currently it's open source but like you don't even need to ask me to add some functionality to it like you could just go off and do your own thing make that available and then whatever. It fits in the framework and you can run it on your own. And so, like, that's, yeah, we acutely and focused on this problem of how do we value people's labor in a way that, like, takes us out of the equation, basically.

Starting point is 00:19:49 Like, that's like, we shouldn't even be us valuing your labor. It should just be you being able to do something. And so, anyway, that's part of the social. politics of the space, but I feel like I'm getting too far into like the socio-technological, and we haven't really even spoken much about just like the actual stuff that's being stolen from the culture and like the more of the what actually is being lost element of this, but sorry, don't want to go too far into the technical details and miss the, miss the form of the trees. I think like picking up on that and the like,

Starting point is 00:20:28 what words were you going to say, Jess? I think the aspect of that that sort of does keep bothering me with trying to get more institutions to adopt this, like, that would be great, but I'm very wary that one of the ways that you could sell it is this will make your repository cheaper. And if everyone adopts like peer-to-peer file distribution, because it will make their repository cheaper.

Starting point is 00:21:00 It doesn't work. You end up with the tragedy of the common. So you have to have the kind of governance and the norms in place formally or informally that sort of support the positive of the commons. Explain that to me because in my mind we were talking about all this, you know, increased server load because of all the scrapers. And you could mitigate that by having torrents. So why would, I mean, I guess is the assumption that people would stop having their own repositories,

Starting point is 00:21:38 and that's the way they would save costs? Or because if everyone just, if DSpace gave you the same option as Internet Archive, which is download file, download torrent, how would that be a problem for of the commons? Yeah, and I don't think that would be a problem. I think I think that the danger is more, when you're kind of making the business case for this to be budgeted for and things like that. And you're saying, oh yeah, this is reducing the loads on our servers. And they're like, great, that means we can give you less money for servers.

Starting point is 00:22:11 And I think, like, there's not a reason not to do it. It's just a reason to be sort of aware of how these kinds of arguments play out in different ways. Or maybe if you're starting this project, have a plan for, how you're going to reuse that new, that freed up service space from the beginning. Because also if you don't need the server space and you don't need it, so it's like, it is good financially, like, to put my manager head on again. You know, if we don't need to spend money on something, we don't need to spend money on it. But that's also where as a person who likes to build his own little fiefdom, I would go,

Starting point is 00:22:49 okay, I've saved money. Now let me spend that money in something else in my fiefdom. So, you know. One of the cool ways of getting, I think, to get this adopted by some institutions would be to get together a little. I want to say cabal, but I'm going to say collaboration of like two, three, four like-minded folks in different libraries. And kind of that's the point at which you can say, okay, we're going to be for kind of resilience and sustainability, we want to store three copies of this data. at the moment, we all pay for hard drives and we all have our three copies in our different machine rooms.

Starting point is 00:23:29 Let's still pay for the same amount of storage, but we'll keep a copy of yours and they'll keep a copy of theirs and it's sort of increasing the geographic distribution and resilience of that without the stuff without increasing the cost. Yeah, that's kind of what just pinged in my head thinking about this from the IT perspective is like

Starting point is 00:23:52 I mean, for one, I'm all for, you know, doing shit just because it's good. But there is also the, like, the practical part of me is like, who's going to maintain the server? How much time is it? How much of it is shadow IT? Because, like, that's a huge issue, right? It's just making sure users aren't doing some weird shit that we have no idea about. But, like, I know that our state library has a large collection of large archival. of photographs from all over the country and all sorts of libraries contribute to it. Now I'm like,

Starting point is 00:24:28 wouldn't it be cool of public libraries with more of that space could create that torrent, keep that archive going across the state so that way it's not just on the state library to maintain it because libraries are contributing to it but not necessarily hosting it. So now I'm like, I could see, I could absolutely see a case for this if like certain, if certain like, if we could get like a state library or the people who have the big archives like, I'm in Washington. So like the Udub, why not run a torrent for Udub's archives? So like, I could see the business case for this as a backup solution kind of thing. Like I think was kind of what you're getting at, Jez, is like it's the three to one of backups. You always want to have some geographically distant, right? So like

Starting point is 00:25:16 this could be kind of a way to bolster that even if it's not like a full, you know, backup up of something. So like I've got like I've got the I've been gathering a ball of yarn over here like a bunch of these like interrelated thoughts about this. So okay so we were just like like that so this this notion of okay this is about this idea of platform fiefdoms and resource allocation and the distinction between archivalism and active service and stuff like that just okay so like like this goal of platforms and web platforms of bringing resources down for people to be able to use data.

Starting point is 00:25:59 It's like, on its face, good one. That is a positive thing to be able to do. To make it a website, I go to the thing, it's all set up there for me. I don't need to do anything. But then that also poses an additional problem for archiving. How do you archive a web service?

Starting point is 00:26:16 That's a much harder thing to do than archiving static data. the underlying static data. And even archiving the static data plus the code needed to use it. Way, way harder. Yeah. Way, way harder. And so I think about this, one of these examples that, like, we faced, like, drought.gov.

Starting point is 00:26:33 That just, like, this is a resource that farmers use to farm, like, essential information, like, being able to plan crop field and being able to just, like, have information about drought and water usage and just, like, what the next year is going. And so that was one of the casualties of, you know, Trump administration that took this thing down. And we're trying to archive it and like, like, how? And it's because like there's like multiple pieces of that. One is that just like without the funding for the thing to continue existing, then it will not be continued to update. You know, just like we don't have the, the thing that's lost is not, is not just the data. It's the means of keeping that data function. Like the system itself

Starting point is 00:27:14 is at a loss. And you can't archive a system of labor. And so, So, like, that's something that there's, like, fundamental limitation, we cannot address ourselves. Like, that's just like, that's not possible. And, like, trying to do that would be hubris, right? And so, like, when that thing goes down, not only do you lose access to the thing that could be backwards looking, but you look at that lose access to the thing that would be forwards looking. And so that we have partial solutions to how to archive a web service. And it's like where, like, our friends over at, Web Recorder and Browser Tricks have made this tool where usually when you go and do a WebSgrave or a Web Archive, you go and do static ATP requests, go and download this set of

Starting point is 00:28:01 files, JavaScript, and et cetera. Browser tricks and Web Recorder works differently where it simulates a full web browser, and it just captures all of the network traffic that happens the whole time you are using the site. And so, for example, like if you're going to a site like drought.org and you go and zoom in on the map, that's requesting additional map tiles. And you're like scrolling around and looking at these different like facets of this data. All of that can be captured and recapitulate. And so like basically what you need to do is be able to, well, in the volume case to automate that kind of like interactivity, go and zoom in on all of the map tiles and expand every data browser and. and so on. I'm thinking about the heinous miscarriage of infrastructure that happened in the United States where, what the hell was that? That, where all the government services were using this, like, was it plateau or tableau? That's what it is. Tableau, yeah. That just like, that now all of the government sites have this, like, embedded JavaScript applet that, like, has their live updating COVID tracker or something like that.

Starting point is 00:29:10 But that's, like, a really rickety and fragile system that, like, that, like, that does. doesn't serve, like, public use of information in that way. It makes it available, but it's impossible to argue. So, anyway, so we have these, like, this, I want to bring in browser tricks, this more advanced kind of web scraping. And they've done a good job of making that accessible to you. And then you basically get into the need for, we want to then not only scrape and grab these sites, but then make them available again.

Starting point is 00:29:38 So, like, that, you know, maybe drought.org wouldn't be able to be useful in the future because it's not being updated, but other interoperable. interactive web services can be archived that like aren't, you know, intended to be like live updating things like, you know, climatological and weather information. And so that's also currently something that like sciop can't do yet. And archive.org does better. That just like that archive.org does preserve some, you know, you can, if there is like some web service that can be captured by their technology, you can then just go to a website and have like the same quality of experience, like the same utility that the web service gives to you. And so that's,

Starting point is 00:30:16 that is something that like the browser tricks and web recorder folks were working on. In fact, they, I mean, hopefully by the time that this, the podcast comes out, like, they, Ilya said that like, don't mention the release yet because it's still in beta. But I'm going to mention anyway. They just like, they released a version of web recorder that is BitTorrent power, that like you can make a, you know, a full network traffic backup of a service and then have, be able to browse that from a BitTorrent archive. The limitation that they face is exactly the same one I was talking about before is that just like LibTorrent doesn't want to enable web torrent, blah, blah, blah, but like assume that that works. Then you get into the situation where just like,

Starting point is 00:31:00 now you can have distributed archive.org where if I'm, if I care about something, if I want something to exist and I know it's going to go down soon. I can basically go and just use it, like open up a browser Turk's web browser, go and touch all the knobs and fiddle with everything to record and capture all of that. And then like basically in the act of me using it, I create an archive from that. So I create like this thing that's a whack z file. It's just like a big zip file of all that network draft. And then then you, but then you have the question of trust that like that's like that like that you can make all of these distributed copies of websites and stuff. But then the thing that Archive.org also provides is that trust.

Starting point is 00:31:46 You trust that an Archive.org copy of a website is correct because Archive.org systems captured it. And they do not have a history and a reputation of modifying the things that they capture. So it's like the thing that I'm like the sort of big ball of yarn trying to pull out here is just being sort of like that once you try imagining like an alternative to platform, web, then you need to basically pull out and redo everything. We're just like, you need to rebuild the technology by which we populate these archives. You need to create a whole social system for like distributed trust so I know who's scraping what and like whether or not this

Starting point is 00:32:21 has been useful in the past so that like I can look at different, I can host a Bittern client and like, you know, know that like I'm going to connect to peers that are trustworthy that aren't just like trying to like spam the network with garbage. And so like, like, Like that that is a massive project that involves just like basically like, you know, rethinking just like the way that every person is using computers. And so like how do you like make that palatable and make that like, you know, sort of sneak these ideas in the back door basically. And just like, and like it's like sort of like all of these fronts are related.

Starting point is 00:32:59 Like you're communicating to administrators that we need to like it would bring down costs in order to, you know, make torrents of everything. That just, like, communicating to scientists, researchers, academics that, like, when you use data, you should be partly responsible for keeping that data available. Like, making it communicate to, like, the BitTorrent community, just, like, web torrents are friend. And, like, all of these things are part of, like, you know, related to this, like, largest part of it.

Starting point is 00:33:27 And just, like, you can do bits of them, like, independently and stuff like that. But just, like, they become much more powerful once you have, each of them in place. And so... And there's also, to jump in, there's also, if we're bringing in this whole new community

Starting point is 00:33:44 of academic users, that also puts pressure on the interpersonal ecosystem to say things like, you know, webthorns are your friend. And, and like, that's another part of it too.

Starting point is 00:33:56 That just like, the social tooling, in most cases, is like an afterthought or just like basically dog shit, you know, just like that you, like,

Starting point is 00:34:04 you have, if you, you're, design, if you don't design a system with social tooling first, then inevitably what you end up with is like the council of expert gods that like control all metadata and mediate the whole system. And that's also like a, like I say you was saying just like who regulates, who runs this thing. It's like it must be you. Like you, like it pins a certain set of people into maintaining the system.

Starting point is 00:34:26 And so you need to build like the social tooling that allows people to negotiate over like who is able to post what. and like how do we trust them? And it's like what the Federation model is intended to bring is like the ability to break off and make a new set of norms and instead of like a community of archivists, that isn't then isolated. That doesn't become like its own separate system. They might have different norms and standards than we do,

Starting point is 00:34:55 but like we still can like talk with each other. So like that was like one of the major goals is that like sciop.com. dot net, squeaky clean, super legal. Like, if you download a torrent from sciop.net, we make the promise to you that it is legal to download it. Like, that's like our standard for that site, in part to make it possible to develop the underlying technology so that then another archive group that has a higher risk threshold that wants to be like hosting, you know, confidential leaks or wants to be hosted, you know, just like more risky information, they can do that over there.

Starting point is 00:35:37 And like it doesn't necessarily implicate us, but we can still be a part of that system as well. So that's like what I was talking about before with like the different kind of federation model where you might want to have the idea of like dark archives. Like who was it? Was it the, was it, there was a New York library that was like, we made a dark archive of all of data.gov

Starting point is 00:36:00 or something like that. but we're not posting it publicly. This was about a year ago at this point. I don't know if you remember this, Chess. Someone made a dark archive of all the government data, and they were like, we have it and we're not sharing it. I'm about to look that up. But like that you could do something like that with metadata.

Starting point is 00:36:17 We're just like, I'm thinking about like DDoS secrets or something like that, where DDoS secrets is they're functioning. Like what they do is legal, but like it's extremely threatened. And so we might want to make copies. of all of that metadata of all the stuff that DDoS Secrets has leaked, but we might not want to say in advance that we're a mirror. Like we're mirroring

Starting point is 00:36:39 all of their content, and it's like basically like insurance, like they have an insurance file up there where it's like everyone download this and if DDoS Secrets goes down, we'll distribute the key and then leak some really damaging information. But it's like insurance in that respect, in this way, it's like, we have a mirror of all this risky data. And if that

Starting point is 00:36:55 goes down, then we have a copy of it and can verify that it is in fact that data. It's like a provable copy of the data and can be a secondary mirror. And to be able to like scale that from widely acknowledged public mirrors of everything, we just are going to like repeat it in the same sense that like a like a blue sky app is like a or a Vetaverse server is a public archive of other servers posts and stuff like that. We might want to be able to scale that to privately archiving and reproducing these things as well. Yeah, that's yeah, that's more of the

Starting point is 00:37:31 the gray area legal nature of what one has to do when information, as we know it, is increasingly illegal to have that you shouldn't be able to actually have the data, you should only be able to access like a predigested AI hallucination of it. And that's the legal path to accessing information as opposed to like, you know, being able to know stuff directly. Yeah. And definitely, like, we talked, I think it was last week with Hagenblicks, about how the knowledge is part of, it represents a power hierarchy.

Starting point is 00:38:13 And so, you know, your ability to go to college gets you, is reflected of your already existing part in the hierarchy and where you exist and you want to be intellectual property is part of that knowledge hierarchy. And the part of AI is to get more people who don't have knowledge, knowledge, meaning like, we can think of the word knowledge here to mean proprietary information or data or things like that, to get that knowledge away from them so that you can then rely on gate workers with less knowledge. So it's the same like Luddite, deskilling sort of thing of

Starting point is 00:38:43 you can then be pushed further down in the power hierarchy. We don't need professional managerial people if we can have an AI that can kind of take that knowledge away from you. And so you don't have to have it. The ownership class doesn't have to have it, but they own it, right? So thinking of knowledge as power is really, as part of the power hierarchy is also very useful. So I also wanted to ask, because at this point, I'm just going to split this into two episodes, so we can go like 10 more minutes. Oh, David, every time. Sorry. Well, we're going to be on hiatus for a while when I move, so it's like, I would like to have

Starting point is 00:39:18 things I can, like, split out, too. If there, if there's nothing, I hope it's that me being hyper, verbose and talking about shit for way too long does is it's give you a little bit of heat off when you need to give an episode while you do. I'm glad that's a good outcome. It's totally great. There's part of one thing. There's two things I want to get to, so you can choose which one you want to do. For someone who has no idea, how can they start to help?

Starting point is 00:39:49 And then the other one is a discussion of, web scraping because you said that's the really fun part so which one are you more excited about I think probably web scraping I saw your face I don't know the how to how to get involved is probably more important like okay I can also do that at the very very end yeah well like the fun off road okay yeah so the other by the fun off road part of the web scraping is basically like you're sort of like doing reverse engineering and a little bit of hacking in a way that's like all of it is technically illegal in the same sense that like using a computer is technically illegal because like the computer fraud and abuse act is so broad. Like, you know, that like scraping a website is all,

Starting point is 00:40:33 like visiting a website is illegal. If they could argue that it's against the terms of service for you right now to access our machine. And so like, but like that's been like a joyful part of this of watching folks. So in our group, we have a mixture of folks that are like old web scrapers that do this all the time and new people who have never done it.

Starting point is 00:41:00 And so that's like, I guess this is a hybrid answer to both of these things. It's like, how do you get involved? It's like you just like look around and see what you care about and what matters to you and what you think is at risk.

Starting point is 00:41:16 And you go and grab it. And alternatively, like, part of this is also scouting for that information. Like, we thrive off of people who don't know computer stuff and but know something's at risk. And to be able to say, hey, we need help over here. Can someone come and help us, like, handle this big thing that's about to go down? So, like, this has happened a couple times where, like, we'll just be, like, alerted to something and just, like, set the dogs of hell loose on, on this, like, like, like, you know, set of, but, but like, the, the barriers to participate in are very low because, like, one, if you don't know how to do it, we've written documentation that, like, not complete by any

Starting point is 00:42:00 means. And, like, you can just go into the sci up documentation and just see what's, if any of that applies to you and do it. Like, that, like, if that doesn't work for you, just tell us this, I don't know how to do this, but I don't know how to do this. And we'll do our best to, like, describe that and to facilitate you being able to, because it's like the, the, we love to great, the web, but also the whole, like, as we've been talking about just like, the whole test of whether something is useful is like whether you can do it without me. Like, if you can't do it without me, then it's shit. And we really haven't done anything or moved the needle as far as like autonomy or power those. So I want to make sure that people can do that on their own. So

Starting point is 00:42:36 give it a shot, try and do it using the information that you have available. And then when you get stuck, just like yell at me and yell at the rest of us just like raise an issue or like you can talk to any of us on the Fedover's to say, like, how do I do this? And then, like, then the second part of that is just like, just becoming a seed is a very, it's like something you can do in 10 minutes. If you just go and, like, go on the, get a torrent client, let that run in your background and get one torrent. Go on the side and just look for like the thing that has very few seeds on it. And it looks important to you, grab that. We have also, for people who have more resources, like more storage resources, more bandwidth resources.

Starting point is 00:43:19 Like we've set up a system of RSS feeds for these torrents. This is like an old feature of that like a lot of Vittorrent client support. So for example, like we tag all of our datasets. And so one of the, when I look at the server logs and see what sites are being, I mean what URLs are being accessed the most. The most frequent is like the LGBT RSS feed for like getting all of the torrents that have to do with queer people period. And just like that's, so, like, that's one of the biggest things that's been lost.

Starting point is 00:43:52 And, like, actually, we can't, it's like just, it's taking a side tangent. I see there's a note of just, like, in the show notes about what has been lost. And that is a tragic loss. There's a lot of biomed research and a lot of, like, ethnographic research that about just queer people that has been lost, in part because it has PII in it. and like the, we can't just make like a public archive of this, of this data, but the people who were the curators of it and the holders of it were forced to take it down, remove it, or usually hide it, not actually destroy it. But like that, those, that's been the biggest casualty so far

Starting point is 00:44:30 of unrecoverable information that we cannot retrieve unless the people who, you know, are the researchers or people holding it. Yeah. But that's like, that's the, that's the, main thing is like when there is something around you that you care about, that you want to preserve, that's the best way to get involved is that like you are the expert of your local domain. Like you don't be waiting for someone else to tell you, go and get something. I'm sure that you're, you know, anyone is already aware of this. And then just like, like I said before, if you need help doing that, that's what we're here for is to try and make like that possible for you.

Starting point is 00:45:07 And then to facilitate a group of people around you that also care about that thing and also want to help preserve it. I don't know if that's content enough getting started. No, no, no. Actually, so for example, I'm trying to think of like, if I say that there is a data set on my institution or repository or somewhere I used to work and I go, oh, that's probably under danger. I could create, could I create a torrent for it and get

Starting point is 00:45:35 that onto the sciop tracker so that because I need people to find the torrent that I've created, like me as an individual person. So I get that. try, is there a way for me to submit that horn to the SIAP tracker? Yes, that's the, yes, do it. Yes, that's, that's real functionality of what Saiaf does is yes,

Starting point is 00:45:54 you can submit things to it. So it's like, it's relatively simple. Like, there's a bunch of like moderation tooling that we have yet to build. But like, go, you go and make a, make an account on SAO. And for now, it's just a matter of like, you just need to tell us that's you. We're working on messaging and commenting and inter-site communication so that you can do that in-band.

Starting point is 00:46:16 But now just like, make a thing, tell us that to you, we'll give you upload permissions. And so it's like the way that our permission system, it's like a tiny bit different than a lot of, than the way that a lot of other web platforms work in the sense that just like we're trying to embody a model of soft security where like the way to make system secure and safe is like to limit the damage that someone can do

Starting point is 00:46:41 without constraining the things that they can do. So it's like in the same way that Wikipedia allows any anonymous user to edit the site. The reason why that isn't a catastrophe is because there's like abundant means of preserving history, of monitoring changes and making, like just making and discussing these changes. So. And also undoing is easier than doing. Exactly. So we're like we have a, this is work in progress stuff, but like it we're working. It is the case of just like we have.

Starting point is 00:47:11 like wiki like edit logs and history that can be rolled back and stuff like that. But in the meantime, just like the, you go on the site, you make an account, you can already create a dataset. You can already create an upload without being given permissions yet. So you make them, they're just not visible to anyone but you and the administrators. So just like, this thing is done. The data set has is described, it has a description, it's got metadata, it's got a title and everything like that. Like, this is ready. for someone else to find it. And then the upload part of that is,

Starting point is 00:47:44 I put a torrent on the website, and here it is. Then we'll review that and then approve it. And then that's the whole story. Then basically you just need to seed it. And so, like, there's this initial period where if you make, so we have tools for making torrents that include adding webseeds and indeed making torrents from webseeds.

Starting point is 00:48:08 Like, again, our beloved triple shrimp, Prolific Archivist doesn't have the biggest bandwidth in the world and doesn't have the most hard drives in the world. So something that they will do is they'll make Webseed-only torrents that are just like, literally just like download data, hash it, and then delete it. Download data, hash it, delete it. Just create the torrent file that has a reference to where the data come from, and that's it.

Starting point is 00:48:35 So they won't even be a seed in the initial swarm, but other people can download it just by the webseed only torrent. And so, like, that's the last part of it is, like, then if the data comes from something, like some HTTP server and that can serve as the initial seed, great, otherwise you just stay online and seed the thing until, like, other people come along and see it, and that's the participatory nature of, you know, group archiving. is other people, like stuff exists and will be preserved and seated to the degree that other people believe that it should be seated, backed up, archived and stuff like that. So like that's like the

Starting point is 00:49:16 means of that you put it on there, announce everyone, hey, help me back this up. And then the people that will do that will show up and do it. I think that's like the one thing that we've seen quite a lot is because that one aspect is really unintuitive is that we've had a few cases. where someone's scraped, downloaded a dataset, they've created a torrent, they've uploaded the torrent, it's been published on the site, and no one's ever been able to download the data

Starting point is 00:49:44 because they didn't realize up front that they needed to keep seeding it long enough for someone else to have a copy. They thought of it as a repository. Yes. So we put notices on the upload form and stuff like that that is like, a torrent contains no data.

Starting point is 00:50:02 It is not a zip file. you need to stay online and see the data in order to make it available to other people. But, like, yeah, it's just an informational barrier that is a very understandable one because nothing else that you experience on the web works like this. And so, like, yeah, that's a real issue as far as making it acceptable,

Starting point is 00:50:25 making it understandable what the system does and how it works without overburdening someone with a big, long lecture about how Bidtorn works. I'm just thinking we need to start, like library punk academy where we just have workshops like the software like the software carpentries where it's just like how to bit torrent what is a bit torrent and then we just start posting because i was part of me was just like because i'm a carpentries instructor part of me is just like oh we should just have a library carpentries for bit torrent and probably we

Starting point is 00:50:52 should uh there should be software carpentries for bit torrent i don't think there is but this is the whole point of carpentries is like to teach academics the computer skills they need bit torrent should be on there yeah very much like I have on my stupidly long to-do list is to submit a programming historian lesson on BitTorrent and a Turingway chapter on BitTorrent. Yeah, and we should make those, you know, if you build it in the carpentries, or if you have it and you want me to do it, I mean, like, turning it into a carpentry's like style so we can put it on the carpentries or just have other people teach it in the carpentry's style. I mean, that'd be great. I'd be happy to work with that. See, this is all the shit that I'm terrible at and that's like why I need other people in the world

Starting point is 00:51:41 with different focuses, skill sets, and expertise. I'm terrible in this. And so just like, if y'all folks are good at making educational materials happen, that's like very much. That's another way to get involved is like do the things that we are terrible at. And what we are terrible at is teaching people how to do stuff, writing down what we've done, et cetera. I'm literally adding into the notes

Starting point is 00:52:06 how to contribute to Sciop and it's like create account, create a torrent, find a link for how to create a torrent. That's a note to myself. Upload to get permission so that they are visible to others. Seed. Five bullet points of how you can get started if you want to do it right now

Starting point is 00:52:21 and create a torrent to some data that you want. We have a quick start docs page. Great, I'll have that too. Yeah, but you know, like all our docs are jenky and incomplete. I mean, it happens. That's how all good documentation starts, right?

Starting point is 00:52:40 And if someone, if listeners thinking, what can I contribute to this, that sort of the documentation stuff, the outreach and external comms, those kinds of things are things that we're sort of lacking a little bit in our core team at the moment. And like this is like, because this project, you know, lives in so many social systems. Like the folks that are coming that are not from open source world, like one cultural lesson from open source is like, tell us our shit sucks.

Starting point is 00:53:15 We like it when you do that. Like that's, you know, like if you are trying to do something and it doesn't make any sense, it's a compliment to me and a compliment to other people that you tell me this sucks and I hate it. Because that means you care enough about something that you, to tell me that it's broken and it should be fixed.

Starting point is 00:53:33 So like, don't be shy and say, well, they talk for like, you know, four hours or whatever it is about all of this like lofty ambitions, but then I try and go here and it sucks. Like, yeah, I know it sucks. Like, you're not going to surprise me by telling me that and you're not going to offend me because I already know that. Like, I'm way ahead of you. I'm thinking everything I do is terrible. But like it's actually very helpful when when people say the things that don't work for that. So like we have documentation. about how to make Dorrance and how to use a site and everything like that. But like if you, if it's amazed to you and you can't find it or like this doesn't make any sense from you or I've never touched a terminal in my life or whatever, then like, yeah, that's your, your input is needed. And I'm just thinking about Microsoft documentation and how much of it is just like absolute bullshit labyrinthine sort of find the one line in the one article about, oh yeah, by the way, you can't do it if you don't have this very specific permission.

Starting point is 00:54:31 And I'm like, yeah, I could test document. for you guys. We'd love to have that. Also, yeah. And like, OCLC's documentation, they put everything on different pages so you can't just control F through the documentation.

Starting point is 00:54:47 And everything's on a different website. And genuinely, I think the best use case for Google adding their stupid AI summary on top of everything is because, one, OCLC's web presence sucks on Google. But two, like, it will find the actual right document page

Starting point is 00:55:03 because it's scraping all of them to pull the data and it will at least get me to the part of the OCLCLC documentation I need because luckily all of OCLC's documentation is on the open web, which is not true of all library vendors. I'm pretty sure Alma isn't. So when I change jobs to Alma, that's going to go away. Good luck, God's feed. That'll be fun.

Starting point is 00:55:25 How is it that so many of these library vendors are so awful at Discovery? Like they literally try to sell us Discovery platform. sense. I mean, genuinely, I think OCLC knows their discovery platform called discovery sucks. You can email them a problem with discovery and they go, yeah, it just does that. They'll literally they'll say, yeah, it just does that. Amazing. As though they have no, no agency whatsoever over it. It is its own independent entity. That just reminds me of a time I emailed some vendor to be like, hey, does this product do this very specific thing that is incredibly useful to do for a, God, what is it? Like a computer reservation program. So like, so patrons can reserve computers. And they were like,

Starting point is 00:56:15 oh, yeah, it doesn't do that. It would be nice if it did, though, right? But that's kind of too hard for us right now. And I wanted to be like, where is the thousands of dollars? Yeah. But I'm just like, you don't even have like a feature list. Like these are the things. things you can submit to vote on. Like, or like, yeah. Anyway, I just, I always think about that and just being like, okay, well, that's, that's a you problem. Maybe. Can you help me out?

Starting point is 00:56:44 This is why no better, no matter how bad open source community shit gets, I'm always like, remember the alternative. Yeah. Which is that this is all intellectual property and no one can fix it. And no one will fix it. And they will continue to fire their staff who can fix it. It's like, it's like, yeah, one of the things I love it as being like a open source program is just like I always have no resources. And so I can always say, we'd love to do that, but I've got no money. And so like if I ever go into the situation where I actually had resources and had to be responsive to, you know, my whatever share, what do you call? You customers and Patriots. Stakeholders, customers.

Starting point is 00:57:24 That would be just like, God, I actually have to do stuff because I actually should be able to do this, you know. this is why we don't have a Patreon because then people would be like why why do your transcripts not so good i'm like someone can do them for free but i can't spend six hours every episode doing them if we had a patreon i would feel like morally obligated to do to do like manual transcripts even though all of our transcripts are online on podtranscripts.com now i don't know how long that website's going to exist but they do it for free so if you have a podcast submit it to them you can just submit it i just submit it i just We weren't already on there. Submit to it, they will do all of your episodes. They will download them and do a pretty good transcript. But I have been transcribing ours with the speaker. It won't have the speaker identified, obviously. But all the words will be there and you will have a searchable transcript.

Starting point is 00:58:15 And of all the like bullshit kind of website startups, I'm like, hell, this is a good service. So go check them out. You don't have to pay money to search one transcript. But if you want to search multiple podcasts, then they charge you money. but you know what, that's good service. I mean, a lot of these bullshit startups start up with a good idea and implement it well. And it's only when like the we must grow and continue to grow and number must continue to go up at all costs that it all starts to go safe. It's like we've talked about this internally with like the safeguarding group just like, how could we like get money to do this?

Starting point is 00:58:55 And it was like, what could, like, what could our service be? Like, what could we, like, what's actually like the thing that we could sell that, like, it's like, the ideal case is like, if you want to do something for free, it's just sort of like, yeah, we do the thing that we do for free. And then there's some ethical freemium upsell deluxe package or whatever that doesn't compromise the core, you know, not getting into whether that's possible or whether that always, you know, undermines its own. Sure, but you got to think about it. But, like, in our case, it's like, we could sell network congestion as a service. Like, that's the main thing we produce. We've got too much bandwidth.

Starting point is 00:59:37 We'll use it all up for you. Don't worry. We'll make it our problem. Right. And it's been, like, it's been one of the biggest shames to see. So, like, like, hard drives being cheap was just a true thing for, like, most of the contemporary history of computing. And like, it's like you can tell when, when hard drives are scarce, like something is going

Starting point is 01:00:03 really wrong in the world. And so it's like, you know, the last time before this one was, again, this is like, coming back to Bram Cohen's grift company and stuff like that, like when they launched Chia and turned proof of proving that you have wasted a ton of hard drives space into a currency. and there was a huge run on hard drives that then became a bunch of e-waste. That was like something is sick in the world. And now it's like, now the hard drives are all gone

Starting point is 01:00:36 because you know why. And like, yes, indeed, again, something is very sick in the world. And so like go out and get more hard drives, but not to poison the ecosystem and to, you know, store like the hallucinated text that you then cannibalize your models on,

Starting point is 01:00:54 but go get hard drives to make sure that we can continue to have climate data. I've got very into scouring eBay for secondhand hard drives last year. Yeah, a lot of things, I recently, because what I did most recently is I bought a 12 terabyte non-solid state drive, like an old server hard drive that you'd just be in a rack, and just wiped it and used it because I needed to do a full backup of my other. storage hard drives that are like two and three terabytes. So I'm like, I need a big one. And I was just talking with someone about like hard drives, another person who works in IT. And just the, just any kind of hard drives are shooting up at price. And it's kind of like, do you wait it out or do you buy another 12

Starting point is 01:01:41 terabyte drive right now and find, you know, find a used one? And everyone yelled at me for buying a used hard drive. They're like, that's a security risk. I'm like, I'm going to deformat it. It's fine. Wait, security risk to you? Yeah, to who? Yeah, because I, I don't know, because I told him I had to reformat the drive and fully wipe it. And they're like, but there could be stuff on it. Sure, yeah. Security risk.

Starting point is 01:02:03 That's their problem. That's what is it. That's what Newton boot is for or boot and nuke is for, right? Yeah, well, they were saying that me reformatting it was not enough to get rid of any risks on the hard drive. Sure, yeah, that's, but like you're not going to go and execute any latent binary. that you find in like the nether regions of the hard drive, are you? Like, that's, yeah. I know, it's a storage drive.

Starting point is 01:02:28 It just literally is like I point my torrents to it. And it's like, that's download there. This is, I would say, of the things that like the sciop chat is concerned with, how to make giant arrays of hard drives out of dog shit is like one of the major points of conversation. I have this, like, a set of images that I use in all of my. posters now where like we so we're just like trying to make the argument to i like when i go to conferences and stuff like that i'll just like go and talk to the other archives there and show them like

Starting point is 01:03:02 this is what sciop runs on like it just like because you know AWS wants to see you this image of just like squeaky clean space age server racks that go for miles into the core of the earth and stuff and like we're like so the different aesthetic of like one of my favorite images that This in the world is of a SIOP seed box rig where it's a normal desktop case that constitutionively does not have a side on it as far as I'm aware. But they ran out of hard drive bays

Starting point is 01:03:36 and so what they did is they took yellow tape and like taped a sling underneath the hard drive bays. So it just sort of like through the tape, screwed the tape into the drive base that exists, and then now the tape hangs down, and then had a bunch of drives stacked slung in the tape,

Starting point is 01:04:01 each of them screwed through the tape, and so I've got to just show you the image of these four hard drives just like sloughing off into the loose space of the computer case in a tape sling. And it's like, so now it's like whenever people are saying like, How do I, what, like, physical enclosure should I use for my drives? It's always like yellow tape sling, obviously. Like, that's the most, like, most efficient.

Starting point is 01:04:30 The hard drive hammock. Yeah, exactly. And it's functional, too. It's amazing how much it reduces the vibration transmission to your desk or floor. So the whole, the whole rig is much quiet. Was this rig yours, Janice? No, no, no. I feel bad, like, we should be crediting folks, but it's also, yeah,

Starting point is 01:04:49 Like I said before, we don't know how much people want to be known for stuff. But yeah, like... Well, if you can get me that photo, it's going to be the photo for the episode. Hell yeah, I'll get it to. I've got it on. So I set a reminder in Discord so you can get it later. When you were talking about the coordinating of scraping of targets, I kind of heard that you were working on some tooling for that so that you can automate it.

Starting point is 01:05:14 I'm curious about that if we have time. So... Yeah. this warrior. So, ill-determined name. So there's this thing that this group archived team that's like loosely affiliated with Archive.org does.

Starting point is 01:05:30 Then they have this thing called the Archive Team Warrior. And what that is is like, it's like a Docker service that you can run where, like, so Archive Team does this exact thing of distributing scrapes for at-risk stuff. And so it's like when they'll go, they'll have a scrape project, you'll go and run the archive team warrior for that scrape project. And then what that will do is it will, they have a list of URLs to be scraped. You'll go claim a set of those. Say, I'll get these ones. You scrape it, submit the data, and then say, this URL list is done. And you also probably will return additional URLs that you encountered along the way.

Starting point is 01:06:07 So we've done something similar to that where, and it's, and so it's like, this is sort of like an ill-defined, like, that's a well-defined problem for like web scraping, web scraping, like HTTP site scraping, but like there are lots of data sets that we encounter that are like a big S3 bucket that has like logical divisions in it, like, you know, this

Starting point is 01:06:29 this like data set over a year. I'm thinking the main thing we designed this for was like the chronicling America dataset. And so this is like a set that is released as bundles as like, what are they, what's the term did they use for each of the things? I forget, but they have like fancy names

Starting point is 01:06:47 and shit like that. Like, they're all named after, like, sci-fi characters. Let me see if I can, yeah, Chronically America. Like, so they have, like, and they're each of, so, you know, Chronically America is a project across a bunch of different libraries. And so they'll be like, the library abbreviation hyphen Falcor, like, DLC, and IUNA underscore EGRIT. And so it's like, they're like, you know, little codenames that are batches of the

Starting point is 01:07:17 data. And so just, I didn't give any introduction to what Chronicling America is. It's an archive of all of like the local newspapers from the United States through its history. So what we did is we made a thing where that is the unit of division. So we have a thing, a concept on SIO called a data set. And that is like, just like the abstract description of something that exists out there. And then a data set will have data set parts. And in this case, these are these like batches that Chronicleing America will release. So we made a SIEOP team scraper, the Scyop team sworeer, like, this worst name of all time. That is like a similar thing where you run for a given project, in this case, like the chronicling everything, like go and get me a batch of this. And so I will go, it will hit SIOP, say, I've claimed this one. I'm going to get this one. you'll do a scrape, you'll create the torrent, upload the torrent, and then say that one is done.

Starting point is 01:08:21 And so we want to generalize that so that currently that's pretty labor intensive to create like a task, like a scrape task. We called them quests. It's like this terminology is so terrible. But like make it so that it's possible for someone to say, hey, over here, there's a data set that needs to be scraped and say here are the subdivisions of it. that people can go grab and maybe we should like again automate the process of discover like the way that like archive team and well general recursive web scraping works is you start at some root URL and then but in the process of scraping you discover all of the other Urials that you need to be getting and so we need to like make tooling that will allow that process of over here

Starting point is 01:09:08 there's a data set that needs to be gotten here are its pieces and and so to go and help get it run this command that is, you know, CIOC scraping name of project. And then it'll automatically handle the, go and grab the data, create the torrent, upload it. And it even handles, like, adding it to your torrent client,

Starting point is 01:09:29 so it, like, auto-seeds for you. And so, like, that's, like, the tooling that we're working on. And the goal of that is, again, like what we were mentioning about, like, the distributed archived.org, like what that could look like, where instead of, when I want to go to the Wayback Machine

Starting point is 01:09:45 and archive a U.S. URL, like what that does is it sends archive.org servers to go and grab the thing, archive it, and scrape it for you. Like to make it so that, like, you can do that and like anyone can do that. So if I'm running my version of the, you know, local scraper and I want to grab something, I can just do that, put it in the thing, go grab URL, and it dispatches that out to my scraper. Or we could have like a standing army, a pro-social botnet, as I have termed it, it's like people that are saying like when there is scraping to be done, you can use these extra resources on my computer. Like I allocate X amount memory, bandwidth, storage, whatever. And like,

Starting point is 01:10:27 I will go grab that website for you, create like a signed copy of it and make an upload of it for you. So you have the scraping part distributed. And then again, once we figure out, like, the lovely folks over at Browser Trick's Web Recorder, like get the BitTorrent backed web view. then the goal of that will be to basically replicate what the way back machine does, where you go to sciop.net slash URL, whatever prefix that is, and then in the way that you do this with, well, it used to be like 12-foot I.O. Used to work like this and those of these like paywall things. You just put the URL after that prefix, and you can go be served some bit torrent-backed archive

Starting point is 01:11:10 of that web page. So that's like a actual goal. But like we're, it's still very, it was like most things in sci-up, like done out of necessity and trying to respond to some immediate need. And with the idea of generalizing it later, that's sort of like the status of that distributed scraping project is that like there are a couple of projects that we needed that for. So there's like chronically America with the big one. But then there's like an NPR scrape, a Department of the Interior scrape. I think like some Department of Justice scrapes that are that are in there. But we want to in the future generalize that into some. something that's much more possible. Because that's ultimately the question, it's like, how do I get involved and trying to bring that barrier down as much as possible to the point where you can distribute labor between people who are like scouts that know about stuff that's at risk but don't have a big scraping system and people who do have that but don't know about what they need to be going

Starting point is 01:12:06 out in scraping. So that's like. So you would just kind of run an EXE on your computer that once in a while would get a task to scrape, it would scrape, download, and upload for you? Or is this all in the browser instead of a severity XE file? At the moment, it's a program that you're running on your machine, not in the browser, but like, that's with the magic of... When you started talking about the distributed archive or way back machine, I was, I got confused as to like if that was part of it. That is, yeah. So, ideally, a lot of it would be

Starting point is 01:12:42 triggerable and browsable and interactable from the browser but like some parts some things browsers can't do and that's good like browsers can't access the file system

Starting point is 01:12:52 we don't want them to like so like there's just like some anyway that like I said it's very in a protein stage now and so yeah we'll come back to it

Starting point is 01:13:04 in the future when it's fleshed out and we'll do a different episode yeah but that's the dream okay yeah it's good it's good if there's someone who has expertise and would like to reach out to the SIOP team to join as a member. Is that something they could do? Totally. You can hit me up on the Fediverse, like my, I don't know, we put my

Starting point is 01:13:21 handles in description, but then also we have an email that we respond to. It's a group email. We have yet to make, so that that is contact at, oh God, I need to get this right. Contact at Safeguard.de. Is that right? So, that's right. Contact at SafeSafSaf. A-F-E-G-U-A-R-D-E. We're like a... I should just put it in show notes. I'm doing my usual thing of stuffing links and things in the chat, but like the actual show notes. Cool URLs are our downfall.

Starting point is 01:13:55 We like, it would be easier to just say safeguard.com, something that could be said out loud, but having the cool... You can email us. Anyway, I was telling a professor about Plum X one time, which is an Elsevier thing. and he says, yeah, I saw the Mexico site, but it's not anything. I'm like, no, it's not a Mexico site. Yeah. It's, it's, it's, it's not a Mexican project. It's, it's, it's actually Dutch now.

Starting point is 01:14:26 Like half of publishing ecosystem. And like, so we, we will set up a public chat, a Zula chat at some point. We just like have yet to do that. But yeah, so if someone has expertise, like, email us, like just DM me on Fevers, whatever, and then we'll get you set up. But in the future, we'll actually have, like, a reasonable public chat. We have a Discord discourse forum.

Starting point is 01:14:53 It's neglected. Don't direct people in Discourse Forum. No one's moderating it right now. But I don't think it's become overrun with Spanbot quite yet. I actually just don't know. It's like discourse is something that I want to love because I love forums, but, like, I have never been able to, like, actually, reliably use them

Starting point is 01:15:13 because I find that sort of awkward. C4 mentioned out about ATT. Yeah. All right. So I think we're good to wrap. We can always do more episodes in the future as the project comes along. And then, so I will put your contact info, any of the contact info

Starting point is 01:15:29 you want to save, we'll go into show notes. And the tools we mentioned, I've been making a list of, the media mentioned, and then just a quick, like, how to contribute to SIOP. Was there any, like, absolute last thing you wanted to cover Sadie? No. Okay. All right. Well, thanks so much for coming on. I've been really looking forward to this one. That's why I let it go long because I really wanted to talk about this for a while.

Starting point is 01:15:50 Yeah, glad to chat about it. It's like, you know, labor of love. It's something that like I will be doing for the indefinite future in my life because it turns out like this is what I was put on earth to do is apparently be trapped and entangled with BitTorrent in different ways throughout my life. But yeah, I'm glad to be able to spread the good word. And like, we love, we love our library. The librarian audience of library punk is like the core, like, group of people who we love to, you know, just would love to know and have better contact with, but have been doing a bad job of reaching out to and communicating with. So we appreciate like, like, like, the, just like, also just like the perspective of like thinking about it as like a political and like a, and as a social problem. Just like, it's a perfect. match between these two universes that we live in. So I'm glad to be able to come on and and chat with you about it. Great. All right. Good night.

librarypunk - 161 - SciOp.net feat. Jonny and Jez (part 2)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.