Tech Over Tea - Software Heritage Co-Founder & Former Debian Leader | Stefano Zacchiroli

Starting point is 00:00:00 Good morning, good day, and good evening. I'm as who is your host, Brodie Robertson. And today we have someone who is from the soft... It's just called Software Heritage or is Foundation in the name. I think it's just Software Heritage. Okay, sorry, I couldn't have Foundation was in the name or not. We have Stefano from Software Heritage who... Well, you've also done various other things in the past, and we might get to some of that stuff.

Starting point is 00:00:22 But the main thing here is to talk about that. So I guess introduce yourself and maybe mention some of that stuff you've done in the past, and yeah, we'll go from there. Hello, Brody. Hello, everyone listening and watching us. So I'm Stefano Zakiaroli. I also go by Zach. That might be easier to pronounce

Starting point is 00:00:38 for the English-speaking crowd out there. So I guess I have two main lives to mention. So as a profession, I'm a researcher. I'm a professor of computer science at the Polytechnic Institute of Paris. And I work on software engineering, and I also study open-source a lot. I work on reproducible bills.

Starting point is 00:00:56 I work on package manager. and the package dependencies, a bunch of things that maybe will be relevant later. And in that area, I'm also studying a lot, digital commons, which of course software heritage is very much related to. We will get to get that later. And my second and parallel life has been for more than 25, I guess 30 years now,

Starting point is 00:01:18 being an activist and a developer in free-end open-source software. I've been in a bunch of projects. I've been in Debian for more than 20 years now. I've been Debian project leader for three years. I'm active in the Reproducible Bills project, I'm on the steering committee, and I've also been on the Open Source Initiative board and a bunch of other things leading me eventually to funding and creating software heritage 10 years or something ago.

Starting point is 00:01:42 Okay, so if people know you, they might know you from some of your earlier stuff. Yes, possible, yes. Okay, so I guess we should probably get started with, like, the main thing. What is software heritage, and what is the sort of primary goal of the organization? Okay. So Software Heritage is an initiative we created 10 years ago with my colleague Roberto DiCosmo. And the short stated mission of the software heritage is to collect, preserve, and share

Starting point is 00:02:14 all software we can find in SOS code 4. This is the short pitch. So basically, the idea is that there is value in software source code. There is value even if you don't want to run that piece of software. There is cultural value, there is technical value. There is a lot of effort that went into creating that software, and we don't want that value to be lost for future generations. We can get to why it is at risk later if you want.

Starting point is 00:02:40 And so the initiative of software heritage is precisely collecting as much of that software as we can, store it in a way that it will make it available ideally forever for future generations, and actually give access to it to everybody, everybody who needs access to that can be people for doing technical stuff, can be people to resolve dependency that no longer resolve today, can also be scientists like me that want to run an experiment

Starting point is 00:03:04 on the largest body of software source code that is available today that happens to be stored in the software heritage archive. As sort of a, I guess, easy comparison, in my video, I compared it to the Internet Archive for source code effectively. Yes, so that is a common comparison

Starting point is 00:03:21 and we are good friends with our, let's say, colleagues in digital preservation, at the Internet Archive, we feel like we are doing, basically pursuing the same mission specifically for software source code. So that means that we do things a little bit differently. So in particular, when I say source code, we actually not only preserve the source code itself, like the files and the directories that you commonly have when you have a piece of software on your machine, but we actually preserve the full development history. So for instance, if we have archive that Git repository and that Git repository disappears, we can,

Starting point is 00:03:55 can actually reconstruct it. If it is altered in some malicious ways from some attackers, we actually detect that and reconstruct it to any previous state we have sword guarded. And this is true not only for Git. I mentioned Git because it's the most popular version control system today, but we do the same for subversion and for other popular version control system.

Starting point is 00:04:16 And so what we do differently from something like the Internet Archive is basically we also create a global data model which essentially all the history of the development all of the history, plural, of development on the software archive, are intertwined together. So we're sort of creating the global public Git repository. It is of course not stored like that because it would not scale, so that way. But that is the general idea.

Starting point is 00:04:41 So in a sense, if our website disappears, you go to the Internet archive to actually find it and reconstructed. If a version control system repository disappears, you come to us for the same purpose, essentially. So, I guess from someone who is a developer, you might understand why a development history is important. But for someone who might not be, why does that matter? Yes. So it matters because to make sense of a piece of software, the development history is fundamental. For instance, it's very common for developers to look at what is called the annotation or the blame of a file and see who is the last person that change us. a specific line in a file.

Starting point is 00:05:26 And if you want to know that, you need the full software history. And when you see who is the person who last touched a single line of code, you can look at what is called a comet, so which is basically the last recorded change that has touched that line. And a commit comes with a bunch of very useful additional information, metadata, if you want. A commit comes with information about who did that,

Starting point is 00:05:49 when that change was made, and a very precious human-targeted message that essentially the developer used to describe why, what the change is about. So essentially, the version control history gives you the context on when a change happened and what it was about. So that's very precious information for everybody wants to look at, you know, why things were done and the history of something. Yeah, right now I've got my terminal file manager open as an example in the history. And you can do, even though this is something, which is.

Starting point is 00:06:24 which is, it's on GitHub right now, you can publicly see it. This is something that is still being preserved. So it's not necessarily just stuff that no longer exists. It's, you want to preserve it whilst it is still here. So that if any point, somewhere along the line, it disappears, you already have everything there. It's not a matter of trying to sort of dig back through and sort of piece things together where there might not be that sort of preserved history anymore.

Starting point is 00:06:51 Absolutely. So two things about this. So first, indeed, we proactively archive things. So we do not wait for someone to ask or say something to do that. So GitHub is an example. It's one of the forges we periodically crawl to archive software. It's not the only one. So for instance, we have many GitLab instances or other development platforms.

Starting point is 00:07:09 So we proactively do that. And the reason why we do that is that indeed, when something disappears, it might be too late to preserve it. It sounds obvious, but actually it's not so easy to know in advance. So we might discover essentially too late that something was important. And the only way to avoid not having archived it when it was still available is to actually do that, which is precisely what we do. Right, right. I think the work that is being done here is like really cool.

Starting point is 00:07:38 And it did kind of bother me that so few people seem to like know about this existing. How long has the organization been around for at this point? So the initiative itself started 10 years ago. then the organization has been different because in the beginning, we were essentially an incubation project at a research institution here in France, because both me and my colleague were from that research institution, which is Enria. But now we are essentially part still there, part at an independent foundation that is hosted at the Inria Foundation, and that came out, came about a bit later. So the work is in 10 years old, which seems like a lot, but we actually

Starting point is 00:08:16 started very small. So in the beginning, we were, you know, two researchers and two engineers working on that. Today we are a team of more than 20 people. So it's growing slowly because it's also important to mention that we are a non-profit initiative. So we believe we're doing this for the public good for future generation. So it's not like VC funded. So we didn't have that kind of fast kickstart in terms of funding and it's growing organically and we think this is a very healthy in a way of growing and actually not overstretch ourselves. Right, right. I've got the sponsors page open right now. How much is a diamond sponsor?

Starting point is 00:08:54 So, I don't know, I don't have it before my I. How much the ballpark? Okay, right, right, okay. So diamond is like the, the tier where it's like, there's no maximum loan to it, I guess. Yes, I mean, it's starting from there, yes. Okay, okay. What did you say that the tier under that was? No, I said that

Starting point is 00:09:14 diamond is 250K, I think. Before that there is platinum. Oh, okay, okay, yep, sorry, I misheard you. Okay, so since that like early sort of beginnings of the project, you have, you know, there are some like relatively large sponsors at this point. You know, IBM, Microsoft, Huawei. Like there is, you know, Google, AWS as well. There is sort of, I guess, some drive behind it. But I guess with how much is being stored, how does that money sort of stretch?

Starting point is 00:09:47 Like if you have an archive this big and you're constantly trying to add in new commits, like I don't know how much you can really get into like, you know. Yeah, sure. No, no, I'm happy to talk about this. So first of all, our sources of funding are very diversified. So you see the big names there, but a lot of funding is public funding. So it's a good mixture of both public and private funding. And we believe that diversification in funding is fundamental, you know, to avoid that your

Starting point is 00:10:17 mission is pulled in the direction that you don't like and govern us all with that, of course. But so you have cited the big companies, but you have also a bunch of the French state, we have a lot of European funding. So it's very diversified in that way. And in terms of how the money is used, as usual in this kind of organization, a lot of the funds are used for actually paying salaries to people, engineers, managers that are engineers, managers that are working to keep the archive up and running. You said there are two people. Are those like, sorry, sorry, Are those all paid positions? Yes.

Starting point is 00:10:50 So we also have some volunteer position that we like ambassadors for instance that are people that are excited about the project and actually want to go and talk about it in different contexts can be scientific contents, can be museum context, can be cultural context. Those are volunteer positions. But the other people you see on our team page are actually salad position. But so if your previous question was about the sustainability

Starting point is 00:11:13 in terms of infrastructure, for instance, So once you remove the big part of the funds that are needed to pay salaries, of course, you have infrastructure. In terms of infrastructure, we have there to a lot of diversification. So we have an on-premise infrastructure. So the primary copy is on servers that we run ourselves. So we both the machines. We have an ops team that's keeping the servers up and running. I think we are at two full racks right now, which already filled.

Starting point is 00:11:40 Maybe we are edging into our third rack for that. And they are hosted here in France in the data center of. Then we have additional copies. So you've mentioned some of our sponsors that are cloud providers. And actually what they offer is in-kind resources to host a copy of the archive on copies that we control, but that are hosted on like AWS, Amazon, or some of those services. And then, so essentially those sponsors are there to guarantee that we have additional copies, and we cannot lose them for a disk, breaking down on a fire or whatever.

Starting point is 00:12:12 And then, even though this is, I'm a lot of, is independent from the sponsor conversation, we have mirrors. So mirrors are independent copies. So we have multiple ones in Europe from now. That essentially even if we wanted to destroy them ourselves, we could not do that. So this is the global picture about the sustainability of the infrastructure. How big is a mirror?

Starting point is 00:12:35 Like if someone to mirror the archive, like what would they need for that? Yeah, it's... It sounds like you're getting to be quite a lot now. Yeah. So it's not something you can also. on your laptop. It's not even something that is usually in the realm of what a random research lab of a public university could do, because we are now into the territory of about two petabytes of data. This is before duplication that you need before redundancy. And maybe an interesting

Starting point is 00:13:03 technical detail here is that the data model is a graph. Essentially, for people who are familiar with the Git data model, a Git itself is a graph essentially, where the leaves of the graph are the files, the different version of the files themselves, and everything else are like the commits, the directories, the files, and whatnot. So the graph itself is a very big object as well. It has about one trillion arcs. So several tens of billions know, one trillion arcs. But that's a complex object to manipulate, because essentially state-of-the-art graph database cannot really scale to that kind of workflow. But then you have the files themselves, which of course are less, they're not talking trillions, we're talking billions, tens of billions, but those take the most space to actually

Starting point is 00:13:49 store. So the sum of all the files, after the duplication, maybe this is a point that you'd like to discuss later, it's two petabyte of data. So if you want to mirror essentially the subterterrides archive, you need to be up to that level of storage and also capable of running some services that use the grab, that use the files and whatnot. Okay, so this is like a, a very substantial endeavor at this point. What sort of entities tend to sort of take on that role of running a mirror? Yes, so it's typically for now, for the most part, user research universities that are researchers

Starting point is 00:14:31 that want to be able to run the kind of experiments we were discussing before. That makes sense. And of course, we have some public APIs, but if you want to mine this stuff at scale, you really need to have the data close to your computing service. or infrastructure that actually serve universities and academia in a given country. Okay. So this is the typical kind of people that are willing to monitor right now.

Starting point is 00:14:55 So I mentioned that I'm from Debian, right? So I remember the days where people were flocking to host a Debian mirror, and that were like on spare service, you have your sales lab just to help the community. We have not seen that kind of love mirrors, but I think it's understandable because a love mirror doesn't have. two petabytes usually. Right, right. There's a little bit

Starting point is 00:15:18 outside the the range for most people to do. And it's not just a matter of having the two petabytes and you need, if you want it to be long term, or it needs to be out of scale

Starting point is 00:15:26 past that point. And you need people, right? Also, you need to operate, operators to operate this kind of infrastructure. It's not like you set it up and you let it go. Usually you need some maintenance

Starting point is 00:15:35 as well. I guess how quickly did it scale to that point? Like when the archo first started, how small was it and then sort of like how quickly they start picking up pace? So I don't remember the, you know, the threshold that we passed over time, but they can mention a couple of things. So in the beginning, there were some urgent archival that we needed to do.

Starting point is 00:16:00 So I said that I mentioned that we started 10 years ago. Back then, already we were seeing some forges, so some collaborative developing platforms disappearing. For instance, we have seen stuff like Gitorios being bought. and essentially being closed down or Google code that maybe the older among your listeners will remember closing down or maybe restricting the kind of access you could have. So no longer version control system,

Starting point is 00:16:26 but only the latest tarbo. And so basically we had some urgent stuff to do to retrieve copies of stuff that were basically disappearing at the time. And then we started archiving GitHub, which was the biggest platform at the time. It is still the biggest platform today. So from there, what I can tell you is that the amount of stuff we archive has been growing exponentially. And in fact, we have even looked at this retroactively because essentially in a version control system, you have the timestamps, right?

Starting point is 00:16:55 So even if I start archiving today, I can see that some code has existed for 20 years or 30 years based on the timestamp in the version control system. Okay. So we have monitored the evolution of the amount of files and the amount of commits that we archive after duplicates. after duplication, so unique files and unique commits. And this basically gave a view of the global production of source code in the world. If you accept that we are a good approximation of that. I don't know how good we are as approximation, but for sure we are the largest one out there, which is publicly available.

Starting point is 00:17:29 I think once you have that early on when you don't have much going on, it's kind of hard to say that. But once you have that core data set and then at this point, you've got a lot there, right? So now with years further on, now you can very easily like sort of track how many additional things are being made. Actually, so we are doing that right now and we're also that looking back at the past. And so what I was mentioning is that the production of publicly available source code has been growing exponentially for the past 20 years. And there is no sign that this is stopping anytime soon. So you might wonder how could be sustainable to actually archive something that growing exponentially.

Starting point is 00:18:10 Maybe I'm anticipating your next question. But so the fact is, up to very recently, the cost of storage was essentially was decreasing faster than the ratio of increase of the stuff we archive. Yes. Maybe you are aware that this seems to be changing right now? Yeah, I had to buy a new drive recently. It was probably double the price it should have been, let alone, you know, RAM as well, which is a whole additional issue there when you're running these servers.

Starting point is 00:18:40 But still, the good thing with source code is that I mentioned two petabytes, right? Two petabytes is actually peanuts if you compare it to storing videos. So I don't think there are public figures about YouTube storage sizes. But, you know, if you take a 30 seconds video in HD of a kitten, that produces a lot of bites in comparison to a fix to a line of code in a big project. Right. The majority of stuff that you're storing is Plaintext. And it, you know, that's.

Starting point is 00:19:10 Yes, people are going to, some people store weird things in their repo that probably shouldn't be in the repo, right? Yeah. Of course, I mean, we archive Git repository and people store weird things in GitHub repository, but we are still, you know, on the, on the small size in terms of, you know, people, stuff that people store on the internet. Hmm. I've got the graph open right now, and I'm noticing in, what, late 2022, it massively spiked up. Was there something that happened there? Like a growth in the... the infrastructure or something? Or is that just a weird spike that you can't explain?

Starting point is 00:19:45 So I need to check our journal of big changes, but I think basically we increased our capacity of crawling. Right. So it's not necessarily representative of increased production out there, but an increase in our crawling capacity. Yeah, you know, I'm seeing like a few points in the graph where there's just like a sudden spike up between one month, where it's my assumption would be either crawling increase or infrastructure,

Starting point is 00:20:09 some sort of infrastructure change that led to more code being added into it. Correct. Correct. So is, how would I say? Is everything on GitHub to like a certain date covered? Or how is that crawling handled? Are things prioritized in some way? So there is no sophisticated prioritization right now. That might be something that we should add,

Starting point is 00:20:39 but maybe we can get to that later. Right now, the way it works is that we have worked with GitHub, and we have good technical collaboration with them. And essentially, GitHub already had basically an API to list all the repositories. So imagine you are starting code. You start on that, and you archive all those repository one by one. And when I say one by one, it's still subject to the duplication, right? So if we see one gazillion forks of the same repo, in general, it's stored only ones,

Starting point is 00:21:06 plus the few commits that are in one of the fork, but not in the main route. But then there is also a feed of updates in GitHub. Essentially, you can get a list of repositories where associated to each repository, you also have an information that said when that repository was last modified. And so that is essentially the key

Starting point is 00:21:25 that we use for prioritizing sync. So there is no need to re-crawl a repository that has not been modified, so we only crawl the one that have been modified. And if you're a courier, also on GitHub, like I guess in most public platforms, like, I don't know the out number, but like something like 90% of GitHub is dead in the sense that has not seen any change in the past, I don't know, five years, 10 years.

Starting point is 00:21:49 So essentially the challenge there is just keeping up with the repository that do see changes and you can ignore basically everything else. Right, because a lot of people on GitHub will use the fork button as sort of like a repost kind of thing. But they're not actually forking the project. they're just sort of making their own clone of it. Yeah. And then imagine how you usually use that.

Starting point is 00:22:10 If you're a developer and want to submit a change to an existing Apple, you do that. You do your fork, you make your change. And then your fork will either it is deleted automatically when you merge your change or it will remain there, stale forever, and there will be no need to archive it ever again. Right. You mentioned they're not going back and checking things where there isn't a change. And this, I do think is actually like a really important to like hammer in. Because we see obviously a lot of like open source infrastructure being hit by these like AI scrapers, right?

Starting point is 00:22:43 And this is a problem. And you guys like you guys literally have like a thing to stop people just, you know. Yeah, we are suffering from that problem too. So absolutely. But it does make you having to know that like you actually do care about that problem. And you're not just trying to create an archive of very. everything, who cares what happens to the infrastructure. It's trying to do it in a way that isn't destructive to the rest of the ecosystem. Absolutely. So essentially, so two cases. So let's

Starting point is 00:23:12 separate the case of GitHub because it's a huge infrastructure. Sure. Yeah. Like some, the Ghanome GitLab, the KDE Git lab, some random person's repo. Yeah, yeah, yeah. And let's take a GitLab of a specific project from a open source foundation. So with GitHub itself, we're in touch with them, they know, they know our bots, they recognize our GitHub accounts as bots and, you know, they know they can scale. And so this is kind of an easy case, but we are in touch with them. So they know we are there. We just didn't just start, you know, archiving without being in touch with them. For the forge, for the small forge is more interesting in the sense that we have a feature that is called Save Forge now. Well, random users interested in the archival of a specific, say,

Starting point is 00:23:56 GitLab instance can say there is a GitLab there is a GitLab instance there. Can you please archive it? So when you receive that request, there is a manual process, a manual vetting process in which essentially we reach out to the operators of that GitLab instance and we actually ask them, well, we have received this request for archival. Are you okay with that? And basically there is also a conversation in essentially the capacity. So essentially, if it is a small instance, they would not like having, I don't know,

Starting point is 00:24:23 even 10 or 20 crawlers hammering them. And so they tell us maybe one crawler is enough. And so we crawl with only one. So that is the kind of process that we have in place to avoid the essentially the risk that you mentioned. Okay. Okay. That's, yeah, that makes sense. Because, you know, I could imagine this.

Starting point is 00:24:46 Okay, let's go with GitLab. Get, sorry, merging GitLab and GitHub in my head. GitHub, how many crawlers do you actually have going through that site if you happen to even know the number? So, I mean, our ops would know. I don't know the number off the top of my head, but I think it's in the ballpark of maybe 16, something like that. Oh, okay. So it's not hundreds. It's in the tens and not very high tense.

Starting point is 00:25:14 Okay. Okay. That was a lot less of a number than I expected. Yeah, because, I mean, really, the, the, the problem, is not there. They can scale. So essentially, we operate all the collars. We can operate on our infrastructure and that's enough, essentially. Sure, sure. Okay, fair enough. Fair enough. I guess. And also, you mentioned the problem of AI scrapers, right, that I mentioned that we are suffering from as well. That's why you are seeing those checks when you navigate their archive. So those

Starting point is 00:25:43 are very different kind of crawlers, right? So they come from usually residential IPs these days. So they are all different. They are hard to identify. So in our case, it's a limited number of crawlers, that identify themselves and that are known to the recipient of the crawling. So very different kind of workload. So I guess if you happen to know, how, at what point did it really start to become a problem where that issue had to be dealt with with crawlers sort of coming to this archive? So for us, it's been essentially shortly before the release of some popular independent open AI open LLM models, open weight models.

Starting point is 00:26:26 And essentially, those were the sudden spike who we noticed, but then it has become common. So we are one of the targets. And so I guess maybe it's two years ago or something like that. So we need to check the release dates. And they want to mention names, but it's, that is essentially, it happened to us when it happened to everybody else.

Starting point is 00:26:45 And frankly, I think we were not even necessarily targeted ourselves. There were just, you know, AI crawlers crawling the entire web. And so essentially finding us, and going deep into the archive structure with very deep and creating all sort of load problems. Right, right. And there is a lot on this site. So if you want to have something that crawls constantly, it can crawl and will never find an end. Yeah.

Starting point is 00:27:11 And also it is not the right way, right? So imagine that you want to have a copy of our archive of other resources out there. Calling page by page is not the most efficient way to do that, right? So reach out and ask to be a mirror operator or something. So it's just that would be way more efficient and would reduce the burden for all people involved. Something you touched on earlier is sort of why it's important and why we should care about our software heritage. And I kind of want to get into a bit of that. Yeah. So there are different reasons for different stakeholders, right?

Starting point is 00:27:48 So a couple of examples that we usually use when we make presentation. about software heritage is one historical value, like the source code that put the first man on the moon, right? So the Apollo mission, that code is archived in software heritage. And that code is something that basically cannot be run any longer today. It's specific assembly code for the Apollo guidance computer. So it's completely pointless to even try to run it. But if you look at the comments in that code,

Starting point is 00:28:15 it's the place where the team led by Margaret Hamilton invented software engineering as a discipline and invented a bunch of stuff that still exists today, like defensive programming, the fact that your program should be careful and should defend itself even against conditions that are not supposed to happen. Okay. So that piece of code is archived in our software heritage and it's important for historical value and we don't want to lose it. Like other stuff that can be in a museum for a different branch of history to learn from. That's one example. Another example is in terms of innovation. So another typical piece of code that we show in our slide is the quake source code, right?

Starting point is 00:28:59 It has been open source a long time ago, authored originally by John Carmack. And in there, you find a very smart way of computing the inverse square root using only decimals, because that's the most efficient way to do that on consumer-grade hardware at the time. So I'm a scientist. So at the time, that kind of innovation could have been published in a computer graphics conference. Instead, it's in a piece of source code. So if we archive scientific papers because it's useful for the historical record and for a scientific record, then we should archive those pieces of source code.

Starting point is 00:29:35 And these are just, if you want, like inspiring example. But you might also have a lot of practical reasons to do that, like digital obsolescence. Like maybe you've stored some images in a format that is not a standard, so it doesn't have a public specification available, but maybe someone implemented a reader for those images. Maybe that reader is only stored in your own page or your weird GitHub repository. So archiving it will enable someone down the road to actually read their images 20 years from now rather than having them lost forever.

Starting point is 00:30:12 And again, there are thousands of examples like that. But I guess the key lesson here is really that source code is important per se. It contained knowledge that should be preserved independently from even running that piece of source code. I think not just necessarily preserving that old code, you know, you could open an old file, but preserving it. So if someone in the future wants to build a modern solution to that, they have an example of how it was once done. And maybe that way is not correct now and isn't going to run a modern system. But it gives you an example of how it was done, so you can more easily understand a way to do it now.

Starting point is 00:30:49 Absolutely. That's absolutely the case. And, you know, source code then someone expert can actually read and make sense of it. And nowadays, we even have more and more useful tools to understand source code for us. For instance, LLMs themselves are actually really good at understanding code this day. So maybe once you have found a piece of old source code you care about, you can ask either an expert friend or some tooling, to explain it to you and use it a base to produce something else. Then there is also the entire aspect of scientific reproducibility.

Starting point is 00:31:22 I don't know if that's something you wanted to touch upon. Yeah, no, go ahead. In open science, people have been talking about, so open science is this idea that essentially we should do science only using open artifacts that are available to everybody. And there are essentially three kinds of artifacts in science. One are the papers themselves, ones are the data, and the third one is software.

Starting point is 00:31:43 Okay? So in open science, a lot of people 10 years ago were still talking only primarily about open access to papers, essentially papers, scientific papers should not be behind paywalls that only rich universities can afford. They should be available for the entire population. And they were talking a lot about open data. So data, same thing, right?

Starting point is 00:32:03 They should not be paywall. They should not be behind closed APIs. They should be available to everyone. The importance of open source software for open science was known, it's not like we invented it, but it was very much under-discussed at the time. So our point is that if I have scientific paper either produce some source code as a result

Starting point is 00:32:25 or use some source code for running the experiment, which is the case in 99% of the experiment these days, by the way, and not only in hard science, right? And so that source code should be stored in a permanent place and referenced it properly. Till to this day, we have way too many scientific papers that as a pointer to where the source code related to that paper use GitHub. And GitHub is a development platform.

Starting point is 00:32:54 I mean, it's a proprietary development platform, which is not by cup of tea. But, okay, it's a successful collaborative development platform, but it's not an archive. So if you reference your source code in a paper with just a URL for the Git repository, that GitHub repository risk disappearing. And so it's really important that all the source code associated to scientific discoveries is stored in an archive like software hurdles and referenced it properly for posterity. Posterity can be my colleague, right? It can be my colleague that two years from now just want to retrieve my source code behind my experiment

Starting point is 00:33:30 to actually builds upon it or to show that it was wrong. I mean, all sorts of good things come with public access to this. kind of information. Right. And if you have open data and open papers, but you don't have the software, you're sort of in a situation where you have to reinvent the software from the results of the paper and what the data says. You can't really, you can't really just interpret the data without knowing how that data was actually being used. Absolutely. So that is the use case of basically, standing on the shoulder of giants. So reusing some work done by others and build upon it.

Starting point is 00:34:13 And then there is the reproducibility aspect. So I want to be able to rerun your experiments, maybe to verify it, because maybe you overlooked something. And so for that too, even if you don't want to build upon something, verification can only happen if you have the entire dependency chain of the scientific experimentation. One of the things I was a little bit confused about when I first saw, the archive was the, what is, the SWHID, the software hash identifier. So I guess maybe you could

Starting point is 00:34:46 explain a bit of what that is and I guess why it exists. So that is an identifier scheme that we have developed and that has been recently standardized as an ISO public standard. It means that is one of the few ISO standards that are publicly available. You don't need to pay for the specification. And so that standard essentially capture the way we identify all the objects we store in the archive. So I mentioned before this graph data model. And essentially, without entering into technical details, the graph contains different kind of objects. Each one is a node in the graph. So one kind of node is a file, a specific version of the file.

Starting point is 00:35:25 Another is a full directory of files. Another is a commit, which we call a revision. Another is a tag in Git that we call release. and so forth, up to the full state of a Git repository. So we have one identifier that captured the entire state of a Git repository. And so essentially, the Swedes are a standardized way that we use to compute those identifiers. People familiar with Git will notice that the basic scheme is actually compatible with what Git currently use, which are basically SHA-1-based identifiers.

Starting point is 00:35:58 And these are used to identify all this kind of object. So what we have done, essentially, we have standardized independently from Git, how you compute these identifiers, provided some open source implementation so that everyone can do that, and actually release this to the public. And so the idea is that when you archive something in software heritage, you can see the identifier that something has received, a different granularity, because maybe you care to reference a single file, maybe you want to reference a full directory, maybe commit, maybe the entire repository. And then later on, 10 years from now, you will retrieve the same thing from software heritage

Starting point is 00:36:35 and you will recompute yourself using an open-source implementation, the identifier, and verify that they match. If they don't match, something has been modified, either in the archive or in the transfer. So it gives you some very strong integrity guarantees, which is something that you don't have with a lot of other identifiers out there.

Starting point is 00:36:56 So we mentioned papers in science, DOI, digital object, the identifier I choose a lot to identify papers, they don't have these properties. So if you were solved today at DOI from a paper published 10 years ago, you can receive an entirely different paper and you would have no way to verify that. Okay.

Starting point is 00:37:15 Okay. Because there are essentially men in the middle, usually they are a publisher and the operator of the DOI registry that can make the DOI point to something else. That has some useful properties, okay, in the digital industry. I'm not completely saying it's a problem, but it's something that we wanted to avoid for digital first objects like source code artifacts. Okay.

Starting point is 00:37:39 You can see the important implication for security, for example. Sure, yes, yes, yes. I didn't know that it was a thing with papers. I was not aware of that at all. But I guess if, and this goes back to the whole open science thing, saying, right? If you aren't able to reference the software that was used, how can you then reproduce the results and do it further test? You're not able to really do so if you aren't certain the software you're looking at is the same software. You are absolutely right. So as it happens,

Starting point is 00:38:17 I have also done a bunch of research on this specific topic because I'm very interested in the reproducible builds. And indeed, so there are actually two problems there. One is the way you reference your software. For instance, if you just say my software is available in this Git repo and you just give a URL, well, okay, you can Git clone that URL, but then which version did you mean? Okay. Sorry, that was my... My bottle. It's all good.

Starting point is 00:38:43 Okay. Okay. It didn't spill anything? It's good. Okay. Which version did you mean? So at least you need to reference the Git repository and the specific commit, for instance. Okay.

Starting point is 00:38:52 And then you have the problem of archival in the sense that, okay, you have referenced something, but if that something is gone, you need a backup place where to retrieve it from. So the idea is that with software heritage plus Swid, you have the two together. You have a place that gives you the archive and an integrity mechanism. And if you're curious about some research results on this, we have shown that, for instance, repositories are altered way more than people imagine. So there are a lot of history rights that happens on GitHub. And the only way to notice them is via something like software heritage.

Starting point is 00:39:27 Or there are tags that disappear. Maybe you are a referential version 100 of a software that one zero today points to something and tomorrow point to something else. And if you have only noticed 1.0, then you will not know that something has changed. And this actually impacts things like reproducible bits. And just like something a lot of people have, you know, use themselves is branches, right? Branches can very well be deleted or merged or like a branch is not necessarily

Starting point is 00:39:58 going to be there forever. Correct. So I mean, with branches is less bad in the sense that people expect them to move, right? A branch by definition, we're in Git, it moves. Every time you add a commit, the branch points to the new commit. So it's sort of by the time. Tags, on the other hand, are meant to be immutable. and they are actually not really so.

Starting point is 00:40:19 Right. Another thing is... But yeah, but if you... What you have in mind are like, if femoral branches, like for development of a feature and then disappear, yes, you're absolutely right. They will disappear at some point.

Starting point is 00:40:31 So if someone was referencing that branch and if it gets deleted, then you no longer can return the software you care about. Right, right. Or things like, um, squashing commits, things like that. Like, that's other thing you can do. Absolutely.

Starting point is 00:40:43 Yeah. Remind me talking about Git here, but also, obviously, there are other source control. systems outside of Git, so you need something independent from Git itself as well. Yeah, so that's a very good point. Thanks for reminding that because I didn't mention it in my previous conversation. Swids are independent of the specific version control system. So we inherited them and we built upon them from Git essentially. So the only thing we added on top of

Starting point is 00:41:08 Git is an identifier of the full state of our repository, but we compute them in the same way independently from the version control system. So this allows you to do things. like, well, maybe something migrated from subversion to get in a known proper way, but the suite of, I don't know, the root directory of the last commit in subversion, the first commit will be the same. So you can actually stitch the two histories together. Oh. Oh, I did not know that.

Starting point is 00:41:36 Yeah. So it's not something that is done automatically on the interface, but we do have the information to actually do that. Okay. Okay. That could very well be interesting then. Because it's not often now with newer stuff, but you do definitely see a lot of legacy,

Starting point is 00:41:53 like you look at something like the Linux kernel, for example, that used to use BitKeeper before they moved over to Git. Yes. Okay. Yeah, so actually, it's funny that you mentioned that because just last week I was discussing with some of our engineers that I wanted to create some APIs to do exactly that. So as a way to essentially, there is the Git, Linux kernel,

Starting point is 00:42:16 repository that was started at version two, six something. Then you have the because essentially something, yeah, that went up a long time. It was too soon to actually add the entire history move there. And then you have the history repository that contains essentially the history before that. Gint has some

Starting point is 00:42:32 support to actually have a local copy that stitches the two together and they wanted to add the same APIs to the software heritage of car. So we all have the basic information is out there. We just you know, lack the thing needed to stitch the two together. Okay. Okay, okay, now that sounds super cool.

Starting point is 00:42:50 Yeah, I guess what does, I guess, what does, like, current support for the org look like? Obviously, you've got your sponsors and stuff, but like, when you, I assume you've gone out and, like, talked about the, talk about the archive at various events, and, like, what has, what has been,

Starting point is 00:43:16 sort of the general perception from developers in this space and sort of thoughts about what you're doing here? Yeah. So as you mentioned, many developers still don't know us, but we are seeing a lot of increasing interest. So when we go at free software event, open source events, we have our booth, we are present there, and we have a lot of people essentially coming and say, thanks for what we're doing. Because as soon as a developer realized that this work was actually not being done before, usually they're blown away. They're very happy by the fact that someone is doing that. This is in term of bottom-up support.

Starting point is 00:43:51 And then there is institutional support. Like for instance, I don't know if you're aware, but right now in Europe, there are a lot of conversation about digital sovereignty. And a lot of people are looking at us as a way to be more independent from the original provider or some source code than the original operator or some commercial platforms.

Starting point is 00:44:09 So we have seen a lot of support from that as well. And also I should mention that also in terms of support on the cultural angle since the very beginning, I mentioned Inria as an incubator for the project at the beginning, but we're also being supported by UNESCO since the very beginning. Because essentially UNESCO recognized that in the space of digital preservation, which, of course, we did not invent. So digital preservation existed already, and softer preservation existed already as well. the corner that was still not handled was the preservation of Seuss Code itself.

Starting point is 00:44:43 So we're seeing growing support in the digital preservation community in the recognition of the fact that, yes, that was an unattended need that we basically started working on ourselves. Maybe this sort of, I assume you've probably thought about this before. There is this general understanding for the preservation of physical media, whether that be, you know, DVDs or books or music or anything else like that, why do you think it sort of took until you and the other founder got, like, you know, wanted to do this for someone to really approach this as a problem?

Starting point is 00:45:25 I actually don't know in the sense that when we discovered the idea at the very beginning, our own first reaction was, well, someone has done this already for sure. and actually it was not the case. So my guess is that we were at the conjunction of needs from the one end, we were both free and open source people. So we were coming from the world of free software developers. So we knew the community, we knew the value that publishing code was adding. And on the other, we had specific needs in terms of research.

Starting point is 00:45:57 So that's an interesting anecdote maybe. So I was in Debian and the sort of small project that I did, before software heritage in Debian is what is called sources.debion.org, which is essentially a website where you have the source code of all the historical versions of Debian publish it. Okay.

Starting point is 00:46:18 And so for me, software heritage, and actually I developed it with an intern student at the time, was useful to do some research on software evolution. And so actually, to some extent, software heritage is scaling up that experiment from only Debian,

Starting point is 00:46:35 to the entire body of software that is out there. So I guess we were at an interesting intersection of interest between academia, between free software and digital preservation that essentially gave us the kickstart and put us in the right direction. I also mentioned that for some free software developers, they legitimately don't care about this, in the sense that for you, for you, free software is only a, tool to achieve some technical mean, maybe you care about your specific need today and you don't

Starting point is 00:47:11 care about what will happen 20 years from now to that code. And that's fine as well. Okay, so not everybody should care about this in the same way we do. I was just thinking about, I was thinking about this along with another discussion that I had with someone recently. I had a discussion with a open Sousa developer who has basically made it, made it his mission for the past five or so years to go and track down software that has 20, 38 bugs. And I was thinking that with an archive like this, you can go back and do research on what actually existed at the time. Because if you think back on Y2K, there is a lot of people who legitimately think that wasn't

Starting point is 00:47:57 real, that was never going to happen, because a lot of that software history was either behind closed doors to begin with or has since been lost to the sands of time. And people don't even realize that there was a lot of work that went into that problem to ensure that, you know, we didn't have, you know, databases rolling back on themselves and various things like that. I think being able to go back and look at where we came from, even if it's not something that is important to what you're doing today, I think it's still important to address for any future problems that might come along. Yes, absolutely. So here, however, you're entering the territory of somehow open research problem

Starting point is 00:48:44 in the sense that two petabyte is not that big to store, but grabbing through two petabytes is starting to becoming a challenging effort. So yes, so we are absolutely aware of the need of having essentially an infrastructure and a framework that allows essentially to do all sort of empirical search in the content of the archive. So if you want to know, for instance, okay, what was this, let's grab for this pattern of bugs in the code, as it were in 1999. That is definitely a use case we want to support, but we also are, it's very clear for us that it should be on an infrastructure that it's separate from the archive itself, in the sense

Starting point is 00:49:26 that you really don't want to mix infrastructure that you use for long-term preservation and archival primarily, and infrastructure that is meant to do, you know, hard crunching of the lines of source code to figure out some pattern and stress something. So our view for this is that we need essentially a twin infrastructure, a research infrastructure, but of course can be research even for curious people

Starting point is 00:49:48 that want to grab. Doesn't need to be academic researcher. Essentially, it's a twin infrastructure that has a mirror of the archive itself and all sort of storage capacity and compute capacity and memory capacity to do this kind of analysis. And actually, we are in the process of, you know, asking for funding,

Starting point is 00:50:06 for actually specifically setting up this infrastructure, which we also want to be essentially a public good, okay? Something that is, of course, the cost should be subsidized and should be sustainable for everyone. But in theory, it should be a shared infrastructure like our friends in physics are super good at doing for particle physics or whatnot. I think something like this,

Starting point is 00:50:30 more and more important as we move further into a digitized world. Because it's not just that random people are making, you know, more and more random, a little bit of software. But software is becoming more embedded in the things we do, in the things we use, where things that in the past would have just been, you know, some circuits, you know, flipping some switches and stuff. Now, a lot of things have a microcontroller in them, have software actually controlling them, and I think it is important to sort of, you know, there are these arts that get lost over the years. Like, there are certain forms of weaving, for example, that as time progresses, become sort of this endangered, um, endangered skill. and preserving some of that sort of software legacy, software's still a relatively new thing, but if we go 100, 200 years into the future,

Starting point is 00:51:33 there's going to be ways of writing software, ways of interacting with the system that nobody does anymore. And just because it may not be the modern way of doing things, doesn't necessarily mean that that bit of our history shouldn't be preserved. Absolutely. And so here are two things, though. One is indeed, software is more and more present in our lives. I think for a long time, that was one of my, the starting point in why we need to care about this stuff for also for software heritage. And the second thing

Starting point is 00:52:06 is archiving that software so that we can learn from it in the future 10 or 20 years. But I think the bigger problem there, in addition to preservation, is that most of the software that control our lives is not free software, is not open source software. So that, of course, means that we cannot preserve it, but what is even worse is that you don't know what that software is doing with your data. I don't know what the software is doing with the data.

Starting point is 00:52:33 I don't know. In some cases, even the state doesn't know what the software is doing because it's been, you know, pass it on to some private company, some proprietary company. So I think two steps. So as a citizen, I think that all the software that control my life should be free software.

Starting point is 00:52:48 Okay. And then one, it has been freed, one it has been, you know, published somewhere publicly available, then as software heritage, we will be happy to archive it and enable all the good things that you just mentioned for, you know, to get better societies in the future. And this, by the way, it's not only about software, right? With data is the same thing. So if there are some private data that control our lives as citizens, those should be liberated as well, right? sort of somewhat relevant to that. I believe it was posted maybe like yesterday or somewhere very recently. Did you hear Germany basically mandated the use of ODF? I believe this is for their government documents. I've seen the headline. I think you're right yesterday or the day before yesterday,

Starting point is 00:53:37 but I actually haven't read the details of the article. So I'm vaguely aware of the news, but I haven't looked into the details. Okay. No, I think steps like this towards these open formats that are well documented are just a good change, whether it be for documents like this. And even though obviously, you know, they're not going to make government documents public, but within an organization being able to bring forward your legacy documents and not be in a situation, like not be in a situation, for example, where up until I want to say, three years ago, the Japanese military was relying on floppy disks, for example. Like, situations like this are really common, and not having a repeat of that hardware problem with software is going to make the continuation of these systems, even within a

Starting point is 00:54:31 singular organization, considerably easier. So I agree. So the topic, the broader topic you're touched upon is essentially lock-in, right? So depending on something that you do not control. that can be hardware, can be software, and it implies a lot of costs for transitioning out and moving to something else. What is I think it's important to mention here is that, of course, if it is proprietary software, you are locked in, because only the vendor of that piece of code can change it for you. But even if it is free software and properly archive, it doesn't mean you as a state has the capacity to actually work on it.

Starting point is 00:55:10 In the sense that you need skills, you need people. So I think essentially this kind of liberation of technology we are talking about should go together investments in publicly, public resources that are human resources that are capable of maintaining those software. And this is another conversation that is really active right now in Europe in the sense that, okay, how do we increase our capacity of developing and maintaining our own open-source software in a given state or in a given continent in the case of Europe? And so these are, I think, very important debates on the future. And I think we have something to contribute to that as software heritage in the sense that, you know, having the archive, at least it removes some of the single point of,

Starting point is 00:55:57 failures that we currently have in public digital infrastructure, if you make. And I was just looking through some of the links on the Software Heritage website. I saw the thing where it was saying the half-life I reference URL is approximately four years from its publication date. I didn't realize it was that short. It's especially weird with the centralization of platforms where the early Internet was really this decentralized web where some random guy was

Starting point is 00:56:30 running a server and then some other one talked to that server but you've seen a lot of this with GitHub, with GitLab, with just the internet as a whole, sort of this focusing on singular platforms and even if we can just preserve

Starting point is 00:56:46 the software and preserving all of the other information is a whole whole another topic that are you know you see a lot of for I one thing that I've seen for a while and it really doesn't annoy me is the a lot of um a lot of like video game tutorials and instructions sort of like moved away from independent forums onto discord servers

Starting point is 00:57:12 and sometimes those servers just shut down and everything that was there is just long gone and that's that's a whole another you could set up a whole organization just to focus on that problem But this is a problem across the entire internet where data is very fragile. Software is fragile. And a lot of things are not backed up anywhere. And they, like, people have this idea, you know, if it's posted online, it never disappears. And, you know, it's a good thing to live by because you don't want to post things you shouldn't be posting. But it's not actually true.

Starting point is 00:57:49 Like, that's not, data can just disappear. Yeah, so one of the, I think, a joke we had on a topic in our chat channels in the very beginning of Software Heritage was from XKCD about digital preservation, which was something like digital objects last forever or five years, whichever course first, right? But the point you're touching upon there is that the outlife of URL references actually is impacted very differently depending on whether we are in an internet which is fully distributed or in an internet which is very much centralized. Essentially, if it is fully distributed,

Starting point is 00:58:24 the problem you have is that nothing is, you don't buy a resource on the internet, right? You just rent it. So you rent a domain, you rent a server where you have a virtual machine. So at some point, your rent for whatever reason expires. And so whoever was referencing your URL,

Starting point is 00:58:42 can no longer resolve. So basically, you can have an organic way in which references are created, and at some point disappear. But it's very much, you know, is a smooth distribution of this kind of thing. If you are in a centralized internet, what will happen is that your reference is stay stable for a long time.

Starting point is 00:59:00 And here I'm assuming that there is no way on the platform itself to invalidate those references, which there is. Okay, it's our conversation from before about tags and changing the history of the point. Let's forget about that for a second. But then at some point, the entire platform can disappear. So when I was young, who would have thought that,

Starting point is 00:59:20 Google code would disappear. I mean, it's a forge, is operated by Google, one of the tech giant, it will never disappear. And then at some point they decided they didn't have a business case anymore to operate it, and they shut it down. So GitHub today, nobody imagines that it's going to disappear, but it might disappear. I would say that it will disappear. And so this is when that will happen, you will have a big blip in the half time of your URLs, essentially a gazillion of the U.S. URL. will disappear at the same time when the platform is shut down. So it's depending on the shape and the structure of the internet,

Starting point is 00:59:57 we have very different effects on the half-life of URLs, but we need to think about that. So there is no magic solution, okay, but we need to be aware of the problem and, you know, take measure to minimize it. So that's why one of my mission with my colleagues in academia is every time as a reviewer, I see a GitHub URL in a paper, I say, okay, but that's not stable.

Starting point is 01:00:19 archive it somewhere, and the same is true for URL on the web. If you reference a URL on the web, use an internet archive reference, not a bare URL. And the same for software heritage with GitHub URLs. I know you mentioned that running a mirror, obviously, is not viable for the average person, but if somebody does want to help out in some way, what can they do? Obviously, they can promote the existence of it, but if there's anything else, I would love to hear it. Yeah, so maybe three things.

Starting point is 01:00:54 So one, we do accept donations. So we have not run big donation campaigns yet, but this is a way to support the project and something that we might want to scale up more in the future as additional diversification ourselves of source of funding. Second point is that one of our funding principle for software heritage is that the software

Starting point is 01:01:14 rewrite ourselves to run their archive is free and open source software. So we are really hardcore about that. So if you go to githlab.comhot heritage.org, that is our own code is all there. Okay. And people can help contributing as they usually do with any piece of free and open source software. Okay. It's a bit more challenging on average that hacking on your, you know, random web framework

Starting point is 01:01:39 because, you know, testing something on a small scale is very different than testing something at the scale of their kind. But that's one way they can help. And finally, talking about that is indeed a very important, a very important aspect. So advocacy. So that can be done either in a complete informal way, as you're doing. So it's great. And by the way, thanks a lot for this. I know you have spoken about the software heritage in the past already,

Starting point is 01:02:04 and you're doing it again today. So that's great. But then there is also some additional engagement so people can become ambassador of software heritage and then go around and give talks and presentations about software heritage to wherever their communities are. So that are three very concrete ways. So donation, technical contributions, and promoting the project. And also, maybe one last thing.

Starting point is 01:02:29 So if you care about the preservation of a specific piece of code, that we do have functions in the web UI to ask for the archival of a specific repository. That maybe we have not archived yet, or it's not been archived recently, or even, as we discussed before, a specific forge that we have not archived yet. Those are all super useful contributions that people can make to our initiative. Okay. I want to shift gears a bit and sort of talk about, I guess, some of your history. So, I guess, when did you get involved and interested in FOSS? And I guess what was the thing that sort of brought you to it?

Starting point is 01:03:11 So I was an undergraduate studying studying computer science at the University of Bologna. And in the second year, we had a class on operating system. And the professor was a big fan of free-indopensile software. So especially he explained the notion to us. And I was immediately, you know, got into how passionate, how interesting all that was. And so we had two labs. So one lab was run by the university administrators. And one lab was run by the students.

Starting point is 01:03:41 And of course, I wanted to be one of the administrator of the student lab, and I got involved in that. The distribution was Debian. So Debian was my first exposure to, actually, it was not my first distribution, but the first I started looking into and contributing to. And so I first become, I guess, a volunteer system administrator of Debian machines. And then later on, when I started as a graduate student doing some research, we were working in Okano, a functional programming language. and the packaging of Okamele libraries and tools in Debian was a bit lacking at the time. So I asked myself, hey, how can I help making this better? And that's basically started my path in becoming first,

Starting point is 01:04:22 adabian package maintainer and later on even a Debian project leader. And then one thing led to another, if you want. So it was exposure to both the technical sides and the philosophy. I think it's very important to understand why free software is empowering. and then scratching my own itch as the same goals, right? Something was not as good as it could be. I could make it better, and so I did. So I've got the list of Debianphrodite leaders up.

Starting point is 01:04:49 You were the leader following Steve McIntyre. Yes, correct. I guess what made you want to go for that position? You know, I don't think I even remember that. So at the time, I was a post-doctoral researcher. still working a lot on the Okameel packaging. In the meantime, I've been the maintainer of VIM. I was doing, ah, yes, maybe it was via the angle of Q&A.

Starting point is 01:05:15 So one of the after package maintenance, there are some very interesting technical challenges in Debian that are at the scale of distribution as a whole. And those goes under the umbrella of the quality assurance team in Debian, which is a team that basically instead of looking at specific packages, try to have a broad view, like how can we improve, the quality of package as a whole. And so that gave me

Starting point is 01:05:39 an interest in the distribution as a whole. And that led me to the, I guess, more political aspects. And of course, being a David Project Leader is more political than doing specific technical work. So it started from there. Okay. And it was super exciting because

Starting point is 01:05:55 by being David Project Leader, I got in touch with a lot of other major free software communities. So that really was something that they found very exciting. So essentially what you might called the geopolitics of open source, and that I met a lot of friends that basically are made as friends for life. Okay.

Starting point is 01:06:13 You're also on the OSI board at one point as well. Yes, it's been a while, I guess. Shortly after becoming the epipodevary page. It's probably been 10 years old or something by now. So I get, what year was it? That was 2014, assuming this is correct. Okay. I guess how did that happen?

Starting point is 01:06:42 So after I, so I run, I remain Debian prosecutor for three years. That's three one-year mandates. And at the end, one person that I got to know was Simon Phipps. And at the time, he was trying to reform the OSI management, essentially to change it from a fully self-persexual. perpetuating board to something a bit more mixed in which he wanted to have, essentially, some people that are nominated by the board, some people that are voted in by individual members,

Starting point is 01:07:15 and some people that are voted in by project representatives, where projects were essentially major foundations and major initiatives in free and open source software. So he conducted me first, I think, as Debian project leader, to see if Debian wanted to participate as one of these projects, and I said it was a good idea. And so given the project was interesting for me, in general, I'm very much interested to understand what is the best possible governance

Starting point is 01:07:42 for major free and open source product. So I thought this project was an interesting experiment and I wanted to be part of it. So I nominated myself and I had been elected as one of the elected members of the board. Okay, okay. I guess like, on us you've done a lot throughout your history

Starting point is 01:08:04 you've clearly done a lot like I I don't know I don't know where to go from there I guess was there anything we kind of haven't touched on with the

Starting point is 01:08:18 with software heritage I know you have a hard cut off in like 20 or so minutes so I'm not I'm not really sure like how long you really want to go for but was there anything pretty important important or important all that we kind of missed with the software heritage stuff throughout this talk. No, maybe so an angle we didn't cover much as the kind of research one can do on software related.

Starting point is 01:08:38 So maybe I would quickly mention a few research results that I think are interesting for the free and open source community at large. So for instance, with some colleagues, we did a lot of work on diversity. Essentially, once you have this body of code and version control history, spread over time. And if you look at the timestamp, we have a spread of like 50 years, you can look at, at, for instance, who are the people that contribute to free and open source software. And for instance, we have mapped the evolution of gender diversity and geographic origin diversity throughout the data provided by software heritage. A colleague of mine has done also something similar in terms of more technical evolution, like the evolution of programming languages. And this, so for instance, for gender, we've seen that up to the arrival of COVID,

Starting point is 01:09:25 women were growing in their yearly contribution, reaching a maximum of 10% of 10% of the of women authored contribution just before COVID, and then it has regressive since, and we have shown that it is COVID that has caused that in a causal way, not just correlation. And we did something similar for geographic diversity. And I'm mentioning this just because it shows that with this body of data, we can do very large-scale study of our entire population of contributors

Starting point is 01:09:55 to free and open-source software. It is something, it's not the kind, they are not the first studies that have been done on this, but usually the previous studies were done on much smaller scale set of repositories, and they are not reproduced. So those studies, for instance, of course GitHub can do these kind of studies, but they will publish their results, and we will have basically to trust them at face value

Starting point is 01:10:16 without being able to verify that. So with something like software heritage, essentially we have a base for conducting very large-scale experiments of the history of our own movement, and in a way that is accessible, to everyone and is reproducible and hence verifiable. So this for me is kind of, how to say, self-interpective contribution.

Starting point is 01:10:38 So we can study ourselves like sociologists do with human communities elsewhere and learn something about ourselves. Yeah, GitHub in the past has put out numbers on the percentages being used, for example. And they make sense at face value, but, you know, There's not any good way to go and verify that. So having, and not just on GitHub, right? Because possibly, I don't know, maybe there is a difference if you look at the use of GitHub versus GitLab where I would assume, maybe I'm wrong, who knows someone has to do the numbers here, I would assume that someone who's explicitly using GitLab might have differences of opinion when it comes to software licensing.

Starting point is 01:11:25 Yeah, so you're touching upon a topic that is dear to my heart, but it's also. also a bit painful in the sense that it's something that I've had in my drawer for a long time and didn't have time to work on. But yes, absolutely. The diversity of practices across different forges, it's a super interesting topic. So I'm going to make you here a result of a paper not of mine, some colleagues published a paper called the Penumbra of Open Source in which they did something fascinating. So basically, they started at the IP address level, do a reverse lookup of IP addresses, see if they encountered some host name called Git. Dot something and crawl that.

Starting point is 01:12:04 And then they compared the patterns of coding and of contribution on non-Gitab repos, basically on wild repos versus the one on GitHub. So we can do something similar or we're using software heritage data. I just didn't have the time to do it myself. But it's super interesting. This essentially is the key of the monoculture problem, right? So a lot of developers only see GitHub and see that that's the only thing that exists in town, but it is not. And studying and comparing what happens on GitHub with what happens outside of GitHub is super important.

Starting point is 01:12:39 Also to understand what are the other ways, the other possible ways of doing things. And maybe just one data point. So we have looked into references from academic papers in France to basically code hosting platforms. And you have, so the majority is GitHub, but you have, I think, a bit more than one quarter among millions of references that are coming from other places than GitHub. So it is not a small amount of people that use in scientific papers, or the reference more precisely, in scientific papers called Austin elsewhere than GitHub. Studying that is super important. And even if we have not done the studies ourselves yet, I think it's very important that we are enabling these kind of studies for the case. Okay.

Starting point is 01:13:31 And I guess you could then do similar sorts of researchers like, okay, do different regions have different preferences in the kinds of source hosting they're using? Yeah, for instance. Yes, absolutely. And then, of course, you get into the weeds of how you detect the region or something. but I mean, these are scientific challenges. Sure, sure, sure. But like you're offering the data there and you can... Correct.

Starting point is 01:13:55 It's up to the research is how they're going to handle that data. Absolutely. Okay. I'm not sure how much time you wanted to have before that meeting you had. So I think unless you got something else you want to touch on, we probably should wrap this up. Yeah, let's do that. Okay.

Starting point is 01:14:12 If people want to check out Software Heritage and help out, is everything just linked on the Software Heritage Web? Go to softwareheritage.org for the presentation of the project and to archive. Dot softwareheritage.org for the actual archive content itself. We have a community menu in the main website that leads to developer and scientific communities. So people who want to join our development channels or our channel about scientific research, they can just do that.

Starting point is 01:14:39 So we are typically mailing list, we have matrix channels and the usual. Okay. Nothing else want to say? I just wanted to thank you again for this opportunity. It's been a great chat and thanks a lot for the work you're doing. Yep, absolutely pleasure. Thank you for coming on and doing this. This was a lot of fun. You speak very, very quickly. So I apologies to anyone who may not be great at English who was watching this. My apology for that. We would slow down the recording, I guess. Yeah, good plan. Okay, I'll do my outro and then we'll sign off. Bye, Brody.

Starting point is 01:15:14 Okay, my main channel is Brodie Overson. I do Linux videos there six-ish days a week. Sometimes I stream as well. I've got the gaming channel, Brodio on Games. Right now I'm playing through Shenmoo 2 and Metal Gear Solid. If you're watching the video version, this, you can find the audio version of basically every podcast platform at Tech Over T. There is an RSS feed as well.

Starting point is 01:15:31 The video is on YouTube at Tech Over T, and so is it's on Spotify as well if you like Spotify video. How do you want to sign us off? What do you want to say? Me? Yes. I never tell anyone they're doing this. It's my favorite part.

Starting point is 01:15:43 iterate all the software. Perfect.

Tech Over Tea - Software Heritage Co-Founder & Former Debian Leader | Stefano Zacchiroli

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.