The Changelog: Software Development, Open Source - The world of open source metadata (Interview)

Starting point is 00:00:00 Welcome, friends. I'm Jared, and you are listening to The Change Log, where each week, Adam and I interview the hackers, the leaders, and the innovators of the software world. We pick their brains, we learn from their failures, we get inspired by their accomplishments, and we have a whole lot of fun along the way. This episode features Andrew Nisbet, who builds tools and open datasets to support, sustain, and secure critical digital infrastructure. He's been exploring the world of open source metadata for over a decade, first with libraries.io and now with ecosystems, which tracks 12 plus million packages, 287 million repos, 24.5 billion dependencies, and 1.9 million maintainers. What has Andrew learned from all of this? Who is using this open data set and how does he hope others can build on top of it all? You're about to find out. But first, a big thank you to our partners at fly.io, the public cloud built for developers who ship.

Starting point is 00:01:06 We love fly. We probably will too. Learn more at fly.com. Okay, Andrew Nesbitt talking ecosystems on the changelog. Let's do it. Well, friends, agentic Postgres is here. And it's from our friends over at Tiger Data. This is the very first database built for agents and it's built to let you build faster.

Starting point is 00:01:27 You know, a fun side note is 80% of Claude was built with AI. Over a year ago, 25% of Google's code was AI generated. It's safe to say that now it's probably close to 100%. Most people I talk to, most developers I talk to you right now, almost all their code is being generated. That's a different world. Here's the deal. Agents are the new developers. They don't click.

Starting point is 00:01:49 They don't scroll. They call. They retrieve. They parallelize. They plug in your infrastructure to places you need to perform, but your database is probably still thinking about humans only, because that's kind of where Postgres is at. Tiger Data's philosophy is that when your agents need to spin up sandboxes, run migrations, query huge volumes, a vector and text data, well, normal Postgres, it might choke. And so they fix that. Here's where we're at right now.

Starting point is 00:02:13 Agentic Postgres delivers these three big leaps, native search and retrieval, instant zero copy forks, and MCP server, plus your CLS. plus a cool free tier. Now, if this is intriguing at all, head over to tigurdata.com, install the CLI, just three commands, spin up an agenti Postgres service, and let your agents work at the speed they expect, not the speed of the old way. The new way, agenti postgres, it's built for agents, is designed to elevate your developer experience and build the next big thing. Again, go to tigurdata.com to learn more. Today we're joined by, for us, an old friend, but a long time, Andrew Nesbit is here with us. And you know, Andrew, I came across ecosystems, which is eco-dot, no, ecosista.ms. Domain, nice domain hack, hard to say out loud, but it looks cool in the URL bar.

Starting point is 00:03:32 I came across this and I thought, this is a very cool project. It seems somewhat familiar. I can't quite put my finger on what it could possibly be. And then I saw it was from you and I'm like, oh, it makes totally sense. Like it makes total sense. This is right up your alley. Oh, you've had you on the show many times back in the day talking Octobox, talking. Libraries, I.O. talking Ruby ecosystem and dependency management.

Starting point is 00:03:55 And it looks like you're still out there kind of beating around that same bush. So, first of all, welcome back to the show. Thanks for having me. Yeah, it's great to be back. Ecosystems. It is, I mean, okay, we have a lot of context that maybe our listeners don't share. But take us back to what you're interested in, which seems like you've been interested in similar things for a long time. And you built libraries.io around this. And ecosystems is a very, similar thing. I'm wondering if it's the same old thing or if it's a new, new thing. So tell us

Starting point is 00:04:25 about your past and like collecting and organizing dependencies and the information about them and open source projects, sustainability, and then what that brought you, how that brought you to ecosystems. Yeah. Okay. So I have been swirling around the world of open source metadata for must be coming up to 10 years now. Starting with 24 pro request, That's right, 24 poor requests, yeah. That didn't kind of start from metadata, but the idea of that project was to encourage people to contribute to open source as part of kind of the run-up to Christmas. And after kind of like first getting that off the ground, we quickly ran into like, oh, how do we suggest like where should people go and contribute to? And a lot of people would try and send a poor request to a project that had no activity and like the maintainer was gone.

Starting point is 00:05:20 or just like we're struggling to be able to even like work out how to send a poor request to some projects because they were really not very friendly or easy to contribute to. And that kind of led me down this path of like, okay, well, what's a, how do you define what a good project is? And then like can we scale that up rather than manually having to have people kind of like submit their things and keep those things up to date every year because that project would, just kind of come and go every December and shut down afterwards. So the maintenance there couldn't be entirely human because there was thousands of people contributing to that project and like sending poor requests.

Starting point is 00:06:05 And it was a lot of data to try and work with. So I started to build out some basic metrics there to try and go like, does this project look like it has activity that's happening on it? Does it look like it's ever received third party like contributions and things like that? and that led me to kind of I got a job at GitHub from there and then GitHub promptly fell apart internally. Tom Preston Warner left. It was a horrible time.

Starting point is 00:06:32 And so then I left there and started Libraries I. O is essentially a like, okay, well, looking at package manager metadata is a different way of kind of getting some measure of what's an interesting open source project. Like rather than just using stars, which stars is a terrible metric and has very little kind of bearing on a lot of projects, especially as you go down from the kind of like the massive frameworks, the kind of those huge keystone projects.

Starting point is 00:07:05 Once you get down to smaller libraries and also especially the kind of like low level critical projects that are doing a lot of the kind of the real work, they don't get a lot of attention and a star is basically a measure of attention, how many people are landing on that GitHub repo page. So package manager metadata was like, oh, this is really juicy, because it kind of gives me a hook into saying, like, these libraries are being used by other people. But download stats again, available for most package managers, but not all, is often kind of wildly all over the place for certain projects, especially if they're used a lot in CI, that you'll just see, like, really inflated download stats.

Starting point is 00:07:50 And you also don't necessarily see those for dev dependencies, the things that, you know, people, especially maintainers, are installing on their laptops to be able to work on those projects, but they're not necessarily a runtime dependency of all the applications. You know, there are definitely gems that Ruby and Rails devs use locally, but aren't shipped with the Rails app, so you would never see those numbers. And the insight that I kind of accidentally tripped over was if we go mining the dependency information out of open source repositories, at a large scale, you actually start to get a really good picture of how people really open, like use open source and how they don't

Starting point is 00:08:38 use open source. Like if a project breaks, you probably don't go and unstar that project. Let's be honest, like not many people are unstarring things. They don't remember. And also, you don't, like, un-download a thing. The download count remains after you downloaded it and was like, oh, this doesn't actually work or this is not what I wanted or has become unmaintained. Whereas actual, like, I depend on this thing.

Starting point is 00:09:06 If I remove that thing as a dependency, then numbers go down. And you get a really interesting, strong signal that something is maybe not quite right with that project. So that kind of led me onto a path of I should just try and index the dependencies of every open source project

Starting point is 00:09:23 ever. And Libraries I.O. was started out as a search engine designed to be like, I can help you try and find the best package and that was primarily like that this package is well used. So therefore that implies

Starting point is 00:09:40 like that it has good documentation that it actually works and other people are using it as kind of a proxy. And it grew and grew and became a massive and expensive and difficult project to maintain as a side project whilst I was doing contracting. And me and Ben, who are working on it, were like, well, what are we going to do? How can we turn this into a, you know, a sustainable project that can fund itself? and we at the time GitHub had just implemented its own dependency graph as well

Starting point is 00:10:18 along with purchasing Dependibot and that basically they started giving that away for free that pulled the like pull the rug out of any plans we had to monetize libraries I.O. directly as well as a project I was building called Dependency CI which never really got off the ground but was back in the day was like oh, this is really cool because it could literally like block your poor request to say you're trying to add a dependency here that is not good because it doesn't have a license or it's got security issues or other things. And so we ended up selling to Tidelift

Starting point is 00:10:56 to try and find some way of recouping the costs of building out that project. But just before we did, we also made all of the code open source and all of the data open source. And all of the data open source. So it was kind of like an airdrop into the community to be like this is always going to be here if you want to use it for purposes. Didn't really work out at TideLift. There's a big cultural difference in the founders at TideLift compared to me and Ben. Me and Ben are very like we really like building and solving problems in the open and shipping stuff really quickly and seeing kind of iterating on those things and tidelift cultures was because they just sold to another company yeah who bought tidelift sooner i can't remember the name it's it's a security

Starting point is 00:11:49 company um and as a shareholder of tidelift i can tell you i didn't get anything from from the sale um bummer did you know libraries i o was there and was open source and uh after I took a break for a little while during the pandemic, which, you know, everyone had a kind of a crazy time. I went to do some contracting with Protocol Labs, basically kicking the tires on IPFS and file coin and trying to use it as a real user. It's an interesting time of actually try. Like the try. The try and that was pretty cool.

Starting point is 00:12:33 And then at the same time, was talking to Schmidt Futures, which is now Schmidt's science, but one of the kind of sub-foundations of the Schmidt Foundation, who are basically saying, like, we have researchers that were using the data from Libraries. I.O. for research. But now libraries I, like, when I left TidLift, they started to remove features of libraries, especially the API access and the data. And Schmidt Futures basically came along and said, like, could you stand up another copy of it? And I was like, we could do that.

Starting point is 00:13:12 But what if we rebuilt it from the ground up as infrastructure for research purposes, rather than taking the same code, which is like one big search engine, one honking great Rails app, and actually make it into kind of a slightly more, like take all the lessons learned, but instead of building it as a search engine, instead build it as a base layer of open source metadata, which then can be used to build a

Starting point is 00:13:43 library's AI on top of it. And that also means like we can take some of those lessons that were like, oh, actually it turns out contributing to a project that has one absolutely enormous database schema is really difficult. Like trying to stand that up yourself. is really hard as a contributor. So people would just bounce straight off the project because they're like, well, there's no way I can possibly comprehend how big this, like the stuff that's going on here. And then also the performance implications of deploying a change

Starting point is 00:14:15 that might be like, oh, you're about to touch a table with like a billion rows in it. That's going to be difficult for you to test without me giving you production access. And I really don't want to do that to random third party. open source contributors. And so ecosystems is essentially a do-over of libraries I.O. It's many different Rails apps that are focused on collecting different kinds of open source metadata and then combining them together in different ways. So there's a packages service, there's a repo service that collects the dependency

Starting point is 00:14:54 information from repositories, there's an advisory service and a commit service. and a commit service and an issue service, basically all the different things that you might be interested in. And each one of them can then be independently worked on and scaled up as like different amounts of data pour in and kind of collect in different places. And that has been going on for nearly, I want to say, three years now,

Starting point is 00:15:20 really kind of like going from, it was a nice kind of year where I just worked on it myself, didn't really tell anyone about it, just kind of like plugged away it. And there are core pieces, because Libraries O's open source, I was able to reuse like the dependency parser and a load of the mappings to the package managers,

Starting point is 00:15:41 actually like take that code and kind of reuse that in a way that is also allows you to have multiple different package manager registries where libraries O would only support one, which was really nice when Ruby Gems had its, all of its drama recently and the gem.coop popped up. I was able to go, oh, I can quickly start indexing gem.comop. It just fit straight into that new schema.

Starting point is 00:16:10 And then, like, since kind of like the past year, it's just absolutely exploded in usage. The amount of traffic today alone was 50 million requests to the API. Wow. And has become quite a piece of critical infrastructure to a number of different kind of areas of open source in terms of S-bomb enrichment and also trying to find those critical pieces of open source that need security work or need sustainability efforts to be kind of coordinated around them. Well, I'm happy to hear that you got to reuse some of your code from libraries I.O. because when I thought was going to happen, when you said I airdropped it,

Starting point is 00:16:56 I thought you were going to just catch your ownirdrop a few years later and be like, and because I open sourced it, I just relaunched it under a new, but obviously the big rewrite is a very tantalizing thing, especially when you've been living with all your mistakes for this time, is like, let's start over. But you got to reuse some of your code, which is really awesome.

Starting point is 00:17:14 So nice job open sourcing that when you still had an opportunity to do so. Yeah, absolutely. You mentioned this is used in, research, I guess, research terminology, so to speak. What exactly does that look like? Who are those folks? What kind of research are they doing? Are they developers?

Starting point is 00:17:32 Are they developer adjacent? I think they're mostly developer adjacent or in the research space. I guess you'd call them like research engineers where lots of computer science researchers are like, we want to study what these kind of behaviors are like across different package managers, or comparing, like, what are developers doing in this space versus that space, especially around the dependency stuff to be able to go like, oh, the average number of dependencies in a JavaScript app compared to a Ruby app, for example, which I think is about 10x. And then looking at kind of can you go down those dependency chains and find where the,

Starting point is 00:18:17 the security problems are or the licensed problems are and also leading into kind of like how can we encourage best practices in this space or looking at how to work out like how many projects have have taken on these various kinds of like especially just recently had a call with someone who's looking at all the attestations around trusted publishing like how many can we see like the share of usage of packages that have the trusted publishing setup and are publishing attestations into Sixthore compared to like the overall space and also then breaking that down across different ecosystems as well.

Starting point is 00:19:00 Okay friends, Augment Code, I love it. This is one of my daily driver AI agency use. Super awesome, CLI, VS code, JetBrains, anywhere you want to be, Augment Code can bring better content, context, better agent, and of course, better code. To me, Agwin code is by far one of the most powerful AI software development platforms to use out there. It's backed by the industry leading context engines.

Starting point is 00:19:25 The way they do things is so cool. You get your agent, you get your chat, you get your next edit, in completions, it's in Slack, it's in your CLI. They literally have everything you want to drive the agent, to drive better context, to drive better code for your next big thing, for your big thing you're already working on, or whatever you have in your brain. you want to dream up. So here's a prescription.

Starting point is 00:19:45 This is what I want you to do. I want you to go to augment code.com. Right in the center, you'll see install now. And just go right to the command line. There is a terminal C-L-I icon there. Click that. And it's going to take you to this page. It says install via N-PM.

Starting point is 00:20:01 Copy that, pop into your terminal, install augment code. It's called Augie. Instantiate it wherever you want to. Type in A-U-G-G-I-E and let loose. You now have all the power of augment in your terminal. Deep context, custom slash commands,

Starting point is 00:20:17 MCP servers, multimodals, prompt enhancers, user and repo rules, task lists, native tools, everything you want, right at your fingertips. Again, I'll get code.com's with my favorites.

Starting point is 00:20:28 You should check it out. This might be silly, but let me ask you this. I've been researching some CLIs and I've been researching how CLIs install themselves. Sometimes they'll leverage, the actual package manager of the distro like a Linux distro or something like that but most by

Starting point is 00:20:49 large just give you a URL to curl and pass to to bash essentially which can be problematic if you don't trust the the script if I wanted to research I guess somehow research CILized and how they install themselves and the various ways they install themselves is that something that this service could do like is that the level of research I could do yeah I mean I mean, for one thing, you would be able to quickly find everything that had kind of like tagged itself up as a C-L-I program. I'm also been indexing every image on, every public image on Docker Hub and basically running an S-bomb scanner against each one of those. There would be some juicy insights there to be able to go, like how many of these things were installed via a, like a distro package manager versus like we just have a URL for this which would be

Starting point is 00:21:51 recorded in the S-bomb basically to say like oh we found this known bit of open source and it appears to say that like it sits in the file system here which implies it was installed by apps or it's in a random space like it was probably curled down along with the Docker file that was used to build that image. And there's a good kind of million open source Docker images on Docker Hub or at least individual versions of things. And you also get the interesting aspect there

Starting point is 00:22:29 of that you can kind of multiply that by the number of downloads that some of these Docker images have. And some of those numbers are crazy, like millions and millions of downloads of a particular image And of course, those numbers inside that one container are like never reflected in the package managers upstream. So just because it was downloaded in Docker doesn't mean, you know, that actually shows up as being a million downloads in Ruby Jems or on NPM. So you start to see some really interesting things and you start to see those download numbers or the proxy for a download number of distro packages. as well, which is a really hard number to get hold of because, you know, every distro package manager is very heavily mirrored and basically like just a file system somewhere exposed over HTTP or R sync. So no one has good download stats for those things. The only place you really find that is the Debian popularity contest, which is opt in, not opt out. So you'd be able to go like, oh, okay, well, I can see here are.

Starting point is 00:23:42 the CLI programs that are being like manually downloaded inside of docket images as part of this install process. It's not going to give you everything, but it certainly gives you a good proxy for like, okay, where I can see where like relative usage of these things starts to show up, which is, you know, where I found the most useful ways of kind of sorting different piles of packages or whole registries is to go like, okay, well, if I sort this registry by the number of dependent repositories or the number of dependent packages, like which things show up at the top, and then also, like, which of those things make up 80% of all of this stuff?

Starting point is 00:24:28 And you actually end up looking like, if for 80, like, I like the 80-20 rule, but it doesn't actually turn out to be like 20% of packages make up 80% of usage. It's like 0.01% of packages make up 80% of usage. It's tiny amounts. Like there might be 2,000 node modules total that make up 80% of all of the usage of MPM in terms of downloads and in terms of like discrete dependent repositories, which is like when you then start to really focus that lens, you see a long tail of stuff that never gets used. And there's also like all kinds of spam.

Starting point is 00:25:09 and malware and stuff that floats around. But there's like a 10, 15,000 packages, which are like the packages that make up most open source usage across all these ecosystems. It's kind of amazing how massive that asymmetry is when you pin that down to individuals. Yeah, and that's like on average one maintainer per package at that critical level as well.

Starting point is 00:25:36 So that's like 15,000 people maintaining all of open source usage. That makes the SKCD comic even more poignant, you know, the one person in Nebraska, you know, replace Nebraska with wherever they are around the world. Probably in different towns. And how many of them have you had on the change lock?

Starting point is 00:25:55 That's a good question. Probably a good percentage of those. Oh, man. So there's all, there's 15,000 people basically running the world for free. Wow. I have done a little bit of indexing of, you know, like how many of those have GitHub sponsors or are their projects on Open Collective or they have some other kind of funding link?

Starting point is 00:26:18 And in terms of those top critical packages, it comes out to kind of like, depending on the ecosystem, it's somewhere between 25% and 50% have some way of, you know, like, here's an automated way you can give me a donation to the project. There's a good chunk of those as well that are, you know, massive. corporate funded projects like all of the AWS ruby gems that make up the AWS CLI are in the top of Ruby gems because they're just massively used they don't need any funding right because that's Amazon has full-time staff but there's a good they might need some funding like right here they're laying people off again hopefully they didn't lay off all the

Starting point is 00:27:02 the Ruby people maintaining the CLI there that would be awful so do you track so you're tracking those who are able to receive funding in some sort of automated fashion. Do you track funding itself? Like who's getting how much money and how? Yes. Well, where possible. So I'm tracking, I call it a funding link and some package managers have funding links support

Starting point is 00:27:25 where you can say like, oh, I get, you can donate to me over here. Repositories have the funding YAML file and I go looking for that wherever possible. And you actually see that even on GitLab and Codeberg. I don't know how well those platforms display it in the UI,

Starting point is 00:27:45 but it definitely, because obviously GitHub sponsors is not, I don't think there's a GitLab sponsors or a Codeberg sponsors. Those files do show up all over the place. And then also being able to go, like this repository is owned by a user on GitHub, who is part of GitHub sponsors,

Starting point is 00:28:04 is another way of kind of detecting that, even if they haven't, added their funding YAML file, we can kind of make a hop to say, like, oh, here's one of the maintainers to be able to support that. And I then collect the data from GitHub sponsors of every, because GitHub sponsors users are public. You don't get any financial numbers, but you do get, like, here's the number of active sponsors of things.

Starting point is 00:28:28 And here's the total, like all time. It's quite hard to get time series data out of that API. So instead, I basically just kind of. kind of snapshot it on a regular basis to go like, oh, here's what the current state of the world is in terms of GitHub sponsor funding. It's a bit weird, though. There's a lot of people who have realized that GitHub sponsors is actually quite a good way to sell digital goods.

Starting point is 00:28:55 If you go looking at the top users of GitHub sponsors who have the most people funding them, they sell things like avatars and Discord memberships and e-books and things like They're not necessarily kind of selling, like, oh, you can, I can maintain this project better for you. That's, that's not the, like, Open Collective is so much bigger in terms of actually, like, supporting the projects as a, as a collective because they're just set up in a totally different way to get sponsors. Yeah, that's fascinating. So they're kind of doing sponsorware insofar as it's not a donation or you're supporting my work. on this project. It's like, I'm actually, there's a quid pro quo here.

Starting point is 00:29:42 You're like, we're going to trade a good or a service for that sponsorship money. Really, it's a purchase of, yeah, yeah. Like, if you go looking, it's easy to see GitHub doesn't make it particularly. Like, they don't have a leaderboard, which is a good thing to not, like, putting a leaderboard on things can often produce them very strange behaviors. There's also an interesting breakdown of, like, number of users who sponsor other maintainers versus companies. Obviously, companies are going to sponsor a lot more in total amount per company. But the distribution is quite surprising in, you know, like, you're looking at

Starting point is 00:30:21 easily 10 times as many individuals are sponsoring other people on GitHub sponsors compared to the number of organizations. Like, it's quite small, really. And most of that activity is public. so it's not like there are you can be anonymous as a GitHub sponsor but you can't really hide the fact that you are that there is a sponsorship happening there there's also on Open Collective some massive donations that go to certain projects through like company sponsorships because you know they're acting as a fiscal host rather than just being like a platform to collect tips which is basically how get sponsors works reminds me is way back in the day Chad Whitaker's get-tip, which was later for Grat-Pay, remember that?

Starting point is 00:31:10 And it felt all warm and fuzzy because people were getting money for their open source. But when you go looking at it very closely, most of that was like the same 50 bucks getting passed around between friends, like not a slush fund, but like they just felt good. And so like I would make 20 bucks a month and I'm using open source. So I would give it to somebody else. And there was really no new, not enough, new money coming in. It was really just money that already existed amongst all of us maintainers, kind of patting each other on the back which was unfortunate but just the way it started

Starting point is 00:31:39 I definitely do that like I sponsor 35 different people on GitHub sponsors of just a few dollars a month to just be like I appreciate your work I don't have a huge amount of to support you with but like just as a way of saying like I notice you and like appreciate that you continue to maintain these things that I use well I hope I hoped GitHub sponsors was like big enough and mainstream enough to kind of change the the shape of that and maybe it's done it some but it sounds like there's still more indies passing you know person-to-person kind of sponsorship than there is corporate person but yeah i think the change of interest rate across the world yeah had a massive impact like you can see oh the nice thing about

Starting point is 00:32:23 open collective is they are especially open source collective is very public you can see the amounts of donations like going in and going out and there was a big drop around the time that post-COVID hit and changed all of the finances of these things was like, oh, okay, well, open source is no longer like one of the, it's an easy line item to drop, right? Because everything is free and it just continues to work for now until a security problem comes along

Starting point is 00:32:56 and then everyone starts scrambling again. So you've got 12 million packages being tracked, 287 million repositories, 24.5 billion dependencies, 1.9 million maintainers. I'm reading these stats off of your website. There's a timeline of public events on GitHub. There's issues. There's commits.

Starting point is 00:33:16 I mean, there's just tons of different data points that you're tracking. How do you store all this stuff? Where do you store it? How big is it all? Because I'm just thinking this is a data management nightmare. So that 24 billion dependencies is a bit of a headache. I bet. I mean, that's crazy. Almost all of this is stored in Postgres.

Starting point is 00:33:40 Okay. Individual Postgres instances on dedicated machines in France and Amsterdam, mostly because they're very affordable. Online.net is a very reliable host similar to Hertzner or some of these other kind of like bare metal machines. So I do the maintenance of the machine myself, and obviously scaling up is a little more tricky because there's not just a nice Heroku slider anymore. I use Doku as essentially like the open source Heroku, which is really nice. Just Git push, it builds your Docker image, and then it handles putting EngineX,

Starting point is 00:34:25 kind of proxying all of those things. Very nice for like an individual machine. It doesn't really give you any kind of multi-machine. things, but I try to avoid too much complexity when there's only a very small number of people working on doing the infrastructure, and it's mostly me, rather than I calculated like a back of the napkin thing the other day. I think it would cost me 15 times as much to host on AWS as it does to host it on dedicated machines right now. But these Postgres, each service basically has its own database. So rather than it being one that is enormous, it's split out, which at least

Starting point is 00:35:06 makes it kind of like, I can work on individual ones and be like, oh, this one is reaching capacity, so it's time to scale it up. Or I should make another box of web machines or sidekick workers separately. I don't need to kind of do everything in one big lockstep, which keeps it fairly easy to do. And then the whole website is basically read only. Like you can't. can't ask, you can't put data into it as a user. You read from it. And it, all the data comes in in the background through loading data from package managers and repositories. And there's about 2,000 different Git hosts in there that I'm constantly crawling at different rates to go like, oh, there's, there's new activity over here. So I can cache things very aggressively at the kind of

Starting point is 00:35:56 HTTP layer, I think the cash hit rate at the moment is about 60% in Cloudflare. At some points, I've got it all the way up to like 95%, but then you get some AI bots come along and they do some weird stuff and it's very hard to cash such a long tail of billions and billions of URLs that might exist on the platform and Cloudflare on the free plan is not going to cover you know, an unlimited amount of cash. You'd just kind of keep rolling over the cash over and over again. Is this a solar project again? Or is this you and Ben back together?

Starting point is 00:36:34 So Ben is working on it part-time. He is also the one of the directors at Open Source Collective, which is, you know, that's a lot of work in itself. Yeah. And then we have a few people who are doing some part-time work. Martin has done all the design work, which looks so much better than my efforts of the original. You can see that.

Starting point is 00:36:58 And there's a couple of older hidden webpages there that are very poorly designed, which is just me like making some plain bootstrap pages. And we just had James come on to help with making the project like better documented and easier to onboard as a contributor because I was running so fast on standing everything. up and scaling it up and collecting all that data that I didn't really leave a lot of documentation along the way, which is terrible. But hopefully, like, these are pretty basic Rails apps. There's not a lot of interesting stuff, like intentionally trying to make it the most boring tech

Starting point is 00:37:38 possible so that I can focus on the interesting stuff, which is like the passing or the mapping of the metadata, which is like each app has that core little nubbin of like, oh, here's where the real logic sits and that's like a nice well-tested bit of functionality with a load of rails scaffolding around it to be like, okay, write this into Postgres and then serve it up as in kind of the quickest way possible. How many apps is it now? Oh, good question. It must be coming up to 20 but some of them are quite small. Like there are, there's a load of services that are kind of like stateless. Like I will just give you a shah 256 of a tarball that you get from Ruby gems or similar. And a lot of those I basically have on the chopping block to try and turn into something a little bit more like,

Starting point is 00:38:31 imagine a GitHub actions, but for analyzing packages. So rather than it happening every time that you commit or every time you open a poor request, instead it'd be like, you can define, I want to run this kind of analysis on this package when it, a new version comes out that might be like copyright and license extraction or it might be do me a capabilities analysis of this go package using the caps lock library which will basically go like oh this library just gained network access and it can read environment variables and it became a crypto miner would be a great way of like being able to highlight some of those changes so i want to pull it down and make it a little bit kind of like fewer services

Starting point is 00:39:15 but one of those services will be basically the like, which open source analysis do you want to run against this package? And then here's a massive fire hose of every activity that is happening. And you can hook those analysis in to say like, okay, I want to run Zizmore every time I see a GitHub action change because Zizmore does the security scan on the YAML config to go like, oh, you've just introduced a footgun of GitHub actions. here and then try and publish all of those analysis back out as a public good,

Starting point is 00:39:52 just basically fling that into S3 or something as a way that allows researchers again to go and do broad analysis over the whole ecosystem or multiple ecosystems without having to spend all their time like collecting all of that base data and normalizing it and then setting up infrastructure to run all of that across you know, all of those packages is, I see that time and time and again where the paper is like 50% of the work is, oh, well, we had to collect all of this data and we had to make sure that it all fit into the right box. And then we could actually start doing the interesting research. So what I hope is we get to a place where it's like, oh, you don't need to do that.

Starting point is 00:40:36 You can just use this open data set. And that gives you a good starting point to then start to really dig into like what's going on in these ecosystems. that's the dream anyway. We're certainly working your way towards that. So does Schmidt, sciences, do they flip the bill for all of this work? So they gave a grant initially to get started, luckily because they gave it in dollars and the exchange rate was very positive for a while. So we actually managed to stretch that from a one-year grant into a two-year grant.

Starting point is 00:41:09 And then Open Collective has been supporting the project as well. as a fiscal host, but also as a, like a customer. So I built a number of tools for them to help them kind of investigate ways of trying to expand the ability to kind of let companies fund open source. And then also to try and measure the return on investment of giving two projects and try and be able to see like, oh, if I donate money here or resources, does that turn into actions and changes on the repositories. And that kept me busy for, you know, a good nine months, I think, of building out

Starting point is 00:41:54 tools for them whilst they financially supported the project. And we also have a number of customers who pay for a different license for the data. So the data is CC by SA, which is share like a copy left license. You can use it for whatever you like as long as you also. So persist the license and you credit where it came from. But if you don't want to do that, then you can pay to essentially have a CC0 license. It's not actually CC0 because there's some things there to say like, oh, don't just completely undercut us and sell that on again.

Starting point is 00:42:34 But we have a number of customers there. That basically pays for all the hosting costs. So it's self-sufficient. It runs itself as long as, but you don't get any extra feature development on top of that. So that's like where I'm trying to work on right now is to get that level of sustainability higher. And we just received a grant from Alpha Omega to basically make that happen. That's, Alpha Omega is part of OpenSSF and their goal is like turn money into security. and they have become a big user of ecosystems for doing analysis

Starting point is 00:43:15 of like who are the critical projects in a particular space? Who are the ones that are like going to be most likely impacted if there's a big security vulnerability? Who are the ones who have never had a security vulnerability and maybe don't know what to do if they get one? Things like that. So they have basically given us a grant to try and help make ecosystems long turn sustainable. So that's things like making the project easier for people to on board onto

Starting point is 00:43:45 or and also to be able to kind of like charge large companies in different ways. That might be like, oh, you want an even higher rate limit than the very friendly rate limits that are already on there. Like you want to go even harder or then you can pay for, you know, like a super rate limit or similar. And then also this kind of like pipeline of analysis will be another way that it would basically be like, oh, you want to, you want to run your LLM queries across all these package source code. Well, then you can funnel it through here. We'll just like, tee that up and trigger it every time that we see a new release of a package or similar will be another way that I think would be essentially just like, oh, you're just paying

Starting point is 00:44:32 for our CPU to do this analysis. And then the, the analysis that comes out the other side, if it is like an item potent, I guess, you know, LM queries are not item potent. You're going to get a different thing every time you do an analysis. But for a lot of those things, we just come out as a public good and companies will have paid to have it generated, but then it's shared for everyone to use, which I think is a nice thing. I mean, what I'd really like to be able to do then is to actually do revenue share with the people who are maintaining those individual command line tools that do the analysis.

Starting point is 00:45:07 Imagine being able to go like, oh, we can help with supporting Zismore and Bullet or like all of these different things that are like command line tools that analyze source code. And rather than you build a whole enterprise company around your command line tool, you can just focus on making that tool really good. And then we can run it at scale for customers and then just funnel the money back to the maintainers after whatever investment. structure costs there were to run it so that you can actually focus on building the open source tools rather than building the scaffolding around it. That would be super cool. So it sounds like there's a collection of potential income sources, some that are currently working on the ones that you're working on. The real licensing of the data for a fee seems like a good one. Is that potentially like could you see a world where there's enough people that want to do that,

Starting point is 00:46:03 that that could be enough or no? Yeah, I think so. especially this kind of dependent data, the 25 billion row table, is really juicy in terms of the insights that you can get from that. The general package data, though, is often like you can get clawed to generate you an NPM scraper very easily. Like, if you ask you to do it in Ruby, you get code that looks a lot like libraries I.O. in turtles come out. That's awesome. Do you get a nickel when that happens or what happens? No, unfortunately, no. It looks a lot like.

Starting point is 00:46:42 Yeah. Well, you know, imitation is the sincerest form of flattery. So just remember that. Yes. It's tricky to get that kind of balance of like, I want to give away as much as possible, especially as all of this data comes from open source. Like it is, it should be open because it is data about open source. But then, like, how do you continue to pay for that, uh, whilst companies,

Starting point is 00:47:06 that also can kind of go like, oh, I could just go fetch it from the source myself. And trying to get as many different ecosystems support in is a good way of kind of going, like, you really don't want to try and index the R package manager. Like, you're not going to have a good time. So, like, we try and take care of all of the horrible bits. And then also being able to fetch, like, the Linux distro package managers, which is something that I'm trying to add more distro support in because each one of those has its own kind of like horrible rabbit holes of weird and wonderful metadata and trying to

Starting point is 00:47:42 work out like how does this fit into the schema a lot of it is kind of trying to tie it around the package URL format uh pearl but not pearl the language uh although you can have a pearl pearl for a C-pan that is, you know, a pearl about pearl. That is kind of a kind of come out from efforts in the S-bomb world and was like originally kind of one of the inspirations was the libraries I. being like able to map these things into different ecosystems and kind of say like you have an ecosystem, you have a name of a package and you have a version. Like can we talk about this in a kind of fairly standardized way?

Starting point is 00:48:27 as a way of transporting these package bits of metadata between different platforms that are doing analysis of different kinds. An S-bomb is kind of like the natural conclusion of that. Of course, you have two different S-bomb standards. They can't just be one standard for things. But that being able to look things up by Perl is something that ecosystem serves really well

Starting point is 00:48:53 because you can basically then take an S-bomb and just work through. it every single package that's in there and say, can you tell me about this package? Can you tell me what security advisories are affecting the version that I've got in my S-bomb? And that is like the biggest use right now is there are lots and lots of people with GitHub actions that are just enriching their S-bombs with this kind of information. And they just, it's funny how much more traffic we get on the weekday than on the weekend. And it's, I think it's just because of the GitHub action kind of like, oh, this is happening every time someone commits.

Starting point is 00:49:32 So you see a smash of traffic of them, like, enriching their S-bombs and checking out every package that is in there. And then the weekend comes along. Everyone stops working. And the traffic shape completely changes. And also the cash hit rate completely goes through the floor because suddenly it's like, oh, there's all kinds of other weird and wonderful things happening at the weekends. especially lots more like researchers and hobbyists using it. So you've mentioned a few of the weird, gnarly things like multiple S-bomb specs, etc. You have 35 ecosystems on here, NPM, Golan, Docker, to name a few, right, crates,

Starting point is 00:50:14 Nuget, so you're in that world, across 75 registries. So I'm assuming, you know, some ecosystems have multiple registries. Yeah, Maven especially, there's lots of registries in the Maven, world and then oh even bower.io i remember bower i don't know if people are still using that anyways forever ago man no one adopts anything they don't accept any new packages but there you still find people that use them and download stuff through them yeah so what i'm wondering is like you know where where are the black sheep where is the gnarliest weirdest like let's not i don't want to create any enemies for you andrew but like which of these ecosystems are like the in your own

Starting point is 00:50:50 heart of hearts notoriously hard to work with well The hardest bits are often, like, the change over time, especially when you go back to the really old stuff. The classic one is that you'd think, like, oh, MPM, their names are case insensitive. But if you go and try and index every name in NPM, you will find about 1,000 that are case sensitive and have clashes with the, like, a different case version of the name. And those still exist on the registry. they haven't been removed and so if you try and make an index against that you're going to have a bad time

Starting point is 00:51:29 because as soon as you actually go to run that you're like, oh, that's not like that anymore. So there's things like that when you go back into the time, like going back further and further is like, oh, there's weird things here, especially when the package manager registry has like a document database

Starting point is 00:51:49 rather than something that is like always enforcing its schema in every record and you know mpm used to be couch d which is like oh they've changed some schemas of the package metadata uh so in new packages it looks different than old ones of course now it's actually post squares underneath and it pretends to be couch db which is is interesting and imagine a headache in terms of like actually like maintaining that but they still have some really old and weird like you just run into like ah this bit of metadata isn't right for these few packages because it was frozen in time there as jason in postgres now somewhere um similarly with maven they've got lots of different kinds of pom ximels uh and there's so many features in

Starting point is 00:52:42 the way that maven can like have these nested and parent pombs uh that is i'm not i don't really have a like a background in Java so I've never used maven as a user but the amount of different ways that you can describe the data that is stored in a palm XML and then published out to maven central and course once it's on maven central and it's like frozen in time almost they don't then go and update like if ruby gems adds a new attribute to their registry that becomes available in the metadata for every single endpoint because no it's just a Rails app that's generating JSON. But for the things that store the files as a historical, like, we just dump this file somewhere, then you're like, okay, my code needs to be able to

Starting point is 00:53:32 know every different possible version of this, how this worked, and then also be able to recover from it. The worst one is the R package manager. It's not huge, but it is used a lot in the research space. And they don't have an API. You have to scrape HTML from the thing. They also remove packages quite regularly, which is very strange. So R has this really weird, I think it's because it's come from a scientific kind of like non-developer background. Like R, it also has one indexed arrays, which not many programming languages have that, right? But they, their package manager won't let you pin to an older version of something. It won't say, like, I want version 1, even though there's version 2.0 is out.

Starting point is 00:54:25 And the knock-on effect of that is that, so when, as a user, if I'm going to say, install my R packages, I always get the latest version of everything. That means that if something's broken because something else got a new version, rather than the new version causing the breaking change be told off. It's actually the package that didn't upgrade to fix the problem with this other package that just updated. So if you don't, if you're not proactive in fixing breakages with your package being used with other packages, your package gets removed. It gets kicked out of that registry, which is pretty wild because, you know, people, especially in science trying to make their science reproducible, are like,

Starting point is 00:55:16 oh, my package got yanked. Like, how am I supposed to reproduce this science? It's no longer here. So they have some very strange behaviors where they'll actually make snapshots of the registry. And then, like, so you can say, I want to install my R package from this registry on this day. So you actually have like a weird historical aspect of the thing, which is, it's not like a lot of other package managers. And it's very hard to change because, you know, there's just not a lot of, we don't have a lot of funding in open source,

Starting point is 00:55:50 but in terms of research software engineers, there's no incentive there to maintain and develop software unless it has a paper attached to it, right? You get, if you can get citations, great. Like, you can continue to make a case to keep working on those things. But once it's done, it's done kind of thing. You're like, oh, you already published that paper. I don't need to continue maintaining the software.

Starting point is 00:56:13 that's something that I have an interest in trying to solve, but it's a very hard problem to kind of break into. But what I'd like to be able to do is go, like, can we connect the world of papers and citations back to the software that's being used to especially, like there's a lot of Python code that is like, might not look like it's massively used, but then when you kind of go,

Starting point is 00:56:39 oh, but it's mentioned in all these papers, especially the kind of AI papers, as well, which are just, like, exploding at the moment, if you can then say, like, we can send some of this transitive citation credit down the dependency graph to the transitive dependencies of the things mentioned in a paper, like, I bet there are maintainers who have no idea that their, like, low-level Python or Julia code is being, like, referenced in these massive papers. Like, that's the discovery aspect there, but also for the people that do know to be able to go back to their institution and say, look, my software is supporting all of this research that you're publishing, you should also support me because that will make your research better, would be a really cool thing to make happen.

Starting point is 00:57:29 Until they say, well, we already published those papers, so who cares? That attitude makes it tough for sure. Yeah, there's a lot of still that kind of like, oh, open source is just there. I can just use it. I don't need to contribute back in any way because someone else will do it is still a totally unsolved like social problem, I think,

Starting point is 00:57:49 in the wider open source space. Well, if somebody wants to write a paper on the reproducibility problem in scientific papers due to mismanaged packages in the R language, I think that would be a hit. I think it would be a hit.

Starting point is 00:58:05 Oh my gosh. I'm still dumbfounded that they would not let you pin to an older version. I know. I feel like that's going to break so many research projects that go stale, essentially.

Starting point is 00:58:17 Well, there's the software heritage project, which is a massive index of like the hashes of every file ever published to any open source thing, is basically was produced to try and help solve that problem. Like you had to make a full index of every file in every Git repository

Starting point is 00:58:36 to be able to try and get around the fact You can't pin to older versions in ours package manager. I mean, there are still other package managers that don't have lock files in them, which if you think, like, years ago, yes, it wasn't such a problem. But nowadays, like, lock files are so critical to the way that people, like, build and maintain and share their software to be able to go, like, oh, it works on my machine. It should work on yours because, you know, you're literally installing the same set of dependencies. And Docker works for that at a high level. But as soon as you want to change one

Starting point is 00:59:15 thing, you obviously blast away the whole Docker image and have to start over. Whereas the lock file works really nicely at the language level to be able to kind of solve that problem. If your package manager doesn't have one, you should definitely try and like get that added in somehow. What if AI agents could work together just like developers do? That's exactly what agency is making possible. Spelled AGN, TCI, Agency is now an open source collective under the Linux Foundation building the internet

Starting point is 00:59:48 of agents. This is a global collaboration layer where the AI agents can discover each other, connect, and execute multi-agent workflows across any framework. Everything engineers need to build and deploy multi-agent software is now available

Starting point is 01:00:04 to anyone building on agency, including trusted identity, and access management, open standards for Asian Discovery, Asian to Agent Communication Protocols, and modular pieces you can remix for scalable systems. This is a true collaboration

Starting point is 01:00:20 from Cisco, Dell, Google Cloud, Red Hat, Oracle, and more than 75 other companies all contributing to the next-gen AI stack. The code, the specs, the services, they're dropping, no strings attached, visit agency.org, that's agn, tcy-y-org to learn more and get involved.

Starting point is 01:00:40 Again, that's agency, agn-t-c-y-org. So your team has amazing ideas flying around. You know the feeling, but turning them into something real. Feels like wading through peanut butter. Super thick, right? Peanut butter is tough to walk through.

Starting point is 01:00:57 We've all been there. The gap between idea and impact, it is brutal. And just throwing AI at the problem without clarity, that only makes things worse. We all know that. That's why I checked out Miro, investigated it, love it. And that's why I recommend it.

Starting point is 01:01:13 Miro is the innovation workspace that helps teams get the right things done. Faster, powered by AI, teamwork that used to take weeks, now takes days. You can use Miro to plan product launches, map complex workflows. You can even generate fresh ideas from interviews all in one place. And the Miro AI sidekicks, it's like having your own product leader, agile coach, and even a product marketer ready right there to review, clarify, and give feedback

Starting point is 01:01:42 right inside your workspace. It's cool. You can even build custom sidekicks tailored to your workflow. Plus, Miro Insights pulls together sticky notes, research, and docs into clean summaries so you spend time building, not digging.

Starting point is 01:01:59 Help teams get great done with Miro. Check out Miro.com. That is M-I-R-O-com. Once again, Miro.com. Behind the scenes, I've had some AI literally obliterating your API. With the polite mode on, of course,

Starting point is 01:02:19 I've passed my name so you can track all the things I'm trying to do here. But it has finally found a way to craft a script that will pull back essentially some version of Curl-F-S-L, blah, which is the URL where the thing lives. and then piping that to SH. And so I've got a nice dramatic list of projects through research that use that command and what that, you know,

Starting point is 01:02:48 what that install that SH script looks like and what are some of the details in there. So it didn't take long, but my gosh, if I did not have AI to do this for me, I would have pulled my hair out so badly. And probably not your API by any means, but just more like you can get the data, it seems, but it's very, you got to like comb through it.

Starting point is 01:03:07 you've got to be persistent and very... Wow, there's a lot of kind of... The schema is not simple. No. Unfortunately, and it's hard to find a way to describe that in a way that doesn't just... Like, people will just switch off and kind of glaze over as you start going into the levels. Something I've also tried to do over the past couple years as the AI bots have kind of gone mad is actually let them scrape the website, right?

Starting point is 01:03:34 Rather than block them, I've said, you can go mad. in the same way as I used to let Google bot go mad on libraries I.O. Because two years in, like we've had a full training cycle of the frontier models, they actually know what ecosystems is and they know the structures of the APIs and they can actually just suggest those things, which is like a good and a bad thing. But I think in terms of like being able to get into the training data in terms of like my API is here.

Starting point is 01:04:07 and my service exists is helpful to people who are using AI coding agents to do some of these things. I have dabbled in the MCP world with this stuff and it would be very easy for anyone to build an MCP adapter on top of this. But the security implications really hurt my brain. So I have kind of like held off going hard into it because, you know, every string that is returned by their MCP is essentially like a prompt injection. So you imagine your version number that is pulled from an MPM package and then fed through an MCP server into your context, they have the ability to make a version number, especially if it's like Semver with your, your pre-release string on the end of the version

Starting point is 01:04:57 number, you could make like prompt injection ver where I just start putting like ignore all previous instructions dash 1.1 in the strings of the thing like that come from the package managers suddenly a security vector or even just the description of the package or the name of the package like there's a lot of a lot of trust that happens on that kind of go through when it comes out as an MCP server on the other side if you're just saying like blindly install whatever the MCP server told me then there's a lot of trust that you're putting into many layers of indirection that could happen. And we've definitely seen, like, loads of threat actors have realized how, like,

Starting point is 01:05:48 I'm going to use the word juicy so many times. But in terms of being able to go, like, I can just, there's no restrictions. I can publish things to a package manager. And that might be, like, the sixth level of indirection before I actually get to my target. it's very hard to see all of the moving pieces until they actually kind of all come together. But most of these packing managers have zero restrictions in what you can do. Like even GitHub only just recently started kind of saying like there are certain restrictions in how like you automated the NPM publishing can be because people were

Starting point is 01:06:25 literally like every commit, I'll just publish a new version. Why not? There's no restrictions like 100 versions a day, which is like why are you doing this? Well, because we could. And the cost to the registries is mad as well. Like you see that PIPI are just showing their numbers continue to grow. And they're like, well, how the hell are we going to continue to fund this? Because it doesn't look like it's going to stop anytime soon.

Starting point is 01:06:52 It feels like there's a lot of challenges that are kind of like coming down the pipe for these shared open bits of infrastructure to keep them as open as they currently are. Well, what is your take then with this rate limits and polling when it comes to this polite nature you have here? Like, how do you leverage that? Because I can pass in my email, but then you say, well, I can reach out to you later. You're watching my rate limits, of course. Can you just shut me off because of me passing that email to you? Or how do you curb the enthusiasm, so to speak?

Starting point is 01:07:26 So right now we have this, like, we have the anonymous rate limit, which I think is, is fine. 5,000 requests an hour per IP address, basically. And then the polite pool, which is a term we borrowed from a service called Open Alex, which is basically like ecosystems, but for research papers. They have this, if you pass in your email address as part of the user agent, then you just get an uprated rate limit so that if we see that you're smashing the API, we can contact you and say like, oh, what are you like doing? Can we help you do this in a different way?

Starting point is 01:08:03 so far I haven't actually like been tracking that particularly closely. I've literally just like great. Cloudflare was still like catching most of that stuff before like if you hit anything that's cashed, it doesn't even touch your rate limit. So it's only the the uncash things that actually affects that rate limit. But even then it's like 10,000 requests an hour. You're you're going to, if you're really, really hitting it, you're going to run into that. And then a 429 request is very cheap to serve up. So I can I can serve up a lot of like rate limit used requests before things start to fall over. Having and then like looking at the patterns and going like, how are people using this?

Starting point is 01:08:51 And is there a way I can do like a higher level API that avoids, you know, having to have someone do that trawling? or is there ways of being able to export big chunks of data rather than doing individual lots of little queries is another thing that we're exploring. It may be like a big click house with a read-only, like you can write your SQL query or SQL-ish query against a column store worth of data, similar to BigQuery, but without the, you know,

Starting point is 01:09:23 whoopsie, I spent $3,000 on my one query through BigQuery because it pulled in terabytes of data. But that is a bit of an ongoing side project. It's not actually live yet for anyone else to use. But hopefully for researchers, especially, you'll be able to just be like, oh, I can just do a big sweeping queries in a kind of an offline way

Starting point is 01:09:50 rather than having to hit the live Postgres databases because that's like the source of truth of these things. And often researchers aren't like, I need the most up to date. date, like within the hour changes. They're like, ah, actually I'm fine with this. If it's like a day or a week old, it's really not too much difference compared to, you know, I'm looking for the security advisory stuff that is as fresh as possible, which is often where you're scanning your S-bomb and trying to find, like, where are there new vulnerabilities that are

Starting point is 01:10:23 affecting me? Yeah. How do you prioritize your time, I suppose, is a lot to cover it seems it seems there's a lot that uh you know even discoverability like if i am naturally interested how can i pull this data out seems like i would have to spend a lot of time to figure that out that's okay but you know who is your user how you prioritize your time how do you build the who you build in the platform really for i know who's using but like how do you prioritize your time to how it's being used well to be honest the number one user is me right now like good That's who I prioritize for because I have a good picture of how you'd want to be able to pull this data out. So the APIs are, each one has its own open API YAML spec, which kind of tells you here are all the different endpoints that you'd want to use.

Starting point is 01:11:17 And then the things that, like, I'm building applications on top of this data as well and going like, oh, this is not here. Like, or I want to be able to do it like this. So often, like, a lot of those APIs have shown up because I couldn't get them to work right. Josh Bressers has also, like, had a good amount of input in just, like, absolutely thrashing various aspects of it to look up lots of data around CVEs and the kind of rate of versions being published. There's also kind of loads of tools that have been built on top of it. So there's SNCC has a tool called Pallay, which does S-Bomb enrichment. And so I can then go and these things are open source. I can go and look at them and see like, oh, how are they currently using the existing API?

Starting point is 01:12:10 Is there a better way that I can do? Or do I just need to beef up the caching in some of these kinds of places? It's very much like the prioritization a little bit is like just running around putting out fires, but then occasionally it's like, right, I'm going to turn everything off and I'm going to go and like tackle one of these slightly chunkier problems of essentially like solving a bigger challenge than just, oh, there needs to be a new API. Often that's like, oh, there needs to be another service for another kind of data or there needs to be another way of querying this thing because lots of people have been asking for

Starting point is 01:12:47 this. like the biggest thing is just being like coming and asking for things on the issue tracker is a great way to um to kind of kick off that conversation and say like oh i've been trying to do this i'm trying to solve this problem but i can't work out how to go through you know like i've hit a wall here or there's just too many individual bits of data over there to you know like can there be an aggregation of this thing somehow and sometimes that's easy and sometimes it's like oh, actually, if we make this index, it's going to be like the index itself

Starting point is 01:13:22 is like 500 gigabytes in size. That's hard to fit into RAM. So maybe we think of another way to solve that problem rather than just like adding an index for every single different way you might want to query Postgres. I found the introducing Parley Post. They even mentioned that we're enriching

Starting point is 01:13:45 parlay, you know, it's enriching these S-bombs using ecosystem. So are they one of your paying customers then, considering this tool is probably part of their... No, they are using... So, Parlay is an open-source tool that other people can use. And so they, people, and it's primarily companies, because, you know, open-source developers don't actually care about S-bombs because they're like, here's the code. You know, you know, I had to search what S-bombs. bomb enrichments was. I guess I should have guessed that by take a little bit of data and make it better. I don't know. Okay. Well, if you, most S-bomb extractions don't, like, when you produce an S-bomb

Starting point is 01:14:27 from, say, a repository or from a Docker, a container, it will go, here are the packages and the version numbers, but it's not going to tell you, like, and here is all of the information about that package, because they just don't have that on disk available most of the time. Some package manage, especially the, like the distro package managers, do actually have that information right there. But, you know, these S-Bomb generation tools don't go and hit the NPM API directly to fetch all of those things. So if you want to be able to get a high-level overview of all of the license breakdown of all the different packages in your S-bomb, then you need to enrich it by, you know, basically going through each one and fetching some extra information and filling in the license field

Starting point is 01:15:14 maybe they're like maintainers there's a load of different things in there and it depends on which S-bomb standard you're looking at as well because they're different but they're also like just being able to look up all the security CVE stuff it's nice if you're only working in one particular ecosystem because you can use NPM audit or bundle audit but as soon as you get into the like multi-ecosystem things which every Docker container is right it's going to be like oh it's got my my Django app with a JavaScript front end and also all of the back end like low level district package stuff like there's a big collection of random bits of software in there and I really don't want to have to use 10 different tools to enrich you I just want one thing

Starting point is 01:15:59 that will just sweep across and support everything you mentioned a couple of times building things on top of is this uh since this is sort of a redo for you it's it's kind of like a take to do it better. Is this the substrate for many things? What are some of those things that you mentioned? Like you mentioned some things being built on top of, but what are those things? What's the world you envision?

Starting point is 01:16:22 There's a few that are listed on the ecosystem's homepage. So we have the things that I've built for Open Collective, which are the funds app and the dashboards app. Those two things are like definitely, they don't have their own data. They're essentially like aggregation. of various bits from ecosystems to solve particular challenges. One thing I've not built is a search engine.

Starting point is 01:16:48 I've kind of been like, I'd like to see if someone else would build that. You know, like I already did that in libraries I. But that would be a natural one to add in there. What I'd really like to build is things that help maintainers understand who is using their software. And this is going back to that 24 billion rows of, dependency data to be able to say like how much are bots how much is docker pulls how much is just

Starting point is 01:17:16 like CI builds which is I guess those are all still users right I mean if I'm that person releasing a hundred times I'm still pulling the packages right every time they commit yeah yeah boom new new version because I can you know and also to be able to go like if we can flip that graph upside down and show you here are the key like people downstream depending of your library, then rather than you find out that you broke them because they come into your issue tracker after you just publish that release and say, like, you broke stuff, like maybe building a CI that is like an inverse that goes, okay, well, you committed something, let me go and test this against your downstream, like your most popular downstream users to make sure

Starting point is 01:18:04 that you didn't break those things. And there's some difficult bits there in making sure, you know, those downstream CIs are reliable. They're not just going to be like, oh, actually, our tests pass all the time regardless, or they fail all the time so you can't trust, like, if you actually broke anything or not. But to be able to do that would give maintainers insights that would be like they can actually be proactive about some of these things and maybe even be able to coordinate and go like, oh, I'm able to reach out to these projects and say, like, I'm going to break this thing or I'm going to change this thing to make it better,

Starting point is 01:18:41 can I help you upgrade in the process rather than just, you know, like firing it out into the world and then not being able to know what the impact was until after the fact. Like I've also been indexing Dependerbot data as a way of being able to show, I've no idea why GitHub hasn't done this, but as a maintainer of a thing,

Starting point is 01:19:03 if I publish a new version, I want to know how many Dependerboprs actually, like, were successfully merged or were closed as like, no, I don't want this because it broke my CI or just completely left, like, give me more context so that I can understand what's happening with the people that are using my stuff, at least in the open, because there's so many open source users now that it's a good proxy through to closed source. Tools like that that enable maintainers to do, like, more with the same amount of time that they're putting into the project by being more data driven or being able to just have more like visibility because I think

Starting point is 01:19:42 a lot of them are working in the dark a lot of the time partly because you know you put the blinkers on and you just focus on getting what you need out of your project but also because they just have no good idea of like where their key consumers of those things are and the knock on effects of being able to go like oh I make a breaking change that breaks this other library and that ends up having a, like a significant impact, as well as, you know, if you have a security advisory to be like, hey, significant end users of my thing, there's going to be a security update, like, FYI, get ready to bump rather than be, you know, like, oh, we're stuck on this version and now we're going to have to like scramble to try and get it updated to be able to

Starting point is 01:20:30 get a little bit more coordination and collaboration by, you know, being data driven, I think would be amazing. That's my slightly bigger picture of what I would like to build on top of it is to really empower maintainers to have an impact to make their process better, but also then make their open source software better because everyone uses open source software. And so then you make all software better by just improving, you know, the base layers of the most critical packages. It's a pretty big goal. but I think there's enough untapped data there that I think can be really powerfully leveraged to make a good go at improving some of these things.

Starting point is 01:21:15 Could you maybe discuss how that interface manifests, like what would you show the maintainer? What do you think? Where would you begin when it comes to exposing the data? Like, how do I get to know my users, the people using my thing? Yeah, so I could imagine you would. would see like, okay, well, for this particular package, and maybe I've got lots of packages, but I just drill down to one of them, then I can see here are like my top dependence

Starting point is 01:21:44 and top being, there's lots of different ways you can define what a top would be, but we can just use the ecosystems like usage metrics is one thing. Here are like the key projects that are using your stuff and then which versions they're currently pinned on as well. So they might be just like, oh, I always pick up the latest version. I've got to Penderbot doing the updates, but maybe there's someone who's really heavily using your stuff, but they're actually pinned to an old, like an old major version. And that's like an insight into, okay, well, why were they stuck? Like, maybe I can go and help them upgrade or I can learn that actually I made the most horrific breaking change ever. And they really, really don't want to upgrade because

Starting point is 01:22:29 of, you know, it completely causes them too many headaches to do that. And maybe I can consider that in like how I then continue to maintain that project going forwards. You could also then use that interface to say, okay, well, can you show me everyone that's on this specific version, like, or which is a 50% of my users like stuck on an old version? Or are they stuck on an insecure version as well to be able to go like well we had this CVE three months ago and most people especially thinking about this from a like individual packages that depend on me to be able to see the knock on effect of like all the users of those packages are like my transitive users. There's a lot of data there but being able to highlight like where those key points are of

Starting point is 01:23:20 leverage that are like these things could be improved here that would be. one way of that kind of being manifest the other way you could do it is rather than it be a UI is more like a notification system of being like oh did you like you got the proactive kind of things of like your dependents have updated

Starting point is 01:23:40 to your latest version or that like your dependents are having problems with they tried to upgrade to this thing and like here's the context of this depender bot poor requests and the discussion that they had and they had

Starting point is 01:23:54 and they haven't yet merged it to be able to like show you that oh wow okay that's interesting like it's having a problem for them that we didn't even imagine because we're not using the same database as they are for our testing purposes something like that

Starting point is 01:24:13 and there's maybe there's an AI element there once you get to like very large amounts of users that you're like actually this is too many people to kind of like too many downstream users to reach out to, maybe I can empower copilot or Claude through some kind of prompt that is like,

Starting point is 01:24:33 I've described the changes in my change log in a way that helps them upgrade from one version to the next. But there's a lot of people that are very reluctant to take on some of those things because they can be wildly unreliable sometimes when you try to do things the same way over and over and over

Starting point is 01:24:53 and over. It's kind of like telemetry via exhaust too. You're not like literally tracking your users. You're tracking them through natural usage pattern of the ecosystem of open source. So you're not like ask them to opt into too much telemetry either. Yeah, I really try not to be too invasive. I try not to track too much data about the individuals and instead keep it at the project level because, you know, for one thing, the projects are all like license in a way that says, yeah, you can, you can share this and you can like understand this. Like the licenses let you do that. Whereas, you know, the tracking individual people is a much more messy thing to do because people come and go and they change their names and they change their email addresses and it can

Starting point is 01:25:43 be hard to try and pin them down. But also that, you know, most open source projects, they're all volunteers. It's like trying to pin requirements on an individual is asking a lot of someone who is probably not being, like they're just giving away their code. So instead it's like, oh, well, we'll look at these as if you want to do something to help, then here's data that you can do it rather than being like, we're going to force to upgrade, you know, like you must do this. You wouldn't want to use ecosystems to power like a massive wave of automated poor request, for example, for one thing, GitHub would just shut you down straight away. They're allowed to run co-pilot or Dependerbot at a large scale, but you wouldn't want,

Starting point is 01:26:30 like, it would be horrible for maintainers, right, to just have, you'd hear Daniel from Curle is constantly saying, like, how many different AI bots, especially if it's incentivized in any kind of way, then you're going to make a mess. But ecosystems tries to kind of just watch. What's the vibe of these ecosystems going on at the moment? And then you can use that to try and have impacts on top of that. Have you found any information black holes in your desires for features or tracking things? I mean, funding, exact amounts of funding is an example, I guess.

Starting point is 01:27:09 But anywhere else where you're like, man, I could build this, but I went looking for the data and there's no data. Oh, so yeah, the funding one is a big one. The other thing that I'd really like is kind of more data around the non-code contributions, but that's really hard to get, right? Yeah. Your discords and your slacks are not open enough to be able to really index without, you know, you need an API key or you need a ghost user sat in a Discord collecting everything. getting creepy.

Starting point is 01:27:44 Yeah, it is way too much. There are tools. There's ecosystems again tracking us. Get out of it. ecosystems. I'd start joining all the community Zoom calls with an AI chat log kind of thing. But now there are tools like that.

Starting point is 01:28:03 Biturgia has one that can, you can configure it to track your own community and like you can feed in mailing lists and you can feed in your Slack. or your Discord or similar but you're kind of doing that at a per community or even just a per repository

Starting point is 01:28:21 level trying to do that at a mass scale is stepping into worlds that I'm not really comfortable in terms of the amount of tracking of stuff also it's just really really messy like open source metadata is messy but it's not it is like

Starting point is 01:28:40 tangibly okay yeah I can see how I can connect the dots here, whereas once you get into, like, unstructured text of discussions of things is, you're quickly into like, right, well, we're just going to try and, like, have LLM's process everything here, and it's a horrible mess, and it's incredibly expensive. Like, we use no LLM stuff in ecosystems because we just don't have any budget for that kind of stuff, the amount of processing to analyze 12 million packages. Well, you do now, our friends at AMP have free, just advertising as I use it.

Starting point is 01:29:17 And it's like free dots ascent. I was just telling Jared this about on the power releasing on Friday. You know, if you're not using AMP code for free, then at least two hours a day or so, then you're missing a little bit of LLM work that you can get for free. Add supported. That's the way, that's the way of saying. Well, yes, sorry, it is ad supported. So you are getting advertised to it.

Starting point is 01:29:37 But, you know, I think that if you're not using that and you have a use for a couple hours, a day at no cost you know one of the 17 advertisers they have in the network are supporting your open source essentially it's kind of cool what else and do anything else we didn't ask you about

Starting point is 01:29:56 ecosystems wise or I mean we covered a lot yeah I'm trying to think if there's anything I think I covered most of my kind of like thinking of the future things and that's mostly everything that I'm working on at the moment is ecosystems

Starting point is 01:30:12 and I haven't got any other side things. Octobox is dead or? Octobox is ticking along like GitHub copied most of the features of Octobox. And then we lost most of the customers. So I still use it every day, but there's not a lot left there. So it still works.

Starting point is 01:30:32 But it doesn't have any AI features. So it's not particularly interesting in terms of that aspect. Yeah, I think that's like nicely covered. as most of what I've been working on. Well, it's really cool stuff. I've always been impressed by your abilities and willingness to just collect all the things and then organize them and give them back out for free for people to use for various reasons.

Starting point is 01:30:58 It's probably exciting when you see somebody using it in a new way that maybe you hadn't dreamed of or wouldn't even care to, but you're like, oh, that's cool. It shows that you're providing real value to folks. Yeah, especially with the researchers. like they people will come to me and say like

Starting point is 01:31:14 I'm working on this paper that's like investigating ways that we can get LLMs to suggest better projects for you to use or packages or we're trying to reduce

Starting point is 01:31:25 like LLMs coming up with old versions of things like what's there is there a good ways of training it on reducing that what do they call it

Starting point is 01:31:34 it's like a data lag basically like the training lag the drift the drift that's it That's an interesting challenge without resorting again to kind of RAG or MCP. Are there ways of doing short fine tunes after the fact of like here are the latest versions of things? And people are doing some interesting research in that space using big chunks of ecosystems data.

Starting point is 01:31:59 The other thing I just started noodling on is an open source taxonomy. So to try and define like a taxonomy that. that describes the different facets of what makes an open source project. You know, what does it do? Who is it for? What technologies does it use? There's about six different facets and about 130 different terms that I put together as like a V1 kind of thing. Of going like if you were to put these packages into a box or six boxes,

Starting point is 01:32:35 like which ones would it do? Rather than just going like, here's some free text keywords. words, like here's a load of the kind of chunks of things, including like the role of the user as well, rather than just thinking about like, oh, it's a front end react app. It's like, but it's for a, is it for an end user or is it for a sysadmin or for a developer, like to be able to, and then what domain is it in as well? It's like really early, but I'm hoping it is another way that can produce some alignment in this open source discovery world because, you know, I worked at GitHub for a while on open source discovery

Starting point is 01:33:18 and wasn't really able to make a good dent in it there. But I think there's still a lot of low-hanging fruit in terms of like just helping people find the right kind of tools to use because not many other people have really, it's also there's just not a lot of money in that space. It's a lost leader, right, for most companies. searching for open source is not going to turn you into like, oh, yeah, you can't even run a lot of ads against that kind of stuff because open source developers are like the number

Starting point is 01:33:48 one user of ad block. So those ads will disappear pretty quickly. But I'm hoping that that taxonomy will be like, here's a nice blueprint of ways that you can define your project and put it into ways that then allow you to kind of go like, okay, well, I've got five dimensions here, but I want to rotate around one of the them. I want a web framework for researchers, but I want to rotate about the technology. Like, what are my options there? Or I'm definitely in this technology space and looking at this kind of like position in the stack. But what options do I have here for different like users?

Starting point is 01:34:31 And to be able to kind of like twist the picture a little bit, but in a fairly defined space, rather than in, you know, just arbitrary free text because, again, you run, you just end up in this like soup of words, which is like, yeah, we kind of just get very fluffy. And often projects just don't have very well-defined ways of finding things. You know, like they don't add a description to their GitHub repo or any keywords or topics. So you just kind of like never find it unless it's in a generic search. engine, which is then really hard in terms of like, oh, well, what are my options in this space? And I made this as just like, surely someone has made one of these already. And I found

Starting point is 01:35:19 a taxonomy of software in the research space, but I did not find a taxonomy of open source software. So I was like, okay, I can make a stab at one of these. I like, I've never made a taxonomy before. But I put it together as like, this should be interesting. And it's been useful so far and it started some interesting conversations but i really need some people with more experience in you know actually defining taxonomies than i have to to give more input and also expand it and like cover the problems because i'm pretty sure there's going to be loads of problems in it because i basically just like put it together in a couple of days as like okay i think this this should work but yeah mostly where does that live that is on the ecosystems gethub as

Starting point is 01:36:05 also a really quick web page I made of taxonomy.ecosystems. It's not, it's literally just a few days ago, so it's not anywhere on the website, but it is on the GitHub org as OSS-Texonomy, but I get a link in the show notes. Awesome. Yeah. Send us that. And anything else, you want to make sure we get into our show notes so that you all can just click through and find that and help Andrew figure out this taxonomy so that we can all start to kind of formulate around it. Categorization is always useful, especially for

Starting point is 01:36:42 very gray, otherwise gray areas such as these, especially if you're self-defining, it helps you to even flesh out your idea or your project better. I think this is fertile ground right there, honestly, because you got so many I would describe it as like ecosystem explorers.

Starting point is 01:36:59 You may have previous to LMs being a ubiquitous thing and agents helping you, you may have just stayed in the zone that you're comfortable in because you're the mere human that cannot think to next faster. You know,

Starting point is 01:37:14 and then you get into this LLM world and you're like, man, I can actually explore new languages because it knows it. I know this language and I can at least translate my knowledge. And so now you find yourself exploring and go or Russ whenever you would have

Starting point is 01:37:26 normally just stayed in the Ruby world because maybe that's where you're comfortable. You know, and so when you go into those worlds, you're like, well, how do people test here? How do people deal with HTTP? How do you deal with security things? And so you find yourself exploring new worlds while you know the Ruby World well, you don't know the same kind of projects that would help you in a different lens. I can think that's going to be useful, honestly.

Starting point is 01:37:49 Yeah, definitely. There's also kind of the ability to see, like, where are the gaps in a particular space? Where have there not been, like, many people working? Or there's only just this one old library. like is there an opportunity to kind of jump in and improve that or as you say like you come into a new ecosystem and you're like what is the sidekick of X exactly often it's like oh well actually we like in Erlang world we don't need sidekick because we have you know OTP will it's kind of all built in but like to be able to learn like that what is the alternative to

Starting point is 01:38:32 this thing is going to be an interesting way of challenging that and maybe also they're kind of breaking down some of these massive projects into sub pieces as well to be able to go like okay well you've got something huge but actually there's lots of like individual components here that can be used without you having to take on like oh I've got a massive Apache airflow install now that does everything actually I really only want to do like a piece of this but how the hell do you go finding that if like their discovery is just folders full of strangely named projects like that's not particularly helpful necessarily in terms of discovery let's close with this what do you want from the world you seem to be a pretty quiet guy

Starting point is 01:39:19 there's a definitely a blog there so you're active I don't know how frequently you podcast we haven't talked personally in years at least me personally maybe you've talked to him least one's jare without me in the meantime but like what do you want from the world for this project what kind of response do you want from coming on the show or producing all this work well i i have had my head down for like basically since uh leaving tide lift and then covid happening i basically just like got my head down and just started like plugging away i also started uh doing track days in a Subaru BRZ, which is an excellent way to get away from the computer. If you've got an interesting cars, track days is brilliant fun.

Starting point is 01:40:05 But ecosystems has kind of like been building up and building up and it's now reached the point where I'm like, I need more people helping kind of like, not just contributing to the code, but like helping it work out where it should go next. because I can definitely come up with lots of things I would like to see happen, but I need more input from more people on like, how would you like to have an impact on the open source world, like through data? So that's input in like feature requests or thinking about that from a slightly

Starting point is 01:40:42 higher level of like collaborations, ways that ecosystems can support different efforts, be it like security, searching for projects that are like, oh, there are ways we can improve this part of an ecosystem. The collaboration is really what I would like to see more of. And I am starting to do more podcasts and like various kinds of. I started a working group with the chaos metrics people around package manager metadata as trying to share the kind of learnings that I've done in developing ecosystems and being able to kind of like map metadata across different

Starting point is 01:41:25 ecosystems into standardized ways. But if they're interested in ways of, you know, like understanding and using data in open source to have impacts, then ecosystems is literally rearing up right now through the alpha omega grant that we just received to be able to. you know bring more people into this space and help them have real impact on like knock on effects of improving open source wow very cool I'm glad you're I'm glad COVID's over obviously I'm glad that uh you're you're you're poking your head out of the hole little rabbit and and showing the world what you got it's kind of cool I like it good stuff Andrew thanks for

Starting point is 01:42:15 coming on the show again yeah thanks so much for having me There you have it. Ecosystems, a very cool web app with a very cool domain hack. That's E-C-O-S-Y-S-T-E. Check it out. There is so much data to dig through. I'm sure you can think of cool stuff to build on top of it. And if you do, let us know in the comments.

Starting point is 01:42:39 We hang out in Zulip. You can too. It's free. Just click the link in your show notes or find the episode page on our website and hit the discuss button. That'll get you there. Thanks again to our partners at fly to audio and to our beat freak in residence breakmaster cylinder. And thank you to you for listening. There's a zero percent chance we'd keep this thing afloat for 16 whole years without you.

Starting point is 01:43:01 So thank you. Seriously. It means a lot. This has been your midweek interview, but we'll be back on Friday. You got to listen to that one. It's the pound to find champs game. Come play along. We'll talk to you then. I'm going to be able to be. I'm going to be.

Starting point is 01:43:24 You know, I'm going to be able to be. Game on

The Changelog: Software Development, Open Source - The world of open source metadata (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.