The Changelog: Software Development, Open Source - The world of open source metadata (Interview)
Episode Date: November 5, 2025Andrew Nesbitt builds tools and open datasets to support, sustain, and secure critical digital infrastructure. He's been exploring the world of open source metadata for over a decade. First with libra...ries.io and now with ecosyste.ms, which tracks over 12 million packages, 287 million repos, 24.5 billion dependencies, and 1.9 million maintainers. What has Andrew learned from all this, who is using this open dataset, and how does he hope others can build on top of it all? Tune in to find out.
Transcript
Discussion (0)
Welcome, friends. I'm Jared, and you are listening to The Change Log, where each week, Adam and I interview the hackers, the leaders, and the innovators of the software world.
We pick their brains, we learn from their failures, we get inspired by their accomplishments, and we have a whole lot of fun along the way.
This episode features Andrew Nisbet, who builds tools and open datasets to support, sustain, and secure critical digital infrastructure.
He's been exploring the world of open source metadata for over a decade, first with libraries.io and now with ecosystems, which tracks 12 plus million packages, 287 million repos, 24.5 billion dependencies, and 1.9 million maintainers.
What has Andrew learned from all of this?
Who is using this open data set and how does he hope others can build on top of it all?
You're about to find out.
But first, a big thank you to our partners at fly.io, the public cloud built for developers who ship.
We love fly.
We probably will too.
Learn more at fly.com.
Okay, Andrew Nesbitt talking ecosystems on the changelog.
Let's do it.
Well, friends, agentic Postgres is here.
And it's from our friends over at Tiger Data.
This is the very first database built for agents and it's built to let you build faster.
You know, a fun side note is 80% of Claude was built with AI.
Over a year ago, 25% of Google's code was AI generated.
It's safe to say that now it's probably close to 100%.
Most people I talk to, most developers I talk to you right now, almost all their code is being generated.
That's a different world.
Here's the deal.
Agents are the new developers.
They don't click.
They don't scroll.
They call.
They retrieve.
They parallelize.
They plug in your infrastructure to places you need to perform, but your database is probably still thinking about humans only, because that's kind of where Postgres is at.
Tiger Data's philosophy is that when your agents need to spin up sandboxes, run migrations, query huge volumes, a vector and text data, well, normal Postgres, it might choke.
And so they fix that.
Here's where we're at right now.
Agentic Postgres delivers these three big leaps, native search and retrieval, instant zero copy forks, and MCP server, plus your CLS.
plus a cool free tier.
Now, if this is intriguing at all, head over to tigurdata.com, install the CLI, just three commands, spin up an agenti Postgres service, and let your agents work at the speed they expect, not the speed of the old way.
The new way, agenti postgres, it's built for agents, is designed to elevate your developer experience and build the next big thing.
Again, go to tigurdata.com to learn more.
Today we're joined by, for us, an old friend, but a long time, Andrew Nesbit is here with us.
And you know, Andrew, I came across ecosystems, which is eco-dot, no, ecosista.ms.
Domain, nice domain hack, hard to say out loud, but it looks cool in the URL bar.
I came across this and I thought, this is a very cool project.
It seems somewhat familiar.
I can't quite put my finger on what it could possibly be.
And then I saw it was from you and I'm like, oh, it makes totally sense.
Like it makes total sense.
This is right up your alley.
Oh, you've had you on the show many times back in the day talking Octobox, talking.
Libraries, I.O. talking Ruby ecosystem and dependency management.
And it looks like you're still out there kind of beating around that same bush.
So, first of all, welcome back to the show.
Thanks for having me. Yeah, it's great to be back.
Ecosystems. It is, I mean, okay, we have a lot of context that maybe our listeners don't share.
But take us back to what you're interested in, which seems like you've been interested
in similar things for a long time.
And you built libraries.io around this. And ecosystems is a very,
similar thing. I'm wondering if it's the same old thing or if it's a new, new thing. So tell us
about your past and like collecting and organizing dependencies and the information about
them and open source projects, sustainability, and then what that brought you, how that brought
you to ecosystems. Yeah. Okay. So I have been swirling around the world of open source metadata
for must be coming up to 10 years now. Starting with 24 pro request,
That's right, 24 poor requests, yeah.
That didn't kind of start from metadata, but the idea of that project was to encourage people to contribute to open source as part of kind of the run-up to Christmas.
And after kind of like first getting that off the ground, we quickly ran into like, oh, how do we suggest like where should people go and contribute to?
And a lot of people would try and send a poor request to a project that had no activity and like the maintainer was gone.
or just like we're struggling to be able to even like work out how to send a poor request to some projects
because they were really not very friendly or easy to contribute to.
And that kind of led me down this path of like, okay, well, what's a, how do you define what a good project is?
And then like can we scale that up rather than manually having to have people kind of like submit their things
and keep those things up to date every year because that project would,
just kind of come and go every December and shut down afterwards.
So the maintenance there couldn't be entirely human because there was thousands of people
contributing to that project and like sending poor requests.
And it was a lot of data to try and work with.
So I started to build out some basic metrics there to try and go like,
does this project look like it has activity that's happening on it?
Does it look like it's ever received third party like contributions and things like that?
and that led me to kind of I got a job at GitHub from there
and then GitHub promptly fell apart internally.
Tom Preston Warner left.
It was a horrible time.
And so then I left there and started Libraries I.
O is essentially a like, okay, well,
looking at package manager metadata is a different way
of kind of getting some measure of what's an interesting open source project.
Like rather than just using stars,
which stars is a terrible metric and has very little kind of bearing on a lot of projects,
especially as you go down from the kind of like the massive frameworks, the kind of
those huge keystone projects.
Once you get down to smaller libraries and also especially the kind of like low level critical
projects that are doing a lot of the kind of the real work, they don't get a lot of
attention and a star is basically a measure of attention, how many people are landing on
that GitHub repo page. So package manager metadata was like, oh, this is really juicy,
because it kind of gives me a hook into saying, like, these libraries are being used by other
people. But download stats again, available for most package managers, but not all, is often
kind of wildly all over the place for certain projects, especially if they're used a lot in
CI, that you'll just see, like, really inflated download stats.
And you also don't necessarily see those for dev dependencies, the things that, you know,
people, especially maintainers, are installing on their laptops to be able to work on those
projects, but they're not necessarily a runtime dependency of all the applications.
You know, there are definitely gems that Ruby and Rails devs use locally, but aren't shipped
with the Rails app, so you would never see those numbers.
And the insight that I kind of accidentally tripped over was if we go mining the dependency
information out of open source repositories, at a large scale, you actually start to get
a really good picture of how people really open, like use open source and how they don't
use open source.
Like if a project breaks, you probably don't go and unstar that project.
Let's be honest, like not many people are unstarring things.
They don't remember.
And also, you don't, like, un-download a thing.
The download count remains after you downloaded it and was like,
oh, this doesn't actually work or this is not what I wanted or has become unmaintained.
Whereas actual, like, I depend on this thing.
If I remove that thing as a dependency, then numbers go down.
And you get a really interesting, strong signal that something is maybe not quite right
with that project.
So that kind of led me
onto a path of
I should just try and index
the dependencies
of every open source project
ever.
And Libraries I.O.
was started out as a search engine
designed to be like,
I can help you try and find the best package
and that was primarily
like that this package is well used.
So therefore that implies
like that it has good documentation
that it actually works
and other people are using it as kind of a proxy.
And it grew and grew and became a massive and expensive and difficult project
to maintain as a side project whilst I was doing contracting.
And me and Ben, who are working on it, were like, well, what are we going to do?
How can we turn this into a, you know, a sustainable project that can fund itself?
and we at the time GitHub had just implemented its own dependency graph as well
along with purchasing Dependibot
and that basically they started giving that away for free
that pulled the like pull the rug out of any plans we had to monetize libraries I.O.
directly as well as a project I was building called Dependency CI
which never really got off the ground but was back in the day was like
oh, this is really cool because it could literally like block your poor request to say
you're trying to add a dependency here that is not good because it doesn't have a license
or it's got security issues or other things. And so we ended up selling to Tidelift
to try and find some way of recouping the costs of building out that project. But just before
we did, we also made all of the code open source and all of the data open source. And all of the data
open source. So it was kind of like an airdrop into the community to be like this is always
going to be here if you want to use it for purposes. Didn't really work out at TideLift. There's a
big cultural difference in the founders at TideLift compared to me and Ben. Me and Ben are very like
we really like building and solving problems in the open and shipping stuff really quickly and
seeing kind of iterating on those things and tidelift cultures was because they just sold
to another company yeah who bought tidelift sooner i can't remember the name it's it's a security
company um and as a shareholder of tidelift i can tell you i didn't get anything from from the sale
um bummer did you know libraries i o was there and was open source and uh after
I took a break for a little while during the pandemic, which, you know, everyone had a kind of a crazy time.
I went to do some contracting with Protocol Labs, basically kicking the tires on IPFS and
file coin and trying to use it as a real user.
It's an interesting time of actually try.
Like the try.
The try and that was pretty cool.
And then at the same time, was talking to Schmidt Futures, which is now Schmidt's science,
but one of the kind of sub-foundations of the Schmidt Foundation, who are basically saying, like,
we have researchers that were using the data from Libraries. I.O. for research.
But now libraries I, like, when I left TidLift, they started to remove features of libraries,
especially the API access and the data.
And Schmidt Futures basically came along and said,
like, could you stand up another copy of it?
And I was like, we could do that.
But what if we rebuilt it from the ground up
as infrastructure for research purposes,
rather than taking the same code,
which is like one big search engine,
one honking great Rails app,
and actually make it into kind of a slightly more,
like take all the lessons learned, but instead of building it as a search engine,
instead build it as a base layer of open source metadata, which then can be used to build a
library's AI on top of it. And that also means like we can take some of those lessons that
were like, oh, actually it turns out contributing to a project that has one absolutely
enormous database schema is really difficult. Like trying to stand that up yourself.
is really hard as a contributor.
So people would just bounce straight off the project
because they're like, well, there's no way I can possibly comprehend
how big this, like the stuff that's going on here.
And then also the performance implications of deploying a change
that might be like, oh, you're about to touch a table with like a billion rows in it.
That's going to be difficult for you to test without me giving you production access.
And I really don't want to do that to random third party.
open source contributors.
And so ecosystems is essentially a do-over of libraries I.O.
It's many different Rails apps that are focused on collecting different kinds of
open source metadata and then combining them together in different ways.
So there's a packages service, there's a repo service that collects the dependency
information from repositories, there's an advisory service and a commit service.
and a commit service and an issue service,
basically all the different things
that you might be interested in.
And each one of them can then be independently worked on
and scaled up as like different amounts of data pour in
and kind of collect in different places.
And that has been going on for nearly, I want to say, three years now,
really kind of like going from,
it was a nice kind of year where I just worked on it myself,
didn't really tell anyone about it,
just kind of like plugged away it.
And there are core pieces,
because Libraries O's open source,
I was able to reuse like the dependency parser
and a load of the mappings to the package managers,
actually like take that code
and kind of reuse that in a way that is also allows you
to have multiple different package manager registries
where libraries O would only support one,
which was really nice when Ruby Gems
had its, all of its drama recently and the gem.coop popped up.
I was able to go, oh, I can quickly start indexing gem.comop.
It just fit straight into that new schema.
And then, like, since kind of like the past year,
it's just absolutely exploded in usage.
The amount of traffic today alone was 50 million requests to the API.
Wow.
And has become quite a piece of critical infrastructure to a number of different kind of areas of open source in terms of S-bomb enrichment and also trying to find those critical pieces of open source that need security work or need sustainability efforts to be kind of coordinated around them.
Well, I'm happy to hear that you got to reuse some of your code from libraries I.O.
because when I thought was going to happen,
when you said I airdropped it,
I thought you were going to just catch your ownirdrop a few years later
and be like, and because I open sourced it,
I just relaunched it under a new,
but obviously the big rewrite is a very tantalizing thing,
especially when you've been living with all your mistakes for this time,
is like, let's start over.
But you got to reuse some of your code,
which is really awesome.
So nice job open sourcing that when you still had an opportunity to do so.
Yeah, absolutely.
You mentioned this is used in,
research, I guess, research terminology, so to speak.
What exactly does that look like?
Who are those folks?
What kind of research are they doing?
Are they developers?
Are they developer adjacent?
I think they're mostly developer adjacent or in the research space.
I guess you'd call them like research engineers where lots of computer science researchers
are like, we want to study what these kind of behaviors are like across different
package managers, or comparing, like, what are developers doing in this space versus that
space, especially around the dependency stuff to be able to go like, oh, the average number
of dependencies in a JavaScript app compared to a Ruby app, for example, which I think is about
10x. And then looking at kind of can you go down those dependency chains and find where the,
the security problems are or the licensed problems are and also leading into kind of like how can we
encourage best practices in this space or looking at how to work out like how many projects have
have taken on these various kinds of like especially just recently had a call with someone who's
looking at all the attestations around trusted publishing like how many can we see like the share
of usage of packages that have the trusted publishing setup
and are publishing attestations into Sixthore
compared to like the overall space
and also then breaking that down across different ecosystems as well.
Okay friends, Augment Code, I love it.
This is one of my daily driver AI agency use.
Super awesome, CLI, VS code, JetBrains,
anywhere you want to be, Augment Code can bring better content,
context, better agent, and of course, better code.
To me, Agwin code is by far one of the most powerful AI software development platforms to use
out there.
It's backed by the industry leading context engines.
The way they do things is so cool.
You get your agent, you get your chat, you get your next edit, in completions, it's in Slack,
it's in your CLI.
They literally have everything you want to drive the agent, to drive better context, to drive
better code for your next big thing, for your big thing you're already working on, or whatever
you have in your brain.
you want to dream up.
So here's a prescription.
This is what I want you to do.
I want you to go to augment code.com.
Right in the center, you'll see install now.
And just go right to the command line.
There is a terminal C-L-I icon there.
Click that.
And it's going to take you to this page.
It says install via N-PM.
Copy that, pop into your terminal,
install augment code.
It's called Augie.
Instantiate it wherever you want to.
Type in A-U-G-G-I-E and let loose.
You now have all the power of augment
in your terminal.
Deep context, custom slash commands,
MCP servers, multimodals,
prompt enhancers,
user and repo rules,
task lists, native tools,
everything you want,
right at your fingertips.
Again, I'll get code.com's
with my favorites.
You should check it out.
This might be silly,
but let me ask you this.
I've been researching some CLIs
and I've been researching how
CLIs install themselves.
Sometimes they'll leverage,
the actual package manager of the distro like a Linux distro or something like that but most by
large just give you a URL to curl and pass to to bash essentially which can be problematic
if you don't trust the the script if I wanted to research I guess somehow research CILized
and how they install themselves and the various ways they install themselves is that something
that this service could do like is that the level of research I could do yeah I mean
I mean, for one thing, you would be able to quickly find everything that had kind of like tagged itself up as a C-L-I program.
I'm also been indexing every image on, every public image on Docker Hub and basically running an S-bomb scanner against each one of those.
There would be some juicy insights there to be able to go, like how many of these things were installed via a,
like a distro package manager versus like we just have a URL for this which would be
recorded in the S-bomb basically to say like oh we found this known bit of open source and it
appears to say that like it sits in the file system here which implies it was installed by apps
or it's in a random space like it was probably curled down along with the Docker file
that was used to build that image.
And there's a good kind of million open source
Docker images on Docker Hub
or at least individual versions of things.
And you also get the interesting aspect there
of that you can kind of multiply that by the number of downloads
that some of these Docker images have.
And some of those numbers are crazy,
like millions and millions of downloads of a particular image
And of course, those numbers inside that one container are like never reflected in the package managers upstream.
So just because it was downloaded in Docker doesn't mean, you know, that actually shows up as being a million downloads in Ruby Jems or on NPM.
So you start to see some really interesting things and you start to see those download numbers or the proxy for a download number of distro packages.
as well, which is a really hard number to get hold of because, you know, every distro package manager is very heavily mirrored and basically like just a file system somewhere exposed over HTTP or R sync. So no one has good download stats for those things. The only place you really find that is the Debian popularity contest, which is opt in, not opt out. So you'd be able to go like, oh, okay, well, I can see here are.
the CLI programs that are being like manually downloaded inside of docket images as part of this
install process.
It's not going to give you everything, but it certainly gives you a good proxy for like, okay,
where I can see where like relative usage of these things starts to show up, which is,
you know, where I found the most useful ways of kind of sorting different piles of packages
or whole registries is to go like, okay, well, if I sort this registry by the number of
dependent repositories or the number of dependent packages, like which things show up at
the top, and then also, like, which of those things make up 80% of all of this stuff?
And you actually end up looking like, if for 80, like, I like the 80-20 rule, but it doesn't
actually turn out to be like 20% of packages make up 80% of usage.
It's like 0.01% of packages make up 80% of usage.
It's tiny amounts.
Like there might be 2,000 node modules total that make up 80% of all of the usage of MPM in terms of downloads
and in terms of like discrete dependent repositories, which is like when you then start to really focus
that lens, you see a long tail of stuff that never gets used.
And there's also like all kinds of spam.
and malware and stuff that floats around.
But there's like a 10, 15,000 packages,
which are like the packages that make up most open source usage
across all these ecosystems.
It's kind of amazing how massive that asymmetry is
when you pin that down to individuals.
Yeah, and that's like on average one maintainer per package
at that critical level as well.
So that's like 15,000 people
maintaining all of open source usage.
That makes the SKCD comic even more poignant,
you know, the one person in Nebraska,
you know, replace Nebraska with wherever they are
around the world.
Probably in different towns.
And how many of them have you had on the change lock?
That's a good question.
Probably a good percentage of those.
Oh, man.
So there's all, there's 15,000 people basically running the world for free.
Wow.
I have done a little bit of indexing of, you know,
like how many of those have GitHub sponsors or are their projects on Open Collective
or they have some other kind of funding link?
And in terms of those top critical packages, it comes out to kind of like, depending on
the ecosystem, it's somewhere between 25% and 50% have some way of, you know, like,
here's an automated way you can give me a donation to the project.
There's a good chunk of those as well that are, you know, massive.
corporate funded projects like all of the AWS ruby gems that make up the AWS
CLI are in the top of Ruby gems because they're just massively used they don't need
any funding right because that's Amazon has full-time staff but there's a good they might need
some funding like right here they're laying people off again hopefully they didn't lay off all the
the Ruby people maintaining the CLI there that would be awful so do you track so you're tracking
those who are able to receive funding
in some sort of automated fashion.
Do you track funding itself?
Like who's getting how much money and how?
Yes. Well, where possible.
So I'm tracking, I call it a funding link
and some package managers have funding links support
where you can say like, oh, I get,
you can donate to me over here.
Repositories have the funding YAML file
and I go looking for that wherever possible.
And you actually see that even
on GitLab and Codeberg.
I don't know how well those platforms
display it in the UI,
but it definitely,
because obviously GitHub sponsors is not,
I don't think there's a GitLab sponsors
or a Codeberg sponsors.
Those files do show up all over the place.
And then also being able to go,
like this repository is owned by a user on GitHub,
who is part of GitHub sponsors,
is another way of kind of detecting that,
even if they haven't,
added their funding YAML file, we can kind of make a hop to say, like,
oh, here's one of the maintainers to be able to support that.
And I then collect the data from GitHub sponsors of every,
because GitHub sponsors users are public.
You don't get any financial numbers, but you do get, like,
here's the number of active sponsors of things.
And here's the total, like all time.
It's quite hard to get time series data out of that API.
So instead, I basically just kind of.
kind of snapshot it on a regular basis to go like,
oh, here's what the current state of the world is in terms of GitHub sponsor funding.
It's a bit weird, though.
There's a lot of people who have realized that GitHub sponsors is actually quite a good way
to sell digital goods.
If you go looking at the top users of GitHub sponsors who have the most people funding them,
they sell things like avatars and Discord memberships and e-books and things like
They're not necessarily kind of selling, like, oh, you can, I can maintain this project better for you.
That's, that's not the, like, Open Collective is so much bigger in terms of actually, like, supporting the projects as a, as a collective because they're just set up in a totally different way to get sponsors.
Yeah, that's fascinating.
So they're kind of doing sponsorware insofar as it's not a donation or you're supporting my work.
on this project.
It's like, I'm actually, there's a quid pro quo here.
You're like, we're going to trade a good or a service for that sponsorship money.
Really, it's a purchase of, yeah, yeah.
Like, if you go looking, it's easy to see GitHub doesn't make it particularly.
Like, they don't have a leaderboard, which is a good thing to not, like,
putting a leaderboard on things can often produce them very strange behaviors.
There's also an interesting breakdown of, like, number of users who sponsor other
maintainers versus companies. Obviously, companies are going to sponsor a lot more in total amount
per company. But the distribution is quite surprising in, you know, like, you're looking at
easily 10 times as many individuals are sponsoring other people on GitHub sponsors compared to the
number of organizations. Like, it's quite small, really. And most of that activity is public.
so it's not like there are you can be anonymous as a GitHub sponsor but you can't really hide
the fact that you are that there is a sponsorship happening there there's also on Open
Collective some massive donations that go to certain projects through like company sponsorships
because you know they're acting as a fiscal host rather than just being like a platform to
collect tips which is basically how get sponsors works reminds me is way back in the day
Chad Whitaker's get-tip, which was later for Grat-Pay, remember that?
And it felt all warm and fuzzy because people were getting money for their open source.
But when you go looking at it very closely, most of that was like the same 50 bucks getting
passed around between friends, like not a slush fund, but like they just felt good.
And so like I would make 20 bucks a month and I'm using open source.
So I would give it to somebody else.
And there was really no new, not enough, new money coming in.
It was really just money that already existed amongst all of us maintainers, kind of
patting each other on the back which was unfortunate but just the way it started
I definitely do that like I sponsor 35 different people on GitHub sponsors of just a few
dollars a month to just be like I appreciate your work I don't have a huge amount of to
support you with but like just as a way of saying like I notice you and like appreciate
that you continue to maintain these things that I use well I hope I hoped GitHub sponsors
was like big enough and mainstream enough to kind of change the the shape
of that and maybe it's done it some but it sounds like there's still more indies passing you know
person-to-person kind of sponsorship than there is corporate person but yeah i think the change of
interest rate across the world yeah had a massive impact like you can see oh the nice thing about
open collective is they are especially open source collective is very public you can see the amounts
of donations like going in and going out and there was a big drop around the time
that post-COVID hit
and changed all of the finances of these things
was like, oh, okay, well, open source is no longer
like one of the, it's an easy line item to drop, right?
Because everything is free and it just continues to work for now
until a security problem comes along
and then everyone starts scrambling again.
So you've got 12 million packages being tracked,
287 million repositories,
24.5 billion dependencies, 1.9 million maintainers.
I'm reading these stats off of your website.
There's a timeline of public events on GitHub.
There's issues.
There's commits.
I mean, there's just tons of different data points that you're tracking.
How do you store all this stuff?
Where do you store it?
How big is it all?
Because I'm just thinking this is a data management nightmare.
So that 24 billion dependencies is a bit of a headache.
I bet. I mean, that's crazy.
Almost all of this is stored in Postgres.
Okay.
Individual Postgres instances on dedicated machines in France and Amsterdam,
mostly because they're very affordable.
Online.net is a very reliable host similar to Hertzner or some of these other kind of like bare metal machines.
So I do the maintenance of the machine myself, and obviously scaling up is a little more tricky
because there's not just a nice Heroku slider anymore.
I use Doku as essentially like the open source Heroku, which is really nice.
Just Git push, it builds your Docker image, and then it handles putting EngineX,
kind of proxying all of those things.
Very nice for like an individual machine.
It doesn't really give you any kind of multi-machine.
things, but I try to avoid too much complexity when there's only a very small number of people
working on doing the infrastructure, and it's mostly me, rather than I calculated like a back
of the napkin thing the other day. I think it would cost me 15 times as much to host on AWS as it
does to host it on dedicated machines right now. But these Postgres, each service basically has
its own database. So rather than it being one that is enormous, it's split out, which at least
makes it kind of like, I can work on individual ones and be like, oh, this one is reaching capacity,
so it's time to scale it up. Or I should make another box of web machines or sidekick workers
separately. I don't need to kind of do everything in one big lockstep, which keeps it fairly easy
to do. And then the whole website is basically read only. Like you can't.
can't ask, you can't put data into it as a user. You read from it. And it, all the data comes in
in the background through loading data from package managers and repositories. And there's about
2,000 different Git hosts in there that I'm constantly crawling at different rates to go like,
oh, there's, there's new activity over here. So I can cache things very aggressively at the kind of
HTTP layer, I think the cash hit rate at the moment is about 60% in Cloudflare.
At some points, I've got it all the way up to like 95%, but then you get some AI bots come along
and they do some weird stuff and it's very hard to cash such a long tail of billions and billions
of URLs that might exist on the platform and Cloudflare on the free plan is not going to cover
you know, an unlimited amount of cash.
You'd just kind of keep rolling over the cash over and over again.
Is this a solar project again?
Or is this you and Ben back together?
So Ben is working on it part-time.
He is also the one of the directors at Open Source Collective,
which is, you know, that's a lot of work in itself.
Yeah.
And then we have a few people who are doing some part-time work.
Martin has done all the design work,
which looks so much better than my efforts of the original.
You can see that.
And there's a couple of older hidden webpages there that are very poorly designed,
which is just me like making some plain bootstrap pages.
And we just had James come on to help with making the project
like better documented and easier to onboard as a contributor
because I was running so fast on standing everything.
up and scaling it up and collecting all that data that I didn't really leave a lot of documentation
along the way, which is terrible. But hopefully, like, these are pretty basic Rails apps.
There's not a lot of interesting stuff, like intentionally trying to make it the most boring tech
possible so that I can focus on the interesting stuff, which is like the passing or the mapping
of the metadata, which is like each app has that core little nubbin of like, oh, here's where the real
logic sits and that's like a nice well-tested bit of functionality with a load of rails scaffolding
around it to be like, okay, write this into Postgres and then serve it up as in kind of the
quickest way possible. How many apps is it now? Oh, good question. It must be coming up to 20 but some
of them are quite small. Like there are, there's a load of services that are kind of like stateless.
Like I will just give you a shah 256 of a tarball that you get from Ruby gems or similar.
And a lot of those I basically have on the chopping block to try and turn into something a little bit more like,
imagine a GitHub actions, but for analyzing packages.
So rather than it happening every time that you commit or every time you open a poor request,
instead it'd be like, you can define, I want to run this kind of analysis on this package when it,
a new version comes out that might be like copyright and license extraction or it might be
do me a capabilities analysis of this go package using the caps lock library which will
basically go like oh this library just gained network access and it can read environment variables
and it became a crypto miner would be a great way of like being able to highlight some of those
changes so i want to pull it down and make it a little bit kind of like fewer services
but one of those services will be basically the like,
which open source analysis do you want to run against this package?
And then here's a massive fire hose of every activity that is happening.
And you can hook those analysis in to say like,
okay, I want to run Zizmore every time I see a GitHub action change
because Zizmore does the security scan on the YAML config to go like,
oh, you've just introduced a footgun of GitHub actions.
here and then try and publish all of those analysis back out as a public good,
just basically fling that into S3 or something as a way that allows researchers again
to go and do broad analysis over the whole ecosystem or multiple ecosystems
without having to spend all their time like collecting all of that base data and
normalizing it and then setting up infrastructure to run all of that across
you know, all of those packages is, I see that time and time and again where the paper is like
50% of the work is, oh, well, we had to collect all of this data and we had to make sure that
it all fit into the right box. And then we could actually start doing the interesting research.
So what I hope is we get to a place where it's like, oh, you don't need to do that.
You can just use this open data set. And that gives you a good starting point to then start
to really dig into like what's going on in these ecosystems.
that's the dream anyway.
We're certainly working your way towards that.
So does Schmidt, sciences, do they flip the bill for all of this work?
So they gave a grant initially to get started, luckily because they gave it in dollars
and the exchange rate was very positive for a while.
So we actually managed to stretch that from a one-year grant into a two-year grant.
And then Open Collective has been supporting the project as well.
as a fiscal host, but also as a, like a customer.
So I built a number of tools for them to help them kind of investigate ways of trying
to expand the ability to kind of let companies fund open source.
And then also to try and measure the return on investment of giving two projects
and try and be able to see like, oh, if I donate money here or resources, does that turn
into actions and changes on the repositories.
And that kept me busy for, you know, a good nine months, I think, of building out
tools for them whilst they financially supported the project.
And we also have a number of customers who pay for a different license for the data.
So the data is CC by SA, which is share like a copy left license.
You can use it for whatever you like as long as you also.
So persist the license and you credit where it came from.
But if you don't want to do that, then you can pay to essentially have a CC0 license.
It's not actually CC0 because there's some things there to say like, oh, don't just
completely undercut us and sell that on again.
But we have a number of customers there.
That basically pays for all the hosting costs.
So it's self-sufficient.
It runs itself as long as, but you don't get any extra feature development on top of that.
So that's like where I'm trying to work on right now is to get that level of sustainability higher.
And we just received a grant from Alpha Omega to basically make that happen.
That's, Alpha Omega is part of OpenSSF and their goal is like turn money into security.
and they have become a big user of ecosystems for doing analysis
of like who are the critical projects in a particular space?
Who are the ones that are like going to be most likely impacted
if there's a big security vulnerability?
Who are the ones who have never had a security vulnerability
and maybe don't know what to do if they get one?
Things like that.
So they have basically given us a grant to try and help make ecosystems long
turn sustainable. So that's things like making the project easier for people to on board onto
or and also to be able to kind of like charge large companies in different ways. That might
be like, oh, you want an even higher rate limit than the very friendly rate limits that are
already on there. Like you want to go even harder or then you can pay for, you know, like a super
rate limit or similar. And then also this kind of like pipeline of analysis will be another way that
it would basically be like, oh, you want to, you want to run your LLM queries across all these
package source code. Well, then you can funnel it through here. We'll just like,
tee that up and trigger it every time that we see a new release of a package or similar
will be another way that I think would be essentially just like, oh, you're just paying
for our CPU to do this analysis. And then the, the analysis that comes out the other side,
if it is like an item potent, I guess, you know, LM queries are not item potent.
You're going to get a different thing every time you do an analysis.
But for a lot of those things, we just come out as a public good and companies will have
paid to have it generated, but then it's shared for everyone to use, which I think is a nice
thing.
I mean, what I'd really like to be able to do then is to actually do revenue share with the
people who are maintaining those individual command line tools that do the analysis.
Imagine being able to go like, oh, we can help with supporting Zismore and Bullet or like all of these different things that are like command line tools that analyze source code.
And rather than you build a whole enterprise company around your command line tool, you can just focus on making that tool really good.
And then we can run it at scale for customers and then just funnel the money back to the maintainers after whatever investment.
structure costs there were to run it so that you can actually focus on building the open source
tools rather than building the scaffolding around it. That would be super cool. So it sounds like
there's a collection of potential income sources, some that are currently working on the ones that
you're working on. The real licensing of the data for a fee seems like a good one. Is that
potentially like could you see a world where there's enough people that want to do that,
that that could be enough or no? Yeah, I think so.
especially this kind of dependent data, the 25 billion row table, is really juicy in terms of the insights that you can get from that.
The general package data, though, is often like you can get clawed to generate you an NPM scraper very easily.
Like, if you ask you to do it in Ruby, you get code that looks a lot like libraries I.O. in turtles come out.
That's awesome.
Do you get a nickel when that happens or what happens?
No, unfortunately, no.
It looks a lot like.
Yeah.
Well, you know, imitation is the sincerest form of flattery.
So just remember that.
Yes.
It's tricky to get that kind of balance of like, I want to give away as much as possible,
especially as all of this data comes from open source.
Like it is, it should be open because it is data about open source.
But then, like, how do you continue to pay for that, uh, whilst companies,
that also can kind of go like, oh, I could just go fetch it from the source myself.
And trying to get as many different ecosystems support in is a good way of kind of going,
like, you really don't want to try and index the R package manager.
Like, you're not going to have a good time.
So, like, we try and take care of all of the horrible bits.
And then also being able to fetch, like, the Linux distro package managers,
which is something that I'm trying to add more distro support in because each one of those
has its own kind of like horrible rabbit holes of weird and wonderful metadata and trying to
work out like how does this fit into the schema a lot of it is kind of trying to tie it around
the package URL format uh pearl but not pearl the language uh although you can have a pearl pearl
for a C-pan that is, you know, a pearl about pearl.
That is kind of a kind of come out from efforts in the S-bomb world
and was like originally kind of one of the inspirations was the libraries I.
being like able to map these things into different ecosystems and kind of say like you have
an ecosystem, you have a name of a package and you have a version.
Like can we talk about this in a kind of fairly standardized way?
as a way of transporting these package bits of metadata
between different platforms
that are doing analysis of different kinds.
An S-bomb is kind of like the natural conclusion of that.
Of course, you have two different S-bomb standards.
They can't just be one standard for things.
But that being able to look things up by Perl
is something that ecosystem serves really well
because you can basically then take an S-bomb
and just work through.
it every single package that's in there and say, can you tell me about this package?
Can you tell me what security advisories are affecting the version that I've got in my S-bomb?
And that is like the biggest use right now is there are lots and lots of people with GitHub
actions that are just enriching their S-bombs with this kind of information.
And they just, it's funny how much more traffic we get on the weekday than on the weekend.
And it's, I think it's just because of the GitHub action kind of like, oh, this is happening every time someone commits.
So you see a smash of traffic of them, like, enriching their S-bombs and checking out every package that is in there.
And then the weekend comes along.
Everyone stops working.
And the traffic shape completely changes.
And also the cash hit rate completely goes through the floor because suddenly it's like, oh, there's all kinds of other weird and wonderful things happening at the weekends.
especially lots more like researchers and hobbyists using it.
So you've mentioned a few of the weird, gnarly things like multiple S-bomb specs, etc.
You have 35 ecosystems on here, NPM, Golan, Docker, to name a few, right, crates,
Nuget, so you're in that world, across 75 registries.
So I'm assuming, you know, some ecosystems have multiple registries.
Yeah, Maven especially, there's lots of registries in the Maven,
world and then oh even bower.io i remember bower i don't know if people are still using that anyways
forever ago man no one adopts anything they don't accept any new packages but there you still find
people that use them and download stuff through them yeah so what i'm wondering is like you know
where where are the black sheep where is the gnarliest weirdest like let's not i don't want to
create any enemies for you andrew but like which of these ecosystems are like the in your own
heart of hearts notoriously hard to work with well
The hardest bits are often, like, the change over time, especially when you go back to the really old stuff.
The classic one is that you'd think, like, oh, MPM, their names are case insensitive.
But if you go and try and index every name in NPM, you will find about 1,000 that are case sensitive and have clashes with the, like, a different case version of the name.
And those still exist on the registry.
they haven't been removed
and so if you try and make an index against that
you're going to have a bad time
because as soon as you actually go to run that
you're like, oh, that's not like that anymore.
So there's things like that
when you go back into the time,
like going back further and further is like,
oh, there's weird things here,
especially when the package manager registry
has like a document database
rather than something that is like always enforcing
its schema in every record and you know mpm used to be couch d which is like oh they've changed some
schemas of the package metadata uh so in new packages it looks different than old ones of course
now it's actually post squares underneath and it pretends to be couch db which is is interesting and
imagine a headache in terms of like actually like maintaining that but they still have some really old
and weird like you just run into like ah this bit of metadata isn't right for these few packages
because it was frozen in time there as jason in postgres now somewhere um similarly with maven
they've got lots of different kinds of pom ximels uh and there's so many features in
the way that maven can like have these nested and parent pombs uh that is i'm not i don't
really have a like a background in Java so I've never used maven as a user but the amount of
different ways that you can describe the data that is stored in a palm XML and then published out
to maven central and course once it's on maven central and it's like frozen in time almost
they don't then go and update like if ruby gems adds a new attribute to their registry
that becomes available in the metadata for every single endpoint because no it's just a
Rails app that's generating JSON. But for the things that store the files as a historical,
like, we just dump this file somewhere, then you're like, okay, my code needs to be able to
know every different possible version of this, how this worked, and then also be able to
recover from it. The worst one is the R package manager. It's not huge, but it is used a lot
in the research space. And they don't have an API. You have to scrape HTML from
the thing. They also remove packages quite regularly, which is very strange. So R has this
really weird, I think it's because it's come from a scientific kind of like non-developer
background. Like R, it also has one indexed arrays, which not many programming languages
have that, right? But they, their package manager won't let you pin to an older version of
something. It won't say, like, I want version 1, even though there's version 2.0 is out.
And the knock-on effect of that is that, so when, as a user, if I'm going to say, install my
R packages, I always get the latest version of everything. That means that if something's broken
because something else got a new version, rather than the new version causing the breaking
change be told off. It's actually the package that didn't upgrade to fix the problem
with this other package that just updated. So if you don't, if you're not proactive in fixing
breakages with your package being used with other packages, your package gets removed. It gets
kicked out of that registry, which is pretty wild because, you know, people, especially
in science trying to make their science reproducible, are like,
oh, my package got yanked.
Like, how am I supposed to reproduce this science?
It's no longer here.
So they have some very strange behaviors where they'll actually make snapshots of the registry.
And then, like, so you can say, I want to install my R package from this registry on this day.
So you actually have like a weird historical aspect of the thing, which is, it's not like a lot of other package managers.
And it's very hard to change because, you know,
there's just not a lot of, we don't have a lot of funding in open source,
but in terms of research software engineers,
there's no incentive there to maintain and develop software
unless it has a paper attached to it, right?
You get, if you can get citations, great.
Like, you can continue to make a case to keep working on those things.
But once it's done, it's done kind of thing.
You're like, oh, you already published that paper.
I don't need to continue maintaining the software.
that's something that I have an interest in trying to solve,
but it's a very hard problem to kind of break into.
But what I'd like to be able to do is go, like,
can we connect the world of papers and citations
back to the software that's being used to especially,
like there's a lot of Python code that is like,
might not look like it's massively used,
but then when you kind of go,
oh, but it's mentioned in all these papers,
especially the kind of AI papers,
as well, which are just, like, exploding at the moment, if you can then say, like, we can send
some of this transitive citation credit down the dependency graph to the transitive dependencies
of the things mentioned in a paper, like, I bet there are maintainers who have no idea
that their, like, low-level Python or Julia code is being, like, referenced in these massive
papers.
Like, that's the discovery aspect there, but also for the people that do know to be able to go back to their institution and say, look, my software is supporting all of this research that you're publishing, you should also support me because that will make your research better, would be a really cool thing to make happen.
Until they say, well, we already published those papers, so who cares?
That attitude makes it tough for sure.
Yeah, there's a lot of still that kind of like, oh, open source is just there.
I can just use it.
I don't need to contribute back in any way
because someone else will do it
is still a totally unsolved
like social problem, I think,
in the wider open source space.
Well, if somebody wants to write a paper
on the reproducibility problem
in scientific papers
due to mismanaged packages
in the R language,
I think that would be a hit.
I think it would be a hit.
Oh my gosh.
I'm still dumbfounded
that they would not let you pin
to an older version.
I know.
I feel like that's going to break
so many research projects
that go stale, essentially.
Well, there's the software heritage project,
which is a massive index
of like the hashes of every file
ever published to any open source thing,
is basically was produced
to try and help solve that problem.
Like you had to make a full index
of every file in every Git repository
to be able to try and get around the fact
You can't pin to older versions in ours package manager.
I mean, there are still other package managers that don't have lock files in them,
which if you think, like, years ago, yes, it wasn't such a problem.
But nowadays, like, lock files are so critical to the way that people, like, build and
maintain and share their software to be able to go, like, oh, it works on my machine.
It should work on yours because, you know, you're literally installing the same set of
dependencies. And Docker works for that at a high level. But as soon as you want to change one
thing, you obviously blast away the whole Docker image and have to start over. Whereas the
lock file works really nicely at the language level to be able to kind of solve that problem.
If your package manager doesn't have one, you should definitely try and like get that added in
somehow. What if AI agents could work together just like developers do? That's exactly what
agency is making possible.
Spelled AGN, TCI, Agency is now
an open source collective under the Linux
Foundation building the internet
of agents. This
is a global collaboration layer where
the AI agents can discover each other,
connect, and execute multi-agent
workflows across any
framework. Everything engineers
need to build and deploy
multi-agent software is now available
to anyone building on
agency, including trusted identity,
and access management, open standards for Asian
Discovery, Asian to Agent
Communication Protocols, and modular
pieces you can remix
for scalable systems.
This is a true collaboration
from Cisco, Dell, Google Cloud,
Red Hat, Oracle, and more than 75
other companies all contributing to the
next-gen AI stack.
The code, the specs, the services, they're
dropping, no strings attached, visit
agency.org, that's agn, tcy-y-org
to learn more and get involved.
Again, that's agency,
agn-t-c-y-org.
So your team has amazing ideas flying around.
You know the feeling,
but turning them into something real.
Feels like wading through peanut butter.
Super thick, right?
Peanut butter is tough to walk through.
We've all been there.
The gap between idea and impact,
it is brutal.
And just throwing AI at the problem without clarity,
that only makes things worse.
We all know that.
That's why I checked out Miro, investigated it, love it.
And that's why I recommend it.
Miro is the innovation workspace that helps teams get the right things done.
Faster, powered by AI, teamwork that used to take weeks, now takes days.
You can use Miro to plan product launches, map complex workflows.
You can even generate fresh ideas from interviews all in one place.
And the Miro AI sidekicks, it's like having your own product leader, agile coach,
and even a product marketer
ready right there to review,
clarify, and give feedback
right inside your workspace.
It's cool. You can even build
custom sidekicks tailored to your
workflow. Plus, Miro Insights
pulls together sticky notes,
research, and docs
into clean summaries so you spend
time building, not digging.
Help teams get great done with Miro.
Check out Miro.com.
That is M-I-R-O-com.
Once again,
Miro.com.
Behind the scenes, I've had some AI
literally obliterating your API.
With the polite mode on, of course,
I've passed my name so you can track all the things I'm trying to do here.
But it has finally found a way to craft a script
that will pull back essentially some version of
Curl-F-S-L, blah, which is the URL
where the thing lives.
and then piping that to SH.
And so I've got a nice dramatic list of projects through research
that use that command and what that, you know,
what that install that SH script looks like
and what are some of the details in there.
So it didn't take long, but my gosh,
if I did not have AI to do this for me,
I would have pulled my hair out so badly.
And probably not your API by any means,
but just more like you can get the data, it seems,
but it's very, you got to like comb through it.
you've got to be persistent and very...
Wow, there's a lot of kind of...
The schema is not simple.
No.
Unfortunately, and it's hard to find a way to describe that in a way that doesn't just...
Like, people will just switch off and kind of glaze over as you start going into the levels.
Something I've also tried to do over the past couple years as the AI bots have kind of gone mad
is actually let them scrape the website, right?
Rather than block them, I've said, you can go mad.
in the same way as I used to let Google bot go mad on libraries I.O.
Because two years in, like we've had a full training cycle of the frontier models,
they actually know what ecosystems is and they know the structures of the APIs
and they can actually just suggest those things,
which is like a good and a bad thing.
But I think in terms of like being able to get into the training data in terms of like
my API is here.
and my service exists is helpful to people who are using AI coding agents to do some of these
things. I have dabbled in the MCP world with this stuff and it would be very easy for anyone
to build an MCP adapter on top of this. But the security implications really hurt my brain. So I have
kind of like held off going hard into it because, you know, every string that is returned by their
MCP is essentially like a prompt injection.
So you imagine your version number that is pulled from an MPM package and then fed
through an MCP server into your context, they have the ability to make a version number,
especially if it's like Semver with your, your pre-release string on the end of the version
number, you could make like prompt injection ver where I just start putting like ignore all
previous instructions dash 1.1 in the strings of the thing like that come from the package
managers suddenly a security vector or even just the description of the package or the name of
the package like there's a lot of a lot of trust that happens on that kind of go through
when it comes out as an MCP server on the other side if you're just saying like
blindly install whatever the MCP server told me
then there's a lot of trust that you're putting into many layers of indirection that could happen.
And we've definitely seen, like, loads of threat actors have realized how, like,
I'm going to use the word juicy so many times.
But in terms of being able to go, like, I can just, there's no restrictions.
I can publish things to a package manager.
And that might be, like, the sixth level of indirection before I actually get to my target.
it's very hard to see all of the moving pieces until they actually kind of all come together.
But most of these packing managers have zero restrictions in what you can do.
Like even GitHub only just recently started kind of saying like there are certain
restrictions in how like you automated the NPM publishing can be because people were
literally like every commit, I'll just publish a new version.
Why not?
There's no restrictions like 100 versions a day, which is like why are you doing this?
Well, because we could.
And the cost to the registries is mad as well.
Like you see that PIPI are just showing their numbers continue to grow.
And they're like, well, how the hell are we going to continue to fund this?
Because it doesn't look like it's going to stop anytime soon.
It feels like there's a lot of challenges that are kind of like coming down the pipe
for these shared open bits of infrastructure to keep them as open as they currently are.
Well, what is your take then with this rate limits and polling when it comes to this polite nature you have here?
Like, how do you leverage that?
Because I can pass in my email, but then you say, well, I can reach out to you later.
You're watching my rate limits, of course.
Can you just shut me off because of me passing that email to you?
Or how do you curb the enthusiasm, so to speak?
So right now we have this, like, we have the anonymous rate limit, which I think is, is fine.
5,000 requests an hour per IP address, basically.
And then the polite pool, which is a term we borrowed from a service called Open Alex,
which is basically like ecosystems, but for research papers.
They have this, if you pass in your email address as part of the user agent,
then you just get an uprated rate limit so that if we see that you're smashing the API,
we can contact you and say like, oh, what are you like doing?
Can we help you do this in a different way?
so far I haven't actually like been tracking that particularly closely. I've literally just like
great. Cloudflare was still like catching most of that stuff before like if you hit
anything that's cashed, it doesn't even touch your rate limit. So it's only the the uncash things
that actually affects that rate limit. But even then it's like 10,000 requests an hour.
You're you're going to, if you're really, really hitting it, you're going to run into that.
And then a 429 request is very cheap to serve up.
So I can I can serve up a lot of like rate limit used requests before things start to fall over.
Having and then like looking at the patterns and going like, how are people using this?
And is there a way I can do like a higher level API that avoids, you know, having to have someone do that trawling?
or is there ways of being able to export big chunks of data
rather than doing individual lots of little queries
is another thing that we're exploring.
It may be like a big click house with a read-only,
like you can write your SQL query or SQL-ish query
against a column store worth of data,
similar to BigQuery, but without the, you know,
whoopsie, I spent $3,000 on my one query through BigQuery
because it pulled in terabytes of data.
But that is a bit of an ongoing side project.
It's not actually live yet for anyone else to use.
But hopefully for researchers, especially,
you'll be able to just be like,
oh, I can just do a big sweeping queries
in a kind of an offline way
rather than having to hit the live Postgres databases
because that's like the source of truth of these things.
And often researchers aren't like,
I need the most up to date.
date, like within the hour changes. They're like, ah, actually I'm fine with this. If it's like
a day or a week old, it's really not too much difference compared to, you know, I'm looking
for the security advisory stuff that is as fresh as possible, which is often where you're
scanning your S-bomb and trying to find, like, where are there new vulnerabilities that are
affecting me? Yeah. How do you prioritize your time, I suppose, is
a lot to cover it seems it seems there's a lot that uh you know even discoverability like if i am
naturally interested how can i pull this data out seems like i would have to spend a lot of time
to figure that out that's okay but you know who is your user how you prioritize your time how do you
build the who you build in the platform really for i know who's using but like how do you prioritize
your time to how it's being used well to be honest the number one user is me right now like good
That's who I prioritize for because I have a good picture of how you'd want to be able to pull this data out.
So the APIs are, each one has its own open API YAML spec, which kind of tells you here are all the different endpoints that you'd want to use.
And then the things that, like, I'm building applications on top of this data as well and going like, oh, this is not here.
Like, or I want to be able to do it like this.
So often, like, a lot of those APIs have shown up because I couldn't get them to work right.
Josh Bressers has also, like, had a good amount of input in just, like, absolutely thrashing various aspects of it to look up lots of data around CVEs and the kind of rate of versions being published.
There's also kind of loads of tools that have been built on top of it.
So there's SNCC has a tool called Pallay, which does S-Bomb enrichment.
And so I can then go and these things are open source.
I can go and look at them and see like, oh, how are they currently using the existing API?
Is there a better way that I can do?
Or do I just need to beef up the caching in some of these kinds of places?
It's very much like the prioritization a little bit is like just running around
putting out fires, but then occasionally it's like, right, I'm going to turn everything off
and I'm going to go and like tackle one of these slightly chunkier problems of essentially
like solving a bigger challenge than just, oh, there needs to be a new API.
Often that's like, oh, there needs to be another service for another kind of data or there
needs to be another way of querying this thing because lots of people have been asking for
this.
like the biggest thing is just being like coming and asking for things on the issue tracker
is a great way to um to kind of kick off that conversation and say like oh i've been trying
to do this i'm trying to solve this problem but i can't work out how to go through you know like
i've hit a wall here or there's just too many individual bits of data over there to you know like
can there be an aggregation of this thing somehow and sometimes that's easy and sometimes it's like
oh, actually, if we make this index,
it's going to be like the index itself
is like 500 gigabytes in size.
That's hard to fit into RAM.
So maybe we think of another way
to solve that problem rather than just like
adding an index for every single different way
you might want to query Postgres.
I found the introducing Parley Post.
They even mentioned that we're enriching
parlay, you know, it's enriching these S-bombs using ecosystem. So are they one of your paying
customers then, considering this tool is probably part of their...
No, they are using... So, Parlay is an open-source tool that other people can use.
And so they, people, and it's primarily companies, because, you know, open-source developers
don't actually care about S-bombs because they're like, here's the code.
You know, you know, I had to search what S-bombs.
bomb enrichments was. I guess I should have guessed that by take a little bit of data and make it
better. I don't know. Okay. Well, if you, most S-bomb extractions don't, like, when you produce an S-bomb
from, say, a repository or from a Docker, a container, it will go, here are the packages and
the version numbers, but it's not going to tell you, like, and here is all of the information about
that package, because they just don't have that on disk available most of the time. Some package
manage, especially the, like the distro package managers, do actually have that information right
there. But, you know, these S-Bomb generation tools don't go and hit the NPM API directly
to fetch all of those things. So if you want to be able to get a high-level overview of all of the
license breakdown of all the different packages in your S-bomb, then you need to enrich it by, you know,
basically going through each one and fetching some extra information and filling in the license field
maybe they're like maintainers there's a load of different things in there and it depends on which
S-bomb standard you're looking at as well because they're different but they're also like just
being able to look up all the security CVE stuff it's nice if you're only working in one particular
ecosystem because you can use NPM audit or bundle audit but as soon as you get into the like
multi-ecosystem things which every Docker container is right it's going to be like oh it's got my
my Django app with a JavaScript front end and also all of the back end like low level
district package stuff like there's a big collection of random bits of software in there and
I really don't want to have to use 10 different tools to enrich you I just want one thing
that will just sweep across and support everything you mentioned a couple of times building
things on top of is this uh since this is sort of a redo for you it's it's kind of like a take
to do it better.
Is this the substrate for many things?
What are some of those things that you mentioned?
Like you mentioned some things being built on top of,
but what are those things?
What's the world you envision?
There's a few that are listed on the ecosystem's homepage.
So we have the things that I've built for Open Collective,
which are the funds app and the dashboards app.
Those two things are like definitely,
they don't have their own data.
They're essentially like aggregation.
of various bits from ecosystems to solve particular challenges.
One thing I've not built is a search engine.
I've kind of been like,
I'd like to see if someone else would build that.
You know, like I already did that in libraries I.
But that would be a natural one to add in there.
What I'd really like to build is things that help maintainers understand
who is using their software.
And this is going back to that 24 billion rows of,
dependency data to be able to say like how much are bots how much is docker pulls how much is just
like CI builds which is I guess those are all still users right I mean if I'm that person
releasing a hundred times I'm still pulling the packages right every time they commit yeah yeah
boom new new version because I can you know and also to be able to go like if we can flip that
graph upside down and show you here are the key like people downstream depending
of your library, then rather than you find out that you broke them because they come into
your issue tracker after you just publish that release and say, like, you broke stuff, like maybe
building a CI that is like an inverse that goes, okay, well, you committed something, let me go
and test this against your downstream, like your most popular downstream users to make sure
that you didn't break those things. And there's some difficult bits there in making sure, you
know, those downstream CIs are reliable.
They're not just going to be like, oh, actually, our tests pass all the time regardless,
or they fail all the time so you can't trust, like, if you actually broke anything or not.
But to be able to do that would give maintainers insights that would be like they can actually
be proactive about some of these things and maybe even be able to coordinate and go like,
oh, I'm able to reach out to these projects and say, like, I'm going to break this thing
or I'm going to change this thing to make it better,
can I help you upgrade in the process
rather than just, you know, like firing it out into the world
and then not being able to know what the impact was
until after the fact.
Like I've also been indexing Dependerbot data
as a way of being able to show,
I've no idea why GitHub hasn't done this,
but as a maintainer of a thing,
if I publish a new version,
I want to know how many Dependerboprs
actually, like, were successfully merged or were closed as like, no, I don't want this because it
broke my CI or just completely left, like, give me more context so that I can understand what's
happening with the people that are using my stuff, at least in the open, because there's so many
open source users now that it's a good proxy through to closed source. Tools like that that
enable maintainers to do, like, more with the same amount of time that they're putting into the
project by being more data driven or being able to just have more like visibility because I think
a lot of them are working in the dark a lot of the time partly because you know you put the blinkers
on and you just focus on getting what you need out of your project but also because they just
have no good idea of like where their key consumers of those things are and the knock on effects
of being able to go like oh I make a breaking change that breaks this other library and that ends up
having a, like a significant impact, as well as, you know, if you have a security advisory
to be like, hey, significant end users of my thing, there's going to be a security update,
like, FYI, get ready to bump rather than be, you know, like, oh, we're stuck on this
version and now we're going to have to like scramble to try and get it updated to be able to
get a little bit more coordination and collaboration by, you know, being data driven, I think
would be amazing.
That's my slightly bigger picture of what I would like to build on top of it is to really empower maintainers to have an impact to make their process better, but also then make their open source software better because everyone uses open source software.
And so then you make all software better by just improving, you know, the base layers of the most critical packages.
It's a pretty big goal.
but I think there's enough untapped data there
that I think can be really powerfully leveraged
to make a good go at improving some of these things.
Could you maybe discuss how that interface manifests,
like what would you show the maintainer?
What do you think?
Where would you begin when it comes to exposing the data?
Like, how do I get to know my users, the people using my thing?
Yeah, so I could imagine you would.
would see like, okay, well, for this particular package, and maybe I've got lots of
packages, but I just drill down to one of them, then I can see here are like my top dependence
and top being, there's lots of different ways you can define what a top would be, but we can
just use the ecosystems like usage metrics is one thing. Here are like the key projects that are
using your stuff and then which versions they're currently pinned on as well. So they might be
just like, oh, I always pick up the latest version. I've got to Penderbot doing the updates,
but maybe there's someone who's really heavily using your stuff, but they're actually
pinned to an old, like an old major version. And that's like an insight into, okay, well,
why were they stuck? Like, maybe I can go and help them upgrade or I can learn that actually I made
the most horrific breaking change ever. And they really, really don't want to upgrade because
of, you know, it completely causes them too many headaches to do that. And maybe I can consider
that in like how I then continue to maintain that project going forwards. You could also then
use that interface to say, okay, well, can you show me everyone that's on this specific version,
like, or which is a 50% of my users like stuck on an old version? Or are they stuck on an insecure
version as well to be able to go like well we had this CVE three months ago and most people
especially thinking about this from a like individual packages that depend on me to be able to see
the knock on effect of like all the users of those packages are like my transitive users.
There's a lot of data there but being able to highlight like where those key points are of
leverage that are like these things could be improved here that would be.
one way of that kind of being manifest
the other way you could do it is rather
than it be a UI is more like
a notification system of being like
oh did you like you got the proactive
kind of things of like your dependents
have updated
to your latest version
or that
like your dependents are having problems
with they tried to upgrade
to this thing and like
here's the context of this depender bot
poor requests and the discussion that they
had and they had
and they haven't yet merged it
to be able to like show you that
oh wow okay that's interesting
like it's having a problem for them
that we didn't even imagine
because we're not using the same database
as they are for our testing purposes
something like that
and there's maybe there's an AI element there
once you get to like very large amounts of users
that you're like actually
this is too many people to kind of like
too many downstream users to reach out to,
maybe I can empower
copilot or Claude
through some kind of prompt that is like,
I've described the changes in my change log
in a way that helps them upgrade
from one version to the next.
But there's a lot of people
that are very reluctant to take on some of those things
because they can be wildly unreliable sometimes
when you try to do things
the same way over and over and over
and over. It's kind of like telemetry via exhaust too. You're not like literally tracking your
users. You're tracking them through natural usage pattern of the ecosystem of open source. So you're
not like ask them to opt into too much telemetry either. Yeah, I really try not to be too
invasive. I try not to track too much data about the individuals and instead keep it at the
project level because, you know, for one thing, the projects are all like license in a way that
says, yeah, you can, you can share this and you can like understand this. Like the licenses let
you do that. Whereas, you know, the tracking individual people is a much more messy thing to do because
people come and go and they change their names and they change their email addresses and it can
be hard to try and pin them down. But also that, you know, most open source projects, they're all
volunteers. It's like trying to pin requirements on an individual is asking a lot of someone
who is probably not being, like they're just giving away their code. So instead it's like,
oh, well, we'll look at these as if you want to do something to help, then here's data that
you can do it rather than being like, we're going to force to upgrade, you know, like you must do
this. You wouldn't want to use ecosystems to power like a massive wave of automated
poor request, for example, for one thing, GitHub would just shut you down straight away.
They're allowed to run co-pilot or Dependerbot at a large scale, but you wouldn't want,
like, it would be horrible for maintainers, right, to just have, you'd hear Daniel from
Curle is constantly saying, like, how many different AI bots, especially if it's incentivized
in any kind of way, then you're going to make a mess.
But ecosystems tries to kind of just watch.
What's the vibe of these ecosystems going on at the moment?
And then you can use that to try and have impacts on top of that.
Have you found any information black holes in your desires for features or tracking things?
I mean, funding, exact amounts of funding is an example, I guess.
But anywhere else where you're like, man, I could build this, but I went looking for the data and there's no data.
Oh, so yeah, the funding one is a big one.
The other thing that I'd really like is kind of more data around the non-code contributions,
but that's really hard to get, right?
Yeah.
Your discords and your slacks are not open enough to be able to really index without, you know,
you need an API key or you need a ghost user sat in a Discord collecting everything.
getting creepy.
Yeah, it is way too much.
There are tools.
There's ecosystems again tracking us.
Get out of it.
ecosystems.
I'd start joining all the community Zoom calls
with an AI chat log kind of thing.
But now there are tools like that.
Biturgia has one that can,
you can configure it to track your own community
and like you can feed in mailing lists
and you can feed in your Slack.
or your Discord or similar
but you're kind of doing that
at a per community
or even just a per repository
level
trying to do that at a mass scale
is stepping into worlds
that I'm not really comfortable
in terms of the amount of tracking of stuff
also it's just really really messy
like open source metadata is messy
but it's not it is like
tangibly okay yeah I can see
how I can connect
the dots here, whereas once you get into, like, unstructured text of discussions of things
is, you're quickly into like, right, well, we're just going to try and, like, have LLM's
process everything here, and it's a horrible mess, and it's incredibly expensive.
Like, we use no LLM stuff in ecosystems because we just don't have any budget for that
kind of stuff, the amount of processing to analyze 12 million packages.
Well, you do now, our friends at AMP have free, just advertising as I use it.
And it's like free dots ascent.
I was just telling Jared this about on the power releasing on Friday.
You know, if you're not using AMP code for free, then at least two hours a day or so,
then you're missing a little bit of LLM work that you can get for free.
Add supported.
That's the way, that's the way of saying.
Well, yes, sorry, it is ad supported.
So you are getting advertised to it.
But, you know, I think that if you're not using that and you have a use for a couple hours,
a day at no cost
you know
one of the 17 advertisers they
have in the network are supporting
your open source essentially
it's kind of cool
what else and do anything else we didn't ask you about
ecosystems wise or
I mean we covered a lot
yeah I'm trying to think if there's
anything I think I covered
most of my kind of like thinking
of the future things
and that's mostly
everything that I'm working on at the moment is ecosystems
and I haven't got any other side things.
Octobox is dead or?
Octobox is ticking along like GitHub
copied most of the features of Octobox.
And then we lost most of the customers.
So I still use it every day,
but there's not a lot left there.
So it still works.
But it doesn't have any AI features.
So it's not particularly interesting in terms of that aspect.
Yeah, I think that's like nicely covered.
as most of what I've been working on.
Well, it's really cool stuff.
I've always been impressed by your abilities and willingness
to just collect all the things and then organize them
and give them back out for free for people to use for various reasons.
It's probably exciting when you see somebody using it in a new way
that maybe you hadn't dreamed of or wouldn't even care to,
but you're like, oh, that's cool.
It shows that you're providing real value to folks.
Yeah, especially with the researchers.
like they
people will come to me
and say like
I'm working on this paper
that's like
investigating ways
that we can get LLMs
to suggest better
projects for you to use
or packages
or we're trying to reduce
like LLMs
coming up with old versions
of things
like what's there
is there a good ways
of training it
on reducing that
what do they call it
it's like a data lag
basically like the training
lag
the drift
the drift that's it
That's an interesting challenge without resorting again to kind of RAG or MCP.
Are there ways of doing short fine tunes after the fact of like here are the latest versions of things?
And people are doing some interesting research in that space using big chunks of ecosystems data.
The other thing I just started noodling on is an open source taxonomy.
So to try and define like a taxonomy that.
that describes the different facets of what makes an open source project.
You know, what does it do?
Who is it for?
What technologies does it use?
There's about six different facets and about 130 different terms that I put together as like a V1 kind of thing.
Of going like if you were to put these packages into a box or six boxes,
like which ones would it do?
Rather than just going like, here's some free text keywords.
words, like here's a load of the kind of chunks of things, including like the role of the
user as well, rather than just thinking about like, oh, it's a front end react app. It's like,
but it's for a, is it for an end user or is it for a sysadmin or for a developer, like to be
able to, and then what domain is it in as well? It's like really early, but I'm hoping it is another way
that can produce some alignment in this open source discovery world
because, you know, I worked at GitHub for a while on open source discovery
and wasn't really able to make a good dent in it there.
But I think there's still a lot of low-hanging fruit
in terms of like just helping people find the right kind of tools to use
because not many other people have really, it's also there's just not a lot of money
in that space.
It's a lost leader, right, for most companies.
searching for open source is not going to turn you into like, oh, yeah, you can't even run
a lot of ads against that kind of stuff because open source developers are like the number
one user of ad block.
So those ads will disappear pretty quickly.
But I'm hoping that that taxonomy will be like, here's a nice blueprint of ways that you
can define your project and put it into ways that then allow you to kind of go like,
okay, well, I've got five dimensions here, but I want to rotate around one of the
them. I want a web framework for researchers, but I want to rotate about the technology.
Like, what are my options there? Or I'm definitely in this technology space and looking at this
kind of like position in the stack. But what options do I have here for different like users?
And to be able to kind of like twist the picture a little bit, but in a fairly defined space,
rather than in, you know, just arbitrary free text because, again, you run, you just end up in
this like soup of words, which is like, yeah, we kind of just get very fluffy. And often
projects just don't have very well-defined ways of finding things. You know, like they don't
add a description to their GitHub repo or any keywords or topics. So you just kind of like
never find it unless it's in a generic search.
engine, which is then really hard in terms of like, oh, well, what are my options in this
space? And I made this as just like, surely someone has made one of these already. And I found
a taxonomy of software in the research space, but I did not find a taxonomy of open source
software. So I was like, okay, I can make a stab at one of these. I like, I've never made
a taxonomy before. But I put it together as like, this should be interesting. And it's been
useful so far and it started some interesting conversations but i really need some people with more
experience in you know actually defining taxonomies than i have to to give more input and also
expand it and like cover the problems because i'm pretty sure there's going to be loads of
problems in it because i basically just like put it together in a couple of days as like okay i think
this this should work but yeah mostly where does that live that is on the ecosystems gethub as
also a really quick web page I made of taxonomy.ecosystems. It's not, it's literally just a few
days ago, so it's not anywhere on the website, but it is on the GitHub org as OSS-Texonomy, but I get a
link in the show notes. Awesome. Yeah. Send us that. And anything else, you want to make sure
we get into our show notes so that you all can just click through and find that and help Andrew
figure out this taxonomy so that we can all
start to kind of
formulate around it. Categorization is always
useful, especially for
very gray, otherwise gray areas
such as these, especially if you're
self-defining, it helps you to
even flesh out your idea or your project
better. I think this is fertile ground
right there, honestly, because you got so many
I would describe it as like
ecosystem explorers.
You may have previous to LMs
being a ubiquitous thing
and agents helping you,
you may have just stayed in the zone
that you're comfortable in
because you're the mere human
that cannot think to next faster.
You know,
and then you get into this LLM world
and you're like,
man, I can actually explore new languages
because it knows it.
I know this language
and I can at least translate my knowledge.
And so now you find yourself exploring
and go or Russ whenever you would have
normally just stayed in the Ruby world
because maybe that's where you're comfortable.
You know, and so when you go into those worlds,
you're like, well, how do people test here?
How do people deal with HTTP?
How do you deal with security things?
And so you find yourself exploring new worlds while you know the Ruby World well, you don't know the same kind of projects that would help you in a different lens.
I can think that's going to be useful, honestly.
Yeah, definitely.
There's also kind of the ability to see, like, where are the gaps in a particular space?
Where have there not been, like, many people working?
Or there's only just this one old library.
like is there an opportunity to kind of jump in and improve that or as you say like you come into
a new ecosystem and you're like what is the sidekick of X exactly often it's like oh well
actually we like in Erlang world we don't need sidekick because we have you know OTP will
it's kind of all built in but like to be able to learn like that what is the alternative to
this thing is going to be an interesting way of challenging that and maybe also they're kind
of breaking down some of these massive projects into sub pieces as well to be able to go like
okay well you've got something huge but actually there's lots of like individual components here
that can be used without you having to take on like oh I've got a massive Apache airflow
install now that does everything actually I really only want to do like a piece of this
but how the hell do you go finding that if like their discovery is just folders full of
strangely named projects like that's not particularly helpful necessarily in terms of discovery
let's close with this what do you want from the world you seem to be a pretty quiet guy
there's a definitely a blog there so you're active I don't know how frequently you podcast we
haven't talked personally in years at least me personally maybe you've talked to him
least one's jare without me in the meantime but like what do you want from the world for this
project what kind of response do you want from coming on the show or producing all this work well i
i have had my head down for like basically since uh leaving tide lift and then covid happening i
basically just like got my head down and just started like plugging away i also started uh doing track
days in a Subaru BRZ, which is an excellent way to get away from the computer.
If you've got an interesting cars, track days is brilliant fun.
But ecosystems has kind of like been building up and building up and it's now reached
the point where I'm like, I need more people helping kind of like, not just contributing
to the code, but like helping it work out where it should go next.
because I can definitely come up with lots of things I would like to see happen,
but I need more input from more people on like,
how would you like to have an impact on the open source world,
like through data?
So that's input in like feature requests or thinking about that from a slightly
higher level of like collaborations,
ways that ecosystems can support different efforts,
be it like security,
searching for projects that are like, oh, there are ways we can improve this part of an
ecosystem. The collaboration is really what I would like to see more of. And I am starting
to do more podcasts and like various kinds of. I started a working group with the chaos metrics
people around package manager metadata as trying to share the kind of learnings that I've done
in developing ecosystems and being able to kind of like map metadata across different
ecosystems into standardized ways.
But if they're interested in ways of, you know, like understanding and using data in open
source to have impacts, then ecosystems is literally rearing up right now through the
alpha omega grant that we just received to be able to.
you know bring more people into this space and help them have real impact on like knock on
effects of improving open source wow very cool I'm glad you're I'm glad COVID's over obviously
I'm glad that uh you're you're you're poking your head out of the hole little rabbit
and and showing the world what you got it's kind of cool I like it good stuff Andrew thanks for
coming on the show again yeah thanks so much for having me
There you have it.
Ecosystems, a very cool web app with a very cool domain hack.
That's E-C-O-S-Y-S-T-E.
Check it out.
There is so much data to dig through.
I'm sure you can think of cool stuff to build on top of it.
And if you do, let us know in the comments.
We hang out in Zulip.
You can too.
It's free.
Just click the link in your show notes or find the episode page on our website and hit the discuss button.
That'll get you there.
Thanks again to our partners at fly to audio and to our beat freak in residence breakmaster cylinder.
And thank you to you for listening.
There's a zero percent chance we'd keep this thing afloat for 16 whole years without you.
So thank you. Seriously. It means a lot.
This has been your midweek interview, but we'll be back on Friday.
You got to listen to that one.
It's the pound to find champs game.
Come play along.
We'll talk to you then.
I'm going to be able to be.
I'm going to be.
You know,
I'm going to be able to be.
Game on
