Software Huddle - Building CI for the age of AI Agents with Aayush Shah
Episode Date: July 22, 2025Today's episode is with Aayush Shah. Aayush is one of the co-founders of Blacksmith, which is a CI compute platform. Basically, Blacksmith will run your GitHub Actions jobs faster and with more visibi...lity with the standard GitHub Actions CI runners. The founding team has a fun background doing systems work at Cockroach and Faire, and they're taking on a big problem in running this massive CI fleet. The explosion in AI agents has really changed the CI world. CI is more useful than ever, as you want to be sure the changes from your agents aren't breaking your existing functionality. At the same time, there's a huge increase in demand and spikiness of CI workloads as developers can fire off multiple agents to work in parallel, each needing to run the CI suite before merging. Aayush talked about how they're handling this load and facilitating visibility into test failures. We also covered cloud economics. Aayush said the traditional cloud-based storage options don't work for them -- EBS and locally attached SSDs are too expensive for their workloads where they don't need the standard durability guarantees. He walks us through building their own fleet outside the hyperscalers and the plans going forward, along with some of the economics of multi-tenancy that Blacksmith has previously written about.
Transcript
Discussion (0)
I'd say half of our team mostly uses Cloud Code at this point.
A lot of stuff is kicked off with Cloud Code and that gets you 90% of the way and then
the final 10% you finalize with something like a cursor.
But I'd say the other half of the team is all cursor.
Yeah.
What about you personally?
What's your workflow?
I like to start things off with Cloud Code and then take it the final mile on cursor.
Now you're in New York.
Is the whole team in New York?
Yeah.
So all of engineering is in New York.
All of our growth and sales is in San Francisco.
I find that even the location split is sort of helpful in that one office is sort of helpful in that like OneOffice is sort of entirely thinking about you
know sales and growth and content ideas and that sort of thing and OneOffice is
sort of entirely focused on product. Did your idea change much in YC or did you
go in with with pretty much this idea and come out with this idea? I think we
were one of the only companies in our cohort at least where we came in with this
exact thing and we are still kind of building this exact thing. What's up everybody? This is Alex,
great episode today with Ayush Shah who's one of the three co-founders at Blacksmith. And Blacksmith
is like a CI runner for GitHub actions right, provides the compute fabric so a much faster CI
solution while also giving you you this interesting observability into
test history and let you know,
did this test fail because I wrote some bad code or because
it's a flaky test and it fails once out of every five runs anyway,
and gives you some visibility into that.
But one thing I liked the most was him talking about just how
these AI coding agents have changed CI requirements,
especially if you're out there spinning up
10 different agents working on things at a time, it's very bursty when all of them AI coding agents have changed CI requirements, especially like if you're out there spinning up
10 different agents working on things at a time,
they're gonna be, it's like very bursty
when all of them finish within a few minutes,
they all wanna spin up CI to see if it works,
and it's just like you have this very bursty CI solution
that can be tricky to handle,
and just how they're handling it,
and pretty interesting stuff there.
So be sure to check it out.
As always, if you have any questions,
if you have any comments,
if you have any guests that you wanna have any comments, if you have any guests
that you wanna have on the show,
feel free to reach out to me.
I love hearing about those
and it's a great way for me to find people.
And with that, let's get to the show.
Aiyush, welcome to the show.
Alex, thanks for having me.
Yeah, absolutely.
So you're one of the co-founders at Blacksmith
and you all came on my radar in the last month or two
because there was this great post you had on multi-tenancy and like running a multi-tenant service and especially like the
economics of it which I just think is like really fascinating like yeah Mark Brooker's done some
good stuff and just like AWS generally thinks about that really well and it was cool to see
you know some other SaaS services do that as well. So I guess maybe let's first start off by just
having you give us
some background on you and
Blacksmith and what Blacksmith is and does before we dive into that.
Sounds good. Really great to be here.
I'll start off with a bit of a background on myself.
I started my career five years ago at a company called Cockroach Labs,
working on CockroachDB, which is this distributed SQL database.
I worked on a lot of the data placement aspects of Cockroach.
So one of their USPs was they would let customers kind of
transparently relocate data closer to where the traffic's coming from.
Cockroach would also automatically balance your loadout
across the cluster, and it would do so
in like a multi-region setup.
Like you could have arbitrarily many nodes
in arbitrarily many regions across the world,
and it would move shards of data around
without disrupting foreground traffic at all.
So I worked on a lot of those aspects of Cockroach
for a while.
After that, I had a short stint at a company
called Superblocks,
which is a internal enterprise,
internal tool builder.
It's a competitor to Retool, if you've heard of them.
And at both of these places,
and just throughout my career,
throughout internships during college,
it felt like there were a few problems with
regards to the software life cycle that just seemed endemic across all of these companies,
across all sizes, but particularly bigger companies.
And a lot of these problems were around, I guess, context for code reviews, which I think
is being attacked in various ways through all these AI code
review tools and all these bug bots, that sort of thing.
The second class of problems that
kind of keeps growing with the size of the team
is CI, both in terms of reliability, the amount of time
and effort it takes to keep it up from within the company,
and also just the amount of time you're spending waiting
for each pull request, not just in terms of the amount of tests
each pull request runs, also just the reliability
of the test suite itself. The percentage of like
flaky tests that start blocking people ends up getting to a pretty problematic place fairly early
on in the life of the company. And, you know, so that sort of was the genesis of Blacksmith in some sense. We also realized that across me and my co-founders,
all of our employers at the time were spending a lot of time
and effort maintaining CI infrastructure specifically
on hyperscalers like AWS.
And when you dive a bit deep into it,
it sort of feels like a lot of the trade-offs
that hyperscalers like AWS have you make
are kind of the opposite of what you want to do for CI.
I think the big thing we talk a lot about in our blog posts
is this forced bundling of EBS volumes, which is precisely
the exact opposite of what you want in CI. For CI, you don't care about durably persisting any of
that data because these jobs are ephemeral anyway. For stuff that you do care about persisting,
like cache artifacts, you just want a very high performance network link to whatever blob store
or whatever storage cluster. But for the jobs themselves, they actually need to be as ephemeral
and as low durability as possible. You kind of want to make the opposite trade-off. However, on
AWS, instances with local SSDs are a lot more expensive
than the ones with EBS volumes.
And they go even further in that a lot of the most modern instance
types with the new US generation CPUs,
they don't even launch the local SSD variants for those skews
until much later.
So I think at the time of recording this on 8th US,
I don't even think you can get an M7A or M7I instance with,
like you can't get the D variant, the one that has local SS.
Yes, we felt like there was something here.
So long story short is with Blacksmith,
we're trying to build effectively a hyperscaler
specialized on CI workloads.
And we want to go beyond just the compute fabric.
So the compute layer that makes the right trade-offs for CI
is kind of one piece of the puzzle.
But in general, the set of problems you want to solve
are how do we make engineers more productive
as they're trying to merge code?
And how do we curtail the kinds of problems
that keep growing within a company
as the engineering team grows?
And especially with all these AI code gen tools,
making it so much easier to write code and write tests,
we think the bottleneck is going to shift more and more
towards merging code with confidence
and doing so without slowing your team down.
Yep, yeah, absolutely.
OK, and so are you mostly working with GitHub Actions,
or is it sort of like all types of CI?
It's mostly GitHub Actions? Yeah, so it's just GitHub Actions or is it sort of like all types of CI? It's mostly GitHub Actions?
Yeah, so it's just GitHub Actions at the moment.
Okay. And is that like eating the world right now in terms of CI?
Like is everyone on GitHub Actions?
Yeah, so I'd say a lot of the larger enterprises are still on Jenkins
and we often come across customers that are on BuildKite. A lot of the, interestingly,
a lot of the early 2010s tech companies,
companies like Uber, Shopify, Slack,
a lot of these types of companies are on BuildKite
because BuildKite was at that time,
kind of the de facto CI system.
No one wanted to use Jenkins.
BuildKite was sort of this more modern, sane Jenkins.
They also made it quite easy to self-host,
as in self-host the infra for Bill Kite.
And so we come across those kinds of companies a lot,
but in terms of where a lot of the growth in the market is,
it's pretty much all GitHub actions.
I feel like if we surveyed companies
that were founded
after 2019, more than 95% of them
would be on GitHub actions.
Yeah, interesting.
And so will you build runners that work with BuildKite
and things like that, or are you just like, hey,
that's eventually just going to migrate over
to where we're just focusing on GitHub actions for a while?
We're going to do it at some point, but we want to push that
off to as far in the future as possible.
The way we like to think about this problem space is anyone can
build this undifferentiated compute fabric in some sense.
If you prove out that there's some kind of a market resonance this undifferentiated compute fabric in some sense.
If you prove out that there's some kind of a market resonance
or PMF for a specialized CI Cloud platform,
anyone can build that.
And that's not where we think we can differentiate long term.
However, if we can de-risk a lot of our more longer term bets
around improving productivity in general, as it pertains to CI,
as it pertains to reducing fakey tests and that sort of thing.
It becomes much easier for us to in the future
translate those learnings to other CI systems.
But we think we want to deepen out on this one vertical first
and then concentrically expand.
And at that point, that becomes a more deduced proposition.
Yep, yep.
OK.
So you're the second company I've had on.
I also talked to Depot about this problem.
And did GitHub just completely drop the ball here?
Or why are GitHub action runners just quite bad?
Or is it a certain segment of use cases that they're bad at?
Or what's going on with the native
GitHub action runners?
Yeah.
So I think there's a few factors at play.
I think the big one is that they have perverse incentives to not solve the problem just because
their pricing model charges by the minute. If they give you faster compute, not only are their costs going up, their customers
are spending less money, they're consuming fewer minutes.
And with how much growth there is for GitHub, because it's the de facto platform owner for
code repositories today.
Yeah, it just feels like there's a lot of perverse incentives
for them to not try to solve the problem the right way.
And secondly, internally, I think
we view GitHub Actions almost as this low-level orchestration
layer for workflows that operate on your repository.
And I think GitHub sort of views it the same way as well.
If you think about the fact that GitHub Actions has been
around for all this time, they still don't have a UI for you
to track test results.
They don't have a way for you to look at regressions
in a particular unit test.
You could have been running GitHub Actions for six years,
and you would have no idea how a test suite has evolved
over the course of that time.
So I think there's a lot of opportunity in the market
to build this observability layer on top of GitHub Actions
and treat it as a low-level
workflow orchestration platform in some sense.
Yeah.
Yeah.
Interesting.
Okay.
You mentioned flaky tests a few different times.
I imagine that's different from just the pure speed and resource requirements that you're
also solving, but the test flakiness.
I guess where do you see that flakiness come in
and how can you help customers there?
Is that an education thing or is there stuff you can do
for more automatically without them having to change?
Like where does that flakiness come in?
Yeah, I think that's a good question.
So I feel like there's many ways
you can kind of break that problem down.
But let's start with how flakiness even creeps in
and what I would even define as flakiness.
So there's some test failures that are deterministic.
You run it on a given shot.
Someone landed a bad commit without maybe waiting for CI.
And maybe not all of your CI jobs are required for the pull request to merge.
So someone hastily just merged a pull request that had a failing unit test.
But the good thing about it is that it always fails.
Someone can spend 30 minutes on it and get it fixed up and unblock everyone else.
spend 30 minutes on it and get it fixed up and unblock everyone else. Now the worst kinds of failures are these non-deterministic test failures that let's
say only fail 1% of the time or 5% of the time.
Cockroach actually had a pretty big problem with this class of failures,
where some of our failures would only happen once you stressed a unit test for an hour,
out of like 100,000 runs.
And a lot of times, that was just the test being poorly written,
not the logic being tested.
The logic being tested was always correct,
but the test maybe had some kind of raciness
across the many things it was trying to synchronize.
But sometimes it's actually the logic being tested
is non-deterministically wrong.
So when you have a flaky test that sort of passes and fails on the same shock, on the same piece of code,
it's actually a hard problem to determine whether that's a high-quality failure that points to a real problem with the system,
or just a test that has some sort of like a synchronization issue. Now, with that being said, with our customers,
the most common thing we see is customers
that have a large test suite of these end-to-end browser-based
tests, stuff like Cypress and Playwright and Puppeteer.
With a lot of these tests, they're running it by running an external instance of their application
that sometimes talks to a local database, sometimes it may even talk to an ephemeral super base instance
spun up dedicatedly for that unit test.
So it has a lot of these moving parts. It may have an external network link involved somewhere,
which can let flakes creep in.
But these kinds of tests are extremely flaky.
And we've yet to come across customers
where they have a large test suite of Cypress and Flaky
tests that's not flaky.
And almost always, these are the former type of flakiness,
where the unit test itself is...
It depends implicitly on some kind of synchronization
that isn't guaranteed by the harness,
and that's why it's like flaking.
Yep.
And are there things that you can do for that,
or is it education?
Or is it just like, hey, that's the nature of
site person playwright and browsers and all that?
So there's a number of things we can do for that.
And there's a number of things I think we're going to have to do
for our customers going forward
with the rise of a lot of these AI cogen tools.
Because the good thing about these tools is that, for example, Cursor, Winserv, Cloud Code,
they're extremely good at writing unit tests.
They are extremely competent at writing Playwright, Cybers tests as well.
However, a lot of teams do not have protections in place that prevents them
from merging flaky tests in to begin with. So one of the big things that we're working on is how do
we help teams selectively stress only the new unit tests that they're adding to their suite
so that when we detect that you've introduced five new unit
tests for this new feature that you've been working on,
can we stress them for you by automatically detecting that,
running them like 1,000 times before they're even
allowed to merge in so that you're at least preventing
the increase of new unit
tests being flaky over time, new flaky tests creeping in.
That's one thing.
I think another big thing is just helping teams identify
flaky tests.
And the reason this is actually even more critical, in my opinion, than the first
thing is because the cost of a flaky test isn't just the amount of time it takes to
stabilize it. It's the impact it has on everyone else from your team that's trying to merge
their PRs that is confused whether their change
broke this unit test that's seemingly unrelated.
Throughout my career, and I'm sure you've experienced this
yourself, you push up a pull request.
It has five unit test failures or five test failures
of various types, out of which maybe three or four seem completely
unrelated to your change. And you're left wondering, you know, was it actually my change or is it just
like a flaky test? We want to surface that information as readily as possible. So we want
to help customers like score or automatically rank the relevance of a test failure to their particular code commit
and help customers identify whether,
given flaky tests is even worth digging into
or should you be digging into these two other failures
that are very likely related to your change?
Yeah, yeah.
And in terms of value prop of blacksmith,
do you think of it as sort of equal weights of like,
hey, these fast on-demand resources
to go spin up a bunch of tests very quickly?
And also these higher level things like visibility into long-term test history and stuff like
that?
Or do you think one is more than the other?
Or how do you think about that value prop?
Yeah, that's a great question.
It's one thing that we think about a lot.
At the moment, I think our value prop is still quite skewed
towards just the compute.
The fact that we let you spin up 3,000 vCPUs in less
than a couple seconds.
We kind of control this entire pool of compute,
and we can handle bursts really well.
And that's still a big part of the value prop.
The hardware stack that's optimized for CI
is still a big part of the value prop.
However, over time, our observability story is improving
and that's becoming more and more for a reason.
We're starting to see customers try to adopt us.
becoming more and more for a reason we're starting to see customers try to adopt us.
Our internal goal is to get the observability piece to a point where customers would use us
purely for that, even if the compute fabric wasn't as good.
And it could be like what you're saying, where there's the easy win in visibility of like,
hey, someone could try your new, your compute runners
and just be like, oh wow, this is five times faster
than what I was currently running.
It gets you in the door.
But then people don't move away
because they love all that historical stuff.
And you're like, just being able to see that,
they're like, I'm never gonna move to a different runner.
Or if people go to a different company, they're like, hey,
we need to have this visibility that I don't have here now.
So just following in love with that feature over time,
even if it's not as visible right away,
requires some history to get built up
before they can see that value.
Yeah.
As any vendor will try to do, you
want to leverage the data you have
with your existing customers to make the experience better for them over time,
make the platform sticky for them.
And then that's definitely one of the ways
we're thinking about it.
And we're realizing that as we get deeper into it,
a lot of these things lay the foundations for more higher level things that we can unlock.
This scoring of flaky jobs or flaky failure modes and helping customers understand
whether their PR caused a particular failure or not was something that just came up
in the last four months when we built a global log ingestion system for customers.
We started realizing that,
oh, we can collect failure signatures
across RCI pipelines.
And based on the frequency of any given failure signature,
we can help customers understand
whether that's a flaky failure
or if it's something unique to their pull request.
Yeah, for sure.
One thing on Cockroach, real quick,
I imagine CI for something like Cockroach
is very different from CI for some of your customers,
like Clerk or Vee.
Things that are more web-type applications.
Is that true?
And are you more focused on the latter?
I imagine the database one is so specialized,
but maybe I'm off on that. I guess how does that work?
Yeah, that's a great question.
I'd say most of our customers are primarily web applications.
I think about 80 percent of our CI workloads are TypeScript,
just running unit tests and vTest, that sort of running unit tests and v-tests, that sort of thing.
However, the pain point felt by a lot of these systems companies
is often a lot greater than a lot of companies
like Clark, for example.
Just because a lot of these companies
typically have large Rust code bases,
often pulling in a lot of dependencies.
Rust is notoriously slow at compilation.
It's one of the main things people
dislike about the language. And LLVM-based languages also end up
being surprisingly IO-heavy during compilation, which
means that if you're trying to self-force CI
for those repositories on the hyperscalers,
you are really bit by the EBS volumes and local SSDs trade-off
that they force you to make.
So we find that when we outbound a database company,
or when they come to us, they actually have a much bigger problem on average. Yeah, and a lot of times we can help them cut their CI spend
and CI runtimes down by over 60%.
Yep, yep.
So you mentioned like 80% of your stuff
is TypeScript and Jasper tests, all that.
Do you have a sense of what the back end front end
split is for those?
Yeah, so I'd say frontend is,
so pretty much because we sell to companies
that were founded after 2019 or whatever,
frontend is all React.
It's all like TypeScript React
and anything else is like an anomaly.
And on the backend, I'd say the,
I don't often know the exact percentages,
but TypeScript is the clear winner.
Golang is the second biggest player.
So I'd say it's more like a...
And sorry, I don't know if it was unclear.
I meant, of the tests that are run,
how many are more pure backend tests
versus running Cypress or Playwright or a browser-based test?
I'd say it's closer to a 50-50 split.
It's funny because the state of front-end testing is so poor.
Because writing component-level tests just does not work
after a certain size outside of some very critical,
complex components that your code base might have.
So if you want to have any coverage of your front end
after a point, almost the only option you have
is something like a Cypress or Playwright test suite.
Yeah, so I'd say within that, it's fairly 50-50.
Yeah, yeah, okay. One more cockroach question. I within that it's fairly 50-50.
Yep. Yeah. Okay. One more cockroach question. I know that cockroach is kind of specialized. Do you use
cockroach at Blacksmith, like given that you work there? Or
are you like, hey, you know, it's cool, and I like it, but we
didn't really need it for our specific needs?
No, not at the moment. Although we did have one use case for it
where we could have used it.
And you might still kind of move to it, but we decided to start off on like a simpler
Postgres based setup.
And then the kind of nice thing about all these Postgres compatible databases is the
on-ramp.
We can migrate to them when the pain point kind of becomes evident.
Yeah.
Yeah.
Very cool.
Okay.
Let's talk a little bit about the force bundling, the EBS and EC2 stuff.
This is interesting because we also
had Sam from PlanetScale on talking about their new metal
offering.
And he had a similar complaint as you,
but a different approach to it, right?
Where he's like, hey, EBS is super expensive.
But for their workloads, I'll be so important.
And still need some, I guess, like they decided
running the instance with the locally attached NVMe
makes sense for them, actually is a lot cheaper
than using EBS.
But for your needs and requirements,
it's like that's still just way too expensive to do that,
is what you're saying, given your sort of requirements in CI.
Yeah, so there's a number of threads we can dive into there.
I think the big part of our thesis behind Blacksmith
was that CI as a class of workloads
does not need to run in a customer's cloud account.
So for example, in your conversation with Sam,
he mentioned how, if you're building a database,
if you're a database vendor
and you're not on one of the hyperscalers,
it's sort of an immediate deal breaker
because your database needs to be as close as possible
to your production application.
It needs to be as close as possible
to your production blob store bucket
because of this indirect dependency where your application also needs to be close to that.
It needs to be close to where you're running your Kafka, which is likely also in AWS.
And the moment you run anything outside of that zone,
you're hit with extremely expensive
networking costs.
And you're also just adding unnecessary latency
to every user interaction, every back-end query.
However, that is not the case for CI workloads.
In fact, for most companies, your CI
is already running outside of your cloud account. Your code already lives in GitHub, which is not in your...
No one really ever used AWS's Git server implementation.
I forget what it was involved.
Code something, yeah.
So we realized that CI was a class of workloads that you could expatriate out of the hyperscalers
and customers because it also doesn't deal with their end customer data. It's just code.
The security expectations for that are not as B by OC centric, if that makes sense.
If I'm using a database and I'm trusting a vendor with my customer's data, I do
want that to be in my cloud account as much as possible.
However, if my code is already not in my sovereign jurisdiction, I'm okay with
my CI also being out of it.
And because of all of this and the fact
that the actual hardware itself is just so much cheaper than what
you can get on the hyperscaters, it just
made sense for us to start the business off
with that kind of understanding and the fact
that our unit economics improve as we get more customers.
So, yeah, I think it's like sort of a confluence of all of these factors.
We could consider in the future like a BYUC type offering.
Once our observability piece becomes valuable enough that that's something customers care about.
But at the moment, this trade-off just makes a lot of sense, both for us and for our customers.
Yep.
And then so where are your instances running?
So we have one data center in Germany.
We work in another company in Phoenix in the US.
We're potentially working on a third region at some point
later this year.
And that'll likely be in US East, Virginia, that area.
The other interesting thing about CI workloads
is that for most jobs, they're pretty location agnostic.
They kind of just need to be close to your caches.
So if we control our customers' cache artifacts,
we kind of have a lot of control over where
we can place their entire org.
And we also have a bunch of control
over where we can move them.
However, one exception to that is jobs that perform Docker builds
and push Docker images to container registries.
Because typically, your container registry
lives in your AWS account in something
like ECR, which is homed in some region.
So at the moment, what we'll do is if a customer has Docker builds running that push to a container registry in the US and US West, we'll home them in the in the Phoenix data center.
Okay. Okay. Yeah, I was gonna I saw that you all started in EU Central. And I guess that's true. Like, there's not a lot of, you know, going back and forth other than that Docker push, I guess like was EU Central and I guess that's true. Like there's not a lot of, you know,
going back and forth other than that Dr. Push.
I guess like was EU Central cheaper than hosting in the US?
Yeah, so we worked with a vendor called Hetzner.
I think they're sort of fairly popular
for just leasing out bare metal hardware.
They are fairly reliable.
They let you attach high throughput network links
to all of your instances, also for relatively cheap.
So that was the reason we started in EU Central.
Hetzener is only in the EU,
at least their bare metal offering is only in the EU.
And then over time, as we hit more scale,
we had kind of the leverage to start working with slightly smaller
data center colos in the US.
And we see us going lower down that pipeline over time
as our scale improves.
At some point, we'll start renting rack space,
but racking it up with our own hardware.
But all of these are problems for the future,
where we explicitly want to improve our unit economics
even further.
Yeah, for sure.
And so you mentioned blob storage
being essential for the cache artifacts.
Are you using something like S3 or Google Blob,
or are you using, like, does Herbster have something,
or are you running Minio, or I guess what are you
running for Blob storage?
Yeah, so we offer two types of caching
for CI jobs for our customers.
One is backed by an S3-compatible Blob store.
We run Minio, which is this open source S3-compatible Blob
store.
And the second kind of caching we offer
is something we call sticky disks.
And sticky disks are effectively our implementation
of a network attached block device.
It lets us offer out of the box Docker layer caching
for all Docker builds running on Blacksmith.
So for instance, if your CI job builds a Docker image,
you can change one line of code to tell Blacksmith
that this job builds a Docker image
and we will transparently mount in this network dock device
that contains all of your Docker
layers into your runner.
And that will asynchronously get committed
after your build completes.
And yeah, it happens completely transparently to you.
In the most optimized scenarios, it can be up to 20x faster. However, we do,
unfortunately, find that most customers do not have the most optimized Docker files.
So it ends up only being about 60, 70% faster because that's the percentage of layers that are
cached in any build. But going back to your question, for every cell that we operate, every kind of
data center or colo that we operate, we run a fleet of machines that run our Blob Store.
We run a distributed storage cluster that runs Ceph that offers sticky disks. And we have our fleet of VM agents
that orchestrate virtual machines.
And these are all in the same data center.
Once they grow past the size of the number of machines
we can have in the same DC,
we kind of have to create a new cell.
And this is to ensure that all the VM agents
effectively have like a local link to the cache clusters,
because that's a big part of our value prop.
Gotcha.
And how is running MinIO?
Is that pretty straightforward and easy?
And is it significantly cheaper than S3?
The way we run it is not particularly cheap
because we also run Minio over NVMe drives,
whereas S3 obviously does a lot of intelligent gearing.
Operationally, Minio has been extremely smooth for us to run.
We did initially have some issues
when we were starting out.
But once you get through this initial operational hurdle,
it's been relatively smooth for us.
Their story around expanding the size of the cluster
is also less than ideal.
They almost necessitate that you have
to take a minute of downtime to expand the cluster.
But that ends up being fine in a lot of cases
just because of how rare these are,
and the downtime window is fairly short.
Yep, yep.
And even if that has downtime, does that just mean, hey,
this build isn't using cache and can still run the build.
It's just going to be slower.
Exactly.
And we can time it based on the data
we have around when the usage is the lowest,
we can time the window in a way
where like no one even like notices.
Yeah, nice.
Did you all have a lot of experience doing like orchestration
on, you know, Hertzner and things like that
versus the hyperscalers or was this new to you
or what was that like?
It was relatively new to us.
However, we did have,
so one of my other co-founders
and a lot of our team is from Cockroach.
One of our founding engineers is also from Cockroach.
We did have a lot of experience fighting fires with Cockroach customers that were running Cockroach on-prem.
Big banks and customers of that nature. So we were fairly comfortable with running a big service
on our own hardware.
We dealt with a lot of typical network congestion type
problems over time.
So it felt like it was tractable.
Of course, when the rubber meets the road,
you encounter new problems.
And there's things that you have not dealt with before.
But we've actually felt like every time we
want to run something that's large scale on something
like AWS, at least initially, we're
dealing with problems that we'd rather not deal with.
I'd rather not have a full service outage because my auto scaling group
doesn't have the quota limits for that specific instance type in AWS.
I have to contact AWS support to bump up the quota limits.
I would rather deal with problems that we can, you know,
as a team, just debug on our own and fix
and mitigate on our own.
Yeah, interesting.
Yeah.
One complaint I hear a lot,
and you mentioned a little bit earlier,
is just like the network IO cost in like the hyperscalers.
Does Herzener have anything comparable to that?
Do they charge or limit you on bandwidth
or what does that look like?
So they have no limits on bandwidth.
They do charge you for bandwidth,
but I'll have to double check myself.
But I think it's, so for hyperscalers,
the egress costs, intrazone egress costs,
are $90 per terabyte, so 90 cents a gig.
However, yeah, I believe that's right.
That might be inter-zone or inter-region.
Could be wrong about that.
But on Hetzera, it's $1 per terabyte.
And that's kind of the typical network bandwidth costs
you'll pay with a larger ISP.
So a lot of these call over.
Interesting.
So the cost is very comparable to the hyper-scaners.
Interesting.
They're sort of just passing that down to the customers.
And that's a huge difference.
And it's low enough that we effectively just like
don't have to think about network costs at all.
Yeah, okay, okay, cool.
Okay, let's move on to multi-tenancy a little bit.
And I would encourage everyone
to read the blog post number one.
But the cool thing you all were explaining there is,
hey, CI workloads are extremely spiky
where you might have a customer that needs like 500 vCPUs very quickly, and maybe five times out of five
people push code all at once. But if you aggregate a bunch of customers together, it actually
smooths out quite a bit to where the individual spikes don't need as much there. And I know
AWS talks a lot about peak to average
ratio, where peak is like the max you need at any given time, which basically how much you need to
have provisioned or else have cues or averages or something. But average is what your margins are
measured against. I guess, is this something when you were coming into Blacksmith, you're like,
hey, this was a key selling point. Is this something you figured were coming into Blacksmith, you were like, hey, this was a key selling point?
Is this something you figured out later on about just multi-tenancy and things like that?
When you were conceiving the idea of Blacksmith, was that a big point you were focused on?
Yeah.
We knew that our unit economics would be substantially worse in the beginning when we barely had
any customers.
And we still had to have a big fleet to support even a small
number of customers because workloads, like you said, are bursty within a customer.
However, we knew that the economics would improve.
We actually ran this Monte Carlo-style simulation early on during when we were in YC to just run some
numbers on what happens when we're running 100 jobs a minute or 1,000 jobs a minute and
then just project that out.
We knew that if we hit a large enough scale, our margins would be substantially better.
We also knew that there would be a path for us to go down the hierarchy in some sense.
So we could start off with leasing bare metal machines from a provider like Hetzner, but
over time, rack our own machines and go down that route.
It seems like that's a trend that is also catching on in general in the industry.
I know Railway, which is like an app hosting platform,
they kind of did a similar migration of GCP
onto their own bare metal hardware.
And I think the quality of service
has actually improved as a result of that.
Yeah, you're actually the second person
I've talked to in two days that's using HerdCenter
for some stuff too.
So yeah, you're really starting to see a little bit of that trend, especially if you have a unique
workload or like understand the trade-offs well and like you're saying aren't scared of
getting into the orchestration stuff and realizing, hey, maybe it's not as bad as you think it is.
So I know like one thing is, you know, CI would do like somewhat correlated, right?
Because it's going to be during people's work days for the most part.
I guess, is there anything you all are doing to try and help smooth those numbers or try
and make it so people are moving jobs to off hours, either with pricing or even with like
customers you're targeting in different regions of the world?
Or you just, you know, you'll figure that out later.
And right now you're sort of in growth mode
as much as possible.
We're sort of in growth mode as much as possible.
However, we're realizing that
the bursty nature of CI workloads
is actually getting worse with a lot of these like
background agent companies kind of doing a good job and the model is just getting so much better.
We've had customers, we actually have one customer where the way they operate for a lot of
product-facing features that have a well-defined boundary, they'll file a bunch of linear tickets,
they'll create a linear project, they'll file 11 tickets, they'll furnish those tickets with
as much detail as possible, and then they'll just kick off 12 background agents all in parallel, which
will, within the span of the next five, 10 minutes, kick off 1,000 CI jobs, because each
pull request may kick off 100 CI jobs. So it's becoming more bursty, which
we're sort of handling well.
And that's another aspect of what
will make self-hosting CI even harder in the future.
Because typically, the way companies will self-host CI
is they'll run like an EKS cluster,
or they'll run GitHub's ARC controller
to manage the runner pods over that EKS cluster.
And maybe they'll use something like Carpenter
to handle node provisioning and scale the size of the cluster
itself as it saturates.
We've had a number of customers where that just becomes such a big operational headache for something like CI
that's not a core differentiator.
It's not a core competency.
They'd just like rather not spend time on it.
And as these companies adopt more of these background agents,
we see that shift happening even faster.
Yeah, yeah. How, I guess, how does Blacksmith charge currently? agents, we see that shift happening even faster. Yeah.
I guess, how does Blacksmith charge currently?
What's the unit you charge?
So we charge for two things, both usage-based.
We charge by the minute on the runners for the compute.
And we charge for cache storage, the sticky disks that you use.
we charge for cache storage, the sticky disks that you use.
Each customer gets some free quota for the blob store cache.
They also get some free minutes on the compute. For larger customers,
we'll sort of just negotiate bespoke pricing.
That's sort of another big advantage we feel like we have.
For larger customers, we actually have a lot of leverage in terms of
pricing our offering however we want and not based on just our costs.
So a lot of vendors, a lot of folks I think on your podcast have
talked about, they kind of have to begrudgingly do cost plus pricing.
Just because their costs are so linear to their,
like their gross margins are just fixed.
And 30% of their revenues just go to AWS,
some cases even more.
So they don't have as much leave in pricing their products
the way they would want.
We don't necessarily have that.
Sure, our gross margins are still
like a big percentage of our revenue.
It's still a double digit percentage,
but it's not, you know, it's not,
like our gross margins are not 40%.
We're over like a 80% gross margin business
lets us like price things in a lot of different ways
that works well for customers.
Yep.
OK, that's super interesting.
And then can I have as many jobs running at a time as I want?
Or is there some sort of queue or limit?
Or what does that look like?
Yes.
But we'll have alerting on our end
when a customer runs more than 1,000 BCPUs worth of jobs, we'll
get alerted and someone from our team will at least look at it to make sure that it's
not someone that's trying to DDoS the system. Early on, we had instances where people would
try to mine crypto in GitHub Actions jobs.
Every time you have a compute service,
someone's going to try to mine crypto in it.
Yeah, for sure.
Exactly.
It's kind of silly because these days,
crypto mining has become so efficient
that unless you're running it on FPGAs or GPUs,
it's completely pointless to run it on CPUs.
Yeah, so we'll have a learning on our end
to make sure someone isn't attacking the system,
but there's no real concurrency limits.
Yeah, interesting.
Yeah, I was just trying to think about that,
to try and smooth out that load,
especially with that agent stuff,
if there's some way you can incent customers
to kick that off overnight,
where it's like, hey, you have those background, like there's 12 hours overnight
where they can run it at any time
and your load is super low and can spin those up.
But it's kind of like a tricky thing.
It's like, you don't want to give,
I don't know how you incent that.
We can easily launch sort of like low priority,
like a low priority skew that like lets the customer
kind of defer the scheduling of the job to us.
And we can offer a much lower pricing on that kind of thing.
It's something we're open to doing.
However, we haven't had that much demand for this sort of thing.
And that was surprising to me as well.
I think typically people just want their blocking CI jobs
to run as soon as possible.
For stuff that is not as critical,
they're already running nightly in a cron on GitHub Actions
itself.
So the demand for this scenario where
they want the vendor to do the scheduling
in a way that's disconnected from the orchestration platform,
which is GitHub Actions in this case, we haven't had a ton of instances of that yet.
Yeah, for sure.
On that sort of AI note too, do you feel like the explosion of coding agents
and all that is like a tailwind for you guys?
Cause there's just so much more demand for CI now.
Or is it a headwind cause it's like, oh man,
it's that much spikier that we have to deal with
and it causes that problem or how is that affecting y'all?
So especially for smaller teams or younger teams
that are moving fast or adopting a lot of these tools
faster than larger enterprises.
We're seeing that our growth rate for companies, even normalizing for sort of employee count,
is like over 60% quarter over quarter, meaning that for 10 engineers, they're running 60% more CI, quarter over quarter, with the same amount
of people.
Because we started the company during this whole coding boom, it's hard for me to juxtapose
that with what that would have been before.
But that seems pretty high. And almost all of our customers are,
we see a pretty big spike or surge
over the last few months of things like the Cursor Agent
and things like Devin and Claude and all of that in our data
just comments being pushed by agents more and more.
Yeah.
Are you able to see popularity of the different, especially terminal-based agents?
Can you share any of that stuff?
Yeah. We plan on doing a series of blogs on that kind of stuff. Just because it's interesting,
typically what happens is whenever there's a big launch, so when Devon launched, there was a huge spike to the point
where it was something like 0.3 or 0.5% of all of RCI.
And then it goes down in flat lines,
but then maybe it improves over time again.
And every time there's a big new model launch with, for example,
the four Cloud4 series
of models, they resulted in almost like a step function increase in how much more people
were using background coding agents in general.
And I expect to continue to see that with newer model launches.
Yep.
Yep.
Super cool.
Okay.
On that same note, similarish,
you all just raised a $3.5 million seed round,
so congrats on that.
VC environment, super interesting right now,
because there's so much AI energy.
And then you all aren't an AI app builder or something
like that.
Was it difficult to break through the noise,
or did it help having this tailwind of, hey,
these coding agents are probably the most popular AI app right now, and they're going to need a lot of CI? I noise, or did it help having this tailwind of, hey, these coding agents are probably
the most popular AI app right now,
and they're going to need a lot of CI?
I guess, how did that go?
It was fairly difficult with some VCs.
It was fairly easy with others.
So it was sort of this very interesting,
but kind of polarizing experience in some ways
where some VCs wouldn't even want to talk to a company
raising around that's not an AI, whereas the others kind of totally get it.
They sort of understand the second order effects of all this code gen is one of the things
is CI will become a bottleneck. Yeah, so it was very much based on like who we were
talking to and sort of how technical they were in some sense. In general, yeah, I would not be,
I would not want to start a non-AI company in today's environment in general, just because
all of the, you know, all of the air is like being sucked out of the room. And for good reason, a lot of these companies
are growing at rates that have basically
never been seen before.
Yeah, yep, that's pretty wild.
You all did YC as well.
I guess, tell me about your YC experience.
Did all three of you go to the Bay Area?
Yeah, so we lived in San Francisco for four months.
YC was quite a great experience for us, honestly, just because they
fight hands off in some ways, but they still sort of keep you on your toes and they kind of keep you
aligned on solving important problems, problems that are important for the business and not kind
of go down on rabbit holes that are not worth pursuing at that early stage of the company, which was quite
helpful. You get a bunch of early users, early usage from other companies in the YC batch,
which is obviously quite helpful as well. I think if you were starting a company that had
immediate demand with very early stage startups,
like a SOC 2 compliance type company, then YC is the ultimate jumping off point, just because you
can end the batch with two dozen, five-figure customers. And we weren't like that because a lot of these
or of YC startups at that stage don't have that much CI.
They're barely writing any unit tests at that point,
but it's still valuable usage.
Yeah. It gives you a big jump in the fundraising
right after the batch, which was also super helpful.
Yeah, for sure. Did your idea change much in YC,
or did you go in with pretty much this idea
and come out with this idea?
Yeah, we were one of the...
I think we were one of the only companies in our cohort,
at least, where we came in with this exact thing
and we are still kind of building this exact thing,
where even the thing we sort of pitched in our YC interview
was this like compute fabric for CI
with all this like built-in observability.
And in some sense, we still haven't like fully achieved that
but we're also still like working on the same vision.
And I'm glad that's the case
just because the constant shifting of focus
can be really detrimental at this stage of the company.
Yeah, for sure. Okay, now you're in New York. Is the whole team in New York?
Yeah. All of engineering is in New York.
All of our growth and sales is in San Francisco.
We have two offices.
We're going to likely grow the team in this fashion
for at least another eight months to a year.
But we're open to good candidates on either coast.
If there's a great GTM hire that is only
open to working out of New York, then it obviously
make exceptions.
Gotcha.
So you want in-office ideally sort of with those strategic hubs, but as long as
they're in an office, you're up for it?
Yeah. I think at this stage of the company, it's just so much easier to keep everyone aligned on
what's happening in the company, the direction we want to swim in, that sort of thing, if everyone is in person.
I find that unless your entire founding team has previously worked with each other before,
and it's sort of like a bunch of team members from LinkedIn leaving to do Confluent, that sort of thing. It's very hard to have everyone be engaged
and build mutual personal relationships with each other
if you're remote.
At some point, we're gonna have to be a bit more flexible,
but it feels like the correct kind of trade-off
at this stage.
Yeah.
Are all three of the founders more engineering heavy
and associated with engineering
or is one of them, one or two in SF doing more sales GTM type stuff?
Yeah. So JP, our CEO, he's in SF. He is solely focused on GTM and sales and kind of everything
else that needs to be done. Myself and Aditya, who's our third co-founder,
we're both focused on engineering.
And that's been a good split.
We get to kind of protect each other's times
and let us kind of do what we're best at.
And I find that even the location split is sort of helpful in that one office is sort
of entirely thinking about sales and growth and content ideas and that sort of thing.
And one office is sort of entirely focused on product.
At least at the moment, that seems really great for our focus.
Cool. Yeah. And closing out like on an engineering level,
I guess, what does the code base look like?
Is it TypeScript like you're saying? Is it lower level stuff?
Or what are you using for that?
Yeah. So we have a lot of Golang.
A lot of our infra services are Golang.
All of our front end is Next.js, TypeScript, React.
Our control plane is actually written in Laravel in PHP.
How interesting.
What drove that decision?
We sort of couldn't be happier with this like split
in the sack just because,
so we built a number of projects before,
just personal projects in Laravel.
And it's just a very productive way
to write application control planes, in my opinion.
All of the problems that you typically want to solve
in a control plane, background jobs, billing,
caching in various parts of the stack, queuing various types of jobs
in particular sequences and graphs.
All of those are just solved problems in a framework
like Ruby, Unreal, or Laravel.
Modern PHP, for all the flak it gets,
is surprisingly good and pleasant to
work with because it's typed now.
The tooling around it is really great.
The other thing that we didn't foresee when we were starting the project off in Laravel
was how good a lot of these coding agents were going to be at generating PHP code,
just because I don't think there's another language that has more training data out there.
But at the same time, the Larvel framework itself kind of keeps the solution space a bit more constrained.
There's typically one kind of canonical way solution space a bit more constrained. There's typically like one kind of canonical way
of achieving something.
So these agents will typically write code
that looks like the way we would write it.
Yeah, it's been incredibly productive.
Wow. Yep. You're the second person
that not only uses Hersener,
but has some Laravel and PHP in there too.
So second person in two days with that.
What's your AI workflow then? Are you using like more interactive like cursor?
Are you using like Cloud Code or Codex
or more like terminal jobs, agents?
Like what do you do?
I'd say half of our team mostly uses Cloud Code
at this point.
A lot of stuff is kind of kicked off with Cloud code and that gets you 90% of the way.
And then the final 10%, you sort of finalize
with something like a cursor.
But I'd say the other half of the team is like all cursor.
Yeah, what about you personally?
What's your workflow?
I like to like start things off with plot code
and then kind of take it the final mile on cursor. And I think that that final mile is getting shorter and shorter as these like
tools improve. Yep, cool, cool. I still love cursor so much. I haven't tried the cloud codes, but
I'm seeing more people switch to it. I gotta get I gotta get over to it. So yeah, it's it's
shockingly good. They sort of dynamically switch between
like the thinking mode and kind of the non-thinking mode.
And yeah, it's just shockingly good.
Yeah, yeah, that's fine.
Well, this has been great.
Like it's been a fun episode,
like talking about all this stuff
and I appreciate you coming on and sharing all this stuff.
If people wanna find out more about you, about Blacksmith,
where should they go?
this stuff. If people want to find out more about you, about Blacksmith, where should they go?
So Blacksmith is at blacksmith.sh. If you're running GitHub actions, we can likely help you move faster. And if you're spending a lot of money, and if that's a problem, we can likely
reduce your CI spend by quite a bit as well. And we think a lot of teams will like the observability features
that we already have and a lot of the stuff
that we have in the pipeline.
Very cool.
Now that you say that, if I'm switching to Blacksmith,
is it like, hey, it's a two line change in my thing
and everything else should work?
Do I have to make some changes to my CI jobs?
Or what does that look like?
Yeah, so you install our GitHub app and then you point your CI workflow files to Blacksmith.
So it's a one-line code change to run your jobs on Blacksmith.
We also have this like nifty migration wizard
that will move your entire repository onto Blacksmith
in three clicks.
And that will apply all of the caching optimizations
on top of just moving to our RRs.
Very cool. So, yeah, easy to try out and see if it's faster,
if you like the observability stuff, and try that out.
So, again, Ayush, thanks for coming on.
It's been great, and yeah, best of luck to you going forward.
Thanks for having me.