Software Huddle - Building CI for the age of AI Agents with Aayush Shah

Starting point is 00:00:00 I'd say half of our team mostly uses Cloud Code at this point. A lot of stuff is kicked off with Cloud Code and that gets you 90% of the way and then the final 10% you finalize with something like a cursor. But I'd say the other half of the team is all cursor. Yeah. What about you personally? What's your workflow? I like to start things off with Cloud Code and then take it the final mile on cursor.

Starting point is 00:00:28 Now you're in New York. Is the whole team in New York? Yeah. So all of engineering is in New York. All of our growth and sales is in San Francisco. I find that even the location split is sort of helpful in that one office is sort of helpful in that like OneOffice is sort of entirely thinking about you know sales and growth and content ideas and that sort of thing and OneOffice is sort of entirely focused on product. Did your idea change much in YC or did you

Starting point is 00:00:57 go in with with pretty much this idea and come out with this idea? I think we were one of the only companies in our cohort at least where we came in with this exact thing and we are still kind of building this exact thing. What's up everybody? This is Alex, great episode today with Ayush Shah who's one of the three co-founders at Blacksmith. And Blacksmith is like a CI runner for GitHub actions right, provides the compute fabric so a much faster CI solution while also giving you you this interesting observability into test history and let you know, did this test fail because I wrote some bad code or because

Starting point is 00:01:31 it's a flaky test and it fails once out of every five runs anyway, and gives you some visibility into that. But one thing I liked the most was him talking about just how these AI coding agents have changed CI requirements, especially if you're out there spinning up 10 different agents working on things at a time, it's very bursty when all of them AI coding agents have changed CI requirements, especially like if you're out there spinning up 10 different agents working on things at a time, they're gonna be, it's like very bursty

Starting point is 00:01:48 when all of them finish within a few minutes, they all wanna spin up CI to see if it works, and it's just like you have this very bursty CI solution that can be tricky to handle, and just how they're handling it, and pretty interesting stuff there. So be sure to check it out. As always, if you have any questions,

Starting point is 00:02:03 if you have any comments, if you have any guests that you wanna have any comments, if you have any guests that you wanna have on the show, feel free to reach out to me. I love hearing about those and it's a great way for me to find people. And with that, let's get to the show. Aiyush, welcome to the show.

Starting point is 00:02:15 Alex, thanks for having me. Yeah, absolutely. So you're one of the co-founders at Blacksmith and you all came on my radar in the last month or two because there was this great post you had on multi-tenancy and like running a multi-tenant service and especially like the economics of it which I just think is like really fascinating like yeah Mark Brooker's done some good stuff and just like AWS generally thinks about that really well and it was cool to see you know some other SaaS services do that as well. So I guess maybe let's first start off by just

Starting point is 00:02:44 having you give us some background on you and Blacksmith and what Blacksmith is and does before we dive into that. Sounds good. Really great to be here. I'll start off with a bit of a background on myself. I started my career five years ago at a company called Cockroach Labs, working on CockroachDB, which is this distributed SQL database. I worked on a lot of the data placement aspects of Cockroach.

Starting point is 00:03:19 So one of their USPs was they would let customers kind of transparently relocate data closer to where the traffic's coming from. Cockroach would also automatically balance your loadout across the cluster, and it would do so in like a multi-region setup. Like you could have arbitrarily many nodes in arbitrarily many regions across the world, and it would move shards of data around

Starting point is 00:03:45 without disrupting foreground traffic at all. So I worked on a lot of those aspects of Cockroach for a while. After that, I had a short stint at a company called Superblocks, which is a internal enterprise, internal tool builder. It's a competitor to Retool, if you've heard of them.

Starting point is 00:04:13 And at both of these places, and just throughout my career, throughout internships during college, it felt like there were a few problems with regards to the software life cycle that just seemed endemic across all of these companies, across all sizes, but particularly bigger companies. And a lot of these problems were around, I guess, context for code reviews, which I think is being attacked in various ways through all these AI code

Starting point is 00:04:50 review tools and all these bug bots, that sort of thing. The second class of problems that kind of keeps growing with the size of the team is CI, both in terms of reliability, the amount of time and effort it takes to keep it up from within the company, and also just the amount of time you're spending waiting for each pull request, not just in terms of the amount of tests each pull request runs, also just the reliability

Starting point is 00:05:23 of the test suite itself. The percentage of like flaky tests that start blocking people ends up getting to a pretty problematic place fairly early on in the life of the company. And, you know, so that sort of was the genesis of Blacksmith in some sense. We also realized that across me and my co-founders, all of our employers at the time were spending a lot of time and effort maintaining CI infrastructure specifically on hyperscalers like AWS. And when you dive a bit deep into it, it sort of feels like a lot of the trade-offs

Starting point is 00:06:10 that hyperscalers like AWS have you make are kind of the opposite of what you want to do for CI. I think the big thing we talk a lot about in our blog posts is this forced bundling of EBS volumes, which is precisely the exact opposite of what you want in CI. For CI, you don't care about durably persisting any of that data because these jobs are ephemeral anyway. For stuff that you do care about persisting, like cache artifacts, you just want a very high performance network link to whatever blob store or whatever storage cluster. But for the jobs themselves, they actually need to be as ephemeral

Starting point is 00:06:53 and as low durability as possible. You kind of want to make the opposite trade-off. However, on AWS, instances with local SSDs are a lot more expensive than the ones with EBS volumes. And they go even further in that a lot of the most modern instance types with the new US generation CPUs, they don't even launch the local SSD variants for those skews until much later. So I think at the time of recording this on 8th US,

Starting point is 00:07:28 I don't even think you can get an M7A or M7I instance with, like you can't get the D variant, the one that has local SS. Yes, we felt like there was something here. So long story short is with Blacksmith, we're trying to build effectively a hyperscaler specialized on CI workloads. And we want to go beyond just the compute fabric. So the compute layer that makes the right trade-offs for CI

Starting point is 00:08:00 is kind of one piece of the puzzle. But in general, the set of problems you want to solve are how do we make engineers more productive as they're trying to merge code? And how do we curtail the kinds of problems that keep growing within a company as the engineering team grows? And especially with all these AI code gen tools,

Starting point is 00:08:26 making it so much easier to write code and write tests, we think the bottleneck is going to shift more and more towards merging code with confidence and doing so without slowing your team down. Yep, yeah, absolutely. OK, and so are you mostly working with GitHub Actions, or is it sort of like all types of CI? It's mostly GitHub Actions? Yeah, so it's just GitHub Actions or is it sort of like all types of CI? It's mostly GitHub Actions?

Starting point is 00:08:46 Yeah, so it's just GitHub Actions at the moment. Okay. And is that like eating the world right now in terms of CI? Like is everyone on GitHub Actions? Yeah, so I'd say a lot of the larger enterprises are still on Jenkins and we often come across customers that are on BuildKite. A lot of the, interestingly, a lot of the early 2010s tech companies, companies like Uber, Shopify, Slack, a lot of these types of companies are on BuildKite

Starting point is 00:09:17 because BuildKite was at that time, kind of the de facto CI system. No one wanted to use Jenkins. BuildKite was sort of this more modern, sane Jenkins. They also made it quite easy to self-host, as in self-host the infra for Bill Kite. And so we come across those kinds of companies a lot, but in terms of where a lot of the growth in the market is,

Starting point is 00:09:39 it's pretty much all GitHub actions. I feel like if we surveyed companies that were founded after 2019, more than 95% of them would be on GitHub actions. Yeah, interesting. And so will you build runners that work with BuildKite and things like that, or are you just like, hey,

Starting point is 00:09:56 that's eventually just going to migrate over to where we're just focusing on GitHub actions for a while? We're going to do it at some point, but we want to push that off to as far in the future as possible. The way we like to think about this problem space is anyone can build this undifferentiated compute fabric in some sense. If you prove out that there's some kind of a market resonance this undifferentiated compute fabric in some sense. If you prove out that there's some kind of a market resonance

Starting point is 00:10:29 or PMF for a specialized CI Cloud platform, anyone can build that. And that's not where we think we can differentiate long term. However, if we can de-risk a lot of our more longer term bets around improving productivity in general, as it pertains to CI, as it pertains to reducing fakey tests and that sort of thing. It becomes much easier for us to in the future translate those learnings to other CI systems.

Starting point is 00:10:57 But we think we want to deepen out on this one vertical first and then concentrically expand. And at that point, that becomes a more deduced proposition. Yep, yep. OK. So you're the second company I've had on. I also talked to Depot about this problem. And did GitHub just completely drop the ball here?

Starting point is 00:11:15 Or why are GitHub action runners just quite bad? Or is it a certain segment of use cases that they're bad at? Or what's going on with the native GitHub action runners? Yeah. So I think there's a few factors at play. I think the big one is that they have perverse incentives to not solve the problem just because their pricing model charges by the minute. If they give you faster compute, not only are their costs going up, their customers

Starting point is 00:11:50 are spending less money, they're consuming fewer minutes. And with how much growth there is for GitHub, because it's the de facto platform owner for code repositories today. Yeah, it just feels like there's a lot of perverse incentives for them to not try to solve the problem the right way. And secondly, internally, I think we view GitHub Actions almost as this low-level orchestration layer for workflows that operate on your repository.

Starting point is 00:12:25 And I think GitHub sort of views it the same way as well. If you think about the fact that GitHub Actions has been around for all this time, they still don't have a UI for you to track test results. They don't have a way for you to look at regressions in a particular unit test. You could have been running GitHub Actions for six years, and you would have no idea how a test suite has evolved

Starting point is 00:12:52 over the course of that time. So I think there's a lot of opportunity in the market to build this observability layer on top of GitHub Actions and treat it as a low-level workflow orchestration platform in some sense. Yeah. Yeah. Interesting.

Starting point is 00:13:13 Okay. You mentioned flaky tests a few different times. I imagine that's different from just the pure speed and resource requirements that you're also solving, but the test flakiness. I guess where do you see that flakiness come in and how can you help customers there? Is that an education thing or is there stuff you can do for more automatically without them having to change?

Starting point is 00:13:33 Like where does that flakiness come in? Yeah, I think that's a good question. So I feel like there's many ways you can kind of break that problem down. But let's start with how flakiness even creeps in and what I would even define as flakiness. So there's some test failures that are deterministic. You run it on a given shot.

Starting point is 00:14:01 Someone landed a bad commit without maybe waiting for CI. And maybe not all of your CI jobs are required for the pull request to merge. So someone hastily just merged a pull request that had a failing unit test. But the good thing about it is that it always fails. Someone can spend 30 minutes on it and get it fixed up and unblock everyone else. spend 30 minutes on it and get it fixed up and unblock everyone else. Now the worst kinds of failures are these non-deterministic test failures that let's say only fail 1% of the time or 5% of the time. Cockroach actually had a pretty big problem with this class of failures,

Starting point is 00:14:46 where some of our failures would only happen once you stressed a unit test for an hour, out of like 100,000 runs. And a lot of times, that was just the test being poorly written, not the logic being tested. The logic being tested was always correct, but the test maybe had some kind of raciness across the many things it was trying to synchronize. But sometimes it's actually the logic being tested

Starting point is 00:15:17 is non-deterministically wrong. So when you have a flaky test that sort of passes and fails on the same shock, on the same piece of code, it's actually a hard problem to determine whether that's a high-quality failure that points to a real problem with the system, or just a test that has some sort of like a synchronization issue. Now, with that being said, with our customers, the most common thing we see is customers that have a large test suite of these end-to-end browser-based tests, stuff like Cypress and Playwright and Puppeteer. With a lot of these tests, they're running it by running an external instance of their application

Starting point is 00:16:09 that sometimes talks to a local database, sometimes it may even talk to an ephemeral super base instance spun up dedicatedly for that unit test. So it has a lot of these moving parts. It may have an external network link involved somewhere, which can let flakes creep in. But these kinds of tests are extremely flaky. And we've yet to come across customers where they have a large test suite of Cypress and Flaky tests that's not flaky.

Starting point is 00:16:42 And almost always, these are the former type of flakiness, where the unit test itself is... It depends implicitly on some kind of synchronization that isn't guaranteed by the harness, and that's why it's like flaking. Yep. And are there things that you can do for that, or is it education?

Starting point is 00:17:05 Or is it just like, hey, that's the nature of site person playwright and browsers and all that? So there's a number of things we can do for that. And there's a number of things I think we're going to have to do for our customers going forward with the rise of a lot of these AI cogen tools. Because the good thing about these tools is that, for example, Cursor, Winserv, Cloud Code, they're extremely good at writing unit tests.

Starting point is 00:17:37 They are extremely competent at writing Playwright, Cybers tests as well. However, a lot of teams do not have protections in place that prevents them from merging flaky tests in to begin with. So one of the big things that we're working on is how do we help teams selectively stress only the new unit tests that they're adding to their suite so that when we detect that you've introduced five new unit tests for this new feature that you've been working on, can we stress them for you by automatically detecting that, running them like 1,000 times before they're even

Starting point is 00:18:17 allowed to merge in so that you're at least preventing the increase of new unit tests being flaky over time, new flaky tests creeping in. That's one thing. I think another big thing is just helping teams identify flaky tests. And the reason this is actually even more critical, in my opinion, than the first thing is because the cost of a flaky test isn't just the amount of time it takes to

Starting point is 00:18:54 stabilize it. It's the impact it has on everyone else from your team that's trying to merge their PRs that is confused whether their change broke this unit test that's seemingly unrelated. Throughout my career, and I'm sure you've experienced this yourself, you push up a pull request. It has five unit test failures or five test failures of various types, out of which maybe three or four seem completely unrelated to your change. And you're left wondering, you know, was it actually my change or is it just

Starting point is 00:19:31 like a flaky test? We want to surface that information as readily as possible. So we want to help customers like score or automatically rank the relevance of a test failure to their particular code commit and help customers identify whether, given flaky tests is even worth digging into or should you be digging into these two other failures that are very likely related to your change? Yeah, yeah. And in terms of value prop of blacksmith,

Starting point is 00:20:01 do you think of it as sort of equal weights of like, hey, these fast on-demand resources to go spin up a bunch of tests very quickly? And also these higher level things like visibility into long-term test history and stuff like that? Or do you think one is more than the other? Or how do you think about that value prop? Yeah, that's a great question.

Starting point is 00:20:21 It's one thing that we think about a lot. At the moment, I think our value prop is still quite skewed towards just the compute. The fact that we let you spin up 3,000 vCPUs in less than a couple seconds. We kind of control this entire pool of compute, and we can handle bursts really well. And that's still a big part of the value prop.

Starting point is 00:20:49 The hardware stack that's optimized for CI is still a big part of the value prop. However, over time, our observability story is improving and that's becoming more and more for a reason. We're starting to see customers try to adopt us. becoming more and more for a reason we're starting to see customers try to adopt us. Our internal goal is to get the observability piece to a point where customers would use us purely for that, even if the compute fabric wasn't as good.

Starting point is 00:21:25 And it could be like what you're saying, where there's the easy win in visibility of like, hey, someone could try your new, your compute runners and just be like, oh wow, this is five times faster than what I was currently running. It gets you in the door. But then people don't move away because they love all that historical stuff. And you're like, just being able to see that,

Starting point is 00:21:41 they're like, I'm never gonna move to a different runner. Or if people go to a different company, they're like, hey, we need to have this visibility that I don't have here now. So just following in love with that feature over time, even if it's not as visible right away, requires some history to get built up before they can see that value. Yeah.

Starting point is 00:21:59 As any vendor will try to do, you want to leverage the data you have with your existing customers to make the experience better for them over time, make the platform sticky for them. And then that's definitely one of the ways we're thinking about it. And we're realizing that as we get deeper into it, a lot of these things lay the foundations for more higher level things that we can unlock.

Starting point is 00:22:27 This scoring of flaky jobs or flaky failure modes and helping customers understand whether their PR caused a particular failure or not was something that just came up in the last four months when we built a global log ingestion system for customers. We started realizing that, oh, we can collect failure signatures across RCI pipelines. And based on the frequency of any given failure signature, we can help customers understand

Starting point is 00:23:01 whether that's a flaky failure or if it's something unique to their pull request. Yeah, for sure. One thing on Cockroach, real quick, I imagine CI for something like Cockroach is very different from CI for some of your customers, like Clerk or Vee. Things that are more web-type applications.

Starting point is 00:23:20 Is that true? And are you more focused on the latter? I imagine the database one is so specialized, but maybe I'm off on that. I guess how does that work? Yeah, that's a great question. I'd say most of our customers are primarily web applications. I think about 80 percent of our CI workloads are TypeScript, just running unit tests and vTest, that sort of running unit tests and v-tests, that sort of thing.

Starting point is 00:23:46 However, the pain point felt by a lot of these systems companies is often a lot greater than a lot of companies like Clark, for example. Just because a lot of these companies typically have large Rust code bases, often pulling in a lot of dependencies. Rust is notoriously slow at compilation. It's one of the main things people

Starting point is 00:24:21 dislike about the language. And LLVM-based languages also end up being surprisingly IO-heavy during compilation, which means that if you're trying to self-force CI for those repositories on the hyperscalers, you are really bit by the EBS volumes and local SSDs trade-off that they force you to make. So we find that when we outbound a database company, or when they come to us, they actually have a much bigger problem on average. Yeah, and a lot of times we can help them cut their CI spend

Starting point is 00:25:08 and CI runtimes down by over 60%. Yep, yep. So you mentioned like 80% of your stuff is TypeScript and Jasper tests, all that. Do you have a sense of what the back end front end split is for those? Yeah, so I'd say frontend is, so pretty much because we sell to companies

Starting point is 00:25:28 that were founded after 2019 or whatever, frontend is all React. It's all like TypeScript React and anything else is like an anomaly. And on the backend, I'd say the, I don't often know the exact percentages, but TypeScript is the clear winner. Golang is the second biggest player.

Starting point is 00:25:54 So I'd say it's more like a... And sorry, I don't know if it was unclear. I meant, of the tests that are run, how many are more pure backend tests versus running Cypress or Playwright or a browser-based test? I'd say it's closer to a 50-50 split. It's funny because the state of front-end testing is so poor. Because writing component-level tests just does not work

Starting point is 00:26:24 after a certain size outside of some very critical, complex components that your code base might have. So if you want to have any coverage of your front end after a point, almost the only option you have is something like a Cypress or Playwright test suite. Yeah, so I'd say within that, it's fairly 50-50. Yeah, yeah, okay. One more cockroach question. I within that it's fairly 50-50. Yep. Yeah. Okay. One more cockroach question. I know that cockroach is kind of specialized. Do you use

Starting point is 00:26:50 cockroach at Blacksmith, like given that you work there? Or are you like, hey, you know, it's cool, and I like it, but we didn't really need it for our specific needs? No, not at the moment. Although we did have one use case for it where we could have used it. And you might still kind of move to it, but we decided to start off on like a simpler Postgres based setup. And then the kind of nice thing about all these Postgres compatible databases is the

Starting point is 00:27:15 on-ramp. We can migrate to them when the pain point kind of becomes evident. Yeah. Yeah. Very cool. Okay. Let's talk a little bit about the force bundling, the EBS and EC2 stuff. This is interesting because we also

Starting point is 00:27:28 had Sam from PlanetScale on talking about their new metal offering. And he had a similar complaint as you, but a different approach to it, right? Where he's like, hey, EBS is super expensive. But for their workloads, I'll be so important. And still need some, I guess, like they decided running the instance with the locally attached NVMe

Starting point is 00:27:50 makes sense for them, actually is a lot cheaper than using EBS. But for your needs and requirements, it's like that's still just way too expensive to do that, is what you're saying, given your sort of requirements in CI. Yeah, so there's a number of threads we can dive into there. I think the big part of our thesis behind Blacksmith was that CI as a class of workloads

Starting point is 00:28:16 does not need to run in a customer's cloud account. So for example, in your conversation with Sam, he mentioned how, if you're building a database, if you're a database vendor and you're not on one of the hyperscalers, it's sort of an immediate deal breaker because your database needs to be as close as possible to your production application.

Starting point is 00:28:40 It needs to be as close as possible to your production blob store bucket because of this indirect dependency where your application also needs to be close to that. It needs to be close to where you're running your Kafka, which is likely also in AWS. And the moment you run anything outside of that zone, you're hit with extremely expensive networking costs. And you're also just adding unnecessary latency

Starting point is 00:29:12 to every user interaction, every back-end query. However, that is not the case for CI workloads. In fact, for most companies, your CI is already running outside of your cloud account. Your code already lives in GitHub, which is not in your... No one really ever used AWS's Git server implementation. I forget what it was involved. Code something, yeah. So we realized that CI was a class of workloads that you could expatriate out of the hyperscalers

Starting point is 00:29:48 and customers because it also doesn't deal with their end customer data. It's just code. The security expectations for that are not as B by OC centric, if that makes sense. If I'm using a database and I'm trusting a vendor with my customer's data, I do want that to be in my cloud account as much as possible. However, if my code is already not in my sovereign jurisdiction, I'm okay with my CI also being out of it. And because of all of this and the fact that the actual hardware itself is just so much cheaper than what

Starting point is 00:30:34 you can get on the hyperscaters, it just made sense for us to start the business off with that kind of understanding and the fact that our unit economics improve as we get more customers. So, yeah, I think it's like sort of a confluence of all of these factors. We could consider in the future like a BYUC type offering. Once our observability piece becomes valuable enough that that's something customers care about. But at the moment, this trade-off just makes a lot of sense, both for us and for our customers.

Starting point is 00:31:11 Yep. And then so where are your instances running? So we have one data center in Germany. We work in another company in Phoenix in the US. We're potentially working on a third region at some point later this year. And that'll likely be in US East, Virginia, that area. The other interesting thing about CI workloads

Starting point is 00:31:40 is that for most jobs, they're pretty location agnostic. They kind of just need to be close to your caches. So if we control our customers' cache artifacts, we kind of have a lot of control over where we can place their entire org. And we also have a bunch of control over where we can move them. However, one exception to that is jobs that perform Docker builds

Starting point is 00:32:08 and push Docker images to container registries. Because typically, your container registry lives in your AWS account in something like ECR, which is homed in some region. So at the moment, what we'll do is if a customer has Docker builds running that push to a container registry in the US and US West, we'll home them in the in the Phoenix data center. Okay. Okay. Yeah, I was gonna I saw that you all started in EU Central. And I guess that's true. Like, there's not a lot of, you know, going back and forth other than that Docker push, I guess like was EU Central and I guess that's true. Like there's not a lot of, you know, going back and forth other than that Dr. Push. I guess like was EU Central cheaper than hosting in the US?

Starting point is 00:32:52 Yeah, so we worked with a vendor called Hetzner. I think they're sort of fairly popular for just leasing out bare metal hardware. They are fairly reliable. They let you attach high throughput network links to all of your instances, also for relatively cheap. So that was the reason we started in EU Central. Hetzener is only in the EU,

Starting point is 00:33:18 at least their bare metal offering is only in the EU. And then over time, as we hit more scale, we had kind of the leverage to start working with slightly smaller data center colos in the US. And we see us going lower down that pipeline over time as our scale improves. At some point, we'll start renting rack space, but racking it up with our own hardware.

Starting point is 00:33:47 But all of these are problems for the future, where we explicitly want to improve our unit economics even further. Yeah, for sure. And so you mentioned blob storage being essential for the cache artifacts. Are you using something like S3 or Google Blob, or are you using, like, does Herbster have something,

Starting point is 00:34:06 or are you running Minio, or I guess what are you running for Blob storage? Yeah, so we offer two types of caching for CI jobs for our customers. One is backed by an S3-compatible Blob store. We run Minio, which is this open source S3-compatible Blob store. And the second kind of caching we offer

Starting point is 00:34:28 is something we call sticky disks. And sticky disks are effectively our implementation of a network attached block device. It lets us offer out of the box Docker layer caching for all Docker builds running on Blacksmith. So for instance, if your CI job builds a Docker image, you can change one line of code to tell Blacksmith that this job builds a Docker image

Starting point is 00:35:00 and we will transparently mount in this network dock device that contains all of your Docker layers into your runner. And that will asynchronously get committed after your build completes. And yeah, it happens completely transparently to you. In the most optimized scenarios, it can be up to 20x faster. However, we do, unfortunately, find that most customers do not have the most optimized Docker files.

Starting point is 00:35:34 So it ends up only being about 60, 70% faster because that's the percentage of layers that are cached in any build. But going back to your question, for every cell that we operate, every kind of data center or colo that we operate, we run a fleet of machines that run our Blob Store. We run a distributed storage cluster that runs Ceph that offers sticky disks. And we have our fleet of VM agents that orchestrate virtual machines. And these are all in the same data center. Once they grow past the size of the number of machines we can have in the same DC,

Starting point is 00:36:16 we kind of have to create a new cell. And this is to ensure that all the VM agents effectively have like a local link to the cache clusters, because that's a big part of our value prop. Gotcha. And how is running MinIO? Is that pretty straightforward and easy? And is it significantly cheaper than S3?

Starting point is 00:36:43 The way we run it is not particularly cheap because we also run Minio over NVMe drives, whereas S3 obviously does a lot of intelligent gearing. Operationally, Minio has been extremely smooth for us to run. We did initially have some issues when we were starting out. But once you get through this initial operational hurdle, it's been relatively smooth for us.

Starting point is 00:37:10 Their story around expanding the size of the cluster is also less than ideal. They almost necessitate that you have to take a minute of downtime to expand the cluster. But that ends up being fine in a lot of cases just because of how rare these are, and the downtime window is fairly short. Yep, yep.

Starting point is 00:37:30 And even if that has downtime, does that just mean, hey, this build isn't using cache and can still run the build. It's just going to be slower. Exactly. And we can time it based on the data we have around when the usage is the lowest, we can time the window in a way where like no one even like notices.

Starting point is 00:37:47 Yeah, nice. Did you all have a lot of experience doing like orchestration on, you know, Hertzner and things like that versus the hyperscalers or was this new to you or what was that like? It was relatively new to us. However, we did have, so one of my other co-founders

Starting point is 00:38:03 and a lot of our team is from Cockroach. One of our founding engineers is also from Cockroach. We did have a lot of experience fighting fires with Cockroach customers that were running Cockroach on-prem. Big banks and customers of that nature. So we were fairly comfortable with running a big service on our own hardware. We dealt with a lot of typical network congestion type problems over time. So it felt like it was tractable.

Starting point is 00:38:39 Of course, when the rubber meets the road, you encounter new problems. And there's things that you have not dealt with before. But we've actually felt like every time we want to run something that's large scale on something like AWS, at least initially, we're dealing with problems that we'd rather not deal with. I'd rather not have a full service outage because my auto scaling group

Starting point is 00:39:11 doesn't have the quota limits for that specific instance type in AWS. I have to contact AWS support to bump up the quota limits. I would rather deal with problems that we can, you know, as a team, just debug on our own and fix and mitigate on our own. Yeah, interesting. Yeah. One complaint I hear a lot,

Starting point is 00:39:34 and you mentioned a little bit earlier, is just like the network IO cost in like the hyperscalers. Does Herzener have anything comparable to that? Do they charge or limit you on bandwidth or what does that look like? So they have no limits on bandwidth. They do charge you for bandwidth, but I'll have to double check myself.

Starting point is 00:39:53 But I think it's, so for hyperscalers, the egress costs, intrazone egress costs, are $90 per terabyte, so 90 cents a gig. However, yeah, I believe that's right. That might be inter-zone or inter-region. Could be wrong about that. But on Hetzera, it's $1 per terabyte. And that's kind of the typical network bandwidth costs

Starting point is 00:40:26 you'll pay with a larger ISP. So a lot of these call over. Interesting. So the cost is very comparable to the hyper-scaners. Interesting. They're sort of just passing that down to the customers. And that's a huge difference. And it's low enough that we effectively just like

Starting point is 00:40:47 don't have to think about network costs at all. Yeah, okay, okay, cool. Okay, let's move on to multi-tenancy a little bit. And I would encourage everyone to read the blog post number one. But the cool thing you all were explaining there is, hey, CI workloads are extremely spiky where you might have a customer that needs like 500 vCPUs very quickly, and maybe five times out of five

Starting point is 00:41:11 people push code all at once. But if you aggregate a bunch of customers together, it actually smooths out quite a bit to where the individual spikes don't need as much there. And I know AWS talks a lot about peak to average ratio, where peak is like the max you need at any given time, which basically how much you need to have provisioned or else have cues or averages or something. But average is what your margins are measured against. I guess, is this something when you were coming into Blacksmith, you're like, hey, this was a key selling point. Is this something you figured were coming into Blacksmith, you were like, hey, this was a key selling point? Is this something you figured out later on about just multi-tenancy and things like that?

Starting point is 00:41:50 When you were conceiving the idea of Blacksmith, was that a big point you were focused on? Yeah. We knew that our unit economics would be substantially worse in the beginning when we barely had any customers. And we still had to have a big fleet to support even a small number of customers because workloads, like you said, are bursty within a customer. However, we knew that the economics would improve. We actually ran this Monte Carlo-style simulation early on during when we were in YC to just run some

Starting point is 00:42:26 numbers on what happens when we're running 100 jobs a minute or 1,000 jobs a minute and then just project that out. We knew that if we hit a large enough scale, our margins would be substantially better. We also knew that there would be a path for us to go down the hierarchy in some sense. So we could start off with leasing bare metal machines from a provider like Hetzner, but over time, rack our own machines and go down that route. It seems like that's a trend that is also catching on in general in the industry. I know Railway, which is like an app hosting platform,

Starting point is 00:43:08 they kind of did a similar migration of GCP onto their own bare metal hardware. And I think the quality of service has actually improved as a result of that. Yeah, you're actually the second person I've talked to in two days that's using HerdCenter for some stuff too. So yeah, you're really starting to see a little bit of that trend, especially if you have a unique

Starting point is 00:43:28 workload or like understand the trade-offs well and like you're saying aren't scared of getting into the orchestration stuff and realizing, hey, maybe it's not as bad as you think it is. So I know like one thing is, you know, CI would do like somewhat correlated, right? Because it's going to be during people's work days for the most part. I guess, is there anything you all are doing to try and help smooth those numbers or try and make it so people are moving jobs to off hours, either with pricing or even with like customers you're targeting in different regions of the world? Or you just, you know, you'll figure that out later.

Starting point is 00:44:03 And right now you're sort of in growth mode as much as possible. We're sort of in growth mode as much as possible. However, we're realizing that the bursty nature of CI workloads is actually getting worse with a lot of these like background agent companies kind of doing a good job and the model is just getting so much better. We've had customers, we actually have one customer where the way they operate for a lot of

Starting point is 00:44:34 product-facing features that have a well-defined boundary, they'll file a bunch of linear tickets, they'll create a linear project, they'll file 11 tickets, they'll furnish those tickets with as much detail as possible, and then they'll just kick off 12 background agents all in parallel, which will, within the span of the next five, 10 minutes, kick off 1,000 CI jobs, because each pull request may kick off 100 CI jobs. So it's becoming more bursty, which we're sort of handling well. And that's another aspect of what will make self-hosting CI even harder in the future.

Starting point is 00:45:17 Because typically, the way companies will self-host CI is they'll run like an EKS cluster, or they'll run GitHub's ARC controller to manage the runner pods over that EKS cluster. And maybe they'll use something like Carpenter to handle node provisioning and scale the size of the cluster itself as it saturates. We've had a number of customers where that just becomes such a big operational headache for something like CI

Starting point is 00:45:51 that's not a core differentiator. It's not a core competency. They'd just like rather not spend time on it. And as these companies adopt more of these background agents, we see that shift happening even faster. Yeah, yeah. How, I guess, how does Blacksmith charge currently? agents, we see that shift happening even faster. Yeah. I guess, how does Blacksmith charge currently? What's the unit you charge?

Starting point is 00:46:10 So we charge for two things, both usage-based. We charge by the minute on the runners for the compute. And we charge for cache storage, the sticky disks that you use. we charge for cache storage, the sticky disks that you use. Each customer gets some free quota for the blob store cache. They also get some free minutes on the compute. For larger customers, we'll sort of just negotiate bespoke pricing. That's sort of another big advantage we feel like we have.

Starting point is 00:46:46 For larger customers, we actually have a lot of leverage in terms of pricing our offering however we want and not based on just our costs. So a lot of vendors, a lot of folks I think on your podcast have talked about, they kind of have to begrudgingly do cost plus pricing. Just because their costs are so linear to their, like their gross margins are just fixed. And 30% of their revenues just go to AWS, some cases even more.

Starting point is 00:47:18 So they don't have as much leave in pricing their products the way they would want. We don't necessarily have that. Sure, our gross margins are still like a big percentage of our revenue. It's still a double digit percentage, but it's not, you know, it's not, like our gross margins are not 40%.

Starting point is 00:47:39 We're over like a 80% gross margin business lets us like price things in a lot of different ways that works well for customers. Yep. OK, that's super interesting. And then can I have as many jobs running at a time as I want? Or is there some sort of queue or limit? Or what does that look like?

Starting point is 00:47:58 Yes. But we'll have alerting on our end when a customer runs more than 1,000 BCPUs worth of jobs, we'll get alerted and someone from our team will at least look at it to make sure that it's not someone that's trying to DDoS the system. Early on, we had instances where people would try to mine crypto in GitHub Actions jobs. Every time you have a compute service, someone's going to try to mine crypto in it.

Starting point is 00:48:29 Yeah, for sure. Exactly. It's kind of silly because these days, crypto mining has become so efficient that unless you're running it on FPGAs or GPUs, it's completely pointless to run it on CPUs. Yeah, so we'll have a learning on our end to make sure someone isn't attacking the system,

Starting point is 00:48:48 but there's no real concurrency limits. Yeah, interesting. Yeah, I was just trying to think about that, to try and smooth out that load, especially with that agent stuff, if there's some way you can incent customers to kick that off overnight, where it's like, hey, you have those background, like there's 12 hours overnight

Starting point is 00:49:06 where they can run it at any time and your load is super low and can spin those up. But it's kind of like a tricky thing. It's like, you don't want to give, I don't know how you incent that. We can easily launch sort of like low priority, like a low priority skew that like lets the customer kind of defer the scheduling of the job to us.

Starting point is 00:49:25 And we can offer a much lower pricing on that kind of thing. It's something we're open to doing. However, we haven't had that much demand for this sort of thing. And that was surprising to me as well. I think typically people just want their blocking CI jobs to run as soon as possible. For stuff that is not as critical, they're already running nightly in a cron on GitHub Actions

Starting point is 00:49:56 itself. So the demand for this scenario where they want the vendor to do the scheduling in a way that's disconnected from the orchestration platform, which is GitHub Actions in this case, we haven't had a ton of instances of that yet. Yeah, for sure. On that sort of AI note too, do you feel like the explosion of coding agents and all that is like a tailwind for you guys?

Starting point is 00:50:25 Cause there's just so much more demand for CI now. Or is it a headwind cause it's like, oh man, it's that much spikier that we have to deal with and it causes that problem or how is that affecting y'all? So especially for smaller teams or younger teams that are moving fast or adopting a lot of these tools faster than larger enterprises. We're seeing that our growth rate for companies, even normalizing for sort of employee count,

Starting point is 00:50:56 is like over 60% quarter over quarter, meaning that for 10 engineers, they're running 60% more CI, quarter over quarter, with the same amount of people. Because we started the company during this whole coding boom, it's hard for me to juxtapose that with what that would have been before. But that seems pretty high. And almost all of our customers are, we see a pretty big spike or surge over the last few months of things like the Cursor Agent and things like Devin and Claude and all of that in our data

Starting point is 00:51:41 just comments being pushed by agents more and more. Yeah. Are you able to see popularity of the different, especially terminal-based agents? Can you share any of that stuff? Yeah. We plan on doing a series of blogs on that kind of stuff. Just because it's interesting, typically what happens is whenever there's a big launch, so when Devon launched, there was a huge spike to the point where it was something like 0.3 or 0.5% of all of RCI. And then it goes down in flat lines,

Starting point is 00:52:17 but then maybe it improves over time again. And every time there's a big new model launch with, for example, the four Cloud4 series of models, they resulted in almost like a step function increase in how much more people were using background coding agents in general. And I expect to continue to see that with newer model launches. Yep. Yep.

Starting point is 00:52:42 Super cool. Okay. On that same note, similarish, you all just raised a $3.5 million seed round, so congrats on that. VC environment, super interesting right now, because there's so much AI energy. And then you all aren't an AI app builder or something

Starting point is 00:52:58 like that. Was it difficult to break through the noise, or did it help having this tailwind of, hey, these coding agents are probably the most popular AI app right now, and they're going to need a lot of CI? I noise, or did it help having this tailwind of, hey, these coding agents are probably the most popular AI app right now, and they're going to need a lot of CI? I guess, how did that go? It was fairly difficult with some VCs.

Starting point is 00:53:13 It was fairly easy with others. So it was sort of this very interesting, but kind of polarizing experience in some ways where some VCs wouldn't even want to talk to a company raising around that's not an AI, whereas the others kind of totally get it. They sort of understand the second order effects of all this code gen is one of the things is CI will become a bottleneck. Yeah, so it was very much based on like who we were talking to and sort of how technical they were in some sense. In general, yeah, I would not be,

Starting point is 00:53:55 I would not want to start a non-AI company in today's environment in general, just because all of the, you know, all of the air is like being sucked out of the room. And for good reason, a lot of these companies are growing at rates that have basically never been seen before. Yeah, yep, that's pretty wild. You all did YC as well. I guess, tell me about your YC experience. Did all three of you go to the Bay Area?

Starting point is 00:54:19 Yeah, so we lived in San Francisco for four months. YC was quite a great experience for us, honestly, just because they fight hands off in some ways, but they still sort of keep you on your toes and they kind of keep you aligned on solving important problems, problems that are important for the business and not kind of go down on rabbit holes that are not worth pursuing at that early stage of the company, which was quite helpful. You get a bunch of early users, early usage from other companies in the YC batch, which is obviously quite helpful as well. I think if you were starting a company that had immediate demand with very early stage startups,

Starting point is 00:55:06 like a SOC 2 compliance type company, then YC is the ultimate jumping off point, just because you can end the batch with two dozen, five-figure customers. And we weren't like that because a lot of these or of YC startups at that stage don't have that much CI. They're barely writing any unit tests at that point, but it's still valuable usage. Yeah. It gives you a big jump in the fundraising right after the batch, which was also super helpful. Yeah, for sure. Did your idea change much in YC,

Starting point is 00:55:44 or did you go in with pretty much this idea and come out with this idea? Yeah, we were one of the... I think we were one of the only companies in our cohort, at least, where we came in with this exact thing and we are still kind of building this exact thing, where even the thing we sort of pitched in our YC interview was this like compute fabric for CI

Starting point is 00:56:07 with all this like built-in observability. And in some sense, we still haven't like fully achieved that but we're also still like working on the same vision. And I'm glad that's the case just because the constant shifting of focus can be really detrimental at this stage of the company. Yeah, for sure. Okay, now you're in New York. Is the whole team in New York? Yeah. All of engineering is in New York.

Starting point is 00:56:35 All of our growth and sales is in San Francisco. We have two offices. We're going to likely grow the team in this fashion for at least another eight months to a year. But we're open to good candidates on either coast. If there's a great GTM hire that is only open to working out of New York, then it obviously make exceptions.

Starting point is 00:57:02 Gotcha. So you want in-office ideally sort of with those strategic hubs, but as long as they're in an office, you're up for it? Yeah. I think at this stage of the company, it's just so much easier to keep everyone aligned on what's happening in the company, the direction we want to swim in, that sort of thing, if everyone is in person. I find that unless your entire founding team has previously worked with each other before, and it's sort of like a bunch of team members from LinkedIn leaving to do Confluent, that sort of thing. It's very hard to have everyone be engaged and build mutual personal relationships with each other

Starting point is 00:57:50 if you're remote. At some point, we're gonna have to be a bit more flexible, but it feels like the correct kind of trade-off at this stage. Yeah. Are all three of the founders more engineering heavy and associated with engineering or is one of them, one or two in SF doing more sales GTM type stuff?

Starting point is 00:58:10 Yeah. So JP, our CEO, he's in SF. He is solely focused on GTM and sales and kind of everything else that needs to be done. Myself and Aditya, who's our third co-founder, we're both focused on engineering. And that's been a good split. We get to kind of protect each other's times and let us kind of do what we're best at. And I find that even the location split is sort of helpful in that one office is sort of entirely thinking about sales and growth and content ideas and that sort of thing.

Starting point is 00:58:58 And one office is sort of entirely focused on product. At least at the moment, that seems really great for our focus. Cool. Yeah. And closing out like on an engineering level, I guess, what does the code base look like? Is it TypeScript like you're saying? Is it lower level stuff? Or what are you using for that? Yeah. So we have a lot of Golang. A lot of our infra services are Golang.

Starting point is 00:59:23 All of our front end is Next.js, TypeScript, React. Our control plane is actually written in Laravel in PHP. How interesting. What drove that decision? We sort of couldn't be happier with this like split in the sack just because, so we built a number of projects before, just personal projects in Laravel.

Starting point is 00:59:49 And it's just a very productive way to write application control planes, in my opinion. All of the problems that you typically want to solve in a control plane, background jobs, billing, caching in various parts of the stack, queuing various types of jobs in particular sequences and graphs. All of those are just solved problems in a framework like Ruby, Unreal, or Laravel.

Starting point is 01:00:22 Modern PHP, for all the flak it gets, is surprisingly good and pleasant to work with because it's typed now. The tooling around it is really great. The other thing that we didn't foresee when we were starting the project off in Laravel was how good a lot of these coding agents were going to be at generating PHP code, just because I don't think there's another language that has more training data out there. But at the same time, the Larvel framework itself kind of keeps the solution space a bit more constrained.

Starting point is 01:01:03 There's typically one kind of canonical way solution space a bit more constrained. There's typically like one kind of canonical way of achieving something. So these agents will typically write code that looks like the way we would write it. Yeah, it's been incredibly productive. Wow. Yep. You're the second person that not only uses Hersener, but has some Laravel and PHP in there too.

Starting point is 01:01:22 So second person in two days with that. What's your AI workflow then? Are you using like more interactive like cursor? Are you using like Cloud Code or Codex or more like terminal jobs, agents? Like what do you do? I'd say half of our team mostly uses Cloud Code at this point. A lot of stuff is kind of kicked off with Cloud code and that gets you 90% of the way.

Starting point is 01:01:47 And then the final 10%, you sort of finalize with something like a cursor. But I'd say the other half of the team is like all cursor. Yeah, what about you personally? What's your workflow? I like to like start things off with plot code and then kind of take it the final mile on cursor. And I think that that final mile is getting shorter and shorter as these like tools improve. Yep, cool, cool. I still love cursor so much. I haven't tried the cloud codes, but

Starting point is 01:02:17 I'm seeing more people switch to it. I gotta get I gotta get over to it. So yeah, it's it's shockingly good. They sort of dynamically switch between like the thinking mode and kind of the non-thinking mode. And yeah, it's just shockingly good. Yeah, yeah, that's fine. Well, this has been great. Like it's been a fun episode, like talking about all this stuff

Starting point is 01:02:38 and I appreciate you coming on and sharing all this stuff. If people wanna find out more about you, about Blacksmith, where should they go? this stuff. If people want to find out more about you, about Blacksmith, where should they go? So Blacksmith is at blacksmith.sh. If you're running GitHub actions, we can likely help you move faster. And if you're spending a lot of money, and if that's a problem, we can likely reduce your CI spend by quite a bit as well. And we think a lot of teams will like the observability features that we already have and a lot of the stuff that we have in the pipeline.

Starting point is 01:03:11 Very cool. Now that you say that, if I'm switching to Blacksmith, is it like, hey, it's a two line change in my thing and everything else should work? Do I have to make some changes to my CI jobs? Or what does that look like? Yeah, so you install our GitHub app and then you point your CI workflow files to Blacksmith. So it's a one-line code change to run your jobs on Blacksmith.

Starting point is 01:03:35 We also have this like nifty migration wizard that will move your entire repository onto Blacksmith in three clicks. And that will apply all of the caching optimizations on top of just moving to our RRs. Very cool. So, yeah, easy to try out and see if it's faster, if you like the observability stuff, and try that out. So, again, Ayush, thanks for coming on.

Starting point is 01:03:56 It's been great, and yeah, best of luck to you going forward. Thanks for having me.

Software Huddle - Building CI for the age of AI Agents with Aayush Shah

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.