The Infra Pod - Turning Gaming PCs to Serverless CI for AI! (Chat with Aditya from Blacksmith)

Starting point is 00:00:00 Welcome to the InfraPod. This is Tim from Essence, and you let's go. This is Ian, a lover of performing efficient compute in my continuous integration delivery pipelines. I'm super excited today to be joined by the CEO of Blacksmith. Aditya, please introduce yourself and tell us why in the world you decide to start Blacksmith and also what in the world is Blacksmith. Thank you, Tim, and Ian. Nice to meet you both. And thanks for having me on the pod. I'm a pretty huge fan.

Starting point is 00:00:29 I have two co-founders, Ayush and Maru. We were engineers before starting Blacksmith. They used to work at Cockroach, and then Ayush used to work at the startup called Superblocks. I was an engineer at Fair, and we'd interned at a bunch of companies before, and every single company struggled with CICD, especially at Fair, which had about 400 engineers, they had a pretty big platform team, and a big chunk of the platform team spent their time working on CI. And I'm happy to go into the challenges of it, but at a high level, we saw teams do the same thing over and over again at every large company, and we felt there was a better way out.

Starting point is 00:01:09 And we started looking into that, thinking about CI from the ground up from first principles, and we landed at Blacksmith. Now, what Blacksmith does is we offer really fast compute that is instantly provisioned for, companies to run their CI workload. So we're like a serverless CI compute provider. Incredible. And like what exactly does that mean like a serverless CI competitor? Like what does it look like to use Blacksmith and how what do you what do I get over the box? Like how do I just like pick up and use it today? Yeah. So the difference I want to call it is that we're a compute provider. We're not a CI system. The only CI system we work with today is GitHub actions. So developers can still continue to use GitHub Actions as the control plane, use the same like YAML file, use the GitHub

Starting point is 00:02:00 UI. All they'll have to do is install the GitHub app and, you know, replace one line of code in their GitHub Actions workflow file, pointing it to Blacksmith. And their CI workloads will run on our compute, and it runs much faster than GitHub hosted runners or if they're self-hosting it on their cloud account. Very cool. And like, you know, take taking a step back, like, what are some of these things you kept seeing people do that were causing problems, right? Like, what were these pains that you kept feeling at all these different companies? I mean, I spent a lot of time building CI-CD systems or on top of CICD systems, you know, like, what type of challenges are people running into that? And you're

Starting point is 00:02:40 like, hey, someone's just got to go and get in there and do something about it. Yeah. So the number one challenge that we saw at much larger companies was just about making sure CI runners or VMs were available when developers needed them. This comes back to the nature of CI. Say workloads are peculiar. They're not like production workloads. They tend to be sparse and spiky.

Starting point is 00:03:06 So if you visualize the VCP utilization for a company, you'll see that for a big chunk of the time. It's actually zero, especially when developers are not pushing code. It's going to be zero for like most of the day. And depending on the company you're at, let's assume that every time a developer pushes code, they need to run 50 CI jobs. And let's say, you know, each of them, like collectively,

Starting point is 00:03:32 they use 500 BCPUs. So every time a developer pushes code, you're using 500 BCPUs. It runs for five or 10 minutes, goes down to zero. And there are times when three or four developers are pushing code at the same time. So you might go from zero VCPs being used

Starting point is 00:03:48 all the way up to like, thousands and then back down. And that's what I mean by it's super spiky. So consider this scenario. Let's say you're self-hosting, which a lot of companies do, like what I saw at fair, when a developer pushes code, you can go and ask AWS for 50, you know, EC2 instances. And the drawback to doing that is you'll have to wait to get all these instances before the job's even start. And if it's during peak business hours, that can take up to five, 10 minutes for all your jobs to start. I'm not even talking about the duration of the CI job. I'm just talking about the job starting. And that's a pretty big drag on productivity. So what do you do to like get around

Starting point is 00:04:29 this? So what most companies do is they maintain a warm pool of compute that's ready to go. And they're typically using like Kubernetes for this. They're using Carpenter. So they're going to like have some amount of like nodes warm. Let's say, you know, that covers like 500 BCP. So if one developer pushes code, things are good, instantly gets picked up. But what happens when four developers are pushing code at the same time? You again have to like autoscale. That needs to kick in. It's not super quick. Now, one thing that companies could do is they could like over provision to always absorb the load for four engineers, but then you're burning one. So there's this tradeoff between how much do you want to spend versus how much do you want to keep your developers waiting? And that was the

Starting point is 00:05:13 biggest problem. The second problem is performance. Now, GitHub hosted runners are running on pretty old machines. These machines have much lower single core performance than the latest machines, and the same for machines that are running on AWS. At Blacksmith, we run them on consumer gaming CPUs, not server chips. And gaming CPUs, you know, because they were optimized for gaming and making sure the game runs really fast actually has the highest single-core performance compared to the same generation server counterparts. The other peculiar aspect of CI jobs is that you're copying files around constantly. Think of what you're doing during code compilation. You're constantly copying files around. And the hyperscalers most of the time tend to push you towards

Starting point is 00:06:03 using network-attached storage like EBS volumes. But for CI, you actually want the opposite. You actually want locally attached NVME SSD so that it's super quick and you're not bottleneck by iops. So we identified those bottlenecks and we started, he started Blacksman. Now, the way we're getting around those problems are, A, we're using gaming CPUs with locally attached NVME drives. That helps with the performance problem. Now, we have, you know, over 500 machines right now, and they're only running CI workload. So when we're only running one workload, things get a lot more predictable.

Starting point is 00:06:38 and smooth at scale. And even if one customer wants 500 VCPUs, we can provision that instantly. And at scale, this works out. That's super cool. I think there's so many questions I want to ask here. So I want to try to divvy up. Let's start with why GitHub action maybe?

Starting point is 00:06:56 Because I think obviously CI, people are running all kinds of CI systems. We've seen a lot of like basil base, a lot of other things. GitHub Action certainly is like probably the most common I've seen since for a startup. especially, like I don't want to install yet another new thing. GitHub is already there, Actions already there. But maybe tell us the reasoning you've picked GitHub Action as the point to go at starting points. Is it an ecosystem or performance just sucks or just everything combined? So it mainly had to do with the fact that GitHub Actions is the most popular CI system today. Most people use GitHub and it works out of the box. Most companies starting out,

Starting point is 00:07:36 they don't really want to consider any other solution. And we think that's where the market is consolidating. If you look at, if you look at CSED at large, most companies are on Jenkins, like legacy enterprises, but companies that were started five years ago are almost always on get-up actions. And even companies like five years before that, so from 2015 to 2020, they're probably on Circle CI and they're migrating to get-up actions in most cases. So we were like, to start with, let's better. on the actions market.

Starting point is 00:08:08 Okay, so I think that kind of confirms my suspicion, but I'm really intrigued about this gaming PC thing, by the way. I'm picturing like a 500 alienware machines running somewhere, you know, with a shiny logo and a neon license. Like, I wouldn't actually assume that's the thing you're doing, even though it does make sense that it's optimized for workload. Is this something that you just knew before going into Blacksmith? Or that you tested, like, I don't know, like a bunch of different kinds, you know, of servers and just somehow, like, my game that I've been running, you know, a bunch of games is much better. And how do you even host this damn thing? Like, is it like, I have to go actually go on server rooms and do it myself now? Because I don't think

Starting point is 00:08:49 there's a gaming PC file for you. Yeah, I can go more into that. So the way we landed on this was we were asking ourselves, you know, how can we make CI faster? And my co-founders who worked that Cockroach had this, like, observation. And it was pretty simple. When they were working on Cockroach, they could either build CockroachDB remotely on a server in GCP, or they could do it on their gaming rig at home. And they noticed that the gaming rig was substantially faster.

Starting point is 00:09:21 And it started with that, and we were like, why is that the case? And we found out that, you know, there are two factors here, single core performance and locally attached NVMS SSDs. And so we found that. And when we were starting the company, we were like, okay, where can we get these machines? And turns out, Hetzner was offering these machines because they hosted a lot of call-of-duty servers. And now we work with like, we have another region in the U.S. where we work with like a much larger provider where we have a contract with them and they manage our machines and they rack it up and lease it to us.

Starting point is 00:09:54 But that's how it got started. We were pretty scrappy. And, you know, who would have thought that you could get call-of-duty servers and repurpose them for, for, for CICD. Okay, very cool, first and foremost. You know, you have this Bersi problem case, problem statement. Other than solving, like, what is kind of like a large scheduling issue, really. Like, what we've talked about so far is a scheduling problem and then like a hardware

Starting point is 00:10:20 optimization problem, which is like, okay, let's make sure we're using like hardware that's suitable for the job. And let's also make sure that we schedule things so that people actually, like, we can offer might you get. You can use CI to much as you want, and it doesn't break the bank, right? Which is broadly the problem you're talking about. What are other improvements or optimizations that you're making under the hood than just like the raw, like, hey, look, we just have better servers and better scheduling? Yeah. Yeah. So I think this is one of the pros of running a single workload like CI. We can make a lot of optimizations tailored towards that.

Starting point is 00:10:53 Before I go into that, I want to say that when you're running production workloads, like you're running a database, you care about a lot of durability guarantees that don't necessarily apply for CI. One good example of that is something like f-sync, where you're flushing the page cache to disk, that's important when you're running a database. But it does not actually matter for CI workloads, because if your CI job fails, if your Lint job or unit test fails, you can rerun it. And that's something that we disable on our end. And we've made a number of optimizations like that to just make CI run a lot faster. And we've experimented with a number of different file systems and made a lot of modifications on our own purely for performance.

Starting point is 00:11:36 So that's one example that we're doing on like the software layer. But if you're only running one workload, you can also do workflow-specific optimizations. Let me give you an example. A lot of our customers do Docker builds. Like that's at the heart of like how most companies like deploy. And you probably notice this when building Docker images on your lap. When you do it for the first time, it's slow because it's building each layer. But when you rebuild it the second time, and let's assume that your Docker file hasn't changed,

Starting point is 00:12:08 it's instant. And the reason for that is the Docker layer cache is in your machine. It doesn't have to rebuild all the layers from scratch. But when people build images in CI, you're often doing it in a fresh VM. The Docker layer cache is not present and it's slow. And so we looked at that problem and we were like, hey, how do we solve this for our customers? And right now what we're doing is we're mounting a Ceph block device with the ORG's Docker layer cache bind mounted into the runner. So it's just there.

Starting point is 00:12:42 There's no downloading the layer of cash from, you know, that's another approach that beats in company's day. There's no doing any of that. It's just there ready to go. And it's almost like having your Docker layer cache persisted across the I runs. And in a lot of cases, your Docker builds are near. instant. And this is extremely important when you're trying to get a hot fix out. And so given its Hessner servers, right? I'm not super familiar with Hezner, by the way, because I know they're more like a, I guess, boutique, you know, hardware provider with more

Starting point is 00:13:11 selections here. But I don't think they have the extensive like S3 and all that kind of stuff with it, right? Like, you basically have to do everything yourself. And so it's able to handle this sort of like, I have a cache that I can bimount in. Most people use either EBSs or some sort of variance in the Amazon, right? Because this is readily available. It costs more, but you have more knobs to tune. How do you do that with Hedzner servers? Like, do you have, like, a separate net app servers running out next door? So we have two types of, like, caches. One is a drop in replacement for the GitHub Actions cache. So these are cache artifacts. You know, you can think of your NPM modules if you're, like, downloading each time. So we run our own Minio cluster, which is

Starting point is 00:13:52 an S3 compatible, like, object store as a replacement for S3 effectively like that. And for the Docker layer caching primitive, which we call sticky disk, we're actually running a Ceph cluster ourselves. So there are a lot of benefits to, you know, being on AWS. But, of course, when you go bare metal, you're trading off a lot of that for, you know, in our case, like performance for our end user, but that also means having to do a lot of things ourselves. How much does locality matter?

Starting point is 00:14:21 and CICD workloads like can you just throw these things anywhere and just like go for like the cheapest power like cheapest data center with the best hardware or like you know

Starting point is 00:14:32 because oftentimes like production workloads we're sitting here be like okay how do I get this shit as close as possible to my customer like can I move the data to the edge how much this kind of cash to the edge how much can I make this

Starting point is 00:14:40 like a zero most of response time so they load the page and they use their credit card before they realize yeah so the answer is like somewhere in the middle it matters for some jobs and it does not matter for others

Starting point is 00:14:50 like let me give you an example If you're, you know, running like vanilla unit tests, they don't really matter. They can run anywhere in the world as long as, you know, it's power efficient, like you said. But it does matter for jobs like Docker builds. Like let's say you build an image and you're pushing it to a container registry where your services are deployed. Let's say that, you know, you're building an image in our EU region and you're trying to push it to U.S. West, there is a network agency there. could slow it down. And that was one of the reasons we decided to, like, start a U.S. region because we had a number of customers pushing the container registries in the U.S.

Starting point is 00:15:30 But it does not have to be super close as long as it's in the same region or, you know, even in the U.S. it works out. Okay. So it matters somewhat depending on the job. And do you have metadata about what the job is, right? Like, how much can you infer, you know, I think you're using like, you're U.S.I.R.ize like the GitHub actions, Yaml file format, which gives you the ability to help the stuff out. Like, how much data can you. you infer from these formats about like the actual job type and like are these things you can use to optimize scheduling like help us understand like yeah yeah you know okay so we're at the point where it's like okay there's there's a hardware benefits or scheduling benefits cost

Starting point is 00:16:03 benefits help us understand what this action like what you can do on top of so i think it is possible if we wanted to to go and try to understand where a customer might be running things and auto-routing jobs, but we're not doing anything too sophisticated right now. We pin an organization to a region, and they can email us, like, the default is the U.S., but if they want to run in the EU, they can message us and we'll move all of their jobs to run in the EU.

Starting point is 00:16:35 It's a simple approach, but it's worked well so far. Got it, got it. But it's true fascinating that because you had to choose your own, use your own hardware, you basically have to recreate Amazon to some degree, for yourself, right? So you have storage, you have this sort of, like, you know, Alienware servers.

Starting point is 00:16:53 I just keep people in my head. What are other things you've got to do yourself that just not typically as available on your clouds? Because networking is actually not that easy, you know? Yeah, I think addresses can exhaust really easily, you know, Vips and stuff like I. Like, what are other stuff you actually reinvent yourself? And it's like turn out to be like more work that you hope.

Starting point is 00:17:14 We've had to like tune our networking stack quite a bit. And networking is something that we're learning as we go. Like, for instance, you know, the number of machines in a subnet, once we exceeded that, we had to figure out, and we're actually still figuring out, like, how do we, like, solve those problems? I think networking is something that we're, like, working on. Yeah, because that's the thing I thought of is, yeah, your IP addresses go out quick. And there's a lot of reuse, but you want to be efficient on costs, right?

Starting point is 00:17:43 So you can't just be, like, taking that forever. is it super fascinating that your customers, though, right? They just want fast and cheaper. Are you trying to always make sure both are satisfied for customers? Or for some folks, like I actually tell you, like, you know, I just want cheap. I don't care about fast. I just want cheap. But there are some customers because of their workloads are maybe inherently takes hours or minutes or whatever.

Starting point is 00:18:06 I do want faster and actually willing to even pay you more to try to get faster. And I've seen those situations before. And I wondered how do you play a tradeoff here or you just don't. Yeah. So I think when we were starting out, this is a question that we wondered ourselves, like, which value prop matters more here? And why are customers choosing us? I think we've learned that performance matters more than anything. Especially with developers, they care about using the fastest product. I think being cheaper than GitHub really helps us get through procurement and finance because when you talk to someone from finance and you tell them that, hey, we can

Starting point is 00:18:40 like half this bill, they're like, great. And the conversation ends there. but performance is what gets our customers excited. It's what keeps them happy, and that's why we keep getting more customers. There are a lot of workloads that look like CICD. I mean, batch data pipelines is a great example. You know, a training run of like a machine learning model is another, like, good example. Like, there are tons of the sort of non-real-time batch style workflow-based use cases tell us the vision here.

Starting point is 00:19:15 You know, you start in CICD. Do you want to go further or do you think there's just so much of opportunity there that CIC we just cooked there for a long time? Like, what do you think the broader opportunity is for like a company like a blacksmith to go after this? I mean, it makes sense why you start here. Like, it's an incredible starting point. Yeah.

Starting point is 00:19:32 I think for now we're laser focused on C.I. And we think there's a lot there. And we think there's a lot more that we can do for our customers. I think compute was a really. good wedge for us for customers to start using us, but there's a lot more value that we can add when it comes to CI observability. And that's something we're working on right now. Especially right now with people writing a lot more code, they're writing a lot more tests and they're running these pipelines a lot more. Making sure their CI pipeline is healthy, is not failing, is not riddled

Starting point is 00:20:03 with flaky tests. They matter a lot. And because our customers are running their workloads on our VMs, we can offer a lot of this observability out of the box without them having to configure anything. Like, that's the key. One example of this is in GitHub actions, you can go and search your CI jobs logs, but only for a single job. You cannot do a global search across all of your jobs historically. I'm sure you've seen the scenario where you see an error and you're wondering, did I introduce this or has this existed before? Now, unless you're automatically making sure that your CI logs are going into your data dog or New Relic or Logics or Honeycomb, there's actually nowhere to figure that out and get up actions today.

Starting point is 00:20:50 But because it's running on our VMs, we actually parse these and expose them in our UI. So you can actually go to Blacksmith and do a global log search across all of your CI jobs. And that's helping a lot of our customers, like, fix problems faster. And we're going to help them fix their flaky tests and catch them in the future with that same approach. I mean, that's pretty cool. Because having spent a lot of time, especially early on, when we sold the ones since the Salesforce in like 2013, we were heavy Selenium users. Like, one of the first big at-scale Sun users was basically Salesforce had a huge UI service layer and trying to get selenium at scale to work with all of the flaky UI tests and all the way that selenium itself works.

Starting point is 00:21:31 We've had Paul from Browser base on and I talked a lot about some of my experience of selenium, how awful it was. But like, it also had this native issue of flakiness of like, how do I actually understand. Do you think like the next step after observability is like improvement? Like how do you actually improve for people? Like, you know, in a semi like autonomous manner, like, hey, we notice this test is broken. Here's here's the patch. Or hey, we notice this thing.

Starting point is 00:21:52 Here's what you can potentially do. Is that like the ultimate outcome of sort of the flow you see from having this data or where does this go from here? We're still figuring it out. But I will say that that's something that we're like actively thinking of. We have all of this information about what's happening in your CI. What's breaking right now, in addition to surfacing it, can we help you fix it? We're still figuring out the medium

Starting point is 00:22:14 in which it's most helpful. We actually have something coming out soon where we're going to surface CI errors and post them as a PR comment. And eventually, we're experimenting around, like, can we have them do something about it, maybe kick off like a cursor agent to help fix it or use cloud code? We're still working with around that. I was reading a tweet from Michel, you know, HashiCorp, Mitchell was talking about how he was able to survive with Amazon, right? Because his motto, I think the sales pitch was I would support Amazon faster than Amazon, right? It would get a telephone to everybody. I saw it yesterday. And so, you know, with that in mind, you're working with another incumbent, right, GitHub in this case, right? And trying to

Starting point is 00:23:01 be like, I'm better than GitHub than doing their own thing. Do you think they just don't care enough about the problem, they have too many other things to solve so I don't want to solve this little performance problems or what is your way of surviving with incumbents? Like, is it just,

Starting point is 00:23:17 I have special expertise, they don't? Maybe tell us more, like how you've figured out that there is enough for us to continue without being squashed by, you know, the big boss here. Yeah, so there are a few different angles here. The first is, in the words of our customers,

Starting point is 00:23:35 GitHub has stopped focusing on actions and is more focused on co-pilot, and there hasn't been any material improvement with actions in a number of years. And we think we're doing a lot of things to actually help with that. We're making GitHub Actions as a platform much better and feature-rich. And we actually think that's actually going to make GitHub Actions more appealing as a platform. As companies think about where should they migrate to from Jenkins, should they go to GitLab, or should they go to GitHub? And if they see that GitHub has a much richer ecosystem,

Starting point is 00:24:10 then that's actually better for them. And is there any acceleration with AI with you guys at all? Like, doesn't matter to you or not? Like, maybe the unit's just testing is the same amounts. It doesn't matter. I feel like the vibe coding is getting more code in general out there, but I don't think the tests are really that much more yet, my assumption.

Starting point is 00:24:32 But AI affects you at all, I guess. Yeah. So the answer is AI has a pretty big effect on our business. And I'll say we're one of the second order beneficiaries of this AI cogen boom. I'll break this down for a few reasons. First is people are pushing out a lot more PRs than before. And every time someone pushes code, they have to run CI. And they're running it on us and we're running a lot more of their workload. The second is developers are writing a lot more code than before and a lot more tests than before, which means their bill times are actually going up, and the amount of tests that they have to run, that's going up and that's also taking more time. And that's actually creating more demand for us because people are getting more impatient and they're like, hey, CI needs to be faster. And we're seeing this with data too from like how many jobs our customers are pushing

Starting point is 00:25:27 and how much they're also like spending on us. there's a third effect too there are a lot of tools today like codex clot code code gen that are automatically pushing PRs and iterating on them and I think PR arena actually has this really great dashboard that talks about the number

Starting point is 00:25:44 of PRs merged across all these providers and also aggregates it by volume on public repos and you should look at some of those numbers those are growing exponentially and all of those PRs had to run SEA. I know there's a lot of like ephemeral compute companies, like an E2B, a Daytona, in addition to what you just said, like a lot of the net new thing about AI's high ephemerality and the task-based workflow

Starting point is 00:26:10 and this sort of scale-out, right? Like the future of development certainly does feel more like a scaled-out branch test workflow experimentation pipeline than it does feel like systems of the past. I'm curious, like you envision CI getting significantly more integrated directly into these, like, whether I'm using a cursor locally or I'm using, like, you know, some external coding agent like a Devon or a Codex or something, like, what is the future of CI and CD? Because they are the encoded feedback loop for the LOM. And so I'm kind of curious, stepping back and just thinking about the CI, the purpose of CI, the purpose of CD, the purpose of like red, green, blue deployment. It's the purpose of like why we even start all this stuff

Starting point is 00:26:53 years ago. Like, where does this all end up? And certainly some part of this is a giant feedback loop that can feed into the broader machine, the broad experimentation machine, the broader, like, how do we actually like generate something that says, yeah, this is good code or bad code and how we scale out the usage of AI and code development? Yeah. I want to break this down into two different categories. I agree with what you said about feedback loops, but there's immediate feedback, which is when an agent makes a change to a file or a class, and it might test out, test associated with that. But there's, also the final step of running all your tests and just making sure nothing has regressed or nothing

Starting point is 00:27:35 has broken, I think that's still going to happen. And I think that matters, especially before deployment, just to make sure that things are still working. I don't see that going away. Long and short, I still think CIA will persist, but the agents are going to iterate differently. And I agree that today's approach of Devin pushing a PR and pull and get up actions to see if something is broken, I don't think that's going to be the case forever. Got it. Well, we definitely want to go into our favorite section of our podcast called the Spicy Future. Spicy Futures.

Starting point is 00:28:13 So I'm very curious now. Give us your spicy hot take that you believe that most people don't believe yet. I believe that companies should 5 to 10x their CI budget. For the reasons that I mentioned, you know, companies are pushing a lot more code, their bills are taking longer, number of commits is exploding and the amount of software being written is going up. I don't think companies realize that they're going to have to rethink their compute budgets. A lot of companies are introducing like token budgets. I know a lot of companies giving their employees $500 in, you know, open AI or cloud credits. They're going to have to do

Starting point is 00:28:46 that for compute as well. It's already happening, but I don't think companies have adjusted to that reality yet. So five to next CI budgets is, I think your correlation is like, hey, we should have more people able to get faster CI to unblock them is the first thing I thought of, is productivity will increase? Is there any other benefits that are really trying to push for there? Like, hey, what is the effect of getting 5 to 10x more CI budget to do what? Sorry, I should have iterated that more. So the reason they're going to have to increase their budget is because people are going to run so much more CI. As developers are pushing out more PRs, the number of CI jobs is going to go up by 5 to 10x. Got it. And you think the

Starting point is 00:29:28 CI jobs are getting five times because the code were generating a fight attack more, right? Just basically. There's more code and also the number of commits that they're pushing out is actually going to go up drastically. That's the main driver. Got it, got it. Of course, like Clavode and Gemini, whatever. The whole batch vibe coding is it changed from vibe coding in my ID

Starting point is 00:29:50 to like vibe code all my code at this point. So it's such a drastic effect. And just to add to that, you know, Today, a lot of people are using, you know, a lot of, like, clod code, but eventually you're going to have, like, agents orchestrating other agents that are going to keep pushing code all the time. And every time you push code, you're going to have to run CI. Another way to put this is, every time someone pushes code,

Starting point is 00:30:14 they're spending X dollars on running CI. And when that number of instances, they push goes up five to ten times, the downstream spend goes up five to ten times as well. That's probably the most interesting part is, like, If we talk to engineering leaders or even everyone's interest is so much AI right now and AI budgets, like how much I'm spending on Open AI and all of these like inference providers, right? And maybe a bit of the AI tooling and platforms.

Starting point is 00:30:41 But I guess we haven't really even figured out the downstream effects of costs. But you're saying like, hey, if you're going to push 10x more code, all over the rest of your platforms is going to get 10x increased cost as well. And people already noticing like the productivity improvement. is huge. So I think that it's really justifying a lot of the cost right now. But it's also terrifying, basically. I'm spending already a lot in R&D. And now I'm like, maybe the R&D people has to reduce down to really like make up to that. Or I basically have to push more products and more value out, right? Because there is a limit sometimes of like how much product stuff I can

Starting point is 00:31:21 actually sell. It doesn't even matter how much code I can really output there. And so what do you think it's a prediction here. We're like, okay, code is getting generated way more. Less engineers per team. So I basically don't really that much people now. And that would just even go more exacerbated over time. Or do you think there's something else will happen, right? My prediction here is that companies are going to do more with less people. So you're going to see small engineering teams do things that historically would have taken, you know, five X more people. And I'd argue that paying for compute is a lot cheaper than paying for labor. And I think companies are going to be okay with that tradeoff.

Starting point is 00:32:02 That's my bet. What do you think of the complexity of CI jobs? Like, you think, you know, one is we've talked about, you basically positioned your 5 to 10x CI budget around just like velocity of code changes, broadly speaking. And I kind of consider that, yeah, that's a second order effect. But also, I mean, there's another, it may be a third order effect of the that we have a lot of more of these agents, or maybe it's, you know, in a second order,

Starting point is 00:32:30 whichever order we want to pick. But, like, the complexity of what we will need to do in CIO will increase as well, right? Like, a lot of what we're going to get to some formulation probably a simulation in CICD to deal with the fact that the upstream velocity, and we're going to have less humans to test all these new features and test all these new interfaces to test all these different things. And so you're going to be wanting to look at systems in a way, like, in a true like simulation style testing experience, something more akin to like maybe it's not end-to-end studying of like, you know, 10 years ago. But certainly like the complexity of what you need to do in CI will probably change drastically as well in a way that it hasn't. Because we had other ways

Starting point is 00:33:10 to manage, you know, that risk. I agree with that. And we've talked about this a lot inside the company. And our prediction is that people are going to need tools to run a subset of their tests. And here's the way. Like we spoke, people are going to write a lot more tests. It doesn't make sense to run every single test when you've only changed a small fraction of a code base. Now, of course, if you're on something like Basel, and Basel can figure out which target has changed, it can run specific tests. But most of our customers, and we see a lot of people are not on Basel, and they shouldn't be given the overhead. So we're going to need better ways of predicting which test to run. And that's going to be a challenge of its own. Yeah, that's sort of

Starting point is 00:33:54 Maybe this is a little side question, to be honest, but I'm actually very curious about this. Because we talked to Paul at browser base, right? The AI Wave is creating a new infrastructure permitive required, right? Because web agents and more actions and automation is happening. Besides AI pushing more code to you, are you also thinking about new permit is required? Like maybe the first thing I thought was GPUs, obviously. Like, I need more special purpose, you know, compute. or do I need to actually start to have specialized frameworks

Starting point is 00:34:26 to help people test their AI stuff? Like, how broader or deeper the stack do you think you do need to go because of the AI changes at all? Or did maybe just, I all just focus on the basics and make it fast. Yeah, I think right now, for most of our customers, the kinds of things they're doing, they're mostly bottlenecked by CPUs.

Starting point is 00:34:46 But we do have a few customers who are asking for, like, GPU instances so that they can test, some of those workloads that require GPUs. And I think over time we'll see more of those, but I don't think we're going to see a shift where all CI instances need GPU support. I don't see that happen. Got it, got it.

Starting point is 00:35:06 Okay, this is super fascinating. We have so much we could ask, but just intro of time, I think, want to probably end here. What are things, places people can find out more about Blacksmith? Can I sign up as a user, where should you go sign up and learn more about what Blacksmith? Yeah, yeah.

Starting point is 00:35:20 For anyone who wants to learn more about Blacksmith, just go to blacksmith.sh and click the sign up button. We're actually just a few clicks to try out. And that's it. Don't even need a credit card. Amazing. Amazing. And you get gaming servers ready to go to run Call of Duty and C.N.

Starting point is 00:35:35 Both, right? Cool. Super appreciate it being on our pasta. Yeah. This is a lot of fun. Thank you for having me. Thank you.

The Infra Pod - Turning Gaming PCs to Serverless CI for AI! (Chat with Aditya from Blacksmith)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.