In The Arena by TechArena - Inside Runpod’s GPU Cloud for Long-Running AI Agents

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome to Tech Arena Data Insights. My name's Allison Klein, and it's another Data Insight episode, which means I am here with Janice Norowski with Solidime. Welcome to the program, Janice. How are you doing? Awesome. Thank you, Allison. I'm doing really well. How are you? I'm awesome. I'm really excited about the topic today. Why don't you go ahead and introduce who you brought with you this time and what we're going to be talking about. Yes, I'm very excited to talk, you know, as usual, right, all things AI. But today we actually have Run Pod with us. And I know a lot of folks out there are really excited to understand all the different types of organizations out there, really spearheading AI, all things AI. And today we have Brennan Smith from RunPod, who is head of engineering. So welcome to the program, Brennan.

Starting point is 00:01:04 Excellent, excellent. Thank you both. Thank you so much for having me. So, Brennan, this is your first time on Tech Arena. So why don't we just start and do a little bit of an introduction on RunPod and your role at the company? Yeah, absolutely. RunPod is one of the best places to run AI workloads. We provide not only high-scale GPU compute.

Starting point is 00:01:23 We have 28 data centers full of top-end GPUs, storage, networking. But we also have a highly sophisticated software stack. that allows developers to go from idea to actually fully in production within minutes. This ranges from training workloads all the way to inference. And that's something that we really pride ourselves at is focusing on not just how to build high-quality infrastructure, but also how to build high-quality software that makes it so developers and AI researchers can focus on what they want to do best, which is actually getting out value to their customers, scaling up, and being able to focus on the really exciting, things, let us handle the complex side.

Starting point is 00:02:03 Love it. And so, Brendan, with that, can you expand a little bit more on, you know, with the rise of GPU-dense cloud environments? How is this changing the way small teams and enterprises alike build and deploy AI systems? It's a really good question. The economic landscape, especially within the cloud space, completely was upended with the advent of AI's and GPUs. One of the big challenges is how expensive and how demanding these types of workloads and also the resources to support them are. So we look at, say, like an Nvidia B200 or an H200, you're in the order of hundreds of thousands of dollars to be able to procure one of these. So leveraging a partner such as RunPod or others

Starting point is 00:02:44 to be able to supply that, that's very helpful. And making sure that those are very dynamically available, they're readily available is important. But there's a second component to it as well as all the associated componentry. So storage and networking, those are the single most important parts that glue these systems together. We hear time and time again from customers who love RunPod because we have focused on investing in these areas. And it results in just a much cleaner, more predictable, and better experience. There's nothing worse than having downtime or stuttering or issues for customers. And by ensuring that there's high quality storage paired up with these GPUs or high quality network paired up these GPUs,

Starting point is 00:03:24 we have been able to show that this results in a markedly better experience and stronger net promoter score across the board. That's awesome. One of the things that I've been thinking about is that we've spent a lot of time talking about AI training and large language model training, but we're seeing a shift in the market where organizations are starting to deploy AI in meaningful ways, whether it be LLM inference across application types or even agenic workflows. what do you see as we make this shift in terms of infrastructure requirements and them changing over the next few years?

Starting point is 00:04:01 There's something I'm really excited about, actually, is, you know, right now we're at the front end of this life cycle. And I always like to say, you know, you train once, inference forever. Training that's equivalent in the business, you know, traditional business sense, like it's a highly capital intensive job. And then you're able to return that investment over a long time. Now, a lot of the investment has gone towards training just in terms of infrastructure development. But that long-term inferencing is what will make them break? Can people operate at scale? Can they operate efficiently?

Starting point is 00:04:30 Are they able to deliver value to customers? And it spans throughout every single layer of the stack. I have engineers working on everything from low-level physical optimization, working with partners such as Solidime and others to really make sure that our workloads maximize the hardware performance and potential. And then we go all the way through the stack, all the way up to the highest order like inference optimization, how do we distribute these LLMs and their different actual components

Starting point is 00:04:58 across different portions of our staff to operate either more efficiently or more performantly. This is where the innovation is happening right now across the industry is how to glue these different spaces together. And especially on the inference side, this is where Jensen Wong is famous for like AI factories. He keeps repeating that. And that shows what is to come.

Starting point is 00:05:20 which is how efficiently, how well these systems are run from an operational excellence perspective will dictate the winners and losers. You run an inefficient factory? You're out. And that's where working across and throughout the stack and having a very clear vision, that's going to set people apart. Yeah, and I couldn't agree more with that, Brendan, especially because I've been in the industry a long time in storage for almost 15 years. And I often hear about keeping these things, these systems up and running, you know,

Starting point is 00:05:48 what's really the pinnacle behind that. And storage is often seen as the bottleneck, right? And that's the thing that everyone's trying to optimize it gets to kind of keep the GPU and CPU said. But I've heard recently storage is really considered the hidden bottleneck and AI. And in practice, you know, how does the high performance storage influence, such as cost and latency and developer velocity in your opinion? It's one of the largest parts. I was literally just talking to a couple of my engineers about this.

Starting point is 00:06:17 And they're looking at Docker image loading. This is completely unrelated necessarily to true LLM specific activity. But what we hear from customers is a complaint like, hey, it is slow to load of Docker image. This is slowing down my workflow or it is slow to do X. It's slow to do Y. Where the bottlenecks occur is all of the different interface points throughout the system. Storage is what glues everything together. And what we have found is every time, as long as we are optimizing our storage, we are able to make the data

Starting point is 00:06:49 move faster. And one of our thesis internally is things need to magically just work. They need to just magically appear. They need to magically happen. We just came out with a new feature called model store. It's in public beta. And what this does is it leverages special NVME storage mechanics and also global distribution to be able to make it seem like models appear onto a GPU like magic. They just appear. They're up and running. There's no latency involved. Normally it would take, you know, minutes, if not hours, to load this. And imagine you're trying to do a development cycle. You're rolling things out.

Starting point is 00:07:23 You're testing it, it fails. You roll it, it tests out, it fails. And that's just part of how you build software. Well, now if you're able to load things instantly in the GPU, they're in Vram, and it seems like magic, now you are able to cut your developer's life cycle, you know, that cycle down, minutes, hours. And when you compound that across a large organization, now you're talking actual real monetary gains.

Starting point is 00:07:46 So this is when I talk to some of our larger customers. We talk about this as a factor of leveraging high quality software and high quality hardware as an accelerator of innovation. Every CTO is under mandate from their CEO of make AI happen. That's just how the industry is right now. And what we have seen a lot of success in is CTO, CIOs who are directly focused on how do I accelerate my team's efforts? If I make my teams faster, they will be able to figure out the best outcome. faster. And then from there, it's a matter of iteration and bringing successful products to market. Now, let's take us under the covers a little bit. How are you actually dialing in the right

Starting point is 00:08:26 mix of compute storage and orchestration tools to help developers accelerate time to value in de-risk AI projects? I touched on that a little bit, but I think you bring up a good point about de-risking. De-risking is a large portion of, you call it a calculus in a way, is four, these companies, because this hardware is so expensive, the developer time is so expensive. These are very expensive initiatives. And yeah, I like to borrow from the meta mantra of, you know, fail fast is the faster that developers are able to improve their code, make it better, make it better, make it better, they're able to do this. Now, so how does the innovation? Where does that happen to unlock that? What we do at RunPod is we look at ourselves as a global orchestrator. We don't look at

Starting point is 00:09:14 specific data centers. Now, we have 28 data centers around the globe, but our goal is to provide developers a substrate so they don't need to worry about that and that they're able to look and see their application as a whole. And we are able to route things around. Now, we're constantly doing analysis on our side, automated systems, AI models, and human PhDs that are looking at how are our workloads moving through our systems? How can we better optimize where workloads are placed to either improve latency, to improve delivery times, to improve cold start times. And these all fit together into a much larger picture. So this is where oftentimes of the question of Kubernetes and how does that fit into this whole equation. Kubernetes is wonderful

Starting point is 00:09:58 because it abstracts away from the developers. A lot of the hardware details, a lot of the deployment details. And that's good, but it's very generic. At Round Pod, we focus on how do we leverage our partner's strongest strengths and work together on roadmap design, work together on where we're going together as a team. So that way, we're able to write very tailored specific software that comes with a better outcome than just a generic status quo. And Brennan, with your experience, what lessons have you learned about giving developers access to different tiers of storage or compute so that they can quickly, you know, experiment and scale responsibly?

Starting point is 00:10:39 Yeah, that's a good question. And storage is, as I said, like, it's the very key substrate that brings things together. Where we see majority of different tiering operations happen is there's a, you know, obviously hot data, cold data. There's those traditional concepts that are used in large-scale clustered systems. But there's much larger demands now on parallel storage. Object storage is highly distributed exabytes across many different machines. How does that get further optimized to be able to feed these absolute monster training runs? The second part is that that data has to remain in DC.

Starting point is 00:11:14 Now, it might get delivered out on a regular basis, but there's also times where it may just need to sit there and wait while things are turning while things are working actually within the GPUs. Having the ability to move data between different tiers and shuffle around that cost is a really important aspect. This is one where we do rely heavily on our partners. We have very close relations and making sure that they are able to move data between TLC or simulated SLC being able to make sure that we understand how the catching mechanics are

Starting point is 00:11:47 working such that we are best able to optimize our workloads to it. But again, utilization. That's the next one. Like when I'm talking with, you know, our competition or others is we're always talking about utilization. That's like the number one metric of success here. And the better that you're able to move your data around, the better that you're able to maximize your utilization. And therefore, earnings across the board. Now, when you look at the landscape of AI, it's moving so quickly, are you seeing any nascent technologies emerge that you think are going to drive the next big wave of infrastructure innovation demand? I think the really exciting areas that we'll see some

Starting point is 00:12:29 fascinating achievements is even tighter. There's two verticals, but they align together, which is there are GPU or NPU APU manufacturers that are in the R&D stages now, but they really have aspirations of being the best inference provider or the inference chip. The second side of it is on the storage is what we see over and over is VRAM is the unlock. That's what provides the unlock of capability. How many cores you have, et cetera, that effectively gauge like performance. But if you try to load them, model that's larger than V-RAM, it's a hard line. You're not able to do it. You run out of memory,

Starting point is 00:13:10 run and done. Where I'm really excited is there's a number of different companies out there that are working on converging storage against the compute layer and how to bring that closer and closer together. And the closer that those are able to be brought together, the line becomes more blurry in terms of how much capability can be loaded in, how much capability can be addressed, how many parameters can be managed. So that's one area where I'm personally very excited. That's not distort. That's also networking.

Starting point is 00:13:39 How do we cut out layers of the stack to make things faster? The second one as well is, InVIA obviously, is working on more of a tailored distributed approach. I think we'll start to do that. Dedicated KV cache layering systems. How to tie those together and start breaking out these large, presently monolithic systems

Starting point is 00:13:59 into more tailor-made componentry. that is also where winners and losers will be made. And that's going to be a really fun arena to watch the unfold. I couldn't agree more. So with that, Brandon, what does it really take for cloud platforms to support long-running kind of memory-intensive AI agents at skill without sacrificing overall reliability or cost efficiency? It's a good question.

Starting point is 00:14:26 So one very important area, it does get down to the software. And the reason being is how one is able to move workloads around a stack is really important. And you're looking at a data center scale. You're looking at a machine scale. You're looking at global scale. As you said, like a long, long running application or inference type workload, you have to have the ability to move things around very gracefully. And that things would include not only the workloads, the inference,

Starting point is 00:14:53 but that's also the traffic and the actual inbound request volume. How does that get shuffled around within a given country, within a given continent, or within a global scope? To be able to ensure that not only there's good performance from the perspective of delivery, and then locality delivering to the nearest location, but also balancing workloads from a perspective of fill and capacity management. Training is relatively simply simply simple. You have a static cluster. You might add some additional ray nodes in, but it's relatively static workload. And you can plan that in advance.

Starting point is 00:15:27 etc. Inference is a whole different beast where we see customers go viral on a regular basis all the time and that is a really fun challenge to deal with because all of a sudden you have this rapid spike machines and GPU just get consumed immediately we're having to spread traffic all across the globe it's fun to watch it's really amazing to see but these cause interesting knock on effects now you have huge thermal increase you've lifted the power floor of the data center over the course of minutes. Odd knock on effects occur when those types of things happen. We've, you know, had cases where we've browned out data centers and grids before. We've had issues of different magnitudes like that. So I think as the industry matures, as the industry catches up,

Starting point is 00:16:11 that's going to be a really important detail for all of us in the industry to focus on is how best to balance these types of high demand workloads to maximize the efficiency that we're able to achieve and as a result, frankly, take best advantage of all the resources we have available to us. I left your concept at the Universal Orchestrator and really removing some of the complexity from the developer in terms of the underlying infrastructure that is being utilized. Do you see any technologies or practices that will further transform this as AI advances over the next couple of years? One area that I'm personally betting on, I'm very excited about. is the convergence of infrastructure and code.

Starting point is 00:16:57 Right now, and historically, it's always been a separation you have, you know, back of the day, systems administrators and software. You had SREs, DevOps and software engineers. There are better and better advances on the software side to bring this tooling closer. And, you know, the vision that we have at Rompot is how do we bring that even tighter? How do we actually make the code self-declare and self-establish the infrastructure that's required to run it, where a developer doesn't even need to think about what is the infrastructure that powers? They write the code and the infrastructure is built automatically or

Starting point is 00:17:33 automatically behind the scenes to support it. Because at the end of the day, business logic is what really defines the value creation flywheel of software. And anything we can do to make it even easier to get global distribution, edge type compute performance, ability to have, global scale storage and infinite amount of unblocked levels of storage while just focusing on their code, that's a hugely powerful paradigm that I know others are working on. I know that's being focused on. It's an exciting space and RunPod is we have some very exciting things in store for that. This has been an amazing conversation. I feel like we could talk to you for another hour, but we can't. So with that, where can folks go just to learn more about RunPod

Starting point is 00:18:24 and the services you offer? You know, honestly, I think it depends on what you're looking for. If you're looking for large-scale compute, feel free to go to our website, reach out to myself on LinkedIn, be happy to have a conversation. If you're an actual active developer, we have a very active discord. We have employees in there. We have a ton of active community members. It's a fun space to hang out. I actually personally love to spend time in there. So no matter what you're looking for, have someone who can connect with you and happy to share more. And infrastructure is something that's really exciting in this space. And there's more and more advances here. I think how we bridge all these layers together and make them converge, that's where the excitement is really going to

Starting point is 00:19:02 happen. Bren, thank you so much for spending time with Janice and I. It was so exciting to learn more about Rompad and figure out what you guys are up to your foundational delivery and capabilities that the world needs right now. So congratulations. what you've accomplished this far, and we can't wait to see what comes next. And with that, Janice, it wraps another episode of Data Insights. Thank you so much for being with me today. It's a real pleasure. Thank you, Alison. Same. Thank you, Brennan. Awesome. Thank you so much for having me. Thanks for joining Tech Arena. Subscribe and engage at our website, Techorina.com.

Starting point is 00:19:39 All content is copyright by Tech Arena.

In The Arena by TechArena - Inside Runpod’s GPU Cloud for Long-Running AI Agents

Runpod head of engineering Brennen Smith joins a Data Insights episode to unpack GPU-dense clouds, hidden storage bottlenecks, and a “universal orchestrator” for long-running AI agents at scale....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.