In The Arena by TechArena - Data Insights Series Sponsored by Solidigm: Achieving AI Scale with CoreWeave
Episode Date: March 26, 2024TechArena hosts Allyson Klein and Jeniece Wronowski chat with CoreWeave’s Jacob Yundt about how his organization is delivering a scalable data pipeline to AI customers utilizing breakthrough VAST Da...ta solutions featuring Solidigm QLC SSDs.
Transcript
Discussion (0)
Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators
and our host, Alison Klein.
Now, let's step into the arena.
Welcome to the Tech Arena. My name is Allison Klein, and I am so delighted to be here.
We are kicking off our episode two of our Data Insight Series, and I'd like to welcome back my co-host for the Data Insight Series, Janice Narowski from Solidigm. Welcome to the show, Janice.
Hi, Allison. Thank you so much.
So we've been at GTC, and what an exciting conference that was, really capturing the
best of innovation in the industry right now. What are your key takeaways from that?
Yeah, we have seen so much excitement around just AI in general, so many different organizations.
But there's one organization in particular that really, really stands out to me, and that is CoreWeave.
And CoreWeave is one of the world's most innovative GPU cloud providers today. They specialize in delivering massive scale of NVIDIA GPUs on top of the
industry's fastest and most flexible scalable infrastructure. So I'm really excited to talk
a little bit more today about what they're up to. I have been hearing about them all over the place,
and I'm so delighted that we got a guest to join us from CoreWeave. Do you want to introduce him?
Yeah.
So today we have Jacob Yunt from CoreWeave.
Jacob is the director of compute architecture for CoreWeave.
So Jacob, thank you so much for joining us today.
Thanks for having me. So Jacob, I know it's a busy week for you at GTC, but thank you so much for taking the time with us.
Why don't we just start with, you know,
you're known as an innovative company and you have been driving incredible disruption into
next generation cloud computing. How are you able to deliver the scale that you're delivering with
NVIDIA GPUs? It's a great question. Janice mentioned earlier how we are a specialized cloud service provider. And one of the secret weapons of CoreWeave is that our software stack is purpose built to handle these massive clusters, these massive GPU training clusters. provisioning to hardware validation through passive and active health checks, all the way
through some orchestration and scheduling. Our cloud is uniquely designed to bring massive amounts
of GPUs online as fast as possible. Jacob, can you tell us a little bit about how that fast access
really sets you apart from other CSPs who are trying to do similar work?
Yeah, also a good question.
So part of it is that our cloud is fast.
Our software stack, like I mentioned,
is specifically designed for bringing these clusters online as fast as possible,
but it's also responsible for making sure
that we have stable and reliable and consistent performance.
We've got our control plane that regularly runs
active and passive health checks. We want to make sure that the cluster is running at top speed
from day one until we retire the cluster. Our goal is to make sure that we identify any potential
performance issues like well before the customer does. And this can be anything from detecting
hardware failure to detecting we
have slow interconnect links to making sure that we're screening out any type of underperforming
hardware like performing GPUs. In addition to that, we've got a lot of tools that we've developed in
house to improve our customer experience. We have ways to improve the data ingestion, but that
really separates us from the other clouds. We are truly designed from the ground up to support this very unique AI use case.
When you look at that AI use case and you think about enterprise adoption, you know,
one of the things that I wanted to talk to you about is where you saw enterprises adopting AI.
And obviously there's a lot of different types of solutions that they're looking for.
How are you seeing this market shape as we head further into 2024?
I mean, I think our demographic that we're targeting right now is closer to the AI startups and those that are looking for mass amounts of GPU. So not to say that we're not necessarily dealing with enterprise, but we're interested in customers that want to do like groundbreaking work at
incredible scale. And so speaking of that incredible scale, I kind of want to dive a little
bit into the storage specifically. Can you tell us a little bit about what does the storage you're
dealing with today mean to you and how does it
help you with your overall solution? So we're just scaling GPU and compute like crazy. Like if we
just look at power density, we've jumped our rack power density from like 17 kilowatts to 30 kilowatts,
34 kilowatts. We're getting ready to deploy high density racks that are 80 plus, 100 kilowatts, 120 kilowatts.
And that level of density is just crazy.
But part of scaling that GPU density is making sure that we're scaling the storage accordingly.
Larger clusters typically result in us having a larger demand for storage as well. And we can't meet our customers' demands for that storage density unless we're specifically designing our hardware to meet that level of like density and performance.
So we're targeting high cap, high performance NVMe drives. We're making sure that our software
and hardware is tuned to meet like our customers' needs. And just to follow up on the power
consumption, can you share a little bit more? Is there an advantage
with the type of storage you're using? Can you comment on how that differs from your competitors
per se or how that's helping the overall environment? Yeah, I mean, I think the power
consumption is one aspect of it, but it's finding a good blend of performance and capacity and power
consumption. Like we can move a slider in one direction and say that like, you know, this is using little to no power, but then we may be taking up tons of
space or we're just burning performance. And one of the things that we've aggressively leaned into
is adopting QLC high cap drives. For us, that strikes the perfect balance of performance,
power consumption, density. And yeah, without using that type of
technology, I just don't think that we could be hitting our customers' requirements in terms of
all those metrics that I mentioned for density, capacity, performance, et cetera.
Jacob, we were just talking earlier in the episode about being at GTC, and I know that you've
spent the week there. What were you impressed with in
terms of the broader innovation that you saw at the conference, and how does that relate to what
CoreWeave's plans are? So I'm a hardware guy, and I'm pretty biased towards anything that's
related to infrastructure. And right now, I am incredibly excited and a bit nervous about the
amounts of power that we're going to need to
support some of these future clusters and how we're going to cool it. So right now, liquid
cooling is a hot topic, heavier quotes, hot topic, because it's no longer just a nice to have,
but it's a must to have. We're planning to use NVIDIA's next generation Blackwell GPUs,
and we're only planning to deploy that with liquid cooling.
And we already know that the Blackwell architecture has some pretty impressive
performance improvements. I think it's something like 30% perf improvement, 20%
improvement in power efficiency. But combined with that new architecture and our super high
dense liquid cooled racks, we're going to be able to offer just like larger,
faster clusters to our customers. And that's going to just have a huge impact on these large training jobs or just super fast inference. Wow. I think aside from the storage comment,
I think that literally is the coolest thing you've said. And figuratively, so I'm with you.
Okay. But, and so I appreciate that insight on that. But you also mentioned earlier about your partnership with Vast.
Can you tell us a little bit more about the secret sauce or, you know, key benefits of working with Vast?
And, you know, feel free to comment on any of the partnership.
That would be great.
Sure.
Let me take a quick step back, though, and talk a little bit about QLC because that segues into why VAST and the partnership.
I'm a huge fan of QLC. If you haven't picked that up yet, I think it's a great product.
I was an early adopter of it last gig. We're deploying it aggressively at CoreWeave.
I mentioned earlier, it's a great way to strike a balance of performance, density, cost, etc. Part of VAS offering is to leverage QLC.
So it's a great validation that like, hey, they think it's good.
We think it's good.
Okay, maybe like we're both onto something.
But besides just the hardware that they're using to deploy their solutions, we've got
an incredible relationship with VAS right now.
They've truly been a fantastic partner.
We've aligned our internal roadmaps with their engineering roadmaps.
We've got our engineers working together right now to co-develop features, test new functionality, debug problems.
So it truly is a great collaborative partnership in the truest sense of the word.
Awesome. Jacob, can you tell us a little bit
about the overall market response to your product and how has that been?
Market response has been great. We're deploying VAST at all of our new data centers and we're
deploying it to a large range of customers. And in general, they've been extremely happy with it.
We're going to be adding
a bunch of new features, like I mentioned previously, that we're co-developing with Vast.
And I'm extremely excited about deploying more of this and offering new and better performance
and features to our customers. You know, you've talked about your relationship with Fast. I've
got to ask, and you say that you love QLC. How has the engineering relationship been with Solidigm through this?
And what has your experience been with working with the team with QLC drives?
Solidigm's been absolutely fantastic. I mentioned that I had been working with them for a few different gigs. And the partnership with them is also just incredible. I can say that like without them,
we definitely would not have been as successful deploying QLC.
We get great engineering support.
I know that I can hit them with any type of tough question and I'll get some solid, solid answers from them.
But just overall, great support from them,
both in terms of account support, engineering support.
Right now, you know, Nan's in like a little bit of a tough space in terms of availability. And in general,
Solidigm is just a great partner to work with. Awesome. Thank you for that, Jacob. And you guys
likewise. Can you tell us where folks can connect more about your solutions that we've discussed
today? Sure thing. Head to the website. We've got updates,
blog posts, more documentation, and just general information where you can learn about our latest clusters, new DC designs, and new features that we're rolling out. Well, thank you so much for
being on today, Jacob. It was a real pleasure to get to know you. You are very full of puns,
and that was really fun too. Thanks so much for being on the show today. It
was a really good time to learn a little bit more. Yeah, thanks for having me. This was great.
Thanks for joining the Tech Arena. Subscribe and engage at our website,
thetecharena.net. All content is copyright by The Tech Arena.