In The Arena by TechArena - WEKA on AI Data Centers: A New Infrastructure Playbook

Starting point is 00:00:00 Welcome to Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name's Allison Klein. And today is another Data Insights podcast, which means Janice Murowski with Solidime is back with me. Janice, how's it going? Hey, Allison, it's great. And it's awesome to be back. So we have a fantastic episode today. This is amazing. Why don't you tell us what the topic is and who you've brought along with you this time? Yes, I am, as always, even more excited about this topic. But today we have an opportunity to talk token economics and overall just AI productivity. And to talk through this, we actually have Weka joining us today, as well as a guest from Solidine. So today we're going to hear from Val. Brokow Vucci, who is the chief AI officer for WECA, and then we have Scott Shally, Director of Thought Leadership for Solidine.

Starting point is 00:01:07 So, guys, welcome to the program. Great to be here. Looking forward to a fun conversation. Val, this is your first time on the podcast, which is awesome. We've had WECA on before, but why don't you go ahead and just introduce yourself in your background at WECA? Sure. And Scott and I used to work before. You know, I think we intersected, obviously, at NetApp after the Solid Fire acquisition, where I was CETO and,

Starting point is 00:01:30 spend some time with Google helping bring their Borg project to the world as open source Kubernetes. And since then, I joined WECA in charge of AI strategy. And ordinarily, that would be a part-time role. And there would be all sorts of other tactical things to do. But in the biggest industry of our lifetimes, it's growing faster than anything we've ever seen. Strategy is actually a full-time job because in the past well, once alone I've been here, the industry has shifted so dramatically. And you do have to think ahead and make some very clever educated guesses.

Starting point is 00:01:59 and obviously be right a bit more than you're wrong most of the time. And, Scott, you've been on the show quite a bit. I know that our audience is pretty familiar with you, but just do a reintroduction. Yeah, no problem. I've been in the industry now a little over 25 plus years, so it adds some validation to my age. I also went for one of the largest titles possible with my thought leadership. So the more letters you have in your name, the more important you are, right? But I've been working for Solidine for just over three years, loving the technology growth

Starting point is 00:02:25 and direction we're going into Val's point. the momentum behind the market we're driving now has been a lot of fun to follow and push and be a part of. So Scott and Val, let's start with the overall big picture. Time to First Token is becoming a new measure of AI responsiveness. So why is this such a critical benchmark and how does it change the way you both think about building infrastructure or inference? I'll jump on this if you don't mind, Scott. So what I love about our industry in general here and metrics like Time to First Token, is it so transparent and it takes me back to my early database benchmarking days

Starting point is 00:03:01 when like transactions per second was literally business value. And obviously we've added so many layers of abstraction since then, but AI, we're back to the future basically where important metrics like time to first token literally translates to revenue, OPEX and gross margin for the inference providers for the model builders. And ultimately, if you take a look at big companies and headlines like Cursor and Anthropic with Cloud Code, to their value. value and profitability in marketplace or valuation.

Starting point is 00:03:30 So specifically, it's around how long it takes to translate the prompts that we send into the first bite of response that we actually see on our screens. And it gets much, much deeper than that for the AI apps now that are more API-driven and chat-driven. But I've spoken enough. Let's see, obviously, what Scott thinks about this and how we can translate it into the rest of the conversation. Absolutely.

Starting point is 00:03:53 And to your point, it kind of goes back to the TPS reports, right? I'd love the reference to office spacer. That was wonderful. But yeah, it is truly. We're all in a society now where it's how fast can we do something. And especially with AI now where you're doing things with RAG and all these other ways we're getting into agenic and beyond, the time to first token is not even necessarily the time to when our response of the first token, but the internal cycle of responses to that.

Starting point is 00:04:17 So it's all about the data, but it's a matter of when you see that first response to the data you put in, whether it be token or any other type of economic at this point. I just want to add an example because we're talking a little bit in abstract here. The example is real-time voice translation. Apple just announced that with their pause the other day. Google did it a few months ago at their I.O. conference. Who wants to wait an awkward pregnant pause of 30, 40 seconds for a translation? We want that to be real-time and instantaneous.

Starting point is 00:04:44 In time to first, Oaken, there's a key metric for that kind of use case. Now, as we're turning to a period where we're seeing broad proliferation across so many different industries, the economics become really important. And we know, Scott, that AI economics hinge on cost, performance, density. There's so many variables here that IT administrators are taking a look at. When you look at your conversations with customers, how are they looking at how the SSDs in particular are reshaping this equation to be able to scale AI factories to meet the need efficiently?

Starting point is 00:05:24 Yeah, it's an interesting conundry we've run into, right? So with all good things at launch, you go out at all full bore, you put as much effort and energy as you can into it. And then as you start to evolve it, you realize there's a need to optimize that layout, that look, and that feel. And so that's what's really happening with the SSD marketplace is it's allowing us to bridge that gap even more than we've ever done before between that I have to have memory now versus I have data in storage somewhere.

Starting point is 00:05:49 And the ability to do that handshake has just never been something that more legacy hardware could do and SSDs really have an opportunity to support in the most amazing fashion, but they still themselves have to be the right size. We have stuff that's sitting right next to a GPU, next to the memory, next to whatever, and we have stuff sitting off on a network attach of some swarks, and they're not the same product. And then if anybody tries to tell you that, then they don't know what they're talking about. And that's one thing we love about our conversations with the customers is it's not how many SSDs you want. It's we know you need storage, we know you need to improve your response times. How do we do that with the most effective

Starting point is 00:06:25 products on the market? So Val, I have a question for you specifically. Around WECA's neural mesh axon, which embeds storage intelligence into the GPU, are close to the GPU. Can you impact for us how this works and why is it such a game changer for accelerating token readiness? Yeah, I love this topic in general because it shows you how different AI is and AI factories, which obviously are fundamentally AI data centers, then the traditional clouds and legacy data centers we've worked with in the past. We have this whole spectrum of capabilities

Starting point is 00:06:59 that this protocol we all use for solid-stage storage, the protocol is NVME. And I like to joke in many contexts, particularly in GPU computing, AI computing, there's no S, there's no storage in NVME, right? It's non-volatile memory extension or memory express. And what we're able to do with Axon is actually take NAND flash,

Starting point is 00:07:20 media, so SSDs, and we're able to simply through software create memory performance and memory interface to those SSDs, certainly in aggregate, as a group, because it's all about memory bandwidth. So the more SSDs, as we just refer to them, NVME devices, we can pool together, the more we can match both the raw bandwidth, MbU memory bandwidth utilization is a key metric in AI, but also the latency that the software of AI, which are primarily inferencer, servers in the monetization case, what that software expects from a latency perspective. So Axon is quite simply software-defined memory.

Starting point is 00:07:58 It's software you install on a stock GPU server, either using the Nvidia reference architecture, the DGX motherboards, or Dell, Super Micro, Lenovo, HP, you name it. And merely by installing a software, Axon sees and pools together all of the embedded SSDs, the embedded NVME devices on each and every GPU server, often also called the GPU node, and together delivers the memory bandwidth and latency that inference servers are the most popular the one of the market today is an open source one called VLLM. This software, this key software, actually sees the NVMEs as memory now, which is remarkable in terms of addressing the really pressing topics of token economics

Starting point is 00:08:42 and helping invert the upside down economics of AI businesses that is being widely reported on today. So, Scott Val just gave you a fantastic testimonial for why SST technology is really needed in this space. Can you unpack a little bit about how solid I am is taking advantage of this moment in terms of delivery of solutions that actually balance that low latency with the high capacity and it's required for AI pipelines? Yeah, and I appreciate Val very much for giving that definition from, I would say, our partner point of view, because when we start talking about some of the, of the stuff, it's like, oh, yeah, they're just touting their line. But NVME came to existence through the legacy of the Solidime organization and where we came from as an organization. And when you talk about things like latency and the ability to scale, NVME offered us that unprecedented opportunity to do that. And you look at all the other infrastructure that's out

Starting point is 00:09:36 there for storage. You have NVME as the forefront of that. And we produce those products. And when you want to talk about the capacity tradeoff, you have the ability to look at the capacity drives we provide, give that read performance that everybody still needs while offering the right level of right performance to back it up when you need to actually pull the data back into the storage products. And that's how WECA can take advantage of the capabilities is because regardless of the product, the drive is no longer the bottleneck. The latency provided by the products are in such aggregate, to Val's point, that you can pull the data in a pool away from that. And you can't do that with other technologies. This concept of

Starting point is 00:10:16 global namespace is allowing you the access to all of that capacity at once with the extremely low read latencies, but that nice right backup capability is something you just can't do with things like a sound of hard drive, for example. And let me jump in on that because there's a particular term here that is central to all these discussions called KV cash, key value cash. It's the working memory of LLMs fundamentally. And it's a cache. It's very reedcentric.

Starting point is 00:10:42 And there's occasional rights there. And again, the raw capacity of the memory, you can provide KV cash, the raw bandwidth, the latency in a reed-centric workload is tailor-made, I think, for Philadelphia's product line. It's really the sweet spot. And again, timing is everything in these markets. These are solutions that are absolutely ready for prime time today. Yeah, I mean, Val, you're right. Timing is everything. And what seems to be an ongoing theme of this is utilization, right? Utilization is a huge driver of just overall ROI. by WECA, and I think you've talked about this specifically, that you guys have about 90% or so GPU utilization. Can you explain why storage architecture is really pivotal to kind of achieving that level of efficiency? Sure. And this is a fun topic now because a year ago, we would talk pretty opaquely about GPU utilization. You know, more is better. I think what people are realizing now is you can basically be busy on a stationary bike or you can be busy on a Tour de France bike, you know, winning a race. So it's how you're actually using the GPUs now.

Starting point is 00:11:42 versus just keeping them hot and busy. And there's two primary use cases, training and inference. In the training workload, it actually still is like the stationary bike discussion where the busier a GPU is during training, the more it's actually crunching all the data, the more it's doing these things called gradient descents and finding an optimal loss function and this magical checkpoint, you know, the one that actually works best is what you ship as a model version.

Starting point is 00:12:08 That's the value there. But GPU utilization is a vastly different beast when it comes to monetizing these models. Inference is quite complicated. It's quite a bit of a rabbit hole once you get into it. But fundamentally, it has two pieces. A piece called pre-fill where you take your prompts and you actually convert them one for one more or less into tokens, 1-2.75 into tokens. But then you create the working memory, the KV cash, by pre-filling it from those tokens. And that just balloons 100K maybe, you know, with a PDF attachment or something.

Starting point is 00:12:39 something of tokens into just many tens of gigabytes very often of key value cash as you add up to 10 or 20,000 dimensions to each and every one of those tokens. So pre-fill is one part of it, very GPU intensive. So much so that Invidia just yesterday, just to bring in headlines into the discussion, announced this brand new GPU category for the first time in their 30-year history called the Ruben C-PX. Don't ask me what the C and the P-N-E-X actually stand for. they just like those letters nowadays.

Starting point is 00:13:10 But the Ruben CPX is basically a GPU that's dedicated just to one phase of inference, just this compute-intensive pre-fill phase of inference precisely because there isn't much memory required. However, that's just the background stuff. The foreground stuff that you and I see, which would be the reasoning, the thought process, and reasoning models, the reasoning tokens,

Starting point is 00:13:32 and of course that we actually pay for, the output tokens, that's something called decode. And that's extremely memorable. intensive. So much so that, again, we were asking GPUs in the past, and we will be for another year or so until this new product actually ships from Nvidia. We're asking them to do two things at once, like walk and chew gum, and they're not good at it because pre-filling and decoding are literally asking a GPU to do very expensive kind of context switching. And when you can separate those two things and have just pre-fill or just decode, now you have an optimally efficient GPU,

Starting point is 00:14:07 which is the most expensive thing we've ever bought in our data centers on a unit basis. And now you're really getting a true ROI and bang for the buck to the point where we've demonstrated you can get about five data centers worth of output from one data center when you get GPU utilization right. Now, I think that we've stated that SSTs are really critical for this. And Scott, I want to go back to you because we're hitting this moment when enterprises are starting to deploy AI in broad proliferation. at some conferences lately talking to everyone from large retailers to banks and everybody's got

Starting point is 00:14:43 AA projects moving out of PSCs into broad proliferation. But we know something about their environments. They're still heavy HDV systems dependent. How is solid I'm working with these enterprises to get them transition so that they can actually deploy AI pipelines without paying a latency penalty? Yeah, it's an interesting thing. We've all talked about over the years of the death of a hard drive, and we know that it's really not going to happen. The hard drive is here to stay and will be forever and ever, ever, just like tape still exists today.

Starting point is 00:15:16 And they have their place in the systems, and generally those systems evolve. And the challenge that we're seeing and the value that we're providing is the fact that we're not allowing talking hot and cold, but levels of warm, right? And we're trying to make sure that those levels of warm are appropriately driven to handle, for example, the KV Cash that Val was talking about. So you need to have a footprint of data that can be consumed to create those large KV caches and other aspects of it. And it's just something that you can't do with a device that's designed to do read or write effectively. You have to have something that can do both at the right workloads effectively. And that's where our high capacity drives are coming into play. We're providing that massive amount of storage, 122 terabytes, moving to 245 plus along with the rest of the marketplace.

Starting point is 00:16:02 But we're doing it with a technology and a footprint and focus that says, this is what you really need for it. We're not trying to say, here's the fastest or here's the densest, but the most optimized for these workloads. And that's one thing that our team, and you can see by looking at our site and talking to all people that work here, that we're driving this towards making the AI optimizations that are necessary so that customers don't have to try to do A or B. It's really A plus B equals the right answer. and that's what we're doing with these drives, is to make sure that that's the solution we're solving for and not trying to be overly aggressive

Starting point is 00:16:37 of being this, that, or the other, but just a properly aligned solution for the customer. So with that, Val, you know, everything Scott's saying, wanted to get your opinion on, as companies move from pilot projects to what we're now calling kind of AI factories, what new challenges are you seeing around inference pipelines and then how does WECA really address those?

Starting point is 00:17:00 Yeah, I love that question because the challenges are so transparent right now. If you take a look on Reddit or X, social media, you go to conferences and you ask people, you know, what they're seeing in terms of actual real world adoption, they're seeing two things. They're addicted to where there's value. And there's a lot of value in agents, particularly coding agents and research agents. People are happily paying $200 a month or more now for this because it's generating thousands of dollars a month, if not tens of thousands of value. On the other hand, the providers of these solutions literally can't take money from their best customers. What I mean by that is it's so expensive to provide this volume of tokens that agents needs, which is anywhere from 100 to 10,000 times X, you know, not percent, but X more tokens and then simple chat sessions.

Starting point is 00:17:50 And it's so expensive to provide these tokens that there's this very prominent problem of throttling and rate limits. So you get these weird five-hour windows where you have to optimize just how many questions your agent asks of a back-end model because then the door shuts and you can't do any more work with that account at least

Starting point is 00:18:09 for the next four hours if you get it wrong. So we're getting into this weird situation where the inefficiencies of inference as I talked about earlier on are becoming really transparent and we need to introduce efficiencies like assembly lines to AI factories

Starting point is 00:18:25 which is crazy, right? When you and I think of a factory today, there's no way we just don't imagine an assembly line just implicitly in that. We can't think of a factory without that. The harsh reality of AI inference today is there are no assembly lines. It's a really primitive process of moving data back and forth. And again, this re-prefilling over and over because you run out of this KV cash very quickly, often in minutes for busy agents.

Starting point is 00:18:51 And then you're back to this expensive process. And by the way, it takes tens and tens of kilowatts every time you reprefill this. So we don't have assembly lines yet, and that's one of the key things we're looking forward to is with technologies like augmented memory grid and yourselves, we're able to actually add streamlined processing and assembly lines to this pre-fill and decode phase of inference, drop the actual price of high volume tokens by another 100 X to 10,000 X and make these things affordable now. Now, you know, one thing that comes to mind is scale. And maybe I'm influenced because I was at the San Jose Air Force yesterday. and I saw WECA ads everywhere telling me about your performance at scale now, but I want to turn this to Scott's throughput at scale is really difficult, and we're talking about thousands of GPUs. How do you design the SSD technology so that it can work in mass

Starting point is 00:19:48 to deliver this throughput at scale to feed these GPUs? Is there some secret sauce there, or how do you approach that? It's a great opportunity to talk about that, right? because there's the aspect of what we can do, but what we need the industry to do because one of the other challenges is there's no ability for one single source for every solution.

Starting point is 00:20:08 And so when you look at it from that perspective, perfect example of when you're looking about very fast, close to the GPU throughput, we designed or co-designed from a hardware perspective with Nvidia and ability to do a liquid cooling of our drive, allowing it to be more effective at its throughput in an environment where we're now removing infrastructure which puts power back into the system to allow the GPU to operate at a more optimal level.

Starting point is 00:20:33 So we all talk about reducing power and that kind of stuff. We're never going to reduce the power, but we can certainly optimize the use of the power. So locally to that GPU, we've liquid cooled the drive, and we've introduced that product, and it's going to be shipping soon in some of those new platforms. But if it's not there, then I've got this network attached box. And to the bottleneck comment that Val made, our solution can be optimized to allow the right level of write and read performance. You can actually tailor the controller architecture that we provide through firmware hooks and offers to our customers to let them pick how fast at what power

Starting point is 00:21:08 envelope they want to run their products. And you can do that across the family of solutions that we provide. So it's a unique new opportunity to allow you to tailor the throughput at the scale you need while keeping the infrastructure at the right power levels. And that's something that we haven't really had to look at in the past because it was run them as fast as you possibly can, but we know that that's not really what everyone means. They need the system to run at the most optimal level possible. And I love the concept of the assembly line. As soon as you said that, it went in my head and I started literally watching data bits flowing through the assembly line of the AI factory. Yeah, it's not a physical infrastructure, but how that bit gets from a stored

Starting point is 00:21:46 bit on a man device to a GPU and back as necessary. And the whole path it has to take, One of the most critical challenge is the downtime and eliminating the downtime. I have a niece who works at a factory and she shut the factory floor down for 30 seconds to institute something new and forgot to tell a boss and the boss got so mad at her because the system was down for 30 seconds out of an hour day. And we're even more critical of that in this kind of an infrastructure. So being able to make sure our products are highly reliable as well and the quality is there is paramount.

Starting point is 00:22:18 And that's one of the things that Solidime is best known for us in that regard. Now, you give us a lot of good insight to what the current challenges are, right, and how Weka is of tackling them today. But looking ahead, where do you see the next bottlenecks really emerging in scaled inference? And how is Weka prepared to solve them? This ties back to the axon topic earlier on. So many things about GPU computing are just so fundamentally different than CPU computing. A main one is when you take a look at a standard motherboard for a GPU server,

Starting point is 00:22:52 It has actually two prominent networks, not one. It has that traditional network that's now referred to as the North-South network, which is high-performance. It's up to 400 gigabits very often, which in the world of regular CPU computing is insane, excessive performance and bandwidth. But it also has a secondary network, which Jensen increasingly refers to as the heart of the systems of the rack scale systems that Nvidia sells now. They focus less on the chips and more on the racks, like an NVL-72 rack and so forth.

Starting point is 00:23:21 And that is a compute network, also referred to as an east-west network. And there's 16 times, one-six times more bandwidth on that east-west compute network than there is on the north-south network. So the ability to actually address that compute network as well as the north-south storage network is critical towards unlocking some of the bottlenecks here. And it's actually on a critical path to being able to take solid-time drives and make them deliver memory value, which GPUs or something. hungry for for this KV Cash working memory we keep referring to. That's one of the essential bottlenecks here. And there's a lot of industry movement in this regards, not just at the sort of firmware or low level protocol stage, but very much now in the open source community, which is

Starting point is 00:24:07 encouraging to see because that's where all these cool models we read about, like deep seek and others, that's where the rubber meets the road and the open source inference servers take the deepseaks of the world and actually enable way better performance at a way lower token cost by being able to address more and more storage as memory for larger working memory effect of it. Val and Scott, this has been an awesome conversation. I want to ask you to sum up, we've talked about all of the technology, what is capable? We've talked about how enterprises are really starting to throttle up their utilization of

Starting point is 00:24:44 AI. If you were going to talk to an IT practitioner today about the one thing that they need to be thinking about when they are navigating the tradeoffs between performance, cost, long-term productivity, what would you tell them to pay attention to now? Scott, do you want to take that one first? Yeah, I'll start. I'll say that the one thing that tends to still be an existing problem in our infrastructure today is the hardware guy is not talking to the software guy. And I think it's a prominent thing that we should, as we navigate forward, we need to be more aligned within our organizations that whatever software needs to run has

Starting point is 00:25:20 to have a hardware equivalent solution that's not overbearing too much, not enough. We've gone through hyperconverge. We've gone through disaggregated. We've done all this stuff. And now AI's throwing a whole new wrench into it. And it's just making sure to optimize across both hardware and software. Wow. Yeah, I'll dive a little bit even deeper into that.

Starting point is 00:25:39 So again, just taking the look at the fundamental differences between cloud data center and an AI factory, AI data center, is the fact that the CPUs, a good one today, maybe has 100 cores, an average GPU has over 17,000 cores. So this is so fundamentally different apples and oranges that my advice to a lot of people, having learned this to hard by myself over and over again during major disruptions in tech, is it's easier to unlearn what you think you know

Starting point is 00:26:07 about these very scalable data centers, these AI factories, and just relearn from first principles and relearn from a clean sheet of paper. Because processing these distributed apps, if you will, across 17,000 cores per single GPU. And God knows, you know, millions and trillions of cores and data centers, large AI data centers, it's just a fundamentally different world. So it's better to sort of just start from scratch,

Starting point is 00:26:33 apply experience to how IT operations works and budgets and catchler works. But when it comes to the technology designs and the optimizations and best practices, get out there, learn, absorb. It's never easier to learn anything. You can ask chat GPT the most detailed questions. and it'll explain to you from first principles how this works and how it's different from what you're used to. So don't bring old biases into this new world. It's probably my best advice. I love that. Scott and Val, it's been a real boot. I always want to have you guys back.

Starting point is 00:27:04 When we are talking, you know, one thing that I was thinking about is that our audience is going to want to keep the conversation going between now and the next episode that you might emerge on. where can they find the information about the solutions we talked about today and engage with you directly? Sure. Well, from Weka, it's Weca.com. I always Google either, you know, my name or other names of some key folks at Weka, like one of our product managers for this augmented memory grid product. His name is Callan, C-A-L-A-N-F-Fox. So Google Weca Val blog and Weca-Callon blog and all of a sudden, some really cool blogs and videos will tend to come up. Yep. And on our side, solidime.com forward slash AI. is a great place to start for AI-related activities you'll find with collaborations we've done

Starting point is 00:27:49 with Beka, as males, how you can deploy different products in different ways. So it's a great opportunity to look for it. If you want to find me, I'm S.M. Shadley on almost every social platform. Awesome. Thank you so much, guys. This has been a real pleasure. And with that, we're wrapping another episode of Data Insights. Janice, always so much fun to host with you. Thanks so much to all of you. Thank you, Allison. Thanks. Thanks for joining Tech Arena. Subscribe and engage at our website, Techorina.aI.

Starting point is 00:28:22 All content is copyright by Tech Arena.

In The Arena by TechArena - WEKA on AI Data Centers: A New Infrastructure Playbook

In this episode, Allyson Klein, Scott Shadley, and Jeneice Wnorowski (Solidigm) talk with Val Bercovici (WEKA) about aligning hardware and software, scaling AI productivity, and building next-gen data... centers.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.