@HPC Podcast Archives - OrionX.net - @HPCpodcast-107: Paul Bloch of DDN on AI Storage – Industry View

Starting point is 00:00:04 We're getting into real-time video, we're getting into changes, we're getting into multi-modal. So training is still evolving. However, now inferencing and be more and more important because those models are efficient. And what's super important is to build redundancies across those, right? So when they hit that either they do not impact the full production, and then second that you basically get back to full production as fast as possible. But at this point, you need the AI system to perform 100% of the time. From OrionX in association with InsideHPC, this is the AtHPC podcast.

Starting point is 00:00:45 Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everyone. Welcome to the AtHPC podcast. I'm Doug Black at InsideHPC. and with me is our co-host, Shane Khan of OrionX.net. Today with us is our special guest Paul Block.

Starting point is 00:01:08 He is president of DDN, the High Performance Storage and Intelligent Data Platform Company. And we're here to discuss issues around storage, data management, and data delivery in the AI at scale era. So, Paul, welcome. Thank you very much. Thank you, Doug. Thank you, Shane.

Starting point is 00:01:28 Nice to meet you guys. Really looking for you. forward to this. Thanks for making the time. Okay, great. So, as I said, this episode explores how HPC has evolved into the operational foundation for AI at scale, and also why data as opposed to compute is the primary AI constraint and how AI is shifting into an operational level. Paul, let me state it this way. We've talked for years about faster compute driving, AI. From your vantage point, what's actually limiting AI systems today? Well, I mean, it's really not a question of limitation, right? It's what you need to operate

Starting point is 00:02:10 AI at scale, right? So clearly the GPUs are improving at great pace, right? I mean, you saw data's announcement from Blackwell to Vera Rubin. I mean, actually, so GPUs are becoming more and more powerful. The networks are becoming faster and faster and getting low latencies. And so what you really need is you need a data platform, a data intelligence and data storage platform that can really not just keep up, but really enable the GPUs and the network to perform at full scale.

Starting point is 00:02:44 And you really need an end-to-end system to be fully tuned, optimized, to be able to get the reward and your TCO. So it's really not as much of a limitation that it's really not trivial to get those very large systems, to perform at massive scale. And we're seeing it actually with the actual deployment that we're seeing, depending on customers,

Starting point is 00:03:07 some of them are extremely successful because they're basically using the right ingredients. And some of them are really not at efficiency. And so at the end of the day, the GPUs are not performing to what they should do. So it's really a notion of that's where the HPC, if you want, background comes in. I mean, we're seeing more and more that there's

Starting point is 00:03:27 kind of a disconnect between HPC, what we call HPC people and AI people, because really AI people don't always understand potentially the infrastructure requirements and that it's really a must and enabling, and whereby the HPC guys have had the experience for many, many, many years to deploy and get those systems at full scale. So right now we're seeing actually a lot of the HPC experts are being hired by the AI natives or by the AI companies to actually help them operate their assistance. We want to really get into that HPC AI theme in just a moment with Shaheen. But if we could, talk about the whole notion of data starvation, how that leaves these very expensive GPUs underutilized.

Starting point is 00:04:12 Sure. Obviously, the first level is you're going to invest and use either in the cloud, on-prem, or on the edge, hundreds, thousands, tens of thousands, hundreds of thousands of GPUs. And what's absolutely important is to get them in parallel fully operating at 99%. And so that's one of the challenges, right? So the first challenge is really being able to have any and all of these GPUs in full speed real time access the data intelligence platform, the data storage, and having that data storage being able to respond full speed, real time at very low latency. And that's what DDN does extremely well. and that's really kind of our historic past with HPC, as well as our 10-year, 12-year relationship with Nvidia

Starting point is 00:04:58 of optimizing end-to-end systems. And I think also, quite frankly, Nvidia deploy a lot of our DDN system all over the data centers and their cloud. So that's the first system, is really the fact of having a parallel access, not like, you know, what you would say, like NFS limitations, right?

Starting point is 00:05:16 A lot of the other players claim they can do AI system at massive scale, but it's really they just can't, with NFS limitations. And even if they've improved, they just can't. The second issue then becomes data intelligence. Access to data, access to massive amount of data, access to metadata, access to orchestration,

Starting point is 00:05:35 access to data validation. And this is all basically that has been added in our platform to be able to basically build a end-to-end system alongside Nvidia. Very quick question. Do you have any sorts of numbers on GPU underutilization? and percentage of it of total capability? Sure. It's actually quite interesting.

Starting point is 00:05:57 We come in and we have tools, actually. Some of the customers call us, you know, we get into discussions and they might be using either historic enterprise class storage because that's what they add available, or maybe some of the newcomers. And we come in and really look at what the efficiency at the level of the system is. And you'd be surprised. I mean, we've seen some of the AI natives or the large cloud platform and we're seeing, you 10, 20, 30% efficiency at the level of GPUs.

Starting point is 00:06:26 And very often, these companies don't even know that they're basically running at sub-efficiency. So we have tools basically and been working very closely with customers and future customers and prospects to show them really how to optimize, how to build, and what are the best practices to basically deliver with DDN full efficiency on the GPU. It's extremely important for the GPUs to operate fully

Starting point is 00:06:50 because that's actually where the core of the cost of the infrastructure comes in. I love hearing what you're saying, Paul, because as many people know, my personal view is that AI is a subset of HPC, and it's the big killer app that has grown to be huge. I also remember people would say that a supercomputer is a system that turns a CPU-bound problem into an I-O-bound problem, and here you are to really address the I-O-bound aspect of it. It's interesting because we're really part,

Starting point is 00:07:20 of both world, right? We've lived the HPC world for 20 plus years. We've lived the AI world, I would say, for 10, 12 years, but really with a massive acceleration over the past three years, right, with the event of Chad GPT. And it's kind of funny because we've seen different reactions from the HPC community and from the AI community. So the HPC community was kind of, you know, not laughing, but was kind of smirking about the AI guys coming in, right, at first. And I think they were saying, well, we know all this and we know how to do it. everything. And the AI guys basically coming in well, you see, with new models and new training and new technologies. And I think at this point, there's kind of a convergence, right? The HPC community

Starting point is 00:08:00 is realizing that actually there is value in AI. They need to evolve their systems into more closer to AI systems, right, delivering HPC as well. And then the AI guys, in the meantime, are figuring out that, yeah, you do need the right data infrastructure guys and the right HPC knowledge to basically get those AI system to perform at 99 100%. So it's kind of been interesting to see the reactions, right, of the communities. And I think that what's great is that actually at this point, they're converging, and people are actually getting on the same line. Yeah, exactly.

Starting point is 00:08:36 I think there's definitely algorithmic and mixed precision convergence. But in terms of infrastructure and skill set and the best practices to build it, I think a lot of those are very, very consistent with what HPC has been doing. And of course, like you said, you have a very long history in HPC. The other thing I wanted to raise was streaming because I also remember you were in satellite data processing, Hollywood video processing, data processing, low latency, let's not drop a packet, high reliability. all of that now ends up being pretty central to what AI needs to do. Is that true?

Starting point is 00:09:20 Absolutely. So if you look at the history of DDN, right? I mean, 25 years ago, we literally developed a groundbreaking system that would be a parallel access with our own ASIC, our own memory. We delivered a system where no one in the world was really looking at anything like streaming, satellite download, and so forth, right? So that system became instantly a success in HPC. It enabled us to deliver probably 90 out of the 100 largest HBC systems successfully

Starting point is 00:09:52 across many, many, many, many years, right, at massive scale. And so what gave us this is basically an experience and the knowledge on how to run at 100% all these massive system pre-AI, right? So if you look at it, we call it luck, call it vision, call it whatever you want. but we really got the experience of, you know, kind of a 15-year head start across any other companies basically delivering AI systems. And this is why DBN nowadays is very valued by our customers because we come in and we deploy massive system at, you know, groundbreaking performance

Starting point is 00:10:28 and we don't blink an eye. I mean, it's basically just installed and delivered, and basically people can go about the business and basically getting results from the GPUs and the network and from the application and the models and so forth. very quickly. So clearly this helped us and it is the value is still today, right? Like you mentioned, streaming any to any parallel access, low latency. We do not drop a frame. We actually have ability to really maintain the quality of service, online upgrades. I mean, obviously all the things

Starting point is 00:11:02 that you would expect, manageability, planning of workloads. So all this is implemented in the platforms today. Paul, let's talk about customer issues, the prevalent customer issues you run into. Talk about the transition from pilot AI projects to operational. And what are the key problems, whether it's compute networking or data? I mean, the problems, obviously, you know, first you have to build the data center, right? Then you need a clear power. You need generation, power, you need generators. So the power needs to be clean as well, right? We're seeing a lot of issues, potentially with data centers, losing power. And when they lose power and the generators don't get in place,

Starting point is 00:11:44 basically you have those massive systems, basically, coming down. And it's very difficult to put them back up, right? So clearly, cleanness of power, having the right, the right generators is number one. Then, you know, clearly you have a GPU's network and the data storage portion. On the GPU side, GPU tend to fail, right? GPU will come out. And that's where checkpointing and the fact that DDN enables very fast, either synchronous

Starting point is 00:12:14 or asynchronous checkpointing helps a lot because you literally, you know, if you lose a few GPUs, you can really just go back to your latest checkpoint, right? Think about your save on the Microsoft, a Word or Excel, right? You can really clearly just instantly checkpoint, right? That's one of the massive advantages. I mean, one of the advantages of DDN has is basically that checkpointing. that enables you to go right back. You have a 10,000, 100,000 GPU

Starting point is 00:12:42 on a simulation that lasts two or three weeks or a month for a new model, and you certainly don't want to lose any work when you lose GPU. So that DDN enables extremely well. The other issues become network issues, right? You have network latencies, you have network disappearing, you have ports,

Starting point is 00:13:00 and the same way we've been working very closely with Nvidia to be able to diagnose this very close, quickly. Every issue points to the data storage or the data intelligence platform, and clearly what we're finding is that about 70, 75% of the time, it is a network potential issue. So it's extremely important to have the tools, which the DNS, to troubleshoot, diagnose, and those automatically basically resolve these and get the system back to optimal shape. Then on the data storage, quite frankly, it's about efficiency, it's about reliability. It's about,

Starting point is 00:13:36 basically predictability. And on those, I think to achieve all this, it's really not trivial at all. A lot of people might be using NFS-based system. They think they're doing well. But your analogy is like using a bicycle instead of a Harley-Davidson, right? You're happy with using a bicycle

Starting point is 00:13:54 if you've never done anything different. However, if you go on a motorbike, you'll enjoy that much better and much faster. So again, the same way. I think that when people get on the DDN platform running AI and HBC, they really stay with us for many, many, many years. And this is why DeLN has been basically a very successful enterprise for the past 25 years. And we anticipate remaining the AI leader in data storage and intelligence.

Starting point is 00:14:22 Let's take a look at the AI landscape, since that's really where the party is. There are reports of tens and tens of chip vendors that are addressing different parts of AI. many of them build their own systems now. Every one of them really needs to bring the whole compute data, network, memory, the whole thing together. What is your perspective on that? I mean, Nvidia continues to be the 800-pound gorilla. How do you navigate all of that? How do you make sure you don't get blindsided by something?

Starting point is 00:14:52 I mean, you're going to get blindsided by something anyway, right? Even if you prepare for it. But, I mean, that's what makes the life exciting. That's where we're still working at it, right? That's where my partner and I and our team are still excited every day of the week. It's not just by the fact that we're, you know, obviously successful. It's really what we're driving, the passion, and really understanding and enabling the industry that drives us. So when you look at that, I mean, clearly, Nvidia, like you said, 800-pound gorilla, no doubt.

Starting point is 00:15:22 But you have other players. You have new players for inferencing. Look at what just Nvidia did with Grok, right? basically enabling and basically going to integrate the grog inferencing model because clearly Jensen recognized that they had some value in the space. So I think this is an evolving space. Nobody has the right full answer. And I think as we move forward, because this is still early days in the AI era,

Starting point is 00:15:50 you will have newcomers, you will have new chips, you will have new inferencing models, you'll have new algorithm and so forth. and it will basically implement into the market. So it's exciting. It's a very rich ecosystem. We are working with a lot of partners. Our partners is becoming wider and wider. So it's difficult on us and resources,

Starting point is 00:16:12 but we are doing it because you don't have a choice, right? The same way you need to be working with the various orchestration platform, the same way you need to be optimizing to models, the same way you be optimizing to different networks. So, I mean, it's part of our heritage, right? the fact that we've been an open system for 20, 25 years. I mean, if you look at it when we started, it was a heck of a lot more difficult. All the systems were closed, and we had to literally force DDN in on an IBM system

Starting point is 00:16:40 or a VAC system, whatever, that will completely close, and we're able to deliver value there. So today, the world is much more open, so we are thriving. And the experience that we've had of delivering systems, no matter what operating system you use, matter what chips you use, no matter what network you use, is extremely useful as well. We're very flexible in that space. Right on. Now, there is definitely the whole training to inference part of AI, and the industry seems to be shifting more and more towards inference as it gets used and it becomes part of business processes. But there's also kind of pilot through production. Would you speak to what you're observing in the market as customers transition from training or maybe they don't even

Starting point is 00:17:28 do training, they're into inference and how they go from pilot to production. Sure. If you look at the market, right, the market is still driven by probably about 20 to 30 companies, right? The companies that are driving most of the volumes are the AI natives, the cloud providers, the model makers, those are still training. The training has been going on and it's still growing. I mean, if you look at it or even at Jensen's presentation today, right, the models are basically growing 10x in parameters every year. And, you know, people are moving to millions of GPUs where we thought that two years ago, 100,000 GPU would be the goal, right? At this point, you know, people are talking about 1 million, 2 million, 3 million GPU. So,

Starting point is 00:18:11 clearly the training remains and will remain probably for, you know, the next two, three years. It will require more and more GPU still because we're still at the infancy of training. If you look at the data, we're getting into real-time video, we're getting into changes, we're getting into multimodal. So training is still evolving. However, now inferencing and be collecting more and more important because those models are efficient, right? And so inferencing is coming in and permeating in all the various industries, fine-tuning, applications. And so inferencing is going to become much wider because it applies.

Starting point is 00:18:47 it then permeates into the enterprise. And the enterprise has barely started. Right now, it's in an infancy. People are using Gen AI. They're using chat GPT. But they haven't really yet, you know, I mean, some of them, and you are moving, right? We're seeing pharma. We're seeing some of the medical guys that are moving into production from the pilots,

Starting point is 00:19:07 absolutely. But it's still in the infancy. Infancy, edge, sensors, everything still remains to be created over the next three, five, ten years. It's going to be incredible. Well, you all use the term static storage architectures in the face of continuous AI pipelines. How would you describe static versus dynamic storage? Well, I mean, you have to be dynamic at this point, right?

Starting point is 00:19:30 So when you look at dynamic, you really need a data platform that addresses any and all the problems and requirements, right? It's no longer good enough to resolve maybe just metadata or to resolve just enterprise data, without resolved mirroring or data movement, you need to be able to do it all. You need to be able to have clearly a cost-efficient way of storing petabytes, tens of petabytes, hundreds of petabytes, or even exabytes. We're seeing nowadays exabytes is becoming the new petabyte for companies. So an exabyte is no longer that massive. People are deploying five, ten, twenty exabytes.

Starting point is 00:20:09 So there's an explosion of data. So you need a cost-efficient way, number one. and power efficient, density to be able to deploy those systems. Then the system need to be able to then match and enable the GPU. So you have a notion of performance, bandwidth, I.O. To those systems. Then you have a notion of intelligence. You need to be able to retrieve, right?

Starting point is 00:20:33 When you want to do a simulation, for example, we are about 500 times faster than anybody else to figure, thanks to our metadata search engines, to figure out which of the objects you want to be able to be able to be able to, able to look for for a simulation. Because this is downing. You have billions and billions and tens of billions of files. What are you going to go search about? So you need to be able to do this. Then you have intelligence, KV Cash coming in, you have the orchestration, you have the data movement. You potentially also now, with the latest commodity, scarcity and pricing issue,

Starting point is 00:21:06 you potentially need to look at how do you manage your data across different type of devices. Everybody was rushing to get SSDs or memory, and at this point, you know, the costs are escalating tremendously. So the potentially hybrid infrastructure that makes SSDs and hard drives are becoming super important for those customers, because otherwise they're going to break the budgets. You mentioned key value cash and some of the other technical challenges that you've addressed. What are some of the more recent challenges? Like you mentioned, you know, simplicity, reliability, serviceability, throughput. but scalability, hybrid, NVME, SSD, like special optimizations for KV cache or for pre-fill or decode or

Starting point is 00:21:52 transformers. Sure. As you pile on and you have a kind of a laundry list of all these things, then some serious optimization emerges because there's a cumulative impact of all of these. What are some of the highlights that they can point that have been challenging in recent time? So clearly, yes. So on the training side, I think we've pretty much resolved most of the issues of running really, you know, thousands, tens of thousands, 100 of thousands of system against our system when you do training and those simulation. I think this has been worked and those are really up and coming and really stable and so forth.

Starting point is 00:22:27 When you're talking about inferencing, right, we're still in the early days. And KV Cash becomes super important because think about millions of people doing prompt, right, doing prompt engineering. And the idea of KV Cash is that so far, when you do a search, you're literally holding the GPU. That means you have your first search. You know, the system knows very well that you're going to improve your pront. You have to go back and make it better or add on some details. So it basically holds right now in the GPU and GPU memory that data. And this is extremely costly.

Starting point is 00:23:00 So what you want to do where KV Cash comes in is that you're really using DDN as an intermediary storage layer that basically is extremely fast and low latency. So really you have a connection between GPU and us, whereby your prompt gets stored and move from GPU to DDN and then at a much lower cost and CCU, so you release the GPUs, but you still have your data and you still have your prompt getting ready to basically being refilled and re-decoded after the fact. And so when you do this by millions or tens of millions of requests, you know, KV cash becomes extremely important because otherwise you won't be able to do it with all the GPUs

Starting point is 00:23:42 you have. You're basically keeping them busy for no reason. Is that makes sense? Yes, absolutely. So the other thing that complicates everything that we all do is the emergence of new technologies, new miniaturization, when you look at networking and storage and the emergence of DPUs as a major pivot point there. What does that mean in terms of the system architecture, the hardware architecture that you have to pursue on a roadmap. What does that do with the roadmap? So actually in our case, right? I mean, DDN is really software, right?

Starting point is 00:24:13 Historically, we've delivered complete systems because customers like a one-throat choke, right? You basically have a one point of support and you have someone responsible for the quality and the efficiency of the full system and the performance of the full system. But really, you know, 97% of what we do in IP and R&D is software.

Starting point is 00:24:32 And so our software actually, our later software, called Infinia is really a groundbreaking platform that is built for the AI time. So you can load, you know, you can actually load on your PC or your Mac. You can load on a GPU. So you can literally kind of run that software, call it inferencing software, that basically can run on a GPU on the sensor and at that point validate data and move it to your core to be able to basically extract the value of what data you really need. So think about DDN of really kind of a evolution from complete system, which we will continue

Starting point is 00:25:11 delivering, but really to a software play that can load on DPUs, on memory, close to a GPU, across, and so no matter what the application might be. And so think about robotics, humanoid. We think that there's a great future of a DDN with Nvidia and other robotics company to basically deliver software value to manage and be able to be able. to basically drive those humanoid. Related to that really is what is a system anymore? Because if it becomes more and more modularized and there's software glue that brings it all together,

Starting point is 00:25:45 doesn't that impact the definition of what a system is? Well, I mean, you know, the system is whatever you define it, right? I mean, historically you think of a server, you think of network, you think of all that, and that's going to change, right? You're going to have basically, it's going to be all about some type of a GPU or intelligent or DPU unit, but at that point the software can run very close to that GPU and DPU that's going to be super important, especially on the edge. So the software and the intelligence software is going to become more and more crucial

Starting point is 00:26:17 because not only do you need access to data, you need to be able to ingest it at the right performance, you need to be able to validate it, you need to be able to have some intelligence there to keep whatever data is going to make sense or not. Paul, explain for our listeners the issue around sovereign, and the notion that sovereign AI is emerging as the default operating assumption. Right now there's a battle for intelligence.

Starting point is 00:26:42 You see it, you hear it from the present, you hear it of the battle between the US and China and other countries. And really it's a notion of creating intelligence, right? So the AI factories, the sovereign AI factories, it's a notion that each of those countries is going to need to build their own intelligence and sovereign AI factories for their services. for their R&D, for their universities, for their people, for their IP, for the delivering easier system to their citizens, right? Like getting a real-time passport or just a passport on your phone or getting, you know, just walking through the airports at this point and not having to speak or see anyone and then just being cleared directly.

Starting point is 00:27:24 I mean, there's so many support and services that can be improved. But sovereign AI becomes something where you control your destiny, you control your IP, your data, your intelligence, and at the end of the day, it's all about data and your creation. So it's going to be super important for each and every country, which we are seeing, to build up their own AI value. And it doesn't mean, you know, really... And so people say, yeah, but you're still buying Nvidia,

Starting point is 00:27:55 which is an American company. But that's irrelevant. I mean, it's really because those are the tools, what you're creating on top of it, then you own that, right? So it's super important that, you know, not every country is going to have the technology. They need to source the technology,

Starting point is 00:28:08 install it in country, and control basically what's coming in and keep it within country. So I wanted to turn the conversation to the future. We've seen really the emergence of AI, not just as a user of storage, but as an enabler of future storage. We've talked about semantic storage,

Starting point is 00:28:29 computational storage. What does that do to the future of the data management aspect of AI, whether it's hardware or software? It's what we said, to be efficient in the AI over time, you're going to need to be able to have quality closely to an operating system for data. And this is what DDN has been building, which is really a notion of addressing on-prem, cloud, edge, as well as different type of computational, I mean, either in-chip computation or close to networks. I mean, I think at this point, what you will need is you will need kind of a, ideally a solution that encompasses all the various use case in a very simple way, and that can scale easily as well. So this is software-based that loads, and basically it's very flexible depending on either chipsets or network set or DPU set or whatever is going to come up, and this is what we've been working

Starting point is 00:29:26 on and deploying. So, Paul, talking at scale in AI factories, the part of the market that you're really addressing, typically where do things break down first, you know, compute network or data, where do you see the most common problems that your customers are up against? The problems are going to come from pretty much every component of the system, right, from power, from GPUs, from network, or from data storage. And what's super important is to build redundancies across those, right? So when they hit, that either they do not impact the full production,

Starting point is 00:30:01 and then second, that you basically get back to full production as fast as possible. So there's going to be issues, right? You're talking about, we spoke about power potential issues, you know, generators, having to reboot systems. You want to avoid all this. GPUs will fail, but then we have models like checkpointing and ways around this so that you do not lose your simulation that you've been working on for weeks or months. And then network, you know, network has probably where the core of the issues come from, right?

Starting point is 00:30:32 There's an inherent instability from time to time when you push the network, you get some issues. And so that points to the storage, but then the storage. So it's really crucial for whoever to have elements on the data storage and to have elements on the data storage people to be able to alleviate when the issues are storage-based, data storage or their intelligence-based, or when they're network-based, right? And we look at most of the system, most of the issues come from network, but you still need to resolve them.

Starting point is 00:31:05 And that relies upon then basically the data storage guys to say, yeah, this is the issue, this is the trace, this is where it's at, and this is how to resolve it. So we're working very closely. We're not automated tools to be able to do this, because at the end of the day, what AI systems, the big difference between AI system and HPC is that AI system cannot go down, right?

Starting point is 00:31:25 They need to operate 100% of the time, no matter what. And this is basically things that we've had to work on extremely in details over the past three, five years. Because, you know, when you talked about the old times of HPC, you know, these people would understand why a system would potentially go down or what the issues were. In the age of AI, you know, nobody cares. People want online system 100% of the time. If you have issues with SSDs or servers or controls or whatever, you have to resolve it in the background while the system is performing. So this is obviously a pretty big challenge and this is the one that we're tackling pretty efficiently. That's excellent.

Starting point is 00:32:06 I've heard you use the word operationalization of AI. And that sounds like what causes these requirements to appear in a big way and AI becoming more of a mission critical thing rather than an app that you're running in the background. No, absolutely. Think about it. I mean, you basically, it's quality of service. It's the old way of your phone. You need like when you call with the disconnect, where at this point, you need the AI system to perform 100% of the time. You know, it's becoming as crucial as a banking system or more crucial. When the access to AI, especially when it's going to permeate to digital transformation or industry or omniverse or your plant or your, you know, medical systems and everything, it needs to be 100% of it. operational. And so that's really what we've been aiming at. If you look at the efforts this year, we are gearing towards that 100% uptime. That's really what we hear from customers, right? They

Starting point is 00:33:02 want uptime full on. They don't want to sustain any type of a downtime, of upgrade or whatever. It needs to be uptime full on. Excellent. Doug, what have you not asked? Or Paul, what have me not asked? No, I mean, again, it's an exciting time, right? I think that it is, challenging. I think people are super stressed because there's so much happening. Think about an industry which is developing, you know, 20, 50, 100,000 different ways and you have to be basically up on all these issues, all these challenges, all these opportunities. And I think it is an interesting time because everybody's working probably two times or four times more than they were before and still not believing they're achieving a full result because there's so much to be done.

Starting point is 00:33:51 right so i think there's kind of a new wave of uh of kind of feeling of uh that i've been talking to people right and they feel that they are the more they work they're still not fully satisfied because they cannot complete all the tasks they have and i think you need to be able to to to kind of evolve and adapt and and do the max you can but at the end of the day you're not going to be able to do everything so you need to focus on what's super important and what you're going to be able to do and add the most value. But the AR era is going to change profoundly the world, I mean, the world, the way we work,

Starting point is 00:34:27 the way people interact, everything. And, you know, you have to, I think at this point, you have to embrace and basically learn the tools. The same way, you know, when people ask me, well, is AI going to replace my job? I said, no, if you basically embrace AI and actually enhance what you're doing and provide more efficiency,

Starting point is 00:34:45 you basically have job security. So it's a notion of really, kind of a pretty, pretty, I think, exciting time, but at the same time challenging. And so people are going to have to evolve and accept that AI is here to stay. Paul, I had one just as an aside. I was talking with Alex at SC in November, talking about this whole notion that you all have been in HPC storage for all these decades. And then as AI really emerged, it was apparent that this would be the next wave, even if it is, as Sheehan likes to say, AI is an HBC workload, but that you are well positioned to take full advantage or to leverage or to ride this next wave. And he said the key

Starting point is 00:35:29 moment that he was convinced that you all were in the right spot was when chat GPT came out three years ago in November. But I'm curious if that news, that explosion of LLMs and generative AI, if that was a difficult transition for DDN. Yeah, not at all. I mean, if you look at it, I mean, I will tell you, the one thing we saw pretty early is the alignment with Nvidia and Jensen, right? We did this about 10, 12 years ago, and I think that was kind of the paradigm shift, right? We basically aligned our engineering, we aligned our understanding our vision

Starting point is 00:36:04 into understanding what Nvidia was doing and really partnering with them, NVDAs starting using DDN systems across all of their internal systems and cloud, right? And so that created a lot of engineering value, a lot of learning. It was not about marketing or selling. It was really about moving the industry and delivering end-to-end system that would deliver AI. So clearly, we had been working for probably seven, eight years when Chad GPT came up, right? So we have built a lot of large systems alongside with Nvidia or Nvidia customers. We were the first one delivering 100, 200 super pods with Nvidia, with the only data store.

Starting point is 00:36:45 storage platform there. And so really we were ready for the chat GPT moment. What Chad GPT did for us is really not as much challenge us, but just basically delivered many, many, many, many opportunities across the world, right? And delivered more business outcomes. But, I mean, no, we're ready. I mean, DDN has been ready for the past, you know, whatever, 10. You know, we learned for the first 15 years,

Starting point is 00:37:09 how to deliver very large systems at full performance, at full efficiency, at the best price points. And basically with AI, we're now clearly optimizing for AI, but we're built for that. And so we know about it. That'd be my take, too, Doug, that the market came to DDN rather than the other way around. It's sort of like suddenly everybody wants what we've been doing for years,

Starting point is 00:37:32 and isn't that great? Only more so, right. Yeah. No, that's exactly right. I mean, what we saw is that instead of having basically a population of potential 500 customers in the world, right, when we were in HPC, with all the various large data centers, we now have one much larger potential single customers and obviously many more of them. So it says, you know, you have to review that, you know, a very large HPC system or DOE

Starting point is 00:38:01 we're spending basically 500 million or 600 million on a very large system, right, three, four, five years ago. And nowadays, people are spending $1 billion, $2 billion, $5 billion, $10,000. billion, 50 billion on system, like it's no problem. Right on. There's an explosion out there, which is quite interesting. So Paul, in the context of what do you see looking forward, you've mentioned that a little bit, but how do you see the strategic decisions that you need to make?

Starting point is 00:38:32 Is DPU a threat or a leverage point? Does it accelerate you becoming more of a software company and therefore, no worries. we're just going to layer on top of what they do, or is that a threat? Have you considered doing your own systems? Yeah, no, absolutely. No, I think it's enabling because you're going to need the edge to core to cloud, on-prem. You're going to need all this system to basically perform efficiently and in unison.

Starting point is 00:39:00 And for that, you need potentially fully optimized and built specialized hardware and software systems, right, for the core training system. and then when you come to DPU or sensors or humanoid and so forth, you only need software and be close to the unit of intelligence in this device. So really, it's not a threat. It's convergence. We align completely with Nvidia. We align with other partners as well. And this is really what makes it very exciting for us,

Starting point is 00:39:31 is that we have the right technologies to be able to deliver the solutions no matter what you're trying to gain. Right. So you could be at small scale, you could be at medium scale, you could be at massive scale, you could be training or just inferencing. I mean, the data and the IO profiles from training or inference are very different, right? The challenges is extremely different.

Starting point is 00:39:50 And interestingly enough, right, we're finding that in some of the early robotics testing that we've done is that we're finding that actually some of the performance, the bandwidth, all the metrics remain that to enable, you know, value and TCU on your investment you do need the data intelligence platform to deliver performance. And that's where DDN is king, the performance. When people realize that they do need the performance, we're the only ones delivering that extreme performance at any scale, and that makes a huge difference at the level of the investment.

Starting point is 00:40:25 Yeah, it may just be me, but I think what you do at the edge, I'm less familiar with. I obviously know that you've been in the data center and, you know, et cetera. But I think that edge part is really excited. This is really new, right? I mean, if you're looking, this is really evolving, right? This is going to happen over the next two, three years. It is loading on DPU.

Starting point is 00:40:47 You saw the announcements with Bluefield 3, now Bluefield 4 with Nvidia. And so we're able to really not just use them as cars or extension. We're actually able to run our software on a DPU, right? So you do not need CPU or GPU. All you need is a DPU and our software to run. basically inferencing software validation and so forth. So this is something that we are working on and that is in part. But as the market evolves, we are ready for it.

Starting point is 00:41:19 Yeah, yeah, yeah. That's also a good answer. Thank you. Okay, Paul. Well, a wonderful discussion. We've been with Paul Block, president of DDN, and thanks so much for your time. We could go forever, and I really appreciate your time, Paul. Thank you.

Starting point is 00:41:35 Thank you, Doug and Shane. Really appreciate your time. and that was a great discussion. Appreciate it. Thank you. Very good. Take care. That's it for this episode of the At-HPC podcast. Every episode is featured on InsidehPC.com and posted on Orionx.net.

Starting point is 00:41:51 Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At-HPC podcast is a production of OrionX in association with InsideHPC. Thank you for listening. Thank you.

@HPC Podcast Archives - OrionX.net - @HPCpodcast-107: Paul Bloch of DDN on AI Storage – Industry View

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.