@HPC Podcast Archives - OrionX.net - @HPCpodcast-87: Penguin Solutions on AI Infrastructure – Industry View

Episode Date: July 9, 2024

In this special instance of our Industry View feature we are joined by Jonathan Ha who is Senior Director of Product Management for AI at Penguin Solutions. We discuss the design and deployment of l...arge scale AI infrastructure: why AI at scale is such a critical need, where the challenges lie, and what it takes to do it right. [audio mp3="https://orionx.net/wp-content/uploads/2024/07/087@HPCpodcast_IV_Penguin-Solutions_Jonathan-Ha_AI-Infrastructure_20240709.mp3"][/audio] The post @HPCpodcast-87: Penguin Solutions on AI Infrastructure – Industry View appeared first on OrionX.net.

Transcript
Discussion (0)
Starting point is 00:00:00 All these challenges lead to product delays, quality issues, and even loss of business. And underestimating these challenges is why most AI projects fail. In fact, Harvard Business Review did a study and they put the failure rate as high as 80% for AI projects. We started deploying and managing these large GPU clusters for Meta since 2017, back when they were still called Facebook. Through the years, we've collected, analyzed, and gotten insights from all kinds of data points in their clusters. In the cabling alone, it could be miles of cables and thousands of ports, that there
Starting point is 00:00:42 really needs to be some type of methodology. From OrionX in association with InsideHPC, this is the AtHPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everyone. Welcome to the At HPC podcast. I'm Doug Black at Inside HPC with Shaheen Khan of OrionX.net. And this is a special instance of our industry view feature in which we take on major issues in the world of HPC AI through the lens of an industry leader. Today, we have with us Jonathan Ha. He is Senior Director of Product Management, AI Solutions at Penguin. He's been in the industry for more than 25 years. He previously held senior positions in product management at Microsoft, AMD, and AWS. So,
Starting point is 00:01:40 Jonathan, welcome. Thanks, Doug. Glad to be here. Okay. So today we're talking about AI at scale. Huge issue, major opportunity, and major challenges. So if you would, please, Jonathan, set the landscape for us. Share your thoughts on why AI at scale is such a critical need. Let's get into why it's so challenging. Sure. And thanks for that intro. Yeah, so today, organizations of all sizes are under intense pressure to leverage AI as a competitive advantage. And while we're still
Starting point is 00:02:11 in the early stages, only about 40% of enterprises are actively using AI, while another 40% are intently exploring AI options. And even though there's this massive interest, 38% of IT professionals admit that a lack of technology infrastructure is a major barrier to their AI success. So this road to successfully deploying and operating AI infrastructure at scale can be filled with often requires specialized knowledge to design, build, deploy, and manage a complete integrated system. Organizations that are rushing into this without the right experience often end up with disappointing results. Things like poor performance, wasted GP resources, frustrated data scientists, and even lost investments. Yeah, it obviously extends beyond technology and components. There's a whole need for AI talent. Oh, for sure. I mean, on top of the challenges that I mentioned, nine out of 10 organizations suffer from a shortage of AI talent.
Starting point is 00:03:17 And this exposes a lot of execution gaps in the AI system from a design and deployment perspective. Even if the hardware infrastructure is there, 83% of organizations admitted that they weren't able to fully utilize their GPU and AI hardware once it was deployed. All these challenges lead to product delays, quality issues, and even loss of business. And underestimating these challenges is why most AI projects fail. In fact, Harvard Business Review did a study and they put the failure rate as high as 80% for AI projects fail. In fact, Harvard Business Review did a study and they put the failure rate as high as 80% for AI projects. That's two times more than other corporate IT projects. But I think there's hope. And one major source of hope is working with a partner that has that deep experience and expertise in designing, building, deploying, and managing AI infrastructure at scale. Someone that
Starting point is 00:04:04 has the battle scars and wounds of working through and addressing these key challenges with customers. So you guys have deployed some really big AI systems for some really big customers and certainly qualify for having those battle scars. What do you see when you work with customers in terms of those big challenges? Yeah, no, that's a great question, Shaheen. And, you know, AI infrastructure is very different and much more complex than general purpose or even HPC compute infrastructure.
Starting point is 00:04:33 There's new technologies like GPUs and InfiniBand networks. It's a lot of complex cabling and integration. And with GPU power being so high, it's really challenging to incorporate the design implications around data center power and cooling capabilities. And organizations who try to take this on without the necessary skills or experience often end up with these very disappointing results or even failures where they're just stuck with a bunch of hardware that costs them millions of dollars. But there are three kind of main areas where we see customers challenged or have some pitfalls.
Starting point is 00:05:10 The first is around designing those AI clusters to surround bottlenecks and limitations across the different technologies. The second is a limited pre-deployment integration and testing. And then finally, I think customers really underestimate the challenges with operating this type of system at scale. Yeah, that's a really good way of putting it. So let's go down that list the way you had it. If I remember, the first one was just designing the AI cluster and figuring out where the bottlenecks and limitations are. What do you see and how do you suggest that people go around it? Yeah. So a lot of focus has been just around the GPU servers themselves.
Starting point is 00:05:48 And that's one thing, but using an unvalidated rack, row, and even cluster architecture that includes storage and networking components can lead to some unforeseen performance bottlenecks. I mean, for example, inefficient integration of multiple network technologies and topologies like Ethernet and InfiniBand can lead to poor performance that doesn't scale and may even lead to higher costs. And then, you know, we talked about this earlier, but suboptimally designing around the data center's power and thermal constraints can lead to stranded or unutilized power capacity. And that may limit the number of GPUs you can actually deploy in the data center.
Starting point is 00:06:30 The second one was pre-deployment, pre-deployment integration. And it sounds like something that you really need to do and not just do in real time when stuff shows up. What are the complexities there? Yeah. When we talk about rack integration, even inexperienced customers or even system integrators may not have like the methodologies or best practices to build these racks and clusters with complex components and very intricate cabling. I mean, the cabling alone, it could be miles of cables and thousands of ports that there really needs to be some type of methodology and experience to put that together. And with limited experience around that, that can lead to
Starting point is 00:07:14 a nightmare to support or troubleshoot problems once it's deployed at scale. And additionally, from a pre-production testing standpoint, beyond just the nodes, you really have to do some type of full rack and even mini cluster performance testing because that cluster might look good on paper, but perform poorly once deployed. And you need to validate all that stuff before you move into production and try to debug and optimize once deployed. Yeah. And also at that point, you have storage and software stack and everything now has to come together. So the whole thing needs to sync. Now, one thing you mentioned, and I think this was the third one, customers underestimating the challenge. And I think that's a really important point to get the nuance of that out. and even the big hyperscalers struggle to keep their GPU fleet at high availability. We hear that they're struggling to keep it in that 30 to 80% availability range.
Starting point is 00:08:12 And this significantly impacts their ROI, potential revenue streams, and ability to utilize that big expensive AI infrastructure to its full potential. Jonathan, when you say availability, you mean like utilization or just downtime? Yeah, so it's downtime. These server nodes are large, power-hungry, complex systems. There's multiple network fabrics from the front end to the back end. There's a lot of different pieces that from a health monitoring and management standpoint, like CPUs and general compute monitoring and management is fairly mature, but monitoring the health of GPUs, these high-speed transceivers, liquid cooling systems, and so on is new and complex. And that's where
Starting point is 00:08:56 underestimation of the amount of experience and expertise you need to manage these systems. Customers like Meta, with Penguin software and services, have been able to improve their overall cluster management. And now they are able to experience 95% availability in their fleet of tens of thousands of GPUs in a very consistent basis. Yeah, excellent stuff. So Jonathan, I fear that people listening to this array of challenges and problems might be willing, probably want to roll up in a fetal position. Now, we also know, though, that Penguin recently announced Origin AI, which is an AI at scale
Starting point is 00:09:34 solution designed to address a lot of these problems. Why don't we talk about that a little bit? Sure, Doug, and thanks for that. Like you mentioned, we just launched Origin AI last week, and it is Penguin's complete AI infrastructure solution offering that's designed.S. Navy on deploying and managing their big AI deployments. We're taking that know-how and experience and productizing it into Origin AI. This offering is really intended to take out all the guesswork and worry and give our customers a big head start on implementing their AI strategies without having to reinvent the wheel. Okay, yeah, let's get into that in a little more detail, if you would. Yeah, sure. Now, to make it more concrete, Origin AI includes predefined and validated AI architectures in small, medium, and large-sized cluster templates.
Starting point is 00:10:41 These templates are then integrated and validated in our factory at a rack and cluster level. We then include our skilled cluster management and health monitoring software to help manage that environment. And finally, we bind all this together with our AI experience deploying and managing over 75,000 GPUs and a full suite of design, build, deploy, and manage support services. With this solution, our customer is able to reduce risk and accelerate time to value. They can also realize the full cluster performance potential right at deployment and finally maximize their ongoing GPU availability and utilization. Let's get back to the industry. When you look at the customer landscape, who would be an ideal candidate for this kind of an infrastructure to say, okay,
Starting point is 00:11:31 I'm now ready to get one of these and I can basically plug and work? Yep. Yep. No, that's a great question. At a high level, ideal candidates for Origin AI would be organizations who are looking to rapidly deploy and operate AI infrastructure at scale. And that's from hundreds to tens of thousands of GPUs. And so we see two main customer types. The first are organizations or enterprises who are deploying on-prem AI infrastructure at some scale. And then second are service providers who are offering AI infrastructure as a service for third-party customers.
Starting point is 00:12:07 Awesome. Different landscapes, each of them, because one is on-prem for internal use. The other one is, of course, also on-prem, but as a way of selling to others. Yeah. So digging a little deeper with this race to utilize AI as a competitive advantage, Origin AI is great for enterprises and organizations who might be upgrading, expanding, or deploying new AI infrastructure for model training, model tuning, or developing new generative AI services. This solution also applies to those
Starting point is 00:12:40 who might be using AI infrastructure to accelerate high-performance workloads like computational fluid dynamics, financial simulations, climate modeling, and so on. Previously, organizations may have had a challenging experience deploying or managing their GPU cluster, and now they're looking for some help to do it faster, increase uptime and availability, and ultimately get better performance and ROI from their AI infrastructure. Yeah. How about industries? If you go take a vertical look, you mentioned financial services, who else? Yeah. We see a lot of interest from energy companies, retail. Oh, interesting.
Starting point is 00:13:16 Healthcare, obviously, manufacturing, as well as government and research institutions. It was nice to hear HPC and the numerically intensive computations, the traditional ones mentioned. For the enterprise, I can see that depending on the enterprise, they're going to have those things. Do you see them also for the cloud providers? Do they look like they want to have a mixed use target or are they exclusively focusing on AI? Yeah, right now there's just such a demand for AI infrastructure and for these mid-sized or specialty cloud service providers or even managed hosting providers. Origin.ai can be a really great solution for those who really need to rapidly deploy large fleets of GPU clusters
Starting point is 00:13:58 or those who need help operating or managing those large clusters to maximize performance and availability. One specific example is where we're helping a crypto Bitcoin mining provider pivot their business case to serve the AI market. And even though they have thousands of GPUs deployed, they were never architected to support AI workloads. And we're helping several customers like that to re-architect those clusters to support AI workloads. And we're helping several customers like that to re-architect those clusters to support AI workloads with InfiniBand networks and high-speed storage, as well as providing the managed services to keep the AI cluster up and running and at peak performance and availability.
Starting point is 00:14:36 Oh, interesting. So they are now able to make more money off of AI than they are off of crypto mining. That's interesting. A lot of dynamic things happening in the market. Yeah, a lot of opportunities. So Jonathan, when you say at scale, let's make sure we all understand what that means from Penguin's perspective or from the perspective of the solution. Sure, Doug. And that's a great question. So to meet the wide range of demanding AI workloads, Origin AI architectures, it that scale from 256 to more than 16,000 GPU clusters. We optimize these architectures into 1, 4, and 16 pod configurations
Starting point is 00:15:16 based on optimal utilization of InfiniBand network topologies. So we have, again, small, medium, and large. So our small cluster template is one pod of 32 nodes and 256 GPUs. And it's the smallest config to fully use all the available InfiniBand switch ports in an eight-reel, non-blocking, single-tier InfiniBand network design. So from there, we move to our medium-sized cluster template which is four pods and a thousand 24 GPUs and this template scales to eight pods and this the reason for this size is this is the maximum number of GPUs supported in a two-tier non-blocking InfiniBand network design and then for our large cluster template it's with 16 16 pods, which is 4,096 GPUs. And this design supports a three-tier InfiniBand network. And this can scale out to more than
Starting point is 00:16:13 16,000 GPUs in a cluster. So that's how we came up with our t-shirt size templates. And it's really built around optimizing and efficient scale-out network configurations to maximize every dollar that you're spending on technology and use it in your infrastructure. I want to talk about the software stack, and you mentioned skilled. But before we do that, just how far up the stack do you go, starting from chip to apps? Yeah, no, that's a good question. And if I were to look at the kind of the stack simply as infrastructure layer and then infrastructure management, and then you've got the AI models and ops layer and then the application layer, Origin AI is currently focused on the infrastructure and infrastructure management layers. So our solution includes obviously all the tested and validated firmware drivers and operating systems for all the compute storage and networking hardware.
Starting point is 00:17:12 And with that, the solution also includes our skilled clusterware cluster management software and the health monitoring capabilities, along with workload scheduling software as part of that infrastructure management layer. And as we move further up the stack, we do integrate and support partner software for a wide variety of AI frameworks, applications, and orchestration software. And we support that through our managed services as well. Yeah, I want to mention to the audience who may not know, so Skilled is S-C-Y-L-D. And it's a software package that, if I'm not mistaken, showed up like in early 2000s. Yeah, no, actually, I believe it started before then.
Starting point is 00:17:55 And it went through a couple of major re-architecture and upgrades. But yes, it's been around for quite some time. Jonathan, implications of AI for storage. Can you talk about best practices and making sure it supports the rest of the stack? Yeah. As you know, AI applications typically require massive amounts of data storage, and that requires scalable and cost-effective solutions. And at the same time, performance is crucial for not creating bottlenecks in the AI workflow. So with that context, AI storage needs are like any other large data applications in that the best storage solution
Starting point is 00:18:31 is driven by data access patterns. Large AI environments typically use a tiered storage system with both low cost bulk storage and high speed flash for working data. And balancing these trade-offs between performance and cost involves really looking at different factors like data movement between the tiers, checkpoint writes, and the application's tolerance for data retrieval speeds. And so basically, by understanding and focusing on the data access patterns across the AI workload,
Starting point is 00:19:02 we can help design the right storage solution that can effectively support the right storage solution that can effectively support the entire application stack. We talked about customer readiness to be able to hit the ground running. And within OrionX, we talk about it as product readiness, market readiness, and customer readiness. So all of those need to come together. Can you speak to that on what customers need to do to prepare for this sort of a, what NVIDIA calls AI factory, what in the old days we'd call just a data factory,
Starting point is 00:19:33 but an industrial strength deployment of harnessing the value of your data? Yeah, there's a lot of ways to talk about this. And a lot of it comes down to planning and knowing what you want to do. So to prepare for deploying and building AI factory capability, customers should first map out their AI use cases and models and data sets and really scope out the scale of the required AI infrastructure based on things like model parameters, number of users supported, or even performance needs. That'll help then inform which GPU technology capabilities are the most impactful for their workloads. And this could be anything from raw flops to supporting specific number formats and
Starting point is 00:20:18 precision, memory bandwidth, memory capacity, even GPU to GPU bandwidth, software stack performance, and so on. And they should test these theories out with some kind of proof of concept, comparing the various GPUs and technologies from different vendors with their specific workloads. So one thing organizations, especially the enterprise, really like are things like key performance indicators, KPIs. They want to know what their ROI is. What do you think they need to do to get a handle on that? Yeah, Shane, absolutely. Customers need to understand which KPIs are going to be most important for AI in their
Starting point is 00:20:55 production environment and from an ROI perspective. And this could be metrics like performance per watt, performance per dollar, time to train, token throughput, token latency, or just raw performance throughput. These metrics can impact their choices in GPU technology, network topologies, even air versus liquid cooled solutions, the size of the cluster, and other design considerations. And with all those parameters understood, this is where all these AI infrastructure sort of design considerations need to start at the data center. Because the power and cooling capabilities of the data center will dictate everything for their AI cluster and even potential expansion.
Starting point is 00:21:37 Yeah, that's the world we're in today, for sure. Now, Jonathan, we're seeing other announcements from major vendors with similar characteristics. Talk a little bit about the distinctions and differentiations here. Sure, Doug. Thanks for that question. What sets us apart from other vendors is our more than 25 years experience delivering hardware, software, and services in HPC and large-scale systems. And in the last seven years, implementing and managing AI at scale for some major organizations. With Origin AI, we're streamlining AI implementation, simplifying AI infrastructure management, and enabling predictable AI cluster performance.
Starting point is 00:22:18 Okay, yeah, certainly complexity and risk issues are paramount here, am I right? Yeah, absolutely. So first and foremost, we're reducing the complexity and risk issues are paramount here. Am I right? Yeah, absolutely. So first and foremost, we're reducing the complexity and risk for AI implementation while accelerating time to value. And as one of the only NVIDIA certified DGX ready managed service providers, we've successfully built AI infrastructure since 2017 with more than 75,000 GPUs deployed and managed. So Origin AI is based on proven architectures and methodologies that we've developed with our customers as we designed, deployed, and managed their large-scale AI infrastructure. Now, customers are always interested in predictability performance and optimizing ROI. Could you get into that a little
Starting point is 00:23:02 bit? Of course. So Origin AI Solutions leverage our testing and simulation environment to help confirm that the cluster is production ready, and we validate that cluster performance before we send it off to the customer data center. And we're not just bolting together piece parts. We start with a validated architecture, and then we fully test those racks and mini clusters in the factory to ensure performance and ROI can be realized right as it's being deployed at the customer site. Yeah. And, you know, Jonathan, the whole issue around GPUs, time delays and getting organizations getting their hands on them and accessing them. But then if they're not used to
Starting point is 00:23:42 their full potential, it seems almost tragic to me. Talk about that whole maximizing GPU availability issue. That is really, I'd say, our crown jewel. With our skilled clusterware software, combined with our expert services and on-site spares depot, we're helping our customers achieve greater than 95% availability of those GPU nodes. And this helps ensure that Origin AI maximizes cluster utilization and health alongside AI node availability to drive higher overall performance and ROI. And you might be wondering, how can Penguin achieve this level of availability when others are struggling to maintain availability at that 30 to 80 percent
Starting point is 00:24:25 range. So like I mentioned before, we started deploying and managing these large GPU clusters for meta since 2017, back when they were still called Facebook. Through the years, we've collected, analyzed, and gotten insights from all kinds of data points in their clusters. And we learned how to correlate multiple events to predict and proactively replace failing components before nodes went down. And so we wrote scripts, we've created automated processes, and even filed a few patents to productize all these learnings into what we're calling our Assured Infrastructure Module, which is part of our skilled clusterware software platform. And you can Google it, but Meta has publicly acknowledged that with Penguin,
Starting point is 00:25:05 availability of their AI clusters stay above 95% on a consistent basis. Wow. Meta, of course, is just such a great use case. And you guys have been around, as Doug was saying, for decades. So probably lots of customer case studies. But what else can you share in terms of, I know some customers don't want to be mentioned at all, but what else can you share in terms of, I know some customers don't want to be mentioned at all, but if you can share any use cases, case studies, that would be great. Yeah, you're right. Penguin's heritage in HPC plays a significant role in being able to confidently offer AI technologies and services as a trusted partner. In fact, Penguin is directly involved in 14 of the top 500 and green 500 most powerful and efficient computer systems today.
Starting point is 00:25:48 Our 25 plus years of experience designing, building, deploying, and managing high performance complex systems brings a level of confidence and trust with customers that really only a handful of companies can bring. And with HPC, we also bring deep expertise in scale-out networking with InfiniBand and operating and managing high-performance clusters at high availability. The other thing that I think sort of sets us apart is that we are an independent and agnostic technology provider, and we are very focused on customer value. And so we've integrated and managed best-in-class and innovative technologies, including liquid and immersion cooling solutions for our customers. Yeah, so that experience and expertise is not just formed in our development labs, but it's really by working hand-in-hand with our customers in some of the most powerful data centers. Jonathan, misperceptions and misunderstood points can crop up in talking about solutions like this. Do you run into that when talking to customers or potential customers about origin AI? Yeah, that's a great question. You know,
Starting point is 00:26:56 for customers who are inexperienced in the area of AI infrastructure, there is misunderstanding and perception that knowing how to deploy and manage, compete, or even HPC infrastructures is sufficient. And these infrastructures and teams are just not AI ready. And often the skill sets and resources that need to get them ready are not immediately available. And we help solve for that. But for experienced AI customers who understand these limitations, we're not just about solving IT or procurement problems with hardware solutions. Penguin is all about helping customers achieve business outcomes like accelerating time to value, reducing risk, and maximizing ROI of their AI infrastructure. And even the most experienced and deeply resourced organizations struggle with GPU availability in their AI infrastructure, and especially when there are thousands of nodes and GPUs to manage. So we at Penguin, we partner with the top technology companies to provide
Starting point is 00:27:57 leading edge solutions. And our experience coupled with our software and expert services have proven that we can help customers rapidly deploy large-scale AI infrastructure and consistently achieve more than 95% GPU availability. This is our key differentiator and value that we bring to our customers. Jonathan, excellent. Really good overview of both kind of the reality of what it takes to do this and what you guys do to help accelerate that. Where do people go to get more information on what you just talked about? Yeah, thanks. So to learn more about Penguin Solutions or Origin AI,
Starting point is 00:28:32 please visit us at our website, penguinsolutions.com. You'll find case studies, solution briefs, and contact information to chat with us. I imagine Origin AI will be prominently featured on that website. Yes. So the website, once again, is penguinsolutions.com. Go check it out. Thank you, Jonathan. What a treat to have you. Thanks, Jonathan. Thank you, Doug and Shaheen. Really enjoyed our time together. Okay. Thanks so much. Perfect. We're going to have to get you back as this space evolves so we can keep in touch with the technology realities that you guys are facing and solving. Yeah, no, that'll be great.
Starting point is 00:29:10 Good stuff. All right. Take care, everybody. Until next time. Thank you, Doug. Thank you, Jonathan. That's it for this episode of the At HPC podcast. Every episode is featured on InsideHPC.com and posted on OrionX.net.
Starting point is 00:29:24 Use the comment section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen. The At HPC podcast is a production of OrionX
Starting point is 00:29:36 in association with Inside HPC. Thank you for listening.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.