@HPC Podcast Archives - OrionX.net - @HPCpodcast-94: Penguin Solutions on HPC-AI Managed Services – Industry View

Starting point is 00:00:00 There's a lot of other things inside, storage, connectivity between the servers, getting everything installed correctly, the cluster interconnected. So comprehensive planning is probably the primary thing that I think that people really need to spend more time on before moving into implementing AI. We're already seeing, even for the big boys like NVIDIA, the replacement GPUs are coming much faster. It's really been a pleasure to work with some top minds in the industry, both on the meta side and on the Penguin side.

Starting point is 00:00:47 From OrionX in association with InsideHPC, this is the At HPC podcast. Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications, markets, and policies that shape them. Thank you for being with us. Hi, everyone. I'm Doug Black at Inside HPC. I'm with Shaheen Khan of OrionX.net. And today we have with us our special guest, Ryan Smith. He is Director of Managed Services at Penguin Solutions.

Starting point is 00:01:14 Ryan is a 25-year veteran of the technology industry. He was previously a Senior Services Manager at Stratus, among other companies. Ryan, welcome. We're here today to talk about the vexing challenges of implementing HPC class AI systems, the landmines organizations need to avoid, and the opportunities for seizing success. Ryan, organizations are struggling to implement big AI so that it delivers ROI as quickly as possible. What are some of the major challenges they're dealing with? And if there are maybe three or four major pitfalls that undermine projects, what would you say they are? Yeah, thanks for having me on. It's

Starting point is 00:01:58 a pleasure to be with you. I like to imagine implementing big AI similar to building a home or maybe a large office. You're going to spend hundreds of thousands of dollars on this home, but there's a lot of things that need to be done right up front to get it going. You'd never go into building something of this nature without considering all of the plumbing, the electrical materials and the labor that are needed to go into building that home. In fact, for me, there's a lot of fear I have of building a home from scratch and not just buying a used one. Primarily, scope creep. What's going to happen when we decide we need another room added onto this house? Or what's going to happen if I decide I want to move a bathroom to a different area? These are things that add significant cost and time onto the completion of your home. It's similar in the AI environment. A lot of companies jump into this. They have a pretty good use plan. And I think they have a

Starting point is 00:02:55 good understanding of what they need to get. They even go work with a large OEM, perhaps, get a bill of materials. And this company can even help them figure out maybe some networking equipment. And then they have that equipment arrive and they have to figure out how to put everything together. And as you can imagine, there's actually a lot more that goes into it. Even if you have a company that's up and the network needs to be set up correctly to run for AI, there's a lot of other things inside, storage, connectivity between the servers, getting everything installed correctly, the cluster interconnected. So comprehensive planning is probably the primary thing that I think that people really need to spend more time on before

Starting point is 00:03:35 moving into implementing AI. The second one that I would say is somewhat similar, and that's effective cluster management. Similar to a home, I'm in a home that's 20 years old, and you can only imagine there's always something that needs to be fixed. And a cluster is somewhat similar to that as well, in that you're always having to fix something. And a lot of companies go into this thinking, I've got an IT department, why don't we just have them do this? And I would equate this similar to a virtual environment that I run in the past where we had people that could jump in. They could read about the virtual software that they're going to use. The install went pretty well, and it was running, for the most part, pretty well until we had a problem.

Starting point is 00:04:19 And it wasn't until we had a problem that we realized the cluster didn't fail over the way that we wanted it to. Or during an upgrade, we didn't have settings quite right. And during that upgrade, we changed something and suddenly the cluster is not working correctly. Whereas if we had a good virtual cluster administrator or an engineer, somebody who really understood it, we avoided a lot of those pitfalls. And I think if you're walking into this and you want to have good cluster management, you need to do the same thing. You need to not just find somebody who understands Linux operating system. You need to have somebody that actually understands cluster management and specifically HPC cluster management. And then maybe the third one,

Starting point is 00:05:01 if we were to throw on there is hardware challenges. Similarly to the other two, companies may already be accustomed to replacing disks and memory and keeping hardware up and running inside of a data center. These HPC clusters tend to use a lot more electricity where they got a much bigger footprint and they seem to fail at a higher rate, probably because they're used at a higher rate. We ought to think of them more as in how you would consider a car or a boat. And that's in the number of hours or miles driven, as opposed to how long it sits in the data center. They will seem to fail at a higher rate because they're running at a higher rate. Yeah. Now the technology in AI, HPC class AI, is changing so rapidly. Talk about some of the characteristics, the unique challenges that poses to organizations implementing this newer class of technology. I believe Penguin has, well, how many GPUs under management do you have now? Yeah, we've got just over 75,000. Yeah. Let's get into a

Starting point is 00:06:06 little more detail on the unique challenges these enormous GPU-driven deployments are posing. Sure. Maybe just back up just a second. So not only do we manage 75,000 GPUs, but we actually have a lot more than that in hours recorded from troubleshooting them. In fact, we have been able to estimate that we have 2 billion GPU runtime hours dedicated to root cause analysis. And so one of the things that does for us is that when we see errors come in, we have an event handler that can really analyze what's happening behind the scenes. And I think that's really helpful for us. So to come back to the question, it was, what are they running into? Is that the question? Yeah. So maybe I can answer it best by how we

Starting point is 00:06:52 attack it. There's the design phase, right? So if you're designing a new cluster, there's some things that you need to consider. And these would be the architects that have experience in designing the cluster. So we talked about you can go sit down with your OEM and they can say, yeah, based on this, your typical cluster will be about this size. We can give you the network equipment and sell you everything that goes with it. But a good architect will sit down with you and help you talk about what's your initial purpose. What is your use case for building this? Are you going to resell it? Are you going to use it in an educational institution? Or are you going to try to use it for some other, I don't know, pharmaceuticals or some other purpose?

Starting point is 00:07:36 And the purpose of your design will greatly impact what you design and how you use it. Maybe the second area of that would be the build phase. So let's say we've got all of our equipment and you're ready to go. What Penguin likes to do is we have a state-of-the-art factory in Fremont, California that we send all of the equipment. So we're not waiting for the equipment to arrive on the floor in the data center, trying to put all the pieces together, figure out what we're missing and piece it together at that point. We actually compile the racks and everything in our center in Fremont, and we get everything running at that place. And the benefit of that is that when we send it out to the data center, we're not trying to put it together and figure out what's working there. We're actually making

Starting point is 00:08:21 sure the interconnectivity at that point is working correctly. The third phase of that, I think, would be the deployment phase. And in deployment, we actually send people on site to the data center, put everything together, make sure the hardware and the software is working as it should, and it's turned over to our managed services. This is where we talked about having the right people to administer it. We use what we call skilled clusterware, which is developed by Penguin. And this allows us to get up to speed much faster, make sure we have alerts running, make sure we can control the clusters. They're deployed, they're drained, and this is all done automatically through our system software. When you have so many GPUs in one place doing either one thing or more than one thing,

Starting point is 00:09:08 and then you have to upgrade either because something failed or because you now are expanding the configuration. And it's not the same chips. It's not the same technology. It's like two years later and you have something that's better, faster, cheaper. How do you manage that? Because you lose homogeneity, you lose consistency, and you're still supposed to deliver the same SLAs. How do you go around that complexity? Yeah, good question. So what I'm hearing is we could have different technologies, most likely in different clusters, right? Maybe you have NVIDIA GPUs in one and AMD GPUs in the other. Well, even if it is the same vendor, if you're going from A100s to H100, that's still a change.

Starting point is 00:09:51 Yeah, really good question. In general, we keep those in separate clusters, right? So even if you're just going from different processors by the same mate, you're probably going to create those in different clusters, and that would allow you to separate them. There is a lot of complexity if you try to merge them. And in fact, some of them, we just tend to stay away from mixing different technologies like that. I see. So you enforce consistency through partitioning and allocating the new configuration to a new set of applications or new instances of the application. Yeah? Yes, that's correct. Yeah. All right. So the other thing is really availability. How do you manage availability

Starting point is 00:10:29 of these systems? You mentioned the skilled software that does a bunch of analysis. Maybe it can do some predictive analysis. And at these high scales, like you said, there's always something that's liable to fail or something. It's just because you've got so many moving parts, right? Yeah. It seems like something's always failing, that's for sure. So that is just simply a question of the sheer number of things, right? Yeah. In fact, I think there's a couple of ways to go about that. One, let's just take a very simple approach. Let's say you need 100 nodes up and running. you may want to consider having 110, right? Having a little bit extra so that you can meet that 100. That's a really good way to go about it. And if we have

Starting point is 00:11:11 an SLA that's to keep 100 nodes up and running, we have 110, that's a really nice cushion. And that helps you meet that SLA. However, sometimes the extra cost doesn't allow you to get 110 nodes when you've only maybe budgeted for 100. And so what you do in that case is you have to really plan for those outages. So spare part inventory would be a really big part of that, making sure your, let's just say, memory is going out. Make sure when you go to that bucket to pull memory from it, that there's memory in that box. And there's a people component to that as well. So each one of these vendors have a part replacement guarantee, but that might take a little bit of time to get there. So when you pull that part, you need to have a process in place that allows you to submit that RMA as quickly as possible. I would recommend same day because that part, whether you have next

Starting point is 00:12:06 day or two day, or maybe it's over a weekend, it may take a little bit of time to get that back. Right. To what extent is this proactive, preventive maintenance? To what extent can you predict something is liable to go offline and I can replace it before it happens. Is that still sort of after the fact or are we now building enough telemetry or whatever capabilities we have to turn it into something more predictive? Yeah, exactly right. So those 2 billion GPU runtime hours that we do and we collect and really help us proactively determine are these components moving towards failure? Is the heat of your GPU really your primary detector? Well probably not because they're almost always within a certain range. But if you do head outside of that

Starting point is 00:12:55 range and let's just say you have 20,000 GPUs and one or two of them are heading outside of that range, now you've got something that you can look at. And that's a very simple approach but those 2 billion GPU runtime hours that we've analyzed really help us understand what's happening in the GPU and if it falls outside of a normal scope for what should be running. In addition to this rapid cadence, very quick cadence of new GPUs coming out from the three big chip vendors. And I'm sure, obviously, you guys stay on top of all this. But talk about maybe other areas of technology that's coming to bear on big AI that maybe you're particularly interested in or excited about. I don't know if

Starting point is 00:13:40 it might be on the interconnect side or memory. Yeah, I want to be careful diving too deep. I'm not really the deep technology. I'm not our chief technology officer. But there's three areas that I'm really excited about. One, certainly the new GPUs that are coming out, not just from the big two or the big three producers, but there's other individuals trying to get into this space and being very innovative in the space. It'll help

Starting point is 00:14:05 drive down the costs. We're already seeing, even for the big boys like NVIDIA, the replacement GPUs are coming much faster than they have. There was a time when you could easily be waiting more than just a few months to get a simple replacement GPU. Those are seeming to come very quickly now. So I'm really excited about the GPUs that are coming out and some of their specific capabilities. Storage is another really big one. So as we're moving to potentially all flash storage and the cost of that coming down, the abilities of that to restore and to read data at a much faster rate is very beneficial. So that's a really fun one. And then I think the network is our third bottleneck so often, right? We're always trying

Starting point is 00:14:53 to keep these servers as close as we can. Even how you build the servers, the components have to be very close together to avoid latency. And network is a really big one. And there's some really neat technologies out there coming. I think the big ones right now are IBE. And I'm just looking forward to the next technology in those areas to really help speed these up. As a customer, if I want to get engaged and get your help with managed services, what sort of process do I need to do on my end? What sort of training do I need? What sort of tasks do I need to be prepared to do to make sure that it works well for everybody? So we've done the installation, we've sort of handed it over for managed services. As a customer, am I done and I'm just turning into

Starting point is 00:15:36 an end user or do I still need to do a bunch of things to make sure that your group can do its job? Yeah, I think there's a lot we can both do to make sure that we're collectively successful. One is to, going back, make sure you understand your use case and how you want to present your compute nodes to your customers or your end user, whoever that is. If we understand that,

Starting point is 00:16:02 our chances of being successful with you go up dramatically, right? And help us understand your use case. And even if you already have built a cluster and it's up and running, we'd like to sit down and understand that design with you and the purpose for why you've set it up the way that you have before you ask us to step in and start running it. Because if we don't understand the use case or we don't understand the purpose for the design, we may be starting to make recommendations to move it to something that we do understand in a different way. Definitely come to the table. Let's talk about that first phase, the design, even if you think you're already through the design. And I think for

Starting point is 00:16:39 managed services, we need to understand a few things. Of course, where your servers are located helps us take care of them better. Do you want us to be an on-site break fix? Do you want us to only take care of the remote cluster itself? Those things are all very helpful information. I guess what I'd say you don't need to do, perhaps, you don't need to become a clusterware expert. You don't need to understand AI clusters. You don't need to go out and do expert. You don't need to understand AI clusters. You don't need to go out and do that. It's really about the use case. And then we can sit down together and help figure out the rest.

Starting point is 00:17:13 And what is the duration of time, the typical length of engagement for managed services for these large installations? Yeah, we see a couple different things. It seems like most people these days have an intent, at least eventually, to manage it themselves. And we're more than happy to take a shorter term with you. The shortest term, I think, would be a year. And I haven't seen anybody move it into their own management within the year. But we'd be more than happy to help you move in that direction. So normally, it's a three to five year engagement for us. And we're happy to help you move in that direction. So normally it's a three to five year engagement for us, and we're happy to work with you on your needs. So if you come to us and you say, hey, this is our plan, we'll help build a plan to move in that direction.

Starting point is 00:17:53 So that means also training is included in your capabilities? Yeah. Yeah. So we can build that right into it. So there is cross-training. We have several customers who want training on a regular basis and we do provide that. You're also going to see in the near future on our website coming up where we'll just provide classes for training. You wouldn't necessarily have to go through managed services to get that training. We'll provide HPC and other trainings on our website. Ryan, we've heard about the AI supercluster at Meta that you've implemented. I'm just curious, what was it like working with Meta where you already have people in place who are, I assume, incredibly capable? So what was that relationship like? And maybe share an anecdote or two about standing up that system. Yeah, you're aware, but they're pioneers in the area. Some of the first people that really ventured into this. And just like any venture like that, you've got to figure out things and develop them as you're going. A lot of it can be through trial and error,

Starting point is 00:18:55 but it's really been a pleasure to work with some top minds in the industry, both on the meta side and on the penguin side, architecture engineers who will come together and help figure out what is the best solution, how much hardware to provide based on the demand that is projected, what do we need to do, locations of those, and who's going to manage it. I think that's been some challenges of figuring out, sure, we can put some people on site for break fix, but they're not necessarily the same skill sets or the right people who are going to solve other issues, figuring out, sure, we can put some people on site for break fix, but they're not necessarily the same skill sets or the right people who are going to solve other issues, right? Operating system or cluster availability, building a monitoring system and reporting capabilities

Starting point is 00:19:37 so that we can send out alerts and get quick responses to those. Some of the things we've already talked about as well, making sure your spare part inventory is at the appropriate size. That's taken some significant work and science, if you will, to get that to a point where it's appropriate. And we don't have a lot of parts sitting in a bucket or on a shelf somewhere that aren't being used and turned over, as opposed to running out of parts that we need right away. One other question, really requirement, is the kind of supply chain intimacy that you need to have to be able to take it all the way upstream as wherever the need might be. What sort of processes do you put in place to make sure that things get escalated properly and backline support can be escalated back to the original manufacturer, vendor? To what extent is that a

Starting point is 00:20:26 challenge, especially in the AI world where you do have a lot of rapidly changing technology? Yeah. So we've actually had to have dedicated people for inventory management, for one. Something that we thought maybe we could just through a normal process or perhaps through a tool, we found that, and we do use tools, don't get me wrong, but we found that we need a person that are tracking that and making sure that the inventory turnover is correct, making predictions of changing needs. Maybe for a while, it's DIMMs that you need to keep on the shelf. But at some point that turns over based on manufacturer capabilities or whatever it is that we need to keep another part at a higher rate. And so we

Starting point is 00:21:05 need somebody really tracking that. That same group of individuals also help make sure that the RMA process is working as expected. It's actually our data center techs that will complete the RMAs as they replace parts in servers, but they have to make sure that they fill all of those steps out. They can't just take a part out and put a new part in and walk away. We got to make sure the part gets replenished. Ryan, one topic or point of discussion Shaheen and I have shared several times is that big AI is, as Shaheen says, really is a form of HPC. And I think for some of us in the HPC community, it's a little frustrating that HPC seems to be these days being subsumed by AI. And now,

Starting point is 00:21:53 of course, Penguin comes out of this long HPC heritage, and you've parlayed that into this HPC AI strategy. I mean, that must have been maybe almost a natural evolution for the company. Yeah, I think it was. And a lot of it goes by demand, right? So it's really what are your customers asking for? And the great big technology wave at the time right now is AI. And so it certainly is a natural move for us. Yeah. I think Doug wants you to say we could not have done this without our HPC background. Yes. Yeah, that's absolutely true. Yeah. You're right, Shaheen. That's what I wanted them to say. Well, actually, along those lines, I do have a question. And is there any difference in how customers manage their applications, their SLAs, their approach to their infrastructure between the HPC sites

Starting point is 00:22:46 and the full-on AI sites? Yeah, good question. We're so involved in the AI at this point that you almost forget what the differences may have been before AI. And so I'd say GPUs is a big one, learning to deal with the GPUs, their failure rates, how to care for those. Sometimes some of our cooling techniques have had to change. A lot of people are starting to use liquid coolings instead of air cooling. And I say a lot, that's not true. It's a very

Starting point is 00:23:15 small number that are doing that. But definitely growing. Yeah. Yeah, absolutely. Well, I noticed one thing where I am, my understanding is you all have, there are four major components to your services and you did go over design, build, deploy. Yes. And then there's manage. Yeah. That's the specific area that my department runs. Design, build, deploy, and manage.

Starting point is 00:23:34 Those are four separate areas within Penguin that we do. My specific area is the management. And so I think one area is that we didn't get into that might be worth going over a little bit. How do we stay up on top of the technologies? How do we stay up on the latest trends and the best practices in the industry? And I think this would be good and true for anybody out there in the industry. But for Penguin, there's two things that have happened in my time here that have been really valuable to us.

Starting point is 00:24:02 One, we hold conferences and we just had one last month where we invited a bunch of our vendors out and these included the big boys, NVIDIA, AMD, Dell, of course, and then a lot of storage, processors, service providers. We all got together and we had presentations where we actually spoke about where we're seeing the industry going, what our customers are asking for, what's coming down the pipes, even if it's not available to sell right now, that was invaluable. There were a couple of panels, one on processors and one on storage, where we got to hear different companies talk about where their value proposition is and where it's headed, what they see their customers needing. And from a Penguin point

Starting point is 00:24:46 of view, especially as we consult with customers, it's nice to be able to offer. We don't just have one solution. So we can sit down and again, listen to your use case and help you determine where is it that you're going to go and what would be the best technology for your needs at this time. The second one for us is we've organized ourselves into centers of excellence. And the managed services department is about 89 people at Penguin. So it's growing at a significant rate. And the question is, how do we position ourselves to continue to grow? And the centers of excellence has been our answer. So we have centers of excellence for storage, for networking, for system engineering. We have one called engagement managers. These are the people who help make sure that those engagements run well. And so we have leaders of each of these centers that really focus on best practices for their specific group. We might have, for example, a storage engineer who is assigned to a customer, and they're the only storage engineer.

Starting point is 00:25:49 But we don't want to limit the technology and the processes that are being provided to a customer just because they only need one person in a particular field. So they are part of a center of excellence where they can draw back to a much larger group and get information and feedback, run problems across somebody else in their center of excellence to say, hey, here's the problem I'm running into. What do we think? Because you can open tickets, you can go out and search the web, but sometimes having experienced people that you can bounce things off of is invaluable. And so we've created processes where they can onboard

Starting point is 00:26:25 and train with one another. They provide regular meetings with one another to talk about the latest things going on. And I think that's been really valuable. Great stuff. Yeah. One final question for me. How far up the stack do you go in managed services? What level of software do you stop at? Yeah. So usually we go up to the orchestration tool. So generally, what we do is we provide a system that is fully cluster available, and users are able to jump on and start running jobs against the cluster. When you ask that question, I always have to say, what don't we provide, right? And so really, it is the user submits those jobs, we generally don't necessarily know what they're running.

Starting point is 00:27:06 We can see the jobs are running on the system, but we're not actually running the job. We're not helping the users configure their jobs. And so there is a software layer of what we provide that is actually running those jobs. They run on the system. So we'll install whatever your orchestration tool is. We'll install whatever your scheduler is so that you're ready to run them. But that's where we stop. Got it. Excellent.

Starting point is 00:27:31 Thank you, Ryan. This is a really evolving, interesting area with lots of complexity and a moving target. So it's wonderful to have this kind of a service that absorbs a lot of those changes for you. Yeah, it's been really fun speaking with you guys about this. Hopefully you can tell it's a passion, not only for me, but for our company. And we enjoy doing this. Absolutely. No, I think we've learned a lot from you and your colleagues and your experiences have

Starting point is 00:27:57 been very important to illuminate all of these things for us. Yeah, wonderful. Thank you so much. Thanks so much, Ryan. That's it for this episode of the At HPC Podcast. Every episode is featured on InsideHPC.com and posted on OrionX.net. Use the comments section or tweet us with any questions or to propose topics of discussion. If you like the show, rate and review it on Apple Podcasts or wherever you listen.

Starting point is 00:28:22 The At HPC Podcast is a production of OrionX in association with Inside HPC. Thank you for listening.

@HPC Podcast Archives - OrionX.net - @HPCpodcast-94: Penguin Solutions on HPC-AI Managed Services – Industry View

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.