@HPC Podcast Archives - OrionX.net - @HPCpodcast-94: Penguin Solutions on HPC-AI Managed Services – Industry View
Episode Date: December 10, 2024Special guest Ryan Smith joins Shahin and Doug to discuss the vexing challenges of implementing HPC class AI systems in a managed services model, the landmines organizations need to avoid, and the op...portunities for seizing success. This episode is part of the @HPCpodcasts Industry View feature, which takes on major issues in the world of HPC, AI, and other advanced technologies through the lens of industry leaders. [audio mp3="https://orionx.net/wp-content/uploads/2024/12/094@HPCpodcast_IV_Penguin-Solutions_Ryan-Smith_AI-HPC-Managed-Services_20241209.mp3"][/audio] The post @HPCpodcast-94: Penguin Solutions on HPC-AI Managed Services – Industry View appeared first on OrionX.net.
Transcript
Discussion (0)
There's a lot of other things inside, storage, connectivity between the servers, getting
everything installed correctly, the cluster interconnected.
So comprehensive planning is probably the primary thing that I think that people really
need to spend more time on before moving into implementing AI.
We're already seeing, even for the big boys like NVIDIA,
the replacement GPUs are coming much faster.
It's really been a pleasure to work with some top minds in the industry,
both on the meta side and on the Penguin side.
From OrionX in association with InsideHPC, this is the At HPC podcast.
Join Shaheen Khan and Doug Black as they discuss supercomputing technologies and the applications,
markets, and policies that shape them.
Thank you for being with us.
Hi, everyone.
I'm Doug Black at Inside HPC. I'm with Shaheen Khan of OrionX.net.
And today we have with us our special guest, Ryan Smith.
He is Director of Managed Services at Penguin Solutions.
Ryan is a 25-year veteran of the technology industry.
He was previously a Senior Services Manager at Stratus, among other companies.
Ryan, welcome. We're here today
to talk about the vexing challenges of implementing HPC class AI systems,
the landmines organizations need to avoid, and the opportunities for seizing success.
Ryan, organizations are struggling to implement big AI so that it delivers ROI as quickly as possible. What are
some of the major challenges they're dealing with? And if there are maybe three or four major
pitfalls that undermine projects, what would you say they are? Yeah, thanks for having me on. It's
a pleasure to be with you. I like to imagine implementing big AI similar to building a home or maybe a large office.
You're going to spend hundreds of thousands of dollars on this home, but there's a lot of things that need to be done right up front to get it going.
You'd never go into building something of this nature without considering all of the plumbing, the electrical materials and the labor that are needed to go into building that home. In fact, for me,
there's a lot of fear I have of building a home from scratch and not just buying a used one.
Primarily, scope creep. What's going to happen when we decide we need another room added onto
this house? Or what's going to happen if I decide I want to move a bathroom to a different area?
These are things that add significant cost and time onto the completion of your home. It's similar in the AI environment.
A lot of companies jump into this. They have a pretty good use plan. And I think they have a
good understanding of what they need to get. They even go work with a large OEM, perhaps,
get a bill of materials. And this company can even help them figure out maybe some networking
equipment. And then they have that equipment arrive and they have to figure out how to put
everything together. And as you can imagine, there's actually a lot more that goes into it.
Even if you have a company that's up and the network needs to be set up correctly to run for
AI, there's a lot of other things inside, storage, connectivity between the servers,
getting everything installed correctly, the cluster interconnected. So comprehensive planning
is probably the primary thing that I think that people really need to spend more time on before
moving into implementing AI. The second one that I would say is somewhat similar, and that's
effective cluster management. Similar to a home, I'm in a home that's
20 years old, and you can only imagine there's always something that needs to be fixed. And a
cluster is somewhat similar to that as well, in that you're always having to fix something. And
a lot of companies go into this thinking, I've got an IT department, why don't we just have them
do this? And I would equate this similar to a virtual environment that I run in the past where we had people that could jump in.
They could read about the virtual software that they're going to use.
The install went pretty well, and it was running, for the most part, pretty well until we had a problem.
And it wasn't until we had a problem that we realized the cluster didn't fail over the way that we wanted it to.
Or during an upgrade, we didn't have settings quite right.
And during that upgrade, we changed something and suddenly the cluster is not working correctly.
Whereas if we had a good virtual cluster administrator or an engineer, somebody who really understood it, we avoided a lot of those pitfalls.
And I think if you're walking into this and you
want to have good cluster management, you need to do the same thing. You need to not just find
somebody who understands Linux operating system. You need to have somebody that actually understands
cluster management and specifically HPC cluster management. And then maybe the third one,
if we were to throw on there is hardware challenges. Similarly to the other two, companies may already be accustomed to replacing disks and memory and keeping hardware up and running inside of a data center.
These HPC clusters tend to use a lot more electricity where they got a much bigger footprint and they seem to fail at a higher rate, probably because they're used at a higher rate. We ought to think of them more as in how you would consider a car or a boat. And that's
in the number of hours or miles driven, as opposed to how long it sits in the data center. They will
seem to fail at a higher rate because they're running at a higher rate.
Yeah. Now the technology in AI, HPC class AI,
is changing so rapidly. Talk about some of the characteristics, the unique challenges
that poses to organizations implementing this newer class of technology. I believe Penguin has,
well, how many GPUs under management do you have now? Yeah, we've got just over 75,000. Yeah. Let's get into a
little more detail on the unique challenges these enormous GPU-driven deployments are posing.
Sure. Maybe just back up just a second. So not only do we manage 75,000 GPUs, but we actually
have a lot more than that in hours recorded from troubleshooting them. In fact, we have been able to estimate that we
have 2 billion GPU runtime hours dedicated to root cause analysis. And so one of the things
that does for us is that when we see errors come in, we have an event handler that can really
analyze what's happening behind the scenes. And I think that's really helpful for us.
So to come back to the question, it was,
what are they running into? Is that the question? Yeah. So maybe I can answer it best by how we
attack it. There's the design phase, right? So if you're designing a new cluster, there's some
things that you need to consider. And these would be the architects that have experience in designing
the cluster.
So we talked about you can go sit down with your OEM and they can say, yeah, based on this, your typical cluster will be about this size.
We can give you the network equipment and sell you everything that goes with it.
But a good architect will sit down with you and help you talk about what's your initial purpose.
What is your use case for building this? Are you going to resell it? Are you going to use it in an educational institution? Or are you going to try to
use it for some other, I don't know, pharmaceuticals or some other purpose?
And the purpose of your design will greatly impact what you design and how you use it.
Maybe the second area of that would be the build phase. So let's say
we've got all of our equipment and you're ready to go. What Penguin likes to do is we have a
state-of-the-art factory in Fremont, California that we send all of the equipment. So we're not
waiting for the equipment to arrive on the floor in the data center, trying to put all the pieces
together, figure out what we're missing and piece it together at that point. We actually compile the racks and everything in our center in Fremont, and we get everything
running at that place. And the benefit of that is that when we send it out to the data center,
we're not trying to put it together and figure out what's working there. We're actually making
sure the interconnectivity at that point is working correctly. The third phase of that, I think, would be the deployment phase. And in deployment,
we actually send people on site to the data center, put everything together, make sure the
hardware and the software is working as it should, and it's turned over to our managed services.
This is where we talked about having the right people to administer it.
We use what we call skilled clusterware, which is developed by Penguin. And this allows us to
get up to speed much faster, make sure we have alerts running, make sure we can control the
clusters. They're deployed, they're drained, and this is all done automatically through our system
software. When you have so many GPUs in one place doing either one thing or more than one thing,
and then you have to upgrade either because something failed or because you now are expanding the configuration.
And it's not the same chips. It's not the same technology.
It's like two years later and you have something that's better, faster, cheaper.
How do you manage that? Because you lose homogeneity, you lose consistency, and you're still supposed to
deliver the same SLAs. How do you go around that complexity? Yeah, good question. So what I'm
hearing is we could have different technologies, most likely in different clusters, right? Maybe
you have NVIDIA GPUs in one and AMD GPUs in the other.
Well, even if it is the same vendor, if you're going from A100s to H100, that's still a change.
Yeah, really good question. In general, we keep those in separate clusters, right? So even if
you're just going from different processors by the same mate, you're probably going to create
those in different clusters, and that would allow you to separate them. There is a lot of complexity if you try to merge them. And in fact, some of them,
we just tend to stay away from mixing different technologies like that.
I see. So you enforce consistency through partitioning and allocating the new configuration
to a new set of applications or new instances of the application. Yeah?
Yes, that's correct. Yeah.
All right. So the other thing is really availability. How do you manage availability
of these systems? You mentioned the skilled software that does a bunch of analysis. Maybe
it can do some predictive analysis. And at these high scales, like you said, there's always
something that's liable to fail or something. It's just because you've got so many moving parts,
right? Yeah. It seems like something's always failing, that's for sure.
So that is just simply a question of the sheer number of things, right?
Yeah. In fact, I think there's a couple of ways to go about that. One, let's just take a very
simple approach. Let's say you need 100 nodes up and running. you may want to consider having 110, right? Having a little bit
extra so that you can meet that 100. That's a really good way to go about it. And if we have
an SLA that's to keep 100 nodes up and running, we have 110, that's a really nice cushion. And
that helps you meet that SLA. However, sometimes the extra cost doesn't allow you to get 110 nodes when you've only maybe budgeted for 100.
And so what you do in that case is you have to really plan for those outages.
So spare part inventory would be a really big part of that, making sure your, let's just say, memory is going out.
Make sure when you go to that bucket to pull memory from it, that there's memory in that box.
And there's a people component to that as well.
So each one of these vendors have a part replacement guarantee, but that might take a little bit of time to get there.
So when you pull that part, you need to have a process in place that allows you to submit that RMA as quickly as possible. I would recommend same day because that part, whether you have next
day or two day, or maybe it's over a weekend, it may take a little bit of time to get that back.
Right. To what extent is this proactive, preventive maintenance? To what extent can you
predict something is liable to go offline and I can replace it before it happens. Is that still sort of after the fact or are we
now building enough telemetry or whatever capabilities we have to turn it into something
more predictive? Yeah, exactly right. So those 2 billion GPU runtime hours that we do and we
collect and really help us proactively determine are these components moving towards failure? Is the
heat of your GPU really your primary detector? Well probably not because
they're almost always within a certain range. But if you do head outside of that
range and let's just say you have 20,000 GPUs and one or two of them are heading
outside of that range, now you've got something that you can look at. And
that's a very simple approach but those 2 billion GPU runtime hours that we've analyzed
really help us understand what's happening in the GPU and if it falls outside of a normal scope
for what should be running. In addition to this rapid cadence, very quick cadence of new GPUs
coming out from the three big chip vendors. And I'm sure, obviously,
you guys stay on top of all this. But talk about maybe other areas of technology that's coming to
bear on big AI that maybe you're particularly interested in or excited about. I don't know if
it might be on the interconnect side or memory. Yeah, I want to be careful diving too deep.
I'm not really the deep technology.
I'm not our chief technology officer.
But there's three areas that I'm really excited about.
One, certainly the new GPUs that are coming out, not just from the big two or the big
three producers, but there's other individuals trying to get into this space and being very
innovative in the space.
It'll help
drive down the costs. We're already seeing, even for the big boys like NVIDIA, the replacement GPUs
are coming much faster than they have. There was a time when you could easily be waiting more than
just a few months to get a simple replacement GPU. Those are seeming to come very quickly now.
So I'm really excited about the GPUs that are coming out and
some of their specific capabilities. Storage is another really big one. So as we're moving to
potentially all flash storage and the cost of that coming down, the abilities of that to restore
and to read data at a much faster rate is very beneficial. So that's a really fun one.
And then I think the network is our third bottleneck so often, right? We're always trying
to keep these servers as close as we can. Even how you build the servers, the components have
to be very close together to avoid latency. And network is a really big one. And there's some really neat technologies
out there coming. I think the big ones right now are IBE. And I'm just looking forward to
the next technology in those areas to really help speed these up. As a customer, if I want to get
engaged and get your help with managed services, what sort of process do I need to do on my end?
What sort of training do I need? What sort of tasks do I need
to be prepared to do to make sure that it works well for everybody? So we've done the installation,
we've sort of handed it over for managed services. As a customer, am I done and I'm just turning into
an end user or do I still need to do a bunch of things to make sure that your group can do its
job? Yeah, I think there's a lot we can both do to make sure that we're collectively successful.
One is to, going back,
make sure you understand your use case
and how you want to present your compute nodes
to your customers or your end user,
whoever that is.
If we understand that,
our chances of being successful with you
go up dramatically, right?
And help us understand your use case. And even if you already have built a cluster and it's up and
running, we'd like to sit down and understand that design with you and the purpose for why you've set
it up the way that you have before you ask us to step in and start running it. Because if we don't
understand the use case or we don't understand the purpose for the design, we may be starting to make recommendations to move it to something that
we do understand in a different way. Definitely come to the table. Let's talk about that first
phase, the design, even if you think you're already through the design. And I think for
managed services, we need to understand a few things. Of course, where your servers are located helps us
take care of them better. Do you want us to be an on-site break fix? Do you want us to only take
care of the remote cluster itself? Those things are all very helpful information. I guess what
I'd say you don't need to do, perhaps, you don't need to become a clusterware expert. You don't
need to understand AI clusters. You don't need to go out and do expert. You don't need to understand AI clusters.
You don't need to go out and do that.
It's really about the use case.
And then we can sit down together and help figure out the rest.
And what is the duration of time, the typical length of engagement for managed services for these large installations?
Yeah, we see a couple different things.
It seems like most people these days have an intent, at least eventually, to manage it themselves. And we're more than happy to take a
shorter term with you. The shortest term, I think, would be a year. And I haven't seen anybody move
it into their own management within the year. But we'd be more than happy to help you move
in that direction. So normally, it's a three to five year engagement for us. And we're happy to help you move in that direction. So normally it's a three to five year engagement
for us, and we're happy to work with you on your needs. So if you come to us and you say,
hey, this is our plan, we'll help build a plan to move in that direction.
So that means also training is included in your capabilities?
Yeah. Yeah. So we can build that right into it. So there is cross-training. We have several
customers who want training on a regular basis and we do provide that. You're also going to see in the near future on our website coming up where we'll just provide classes for training. You wouldn't necessarily have to go through managed services to get that training. We'll provide HPC and other trainings on our website. Ryan, we've heard about the AI supercluster at Meta that you've implemented.
I'm just curious, what was it like working with Meta where you already have people in place who
are, I assume, incredibly capable? So what was that relationship like? And maybe share an anecdote
or two about standing up that system. Yeah, you're aware, but they're pioneers in the area.
Some of the first people that really ventured into this. And just like any venture like that, you've got to figure
out things and develop them as you're going. A lot of it can be through trial and error,
but it's really been a pleasure to work with some top minds in the industry, both on the meta side
and on the penguin side, architecture engineers who will come together and help figure
out what is the best solution, how much hardware to provide based on the demand that is projected,
what do we need to do, locations of those, and who's going to manage it. I think that's been
some challenges of figuring out, sure, we can put some people on site for break fix, but they're not
necessarily the same skill sets or the right people who are going to solve other issues, figuring out, sure, we can put some people on site for break fix, but they're not necessarily
the same skill sets or the right people who are going to solve other issues, right? Operating
system or cluster availability, building a monitoring system and reporting capabilities
so that we can send out alerts and get quick responses to those. Some of the things we've
already talked about as well, making sure your spare part inventory is at the appropriate size. That's taken some significant work and science,
if you will, to get that to a point where it's appropriate. And we don't have a lot of parts
sitting in a bucket or on a shelf somewhere that aren't being used and turned over, as opposed to
running out of parts that we need right away. One other question, really requirement, is the kind of supply chain intimacy that you need to have to be able to take it all the way
upstream as wherever the need might be. What sort of processes do you put in place to make sure that
things get escalated properly and backline support can be escalated back to the original
manufacturer, vendor? To what extent is that a
challenge, especially in the AI world where you do have a lot of rapidly changing technology?
Yeah. So we've actually had to have dedicated people for inventory management, for one.
Something that we thought maybe we could just through a normal process or perhaps through a
tool, we found that, and we do use tools, don't get me wrong, but we found that we need a person
that are tracking that and making sure that the inventory turnover is correct, making predictions
of changing needs. Maybe for a while, it's DIMMs that you need to keep on the shelf. But at some
point that turns over based on manufacturer capabilities or whatever it is that we need
to keep another part at a higher rate. And so we
need somebody really tracking that. That same group of individuals also help make sure that
the RMA process is working as expected. It's actually our data center techs that will complete
the RMAs as they replace parts in servers, but they have to make sure that they fill all of
those steps out. They can't just
take a part out and put a new part in and walk away. We got to make sure the part gets replenished.
Ryan, one topic or point of discussion Shaheen and I have shared several times is that big AI
is, as Shaheen says, really is a form of HPC. And I think for some of us in the HPC community,
it's a little frustrating that HPC seems to be these days being subsumed by AI. And now,
of course, Penguin comes out of this long HPC heritage, and you've parlayed that into this
HPC AI strategy. I mean, that must have been maybe almost a natural evolution for the company.
Yeah, I think it was. And a lot of it goes by demand, right? So it's really what are your
customers asking for? And the great big technology wave at the time right now is AI. And so it
certainly is a natural move for us. Yeah. I think Doug wants you to say we could not have done this
without our HPC background. Yes. Yeah, that's absolutely true. Yeah. You're right, Shaheen. That's what I wanted
them to say. Well, actually, along those lines, I do have a question. And is there any difference
in how customers manage their applications, their SLAs, their approach to their infrastructure between the HPC sites
and the full-on AI sites?
Yeah, good question.
We're so involved in the AI at this point that you almost forget what the differences
may have been before AI.
And so I'd say GPUs is a big one, learning to deal with the GPUs, their failure rates,
how to care for those.
Sometimes some of our cooling techniques have had to change. A lot of people are starting to
use liquid coolings instead of air cooling. And I say a lot, that's not true. It's a very
small number that are doing that. But definitely growing. Yeah.
Yeah, absolutely. Well, I noticed one thing where I am,
my understanding is you all have, there are four major components to your services and you did go over design, build, deploy.
Yes.
And then there's manage.
Yeah.
That's the specific area that my department runs.
Design, build, deploy, and manage.
Those are four separate areas within Penguin that we do.
My specific area is the management.
And so I think one area is that we didn't get into that might be worth going over a
little bit.
How do we stay up on top of the technologies?
How do we stay up on the latest trends and the best practices in the industry?
And I think this would be good and true for anybody out there in the industry.
But for Penguin, there's two things that have happened in my time here that have been really valuable to us.
One, we hold conferences and we just had
one last month where we invited a bunch of our vendors out and these included the big boys,
NVIDIA, AMD, Dell, of course, and then a lot of storage, processors, service providers.
We all got together and we had presentations where we actually spoke about where we're seeing
the industry going, what our customers are asking for, what's coming down the pipes, even if it's not available to sell right
now, that was invaluable. There were a couple of panels, one on processors and one on storage,
where we got to hear different companies talk about where their value proposition is and where
it's headed, what they see their customers needing. And from a Penguin point
of view, especially as we consult with customers, it's nice to be able to offer. We don't just have
one solution. So we can sit down and again, listen to your use case and help you determine
where is it that you're going to go and what would be the best technology for your needs at this time.
The second one for us is we've organized ourselves
into centers of excellence. And the managed services department is about 89 people at Penguin.
So it's growing at a significant rate. And the question is, how do we position ourselves to
continue to grow? And the centers of excellence has been our answer. So we have centers of excellence for storage, for networking, for system engineering. We have one called engagement managers. These are the people who help make sure that those engagements run well. And so we have leaders of each of these centers that really focus on best practices for their specific group. We might have, for example, a storage engineer who is assigned to a customer,
and they're the only storage engineer.
But we don't want to limit the technology and the processes that are being provided to a customer
just because they only need one person in a particular field.
So they are part of a center of excellence where they can draw back to a much larger group
and get information and feedback,
run problems across somebody else in their center of excellence to say, hey, here's the problem I'm
running into. What do we think? Because you can open tickets, you can go out and search the web,
but sometimes having experienced people that you can bounce things off of is invaluable.
And so we've created processes where they can onboard
and train with one another. They provide regular meetings with one another to talk about the latest
things going on. And I think that's been really valuable. Great stuff. Yeah. One final question
for me. How far up the stack do you go in managed services? What level of software do you stop at?
Yeah. So usually we go up to the orchestration tool. So generally,
what we do is we provide a system that is fully cluster available, and users are able to jump on
and start running jobs against the cluster. When you ask that question, I always have to say,
what don't we provide, right? And so really, it is the user submits those jobs, we generally don't
necessarily know what they're running.
We can see the jobs are running on the system, but we're not actually running the job.
We're not helping the users configure their jobs.
And so there is a software layer of what we provide that is actually running those jobs.
They run on the system. So we'll install whatever your orchestration tool is.
We'll install whatever your scheduler is so that you're ready to run them.
But that's where we stop.
Got it.
Excellent.
Thank you, Ryan.
This is a really evolving, interesting area with lots of complexity and a moving target.
So it's wonderful to have this kind of a service that absorbs a lot of those changes for you.
Yeah, it's been really fun speaking with you guys about this.
Hopefully you can tell it's a passion, not only for me, but for our company.
And we enjoy doing this.
Absolutely.
No, I think we've learned a lot from you and your colleagues and your experiences have
been very important to illuminate all of these things for us.
Yeah, wonderful.
Thank you so much.
Thanks so much, Ryan.
That's it for this episode of the At HPC Podcast.
Every episode is featured on InsideHPC.com and posted on OrionX.net.
Use the comments section or tweet us with any questions or to propose topics of discussion.
If you like the show, rate and review it on Apple Podcasts or wherever you listen.
The At HPC Podcast is a production of OrionX in association
with Inside HPC. Thank you for listening.