In The Arena by TechArena - AI-Era Computing Drives Innovation in Memory Interconnects
Episode Date: December 18, 2024Explore how OCP’s Composable Memory Systems group tackles AI-driven challenges in memory bandwidth, latency, and scalability to optimize performance across modern data centers....
Transcript
Discussion (0)
Welcome to the Tech Arena,
featuring authentic discussions between
tech's leading innovators and our host, Alison Klein.
Now, let's step into the arena.
Welcome in the arena.
My name is Alison Klein, and today we are revisiting a topic
from the OCP Summit. I've got Nilesh Shah from ZeroPoint, Manoj Watakar from Meta, and Reddy
Shagman from Intel with me. They work on a really important project within the Open Compute Project
Consortium. Guys, welcome to the program. Thank you. Why don't we just go ahead and start with introductions.
Nilesh, why don't you go first? Why don't you introduce your role and how you're related to
the topic we're talking about today? Sure, Alison. So I'm with Zero Point Technologies.
We primarily focus on compression, hardware-accelerated compression technologies for memories.
And my role out here is primarily to drive business development, engage hyperscaler companies and processor manufacturer companies, memory manufacturers across the ecosystem to address this memory wall problem.
And one of the ways to address it, of course, by compressing the data that's stored in
memory. I also engage the open compute OCP ecosystem to partner with different folks and
develop solutions and come up with new frameworks to apply these technologies.
That's awesome. Manoj, why don't you go next?
Hi, this is Manoj Vadhikar. I'm at Meta. I work in a group that focuses on the AI systems
and the technology that enables the AI system. So I focus primarily around memory and memory
technologies, but also the memory interconnects, various ways it connects into our systems.
I look at also the scale-up and scale-out interconnects as we look into building the
AI systems of the future. So I tend to focus on how we see the AI systems requirements are growing,
what pinpoints that we foresee, and how do we really solve those problems
as we see that AI systems are expanding and increasing their requirements
and performance and capacities very fast.
So we need to work with the ecosystems to make sure that we have the solutions
before the problems become very critical and solve them before that.
So that's part of the work that I do with JDAG bodies, as well as work with all these gentlemen in open compute platforms, OCP.
We have a group called Composable Memory Systems.
So we try various solutions and explore different opportunities that we have, just the challenges
in the AI systems.
That's my focus.
And Reddy, why don't you go ahead and introduce yourself?
Yeah.
Hi, this is Reddy Chagom.
I work for Intel Corporation.
I am in the data center AI group.
I'm currently focusing on the large-scale AI and HPC system architecture.
Primarily looking at different technologies and how do we enable large-scale AI
solutions using Intel technologies.
As Manoj mentioned, I also co-lead the Composable Memory Systems in OCP project with Manoj.
So that essentially is the focus for today.
Fantastic.
Now, the topic today is a really important one, and it's getting a lot of attention across
the industry, especially as we move full steam ahead with AI era computing. It's a focus on interconnects
memory and fabric capability. You guys collectively led a discussion on this at the recent OCP
conference in the Bay Area. Why is this topic so vitally important right now?
I can take it and I can start with, first of all, in general, AI systems are pushing the envelope on the amount of computation we require.
This goes beyond a single accelerator.
It needs a large number of accelerators to work together, cooperate on a job, and that requires a large amount of memory.
So this is driving the memory capacity and bandwidth needs accessible for each accelerator.
Without proper memory and network capabilities ai jobs can take
inordinately large time so if you think of a training job that runs across thousands of gpus
if each gpu did not have enough amount of memory it would basically keep on expanding the need for
number of gpus that we need to deploy the job will take longer time but also we start standing
the expensive resource which is the gpu so the memory and memory fabric in general is a very critical functionality
and so is also the networking.
So overall the topic of future of interconnects is very critical for AI
as we spread a single job across multiple accelerators
and it needs to collaborate on that interconnect
with very high bandwidth and very low latency
and sometimes a specific requirement that you have to access the memory.
So this is very important, and these requirements continue to grow very fast.
That is why you'll see so much discussion in the industry to how to address it very
quickly.
I was just going to add to what Manoj said.
With these new AI types of use cases, what we're seeing is, like Manoj said, it's not
just a single compute element,
it's multiple elements that need to coordinate. And that's where the interconnects actually become
extremely critical. And a lot of these interconnects, they are feeding the memory
systems of these compute accelerators. So in order to be power efficient and to scale reliably,
and actually the reliability of these systems,
what Manoj mentioned in terms of completion times becomes critical.
And for completion times, one of the key things is,
can these systems actually coordinate in a reliable fashion without a lot of overhead?
And that goes into how these memory interconnections are designed.
Yeah, I agree with Nilesh and Manoj as well.
The AI workloads like Manoj mentioned,
as well as Nilesh,
that it's not one GPU, right?
It is essentially a collection of GPUs working together to execute
a specific set of training jobs.
It could be one of them
or it could be multiple of them.
So any slow running GPU
can actually slow down the entire training execution time.
So from that perspective, it's not only having an interconnect, but it also needs high bandwidth, low latency and reliable, as Nilesh pointed out.
Reliability is critical because if there is any jitter, any failure in the interconnect in the communication failures, whether transient or permanent,
can actually cause the training job to stop.
And then you have to restart from the last checkpoint.
From the OCP CMS perspective, our goal is to actually look at open interconnect solutions.
This includes Compute Express Link, as well as the UA Link.
And then, of course, on the scale out, Ultra Ethernet Consortium solutions as well for the UA link. And then of course, on the scale out, UltraEthernet Consortium solutions
as well for the Ethernet based. So we need to look at all those elements and we need to bring
all those elements as a holistic system view through CMS work. That's amazing. When I'm
listening to you guys talking, you know, one thing that I think about is how everything has changed
with AI in terms of fundamentally changing the requirements
of the platform. What is driving the change required in system interconnects and why is this
so critical? I think Nilesh started with this kind of a little bit, right? We have in hyperscalers
or actually in general in HPC world, we have lots of systems that are interacting together.
A job is getting distributed across
multiple systems but there is some significant difference between how it is versus ai in
hyperscale workloads typically we have large number of small jobs that run in a stateless fashion on
millions of independent systems and they could fail and in fact the failure is a part of the
design process that you can fail there is a whole robustness built into the software itself that allows redundancy and availability of the jobs and allows you to
fail jobs will continue because that's how the whole jobs are run but in ai if you take a large
training job if you will you're going to need compute power of lots of accelerators simultaneously
to solve that training problem so when you're running across simultaneously across these
multiple accelerators you really need to have very high bandwidth among them.
We want to have low latency,
so it is driving the interconnect requirements very high.
But at the same time,
the high reliability that Nilesh and Reddy mentioned
is extremely critical
because all the jobs run for days and weeks sometimes,
and any kind of a failure across any component,
any kind of link,
would mean that job needs to restart.
It may not be right from the beginning, but at least from the last checkpoint.
But it is a significant set of the overall time it takes as well as the cost and power
it requires.
So this really requires a very detailed consideration for each component in the cluster.
So we're talking about high bandwidth, we were talking about low latency as well as
a very high reliability at a cluster level.
So this really requires fundamental consideration as we look at the interconnect.
This is why interconnects are divided into multiple concepts like scale up and scale out and what their specific role is there.
And before we get into that, my point is basically, this is something that is unique for AI that is being driven because of the way, accelerator jobs for training would run or even for inference would run.
Yeah, exactly.
I think as Manoj pointed out, traditional workloads, scale-out workloads like caching,
big data type of workloads, all the software stacks have inbuilt failure recovery mechanism
and high availability mechanism.
So if there is any specific node that actually goes down, you're not actually
interrupting the actual operations that are being actually serviced through that scale-out
instantiation. Whereas in the AI, any single point of failure can actually stop the entire training
job running on thousands of GPUs, any single point of failure. From that perspective, it is very
important to look at the reliability
as well as essentially looking at the AI workload in its own unique way compared to the traditional
scale-out workloads. I was just going to add to what Manoj and Reddy said is they focused on the
training aspect, but there is also this whole inference that happens on these AI systems.
And really the key metric for entrance
is how many tokens can you get out of the system per second. And there's actually been a lot of
published work from OpenAI and even from Meta and others, which points to the fact that the tokens
per second is limited by the memory capacity and bandwidth. Because essentially every time you want to make an inference on, let's say, a single
embedding that, let's say, a user enters a sentence.
So each time the AI model needs to be reloaded from the memory into the accelerator.
And that actually then is the bottleneck for your inference tokens per second.
So when you look at all these new accelerator companies coming out, even a lot of the GPU
companies coming out or XPU, I'll just call them, they all are limited by the same challenge,
which is how to get more tokens per second. And that is actually limited by their interconnect,
bandwidth, connecting, memory,
and even further down storage. Excellent point. Yep. Now, you guys talked about scale up and
scale out fabrics. And obviously, there's a tremendous amount of standards works going on
in that space. Can you talk about why this is so critical? And when should we expect standards like UA-Link to start
showing up in the market with products? Yeah, why is it critical, right? So I think
Dilish kind of touched upon that a little bit. So irrespective of the type of workload for AI,
that it is Gen AI or traditional AI workloads, they actually depend quite a bit on the memory
bandwidth and latency, whether it is inference workload or training workload. So from that perspective, we have to
look at what is the best way to address the memory bandwidth limitations. If you have high bandwidth
memory or LPDDR within the GPU or the accelerator, you essentially get certain amount of bandwidth.
But more and more, these workloads are looking for higher and higher bandwidth requirement.
So the question is going to be, how do we enable with the existing tiered architecture solutions
through the existing open interconnect work that is happening through UA-Link and CXO?
That will be the key focus, not only looking at the limitations within the accelerator,
but also figuring out the best way to augment accelerator memory bandwidth through the interconnect, the memory bandwidth as well.
And how do you position CXL? It just came out with a new spec today.
How do you position CXL within this cornucopia of standards and technologies that are addressing interconnects required for this type of performance.
Let me expand also a little bit just to set the context for different types of fabrics we have in different environments.
For the AI world, we do have, if you think of the large problem that you're trying to solve,
at a very high level, you can take that large amount of data, divide it parallelly, data parallel, and then have those high bandwidth, high performance, small clusters connected
to a large network through what is called a scale-up network.
And you have a very high bandwidth among accelerators, which is where you're running
tensor parallel or pipeline parallel, where accelerators are really communicating
with each other with very high bandwidth, low latency expectation.
So this is the scale-up part of it.
So when we talk about cxl
or ua link or in willing for that matter we are mostly talking about accelerators working with
each other in a tightly connected in a scale of network scale network tends to be largely
ethernet there are few examples of infinite band also this is a high bandwidth low latency but it
is relatively the scale of bandwidth expectations are much, much higher, but today the CXL has the right
capabilities for enabling the memory expansion, like what Nilesh was mentioning.
There is a limitation of the amount of memory you can have on a general purpose
compute and how do we expand it and make it more heterogeneous memory expansion.
That capability CXL is providing.
That's where the use case is being driven primarily and
has a strong value there. As far as the AI interconnect is concerned, which allows you to
connect interconnects accelerators together to access each other's memory or memory expansion,
the bandwidth-wise comparison, if you consider UA-Link or NV-Link or CXL, I think CXL has a
long way to go because it fundamentally depends on PCIe, which lacks in the per lane or overall surrogate speed as compared to most of these other state-of-the-art technologies like NVLink or UALink.
But of course, it's only in their time.
So we'll have to see CXL has the right memory semantic as does NVLink or UALink.
But I think overall AI systems are in its infancy, even though it is very large right
now, it's going to grow very fast. So I think there are a lot of technologies that are going
to come across and then enable UEC or Ethernet-based technologies, whether it's a RockUV2 or
in general RDMA solutions, all of these are coming in the same space. What Brady mentioned was the
important part is basically we want to see some standardization because it's very difficult to imagine when we get into hundreds of thousands of millions
of accelerators talking to each other in a cluster.
It's extremely difficult to imagine that in each data center we'll have exactly the same
accelerators working with their own proprietary technologies, which would basically mean
you're going to start seeing heterogeneous clusters, which would require at least some level of interoperability at a scale-out network.
Scale-out networks are standard-wise, it is critical because this is where you can get very standard-based solutions, which you can start focusing on really building the top-level solutions rather than having each independent components being designed to very proprietary technology.
So I think it's very important to this point why we need standard solutions.
Yeah. The other thing is having a standard scale-up interconnect solution. very proprietary technology. So I think it's very important to this point why we need standard solutions.
Yeah.
The other thing is having a standard
scale-up interconnect solution.
What it does is
if a specific set of companies
actually test this under volume,
the rest of the industry
can benefit from it,
primarily from the contributions
back into the
open source communities,
through open source communities,
publishing the collateral
on how to actually run
these things under scale and in a reliable fashion. So the rest of the industry can benefit significantly as well
because it's not only open, but few companies actually delivering this at scale and showcasing
that it is viable and sharing those best practices back into the industry can go a long way in
actually speeding up the AI deployment in the broader industry context.
Yeah, just to add to what Reddy and Marot said,
CXL is a pretty versatile interconnector.
It can be used for both scale-up and scale-out scenarios.
Now, in order to do that, of course,
there were some legacy elements that CXL builds on,
namely PCIe.
So it does provide a coherent interface for both Lend-Re and Accelerator access, where you can have a flexible and efficient way to connect these
heterogeneous components together. Now, when you look at UA-Link, which is designed primarily for
scale-up, and then you have UltraEthernet, for example, which is for scale out. So yeah,
like Manoj said, you know, there will be diverse deployment scenarios and it's still early. So
the question is, at least for now, it seems like all of these standards will coexist and solve for
the different use cases and different PCO factors that need to be considered when enterprises or even hyperscale are deploying these AI systems at steel.
Yeah, just to add on to one additional thing, CXL being
in the market for the last several years or so, you see an ecosystem that
is actually very well established, at least from a simple memory
expansion capability perspective. So we are going to see a bit of
attraction in that space for not
only the traditional scale-out type of workloads where we can actually benefit from additional
memory capacity and bandwidth, but even for AI workloads, we will see the traction. Once ULink
actually comes on board and we have the ecosystem readily providing those solutions, then it's going
to be essentially looking at those two implementations and creating what would be the complementary system architecture view that we need to drive
towards through CMS. That's what we will focus on. Now, what else are we doing in terms of
technology development when it comes to memory and getting the most performance out of memory?
Is there anything else in the landscape of standards definitions that people should be
aware of?
One thing I can say is basically the component level, as you know, general GPUs made the
high bandwidth memory HBMs most popular.
And HBMs are growing in the bandwidth and capacity a lot.
We had seen the papers published at some point of time back, which is very well known about
the AI memory wall.
But if you look at it, actually, the way flops are increasing versus HBM bandwidth is increasing.
HBM bandwidth is increasing at a pretty good pace now.
So if you look at HBM 4 onwards, where the data bus has doubled and then the overall
bandwidth has gone up significantly, even the capacities are going up significantly
as you go from the 8 high stack to 12 to 16 in future.
So I think HBM in general is a very interesting development in the memory technology perspective.
Having said that, the model sizes are increasing at such a fast pace that HBMs are not going to be enough for the overall capacity, at least from the inference view for the large
language models, which would mean that I think we may be able to satisfy the bandwidth needs with HPM for the GPUs, but we may come short on the capacity needs, which requires us to have in that case, tiered memory solutions.
And this is where we are looking at different solutions that allow you to tier memory and have memory accessed as a second tier memory as an expansion of the memory that gpu sees grace hopper has been a classic example
where memory is tiered across and it is connected into the host memory so gpu's access host memory
as a tier 2 memory but more solutions are going to come across people are going to continue to
innovate where that second tier memory sets composable memory systems cms group in ocp
there is a lot of healthy discussion not only discussion but also how do we prepare
ourselves and how do you demonstrate this solution and how do we make sure that all the software
plumbing will be ready when we are ready for it orchestration and fabric orchestration would be
ready so this is something going to be important for memory especially for the ai system but also
for general purpose computing that what are these tier 2 memory solutions how tier 1 memory solutions
continuity works and before i just head it off to the next one i think the key aspect that is going computing that what are these tier 2 memory solutions how tier 1 memory solutions continuity
works and before i just head it up to the next one i think the key aspect that is going to be
also important here is basically how the overall rest for this memory is going to be handled
especially as we look into the higher speed memories and higher capacity memories
we are going to see the reliability challenges for ai systems becoming key. So we want to make sure
that at a system level, we have the solution that makes sure we have the right vast capabilities
taking care of even the silent data correction. So JEDEC and CMS and all the standards bodies
are very actively working on it, especially the JEDEC part from the standards perspective.
Yeah, totally agree. I think from the perspective of the CMS, OCP CMS work, we will be assuming the
fact that tiering is needed, whether it is speeding up the training execution part by giving
additional capacity or maybe bandwidth or a combination of those two with the second tier
or even SSD storage capacity. Our goal is that we essentially assume that there is tiering needed
for AI workloads,
training, inference, even inference, you want to actually support multiple jobs running on the
same pool of accelerators. We do need to have that capability. So in general, the assumption
starting point of our operational model in CMS is that tiering is needed, irrespective of what
open fabric protocol we end up having. And the goal is to make sure that we support the plumbing
to tap into multiple tiers of memory,
whether it is the CPU or the accelerator
or both of them needing to have that memory
to essentially execute the job in a much more efficient way.
And that's where we are actually focusing on currently
to be able to provide tiering solutions
with the assumption that
AI workloads do require that. Just a note to add, Alison, on the memory innovation. So innovation
is happening along the lines, of course, interconnects like what we are discussing.
There's innovation like Manoj said along the lines of RAS for higher capacity and tiered systems. There's also other innovation now
possible because you can now have newer types of media. For example, FlashMedia, there was a lot
of companies experimenting with mixing their DRAM technology with FlashMedia, for example,
to get to the capacity and the cost profile as possible. There are other companies that are innovating with emerging memories
or re-emerging memory technologies like MRAM, meter-resistant RAM,
to supplement the limitations that Manoj mentioned.
For example, HBN memories might have.
As you keep stacking these memories for capacity,
the yield starts to get impactive.
And what that translates
into is cost. If your yield goes down with every additional stack or memory layer that you add on,
that in the end translates to a higher sticker price. So yeah, there are several innovations
possible. And then that would go be targeted towards these different use cases. One is,
of course, the hyperscale use case. Then there's the enterprise use case, and then there's the rest of the world. So yeah,
we have to consider how are we solving, through standards, all of these different markets.
You guys are running an important work group within OCP. How do folks get involved in the
work that you guys are driving and engage with you to find out more?
Yeah, we have weekly meetings.
It's essentially open for everyone to join.
When you go to the OCP website and look for CMS, we have a wiki page and we have actually the calendar that is open.
So anyone can actually join.
We normally see around anywhere between 40 to 50 attendees on a weekly basis.
So it's a fairly active working group.
Do you agree, Anilash, that we essentially get very good crowd every Friday,
lots of interesting discussions, but it's all open.
I think it's very important that multiple contributions people are bringing forward.
For Aqabat, CXL, we talked about ULing.
People are bringing forward contributions there to make it open collaboration.
OCP saw significant number of products demonstrating the capability, CXL
capability and interoperability, which shows the readiness of the technology
compression was demonstrated by companies.
So I think all these things are being discussed in CMS.
So if people have interested in understanding where the technologies
are going for memory and memory interconnect perspective, we want to contribute
to the directions, but also bring in their product level interactions. I think CMS is
very open, collaborative environment that they should definitely join and contribute.
Yeah, I just want to applaud within the CMS or composable memory systems work group,
there are many subgroups as well. So for example,
there's a workload-focused workgroup and there's a computational programming workgroup. So these
are great communities because it's not easy to find people on a weekly basis. So Alison,
think of these OCP workgroups as literally having a mini conference every week and where else would you get that?
So I think that's been the real benefit is you get this feedback from this broad ecosystem,
which includes hyperscalers, processor, memory, manufacturer, startups, and it's a very inclusive
community. So it doesn't matter if you're a large company or a small startup, you get a chance to
present your case and then get feedback from the broad community.
So I would really encourage people to consider attending, joining their Composable Memory World Group, but then also these other subgroups where they can come in and really make a difference and get the benefit of peer-driven contributions on collective intentions.
That's awesome.
Thanks, guys.
This has been a fantastic discussion.
I've learned so much, and I know that this is a topic that is going to be a center of
focus in 2025.
Thank you for laying down the foundation for everyone listening.
I'd love to have you back on the program again.
Thanks a lot, Melissa.
Thanks for having us.
Thank you.
Thank you, Alistair. Thanks for joining the Tech Arena. Subscribe and engage at our website,
thetecharena.net. All content is copyright by the Tech Arena.