In The Arena by TechArena - AI-Era Computing Drives Innovation in Memory Interconnects

Episode Date: December 18, 2024

Explore how OCP’s Composable Memory Systems group tackles AI-driven challenges in memory bandwidth, latency, and scalability to optimize performance across modern data centers....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Tech Arena, featuring authentic discussions between tech's leading innovators and our host, Alison Klein. Now, let's step into the arena. Welcome in the arena. My name is Alison Klein, and today we are revisiting a topic from the OCP Summit. I've got Nilesh Shah from ZeroPoint, Manoj Watakar from Meta, and Reddy Shagman from Intel with me. They work on a really important project within the Open Compute Project
Starting point is 00:00:43 Consortium. Guys, welcome to the program. Thank you. Why don't we just go ahead and start with introductions. Nilesh, why don't you go first? Why don't you introduce your role and how you're related to the topic we're talking about today? Sure, Alison. So I'm with Zero Point Technologies. We primarily focus on compression, hardware-accelerated compression technologies for memories. And my role out here is primarily to drive business development, engage hyperscaler companies and processor manufacturer companies, memory manufacturers across the ecosystem to address this memory wall problem. And one of the ways to address it, of course, by compressing the data that's stored in memory. I also engage the open compute OCP ecosystem to partner with different folks and develop solutions and come up with new frameworks to apply these technologies.
Starting point is 00:01:37 That's awesome. Manoj, why don't you go next? Hi, this is Manoj Vadhikar. I'm at Meta. I work in a group that focuses on the AI systems and the technology that enables the AI system. So I focus primarily around memory and memory technologies, but also the memory interconnects, various ways it connects into our systems. I look at also the scale-up and scale-out interconnects as we look into building the AI systems of the future. So I tend to focus on how we see the AI systems requirements are growing, what pinpoints that we foresee, and how do we really solve those problems as we see that AI systems are expanding and increasing their requirements
Starting point is 00:02:15 and performance and capacities very fast. So we need to work with the ecosystems to make sure that we have the solutions before the problems become very critical and solve them before that. So that's part of the work that I do with JDAG bodies, as well as work with all these gentlemen in open compute platforms, OCP. We have a group called Composable Memory Systems. So we try various solutions and explore different opportunities that we have, just the challenges in the AI systems. That's my focus.
Starting point is 00:02:43 And Reddy, why don't you go ahead and introduce yourself? Yeah. Hi, this is Reddy Chagom. I work for Intel Corporation. I am in the data center AI group. I'm currently focusing on the large-scale AI and HPC system architecture. Primarily looking at different technologies and how do we enable large-scale AI solutions using Intel technologies.
Starting point is 00:03:03 As Manoj mentioned, I also co-lead the Composable Memory Systems in OCP project with Manoj. So that essentially is the focus for today. Fantastic. Now, the topic today is a really important one, and it's getting a lot of attention across the industry, especially as we move full steam ahead with AI era computing. It's a focus on interconnects memory and fabric capability. You guys collectively led a discussion on this at the recent OCP conference in the Bay Area. Why is this topic so vitally important right now? I can take it and I can start with, first of all, in general, AI systems are pushing the envelope on the amount of computation we require.
Starting point is 00:03:48 This goes beyond a single accelerator. It needs a large number of accelerators to work together, cooperate on a job, and that requires a large amount of memory. So this is driving the memory capacity and bandwidth needs accessible for each accelerator. Without proper memory and network capabilities ai jobs can take inordinately large time so if you think of a training job that runs across thousands of gpus if each gpu did not have enough amount of memory it would basically keep on expanding the need for number of gpus that we need to deploy the job will take longer time but also we start standing the expensive resource which is the gpu so the memory and memory fabric in general is a very critical functionality
Starting point is 00:04:28 and so is also the networking. So overall the topic of future of interconnects is very critical for AI as we spread a single job across multiple accelerators and it needs to collaborate on that interconnect with very high bandwidth and very low latency and sometimes a specific requirement that you have to access the memory. So this is very important, and these requirements continue to grow very fast. That is why you'll see so much discussion in the industry to how to address it very
Starting point is 00:04:53 quickly. I was just going to add to what Manoj said. With these new AI types of use cases, what we're seeing is, like Manoj said, it's not just a single compute element, it's multiple elements that need to coordinate. And that's where the interconnects actually become extremely critical. And a lot of these interconnects, they are feeding the memory systems of these compute accelerators. So in order to be power efficient and to scale reliably, and actually the reliability of these systems,
Starting point is 00:05:29 what Manoj mentioned in terms of completion times becomes critical. And for completion times, one of the key things is, can these systems actually coordinate in a reliable fashion without a lot of overhead? And that goes into how these memory interconnections are designed. Yeah, I agree with Nilesh and Manoj as well. The AI workloads like Manoj mentioned, as well as Nilesh, that it's not one GPU, right?
Starting point is 00:05:53 It is essentially a collection of GPUs working together to execute a specific set of training jobs. It could be one of them or it could be multiple of them. So any slow running GPU can actually slow down the entire training execution time. So from that perspective, it's not only having an interconnect, but it also needs high bandwidth, low latency and reliable, as Nilesh pointed out. Reliability is critical because if there is any jitter, any failure in the interconnect in the communication failures, whether transient or permanent,
Starting point is 00:06:25 can actually cause the training job to stop. And then you have to restart from the last checkpoint. From the OCP CMS perspective, our goal is to actually look at open interconnect solutions. This includes Compute Express Link, as well as the UA Link. And then, of course, on the scale out, Ultra Ethernet Consortium solutions as well for the UA link. And then of course, on the scale out, UltraEthernet Consortium solutions as well for the Ethernet based. So we need to look at all those elements and we need to bring all those elements as a holistic system view through CMS work. That's amazing. When I'm listening to you guys talking, you know, one thing that I think about is how everything has changed
Starting point is 00:07:02 with AI in terms of fundamentally changing the requirements of the platform. What is driving the change required in system interconnects and why is this so critical? I think Nilesh started with this kind of a little bit, right? We have in hyperscalers or actually in general in HPC world, we have lots of systems that are interacting together. A job is getting distributed across multiple systems but there is some significant difference between how it is versus ai in hyperscale workloads typically we have large number of small jobs that run in a stateless fashion on millions of independent systems and they could fail and in fact the failure is a part of the
Starting point is 00:07:41 design process that you can fail there is a whole robustness built into the software itself that allows redundancy and availability of the jobs and allows you to fail jobs will continue because that's how the whole jobs are run but in ai if you take a large training job if you will you're going to need compute power of lots of accelerators simultaneously to solve that training problem so when you're running across simultaneously across these multiple accelerators you really need to have very high bandwidth among them. We want to have low latency, so it is driving the interconnect requirements very high. But at the same time,
Starting point is 00:08:12 the high reliability that Nilesh and Reddy mentioned is extremely critical because all the jobs run for days and weeks sometimes, and any kind of a failure across any component, any kind of link, would mean that job needs to restart. It may not be right from the beginning, but at least from the last checkpoint. But it is a significant set of the overall time it takes as well as the cost and power
Starting point is 00:08:33 it requires. So this really requires a very detailed consideration for each component in the cluster. So we're talking about high bandwidth, we were talking about low latency as well as a very high reliability at a cluster level. So this really requires fundamental consideration as we look at the interconnect. This is why interconnects are divided into multiple concepts like scale up and scale out and what their specific role is there. And before we get into that, my point is basically, this is something that is unique for AI that is being driven because of the way, accelerator jobs for training would run or even for inference would run. Yeah, exactly.
Starting point is 00:09:09 I think as Manoj pointed out, traditional workloads, scale-out workloads like caching, big data type of workloads, all the software stacks have inbuilt failure recovery mechanism and high availability mechanism. So if there is any specific node that actually goes down, you're not actually interrupting the actual operations that are being actually serviced through that scale-out instantiation. Whereas in the AI, any single point of failure can actually stop the entire training job running on thousands of GPUs, any single point of failure. From that perspective, it is very important to look at the reliability
Starting point is 00:09:45 as well as essentially looking at the AI workload in its own unique way compared to the traditional scale-out workloads. I was just going to add to what Manoj and Reddy said is they focused on the training aspect, but there is also this whole inference that happens on these AI systems. And really the key metric for entrance is how many tokens can you get out of the system per second. And there's actually been a lot of published work from OpenAI and even from Meta and others, which points to the fact that the tokens per second is limited by the memory capacity and bandwidth. Because essentially every time you want to make an inference on, let's say, a single embedding that, let's say, a user enters a sentence.
Starting point is 00:10:34 So each time the AI model needs to be reloaded from the memory into the accelerator. And that actually then is the bottleneck for your inference tokens per second. So when you look at all these new accelerator companies coming out, even a lot of the GPU companies coming out or XPU, I'll just call them, they all are limited by the same challenge, which is how to get more tokens per second. And that is actually limited by their interconnect, bandwidth, connecting, memory, and even further down storage. Excellent point. Yep. Now, you guys talked about scale up and scale out fabrics. And obviously, there's a tremendous amount of standards works going on
Starting point is 00:11:17 in that space. Can you talk about why this is so critical? And when should we expect standards like UA-Link to start showing up in the market with products? Yeah, why is it critical, right? So I think Dilish kind of touched upon that a little bit. So irrespective of the type of workload for AI, that it is Gen AI or traditional AI workloads, they actually depend quite a bit on the memory bandwidth and latency, whether it is inference workload or training workload. So from that perspective, we have to look at what is the best way to address the memory bandwidth limitations. If you have high bandwidth memory or LPDDR within the GPU or the accelerator, you essentially get certain amount of bandwidth. But more and more, these workloads are looking for higher and higher bandwidth requirement.
Starting point is 00:12:08 So the question is going to be, how do we enable with the existing tiered architecture solutions through the existing open interconnect work that is happening through UA-Link and CXO? That will be the key focus, not only looking at the limitations within the accelerator, but also figuring out the best way to augment accelerator memory bandwidth through the interconnect, the memory bandwidth as well. And how do you position CXL? It just came out with a new spec today. How do you position CXL within this cornucopia of standards and technologies that are addressing interconnects required for this type of performance. Let me expand also a little bit just to set the context for different types of fabrics we have in different environments. For the AI world, we do have, if you think of the large problem that you're trying to solve,
Starting point is 00:12:59 at a very high level, you can take that large amount of data, divide it parallelly, data parallel, and then have those high bandwidth, high performance, small clusters connected to a large network through what is called a scale-up network. And you have a very high bandwidth among accelerators, which is where you're running tensor parallel or pipeline parallel, where accelerators are really communicating with each other with very high bandwidth, low latency expectation. So this is the scale-up part of it. So when we talk about cxl or ua link or in willing for that matter we are mostly talking about accelerators working with
Starting point is 00:13:31 each other in a tightly connected in a scale of network scale network tends to be largely ethernet there are few examples of infinite band also this is a high bandwidth low latency but it is relatively the scale of bandwidth expectations are much, much higher, but today the CXL has the right capabilities for enabling the memory expansion, like what Nilesh was mentioning. There is a limitation of the amount of memory you can have on a general purpose compute and how do we expand it and make it more heterogeneous memory expansion. That capability CXL is providing. That's where the use case is being driven primarily and
Starting point is 00:14:05 has a strong value there. As far as the AI interconnect is concerned, which allows you to connect interconnects accelerators together to access each other's memory or memory expansion, the bandwidth-wise comparison, if you consider UA-Link or NV-Link or CXL, I think CXL has a long way to go because it fundamentally depends on PCIe, which lacks in the per lane or overall surrogate speed as compared to most of these other state-of-the-art technologies like NVLink or UALink. But of course, it's only in their time. So we'll have to see CXL has the right memory semantic as does NVLink or UALink. But I think overall AI systems are in its infancy, even though it is very large right now, it's going to grow very fast. So I think there are a lot of technologies that are going
Starting point is 00:14:50 to come across and then enable UEC or Ethernet-based technologies, whether it's a RockUV2 or in general RDMA solutions, all of these are coming in the same space. What Brady mentioned was the important part is basically we want to see some standardization because it's very difficult to imagine when we get into hundreds of thousands of millions of accelerators talking to each other in a cluster. It's extremely difficult to imagine that in each data center we'll have exactly the same accelerators working with their own proprietary technologies, which would basically mean you're going to start seeing heterogeneous clusters, which would require at least some level of interoperability at a scale-out network. Scale-out networks are standard-wise, it is critical because this is where you can get very standard-based solutions, which you can start focusing on really building the top-level solutions rather than having each independent components being designed to very proprietary technology.
Starting point is 00:15:41 So I think it's very important to this point why we need standard solutions. Yeah. The other thing is having a standard scale-up interconnect solution. very proprietary technology. So I think it's very important to this point why we need standard solutions. Yeah. The other thing is having a standard scale-up interconnect solution. What it does is if a specific set of companies actually test this under volume,
Starting point is 00:15:54 the rest of the industry can benefit from it, primarily from the contributions back into the open source communities, through open source communities, publishing the collateral on how to actually run
Starting point is 00:16:03 these things under scale and in a reliable fashion. So the rest of the industry can benefit significantly as well because it's not only open, but few companies actually delivering this at scale and showcasing that it is viable and sharing those best practices back into the industry can go a long way in actually speeding up the AI deployment in the broader industry context. Yeah, just to add to what Reddy and Marot said, CXL is a pretty versatile interconnector. It can be used for both scale-up and scale-out scenarios. Now, in order to do that, of course,
Starting point is 00:16:39 there were some legacy elements that CXL builds on, namely PCIe. So it does provide a coherent interface for both Lend-Re and Accelerator access, where you can have a flexible and efficient way to connect these heterogeneous components together. Now, when you look at UA-Link, which is designed primarily for scale-up, and then you have UltraEthernet, for example, which is for scale out. So yeah, like Manoj said, you know, there will be diverse deployment scenarios and it's still early. So the question is, at least for now, it seems like all of these standards will coexist and solve for the different use cases and different PCO factors that need to be considered when enterprises or even hyperscale are deploying these AI systems at steel.
Starting point is 00:17:28 Yeah, just to add on to one additional thing, CXL being in the market for the last several years or so, you see an ecosystem that is actually very well established, at least from a simple memory expansion capability perspective. So we are going to see a bit of attraction in that space for not only the traditional scale-out type of workloads where we can actually benefit from additional memory capacity and bandwidth, but even for AI workloads, we will see the traction. Once ULink actually comes on board and we have the ecosystem readily providing those solutions, then it's going
Starting point is 00:18:02 to be essentially looking at those two implementations and creating what would be the complementary system architecture view that we need to drive towards through CMS. That's what we will focus on. Now, what else are we doing in terms of technology development when it comes to memory and getting the most performance out of memory? Is there anything else in the landscape of standards definitions that people should be aware of? One thing I can say is basically the component level, as you know, general GPUs made the high bandwidth memory HBMs most popular. And HBMs are growing in the bandwidth and capacity a lot.
Starting point is 00:18:38 We had seen the papers published at some point of time back, which is very well known about the AI memory wall. But if you look at it, actually, the way flops are increasing versus HBM bandwidth is increasing. HBM bandwidth is increasing at a pretty good pace now. So if you look at HBM 4 onwards, where the data bus has doubled and then the overall bandwidth has gone up significantly, even the capacities are going up significantly as you go from the 8 high stack to 12 to 16 in future. So I think HBM in general is a very interesting development in the memory technology perspective.
Starting point is 00:19:17 Having said that, the model sizes are increasing at such a fast pace that HBMs are not going to be enough for the overall capacity, at least from the inference view for the large language models, which would mean that I think we may be able to satisfy the bandwidth needs with HPM for the GPUs, but we may come short on the capacity needs, which requires us to have in that case, tiered memory solutions. And this is where we are looking at different solutions that allow you to tier memory and have memory accessed as a second tier memory as an expansion of the memory that gpu sees grace hopper has been a classic example where memory is tiered across and it is connected into the host memory so gpu's access host memory as a tier 2 memory but more solutions are going to come across people are going to continue to innovate where that second tier memory sets composable memory systems cms group in ocp there is a lot of healthy discussion not only discussion but also how do we prepare ourselves and how do you demonstrate this solution and how do we make sure that all the software
Starting point is 00:20:10 plumbing will be ready when we are ready for it orchestration and fabric orchestration would be ready so this is something going to be important for memory especially for the ai system but also for general purpose computing that what are these tier 2 memory solutions how tier 1 memory solutions continuity works and before i just head it off to the next one i think the key aspect that is going computing that what are these tier 2 memory solutions how tier 1 memory solutions continuity works and before i just head it up to the next one i think the key aspect that is going to be also important here is basically how the overall rest for this memory is going to be handled especially as we look into the higher speed memories and higher capacity memories we are going to see the reliability challenges for ai systems becoming key. So we want to make sure
Starting point is 00:20:45 that at a system level, we have the solution that makes sure we have the right vast capabilities taking care of even the silent data correction. So JEDEC and CMS and all the standards bodies are very actively working on it, especially the JEDEC part from the standards perspective. Yeah, totally agree. I think from the perspective of the CMS, OCP CMS work, we will be assuming the fact that tiering is needed, whether it is speeding up the training execution part by giving additional capacity or maybe bandwidth or a combination of those two with the second tier or even SSD storage capacity. Our goal is that we essentially assume that there is tiering needed for AI workloads,
Starting point is 00:21:25 training, inference, even inference, you want to actually support multiple jobs running on the same pool of accelerators. We do need to have that capability. So in general, the assumption starting point of our operational model in CMS is that tiering is needed, irrespective of what open fabric protocol we end up having. And the goal is to make sure that we support the plumbing to tap into multiple tiers of memory, whether it is the CPU or the accelerator or both of them needing to have that memory to essentially execute the job in a much more efficient way.
Starting point is 00:21:58 And that's where we are actually focusing on currently to be able to provide tiering solutions with the assumption that AI workloads do require that. Just a note to add, Alison, on the memory innovation. So innovation is happening along the lines, of course, interconnects like what we are discussing. There's innovation like Manoj said along the lines of RAS for higher capacity and tiered systems. There's also other innovation now possible because you can now have newer types of media. For example, FlashMedia, there was a lot of companies experimenting with mixing their DRAM technology with FlashMedia, for example,
Starting point is 00:22:39 to get to the capacity and the cost profile as possible. There are other companies that are innovating with emerging memories or re-emerging memory technologies like MRAM, meter-resistant RAM, to supplement the limitations that Manoj mentioned. For example, HBN memories might have. As you keep stacking these memories for capacity, the yield starts to get impactive. And what that translates into is cost. If your yield goes down with every additional stack or memory layer that you add on,
Starting point is 00:23:12 that in the end translates to a higher sticker price. So yeah, there are several innovations possible. And then that would go be targeted towards these different use cases. One is, of course, the hyperscale use case. Then there's the enterprise use case, and then there's the rest of the world. So yeah, we have to consider how are we solving, through standards, all of these different markets. You guys are running an important work group within OCP. How do folks get involved in the work that you guys are driving and engage with you to find out more? Yeah, we have weekly meetings. It's essentially open for everyone to join.
Starting point is 00:23:52 When you go to the OCP website and look for CMS, we have a wiki page and we have actually the calendar that is open. So anyone can actually join. We normally see around anywhere between 40 to 50 attendees on a weekly basis. So it's a fairly active working group. Do you agree, Anilash, that we essentially get very good crowd every Friday, lots of interesting discussions, but it's all open. I think it's very important that multiple contributions people are bringing forward. For Aqabat, CXL, we talked about ULing.
Starting point is 00:24:24 People are bringing forward contributions there to make it open collaboration. OCP saw significant number of products demonstrating the capability, CXL capability and interoperability, which shows the readiness of the technology compression was demonstrated by companies. So I think all these things are being discussed in CMS. So if people have interested in understanding where the technologies are going for memory and memory interconnect perspective, we want to contribute to the directions, but also bring in their product level interactions. I think CMS is
Starting point is 00:24:53 very open, collaborative environment that they should definitely join and contribute. Yeah, I just want to applaud within the CMS or composable memory systems work group, there are many subgroups as well. So for example, there's a workload-focused workgroup and there's a computational programming workgroup. So these are great communities because it's not easy to find people on a weekly basis. So Alison, think of these OCP workgroups as literally having a mini conference every week and where else would you get that? So I think that's been the real benefit is you get this feedback from this broad ecosystem, which includes hyperscalers, processor, memory, manufacturer, startups, and it's a very inclusive
Starting point is 00:25:38 community. So it doesn't matter if you're a large company or a small startup, you get a chance to present your case and then get feedback from the broad community. So I would really encourage people to consider attending, joining their Composable Memory World Group, but then also these other subgroups where they can come in and really make a difference and get the benefit of peer-driven contributions on collective intentions. That's awesome. Thanks, guys. This has been a fantastic discussion. I've learned so much, and I know that this is a topic that is going to be a center of focus in 2025.
Starting point is 00:26:15 Thank you for laying down the foundation for everyone listening. I'd love to have you back on the program again. Thanks a lot, Melissa. Thanks for having us. Thank you. Thank you, Alistair. Thanks for joining the Tech Arena. Subscribe and engage at our website, thetecharena.net. All content is copyright by the Tech Arena.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.