Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x19: Enabling Mass Adoption of CXL-Attached Memory with Rambus

Episode Date: March 13, 2023

Moving memory and other resources off the system bus to CXL is exciting, but how do we ensure that these systems will be reliable, available, and serviceable? This episode of Utilizing Tech features M...ark Orthodoxou, VP of Strategic Marketing for Datacenter Products at Rambus discussing with Stephen Foskett and Craig Rodgers the technology and standards required for mass adoption of CXL-attached memory. Rambus brings decades of experience and a breadth of technology to the deployment of memory in high-performance and highly-available systems. As a CXL Consortium member, Rambus is bringing this experience to CXL, enabling the technology across the ecosystem. Memory expansion with CXL is being deployed today, and memory pooling over CXL fabrics is coming, but it is disaggregation and rack-scale architecture that will ultimately be the result of the adoption of CXL. Hosts:   Stephen Foskett: https://www.twitter.com/SFoskett Craig Rodgers: https://www.twitter.com/CraigRodgersms Rambus Representative: Mark Orthodoxou, VP of Strategic Marketing - Datacenter Products: https://www.linkedin.com/in/mark-orthodoxou-94b189/ Follow Gestalt IT and Utilizing Tech Website: https://www.UtilizingTech.com/ Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/1789 Tags: #UtilizingCXL, #CXLAttachedMemory, #Datacenter, @RambusInc, @UtilizingTech

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT. This season of Utilizing Tech focuses on Compute Express Link, or CXL, a new technology that promises to revolutionize enterprise computing. I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT. Joining me today as co-host is Craig Rogers. Hi, Stephen. Good to see you again. Are you looking forward to talking about what Rambos are doing in the CXL space? Excellent. Yes, indeed I am.
Starting point is 00:00:37 Many of us are very familiar with Rambos over the years. We've encountered the company, and I think a lot of us know that Rambus is responsible for some of the basic structure of how server memory is constructed these days. But I think some people might have some questions about sort of where that goes into CXL. But I mean, if you think about it, it makes a lot of sense that a company like Rambus would be in the CXL space, because in order to make this stuff real, there's going to need to be a lot of focus on the things that they have dealt with in the past, reliability, availability, serviceability, security, basically making this thing enterprise grade, right?
Starting point is 00:01:19 Absolutely. Given Rambus's historical exposure and experience with memory and CXL first leveraging memory to enter the market, it makes absolute sense. So let's then bring in our guest. Welcome, Mark Orthodoxou, VP of Strategic Marketing for Rambus. Welcome to the show. It's good to have you here. Thanks, Stephen. Yeah, like you said, my name is Mark Orthodoxy, VP of Strategic Marketing for Data Center Products here at Rambus. It's nice to talk to you guys today. So, Mark, you heard what we had to say there. You know, what are the, let's start right off the bat with the sort of things that might hold back memory across CXL from really taking root in, well, basically big production environments? Yeah, that's a great question. And one that I feel like, you know, as an ecosystem, we don't necessarily talk enough about because let's face it, for the first time ever, we're talking about taking system memory, sticking it off a serial interface, moving the memory controller off the CPU.
Starting point is 00:02:30 And what are the implications of that? I mean, the biggest implication, of course, is that that has to be reliable. And you said it in your intro there. I think the industry as a whole needs to kind of really focus in that area. If you think about the main CPU players, the Intels, the AMDs of the world that have for many, many years honed their memory controllers to work with system memory, mostly in the form of RDMs, that's a lot of man years of effort that have gone into ensuring that there's the right ECC technology, that there's no silent data corruption under most every use case, things of that nature. A lot of that work now has to be
Starting point is 00:03:13 recreated by third parties. And that is not a easy task. Now, companies like Rambus, who have been in the memory business for a long time, are uniquely capable of doing that with some of the insights we've gained working with the memory suppliers. But that's one major element. The second piece of it is security. As you also mentioned in your intro, I mean, now we're moving memory a little further from the CPU. Security becomes a concern and that has to be that has to be dealt with appropriately. And then I think more broadly, you talk about just the standards work that's required to ensure everybody is aligned on the foundational requirements for this type of memory attachment. A lot of work needs to be done there too. And RAMBUS is heavily involved in all those areas. You mentioned there about standards. It's so important for something
Starting point is 00:04:07 this big and this open. With so many members of the CXL consortium, it's so important that this is all properly standardized. And for adoption to take place, it's going to have to be standardized. Companies won't simply throw it into a data center on the assumption that it's going to work. They're going to want to know how it works, how reliably it works, and how well they can maintain that with serviceability moving forward. The standards activities are quite involved now around CXL attachment of memory in particular, not strictly speaking memory, but a lot of focuses in that area. I mean, everybody knows the CXL consortium, of course, which really is the foundational standards body that's essentially delivering the protocol.
Starting point is 00:05:02 Rambus, heavily active in that area. We actually, as of January of this year, we sit on the board of directors for the CXL consortium and actually are the only memory company that's building memory silicon that does so. And that's one obviously major thrust. So foundationally, all the ecosystem players have to come together to make sure that the protocol itself is suitable for the application. But that's just one part of it. A lot of other standards bodies now are very involved by necessity, because as you pointed out, how do we use this stuff in production as things really ramp up over the next couple of years. And that has at least two other major components.
Starting point is 00:05:47 One is what are the form factors and the standard, the most important requirements for the controllers that will implement this type of technology. So, you know, Jetec is very active in that and, you know, Rambus has been working closely with Jetec for many, many years. That standard, just like it's important in the RDM industry, is new work group was formed formally, the Composability Memory Systems Work Group within OCP. And what is that group doing? Well, that group is essentially building an open source fabric manager for the purposes of managing all this tiered memory, which is going to be foundational for its use as we go forward. In the Linux community, there's kernel work to ensure open source kernels.
Starting point is 00:06:47 So all this standards activity has to come together. The ecosystem has to agree on all these things for this to be deployed in any kind of meaningful volume in a ubiquitous way. And, you know, Rambus is heavily involved in all those groups, as are many of our customers and partners. So it's a very exciting time. It's really positive, I think, to see such broad activity across all these standards bodies. It speaks to the importance of the technology. Yeah, I think that's what's been interesting to us here on the podcast is just the number of companies, the breadth of the companies that are involved, and all of the different resources and ideas and technologies that are being brought together. And it's really, I think, important that companies like Rambus are there as
Starting point is 00:07:39 well because of the long history, not just in terms of patents and technology, but even just understanding of this market. I mean, Rambus is a company that has worked in the enterprise industry, enterprise server industry and cloud servers, well, Nintendo 64, everything for so long. And yet having that kind of background and bringing that to the table is important because, you know, for me as a storage guy, one of the things that kills me is, you know, people kind of come in and sit down and they're kind of like, oh, how hard could storage be? We'll just, you know, we'll just store some data. It'll be fine. Well, it's the same thing with memory. I mean, people come in, you know, if they don't know the memory business and you don't have the background in this and you don't understand what it takes to make memory reliable and to handle the issues that are going to happen
Starting point is 00:08:33 and to predict what issues are going to happen, because we've seen this before and we'll see it again. You don't ask the right questions and then you don't have something that's reliable. You don't have something that's highly available. And so it's good to have that kind of background being brought to the table. So what is it that specifically that Rambus is bringing in terms of understanding of system memory to CXL? Yeah, that's a great question, Stephen. Thank you.
Starting point is 00:09:00 I mean, well, first of all, I can't claim any responsibility for what we were doing back with Nintendo as many years ago. I'm about a year and a half new to Rambus. But I mean, the reason why I joined the team was because I feel that Rambus brings something very unique to this ecosystem. in the serialization of memory for many years now, all the way from back when we were talking about things like C6 and OpenCAPI or OMI and Gen Z. And CXL clearly has sort of taken the lead from an ecosystem adoption standpoint.
Starting point is 00:09:35 And what Rambus brings is rather unique. First of all, as you pointed out, we have a great deal of historical contribution to the memory ecosystem through inventions and involvement with the memory suppliers by supplying chipsets that live in the RDIMs that today ship ubiquitously in the data center. And we also have this rather unique pool of IP to draw from that we ourselves develop and sell to third parties in the form of CXL controller IP, PCIe controller IP, CertEZ, security blocks, and then a very strong SOC design team within
Starting point is 00:10:16 Rambus that's able to take all these pieces and put them together to do something I think pretty special. And so when you look at the memory ecosystem and you think about companies that really kind of know the nuts and bolts of what it is to take things like DRAM and make them consumable on a mass scale, Rambus is rather unique. Now, when you think about the different ways that memory is going to be deployed with CXL, there's a whole bunch of different variations on that. And that's also important to consider in answering your question, because there's a great deal of complexity, as I mentioned, in recreating the memory controller, taking it off the CPU and putting it external, and then attaching our DIMMs to that. You have to consider all the reliability things
Starting point is 00:11:05 that any CPU would need to consider. But there's another type of use case as well, which is where DRAM is populated directly down on a board next to something like a CXL memory controller that necessitates an understanding of capabilities that are very analogous to what today you find in things like an RCD that live on an RDIM. And very few companies have the understanding and the background to make that a productizable thing as well.
Starting point is 00:11:34 There's a lot of acquisitions leading up to this point, and a lot of them were focused on enabling us in this area. One of the more recent ones is an acquisition we did of a company called Hardened. They're based in Montreal, Canada, and their expertise is in the areas of ECC, in the areas of compression. And it kind of hints to the types of things that we feel are necessary to have a very deep understanding of to make this thing successful. So those are a few things in answer to your question. Yeah. And it's interesting too, that you bring up sort of non-memory thing, non-memory topics as well. Because of course, the differentiator for CXL memory
Starting point is 00:12:28 is that it's going over PCI Express. It's probably going to be going over fabrics. There's probably going to be other types of IO passing alongside it. There's a lot of questions that go beyond sort of what needed to be asked when you're talking about memory on a system bus isn't there. So I think it's important to bring in those kind of things,
Starting point is 00:12:51 bring in that kind of understanding, maybe it's storage-based or it's interconnect GPU-based, those kinds of understandings and meld them with how memory needs to work, right? Right. Well, I mean, exactly, 100%. I mean, when you take a step back and you look at the big picture here, I mean, what is CXL offering? Well, it's offering a new attach point for memory. And one of it's often talked about use cases. Why is that important? Well, that's important because today, I mean, we'll be just lots of documentation or papers and, and, and, and, and talks about how important memory is from a cost standpoint, how the need for memory is growing, and what are the tools that are available in the absence of CXL to the system architect. It's really, you have HBM and you have various types of direct
Starting point is 00:13:35 attached DDR. And then you have storage, right? There was an attempt to bridge that gap a little bit with things like 3D Crosspoint. That's not really an option now. So what are the tools that we have in the toolbox? Well, CXL creates a whole bunch of those tools by creating these memory tiers. And as soon as you have these memory tiers, which, to your point, is very analogous to storage in the past, you have a whole bunch of different things you need to think about. You need to think about whether or not and what the use case is from a latency standpoint, for example, on whether or not you encrypt that memory. You need to think about
Starting point is 00:14:12 things like storage techniques that are not entirely different from an FTL style layer you might see in flash SSDs today when you want to do some kind of post-processing on a cooler tier of memory in order to use that memory more efficiently. There's a whole bunch of enterprise storage concepts that become very applicable to the memory space. Just pure data integrity through silicon and end-to-end in a system is an art that was really, in many ways, perfected in storage that is very applicable to the memory space. And I think carrying all those ideas forward in a deliberate manner, right, we can't bite off too much too quickly, is going to be how this market actually is realized,
Starting point is 00:15:05 how people actually start to find reliable ways to use these new types of memory tier to solve some of these problems, given, again, today, their only option is throw more HBM at it or throw more direct-attached DDR at it, which is fraught with all kinds of limitations. There's been huge parallels between where
Starting point is 00:15:26 we were with storage and where we're going with memory. And one of the first things we did, you know, we had huge amounts of wasted storage and physical servers, you know, that just weren't using it. And the solution was pooling it in virtualization. Now we're doing the same with memory. We're pooling the memory to let us mitigate that wasted resource. And it's an expensive resource to waste. But, you know, where we started off with storage with basic RAID and then to erasure coding, you know, storage has evolved over time. And, you know, SSDs and NVMEs, it'd be interesting to see where memory goes now. That pooling is going to be a foundation that other services, things that we haven't
Starting point is 00:16:12 even thought of yet. Nobody was thinking of erasure coding on a physical server. It'd be very interesting to see where it goes. Yeah, I fully agree. And there's a lot of conversations being had, both in closed door meetings and in standards discussions around exactly that universe of application space. And there are a lot of different thoughts on the right way to architect such a system. I mean, as you pointed out in storage, we saw that evolution happen in the form of different types of storage media with different types of interfaces, migrating to things like JBODs
Starting point is 00:16:56 or JBOFs that attach to a number of servers. Will we see similar things for memory? I think we will. Will they live on the other end of a fabric? I think eventually they will. There's a lot of things to work through to get to that point. So I think that's an inevitable place that we will get to. I do think we need to crawl, walk, run a little bit as an industry. And again, I mean, just sort of the way we started this conversation. I mean, there are some foundational things to address in order to really convince the ecosystem that we have a reliable way to use this memory
Starting point is 00:17:32 with the correct performance profile and the right software infrastructure to manage it. And then that will continue to grow. And we will eventually see, we will eventually reach that end of the rainbow where we're able to effectively utilize memory directly and remotely and with reliability so that the architects have truly all the tools in their toolbox to architect servers and racks and data centers the way that they would like to, depending on the workloads that they have, which are themselves widely varied. We're starting off with memory pooling, and we believe that that's going to be quite rapidly and widely adopted. We don't know about, say, the future of CXL,
Starting point is 00:18:18 you know, features in 3, 4, 5, et cetera. And I'm getting these visions in my head of memes, you know, design and user experience, you know and i'm getting these visions in my head of memes you know uh design and user experience you know you have companies like rambus now that are building these that that are creating the the tools and then we're going to have companies who are actually building the cxl solutions where do you see those solutions going long longer term what after memory pooling what do you believe would be the next adoption down the cxl roadmap well there's a number of ways that we can um we can look at a few different directions um so as you point, memory pooling is broadly talked about. It seems to deliver a pretty obvious value proposition, which is use that expensive resource more efficiently, free up capital to purchase more memory, to also eventually over time just more broadly become a foundational new component of what has often been talked companies would love to get to is a truly sort of heterogeneous composable rack where they can plug and play things wherever they like.
Starting point is 00:19:56 And that means also the attachment of GPUs and accelerators, the ability to do direct transfers from memory to memory, from cache to cache across those heterogeneous components. So that is, I think, one certain direction that the ecosystem is trying to move towards after, strictly speaking, pooling of memory. I think the pooling of memory is talked about most nearest term in terms of that sort of far memory because it is such an expensive resource there's never been a solution for this before whereas we do have some solutions right now for disaggregation of things like gpus and accelerators over pcie and there's also some proprietary solutions out there as most people know. But putting that all together kind of is then the next step. You can't really do that effectively, strictly speaking, over, for example, PCIe. You need something like CXL and the protocol enhancements that it brings to make that work.
Starting point is 00:20:59 But the other thing that I'll say is, you know, going back to direct attached memory, right now we talk a lot about, well, you have this many, you know, DRAM channels on a CPU from an Intel or an AMD or pick your favorite CPU company. And CXL, which runs over PCIe, which is a ubiquitous interface, there's increasing number of lanes per CPU to allow for it to be used to do interesting things. You can take that CXL memory and you can add more memory bandwidth or more memory capacity or introduce new types of memory over that additional interface. But I would suggest also that if you look far enough out, the CPUs themselves today, what are they growing to? More than 6,000 pins, something like that.
Starting point is 00:21:48 Like I've talked to customers that they can't conceivably go two DIMMs per channel on a socket and a CPU because they physically are running out of space in a server if it's a two socket server, for example. In fact, maybe even a two socket server becomes hard to build because these things get so large. So I would say that there's also a possibility down the road that the serial memory, as opposed to delivering a new memory tier, sort of is just the mechanism to attach memory to a CPU, whether it's hot, whether it's warm, whether it's cool, whether it's cold. And I would point to a company like IBM, who's demonstrated, has pioneered sort of this thought experiment, which is serializing memory on their power series. So is that a direction that we could also go? And that's going to require new standards work in terms of form factors of memory components, as well as perhaps the protocol itself. So there's a number of different directions that we could go beyond the direct memory attachment with CXL that we talk about quite a bit as an industry right now and pooling, right?
Starting point is 00:22:55 There's those two sort of different paths, which I think are very powerful and very interesting and speak to why we're all making these investments right now. Yeah. And speaking of that, I think that we can all kind of understand where, you know, that Rambus has, is able to bring IP to the market in terms of helping, you know, basically having, and I think you've already maybe announced some product in that area, or maybe there, you know, we've heard about it, but where does the product picture go for you? Are you basically going to be enabling with IP, or are there going to be Rambus products? Yeah, great question. As you know, Stephen, we haven't made any announcements yet in the area of specific CXL products. We do have lots of information available about the IP
Starting point is 00:23:48 that we make available to the ecosystem to build products of their own. All I can tell you right now is that we're doing some pretty interesting stuff in this area. And hopefully we'll be talking about it a little bit more openly soon. You do get a bit of a hint for the direction we're headed if you look at some of the publications that we've issued on our CXL initiative. But the rest of it, we'll have to wait for a future discussion before I can talk more about it. Well, I can't wait to hear what that is. But, but in terms of IP, at least it's, it's great to know that that you all are there to basically support these products that are coming to market with you know, with the Rambus technology. And, and, and I guess for now you know, that is one of the approaches that a lot of companies are making to to
Starting point is 00:24:41 basically some of them are providing IP and some of them are providing, you know, a product. And they're all meeting together in the CXL market. Another question that I would have relating to Rambus, though, is, you know, there's a whole thicket of intellectual property here. How does that work? I mean, we're outsiders. We don't see how it goes. How does it work to bring technology from 30 different companies together into a standard? Yeah, that's an interesting problem space, but one that we've navigated successfully in the past, and again, I'll go back to examples like storage, right? I mean, the situation was no different with SaaS or SATA.
Starting point is 00:25:32 It was no different with NVMe, even PCIe itself. for technology innovation and the need for product to enable it becomes so clear that you kind of have to sit at the table with your partners and your customers to ensure that we're all investing R&D dollars in a manner that we can deliver some ROI. And then the only way you can really achieve that is consensus at some level. And so, yes, is there an art in enabling standardization while leaving an opportunity for innovation? Absolutely. But, you know, it certainly is possible. It's been done before. And I think actually to the credit of, you know, the end users of these products and that's the cloud service providers, that's the server OEMs. They are very, they put a great deal of importance on participation and driving in a collaborative way that work.
Starting point is 00:26:38 To the point where, you know, they will probably say that, you know, they don't want to partner with a company that, that, you know, is fighting or resisting that. Nobody wants to put something proprietary that, you know, is sort of defined in a dark room into a system that gets broadly deployed. And so it's really almost a mandatory part of the pre-sale cycle and of the industry enablement cycle. I mean, if you want to make product to address these things, you just need to collaborate in that way. And everybody sees it. It makes for some very interesting and heated discussions sometimes in standards bodies. But at the end of the day, it lands in a constructive place. And we'll navigate it just like we navigated in the past for some of those standards that I mentioned, by example.
Starting point is 00:27:31 There's one thing you mentioned there that actually set off a light bulb moment in my head, and that was around the serialization of memory access. And I hadn't really thought of it that way. And again, I'm getting storage flashbacks you know of parallel ATA and ultra two wide CTV with 80 pins and you know now it's down to a few pins on a on a sas cable and obviously the performance and latency and everything from where parallel was to now how it is in serial has improved so it'd be very interesting to see if memory access took that trend. I mean, at the end of the day, what is a CXL interface but a memory interface, at least in one incarnation,
Starting point is 00:28:18 a memory interface that consumes one third the pins of, for example, a DDR5 channel? How is that? How is that not a good thing for the compute ecosystem? And in the long run, if we can get the latency down to where it needs to be, the reliability down to where it needs to be, it seems to me that that could be quite ubiquitous. And I would also point out that when you couple it with standards like UCIE and other things, it also contributes to the disaggregation of the CPU itself at a chiplet level, because now you have new ways to
Starting point is 00:28:52 plug and play different architectural components right at a package level, which again, introduces all kinds of interesting possibilities. Absolutely. And I think that, like you said, Mark, one of the things that I'm going to bring back in here to kind of look forward to where CXL goes, I think that one of the biggest problems that CXL is solving is what you mentioned earlier about basically the physical constraints that we're facing due to the system architecture choices that we made a decade ago. And so we've got too many pins, we've got too many lines, we've got too many memory channels,
Starting point is 00:29:31 and we've been trying to overcome this. CXL allows us to overcome that in a way that not only means that the physical footprint of the motherboard can be redesigned and re-engineered, but in a way that allows us to basically blow up the whole concept of a server and maybe get to this kind of rack scale architecture. And I think that the coolest thing is that in the long term, that opens the door to basically more compute, more memory, more storage. It opens the door to building systems that are
Starting point is 00:30:06 basically impractical given the constraints of current architecture. And that, I think, is really where everything changes. Because there's all these dominoes that are going to fall down. As soon as we don't have all these ranks and ranks of memory stacked across that server, we can move that stuff around. We can adjust pooling. We can adjust power. We can, you know, basically adjust, you know, how we place the CPUs. We can do all these things. And a lot of that is being held back by memory, by memory channels. And I think that this changes it. Yeah, I mean, absolutely. And I mean, if anybody is wondering if the requirement for more memory is going to go away, I encourage them to type that question into chat GPT and see what it tells you. The point being that, you know, AI is not going to be slowing down anytime soon. And that's a
Starting point is 00:31:02 big memory driver. I think on one of your other episodes, to highlight the point that you just made, you were talking to Dan Ernst at Microsoft and he said something like, computer architecture isn't for it. So there you go. I fully agree with him. And I think that supports your point.
Starting point is 00:31:18 Absolutely not. And I think that that's really exciting to see where this is going. Well, this has been a wonderful conversation. I really appreciate having you on here. Also, I'm really glad to hear that you're a listener to Utilizing Tech. Where can people keep up the conversation with you? Where can they connect with you?
Starting point is 00:31:37 Yeah, thanks, Steve. And I certainly hope that I can connect with some of your viewers. The best opportunity coming up is I'll be delivering a keynote at MemCon 23 down in Mountain View, California on March 29th. So feel free to grab me at the event. Please come to the session and talk to me afterwards. Also, you can reach me on LinkedIn. And for what Rambus is up to, please do visit rambus.com.
Starting point is 00:32:02 Thanks again, Steve and Craig, for having me today. It was a lot of fun. Hope to do it again. Great. Well, thanks for joining us. And Craig and I will be a part of the Tech Field Day event that's happening. Actually, it will be happened by the time this episode is published. So please check out YouTube slash Tech Field Day for video recordings. And that features a presentation from this about CXL Consortium and a few of the other companies in the CXL space. So we look forward to having you join us there.
Starting point is 00:32:32 Thank you for listening to Utilizing CXL, part of the Utilizing Tech podcast series. If you enjoyed this discussion, please do subscribe in your favorite podcast application or on YouTube. Also, please do give us a rating review because, of course, that's always welcome. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to utilizingtech.com or find us on Twitter or last time at Utilizing Tech. Thanks for listening and we'll see you next week.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.