Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x17: Memory and More over CXL with UnifabriX

Starting point is 00:00:00 Welcome to Utilizing Tech, the podcast about emerging technology from Gisdalt IT. This season of Utilizing Tech focuses on Compute Express Link, or CXL, a new technology that promises to revolutionize enterprise computing. I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gisdalt IT. Joining me today as my co-host is Nathan Bennett. Welcome, Nathan. Hey, Nathan. Hey, Stephen. Good to be back with you all again, talking about this awesome technology and kind of peering into the future of all things CXL and this ComputeLink. I think we've beat memory to death at this point. Let's see what we can talk about today.

Starting point is 00:00:40 Absolutely. We've talked in the last few episodes with folks from, well, all sorts of different companies that are building platforms like AMD, Intel, ARM, building software to support CX fact, one of the things I said was it's a good thing they're starting with memory because it's a great application. It's a great kind of slam dunk application that people can come in and they can say, you know, I want to get some benefit from this technology. We implement memory expansion. We have got bigger memory on the system. We've got more memory bandwidth than we thought we would. Applications run faster? Cool. Where do we go from here? And that's why we've invited Ronan Hyatt from Unifabrics to join us today so that we can learn a little bit, not only about memory expansion and memory pooling, but sort of what comes next. So welcome to the show, Ronan.

Starting point is 00:01:38 Thank you, Stephen. Excited to be here. Give us a little bit about your background and where Unifabrics came from. Excellent. So I'm the CEO of Unifabrics and I'm one of the co-founders. I come originally from Intel DCG, from the data center group. So I've been playing with CXL almost six years ago when it was internal to Intel. It was called Flexbus internally, later IAL, Intel Accelerator Link. Only in 2019, it became public. So I'm of the very few people out there that have multi-year experience with CXL.

Starting point is 00:02:16 And Unifabrics is all around CXL and the benefits that CXL can bring to the industry and data centers. We provide the ultimate memory scaling solution, solving the roadblocks that local DRAM puts into compute today, solving both the bandwidth scalability and capacity scalability of memory and the mismatch that exists today between memory and compute. And actually, not only that, from our name, UniFabrics, you can infer that our vision going forward is having a single unified fabric within the rack. A hint, it's not going to be Ethernet, It's going to be CXL. So at Supercomputing, you showed a memory expansion device, a memory expansion capability, the smart memory node that allowed systems to scale much bigger, run much faster. I do want to talk about things beyond memory, but I guess first, let's start with memory.

Starting point is 00:03:20 Tell us what you're showing now and what you're selling now, and then we'll move beyond memory. So in Supercomputing, we're actually showing a lot more than memory expansion. Memory expansion is quite simple. It's a form factor that gets into the server that expands memory within the server. We were showing memory pooling, and that was a demonstration of a memory pool product where we provision both memory capacity, but also memory bandwidth to servers, CXL-enabled servers. This server was CephaRapids, the InterCephaRapids. It was a pre-launch version of CephaRapids.

Starting point is 00:03:55 And we were running a real HPC application, HPCG, over that server. HPCG is quite a notorious benchmark in the HPC world. It's consumed a lot of memory bandwidth. And what we showed is that once you start engaging more and more CPU cores into the workload, into HPCG, at some point you exhaust all the local memory bandwidth and then you get stuck. Then your compute is getting stranded. You can use only around 50% of the CPU cores that you have on the socket, but no more than that. And with the extra bandwidth that we provisioned to our smart memory node, we were able to use all the CPU cores on the socket, the total 100%. Like you get a double compute density instead of half. So I think that's very good for us to hear.

Starting point is 00:04:49 We talk about memory and we've talked about a lot on this podcast. I don't mean to bash it because at the end of the day, there's a lot of great things that can come from memory. But just what you were discussing in terms of like the fabric and memory being able to be pulled from just it being able to only utilize maybe 50, 60% of the CPU to a hundred percent of the CPU. I think there's a lot of different ways we could go in that conversation. Like, oh, well the CPU, the way they manufacture CPUs will probably change because they got to figure out how to make sure that they can withstand that type of bandwidth

Starting point is 00:05:22 and all those different things as well. But one thing that you said in terms of the fabric, I'm just curious in terms of, and if we want to stay on memory, we can go back to it. But the idea that I keep hearing about is this idea of a master controller. And we all kind of know this if we've studied our architecture in terms of how a computer works, where the master controller kind of worked in between we've studied our architecture in terms of how a computer works, where the master controller kind of worked in between RAM and the CPU. Is that something that kind of is functional in what we would see in CXL, where there's like an extra component that works in between them?

Starting point is 00:05:55 Or is this something that would be kind of more streamlined into like an actual, the chiplet type of situation where there's smaller chips that just streamline the data pipelines from one place to the next? So CXL is many things. CXL is an interface provide you the ability to connect to a server at the rack level. At the chiplet level, we have today UCAE, where you can design CXL-based logic and then attach it on a chip level, on a socket level, on a package level. So everything is there. So in terms of the master controller, you can get everything over CXL. We talked about memory, we talked about storage, we talked about networking and all that connects over the same interface. Yeah. And I think that's the that's the exciting part. Right.

Starting point is 00:06:49 And, you know, again, I think I came down hard on memory at the beginning of the podcast. That's that's a great thing for us to to utilize. But what's really exciting is the idea of moving past that. And you mentioned networking. What what can CXL really do in that networking arena? So CXL is much more beyond memory. Of course, everybody starts with the memory. In SC22, we demonstrated NVMe, actually the fastest NVMe device over CXL and we actually have Ethernet over CXL running at our labs today, totally replacing the top of rack switch. So what we have, our configuration is have multiple servers within the rack

Starting point is 00:07:35 and our appliance sitting at the top and having CXL cables going from our appliance to each server and each server sees an Ethernet service, meaning it sees an Ethernet NIC. But it's not a physical NIC. It's not that something is installed within the server. This is something that we expose on demand through our appliance, to our smart memory node. And we can expose that as a service. And this is nice. You don't have to do that.

Starting point is 00:08:04 You don't have to expose an ethernet service to each server you don't have to expose an nvme device to each server this is done on demand and we can expose different types of nicks it could be a simple nick like just moving data and it also could be a dpu like a nick that does also network overlays, all the vSwitch processing that exists in modern DPUs today. And the nice thing, everything here is on demand. So would the, in this case, let's talk about these NICs. Would these be physical NICs that are in a remote box, essentially away from the server, and you're using CXL Fabric as a multiplexer to attach it to a server, or would they be virtual NICs?

Starting point is 00:08:50 I mean, is it an actual DPU or is it a virtual DPU that's using some kind of pooled compute resource and IO in that remote box? How does that work? It's a virtual DPU and a virtual NIC. It's not that we use CXL to multiplex like other players are doing today with PCA boxes that connect to shelves of NICs or shelves of storage. We actually provide the service itself within our appliance, within our smart memory node. So this is something that we provide on demand and we can change the characteristics and the personality

Starting point is 00:09:25 of that device according to its use. For instance, we focus today on the HPC market. The HPC market is using a lot of RDMA, whether over InfiniBand or over Rocky. And one of the key important things is passing messaging like the MPI interface, which would be a very low latency ultra low latency so one of the things that we can expose using our Ethernet service is an MPI passing device which is ultra low latency because it's running over CXL is CXL is much lower latency than than PC and this

Starting point is 00:10:03 is how we accelerate HPC applications, not just by providing more memory bandwidth, not just by providing more IO bandwidth, but also by providing much faster network and fabric. So I guess that means then that Unifabrics would be essentially a competitor in the DPU space as well, since you would have to be developing your own DPU capability, not just a CXL capability.

Starting point is 00:10:27 The form factor would be our smart memory node. So we would not be like selling DPU cards. This is not our business, but this is something that we can natively provide as a service on demand through our appliance. And it's a very natural development of CXL services that can run over the CXL cabling that we envision that already exists within the REC. Yeah. Yeah. And I know that this is early times still, and it's not like we're talking about a specific product or something. So just trying to get my head around sort of what this thing would look like and what the benefits would be. So as you mentioned, I heard one of the benefits might be that you could have sort of a special purpose NICs. So if the server needs a DPU or if it needs just a basic throughput, I imagine you could have different levels of

Starting point is 00:11:18 performance available as well. That's one benefit. Maybe some dynamism. So if the server changes purpose, you could repurpose the type of connectivity that that server is going to. Also, I imagine that you could have maybe aggregate higher performance. So maybe you've got 400 gigabits out of the top of the rack or something, and that you could then deploy that to all the servers in the rack without having to spring for expensive high-end network adapters on every server or something like that, right? That's correct. If you think of CXL in the generation of PCIe Gen 5, then every CXL by 16 link is one terabit

Starting point is 00:11:57 per second. And we use that one terabit per second to pass memory transactions. But in the context of networking, think that we can expose a virtual NIC of like 10 gigabit per second and then upgrade it on demand to 100, 200 and 400 gigabit per second with or without special processing. So we can scale everything. We can scale the amount of bandwidth that the NIC provides. We can provide RDMA or not. We can provide the vSwitch processing like a DPU or not. This is completely flexible. And I guess maybe you could kind of run us through what the storage connectivity looks like as well,

Starting point is 00:12:39 since NVMe storage has maybe a different form factor, different protocol, takes up space more than a network might. I don't know. What would a storage-connected CXL device theoretically look like? So what we learned while working with the HPC customers is that they have memory bottlenecks. So we started with memory. That was our first step as a company, and we solved it. We already have a product that provides memory. So what we learned from these customers is that memory is only one bottleneck.

Starting point is 00:13:13 So it's like an onion. You peel the first layer, and then you get that once you solve your memory bottleneck, you get to an IO bottleneck. And then how we solve the IO bottleneck, so we say, okay, we already have a very high bandwidth link, the CXL link that goes out of the server. Let's look how we can reuse it, not just for memory transactions, but also for storage transactions. And this is where we came with exposing NVMe over the same cabling and using a very fast storage system that uses a hybrid combination of DRAM and flash and NAND flash. And this is where you can run very large data sets over memory at extremely high performance, like more than 25, 30 gigabytes per second per NVMe interface, and around one, two microseconds of average latency. The tail latency is three microseconds.

Starting point is 00:14:12 It's extremely good compared to real NVMe devices. And this is where we solve the IO bottlenecks of HPC players. And then we move to the next bottleneck, which is the fabric and networking. So we provide a comprehensive solution in that space. So Ronan, it sounds like what we're discussing here and what you're trying to accomplish is not a fabric of one particular thing, but a fabric of many different things. Is that what your company is targeting? Is that the goalpost that y'all are

Starting point is 00:14:45 going towards? So we look at CXL is going to happen, right? You will have memory pools within the rack. And if you have a memory pool, you'll have CXL cabling in the rack. And then we say, okay, this thing is already there. What else can you do with it? And then we come with the storage, with the NVMe, and with the Ethernet, and with other things that are currently in development. So our goal is to make CXL as valuable as possible at the REC level. And finally, replacing Ethernet Like you will not need internet cabling per se within the rack just when going outside the rack. The myriads of Cat6 cables I have in my home office will cry tears.

Starting point is 00:15:34 And I will cry tears of joy if that happens, but that sounds very future forward and I'm excited about that prospect. But it sounds like the goal initially for starting out is memory and then moving past that a fabric of all the different peripherals that we want to connect to it. Is that about correct? Yes, everything will be replaced. By the way, you talked about Cat6 cabling. We are using CDFP cabling.

Starting point is 00:16:01 CDFP is a standard form factor. Actually, Google is using it in their TPU clusters so it's running CXL at PCH and 5 speeds. It's also compatible to the future PCH and 6 speeds. So the same form factor of cabling and connectors will stay with us going forward for at least several years. I always think it's funny how in technology we throw these letters in like an alphabet soup and we understand its context. But at the end of the day, it's about that point that you just brought up about how it works right now with five, but it was forward compatible with six. So we don't have to at the end, the main point is we will always

Starting point is 00:16:45 be able to use this type of cabling and have these types of speeds. You're talking about like one terabyte per second instead of like the 100 or 400 gig per second that we're normally used to. And us in the home labs at like 10 or one, we live that life but we're talking insane speeds at this point um let's go back to the idea of moving from uh the idea of a networking overlay to all these individual components utilizing pcie would we actually start seeing seeing additional gains with the DPU structures that we're seeing right now, as opposed to moving from that to PCIE? Because at the end of the day, what we're actually seeing in terms of what customers are using right now is DPU is very kind of in its infancy, but we're seeing a lot of

Starting point is 00:17:44 customers starting to use it in the commercial as well as, you know, public sectors and other areas as well. But the speed is very fast because it's very close. Right. And I, and I tend to bring this up every, every podcast, which is like the components are getting closer and closer away. And it sounds like C, uh, CXL is starting to spread them out. And that's a worry that I have in terms of like, well, what, what are we going to see?

Starting point is 00:18:04 What are the architects are going to look like? And what are people like me, architects, going to have to deal with? When we're looking at this in terms of the fabric, how are we going to be able to manage it, monitor it, and upgrade it? All of those day-to-day things that we're going to have to deal with. What do you see future for CXL in those areas? Very good point. And one of the examples is like, let's look at something that exists both on the PCA domain and the CXL domain, like an NVMe device. You were talking about DPU, but let's take an NVMe drive. Like an NVMe drive can reside within the server. It has a certain performance, and now you take it away. And we demonstrated in SuperCompute that we provide an NVMe device through our smart memory node.

Starting point is 00:18:51 So from the perspective that you were saying, it's farther from the server, and the concern is whether it will make it slower. But we were actually showing the fastest NVMe device that exists today in the world with 30 gigabytes per second and really fascinating latency. So CXL solves that. CXL solves that. It provides both the bandwidth, extreme level of bandwidth, but also a very, very low latency, much lower than PCE.

Starting point is 00:19:20 So every PCE device that you can think of, whether it's a network device, DPU, or NVMe, the latencies that you get over CXL is much lower. So you're actually getting a really, really fast network adapter and people are adopting, for example, OCP NICs and high performance DPUs and so on in servers, but that starts adding quite a lot to the cost. It also takes up a lot of the space within the chassis. Same thing with storage.

Starting point is 00:19:58 Most chassis have basically the whole front of the system is devoted to storage, even with these new EDSFF next generation drives. Really, the whole front of the server is storage. The whole back of the server is IO and networking. Very little space in there for GPU or XPU. Theoretically, the CXL fabric offloads that. And this leads me to my next point, which is, how will the physical architecture of the system change once these things are well and truly adopted and implemented? Because if there's no need for a whole row of NVMe drives, and there's

Starting point is 00:20:41 no need for a bunch of big clunky NICs or a GPU or so on, then why does a server look the way it does anymore? Will servers change? Yes, obviously you will have more space, more volume for compute. And this is what you want. We talked about disaggregation. So the server will be more around compute having the CPU cores inside the chassis that today we call a server and most of the other services would be external like you will have a certain amount of memory within the server itself close to the compute we are not eliminating local DRAM completely it's not a good thing architecturally you need to have local DRAM completely. It's not a good thing architecturally. You need to have local DRAM as well. But you will have a memory pool outside.

Starting point is 00:21:30 You will have the networking abstraction outside. It would be native Ethernet because the operating system will still see an Ethernet NIC, although it will not be physical. But the experience would be the same, completely identical. And same for the storage. So you would have the disaggregation. You'd have the storage on separate shelves. The networking would be provided through a replacement for a top of rec switch that would be a CXL based and practically everything else. The server itself would become this center point for compute, where it should be.

Starting point is 00:22:08 And it's interesting to sort of puzzle about this because there's a bunch of power and cooling and packaging questions that would be raised. So immediately, if you say the server is about compute, then it makes me think, oh, does that mean we're going to get more quad or eight socket CPUs? If we have some memory outside the system,

Starting point is 00:22:26 memory is one of the things that takes a lot of cooling. So does that mean that we would have maybe smaller systems with less memory channels? Maybe we're going to prioritize more sockets and less channels. But then you start thinking, wait a second, then what do we have all this space for? And what about flow through? And what about allocation of power within the rack?

Starting point is 00:22:44 And how come this component here in the middle that has all the CPU sockets in it is going to be a, you know, terawatt, you know, rack. And then this one up here is going to need all the cooling and this one down here. And it just really boggles the mind. Are people working on that? Is that an active area of focus? Absolutely. One of the key advantages of CXL is that it allows you to move things around. Because today, when you build a server, you have to put the memory and compute side by side. And then you have the challenges of taking all that heat outside the server. And also the NVMe drives that blocks your faceplate at the front and disrupt the airflow. And a lot of other components, even at the back of the server. And what CXL provides you is the ability to move all these components around, to desegregate and put them in different places. And this is how you can actually solve also the thermal challenges that you have today.

Starting point is 00:23:47 And you talked about four-socket and eight-socket servers. One of the nice things that CXL provides is actually getting rid of those because with CXL, you can create ad hoc coherency domains between different CPU sockets so you can actually build a server with one socket, two sockets, or four sockets by demand. You don't have to physically install a four-socket server for that. This is the more advanced use cases of CXL. It's on the roadmap.

Starting point is 00:24:16 We will get to that. Yeah, so it sounds more like we might see more big twin, big quad, I don't know, smaller CPU plus memory plus CXL units spread across a rack unit. And then those can be then aggregated together to create a virtual 8, 12, 20 socket, whatever CPU or whatever processor you wanted, rather than... Okay. Yeah, it just boggles the mind about exactly what makes sense. What does it make sense to move these chess pieces when you've got to worry about things like power and cooling and packaging and placement and all those other things? And I completely agree. I don't think memory is going to be gone from the

Starting point is 00:24:59 server. I don't think some storage, some IO is going to be gone from the server. It's just a matter of, you know, basically setting up a basic amount that's needed on the physical unit and then connecting the rest over fabrics. Maybe even a fabric, a unit fabric, if I may. Exactly. One thing that we envision with the memory, the locally attached memory to the server, today it's parallel. The DIMM interface, it's like 300 pins for each DIMM. It's huge. Think about a socket, a CPU socket, a modern one with eight or 12 DIMM channels. It's like thousands and thousands of pins. We believe that memory, even the local memory,

Starting point is 00:25:45 would become serial going forward in the next several years. So we will not have the existing legacy DIMM interfaces. We will have serial memory attached to the socket. And then it would be probably CXL as well. IBM played with OMI. It's a different serial memory interface that you're probably familiar with. So we envision locally attached CXL memory to the CPU and then CXL links going

Starting point is 00:26:11 from the CPU outside to the rack. Yeah, this is something that I wanted to chime in here because as we keep talking about CPU architectures, I like to bring up things that are already out there that people are already using. And I think what Apple does with system as a chip, like the whole CPU structure, as well as the data lanes connected directly into the DRAM that's there locally, we could see that type of architecture more and more readily out there in terms of like, here's your CPU and your compute in multiple different sockets. And that could be maybe like, like a caching layer, or maybe like that's the performant layer for, for your RAM and your memory, but then a CXL connection outward into like the

Starting point is 00:26:56 fabric. This type of architecture is what really excites me in terms of CXL itself, because I could have, I mean, we, we saw Apple take a chip and literally just glue another chip on top of it and then it actually worked. And that was like fantastic. Now let's see where they can take that type of architecture in terms of the server interface. Yeah. And that's pretty much what Intel's doing with the Macs, with the Xeon Macs, with the HBM on board and their tiered memory. Yeah, exactly. That's wonderful news for people that want to do CXL for this type of architecture that we're talking about, because now we have a much smaller

Starting point is 00:27:31 footprint that we have to cool, and then we can kind of create that architecture around it. But at the same time, this discussion, when we're moving from that CPU architecture into memory network, one terabyte per second type connection speeds outside of the box. That right there tells me and excites me about the future of CXL. Ronan, what do you have to say in terms of like all of these things that we're talking about? Like how far away do you think this really is? Are we talking a decade? Are we talking five years?

Starting point is 00:28:04 How close could this happen? So yes, there is a roadmap and there is a lot of market education to be done because people are used to certain models of a rack, of a server. So everybody starts with memory. Now people understand that memory pooling is good. It provides a lot of value and we will start seeing memory pool devices out there. Of course, the UniFabrics smart memory node is already there and shipping and we demonstrated it working. That's the first step. We will start seeing innovations even in the silicon arena where CXL gets into the UCIE chiplets and there would be a kind of a marketplace for chiplets standard interface chiplets with UCI that

Starting point is 00:28:51 silicon companies and startups could focus around building their own IP into a small chiplet and getting the rest of the functionality with the other chiplets from other vendors and then you get your whole SOC your whole chip like Apple is doing their SOCs, and the NRE would be much lower. Like the barriers of getting this innovation out into silicon would be much lower. So this is at a smaller scale.

Starting point is 00:29:18 And going to the larger scale, the architecture of the server will change. The architecture of CPUs will change. Today, the CPU, the micro will change. The architecture of CPUs will change. Today, the CPU, the microarchitecture within the CPU is totally tuned to locally attached memory. Like the CPU architects that designed the server CPUs assume that the latency to the DRAM is very low because it's local attached DRAM.

Starting point is 00:29:42 And they would have to change their design because now you have locally attached DRAM, but you also have further memory. So we will see the pipelines of CPUs changing going forward. We will see new types of innovations around CXL at the REC level, like we discussed getting more services over CXL, like storage and networking

Starting point is 00:30:02 and different types of accelerators. And CXL practically extends the CPU itself. Like think about how CPU cores communicate with each other over the mesh within the CPU. CXL extends that because you keep the same type of semantics, like the cache semantics, not just within the CPU mesh, but also outside the CPU. So you can scale to much higher diameter. It really is interesting to think about how these things are going to change. And it's going to take a few generations, I think.

Starting point is 00:30:35 The current generation from Intel and AMD and ARM are certainly embracing CXL. But in order to see ultimately where rack scale architecture changes everything. And I agree with you. I think it's going to absolutely change the design, the fundamental design of CPUs. And I think that we're not going to see that necessarily in the next generation, but maybe in a generation further than that, we'll start seeing CPUs that really embrace this whole concept, as you said, of extending these caching semantics beyond the traditional memory channels and the traditional configuration of PC-compatible architecture. So it's going to be really

Starting point is 00:31:18 exciting to see where this goes. Well, thank you so much for joining us today on Utilizing CXL. I think that this was a great discussion. And it's great to talk about moving beyond memory because we've been talking about memory basically every episode. We've been talking about memory expansion. We've been talking about the future of memory pooling. And now we're talking about more. We're talking about futures beyond that. So it's really exciting to hear this.

Starting point is 00:31:41 And it's great to have folks working to bring this future to life. Where can we connect with you and continue this conversation with you, Ronan? So you're happy to visit unifabrics.com. We have a video that we published a couple of weeks ago about the demo we did in Super Compute, where you can see the memory pooling, the expansion of bandwidth and capacity, but also the NVMe device. And we will have a new announcement soon, so stay tuned. Can't wait to hear it.

Starting point is 00:32:14 We're gonna hear some more announcements as well. Nathan and I will be at our Tech Field Day event here, March 8th and 9th. So check out techfielday.com to learn a little bit more about that, where we're going to have some CXL companies, some of the guests here on Utilizing CXL presenting at that, and you'll find the videos of that on YouTube as well. Thank you everyone for listening to

Starting point is 00:32:33 the Utilizing CXL podcast, part of the Utilizing Tech podcast series. If you enjoyed this discussion, please subscribe. You can find us in every podcast application as well as on YouTube. Also, while you're there, please give us a like, give us a rating, give us a review. It really helps visibility. This podcast was brought to you by gestaltit.com, your home for IT coverage from across the enterprise. For show notes and more episodes, go to our special website, utilizingtech.com. Find us on Twitter or Mastodon at Utilizing Tech. Thanks for listening, and we'll see you next time.

Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x17: Memory and More over CXL with UnifabriX

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.