Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x17: Memory and More over CXL with UnifabriX
Episode Date: February 27, 2023Memory expansion is the first application for CXL, and memory pooling is coming next, but this technology will eventually support storage, Ethernet, and more. This episode of Utilizing CXL brings Rone...n Hyatt, CEO of UnifabriX, to discuss memory and more over CXL with Nathan Bennett and Stephen Foskett. UnifabriX demonstrated high-performance compute benchmarks using their Smart Memory Node at Super Computing 22 and claims to be the first to show a CXL 3.0 fabric. But the company is also promising NVMe storage and Ethernet connectivity over the CXL fabric. This enables each server to have the right type and capacity of connectivity, from basic Ethernet to DPU, with dynamic reconfiguration. The fabric can also contain an NVMe storage target that combines DRAM and flash and can be dynamically allocated. Hosts: Stephen Foskett: https://www.twitter.com/SFoskett Nathan Bennett: https://www.twitter.com/vNathanBennett Guest: Ronen Hyatt, CEO and Founder of UnifabriX: https://www.linkedin.com/in/execute/ Follow Gestalt IT and Utilizing Tech Website: https://www.UtilizingTech.com/ Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/1789 Tags: #UtilizingCXL #CXLStorage #Ethernet @UtilizingTech @UnifabriX
Transcript
Discussion (0)
Welcome to Utilizing Tech, the podcast about emerging technology from Gisdalt IT.
This season of Utilizing Tech focuses on Compute Express Link, or CXL,
a new technology that promises to revolutionize enterprise computing.
I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gisdalt IT.
Joining me today as my co-host is Nathan Bennett. Welcome, Nathan.
Hey, Nathan.
Hey, Stephen. Good to be back with you all again, talking about this awesome technology and kind of peering into the future of all things CXL and this ComputeLink. I think we've beat
memory to death at this point. Let's see what we can talk about today.
Absolutely. We've talked in the last few episodes with folks from, well, all sorts of different companies that are building platforms like AMD, Intel, ARM, building software to support CX fact, one of the things I said was it's a good thing they're starting with memory because it's a great application. It's a great kind of slam dunk
application that people can come in and they can say, you know, I want to get some benefit from
this technology. We implement memory expansion. We have got bigger memory on the system. We've
got more memory bandwidth than we thought we would. Applications run faster? Cool. Where do we go from here?
And that's why we've invited Ronan Hyatt from Unifabrics to join us today so that we can
learn a little bit, not only about memory expansion and memory pooling, but sort of
what comes next.
So welcome to the show, Ronan.
Thank you, Stephen.
Excited to be here.
Give us a little bit about your background and where Unifabrics came from.
Excellent. So I'm the CEO of Unifabrics and I'm one of the co-founders. I come originally from
Intel DCG, from the data center group. So I've been playing with CXL almost six years ago when
it was internal to Intel. It was called Flexbus internally, later IAL, Intel Accelerator Link.
Only in 2019, it became public.
So I'm of the very few people out there that have multi-year experience with CXL.
And Unifabrics is all around CXL and the benefits that CXL can bring to the industry and data centers.
We provide the ultimate memory scaling solution,
solving the roadblocks that local DRAM puts into compute today,
solving both the bandwidth scalability and capacity scalability of memory and the mismatch that exists today between memory and compute.
And actually, not only that, from our name, UniFabrics, you can infer that our vision going forward is having a single unified fabric within the rack.
A hint, it's not going to be Ethernet, It's going to be CXL. So at Supercomputing, you showed a memory expansion device, a memory expansion capability,
the smart memory node that allowed systems to scale much bigger, run much faster.
I do want to talk about things beyond memory, but I guess first, let's start with memory.
Tell us what you're showing now and what you're selling now, and then we'll move beyond memory.
So in Supercomputing, we're actually showing a lot more than memory expansion.
Memory expansion is quite simple.
It's a form factor that gets into the server that expands memory within the server.
We were showing memory pooling, and that was a demonstration of a memory pool product where we provision both memory capacity,
but also memory bandwidth to servers, CXL-enabled servers.
This server was CephaRapids, the InterCephaRapids.
It was a pre-launch version of CephaRapids.
And we were running a real HPC application, HPCG, over that server.
HPCG is quite a notorious benchmark in the HPC world. It's consumed a lot of memory
bandwidth. And what we showed is that once you start engaging more and more CPU cores into the
workload, into HPCG, at some point you exhaust all the local memory bandwidth and then you get stuck. Then your compute is getting stranded. You can use only
around 50% of the CPU cores that you have on the socket, but no more than that.
And with the extra bandwidth that we provisioned to our smart memory node, we were able to use
all the CPU cores on the socket, the total 100%. Like you get a double compute density instead of half.
So I think that's very good for us to hear.
We talk about memory and we've talked about a lot on this podcast.
I don't mean to bash it because at the end of the day, there's a lot of great things
that can come from memory.
But just what you were discussing in terms of like the fabric and memory being able to
be pulled from just it being able to only utilize maybe 50,
60% of the CPU to a hundred percent of the CPU. I think there's a lot of different ways we could
go in that conversation. Like, oh, well the CPU, the way they manufacture CPUs will probably change
because they got to figure out how to make sure that they can withstand that type of bandwidth
and all those different things as well. But one thing that you said in terms of the fabric, I'm just curious in terms of, and
if we want to stay on memory, we can go back to it.
But the idea that I keep hearing about is this idea of a master controller.
And we all kind of know this if we've studied our architecture in terms of how a computer
works, where the master controller kind of worked in between we've studied our architecture in terms of how a computer works, where the
master controller kind of worked in between RAM and the CPU.
Is that something that kind of is functional in what we would see in CXL, where there's
like an extra component that works in between them?
Or is this something that would be kind of more streamlined into like an actual, the
chiplet type of situation where there's smaller chips that just streamline the data
pipelines from one place to the next? So CXL is many things. CXL is an interface
provide you the ability to connect to a server at the rack level. At the chiplet level,
we have today UCAE, where you can design CXL-based logic and then attach it on a chip level, on a socket level, on a package level.
So everything is there. So in terms of the master controller, you can get everything over CXL.
We talked about memory, we talked about storage, we talked about networking and all that connects over the same interface.
Yeah. And I think that's the that's the exciting part. Right.
And, you know, again, I think I came down hard on memory at the beginning of the podcast.
That's that's a great thing for us to to utilize.
But what's really exciting is the idea of moving past that.
And you mentioned networking. What what can CXL really do in that networking arena?
So CXL is much more beyond memory. Of course, everybody starts with the memory. In SC22,
we demonstrated NVMe, actually the fastest NVMe device over CXL and we actually have Ethernet over CXL running at our labs today,
totally replacing the top of rack switch.
So what we have, our configuration is have multiple servers within the rack
and our appliance sitting at the top and having CXL cables going from our appliance to each server
and each server sees an Ethernet service, meaning it sees an Ethernet NIC.
But it's not a physical NIC.
It's not that something is installed within the server.
This is something that we expose on demand through our appliance, to our smart memory node.
And we can expose that as a service.
And this is nice.
You don't have to do that.
You don't have to expose an ethernet
service to each server you don't have to expose an nvme device to each server this is done on demand
and we can expose different types of nicks it could be a simple nick like just moving data and
it also could be a dpu like a nick that does also network overlays, all the vSwitch processing
that exists in modern DPUs today. And the nice thing, everything here is on demand.
So would the, in this case, let's talk about these NICs. Would these be physical NICs that
are in a remote box, essentially away from the server, and you're using CXL Fabric as a multiplexer
to attach it to a server, or would they be virtual NICs?
I mean, is it an actual DPU or is it a virtual DPU
that's using some kind of pooled compute resource
and IO in that remote box?
How does that work?
It's a virtual DPU and a virtual NIC.
It's not that we use CXL to multiplex like other players are doing today with PCA boxes that connect to shelves of NICs or shelves of storage.
We actually provide the service itself within our appliance, within our smart memory node.
So this is something that we provide on demand and we can change the characteristics and the personality
of that device according to its use.
For instance, we focus today on the HPC market.
The HPC market is using a lot of RDMA, whether over InfiniBand or over Rocky.
And one of the key important things is passing messaging like the MPI interface, which would
be a very low
latency ultra low latency so one of the things that we can expose using our
Ethernet service is an MPI passing device which is ultra low latency
because it's running over CXL is CXL is much lower latency than than PC and this
is how we accelerate HPC applications,
not just by providing more memory bandwidth,
not just by providing more IO bandwidth,
but also by providing much faster network and fabric.
So I guess that means then that Unifabrics
would be essentially a competitor in the DPU space as well,
since you would have to be developing your own DPU capability,
not just a CXL capability.
The form factor would be our smart memory node. So we would not be like selling DPU cards. This
is not our business, but this is something that we can natively provide as a service on demand through our appliance. And it's a very natural development
of CXL services that can run over the CXL cabling that we envision that already exists within the
REC. Yeah. Yeah. And I know that this is early times still, and it's not like we're talking
about a specific product or something. So just trying to get my head around sort of what this
thing would look like and what the benefits would be. So as you mentioned, I heard one of the
benefits might be that you could have sort of a special purpose NICs. So if the server needs a
DPU or if it needs just a basic throughput, I imagine you could have different levels of
performance available as well. That's one benefit. Maybe some dynamism. So if the server changes purpose, you could
repurpose the type of connectivity that that server is going to. Also, I imagine that you
could have maybe aggregate higher performance. So maybe you've got 400 gigabits out of the top
of the rack or something, and that you could then deploy that to all the servers in the rack
without having to spring for expensive high-end network
adapters on every server or something like that, right?
That's correct.
If you think of CXL in the generation of PCIe Gen 5, then every CXL by 16 link is one terabit
per second.
And we use that one terabit per second to pass memory transactions.
But in the context of networking, think that we can expose a virtual NIC of like 10 gigabit per second and then upgrade it on demand to 100, 200 and 400 gigabit per second with or without special processing.
So we can scale everything.
We can scale the amount of bandwidth that the NIC provides.
We can provide RDMA or not. We can provide the vSwitch processing like a DPU or not.
This is completely flexible.
And I guess maybe you could kind of run us through what the storage connectivity looks like as well,
since NVMe storage has maybe a different form factor, different protocol, takes up space more than a network might.
I don't know.
What would a storage-connected CXL device theoretically look like?
So what we learned while working with the HPC customers is that they have memory bottlenecks.
So we started with memory.
That was our first step as a company, and we solved it.
We already have a product that provides memory.
So what we learned from these customers is that memory is only one bottleneck.
So it's like an onion.
You peel the first layer, and then you get that once you solve your memory bottleneck, you get to an IO bottleneck.
And then how we solve the IO bottleneck, so we say, okay, we already have a very high bandwidth link, the CXL link that goes out of the server. Let's look how we can reuse it,
not just for memory transactions, but also for storage transactions. And this is where we came
with exposing NVMe over the same cabling and using a very fast storage system that uses a hybrid combination of DRAM and flash and NAND flash.
And this is where you can run very large data sets over memory at extremely high performance, like more than 25, 30 gigabytes per second per NVMe interface,
and around one, two microseconds of average latency.
The tail latency is three microseconds.
It's extremely good compared to real NVMe devices.
And this is where we solve the IO bottlenecks of HPC players.
And then we move to the next bottleneck, which is the fabric and networking.
So we provide a comprehensive solution in that space.
So Ronan, it sounds like what we're discussing here and what you're trying to accomplish
is not a fabric of one particular thing, but a fabric of many different things.
Is that what your company is targeting?
Is that the goalpost that y'all are
going towards? So we look at CXL is going to happen, right? You will have memory pools within
the rack. And if you have a memory pool, you'll have CXL cabling in the rack. And then we say,
okay, this thing is already there. What else can you do with it? And then we come with the storage, with the NVMe, and with
the Ethernet, and with other things that are currently in development. So our goal is to make
CXL as valuable as possible at the REC level. And finally, replacing Ethernet Like you will not need internet cabling per se
within the rack just when going outside the rack.
The myriads of Cat6 cables I have in my home office
will cry tears.
And I will cry tears of joy if that happens,
but that sounds very future forward
and I'm excited about that prospect.
But it sounds like the goal initially for starting out is memory and then moving past that a fabric of all the different peripherals that we want to connect to it.
Is that about correct?
Yes, everything will be replaced.
By the way, you talked about Cat6 cabling.
We are using CDFP cabling.
CDFP is a standard form factor.
Actually, Google is using it in their TPU clusters so it's running CXL at PCH and 5 speeds. It's
also compatible to the future PCH and 6 speeds. So the same form factor of
cabling and connectors will stay with us going forward for at least several years.
I always think it's funny how in technology we throw these letters in like an alphabet soup and
we understand its context. But at the end of the day, it's about that point that you just brought
up about how it works right now with five, but it was forward compatible with six. So we don't have
to at the end, the main point is we will always
be able to use this type of cabling and have these types of speeds. You're talking about
like one terabyte per second instead of like the 100 or 400 gig per second that we're normally
used to. And us in the home labs at like 10 or one, we live that life but we're talking insane speeds at this point um
let's go back to the idea of moving from uh the idea of a networking overlay to all these
individual components utilizing pcie would we actually start seeing seeing additional gains with the DPU structures that we're seeing
right now, as opposed to moving from that to PCIE?
Because at the end of the day, what we're actually seeing in terms of what customers
are using right now is DPU is very kind of in its infancy, but we're seeing a lot of
customers starting to use it in the commercial as well as, you know, public sectors and other
areas as well.
But the speed is very fast because it's very close.
Right.
And I, and I tend to bring this up every, every podcast, which is like the components
are getting closer and closer away.
And it sounds like C, uh, CXL is starting to spread them out.
And that's a worry that I have in terms of like, well, what, what are we going to see?
What are the architects are going to look like? And what are people like me, architects, going to have
to deal with? When we're looking at this in terms of the fabric, how are we going to be able to
manage it, monitor it, and upgrade it? All of those day-to-day things that we're going to have
to deal with. What do you see future for CXL in those areas? Very good point. And one of the examples is like,
let's look at something that exists both on the PCA domain and the CXL domain,
like an NVMe device. You were talking about DPU, but let's take an NVMe drive. Like an NVMe drive
can reside within the server. It has a certain performance, and now you take it away. And we demonstrated in SuperCompute that we provide an NVMe device
through our smart memory node.
So from the perspective that you were saying, it's farther from the server,
and the concern is whether it will make it slower.
But we were actually showing the fastest NVMe device that exists today
in the world with 30 gigabytes per second and really fascinating latency.
So CXL solves that.
CXL solves that.
It provides both the bandwidth, extreme level of bandwidth,
but also a very, very low latency, much lower than PCE.
So every PCE device that you can think of,
whether it's a network device, DPU, or NVMe, the latencies that you get over CXL is much lower.
So you're actually getting a really, really fast network adapter
and people are adopting, for example,
OCP NICs and high performance DPUs and so on in servers,
but that starts adding quite a lot to the cost.
It also takes up a lot of the space within the chassis.
Same thing with storage.
Most chassis have basically the whole front of the system
is devoted to storage,
even with these new
EDSFF next generation drives. Really, the whole front of the server is storage. The whole back
of the server is IO and networking. Very little space in there for GPU or XPU. Theoretically,
the CXL fabric offloads that. And this leads me to my next point, which is,
how will the physical architecture of the system change once these things are well and truly
adopted and implemented? Because if there's no need for a whole row of NVMe drives, and there's
no need for a bunch of big clunky NICs or a GPU or so on,
then why does a server look the way it does anymore? Will servers change?
Yes, obviously you will have more space, more volume for compute. And this is what you want.
We talked about disaggregation. So the server will be more around compute having the CPU cores inside the chassis
that today we call a server and most of the other services would be external like you will have a
certain amount of memory within the server itself close to the compute we are not eliminating local
DRAM completely it's not a good thing architecturally you need to have local DRAM completely. It's not a good thing architecturally. You need to have local DRAM as well.
But you will have a memory pool outside.
You will have the networking abstraction outside.
It would be native Ethernet because the operating system will still see an Ethernet NIC, although it will not be physical.
But the experience would be the same, completely
identical. And same for the storage. So you would have the disaggregation. You'd have the storage on
separate shelves. The networking would be provided through a replacement for a top of rec switch
that would be a CXL based and practically everything else. The server itself would
become this center point for compute,
where it should be.
And it's interesting to sort of puzzle about this
because there's a bunch of power and cooling
and packaging questions that would be raised.
So immediately, if you say the server is about compute,
then it makes me think,
oh, does that mean we're going to get more quad
or eight socket CPUs?
If we have some memory outside the system,
memory is one of the things that takes a lot of cooling.
So does that mean that we would have maybe smaller systems
with less memory channels?
Maybe we're going to prioritize more sockets and less channels.
But then you start thinking, wait a second,
then what do we have all this space for?
And what about flow through?
And what about allocation of power within the rack?
And how come this component here in the middle that has all the CPU sockets in it is going to be a, you know, terawatt, you know, rack. And then this one up here is going to need all the cooling and this one down here. And it just really boggles the mind. Are people working on that? Is that an active area of focus? Absolutely. One of the key advantages of
CXL is that it allows you to move things around. Because today, when you build a server, you have
to put the memory and compute side by side. And then you have the challenges of taking all that
heat outside the server. And also the NVMe drives that blocks your faceplate at the front and disrupt the airflow.
And a lot of other components, even at the back of the server.
And what CXL provides you is the ability to move all these components around, to desegregate
and put them in different places.
And this is how you can actually solve also the thermal challenges that you have today.
And you talked about four-socket and eight-socket servers.
One of the nice things that CXL provides is actually getting rid of those
because with CXL, you can create ad hoc coherency domains between different CPU sockets
so you can actually build a server with one socket, two sockets, or four sockets by demand.
You don't have to physically install
a four-socket server for that.
This is the more advanced use cases of CXL.
It's on the roadmap.
We will get to that.
Yeah, so it sounds more like we might see more
big twin, big quad, I don't know, smaller CPU plus memory plus CXL units
spread across a rack unit. And then those can be then aggregated together to create a virtual
8, 12, 20 socket, whatever CPU or whatever processor you wanted, rather than... Okay. Yeah, it just
boggles the mind about exactly what makes sense. What does it make sense to move these chess pieces
when you've got to worry about things like power and cooling and packaging and placement and all
those other things? And I completely agree. I don't think memory is going to be gone from the
server. I don't think some storage, some IO is going to be gone from the server. It's just a matter of, you know, basically setting up a basic amount that's needed on the physical unit and then connecting the rest over fabrics.
Maybe even a fabric, a unit fabric, if I may.
Exactly.
One thing that we envision with the memory, the locally attached memory to the server, today it's parallel.
The DIMM interface, it's like 300 pins for each DIMM.
It's huge. Think about a socket, a CPU socket, a modern one with eight or 12 DIMM channels.
It's like thousands and thousands of pins.
We believe that memory, even the local memory,
would become serial going forward in the next several years.
So we will not have the existing legacy DIMM interfaces.
We will have serial memory attached to the socket.
And then it would be probably CXL as well.
IBM played with OMI.
It's a different serial memory interface
that you're probably
familiar with. So we envision locally attached CXL memory to the CPU and then CXL links going
from the CPU outside to the rack. Yeah, this is something that I wanted to chime in here because
as we keep talking about CPU architectures, I like to bring up things that are already out there that people are already
using. And I think what Apple does with system as a chip, like the whole CPU structure, as well as
the data lanes connected directly into the DRAM that's there locally, we could see that type of
architecture more and more readily out there in terms of like, here's your CPU and your compute
in multiple different sockets.
And that could be maybe like, like a caching layer, or maybe like that's the performant
layer for, for your RAM and your memory, but then a CXL connection outward into like the
fabric.
This type of architecture is what really excites me in terms of CXL itself, because I could
have, I mean, we, we saw Apple take a chip and literally just glue another
chip on top of it and then it actually worked. And that was like fantastic. Now let's see where
they can take that type of architecture in terms of the server interface.
Yeah. And that's pretty much what Intel's doing with the Macs, with the Xeon Macs,
with the HBM on board and their tiered memory. Yeah, exactly. That's wonderful news for people that want to
do CXL for this type of architecture that we're talking about, because now we have a much smaller
footprint that we have to cool, and then we can kind of create that architecture around it.
But at the same time, this discussion, when we're moving from that CPU architecture into
memory network,
one terabyte per second type connection speeds outside of the box.
That right there tells me and excites me about the future of CXL.
Ronan, what do you have to say in terms of like all of these things that we're talking about?
Like how far away do you think this really is?
Are we talking a decade? Are we talking five years?
How close could this happen? So yes, there is a roadmap and there is a lot of market education to be done
because people are used to certain models of a rack, of a server. So everybody starts with memory.
Now people understand that memory pooling is good. It provides a lot of value and we will start
seeing memory pool devices out there.
Of course, the UniFabrics smart memory node is already there and shipping and we demonstrated
it working. That's the first step. We will start seeing innovations even in the silicon arena where
CXL gets into the UCIE chiplets and there would be a kind of a marketplace for chiplets
standard interface chiplets with UCI that
silicon companies and startups could focus around building their own IP
into a small chiplet and getting the rest of the functionality with the other chiplets
from other vendors and then you get your whole SOC
your whole chip like Apple is doing their SOCs,
and the NRE would be much lower.
Like the barriers of getting this innovation out into silicon
would be much lower.
So this is at a smaller scale.
And going to the larger scale,
the architecture of the server will change.
The architecture of CPUs will change.
Today, the CPU, the micro will change. The architecture of CPUs will change.
Today, the CPU, the microarchitecture within the CPU is totally tuned to locally attached memory.
Like the CPU architects that designed the server CPUs
assume that the latency to the DRAM is very low
because it's local attached DRAM.
And they would have to change their design
because now you have locally attached DRAM,
but you also have further memory.
So we will see the pipelines of CPUs changing going forward.
We will see new types of innovations
around CXL at the REC level,
like we discussed getting more services over CXL,
like storage and networking
and different types of accelerators.
And CXL practically extends the CPU itself.
Like think about how CPU cores communicate with each other over the mesh within the CPU.
CXL extends that because you keep the same type of semantics, like the cache semantics,
not just within the CPU mesh, but also outside the CPU.
So you can scale to much higher diameter.
It really is interesting to think about how these things are going to change.
And it's going to take a few generations, I think.
The current generation from Intel and AMD and ARM are certainly embracing CXL.
But in order to see ultimately where rack scale architecture changes
everything. And I agree with you. I think it's going to absolutely change the design,
the fundamental design of CPUs. And I think that we're not going to see that necessarily in the
next generation, but maybe in a generation further than that, we'll start seeing CPUs that
really embrace this whole concept,
as you said, of extending these caching semantics beyond the traditional memory channels and the
traditional configuration of PC-compatible architecture. So it's going to be really
exciting to see where this goes. Well, thank you so much for joining us today on Utilizing CXL.
I think that this was a great discussion.
And it's great to talk about moving beyond memory because we've been talking about memory basically every episode.
We've been talking about memory expansion.
We've been talking about the future of memory pooling.
And now we're talking about more.
We're talking about futures beyond that.
So it's really exciting to hear this.
And it's great to have folks working to bring this future to life.
Where can we connect with you and continue this conversation with you, Ronan?
So you're happy to visit unifabrics.com.
We have a video that we published a couple of weeks ago about the demo we did in Super Compute,
where you can see the memory pooling, the expansion of bandwidth and capacity,
but also the NVMe device.
And we will have a new announcement soon, so stay tuned.
Can't wait to hear it.
We're gonna hear some more announcements as well.
Nathan and I will be at our Tech Field Day event here,
March 8th and 9th.
So check out techfielday.com
to learn a little bit more about that,
where we're
going to have some CXL companies, some of the guests here on Utilizing CXL presenting at that,
and you'll find the videos of that on YouTube as well. Thank you everyone for listening to
the Utilizing CXL podcast, part of the Utilizing Tech podcast series. If you enjoyed this discussion,
please subscribe. You can find us in every podcast application as well as on YouTube.
Also, while you're there, please give us a like, give us a rating, give us a review.
It really helps visibility.
This podcast was brought to you by gestaltit.com, your home for IT coverage from across the enterprise.
For show notes and more episodes, go to our special website, utilizingtech.com.
Find us on Twitter or Mastodon at Utilizing Tech.
Thanks for listening, and we'll see you next time.