Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x15: Enabling the CXL Device Ecosystem with Marvell
Episode Date: February 13, 2023Bringing CXL to market requires a wide variety of components, and Marvell is a key supplier to datacenter and cloud. This episode of Utilizing CXL features Shalesh Thusoo of Marvell, who discusses the... many solutions they are creating to bring CXL to market. Current products are directly connecting memory to CPUs via CXL but soon we will see products enabling the sharing of memory between hosts to enable new applications. From Marvell: This podcast contains forward-looking statements within the meaning of the federal securities laws that involve risks and uncertainties. Forward-looking statements include, without limitation, any statement that may predict, forecast, indicate or imply future events or achievements. Actual events or results may differ materially from those contemplated in this podcast. Forward-looking statements speak only as of the date they are made. Listeners are cautioned not to put undue reliance on forward-looking statements, and no person assumes any obligation to update or revise any such forward-looking statements, whether as a result of new information, future events or otherwise. Hosts: Stephen Foskett: https://www.twitter.com/SFoskett Craig Rodgers: https://www.twitter.com/CraigRodgersms Guest: Shalesh Thusoo, Vice President CXL Product Development, Marvell https://www.linkedin.com/in/shaleshthusoo/ Follow Gestalt IT and Utilizing Tech Website: https://www.UtilizingTech.com/ Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/1789 Tags: #UtilizingCXL #CPU #CXL #Memory @Marvell @GestaltIT @UtilizingTech
Transcript
Discussion (0)
Welcome to Utilizing Tech, the podcast about emerging technology from Gestalt IT.
This season of Utilizing Tech focuses on CXL, a new technology that promises to revolutionize enterprise computing.
I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gestalt IT.
Joining me today as my co-host is Craig Rogers.
Hi, Stephen. It's good to be here again.
How have you been taking all of the recent news around CXL?
Yeah, it's pretty exciting.
We now have AMD and Intel on the market
with host platforms that support CXL.
We were very excited to see Intel's Sapphire Rapids launch
on January 10th,
and we're definitely going to be diving into all of that.
But of course it takes more than a platform,
a host platform to deliver CXL.
It's gonna take a lot of other assorted devices,
right Craig?
It does.
Obviously we have a lot of other companies
that need to contribute to CXL for it to properly grow.
And now they're not going to be hindered
by manufacturer engineering sample availability
and they'll have access to the wider market.
We should see an increased pace here on product releases.
Yeah, I would expect so.
It's going to be very, very exciting.
I mean, we've been seeing CXL in demos up and running all last year on the Intel Sapphire Rapids platform.
And now that it's been launched, I imagine that we're going to see a lot more products coming out.
One of the companies that we talked to last year at FMS at the CXL Forum and OCP Summit was Marvell. And we actually talked to Shailesh Tussu from Marvell
back in New York at the CXL Forum. But I wanted to invite him back to get a little bit of a deep
dive into what Marvell is doing in CXL. So Shailesh, welcome to the show.
Thank you. Happy to be here. And I'm the VP of engineering running the CXL product line with Marvell.
Yeah. And I know that Marvell is obviously one of those companies that produces, well, almost a component that's in almost everything.
And so I wasn't at all surprised to see that y'all are producing CXL components as well.
Maybe we can start off by sort of, I know that it's still new. Like I said,
we just have our first host platforms that support it. But what is Marvell going to be making in the
CXL world? Yeah, Marvell sees CXL as a very enabling technology for the next generation
cloud and data center market. And we are focused on creating solutions that that will allow our customers
to create uh sort of a create optimized uh deployment of cxl memory cxl pooling cxl
expansion so it's going to be a wide range of products coming out in the future through more
and the the first products then that we would expect to hear that would be hitting the market would be around memory expansion in servers, taking advantage of those doing is we are focused to provide solutions that
are broader than just expansion and the order in which we'll be getting some products out
will be coming out in the next few months.
Very good. We've seen with other companies that there's a lot of development work
on these products being done using FPGA prior to committing to ASIC.
Is that a similar approach that you're taking?
Yes.
So we have actually developed an architecture that will allow us
to scale from very small utilization of CXL scale to a very large utilization.
And then we have taken that same architecture and design that we are putting into silicon
into an FPGA platform.
And you'll see we have done multiple public demonstrations on some of the features that
our architecture allows to be done.
And this allows our end customers also to try out the technology
uh as you know it's a brand new technology and the use cases are actually growing every year
or every month actually every few weeks people are talking about slightly a different way of using it
uh our solution space with the fpg of course allows them to experiment but on the other hand
even our silicon and AFIX in the
future will allow a bunch of flexibility that as use cases evolve, we'll be able to adapt to it
for most of them. But of course, these aren't going to be products that end users are buying
in the market. You guys are going to be enabling other people's products, which is pretty much what
we've seen from Marvell all along. Is that right? That's right. We are focused on providing the silicon that will then be going into
various products that are developed by other users.
And the first products that we've seen in the market are memory expansion. I imagine that
we're going to see a lot of memory expansion. Maybe not what Marvell is going to be working on,
but what do you suspect is going to be coming down the road from CXL in 2023?
Now that we have the host platforms, you know,
what do you think is going to be coming to market this year overall in 2023?
Yeah, I think in the market, as you've seen,
there are people coming out with memory expansion and direct connecting memory
to the processors, which are very critical technology.
We do need to get that out there in order to get better utilization of the cores that are being put
by both Intel and AMD, and we're working with both of them to enable our technology through that.
But I see, even though the first products will start with memory expansion,
it will move into more advanced things like sharing memory between processors and processor cores,
better increased utilization of memory while you're also increasing the bandwidth of the
cores going forward.
So the later half of 23, as the technology matures, the scale at which the CXL will be
used will keep growing.
But the starting point is going to be direct connect memory expansion
from a market perspective.
And what type of performance then are you seeing
from those initial memory expansion units?
Would you compare it to something similar
as accessing RAM through another human node?
Right.
I think first generation of product,
everyone has been talking about basically equal to a newer node comparison.
And I think that's the right way of starting to think about it.
But we have to get better.
So in order for it to really be deployed at scale,
the performance or latency, let's call it
latency, there's a difference between latency and bandwidth, of course.
From a latency point of view, the market is looking, in our mind, is looking for even
better than a new mononode.
So we are very focused on latency and performance.
We look at CXL, a lot of people are looking at CXL as saying it's a far memory. Sure,
there's a utilization of far memory around the technology. On the other hand, the closer you can
make the CXL memory to the processor memory, I mean, it's always going to have slightly higher
latency than Direct Connect processors. The question is, how can it be somewhere between a NUMA node and a Direct Connect? So I see as the next generation products start coming out, that the latencies will be getting close better than NUMA node.
And we are focused towards that.
Wow, that's pretty exciting because we were excited. We talked to Anders from Microsoft and he was saying that
they were in testing, they were seeing, you know, sort of remote NUMA node performance,
and they were pretty happy with that. But, you know, the idea that it could be even better than
regular system memory on another node is incredible. But of course, it makes sense because there are PCIe links with CXL support
that go directly into each processor complex,
depending on the architecture of the CPU.
And so I can see that that's the case,
that it could indeed be even better
than a remote NUMA node.
I imagine, of course, that
there's going to be added latency once you do have these memory expansion products. But it seems to
me that there's a software landscape emerging as well that's going to support that. And I imagine
that you all are working closely with that too, to make sure that your products are going to be
supported everywhere. Yes, we are. So we are working with multiple partners to make sure that your products are going to be supported everywhere? Yes, we are. So we are working with multiple partners to make sure that we are
sort of addressing the different use cases. There is going to be a use case for far memory that
maybe even a distance greater than NUMA node is still useful. There are applications like that
for memory capacity expansion, for example. I mean, there are definitely applications
that don't care as much about latency
that may be using other technologies today
that are higher latency than a NUMA node.
And CXL is suited for those places also
where you can provide a much larger capacity memory
to a single core.
So we are, as I said, in our mind,
the use cases are very broad and wide.
And so we want to go for applications
that are more general purpose,
that are better than a new one on,
and then also applications that are really more targeted
towards large memory deployments,
in which case they can be far memory
and associated with that.
Have you been exposed or heard about some of these applications that are emerging that are going to be able to make use of the sort of memory expansion we're talking about?
What are customers telling you?
So the first focus of customers is to be able to use their expensive processor cores, let's put it that way,
to be able to get performance out of them. There are two aspects of this thing.
As the core counts have increased to a socket, the amount of memory that we attach to
that socket has reduced effectively per core, as how much memory can you put in there, and also the
bandwidth. So yeah, the first deployment is, okay, how do we efficiently use our cores that are already
there from the various host providers?
But really, the other part of it is, how do you effectively use memory?
Because when you look at a rack, I mean, you look at multiple servers in a rack, in a cloud
data center, for example,
you might see that the amount of memory utilization at any given point in time across the rack
is low, relatively low, right?
The memory is not at the right spot.
The cores are in some other spot.
So in order to better utilize the overall resources at the rack level is where I see
the market will be starting moving towards.
And that's where CXL will shine even much higher and have a much better use case than just directly connecting to one process.
Yeah, it's actually interesting to hear about that because I think so far much of the discussion has been around memory expansion to meet the needs of applications.
But what you're saying is really true. If we can share memory in a sort of a rack scale
architecture, then we can have a much better, much more efficient use of those resources. So
systems that need more memory now can have it now, and then they can give it up and another system can use it.
That's very, very difficult to do currently
with how servers are architected,
but that'll be something that will be made possible
once CXL extends outside the server
and into this shared memory space.
And I think that that has to be pretty exciting
for everybody in the
industry because it is, you know, it's something we've never been able to do before in terms of,
you know, building this kind of rack scale architecture. And I imagine that that's the
kind of space that you all are pretty excited about as well, because as I said, I mean,
already any kind of device in the rack is using a ton of different
Marvell chips. And you guys will be right there, right? I mean, you'll be in the server, in the
memory expansion, pretty much everywhere with these chips, right? That's correct. And if you
see our, you know, the demos that we are showing with our FPGA is at the rack level, right? We have
been running at the rack level for the last few months publicly.
And so, yes, we do want to optimize for both cases,
basically running at the full rack level and what are the resources
optimized there, and then also directly connected to a host
because some applications are still very useful to be directly connected.
So the initial server platforms support CXL 1.1.
That's what we've heard from both AMD and Intel.
Should buyers be concerned about support
for different CXL versions,
since 1.1 doesn't technically support
any of the things that we're talking about?
Or should they be reassured
that there's going to be compatibility there
as new features and capabilities are brought to market?
Right. So they should be assured that there'll be compatibility.
Of course, as the CXL 2.0 host processors get deployed,
there'll be more features that we can enable around that.
On the other hand, even the 1.1 CPUs with, with the,
you know,
our architecture as it's modeled in an FPGA is already doing some of the things that are really for 2.0, right?
So now there are limitations in that.
I'm not suggesting that they go without limitations, but it's it 1.1 or closer to 2.0 features, even
in their older processors as the industry matures around that.
And then as 2.0 processors roll out, it'll get even better with the same devices.
You mentioned that you were building these products to scale with the APIification of everything in this day and age.
What are you doing to help your customers actually scale, manage, grow, monitor, maintain and secure your products then?
Marvell are a huge company that's very well resourced, and I'm sure you're putting in considerable effort there too.
Right. So we, you know, there are two things that we think,
there are multiple things, but there are two important things
that we have to keep in mind as we are scaling.
Security is a given nowadays, and we have to, of course,
make it much, much more secure, especially as you start connecting multiple hosts together with the same device.
Right. And then there's RAS. So there's reliability.
So our devices need to be actually more reliable than connecting a device directly to one host.
Because if we go down, we take down multiple hosts. Right. Not just a single.
So our blast radius is much higher.
So we have done a lot of work within our device and architecture to ensure that our reliability is higher, better,
equal to a better, trying to be much better than a single host so that we don't take down multiple hosts
or the probability reduces significantly.
Apart from that, it's all about
how do you actually enable that?
How do you show that it actually works?
So there's, you know, CXL defines a fabric manager,
for example.
So we are going, we are providing a reference
fabric manager, right?
For our customers to play around with it, try it out.
We have a, you know, you will see some new release
that we have done about a platform
around an fpga solution that will also become the ethics solution from a from a reference platform
so our customers can uh quickly try it our customers partners so that they can create new
ideas on how they deploy it and how their infrastructure should actually look like
around that there's no There's no substitute for actually
seeing things running and working as opposed, and we are already seeing good traction and good ideas
from our end customers once they use our platforms to try different things.
Yeah, that reflects what we've seen, I think, at some of these shows. As you mentioned, I mean,
a lot of the pre-production stuff is being demonstrated now.
And it seems to work pretty well.
I wonder if you can look into your crystal ball.
When do you think that this is going to happen?
When do you think that we're going to get rack scale
and so on?
I mean, you're more connected with this,
with the development process.
You've been in the industry a long time.
You know that it takes time
to bring these things to market. And we just got our first host platform literally in the last
month. When will we start seeing RackScale memory sharing? And when will we still see future versions
of CXL? Right. So the speed at which it will get deployed will partially depend
on the solutions that people are providing, right? Like ourselves, right? So
we, I think, a bit, it's a
what will happen is that as the market and our customers
see what our solutions do and get comfortable with that
it can be deployed at scale,
then it sort of will accelerate of when it will be shown into the market.
Now, looking at the crystal ball, I would think that the deployment at the mass level,
at a large scale, is probably still about two to three years away. So it's more than the 2025 range is
when it gets to a level that is used at a larger scale. Yeah, that seems reasonable and realistic.
I mean, there's going to be quite a bit of time before rack scale comes into place. But that
being said, I would believe that memory expansion
internal to the chassis is probably gonna come this year,
2023, would you agree to that?
I would say initial deployments
will definitely start showing up this year.
You're seeing demo, you're seeing some silicon in the market
and people are trying it out.
As far as the large scale deployment deployment even for that connectivity, we'll
have to see because the people again are still getting used to sort of getting confidence in
this technology and how it works. And my personal opinion is there'll be really 2024 before you
start seeing wide deployment,
even for an expander, start-off use case and direct connecting.
And then from there, it'll move relatively quickly to 2025 where we can go multi-hosted. And given the sensitivity, as you mentioned with reliability, given the sensitivity to these systems,
my understanding, and I wonder if you've heard this as well, is that the system
vendors, the system integrators, everybody is being very, very cautious before they qualify
this for use. I know that Intel and AMD are doing a ton of testing to make sure that the systems,
the platforms are really ready and that this stuff is really, really going to work.
Is that what you're seeing as well, that people are being very cautious before approving and
qualifying these?
Yeah, that's what I'm seeing.
And as I said, it's not only the reliability and system testing, it's also security.
Now, if it's for the first time, you're now taking something that is usually inside a
caged box, right, and directly connected that you can't get access to, to a serial link, a PCI
type of link, if it's PCI link effectively, that has the ability to get out of the box.
So you have to make sure that all your security aspects are covered around that.
So that's the other part of it why I think it will probably take more than 2023 for real
deployments.
It will really be 2024. I think I would fall in line well. of it why i think it will probably take more than 2023 for real deployments right it'll it'll really
be 2024. i think i would fall in line well i've said a couple of times i have a personal prediction
that we'll see an increase in cadence in terms of release of pci express and now cxl versions
and i think even by 2025 we could potentially be seeing cxL2-enabled servers.
And that would coincide very well with the bedrock of testing
and adoption from the CXL1.1 devices.
So it would be great to see if that was the case
when things really started picking up
and products like yours then across multiple hosts were coming into play.
That's about the time that the next generation of server platforms
will be in the market as well, Craig. So I think that's a good thought.
Yeah, agreed.
I guess given all of this, is it too early for people
to be excited about CXL,
or should they be starting to dream about what they can do with this technology when it does come to market?
No, I think it's not too early for people to be excited.
But on the other hand, there's a lot of caution as the excitement is there so it has to be and it's been
monitored the one of the things that people are also there's a lot of new
terminology being thrown out there in the market and you know even
understanding what is expansion versus pooling versus a new monode although the
definitions and as as I see some of the demos around their definitions are sort of, you would think they're a little vague, right, as far as what people call expansion versus pooling.
And so there is a lot of excitement, but at some point, we got to make sure what are the real products and what are the real use cases that are deployable at scale.
Because, you know, and that's what I think people are working through
in the next 12 to 24 months in the market.
Yeah, that's a real good point.
And I'm glad you brought up the question of security as well,
because it's gonna be very interesting to see
where that goes, especially with external memory.
Because as you said,
essentially this is to allow memory access
to escape from the system chassis for the first time ever.
And that could cause some problems
if things aren't properly nailed down.
Right.
So that's why it's been one of the big focuses for us
is also making sure that it's all secure
because we don't see a large-scale deployment
without ensuring the security that comes with it.
Yeah.
And on the security side, you know,
a lot of the larger companies
are going to want this integrated with their SIEM.
You know, they're going to want to keep an eye
and a track on this.
And as Stephen alluded to,
it'll be the first time RAM is being outside the chassis.
Yeah.
Okay.
So the way I look at it is that
it takes CXL type of enabling
technology comes every decade or so in the industry. It's not something that
comes out all the time. And it's really good to see that everyone has now
sort of consolidated around CXL. There's been a lot of
different technologies
over the last, I would say, five years
that have sort of demoed
or people have been going towards
that are similar in nature
or similar capabilities
or functionality of promise.
Now that there's a big consolidation around CXL
with both the host and the devices
and industry as a whole,
I see a lot of innovation
that's going to be enabled
and a lot of in the coming
one to three years and beyond.
And so I think this is a technology
that for the next decade,
we should be able to see
a lot of innovations around that
coming through.
There'd be a lot to integrate.
If I put my engineer hat on, CXL is amazing.
We get memory expansion.
We get memory outside of hosts soon enough.
If I put my architect hat on,
we have a whole new way of designing solutions.
If I put my product manager hat on,
we need to start thinking soon,
is this going to be a viable product?
Is it gonna be adopted?
There's a whole lot of things to do in a company
before they even think about architecting
any kind of solution.
What's our MVP?
Have you been discussing with any customers
who have already maybe even started that process
who know they're going to adopt it?
They're having a need and CXL addresses that need for them. And they've started the process
with that view to maybe implement them in 2025. You know, early adopters can can often
make leaps and bounds ahead of companies that wait.
Yes, we we have as Marwell, you know, We're not ready to announce the exact customers, but we definitely have designed one in order for us to really get our first early adapters.
So we know who our early adapters are, and we're working closely with the early adapters to ensure that our silicon meets their requirements at scale.
One of the other things that you have to keep in mind that I just thought about is that so far in the industry,
when new memory technologies show up, all the testing is usually done by the processor companies because the host processor companies, because that was the direct connection.
Now, with these devices we are coming out,
we actually have to do the same type of testing that the processor companies do to the memories
ourselves. And you need a certain scale, a certain capability in your company in order for
that to be enabled. So we are working both with the memory vendors from that perspective,
and then from the host providers because they are familiar with it and they also, you know, they are different, slightly different capabilities within the hosts as CXL is evolving.
And then also our end customers who then use the various hosts and the various memories.
And we are sort of the, you know,
the silicon in the middle in some sense now.
And so we have been working with all, you know,
all three types of groups of companies
to ensure that our solution actually works.
So we see that you need to do that
to make a successful business out of it,
as opposed to, you know, something in the lab or something that just shows it working.
So, yes, we are taking the approach of ensuring that when we come out, that we have all those things already thought through and doesn't take a long time from there to get to full production.
Yeah, absolutely.
And that's a really, really good point.
I hadn't really thought about that.
I really appreciate you bringing that up.
It's a new world for folks like yourselves
to be in that position with memory qualification
and so on.
Well, thank you so much for this conversation.
It's been really, really interesting.
I appreciate it.
As we wrap the episode,
where can people connect with you
and continue the conversation about CXL
and other topics, Shailesh?
Yeah, so we'll send a few links for you to be published part of this podcast.
But you can also get to www.marvell.com.
And through that, you'll find a few videos around that.
But we'll have an explicit link being sent out.
Yeah, thanks a lot. And Craig, I think you and I are going to be headed to Tech Field Day
on March 8th. Maybe we'll see some of these CXL companies there.
What else is going on with you? Looking forward to
hopefully seeing some CXL companies there.
I am on LinkedIn as Craig Rogers, and you can find me on
Twitter at CraigRogersMS.
For me, you'll find me at S. Foskett on most social media networks.
And if you'd like to learn more about that tech field day I mentioned, just look me up and drop me a line.
Thank you for listening to the Utilizing CXL podcast, part of the Utilizing Tech series.
If you enjoyed this discussion, please do subscribe.
You'll find us in your favorite podcast application.
And also, please do give us a rating or a review.
This podcast is brought to you by gestaltit.com,
your home for IT coverage from across the enterprise.
For show notes and more episodes, go to utilizingtech.com
or find us on Twitter at Utilizing Tech.
Thanks for listening, and we'll see you next time.