Utilizing Tech - Season 7: AI Data Infrastructure Presented by Solidigm - 4x06: Enabling CXL with VMware with Arvind Jagannath of VMware
Episode Date: December 5, 2022A lot of VMware’s initiatives are in the same sphere as CXL, and now everybody is wondering if VMware is officially camp CXL. In this episode of Utilizing CXL, Stephen Foskett and co-host Craig Rodg...ers join Arvind Jagannath from VMware to hear it straight from the horse’s mouth. Learn if CXL is on the horizon for VMware as it is for a lot of IT companies of its stature, and what VMware’s role is in the enablement of the technology. Watch the full episode to find out how VMware is participating in the CXL revolution and the way its most recent projects support and enable implementation of CXL. Hosts: Stephen Foskett: https://www.twitter.com/SFoskett Craig Rodgers: https://www.twitter.com/CraigRodgersms Guest Host: Arvind Jagannath, Product Management Lead at VMware https://www.linkedin.com/in/arvindjagannath/ Follow Gestalt IT and Utilizing Tech Website: https://www.UtilizingTech.com/ Website: https://www.GestaltIT.com/ Twitter: https://www.twitter.com/GestaltIT LinkedIn: https://www.linkedin.com/company/1789 #UtilizingCXL #CXL #VMware
Transcript
Discussion (0)
Welcome to Utilizing Tech, the podcast about emerging technology from Gischtalt IT.
This season of Utilizing Tech focuses on CXL, a new technology that promises to revolutionize enterprise computing.
I'm your host, Stephen Foskett, organizer of Tech Field Day and publisher of Gischtalt IT.
Joining today as my co-host is Craig Rogers.
Hi, I'm Craig Rogers. You can find me on Twitter at CraigRogersMS.
So, Craig, we were talking quite a lot about CXL,
and we, of course, were part of a panel at the recent OCP Summit
where we did an all-day CXL forum presentation.
One of the things that occurred to us then,
and we were excited to see that day, was a presentation from VMware, which, of course, is an important prospective part of the CXL world, wouldn't you say?
I wholeheartedly agree.
For CXL to succeed, we need companies like VMware on board supporting that software layer.
Absolutely. And I feel like if CXL is well
supported in the hypervisor, then it basically makes all of this stuff possible because
everything from memory expansion to really all the way up to composability and rack scale
architecture, all of that, it makes a lot of sense to support it in the hypervisor,
maybe even more than the OS,
though, of course, supporting the OS is still really important.
So that's why we decided to invite as our guest in this episode,
the presenter from the CXL forum, Arvind Jagnaath from VMware.
Arvind, welcome to the podcast.
Thank you, Stephen, Craig.
It's my pleasure to be here.
Well, it's our pleasure to have you because, you know, although we haven't yet heard exactly what VMware is going to be doing in terms of CXL support officially, having you there really speaks volumes in my mind.
Even just having you, you know, join us on the podcast and join us at CXL Forum.
It says that this is on the VMware roadmap and radar, at least.
Yes, absolutely. So VMware has been working on a lot of memory initiatives and has been enabling a lot of the newer platforms, such as Sapphire Rapids and Genoa. And we definitely are totally into enabling CXL as part of some of those platform enablement.
And what would that support look like? I mean, obviously, if it's supported, sort of passed through to the OS, that's one thing. But would it be the case where this would basically be a tier of memory available for virtual machines to use, just kind of like
what you guys were doing with Optane as well? Yeah, yeah. So we are thinking about several
different plays here. So one is, you know, just a basic enablement, like the one we demoed
during VMware Explore. So there, you know, memory is presented, the CXL-based memory is presented as a separate NUMA.
And then, you know, we access a flat memory.
It's basically a memory expansion use case where we showed that, you know, the applications could use additional bandwidth.
But at the same time, we are definitely looking into some other use cases, such as the memory tiering use case, where we could provide better TCO for our customers and reduce overall cost.
And we also have other use cases, such as with accelerators, that I can definitely talk about. It's interesting to think that VMware
are going to be a disaggregated memory hypervisor.
Tiering is one of the obvious things you see.
Obviously, VMware has been able to tier with vSAN,
but now we'll have these potential memory tiers
within the hypervisor.
Are VMware doing anything or can you share anything about tiering specifically within workloads?
Yeah, yeah, definitely.
So VMware has been in this journey over the last few years where we have worked on Intel Optane. So we have enabled the various hardware modes available within
Intel Optane, such as the memory mode and the app direct mode. But at the same time,
VMware and Intel actually supported us and collaborated with us in the software-based
memory tiering architecture, which means that we could actually make use of, I mean, we could be
more aware of some of the latency characteristics of Optane. And we could use, for example, the DAX
mode and sort of treat Optane as a separate, slower, higher latency tier and start doing some intelligence around page movements such as you know hot and
page a cold page replacements which means that you know you could move hot
move pages between the hot and cold tears and actually make this seamless
which means that a lot of the workloads start getting sort of a uniform-like performance across the board,
and they don't see a whole lot of performance degradation.
So kind of like DRS for memory?
Yeah, it's interesting you say that DRS for memory is one of the things we did do is sort of made DRS aware of these different
tiers.
So which means that as we start using one or more of these tiers, DRS is aware across
the cluster and we can make use of that component to make sure that the right host is chosen whenever, you know, applications get into performance kind of degradations.
So, yeah, it is kind of DRS, but this can be taken as something that is built into the hypervisor, which means that software tiering is completely part of our ESX kernel hypervisor operating system.
And it can also make sure that, you know, there are other benefits such as, you know,
we can make sure that across the board when we run multiple VMs,
we can ensure some, we can place some guarantees around fairness of the VMs, we can ensure some, we can place some guarantees around fairness of the VMs, which
means that we don't see necessarily a case where, you know, one VM can be starved or,
you know, a rogue VM occupies the hottest tier all the time.
And we can sort of control and ensure that, you know, all the workloads across the board
run consistently.
Yeah. So on that note, I think that that's one of the most important aspects here is that this
is basically everything in vSphere is going to be fundamentally a shared resource.
And we're used to having memory be a really unified homogenous thing. Essentially,
any memory on the system is the same for any process running.
But that's not necessarily the case, right? Because, I mean, VMware vSphere is designed
to recognize that under NUMA, there could be some differences already in terms of memory access.
And of course, as you mentioned, with Optane persistent memory, there's quite a big difference
between different memory regions. I think we're assuming that CXL memory expansion will look somewhere in between NUMA
and Optane in that it'll be still maybe not as high latency, maybe a little more high
performance, but not quite system memory. Is that right? Yeah, yeah, that's a good point, Stephen.
So the VMware philosophy is to make sure
that the underlying operating system
handles the different latency characteristics
of these different devices
and just presents a plain sort of uniform address space
across the board.
So basically what we do is, you know,
and this philosophy is followed
for several of our hardware enablement kind of,
you know, effort.
So what we are doing is, you know,
we are taking all these different types of memory
and it's not necessary that we use, you know,
a strict software tiering mechanism for different memories.
I mean, there are still certain memory expanders that can actually fall within, like you said, between the DRAM and Optane buckets, which we could handle by doing smaller optimizations with NUMA, but for overall software tiering, we are targeting that for a memory semantic SHD,
for example, kind of use case where you might see
a whole lot of larger latencies.
And I think we also might get into a use case
where we see pooling as one of the use cases where
you know, hosts share memory.
So that could also become one of the things that we start handling where you know, multiple
hosts start sharing memory and accessing memory simultaneously from the same pool.
So that is another thing that we are looking into there is yet another
thing that we want to look into which is you know what we have seen across our
customer base is a lot of stranded memory so what happens is when you know
when we when our customers provision host there is a lot of memory that's sitting idle on these hosts.
So what we want to be able to do is even look into schemes
where we could potentially access this memory from other hosts
and share memory from these other hosts.
So this is pooling of a different kind, if you see.
So we are also looking into such use cases.
Yeah, I think that there's a whole spectrum of things from the very simplest initial use case,
which is essentially just for those who haven't kind of caught on this, has been already introduced
by SK Hynix and Samsung and is already coming to market with the next generation server platforms. And
that is basically adding a memory expansion card essentially on the CXL slash PCIe bus that would
allow you to, for example, add a half a terabyte of RAM to a system that already has all the memory
channels filled up just so that you have a little bit more RAM and it's going to be a little lower latency, but you can use it. And obviously that would be a very useful thing.
But what you're describing is sort of what comes next after that. So one of the next things was,
well, what about an external chassis with RAM that could be shared among different hosts?
And then the next thing was, what about pooling memory where you have different hosts dynamically
sharing memory? And then the next thing was, well, pooling memory where you have different hosts dynamically sharing memory?
And then the next thing was, well, what if you have hosts that can access each other's memory?
And beyond that was this idea of what if you can cooperatively access memory from host to host?
Now, that's a really interesting use case in a vSphere cluster.
I assume that's on your radar as some future enhancements. Absolutely. And I can talk
about that in some form of a disclaimer or a disclosure, I mean, under NDA, which is, you know,
we are looking at several different use cases. While, you know, we see massive proliferation
with respect to, you know, just the basic use cases with memory expansion and perhaps the lower TC option with
memory semantic SSD. But the DRAM host sharing is something that we already do in some respect.
For example, if you look at vMotion, vMotion is sort of a subset of this use case where we
actually track pages across two hosts and make sure that, you
know, pages can be brought back from the other host if need be, right? And so there is actually
a lot of IP within VMware engineering where, you know, VMware has worked on some of these stuff,
memory issues. There is another use case that we are targeting, which is, you know,
using an accelerator. So with an accelerator, it opens up yet another kind of interesting use case,
which is, you know, we will be able to sort of look into doing vMotion enhancements,
which means that, you know, for example, if you see,
like I mentioned, vMotion is really a subset of memory access or, you know, memory tracking.
So when we actually do vMotion and we actually actively track pages and start moving the pages
to the target host, you will see faster vMotions happening.
And accelerators are, you know, sort of,
this requires special IPs within accelerator,
which means that, you know,
these are some technologies that serve VMware customers really well
because, you know, vMotion is a huge pain point
and host evacuation is a pain point for a lot of the VMware customers.
So what we can do is, you know, we can leverage accelerators and we can actually see if, you know,
we can track this memory and choose good target hosts into which we could easily Vmotion.
And so Vmotion becomes less of a pain point for this larger memory host that a lot of our customers experience.
The other use cases that we can look into with the accelerator are, you know, really,
it opens up a whole new gamut of use cases. Things like, you know, encryption or dedupe,
you know, or, you know, tracking memory resiliency. Because, you know, this is all about memory.
We could look into the accelerator sort of, you know, acting on behalf of the CPU and
looking into and tracking, you know, different memory regions of applications, making sure
that it's resilient and protecting against, for example, even hardware failures. So we can look at this as a complete hardware virtualization,
memory hardware virtualization.
VMware have always been very good at giving insight
to the hardware underlying, the underlying hardware,
and they've also made it, they've helped very well
so we can track it and monitor it, build operations around it.
And have you faced any challenges around diving into CXL?
Because there's no, I don't know, are there standards there?
How are you monitoring and diving into that then?
Yeah, that's a good point, Craig. So VMware has been heavily focused into sort of the operational aspects of, you know, any new hardware technology that we bring.
We believe that operations is one of the core pain points that a lot of our customers need to solve, for which we need to look into so
In terms of making sure that you know
We get the smoothest from possible operations for our customers. We want to make sure that we provide the appropriate
monitoring and measurement of
You know memory across the board. So
for example, I mentioned DRS as You know one of the things that we look to actually do a proper measurement and load balancing across the cluster. But monitoring is useful as well
because even with our current latest release, we introduced something called the BMMR, which actually
makes sure that we can track the different tiers memory usage, which means we can track
bandwidth and miss rates and current latencies for the different tiers.
Customers are able to make a better judgment of how they want to use these different
tiers.
And they could even build some sort of, you know, proactive alerting and proactive load
balancing capabilities with DRS using these measured kind of monitoring statistics.
And that, I believe, is going to help customers a lot.
Absolutely. And VMware will be the people to deliver that for sure, given the sheer scale
of VMware deployments worldwide. People have had that insight on everything else they're
going to want to don't see Excel, if not more. If we look at adding memory say to a host
are you working with any vendors
to let that happen dynamically
where you could move memory from one host to another
based on current scenarios
how dynamic do you think that will be?
Yeah, so currently we are not
really working with a specific vendor.
So VMware is really an ecosystem player.
We view all vendors the same.
There isn't a specific vendor we are working with on this specific architecture.
I mean, we are looking at it as an infrastructure piece, which we could enable in the sense that, you know, when memory has to be moved dynamically, we will make those changes that can enable this in our hypervisor, independent of any specific vendor.
So these types of innovations should be available as we progress in the CXL journey.
One of the things you mentioned there was, and I hadn't really thought about this, was the fact that vMotion helped sort of set the table a little bit for some more advanced memory manipulation. thing that I know is in vSphere is the BitFusion product, which would allow you to use external
GPUs dynamically within vSphere. And of course, that's very valuable in a lot of use cases.
But it strikes me that that also overlaps with the future CXL mission, which would be to share
devices like GPUs as well in a more dynamic way.
Does that also, is that same technology going to be able to kind of port forward into the CXL world?
Yeah, we definitely are, you know, excited about the possibilities with the GPUs and sharing memory for the GPUs, etc. And, you know, we already have
technologies, like you mentioned, Bitfusion already sort of enables this composability
and disaggregation of resources. So definitely when CXL and then we also have, you know, GPU
Direct RDMA, which is sort of, is sort of another version where we can offload
some of the resources from the CPU.
And we have the SmartNIC play where we can use SmartNIC to offload some of the processing
that CPUs normally would do.
So definitely, like you mentioned, Stephen, Bitfusion is an area we want to look into
in the future. Just in general, IO devices themselves could, you know, become more and more
sort of, you know, aligned with the CXL kind of roadmap. I mean, we already enable, you know,
a lot of the networking offloads for example i mean
flow based offloads more intelligent offloads into the network so a net nix for example so we can
definitely um look into cxl as being you know sort of offering another level of you know doing
uh better offloads and better flows and uh you know sort flows and doing more of a proactive kind of offloads.
And Bitfusion could definitely be part of that.
I guess another sort of future here is, of course, the big question,
which is composable infrastructure.
And I know that there have been many proposals
on how that that might be enabled.
It is pretty exciting to think that we could deploy really a custom hardware platform to
match our custom software platform, our custom virtual machines.
I think some people might say, oh, well, composable infrastructure doesn't need vSphere.
But I think that's really not the case.
In fact, I think one of the coolest things about composability is the idea that you can put together a big server that has a configuration you could never achieve in sort of off-the-shelf
hardware. In other words, maybe I need, I don't know, 39 CPUs. And maybe I need, I don't know, 724 gigabytes of memory and maybe I need a half a GPU.
Go build me that guy. Well, with composability, theoretically, you could build me that guy.
And that would actually be really cool with vSphere because you could basically deploy
a piece of physical infrastructure that is impossible in physical world.
That's pretty cool.
Is that how VMware sees it?
Or might we see actually a different kind of situation where maybe you guys are the
arbiters of composability?
Yeah, absolutely.
So we see these use cases, Stephen, already with hyperscalers, right?
So a lot of the cloud providers or even on-prem customers
trying to provide sort of the cloud-like operating model you know trying to provide sort of a devops
kind of deployment so we definitely see that you know there is a lot of value in sort of doing
such composability and disaggregation of resources. So this takes
sort of the hardware abstraction to a completely different level. When you think about bare metal
provisioning, you can actually think about creating a bare metal instance or a server
dynamically. And CXL actually enables memory used to be sort of the last frontier in this, you know.
Of course, CPU is also one of the issues here that we need to tackle with composability,
but memory is one of the bigger sort of challenges.
And CXL definitely helps with that, where you could dynamically create this
hardware abstraction. And then you could dynamically create a server for your specific
customer or a tenant or a use case that such customers can use. So I guess it's probably too
early to guess what exactly this is going to look like because of course none of this hardware even exists yet um but but it is it is pretty cool to think about uh the a future where
the hardware is as dynamic and configurable as v-sphere you know yeah yeah and i think
in terms of the hardware um architecture and you, the fabric and how things can get assembled,
it could be a combination of, you know, using specialized devices like accelerators and, you know, special switches and, you know, fabric managers.
On the software side, we already have, you know, DRS sort of, you know, in the same mode or method as a fabric manager. And we could definitely
think about enhancing the capabilities of DRS to actually act like a fabric manager across
multiple hosts in a cluster. But at the same time, we are seeing that it's easy to build
the infrastructures like the switches
or the pooling the shared appliances.
But at the same time,
how do you operationalize it
and make it very simple for customers
to deploy, configure, manage, and use,
that still has to be solved.
And I think VMware can play a significant role
in making sure the operations problem is solved.
And have you been looking at any specific use cases
within the VMware ecosystem? A couple of spring to mind would be Tanzu,
VDI, Horizon. Do you see CXL making any impact there or adding any operational gains in terms
of efficiency? Yeah, absolutely, Craig. It's interesting you bring this up because when we were actually working on Project Capitola and
Intel Optane and the software tiering, VDI emerged as
one of the prime use cases because
customers want to scale up their memory and they want
to consolidate their servers. They don't want extra hosts
to be deployed necessarily. And,
you know, it also helps them with some of their green initiatives and power cooling and other
management. So definitely, you know, in terms of such manageability, it definitely helps with, you know, VDI like when VDI like workloads come into picture.
It helps when you have, you know, different tiers and you have different types of memory being used.
And, you know, VMware providing this single access or a uniform access across these different tiers and scaling up that memory definitely helps for such workloads.
That's exciting because, you know,
technologies like this that force you to change how you architect solutions
are going to have huge impact, massive impact.
And if VMware are on board and letting companies make better use of their equipment and give them the control that they're used to having and the insight to run it operationally well, I think it could be successful.
So I really think it's great that VMware is involved here.
And I think it's actually reassuring to the industry,
just to have you here say this. Now, we know that you haven't announced this support. I imagine
that'll come at a VMware Explorer event pretty soon. But, you know, I really appreciate you
being willing to come on here and talk to the world and just share your enthusiasm for the
technology personally, as well as,
you know, from a company perspective. One of the things that occurred to me as soon as I saw the
first CXL announcements was that this is going to require especially VMware, Microsoft, Linux
support. We've seen some Linux support, We've seen some third-party drivers and software
supporting it. But the fact that VMware is there, that you are going to be
eagerly working on this stuff is really reassuring. You can't give us any clues,
I know, about when this stuff might happen. But I guess, is this a major thing?
Does this have to wait for some major revision of vSphere?
Or is this the kind of thing we might see come along sooner than that?
What we are currently doing is, we are actually starting our enablement journey with CXL with 1.1.
So we are already talking to a lot of vendors,
including the CPU vendors,
and testing CXL in some of their platforms but really what we see as you know a productized version is you know
more on the CXL 2.0 kind of time frame with granite rapids because we feel that you know the ecosystem will evolve and you know it
will become more mature and use cases will become more clear currently we do
have certain target use cases in mind that we are focusing and the engineering
is working on but you know as we mature and as these technologies become more and more sort of baked, I think CXL 2.0 seems to be the right kind of fit or the environment where we fully start supporting CXL.
And just as a reminder to the audience as well, so you mentioned, you know, so the first
CPU platforms that really support CXL are coming or have been announced by AMD and are widely
expected to be announced by Intel very soon. The next generation, as you mentioned, is widely
expected to follow that fairly quickly. So,
you know, next year at this time, we'll probably see this, at least from a hardware perspective,
perspective being rolled out. And now and as a software, of course, you need software to make
the hardware run. And I think it would be really exciting to see an announcement from somebody like VMware in coordination with those things.
So this sounds great.
And of course, beyond that, you know, we've got more and more coming.
So we heard again at the CXL forum from the head of the CXL consortium, some very, a lot of excitement about PCI Express 5, PCI Express 6, CXL 3, CXL Beyond 3.
And there is a huge roadmap of support in the industry from a software and hardware perspective for basically everything we've been talking about.
So from the purposes of this podcast, this may sound a little pie in the sky, a little futuristic, but I wouldn't bet on that.
This is really happening. This is really coming out and we're going to start seeing people deploying this stuff now in the
first quarter of 2023. And I think that there's a good chance that we're going to see more support
from companies like VMware in the following year. So thank you for that update. Before we go,
where can people connect with you and continue this conversation, Arvind? You can all reach me at LinkedIn and you can just search for me and
please feel free to reach out and I'll be able to provide more information on our CXL journey. And at the same time, we are always
doing more blogs and white papers on CXL. We already did a few on memory in general, but
going forward, please look out for more blogs about our CXL enablement and happy to share more when you connect with me.
I'll also point out that the presentation you mentioned from Flash Memory Summit is available
on YouTube. If people look that up, it's called Towards a CXL Future with VMware,
and they could probably find it. We'll put it in the show notes here too.
So is the OCP one. Yeah. Great. Craig, you and I just published a big white paper that we
contributed to for Intel. If you go to gestaltit.com, you'll see that in the sidebar.
What else is going on? Geez, too much. Lots of podcasts, lots of writing. I have a lot of writing to do this week actually. But the Intel white paper was fantastic to see that coming out. Obviously touched on VMware a lot in that white paper. and we made some observations of where we think some things might be going,
and we called out the perfect configuration.
Yeah, it was great to work with. It was a good team.
Thank you very much, Craig, for that.
And, of course, also you did a presentation for Tech Field Day on CXL,
and we'll include that in the show notes too.
So thanks for joining us for the Utilizing CXL podcast,
part of the Utilizing Tech podcast series. If you enjoyed this discussion, please do subscribe in your favorite podcast application and give us a review. This podcast is brought to you by gestaltit.com, your home for IT coverage from across the enterprise. But for show notes and more episodes, go to utilizingtech.com or find us on Twitter at Utilizingtech. Thanks for listening and we'll see you next week.