Storage Developer Conference - #175: SNIA SDXI Roundtable
Episode Date: September 28, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode number 175. Hello, my name is Shamkumar Iyer and welcome to the session SNEA-STXI Roundtable towards
standardizing a memory-to-memory data movement and acceleration interface.
I am a chair for the SNEA-STXI Technical Workgroup.
I'm also a distinguished engineer at Dell. I am very pleased to also share this
platform with my esteemed panelists from the companies below. So starting with Philip Ng,
who's a senior fellow at AMD. Philip. Hi, everyone. Alexander Romana, who's a principal architect at ARM.
Hi, Sean.
Jason Volgimo, who's a partner software engineering lead at Microsoft.
Thanks, Sean. It's a pleasure to be here.
Donald Dutile, who's a principal software engineer at Red Hat.
Hey, folks.
Richard Brenner, who's a principal engineer and CTO of server platform technologies at VMware.
Hello, everyone. Welcome to our session.
And finally, Paul Hartke, who's a principal engineer in the Xilinx CTO team.
Thanks, Sean. Thanks, Sean.
Thanks, everybody.
So I'm warming up to the discussion that I'm going to have with these gentlemen here.
But before I do that, I would like to talk to you a little bit about what is STXI.
STXI is Smart Data Accelerator Interface.
And some of us in the industry have been thinking about
a problem that we've been facing.
So currently, software-based mem copy
is the current data movement standard.
And while it has served us well
because of a stable instruction set architecture,
it takes away from application
performance and it also incurs software overhead to provide context isolation. While people have
experimented with offload DMA engines in the past, their interfaces have been vendor specific,
which means they're not standardized for userlevel software. So the smart data accelerator interface
is a proposed standard for a memory-to-memory data movement
and acceleration interface,
which is extensible, forward-compatible,
and independent of IO interconnect technology.
SNEA STXI technical workgroup was formed in June 2020,
and it was tasked to work on this proposed standard.
So far, we have 25 member companies and 60 plus individual members,
and we expect this to be a good standard that is applicable for a lot of people.
So what is the grand vision of SDXI and how it wants to solve the memory-to-memory data
improvement and acceleration problem.
Imagine you have an application, maybe two applications,
more than two applications.
You need an accelerator interface that can move data
from a DRAM context A to DRAM context B
without exercising the CPU cycles,
because that's the whole point of the accelerator.
But you also want to eliminate a lot of
the software context isolation layers that are there so that now you can enable applications
directly from user mode. And while we do that, we don't want to just solve this problem for
DRAM memory address spaces. We also want to solve this problem for memory classes like storage class memory, memory behind I.O. devices, or
memory pools attached to CXL links and patterns. At the same time, we also want this accelerator
interface to be CPU family agnostic or architecture agnostic. And while CPU integrated implementations
are certainly envisioned here, we also think an accelerator interface should be something that can be implemented on discrete interfaces like GPUs, FPGAs, and smart IO devices like NICs and drives.
When we built a standard specification, now we can innovate around the spec and add incremental data acceleration features.
With this vision in mind,
the STXI technical work group
has had some design tenets in mind.
We want to enable data movement
between different address spaces,
which includes user address spaces
and also address spaces in different version machines.
We want to enable data movement
without mediation by privileged software. Of course, once the connection without mediation by privileged software.
Of course, once the connection has been established by privileged software.
So we want to allow abstraction and virtualization by privileged software,
which brings me to the capability to quest, suspend,
and resume the architectural state of a per-attribute space data model.
Why is it important?
Because now we can do live workload
or virtual machine migration between servers.
We also want the standard to be forwards
and backwards compatible
across future specifications and revisions.
By doing that,
now we can create interoperable software and hardware.
And because we're creating a standard framework,
we want to design it in such a way
that additional offloads can leverage the same interface.
Finally, we want the DMA to be a concurrent DMA model.
It means multiple parallel DMA transactions should be possible between different address spaces and data modes. With that in mind, I'm very pleased to announce to you that the STXI Technical Workgroup has
published a version 0.9, revision 1 for public review.
It's available at this link and you can take a look at it to provide feedback.
If you are interested to know more about the specification, then we will also engage in
a Birds of a Feather session tomorrow, Wednesday, September 29th from 3 to 4 p.m. Pacific time.
Please come and join us and we will be happy to share some of the internals of the specification.
Also, if you wanted to join the TWIG, you can go to the CISDXI page and find the information on how to join the three and influence this new upcoming standard.
With that in mind, I am now fully ready to talk to my panel and ask them some of the interesting questions that I've always been wanting to ask.
So let me start with Phil. Phil, we've been partners here since the beginning of this journey.
And now we have the spec out for public review.
First of all, you know, thank you for bearing with me this long,
because I remember the time when we were going through the initial drafts of the specification,
and we had done a POC using FPGAs at Dell. And I remember feeling sort of nervous about asking a CPU vendor saying,
hey, can you work with us on a standard specification for data movement and offloads
that offloads the CPU and is CPU architecture independent
and works on a broad set of use cases?
I kind of hope you guys didn't think we were crazy.
Oh, not crazy at all, Shyam.
I think the concept you originally pitched, what has now become SDXI for a standard data mover leveraging modern virtualization technologies, was very interesting and compelling.
A standards-based approach is very important.
It allows for more certainty and stability for software developers at both the OS infrastructure and driver level, but also for
end-user applications, depending on this type of acceleration. We expect this will lead to greater
innovation within the overall SDXI ecosystem. So what are some usage models of interest that
SDXI can enable in your business? For me, the ability for system software or applications
to easily move pages of data between different tiers of memory
in a multi-tiered memory system is interesting.
Also, the intra-host data movement,
that is offload of data movement between virtual machines
running on the same host is also, I think, new and unique.
Offloading these types of transfers can potentially speed up not just the data transfer,
but can also free up CPU cycles for application usage.
I see. So do you see SNEA STXI technical workgroups efforts to standardize a memory to memory data movement and acceleration interface help future industry trends and directions?
Yeah, I mean, standards provide a very solid foundation for innovation.
It provides a stable base upon which you can develop applications,
whether that's end-user applications or usages embedded inside system software.
The technical working group, I'm sure you know, we're very committed to ensuring backwards compatibility
with new specification revisions so that we maintain that stability. In terms of
memory-to-memory acceleration, one can see there's a broad industry trend towards new innovations in
the memory system. Multi-tiered memory with varying capacity and performance, memory expansion via new buses, memory disaggregation,
accelerators with memory, persistent memory. This is a big change from traditional systems where
maybe you just have DDR memory and the biggest challenge is really just dealing with NUMA
effects. So there's a lot of innovation happening in this space and a standards-based method for
moving data around between all of these different types of memory or memory pools is one of the
missing pieces of the puzzle and SDXI helps fill that role. Cool. Are there some specific features
of the specification that really excite you? I mean, the ability to naturally transfer data
between different address spaces,
which I touched on previously,
I think that's very interesting.
For our audience,
here's a short version of how it works
in the SDXI specification.
SDXI defines the ability to reference
different address spaces for each memory access.
So each address space can be mapped
using different IMU page tables.
On a memory copy, for example, you can use different address spaces for the descriptor,
the source buffer, and the destination buffer. So you could have, for example, a kernel mode driver
generating descriptors in kernel space that's going to perform a data copy from kernel space
to user space. Now, you're not limited to using a single page table
for everything as you might be in other device models.
And if you can reference existing page tables
like application page tables,
you don't need to spend CPU cycles
mapping your data buffers
into some common virtual space
or manually translating addresses into physical space.
The same capabilities can be used
to transfer data between virtual machines running on the same host. This can be used to accelerate
intra-host networking, which can be important for a variety of usage models, such as in some
hyper-converged infrastructure systems. The other feature I personally like is the expandability.
The current version of the specification is focused on memory
copy, but it provides a strong framework to include additional offloads in the future.
I think we'll see a lot of useful innovation in this area going forward.
Well, you know, I'm really glad that we made that journey together. That intra-house data
movement use case is really exciting. And, you know, Rich, you have also been in this journey with me and Phil
since the very beginning.
And, you know, we couldn't have asked for a better friend and mentor here
in guiding us in making this specification.
And as someone who represents VMware as a major virtualization
and enterprise software provider,
you've been a key collaborator in making the specification virtualization-friendly and enterprise software-friendly.
Can you elaborate more on how the specification is breaking down?
Certainly.
At VMware, we were very concerned that the industry was trending towards every accelerator requiring a
specific software stack for every combination of CPU vendor, accelerator, operating system,
and hypervisor. As a leading innovator in enterprise software and virtualization,
we did not like this trend. And initially, when I was part of the effort, I was a little worried that maybe not everyone else saw this problem in the same way.
But the good news, at least, judging by the number and diversity of folks joining our work group, is that we're not alone in that concern.
We seem to be missing a standard or specification for hypervisor and operating system management of offload hardware accelerators,
nor did we seem to have any agreement on the models that end-user applications use to submit work.
SDXI addresses these issues.
It provides an open and common hardware accelerator programming model
that spans the gamut of accelerators, hypervisors, and operating systems.
It also defines a standard user space work submission model that, again, is independent
of the actual accelerator. And by the way, this is a standard model that works across
all CPU architectures. Interesting you say that because, you know, that leads me to ask my next question to Alexander from Arm.
Alexander, do you see STXi implementations possible on an Arm architecture?
Why did Arm decide to join the STXi?
You know, there must be some usage models of interest that STXiI can enable for ARM and or your partners?
Hey, thank you, Shem, for having me here
and for those great questions.
So most of you know ARM, but for those who don't,
ARM is the leading technology provider of processor,
interconnect and GPU IP,
offering the widest range of cores
to address the performance, power and cost requirements of every device,
from IoT sensors to supercomputers and from smartphones and laptops to autonomous vehicles.
Interestingly, our IP is targeted towards both host and device side,
making it critical for ARM to support the SDXI ecosystem, not only from a host vendor perspective, but also from a device IP vendor perspective.
So why did ARM join SDXI?
Well, ARM joined SDXI to help standardize the accelerator's architecture such that future devices, which can possibly be coherent to the CPU,
are secure, easy to program, and provide effective capabilities for offload.
Above all, we want really to support our partners
by developing cutting-edge technology and driving standards
to make sure they reach their full potential in ARM-based systems.
So in my view,
SDXI standard is foundational for accelerator virtualization
in complex systems.
It enables advanced capabilities
like device assignments
and live migration.
So having a standard
is important for our partners
in order to get a better integration
of accelerators into SOCs
and avoid some device-specific code in privileged layers,
which is often a source of bugs and vulnerabilities.
On top of that, I believe accelerators connected to the host
via technologies like PCI Express or CXL are really ripe for standardization
and that SDI can play a critical role
by making composable computing platforms pervasive
and easy to use.
So why now?
Why is that training now?
Yeah, so why do I believe now is the right time?
Well, heterogeneous computing and virtualization
will dominate in the future.
So it seems natural to standardize a multi-tenancy accelerator interface to enable accelerator usage and sharing, especially, but not only, for cloud environments.
I think standards like SDXI or CXL are fast becoming the innovation vector that will enable true heterogeneous computing.
This disaggregation of resources to extract the most value and reduce the TCO is critical for innovation moving forward.
I think SDI's initial approach to standardize the accelerator's programming model is really a positive step forward. Now, when I look at the specification,
I think also that there are many features in there
which make it very attractive.
I particularly like, for example,
the fact that the specification made sure
the page width could be something
that an ARM architecture could support.
The specification also works on user-level address spaces,
cleverly utilizing the PAC feature on
IOMMUs. ARM calls them SMMUs, and the
fact that PAC-based address translation is already supported by ARM SMMU
architectures makes it really icing on the cake.
So, will we be able to put an SDXI device on an ARM
platform and get all those cool features with high performance?
My answer is yes.
Cool, Alexander.
As you can observe, I'm doing this hardware software provider combination.
So let me now turn to a major software provider.
Let's hear from Don at Red Hat.
Don, you have contributed to a lot of things in this space.
Fast zeroing of memory and class-level device recognition of SDXI.
How do you see these features benefiting customer applications for you?
Thanks, Jim, for having me here.
Standardized features for cloud applications are are key focus for Red Hat products.
And SDI enables broad accelerated utilization in cloud applications without unique device-specific operators.
Options like fast error on your memory for secure virtual machines start up and shut down as a key focus for improved cloud support. Other areas for user space applications like accelerated snapshotting could
prove to be a significant speed up in high performance computing environments
and database applications.
DAX and block translation table,
both examples of user space persistent memory storage could receive significant acceleration, not only because there's a DMA engine to copy data into and out of persistent memory, but also because these DMA operations can be an open standard method from user space without another syscall into the kernel. Providing an open standard enables these open source user space applications
to make simple updates to take advantage of these accelerations.
That's very cool. You know, let me now shift gears and ask someone who's sort of been there
and done that in the accelerator space. Paul, you've kind of really seen the
evolution of FPGAs and programmable FPGAs have now become very mainstream. What do you see as
the benefit here for you all at Xilinx? Thanks, Sean. Yes, standards are built on a great ecosystem
of partners and component vendors. We at Xilinx are very interested in building communities
around those open standards.
The STXI specification is not tied
to any particular architecture
and is agnostic to both CPU integrated
and discrete implementations.
This is great news for accelerators like FPGAs.
We see heterogeneous compute usage models
where an STXI data mover could transfer data to, from,
and accelerate a memory space, and additional data transforms would be performed on the data
by leveraging that same interface. So while SDXI was first focused on memory-to-memory transfers,
we greatly support how the standard has been defined so additional use cases can leverage
that same infrastructure as well. Facilitating commonality
and application software interfaces will accelerate the heterogeneous computing adoption that many
people have talked about today. That's very interesting. Now it's my turn to talk to a
software-focused person. Let me talk to Jason from Microsoft here. Jason, Microsoft is also a contributor to
the specification, and you've made a valuable contribution to how error reporting needs to look.
And I know you guys don't kind of take this lightly, but will you tell our viewers why you
think STXI is a good idea? Certainly. Thanks, Shyam. As others have already commented here, we believe that
heterogeneous computing is going to be especially important as we move together as an ecosystem of
computing solution providers. So data movers are one class of accelerators currently covered by
STXI. And I feel that they represent what I like to refer to as a next generation device that
really has the potential to impact a broad number of market segments. So that could range from
servers and data centers all the way down to embedded and IoT devices. So we think it's going
to be really important to have robust system architecture in place to support accelerators
as they're integrated into these computing platforms.
We joined SDXI because we also believe in open standards.
As accelerators proliferate, and as Rich mentioned previously,
we absolutely would like to see common protocols that
can be used to efficiently offload work
from general purpose CPUs to
dedicated hardware and potentially vice versa.
This model allows new hardware implementations to be offered to our customers more quickly
as standard drivers and software abstraction layers can be reused.
These common standards also help depict a baseline level of support
with optional extensions for additional features. That permits silicon vendors to differentiate
based on the quality of their implementation and the richness of the features supported in their
offerings. That's interesting. You guys really want open standards to also foster competition among hardware vendors on features, performance and implementation using common software implementations.
Let me ask Don about this. Is there more to open standards for you all at Red Hat? How does it align with open source?
Well, it's pretty easy to see that open
standards are very important for open source.
FCSI has taken an acceleration feature and making it a standard
so its adoption will be simpler, faster, and it will be
deemed a worthwhile investment and provides a broad payback.
Not a corner case, not a one-plus feature for one device by one system or one vendor.
This enables a broader use of NASA, a much broader ecosystem that improves the ROI.
That's very cool.
You know, I want to pick another point from Jason and actually pose that question to Rich
because I'm having fun here.
You know, competition from hardware vendors
on implementation sounds like a fine idea.
But Rich, you know, isn't the intent
to encourage competition among software solution vendors
on a better
SDXI-based implementation also the idea?
Well, what do you think?
Absolutely.
A more tightly integrated SDXI software solution will be far more effective in targeting various
segments of the market, and fostering that kind of environment will be critical to the SDXI
technical work group. But going back to hardware competition, the SDXI system architecture allows
a hardware vendor to choose the right level of integration, and it is interconnect independent.
That means implementations can span the gamut from dedicated integrated CPU devices, external PCI devices, to remote devices over a compute express link.
Nothing in the system architecture requires PCI express only devices.
Now I'm going to take the STXI data mover to Jason. Jason, what do you make of when we say the STXI implementations are like
interconnect, independent, and non-PCI implementations are possible?
It's a good question.
So as I mentioned earlier, you know, we see data movers as the first
in a wave of next-generation devices.
There are really good litmus tests as you, you know,
tend to offload potentially really
small latency sensitive operations to really large background tasks. And, you know, you want to
improve throughput and or efficiency in both cases. But as other accelerators, you know, come
online, we want to ensure they're also supported by standard system architecture,
including the hardware and software protocols used to interact with them.
So as, you know, Alexander and Rich both mentioned CXL a bit earlier,
and to answer your question more directly,
we're also looking at CXL as an attach point for accelerators.
Unlike PCI Express, you know, and as most of you know, CXL allows for device side caches, which may be beneficial to certain classes of accelerators, especially as we look forward.
We want to collaborate in industry forums like SDXI to ensure that the system architecture
and protocols used to support accelerators allow safe and secure attachment to emerging standard interconnects
like CXL that bring these valuable new features.
In final closing comments, I would love if each of you would share in a few words where
and with what aspects you think the STXI specification should develop further.
Jason, maybe let me start with you first.
Sure. Yeah. I mean, you mentioned earlier that we, you know, we submitted some, you know,
changes through the TWIG to deal with error reporting. We want to, you know, push on that
a little bit further and talk more about RAS in general, especially as these, you know,
accelerators are used by different tenants. We, you focus in on containing a blast radius of any failures.
We'd also like to talk about quality of service.
So in cloud service providers such as Azure, we deal with multiple tenants.
We can consider them hostile to each other.
We really want to drill in and look at, you know, potential noisy neighbor problems and, you know, eliminate them.
We also want to talk about, you know, latency improvements, you know, as we decide where accelerators are integrated into the SOC.
We want to reduce latency wherever possible for requests as well as, you know, determining when the offload is complete.
Paul?
Yeah, thank you, Sean.
Right.
I think from our perspective, I talked about how SDXI facilitates heterogeneous environments
and protocol and specs supports it.
But there's a lot of details to flesh out as we think about heading towards implementations, figuring out where it works best, what systems
are ideal, and further adding in
those details and building
proof of concepts and development systems is a key next step.
Alexander? Yes, I would like
Paul, so in the future, I'd love to see
how the XI can address even more use
cases in the next generation
heterogeneous systems,
including especially
sterilization for new types of accelerators.
Don?
I think
I'm looking forward to seeing it
in the CXL space.
I'm going to guess that it'll be very important in the 2.0 timeframe,
which is more geared for virtualization environments.
And I'm looking forward to some of the additional uploads that we're all
toying in the back of our minds that we keep thinking we want to introduce.
So I'm looking forward to that too.
Rich?
Well, I've got a few, but I'll just mention three of them.
One is I want to wring out the virtualization model.
I think it's complete, but I want to make certain
that we did not leave any stone unturned.
The second is I really want to drive the cross-function
collaboration across different processors and virtual machines. I think that's also going to
be pretty important. And also privately, I would love to be able to shorten our acronym from SDXI
to something that comes off my tongue a little easier,
but stay tuned on that. Stone has been cast. Phil? Yeah, like new, for me, new data transforms
and additional offloads, as well as fleshing out sort of the software infrastructure required to enable this VM-to-VM data transfers.
Thanks, everybody.
You know, I must say this has been a great conversation.
You know, I've been talking to you all for a long time,
but getting insights on how you would like to see a standard like SCSI be used is super critical.
And viewers, you know, if this interests you to learn more about this,
then like I mentioned in the beginning of the panel,
STXI Twig has posted a version of the specification
for public review.
Please send us your feedback there.
And in case you want to join the Twig
to influence the future of the specification,
we would love to encourage
you. As you can see, we are a fun group and we would love to hear back from you.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.