Storage Developer Conference - #175: SNIA SDXI Roundtable

Episode Date: September 28, 2022

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode number 175. Hello, my name is Shamkumar Iyer and welcome to the session SNEA-STXI Roundtable towards standardizing a memory-to-memory data movement and acceleration interface. I am a chair for the SNEA-STXI Technical Workgroup. I'm also a distinguished engineer at Dell. I am very pleased to also share this
Starting point is 00:01:06 platform with my esteemed panelists from the companies below. So starting with Philip Ng, who's a senior fellow at AMD. Philip. Hi, everyone. Alexander Romana, who's a principal architect at ARM. Hi, Sean. Jason Volgimo, who's a partner software engineering lead at Microsoft. Thanks, Sean. It's a pleasure to be here. Donald Dutile, who's a principal software engineer at Red Hat. Hey, folks. Richard Brenner, who's a principal engineer and CTO of server platform technologies at VMware.
Starting point is 00:01:55 Hello, everyone. Welcome to our session. And finally, Paul Hartke, who's a principal engineer in the Xilinx CTO team. Thanks, Sean. Thanks, Sean. Thanks, everybody. So I'm warming up to the discussion that I'm going to have with these gentlemen here. But before I do that, I would like to talk to you a little bit about what is STXI. STXI is Smart Data Accelerator Interface. And some of us in the industry have been thinking about
Starting point is 00:02:29 a problem that we've been facing. So currently, software-based mem copy is the current data movement standard. And while it has served us well because of a stable instruction set architecture, it takes away from application performance and it also incurs software overhead to provide context isolation. While people have experimented with offload DMA engines in the past, their interfaces have been vendor specific,
Starting point is 00:02:59 which means they're not standardized for userlevel software. So the smart data accelerator interface is a proposed standard for a memory-to-memory data movement and acceleration interface, which is extensible, forward-compatible, and independent of IO interconnect technology. SNEA STXI technical workgroup was formed in June 2020, and it was tasked to work on this proposed standard. So far, we have 25 member companies and 60 plus individual members,
Starting point is 00:03:30 and we expect this to be a good standard that is applicable for a lot of people. So what is the grand vision of SDXI and how it wants to solve the memory-to-memory data improvement and acceleration problem. Imagine you have an application, maybe two applications, more than two applications. You need an accelerator interface that can move data from a DRAM context A to DRAM context B without exercising the CPU cycles,
Starting point is 00:04:01 because that's the whole point of the accelerator. But you also want to eliminate a lot of the software context isolation layers that are there so that now you can enable applications directly from user mode. And while we do that, we don't want to just solve this problem for DRAM memory address spaces. We also want to solve this problem for memory classes like storage class memory, memory behind I.O. devices, or memory pools attached to CXL links and patterns. At the same time, we also want this accelerator interface to be CPU family agnostic or architecture agnostic. And while CPU integrated implementations are certainly envisioned here, we also think an accelerator interface should be something that can be implemented on discrete interfaces like GPUs, FPGAs, and smart IO devices like NICs and drives.
Starting point is 00:04:54 When we built a standard specification, now we can innovate around the spec and add incremental data acceleration features. With this vision in mind, the STXI technical work group has had some design tenets in mind. We want to enable data movement between different address spaces, which includes user address spaces and also address spaces in different version machines.
Starting point is 00:05:20 We want to enable data movement without mediation by privileged software. Of course, once the connection without mediation by privileged software. Of course, once the connection has been established by privileged software. So we want to allow abstraction and virtualization by privileged software, which brings me to the capability to quest, suspend, and resume the architectural state of a per-attribute space data model. Why is it important? Because now we can do live workload
Starting point is 00:05:47 or virtual machine migration between servers. We also want the standard to be forwards and backwards compatible across future specifications and revisions. By doing that, now we can create interoperable software and hardware. And because we're creating a standard framework, we want to design it in such a way
Starting point is 00:06:04 that additional offloads can leverage the same interface. Finally, we want the DMA to be a concurrent DMA model. It means multiple parallel DMA transactions should be possible between different address spaces and data modes. With that in mind, I'm very pleased to announce to you that the STXI Technical Workgroup has published a version 0.9, revision 1 for public review. It's available at this link and you can take a look at it to provide feedback. If you are interested to know more about the specification, then we will also engage in a Birds of a Feather session tomorrow, Wednesday, September 29th from 3 to 4 p.m. Pacific time. Please come and join us and we will be happy to share some of the internals of the specification.
Starting point is 00:06:55 Also, if you wanted to join the TWIG, you can go to the CISDXI page and find the information on how to join the three and influence this new upcoming standard. With that in mind, I am now fully ready to talk to my panel and ask them some of the interesting questions that I've always been wanting to ask. So let me start with Phil. Phil, we've been partners here since the beginning of this journey. And now we have the spec out for public review. First of all, you know, thank you for bearing with me this long, because I remember the time when we were going through the initial drafts of the specification, and we had done a POC using FPGAs at Dell. And I remember feeling sort of nervous about asking a CPU vendor saying, hey, can you work with us on a standard specification for data movement and offloads
Starting point is 00:07:49 that offloads the CPU and is CPU architecture independent and works on a broad set of use cases? I kind of hope you guys didn't think we were crazy. Oh, not crazy at all, Shyam. I think the concept you originally pitched, what has now become SDXI for a standard data mover leveraging modern virtualization technologies, was very interesting and compelling. A standards-based approach is very important. It allows for more certainty and stability for software developers at both the OS infrastructure and driver level, but also for end-user applications, depending on this type of acceleration. We expect this will lead to greater
Starting point is 00:08:31 innovation within the overall SDXI ecosystem. So what are some usage models of interest that SDXI can enable in your business? For me, the ability for system software or applications to easily move pages of data between different tiers of memory in a multi-tiered memory system is interesting. Also, the intra-host data movement, that is offload of data movement between virtual machines running on the same host is also, I think, new and unique. Offloading these types of transfers can potentially speed up not just the data transfer,
Starting point is 00:09:09 but can also free up CPU cycles for application usage. I see. So do you see SNEA STXI technical workgroups efforts to standardize a memory to memory data movement and acceleration interface help future industry trends and directions? Yeah, I mean, standards provide a very solid foundation for innovation. It provides a stable base upon which you can develop applications, whether that's end-user applications or usages embedded inside system software. The technical working group, I'm sure you know, we're very committed to ensuring backwards compatibility with new specification revisions so that we maintain that stability. In terms of memory-to-memory acceleration, one can see there's a broad industry trend towards new innovations in
Starting point is 00:09:57 the memory system. Multi-tiered memory with varying capacity and performance, memory expansion via new buses, memory disaggregation, accelerators with memory, persistent memory. This is a big change from traditional systems where maybe you just have DDR memory and the biggest challenge is really just dealing with NUMA effects. So there's a lot of innovation happening in this space and a standards-based method for moving data around between all of these different types of memory or memory pools is one of the missing pieces of the puzzle and SDXI helps fill that role. Cool. Are there some specific features of the specification that really excite you? I mean, the ability to naturally transfer data between different address spaces,
Starting point is 00:10:47 which I touched on previously, I think that's very interesting. For our audience, here's a short version of how it works in the SDXI specification. SDXI defines the ability to reference different address spaces for each memory access. So each address space can be mapped
Starting point is 00:11:02 using different IMU page tables. On a memory copy, for example, you can use different address spaces for the descriptor, the source buffer, and the destination buffer. So you could have, for example, a kernel mode driver generating descriptors in kernel space that's going to perform a data copy from kernel space to user space. Now, you're not limited to using a single page table for everything as you might be in other device models. And if you can reference existing page tables like application page tables,
Starting point is 00:11:32 you don't need to spend CPU cycles mapping your data buffers into some common virtual space or manually translating addresses into physical space. The same capabilities can be used to transfer data between virtual machines running on the same host. This can be used to accelerate intra-host networking, which can be important for a variety of usage models, such as in some hyper-converged infrastructure systems. The other feature I personally like is the expandability.
Starting point is 00:12:02 The current version of the specification is focused on memory copy, but it provides a strong framework to include additional offloads in the future. I think we'll see a lot of useful innovation in this area going forward. Well, you know, I'm really glad that we made that journey together. That intra-house data movement use case is really exciting. And, you know, Rich, you have also been in this journey with me and Phil since the very beginning. And, you know, we couldn't have asked for a better friend and mentor here in guiding us in making this specification.
Starting point is 00:12:39 And as someone who represents VMware as a major virtualization and enterprise software provider, you've been a key collaborator in making the specification virtualization-friendly and enterprise software-friendly. Can you elaborate more on how the specification is breaking down? Certainly. At VMware, we were very concerned that the industry was trending towards every accelerator requiring a specific software stack for every combination of CPU vendor, accelerator, operating system, and hypervisor. As a leading innovator in enterprise software and virtualization,
Starting point is 00:13:17 we did not like this trend. And initially, when I was part of the effort, I was a little worried that maybe not everyone else saw this problem in the same way. But the good news, at least, judging by the number and diversity of folks joining our work group, is that we're not alone in that concern. We seem to be missing a standard or specification for hypervisor and operating system management of offload hardware accelerators, nor did we seem to have any agreement on the models that end-user applications use to submit work. SDXI addresses these issues. It provides an open and common hardware accelerator programming model that spans the gamut of accelerators, hypervisors, and operating systems. It also defines a standard user space work submission model that, again, is independent
Starting point is 00:14:12 of the actual accelerator. And by the way, this is a standard model that works across all CPU architectures. Interesting you say that because, you know, that leads me to ask my next question to Alexander from Arm. Alexander, do you see STXi implementations possible on an Arm architecture? Why did Arm decide to join the STXi? You know, there must be some usage models of interest that STXiI can enable for ARM and or your partners? Hey, thank you, Shem, for having me here and for those great questions. So most of you know ARM, but for those who don't,
Starting point is 00:14:54 ARM is the leading technology provider of processor, interconnect and GPU IP, offering the widest range of cores to address the performance, power and cost requirements of every device, from IoT sensors to supercomputers and from smartphones and laptops to autonomous vehicles. Interestingly, our IP is targeted towards both host and device side, making it critical for ARM to support the SDXI ecosystem, not only from a host vendor perspective, but also from a device IP vendor perspective. So why did ARM join SDXI?
Starting point is 00:15:35 Well, ARM joined SDXI to help standardize the accelerator's architecture such that future devices, which can possibly be coherent to the CPU, are secure, easy to program, and provide effective capabilities for offload. Above all, we want really to support our partners by developing cutting-edge technology and driving standards to make sure they reach their full potential in ARM-based systems. So in my view, SDXI standard is foundational for accelerator virtualization in complex systems.
Starting point is 00:16:10 It enables advanced capabilities like device assignments and live migration. So having a standard is important for our partners in order to get a better integration of accelerators into SOCs and avoid some device-specific code in privileged layers,
Starting point is 00:16:28 which is often a source of bugs and vulnerabilities. On top of that, I believe accelerators connected to the host via technologies like PCI Express or CXL are really ripe for standardization and that SDI can play a critical role by making composable computing platforms pervasive and easy to use. So why now? Why is that training now?
Starting point is 00:16:57 Yeah, so why do I believe now is the right time? Well, heterogeneous computing and virtualization will dominate in the future. So it seems natural to standardize a multi-tenancy accelerator interface to enable accelerator usage and sharing, especially, but not only, for cloud environments. I think standards like SDXI or CXL are fast becoming the innovation vector that will enable true heterogeneous computing. This disaggregation of resources to extract the most value and reduce the TCO is critical for innovation moving forward. I think SDI's initial approach to standardize the accelerator's programming model is really a positive step forward. Now, when I look at the specification, I think also that there are many features in there
Starting point is 00:17:50 which make it very attractive. I particularly like, for example, the fact that the specification made sure the page width could be something that an ARM architecture could support. The specification also works on user-level address spaces, cleverly utilizing the PAC feature on IOMMUs. ARM calls them SMMUs, and the
Starting point is 00:18:12 fact that PAC-based address translation is already supported by ARM SMMU architectures makes it really icing on the cake. So, will we be able to put an SDXI device on an ARM platform and get all those cool features with high performance? My answer is yes. Cool, Alexander. As you can observe, I'm doing this hardware software provider combination. So let me now turn to a major software provider.
Starting point is 00:18:40 Let's hear from Don at Red Hat. Don, you have contributed to a lot of things in this space. Fast zeroing of memory and class-level device recognition of SDXI. How do you see these features benefiting customer applications for you? Thanks, Jim, for having me here. Standardized features for cloud applications are are key focus for Red Hat products. And SDI enables broad accelerated utilization in cloud applications without unique device-specific operators. Options like fast error on your memory for secure virtual machines start up and shut down as a key focus for improved cloud support. Other areas for user space applications like accelerated snapshotting could
Starting point is 00:19:30 prove to be a significant speed up in high performance computing environments and database applications. DAX and block translation table, both examples of user space persistent memory storage could receive significant acceleration, not only because there's a DMA engine to copy data into and out of persistent memory, but also because these DMA operations can be an open standard method from user space without another syscall into the kernel. Providing an open standard enables these open source user space applications to make simple updates to take advantage of these accelerations. That's very cool. You know, let me now shift gears and ask someone who's sort of been there and done that in the accelerator space. Paul, you've kind of really seen the evolution of FPGAs and programmable FPGAs have now become very mainstream. What do you see as
Starting point is 00:20:34 the benefit here for you all at Xilinx? Thanks, Sean. Yes, standards are built on a great ecosystem of partners and component vendors. We at Xilinx are very interested in building communities around those open standards. The STXI specification is not tied to any particular architecture and is agnostic to both CPU integrated and discrete implementations. This is great news for accelerators like FPGAs.
Starting point is 00:20:59 We see heterogeneous compute usage models where an STXI data mover could transfer data to, from, and accelerate a memory space, and additional data transforms would be performed on the data by leveraging that same interface. So while SDXI was first focused on memory-to-memory transfers, we greatly support how the standard has been defined so additional use cases can leverage that same infrastructure as well. Facilitating commonality and application software interfaces will accelerate the heterogeneous computing adoption that many people have talked about today. That's very interesting. Now it's my turn to talk to a
Starting point is 00:21:39 software-focused person. Let me talk to Jason from Microsoft here. Jason, Microsoft is also a contributor to the specification, and you've made a valuable contribution to how error reporting needs to look. And I know you guys don't kind of take this lightly, but will you tell our viewers why you think STXI is a good idea? Certainly. Thanks, Shyam. As others have already commented here, we believe that heterogeneous computing is going to be especially important as we move together as an ecosystem of computing solution providers. So data movers are one class of accelerators currently covered by STXI. And I feel that they represent what I like to refer to as a next generation device that really has the potential to impact a broad number of market segments. So that could range from
Starting point is 00:22:34 servers and data centers all the way down to embedded and IoT devices. So we think it's going to be really important to have robust system architecture in place to support accelerators as they're integrated into these computing platforms. We joined SDXI because we also believe in open standards. As accelerators proliferate, and as Rich mentioned previously, we absolutely would like to see common protocols that can be used to efficiently offload work from general purpose CPUs to
Starting point is 00:23:06 dedicated hardware and potentially vice versa. This model allows new hardware implementations to be offered to our customers more quickly as standard drivers and software abstraction layers can be reused. These common standards also help depict a baseline level of support with optional extensions for additional features. That permits silicon vendors to differentiate based on the quality of their implementation and the richness of the features supported in their offerings. That's interesting. You guys really want open standards to also foster competition among hardware vendors on features, performance and implementation using common software implementations. Let me ask Don about this. Is there more to open standards for you all at Red Hat? How does it align with open source?
Starting point is 00:24:08 Well, it's pretty easy to see that open standards are very important for open source. FCSI has taken an acceleration feature and making it a standard so its adoption will be simpler, faster, and it will be deemed a worthwhile investment and provides a broad payback. Not a corner case, not a one-plus feature for one device by one system or one vendor. This enables a broader use of NASA, a much broader ecosystem that improves the ROI. That's very cool.
Starting point is 00:24:42 You know, I want to pick another point from Jason and actually pose that question to Rich because I'm having fun here. You know, competition from hardware vendors on implementation sounds like a fine idea. But Rich, you know, isn't the intent to encourage competition among software solution vendors on a better SDXI-based implementation also the idea?
Starting point is 00:25:10 Well, what do you think? Absolutely. A more tightly integrated SDXI software solution will be far more effective in targeting various segments of the market, and fostering that kind of environment will be critical to the SDXI technical work group. But going back to hardware competition, the SDXI system architecture allows a hardware vendor to choose the right level of integration, and it is interconnect independent. That means implementations can span the gamut from dedicated integrated CPU devices, external PCI devices, to remote devices over a compute express link. Nothing in the system architecture requires PCI express only devices.
Starting point is 00:25:58 Now I'm going to take the STXI data mover to Jason. Jason, what do you make of when we say the STXI implementations are like interconnect, independent, and non-PCI implementations are possible? It's a good question. So as I mentioned earlier, you know, we see data movers as the first in a wave of next-generation devices. There are really good litmus tests as you, you know, tend to offload potentially really small latency sensitive operations to really large background tasks. And, you know, you want to
Starting point is 00:26:32 improve throughput and or efficiency in both cases. But as other accelerators, you know, come online, we want to ensure they're also supported by standard system architecture, including the hardware and software protocols used to interact with them. So as, you know, Alexander and Rich both mentioned CXL a bit earlier, and to answer your question more directly, we're also looking at CXL as an attach point for accelerators. Unlike PCI Express, you know, and as most of you know, CXL allows for device side caches, which may be beneficial to certain classes of accelerators, especially as we look forward. We want to collaborate in industry forums like SDXI to ensure that the system architecture
Starting point is 00:27:19 and protocols used to support accelerators allow safe and secure attachment to emerging standard interconnects like CXL that bring these valuable new features. In final closing comments, I would love if each of you would share in a few words where and with what aspects you think the STXI specification should develop further. Jason, maybe let me start with you first. Sure. Yeah. I mean, you mentioned earlier that we, you know, we submitted some, you know, changes through the TWIG to deal with error reporting. We want to, you know, push on that a little bit further and talk more about RAS in general, especially as these, you know,
Starting point is 00:28:01 accelerators are used by different tenants. We, you focus in on containing a blast radius of any failures. We'd also like to talk about quality of service. So in cloud service providers such as Azure, we deal with multiple tenants. We can consider them hostile to each other. We really want to drill in and look at, you know, potential noisy neighbor problems and, you know, eliminate them. We also want to talk about, you know, latency improvements, you know, as we decide where accelerators are integrated into the SOC. We want to reduce latency wherever possible for requests as well as, you know, determining when the offload is complete. Paul?
Starting point is 00:28:47 Yeah, thank you, Sean. Right. I think from our perspective, I talked about how SDXI facilitates heterogeneous environments and protocol and specs supports it. But there's a lot of details to flesh out as we think about heading towards implementations, figuring out where it works best, what systems are ideal, and further adding in those details and building proof of concepts and development systems is a key next step.
Starting point is 00:29:21 Alexander? Yes, I would like Paul, so in the future, I'd love to see how the XI can address even more use cases in the next generation heterogeneous systems, including especially sterilization for new types of accelerators. Don?
Starting point is 00:29:40 I think I'm looking forward to seeing it in the CXL space. I'm going to guess that it'll be very important in the 2.0 timeframe, which is more geared for virtualization environments. And I'm looking forward to some of the additional uploads that we're all toying in the back of our minds that we keep thinking we want to introduce. So I'm looking forward to that too.
Starting point is 00:30:05 Rich? Well, I've got a few, but I'll just mention three of them. One is I want to wring out the virtualization model. I think it's complete, but I want to make certain that we did not leave any stone unturned. The second is I really want to drive the cross-function collaboration across different processors and virtual machines. I think that's also going to be pretty important. And also privately, I would love to be able to shorten our acronym from SDXI
Starting point is 00:30:42 to something that comes off my tongue a little easier, but stay tuned on that. Stone has been cast. Phil? Yeah, like new, for me, new data transforms and additional offloads, as well as fleshing out sort of the software infrastructure required to enable this VM-to-VM data transfers. Thanks, everybody. You know, I must say this has been a great conversation. You know, I've been talking to you all for a long time, but getting insights on how you would like to see a standard like SCSI be used is super critical. And viewers, you know, if this interests you to learn more about this,
Starting point is 00:31:28 then like I mentioned in the beginning of the panel, STXI Twig has posted a version of the specification for public review. Please send us your feedback there. And in case you want to join the Twig to influence the future of the specification, we would love to encourage you. As you can see, we are a fun group and we would love to hear back from you.
Starting point is 00:31:51 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.