Storage Developer Conference - #136: Introducing SDXI

Episode Date: December 1, 2020

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode 136. Hello, good morning, good afternoon, good evening from everywhere around the world. What an amazing way to chat today. Technology has never been more important, and that's why it gives me great pleasure to introduce Smart Data Acceleration Interface,
Starting point is 00:00:58 a new technology that we are bringing to SNI. It's a new technical working group to standardize the interface for memory-to-memory data movement and acceleration. I am Sham Iyer. I'm a distinguished member of the technical staff at Dell's Office of the CTO for Service. I'm a co-founder and interim chair for this new technical work group at SNIA. I'm also joined here by my esteemed colleague, Richard Brunner, who is a CTO, principal engineer at VMware. He's also a co-founder for this new technical working group. He's been a co-conspirator, co-collaborator, and gentle good friend and mentor in this process with us. Let's take a quick look at the agenda that we want to get you through.
Starting point is 00:01:51 So before I get into the details, we wanted to get you through the journey of how we landed here. So I absolutely want to talk to you about the problem and the need for the solution that we're trying to address. Then I'll introduce SDXI to you. And then finally, Rich will explain
Starting point is 00:02:11 some of the finer details and the concepts behind SDXI. So let me paint the story here. And this is a story that many of you are aware of. There is an increased need to increase core counts to enable compute scaling. Compute density is also on the rise. Working with our hardware partners, they have been very generous with us. They've increased core counts for us so that our applications can scale. Increasingly, converged and hyper-converged storage appliances are enabling new workloads on server-class systems.
Starting point is 00:02:57 You might be wondering why a server guy is talking in a storage conference. Now I don't have to explain that anymore. Data locality is important. Single-threaded performance is under pressure. The law of physics comes before us in trying to expand how a single core can do better. Also, IO-intensive workloads are becoming noticeable in terms of how they take away CPU cycles. Network and storage workloads are also part of the same technology that can take away compute cycles. Data movement, encryption, decryption, compression, the list goes on. Let me draw your attention to a picture that I'm showing you here. Imagine the course that you run in your server are the parallel lanes,
Starting point is 00:03:51 and the cars are the speeds that you get out of your course. It does not matter how fast you run your car. If you've got cross traffic like that bus going through there, you're going to get a bottleneck. That's what happens when you have cross-traffic taking over your server performance. So let's take another look at some of the case study behind why we need to accelerate intra-host data movement.
Starting point is 00:04:18 Let's see how the congestion builds up. You might be running some of your compute infrastructure. Many of them are VMs. We know how to scale them using network stacks and application demands can be met. The storage may not necessarily come from your compute infrastructure. You might rely on your hypervisor to get you the storage, or in many cases, you might have a separate storage infrastructure in the form of a storage VM or if you really need a lot of storage you might have an entire storage network behind for you to get all of your storage needs. We've been doing a great job here over the years. We've been doing bringing a lot of innovation in the form of TCP IP, Rocky, iWarp, are some of the ways that we can increase the bandwidth
Starting point is 00:05:06 and reduce the latency. And the new increased speeds and feeds with 10 gig, 25 gig, 40 gig, 100 gig, and 400 gig Ethernet are just going to make it better. What else is happening? Local storage is also undergoing a revolution. With the advent of non-volatile memory express, local storage latencies have considerably reduced.
Starting point is 00:05:31 With the reducing flash prices, the capacity also has been increasing. With persistent memory technologies, starting with NVDIMM, new memory technologies are going to further reduce the latency envelope and expand the capacity. So, accelerating inter-host traffic is now very critical to server performance. Why is this important now? Because generally, intra-host memory exchange consists of multiple buffer copies. You can see that there are multiple layers of software stack here that do software-to-software copies. Kernel-to-IO and IO-to-hardware can leverage hardware copies, but software-to-software copies relies on CPU. They're more synchronous.
Starting point is 00:06:27 So what is our current data movement standard? Our current data movement standard is a stable CPU instruction architecture. Don't get me wrong, this has served us really well because that's how we can layer applications on top which don't have to change with every next CPU generation. However, there's another thing that happens here. Because we now rely on the CPU instructions, they take away from the application performance. They also introduce overhead when we need to provide context isolation. So what can we do? We tried something. We tried we'll go with an offload DMA engine and to do that we created
Starting point is 00:07:08 a simple experiment where we place two linux vms interface to use a virtual network v switch as you can realize there are software copies in the stack here and to test our theory we tried to prototype in in an FPGA two physical functions going as a pass-through to each of these Linux VMs. And then we stitched together some back-to-back DMAs so we can do fast hardware DMA copies between this Linux VM on the left to the Linux VM on the right. Results? They were very astonishing for us. Mind you, this is 2016. This was still a prototype and still is a prototype. The point of this was just to prove that hardware can be very beneficial. If I take your attention to the graph here, this is the blue line where software copies starts increasing exponentially in latency as the packet size increases, whereas
Starting point is 00:08:06 the hardware copy scales pretty well. What's wrong with this? They can be hardware vendor-specific implementations. They will come with their own drivers. Direct access with user-level software, while it can be designed, needs to get implemented for every hardware vendor that we have to work with. They will come with different kinds of APIs. And generally speaking, this can have very limited usage models.
Starting point is 00:08:32 We were trying to aim for something big. So what kind of solution requirements are necessary to build a standard interface? One, we have to absolutely offload IO from compute CPU cycles. My previous picture just pointed to that benefit. Number two, we have to build architectural stability, the kind of stability that we have enjoyed with our CPU instructions, but not by using the CPU cycles for it. We have to enable application and VM acceleration, but we also have to help migration
Starting point is 00:09:08 from existing software stacks. So we have to create good abstractions in the control path for scale and management, then very surgically enable performance in the data path with offloads. Everyone with me? Something else that we have to think about as architects, principal engineers, CTO and the CTO office, we look into the horizons on what are the different architectures that are going to become important for us. We're looking increasingly at memory centric architectures, which means memory is at the center of the universe so that different
Starting point is 00:09:45 kinds of compute elements can play. To do this, we're looking at different kinds of memory interconnects like CXL and Gen Z to help serve those needs. We are also trying to enable different types of memory so that the innovation potential can be unlocked. Heterogeneous architectures are suddenly becoming mainstream. And now you will agree with me that there is a need to democratize data movement
Starting point is 00:10:12 between these different memory tiers, bridging the different memory interconnects and let all of these heterogeneous compute elements play. So let's see how we should build this. If you want to build an accelerator, where would you first think about building an accelerator? Now, I told you I don't want to use my CPU cycles, but why not build it in the CPU itself so that we don't have to have a lot of hardware
Starting point is 00:10:41 associated with an accelerator? Yes, that's desirable. We absolutely want to enable data memory copies from a context address space A to context address space B because my application could be completely two different address spaces. I want to enable direct user mode for my applications with that accelerator interface so that we can emulate a lot of the software context isolation layers that are there to make this more performable. Next, we want to be able to serve different types of memory for this accelerator.
Starting point is 00:11:21 Why? Because storage class memory is coming to us that brings us persistent characteristics. So we will need different types of data services with that. Increasingly, a lot of memory is going to be behind IO devices. We need to be able to address data movement to that as well. And then finally, with memory interconnects like CXL and Gen Z, the system physical address space is going to increase. And therefore, this data mover or accelerator needs to be able to target all these different memory address spaces. We can't keep doing innovation with one CPU family and not replicate with another CPU family. So it has to be a very standard CPU agnostic interface. Also, different types of compute elements coming to the data
Starting point is 00:12:09 center means that the same accelerator interface needs to be available for a GPU, for an FPGA, or a smart I-O device like a NIC or a drive. When we do that, now we can leverage a very standard specification, can innovate around the spec, then add incremental data acceleration features. And that's how we can solve an increasingly tiered memory work. So now let me introduce the concept of STXI. Before I talk about STXI,
Starting point is 00:12:46 I want to acknowledge Philip Ng, who is an AMD CEDAR fellow, co-founder and co-author for the spec. AMD, Dell, and VMware are contributing the starting spec for this technical working group. And we are very proud and excited to partner with all of SNEA's TWIG members.
Starting point is 00:13:07 So what is STXI? STXI is trying to develop and standardize a memory-to-memory data movement and acceleration interface that is, one, extensible, it's forward compatible, and it's absolutely independent of IO interconnect technology. And may I add, different kinds of implementations should be possible with this. From a design point of view, we try to think of some design tenets that would be useful for a standard interface like this. For example, we want to enable data movement between different address spaces, including user address spaces, both within and across VMs, and new address spaces that get defined.
Starting point is 00:13:52 We want to enable data movement by privileged software. If privileged software comes in the way, then performance can suffer. Of course, we want to make the connection get established first. That's where privileged software comes in, which is why we want to allow abstraction or virtualization by privileged software. And there's something else that we want to aim for. A lot of technologies get defined. Virtualization is more of an afterthought. We've done this groundwork with virtualization in mind so that now we can have the capability to quest, suspend, and resume the architectural state of a per-address-based data movement,
Starting point is 00:14:33 which means all of our architectural states are open and standard, no hidden states there. And what does this do? It enables live workload or virtual machine migration between servers and other types of benefits that come with it. We want to enable forward and backward compatibility across future specifications. So now we can have interoperable software and hardware, a key ingredient to make
Starting point is 00:14:59 a standard a success. Then we want to incorporate additional offloads in the future. I just don't want to copy from A to B. I want to be able to do some data transforms in line. And then we want to enable a concurrent DMA model. That means multiple parallel DMAs should happen all the time without one obstructing the other. With these design tenets in mind, like I said, we decided we will try to do a spec and we even tried to implement a small prototype with this. Let me explain our prototype. So we implemented this STXi prototype with an FPGA, with a driver and an kernel application. To compare, we use the same driver and wrote a kernel application that did mem copies with this driver. On my right here, I have this picture where the gray and the blue show me different kinds of software mem copies. I can use a synchronous API with the driver or an asynchronous API with the driver to get my software copy results. With our STXi prototype, we can only do asynchronous
Starting point is 00:16:12 because remember this is hardware. And with STXi, this is a Gen3 FPGA prototype, we're hitting near the land rate. Remember, we're not even an FPGA company. The reason for this is to show that enabling the ecosystem is key to our success. And therefore, we're saying that there is good benefits by doing this spec and implementing with this. That's why we want to invite one in every one to come up with their implementations here.
Starting point is 00:16:42 Come and partner with us. Also partner with us at a quick level. We're calling other SNEA technical working groups like persistent memory work group, because we want to target persistent memory with this data mover. We also want to talk to computational storage groups because storage is going to be interesting space that we're going to target with this. Networking and storage applications can run with this data more. We also want to partner with Compute Memory Storage Initiative,
Starting point is 00:17:14 part of SNIA. Also, looking externally, we want to partner with PCI-SAIC, CXL, OFA, UFI, Gen Z, because, like I said, it's interconnect independent. This data mover can be implemented with PCI, CXL, any implementation or interconnect that you want to bring. That's how we make it a standard.
Starting point is 00:17:38 So what is it that we're going to do in this trip? The base spec that we contributed, we're going to take it to 1.0. After 1.0, we're going to take it to 1.0. After 1.0, we're going to define new kinds of operations that will enable this data mover interface,
Starting point is 00:17:52 including persistent memory targets. We want to create cache coherency models. We want to be able to solve security features that involve these data movers. We want to create a connection management architecture.
Starting point is 00:18:03 And we absolutely want to encourage one and all to come join us in this effort. OS vendors, hypervisors, OEMs, applications, data acceleration vendors, IP vendors, come and join us and make this a great success. With that, I would like to introduce Rich Brunner to come and talk about more of the STX site concepts. He has a penchant for rings, and I certainly think that you will like his talk. Thank you, Shyam. Let's dig into it.
Starting point is 00:18:39 So solutions to provide scalable data movement requires not only acceleration, but a standard interface that supports software reuse and virtualization. And so I'm showing again the list that Sham shared earlier about the requirements for such an interface. And I'm just going to highlight a few of these on this slide. So a standard interface for data movement needs to work both within an OS instance or a virtual machine, such as we've shown here, or between different virtual machines. And the key concept here is that all the accesses from the hardware to the actual memory space that might live in a VM or process is through the IOMMU as appropriate. So that's a key point of the architecture.
Starting point is 00:19:43 The data movement, as Shyam said, we want that to work without mediation or interception by hypervisor or privileged software. The minute you do that, your performance we want to be able to allow live workload or virtual machine migration between servers using this technology across different kinds of hardware, all implementing the same standard set of operations. So that's very important to truly be able to deploy this at scale. So we're going to define a data movement engine as some number of functions, which easily corresponds to a PCI Express physical or virtual function. And regardless of the acceleration, there should really just be one way to set up and control a function and just one standard descriptor format to submit work. Now a function has some number of contexts and each context is an independent descriptor ring and its associated data structures.
Starting point is 00:21:10 Now all SDXI context state resides in memory. There are no device-dependent specialized mechanisms to serialize state. We don't need them. And as a result, this makes this very easy to virtualize. Now the context state is partitioned between privileged software for setup and control and actual user mode access of the ring itself. So we think that this lends itself well to a model where there is not mediation or interception of a user mode process putting data right into the ring so that the function can begin its operations on it. Now, there can be multiple contexts, and this allows independent threads to issue and manage operations in states without any blocking or coordination with other threads. It also makes
Starting point is 00:22:15 the ability to quiesce, suspend, and resume much easier and much granular per producer-consumer pair, rather than a broad hammer that affects everything. And lastly, we have one standard way to log errors. So you don't need n different ways. We believe what we've defined here will work across lots of different classes of accelerators. So like any data movement architecture, we need the concept of a circular ring, and we don't need to keep redefining it for each and every class of accelerators. So we have a simple ring, and of course it is a finite number of memory locations and it will wrap.
Starting point is 00:23:11 But the ring is managed by position independent indices. in user mode memory that point to the beginning range of valid descriptors and the ending range of valid descriptors. Now by using ring indexes, this allows easy suspend, resume, and relocation of the ring without breaking major context data structures or applications. It makes virtualizing the ring a lot easier as well. Now, the SDXI function starts reading descriptors at the read index pointer, and it stops reading at write index. And software starts writing new descriptors beginning at write index, and it can write until the ring is exhausted. Pretty straightforward mechanism.
Starting point is 00:24:12 And so write index minus one is always the place where the last valid descriptor exists. So descriptors are processed in order by the function. They can be executed out of order. They can be completed out of order. And the read index is incremented after each descriptor has been issued. Now recall that the pointers can be mapped to write back DRAM. And that actually works out really nice. So for a function that is actually implemented,
Starting point is 00:24:51 such that it can be aware of any changes to the read and write index location pointers, it is able to immediately start accessing valid descriptors without the need to wait for a synchronizing doorbell write to an MMIO location. And so we avoid the need for a special instruction defined in any instruction set in order to kick off this process. However, from an architectural perspective, because not all functions necessarily can,
Starting point is 00:25:29 hardware implementations of functions can work this way, and to allow virtualization, the architecture requires that the doorbell be written to ensure that new descriptors are recognized. But you can plan that in such a way that it does not limit your performance. And that's the key point that we want to make here. So this allows us then the ability to have a maximum number of operations that can be executing in parallel without waiting for a serializing write, as well as we have nice, well-defined boundaries
Starting point is 00:26:14 for quiescing and serializing the state and for error reporting. That is very important in this architecture. So we also defined a descriptor format that we believe easily lends itself to all sorts of different classes of accelerators. Again, we don't need to keep reinventing things per accelerator or per device. We can stick with just one. So our descriptor is 64 bytes. Now there's room for future operations. I'm going to talk about them for a moment. And it's even possible to describe descriptors that need multiple 64-byte blocks. So we have that future extensibility there,
Starting point is 00:27:09 as well as the opcode space itself has room for lots of future operations. Now, we have defined already in our spec some basic operation groups. And the details are there on the slide. We have sort of a basic DMA operation group. We have a group that allows us to do atomics. We also, the way we've defined this architecture is that an admin context or an admin queue can be used to manage all of the other contexts,
Starting point is 00:27:49 all the other rings within the function. That, so again, even the management model lends itself to the same properties of well-defined boundaries and mechanisms that easily allow virtualization. So you can see that many of those functions have to do with stopping and starting a context and being able to manage its state and ensure that proper operation is carried out. And then lastly, we've put in the hooks to define a connection set of operations that will allow virtual machines to connect to each other through SDXI functions in a server. So in our standard descriptor, we also obviously need some kind of completion status.
Starting point is 00:28:47 And so we've defined a completion status pointer to a completion status block. This is initialized by software and it is decremented. The completion signal portion is decremented by a function on success. Now this block can be shared across multiple descriptors. So if you had some logical grouping of multiple descriptors and you wanted to use the same completion status block, you could do that and you could even tell which ones have finished and which ones haven't through this mechanism. Now, the completion status mechanism also has the ability to generate interrupts. But we're showing here right now the model where you can pull this in memory to see the result. Now errors are also flagged and that's by writing a signature to the completion signal
Starting point is 00:29:50 and putting miscellaneous status information about the error as appropriate. So the other thing we should talk about is how memory locations are specified. So we've defined a fairly straightforward way for this. Memory locations are always specified as a triple. So there is something that points to what the address space ID is for the operation, a 64-bit address, and any appropriate cacheability attributes. Now again, as we've said before, the addresses that are generated through this mechanism are always run through the IOMMU as appropriate. That's a key concept.
Starting point is 00:30:46 And so if you have multiple memory locations, then you will have multiple of these triples as defined in the standard descriptor format. Now let's talk a little bit more about the address space ID. So this is actually an index to a context address table, address key table, which you might have seen in my earlier picture. And an address key table encodes all the valid addresses, passeds, that's process address space IDs, and interrupts available to a function context. Any descriptor within a context can reference an A key table entry. And so all the locations that our SDXI function can access on behalf of the software submitting work are captured in the address key table. So let's look a little bit more about how that address key table comes into play for doing multi-address space data movement. So we'll start with a simple example. We have an address space B where we have a producer, and that producer has a descriptor ring where he is submitting commands to the SDXI function. And he also has a source buffer that he wants to copy over to another address space. Now the allowed address spaces that he wants to be able to
Starting point is 00:32:26 access are of course it's his own as well as address space C where the destination buffer is. So from an SDXI perspective the a key function of the AKEY table has a mapping for address space C. So it allows the producer an address space B to specify that he wants to do an access to address space C. The other thing to note here is that the IOMMU is going to be programmed for each of these address spaces in order to allow an appropriate SDXI function to be is bound tightly to address space B and a function that is bound tightly to address space C. So the IOM it can access address space B. And SDXI function C is programmed into the IOMMU of address space C in order to allow the same kinds of accesses. So the instruction, the copy descriptor operation is issued.
Starting point is 00:34:14 The SDXI function reads the actual operation, and it will read from the buffer in address space B. And so that's the DMA read completion. hand it off to SDXI function C, which will in turn, using the IOMMU mappings, write into the address space C for the destination buffer. So the SDXI DMA engine has this sort of backdoor ability, if you will, to redirect the data flow through the appropriate function to ensure that the permissions are set up properly. And all of the permission stuff is done at a privileged software level. And of course, it can also be intercepted by the hypervisor as appropriate. Now we're going to look at one more, more complex multi-address space data movement.
Starting point is 00:35:10 In this case, address space B is actually still the one that is submitting the work. But now we want to put the source buffer in address space A and orchestrate copying it over to address space C. So address space A is, of course, also mapped into the IOMMU in the sense that the SDXI function A is allowed to be able to read that address space.
Starting point is 00:35:47 So first, we fetch the actual operation of interest from the descriptor ring for address space B. And the function returns the descriptor and then the DMA engine knows that it has to use what appears as function A through the IOMMU to be able to address address space A. The result of the buffer is copied, it's returned, and then it's handed over to what appears is SDXI function C, which has been permitted to do the write into address space C. So this can be very powerful for certain scenarios. they may live just within one address space model or multiple address space models with just one function. But the architecture permits something as complex as this when appropriate. Now, I don't talk about it in detail here, but certainly we have on our minds how we can chain not only different software components, but how we could even potentially chain different hardware components with SDXI.
Starting point is 00:37:14 But that's something that will work itself out through the TWIG. In summary, as CPU cores scale, the usage and demand for ever larger and faster data exchanges scale. It scales among kernels, applications, virtual machines, and I.O. devices. And we see that future network and storage technologies will especially require this scaling. Now, solutions to provide data movement scaling and transformation requires not only acceleration, but we believe a standard interface that
Starting point is 00:37:53 supports software reuse and virtualization. As Shyam said, Dell, AMD, and VMware are contributing a proposed starting point for this interface to SNEA. And in this session, we've discussed some of the key concepts of this proposed interface, but there are a lot of details. And that's why SNEA has authorized our Smart Data Acceleration Interface Technical Workgroup to begin work on fleshing out this actual interface.
Starting point is 00:38:24 We believe this interface proposal is of broad value to the industry. We are really trying to make this non accelerator specific and also future-proof so that the underlying IO connection technology can be comprehended in the way we've defined the architecture. So, as Shyam said, come join us in the TWIG. The details are
Starting point is 00:38:55 much, much deeper than what we've been able to touch here, but we hope that you get the spirit of what we're trying to do. Thank you. Thank you. Thank you. Look forward to all of your questions.
Starting point is 00:39:09 Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.