Storage Developer Conference - #180: SNIA SDXI Internals

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to SDC Podcast, episode number 180. So I'm William Moyes. I'm here today to talk about SDXI, the Smart Data Accelerator Interface and its internals. We'll be getting into the nuts and bolts of what makes this thing work. But before we do that, we'll go over a couple other topics. If you've been following closely,

Starting point is 00:00:57 you may have noticed that we've had a bit of a change around in the schedule, a bit of a change around in the presenter even. I was not originally planned. Originally, Shyam Iyer from Dell was going to come. He is the chairperson for the SDXI technical work group. Unfortunately, due to circumstances beyond his control, he was not able to attend in person today. So he asked me to come in and present on his behalf, and I'll be going through the presentation. Any mistakes, problems, whatever, please blame me. Don't blame him. Okay. So we're going to go through kind of the story of SDXI, where it came from, what it's about, and where it's going. First, we're going to cover some of the things you already know, but may not have necessarily put together, the motivations of why we wanted to do SDXI, why we felt this

Starting point is 00:01:45 was important, and how the systems are evolving. We all see the systems evolving today, but we're going to talk about what's changed and how these pieces come together, which is what motivates SDXI. We'll then get into some of the use cases, some of the software application use cases that further motivate why SDXI is interesting. And after we talk about what we would have from a motivation standpoint, we'll start getting into the goals and tenets of SDXI. We'll go into an introduction of the internals of SDXI.

Starting point is 00:02:14 We'll start getting into the nuts and bolts. It wouldn't be as detailed as reading the actual specification itself, but it will give you a good overview of how it works, why it works, so you can go back and know what you need to learn if you want to learn more about this technology. We'll also talk a bit about the SDXI community, this developing the specification, as well as the future plans for SDXI. And then, of course, the obligatory list of references,

Starting point is 00:02:37 so that when you, after this conference, you have questions about SDXI, want to learn more about it, you'll be able to see where you can go to learn and educate yourself and learn more about it. Okay, so let's start off talking about what we already know, just to kind of act as a bit of a reminder. Let's talk about an application and the resources an application needs to execute. And again, this is traditional, and I'll be describing it using a number of bubbles. Again, very high-level architecture, just by way of motivation. So today, we have an application. The application itself, classically, legacy-wise, will be running on some form of compute,

Starting point is 00:03:18 usually a general-purpose CPU executing the operations of the application. The compute engine would then be interfacing with system memory, traditionally DRAM. And then, a little more modernly, the application may be multi-threaded, so we get multiple threads of execution running on multiple compute cores. The compute and the memory are coherent with each other. The memory itself is for data in use, the data actively being used, and of course the application can directly access

Starting point is 00:03:48 that data content. It's kind of a direct one-to-one correspondence between what's in the source code and what's actually taking place. Now for persistent storage, persistent durable storage, there's going to be some type of disk device or storage device connected via an I.O. channel. That I.O. channel itself is not directly accessible from the CPU, as in you just

Starting point is 00:04:12 can't go and take a pointer and point to a data blob on the disk. We'll ignore memory mapped I.O. for right now for simplicity. Instead, the compute needs to instruct the I.O. device to transfer the data from I.O. to memory where the CPU can then do work on it. So, again, we all know this. We all understand this. We've been doing this for decades upon decades upon decades. And, of course, the connection between the I.O. and memory has been optimized for bandwidth and latency. Now, let's take a look at where we're starting to head in the future when looking at the industry. And we're seeing a lot of emerging

Starting point is 00:04:49 efforts. Some of them are here, some of them are in development right now, etc. So, we have the application, but instead of just having general purpose CPUs, we're seeing a proliferation of new compute technologies, CPUs, GPUs, application-specific processors. We're also seeing FPGAs being used for doing compute operations. For the memory technology, and again, I'm talking about the online, active, direct interaction type of memory, we have volatile memory. DRAM is still around. It's going to be with us for a long time still, I believe.

Starting point is 00:05:22 We have non-volatile memory. There were some talks earlier today about that, about some of the technology that's going to be with us for a long time still, I believe. We have non-volatile memory. There were some talks earlier today about that, about some of the technology that's coming to play. People are looking at memory that's cheaper, that's more about capacity. Memories that have different performance characteristics, that's cold and hot, either determined by the underlying technology or by the way you access the

Starting point is 00:05:38 memory. Again, you look at, read from the lines in some of the talks that were given, and you can see these patterns emerging. And then these technologies are all being brought together by different memory or fabric or link technologies. CXL, Gen Z, C6, etc. These new technologies are coming into play. And so we're seeing the industry start to gravitate towards these approaches. These, of course, also have shared design constraints

Starting point is 00:06:04 around latency, around bandwidth, and about coherency, and about control and security as we move there. When it comes to moving data between these different elements, and even between these different elements, there has been one way pretty much everyone has done it, and that is software-based memcopy. It's there. It exists. It's there. It exists. It's stable. It is well understood. The security models are well understood about how memcopy works.

Starting point is 00:06:31 Now, the challenge is, if you're in a compute-constrained environment, this can take away from application performance. It's not compute-constrained, different story. But if you're in a compute-constrained environment, it can take away from application performance. Also, when you're looking at context isolation, this can end up involving software overheads, context switches, that again can distract. Offload DMA engines have been around for a long time. Many systems have these. However, the engines that have been created previously

Starting point is 00:07:01 had vendor-specific interfaces on them, which meant that you would have to have the application or the software stack developed to specifically work with these devices. And the adoption has been fairly low on using these vendor-specific DMA engines. And then, again, looking at another trend in the industry, there's been a push to try to cut back on some of these layers. There's been a big push towards user mode access to hardware. And we start getting into a domain like that, having an interface that's proprietary to the vendor becomes especially challenging, as opposed to something that is standardized, where the interface is universally going to be adopted, like your memcopy operation.

Starting point is 00:07:46 When you think about this standardization, think for a moment about what SATA did, what the HCI did, or NVMe, again, a programming interface that's standardized, where you can have one piece of software that works with a multitude of different hardware from different vendors. Think in terms of that when we talk about standardization.

Starting point is 00:08:06 And then further think about user mode applications and what would be involved there. Now let's take a quick moment to go through application usage patterns. And again, not to bore you, the first example is really simple, really straightforward. Let's take a look at transferring data. You have an application running. It's a user mode application. It wants to move data from point A to point B, and it can just call memcopy today. And again, if it's constrained in terms of compute power, this could take away some of its performance. Whereas, if you think about what could happen if you had an accelerator to offload this,

Starting point is 00:08:41 instead you could have a circular FIFO buffer, a ring buffer, to accept the request for transfer. The accelerator can then be alerted that there's something out there for it to do by ringing a doorbell, sending a signal to the hardware to activate it. And then the accelerator can move the data from its source buffer to the destination buffer. And upon completion, it'll then send the completion back to software. Again, a pretty conventional pattern. But by using this pattern, if you have lots of data to move, and you can then offload those operations from having the general purpose compute handle these. Other use cases, other patterns, I should say. Think through the case of where you have an application that's trying to store data to persistent durable storage, so to disk, et cetera.

Starting point is 00:09:29 Again, working through a conventional pattern of behavior, you have the application needs to copy the data into a kernel memory buffer. The kernel itself needs to activate a driver for the, say, NVMe disk or some other storage device. It would then need to copy that data into a DMA-able buffer, again using a memcopy. Then the device will actually transfer the data using a DMA read into persistent storage, durable storage. To get the data back out, the whole process has to basically be run in reverse, back up and through. Now, if we look at what has taken place in going from this model, and again, I recognize there are optimizations here, zero copy, et cetera. I'm not going to do this. I'm just saying traditionally. If you start looking at what's been done with persistent memory technologies,

Starting point is 00:10:17 where the actual memory is available online in the system, where it's side by side with your in-use memory. Think NVDMN as an example that's been around for a couple of years. In this case, a user application can simply memcopy the data from point A to point B, and then plus do the couple extra steps required to ensure persistency. But it's a more straightforward process. But again, if you're compute constrained, it can take away. Then if you were to have an accelerator, the data could be moved by the accelerator, again, if you're compute constrained, it can take away. Then if you were to have an accelerator, the data could be moved by the accelerator, again, by enqueuing the necessary commands into that ring buffer and asking the device to do the transfer. To me, one of the most interesting case patterns is this third pattern.

Starting point is 00:11:01 Imagine you have two applications, each running inside a different virtual machine. In this scenario, these two applications wish to communicate with each other. In this particular scenario, the user application would then need to memcopy the data to the kernel. The kernel would then have to activate I.O. to perform the I.O. transfer. The I.O. would take place moving the data around the system. The data would come back into the other side's kernel,

Starting point is 00:11:29 and then from there it would be then copied back into the user application on the other side. If there's a hypervisor running here and you have a virtualized hardware, there may be even more memory transfers involved. This structure is necessary because of, again, the need for isolation between the two VMs. Now let's take a look at what could be done if you had an accelerator which was aware of the system architecture, if you will. You could go in and have the first VM queue up the transfer, alert the accelerator, have the accelerator directly grab the data from its memory buffers,

Starting point is 00:12:08 perform the transfer, and write them back into the other application's memory buffers. Here, not only are you getting the advantage of transferring the data from point A to point B without having the software overhead, you're also able to skip over the context switches into the kernel. There's no need to invoke the hypervisor, no need to invoke the local guest kernels on these two sides of the equation to perform this transfer, which can save quite a bit. Don't worry, there is security here. A bunch of my slides will get into the security aspects of this.

Starting point is 00:12:43 But you can start to see some of the benefits of what can be achieved here. Again, thinking through, and wow, that doesn't really show up well on this slide, does it? Thinking through again, with this proliferation of memory technologies, an ideal accelerator will be able to handle the data transfer, not just for DRAM, but for the whole gamut of technologies that are upcoming, regardless of source, peer-to-peer transfers, for example, transferring things between persistent memory and regular DRAM, or CXL-connected memory, for example. Now let's take one more minute from a motivation and reminding you about things that you are probably already aware of, to talk through how, in an ideal world, the software stack would look

Starting point is 00:13:30 and how you would build the software stacks up. So starting off with, again, you have an accelerator. The accelerator would then have a kernel-mode driver that would be in charge of discovering and discovering the capabilities and initializing this controller. That kernel-mode driver could then allow a kernel mode application to establish its ring buffer, and that would allow kernel mode applications to transfer data for kernel purposes. It could also be used, for example, for zeroing memory

Starting point is 00:13:56 or other activities such as that. Then the next step would be to provide APIs and libraries so that a user mode application could interact with this kernel mode driver and obtain its own ring buffer. So in this case, we have and user mode application to work together where the application can get its own ring buffer so that it's able to enqueue its own set of commands. And then these two can be isolated. This taking place in kernel mode, this taking place inside the user mode application, inside the user mode's address space.

Starting point is 00:14:42 And again, in an ideal environment, this would be operating entirely inside the user mode's address space. And again, in an ideal environment, this would be operating entirely inside the virtual address space of the application. No pinning, no needing to worry about those other details, and it would be able to work with native addresses within the side of this application. The application could then use the accelerator at will to transfer things within its own address space.

Starting point is 00:15:06 Again, the same approach can be taken crossing these different memory technologies, current and upcoming. Now, if you have this ability, you could also allow multiple applications to start up, each application having its own ring buffer, and they can live in different system address spaces. In this situation here, the two applications can be separate, but again, given a proper accelerator design, the accelerator could move data from one address space into the other address space, providing the appropriate security checks are in place, and they pass successfully.

Starting point is 00:15:45 The same concept could then be extended beyond just two applications running inside one operating system. You could have two virtual machines running side-by-side on one system. And again here, you would have a hypervisor kernel mode driver. That could then allow each one of the guests to have its own kernel mode driver. And then within each one of these VMs, different user mode applications could start up. Guest-to-guest data transfer is possible, as well as user mode application-to-user mode application

Starting point is 00:16:16 data transfer would be possible. Okay, so that's the stuff you probably already know. Let's start getting to the parts that you might not know. Let's start talking about SDXI itself. So SDXI is a Smart Data Accelerator Interface. It's a proposed standard currently in development that's designed to be extensible, forward-compatible, independent of biotechnology,

Starting point is 00:16:42 and its purpose is memory-to-memory data transfer. This was started as a SNEA twig back in June 2020, and it was tasked to work on this proposed standard. Right now, there's 28 member companies and 80-plus individual contributors to the specification. The goal of the specification is to provide a MIM copy-like functionality, but be able to do this MIM copy-like functionality with the same security guarantees that you have with MIM copy, yet have this offload where the offload is independent of the ISA, independent of the form factor, independent of the bus. In some of my examples, I will be talking about PCI Express as an example. We have defined a PCI Express binding.

Starting point is 00:17:31 However, it is not limited to just PCI Express. Other underlying buses could be used. It is agnostic. And when it comes to PCI Express, there is a PCIe class code defined, so that would allow for generic operating system drivers to be developed that can then work with these classes of devices from multiple vendors. Some of the SDXI design tenants. Of course, data movement between different application spaces, address spaces. This includes different virtual address spaces for applications. It also includes guest address spaces, host address spaces, et cetera. So the whole spectrum of guest physical, guest virtual, host physical, host virtual, it's all covered. User mode address

Starting point is 00:18:19 spaces, again, is included in the concept, it's motivating this. This has the ability to handle data motion without mediation by privileged software. Once you've established the privileges, the data transfers can start. To set those up, yes, you need privileged software to handle security and checking, et cetera. But once that's been done, you're free to go. The specification itself was designed to be very virtualization-friendly. There's quite a bit of input from VMware, for example, on how to make this a virtualization-friendly approach. Even though a lot of people working on this are looking at harder implementations,

Starting point is 00:18:59 it was also designed in such a way that you could do a software, a virtually software implementation of SDXI. It has the ability to quiesce, suspend, and resume the architectural state of the exposed functions on a per-address basis. So an orchestrator of some form, hypervisor, whatever you might have, can cleanly suspend a application that's running or cleanly suspend a whole virtual machine that's running and then restart it successfully

Starting point is 00:19:30 without major blips or major disruptions. It can even potentially switch from a hardware-backed implementation to a completely software-based version if necessary. Effort was put in, of course, for forward and backwards compatibility. And one of the key things, too, is a lot ofort was put in, of course, for forward and backwards compatibility. And one of the key things, too, is a lot of thought was put into making sure this was extensible.

Starting point is 00:19:50 There is room for other offloads to be added into this, leveraging the architectural interface. One of the things I find most exciting about SDXI is we've defined an interface, and once that ecosystem's built up, there can be extensions to that built that then

Starting point is 00:20:05 provide for other additional offloads that still fit within the same framework. And then you could then leverage the ecosystem work to extend it. So let's get into the details of SDXI, you start off with SDXI functions. Just to help you visualize this, think of this as, this isn't 100% true, but just think of it this way. Imagine a PCIe Express adapter card. Treat that as a PCIe Express function. Treat that as a SDXI function. That is one of the defined bindings. Whether that be a physical function, whether that be a Express function, treat that as an SDXI function. That is one of the defined bindings.

Starting point is 00:20:47 Whether that be a physical function, whether that be a virtual function as spelled by SRIOV, doesn't really matter, but you have one of these entities. One of these entities could be assigned to a particular virtual machine. That function then will have a small amount of MMIO space. Almost all the data related to an SDXI function in the system is held in system memory. Only a small amount of data is actually held inside this MMIO. It's like a dozen 64-bit registers, and most of those aren't fully populated. The function itself points to context tables. You can have multiple contexts,

Starting point is 00:21:18 which I'll get into in a moment, as well as the control. And this then eventually points to the descriptor ring. This is the circular FIFO buffer that a producer can then enqueue information into. Each one of the descriptors, which we'll talk about in more detail, it also has this read and write pointers to help maintain this. And there's also the doorbell. The doorbell is a MLIO register, a posted ML my register. It's there for the purpose of alerting the SDXI function. There's data waiting there for it. There is not a how should I make sure I phrase this properly?

Starting point is 00:21:54 It's a posted transaction. It's an alert. There's opportunities for the function to aggressively check things. The function doesn't have to wait for it to occur. Some of the ordering is a little bit loose, again, for performance reasons. I won't go much further into that right here, though.

Starting point is 00:22:12 When descriptors are put into the descriptoring, which describe the transfer taking place, you'll have your source and your destination buffers there, and then there's also a pointer to a completion status. This is how you can get feedback. One thing about SDXI is many conventional DMA engines, you submit work and then an interrupt happens. SDXI actually does not mandate that or require that. The generation of interrupts is actually optional for applications.

Starting point is 00:22:38 So there's multiple software can select and software can choose what mode it's going to use to receive notifications. One example is you can have this status come back indicating that the operation is completed through a content and memory. There's also this A key and R key table. Very briefly, we'll go into this more in a moment. The A key is the permission for the function to, sorry, the permission for the context

Starting point is 00:23:07 to go out and access someone else's address space. The R key is actually on the receiver side to say you're allowed to make this remote request. Basically, it's how you handle different address spaces. So A key, think of it as address space key. It's an index to a table. Everything I've highlighted here in this dashed line, this right here is what is available to the producer function.

Starting point is 00:23:30 So if it's a user mode application, this is what the user mode application is going to see, these data structures. Everything you see on the second circle is what's going to be visible to privileged software. So the privileged software can control the AT cable content to say, this application can go touch this, and then it can also work with the RQ, which we'll go into soon. In addition, there's an error log to provide more detailed error information.

Starting point is 00:23:54 If during the processing of a data transfer, something goes wrong, an illegal request is made, there's a hardware error, et cetera, The producer would get the notification that things stopped, but the error log will provide that more detailed information, the debugging information, if you will, about what went wrong to allow for the diagnostic to take place. Again, all the states are in memory. The descriptive format is standardized, and it's easy to virtualize. When it comes to these context tables, one SDXI function is able to stand up many, many different contexts.

Starting point is 00:24:33 The actual number of contexts is implementation defined, the maximum that it will support. However, the spec allows up to 65,536. So you can imagine multiple applications all sharing one SDXI function, and each application can be given by the operating system its own independent ring buffer so that it can directly interact those individual threads. And so they can submit work.

Starting point is 00:24:57 I'll probably note there's also support in this for multiproducers where multiple threads can actually put data into a single queue without locking. Some of the SDXI magic. Read the spec if you want to. Look into the details of how that's done. And again, the error log and the R key are there as well. The descriptor ring itself, just for brevity and time, I'm not going to spend too much time on this. It's a ring buffer.

Starting point is 00:25:23 You put information in. It's a circular FIFO. The data between the read index and the write index minus one is the valid data. The indexes that are used in SDXI constantly, they're monotonic. They constantly increase. The wraparound is handled. You numerically don't do the wraparound. It's just handled by the address flipping around. The operations here, you insert your requests into the queue in order. They're consumed from the queue in order by the hardware. However, they can be executed out of order, and they can be completed out of order,

Starting point is 00:26:00 again, for performance reasons to optimize performance. There are controls in place that you can say run run these, and then stop and wait till these finish. You can have a fence to ensure that if you need consistency, such as write this, you know, transfer this data, stop, then signal the completion. That's supported by SDXI. Okay, the actual descriptors. Inside this ring buffer, each descriptor that gets submitted into the ring buffer is 64 bytes, standardized size. Inside the descriptor itself, do I have a mouse?

Starting point is 00:26:36 No, I will use the pointer. Okay, inside the descriptor itself, first thing, notice, there's a lot of room and space available for expansion in the future. Today, what we have is we have the valid bit saying the new descriptor is valid, it's available. The control field, this is where you put the information about fencing and other concurrency controls. Information is located here. Then we have the operation group and the operation. All of the defined SDXI operations are put into different operation groups. The groups are your DMA base, basic transfers. There's some atomic operations that take place,

Starting point is 00:27:11 algebraic operations. There's a whole set of administrative commands that only the administrative context is allowed to run. One of the context, context zero is reserved, and it's only for starting, stopping, commanding things to take place. There's space for vendor-defined commands. Also not listed here, there's an interrupt operation group. Within the DMA commands, there is the copy operation, obviously. There's also rep copy. This is basically like a memset, if you will.

Starting point is 00:27:43 And so this can then be used to zero memory. There's also the write immediate, which is more just take this value, put it right here, used more for signaling. Same thing with the atomic operations. These provide bitwise operations, atomic add, atomic subtract, various compare and swap type operations. Again, these are useful for signaling applications or whatever you dream up. I already spoke to the admin operations.

Starting point is 00:28:09 Then at the end, there's that completion pointer, which is actually optional, and this will then be either documented or zeroed, depending upon the hardware and what's been requested, in order to signal this particular request has been completed. Furthermore, the addresses. Obviously, when doing a memory copy, you need more than one address, but for brevity in the slide, it just showed one. All the addresses are 64-bit addresses, but an address by itself isn't very useful, especially if you're thinking about devices like this

Starting point is 00:28:39 that can handle multiple address spaces. So we have the A key. The A key is an index into this A key table that I think it was green on the previous slide. And this specifies which address space I'm trying to access. The table is controlled by Privilege software. So Privilege software can say, here's your application. I've set up a queue for you. You can only access yourself, your own memory space. Or it can start creating the bridges into other address spaces as appropriate. And again, the addresses.

Starting point is 00:29:15 So here's the AQ table entry itself. It has the process address space ID. It can have the steering tag information. And I'll talk about this now. In addition to having the ability to go in and talk to another address based on the same function, it is actually possible to reach over to a different SDXI function and source data from it, or store data into another one of the SDXI functions that reside in a function group. I'll get to function groups here in a second. And this is how you can handle VM-to-VM communication,

Starting point is 00:29:47 by bridging that way. Steering tags, standard PCI Express type stuff. In addition, there's also the attributes. This is information about, effectively, cacheability, the attributes of the underlying memory, if it's MMIO or not. Okay, now getting to this concept of function groups, about transferring data from across VMs. So here I'm going to use the example of a PCI Express SRIOV device.

Starting point is 00:30:20 As an example, again, with SDXI, this is not a hard binding, just one of the easier things to visualize. Imagine I have two cards, each card that you plug into the system, and I'm not saying this is how someone's actually going to build one of these things. I have two cards in a system. Each card has a physical function on it, a PCI Express physical function, as well as a number of virtual functions on it. A hypervisor could then assign one of these VFs to each one of its different guest operating systems.

Starting point is 00:30:51 In addition, you have these physical functions. These two cards are then bridged and can communicate with each other over some magical interface. It's not actually defined, but just understand that these devices are cooperating with each other. Each one of these devices, in turn, can have multiple contexts and also have different pieces of data. Now, each one of these devices, of course, is going to be communicating through the platform's IOMMU. So the security is actually a combination of the feature set provided by SDXI and the feature set provided by the platform's IOMMU for doing address control. Now let's take a look at a case where we want to actually move data from one function to another. This is an example taken to an extreme.

Starting point is 00:31:49 In this scenario, I have the function here, function B, which is going to initiate the transfer of data. And in this case, we're actually going to take the data from this function as a source of the data, and this function is the target of the data. So what's going to happen is the producer is going to put into his descriptor ring a request, please move something from address space A, address blah, into address space C on function C, address blah blah.

Starting point is 00:32:23 And then it will then signal the doorbell on the function B. This function is going to come in. It's going to access the descriptor ring, get the information and realize what the request is. It is then going to go and access the A key. Again, this is the initiator's access control. Who are you actually trying to talk to? And this was set up by privileged software on this function.

Starting point is 00:32:47 So this function's privileged software. It'll then come back and get the information about both sides of the communication, about what it wants to do. And it will know the identifier of the function over here. This function can then go in and it'll check with its R key table and say, are you allowed to actually make this request? it's an anti-spoofing mechanism and if it comes back saying yes we allow the function over here to access us this is all authorized as the parameters and the parameters match up appropriately it'll then be allowed to get the confirmation

Starting point is 00:33:21 same thing can happen over here on address space C, where it gets the request back. And then once this is done, this function can then read the data, transport it back over, again, this magical interface. You can, again, visualize this being cabled. I don't think people actually do it this way. We can visualize it being cabled, and then the data being transferred to destination buffer C.

Starting point is 00:33:45 So by doing this, you have a secure communication path and a checked communication path between two virtual machines, but there was never a need during this whole process to signal the hypervisor or even the guest kernels about what's taking place. Once these configurations are set up, the communication can run through seamlessly. Now, of course, this can only take place when you have two or more functions in what we call the SDXI function group. A function group is one or more SDXI functions that can communicate together, and there's a mechanism defined in the SDXI specification which lays out how you discover

Starting point is 00:34:21 what SDXI functions are, and as well as which functions are part of the same function groups and which ones are not. So, SDXI itself. SDXI is the result of an active community of technical work group members. They contribute in many different ways towards specification. You can see, after it was the initial contribution of the spec to SNEA. You can read here the list of the different contributions that were made. When you look at the team itself, it's actually representing a fairly wide swath of the industry. We have a couple of CPU vendors present in it.

Starting point is 00:34:59 We have several OEMs. We have software operating system vendors present and hypervisor vendors. We have hyperscalers as well. So it's a pretty diverse group of individuals that have been actively contributing to the specification. As far as what to expect, expect the SDXI 1.0 spec to become available imminently. I'm afraid I can't say exactly when, but just say soon, real, real soon. As far as post 1.0 activities go, there's a number of things. This is not a committed list, but there's a number of things. We've already begun the conversations on what we're looking at for a post 1.0,

Starting point is 00:35:44 including some new data mover acceleration options, some cache coherency options, management, architecture, et cetera, quality of service improvements, CXL-related items, et cetera. So quite a few areas that we have an interest in. If you have any particular area, please come talk to me or come talk to someone else who's involved in SDXI, and we can listen to the feedback. Here's a number of links of information. Again, starting off with the current public SDXI spec, the.9. As a note, like I said, SDXI 1.0 should be occurring very soon.

Starting point is 00:36:28 You can find more information on the SDXI information page here. We've begun conversations with the persistent memory and computational storage about how these computational storage and persistent memory and SDXI can work together. There have been a number of presentations also available that you can reference. And there's also, like I said, there is a computational storage and SDXI subgroup has been formed. To participate in this, you need to be a member of both organizations. Okay, so that being said, let me just go ahead and open the floor. I'm sure you have tons of questions.

Starting point is 00:37:08 Go ahead, right here in the middle. Yeah, so I'm Tom Kaufman. I had a question. Sure. So the whole idea of accelerators is to get a small computational device that you can offload. You know, the CPU is a general processor, right?

Starting point is 00:37:21 But if you've got a whole bunch of people, especially if you're doing virtual machines and you're trying to address accelerators, accelerators get overloaded. What functions there are for people to accelerate or manage? Ah, that's actually an excellent question. So the question was... What's that? Yes, that's a fair question. So the question was, for those that might be listening to this later, the question was what facilities exist to manage the SDXI functions as accelerators

Starting point is 00:37:53 given the possibility of virtualization or a shared environment? Good question, really good question. In fact, this is something that we talked about in the SDXI group about a month or so ago. We had some conversations with one of the hyperscalers. Right now in SDXI 1.0, there is very limited... Okay, peeling a few layers back. In SDXI 1.0, there is a degree of quality of service, but it's fairly limited. One of the areas that we're going to be looking at in the future,

Starting point is 00:38:28 especially when we start putting these devices into a more hostile environment, in a hostile environment, there's going to be a need to have better quality of service controls. And that's one thing we're going to be looking at in post-SDXI 1.0. That would be the goal, yes. It's kind of, you think in terms of like crawl, walk, run. So at first there's going to be a number of applications that can use it, and then afterwards as we get more quality of service capability. Now that's not to say there isn't any, there is some,

Starting point is 00:39:00 but the level of control and how fairness is handled, especially when you're dealing in a hostile environment, might not meet all customer desires. Why hostile environment? More than condition for resource. I'm talking about something where you have someone who might try, for example, bouncing from machine to machine trying to find one that's lightly loaded. So we're not just talking simple noisy neighbor problems.

Starting point is 00:39:25 We're talking about, well, maybe a noisy neighbor, but something where the users might be trying to find what's best for them, if you will. So you can kind of think through scenarios like that. Excellent question. Other questions? Go ahead. I have a question. Sure. Sure.

Starting point is 00:39:46 Sure. Sure. Sure. Sure. Sure. Sure. Sure. Sure.

Starting point is 00:39:54 Sure. Sure. Sure. Sure. Sure. Sure. Sure. Sure.

Starting point is 00:40:02 Sure. Sure. Sure. Sure. Sure. Sure. Sure. Sure. which is a good thing because you are saving the memory in its speed. Now, the other challenge that we had was it was not very good for small data transfers. And I think that challenge will also come here because there is some overhead when you are trying to pass the messages or you are checking for APs on the website.

Starting point is 00:40:28 Have you done any analysis of what is the threshold of the message type of the data support, beyond which SDXI really shows a big benefit? Right, so the question you're asking is you're comparing RDMA versus SDXI, and you raise the concern over, well, you mentioned that when it comes to pinned memory, not requiring pinning in this is definitely an advantage. But you raise the concern over data transfer size and as far as efficiency. So that's a good question. The characteristics of using any accelerator is going to be implementation dependent.

Starting point is 00:41:14 Different implementations may have different profiles, different points where the tradeoff goes one way or the other. Today here, I'm here to talk about the SDXI specification, not any particular implementation. So that's not a question I can answer right now. Like I said, the specification is going to be coming out imminently, but this is more of a specification discussion, not an implementation question.

Starting point is 00:41:39 But that's a very good question, though. It just doesn't, SDXI does not do transfer, or it's only DCI, if it is any. It does not, like RDMA does, like, you know, So the way I think about it, we actually spend a fair amount of time in working on the STXI specification to, even though we use PCI as an example, to not tie it to PCI explicitly. We actually had to go back a few times and take things out or reword things to reduce the binding to PCI Express.

Starting point is 00:42:12 I personally think of it more as a software interface. This is the architecture. It doesn't necessarily prescribe how implementations do their work. The way data is moved within an implementation isn't actually specified, but it's exposing what would those software interfaces look like. If you're going to create a driver, be able to create a driver that works generically across all the implementations. If you're going to create software that needs to move data, this is how you can move data. This is how you can communicate, and this is how you can build your libraries and then your frameworks on top of those to move data.

Starting point is 00:42:49 So it's not tying itself. For example, you look at this. Nowhere does it, other than the few examples of PCI Express binding, it's not really calling for any particular bus-level wiggling of things. It's all software architectural. That could be a disadvantage also, right? Like in InfiniBand, like you said, if you're going to do a remote pinning of the memory keys and all those things, it's all built into the InfiniBand transport headers, you said, and it goes out and all that.

Starting point is 00:43:17 But I think it's like NVMe also, right? You map NVMe, which is a transport binding, and the SDSI is not going talks about data transfer outside of the coherent domain. So the way I think about it personally now is I think of it almost like RDMA, but in the box as a key. However, that isn't actually technically speaking required. It is possible, but that would be future work. So if there was bindings to other things where you're talking about going from

Starting point is 00:44:01 vendor to vendor, that would be something else. But like today, I kind of showed this and said, don't need any of this to communicate, and I left that kind of as magical. Right now, I would view that as something that's vendor-defined, and I would expect that if you had a vendor, they would work potentially with their own devices or a subset of their own devices for handling the function-to-function transport. But if you start talking vendor- vendor or machine to machine across something, that starts getting more involved.

Starting point is 00:44:27 Within a box, then, maybe SPDK is the closest one if you want to move that efficient thing? If you wanted to go further up in the stack. So if you will, this is kind of the lower level. This is how you would go about saying here's the hardware, here's the interface so you can have generic drivers, and here's how you can create things

Starting point is 00:44:43 with SPDK that's universal so you don't have to port to every single system implementation. Hopefully that answered your question. Other questions? Go ahead. So fast forward two years from now, then code-in switching may be made to be a particular work with a rule. Go ahead. That's a fair question. Okay, the question was, going forward in time, when we start getting into larger coherent domains that are perhaps connected via CXL, do you see SDXI?

Starting point is 00:45:55 How would that work? What would be involved there? The first answer I would give is, even with what's defined today, in certain usage models, it would work seamlessly. What I mean by that is if you had the functions that were intercommunicating were all appropriately hosted in the same location, they could certainly use the fabric to reach out. Like I showed on the earlier slide, SDXI was designed with the thought in mind of what's coming next, about all these different address spaces that could exist, all these different memory technologies, different tiers of memory, different characteristics, and how that would work. So if the initiator was properly located, just like if you had a system with a CPU or a GPU or some other device today that's connected to this fabric and it can then do work,

Starting point is 00:46:49 you could have a SDXI engine connected similarly and do work, and again, it would just work seamlessly. So the short answer today is that's fine. If you end up having a situation where you had the entity and you're trying to have multiple functions that are distributed work and the address and you're trying to do cross-function communication, that's another layer. And that would be a question of, is that first necessary? And then if it is, there'd have to be some work to figure out the ins and the outs of what's possible.

Starting point is 00:47:22 We only have a few minutes left. Any other questions? Okay. Well, thank you very much for your time. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #180: SNIA SDXI Internals

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.