Storage Developer Conference - #180: SNIA SDXI Internals
Episode Date: February 1, 2023...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, episode number 180.
So I'm William Moyes. I'm here today to talk about SDXI, the Smart Data Accelerator Interface
and its internals. We'll be getting into the nuts and bolts of what makes this thing work.
But before we do that, we'll go over a couple other topics. If you've been following closely,
you may have noticed that we've had a bit of a change around in the schedule, a bit of a change
around in the presenter even. I was not originally planned. Originally, Shyam Iyer from Dell was going to come. He is the chairperson for the SDXI technical
work group. Unfortunately, due to circumstances beyond his control, he was not able to attend
in person today. So he asked me to come in and present on his behalf, and I'll be going through
the presentation. Any mistakes, problems, whatever, please blame me. Don't blame him. Okay. So we're going to go through kind of the story of SDXI, where it came from, what it's
about, and where it's going. First, we're going to cover some of the things you already know,
but may not have necessarily put together, the motivations of why we wanted to do SDXI,
why we felt this
was important, and how the systems are evolving.
We all see the systems evolving today, but we're going to talk about what's changed and
how these pieces come together, which is what motivates SDXI.
We'll then get into some of the use cases, some of the software application use cases
that further motivate why SDXI is interesting.
And after we talk about what we would have from a motivation standpoint,
we'll start getting into the goals and tenets of SDXI.
We'll go into an introduction of the internals of SDXI.
We'll start getting into the nuts and bolts.
It wouldn't be as detailed as reading the actual specification itself,
but it will give you a good overview of how it works, why it works,
so you can go back and know what you need to learn
if you want to learn more about this technology.
We'll also talk a bit about the SDXI community,
this developing the specification, as well as the future plans for SDXI.
And then, of course, the obligatory list of references,
so that when you, after this conference, you have questions about SDXI,
want to learn more about it, you'll be able to see where you can go
to learn and educate yourself and learn more about it. Okay, so let's start off talking
about what we already know, just to kind of act as a bit of a reminder. Let's talk about an
application and the resources an application needs to execute. And again, this is traditional, and
I'll be describing it using a number of bubbles.
Again, very high-level architecture, just by way of motivation. So today, we have an application.
The application itself, classically, legacy-wise, will be running on some form of compute,
usually a general-purpose CPU executing the operations of the application.
The compute engine would then be interfacing with system memory, traditionally DRAM.
And then, a little more modernly, the application may be multi-threaded, so we get multiple
threads of execution running on multiple compute cores.
The compute and the memory are coherent with each other.
The memory itself is for data in use,
the data actively being used,
and of course the application can directly access
that data content.
It's kind of a direct one-to-one correspondence
between what's in the source code
and what's actually taking place.
Now for persistent storage,
persistent durable storage,
there's going to be some type of disk device
or storage device connected via an I.O. channel. That I.O. channel itself is not directly accessible from the CPU, as in you just
can't go and take a pointer and point to a data blob on the disk. We'll ignore memory mapped I.O.
for right now for simplicity. Instead, the compute needs to instruct the I.O. device to transfer the data from I.O. to memory where the CPU can then do work on it.
So, again, we all know this.
We all understand this.
We've been doing this for decades upon decades upon decades.
And, of course, the connection between the I.O. and memory has been optimized for bandwidth and latency.
Now, let's take a look at where we're
starting to head in the future when looking at the industry. And we're seeing a lot of emerging
efforts. Some of them are here, some of them are in development right now, etc. So, we have the
application, but instead of just having general purpose CPUs, we're seeing a proliferation of
new compute technologies, CPUs, GPUs, application-specific processors.
We're also seeing FPGAs being used for doing compute operations.
For the memory technology, and again, I'm talking about the online, active, direct interaction type of memory,
we have volatile memory.
DRAM is still around.
It's going to be with us for a long time still, I believe.
We have non-volatile memory.
There were some talks earlier today about that, about some of the technology that's going to be with us for a long time still, I believe. We have non-volatile memory. There were some talks earlier today about that,
about some of the technology that's coming to play.
People are looking at memory that's
cheaper, that's more about capacity.
Memories that have different performance characteristics,
that's cold and hot, either determined by the
underlying technology or by the way you access the
memory. Again, you look at, read from the
lines in some of the talks that were given,
and you can see these patterns emerging.
And then these technologies are all being brought together by different memory or fabric
or link technologies. CXL, Gen Z, C6,
etc. These new technologies are coming into play. And so we're seeing
the industry start to gravitate towards these approaches.
These, of course, also have shared design constraints
around latency, around bandwidth, and about coherency, and about control and security as we move there.
When it comes to moving data between these different elements, and even between these different elements, there has been one way pretty much everyone has done it, and that is software-based memcopy.
It's there.
It exists. It's there. It exists.
It's stable.
It is well understood.
The security models are well understood
about how memcopy works.
Now, the challenge is,
if you're in a compute-constrained environment,
this can take away from application performance.
It's not compute-constrained, different story.
But if you're in a compute-constrained environment,
it can take away from application performance.
Also, when you're looking at context isolation, this can end up involving software overheads, context switches, that again can distract. Offload DMA engines have been around
for a long time. Many systems have these. However, the engines that have been created previously
had vendor-specific interfaces on them, which meant that you would have to have the application or the software stack developed
to specifically work with these devices.
And the adoption has been fairly low on using these vendor-specific DMA engines.
And then, again, looking at another trend in the industry,
there's been a push to try to cut back on some of these layers. There's been a big push towards user mode access to hardware. And we start getting
into a domain like that, having an interface that's proprietary to the vendor becomes especially
challenging, as opposed to something that is standardized, where the interface is universally
going to be adopted, like your memcopy operation.
When you think about this standardization,
think for a moment about what SATA did,
what the HCI did, or NVMe,
again, a programming interface that's standardized,
where you can have one piece of software
that works with a multitude of different hardware
from different vendors.
Think in terms of that when we talk about standardization.
And then further think about user mode applications and what would be involved there.
Now let's take a quick moment to go through application usage patterns.
And again, not to bore you, the first example is really simple, really straightforward.
Let's take a look at transferring data.
You have an application running.
It's a user mode application. It wants to move data from point A to point B, and it can just call memcopy today.
And again, if it's constrained in terms of compute power, this could take away some of its performance.
Whereas, if you think about what could happen if you had an accelerator to offload this,
instead you could have a circular FIFO buffer, a ring buffer, to accept the request for
transfer. The accelerator can then be alerted that there's something out there for it to do by
ringing a doorbell, sending a signal to the hardware to activate it. And then the accelerator
can move the data from its source buffer to the destination buffer. And upon completion, it'll
then send the completion back to software. Again, a pretty conventional pattern. But by using this pattern, if you have lots of data to move,
and you can then offload those operations from having the general purpose compute handle these.
Other use cases, other patterns, I should say.
Think through the case of where you have an application that's trying to store data to persistent durable storage, so to disk, et cetera.
Again, working through a conventional pattern of behavior, you have the application needs to copy the data into a kernel memory buffer.
The kernel itself needs to activate a driver for the, say, NVMe disk or some other storage device.
It would then need to copy that data into a DMA-able buffer, again using a memcopy.
Then the device will actually transfer the data using a DMA read into persistent storage, durable storage.
To get the data back out, the whole process has to basically be run in reverse, back up and through.
Now, if we look at what has taken place in going from this model, and again, I recognize
there are optimizations here, zero copy, et cetera. I'm not going to do this. I'm just saying
traditionally. If you start looking at what's been done with persistent memory technologies,
where the actual memory is available online in the system, where it's side by side with your
in-use memory. Think NVDMN as an example that's
been around for a couple of years. In this case, a user application can simply memcopy the data
from point A to point B, and then plus do the couple extra steps required to ensure persistency.
But it's a more straightforward process. But again, if you're compute constrained, it can take away.
Then if you were to have an accelerator, the data could be moved by the accelerator, again, if you're compute constrained, it can take away. Then if you were to have an accelerator, the data could be moved by the accelerator,
again, by enqueuing the necessary commands into that ring buffer and asking the device to do the transfer.
To me, one of the most interesting case patterns is this third pattern.
Imagine you have two applications, each running inside a different virtual machine.
In this scenario,
these two applications wish to communicate with each other.
In this particular scenario,
the user application would then need to memcopy the data to the kernel.
The kernel would then have to activate I.O. to perform the I.O. transfer.
The I.O. would take place moving the data around the system.
The data would come back into the other side's kernel,
and then from there it would be then copied back into the user application on the other side.
If there's a hypervisor running here and you have a virtualized hardware,
there may be even more memory transfers involved.
This structure is necessary because of, again, the need for isolation between the two VMs.
Now let's take a look at what could be done if you had an accelerator
which was aware of the system architecture, if you will.
You could go in and have the first VM queue up the transfer, alert the accelerator,
have the accelerator directly grab the data from its memory buffers,
perform the transfer, and write them back into the other application's memory buffers.
Here, not only are you getting the advantage of transferring the data
from point A to point B without having the software overhead,
you're also able to skip over the context switches into the kernel.
There's no need to invoke the hypervisor, no need to invoke the local guest kernels on these two sides of the equation
to perform this transfer, which can save quite a bit.
Don't worry, there is security here.
A bunch of my slides will get into the security aspects of this.
But you can start to see some of the benefits of what can be achieved here.
Again, thinking through, and wow, that doesn't really show up well on this slide, does it?
Thinking through again, with this proliferation of memory technologies,
an ideal accelerator will be able to handle the data transfer, not just for DRAM,
but for the whole gamut of technologies that are upcoming,
regardless of source, peer-to-peer transfers, for example,
transferring things between persistent memory and regular DRAM, or CXL-connected memory, for example.
Now let's take one more minute from a motivation and reminding you about things that you are probably already aware of, to talk through how, in an ideal world, the software stack would look
and how you would build the software stacks up.
So starting off with, again, you have an accelerator.
The accelerator would then have a kernel-mode driver that would be in charge of discovering
and discovering the capabilities and initializing this controller.
That kernel-mode driver could then allow a kernel mode application
to establish its ring buffer,
and that would allow kernel mode applications to transfer data for kernel purposes.
It could also be used, for example, for zeroing memory
or other activities such as that.
Then the next step would be to provide APIs and libraries so that a user mode application could interact with this kernel mode driver and obtain its own ring buffer.
So in this case, we have and user mode application to work together
where the application can get its own ring buffer
so that it's able to enqueue its own set of commands.
And then these two can be isolated.
This taking place in kernel mode, this taking place inside the user mode application,
inside the user mode's address space.
And again, in an ideal environment,
this would be operating entirely inside the user mode's address space. And again, in an ideal environment, this would be operating entirely
inside the virtual address space of the application.
No pinning, no needing to worry about those other details,
and it would be able to work with native addresses
within the side of this application.
The application could then use the accelerator at will
to transfer things within its own address space.
Again, the same approach can be taken crossing these different memory technologies, current and upcoming.
Now, if you have this ability, you could also allow multiple applications to start up,
each application having its own ring buffer, and they can live in different system address spaces.
In this situation here, the two applications can be separate,
but again, given a proper accelerator design,
the accelerator could move data from one address space into the other address space,
providing the appropriate security checks are in place,
and they pass successfully.
The same concept could then be extended beyond just two applications running inside one operating system.
You could have two virtual machines running side-by-side on one system.
And again here, you would have a hypervisor kernel mode driver.
That could then allow each one of the guests to have its own kernel mode driver.
And then within each one of these VMs,
different user mode applications could start up.
Guest-to-guest data transfer is possible,
as well as user mode application-to-user mode application
data transfer would be possible.
Okay, so that's the stuff you probably already know.
Let's start getting to the parts that you might not know.
Let's start talking about SDXI itself.
So SDXI is a Smart Data Accelerator Interface.
It's a proposed standard currently in development
that's designed to be extensible, forward-compatible,
independent of biotechnology,
and its purpose is memory-to-memory data transfer.
This was started as a SNEA twig back in June 2020, and it was tasked to work on this proposed standard.
Right now, there's 28 member companies and 80-plus individual contributors to the specification.
The goal of the specification is to provide a MIM copy-like functionality, but be able to do this
MIM copy-like functionality with the same security guarantees that you have with MIM copy,
yet have this offload where the offload is independent of the ISA, independent of the
form factor, independent of the bus. In some of my examples, I will be talking about PCI Express as an example.
We have defined a PCI Express binding.
However, it is not limited to just PCI Express.
Other underlying buses could be used.
It is agnostic.
And when it comes to PCI Express, there is a PCIe class code defined, so that would allow for generic operating system drivers to be developed that can then work with these classes of devices from multiple vendors.
Some of the SDXI design tenants.
Of course, data movement between different application spaces, address spaces.
This includes different virtual address spaces for applications.
It also includes guest address spaces, host address spaces, et cetera. So the whole spectrum of guest physical, guest virtual, host physical, host virtual, it's all covered. User mode address
spaces, again, is included in the concept, it's motivating this.
This has the ability to handle data motion without mediation by privileged software.
Once you've established the privileges, the data transfers can start.
To set those up, yes, you need privileged software to handle security and checking, et cetera. But once that's been done, you're free to go.
The specification itself was designed to be very virtualization-friendly.
There's quite a bit of input from VMware, for example,
on how to make this a virtualization-friendly approach.
Even though a lot of people working on this are looking at harder implementations,
it was also designed in such a way that you could do a software,
a virtually software implementation of SDXI.
It has the ability to quiesce, suspend, and resume the architectural state
of the exposed functions on a per-address basis.
So an orchestrator of some form, hypervisor, whatever you might have,
can cleanly suspend a application that's running
or cleanly suspend a whole virtual machine that's running
and then restart it successfully
without major blips or major disruptions.
It can even potentially switch
from a hardware-backed implementation
to a completely software-based version if necessary.
Effort was put in, of course,
for forward and backwards compatibility. And one of the key things, too, is a lot ofort was put in, of course, for forward and backwards compatibility.
And one of the key things, too,
is a lot of thought was put into making sure this was extensible.
There is room for other offloads
to be added into this,
leveraging the architectural interface.
One of the things I find most exciting
about SDXI is we've defined an interface,
and once that ecosystem's built up,
there can be extensions to that built
that then
provide for other additional offloads that still fit within the same framework. And then you could
then leverage the ecosystem work to extend it. So let's get into the details of SDXI, you start off with SDXI functions.
Just to help you visualize this, think of this as, this isn't 100% true, but just think of it this way.
Imagine a PCIe Express adapter card.
Treat that as a PCIe Express function.
Treat that as a SDXI function.
That is one of the defined bindings.
Whether that be a physical function, whether that be a Express function, treat that as an SDXI function. That is one of the defined bindings.
Whether that be a physical function, whether that be a virtual function as spelled by SRIOV, doesn't really matter,
but you have one of these entities.
One of these entities could be assigned to a particular virtual machine.
That function then will have a small amount of MMIO space.
Almost all the data related to an SDXI function in the system
is held in system memory.
Only a small amount of data is actually held inside this MMIO. It's like a dozen 64-bit registers, and most of those aren't
fully populated. The function itself points to context tables. You can have multiple contexts,
which I'll get into in a moment, as well as the control. And this then eventually points to the
descriptor ring. This is the circular FIFO buffer that a producer can then enqueue information into. Each one of the descriptors, which we'll
talk about in more detail, it also has this read and write pointers to help maintain this. And
there's also the doorbell. The doorbell is a MLIO register, a posted ML my register. It's there for the purpose of alerting the
SDXI function. There's data waiting
there for it. There is not a
how should I
make sure I phrase this properly?
It's a posted transaction.
It's an alert. There's
opportunities for the function to aggressively
check things. The function doesn't have to wait
for it to occur.
Some of the
ordering is a little bit loose, again, for performance reasons.
I won't go much further into that right here, though.
When descriptors are put into the descriptoring, which describe the transfer taking place,
you'll have your source and your destination buffers there,
and then there's also a pointer to a completion status.
This is how you can get feedback.
One thing about SDXI is many conventional DMA engines,
you submit work and then an interrupt happens.
SDXI actually does not mandate that or require that.
The generation of interrupts is actually optional for applications.
So there's multiple software can select and software can choose
what mode it's going to use to receive notifications.
One example is you can have this status come back indicating that the operation is completed
through a content and memory.
There's also this A key and R key table.
Very briefly, we'll go into this more in a moment.
The A key is the permission for the function to,
sorry, the permission for the context
to go out and access someone else's address space.
The R key is actually on the receiver side
to say you're allowed to make this remote request.
Basically, it's how you handle different address spaces.
So A key, think of it as address space key.
It's an index to a table.
Everything I've highlighted here in this dashed line,
this right here is what is available to the producer function.
So if it's a user mode application,
this is what the user mode application is going to see,
these data structures.
Everything you see on the second circle
is what's going to be visible to privileged software.
So the privileged software can control the AT cable content
to say,
this application can go touch this, and then it can also work with the RQ, which we'll go into soon. In addition, there's an error log to provide more detailed error information.
If during the processing of a data transfer, something goes wrong, an illegal request is made,
there's a hardware error, et cetera, The producer would get the notification that things stopped, but the error
log will provide that more detailed information, the debugging information, if you will, about what went
wrong to allow for the diagnostic to take place.
Again, all the states are in memory.
The descriptive format is standardized, and it's easy to virtualize.
When it comes to these context tables,
one SDXI function is able to stand up many, many different contexts.
The actual number of contexts is implementation defined,
the maximum that it will support.
However, the spec allows up to 65,536.
So you can imagine multiple applications all sharing one SDXI function,
and each application can be given by the operating system
its own independent ring buffer
so that it can directly interact those individual threads.
And so they can submit work.
I'll probably note there's also support in this for multiproducers
where multiple threads can actually put data into a single queue without locking.
Some of the SDXI magic.
Read the spec if you want to.
Look into the details of how that's done.
And again, the error log and the R key are there as well.
The descriptor ring itself, just for brevity and time, I'm not going to spend too much time on this.
It's a ring buffer.
You put information in.
It's a circular FIFO. The data
between the read index and the write index minus one is the valid data. The indexes that are used
in SDXI constantly, they're monotonic. They constantly increase. The wraparound is handled.
You numerically don't do the wraparound. It's just handled by the address flipping around.
The operations here, you insert your requests into the queue in order.
They're consumed from the queue in order by the hardware.
However, they can be executed out of order, and they can be completed out of order,
again, for performance reasons to optimize performance.
There are controls in place that you can say run run these, and then stop and wait till these finish. You can have a fence to ensure that if you need consistency, such as
write this, you know, transfer this data, stop, then signal the completion. That's supported by
SDXI. Okay, the actual descriptors. Inside this ring buffer,
each descriptor that gets submitted into the ring buffer is 64 bytes,
standardized size.
Inside the descriptor itself,
do I have a mouse?
No, I will use the pointer.
Okay, inside the descriptor itself,
first thing, notice, there's a lot of room and space available for expansion in
the future. Today, what we have is we have the valid bit saying the new descriptor is valid,
it's available. The control field, this is where you put the information about fencing
and other concurrency controls. Information is located here. Then we have the operation group
and the operation. All of the defined SDXI operations are put into different operation groups.
The groups are your DMA base, basic transfers. There's some atomic operations that take place,
algebraic operations. There's a whole set of administrative commands that only the administrative
context is allowed to run. One of the context, context zero is reserved, and it's only for
starting, stopping, commanding things to take place.
There's space for vendor-defined commands.
Also not listed here, there's an interrupt operation group.
Within the DMA commands, there is the copy operation, obviously.
There's also rep copy.
This is basically like a memset, if you will.
And so this can then be used to zero memory.
There's also the write immediate, which is more just take this value,
put it right here, used more for signaling.
Same thing with the atomic operations.
These provide bitwise operations, atomic add, atomic subtract,
various compare and swap type operations.
Again, these are useful for signaling applications or whatever you dream up.
I already spoke to the admin operations.
Then at the end, there's that completion pointer, which is actually optional,
and this will then be either documented or zeroed, depending upon the hardware and what's been requested, in order to signal this particular request has been completed.
Furthermore, the addresses.
Obviously, when doing a memory copy, you need more than one address,
but for brevity in the slide, it just showed one.
All the addresses are 64-bit addresses,
but an address by itself isn't very useful,
especially if you're thinking about devices like this
that can handle multiple address spaces.
So we have the A key.
The A key is an index into this A key table
that I think it was green on the previous slide. And this specifies which address space I'm trying
to access. The table is controlled by Privilege software. So Privilege software can say,
here's your application. I've set up a queue for you. You can only access yourself, your own memory space.
Or it can start creating the bridges into other address spaces as appropriate.
And again, the addresses.
So here's the AQ table entry itself.
It has the process address space ID.
It can have the steering tag information.
And I'll talk about this now. In addition to having the ability to go in
and talk to another address based on the same function, it is actually possible to reach over
to a different SDXI function and source data from it, or store data into another one of the SDXI
functions that reside in a function group. I'll get to function groups here in a second.
And this is how you can handle VM-to-VM communication,
by bridging that way.
Steering tags, standard PCI Express type stuff.
In addition, there's also the attributes.
This is information about, effectively, cacheability,
the attributes of the underlying memory, if it's MMIO or not.
Okay, now getting to this concept of function groups,
about transferring data from across VMs.
So here I'm going to use the example of a PCI Express SRIOV device.
As an example, again, with SDXI, this is not a hard binding,
just one of the easier things to visualize.
Imagine I have two cards, each card that you plug into the system,
and I'm not saying this is how someone's actually going to build one of these things.
I have two cards in a system.
Each card has a physical function on it, a PCI Express physical function,
as well as a number of virtual functions on it.
A hypervisor could then assign one of these VFs to each one of its different guest operating systems.
In addition, you have these physical functions.
These two cards are then bridged and can communicate with each other over some magical interface.
It's not actually defined, but just understand that these devices are cooperating with each other.
Each one of these devices, in turn, can have multiple contexts and also have different pieces of data.
Now, each one of these devices, of course, is going to be communicating through the platform's IOMMU.
So the security is actually a combination of the feature set provided by SDXI and the feature set provided by the platform's IOMMU for doing address control.
Now let's take a look at a case where we want to actually move data from one function to another.
This is an example taken to an extreme.
In this scenario, I have the function here,
function B, which is going to initiate the transfer of data.
And in this case, we're actually going to take the data
from this function as a source of the data,
and this function is the target of the data.
So what's going to happen is the producer is going to put into his descriptor ring a request,
please move something from address space A, address blah,
into address space C on function C, address blah blah.
And then it will then signal the doorbell on the function B.
This function is going to come in.
It's going to access the descriptor ring,
get the information and realize what the request is.
It is then going to go and access the A key.
Again, this is the initiator's access control.
Who are you actually trying to talk to?
And this was set up by privileged software on this function.
So this function's privileged software.
It'll then come back and get the information about both sides of the communication,
about what it wants to do.
And it will know the identifier of the function over here.
This function can then go in and it'll check with its R key table and say,
are you allowed to actually make this request? it's an anti-spoofing mechanism and if it comes
back saying yes we allow the function over here to access us this is all authorized as the
parameters and the parameters match up appropriately it'll then be allowed to get the confirmation
same thing can happen over here on address space C,
where it gets the request back.
And then once this is done, this function can then read the data,
transport it back over, again, this magical interface.
You can, again, visualize this being cabled.
I don't think people actually do it this way.
We can visualize it being cabled,
and then the data being transferred to destination buffer C.
So by doing this, you have a secure communication path and a checked communication path between two virtual machines,
but there was never a need during this whole process to signal the hypervisor
or even the guest kernels about what's taking place.
Once these configurations are set up, the communication can run through seamlessly.
Now, of course, this can only take place when you have two or more functions
in what we call the SDXI function group. A function group is one or more
SDXI functions that can communicate together, and there's a mechanism
defined in the SDXI specification which lays out how you discover
what SDXI functions are, and as well as which functions are part of the same function groups and which ones are not.
So, SDXI itself.
SDXI is the result of an active community of technical work group members.
They contribute in many different ways towards specification.
You can see, after it was the initial contribution of the spec to SNEA. You can read here the list of the different contributions that were made.
When you look at the team itself,
it's actually representing a fairly wide swath of the industry.
We have a couple of CPU vendors present in it.
We have several OEMs.
We have software operating system vendors present and hypervisor vendors.
We have hyperscalers as well.
So it's a pretty diverse group of individuals that have been actively contributing to the specification.
As far as what to expect, expect the SDXI 1.0 spec to become available imminently. I'm afraid I can't say exactly when, but just say soon, real, real soon.
As far as post 1.0 activities go, there's a number of things.
This is not a committed list, but there's a number of things.
We've already begun the conversations on what we're looking at for a post 1.0,
including some new data mover acceleration options,
some cache coherency options, management, architecture, et cetera,
quality of service improvements, CXL-related items, et cetera.
So quite a few areas that we have an interest in.
If you have any particular area, please come talk to me
or come talk to someone else who's involved in SDXI, and we can listen to the feedback. Here's a number of links of
information. Again, starting off with the current public SDXI spec, the.9. As a note, like I said,
SDXI 1.0 should be occurring very soon.
You can find more information on the SDXI information page here.
We've begun conversations with the persistent memory and computational storage about how these computational storage and persistent memory and SDXI can work together.
There have been a number of presentations also available that you can reference.
And there's also, like I said, there is a computational storage and SDXI subgroup
has been formed. To participate in this, you need to be a member of both organizations.
Okay, so that being said,
let me just go ahead and open the floor.
I'm sure you have tons of questions.
Go ahead, right here in the middle.
Yeah, so I'm Tom Kaufman.
I had a question.
Sure.
So the whole idea of accelerators
is to get a small computational device
that you can offload.
You know, the CPU is a general processor, right?
But if you've got a whole bunch of people,
especially if you're doing virtual machines and you're trying to address accelerators, accelerators get
overloaded. What functions there are for people to accelerate or manage?
Ah, that's actually an excellent question. So the question
was... What's that?
Yes, that's a fair question. So the question was, for those
that might be listening to this later,
the question was what facilities exist to manage the SDXI functions as accelerators
given the possibility of virtualization or a shared environment?
Good question, really good question.
In fact, this is something that we talked about in the SDXI group about a month or so ago.
We had some conversations with one of the hyperscalers.
Right now in SDXI 1.0, there is very limited...
Okay, peeling a few layers back.
In SDXI 1.0, there is a degree of quality of service, but it's fairly limited.
One of the areas that we're going to be looking at in the future,
especially when we start putting these devices into a more hostile environment,
in a hostile environment, there's going to be a need to have better quality of service controls.
And that's one thing we're going to be looking at in post-SDXI 1.0.
That would be the goal, yes.
It's kind of, you think in terms of like crawl, walk, run.
So at first there's going to be a number of applications that can use it,
and then afterwards as we get more quality of service capability.
Now that's not to say there isn't any, there is some,
but the level of control and how fairness is handled,
especially when you're dealing in a hostile environment,
might not meet all customer desires.
Why hostile environment?
More than condition for resource.
I'm talking about something where you have someone who might try, for example,
bouncing from machine to machine trying to find one that's lightly loaded.
So we're not just talking simple noisy neighbor problems.
We're talking about, well, maybe a noisy neighbor,
but something where the users might be trying to find what's best for them, if you will.
So you can kind of think through scenarios like that.
Excellent question.
Other questions?
Go ahead.
I have a question.
Sure. Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure.
Sure. Sure. Sure. Sure. Sure. Sure. which is a good thing because you are saving the memory in its speed.
Now, the other challenge that we had was
it was not very good for small data transfers.
And I think that challenge will also come here
because there is some overhead when you are trying to pass the messages or
you are checking for APs on the website.
Have you done any analysis of what is the threshold of the message type of the data support,
beyond which SDXI really shows a big benefit?
Right, so the question you're asking is you're comparing RDMA versus SDXI,
and you raise the concern over, well, you mentioned that when it comes to pinned memory,
not requiring pinning in this is definitely an advantage.
But you raise the concern over data transfer size and as far as efficiency.
So that's a good question.
The characteristics of using any accelerator is going to be implementation dependent.
Different implementations may have different profiles,
different points where the tradeoff goes one way or the other.
Today here, I'm here to talk about the SDXI specification,
not any particular implementation.
So that's not a question I can answer right now.
Like I said, the specification is going to be coming out imminently,
but this is more of a specification discussion,
not an implementation question.
But that's a very good question, though. It just doesn't, SDXI does not do transfer,
or it's only DCI, if it is any. It does not, like RDMA does, like, you know, So the way I think about it,
we actually spend a fair amount of time
in working on the STXI specification
to, even though we use PCI as an example,
to not tie it to PCI explicitly.
We actually had to go back a few times and take things out or reword things
to reduce the binding to PCI Express.
I personally think of it more as a software interface.
This is the architecture.
It doesn't necessarily prescribe how implementations do their work.
The way data is moved within an implementation isn't actually specified,
but it's exposing what would those software interfaces look like.
If you're going to create a driver, be able to create a driver that works generically across all the implementations.
If you're going to create software that needs to move data, this is how you can move data.
This is how you can communicate, and this is how you can build your libraries and then your frameworks on top of those to move data.
So it's not tying itself.
For example, you look at this.
Nowhere does it, other than the few examples of PCI Express binding, it's not really calling for any particular bus-level wiggling of things.
It's all software architectural. That could be a disadvantage also, right?
Like in InfiniBand, like you said,
if you're going to do a remote pinning of the memory keys
and all those things, it's all built into the InfiniBand
transport headers, you said, and it goes out and all that.
But I think it's like NVMe also, right?
You map NVMe, which is a transport binding,
and the SDSI is not going talks about data transfer outside of the coherent domain.
So the way I think about it personally now is I think of it almost like RDMA,
but in the box as a key.
However, that isn't actually technically speaking required.
It is possible, but that would be future work.
So if there was bindings to other things where you're talking about going from
vendor to vendor, that would be something else.
But like today, I kind of showed this and said, don't need any of this to communicate,
and I left that kind of as magical.
Right now, I would view that as something that's vendor-defined, and I would expect
that if you had a vendor, they would work potentially with their own devices or a subset
of their own devices for handling the function-to-function transport.
But if you start talking vendor- vendor or machine to machine across something,
that starts getting more involved.
Within a box, then, maybe SPDK
is the closest one if you want to move that
efficient thing? If you wanted to go further up in the stack.
So if you will, this is kind of the lower level.
This is how you would go about saying
here's the hardware,
here's the interface so you can have
generic drivers, and here's how you can create things
with SPDK that's universal
so you don't have to port to every single system implementation.
Hopefully that answered your question.
Other questions?
Go ahead.
So fast forward two years from now, then code-in switching may be made to be a particular work with a rule. Go ahead. That's a fair question.
Okay, the question was, going forward in time,
when we start getting into larger coherent domains that are perhaps connected via CXL, do you see SDXI?
How would that work?
What would be involved there?
The first answer I would give is, even with what's defined today, in certain usage models, it would work seamlessly.
What I mean by that is if you had the functions that were intercommunicating were all appropriately hosted in the same location, they could certainly use the fabric to reach out. Like I showed on the earlier slide, SDXI was designed with the thought in mind of what's coming next, about all these different address spaces that could exist,
all these different memory technologies, different tiers of memory, different characteristics,
and how that would work. So if the initiator was properly located, just like if you had a system
with a CPU or a GPU
or some other device today that's connected to this fabric and it can then do work,
you could have a SDXI engine connected similarly and do work,
and again, it would just work seamlessly.
So the short answer today is that's fine.
If you end up having a situation where you had the entity
and you're trying to have multiple functions that are
distributed work and the address and you're trying to do cross-function communication,
that's another layer. And that would be a question of, is that first necessary? And then if it is,
there'd have to be some work to figure out the ins and the outs of what's possible.
We only have a few minutes left. Any other questions?
Okay. Well, thank you very much for your time.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss
this topic further with your peers in the storage developer community. For additional information
about the Storage Developer Conference, visit www.storagedeveloper.org.