Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode 136.
Hello, good morning, good afternoon, good evening from everywhere around the world.
What an amazing way to chat today.
Technology has never been more important, and that's why it gives me great pleasure to introduce Smart Data Acceleration Interface,
a new technology that we are bringing to SNI.
It's a new technical working group to standardize the interface for
memory-to-memory data movement and acceleration. I am Sham Iyer. I'm a distinguished member of the
technical staff at Dell's Office of the CTO for Service. I'm a co-founder and interim chair for
this new technical work group at SNIA. I'm also joined here by my esteemed colleague, Richard Brunner,
who is a CTO, principal engineer at VMware. He's also a co-founder for this new technical
working group. He's been a co-conspirator, co-collaborator, and gentle good friend and
mentor in this process with us. Let's take a quick look at the agenda that we want to get you through.
So before I get into the details,
we wanted to get you through the journey
of how we landed here.
So I absolutely want to talk to you about the problem
and the need for the solution
that we're trying to address.
Then I'll introduce SDXI to you.
And then finally, Rich will explain
some of the finer details and the concepts behind SDXI.
So let me paint the story here.
And this is a story that many of you are aware of.
There is an increased need to increase core counts to enable compute scaling.
Compute density is also on the rise.
Working with our hardware partners, they have been very generous with us.
They've increased core counts for us so that our applications can scale.
Increasingly, converged and hyper-converged storage appliances are enabling new workloads on server-class systems.
You might be wondering why a server guy is talking in a storage conference.
Now I don't have to explain that anymore.
Data locality is important. Single-threaded performance is under pressure.
The law of physics comes before us in trying to expand how a single core can do better.
Also, IO-intensive workloads are becoming noticeable in terms of how they take away CPU cycles.
Network and storage workloads are also part of the same technology that can take away compute cycles.
Data movement, encryption, decryption, compression, the list goes on.
Let me draw your attention to a picture that I'm showing you here. Imagine the course that you run in your server are the parallel lanes,
and the cars are the speeds that you get out of your course.
It does not matter how fast you run your car.
If you've got cross traffic like that bus going through there,
you're going to get a bottleneck.
That's what happens when you have
cross-traffic taking over your server performance.
So let's take another look at some of the case study
behind why we need to accelerate intra-host data movement.
Let's see how the congestion builds up.
You might be running some of your compute infrastructure.
Many of them are VMs.
We know how to scale them using network stacks and application demands can be met.
The storage may not necessarily come from your compute infrastructure.
You might rely on your hypervisor to get you the storage, or in many cases, you might have a separate storage infrastructure in the form of a storage VM or if you really need a lot of storage you might have an entire storage network behind
for you to get all of your storage needs. We've been doing a great job here over the years. We've
been doing bringing a lot of innovation in the form of TCP IP, Rocky, iWarp, are some of the ways that we can increase the bandwidth
and reduce the latency.
And the new increased speeds and feeds with 10 gig, 25 gig,
40 gig, 100 gig, and 400 gig Ethernet
are just going to make it better.
What else is happening?
Local storage is also undergoing a revolution.
With the advent of non-volatile memory express,
local storage latencies have considerably reduced.
With the reducing flash prices,
the capacity also has been increasing.
With persistent memory technologies,
starting with NVDIMM,
new memory technologies are going to further reduce
the latency envelope and expand the capacity. So, accelerating inter-host traffic is now very
critical to server performance. Why is this important now? Because generally, intra-host memory exchange consists of multiple buffer copies. You can see that there are multiple layers of software stack here that do software-to-software copies.
Kernel-to-IO and IO-to-hardware can leverage hardware copies, but software-to-software copies relies on CPU. They're more synchronous.
So what is our current data movement standard? Our current data movement standard is a stable CPU instruction architecture.
Don't get me wrong, this has served us really well because that's how we can layer applications on
top which don't have to change with every next CPU generation.
However, there's another thing that happens here.
Because we now rely on the CPU instructions, they take away from the application performance. They also introduce overhead when we need to provide
context isolation. So what can we do?
We tried something. We tried we'll go with
an offload DMA engine and to do that we created
a simple experiment where we place two linux vms interface to use a virtual network v switch
as you can realize there are software copies in the stack here
and to test our theory we tried to prototype in in an FPGA two physical functions going as a pass-through to each of these Linux VMs.
And then we stitched together some back-to-back DMAs so we can do fast hardware DMA copies between this Linux VM on the left to the Linux VM on the right.
Results? They were very astonishing for us. Mind you, this is 2016. This was still a
prototype and still is a prototype. The point of this was just to prove that hardware can be very
beneficial. If I take your attention to the graph here, this is the blue line where software copies
starts increasing exponentially in latency as the packet size increases, whereas
the hardware copy scales pretty well.
What's wrong with this?
They can be hardware vendor-specific implementations.
They will come with their own drivers.
Direct access with user-level software, while it can be designed, needs to get implemented
for every hardware vendor that we have to work with.
They will come with different kinds of APIs.
And generally speaking, this can have very limited usage models.
We were trying to aim for something big.
So what kind of solution requirements
are necessary to build a standard interface?
One, we have to absolutely offload IO from compute CPU cycles. My previous
picture just pointed to that benefit. Number two, we have to build architectural stability,
the kind of stability that we have enjoyed with our CPU instructions,
but not by using the CPU cycles for it. We have to enable application and VM acceleration,
but we also have to help migration
from existing software stacks.
So we have to create good abstractions in the control
path for scale and management, then very surgically
enable performance in the data path with offloads.
Everyone with me?
Something else that we have to think about as architects,
principal engineers, CTO and the CTO office, we look into the horizons on what are the different architectures that are going to become important for us. We're looking increasingly at memory
centric architectures, which means memory is at the center of the universe so that different
kinds of compute elements can play.
To do this, we're looking at different kinds of memory interconnects like CXL and Gen Z
to help serve those needs.
We are also trying to enable different types of memory so that the innovation potential
can be unlocked.
Heterogeneous architectures are suddenly becoming mainstream.
And now you will agree with me
that there is a need to democratize data movement
between these different memory tiers,
bridging the different memory interconnects
and let all of these heterogeneous compute elements play.
So let's see how we should build this.
If you want to build an accelerator,
where would you first think about building an accelerator?
Now, I told you I don't want to use my CPU cycles,
but why not build it in the CPU itself so that we don't have to have a lot of hardware
associated with an accelerator?
Yes, that's desirable. We
absolutely want to enable data memory copies from a context address space A to
context address space B because my application could be completely two
different address spaces. I want to enable direct user mode for my
applications with that accelerator interface so that we can
emulate a lot of the software context isolation layers that are there to make this more performable.
Next, we want to be able to serve different types of memory for this accelerator.
Why? Because storage class memory is coming to us that brings us persistent characteristics.
So we will need different types of data services with that.
Increasingly, a lot of memory is going to be behind IO devices.
We need to be able to address data movement to that as well.
And then finally, with memory interconnects like CXL and Gen Z, the system physical address space is going to increase.
And therefore, this data mover or accelerator needs to be able to target all these different memory address spaces.
We can't keep doing innovation with one CPU family and not replicate with another CPU family.
So it has to be a very standard CPU agnostic interface. Also, different types of compute elements coming to the data
center means that the same accelerator interface
needs to be available for a GPU, for an FPGA,
or a smart I-O device like a NIC or a drive.
When we do that, now we can leverage
a very standard specification, can innovate around
the spec, then add incremental data acceleration features.
And that's how we can solve an increasingly tiered memory work.
So now let me introduce the concept of STXI. Before I talk about STXI,
I want to acknowledge Philip Ng,
who is an AMD CEDAR fellow,
co-founder and co-author for the spec.
AMD, Dell, and VMware
are contributing the starting spec
for this technical working group.
And we are very proud and excited
to partner with all of SNEA's TWIG members.
So what is STXI?
STXI is trying to develop and standardize
a memory-to-memory data movement and acceleration interface
that is, one, extensible, it's forward compatible,
and it's absolutely independent of IO interconnect technology.
And may I add, different kinds of implementations should be possible with this. From a design point of view, we try to think of some design tenets that
would be useful for a standard interface like this. For example, we want to enable
data movement between different address spaces, including user address spaces, both within and across VMs, and new address spaces that get defined.
We want to enable data movement by privileged software.
If privileged software comes in the way, then performance can suffer.
Of course, we want to make the connection get established first.
That's where privileged software comes in, which is why we want to allow abstraction or virtualization by privileged software. And there's something else that we want to aim for.
A lot of technologies get defined. Virtualization is more of an afterthought.
We've done this groundwork with virtualization in mind so that now we can have the capability
to quest, suspend, and resume the architectural state
of a per-address-based data movement,
which means all of our architectural states
are open and standard, no hidden states there.
And what does this do?
It enables live workload
or virtual machine migration between servers
and other types of
benefits that come with it. We want to enable forward and backward compatibility across future
specifications. So now we can have interoperable software and hardware, a key ingredient to make
a standard a success. Then we want to incorporate additional offloads in the future. I just don't want to copy from A to B. I want to be able to do some data transforms in line.
And then we want to enable a concurrent DMA model. That means multiple parallel DMAs should happen all the time without one obstructing the other. With these design tenets in mind, like I said, we decided we will try to do
a spec and we even tried to implement a small prototype with this. Let me explain our prototype.
So we implemented this STXi prototype with an FPGA, with a driver and an kernel application. To compare, we use the same driver
and wrote a kernel application that did mem copies with this driver. On my right here,
I have this picture where the gray and the blue show me different kinds of software mem copies.
I can use a synchronous API with the driver or an asynchronous API with the driver
to get my software copy results. With our STXi prototype, we can only do asynchronous
because remember this is hardware. And with STXi, this is a Gen3 FPGA prototype,
we're hitting near the land rate. Remember, we're not even an FPGA company.
The reason for this is to show
that enabling the ecosystem is key to our success.
And therefore, we're saying that there is good benefits
by doing this spec and implementing with this.
That's why we want to invite one in every one
to come up with their implementations here.
Come and partner with us.
Also partner with us at a quick level.
We're calling other SNEA technical working groups like persistent memory work group,
because we want to target persistent memory with this data mover. We also want to talk to computational storage groups because storage is going to be interesting space that we're going to target with this.
Networking and storage applications
can run with this data more.
We also want to partner
with Compute Memory Storage Initiative,
part of SNIA.
Also, looking externally,
we want to partner with PCI-SAIC,
CXL, OFA, UFI, Gen Z,
because, like I said, it's interconnect independent.
This data mover can be implemented with PCI, CXL,
any implementation or interconnect that you want to bring.
That's how we make it a standard.
So what is it that we're going to do in this trip?
The base spec that we contributed,
we're going to take it to 1.0.
After 1.0, we're going to take it to 1.0. After 1.0,
we're going to define
new kinds of operations
that will enable
this data mover interface,
including persistent memory targets.
We want to create
cache coherency models.
We want to be able to
solve security features
that involve these data movers.
We want to create
a connection management architecture.
And we absolutely want to encourage
one and all to come join us in this effort. OS vendors, hypervisors, OEMs, applications,
data acceleration vendors, IP vendors, come and join us and make this a great success.
With that, I would like to introduce Rich Brunner to come and talk about more of the
STX site concepts.
He has a penchant for rings, and I certainly think that you will like his talk.
Thank you, Shyam.
Let's dig into it.
So solutions to provide scalable data movement requires not only acceleration, but a standard interface that supports software reuse and virtualization.
And so I'm showing again the list that Sham shared earlier about the requirements for such an interface.
And I'm just going to highlight a few of these on this slide. So a standard interface
for data movement needs to work both within an OS instance or a virtual machine,
such as we've shown here, or between different virtual machines. And the key concept here is that all the accesses from the hardware
to the actual memory space that might live in a VM or process
is through the IOMMU as appropriate.
So that's a key point of the architecture.
The data movement, as Shyam said, we want that to work without mediation or interception by hypervisor or privileged software.
The minute you do that, your performance we want to be able to allow live workload or virtual machine migration between servers using this technology across different kinds of hardware, all implementing the same standard set of operations.
So that's very important to truly be able to deploy this at scale.
So we're going to define a data movement engine as some number of functions,
which easily corresponds to a PCI Express physical or virtual function. And regardless of the
acceleration, there should really just be one way to set up and control a function
and just one standard descriptor format to submit work. Now a function has some
number of contexts and each context is an independent descriptor ring and its associated data structures.
Now all SDXI context state resides in memory.
There are no device-dependent specialized mechanisms to serialize state.
We don't need them.
And as a result, this makes this very easy to
virtualize. Now the context state is partitioned between privileged software for setup and control
and actual user mode access of the ring itself.
So we think that this lends itself well to a model where there is not mediation or interception of a user mode process putting data right into the ring so that the function can begin its operations on it. Now, there can be multiple contexts, and this allows independent threads to issue and manage
operations in states without any blocking or coordination with other threads. It also makes
the ability to quiesce, suspend, and resume much easier and much granular per producer-consumer pair,
rather than a broad hammer that affects everything.
And lastly, we have one standard way to log errors.
So you don't need n different ways.
We believe what we've defined here will work across lots of different classes of accelerators.
So like any data movement architecture, we need the concept of a circular ring,
and we don't need to keep redefining it for each and every class of accelerators.
So we have a simple ring, and of course it is a finite number of memory locations and it will wrap.
But the ring is managed by position independent indices. in user mode memory that point to the beginning range of valid descriptors and the ending range
of valid descriptors. Now by using ring indexes, this allows easy suspend, resume, and relocation
of the ring without breaking major context data structures or applications. It makes virtualizing the ring a lot easier as well.
Now, the SDXI function starts reading descriptors at the read index pointer,
and it stops reading at write index.
And software starts writing new descriptors beginning at write index,
and it can write until the ring is exhausted.
Pretty straightforward mechanism.
And so write index minus one is always the place where the last valid descriptor exists.
So descriptors are processed in order by the function.
They can be executed out of order.
They can be completed out of order.
And the read index is incremented after each descriptor has been issued.
Now recall that the pointers can be mapped to write back DRAM.
And that actually works out really nice.
So for a function that is actually implemented,
such that it can be aware of any changes to the read and write
index location pointers, it is able to immediately start
accessing valid descriptors without the need to wait for a synchronizing
doorbell write to an MMIO location.
And so we avoid the need for a special instruction defined in any instruction set in order to
kick off this process.
However, from an architectural perspective,
because not all functions necessarily can,
hardware implementations of functions can work this way,
and to allow virtualization,
the architecture requires that the doorbell be written
to ensure that new descriptors are recognized.
But you can plan that in such a way that it does not limit your performance. And that's the key point that we want to make here.
So this allows us then the ability to have a maximum number of operations that can be executing in parallel
without waiting for a serializing write,
as well as we have nice, well-defined boundaries
for quiescing and serializing the state and for error reporting.
That is very important in this architecture.
So we also defined a descriptor format that we believe easily lends itself to all sorts of different classes of accelerators.
Again, we don't need to keep reinventing things per accelerator or per device.
We can stick with just one. So our descriptor
is 64 bytes. Now there's room for future operations. I'm going to talk about them for a
moment. And it's even possible to describe descriptors that need multiple 64-byte blocks.
So we have that future extensibility there,
as well as the opcode space itself has room
for lots of future operations.
Now, we have defined already in our spec some basic operation
groups.
And the details are there on the slide.
We have sort of a basic DMA operation group.
We have a group that allows us to do atomics.
We also, the way we've defined this architecture is that an admin context or an admin queue can be used to manage all of the other contexts,
all the other rings within the function.
That, so again, even the management model lends itself to the same properties of
well-defined boundaries and mechanisms that easily allow virtualization.
So you can see that many of those functions have to do with stopping and starting a context
and being able to manage its state and ensure that proper operation is carried out. And then lastly, we've put in the hooks to define
a connection set of operations that will allow virtual machines to connect
to each other through SDXI functions in a server.
So in our standard descriptor, we also obviously need some kind of completion status.
And so we've defined a completion status pointer to a completion status block.
This is initialized by software and it is decremented.
The completion signal portion is decremented by a function on success. Now this block
can be shared across multiple descriptors. So if you had some logical grouping of multiple
descriptors and you wanted to use the same completion status block, you could do that and
you could even tell which ones have finished and which ones haven't through this mechanism.
Now, the completion status mechanism also has the ability to generate interrupts.
But we're showing here right now the model where you can pull this in memory to see the result. Now errors are also flagged and that's by writing a signature to the completion signal
and putting miscellaneous status information about the error as appropriate.
So the other thing we should talk about is how memory locations are specified.
So we've defined a fairly straightforward way for this.
Memory locations are always specified as a triple.
So there is something that points to what the address space ID is for the operation, a 64-bit address,
and any appropriate cacheability attributes. Now again, as we've said
before, the addresses that are generated through this mechanism are always run
through the IOMMU as appropriate. That's a key concept.
And so if you have multiple memory locations, then you will have multiple of these triples
as defined in the standard descriptor format. Now let's talk a little bit more about the address space ID. So this is actually an index to a context address table, address key table, which you might have seen in my earlier picture.
And an address key table encodes all the valid addresses, passeds, that's process address space IDs, and interrupts available to a function context.
Any descriptor within a context can reference an A key table entry. And so all the locations that our SDXI function can access on behalf of the software submitting work are captured in the address key table.
So let's look a little bit more about how that address key table comes into play for doing multi-address space data movement.
So we'll start with a simple example. We have an address space B where we have a producer,
and that producer has a descriptor ring where he is submitting commands to the SDXI function.
And he also has a source buffer that he wants to copy over to another address space. Now the allowed address spaces that he wants to be able to
access are of course it's his own as well as address space C where the
destination buffer is. So from an SDXI perspective the a key function of the AKEY table has a mapping for address space C.
So it allows the producer an address space B to specify that he wants to do an access to address space C.
The other thing to note here is that the IOMMU is going to be programmed for each of these address spaces in order to allow an appropriate SDXI function to be is bound tightly to address space B and a function that
is bound tightly to address space C. So the IOM it can access address space B.
And SDXI function C is programmed into the IOMMU of address space C
in order to allow the same kinds of accesses.
So the instruction, the copy descriptor operation is issued.
The SDXI function reads the actual operation, and it will read from the buffer in address space B.
And so that's the DMA read completion. hand it off to SDXI function C, which will in turn, using the IOMMU mappings, write into the address space C for the destination buffer. So the
SDXI DMA engine has this sort of backdoor ability, if you will, to redirect the data flow through the appropriate function
to ensure that the permissions are set up properly.
And all of the permission stuff is done at a privileged software level.
And of course, it can also be intercepted by the hypervisor as appropriate.
Now we're going to look at one more, more complex multi-address
space data movement.
In this case, address space B is actually still the one
that is submitting the work.
But now we want to put the source buffer in address space A
and orchestrate copying it over to address space C.
So address space A is, of course, also
mapped into the IOMMU in the sense
that the SDXI function A is allowed to be
able to read that address space.
So first, we fetch the actual operation of interest from the descriptor ring for address space B.
And the function returns the descriptor and then the DMA engine knows that it has to use what appears as
function A through the IOMMU to be able to address address space A. The result of
the buffer is copied, it's returned, and then it's handed over to what appears is SDXI function C, which has been permitted to do the write into address space C.
So this can be very powerful for certain scenarios. they may live just within one address space model or multiple address space models with just one function.
But the architecture permits something as complex as this when appropriate.
Now, I don't talk about it in detail here, but certainly we have on our minds how we can chain not only different software components,
but how we could even potentially chain different hardware components with SDXI.
But that's something that will work itself out through the TWIG.
In summary, as CPU cores scale, the usage and demand for ever larger and faster data
exchanges scale.
It scales among kernels, applications, virtual machines, and I.O. devices.
And we see that future network and storage technologies will especially require this
scaling. Now, solutions to provide data movement scaling
and transformation requires not only acceleration,
but we believe a standard interface that
supports software reuse and virtualization.
As Shyam said, Dell, AMD, and VMware
are contributing a proposed starting point
for this interface to SNEA.
And in this session, we've discussed some of the key concepts of this proposed interface,
but there are a lot of details.
And that's why SNEA has authorized our Smart Data Acceleration Interface Technical Workgroup
to begin work on fleshing out this actual interface.
We believe this
interface proposal is of broad value to the industry. We are really trying to
make this non accelerator specific and also future-proof so that the underlying
IO connection technology can be comprehended in the way
we've defined the architecture.
So, as Shyam
said, come join us in the TWIG.
The details are
much,
much deeper than what we've been able to
touch here, but we hope
that you get the spirit of what we're trying
to do. Thank you.
Thank you.
Thank you.
Look forward to all of your questions.
Thanks for listening. If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.