Storage Developer Conference - #18: Donard: NVM Express for Peer-2-Peer between SSDs and other PCIe Devices
Episode Date: August 20, 2016...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
Today you are listening to SDC Podcast Episode 18.
We hear from Stephen Bates, Technical Director with PMC, as he presents
Donard, NVM Express for peer-to-peer Between SSDs and Other PCIe Devices
from the 2015 Storage Developer Conference.
Hi, I'm Steve Bates.
I'm a Senior Technical Director in PNC.
We do enterprise storage, and I'll talk a little bit about that in a second.
I'm going to talk today about how we're leveraging
some of our NVBM Express product portfolio
and IP and technology to look at how do we combine things like storage
and networking and compute that can all live on the PCIe bus
and make them all work very interestingly together
without necessarily having to go through the central processing
unit, or as a good friend of mine likes to call it, the computational periphery unit,
and do some interesting things that might tie into NVM over fabrics. Amber hits me when
I say that because it's not standard yet, so I'd say NVM where RDMA is. You're the wrath of Amber. It ties into hyperconverge,
so what happens if I take a box and I put some fast storage, maybe some processing,
like a GPGPU or an FPGA card, and then something like an RDMA NIC, and then put some software
on top and then tile that out like a billion times or a billion times
or however many times these huge hyperscale companies are doing it.
Looking also at things like next generation all-flash arrays, which maybe don't use SATA or SAS.
Instead, they use RDMA to get into the box.
And then inside we have a PCIe fabric and NVMe drives they're built out of things like NAND, and maybe things
that aren't NAND, like 3D crosspoint or resistant RAM or whatever like that. So this work is,
it's all CSTO, it's proof of concept, it's exploratory. I see some really friendly faces
in here, and you'll be working with a lot of interesting, wonderful companies and people
and kind of seeing what all this does.
I don't have all the answers,
so this is really very much a technology discussion
and trying to work out what's interesting,
how do we make an ecosystem out of it,
what standards might have to be, you know,
worked on or impacted
in order to go make some of this happen.
So feel free to jump in with comments or questions as we go.
I'm going to try and move reasonably fast
and then have some time for discussions afterwards.
You have to do this.
So PNC, actually we have a dilemma.
My boss would probably kill me if he sees this slide,
but we're very much an old school, new school kind of company right now.
We traditionally have done very well over here.
And we've enabled storage companies, HP is a big customer, EMC, NETA,
and also some of the newer storage companies in the public cloud,
some of the hyperscalers, to connect up a lot of drives,
a lot of hard drives typically.
And we manage those drives very well and provide that connectivity. And I'll be honest, we make a pretty nice sum
of money, not a huge sum of money, but a pretty nice sum of money doing that. And we've solved
hard problems for people over the course of the evolution of SaaS quite nicely. But we've
also realized that things are changing,
and this business is not going to go away overnight,
and I don't think it will go away anytime soon.
But there's a lot of really interesting things that are happening,
and we need to take part of that.
So we have a new school division as well,
which is much more focused on some of these new things that we've been talking about at this conference and other places.
NVMe SSD controllers, that's something that we sell.
It's public knowledge that companies like HGST
and OCZ, which is part of Toshiba,
use us in their NVMe offerings
for their high-end enterprise-slash-datacenter-type SSDs.
And again, we also sell to the hyperscalers
who do weird and wonderful things
with our controllers
that are nothing like NVMe. To be honest, sometimes I don't even know what they're doing
with them. PCIe switching is a new business that we've got into. It's very nice because
it ties in very nicely with NVMe. NVMe runs over PCIe today, and NVMe over Fabrics is
all about providing distance and scale
to NVM Express by letting it run over other protocols like RDMA and FireTime.
We also have, as well as simple fan-out switches, we have some what we call our switch tech
storage switches, which have some weird and wonderful things in them that they can do,
like non-transparent bridging, so multiple hosts can share drives. And anyone who was at IDF last month, we did a demo with
Intel that showed how multiple hosts could share a pool of NVMe drives and semi-dynamically
repartition those drives between multiple hosts at the same time, which is not something
that really anyone else has been able to do until now. But that's a problem we've got
to solve, because in the future,
you're all going to want to deploy lots of NVMe drives,
you're probably going to want to virtualize some layer over the top,
and you're going to want to be able to go, this guy doesn't need all this data,
or doesn't need all these drives, and I want to move some of them over here.
And these are problems that we're having to solve.
This is legacy, but it makes tons of money,
and this is you and Gruby, and it makes tons of money. And this is you and Groovy.
So what have we been looking at?
We've been building some ideas around this very straightforward architecture.
So this is going to be familiar to everybody.
We have a host CPU.
I don't really care what architecture it is,
but if you're in the world of enterprise and data center,
I think it's public knowledge that Intel have 108% market share.
What a joke.
We obviously connect DRAM.
There's not that many systems that will work that well without DRAM.
In fact, your workload is totally sequential.
You don't actually need DRAM because you can just page it
because you know in advance the data you want.
That's a really weird problem that you're trying to solve if you know the data.
Then we have some kind of in-server fabric.
That's pretty much exclusively PCIe today.
That makes a lot of sense.
Intel, as one example, gave us about a billion PCIe today, right? And that makes a lot of sense. Intel, as one example,
give us about a billion PCIe lanes for free
out of each processor,
right into the new billion.
And, you know, they will continue to do that.
And we connect wonderful devices
on these PCIe channels
that come directly out of the processor.
So I can put my NVMe SSD,
I can put some compute,
whether that's a GPGPU or an FPGA card
or some, like a Knights Landing,
the Xeon 5 from Intel.
You know, we don't really care
as long as we have some way of making it do interesting data manipulation for us.
And then we obviously want to probably pull data in from the outside world,
otherwise we're just working in a vacuum.
And so, you know, we have networking devices which may or may not support RDMA, and may
not, or may not be based on Ethernet. So that's kind of the model that we're working in.
Excuse me. You know, the work that we've been doing is building on top of the standard nvmxpress. Linux has been great in the sense that it's supported nvme with an inbox driver for quite some time.
Can you pronounce the word at the top for me?
Donnard.
So that's a good point.
Thank you for raising that.
So all the projects that I start at PMC are named after mountains in my home country of Ireland.
So Sleeve Donnard. Sleeave is like the Gaelic word for mountain.
Sleave Donard is a mountain just south of where I was born in El Paso.
Thank you.
So, you know, I'm in CSPO, so I want to be able to collaborate with other people.
This project is pretty much all open source.
I have blogs and white hippers that tell you this is the hardware we use.
You can go order that or something that you think will be compatible with that
or something completely different if that's what you want to work on.
We have GitHubs that have forks of the Linux kernel.
We have branches off that for some of our different work.
And I'll talk about that in a little bit.
We have user space libraries that me and my team
have put together that tie some of this together
and do things like performance benchmarking.
And all of that is open source, it's licensed under DPL
where necessary because of the kernel.
The current Linux kernel is DPL so you have to keep that.
But anywhere else it's licensed.
User space code is Apache licensed,
so you can pretty much do whatever the heck you like with it.
We don't really care.
And the idea is that we build this as a sandbox for people to play with.
And I know for a fact that some people in this room
have created or some semblance of this
and done some work with us on that.
The one thing that I haven't got in this diagram that I
should probably put in that is optional, but I'll talk about why I think it should be a
little more mandatory than just optional, is a PCIe switch. Obviously with Intel giving
us a billion PCIe lanes, we might not need a PCIe switch because we can just connect
everything up to the CPU. And I'll talk a little about the pros and cons of that a little later.
So what are some of the goals and objectives of what we're trying to do here?
A lot of what got me started on Project Donner was thinking about, you know,
we build an NVMe SSD controller that when you configure it correctly, can
do a million IOPS. Now, if you configure it incorrectly, it will do minus seven IOPS.
But it's pretty easy to get it wrong. But, you know, there's products that are in the
market today, you know, like I said, HDS-TOCC, they're doing between three quarters of a
million 4K to a million 4K randomized.
So we don't do the low-end stuff.
And in fact, some people might claim
our stuff is a little too high up,
but that's the data.
You can argue over that.
Companies like Mellanox and Chelsea,
I see what way I'm going here,
and I see you down over there.
These guys are doing these awesome RDMA mix
that can do dual port 40
gig, 56 gig. We're now doing 25, 50, 100 gig. You convert that to IOPS, those are millions,
literally millions of IOPS. So this guy can do a million IOPS. This guy can do a couple
of million going even higher as we transition to 100 gig. These poor guys are trying to
do some hyperconverged, let's process some of this data,
it's literally drinking from a fire hose.
No processor in the planet is going to be able to do
any significant amount of data manipulation
on a data stream that's traveling that fast.
So if you're trying to do image detection,
you're trying to do some kind of searching algorithms,
the reality is that these guys are kind of searching algorithms. The reality
is that these guys are not the bottleneck. The bottleneck's over here. It's either processing
on the core, it's either on the fabric itself, the PCIe subsystem, or it's going to be something
to do with either the DRAM bandwidth or the volume of DRAM at your disposal. Something in here is probably going to get it.
So with Donnard, I wanted to explore what happens when we repartition that working set.
So let's say, for example,
we introduce some element of computation here.
And rather than having data flows
that always have to go through the computational periphery,
maybe we can direct traffic directly from the networking device
directly to storage, vice versa.
Is there a framework that we can build that allows this to happen?
This guy is still here.
He's still running.
You pretty much still need an operating system somewhere.
Somebody somewhere has to have some kind of,
even just for error handling and things like that, but maybe this guy,
I like to think of it more as the conductor of the orchestra rather than
someone playing the instruments. So this guy is managing flows, providing quality of
service, criteria metric, responding to things in the outside world. But a lot of the data, well, all the data path, preferably,
is going what I call east-west.
On this diagram, it's north-south.
This slide is in the deck, so I'm going to kind of skip it.
It's just the hardware platform.
If somebody really cares, we can talk a little about that.
I'm going to talk about a couple of the pieces that we
use to build our puzzle.
And it's also like a plug for the company.
This PS slide,
you see it in a...
What's that?
What's that?
You gave that as part of the...
Oh, yeah.
Oh, my goodness. What am I doing?
Sorry. You mean this one here?
Yeah, so what this is trying to show is that maybe if all this is doing is management,
I don't need a Xeon class and I maybe don't need as much DRAM.
Because maybe, so the problem with NVMe normally is NVMe normally you stage something in a DRAM buffer. So if I'm using standard NVMe,
if I wanted to pull an RDMA,
in fact Intel did a lovely kind of discussion on this
just earlier today,
but let's say I wanted to do an NVMe right to this drive
from somewhere else using RDMA.
Right now, what I'd have to do is RDMA in here
and then do that.
So you're effectively double buffering. But another
path is to maybe try and do this one. And that's part of what we were looking at with
some of this one. Which means you might not need as much DRAM or as much DRAM bandwidth.
So I'll skip that. So we have this product.
PMC is not in the business of making solid-state drives. We enable our customers to do that.
With that being said, we do build hard-level products.
So this started as something a customer asked us to do,
and then we ended up turning it into a generally available product.
So under this heat sink is our NVMe controller.
And this thing appears as an NVMe drive,
but it's not backed by flash.
It's backed by DRAM.
So it's basically at very low capacity,
incredibly low latency.
You know, you think 3D cross-point is fast.
Well, DRAM's faster.
You know, it's just cost more.
And it's got really, really good endurance
because it's DRAM.
I can just write it.
So this sounds crazy.
Why would I use this?
This is great for next generation
all-flash-array write caching
because I can write this guy
over and over and over and over
and over again in a log structure.
And as long as I have at least enough capacity
to store all my writes that are in flight, I'm good.
I don't need terabytes, I need gigabytes.
So this is not for everybody, but it is quite useful.
The other thing that's really, really good about it
is our controller presents to the operating system
as a block NVMe device, but
it has a second access methodology. We can expose memory as a PCIe bar, and we can MM
that into user space, and then I can basically do cache line changes that change cache lines
on the D-RAM. So this, for us, is, even if I told my CEO, I don't care if we sell one
of these, because I am learning so much about what next generation SSDs are going to look like from this, because
I have cache line accessibility, I have a memory semantic way of talking to this, I
still have my NVMe if I want it, which I do still want, because I want DMA engines, and
I get all of that on a PCIe interface.
So this is part of
you don't have to use this in
Donner, you can use anything you like
but this was very useful for me
because it exposes both
a cache line or memory
semantic access methodology
and also a block based methodology
and there's not really a lot of devices
out there right now that do both
block and memory
kind of access semantics like that.
Very useful for research.
And like I said, we are actually selling some of them.
We sell seven.
So I understand you access the same memory,
the same cells, both by the block interface and the memory?
So you can set it up either that way or...
Slap the bar over the whole device.
You could make the bar bigger than you have space for, as long as you have some way of handling page holes.
So, it has a PMEM device style behavior as well.
We know exactly what...
So, everything that you've done in PMEM on an NVDIMM, we've done on NVDIMMs
and this. And in fact, the IL bus instead of the memory bus. Exactly. And so a lot of
the work that, you know, Intel and you guys and the SNIED work, all the NVDIMM work that's
happening, it has been fantastic because it's really helped us understand this as a concept
as well. So it's a good point. I mean, this is
basically an NVDIMM on the PCIe bus. When I expose it as a memory device, it's just
like having an NVDIMM.
Well, until you try to read.
What's that?
Read performance.
Well, sorry. So let's ignore performance. I'm talking about as an entity in the system.
There's always performance issues.
Writing performance will be pretty good, but reading performance will be remarkably low.
But then we have DMA engines, which are very, very fast.
Sure, but the block interface is probably a better way to read this.
Exactly. Very good point that you've got.
Sorry, just the gentleman in the back, and then Terry, I'll get to you.
Yeah, no, sorry, you were there.
How low is the low latency?
So I've got some slides on that.
So if you want to read a cache line of DRAM on this thing,
if you just assume we're up in user space,
we've memmapped this in, and we do a read,
that read gets serviced
in about,
it's around
between 600 and 800 microseconds.
But it's architecture-dependent.
So,
that's the kind of memma you get.
No, did I say microsecond?
I meant nanosecond.
Did I say micro?
I meant nano.
Sorry, holy crap.
This is called a hard drive.
So basically, in that memory mode,
what happens is the operating system goes,
hey, I want to read this cache line on this memapp file.
I'm talking Linux, and I will talk Linux semantics
all the way through, because that's the one I understand.
Apologies to anybody who works for another operating
system. But yeah, so if I try to read that emapp, what essentially happens is that falls
through the driver stack. It turns into a PCIe TLP, or Memread, basically. It comes
down, works out, oh, the Memread request is not for DRAM. I'm actually going out through an IOMMU, out onto the PCIe subsystem, enumerating my buses.
I'm going out to this device.
Basically, that memory read PCIe TLP hits this guy, hits our controller.
We service that TLP.
We pass that back as a completion.
That goes back through the stack and then into the OS.
There you go. So then why do you need both of those things? We pass that back as a completion. That goes back through the stack and then into the OS.
There you go.
So then why do you use both of those things?
Why not always do them?
Because load-store semantics take up CPU cycles.
The reason we have DMA engines is because I can have a thread that goes,
I want to read a megabyte of data, and I don't want to sit here waiting. I'm going to switch to a different thread,
and you, DNA engine,
you're going to service that,
put the data in a buffer,
and you're going to raise an interrupt when you're done.
And so basically you switch threads,
you go and do something that's going to earn you some money,
hopefully, right?
And then at some point in the future,
the MSIX interrupt trigger
interrupts the operating system that says,
hey, that thread said that interrupt,
it's time to go back and do your work because your data's not in there.
There's a really, I have some data a little later that looks at what happens
when we raise an interrupt in an operating system.
I hear some people laughing. This is data that I had to go measure
and I couldn't believe nobody had put this online.
It's really very interesting.
It's not so interesting when you're reading a hard drive
where the access times are so long
that you could literally read a newspaper.
But as we move to octane-based drives
like Intel and IBM,
the time taken for the drivers to do its work
and the time taken for the hardware to do an MSIX,
I think Jim
showed some, you know, the VD access time
is really absolutely nothing.
Everything else is taking up
the bulk of the links. And I think that's a problem
we need to go think of that.
Either we move to the memory channel
in entirety and we give up
on super fast SSDs,
or we go, yeah, I see.
I think I have an idea, so, you know, this is, sorry, go ahead.
I just want to follow up on that.
So it's 6800 milliseconds for a read of a cache line.
And when I want to flush that, when I push that cache line back out.
Remember, this is a, inside you know this is a PCIe cache line.
This is not mapped into your normal memory space,
so it's not necessarily cache backed
by your L2, L3, or whatever cache.
That's architecture dependent,
and you're starting to get into some of the finer details
of what happens in your memory subsystem
on the given processor that you're working on.
We can start talking Intel specific, but I got to wonder when we hit NDA
problems and stuff like that. But you're right, Terry, there are things
you need to think about. Very similar to some of the NVDIMM stuff. Is it in a cache?
If I write it, can I guarantee it's got to this device and it's not stuck
in some stupid cache that somebody put there to make performance better, but now it's hurting me
because I'm trying to go to persistence.
Very, very, very similar problems
to what SNE is working on
with the NVD.
But slightly different
because we're on the IO.
My new SNE isn't solving those problems.
It's just waving red flags.
Somebody solve these problems.
Yeah, yeah.
That's fair enough.
Okay.
So, really, really interesting device.
And like I said, if anyone wants one,
well, if you're a customer of ours,
I'm sure we'll get you a loaner.
But if you want to buy one,
I'm sure we can sell you one for sure.
Very interesting device.
The other piece of the puzzle,
and this is the block I didn't have
in my architecture diagram,
but we have a PCIe switch product.
I'm not going to go into a lot of detail,
but it basically allows me to connect some of the things
and some of the devices.
And it's common knowledge.
GPU Direct, I don't know if anyone has worked with GPU Direct
or heard of it, but it's worth looking up.
I mean, basically, everything I'm doing here, I didn't invent.
I basically stole it from GPU Direct,
which is RDMA into the memory,
the I.O. memory on a graphics card,
and they use it in HPC all the time.
But when they first started doing that,
they realized, if I have my graphic card
and my NIC connected directly to the X86,
and I try to do direct traffic between the two,
something in the I.O. memory controller
in the Intel architecture
screws up east-west traffic.
Surprise, surprise.
The optimal performance path
is to go from the I.O. device
up into the memory system,
and then down from the memory system.
That's probably the path
that they cared the most about.
I apologize if anyone missed it.
But the problem is that if I want to start doing east-west traffic, as I call it,
I do hit performant issues as I try to move large memory PCIe transactions east-west through that subsystem.
So that is going to change as we go.
I have no control over that.
Some of the people in the room probably have more influence than Intel.
Go tell them about it.
Go see if they're going to do anything about it.
Or if you slap a PCI switch in front of it, you're good.
Yeah, exactly.
So the switch...
So Terry works for Everspin,
but also is a member of our sales team.
I like the point I made.
It's for Microsoft,
who is also part of our sales team.
I'd like to put to it
that that East-West problem goes away
if you could have switched here.
The CPU from whatever vendor you hear,
I only know that the East-West will bad-treat an Intel root complex.
I hope to do this work on an open power server very soon,
and compare, and I haven't tested on an ARM.
I can't comment on their East-West, but I do on the Intel,
certainly on the current architecture.
What latency of the switch?
This switch from one point to another is about 160 microseconds.
Nanoseconds.
I'm getting a message.
Oh my gosh.
160 nanoseconds.
So it's not zero, but it's not huge either.
And obviously that's per hop.
So putting a lot of those in is going to cost you more.
That's just part of the thing.
So that's another piece of the puzzle we'll talk about in a minute.
So the great thing about having the switch is that the east-west traffic goes there.
I'm not going to hang around too long on this slide.
GitHub, Donnard, not too many things are called Donnard in the world.
You'll get a picture of a mountain
in Ireland, and you'll probably get
this GitHub site. We have
user space code, we have a fork of the
kernel. Right now, I think we're
rebased off 3.19.
There's actually all the great
NVDIM work that's gone in.
We need to rebase off 4.3
or 4.2 RC2
or something, because a lot of the new patches
from the
Intel folk and other MDM
people are actually going to be quite useful
for this so it's quite an exciting time
for some of this work.
So some actual results.
We started off by doing some experiments
and these are quite old experiments
with GPUs.
These were Kepler class NVIDIA cards.
These don't have any kind of graphic port on the back.
They're not for that.
These are designed to crunch numbers.
Then we have an NVMe SSD.
This is one of our eval cards.
That's why it looks like a complete piece of crap.
It's not complete.
But this is an NVMe device.
That's what I had when I did this testing.
So what we were trying to do is,
let's just go directly from the storage device
to the memory on a GPU.
And NVIDIA have a driver.
Now, NVIDIA are well known for being
a-holes, for want of a better word,
in the Linux community,
because they're drivers somewhat proprietary
in binary form,
and binary is pretty much panthema
to the Linux community.
But they have to expose the symbols
for the functions in their driver,
because otherwise, how do you know
where their functions live in memory space?
We know what those functions do,
so you can actually tie in
to the functions that they're providing.
So basically, we rode on the coattails of GPU Direct,
and we basically invented something
that I called NVMe Direct,
except I knew that
Amber would hit me, so I changed it to Dunner. And we basically are able to say, the operating
system can say, hey, I have a file on a file system or I have a region of LBAs on this
device, Mr. DMA engine on this NVMe SSD, can you please send it to these hashline addresses or these bus addresses? And this guy
will go, yes, I certainly will do that. And it starts pumping out data.
The PCIe system realizes that the destination for those TLPs
is actually in memory. It's exposed by this guy. And so
the data comes directly through here. If they're both connected directly to the
CPU, it'll have to go up, hit the CPU, go through the IO subsystem, and then out. If you have a
PCIe switch in there, it has the enumeration tables. It just throws the traffic directly
from one device to the other. When we did that, we measured two different things. This
is classical, and this is the Donnard method. We measure just
raw bandwidth. How quickly can I do this? And then we also measure how much DRAM am
I consuming on the central process? Because in the original approach, you're actually
double buffering. Because in normal NVMe, you would have to pull the data off the drive, put it in a DRAM
buffer and push that DRAM
buffer down to the graphics card.
In the new version you don't need
that. So this one is just
raw throughput. This one
is how much DRAM volume
I save. So it's kind of a figure
of merit.
So both figures of merit got better.
I'll be honest, I haven't...
I did these experiments before we actually
had a PCIe switch at PMC.
We only just got that chip back
from Fab a couple of months ago,
and this work was done probably almost a year ago now.
I need to go back and redo these
now that I have a switch,
because I think I can make this number even better.
Why do you need any D-RAM at all in the Donner case?
In this particular case?
Yeah, I mean, what's it doing?
Yeah, well, I mean, in this particular case,
I don't need any DRAM
except somebody's running an operating system.
Yeah, some OS orchestrated the two guys to talk to each other.
Yeah, exactly.
He doesn't even know it's happening.
No, he doesn't even know it's happening. Well, he initiated it. He said, please move guys to talk to each other? Yeah, exactly. He doesn't even know what's happening.
No, he doesn't even know what's happening.
Well, he initiated it. He said, please move this from here to here.
But I have some cases later where he doesn't...
that are already in A-base
where he doesn't even know what's happening.
Like, the enormous...
So today, GPU doesn't know how to run and do any drivers.
This is why the CPU has to be involved here.
Yeah. It doesn't know how to run in the new drivers. This is why the CPU has to be involved here.
Yeah. Yeah.
Yeah. Because theoretically you could have the GPU please...
Well, yeah, on a hardware level you could get rid of everything and just say, but, you know...
Yeah, and I know when the Amoeba Fabrics is kind of going down that path, but...
Yeah, in theory these guys could do this by just not even having a central processing unit at all.
Not at all.
It's just a PCI rack at that point, right?
And this guy is a DMA.
I mean, basically the rule is if one of them is a DMA master and the other one is a slave,
things are going to work.
If you've got two slaves, they can't do anything very interesting.
If you've got two masters, they can't do anything interesting because they don't have a destination to go to.
You can't DMA into another DMA's mailbox. You want an exclusive.
Yes?
Is that a constant QDEF or DIOD?
This was large. This was one. How do I get the biggest number here? So this looks good. So this was pretty big QDF, pretty big IO.
I don't have a latency number here, but we can go back and measure the depth.
Yeah, and I got a, you know, some of these results are definitely, you know, I would
like to do, redo a lot of this.
I have a switch, I want to measure latency, I want to measure how busy is the
CPU working from a perf point of view when I'm doing it this way versus that way. To
be honest, I can't do everything I want to do. I'm hoping some people in the room will
find it interesting enough to come and ride on this roller coaster.
So, pretty interesting. For this particular,
and all the code to generate these numbers
is all available in the open source repositories,
so people can certainly try and recreate them
and help with that.
So, for that GPU example,
we actually decided to write something
that's a little more interesting.
This is for a demo that we were doing for somebody.
So, we did a needle in a haystack.
So what we did is we went to, I think it's MIT,
they have this big image database
that's used by academics for certain image recognition.
And what we did is a needle in a haystack problem.
We took the PMC logo and we randomly...
Where's Wally?
It's where's Wally.
Yeah, exactly.
So we basically took this PMC logo
and we basically buried it in these images,
a small set of the huge database.
And then we basically put that entire database on the NVMe device
and we wrote some code using CUDA
to go do convolution on the image with our hit needle,
which is our logo,
and go, please, you know, out of these 10,000 images,
find the end that had this logo in it.
And basically that's what we did.
We did some comparisons between
how many pixels per second can I do
when I'm just using a CPU.
We did it using CUDA,
but without the Donner methodology of DMAing directly.
And then we also did it with the Donner.
And we compared a hard drive,
a solid-state drive,
obviously the NVMe drive,
and we got a pretty good speak-up with the SSD.
Obviously, we can't do the R with the HDD
because it's not an NVMe compliant HDD on the market.
I think something might be working.
Anyway.
But nothing that we did in this slide is non-NVMe.
Right?
So it's just an NVMe.
This should work with any NVMe 1.1 compliant drive.
The performance numbers will change, obviously,
depending on the drive, but that's what we got. And then we also did some analysis to
work out where's the bottleneck. So, interestingly, the bottleneck was different in each of the
three places. For the case where the CPU was doing the image convolution, it was processor
cycles that gave the bottleneck. Even with multiple threads, it was processor cycles that came to the bottom there. Even
with multiple threads, it just takes time to go through all those images and say, does
this convolution of these two generate some kind of impulse? So I think this is in my
opinion. With the CUDA version, because we were now pushing the inner core of that image
recognition onto the graphics card, the problem actually became DRAM bandwidth.
That's the problem.
Because we're moving data all the way to DRAM
and all the way out,
so it's basically an in-and-out that we want to call.
And then with the Donner,
we were actually able to make it
what we want to be the limiting factor
to the graphics card.
So we were able to repartition the problem
and push the bottleneck to where I think I wanted it to be,
which is on the graphics card.
Because I can always deploy more graphic cards.
I can have one SSD and two Teslas
or whatever kind of machine.
And I am going to run way over if I keep going at this pace,
so I'm going to jump in.
We did some demonstration work.
We did some work with Chelsea.
We all see her.
Thank you for that.
And we did some work with Chelsea. Thank you for that. We did some work
with Now and we did some work with Malnox as well. So we're definitely looking more
at RDMA. These results are now looking at RDMA and MDME. This is a very interesting
area as well. This was a problem that we wanted to solve. My background is enterprise storage
and write commits are really, really important. We talked about that as well in the end. If
I'm a remote client and I want to write some data, I can't let go of that data until the
write has been acknowledged. Because I don't know if I'm going to get the act back
until I get it.
The response time for an
acknowledgement on a right is normally
a pretty interesting thing
to know in a storage system.
It doesn't matter whether it's direct attached
storage or network storage or some other
kind of storage. Typically
you're interested in how quickly
can I get my write to persistence
and get an acknowledgement back to whoever initiated the write. Okay, pretty, pretty
important stuff. Now, you know, in an RDMA meets NVMe kind of world, a standard write
path for a write would be this blue line. So the write would come in through the RDMA
connection. You would have an MR, an RDMA memory region, declared somewhere in DRAM. And the data would end up there.
You would basically notify this processor that there's new data, or it would be polling.
And it would basically then do an NVMe command to move data from that buffer out to the NVMe
device. And then you have to, once that write is acknowledged
back to the driver, it then has to generate an acknowledgement over the RDMA
network back to the initiator.
That's your delay path. So network delay, DRAM delay,
contact switch, due to interrupt generated by SSD,
service interrupt due to the client.
Lots of steps there. So what we were doing in this work is looking at, can I actually
just push directly into some kind of persistent zone here? Now, this is a DMA master. This
is an RDMA. Typically RDMA NICs can only be masters.
I'm sure these guys are correct. But typically they're masters.
This guy in normal NVMe mode is a master. I already said two masters
can't talk. But our NVRAM card can be a master and a slave.
Slave is the EMI mode or or the direct memory interface mode. So we actually basically used our card as a proxy for a next generation NVMe SSD
that is capable of exposing some memory type semantics on it.
So we're now getting into the world of non-standard, or NVMe as it doesn't exist today,
but maybe NVMe might exist sometime in the not too distant future.
We did some comparisons on, you know, what was my bandwidth?
I don't have latency numbers here.
I really should have latency numbers here.
Sorry, I apologize for that.
It's pretty important.
But I don't have them right here.
And we can certainly work to get those numbers for anyone who's interested
and then we also looked at
the
DRAM utilization
and what we did there is there's a program
I think it's called NPR or MBR
so what that process
what that MBR does is it runs the background
process on available threads
and it basically hammers DRAM
and then it basically sees how quickly it can geters DRAM and then it basically sees
how quickly it can get to DRAM.
Back to something again we talked about
in one of the talks earlier.
If I'm using my DRAM
bandwidth to do all this blue path,
how much do I have left over
for processes and threads that might be trying
to make me some money? Because I'm not making
money moving data. I'm making money
telling Terry
what kind of car he wants to buy next week
or working out what kind of advert
to put in front of Stephen
as he's having his dinner.
So moving data is important,
but we don't directly make money
by moving data.
And now we start to get into
the interesting world of NVMe over Fabrics,
NVMe over RDMA.
So we did some work with our friends at Mellanox.
And this work is from demonstrations that we did at Flash Memory Summit.
And there's lots of blog information on that.
But we did an example, a prototype of NVMe over RDMA, or NVMe over Fab fabrics. So what we did is we have a server here. This
server actually had a PCIe switch. We got our PCIe switch back from TSMC two and a half
weeks before Flash Memory Summit. And I went to my guys and went, I'm buying everybody
in that team a beer, if I can have it in the demo. And they were like, two and a half weeks.
One beer they were.
They thought it was one beer each.
Oh, dear.
So we have the awesomeness of having a PCIe switch here.
And that's not so relevant for this slide,
but the next slide is very relevant.
And what we did is we took the standard inbox NVMe driver,
and one of our guys basically worked it.
He wrote a client version, and he wrote a server version.
So now the client version, you could do one module
and then use modrams or sys or whatever to do it that way, but we just split it in two.
The client driver is here on the cloud.
The client device has no direct attached storage.
It doesn't have an SSD.
But what happens is we expose a virtual device, a slash dev slash mvb0m1.
As far as this guy's operating system is concerned,
he has a direct attached mbme SSD connected to him. He can do admin commands, he can do
command line interface, he can do FIO. It's just a block device. The performance, obviously,
is going to be a little different because he doesn't really have real hardware there. When he issues, for example, an NVMe read, what happens is it goes into
the NVMe driver. We intercept that. We go, hang on, this is NVMe over RDMA. I'm going
to take this IO. I'm going to pass it over to the RDMA part of the kernel. It gets RDMA
encapsulated by the RDMA NIC. We've already established a connection beforehand, so assume that bit's already been taken care of.
This value basically handles
RDMA over to a buffer
that we've already set up in the memory region
here in DRAM.
This slide has DRAM, so just bear with me.
At that point,
the driver on this side kicks in.
We have a polling operation
on a completion queue. And the driver on this side goes in. We have a polling operation on a completion queue.
And the driver on this side goes, I've got new data.
He takes that data.
He works out his NVMe command.
He basically passes that command to essentially the standard NVMe driver.
And that performs the operation on the NVMe drive.
And back.
I just responded. It comes back.
I just responded with argument.
We did this in... We basically were seeing about 6 microseconds of additional latency
through the RDMA stack.
So if I do a read on the direct-attached drive,
I got 40,50 microseconds.
It's a NAND-based SSD with an Optane drive.
I think Intel are publicly quoting something around 10 in their IDF demo.
But we're adding about 6 microseconds.
I think NVMe over Fabrics is targeting under 10 as a good place to be.
And obviously the individual vendors, of which there are several
in this room who are active in this space,
obviously compete with each other and try to get
those numbers to where they think makes
sense for the market.
Well, if you have Ethan here,
it's going to take you four months
just to go back and forth.
So, you did pretty well.
Yeah, so we did this, this was Rocky D2,
so we're running a VTnet.
We did direct attach, so we didn't actually have a switch off here.
Switches, they have latency too, right?
But, you know, this gives you a ballpark of where we are.
We actually also ran it, the Mellanox mix we had also ran in the FinnaBand,
so we also ran it in IV mode.
We didn't see a big difference between the two,
but we didn't have any switches in there.
That's going to, like I said, that will change going forward.
I'm going to skip this for the sake of time. Kind of get to the rough. So this is basically the kind of,
the interesting modification that we made
when we realized that our NVMe drive
can be both a DMA master and slave.
And here we have some interesting repercussions
for latency and for access semantics
across a network
tied into, again, the NVDIMM work.
But in this case, what we did is we basically said,
rather than having to kind of go all the way through DRAM,
is there a way that we can do it so we can go directly into this device?
And because this device is, like I said,
both a block-based device and a memory access mode.
We were able to do that.
And so in this particular example,
we weren't doing NVMe anymore,
certainly not standard NVMe,
but we were basically accessing persistent memory
on an I.O. memory device
behind a PCIe switch using RDMA.
So this guy doesn't need to do that.
This guy ain't doing jack. Well Well he's initiating the connection, he's going to manage any error
events, he's going to run an OS, but because it's RDMA, he doesn't even get notified of
anything in this particular scenario. I can come in and I can make hash line modifications
on this PCIe card, The traffic is all behind this switch.
And it's all happening from this client control.
Now, we get into some interesting things like,
if I'm getting hash line writes,
atomicity becomes a problem.
If I want to do a 4K write,
but I want to have the rule that that 4K write either happens in its entirety or not at all,
pretty standard feature of a block I.O.
But something
crashes halfway through the transaction.
I don't have a time machine.
This device could provide that.
We could provide that as a service by
doing some kind of double buffering and going
I'll take your data,
I'm not going to commit it to the memory until I've got it all,
and then I'm going to commit it.
At which point it looks a little like a block device.
Because that's what block devices do.
But it is interesting for a couple of reasons.
One is that we're basically completely isolating ourselves from this part of the system.
And another is that we're basically allowing this guy to continue to do whatever it is that makes us the money
that we want him to make.
Whereas this guy is just really moving data.
We've almost split the system into the money-making part
and the person doing the heavy.
So I think there's some interesting ideas there.
And that's something that we're working through with...
What was your end-to-end latency?
What's that? End-to-end latency? What's that?
End-to-end latency.
It's about...
If you want an acknowledgement,
it's about the same as before.
It's about five to six additional microseconds.
Five microseconds to do this.
Because remember, we're a write-back cache,
and we're a region of guaranteed persistence
because we have a capacitor system.
So all we have to do
is get the data in us. We don't
have to put it in the nominal time memory.
As soon as we've got that TLP,
the onus is on us
as this device manufacturer
to make a quality of service
statement that we will guarantee
that data is safe.
That's what a power fail storage device does.
So we can make that kind of...
So we don't need to commit to memory,
so we can acknowledge the right very quickly.
So we were seeing the 5 to 6 microsecond round trip.
But we're now able to do small IO.
We can do 64 bytes.
We can do 128 bytes. We can do 128 bytes.
We can do much smaller modifications.
And I think with DRAM, it's somewhat interesting,
but it's pretty expensive.
With 3D cross-point or resistive RAM
or whatever other wonderful memory technologies that are coming,
I think the cost point of this per gigabit will come down, and the performance
won't necessarily get any worse,
because I don't think the memory
access times are particularly
the problem here, given that we already have
microseconds of network time.
So, yeah.
I don't... I see a lot of
interesting things I can do with this.
I think there's people in the room who can probably pretty quickly think up things
that I have not even thought about yet.
So this describes right only or also reads only?
So, I mean, the reads will be very similar to the rights in the sense that, you know.
Then the master will be the link.
Yeah, yeah, exactly.
Well, yeah, in all these diagrams, yes.
Yeah, I mean, in all of these, the last two slides,
this link is always a master.
He's always a master in these slides.
In the graphic card slides,
the graphics card was the slave.
In this mode, we're the slave.
And the previous slide showed me the NVMe over Fabrics one.
They were both masters, and we were using the DRAM off the processor.
So in this example, he's a master, he's a slave.
He's a master, he's a slave.
But we're having to double buffer, basically.
So where are we going?
This is one idea of where you go with this.
We go to one of our friendly RDMA vendors.
We put an RDMA NIC here.
Maybe we put our NVRAM cards I showed you earlier here,
the DMA masters and DMA slaves.
Maybe we put some NBD SSDs here.
Maybe some of these are NAND based.
Maybe some of them are not NAND based, which makes them ANN based.
We've got a PCIe switch here because we want to have good east-west traffic flow.
Maybe because we have to keep the legacy side of PMC happy,
we have an HDA so we can go into a bunch of rusty disks.
I put it in there.
And maybe we have this guy, but at this point, what's this guy really doing?
Maybe this guy can be here.
Maybe he can be here.
Maybe he can be in here.
Maybe he can be here.
He's not doing a lot of heavy lifting,
so maybe he doesn't have to be the biggest processor in the world.
We can use this NVRAM as essentially a shared pool of persistence
for write caching,
and then later on move the data using the DMA engines in these guys
out to the storage.
So I can acknowledge rights quickly.
Maybe I can provide quality
of service by taking
these drives, some of which are ARM-based,
some of which are NAND-based,
and
divisioning that up as a pool
and providing quality of service metrics.
You want to pay for it? You get the opt-in.
You don't want to pay for it? You get the 3D
TLC.
We can do all kinds of things there.
We're using peer-direct,
it's maybe not a word I've used in the presentation before.
Again, borrowing off the shoulders of giants,
peer-direct is a Mellanox-based patch set
in the Linux kernel that enables this east-west flow.
It's part of what makes GPU-direct work,
it's part of what makes Donner work. It's part of what makes
Donnard work. And that's,
Malinox have that publicly available.
It's not upstream, so you've got to pull it in
to your kernel and recompile.
But you're all smart people, I'm sure.
So, I've probably run way over.
Most of you are still here, so hopefully
you're still interested. Where
do I think... We're out of time.
That's all right.
So, especially we're out of time, I'm going to keep talking unless they have to keep me up.
Yeah.
Where do I think we want this to go?
I think NVMe SSDs would be very interesting if we have a standard way of exposing some kind of memory access into them.
So taking what we do in a proprietary fashion with our NVRAM product,
but integrate that into the NVM Express standard.
There's actually already something that kind of does that.
It's called controller memory buffers.
It's not quite where I think it needs to be,
but I think with the right people in the right rooms,
we can take it from where it is today to where I think it needs to go.
All these themes are tying into why I think that's the case.
In RDMA and MDMA over fabrics,
they need a DMA slave as a destination.
Why can't that be on the drive itself?
We've got MDB 1.2.
It's already a starting point.
We can go from there.
We've got next-gen NBM that doesn't need to have erasures.
It's biodegressible or need to have erasures. It's
biodegressible or somewhat biodegressible. It's faster. It looks more like memory. So
having a memory semantic way of accessing it makes a lot of sense, even if it's not
on the DDR interface. It still makes sense because I can MM that. I can give my applications.
I can take advantage of all the NVDM work that SNEA and others have been doing, and
we can leverage that.
But also have a block device access
methodology as well. Maybe they overlap.
Maybe they don't overlap.
Thinking about
non-volatile memory as memory rather
than storage. Do I need to have
MSIX interrupts? Do I need
to have that tonicity? Do I need
to have some of the other services? Maybe
there's ways I can solve those problems in my applications or in my operating system so that I can take
better advantage of every dollar I invest in non-wallet companies. PCIe switches allow
east-west traffic flow, get the CPU up to a higher level where it can make me my money.
And so I can only take advantage of that if I have a memory region on the PCI unit.
I need to have somewhere to put the data.
So I have some more slides, but we've gone way over.
That's really my call to arms.
Very happy, as always, to talk about this.
As you can tell, I'm pretty passionate about it.
And I hope you guys enjoyed it.
I'm out of time, and I'm going to go for a drink.
Thank you.
See you guys soon.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with
your peers in the developer community. For additional information about the
Storage Developer Conference, visit storagedeveloper.org.