Storage Developer Conference - #36: Enabling Remote Access to Persistent Memory on an IO Subsystem Using NVM Express
Episode Date: March 13, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 36.
Today we hear from Stephen Bates, Senior Technical Director at Microsemi,
as he presents Enabling Remote Access to Persistent Memory on an IO subsystem using NVM Express and RDMA
from the 2016 Storage Developer Conference.
My name is Steve Bates. I work for Microsemi,
and I have the longest, most incoherent title of any talk at Storage Developers Conference.
I think I submitted this abstract just before I went on parental leave and I put in every single keyword that I possibly could.
So enabling remote access to persistent memory on an IO subsystem using NVM
express and RDMA. I decided that I was going to change the title.
So IOP mem PMM for MMIO,
which is any better?
Maybe not any better.
So PMem I will talk about a little bit.
The last speaker, in fact, all the speakers today
have been kind of a great segue into what
I'm going to talk about today, because I'm
going to touch on NVMe, NVMe over Fabrics,
and also persistent memory, i.e.
accessing memory using load store type semantics
as opposed to block type semantics.
And I'm also going to touch on some of the stuff that's been happening in the Linux kernel.
So the last talk, if you were here, was quite interesting because we were talking about
how do we write high level languages like C and Java and how do we put in place tool
chains and libraries that are
architecture and OS independent that are going to let us basically get to code that can run
on any architecture and provide us with very nice memory semantics.
We're looking at this and we've been looking at this for a long time and it's kind of scary.
So it's interesting that one of the things that's happened in the Linux kernel is they
went this load store stuff is really, really
interesting, but let's take a step back
and let's do something with persistent memory
that we can take advantage of right now.
And one of the things they've come up with
is something called PMEM, which is a driver that turns an NVDIMM
into a block device, which makes it very easy for applications
to talk to it because it's talking to a block device.
We know how to talk to block devices.
We've been doing it for a very long time.
So that takes away some of the challenges.
But of course it doesn't give us all the optimizations.
So I'm going to talk a little bit about that and how we can extend that to talk about not
just memory that's hanging off the DDR interface or off the memory channel as we call it, because
it might not be DDR, right?
It could be hybrid memory group or some other kind
of protocol.
And how we might use it to talk to memory that's
sitting on the I.O. system, for example, PCIe memory,
which is something that we're all pretty familiar with.
And if you work with NVMe drives,
you actually do that kind of thing
all the time through the driver, though not necessarily
with persistence.
So anyway, that's what we're gonna get to.
Before I jump into that, a few of us,
for some reason my name is the name on the lead of it,
but a few of us, including Tom who's here,
are giving a keynote later this week.
This slide was a slide I really wanted
to put in the keynote.
We didn't have enough space for it,
so I'm gonna do it now because I really like this slide.
I spent a lot of time thinking about it.
One of the things that somebody told me a very, very long time ago, and it's kind of
stuck with me, is that throughput is easy, latency is hard.
Now, there will be those among us who work on things like storage controllers and whatever.
They go, you know, throughput is not that easy.
But I think the way to summarize it, the way to think about it is throughput's an engineering problem, right?
And the way I think about throughput is
I'm sitting on the highway,
and I'm counting how many cars per hour go past me, right?
And if this is a very wide highway,
the cars don't have to be going super fast,
and I can see a lot of car, you know,
a lot of vehicles every hour go past me, right?
That's throughput.
I can get more throughput
by increasing the width of my highway, by adding more lanes, adding more parallelism. That's what we do
with storage controllers for years. We fan out to many, many drives. The latest NVMe
controllers, they use a lot of flash die and they bring in those flash die and then present
it as an NVMe. Parallelism you through put. Latency is something very different.
Latency is me getting in my car in San Francisco
and going, I wanna be in Los Angeles in three hours.
Is that possible?
I don't know.
It is if you've got one of these.
You know what?
I was half expecting you to go, that's my car.
Do you have a Siron?
No, an old Volkswagen.
They were faster, right?
Yeah, quality German engineering.
So latency is basically trying to get from a location to another location.
One specific IO, right?
It's not one of the many IO.O. it's like I need
this I.O. to get here as quick as possible.
And that is a physics problem because time doesn't go backwards.
In fact unless you're willing to travel very close to the speed of light it doesn't even
slow down.
So if you want to, the problem with time is that once you spend a bit of time accessing
the media you don't get that back. Once you spend a bit of time going time accessing the media, you don't get that back.
Once you spend a bit of time going through the SCSI stack,
you don't get that back.
They add together.
And Christoph and showed some slides on that.
Intel have a very famous slide.
They've used a very long time for that, showing that the sum
of latency is the sum of all the parts,
and there's no minus sign in that equation.
If there was, we'd be in a very different space.
So throughput is easy.
Latency is hard.
Persistent memory is all about latency.
And that means that everything we do
has to be thought about very carefully,
because you cannot take it back.
Once you've spent a microsecond implementing 20, 40 functions
because you want to have some kind of safeguard,
you don't get that time back.
So we've got to think very carefully because it only goes one way. One thing I was thinking is,
and I might put a patent on this, you could issue an I.O. and travel away from the computer at almost
the speed of light and then slow down and then come back. And you have 100 microseconds here.
You've only gone 10 microseconds. I got a 10 microsecond I.O. out of 100 microseconds.
I've expended an awful lot of energy to accelerate to the speed of light.
And I can't even see the IO cuz now I'm mush,
cuz the human body would have collapsed under the gravitational forces.
Shit, but anyway.
Anyway, there'll be a lot more slides like that on Wednesday, so come along.
It's gonna be a good session. The motivation behind today's talk though is really around PCIe I.O. devices.
All right, Sagie used to work for Mellanox, used to work for Libet, Christoph, and you
know everyone in this room probably has worked, either works for a company that
generates like a high-speed I.O. device or takes advantage of them in some way
in their software stack.
RDMA NICs, graphic cards, FPGA accelerator PCIe cards, NVMe devices, right? They're all PCIe detached. They all tend to have incredibly high throughputs, right? We have a product at
microsumi that's capable of doing, you know, up to five gigabytes per second of PCIe IO. So we're getting an awful lot of bandwidth out of these devices.
These devices tend to have DMA engines.
Someone mentioned it earlier. I love DMA
because DMA is basically the CPU going, I gotta move a lot of data.
I could sit here for the next 7,000 instructions doing load stores. I could
do that. That's a waste of a very expensive processor that's capable of running much more
complicated instructions than just loads and stores. I actually want to do something where
I issue a request to this stupid DMA engine that's a lot simpler than I am and say, hey,
you go and move the data.
I'm going to go generate some revenue by running some analytics. And you tell me when you're done,
either by me polling for a completion, which takes CPU cycles, or by you raising an interrupt
through some interrupt scheme. And then I'll come back, wake up, because the OS tells me to,
and service the IO. That. right that's you know
that's kind of the way all these PCI devices want to work so you know most
NVMe devices including ours most all RDMA devices will have very high
performance DMA engines they'll support scatter gather lists they'll support
multiple contexts right and we can take advantage of those and program those and so forth.
The other thing PCI devices are getting are exposed bars,
right, regions of memory, some of which we M-Map
for the driver functionality, right?
So we M-Map the driver and then we peek and poke
at the registers and they do things, right?
They issue IO, they reset the device or whatever.
But other bars are more generic and classic example of that is, for example, something like a frame buffer things. They issue I.O. They reset the device or whatever.
But other bars are more generic.
And a classic example of that is, for example,
something like a frame buffer in a graphics card,
where you can actually have a region of memory.
And if you write to that region, something
appears on a monitor.
It's like, woo.
We were doing that for a long time.
But we're starting to look at ways
we can take advantage of that again.
So PCI devices tend to have very good DMA engines,
and they also can expose bars.
Those bars don't have to be mapped or, sorry,
backed by actual memory.
I can write some firmware on a device
that exposes a humongous bar.
It might crash the server, because the server
tries to enumerate, and it's like, whoa.
But there's nothing stopping me exposing
multiple terabytes of bar space.
Now, if you write to that bar in some random region,
can I guarantee I'm going to keep that data for you forever?
Probably not, unless I have a whole bunch of hard drives
behind me.
But maybe there's games we can play there.
Interesting things could be done there.
Something to think about.
This is where Christoph's going to jump on me. Until this work, pretty much
any high-performance transfer of information between two PCIe devices has required the
use of a buffer and system memory. And Sagi and Christoph put up some very detailed slides
earlier showing how data flows in the NVMe over Fabrics driver. And if you weren't here for that, the basic idea is that we move some data from a remote
host to the target's DRAM, the target's system memory, and then we move the data from the
DRAM to the NVMe devices through the block layer.
Or a piece of DRAM in the NVMe. Potentially, assuming it Or a piece of DRAM in the NVMe.
Potentially, assuming it has some kind of DRAM in the NVMe device,
which most of the high-performance ones will.
And it comes from something like DRAM.
Exactly.
So we're using DRAM in three different places.
We're using it on the host initiator,
we're using it on the target,
and we're using it on the drive itself, typically.
And the problem, of course, is on the target side,
it could be talking to hundreds of hosts at the same time. Scalability becomes a problem. I have to reserve memory for all of them. Memory bandwidth is not infinite,
right? You know, people think it's, you know, it's not infinite, right? So you also have
to worry about not only how much volume of memory am I using, but am I also cutting or eating into my DRAM bandwidth, which
again could compromise performance
for other parts of my system.
If I'm a hyper-converged system, I'm not just a storage device.
I may also be doing analytics, processing, et cetera.
Every gigabyte per second of DRAM bandwidth
I lose because of my storage stack, I don't get back.
Again,
it's not additive. So we potentially want to try and take that buffering step out of
that and that's kind of one of the main things that we look at in this paper, removing that.
We're working in the Linux space so this is actually results that are based on a patch
to the Linux kernel. I'm going to get to it at the end but there's actually other ways
of solving this problem.
This one I'm just going to talk about because it's
quite pertinent.
But I'm not necessarily saying this
is the way we have to do it in the Linux kernel,
because Christoph would definitely
hit me if I said that.
So I'm not going to say that.
There's definitely other ways of doing it.
I don't even necessarily think this is the right way,
but it's a very interesting way to look at.
And while we're there, we're also
going to touch on some fun things around NVMe.
So before I go any further, people are going to maybe get a little bored with
this but I wanted to kind of give you a very quick 101 on PCIe Express sorry PCIe version of NVMe
Express. Until recently we didn't have to qualify with the PCIe but now thanks to the work by the
NVMe working group we also have PCIe sorry NV by the NVMe working group, we also have NVMe
over RDMA, and pretty soon we'll have NVMe over Fibre Channel as well.
The PCIe version, it works reasonably straightforwardly in the sense that, and this is one of the
things I really like about NVMe, the NVMe group defined a boundary down here at the
PCIe layer.
They went, if you want to make an NVMe device, you have to do this at the PCIe layer.
You must present a certain region of memory, a PCIe bar essentially, and the contents of
that bar have to follow this spec.
And that was great because unlike RDMA, it meant we could have a common low-level driver
for every NVMe device.
So it's not like we need a hardware driver shim,
like we do in RDMA for the Mellanox products,
for the Chelsea products, for somebody else's.
We can have a common NVMe driver for any NVMe device.
And I love that.
I love the fact that HGST sent me a sample,
I plug it in the server, I boot it up, it just works.
It doesn't matter, I get one from Memblaze,
I get one from Intel, it just works.
It's fantastic.
Basically the driver works by memory mapping
or IO remapping a region of memory,
and then it knows, well basically now
I can control this device
because I'm basically going to issue reads and writes against that memory region on the
device. Very standard Linux driver kind of work.
Admin commands are used at load and probe time to configure things at queues and doorbells.
And because we probe it, you can do things like hot plug so I can plug in a new drive even when the system's up assuming the hardware supports it and it should just
go hey, there's a new NVMe device, please don't do a kernel loops.
Saki's laughing.
In theory, that's what works.
I actually find this very, very interesting and I kind of wanted to show it just to give a little bit of background.
And I don't know how many people have really done this,
but I put a protocol analyzer between the CPU and the NVMe drive,
and I issued some individual I.O.
I just wanted to see, you know, what do you actually see on the PCIe bus
if you do one NVMe command, right?
You know, it seems like it should be obvious, but it's kind of fun to actually go and go, well, the
spec says this should happen, but what actually does happen?
So here, I'm sorry, it's probably not super clear, but here we have a single NVMe command.
It kind of starts here and it ends down here.
And this is a, this is considering a system that's key essence. So take a system that's completely quiet and just do one NVMe read.
Okay?
That's it.
Don't worry about trying to interleave commands or anything.
And so what happens is, you actually, the first thing that happens is the host
rings a doorbell on the drive by issuing a simple TLP to a very specific address.
It knows where it is. It was decided during startup
So basically the host starts and goes hey
NVMe drive wake up
What happens next is the NVMe drive goes and pulls in the head of that submission queue 64 bytes of data
That's going to tell the drive what to do
So the first thing that happened is we rang a doorbell on the drive
The next thing that happens is the drive issues a memory read essentially to main memory because
the queues are stored in main memory right now.
Pulls it in, 64 bytes, and somewhere in that drive there's a little piece of hardware or
firmware or software, depending on your implementation, that goes, this is a 64 byte NVMe command.
This is an NVMe read.
They've asked to read this particular LBA.
Maybe there's some other things that I need to do with that.
But that's pretty much it.
What happens then depends on the media
that that NVMe drive has.
But somehow, you're going to go find those 512 bytes,
in this case, or maybe 4K in this case.
You're going to go find them.
You're going to energize a NAND die.
You're going to talk to resistive RAM. You're going to go find them. You're going to energize a NAND die. You're going to talk to resistive RAM.
You're going to go to DRAM.
Whatever it takes, depending on the media,
it's kind of irrelevant to NVMe.
You could, in theory, go to a spinning disk.
NVMe doesn't care.
What happens next is you DMA the data
to the address that was given to you in the NVMe command.
So part of the NVMe command said,
here's a region of physical memory, probably in system memory.
Can you please DMA to that address?
It's not a virtual address because DMAs have no idea what virtual addresses are, right?
So it has to be a physical address,
and we have to do something in the driver, get user pages or whatever,
to make sure that if we're talking virtual addresses in our program, we're talking physical addresses
by the time the information gets to the drive.
Once the DMA is done, we write to the completion queue.
We have to tell the system the read is complete.
And then we basically have the completion queue.
And then depending on whether we're polling
or interrupting or doing whatever, in this case we actually issued an MSIX interrupt
which is a type of PCI interrupt which basically tells the processor that something's happened,
there's hardware in there that works out.
Well actually what happened was I have serviced whatever interrupt is associated with this
interrupt that's tied to a thread, that thread is woken up, it goes, oh, we're done.
And the data is now in my system memory.
I can pass it back through the page cache or if it's direct IO, I can copy it.
You can do all kinds of things.
It depends on the OS and so forth.
So that's the anatomy of an NVMe command.
It's worth keeping that in mind.
And in a little bit, we're actually going to look at how long does that take and what is the kind of thing that
can affect the jitter on this kind of command.
The other thing is this is a block-based interface.
We are not talking memory semantics here.
So this is not load store.
This is DMA, ask for something to be done, have it done.
There are some nice things about this.
One, it uses DMAs.
DMAs save you using CPUs, and CPUs are expensive.
DMA engines are not.
You can do things like data integrity.
You can do atomic.
You can say, either this write will pass in completion,
or it won't pass at all.
There's good things about the block layer. Let's not
forget that. So some of the hardware that I've been using to look at some of this. I
wanted a very low latency NVMe device. Just so happens Micro Semi makes one. I actually
have one here. If anyone wants to take a look at it. So this is shipping in volume today.
Underneath this heat sink is what's called Princeton.
It's one of our NVMe controllers.
Some of you may be familiar with it.
Some of you may even ship product with it.
Rather than using Flash, we use a whole bunch of DRAM for the principal store.
And then this is a little Flash card.
And there's also, not shown here, of course,
because marketing, there's a big, chunky capacitor thing.
Because it's here.
And the idea is that if you lose power,
all this DRAM gets vaulted into this flash.
And then on reboot, it comes back.
But the great thing is that basically, I
have a super low latency.
I had an Optane SSD before Intel did. I'll be at a higher price. I haven't
seen the price yet, Jim.
It's going way, way faster than all these other things.
That's right. And also I have infinite endurance. So I can write this puppy all day long. It's
DRAM. Doesn't matter.
Power fail every two seconds.
Yeah, that's true. This guy only has so many cycles. Exactly. So the interesting thing also about this device
is, for reasons that will remain unclear,
we didn't just present this as an NVMe device to target.
Actually, a customer asked for it.
One of our customers was like, hey,
this is cool that this is an NVMe device.
Capacity is not great, but for a write cache,
it's a really good thing. But
can you expose that memory or some part of it as an additional bar? So can we have it
as both a block device and also a memory device? And we went, yeah, we can do that. And the
interesting thing is that that's actually a controller memory buffer. This is before
controlling memory buffers were even in the spec, so it's not standard, but it actually a controller memory buffer. Now, this is before controlling memory buffers were even in the spec,
so it's not standard, but it's a controller memory buffer.
So the interesting thing is if you stick one of those in your server,
and you do, look how geeky is this,
if you do an LS PCI minus VVV, so you get very verbose,
and you look at the device, you'll even see it's,
we still have the old PCI signature.
So now PMC is now micro-semi, but there's a PMC Sierra there.
And you'll see that there's two memory regions, two bars that
are exposed.
There's one here, which is non-prefetchable and 16k.
And then there's this really big one here.
This one is the standard NVMe one.
Any NVMe device has to show this one, right?
Otherwise it's not NVMe. And this is the one the driver will I.O. remap or whatever so
that it can talk to it. This is basically one gig of our DRAM capacity that we've exposed
essentially as a controller memory buffer. This can be remapped, this can be m-mapped, this can be anything you like, right?
And now this region is memory addressable,
albeit on the I-O subsystem, not in standard DRAM, right?
So it's not system memory, it's not coherent,
it's got an L3 cache in front of it.
Performance in certain instances will suck,
and I'll talk about that in a minute.
But it's interesting.
And the question is, given that we have that,
what do we do with it?
So one of the things you want to do with it is standardize it.
So just very quickly, I won't go into too much detail.
This is all in the public NVMe spec,
which anyone here can go and download from nvmexpress.org.
This specifies a couple of registers that are in that NVMe region of the bar, that first
bar region zero, that tells the driver about the CMB capabilities of the drive.
So basically somewhere in the driver eventually there's going to be a line that goes if the
NVMe version of this drive is more than 1.2 and if the CMB size
register is not equal to zero, let's go and do something.
And that code isn't necessarily really there today.
There's some initial work on the CMB in the Linux driver.
I don't know about the status of the Windows and VMware and other drivers.
But in the Linux driver we have some initial CMB support.
Hopefully more of it is coming soon.
And basically this really is just going to describe the size of the bar, the location
of the bar, which you can get from the PCIe data anyway, but it's also going to tell you
about the capabilities that this drive is willing to support for that bar.
And right now we have a couple of different capabilities, and they're not actually on
this slide, but they basically say what kind of things would I like this bar to be used for?
Can you use it for submission queue entries and you use it for completion queue entries?
Can you use it for write data?
Can you use it for read data and then one of the things we don't have in there right now
But we're thinking about adding is a flag that says this is a persistent bar. If you write to this bar,
I will keep your data for you, right? Which actually turns it into kind of an NVDIMM on the PCIe system.
So that's kind of where we're going with that.
So before we jump right into that,
I just kind of want to take a quick step back and look at some latency numbers so that we can compare latency from the NVMe device with latency from some of the other
things. So, you know, given that Intel and Christoph and Saigi mentioned earlier, given
that very fast NVMe devices have latency dying around the 10 microsecond mark, it's pretty
interesting to go and work out where's that latency coming from?
How much of it is dependent on the SSD itself?
How much of it is NVMe protocol overhead?
How much of it is driver?
How much of it is interrupt-based?
And like I said to Christoph earlier,
if you have an outlier,
so I'm a statistician by background,
and of course average is interesting,
but distribution is also incredibly important,
right? Thank you. Yes, I did appreciate that. Because a lot of people say average, and that
doesn't help you if every time, you know, one in a hundred is way out here, right? Your application
cares about here, especially if you're doing multiple reads, and the result that you give
back to the user is dependent on the result of all those reads. If I issue a hundred reads
and I have to add the results together to give someone an answer,
my performance is bounded by the worst case read out of those 100.
So if your average is 1 but your 1 in 100 is 10, and you're stuck with the 10, right?
Average is not that useful to you there.
So we did, like I said, we put the LaCroix analyzer here.
And we did some I.O. and we did some measurements.
This graph is probably not super clear.
But basically, we broke it down into the I.O. contribution
time.
And what we find is that the drive itself,
our drive is a little slower than your drive.
But the NVRAM device, we call it was pretty consistent and had a
very good consistency in terms of its outliers as well, around eight to nine
microseconds. But what we actually noticed was there was occasional reads
that were taking much much longer than that and we tracked that down to the MSIX
interrupt times and you can kind of see that on this graph where I've actually
broken the latency up.
The blue is the total latency measured by FIO up in user
space.
The red is the NVRAM times measured by the protocol
analyzer.
And then the green is basically the difference,
which we ended up working out was certainly the variability
was mainly due to MSIX interrupt times.
And from that we got into polling.
Christoph's been doing quite a bit of work on polling in the block layer.
So I've been doing quite a bit of testing of that work,
looking at the impact of applying polling.
So very interestingly, if you actually plot the latency
of the NVRAM component of the
distribution it's very nicely bell-shaped. Central limit theorem is
obviously working for us here. So that tells me I'm submitting lots of random
variables together which tells me that's probably pretty good. It tells me that
there's probably not something deterministic in there. This is the
addition of multiple random variables which always converges to the central
limit theorem. So that's kind of nice. It's always converges to the central limit theorem. So that's kind of nice
It's always nice to get the central limit theorem into a presentation like that
So so pretty fun and for anyone who cares
This is really just the measurement of the time between the submission queue doorbell
The the time between seeing the submission queue doorbell and the time that we raise the MSI X interrupt. All right, so yeah
All right, so that that's kind of the NVMe background.
Now I wanna start getting into PMEM
and IO-PMEM a little bit more.
So I'm gonna have to use
some Linux kernel terminology here.
There's been a lot of contributions
around the NVDIMM region of the Linux kernel
in the last little while.
And it may or may not be because one of the large CPU vendors has some interesting technology that
they want to hang off the memory bus. You know who I'm talking about. But the great thing is that
this is really important work. And I think the entire industry benefits from it. So there may be some ulterior motive for it,
but I think we can all take advantage of the fact
that this is being done.
And there's a lot of very good programmers who are contributing
to that space.
It gives us a really good insight
into what's coming down the pipe, which
is one of the reasons why I really
enjoy tracking the Linux kernel development,
because it really tells you what kind of devices are
going to start appearing depending on the activity factor.
And I have a little script that basically just runs off the commit logs and works out
what areas of the kernel are being worked on the most based on submissions.
And that tells me where I need to go work.
So in order to prepare for this memory channel, and this is not the only work, but this is
some of the work, there's a few things that have been going along.
We have zone device.
Zone device has been around for a while,
but really it's a way of saying,
I have a lot of memory in my system.
I have a lot of stuff hanging off my DRAM bus.
I wanna split it up into zones.
So this is saying I've got this much physical memory,
and I'm gonna split it into regions.
We kind of do that with zone DMA, but that's kind of more for back historical reasons about certain memory regions having
certain properties and not being accessible. This is really saying I'm going to have regions
of memory that maybe have different access characteristics than others. Maybe it's slower.
Maybe it doesn't have infinite endurance. Maybe it's something I want to reserve for use with a driver. And that's the second
thing. PMEM is a driver in the Linux kernel that basically takes a zone device region that's
allocated by the user at start time and says, rather than throwing this into the big pool of
memory that's part of the memory subsystem that you can give out through, you know, get user pages or KZ or whatever, I would like you to reserve this
memory region for a driver and that driver will be defined by the P-MEM driver.
So, actually, zone device is always reserved and will never go into the generic
memory.
Thank you.
Very good. So, zone device basically says this is a certain region. Everything else go in the standard pool. Thank you. Very good. So zone device basically says, this is a certain region.
Everything else, go in the standard pool.
Thank you, Krista.
It's good having an expert in the room.
So basically what it means is that memory is not
available for your system memory.
So if you put, for example, if you've got a 64 gig system
and you put a 16 gig NVDIMM in and say I'm going to reserve
that 16 gig as a NVDIMM PMEM region, you don't have 64 gig for everything else.
It doesn't work that way.
Sorry, guys.
You're giving up memory capacity in order to have this particular region and service
it.
If you're going to do it, make sure you keep that region busy because you've taken memory
away from the system.
Now, obviously, NVDIMM is only so interesting because it's DRAM cost and so forth.
There's stuff coming from certain CPU vendors that remain nameless Intel that may have much
larger capacities and have properties where you really don't want to treat it like DRAM. So this makes a lot of sense for that. DAX, Christoph mentioned earlier,
DAX is really direct access. It's a way of basically telling the operating system, you
know, I may look like a file system. I may, you know, you may think that I'm basically
backed by a block device, but I'm a special kind of block device. I have special properties
in the sense that I'm kind of addressable at lower
than LBA level. I'm going to see if Kristof nods.
So basically what that means is you can,
you can optimize because you know the underlying system supports cash line
changeability, cash line addressability.
But at the downside is there's certain problems with doing that.
You no longer necessarily have the atomicity that you want.
And also what happens to things like the RAID stack when you start being able to change
things at byte or cache line level as opposed to block level?
Right, some issues there.
So the DAX Framework is definitely something you can take advantage of.
It's something that's provided as a service to file systems.
So file systems can take advantage of it.
Right now, ext4, which apparently is broken, and xfs are two, certainly two that I know
of.
Ext2, but don't touch it either.
Ext2, but don't touch it either.
XFS, Jeff.
And then the last thing, which is kind of important, this is something that got added
a little later.
There was some discussion around struct page support.
So what's important about struct page support?
Well the interesting thing is if you want to do a DMA, right now in the Linux kernel
if you do a DMA you basically at some point probably do a get user page.
And somewhere in the lines
of code for get user page, there's a little bit of code that says basically if the memory
doesn't have struct page support, just bork and don't do the DMA because we're not going
to do it. We're not going to let you do it. So you can't DMA to a physical location that
doesn't have struct page backing essentially. Now initially, this was a problem even for the NVDIMM type
work because there were people who were saying we're going to have such large memory attached
devices, do you really want a struct page piece of metadata for every 4K page? At the
time we're talking 64 bytes for 4K. You can do huge pages and gigantic pages and I'm sure
at some point we'll do super gigantic
pages to reduce the overhead.
We already got super gigantic pages.
Can we call them super awesome?
Anyway, so what we did is we've actually added an option in the kernel config that says if
you want struct page backing for this memory,
you can have it.
And that's a kernel config.
And that's useful for things like direct IO and DMA.
So just to summarize, how does PMEM work?
Everyone can go do this.
You don't actually need an NVDIMM.
You can practice with just normal DRAM.
When you boot your kernel, you want to basically have a boot
option which says, MAP equals, and then this is the size of the PMEM region, and this is where you want to basically have a boot option which says mapp equals,
and then this is the size of the PMEM region, and this is where you want it to start.
And the idea is that that's supposed to line up with where an MVDM plugs into your system.
So if you have an 8 gig MVDM, you'll have an 8G there.
And just make sure it sits where you want it to sit.
And you're going to have to do that to make sure that that's the case.
Otherwise, you've basically taken some DRAM and you've turned it into an NVDIMM which doesn't work because you take the power away.
It's not going to work right. Whereas you've got an NVDIMM that's like could store your
data for you but you're just throwing it in the memory pool. So please make sure it lines
up right. What happens is the PMEM driver will bind to this reserve region and it will
register a device in the device registration
structure of the Linux kernel and that will appear as slash dev slash PMM zero or one
or two depending on whether, you know, how many of these things you have.
And it's just a block device.
And the great thing is all the block device goodness that lives in the kernel and user
space applies to this.
You can put a file system on it.
You can put a DAX file system on it.
You can put a non-DAX file system on it.
You're going to lose some performance by doing that, but you can do it.
You can make it part of a multi-disk structure.
Plexastore, I don't know if anyone's Plexastore here, but they are talking about something
called M1FS, which takes PMEM devices and NVMe devices and puts a file system over the two and does
auto tiering and caching between the two.
Uses the fast memory where it can, uses the slow memory where it can't, hides that all
from you, the user.
You just see an awesome fast file system.
Put databases on there, you can do whatever.
It makes it easy to work
with persistent memory. We can do this today. Everyone knows how to use a block device.
So don't get me wrong, I do think the future is load store memory access. I think that's
kind of where it's going, but this is something we can do today and we get pretty interesting
performance. So let's talk about performance. PMEM performance.
So I have here some results.
The first number is latency in microseconds.
The second number is bandwidth in megabytes per second.
For all you geeks out there, the gory details are in the bottom.
I'm not going to go through them.
I'll send you that slide.
It should be online for anyone who wants it.
And then we have three different columns. I have QDepth 1, number of threads 1. So this is my
QDepth 1 results. You can see our NVMe device is going pretty high there. I'll talk about that in
a second. It's pretty bad. Never. Don't buy our product. QDepth 128. I thought this one was interesting because these numbers are, latency-wise,
these are kind of the same. And I think one of the things that's happening here is the PMEM driver
is servicing one I.O. at a time per thread. Maybe someone can correct me if I'm wrong,
but I thought that was kind of interesting. And then what's also interesting, I find, is that
as you basically get up to the higher QDEPepths, you definitely still get a lot of benefits.
But all this talk about sub-microsecond latency or very, very fast devices, it only applies
at QDepth 1.
And are we really going to be running things at QDepth 1?
Because at QDepth 1, the bandwidth is kind of sucky.
So it's like all this talk about ultra-low latency and what not, true at QDepth 1, but the number of I.O. at QDepth 1, even with
very small amounts of latency, it's not huge. Yes?
I suffer from my, yeah, I deserve that question. These are average. Yeah, I don't
have outliers. I should, I should, but I don't have them here. Yeah, I don't have outliers.
I should, I should, but I don't have them here.
Yeah, I'll leave that as an exercise to the user.
All these scripts are online.
I'm pretty good at putting stuff on GitHub.
You can ping me if you wanna try running these
on your own systems.
Yes?
When you run these,
is it with NSX interrupts or with the polling?
Yeah, that's a good question.
So these results are with MSIX interrupts,
and we should go back and look at polling.
Yes, I agree.
Again, blame me for that.
And like I said, I'm a big fan of just giving you one set of data
and all the tools you need to generate your own.
So that's my excuse.
But I'll let you do that.
Interestingly, I mean, the PBM driver, it's pretty fast, three microseconds.
That's pretty fast, right, for a block device.
You do a little faster.
Oh, Chris does.
Yeah.
I was surprised, actually, that you get that much latency.
Yeah, I was a little surprised.
I kind of dug into it a little bit.
As I briefly mentioned, right now we get reproducible four microsecond latency within
the internet. Yeah, yeah, yeah. I think you have a godlike system or something.
Yeah, I think one of the things I've noticed is there's a lot of variability, right? You change
your system, you change the OS, you change the time of day sometimes, right? There's a lot of
stuff that's going on. This really low latency performance testing,
it's definitely a bit of a fine art right now.
Intel processors and other processors
put themselves in low power states.
You've got to be careful about that, careful about clocking,
all these things that you've got to keep an eye on.
And it means sometimes you get data,
and you're like, this data is insane.
So you have to apply a little bit of judgment.
So this is definitely a snapshot.
Don't treat it as de facto. I recommend people go measure it for themselves. Linux kernel is free.
You can have it on your laptop in 20 minutes. You can even do measurements like this on a VM.
You're not going to get great data, but you can do it. I've done it. It's kind of fun. You SSH,
and you think you're on a bare metal machine. I've SSH into my VM before. Done the measurements.
Gone, oh, that doesn't look right.
You're like, oh, I'm on my VM.
That's why.
Like a QMU KVM.
It has an in-vitro emulation that might go to slopes.
Yeah, yeah.
So what did we do to change PMEM?
PMEM is in the kernel.
You can go grab it.
We had to make some changes.
And I want to talk a little about those changes.
Like I said, we're a great believer in putting the code
that we talk about online.
So there's a GitHub repository at the bottom
with a fork of the Linux kernel.
It's a little out of date.
I think it's 4.5.
We're now almost at 4.8. Um, but that's, uh, something you can go pull.
We did 88 lines of changes that basically enabled struck page support for zone
device memory that resides on IO memory space.
Our change was really this part here.
We already had struck page support for zone device.
We just wanted to make sure that even if the zone device sits on IO memory, it gets struck page backing.
Okay, that's the change.
I've got some good news for you.
What?
The ARM people are getting a memory source
in the operation that allows you to map the PCIe bar
to the IO view.
Excellent.
So just implement support for all the x86 IO views
that you'll get.
How many of those are there?
I think I just put on AMD basically.
Yeah, yeah, okay.
Legacy, IBM and SGI.
I have a slide on this, Christoph.
We'll get there.
Now, PMEM.C won't work with our change because PMEM.C is trying to look at system memory.
We need a PCIe driver, right?
It turns out PCIe drivers are pretty easy to write because the PCIe bus subsystem has a lot of functionality you can
just tie into.
So we wrote an example driver.
So we didn't change the NVMe driver, but I'll talk about
that in a minute.
What we did is we went, if you want your device to be an IO
PMEM device, then use this driver.
This driver basically says, if there's an IO bar, let's
turn it into a block device. So let's do what we did for PMEM for IO PMEM memory. So if
you go back to one of the earlier slides where I showed two bars, basically what it will
do is it will take that one gig bar and turn it into a one gig block device called slash
dev slash IO PMEM zero, right? That you can start hammering with IO, okay?
This is my disclaimer,
so I don't get beaten up by the Linux kernel people.
We submitted this to the kernel as an RFC
intentionally to generate discussion.
We were not necessarily saying in its current form,
this must be accepted
because I think there's other ways of solving this problem.
Christoph mentioned one.
So this is an interesting idea that moves us down a path,
but it's not necessarily, even in my opinion,
the best way that we want to solve this problem.
That said, I do want to talk about some of the things
that it lets us do.
So the example driver, iopmem.c, it's
a self-contained PCIe driver.
You could take parts of it and put it in the NVMe PCIe host
driver and take that functionality.
You could do that.
In our case, what we were doing is unbinding the NVMe driver
from our card that I showed you earlier,
and then binding this driver to it.
We had a module parameter to identify
which bar we were exposing.
In theory, though, you would actually tie
it into that CMB part of the NVMe driver. Thank you. At the back. For now, we use the
entire bar. We have a DAX enabled block device and basically you can put a file system on
top of that. One of the DAX enabled file systems. We also, just for fun, put in an M-app operator
in that driver.
So now it starts to get funky because we are essentially
I-O remapping something to turn it into a block device
that we can then I-O remap, mem map.
So world within world.
But that is interesting.
It did work.
And it would actually let you essentially m-map the bar into a virtual process space
and do things in that.
The other thing that we can do is if we put a DAX-enabled file system on there, you can
put files on it and m-map those files.
That's totally legitimate.
We do that with block devices all the day, all the time.
And that gives you basically cash line accessibility
into files on a file system,
which is often easier to work with
than the LBAs of a raw block device, right?
We like, you know, file systems are good for a reason.
So all that done, let's talk about some performance data.
Anyone recognize this?
It looks a lot like an NVMe over Fabrics deployment.
So let's imagine that we have a couple of processors.
Maybe one of them has an IOP mem attached.
Ultimately, maybe we can take IOP mem
and replace it with NVMe with a controller memory buffer.
Here we have an RDMA NIC.
And in a standard NVMe over Fabrics flow, for a write,
what happens is I write some data
or I write a request for a command,
assuming it doesn't go in packet.
There's going to be a buffer set up here in DRAM.
Basically, at some point, this RDMA NIC
is going to come and request the data.
It gets DMA to this NIC encapsulated in your favorite fabric protocol, whether it's Rocky, iWarp,
Elephant, you can do RDMA over anything.
We could do RDMA over squirrels if you wanted.
Performance would suck, but the squirrels would...
It's just a way of carrying a message, right?
So it comes over here, the DRAM or the data ends up here,
and then the block device layer basically issues an I.O. which goes down to the device.
Now though, there's memory here. Can I register this memory against this as a memory region?
Yes, I can. Okay? With Archange, you can. So now, this is where the data ends up. So
my NVMe over Fabrics write actually becomes here, here, here, through a PCIe switch straight
into here.
Do I need the switch?
No you don't.
Are there some performance issues with Intel's peer-to-peer PCIe transfers?
Yes there is.
All right?
By the way, this company sells PCIe.
By the way, thank you Christa.
We just happen to sell very, very good PCIe switches as well.
I have one in my pocket.
So we can move data.
We were able to transfer data into this device at about four gigabytes per second, so over
the whole network.
Not super fast, but the limiting factor was actually this device.
It's just a limitation of its capabilities.
If you have more than one device, you can scale that up
until this RDMA NIC or the Fabric or something else becomes the limitation.
And, of course, competitors or other people could come up with other devices
that have much better performance numbers.
Reads, we were doing 1.2 gigabytes per second.
And again, that's just the limitation
of this particular device.
Other devices should be able to do much better, worse,
depending on how you implement it.
And of course, remember, latency is hard.
Throughput is easy.
Throughput is just IOP mem, IOP mem, IOP mem, IOP mem.
Do it in parallel.
All right.
Latency was kind of interesting. Do it in parallel.
Latency was kind of interesting.
We're not talking additive latency.
We were getting sub-three microsecond latency for RDMA reads off that IOP mem.
So maybe it's worth going back.
For this guy to access this bar, performance sucks.
Everybody knows that, right? You don't treat a bar on
a device as something to load store into from your local CPU. You've got L3 caches. There's
a reason we have an L3 cache. It's because writes suck. You want to actually cache the
writes and flush them when they get bigger. So small writes kind of suck. Reads are also
not great. So local access typically is not great, which is why we put in DMA engines
Interestingly because this guy has a DMA engine performance here is actually very very good
Alright, so the interesting thing about IOP mem is it serves best as a peer-to-peer communicator not a peer to the host communicator
So we want to look at applications which are more peer-to-peer. I'll get to those.
I don't know if I have that with me today. Sub-three-recurse-second read time depends on the block size. Pretty low latency access. We also did transfers between the IOP MEM and
an NVMe SSD. Think data replication. This could be an NVMe SSD, that could and an NVMe SSD. Think data replication.
This could be an NVMe SSD.
That could be an NVMe SSD.
So very quickly, because of time,
I want to get into some use cases.
So background copying between NVMe devices.
We could write something in the host,
tie it into the RAID stack or the multi-draught,
the device mapper layer, the MD layer,
where we could basically get lazy data replication
in the background. These guys communicate with each other and they basically take copies
of each other's data and tile them out to give you some form of lazy data replication.
The host OS is still in control. It's still issuing all the I.O. It's not like I'm getting devices to talk to each other without the OS's knowing. The OS is the conductor.
The drives are the orchestra.
So they communicate with each other. The other great thing is, like I said earlier, there's no
data traffic flow between the memory subsystem here. This is staying pretty idle.
Even though this could be
scrapped, it's almost like a duck, right?
You look at the legs under the water, they're going crazy.
But this guy is moving gracefully.
Just.
Let's imagine you don't want to put CMBs on your NVMe SSD.
Let's just say the market isn't there yet or whatever,
there's nothing to stop you having standard non-CMB enabled SSDs and an accelerator device
with IOP mem capabilities, like an FPGA card. Then that FPGA device could be doing all kinds
of things, erasure coding, RAID parity generation, background deduplication, maybe security scrubbing,
maybe doing some kind of analytics on the data,
looking for certain structures, right?
These are all things that can be done, right?
And you could even expose this device
as another type of NVMe device if you so wanted, right?
You could tie it into the NVMe system.
You can tie it into the software stacks
that are running on the processor.
This one, obviously, NVMe over Fabrics,
tie it in to the RDMA NIC.
So rather than having all the data hit DRAM and go down,
it just goes to the specific drive,
and then the I-O execution pulls it into the NVMe devices.
And then the last one is, again, if you
want to save some money, potentially you only
have one NVMe SSD with a CMB. It acts like a write cache, and then you lazily copy it out to
other devices later in time.
And then another one, which I'm not going to get into.
So the Linux kernel has been changing.
It always changes and it's lovely to track it
and see where it's changing.
Lots of new ways to attach NVM.
PMEM is a really easy one for us to target,
because it treats PMEM as block device.
It's not optimal, but it's something we can work with today.
We did some extensions to the Linux kernel
that turn on IOP-MEM, which is related to controller memory buffers on the NVMe spec.
And I think there's a lot of interesting use cases
we can take advantage of, where we offload both DMA traffic
and functionality from the CPU.
Next steps.
So right now, nothing upstream supports this DMA
between PCIe devices.
There's been a few proposals. We've had PeerDirect for a long time from Mellanox for GPU RDMA.
IOPmem, we also have a recent submission based around DMA buffers.
And Christoph just mentioned another one tied into the IOMMU of ARM cores that we need to go take a look at.
I think as a whole, though, I mean,
the community in Linux works best when everyone gets
together, sits down, and goes, there's
lots of ways to solve this problem,
but which one's the right one?
Which one is the one that gives us a good API that consumers
can enjoy and consume that's going
to be best long-term going forward, that addresses
the majority of use cases, the majority
of concerns, the majority of issues.
I don't think individual companies throwing things at Linux RDMA is going to be the right
way to solve it.
I think the right way is to get the right people talking about it.
And that's something that we're trying to do.
There are issues around this.
There's security.
You're going to allow devices to DMA to places they haven't typically been allowed to DMA to. There's routing issues if you have complicated switching
and bridging. Can the devices actually see each other the way you think they
can? There's coherency, right? PCIe memory is not coherent necessarily, or at all.
And there's architecture specifics, right? ARM is different to
x86, is different to MIPS, is different to anything else. But I think there's architecture specifics, right? Different ARM is different to x86,
is different to MIPS, is different to anything else.
But I think there's a lot of potential behind this idea.
I think the fact that it ties into NVMe CMBs,
ties into NVMe over fabrics, ties into acceleration,
makes it very timely.
And, you know, I'm looking forward to getting the industry to have a discussion around it and moving forward.
Thank you very much.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.