Storage Developer Conference - #38: SPDK - Building Blocks for Scalable, High Performance Storage Application
Episode Date: March 28, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 38.
Today we hear from Benjamin Walker, software engineer with Intel Corporation,
as he presents SPDK, Building Blocks for Scalable High Performance Storage Application, from the 2016 Storage
Developer Conference.
This is the agenda.
The Storage Performance Development Kit is, so we're going to cover first what it is,
very briefly how we got started.
I think that's less interesting to this crowd than some of the others we're going to talk about our NVM NVME pulled mode user space driver which was
really the starting point of the whole thing we're going to talk about our NVME
over fabrics target we're going to talk about where we're going and the future
things we're doing and then some information about how
to get involved.
So first thing we're going to cover is what is the storage performance development kit.
Before I explain what it is, let me outline the problem.
And this is a problem I'm sure you're all very familiar with. But a number of years ago, the I-O per second of a hard drive
was about 500, something like that.
These are all read numbers, 4K reads.
Latency was somewhere around 2 milliseconds.
And that persisted for a decade or more.
Then we got set and NAND SSDs.
These could do 25,000 IOPS.
I saw some go up to 50,000 occasionally.
You're talking about maybe 100 microseconds of latency.
More recently, we get the NVMe SSDs.
These don't really improve latency
because we're still talking about NAND media,
the same backing media,
but we get huge bandwidth numbers. Particularly
we can do much, much
more in parallel. And that's really the difference
between NVMe and
SATA SSDs. NVMe
SSDs can do wildly more work
in parallel. And the NAND itself
is highly parallel.
So the real
problem we saw, and this was two or three years ago,
the real problem we saw was that this was two or three years ago, the real problem we saw
was that we knew there was a new SSD coming.
We knew the numbers were going to be much better.
So we went and figured out what is the cost of the software running on top of these SSDs
that we think we're going to be building sometime in the distant future 2017 now so it is the distant future so we wanted to make sure the software was ready when
the device came to market and so SPDK is really our attempt to solve that or at
least start the solution to that problem so So SBDK is fundamentally a set of
C libraries, user space
static libraries
with headers.
It's nothing magical.
We will call parts of it
a driver. That's weird to call a user
space library a driver.
It's a driver in the sense
that, yes, it controls the hardware directly.
It reads and writes registers. It maps a bar. But it's also just a C library. It's a driver in the sense that, yes, it controls the hardware directly. It reads and writes registers, it maps a bar, but it's also just a C library.
It's not part of the kernel, nothing like that.
All the code is completely open source.
It is three clause BSC licensed, so very permissive.
You can take it and pretty much do whatever you want.
It's all on GitHub.
We try to document as thoroughly as we can. The driving principles behind SPDK are that
it is user space as I mentioned. It is completely lockless at least through the IOPaaS. Sometimes
we will implement code that has locks in it. When it does not need to be performed it would
be convenient for you. But all of the I paths, anything that would need to be performing is completely lockless.
We do not rely on any interrupts from the SSD. It is polled mode only. That works out because
it is both far more efficient, but also because we don't have interrupts in user space. It can do millions of IOPS per core
in terms of 4K reads and writes to SSDs,
which I'll get into a bit.
And it was really intended always for Intel Optane SSDs
before we even had the name Optane three years ago.
That's what we were doing this whole time.
We released it before we announced
Optane because it does benefit
NAND SSDs, but the purpose
of this project was for Optane.
I'm going to talk a little bit about, and I don't
know if that blue is going to come through or not, but there's
four blue boxes up there.
I'm going to talk a little bit about the different types of things that go into SBDK.
Again, it's a collection of C libraries.
The idea is that you can take the ones you want, mix and match, integrate them into your application.
We have a number of example applications and supported applications as well, official applications.
But they're all based on these composable libraries.
So we break the libraries up for organizational sake into four different categories.
The first of which is hardware drivers.
These are the things that actually talk to the hardware.
They map the PCI address space of a PCI device into a user space process,
talk directly to the bar, they do MMIO, those sorts of things.
The next type of library or component we call storage services, these are the things right
above the drivers, they're doing something like abstracting away the, you know, the most common one is like a block device
abstraction layer where you can use a common protocol to do reads and writes
and it will forward the IOD to the NVMe driver or the kernel or something like
that, you know. These are simple abstractions on top of the base drivers.
And I'll go through examples of all these. Then we have libraries and applications even
that are for implementing storage protocols.
There are two of those right now, iSCSI and NVMe or Fabrics.
These are ways to export block devices over the network.
And then the fourth type of library
are things designed to be used on the client side.
So libraries for writing an NVMF initiator
or something like that.
All right, so let's fill in some of these big boxes.
The two drivers we have released right now
are an NVMe SSD driver.
Our NVMe Express is 1.2.1 compliant.
We stay up to date with the spec.
We are in regular contact with the spec writers, work very closely with them, give a lot of
feedback, report a lot of bugs.
So all of the features of the spec are essentially supported by SPDK and will continue to be indefinitely.
We also have a driver for what Intel calls its quick data technology,
sometimes called IOAT.
Also sometimes the internal code name was Crystal Beach.
This is a DMA engine present on some Xeon platforms.
And so this gives you user space access directly to the DMA engine present on some Xeon platforms. And so this gives you user space access directly
to the DMA engine on the platform
and you can do asynchronous copies and fills.
And it has other features as well,
but we really only expose copy and fill.
And again, pull mode access makes that fairly efficient.
Right now we
or the biggest thing we've just released
is support for NVMe over
Fabrics. This is both a library
that you could incorporate into your
application, your
stack, as well as an application that
uses the library that just implements
an NVMe over Fabrics target.
We also a number of years ago,
wrote an iSCSI target
based on top of the NVMe driver,
and we just recently open sourced it.
So this component is actually
one of the oldest things,
but it's the most recent thing
we've open sourced.
But it's an iSCSI target
based on all the SBDA principles. There's
a SCSI library which does all sorts of SCSI processing. There's an iSCSI library which
does the network requests and things like that. There's a block device abstraction layer
which is a common block interface. It has operations like read-write, flush, trim,
and you can do those and it will forward them to different backends.
It could be the kernel lib-aio, it could be our NVMe driver is obviously the one that's most important to us,
but we have a malloc backend as well.
And I'll get into this a little bit in a bit. So let's go into, I just have one slide here about how SPDK got started.
I'd like to cover very quickly. And that really requires a discussion of DPDK, the Data Plane Development Kit.
I think DPDK was started nine or ten years ago at this point.
This is a toolkit, much like SPDK,
designed for networking, network technologies, so that basically vendors could build switches
and networking products on standard Intel silicon.
They don't need specialized platforms anymore.
And it uses a lot of the same concepts that we're going to use in SPDK.
It is free and open source, of course, but it is pull mode, user space drivers,
zero copy through the stack, all these sort of things.
SBDK does depend on DPDK,
and I just want to talk about that a little bit.
We don't depend on this huge DPDK project.
We depend on a very tiny subset of DPDK,
mostly around the part that abstracts out
how you unbind devices from the kernel and rebind them to user space because we support Linux and FreeBSD and it's
different. We also use their lockless ring and they have a really slick
buffer pool, right, and that's it. So those things could all be taken out of
dbdk and copied in spdk and then we wouldn't depend on dbdk anymore, it. So those things could all be taken out of dbdk and copied into spdk and then
we wouldn't depend on dbdk anymore. So we share basically the core parts, the things
that figure out how to map huge pages and all these sort of core operations that you
need to write user space driver. But we don't actually use any of their NIC drivers or any of the networking part of it.
Feel free to ask questions.
You guys can jump in any time.
Just stop me or I'll just keep going and going and going.
Okay, so next I'm going to talk about our NVMe driver. I think this is probably the most interesting component we've had for a while
until we released the NVMe or Fabrics target
the NVMe driver
supports the 1.2 now.1 spec
we've just added that
we will continue to follow the specification very closely
it is a C library with a single header
completely and thoroughly documented very closely. It is a C library with a single header, completely
and thoroughly documented,
statically linked into your
application.
You
use it by
basically
you initialize the device by giving it
a PCI BDF, and
it will go make sysfs calls
on Linux and say, unbind your driver, bind myself, map the
bar, run through the NVMe spec initialization process, and then it gives you simple functions
that you can call like create an IO queue or submit a read, you know things like that.
It is entirely asynchronous and it's zero copy and it's lockless.
So you give it basically a buffer and you say, go read into my buffer and you give it
a function pointer that you want it to call when it's done and then you call a function
to say, are you done yet?
Are you done yet?
Are you done yet?
And when it's done, it'll call your function and your data will be in the buffer you gave
it and it gets DMA'd from the device into that buffer directly.
There are no copies.
You're the app?
Go ahead.
When this business case driver is completed,
how would it go in the driver?
So they can coexist,
just not loaded on the same device at the same time.
So if I have 8 NVM...
Well, so this is just a C library linked into an application.
So this is only running when your application is running.
So the kernel module is always loaded.
But you thought the kernel module driver
is a unique one for any vendor.
Yeah.
How do you get rid of the NVM driver, which is existing? Yeah. Yeah, but...
Oh, in Linux and
SFS, you can tell it, you know, stop,
you know, unbind from this device by PCI
bus device function, and then say
don't bind to this one again.
And the kernel will just leave it alone.
And it just shows up as an unknown PCI device.
And then we use either UIO
or VFIO in the kernel to say
we want to map your bar
of this device. And with UIO, we put
a placeholder driver in so the kernel knows somebody
owns it. And VFIO just
has an iOctal and you say, I want this one.
And then the kernel knows
somebody owns this.
In VFIO, it's even nicer. They've added
a lot of great features
where when you say, I want to take this device,
it'll do a security check, make sure you have permission,
so you can run as a non-privileged user while you're doing this.
And then also, it's tied into the IOMMU on Xeon platforms
so that the IOMMU will understand
when you map that device into your user-based process
that you can only DMA within your process.
Because when you program the NVMe device
as a DMA engine on it,
you give it physical addresses,
which could be anywhere, right?
But the IOMMU will limit you
to only DMAing inside your process.
Go ahead.
This is a user process
configuration in IOMMU
that's going to go on DMA on your memory. So how the user process gets here Yeah, so there's challenges around pinning the memory, right, so it doesn't disappear.
Right now we solve that by requiring all the data be allocated from huge pages.
And huge pages are static and don't move.
So that's a quick and easy solution.
It also actually makes the performance a lot better,
which is a nice side effect.
So that's one way to do it.
There are other strategies,
some of which would require kernel support.
But yeah, basically we say don't kill the process. But it's allocated at a
huge page, so even if the DMA happens, it's continuing to happen after the process has
ended, it's not corrupting your system. And the IMMU is preventing you from DMAing over
critical kernel structures or something. Yeah. Go ahead.
So I think that when I looked at the code,
I didn't see any just as a standard normal
I would have writing to the admin queue.
Yeah.
And also from the completion queue,
you'd mark market by the...
Only on the admin path.
So the admin path, those operations
are not performance sensitive
and there's only one admin queue globally.
Yeah, but I think the idea
is I think
the idea is
you dedicate a call
to the device, right?
So I'm thinking...
Well, so we actually access one device simultaneously
from many different cores,
each one having their own I-O queue.
And then each of those threads are cores.
For us, those are the same thing.
You pin one thread to a core.
Each of those may want to do an admin operation at some time,
and so they coordinate with a lock.
And it's because admin stuff is not an IO path.
We thought it would be more convenient
to make the admin operations thread safe
because they all have to operate on the same queue.
IO queues, you create one per thread,
and you only use it from that thread, and there's no lock.
Okay, so this is just a quick table.
I'm sure you've all read it by now
of features we support,
which is kind of nice.
This actually makes it a fantastic testing platform
for new SSD features
because, again, it's user space.
If it crashes, you don't blow up.
Go ahead.
Do you support DIV that is in line with the data block?
So where you have quite qualified data
and that gives PI information basically right back to it?
I would have to look at the header
to see if we've added that.
I believe we have.
I believe we have.
I could go confirm.
The problem is I don't know if we have a device
that can do that.
So, like in our lab you know so it's a practical concern right so even if I claim it works and even if the code is there that's I don't believe that's
something that's been tested.
Yeah that's been tested, I'm pretty sure.
But if you move the PI out, right, I don't know if we...
I don't know. I'd have to ask our validation team to see.
My question was, do you support where PI is with the block, not separate?
We should support both is what I'm saying.
And I'm not confident that we've tested either.
I'm more likely that we have tested in line as opposed to
separated but
I can only think of one device that we would have
access to that supports that and I don't know if that
test has been run.
But if it doesn't work
we'll fix it. You're asking, is it asynchronous or non-blocking?
I mean, here, can I explain a little more, like expand, what you say, what you mean with asynchronous or non-blocking?
Yeah, so the general flow for an I.O. is you call a function let's say it's called read
and you give it an LBA start
a length, a buffer
that is the size of however
many blocks you want to read.
And you also give it a callback
a function pointer.
And you call that function. That function
will never block under any circumstances.
It may fail.
You try to do too many operations at once,
it may come back and say,
you need to try this again. I couldn't do it.
It will never block.
Then you call another function
that says, check for completions.
And you can bound it on how many it will complete at a time.
And you're supposed to pull
that function.
You keep calling it. And when it finds completions,
it will match them up with the callbacks you gave
originally when you submitted them,
and it will call your callback.
Is there a
polar semantics where you can
post
some sort of descriptors for
the submission
use, and then when it returns,
you get the vector of the ones that
have completed
summarize the question
oh yeah I apologize
the question is
is there something like a select or you know
where you can say pull all of these
the completions coming off of the
NVMe queue get reordered
the completions don't come in the right order
so we just call
whatever callbacks they match up to
right when you pull us.
So there's no blocking.
There's no, you know, like, call me when something happens.
No, you have to constantly pull,
and it just calls the functions as we find them,
basically in a loop.
So I wonder what I'm supposed to do.
I'm supposed to spin around until my read is finished, or I'm supposed to do something I'm supposed to spin around until my read is going to be finished
or I'm supposed to do something else
like issue new reads?
Yeah, so the question there is
I submit a read.
I have to call my pull function
to wait for it to finish.
What am I supposed to do?
Should I sit in a busy loop
and wait for it to finish
or should I go do something else?
New reads. Yeah yeah and so the
answer is never spend in a busy loop don't do that and and so a lot of people
do that and that's absolutely not our intention go do something else
productive right you know you have to sometimes rethink how you're doing your
application but you need to be thinking in like a pipeline you know like I
submitted this so now I can go do something else.
Keep going. Do as
much as you can. Get that queue depth up.
Submit as many reads as you possibly can.
Don't just sit and wait
for completions unless you really have nothing else
to do.
What feeds the poll?
You only get the callback in response to the poll.
There's no interrupts.
Oh.
No other threads.
There's no threads in SBDK, only your threads.
When you call that check for completions,
that looks at it.
So you decide when it checks for completions.
So SBDK will never spawn a background thread for you in any of our applications ever.
Right?
And we'll never use an interrupt.
So we only do things in response to your request.
That's why we're just a C library, right?
We have no threading model.
We're not enforcing anything like that.
You decide when the work happens.
And you can put bounds on check completions.
You can say, only complete five.
If you're worried about quality of service
and things like that.
Go ahead. Are you asking if you can message pass?
No, no.
Malicious attacks.
DMA directing the physical memory
is possible to
overwrite in some way
or permission.
If you're running as root, yeah.
If you're using VFIO as a non-privileged user,
the IOMMU will stop you from DMAing outside your process.
Yeah.
So the hardware on Intel Xeon platforms,
if you set it up correctly, and this is some extra work,
you know, it's not how our examples run,
the hardware is capable of preventing you from,
preventing that device, that PCI
device, its DMA engine, from
writing to addresses outside of its process space.
So
if somebody
doesn't do that...
If you don't set all that up, and it takes
considerable effort, but if you don't set all that
up, it can DMA wherever you want.
Alright,
let me get back. So, here's
a quick performance slide.
This is just sort of
comparing Linux
libAIO
on one core. This is all for one core
versus SPDK on one core.
We're opening up a block device directly
with libAIO. We're doing
asynchronous reads and writes. It's sort of a similar model with libAIO. We're doing asynchronous reads and writes.
It's sort of a similar model with libAIO as it is to SPDK.
You pull for completions, these sort of things.
But it just shows you the magnitude of the performance difference.
On a single core, and I'm sorry if that gray doesn't come through very well,
but on a single core, the kernel in our testing gets between 300,000 and 400,000 4K random reads per second using LibAIO on an Intel P3700 SSD.
SBDK, as we add SSDs, just adds the max IOPS of that SSD.
And it keeps going.
Now, I should warn you that it does not go past 8 on a single core.
If you add another core, it'll go to 16.
But good luck finding PCI lanes.
So there's some trouble measuring this scaling too far
because I can't get enough PCI lanes.
But definitely up to 8 is just linear. too far because of I can't get enough PCI links. But
definitely up to 8 is just linear.
And then
we finally hit the CPU bottleneck.
About 300 nanoseconds.
Total.
Total.
Round trip.
This is round trip this is so we have one core
with one
Q pair per SSD and there's no
benefit doing more Q pairs
for an SSD in terms of performance
so we have 8 Q pairs
and we're trying to submit 128 IO
per Q pair so we submit them all and then we're polling them all,
and then every time we find a completion, we submit another,
and it takes eight of those to finally get to the point
where the CPU is not just busy waiting.
But if you want to take advantage of that production latency,
you do have to sit there and poll, right?
You do.
You do. So your there and pull, right? You do. You do.
So your application presumably would do more work than just submit a pre-generated read, right?
So you would have other things you'd have to figure out.
I mean, you'd have to work through the whole architecture
of your application, right, to figure out,
you know, okay, you freed me from blocking.
Now I've got to go find something productive to do.
Okay, so this is the overhead,
and we measured this not too long ago.
It's kernel 4.7 RC1.
Again, you see like the 300 nanosecond software latency.
I'm subtracting out the device.
I'm subtracting out the PCI flight time.
The Linux kernel, you know, we're at 5 5500 nanoseconds, which is not bad, you
know, 5.5, you know, it's not terrible. So these are some of the things we're doing to
be fast. Some of these things the kernel will do over time, I have no doubt. The kernel
will get faster. For instance, I know they're implementing polling and they already have an NVMe.
A lot of these techniques are going to be employed in the kernel and the kernel will get better.
Some of them you can't do in an operating system kernel because they're specific things.
You know, you're making assumptions. With SBDK, your process owns the whole device.
Right? So you don't have to share it with anybody. And you're also controlling
the threading model. So you don't have
random threads trying to figure out
what hardware queue to use.
You set that up ahead of time.
So there's no locks.
There's other things like that that you just can't
do in an operating system kernel.
Now the price is you have to design your application for that.
Which is
definitely significant optimization work.
But the kernel will get faster than this, I have no doubt.
Okay, so let's talk about NVMF,
because I think that's what a lot of people want to hear,
and I don't want to run out of time.
But before I talk about NVMF, I'm going to talk about iSCSI.
And this will lead into it clearly. We wrote the iSCSI target about NVMF, I'm going to talk about iSCSI. And this will lead into it clearly.
We wrote the iSCSI target about three years ago
using all these techniques just to prove if this was reasonable,
to prove if this would help
and all these sort of lockless pull mode things.
And it turns out it does help significantly.
So here's a comparison of Linux LIO versus SPDK.
The kernel here,
LIO kernel, is using
32 CPU cores
to do 2.25
million 4K random
reads. Every time I say IOPs,
it's 4K random reads.
SPDK's taking
21 cores to do
2.8 million?
That's better, right?
But that's not dramatic.
So we're using non-blocking TCP sockets.
We're pinning iSCSI connections.
When a new connection is established, we look up what device is it for,
and we migrate it to that
core and then we only use it from that core so we can do all sorts of modifying state
completely locklessly.
Every device has one NVMe hardware queue that's on the same core as the sockets, the connections
that are trying to use it so that you can pull off the TCP stack,
translate to NVMe, dump it on the queue all without a lock in line, just keep going.
It turns out that of these 21 cores, more than 70% of their work is doing TCP. We could have maybe thought about a user space pull mode TCP
stack I don't know how much benefit we get there I think that's a viable
strategy for improvement people could think about that but we kind of just
left it there for now so if you work out the math here and you say, how many IOPS per core
is this?
It turns out SBDK got
about 2x.
You know, that's great.
It's not incredible.
But again, most of that time
left is TCP.
So this is setting me up for why
NVM express over fabrics.
So first of all, it does eliminate a layer of translation from SCSI to NVMe
because we can send the same SCSI command or close to the same SCSI command
with a modified SGL over the fabric as we actually send to the device.
That overhead is tiny.
The real advantage here is RUMA.
The RUMA cards effectively offload TCP.
So for iWarp, it really is TCP.
For Rucky, it's UDP.
And that's responsible for a huge performance increase.
So here's what we get with NVMe over Fabrics.
And I apologize for the max at 1.3 million or whatever.
That's just the fastest NIC we had.
So they're both going to cap the NIC.
But the kernel took 11 cores and SVDK took one.
And you can make the kernel do it a little bit more efficiently than that if you really disable cores.
Some of the performance testers have done that.
But that's not a realistic scenario. You have to be able to use those cores for something else to actually gain value.
So disabling them is not saying anything so if you turn it back on
if you turn it back on it takes 11
and I'm sure they'll get
a little bit better over time
so any
questions on the
NVMe over fabrics
go ahead I have it on a later slide questions on the NVMe over fabrics? Go ahead.
Do you have an NVMe over fabrics?
I have it on a later slide.
But we don't have it written.
How many
injections did you
have?
How many?
What does it say?
It's a lot. this is using the
in kernel initiator for both
it's the same one
we just spin up as many initiators
as we need
the
our target
should be able to do the full performance
over a single connection
the way it's set up
because for us it's set up.
Because for us, it's all being processed on one core.
You can use more cores, you know,
it's just we only need one to saturate the NIC.
But since we're all on one core,
there's no real difference for SPDK
if you have 10 connections or one.
We're looping over the list of connections, right?
So it doesn't ultimately matter.
It's a lot like the internals of an SSD
where typically it doesn't matter how many
queue pairs you have. It doesn't change the max performance
of that SSD.
The queue pairs are really for the client.
Okay, so I'm going to go through
some of the other stuff in SBDK because we're running
a little bit of time.
One thing we have is a block device abstraction layer, which came out of our iSCSI target.
This is a generic block layer that really just is taking what looked like BIOS from the kernel or something like
that, a little different structure. But they have, you can do like read, write, trim, flush.
You can submit them asynchronously to this thing and it will translate them to the appropriate
thing underneath. So we have three modules open source underneath right now.
Our NVMe driver, Linux lib AIO, and a RAM disk.
All right, I'm going to talk about the future a little bit.
Excuse me.
It's catching up to me.
All right, so I showed this picture a little bit earlier.
So I'm going to add some things to it.
We added,
we're thinking about, in the second half of this year,
doing an NVMF
initiator with the same principles.
So it would be user space
pull mode, based on RDMA
and on our NVMe driver, of course.
It will have
the same interface as our local NVMe driver.
It's the same strategy we currently use.
They just swapped out the PCI part underneath their NVMe driver and put in a fabric.
It'll be the same header file.
So you can use a remote device and a local device with SBDK using the same API.
We're going to put the block device abstraction layer into our NVMe over
Fabrics target. I think that's already there.
Excuse me. I believe that's
already there.
I think it went out just a week or two
ago.
So it'll be similar to iSCSI.
And then we've got some performance issues
when you scale to large numbers of devices.
32, 64 SSDs pulling in the loop,
especially on platforms with small caches, CPU caches.
We know how to solve this.
We need to leverage some of the RDMA features a little bit better.
But basically, it's adding fixed latency based on the number of SSDs when you're looping over the list and polling.
Because we're missing the cache too much, the CPU cache too much.
But we have a clear line of sight how to fix that.
So we'll get that in the second half of this year. And then as we get more RDMA devices to test with and higher performance
RDMA devices, we'll continue to make sure this scales up. Okay, so the next thing we're adding,
I'd like to talk briefly about, and we'll see how much interest I have in this. I'm calling a blob store. So, and this, I don't know.
We'll see about the name.
But most people say,
so if you work for like a sand vendor or whatever
and you're exporting iSCSI devices today,
SBDK is obvious how you'd use that, right?
If you're a database author
or something like that, key value store, you say, oh, you're
so much more CPU efficient than the kernel. That's great. I need a file system. So what do you do?
So most applications want some level of file semantics. That's just a fact. There's no way
around that. Fortunately, most
applications need a tiny fraction
of what a file system can do.
And that was a realization we had.
Particularly databases and key value stores
optimized, heavily optimized storage
applications are using as little
from the file system as they can get away with.
Because file systems are slow.
So they use a small number of files.
They use a flat hierarchy.
They don't care about permissions.
They own the whole disk.
Something like that, right?
So you can't use a kernel file system
on top of a user space driver, of course.
The other problem is POSIX.
And POSIX is slow
and there's nothing you can do about it
except for make the media so fast
that it's like RAM
which we're trying to do
but on an NVMe device
where you have 3 microseconds of flight time
POSIX is just not going to work
at high performance
because it's blocking
and assumes certain copies and all these sort of things so we really need to move
away from that and fortunately many databases already have they're using
lib AIO
okay so I'm calling this thing that I would like to build a blob store.
And I'm choosing the word blob with some thought,
despite that I don't like the way it sounds.
I don't want to call it a file system.
It's not anywhere close to a file system.
I don't want to call them objects,
because objects is taking on a life of its own.
That sort of assumes like a get, put, delete interface.
It's not going to have that.
This thing is going to be,
and we're working on this now,
it's going to be asynchronous, pulled-mode,
lockless, event-driven, just like the rest of
SVDK. Essentially, you allocate
a blob on the disk, and it's a
block allocator. You can read and
write blobs in units of blocks,
logical blocks.
You can delete them, you can resize them, whatever you need to do, and they are persistent. You
shut down, you come back up, it finds them all again. They have a name, you don't even
give it the name, they have a GUID essentially, you say I need a new blob, and it gives you
a GUID. And then you say make my blob this many blocks, and it does it.
And it turns out this is probably the lowest level abstraction that you need on top of a block device.
It's the lowest level abstraction that I can possibly fathom that would be valuable to someone.
And this is actually probably enough of an abstraction to port a number of databases on top of SBDK.
This is very similar to Bluestore in functionality.
It's probably enough to port RocksDB on it.
It's probably enough to port MySQL's InnoDB engine, which
is using libio direct to block.
But this is intended to be a framework, rather,
for building higher-order services if you need them.
You could build a mostly POSIX-compliant file system on top of this
if you wanted to go slower, for instance, if your system needed it.
Now, I don't know how close
deposits you get before this is not worth using.
And you should just use the kernel.
There's some cutoff. I don't know where that
is.
But everybody
will find that, I'm sure, at some point.
So you may have any questions about that.
That'll be second half this year.
Excuse me as I lose my voice in the middle of my talk.
Okay, so the last thing I want to talk about is VHOS SCSI,
which is another little side project we have that I think is pretty cool.
This is essentially my last slide here.
We are implementing right now
a mechanism to take
the BERT IOS SCSI IOS
out of QEMU
and route it directly to a user space process
using VHOS SCSI
where that user space process is accessing
the device using SPDK.
And this centralizes all the I.O. work,
essentially from a set of VMs onto a single storage service running on the hypervisor,
all in user space.
So then a lot of companies can innovate that in user space,
do the proprietary stuff there in that one process.
It also would be possible to forward to our NVMF initiator,
for instance, from that user space process.
That would be another choice.
Instead of going to the local NVMe device,
go to a remote one. Since my voice is not cooperating,
I made it.
Join us at sbdk.io.
We'd like to be an open source community.
Join our mailing list.
We're active on there.
Please send us feedback, report bugs.
We're open to
contributions.
With that, I'll take questions.
Go ahead.
You support multiple
different RDA.
Do you support multiple different RDA?
For the iSCSI
SOP?
So for iSCSI for iSCSI, for iSCSI, for iSCSI SPDK, we just use a kernel TCP IP stack.
So whatever the kernel supports, right, which is everything.
If you were to do a user-based TCP IP stack on top of like DPDK or something,
yeah, you'd have to worry about does DPDK have a driver for my NIC?
You know, but the one we're using is just the kernel for the TCP. But again, it's 70% of the CPU.
So it's the bottleneck.
Go ahead. I have two comments.
Okay.
About the logstores, very interesting thing.
And I know at least two teams have been trying to invent these things, you know.
But I think one is just doing something that they really consider right now at the point they should do.
Like invent Globes, doors, and doors.
And yeah, if it's coming, it's very cool.
And the second thing is like, I will rephrase my first point.
Can you give us some recommendations based on your own experience about the situation
of the point, like, should the trade issue as many reads and writes, according to my
transactional logic as well. Is the system that sets
more, like,
more?
Yeah, so the question...
There it goes.
Okay, so the question was,
can I give some advice
or feel about
where the saturation point is
for how many I.O.
you really need to submit?
And it totally depends
on the device, right,
which is sort of unfortunate.
For many devices, the answer is as many I.O. as you can, right?
You really need to get a max Q depth to saturate these things.
The Optane line of devices saturates at much, much lower Q depths,
you know, four or eight or something, I don't know.
They're quoted numbers.
So it's a huge advantage to that.
You know, this next generation of media makes it much easier
to hit the full performance of the device
because you can do it at such low Q depth.
But for today's NAND devices, even the fastest ones,
128, 256, whatever you can throw at it,
they need all the Q depth they can get.
So according to your experience,
it can sound strange,
it should speak of a much
lower situation.
Random things happen.
Yeah, it can also depend
on the I.O. size.
And also, you know,
depending on background work
that the device is doing. It's really tricky to benchmark writes on N work that the device is doing,
it's really tricky to benchmark writes on NAND
because the device is always cleaning up in the background
and making them appear better than they are.
So it's a real challenge to get accurate write saturation numbers.
To piggyback on that,
what kind of instrumentation metrics we have
in terms of, from the library,
is that the number of leads of certain sites,
the distribution, when did they?
Yeah, so the toolkit is still a bit immature
in terms of full instrumentation.
We have this tracing library, which will,
you know, you can set up trace points and log
with accurate timestamps at any point in the whole code.
There's a number of those in the IO path.
We are currently in talks with the VTune team
to provide a full GUI solution on tracing latency
using their existing disk IO tool that works with the kernel.
I don't know where that will go.
I'm not making any promises or announcements. Right, but that would be my ideal. Say that again for me. For the virtualization example, why not use the HTTP to enter the data plane?
Like the data plane, like the Nix?
Yeah, in the virtualhost Guzzi solution
is based on what DPDK did in this space
with their vhost net.
It's the same idea where they route the I.O.
to a user-based process to do virtual switching.
So we're doing the same thing they are.
Anything else?
All right, then I think we're good.
Oh, never mind.
Well, I'll just talk to you.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.