Storage Developer Conference - #70: SPDK Blobstore: A Look Inside the NVM Optimized Allocator
Episode Date: May 30, 2018...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast Episode 70.
Hi, my name is Paul and this is my colleague Vishal. We're going to take you on a little
tour of Blobstore today.
All right, so unless you've been living under a rock, you probably have heard SPDK.
So we're going to briefly go over what that is, why it is, and sort of how you can get involved, how it came about.
And then I'll go through an overview of what Blobstore is, just a couple of high-level slides,
give you some idea for the motivation, a little bit on the on-disk layout.
And then I'm going to purposely insult your intelligence
by walking through an incredibly simple example.
This is our Hello Blob example that is just straight through,
run to completion, create a blob,
read and write to a blob, do a comparison,
delete the blob, close the thing down.
So it'll give you an idea of some of the other elements
of the SPD framework as well. So I'll take a few opportunities to sidetrack from Blobstore
as we go through that code example. And then Vishal is going to walk you through some performance
numbers with Blobstore, specifically with RocksDB. Okay, so what is SPDK? Well, it is a bunch of different software components.
It's not a library.
It's a set of libraries.
It's a set of tools, a set of building blocks.
It's a framework.
It's kind of whatever you want to call it.
And we'll see a block diagram come up on the next slide.
You can sort of optionally choose almost all of them and have them work together,
or you can just pick one thing out of it and have it work together.
And there are implementations of it from A to Z,
from folks just using one of the NVMe drivers
to the entire framework to create an iSCSI target
or NVMe off target.
There's just all sorts of things you can do with it.
It's open source and it's BSD licensed.
I think all that's clear enough.
I'm going to talk a lot more about the community
here on the next slide,
but that's a really important part of it.
It's not the case that we just took some source code
and threw it out there and point you at GitHub and say, good luck.
We're building a big and a strong community around it
with lots of different players besides Intel.
And, of course, the theme, again, unless you've been living under a rock,
you already know this.
The theme is user space pulled mode stuff, right?
Okay, so the colors are bleeding out just a little bit.
But I said I'm not going to go through all of these.
I think there was a talk here last year that probably went through the majority of these
or at least the ones that existed at the time.
The idea here is just to give you sort of a high level
what are the big picture components that you can pick and choose from.
So at the top, we've got our storage protocols.
So you can see we've got NVMe, ISCSI, Target, NVMe.
And then sort of our middle layer is storage services.
And this is really the foundation of storage services is our block device abstraction layer, very analogous to what's in the kernel.
You can actually come along and write your own block device driver
that plugs into this framework.
You can do that to support a block device,
or you can do it in sort of a filter driver model
to add some sort of value to the IOPath.
So you would suck blocks in at the top and spit them out the bottom,
and then they would end up talking with one of the pulled mode drivers
and out onto disk.
And then, of course, you can see Blobstore,
the focus of this talk,
sitting off here to the side,
this optional thing
that we will talk about here in just a second.
And, of course, why did we do this?
Why did we start this?
This was started as kind of a science project
in the lab at Intel many years ago
and then grew into something bigger and eventually grew into open source
and now is growing into a much larger open community.
And the reason is because it delivers, right?
It doesn't deliver in all use cases.
It's not meant to be a general purpose storage accelerator for whatever happens to be out there.
And that's the beauty of these open communities
and these open source projects,
there's something to fit everybody's needs.
And this was fitting enough needs
that a lot of people started showing up
and started gaining interest and started contributing
and really realized the performance gains
that you see up here.
So 10x against NVMOF with the kernel target,
8x against NVME through the kernel in IOPS.
Better efficiency in RocksDB workloads.
You know, very specific, of course, anything with performance is specific to the workload.
Vishal will go through some of that with you.
And then efficient use of development resources is kind of the cornerstone, again, of this being an open source project. The idea with the SPDK building blocks
is that they are all of the
not-so-fun shared
stuff that everybody needs if they're going
to write a high-speed user space
storage application.
So why should everybody rewrite the entire
thing every single time?
Just nature of open source. Let's all contribute to
the common stuff, provide enough
hooks for vendor-unique stuff and for people to add value, and then
march forward and we all get better use of resources.
So yeah, SPDK is as fast as a cheetah, apparently.
Okay, so the community.
This is pretty exciting. This has been growing a lot, really, in the last
lot of change in the last...
a lot of change in the last six months.
So initially, like I said,
it really was one of those typical...
source code was developed internally.
Let's put it out on GitHub and call it open source and be done.
There was enough interest after that happened
that it kind of... things had to change
or the project wasn't going to go anywhere.
So it really is an open community now,
and it continues to grow.
And so, for example, right besides, of course,
GitHub being our source repo,
we've got a Trello board.
If you're not familiar with Trello,
it's just real, real fundamental
project-managing kind of stuff,
but easy enough to use that us developers can use it also.
So we've got a bunch of Trello boards, actually, on that link.
That's where we keep not
only the backlog of things that we're interested in working on with our resources, but things that
other community members are interested in working on too. So they can be anything from,
we've got a board just for low-hanging fruit, we call it. So anything that a developer wants to do
sort of a drive-by, figure out how to go make a contribution to SPDK,
have a little bit of fun.
There's a whole list of really simple things that just nobody's gotten around to,
all the way to a big ideas board
that is just big, wild, and crazy stuff.
Wouldn't it be great if,
and then somebody can come in and fill in the blanks
if they're interested in doing that.
So our whole backlog is visible.
There's no behind-the-scenes stuff going on anywhere.
All of our reviews
are on Garrett Hub now, and this is a reasonably recent change. So everybody is encouraged
in the community to get involved with reviews. You can see all the patches. You can see all
the comments. You can vote on the reviews. And you can see what's coming down the pipeline.
You don't have to just sort of wait for a pull request to get merged to see what's going
on. We've got a channel on Freenode that is starting to pick up a lot more use,
so you can pretty much catch any of the SPDK experts almost around the clock.
It's still a little bit skewed towards U.S. time with the way the maintainers are set up,
but hopefully there's somebody there 24 by 7 to engage in discussion or to answer questions or whatever.
And then our homepage is spdk.io,
where you can get links to all this stuff.
And there's some documentation and fun stuff like that.
Okay, and then coming up in November,
we've got our first developer meetup, or hackathon.
It's actually full now.
We might be able to squeeze in one or two more if there's somebody who's
experienced with SPDK and you really want to come in
and participate and add some value,
we can definitely look at trying to make that happen.
Really, the only rule
to our first
annual hackathon, we might actually do
two a year, is
no presentations like this.
There's no PowerPoint.
These are developer meetups where we are planning small group breakout sessions
to either tackle big code reviews, push a patch through that has taken forever
because it's maybe too big or too complicated,
so we use that time face-to-face to move that through.
We can do some design discussions explaining existing modules,
talk about tactical features coming up,
or brainstorm what we're going to do with those,
and who wants to work on what.
So we're excited about this.
About two and a half days of activity in Phoenix.
So if you're interested, let me know, and we'll see if we can squeeze you in.
Okay, so let's talk about Blobstore.
So probably the best way to introduce it is to talk about the motivation.
Why did we go do this?
Why did we invent something when there's already so many different things invented out there?
And part of it is because we can, right?
Anybody can.
You can come in here and develop something and throw it out there and see what kind of feedback you get.
But really this was because of the success of SPDK and the popularity of SPDK.
If you look through the block diagram that I showed earlier, almost everything up there is
block-focused, right? Just geared around serving blocks, consuming blocks, speeding up block
storage. So really, the question came to mind, I guess it was last year sometime, what about other
applications that don't know what to do with blocks?
The most obvious answer is, well, they want files.
They want POSIX. They want a file system.
That's the last thing we wanted to try and tackle was bolting in an SPDK-specific file system into this thing.
File systems do, as probably everybody in this room knows,
just a ton of shit that we didn't want to even begin to bite off.
So we started thinking about specific use cases like RocksDB,
which has a very well-defined interface to plug things into with the ENB module,
and some analysis that we had done showed that they really didn't use all of the features in XFS.
They used very few features, so it seemed like a good candidate to say,
well, let's do something that's kind of file system-y-like,
but doesn't have file semantics, and is really geared towards next-generation media.
And that's kind of where Lobstore came from, was trying to put something behind RocksDB
that would accelerate it for super-fast drives.
So that led to the next step, of course, starting with some design goals.
What do we want this thing to do?
Well, we started, like I said, with the use case.
And I want to talk about logical volumes, too.
If I forget before the end of the presentation, I'll mention that.
That's a feature, a new block on the block diagram that's coming up soon that relies on Blobstore.
But, again, the original design goals was to write something for
RocksDB and to keep it super simple and not add anything that isn't needed to accelerate RocksDB.
So, and there literally isn't. We could almost walk through the entire code today. That's how
short it is. It's not, I don't know, it's 2,500 lines of code maybe. So it doesn't do a whole lot
on purpose. And again, designed for fast
storage media. And we'll see some of the examples of how that was done as we go through these.
Okay, super, super high level. What's the design look like? We really only have three abstractions
in the design. There's, of course, the blob.
That's the name Blobstore. And the names were picked, although it's impossible to find words
in storage that aren't overloaded, in an attempt not to put some sort of mindset around file or
object or whatever, chunk. They're all taken, so it doesn't really matter what we chose.
But blob is our largest level of abstraction, most analogous to a file, right? It's just a
collection of blocks. We've also got two other abstractions. I'll show you a picture of those
coming up on the next slide. But at the lowest level, we define a page, and a page is the
smallest level of access to the media. So it's typically a block size or larger than a block.
And actually, right now, it's fixed at 4K. So it would take a little bit of work to do something different.
But that's a page unit of access.
And then our next level of abstraction is a cluster.
And a cluster is just a collection of contiguous pages.
And that's much larger.
The default size is a megabyte.
And that's the unit of allocation.
So that's, of course, geared towards SSDs,
that when we do a resize of the blob and
shrink or grow, we want to do it in large contiguous chunks to optimize what's going on at the SSD.
The whole thing is asynchronous. I will explain a little bit more exactly how those asynchronous
calls work when combined with other units within SPDK, part of our application framework, we call it, which is really that.
It's a framework that was kind of born out of necessity.
We started with the NVMe polling driver,
started building higher-level application-type things
like the AskAZTarget, the NVMF Target,
and discovered that there were different elements
in building that application
that were common amongst the applications that were using the polling mode driver.
So, for example, an event scheduler and a poller
and non-blocking rings for communication between threads.
All those kind of things were common,
so they got sort of merged into this thing called the application framework.
And you'll see an example of that as part of the Hello Blob.
Interestingly enough,
and of course it's open source,
we don't know exactly how everybody's using
the different modules and what they're doing with them
unless they come out and talk about it.
I'm not sure anybody's using the application framework
except for the SPDQ modules.
So most large applications
and large customers, consumers of this,
already have mechanisms in their own software to do these things.
So it's kind of the glue that ties together things as an example
or if you're going to use the applications that we provide.
So asynchronous also implies there's no blocking, no queuing, no waiting.
Blobstore is run through to completion or give you an error, one of the two.
There's no mechanisms for queuing.
It'll never block, and it's fully parallel.
We've got a concept of channels.
Again, another word you can't avoid overloading these terms.
But for us, a channel is essentially associated with a thread or a core
and designed to do a different stream of IO.
So with NVM Express, of course, we would associate a channel with a queue pair.
So you can set up your application,
you can set it up so that you've got a channel per blob
or a channel per a series of blobs that you know require no coordination
between the reads and writes between those,
and then all of this just happens as fast as it can possibly happen.
Okay, so this, I didn't draw like a blob clatter on this.
I probably should have done that and made it a little easier to see,
but this is one blob to give you an idea of the three abstractions I talked about.
These things at the top are the pages, the contiguous series of 4K pages.
And each one of those pages,
or each one of those chunks of pages,
as they are contiguous, line up in a cluster.
And then a cluster, therefore,
is just a contiguous set of pages.
But clusters themselves don't have to be contiguous.
So a blob itself is really nothing more than a linked list of cluster identifiers
or cluster IDs.
So you can see the different colored clusters
show that down in physical address range
they can be anywhere.
It doesn't really matter.
It's just those little chunks of clusters
that are contiguous.
So it's a really super simple
and basic layout scheme
that allows really fast lookups
for virtual to physical address
with really no chance for error.
That make sense?
Any questions on any of this so far?
I'm sorry, what is what managed?
Yeah, they're in the Blobstore implementation.
So all of this is Blobstore.
At the top of Blobstore is the API you're going to see
that the application talks to,
which is, you know, read, write, open, close, delete, resize,
and that's probably it off the top of my head.
And then at the back of Blobstore is Blocks.
Okay, and of course we've got metadata to keep the blobs persistent and their locations persistent.
The metadata design is probably by far the most complicated portion of the code base,
and I'm not even going to attempt to get into all of the different aspects of it. At the simplest
level, each blob has either one or more
pages of metadata in a pre-reserved region, and each one of those pages
is strategically designed to be 4K, and we have a requirement
really to guarantee any of the atomicity of Blobstore for metadata
updates that you're running with an NBIM Express device that provides atomic updates
at the block level, or in this case, 4K. That's why it's hard-coded right now in 4K.
But the metadata is all isolated
in units of pages, and they're not shared between each other, so you can do
updates on different blobs and not be
dealing with any locks or contention on the disk.
If the metadata for a single blob gets larger than 4K,
then it will bleed into a second sector
or a second page in our blob store terminology.
And once you go up over two and three,
any updates from that point are write, allocate, and atomic.
So that's all protected,
which makes it the more complicated of the two scenarios,
writing to metadata versus writing to disk.
And really, the metadata is nothing more than an extent list,
a list of clusters that have been allocated for that blob,
and a series of, we call them XADDRs,
but they're similar to XADDRs in file system
land.
It's a very crude implementation, very simple, just a way to store key value pairs with blobs
to help solve little problems that applications might have in trying to keep track of things.
The only identifier between an application and the blob store of any blob is an ID number,
which happens to be an offset into the metadata
of where that blob's metadata is.
So that's all you get.
So we'll see when we talk about how we bolted into RocksDB
how we use X adders in that case to store file names
so that the application can use something that it knows.
Okay, and then I think this is the last slide with words on it,
and then it's all, like I said, painfully insulting C code,
but I'll highlight some of the cooler features that aren't so obvious about the SPDK framework as we go through it.
But, yeah, okay, so I mentioned all those except for sync.
So sync is for synchronizing the metadata, right?
So along with the idea of making this as lightning fast as humanly possible,
the metadata is kept in RAM, the working copy,
and is only synchronized to disk on an explicit call by the application
that says synchronize to disk
or when you gracefully close down the blob.
So if it's not obvious, that means if you're plugging along
and you resize your blob, make it larger
and write a bunch of data,
and you lose power without syncing,
well, you're screwed, right?
You're going to come back up.
You'll be consistent, but you'll be consistently
wrong because your blob will be the wrong size
and you've lost all that data.
It's kind of up to the application what
their requirements are
for sync. For example,
if you were doing reads and writes
and in-ringling a bunch of metadata
operations and
you really need the performance and you
can deal with that
area where you're
not consistent, then maybe you don't want to pay the penalty for going to disk with
metadata writes every single operation.
So you can batch them up in large groups and then do a sync of the metadata at the end.
And then I mentioned the reason for the two abstractions.
We read write in units of pages right so
it's not bite addressable or page addressable so that's the smallest level
granularity for read and write and then allocation for shrinking or growing your
blob is done in clusters again optimized for SSDs data is direct I mentioned the
metadata is kept in memory it's not actually there's no like cache structure
anything it's just kept in memory that's's not actually, there's no, like, cache structure or anything. It's just kept in memory.
That's what we mean by cached.
But the data is completely direct.
You'll see in the example code where we allocate a DMA-safe buffer
and just say read or write, and it's direct in and out of the buffer.
There's blobs that never touches the data.
And, again, minimal support for X-Hour, X-Hourers.
Again, what the idea is
keep it super simple
nothing complicated about it other than
like I said some of the metadata strategies get a little
complicated because they have to
but that's it
okay so this is our
hello world for blobs
and I mentioned already what it does
and I'll take you through it
I can't see it from over here so I'm going to
step in front with my blind eyes.
So this example uses two other pieces of the SPDK framework.
One of them, it uses the application framework,
and the other one is it uses the block device abstraction layer.
You don't have to use either one of those to use Blobstore.
We just chose to in this example.
There's actually an example that is in review right now.
There's a patch up to show how to use it
without the block device abstraction layer,
where you can talk directly from the Blobstore to NBM Express.
But the block device abstraction layer
is incredibly thin and small
and gives us about a dozen APIs
to make the application writing just easier without paying much penalty.
So this first part here,
this SPDK opts init structure,
is a structure that controls the application framework.
It gives it various parameters.
The only two that we're using here are the name
and a config file.
Config file is for use by the block device framework.
So that's actually how you configure
and assign a block device for this application.
So the config file for this one,
even though Blobster is geared towards NVMe,
the idea with this is you can run it anywhere
just to see how the code flows.
So this is actually a RAM disk.
It's defined in this config file.
And then we allocate a context structure that we need
because this is all event-driven, callback-driven code.
So every time we call a Blobster API, or almost every time,
we'll see a few examples the other way around,
we pass it a callback function and two parameters.
So I'm always going to pass the callback function I want,
and I'm going to give it this context structure
so that I've got context of what I'm doing and where interesting things are
for me. So this first call here we call app start. That's the app framework. We're telling the
application framework, which is a collection of functions. The key point is it's got an event
scheduler. We call it a reactor. So this call is telling the reactor on the same thread,
I've got a function that I would like you to schedule for execution.
And that function's name is hello start.
And when you call it, give it these two parameters,
my context and I'm not using the second parameter.
So we're setting that up on the scheduler.
Now the way the scheduler works,
this is all running on the same thread.
The scheduler obviously is not in control right now.
This program is in control.
But as soon as I call app start, which is a special function because it's the start of the thing,
the application framework is going to immediately call my hello start function
and basically is going to be blocking on execution of this function.
So from here, we're going to jump straight over to the hello start function,
and then I'll explain what happens when this guy completes and returns control
and who picks up and how that happens next.
Okay, so now the app framework, the reactor,
has pulled this function off of the event list
and said, I've got to run this.
So he calls my hello start function with my parameters,
and I cast them into my context structure,
and there's a couple things I've got to do here
before I initialize the blob store.
The first one is I have to get my block device,
because remember I said this is built
on the block device abstraction layer.
There's a couple of different APIs for this. I chose the one because I know the name of it because I defined it in my config file. So I say give me my block device and the name is malloc0
because I'm using a RAM disk. And then the next thing I'm going to need is a block storage version
of a device that I'm going to talk to. So this is why I said, remember,
we don't have to use the block device abstraction layer.
I could create one of these things manually if I wanted to,
and it's not that difficult.
There's eight or nine function pointers and a few other things.
I don't know what they are.
But because I'm using the block device abstraction layer,
I've got some helper functions I can use and just say,
okay, create me a block store or a blob store version,
and I'm using the block device abstraction layer, and here's my pointer to my block device,
and that gives me the necessary structure the blob store needs to initialize to get up and running.
Okay, now the fun part.
This is fun, right?
This is SDC, right, developers conference?
Make sure I'm in the right place.
Okay, so SBDK BS in it.
So now we're going to initialize the Bob store.
And remember I said this is an asynchronous callback-driven framework.
And I tell you, the first time I heard that,
I made some assumptions about what that meant.
They turned out to be way wrong, and I didn't learn that until much later.
But I've worked on different frameworks like this
where asynchronous callback-driven means
that as soon as I call this,
I get immediate control back on my thread,
and some other thread is running off doing something,
and I don't really care what or how or when,
but I get immediate control, and I do whatever I want,
and then on that other thread's context,
I'm going to eventually get a callback
to this function called bs init complete
with the parameters
I pass, and then I know that that operation
is done and I can keep going.
And that's kind of, sort of how it works, but not
really. And
almost all of the functions work this way. What's going to
happen is, as soon as I call this function,
this is all
single-threaded. This thread is going to
be executing this line of code
and go down into Blobstore and just
synchronously execute all the way up
until Blobstore can't do something again.
Until it would need to block if we were blocking.
Instead of blocking, what
it does is it takes
our callback and
puts it on and passes it into
the function that needs to do
something that's going to take time, like talk to a disk.
We're not going to pull in the framework for that completion. We something that's going to take time, like talk to a disk. So we're not going to poll
in the framework
for that completion. We're certainly not going to wait for an interrupt.
But what Blobstore is then going to do
is going to schedule
a poller to periodically check that
and tell the poller, as soon as you get a
completion, here's who I want you to call.
And then
I'm going to get control back on the same thread.
So after I get a return back from here,
my application's in control.
Nobody else is doing anything, right?
Well, the hardware should be doing something, right?
The NVMe drive should be processing the request.
But until I relinquish control from this function,
until I return, the event scheduler,
the reactor that is on the same thread,
isn't going to get control again
and isn't going to be able to schedule a polling function
and isn't going to notice that the I.O. is done.
So it is definitely asynchronous,
but it's a little funny like that.
We don't have to set it up this way.
We could set up multiple threads
and do all sorts of other fancy things.
But that's essentially what happened.
So I get immediate control back from here.
As soon as this function exits,
the reactor on this thread runs,
picks up his event loop,
and says, oh, who's next?
And it will most likely be the polar
who's checking to see if that NVMe IO is completed.
And if it does,
he's going to call this function with these parameters.
Everybody with me?
Okay, so now our blob store has been initialized,
and now it gets even simpler and simpler as we go.
Now it's almost kindergarten simple.
All right, initialization is complete.
So at this point, I know that my blob store structure is done, and I can confirm that it's loaded,
and I can set up a pointer to the blob store
in my context structure
and go do whatever I want to do next,
knowing that whatever callback I'm going to get next
is going to pass me back a pointer to my blob store
so I can continue to use it.
So in this example,
the next thing we're going to do is create a blob.
And I go ahead and grab the page size,
and this is an example of a synchronous function in
blob store. This is just going to return as soon as it's done. There's no reason for it to
quote-unquote disconnect. It's got all this information handy. The next thing I want to do
is just call a function called create blob. And the only reason it's not only a two-line function,
the only reason it's not here is because it's supposed to be an example, a simple example, and try to keep things logical and chunked together. Another thing that one could do with
the app framework is if, let's say, there was a bunch of crap here and there's a whole bunch of
work to be done, and this function was pretty large too, and we're trying to be really fair
about this, I could actually call a function called, I think, SPDK schedule event or something
like that.
That's what it does.
And I could give it this function name and this parameter and tell the reactor, next
time you go through your loop, put this guy on the list and call him next.
And then as soon as I exit here, I really give control to the event scheduler, and he
can be a little bit more fair about how we're utilizing the CPU time for this application.
But I'm just calling the function called create blob here
because there's no reason to get fancy.
All right, and starting from the bottom,
and if you've written code in this kind of event framework type thing before,
everything kind of looks upside down because you tend to write it
and you write where your callback is
and then you go right above that function and write the callback
so everything looks backwards.
But here's the callback, or here's the function
for create blob.
And like I said, all I do is create blob.
I call it one of the asynchronous functions
with the callback.
Say, okay, I want to create a blob
and call create complete when you're done.
Notice I didn't give it a size.
So every time I create a blob, it comes up as size zero.
And then I have to resize it to whatever I want it to be.
And that's actually on purpose.
And I probably won't talk about it in these slides.
But, for example, when you bolt it together with our shim to run under rocks.db,
we'll just continually allocate off the end and resize the blob as we go,
so you don't have to guess at what your size is going to be.
Okay, so after I've created the blob, I get this callback,
my create complete callback.
I save off my blob ID.
Remember I said we only have an ID.
There's no such thing as file names with blobs or handles or anything like that.
Next thing I'm going to do is open the blob because I want to do some stuff to it.
So same thing, right?
I give it a callback function and some parameters I want to pass to that callback.
And we open it up.
Okay, so here's our open complete.
It's done.
It's opened.
I now have a handle to the blob that I save off in my context
because I might need to use that as a parameter for upcoming functions.
I'm going to go ahead and make a synchronous call into the blob store
to find out how many free clusters I have in the entire blob store.
I probably wasn't explicit about this,
but a blob store takes up an entire NVMe disk,
and then you create blobs on that one blob store per NVMe disk.
So this is essentially asking me how many blobs
or how many free clusters I have across the entire disk.
And for this example, we're going to say,
okay, now let's resize it to use the entire blob store.
We're just going to create one blob just for the fun of it.
And again, this is a synchronous call.
And these are both synchronous calls.
Maybe you can guess why
because metadata is kept in RAM
like I mentioned earlier
so there's no activity on the back end
so we just make them
I said we're going to
we'll explicitly sync them up here
so this will take our dirty copy
and our clean copy
and merge them together
and make ourselves clean with disk
so that our metadata is updated
and this is an asynchronous call so that our metadata is updated.
And this is an asynchronous call,
so we're going to say after the sync is complete,
here's my callback and my parameter,
and there's sync complete.
And then what our example does first is we're just going to go write a data pattern.
So it'll show you how to write to a blob.
And you can see it's also mind-numbingly simple.
We do have to use the SPDK memory allocation routines
because this needs to be DMA-safe memory.
We're going to be giving this buffer directly to the hardware,
so it obviously can't be paged out in the middle of the DMA.
That would be bad.
I'm just using the size of the page here for the size of my buffer.
You can make any increment of pages as you want,
and this is an alignment requirement for the third parameter,
second parameter, whatever that is.
And now I'm just going to set that memory up with a known pattern,
just, again, something kind of brain dead
so we can make sure that we read it back, we get something different.
We don't get something different, right?
Next thing we're going to do is allocate a channel.
Remember I mentioned earlier channels are these abstract notions of
here's a parallel path that I can do I.O. on that I, as an application, know
doesn't require any synchronization with any other channels.
They're all free to just operate independently as fast as we can make them run.
So I'm going to allocate a channel from SPK,
and I'm going to store that in my context so I've got access to that later.
I'm going to use the same channel to do the read,
because there's no reason not to, but I mean, I could have.
And now we're just going to do a write.
So a write is writing to the blob, give it the channel,
give it the write buffer that holds the data we want to write.
I'm just going to start it at page 0 and write for one page.
And when Blobstore is done,
I want them to call my write complete function
and give me my context back.
Okay, there's my write complete.
So in my write complete, I'm going to go read it.
And again, I could do this in line here.
I could schedule it as an event.
Or for the sake of the example,
I'm going to put it in a nice little compartmentalized function for reading,
which, again, has to allocate a buffer.
For obvious reasons, we can't use the same buffer,
or we're not going to be able to compare the two.
So we get another page size buffer,
and now we're going to pass pretty much the identical function,
except the read is the name of the API,
and we're going to go to our read complete function
with our same callback argument.
All right, free cup of coffee.
Anybody can guess what the read complete does?
Anybody going to take me up on a free cup of coffee?
All it's going to do is compare the memory, right? So the read complete does. You want to take me up on a free cup of coffee? All it's going to do is compare the memory, right?
So the read is done.
We know it's done, and we know the data buffers have got,
they better have what we put in them,
so we just do a quick mem compare
and make sure all is good.
And since it's an example app,
we're going to show how to use the other APIs.
First one is going to be closing the blob,
and then immediately on closure,
we want to go into our function called delete blob.
So we're going to tell it to jump right into that.
Okay, this is our callback for closure,
and here's where we're just going to delete it.
Pretty simple.
Another asynchronous call into blob store
telling it to delete.
And when you're done deleting,
call this function up here.
And this function is just going to call
another internal function to the Hello World,
Hello Blob app called Unload Blobstore,
which is going to do simply
free ROI channel that we got
and then call the Blobstore API for for unload which is also going to do a
synchronization in the background i think i mentioned that that if you don't explicitly
sync we always sync when you unload that makes this an asynchronous function that's going to
require a callback there's a callback for unloading so once we've unloaded we you know set a little
flag for unloading that helps us with cleanup and And then we have this other magic API framework called API Stop,
and that's just a return code for success, hard-coded there.
And actually what that does, this is another sort of magic one like the start.
This will tell the framework this application is finished,
and when you're done, relinquish, stop blocking from main.
So if you remember when I showed you main,
we called app start, and I said,
now this is blocked here,
and the application and the SPDK framework
and the event scheduler has taken over
and is in charge of what runs on this thread.
As soon as he gets this call,
it's like, okay, you're done,
and he returns control back to hello world main,
and I didn't bring that back up
because it's so uninteresting,
but all it does is free up the memory that we um that we allocated and you know just cleans up
nicely and that's the end so that's that's about as uh i said as insultingly simple as you can get
but you know hopefully that was sort of interesting you can see how the api actually works instead of
just sort of hearing about it and getting some hand wavy stuff and especially seeing and hopefully
understanding the nature of these asynchronous calls
and how all of this can happen on a single thread, right,
with the help of our reactor event scheduler running in the background there.
So before I turn over to Vishal, who's going to go over some performance numbers with RocksDB,
I'm going to give you a little bit more context instead of just jumping right into the performance data
on what components of SPDK are involved
in creating the numbers he's got there.
So we obviously know what the driver is.
Talked about the BDEV, the block device abstraction layer.
I didn't draw it up here, but the event framework
was a big part of that example we just saw.
And, of course, Blobstore.
And then you may have seen BlobFS or heard of BlobFS.
It's by no means a file system. It's the minimum file system-like stuff that RocksDB needs.
For example, Blobstore only has IDs. RocksDB needs file names. Blobstore is a unit of read and write
as a page, and it needs to be byte addressable for RocksDB. So this layer provides that kind of stuff, right?
And then we've got our environment layer
for SPDK, BlinkedIn, and RocksDB.
So this is actually what Vishal is going to be talking about next.
I can still work with this off.
This is fine.
All right, thanks, Paul.
So we discussed on the SPDK Blobstore how it is designed.
So we want to look at now how does it perform against traditional using Linux kernel XFS file system. So, since Paul actually
discussed on the RocksDB side
how it was integrated
with Blobstore, was integrated
with RocksDB,
we would be using that, and
a benchmark called DBBench
is used.
What it does is
there is, in this configuration,
we have a
Intel CPU E5, Xeon E5, V4.
It's a Broadwell CPU on an OS config, which is based on Ubuntu 16.04.
We took latest kernel, 4.12.4, at the time of testing.
And the key thing is the NVMe device we are using
so it's based on Intel Optane SSD
which we just got released 2-3 months back
so our database would be running on this Intel Optane
so the data set on the right side
it says 242 gigabytes. This comprises of 250
million uncompressed records, which is 16-byte keys and 1,000-byte value sizes. That's the
DBBench config we used, which is quite predominantly, we have seen people using that kind of key value sizes.
So what about the workloads?
So there are three predominantly used workloads for DBBench,
which is synthetically reading or writing the keys and values.
The first one is read-write, where we have one writer thread and we have
n number of threads which are actually doing the reads. And we have fill sync, which is
filling the drive with each operation there would be a sync call for each IO. And then there is override where you are updating your keys.
So let me go through the performance data.
So workload number one, which is read-write,
it's like 90% read, 10% write.
So we have four threads doing reads,
which are doing point queries,
and then one thread which is actually writing simultaneously.
For the performance side, on the left side,
there's operations per second, different metrics we gathered.
There's a read ops per second, what my point queries,
how fast they were actually performing.
So Linux kernel versus SPDK,
we see that there is like 280% performance improvement using SPDK Blobstore under the same RocksDB workload.
Average latency dropped, which is a good thing.
Like you have more IOPS, you can perform at higher speed with lower response times.
CPU utilization was lower, 20%, and then the P99, that's the biggest one.
I will talk about that.
The tail latency is using this SPDK asynchronous-driven approach.
There is this, what Linux kernel does, when it does a flush operation,
all the threads which are actually reading are blocked while the sync operation is happening. But in SPDK case, since it's asynchronous driven, there is no queuing mechanisms there.
So we see that there is a 53% improvement in tail latencies with read-write,
which is a pretty common workload in DBBench.
So workload number two, fill-sync.
This is a workload where you're writing values in a random key order,
but each IO is actually getting synced in a sync mode.
So while we run this workload, we see there is a 420% jump using SPDK blob store against traditional Linux kernel in terms of performance
and in the latency drop from kernel to SVDK to 77%.
And overwrite. This is my third workload
where we are updating key where the database
is already set up and you're trying to update your keys. We don't see a lot of
improvement here because this workload comprises of
large block IOs. there is a compaction
and flush activity which happens in the background
because of which the chances of improvement
with this workload tends to be a little bit lower than
you saw with the previous workloads.
And than you saw with the previous workloads. And so that was the performance section
where we compared Linux kernel versus SPDK.
And then I would like Paul would talk about,
just summarize things, just to end this talk.
Thanks.
Thanks, Vishal.
Yeah, SPDK's fast, there you go.
There's the summary. SPDK's so fast. Yeah, SPDK is fast. There you go. There's the summary.
SPDK fucking K is fast.
That's a reference to Stephen's talk from this morning,
just for those of you that are wondering what I'm saying up here and why.
But, yeah, SPDK is not a one-size-fits-all solution. But for those that it fits, it fits well.
And the community growth and the growth of the project itself are a testament to that.
And we are always looking for more people to come on board and to help out and provide their insight
to help move things forward both in SPDK and in the entire ecosystem.
So we have a couple of minutes for questions, or if we get booted out of here,
I'm going to be sticking around here for the next talk, so I'll be in this room.
And Michelle has another talk at 3 o'clock
on key value data structures and their impact on performance.
I want to go check him out.
Any questions right now?
Yes?
Yes.
Is there any mapping between DRAM and SPDK
so you can use it like a LICH?
Use it out of code, but between Blobstore
and maybe like a D3H.
No, we haven't done anything like that.
It is...
It's really nothing more than what I went through.
You feed Blobstore buffers
and it interacts directly back and forth with us.
Are there any problems with people?
I haven't heard any. That doesn't mean there aren't any.
I'm actually fairly new to the project,
so there very well could have been some discussion on that.
Certainly, if you have some ideas, like I said,
IRC is the best way to get involved with the community.
We have our distribution list, too, that you can bring up ideas on.
But there's always somebody on IRC, and if you spark some interest,
whether it's from an Intel person
or somebody else from one of the other big companies that are out there,
it can get a life of its own pretty quick.
Yes?
How do you handle fragmentation on that?
Do you have some connection?
No, we don't handle any fragmentation in Blobstore.
We kind of leave all of that to the media.
Because we're using our cluster size and allocation of one megabyte,
we don't have much of a problem with fragmentation with our usage so far.
Yes, back here.
I have a question related to how you manage free space in the Blobstore.
Free space in the Blobstore?
Blobstore doesn't manage it.
So it's managed by the application.
That's one of many things that when you look closer in the Blob Store,
you're like, wow, it doesn't do that either.
Oh, it doesn't do that either.
And it ties back to that slide I said on simplicity.
It only does what it needed to do to enable the benchmarks that Michelle just did.
And I'm sure it will grow over time as people will see different value for adding different things.
But the idea is kind of let's keep it that simple,
and if you need something,
we can put another layer on top of it,
like the blobFS thing that I mentioned earlier
that ties in file names and byte addressability.
Yes?
Is there work for really work to extend this
to NVDIMM?
Yeah, there is.
There's actually up on Trello,
and the question for those of you who didn't hear it was
if there's work to extend this to NVDIMM
or generically to persistent memory.
Being from Intel, I'd have to say
the most efficient way to use persistent memory
would not be to go through this, right?
It would be to actually use one of the low-level NVML libraries
that you maybe heard about earlier this year or yesterday.
I was actually out here last year and gave a talk on NBML for Windows.
So NBML straight up, just using the lightweight library is probably the best way to do that.
But for those that are already in the SPDK world or consume blocks in user space, however,
and want to consume persistent memory, one of the NBML libraries is called libpmmblock.
And an effort was just kicked off, I think, last week.
And it's up on Trello.
I think there's a whole board on Trello
that talks about bolting libpmemblock in
as a block device in SPDK,
which would give user space applications
block access to persistent memory,
albeit not the most efficient way to do it.
But it's there.
And for those that want to make that quick leap,
they can do it.
Great question, though.
Very carefully.
So SPDK, there is support for multiple application sharing.
I wouldn't say sharing SPDK, but sharing the resources that SPDK is using.
And I believe most of that is going to be through DPDK.
So I didn't really mention DPDK at all,
but DPDK is an optional resource for SPDK to provide services like memory management
and some communication services and some other things that I'm not super familiar with.
We have an abstraction layer between us and DPDK that is fairly new.
It's just called env.h.
We have a lot of customers that do this, a lot of consumers at SPDK that do this,
that allows them to use whatever their own application mechanism is,
for example, memory management.
But with DPDK, and we actually use this feature in one of our test
components, you can have DPDK manage shared memory. And the SBDK application, when it starts up,
we've got a flag plumbed in where you give it a shared memory identifier, and you can have two
different SBDK-based applications sharing a piece of memory that was allocated and is managed by
one DPDK instance. So, yeah, you can do it.
There's a bunch of them, I'm sure, a bunch of different other ways,
but that's the one I know of.
Yes?
Why did you kill the channel priorities?
Why did we kill the channel priorities?
You know, I don't have the history to know that,
but I can definitely find the answer for you.
Yeah.
Yeah.
Oh, I have no idea. I know what the community message says. Yeah. The hostings are an issue with the hardware.
Oh, I have no idea.
Honestly, I'm not familiar with it.
Yeah, I know something just got started in the community on,
I think it's just generically titled QoS, and it involves IO-based priorities as opposed to channel-based priorities.
There might have been some debate and discussion before my time
about which was more efficient, which was going to be more usable.
I'm not sure, but I've never heard of it being tied to a hardware issue of any kind.
I can find out if you want.
You can drop me your name and get on IRC and ask.
Use the distribution list and ask.
Come to the hackathon and ask.
Okay, I think that's it.
If you have more questions, like I said, I'm going to stay in this room for a while,
so I'll be around.
Thanks very much.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list by sending an email to
developers-subscribe at snea.org. Here you can ask
questions and discuss this topic further with your peers in the storage developer community.
For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.