Storage Developer Conference - #70: SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast Episode 70. Hi, my name is Paul and this is my colleague Vishal. We're going to take you on a little tour of Blobstore today. All right, so unless you've been living under a rock, you probably have heard SPDK.

Starting point is 00:00:51 So we're going to briefly go over what that is, why it is, and sort of how you can get involved, how it came about. And then I'll go through an overview of what Blobstore is, just a couple of high-level slides, give you some idea for the motivation, a little bit on the on-disk layout. And then I'm going to purposely insult your intelligence by walking through an incredibly simple example. This is our Hello Blob example that is just straight through, run to completion, create a blob, read and write to a blob, do a comparison,

Starting point is 00:01:20 delete the blob, close the thing down. So it'll give you an idea of some of the other elements of the SPD framework as well. So I'll take a few opportunities to sidetrack from Blobstore as we go through that code example. And then Vishal is going to walk you through some performance numbers with Blobstore, specifically with RocksDB. Okay, so what is SPDK? Well, it is a bunch of different software components. It's not a library. It's a set of libraries. It's a set of tools, a set of building blocks.

Starting point is 00:01:52 It's a framework. It's kind of whatever you want to call it. And we'll see a block diagram come up on the next slide. You can sort of optionally choose almost all of them and have them work together, or you can just pick one thing out of it and have it work together. And there are implementations of it from A to Z, from folks just using one of the NVMe drivers to the entire framework to create an iSCSI target

Starting point is 00:02:13 or NVMe off target. There's just all sorts of things you can do with it. It's open source and it's BSD licensed. I think all that's clear enough. I'm going to talk a lot more about the community here on the next slide, but that's a really important part of it. It's not the case that we just took some source code

Starting point is 00:02:28 and threw it out there and point you at GitHub and say, good luck. We're building a big and a strong community around it with lots of different players besides Intel. And, of course, the theme, again, unless you've been living under a rock, you already know this. The theme is user space pulled mode stuff, right? Okay, so the colors are bleeding out just a little bit. But I said I'm not going to go through all of these.

Starting point is 00:02:55 I think there was a talk here last year that probably went through the majority of these or at least the ones that existed at the time. The idea here is just to give you sort of a high level what are the big picture components that you can pick and choose from. So at the top, we've got our storage protocols. So you can see we've got NVMe, ISCSI, Target, NVMe. And then sort of our middle layer is storage services. And this is really the foundation of storage services is our block device abstraction layer, very analogous to what's in the kernel.

Starting point is 00:03:26 You can actually come along and write your own block device driver that plugs into this framework. You can do that to support a block device, or you can do it in sort of a filter driver model to add some sort of value to the IOPath. So you would suck blocks in at the top and spit them out the bottom, and then they would end up talking with one of the pulled mode drivers and out onto disk.

Starting point is 00:03:47 And then, of course, you can see Blobstore, the focus of this talk, sitting off here to the side, this optional thing that we will talk about here in just a second. And, of course, why did we do this? Why did we start this? This was started as kind of a science project

Starting point is 00:04:02 in the lab at Intel many years ago and then grew into something bigger and eventually grew into open source and now is growing into a much larger open community. And the reason is because it delivers, right? It doesn't deliver in all use cases. It's not meant to be a general purpose storage accelerator for whatever happens to be out there. And that's the beauty of these open communities and these open source projects,

Starting point is 00:04:28 there's something to fit everybody's needs. And this was fitting enough needs that a lot of people started showing up and started gaining interest and started contributing and really realized the performance gains that you see up here. So 10x against NVMOF with the kernel target, 8x against NVME through the kernel in IOPS.

Starting point is 00:04:50 Better efficiency in RocksDB workloads. You know, very specific, of course, anything with performance is specific to the workload. Vishal will go through some of that with you. And then efficient use of development resources is kind of the cornerstone, again, of this being an open source project. The idea with the SPDK building blocks is that they are all of the not-so-fun shared stuff that everybody needs if they're going to write a high-speed user space

Starting point is 00:05:14 storage application. So why should everybody rewrite the entire thing every single time? Just nature of open source. Let's all contribute to the common stuff, provide enough hooks for vendor-unique stuff and for people to add value, and then march forward and we all get better use of resources. So yeah, SPDK is as fast as a cheetah, apparently.

Starting point is 00:05:38 Okay, so the community. This is pretty exciting. This has been growing a lot, really, in the last lot of change in the last... a lot of change in the last six months. So initially, like I said, it really was one of those typical... source code was developed internally. Let's put it out on GitHub and call it open source and be done.

Starting point is 00:05:57 There was enough interest after that happened that it kind of... things had to change or the project wasn't going to go anywhere. So it really is an open community now, and it continues to grow. And so, for example, right besides, of course, GitHub being our source repo, we've got a Trello board.

Starting point is 00:06:13 If you're not familiar with Trello, it's just real, real fundamental project-managing kind of stuff, but easy enough to use that us developers can use it also. So we've got a bunch of Trello boards, actually, on that link. That's where we keep not only the backlog of things that we're interested in working on with our resources, but things that other community members are interested in working on too. So they can be anything from,

Starting point is 00:06:36 we've got a board just for low-hanging fruit, we call it. So anything that a developer wants to do sort of a drive-by, figure out how to go make a contribution to SPDK, have a little bit of fun. There's a whole list of really simple things that just nobody's gotten around to, all the way to a big ideas board that is just big, wild, and crazy stuff. Wouldn't it be great if, and then somebody can come in and fill in the blanks

Starting point is 00:06:57 if they're interested in doing that. So our whole backlog is visible. There's no behind-the-scenes stuff going on anywhere. All of our reviews are on Garrett Hub now, and this is a reasonably recent change. So everybody is encouraged in the community to get involved with reviews. You can see all the patches. You can see all the comments. You can vote on the reviews. And you can see what's coming down the pipeline. You don't have to just sort of wait for a pull request to get merged to see what's going

Starting point is 00:07:22 on. We've got a channel on Freenode that is starting to pick up a lot more use, so you can pretty much catch any of the SPDK experts almost around the clock. It's still a little bit skewed towards U.S. time with the way the maintainers are set up, but hopefully there's somebody there 24 by 7 to engage in discussion or to answer questions or whatever. And then our homepage is spdk.io, where you can get links to all this stuff. And there's some documentation and fun stuff like that. Okay, and then coming up in November,

Starting point is 00:08:02 we've got our first developer meetup, or hackathon. It's actually full now. We might be able to squeeze in one or two more if there's somebody who's experienced with SPDK and you really want to come in and participate and add some value, we can definitely look at trying to make that happen. Really, the only rule to our first

Starting point is 00:08:17 annual hackathon, we might actually do two a year, is no presentations like this. There's no PowerPoint. These are developer meetups where we are planning small group breakout sessions to either tackle big code reviews, push a patch through that has taken forever because it's maybe too big or too complicated, so we use that time face-to-face to move that through.

Starting point is 00:08:40 We can do some design discussions explaining existing modules, talk about tactical features coming up, or brainstorm what we're going to do with those, and who wants to work on what. So we're excited about this. About two and a half days of activity in Phoenix. So if you're interested, let me know, and we'll see if we can squeeze you in. Okay, so let's talk about Blobstore.

Starting point is 00:09:03 So probably the best way to introduce it is to talk about the motivation. Why did we go do this? Why did we invent something when there's already so many different things invented out there? And part of it is because we can, right? Anybody can. You can come in here and develop something and throw it out there and see what kind of feedback you get. But really this was because of the success of SPDK and the popularity of SPDK. If you look through the block diagram that I showed earlier, almost everything up there is

Starting point is 00:09:31 block-focused, right? Just geared around serving blocks, consuming blocks, speeding up block storage. So really, the question came to mind, I guess it was last year sometime, what about other applications that don't know what to do with blocks? The most obvious answer is, well, they want files. They want POSIX. They want a file system. That's the last thing we wanted to try and tackle was bolting in an SPDK-specific file system into this thing. File systems do, as probably everybody in this room knows, just a ton of shit that we didn't want to even begin to bite off.

Starting point is 00:10:04 So we started thinking about specific use cases like RocksDB, which has a very well-defined interface to plug things into with the ENB module, and some analysis that we had done showed that they really didn't use all of the features in XFS. They used very few features, so it seemed like a good candidate to say, well, let's do something that's kind of file system-y-like, but doesn't have file semantics, and is really geared towards next-generation media. And that's kind of where Lobstore came from, was trying to put something behind RocksDB that would accelerate it for super-fast drives.

Starting point is 00:10:43 So that led to the next step, of course, starting with some design goals. What do we want this thing to do? Well, we started, like I said, with the use case. And I want to talk about logical volumes, too. If I forget before the end of the presentation, I'll mention that. That's a feature, a new block on the block diagram that's coming up soon that relies on Blobstore. But, again, the original design goals was to write something for RocksDB and to keep it super simple and not add anything that isn't needed to accelerate RocksDB.

Starting point is 00:11:12 So, and there literally isn't. We could almost walk through the entire code today. That's how short it is. It's not, I don't know, it's 2,500 lines of code maybe. So it doesn't do a whole lot on purpose. And again, designed for fast storage media. And we'll see some of the examples of how that was done as we go through these. Okay, super, super high level. What's the design look like? We really only have three abstractions in the design. There's, of course, the blob. That's the name Blobstore. And the names were picked, although it's impossible to find words in storage that aren't overloaded, in an attempt not to put some sort of mindset around file or

Starting point is 00:11:56 object or whatever, chunk. They're all taken, so it doesn't really matter what we chose. But blob is our largest level of abstraction, most analogous to a file, right? It's just a collection of blocks. We've also got two other abstractions. I'll show you a picture of those coming up on the next slide. But at the lowest level, we define a page, and a page is the smallest level of access to the media. So it's typically a block size or larger than a block. And actually, right now, it's fixed at 4K. So it would take a little bit of work to do something different. But that's a page unit of access. And then our next level of abstraction is a cluster.

Starting point is 00:12:33 And a cluster is just a collection of contiguous pages. And that's much larger. The default size is a megabyte. And that's the unit of allocation. So that's, of course, geared towards SSDs, that when we do a resize of the blob and shrink or grow, we want to do it in large contiguous chunks to optimize what's going on at the SSD. The whole thing is asynchronous. I will explain a little bit more exactly how those asynchronous

Starting point is 00:12:58 calls work when combined with other units within SPDK, part of our application framework, we call it, which is really that. It's a framework that was kind of born out of necessity. We started with the NVMe polling driver, started building higher-level application-type things like the AskAZTarget, the NVMF Target, and discovered that there were different elements in building that application that were common amongst the applications that were using the polling mode driver.

Starting point is 00:13:29 So, for example, an event scheduler and a poller and non-blocking rings for communication between threads. All those kind of things were common, so they got sort of merged into this thing called the application framework. And you'll see an example of that as part of the Hello Blob. Interestingly enough, and of course it's open source, we don't know exactly how everybody's using

Starting point is 00:13:51 the different modules and what they're doing with them unless they come out and talk about it. I'm not sure anybody's using the application framework except for the SPDQ modules. So most large applications and large customers, consumers of this, already have mechanisms in their own software to do these things. So it's kind of the glue that ties together things as an example

Starting point is 00:14:11 or if you're going to use the applications that we provide. So asynchronous also implies there's no blocking, no queuing, no waiting. Blobstore is run through to completion or give you an error, one of the two. There's no mechanisms for queuing. It'll never block, and it's fully parallel. We've got a concept of channels. Again, another word you can't avoid overloading these terms. But for us, a channel is essentially associated with a thread or a core

Starting point is 00:14:41 and designed to do a different stream of IO. So with NVM Express, of course, we would associate a channel with a queue pair. So you can set up your application, you can set it up so that you've got a channel per blob or a channel per a series of blobs that you know require no coordination between the reads and writes between those, and then all of this just happens as fast as it can possibly happen. Okay, so this, I didn't draw like a blob clatter on this.

Starting point is 00:15:16 I probably should have done that and made it a little easier to see, but this is one blob to give you an idea of the three abstractions I talked about. These things at the top are the pages, the contiguous series of 4K pages. And each one of those pages, or each one of those chunks of pages, as they are contiguous, line up in a cluster. And then a cluster, therefore, is just a contiguous set of pages.

Starting point is 00:15:41 But clusters themselves don't have to be contiguous. So a blob itself is really nothing more than a linked list of cluster identifiers or cluster IDs. So you can see the different colored clusters show that down in physical address range they can be anywhere. It doesn't really matter. It's just those little chunks of clusters

Starting point is 00:15:56 that are contiguous. So it's a really super simple and basic layout scheme that allows really fast lookups for virtual to physical address with really no chance for error. That make sense? Any questions on any of this so far?

Starting point is 00:16:21 I'm sorry, what is what managed? Yeah, they're in the Blobstore implementation. So all of this is Blobstore. At the top of Blobstore is the API you're going to see that the application talks to, which is, you know, read, write, open, close, delete, resize, and that's probably it off the top of my head. And then at the back of Blobstore is Blocks.

Starting point is 00:16:51 Okay, and of course we've got metadata to keep the blobs persistent and their locations persistent. The metadata design is probably by far the most complicated portion of the code base, and I'm not even going to attempt to get into all of the different aspects of it. At the simplest level, each blob has either one or more pages of metadata in a pre-reserved region, and each one of those pages is strategically designed to be 4K, and we have a requirement really to guarantee any of the atomicity of Blobstore for metadata updates that you're running with an NBIM Express device that provides atomic updates

Starting point is 00:17:28 at the block level, or in this case, 4K. That's why it's hard-coded right now in 4K. But the metadata is all isolated in units of pages, and they're not shared between each other, so you can do updates on different blobs and not be dealing with any locks or contention on the disk. If the metadata for a single blob gets larger than 4K, then it will bleed into a second sector or a second page in our blob store terminology.

Starting point is 00:17:57 And once you go up over two and three, any updates from that point are write, allocate, and atomic. So that's all protected, which makes it the more complicated of the two scenarios, writing to metadata versus writing to disk. And really, the metadata is nothing more than an extent list, a list of clusters that have been allocated for that blob, and a series of, we call them XADDRs,

Starting point is 00:18:23 but they're similar to XADDRs in file system land. It's a very crude implementation, very simple, just a way to store key value pairs with blobs to help solve little problems that applications might have in trying to keep track of things. The only identifier between an application and the blob store of any blob is an ID number, which happens to be an offset into the metadata of where that blob's metadata is. So that's all you get.

Starting point is 00:18:53 So we'll see when we talk about how we bolted into RocksDB how we use X adders in that case to store file names so that the application can use something that it knows. Okay, and then I think this is the last slide with words on it, and then it's all, like I said, painfully insulting C code, but I'll highlight some of the cooler features that aren't so obvious about the SPDK framework as we go through it. But, yeah, okay, so I mentioned all those except for sync. So sync is for synchronizing the metadata, right?

Starting point is 00:19:31 So along with the idea of making this as lightning fast as humanly possible, the metadata is kept in RAM, the working copy, and is only synchronized to disk on an explicit call by the application that says synchronize to disk or when you gracefully close down the blob. So if it's not obvious, that means if you're plugging along and you resize your blob, make it larger and write a bunch of data,

Starting point is 00:19:59 and you lose power without syncing, well, you're screwed, right? You're going to come back up. You'll be consistent, but you'll be consistently wrong because your blob will be the wrong size and you've lost all that data. It's kind of up to the application what their requirements are

Starting point is 00:20:13 for sync. For example, if you were doing reads and writes and in-ringling a bunch of metadata operations and you really need the performance and you can deal with that area where you're not consistent, then maybe you don't want to pay the penalty for going to disk with

Starting point is 00:20:31 metadata writes every single operation. So you can batch them up in large groups and then do a sync of the metadata at the end. And then I mentioned the reason for the two abstractions. We read write in units of pages right so it's not bite addressable or page addressable so that's the smallest level granularity for read and write and then allocation for shrinking or growing your blob is done in clusters again optimized for SSDs data is direct I mentioned the metadata is kept in memory it's not actually there's no like cache structure

Starting point is 00:21:04 anything it's just kept in memory that's's not actually, there's no, like, cache structure or anything. It's just kept in memory. That's what we mean by cached. But the data is completely direct. You'll see in the example code where we allocate a DMA-safe buffer and just say read or write, and it's direct in and out of the buffer. There's blobs that never touches the data. And, again, minimal support for X-Hour, X-Hourers. Again, what the idea is

Starting point is 00:21:25 keep it super simple nothing complicated about it other than like I said some of the metadata strategies get a little complicated because they have to but that's it okay so this is our hello world for blobs and I mentioned already what it does

Starting point is 00:21:41 and I'll take you through it I can't see it from over here so I'm going to step in front with my blind eyes. So this example uses two other pieces of the SPDK framework. One of them, it uses the application framework, and the other one is it uses the block device abstraction layer. You don't have to use either one of those to use Blobstore. We just chose to in this example.

Starting point is 00:22:04 There's actually an example that is in review right now. There's a patch up to show how to use it without the block device abstraction layer, where you can talk directly from the Blobstore to NBM Express. But the block device abstraction layer is incredibly thin and small and gives us about a dozen APIs to make the application writing just easier without paying much penalty.

Starting point is 00:22:27 So this first part here, this SPDK opts init structure, is a structure that controls the application framework. It gives it various parameters. The only two that we're using here are the name and a config file. Config file is for use by the block device framework. So that's actually how you configure

Starting point is 00:22:49 and assign a block device for this application. So the config file for this one, even though Blobster is geared towards NVMe, the idea with this is you can run it anywhere just to see how the code flows. So this is actually a RAM disk. It's defined in this config file. And then we allocate a context structure that we need

Starting point is 00:23:08 because this is all event-driven, callback-driven code. So every time we call a Blobster API, or almost every time, we'll see a few examples the other way around, we pass it a callback function and two parameters. So I'm always going to pass the callback function I want, and I'm going to give it this context structure so that I've got context of what I'm doing and where interesting things are for me. So this first call here we call app start. That's the app framework. We're telling the

Starting point is 00:23:34 application framework, which is a collection of functions. The key point is it's got an event scheduler. We call it a reactor. So this call is telling the reactor on the same thread, I've got a function that I would like you to schedule for execution. And that function's name is hello start. And when you call it, give it these two parameters, my context and I'm not using the second parameter. So we're setting that up on the scheduler. Now the way the scheduler works,

Starting point is 00:24:04 this is all running on the same thread. The scheduler obviously is not in control right now. This program is in control. But as soon as I call app start, which is a special function because it's the start of the thing, the application framework is going to immediately call my hello start function and basically is going to be blocking on execution of this function. So from here, we're going to jump straight over to the hello start function, and then I'll explain what happens when this guy completes and returns control

Starting point is 00:24:40 and who picks up and how that happens next. Okay, so now the app framework, the reactor, has pulled this function off of the event list and said, I've got to run this. So he calls my hello start function with my parameters, and I cast them into my context structure, and there's a couple things I've got to do here before I initialize the blob store.

Starting point is 00:24:59 The first one is I have to get my block device, because remember I said this is built on the block device abstraction layer. There's a couple of different APIs for this. I chose the one because I know the name of it because I defined it in my config file. So I say give me my block device and the name is malloc0 because I'm using a RAM disk. And then the next thing I'm going to need is a block storage version of a device that I'm going to talk to. So this is why I said, remember, we don't have to use the block device abstraction layer. I could create one of these things manually if I wanted to,

Starting point is 00:25:31 and it's not that difficult. There's eight or nine function pointers and a few other things. I don't know what they are. But because I'm using the block device abstraction layer, I've got some helper functions I can use and just say, okay, create me a block store or a blob store version, and I'm using the block device abstraction layer, and here's my pointer to my block device, and that gives me the necessary structure the blob store needs to initialize to get up and running.

Starting point is 00:25:56 Okay, now the fun part. This is fun, right? This is SDC, right, developers conference? Make sure I'm in the right place. Okay, so SBDK BS in it. So now we're going to initialize the Bob store. And remember I said this is an asynchronous callback-driven framework. And I tell you, the first time I heard that,

Starting point is 00:26:19 I made some assumptions about what that meant. They turned out to be way wrong, and I didn't learn that until much later. But I've worked on different frameworks like this where asynchronous callback-driven means that as soon as I call this, I get immediate control back on my thread, and some other thread is running off doing something, and I don't really care what or how or when,

Starting point is 00:26:36 but I get immediate control, and I do whatever I want, and then on that other thread's context, I'm going to eventually get a callback to this function called bs init complete with the parameters I pass, and then I know that that operation is done and I can keep going. And that's kind of, sort of how it works, but not

Starting point is 00:26:51 really. And almost all of the functions work this way. What's going to happen is, as soon as I call this function, this is all single-threaded. This thread is going to be executing this line of code and go down into Blobstore and just synchronously execute all the way up

Starting point is 00:27:08 until Blobstore can't do something again. Until it would need to block if we were blocking. Instead of blocking, what it does is it takes our callback and puts it on and passes it into the function that needs to do something that's going to take time, like talk to a disk.

Starting point is 00:27:24 We're not going to pull in the framework for that completion. We something that's going to take time, like talk to a disk. So we're not going to poll in the framework for that completion. We're certainly not going to wait for an interrupt. But what Blobstore is then going to do is going to schedule a poller to periodically check that and tell the poller, as soon as you get a completion, here's who I want you to call.

Starting point is 00:27:41 And then I'm going to get control back on the same thread. So after I get a return back from here, my application's in control. Nobody else is doing anything, right? Well, the hardware should be doing something, right? The NVMe drive should be processing the request. But until I relinquish control from this function,

Starting point is 00:27:58 until I return, the event scheduler, the reactor that is on the same thread, isn't going to get control again and isn't going to be able to schedule a polling function and isn't going to notice that the I.O. is done. So it is definitely asynchronous, but it's a little funny like that. We don't have to set it up this way.

Starting point is 00:28:16 We could set up multiple threads and do all sorts of other fancy things. But that's essentially what happened. So I get immediate control back from here. As soon as this function exits, the reactor on this thread runs, picks up his event loop, and says, oh, who's next?

Starting point is 00:28:29 And it will most likely be the polar who's checking to see if that NVMe IO is completed. And if it does, he's going to call this function with these parameters. Everybody with me? Okay, so now our blob store has been initialized, and now it gets even simpler and simpler as we go. Now it's almost kindergarten simple.

Starting point is 00:28:56 All right, initialization is complete. So at this point, I know that my blob store structure is done, and I can confirm that it's loaded, and I can set up a pointer to the blob store in my context structure and go do whatever I want to do next, knowing that whatever callback I'm going to get next is going to pass me back a pointer to my blob store so I can continue to use it.

Starting point is 00:29:17 So in this example, the next thing we're going to do is create a blob. And I go ahead and grab the page size, and this is an example of a synchronous function in blob store. This is just going to return as soon as it's done. There's no reason for it to quote-unquote disconnect. It's got all this information handy. The next thing I want to do is just call a function called create blob. And the only reason it's not only a two-line function, the only reason it's not here is because it's supposed to be an example, a simple example, and try to keep things logical and chunked together. Another thing that one could do with

Starting point is 00:29:51 the app framework is if, let's say, there was a bunch of crap here and there's a whole bunch of work to be done, and this function was pretty large too, and we're trying to be really fair about this, I could actually call a function called, I think, SPDK schedule event or something like that. That's what it does. And I could give it this function name and this parameter and tell the reactor, next time you go through your loop, put this guy on the list and call him next. And then as soon as I exit here, I really give control to the event scheduler, and he

Starting point is 00:30:20 can be a little bit more fair about how we're utilizing the CPU time for this application. But I'm just calling the function called create blob here because there's no reason to get fancy. All right, and starting from the bottom, and if you've written code in this kind of event framework type thing before, everything kind of looks upside down because you tend to write it and you write where your callback is and then you go right above that function and write the callback

Starting point is 00:30:47 so everything looks backwards. But here's the callback, or here's the function for create blob. And like I said, all I do is create blob. I call it one of the asynchronous functions with the callback. Say, okay, I want to create a blob and call create complete when you're done.

Starting point is 00:31:05 Notice I didn't give it a size. So every time I create a blob, it comes up as size zero. And then I have to resize it to whatever I want it to be. And that's actually on purpose. And I probably won't talk about it in these slides. But, for example, when you bolt it together with our shim to run under rocks.db, we'll just continually allocate off the end and resize the blob as we go, so you don't have to guess at what your size is going to be.

Starting point is 00:31:32 Okay, so after I've created the blob, I get this callback, my create complete callback. I save off my blob ID. Remember I said we only have an ID. There's no such thing as file names with blobs or handles or anything like that. Next thing I'm going to do is open the blob because I want to do some stuff to it. So same thing, right? I give it a callback function and some parameters I want to pass to that callback.

Starting point is 00:31:58 And we open it up. Okay, so here's our open complete. It's done. It's opened. I now have a handle to the blob that I save off in my context because I might need to use that as a parameter for upcoming functions. I'm going to go ahead and make a synchronous call into the blob store to find out how many free clusters I have in the entire blob store.

Starting point is 00:32:19 I probably wasn't explicit about this, but a blob store takes up an entire NVMe disk, and then you create blobs on that one blob store per NVMe disk. So this is essentially asking me how many blobs or how many free clusters I have across the entire disk. And for this example, we're going to say, okay, now let's resize it to use the entire blob store. We're just going to create one blob just for the fun of it.

Starting point is 00:32:41 And again, this is a synchronous call. And these are both synchronous calls. Maybe you can guess why because metadata is kept in RAM like I mentioned earlier so there's no activity on the back end so we just make them I said we're going to

Starting point is 00:32:54 we'll explicitly sync them up here so this will take our dirty copy and our clean copy and merge them together and make ourselves clean with disk so that our metadata is updated and this is an asynchronous call so that our metadata is updated. And this is an asynchronous call,

Starting point is 00:33:09 so we're going to say after the sync is complete, here's my callback and my parameter, and there's sync complete. And then what our example does first is we're just going to go write a data pattern. So it'll show you how to write to a blob. And you can see it's also mind-numbingly simple. We do have to use the SPDK memory allocation routines because this needs to be DMA-safe memory.

Starting point is 00:33:32 We're going to be giving this buffer directly to the hardware, so it obviously can't be paged out in the middle of the DMA. That would be bad. I'm just using the size of the page here for the size of my buffer. You can make any increment of pages as you want, and this is an alignment requirement for the third parameter, second parameter, whatever that is. And now I'm just going to set that memory up with a known pattern,

Starting point is 00:33:56 just, again, something kind of brain dead so we can make sure that we read it back, we get something different. We don't get something different, right? Next thing we're going to do is allocate a channel. Remember I mentioned earlier channels are these abstract notions of here's a parallel path that I can do I.O. on that I, as an application, know doesn't require any synchronization with any other channels. They're all free to just operate independently as fast as we can make them run.

Starting point is 00:34:24 So I'm going to allocate a channel from SPK, and I'm going to store that in my context so I've got access to that later. I'm going to use the same channel to do the read, because there's no reason not to, but I mean, I could have. And now we're just going to do a write. So a write is writing to the blob, give it the channel, give it the write buffer that holds the data we want to write. I'm just going to start it at page 0 and write for one page.

Starting point is 00:34:47 And when Blobstore is done, I want them to call my write complete function and give me my context back. Okay, there's my write complete. So in my write complete, I'm going to go read it. And again, I could do this in line here. I could schedule it as an event. Or for the sake of the example,

Starting point is 00:35:08 I'm going to put it in a nice little compartmentalized function for reading, which, again, has to allocate a buffer. For obvious reasons, we can't use the same buffer, or we're not going to be able to compare the two. So we get another page size buffer, and now we're going to pass pretty much the identical function, except the read is the name of the API, and we're going to go to our read complete function

Starting point is 00:35:30 with our same callback argument. All right, free cup of coffee. Anybody can guess what the read complete does? Anybody going to take me up on a free cup of coffee? All it's going to do is compare the memory, right? So the read complete does. You want to take me up on a free cup of coffee? All it's going to do is compare the memory, right? So the read is done. We know it's done, and we know the data buffers have got, they better have what we put in them,

Starting point is 00:35:53 so we just do a quick mem compare and make sure all is good. And since it's an example app, we're going to show how to use the other APIs. First one is going to be closing the blob, and then immediately on closure, we want to go into our function called delete blob. So we're going to tell it to jump right into that.

Starting point is 00:36:16 Okay, this is our callback for closure, and here's where we're just going to delete it. Pretty simple. Another asynchronous call into blob store telling it to delete. And when you're done deleting, call this function up here. And this function is just going to call

Starting point is 00:36:30 another internal function to the Hello World, Hello Blob app called Unload Blobstore, which is going to do simply free ROI channel that we got and then call the Blobstore API for for unload which is also going to do a synchronization in the background i think i mentioned that that if you don't explicitly sync we always sync when you unload that makes this an asynchronous function that's going to require a callback there's a callback for unloading so once we've unloaded we you know set a little

Starting point is 00:37:01 flag for unloading that helps us with cleanup and And then we have this other magic API framework called API Stop, and that's just a return code for success, hard-coded there. And actually what that does, this is another sort of magic one like the start. This will tell the framework this application is finished, and when you're done, relinquish, stop blocking from main. So if you remember when I showed you main, we called app start, and I said, now this is blocked here,

Starting point is 00:37:31 and the application and the SPDK framework and the event scheduler has taken over and is in charge of what runs on this thread. As soon as he gets this call, it's like, okay, you're done, and he returns control back to hello world main, and I didn't bring that back up because it's so uninteresting,

Starting point is 00:37:44 but all it does is free up the memory that we um that we allocated and you know just cleans up nicely and that's the end so that's that's about as uh i said as insultingly simple as you can get but you know hopefully that was sort of interesting you can see how the api actually works instead of just sort of hearing about it and getting some hand wavy stuff and especially seeing and hopefully understanding the nature of these asynchronous calls and how all of this can happen on a single thread, right, with the help of our reactor event scheduler running in the background there. So before I turn over to Vishal, who's going to go over some performance numbers with RocksDB,

Starting point is 00:38:22 I'm going to give you a little bit more context instead of just jumping right into the performance data on what components of SPDK are involved in creating the numbers he's got there. So we obviously know what the driver is. Talked about the BDEV, the block device abstraction layer. I didn't draw it up here, but the event framework was a big part of that example we just saw. And, of course, Blobstore.

Starting point is 00:38:42 And then you may have seen BlobFS or heard of BlobFS. It's by no means a file system. It's the minimum file system-like stuff that RocksDB needs. For example, Blobstore only has IDs. RocksDB needs file names. Blobstore is a unit of read and write as a page, and it needs to be byte addressable for RocksDB. So this layer provides that kind of stuff, right? And then we've got our environment layer for SPDK, BlinkedIn, and RocksDB. So this is actually what Vishal is going to be talking about next. I can still work with this off.

Starting point is 00:39:22 This is fine. All right, thanks, Paul. So we discussed on the SPDK Blobstore how it is designed. So we want to look at now how does it perform against traditional using Linux kernel XFS file system. So, since Paul actually discussed on the RocksDB side how it was integrated with Blobstore, was integrated with RocksDB,

Starting point is 00:39:54 we would be using that, and a benchmark called DBBench is used. What it does is there is, in this configuration, we have a Intel CPU E5, Xeon E5, V4. It's a Broadwell CPU on an OS config, which is based on Ubuntu 16.04.

Starting point is 00:40:17 We took latest kernel, 4.12.4, at the time of testing. And the key thing is the NVMe device we are using so it's based on Intel Optane SSD which we just got released 2-3 months back so our database would be running on this Intel Optane so the data set on the right side it says 242 gigabytes. This comprises of 250 million uncompressed records, which is 16-byte keys and 1,000-byte value sizes. That's the

Starting point is 00:40:57 DBBench config we used, which is quite predominantly, we have seen people using that kind of key value sizes. So what about the workloads? So there are three predominantly used workloads for DBBench, which is synthetically reading or writing the keys and values. The first one is read-write, where we have one writer thread and we have n number of threads which are actually doing the reads. And we have fill sync, which is filling the drive with each operation there would be a sync call for each IO. And then there is override where you are updating your keys. So let me go through the performance data.

Starting point is 00:41:52 So workload number one, which is read-write, it's like 90% read, 10% write. So we have four threads doing reads, which are doing point queries, and then one thread which is actually writing simultaneously. For the performance side, on the left side, there's operations per second, different metrics we gathered. There's a read ops per second, what my point queries,

Starting point is 00:42:18 how fast they were actually performing. So Linux kernel versus SPDK, we see that there is like 280% performance improvement using SPDK Blobstore under the same RocksDB workload. Average latency dropped, which is a good thing. Like you have more IOPS, you can perform at higher speed with lower response times. CPU utilization was lower, 20%, and then the P99, that's the biggest one. I will talk about that. The tail latency is using this SPDK asynchronous-driven approach.

Starting point is 00:43:00 There is this, what Linux kernel does, when it does a flush operation, all the threads which are actually reading are blocked while the sync operation is happening. But in SPDK case, since it's asynchronous driven, there is no queuing mechanisms there. So we see that there is a 53% improvement in tail latencies with read-write, which is a pretty common workload in DBBench. So workload number two, fill-sync. This is a workload where you're writing values in a random key order, but each IO is actually getting synced in a sync mode. So while we run this workload, we see there is a 420% jump using SPDK blob store against traditional Linux kernel in terms of performance

Starting point is 00:43:58 and in the latency drop from kernel to SVDK to 77%. And overwrite. This is my third workload where we are updating key where the database is already set up and you're trying to update your keys. We don't see a lot of improvement here because this workload comprises of large block IOs. there is a compaction and flush activity which happens in the background because of which the chances of improvement

Starting point is 00:44:35 with this workload tends to be a little bit lower than you saw with the previous workloads. And than you saw with the previous workloads. And so that was the performance section where we compared Linux kernel versus SPDK. And then I would like Paul would talk about, just summarize things, just to end this talk. Thanks. Thanks, Vishal.

Starting point is 00:45:02 Yeah, SPDK's fast, there you go. There's the summary. SPDK's so fast. Yeah, SPDK is fast. There you go. There's the summary. SPDK fucking K is fast. That's a reference to Stephen's talk from this morning, just for those of you that are wondering what I'm saying up here and why. But, yeah, SPDK is not a one-size-fits-all solution. But for those that it fits, it fits well. And the community growth and the growth of the project itself are a testament to that. And we are always looking for more people to come on board and to help out and provide their insight

Starting point is 00:45:29 to help move things forward both in SPDK and in the entire ecosystem. So we have a couple of minutes for questions, or if we get booted out of here, I'm going to be sticking around here for the next talk, so I'll be in this room. And Michelle has another talk at 3 o'clock on key value data structures and their impact on performance. I want to go check him out. Any questions right now? Yes?

Starting point is 00:45:53 Yes. Is there any mapping between DRAM and SPDK so you can use it like a LICH? Use it out of code, but between Blobstore and maybe like a D3H. No, we haven't done anything like that. It is... It's really nothing more than what I went through.

Starting point is 00:46:18 You feed Blobstore buffers and it interacts directly back and forth with us. Are there any problems with people? I haven't heard any. That doesn't mean there aren't any. I'm actually fairly new to the project, so there very well could have been some discussion on that. Certainly, if you have some ideas, like I said, IRC is the best way to get involved with the community.

Starting point is 00:46:38 We have our distribution list, too, that you can bring up ideas on. But there's always somebody on IRC, and if you spark some interest, whether it's from an Intel person or somebody else from one of the other big companies that are out there, it can get a life of its own pretty quick. Yes? How do you handle fragmentation on that? Do you have some connection?

Starting point is 00:47:01 No, we don't handle any fragmentation in Blobstore. We kind of leave all of that to the media. Because we're using our cluster size and allocation of one megabyte, we don't have much of a problem with fragmentation with our usage so far. Yes, back here. I have a question related to how you manage free space in the Blobstore. Free space in the Blobstore? Blobstore doesn't manage it.

Starting point is 00:47:26 So it's managed by the application. That's one of many things that when you look closer in the Blob Store, you're like, wow, it doesn't do that either. Oh, it doesn't do that either. And it ties back to that slide I said on simplicity. It only does what it needed to do to enable the benchmarks that Michelle just did. And I'm sure it will grow over time as people will see different value for adding different things. But the idea is kind of let's keep it that simple,

Starting point is 00:47:47 and if you need something, we can put another layer on top of it, like the blobFS thing that I mentioned earlier that ties in file names and byte addressability. Yes? Is there work for really work to extend this to NVDIMM? Yeah, there is.

Starting point is 00:48:04 There's actually up on Trello, and the question for those of you who didn't hear it was if there's work to extend this to NVDIMM or generically to persistent memory. Being from Intel, I'd have to say the most efficient way to use persistent memory would not be to go through this, right? It would be to actually use one of the low-level NVML libraries

Starting point is 00:48:20 that you maybe heard about earlier this year or yesterday. I was actually out here last year and gave a talk on NBML for Windows. So NBML straight up, just using the lightweight library is probably the best way to do that. But for those that are already in the SPDK world or consume blocks in user space, however, and want to consume persistent memory, one of the NBML libraries is called libpmmblock. And an effort was just kicked off, I think, last week. And it's up on Trello. I think there's a whole board on Trello

Starting point is 00:48:49 that talks about bolting libpmemblock in as a block device in SPDK, which would give user space applications block access to persistent memory, albeit not the most efficient way to do it. But it's there. And for those that want to make that quick leap, they can do it.

Starting point is 00:49:05 Great question, though. Very carefully. So SPDK, there is support for multiple application sharing. I wouldn't say sharing SPDK, but sharing the resources that SPDK is using. And I believe most of that is going to be through DPDK. So I didn't really mention DPDK at all, but DPDK is an optional resource for SPDK to provide services like memory management and some communication services and some other things that I'm not super familiar with.

Starting point is 00:49:45 We have an abstraction layer between us and DPDK that is fairly new. It's just called env.h. We have a lot of customers that do this, a lot of consumers at SPDK that do this, that allows them to use whatever their own application mechanism is, for example, memory management. But with DPDK, and we actually use this feature in one of our test components, you can have DPDK manage shared memory. And the SBDK application, when it starts up, we've got a flag plumbed in where you give it a shared memory identifier, and you can have two

Starting point is 00:50:16 different SBDK-based applications sharing a piece of memory that was allocated and is managed by one DPDK instance. So, yeah, you can do it. There's a bunch of them, I'm sure, a bunch of different other ways, but that's the one I know of. Yes? Why did you kill the channel priorities? Why did we kill the channel priorities? You know, I don't have the history to know that,

Starting point is 00:50:37 but I can definitely find the answer for you. Yeah. Yeah. Oh, I have no idea. I know what the community message says. Yeah. The hostings are an issue with the hardware. Oh, I have no idea. Honestly, I'm not familiar with it. Yeah, I know something just got started in the community on, I think it's just generically titled QoS, and it involves IO-based priorities as opposed to channel-based priorities.

Starting point is 00:51:03 There might have been some debate and discussion before my time about which was more efficient, which was going to be more usable. I'm not sure, but I've never heard of it being tied to a hardware issue of any kind. I can find out if you want. You can drop me your name and get on IRC and ask. Use the distribution list and ask. Come to the hackathon and ask. Okay, I think that's it.

Starting point is 00:51:24 If you have more questions, like I said, I'm going to stay in this room for a while, so I'll be around. Thanks very much. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at snea.org. Here you can ask questions and discuss this topic further with your peers in the storage developer community.

Starting point is 00:51:56 For additional information about the Storage Developer Conference, visit www.storagedeveloper.org.

Storage Developer Conference - #70: SPDK Blobstore: A Look Inside the NVM Optimized Allocator

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.