Storage Developer Conference - #52: An Enhanced I/O Model for Modern Storage Devices

Episode Date: July 25, 2017

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. My name is Martin Peterson. I'm a storage architect at Oracle's Linux Engineering Group, and I'm also one of the authors of Linux's IOSTACK and maintainer of the Linux Guzzi subsystem.
Starting point is 00:00:43 I'm going to talk about an enhanced IOM model for modern storage devices. And maybe what I should have said was an enhanced IOM model for legacy applications for modern storage devices because it appears the topic at this conference this year is hyperscaler and cloud storage and so on where you have lots of flexibility in being able to modify your application to adapt to the storage hardware. But there's a ton of old applications out there and that's what I'm trying to address with the changes I've been making.
Starting point is 00:01:17 So modern storage for decades we have been optimizing our IO stack to work well on old school spinning hard drives. And it's worked very well and it hasn't really that particular design choice of trying to optimize for spinning media and reducing seeks and what not hasn't adversely affected any of the other classes of storage that we're driving. Anything from USB sticks to high end storage arrays. Pretty much all of them have benefited from our general approach of issuing I.O. to the storage.
Starting point is 00:01:52 But we're getting to the point now where many of the devices come out have very specific needs in terms of being driven effectively. And so I've been looking into what we can do to the existing ILO interfaces to make these devices perform well and predictably. And as it turns out, despite the fact that the media, flash and SMR hard drives and so on, is vastly different in terms of characteristics, the things that we need to communicate to these devices to make them perform well is pretty much the same regardless of which type of media is backing the device. And that's lucky for us because we would like our
Starting point is 00:02:33 I-O stack to have common semantics and be generic and not tied to a particular type of device. And it also means we don't need to throw away decades and decades, thousands of many years of file system and database development. Now if we're looking at optimizing the storage, there's been and are different approaches in the standards groups. One such thing was OSD. It's the SCSI protocol for optic storage. It never really took off in the industry.
Starting point is 00:03:07 Right now there's the zone access commands for SMR drives that are very, very specifically modeled to how those drives work in practice, are physically implemented. And they're good for certain workloads where your application is very conducive to that type of data access patterns. They're mostly reads. You rarely overwrite stuff. And it's very rare that you actually need to go scrub or reclaim the data. Whereas most of the legacy applications we have, they're inherently a random read-write workload. There's databases, normal file system IO.
Starting point is 00:03:44 And overwrites are very frequent. So the common theme of this conference has been we're moving to, you know, write once, never write again, and we're moving to sequential write workloads because that's what SMR drives want. But we can't throw away all the applications that need to be able to do random writes. So somewhere, somehow, somebody will need to convert those random writes into something that works in a way that's good for the device, typically a sequential workload. So random writes are not going away, no matter how much we wish for that, it's not going
Starting point is 00:04:23 to happen. So back many years ago the storage protocols were heavily tied to the physical implementation of the device when you had to talk to a hard drive you needed to know cylinders, heads and sectors and that kind of thing it was very intimately tied to the implementation of the device. And then about 20 years ago, people decided, you know what, this is a really bad idea. It's hard to move drives from system to system.
Starting point is 00:04:52 You need to update disk tabs and whatnot. So let's move to a model where we use logical blocks. The device presents an LBA range from zero to capacity and completely hides the internal implementation of how that data is organized on the device. And that was done in the days of hard drives and we're reaping the benefits of that transition from the physical to the logical representation of the storage. So today we can take a $1 USB key or a $1 million storage array and we drive them using exactly the same protocol. That wouldn't have been possible if the physical representation had been inherent in the protocols like it used to be.
Starting point is 00:05:34 Another reason that we don't like the media-specific access protocols is that when we have an IO request sitting in the application, we need to send it out to a bunch of different devices at the same time. is that when we have an IO request sitting in the application, we need to send it out to a bunch of different devices at the same time. We're increasingly moving to a model where we have locally attached flash cached here of some sort, but the same data needs to go over IB, over the network, onto a different type of storage, or onto locally attached spinning media storage. So we can't semantically, our way of doing I.O. needs to have some commonality regardless of which type of storage protocol it's talking. Also, the read-write I.O. model that application programmers use is heavily ingrained in our standards interfaces. We can't just throw
Starting point is 00:06:26 that away without breaking a ton of applications out there. DOS is already, to a large extent, completely simulating that interface. We're faking it, but we're honoring the guarantees. But the read-write model, as seen by the application that uses the C library to do read-write I.O. is implemented very, very differently inside the kernel. We don't really do I.O. the way you think we do I.O. We just present you with that interface, but we can't break that interface. That's our promise to the application people, and we can't do it easily without some cooperation from the storage device.
Starting point is 00:07:13 So what are the characteristics that distinguish modern types of storage like Flash and SMR from the old school legacy hard disk drives? Well, for Flash, it turns out that one of the most important things, here's Bill, maybe some of you were at Bill's talk, one of the most important things on Flash devices is that you need to be able to put data that belongs together in a logical fashion together in the same physical media spot. That really helps you to reduce write amplification and it helps you to avoid doing extensive garbage collection in the future. And another thing that's important
Starting point is 00:07:54 to flash classes of devices is to be able to tell a device now is a good time for you to do garbage collection. So those are sort of the two main characteristics, being able to group logical units that we, as a result of controller constraints and device constraints and protocol constraints, when we write a 40 gigabyte file, we can't send that as one I.O. That 40 gig file gets chopped up into many, many, many pieces, and by the time it hits the device,
Starting point is 00:08:23 the notion of those pieces forming a logical unit has been lost. So we need a way to communicate to the device all these requests that we just queued on you actually belong to the same thing and we think you should put them together in the same physical location to ease your workload in the future. If we move on to SMR drives, it turns out it's pretty much exactly the same problem. If you have an SMR drive, you need to be able to co-locate things that belong together in the same zone to ease the zone management. Zones on an SMR drive are typically much bigger than on a flash device, but the problem is essentially the same. You want things that belong together to be grouped together physically on the media to
Starting point is 00:09:08 ease future garbage collection. And the SMR drives, under the assumption that the drives in question have some sort of translation layer between logical blocks and the physical zones backing those logical blocks, also needs to be able to do garbage collection exactly like a flash device works. So there's lots of commonality between these two cases. If we move on to storage arrays, the arrays also benefit from the notion of being able to identify different IO streams from different applications.
Starting point is 00:09:43 The rationale is slightly different from why it makes sense for SMR and Flash, but the mechanism is the same. You want to be able to prioritize I.O. from different databases on your system so that your test database doesn't impact the performance of your production system or your backup jobs that you want to run in the background don't interfere at all with the performance of the system in general. Turns out storage arrays also have the need to be able to do background tasks, data migration to reclaim space if it's a thin provisioned device, that kind of thing. So just like SMR and Flash, storage arrays also have a need to be able to perform background operations
Starting point is 00:10:23 and would like to know when is a good time to do so. So to summarize, modern storage devices would like to know additional information about each I.O., they would like to know the nature of the I.O., what is this thing that you're sending to me, what should I do with it, they would like to know how that the I.O.s that you're queuing me, what should I do with it? They would like to know how the IOs that you're queuing on the device at the same time are related, if they're related. And they would like to know when it's a good time for you
Starting point is 00:10:52 to do background operations. So first, the IO classification. Two years ago, Jim Williams and I from the database team gave a presentation about IO hinting how it's used in the database. The Oracle database internally uses over 70 hints to classify IOs as they flow through the system. And until now there's been no standards-based way for the database to communicate that out to the storage. But we're now using a SCSI feature to make that communication path possible.
Starting point is 00:11:28 And so we talked to the database folks and said, okay, you have your 70 hints, which of these make sense and what would you like to tell the storage? We also did two surveys with a bunch of different applications people asked, which hints make sense for your application? And the result of that was, we ended up with a long, long table of many different hints
Starting point is 00:11:49 and it was very, very impossible to try and accommodate everybody. But there turned out to be a common pattern, namely that some I.O. requests are transactional and it's very important for the application that they complete quickly. And it's very important for the application that these, in the future, are low latency accesses. You know, the file system journal or the database journal.
Starting point is 00:12:15 And it's very probable that you are going to access those blocks in the future. So we condensed all those long tables into these six I.O. classes. They're not set in stone. It's just an example of the hints we found actually worked when communicating to the storage. So we have these six classes.
Starting point is 00:12:34 Transaction for file system journals or database redo log. Metadata to identify file system metadata. Swap. We have a real-time I.O. class for things where you don't want hiccups in the I.O. stream. We have the normal data stream, and then we have background tasks such as backups or resyncing a RAID. So for each I.O. we send down to the storage, we classify
Starting point is 00:12:58 the I.O. with one of these six categories, and then the storage device can decide based on that information if it makes sense. For instance, in the case of an SMR drive to put the file system metadata in a conventional zone rather than a sequential zone. In the case of a storage array, you might want to pin stuff in the cache or put it in a flash tier as opposed to out on spinning media depending on what type of classification we assign to the IO. In T10, in this case... For that, mostly in the IO there is no differentiation between reading and writing, am I correct?
Starting point is 00:13:38 Because it's, you know, based on the specific IEO characteristics, for example, if you want to get a frequent read access instead of mixing them, then the device might do something else. Right. So the way it's defined in T10 here is that the actual hints you send to the device are coded in the group number in the read and write commands. And it's an index into a mode page that looks like this where you can define different characteristics for the I.O.
Starting point is 00:14:15 You can say, you know, I'm going to access this very frequently. I'm going to be mostly sequentially writing and rarely reading if it's a file system transaction log for instance. And you could set the subsequent I.O. hint to indicate that other I.O. on the system is backed up behind this I.O. so you want this I.O. to complete quickly. That presumes that the device is doing internal scheduling of some sort but the hint is there. So we map in the Linux case most of these hints that we set on the I.O. are actually discrete hints. They're flags that describe a certain thing.
Starting point is 00:14:50 But given how SCSI specified this to be flexible and expandable, we have to map that into a fairly short list of static hints. So we can't be quite as expressive in SCSI as we can in NVMe because in NVMe each of these are distinct flags. And you can see that here, here's the NVMe DSM field which is the corresponding thing in the NVMe spec. And you can set access latency, access frequency, and so on here. So these correspond pretty much, not entirely, but almost one-to-one with the descriptor
Starting point is 00:15:26 in T10. So we have these in both SCSI and NVMe, we have a way to communicate this information to the device. There's two flags that are different between Read and Write and NVMe, I think. Something to that effect. But they're almost the same. OK. OK.
Starting point is 00:15:54 So the IO classification that we set is on a per IO basis. And we tried static labeling when this originally came about in T10 and T13. The intent was to make it applied statically to an LPA range. But there are several problems with that. We actually tried and we got horrible results because a common use case is you have a database file, you're running your database application doing random read write to that file, and at the same time you start your backup job which is doing a sequential read of the same
Starting point is 00:16:23 blocks. So the classification is a property of the issue of the I.O. rather than the contents of the LPAs. Another thing we saw was an MP3 collection. Somebody had set the real time streaming property on his MP3 collection so that he wouldn't get hiccups when he was playing MP3s. But every time his backup ran, the system came to a grinding halt because the drive thought he was streaming his entire MP3 collection. So to us the classification of the nature of an I.O. is a per I.O. property
Starting point is 00:16:58 and it's a result of who issued the I.O. and the intent that they had of issuing the I.O. rather than an inherent property of the contents of that file. At the implementation level, we tie into POSIXF advice, which is a system call an application can use to set certain properties on the file. So an application can say, this file is something, and it will set, like, write sequential or there are some distinct flags defined and we'll map those into the relevant IO class in the case of SCSI or set the flags on an NVMe command.
Starting point is 00:17:36 So that was the classification of why are we doing IO, how do we communicate that information to the device. Now to the affinity. So I mentioned earlier then that when we're writing a 40 gig file we're chopping it up into 128 K or 512 K pieces and the notion of that as a logical unit goes away. So with the affinity we want to be able to convey to the storage device all these things belong together and should go in the same zone or in the same right channel.
Starting point is 00:18:05 In T10 we use streams with modifications. I was very glad to hear that Bill was planning affinity between T10 and NVMe here because we're actually not explicitly opening and closing the stream IDs in T10. We bent the spec a little bit for our internal implementation. And in NVMe, we're using directives. Right now, I'm using a, there's been some kerfuffle in NVMe about how to go about implementing streams. And two groups of companies disagree strongly about how to do it.
Starting point is 00:18:43 I tried to stay out of the politics very unsuccessfully, I think. But the reality is we can get pretty much all we need with the streams and so I'm probably going to change my implementation from the current stuff to what ended up being the final streams proposal. The stream ID we set on a per file basis is based on the inode number of the file and the partition number hashed together to try and create something that is unique that we can send to the storage. The affinity in SCSI is communicated in the stream ID field here, and there's a similar field in the directives in NVMe
Starting point is 00:19:31 to convey the information about which file this is. So it's a 16-bit value, and when we started looking at this, I thought, oh my God, 16-bit, that's nothing, because I have a million files on my laptop, right? How am I going to work with this? But in reality, the hash collisions that we see with the 16-bit value depends on the workload. And we're sort of saved by the fact that the devices will have 16 write channels or they will have 128 concurrently open zones. And both of those numbers are significantly lower
Starting point is 00:20:08 than the 64K that we can communicate to the device. On our end, we only really see hash collisions if you have a workload that, obviously, if you write more than 64K files, you're more likely to have a hash collision. But the hash collision also needs to happen in a short amount of time because you only care about collisions at the time where that I.O. is queued on the device. So in reality, when we simulated this 64K 16-bit stream ID, it didn't turn out to be as bad as it seemed. And we did a bunch of
Starting point is 00:20:46 checks on a variety of systems running a variety of applications to identify how many files are open and how many files are open for writing. So it turns out on a typical system you have somewhere between 1k and 100k files that are open at the same time. Interestingly enough it needs to be a really big server to compete with your web browser in terms of number of files open. Your web browser is often the craziest example of how many files can a process possibly have opened at the same time. So your average desktop and a very, very high-end server are up near the 100k mark, whereas a lot of server workloads are actually down in the low thousands. As far as number of files that are open for write, which is really what we care about when we communicate the stream ID out to the device, it's typically between 100 and 10k on a large server. So 10k is still significantly lower than the 64k
Starting point is 00:21:48 namespace that we have to work with. So if we combine the IO classification that we get using the hinting and the affinity that we get using the stream ID, we are able to convey to the storage device the nature of the I.O. and how the outstanding I.O. we have queued or is sitting in the queue to device, how those different things are related to each other. For devices that are big and have internal scheduling, such as storage arrays with large caches and whatnot, the separation of the I.O. streams also helps to provide QoS.
Starting point is 00:22:28 And on devices that really don't like frequent overwrites, such as SMR and flash devices, they have natural enemies, and those natural enemies are circular logs, metadata, and swap files. They're some of the worst things you can throw at a flash device or an SMR device. And being able to identify those allows the device to make smart data placement decisions when we consistently tag those particular types of I.O. And in case of arrays, again, with internal scheduling,
Starting point is 00:23:00 it allows you to do I.O. prioritization so that your backup job won't steal IOs from any of the business critical applications running on the array. So the last component in my three prong approach to making devices perform better is the background operations. All the device types I've talked about have a need to do background operations for whatever reason. Flash management or reclaiming space on a thin provision device or destaging zones on an SMR device. Devices typically do this when they're idle. Devices would like us
Starting point is 00:23:46 to tell them when it's a good time to do it. We have no idea because we don't do the I.O. The applications do the I.O. So what I've done is I've just like we've done for discard our trim unmapped infrastructure in Linux, I've created a protocol agnostic implementation of background control so an application can call and I opt on say, now's a good time to do it and we'll translate it to what's the background control command in SCSI and we'll see where it lands in NVMe but we'll issue the relevant command in NVMe when that spec is finalized. So results.
Starting point is 00:24:28 So we've been doing this for a while on both hybrid drives and on storage arrays and have gotten really good results and we would like to move it closer to flash and SMR drives in the future. And the benefit to this approach to doing things as opposed to using ZAC or CPC directly for SMR drives or using something like light NBM where you do the flash translation, parts of the flash translation workload on the OS is that this is a common interface that is completely agnostic to the physical storage medium that you're talking to we can use exactly the same plumbing
Starting point is 00:25:13 throughout the I-O stack and facing the applications regardless of whether we're talking to a flash device an SMR drive or a storage array or anything else for that matter so it's So it's a generic infrastructure for communicating these things to the storage through the common I-O stack. So as I mentioned we've done some some results on a simulated target so we simulated a 10 channel flash device or a I think I did a 64 zone SMR drive, trying to establish what is the co-location rate, like how good are we at attacking things so that they end up in the right zone given that the device is constrained by being able to do 10 different
Starting point is 00:26:03 files or logical groupings at a time. And our collocation rates are really good. As I said, I was worried about the 16-bit hash being a constraining factor. But in reality, there are cases where you're doing things like, for instance, when you do the OS install, where you create tens of thousands of files on the system at the same time. That's a case where you're very likely to hit a collision. But once the system is installed and up and running and doing a normal workload,
Starting point is 00:26:31 it's very, very rare that we get a hash collision. And it's also, depending on the workload, reasonably rare that we exceed the capacity of what the device can do. So what I'm saying is it's relatively rare, again depending on workload, it's relatively rare that we're doing writes to more than 16 files at the same time. You know what we'll do is we'll often fill the entire queue with I.O. from a handful of files at the same time. But obviously it depends very heavily on your particular workload. On storage arrays, the hinting and the stream IDs really help the array do cache management.
Starting point is 00:27:12 For instance, you don't want your backup job to be cached. You want that sequential read of your entire database files. You just go through the cache and not interfere with what the array is doing for the runtime workload. So storage arrays work really well. We're looking to engage with disk drive vendors and the flash drive vendors to implement this feature. And I already mentioned the streams versus LBA affinity thing in the spec. I think from a functional perspective there's really no difference.
Starting point is 00:27:51 I think the main difference is anticipated usage of, it's not so much the wording, it's more the anticipated usage model. Can we get to a point where we can just turn it on by default and not completely mess up the drive? That would be the ideal case for me. I don't think from a protocol perspective that we're missing anything really. And then for T10-SVC, Bill said he was going to get us parity with implicit open-close semantics from NBME. That would be wonderful. I would like to entertain the idea of the read stream command to complement write stream.
Starting point is 00:28:38 It doesn't really make much of a difference on devices like SMR and flash drives where we just want the data as fast as possible. But in the array case, it would be nice to be able to do QoS on the read streams as well as the write streams. So that was it. Yes? My question is about storage arrays and tiering. How would people be able to specify in what tier to characterize the data? So the tiering, when we first did this, we actually had an explicit tiering. We used the group number field and the read and write commands
Starting point is 00:29:20 to just specify tiering and pretty much nothing else. But what ended up in the spec ended up being very different from that. So there's no explicit tiering, although the spec doesn't say how to interpret the values. And I define the IO classes that Linux is setting in the mode page to match the anticipated behavior. So for instance, for the file system or database log, it's
Starting point is 00:29:51 like write sequential, it's write only, we never read it unless we're doing recovery and keep it on something fast because other I.O. depends on the writes completing to the log. So there's no way you can say, I wanna put this on the fastest tier. But based on the properties that I said on the IIO saying, I would like this to be read fast in the future, that's the type of thing you can communicate using the T10 descriptors.
Starting point is 00:30:23 You could then decide based on that to put it in the fastest tier. So there's a level of indirection there that makes it a little bit blurry in this case. It's a little bit easier to convey it in NVMe. So the modification that you did here at the protocol level, but upon the application, how do you specify what kind of performance? Do I do it as a part of the open call? You do it using the fadvice call. So using fadvice, you can set a property on a file handle.
Starting point is 00:30:56 And you can say, you know, I expect this to be mostly sequential. And, you know, I don't think I have the – I can show you the POSIX IO flags, but there's just 10, 15 flags that are defined by the POSIX spec that you can set on the file handle. And we translate those into something that makes sense in the SCSI disk driver. So basically every compile of the application is needed to be able to specify. If you want to specify something, you have to use the fadvice call. To some extent. Actually, one thing you can do is you can,
Starting point is 00:31:30 we have the io prior command. You can set the io priority on an existing application. So if you set the io priority for your application to be idle, our backup io class is called the I-O class because it only uses idle I-O time to actually submit I-O. So if you run I-O prior idle on your application, that will cause it to set the background hint on the SCSI device.
Starting point is 00:31:59 And you can do the same thing with the real time. If you want to boost your application, you can set the real time task priority, and that will cause that to be communicated out to the storage. So that's really the only granularity you have. The Linux IOS scheduler allows you to set a relative priority within each IOS class,
Starting point is 00:32:21 but I have no way to communicate that to the storage because there's no relative priority. There's a wrinkle in the sense that transport protocols, Fibre Channel and SAS, actually have a priority field. And we did look into using that at some point, but that becomes challenging because that field, usually the protocol stack's running in the HBA firmware, so that means we need that way to talk to the HBA firmware to set that priority on a per-IO basis.
Starting point is 00:32:51 And a lot of them don't allow you to do that, so that project didn't really go anywhere, even though in theory it would be possible to do it. Yes? Do you see any flexibility of the use with organization? With? Did Dior determine how you want your stuff? Did you see any kind of, any time you said organization, any kind of flexibility? Did you see any type of flexibility? Well, again, the SCSI spec has written, well, I should modify that because technically the group number field is opaque and there's certain cases under which it's defined one of those as the IO hinting that we're using. But relative priority for applications doesn't really exist in the spec as written.
Starting point is 00:33:51 The way it typically works in or has in the past is that you've been setting a group number and then you've used your array management interface to say when I see this group number I should give this application slash database relative priority. But that's how the group number field in the read write command was originally defined to be outside the scope of the spec. So it was a handshake between you and your storage array vendor what you put in there and how they reacted to it. And then a couple of years ago, the Hinting stuff came about and used the same group number field in the read-write commands to have a certain meaning. So there's nothing that's preventing us from defining a new meaning and sticking it in there.
Starting point is 00:34:44 Yes? So what kernel version are we talking about? Most of this work was done in 4.1 because that's our current production kernel. And I needed something I could run database workloads on and so on. But the code itself is not. The code itself is quite simple, and I plan on putting it in upstream. Any other questions?
Starting point is 00:35:21 Yes? Is it possible to specify that IOPlas for every read and write without the Only by way of the IOPriority that you set on the process rather than the file. So in case I develop a system on top of device, I know what the user is writing to me, right? But they can't communicate this here to the device.
Starting point is 00:35:51 They need to do something else. So it doesn't know the meaning of this. If you're a file system, you have access to the bioflags that we use to convey this information inside the kernel. Correct. And if it's user-space, that that we use to convey this information inside the kernel.. Yeah. So what are you using? What kind of I-O interface are you using? Are you using FUSE or?
Starting point is 00:36:17 . Okay. So that would be an obvious place where we would want to add those flags. And with the Oracle Database add-on, we would definitely like those to be in libAIO as well. The reason I haven't added them to libAIO is that I mostly use Oracle ASM, which is our custom Oracle Database AIO interface for development, because I happen to be the one that wrote it so I'm it's my go-to thing when I need to do something quick and dirty and it plugs in easily to the database but but from a pragmatic perspective yes we definitely
Starting point is 00:36:55 want those as flags in the AIO. No, they're a generic part of the bio. Actually, the hints I'm using to convey most of this information are a result of CFQ, but they're always in the bio. It's just that the other IO schedulers are not using them. Any other questions? All right. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list
Starting point is 00:37:46 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.