Storage Developer Conference - #52: An Enhanced I/O Model for Modern Storage Devices
Episode Date: July 25, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
My name is Martin Peterson.
I'm a storage architect at Oracle's Linux Engineering Group,
and I'm also one of the authors of Linux's IOSTACK
and maintainer of the Linux Guzzi subsystem.
I'm going to talk about an enhanced IOM model for modern storage devices.
And maybe what I should have said was an enhanced IOM model
for legacy applications for modern storage devices
because it appears the topic at this conference this year is hyperscaler
and cloud storage and so on where you have lots of flexibility
in being able to modify your application to adapt to the storage hardware. But there's a ton of old
applications out there and that's what I'm trying to address with the changes
I've been making.
So modern storage for decades we have been optimizing our IO stack to work well on old school spinning
hard drives.
And it's worked very well and it hasn't really that particular design choice of trying to
optimize for spinning media and reducing seeks and what not hasn't adversely affected any
of the other classes of storage that we're driving.
Anything from USB sticks to high end storage arrays.
Pretty much all of them have benefited from our general approach of issuing I.O. to the
storage.
But we're getting to the point now where many of the devices come out have very specific
needs in terms of being driven effectively.
And so I've been looking into what we can do to the existing ILO interfaces
to make these devices perform well and predictably.
And as it turns out, despite the fact that the media,
flash and SMR hard drives and so on, is vastly different in terms of characteristics,
the things that we need to communicate to these devices to make them perform well is pretty much the same regardless of which type of media
is backing the device. And that's lucky for us because we would like our
I-O stack to have common semantics and be generic and not tied to a particular
type of device. And it also means we don't need to throw away decades and
decades, thousands of many years of file system and database development.
Now if we're looking at optimizing the storage, there's been and are different approaches
in the standards groups.
One such thing was OSD.
It's the SCSI protocol for optic storage.
It never really took off in the industry.
Right now there's the zone access commands for SMR drives that are very, very specifically
modeled to how those drives work in practice, are physically implemented. And they're good
for certain workloads where your application is very conducive to that type of data access patterns.
They're mostly reads.
You rarely overwrite stuff.
And it's very rare that you actually need to go scrub or reclaim the data.
Whereas most of the legacy applications we have, they're inherently a random read-write workload.
There's databases, normal file system IO.
And overwrites are very
frequent. So the common theme of this conference has been we're moving to, you
know, write once, never write again, and we're moving to sequential
write workloads because that's what SMR drives want. But we can't throw
away all the applications that need to be able to do random writes.
So somewhere, somehow, somebody will need to convert those random writes into something
that works in a way that's good for the device, typically a sequential workload.
So random writes are not going away, no matter how much we wish for that, it's not going
to happen.
So back many years ago the storage protocols were heavily tied to the physical implementation
of the device when you had to talk to a hard drive you needed to know cylinders, heads
and sectors and that kind of thing it was very intimately tied to the implementation
of the device.
And then about 20 years ago, people decided, you know what,
this is a really bad idea.
It's hard to move drives from system to system.
You need to update disk tabs and whatnot.
So let's move to a model where we use logical blocks.
The device presents an LBA range from zero to capacity
and completely hides the internal implementation of how that
data is organized on the device. And that was done in the days of hard drives and we're reaping the
benefits of that transition from the physical to the logical representation of the storage.
So today we can take a $1 USB key or a $1 million storage array and we drive them using exactly the same protocol.
That wouldn't have been possible if the physical representation had been inherent in the protocols like it used to be.
Another reason that we don't like the media-specific access protocols is that when we have an IO request sitting in the application,
we need to send it out to a bunch of different devices at the same time. is that when we have an IO request sitting in the application,
we need to send it out to a bunch of different devices at the same time.
We're increasingly moving to a model where we have locally attached flash cached here of some sort, but the same data needs to go over IB, over the network, onto a different type of storage,
or onto locally attached spinning media storage. So we can't semantically, our way of doing I.O. needs to have some commonality
regardless of which type of storage protocol it's talking.
Also, the read-write I.O. model that application programmers use
is heavily ingrained in our standards interfaces. We can't just throw
that away without breaking a ton of applications out there. DOS is already, to a large extent,
completely simulating that interface. We're faking it, but we're honoring the guarantees.
But the read-write model, as seen by the application that uses the C library to do read-write I.O.
is implemented very, very differently inside the kernel.
We don't really do I.O. the way you think we do I.O.
We just present you with that interface, but we can't break that interface.
That's our promise to the application people,
and we can't do it easily without some cooperation from the storage device.
So what are the characteristics that distinguish modern types of storage like Flash and SMR from the old school legacy hard disk drives?
Well, for Flash, it turns out that one of the most important things,
here's Bill, maybe some of you were at Bill's talk,
one of the most important things on Flash devices is that you need to be able to
put data that belongs together in a logical fashion
together in the same physical media spot.
That really helps you to reduce write amplification and it helps you to avoid
doing extensive garbage collection in the future. And another thing that's important
to flash classes of devices is to be able to tell a device now is a good time for you
to do garbage collection. So those are sort of the two main characteristics,
being able to group logical units that we,
as a result of controller constraints and device constraints
and protocol constraints, when we write a 40 gigabyte file,
we can't send that as one I.O.
That 40 gig file gets chopped up into many, many, many pieces,
and by the time it hits the device,
the notion of those pieces forming a logical unit has been lost. So we need a way to communicate to the
device all these requests that we just queued on you actually belong to the
same thing and we think you should put them together in the same physical
location to ease your workload in the future. If we move on to SMR drives, it turns out it's pretty much
exactly the same problem. If you have an SMR drive, you need to be able to
co-locate things that belong together in the same zone to ease the zone
management. Zones on an SMR drive are typically much bigger than on a
flash device, but the problem is essentially the same. You want things that belong together to be grouped together physically on the media to
ease future garbage collection.
And the SMR drives, under the assumption that the drives in question have some sort of translation
layer between logical blocks and the physical zones backing those logical blocks, also needs
to be able to do
garbage collection exactly like a flash device works.
So there's lots of commonality between these two cases.
If we move on to storage arrays, the arrays also benefit from the notion of being able
to identify different IO streams from different applications.
The rationale is slightly different from why it makes sense for SMR and Flash,
but the mechanism is the same.
You want to be able to prioritize I.O. from different databases on your system
so that your test database doesn't impact the performance of your production system
or your backup jobs that you want to run in the background don't interfere at all with the performance of the system in general.
Turns out storage arrays also have the need to be able to do background tasks,
data migration to reclaim space if it's a thin provisioned device, that kind of thing.
So just like SMR and Flash, storage arrays also have a need to be able to perform background operations
and would like to know when is a good time to do so.
So to summarize, modern storage devices would like to know additional
information about each I.O., they would like to know the nature of the I.O., what
is this thing that you're sending to me, what should I do with it, they would like
to know how that the I.O.s that you're queuing me, what should I do with it? They would like to know how the IOs that you're
queuing on the device at the same time
are related, if they're related.
And they would like to know when it's a good time for you
to do background operations.
So first, the IO classification.
Two years ago, Jim Williams and I from the database team
gave a presentation about IO hinting how it's used in the database.
The Oracle database internally uses over 70 hints to classify IOs as they flow through the system.
And until now there's been no standards-based way for the database to communicate that out to the storage.
But we're now using a SCSI feature
to make that communication path possible.
And so we talked to the database folks and said,
okay, you have your 70 hints,
which of these make sense
and what would you like to tell the storage?
We also did two surveys
with a bunch of different applications people asked,
which hints make sense for your application?
And the result of that was, we ended up with a long, long table of many different hints
and it was very, very impossible to try and accommodate everybody.
But there turned out to be a common pattern, namely that some I.O. requests are transactional
and it's very important for the application that they complete quickly.
And it's very important for the application
that these, in the future,
are low latency accesses.
You know, the file system journal
or the database journal.
And it's very probable
that you are going to access
those blocks in the future.
So we condensed all those long tables
into these six I.O. classes.
They're not set in stone.
It's just an example of the hints we found actually worked when communicating to the storage.
So we have these six classes.
Transaction for file system journals or database redo log.
Metadata to identify file system metadata.
Swap.
We have a real-time I.O. class for things where you don't want
hiccups in the I.O. stream.
We have the normal data stream, and then we have background
tasks such as backups or resyncing a RAID.
So for each I.O. we send down to the storage, we classify
the I.O. with one of these six categories, and then the
storage device can decide based on that information if
it makes sense. For instance, in the case of an SMR drive to put the file system metadata
in a conventional zone rather than a sequential zone. In the case of a storage array, you
might want to pin stuff in the cache or put it in a flash tier as opposed to out on spinning
media depending on what type of classification we assign to the IO.
In T10, in this case...
For that, mostly in the IO there is no differentiation between reading and writing, am I correct?
Because it's, you know, based on the specific IEO characteristics, for example, if you want to get a frequent
read access instead of mixing them, then the device might do something else.
Right.
So the way it's defined in T10 here is that the actual hints you send to the device are
coded in the group number in the read and
write commands.
And it's an index into a mode page that looks like this where you can define different characteristics
for the I.O.
You can say, you know, I'm going to access this very frequently.
I'm going to be mostly sequentially writing and rarely reading if it's a file system transaction
log for instance. And you could set the subsequent I.O. hint to indicate that other I.O. on the
system is backed up behind this I.O. so you want this I.O. to complete quickly.
That presumes that the device is doing internal scheduling of some sort but the
hint is there. So we map in the Linux case most of these hints that we set on the I.O. are actually
discrete hints.
They're flags that describe a certain thing.
But given how SCSI specified this to be flexible and expandable, we have to map that into a
fairly short list of static hints.
So we can't be quite as expressive in SCSI as we can in NVMe because in NVMe each of
these are distinct flags.
And you can see that here, here's the NVMe DSM field which is the corresponding thing
in the NVMe spec.
And you can set access latency, access frequency, and so on here.
So these correspond pretty much, not entirely, but almost one-to-one with the descriptor
in T10. So we have these in both SCSI and NVMe, we have a way to communicate this information
to the device.
There's two flags that are different between Read and
Write and NVMe, I think.
Something to that effect.
But they're almost the same.
OK.
OK.
So the IO classification that we set is on a per IO basis.
And we tried static labeling when this originally came
about in T10 and T13.
The intent was to make it applied statically to an LPA range.
But there are several problems with that.
We actually tried and we got horrible results because a common use case is you have a database
file, you're running your database application doing random read write to that file, and
at the same time you start your backup job which is doing a sequential read of the same
blocks. So the classification is a property of the issue of the I.O. rather than the contents
of the LPAs.
Another thing we saw was an MP3 collection.
Somebody had set the real time streaming property on his MP3 collection so that he wouldn't
get hiccups when he was playing MP3s.
But every time his
backup ran, the system came to a grinding halt because the drive thought he was streaming
his entire MP3 collection. So to us the classification of the nature of an I.O. is a per I.O. property
and it's a result of who issued the I.O. and the intent that they had of issuing the I.O. rather than an inherent
property of the contents of that file.
At the implementation level, we tie into POSIXF advice, which is a system call an application
can use to set certain properties on the file.
So an application can say, this file is something, and it will set, like, write sequential or
there are some distinct flags defined and
we'll map those into the relevant IO class in the case of SCSI or set the flags on an
NVMe command.
So that was the classification of why are we doing IO, how do we communicate that information
to the device.
Now to the affinity.
So I mentioned earlier then
that when we're writing a 40 gig file we're chopping it up into 128 K or 512
K pieces and the notion of that as a logical unit goes away. So with the
affinity we want to be able to convey to the storage device all these things
belong together and should go in the same zone or in the same right channel.
In T10 we use streams with modifications. I was very glad to hear that Bill was planning affinity between T10 and NVMe here
because we're actually not explicitly opening and closing the stream IDs in T10.
We bent the spec a little bit for our internal implementation.
And in NVMe, we're using directives.
Right now, I'm using a, there's been some kerfuffle in NVMe
about how to go about implementing streams.
And two groups of companies disagree strongly
about how to do it.
I tried to stay out of the politics very unsuccessfully, I think.
But the reality is we can get pretty much all we need with the streams
and so I'm probably going to change my implementation from
the current stuff to what ended up being the final streams proposal.
The stream ID we set on a per file basis is based on the inode number of the file and
the partition number hashed together to try and create something that is unique
that we can send to the storage. The affinity in SCSI is communicated in the stream ID field here,
and there's a similar field in the directives in NVMe
to convey the information about which file this is.
So it's a 16-bit value,
and when we started looking at this,
I thought, oh my God, 16-bit, that's nothing,
because I have a million files on my laptop, right? How am I going to work with this?
But in reality, the hash collisions that we see with the 16-bit value depends on the workload.
And we're sort of saved by the fact that the devices will have 16 write channels or they will have 128 concurrently open zones.
And both of those numbers are significantly lower
than the 64K that we can communicate to the device.
On our end, we only really see hash collisions
if you have a workload that, obviously, if you write more than 64K files,
you're more likely to have a hash collision.
But the hash collision also needs to happen in a short amount of time
because you only care about collisions at the time where that I.O. is queued on the device.
So in reality, when we simulated this 64K 16-bit stream ID,
it didn't turn out to be as bad as it seemed. And we did a bunch of
checks on a variety of systems running a variety of applications to identify how
many files are open and how many files are open for writing. So it turns out on
a typical system you have somewhere between 1k and 100k files that are open
at the same time. Interestingly enough it needs to be a really big server to compete with your web browser in terms of number of files open. Your web browser is often the craziest example of how many files can a process possibly have opened at the same time. So your average desktop and a very, very high-end server are up near the 100k mark,
whereas a lot of server workloads are actually down in the low thousands.
As far as number of files that are open for write, which is really what we care about
when we communicate the stream ID out to the device,
it's typically between 100 and 10k on a large server. So 10k is still significantly lower than the 64k
namespace that we have to work with.
So if we combine the IO classification that we get using the hinting and the affinity that we
get using the stream ID, we are able to convey to the storage device the nature of the I.O. and how the outstanding I.O.
we have queued or is sitting in the queue to device,
how those different things are related to each other.
For devices that are big and have internal scheduling,
such as storage arrays with large caches and whatnot,
the separation of the I.O. streams also helps to provide QoS.
And on devices that really don't like frequent overwrites,
such as SMR and flash devices, they have natural enemies,
and those natural enemies are circular logs, metadata, and swap files.
They're some of the worst things you can throw at a flash device or an SMR device.
And being able to identify those
allows the device to make smart data placement decisions
when we consistently tag those particular types of I.O.
And in case of arrays, again, with internal scheduling,
it allows you to do I.O. prioritization
so that your backup job won't steal IOs from
any of the business critical applications running on the array.
So the last component in my three prong approach to making devices perform better is the background
operations. All the device types I've talked about have a need to do background operations for whatever reason.
Flash management or reclaiming space on a thin provision device or destaging zones on an SMR device.
Devices typically do this when they're idle.
Devices would like us
to tell them when it's a good time to do it. We have no idea because we don't do
the I.O. The applications do the I.O. So what I've done is I've just like we've
done for discard our trim unmapped infrastructure in Linux, I've created a
protocol agnostic implementation of background control
so an application can call and I opt on say, now's a good time to do it and we'll translate
it to what's the background control command in SCSI and we'll see where it lands in NVMe
but we'll issue the relevant command in NVMe when that spec is finalized.
So results.
So we've been doing this for a while on both hybrid drives and on storage arrays and have
gotten really good results and we would like to move it closer to flash and SMR drives
in the future.
And the benefit to this approach to doing things as opposed to using ZAC or CPC directly
for SMR drives or using something like light NBM where you do the flash translation, parts
of the flash translation workload on the OS is that
this is a common interface that is completely agnostic to the physical
storage medium that you're talking to we can use exactly the same plumbing
throughout the I-O stack and facing the applications regardless of whether we're
talking to a flash device an SMR drive or a storage array or anything else for
that matter so it's So it's a generic
infrastructure for communicating these things to the storage through the common
I-O stack. So as I mentioned we've done some some results on a simulated target
so we simulated a 10 channel flash device or a I think I did a 64 zone SMR drive, trying to establish
what is the co-location rate, like how good are we at attacking things so that they end
up in the right zone given that the device is constrained by being able to do 10 different
files or logical groupings at a time.
And our collocation rates are really good.
As I said, I was worried about the 16-bit hash being a constraining factor.
But in reality, there are cases where you're doing things like,
for instance, when you do the OS install,
where you create tens of thousands of files on the system at the same time.
That's a case where you're very likely to hit a collision.
But once the system is installed and up and running and doing a normal workload,
it's very, very rare that we get a hash collision.
And it's also, depending on the workload, reasonably rare
that we exceed the capacity of what the device can do.
So what I'm saying is it's relatively rare, again depending on workload, it's relatively rare that we're doing writes
to more than 16 files at the same time. You know what we'll do is we'll often
fill the entire queue with I.O. from a handful of files at the same time.
But obviously it depends very heavily on your particular workload.
On storage arrays, the hinting and the stream IDs really help the array do cache management.
For instance, you don't want your backup job to be cached. You want that sequential read
of your entire database files. You just go through the cache and not interfere with what
the array is doing for the runtime workload.
So storage arrays work really well.
We're looking to engage with disk drive vendors and the flash drive vendors to implement this
feature.
And I already mentioned the streams versus LBA affinity thing in the spec.
I think from a functional perspective there's really no difference.
I think the main difference is anticipated usage of, it's not so much the wording, it's
more the anticipated usage model.
Can we get to a point where we can just turn
it on by default and not completely mess up the drive? That would be the ideal case for
me. I don't think from a protocol perspective that we're missing anything really.
And then for T10-SVC, Bill said he was going to get us parity with implicit open-close semantics from NBME.
That would be wonderful.
I would like to entertain the idea of the read stream command to complement write stream.
It doesn't really make much of a difference on devices like SMR and flash drives
where we just want the data as fast as possible.
But in the array case, it would be nice to be able to do QoS on the read streams as well
as the write streams.
So that was it.
Yes? My question is about storage arrays and tiering. How would people be able to specify in what tier to characterize the data?
So the tiering, when we first did this, we actually had an explicit tiering.
We used the group number field and the read and write commands
to just specify tiering and pretty much nothing else.
But what ended up in the spec
ended up being very different from that.
So there's no explicit tiering,
although the spec doesn't say how to interpret the values.
And I define the IO classes that Linux is setting in the mode page
to match
the anticipated behavior. So for instance, for the file system or database log, it's
like write sequential, it's write only, we never read it unless we're doing recovery
and keep it on something fast because other I.O. depends on the writes completing to the
log. So there's no way you can say,
I wanna put this on the fastest tier.
But based on the properties that I said on the IIO saying,
I would like this to be read fast in the future,
that's the type of thing you can communicate
using the T10 descriptors.
You could then decide based on that to put it in the
fastest tier. So there's a level of indirection there that makes it a little bit blurry in
this case. It's a little bit easier to convey it in NVMe.
So the modification that you did here at the protocol level, but upon the application,
how do you specify what kind of performance?
Do I do it as a part of the open call?
You do it using the fadvice call.
So using fadvice, you can set a property on a file handle.
And you can say, you know, I expect this to be mostly sequential.
And, you know, I don't think I have the – I can show you the POSIX IO flags, but there's just 10, 15 flags that are defined by the POSIX spec that you can set on the file handle.
And we translate those into something that makes sense in the SCSI disk driver.
So basically every compile of the application is needed to be able to specify.
If you want to specify something,
you have to use the fadvice call.
To some extent.
Actually, one thing you can do is you can,
we have the io prior command.
You can set the io priority on an existing application.
So if you set the io priority for your application to be idle,
our backup io class is called the I-O class
because it only uses idle I-O time to actually submit I-O.
So if you run I-O prior idle on your application,
that will cause it to set the background
hint on the SCSI device.
And you can do the same thing with the real time.
If you want to boost your application,
you can set the real time task priority,
and that will cause that to be communicated out
to the storage.
So that's really the only granularity you have.
The Linux IOS scheduler allows you
to set a relative priority within each IOS class,
but I have no way to communicate that to the storage
because there's no relative
priority.
There's a wrinkle in the sense that transport protocols, Fibre Channel and SAS, actually
have a priority field.
And we did look into using that at some point, but that becomes challenging because that
field, usually the protocol stack's running in the HBA firmware,
so that means we need that way to talk to the HBA firmware to set that priority on a per-IO basis.
And a lot of them don't allow you to do that, so that project didn't really go anywhere,
even though in theory it would be possible to do it. Yes? Do you see any flexibility of the use with organization?
With?
Did Dior determine how you want your stuff? Did you see any kind of, any time you said organization,
any kind of flexibility? Did you see any type of flexibility? Well, again, the SCSI spec has written, well, I should modify that because technically the
group number field is opaque and there's certain cases under which it's defined one of those
as the IO hinting that we're using.
But relative priority for applications doesn't really exist in the spec as written.
The way it typically works in or has in the past is that you've been setting a group number and
then you've used your array management interface to say when I see this group number I should give this application slash database relative priority.
But that's how the group number field in the read write command was originally defined to be outside the scope of the spec.
So it was a handshake between you and your storage array vendor what you put in there and how they reacted to it. And then a couple of years ago, the Hinting stuff came about
and used the same group number field in the read-write
commands to have a certain meaning.
So there's nothing that's preventing us from defining a
new meaning and sticking it in there.
Yes?
So what kernel version are we talking about?
Most of this work was done in 4.1 because that's our current production kernel.
And I needed something I could run database workloads on and so on.
But the code itself is not.
The code itself is quite simple,
and I plan on putting it in upstream.
Any other questions?
Yes?
Is it possible to specify that IOPlas for every read and write
without the
Only by way of the IOPriority that you set on the process
rather than the file.
So in case I develop a system on top of device,
I know what the user is writing to me, right?
But they can't communicate this here to the device.
They need to do something else.
So it doesn't know the meaning of this.
If you're a file system, you have access to the bioflags
that we use to convey this information inside the kernel.
Correct.
And if it's user-space, that that we use to convey this information inside the kernel..
Yeah. So what are you using? What kind of I-O interface are you using?
Are you using FUSE or?
.
Okay. So that would be an obvious place where we would want to add those flags.
And with the Oracle Database add-on, we would definitely like those to be in libAIO as well.
The reason I haven't added them to libAIO is that I mostly use Oracle ASM,
which is our custom Oracle Database AIO interface for development,
because I happen to be the one that wrote it so I'm
it's my go-to thing when I need to do something quick and dirty
and it plugs in easily to the database but but from a pragmatic perspective yes we definitely
want those as flags in the AIO. No, they're a generic part of the bio.
Actually, the hints I'm using to convey most of this information are a result of CFQ,
but they're always in the bio.
It's just that the other IO schedulers are not using them.
Any other questions?
All right. Thank you.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.