Storage Developer Conference - #48: Optimizing Every Operation in a Writeoptimized File System

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to STC Podcast Episode 48. Today we hear from Rob Johnson, Research Assistant Professor with Stony Brook University, as he presents Optimizing Every Operation in a Write-Optimized File System from the 2016 Storage Developer Conference.

Starting point is 00:00:51 My name is Rob Johnson. I'm somewhere in there. I'm from Stony Brook University. And I have three goals with this talk. One is, of course, I'm just excited to talk about BetterFS, which is a high-performance file system that we've been building. But I also hope to talk a little bit

Starting point is 00:01:13 about write-optimized data structures, which we use inside this file system, and explain what they are, what their performance opportunities are, and some of the challenges. And then the third goal is to show so to use better FS as an example of how to design an application or a system on top of right optimized data structures so that you

Starting point is 00:01:33 take advantage of their strengths and you avoid pitfalls in the design of right optimized applications okay so file systems if you would try to build a general purpose file system, there's a lot of different performance demands that you've got to satisfy. Some applications really want great, you know, sequential read and write performance, some want really good random writes, some want file and directory operations like renames and deletes to go really fast, some applications want to do recursive scans like a backup. You've got metadata updates to files and directories. So there's a lot of different things that applications need. And building file systems that excel at all of these has been a long

Starting point is 00:02:16 challenge. So let's take EXT4 as an example. EXT4 is a classic design of what we call an update in place file system. So it tries to lay out the content of files in a logical way with good locality and then when you update those files it updates it in place so it maintains that locality. And so as a result on a disk which has like about 125 megabytes per second of bandwidth, when EXT4 is doing sequential reads, it's able to get about 110 megabytes of bandwidth. So it's basically getting almost all of the resource out of the disk. But when you go to do random writes,

Starting point is 00:02:57 all those random writes require seeks, and so the performance is terrible. And in case you're wondering, no, I did not leave the line off the graph. You just can't see it. And so when we look at these performance objectives, we see that update-in-place file systems are good at sequential reads and writes and file and directory renames, or at least EXT4 is. But things like random writes and metadata updates,

Starting point is 00:03:21 which translate often into random writes, are not so good. And we'll also see that recursive scans are not as good as they could be, but for a different reason. And the other design extreme are log-structured file systems, which are optimized for random writes because all these random writes just are appends to a log. And so log-structured file systems translate that random write workload into a sequential write workload. And so they're really, really fast at random writes. But then the problem is that when you go back to read that data later on, it's just in the log in whatever order it happened to get written in. And so if you're trying to read your file sequentially,

Starting point is 00:04:00 you may have to be seeking all over the disk to get the blocks in whatever order and location that happen to be stuffed into the log. And so log-structured file systems are really just a different trade-off between these performance objectives. They can be quite bad at sequential reads, but they are really good at sequential and random writes. They can also be good at file and directory operations. Just for the same reason that they might not be good at sequential reads, they can also be good at file and directory operations. Just for the same reason that they might not be good at sequential reads,

Starting point is 00:04:26 they can also be bad at recursive scans. So there have been these trade-offs in file system design for many, many years. One is the one I just described between sequential reads and random writes, where update-in-place file systems are optimized for reads and log-shrifted file systems are optimized for the random writes. But this is not the only one. I'll tell you later on about a trade-off

Starting point is 00:04:50 between rename performance and directory scans, which is exemplified by inode-based file systems versus full path indexing file systems. So these trade-offs have been very enduring in file system design. And there's probably more, but these are the two I'm going to talk about today. And the main idea of this talk is that by using new data structures, randomized data structures, we can build file systems

Starting point is 00:05:18 that completely overcome these long-standing trade-offs. And so this is also the outline of the talk. So I'm going to begin by describing the background on right optimized data structures and then I'll talk about how betterFS is designed to exploit those right optimized data structures and then we'll look at performance evaluation and dive a little deeper into some of these trade-offs. Okay, so route-optimized data structures are a new class of data structures that were discovered starting in the mid-90s with the LSM tree. And the LSM tree is a data structure, you may have heard of it before,

Starting point is 00:06:03 it's in a lot of web-scale databases like HBase and Cassandra and LevelDB. There's been a steady stream of other ones since then, BD Epsilon trees, cache oblivious lookahead arrays, xdict, and so on. All these data structures are key value stores, so they're more or less like a drop-in replacement for a B-tree. And the main feature they have is that they can do inserts or updates and deletes orders of magnitude faster than a B-tree. But point queries are basically as fast as in a B-tree. So they don't get any faster, but they don't get any slower asymptotically either.

Starting point is 00:06:47 And betterFS uses B to the epsilon trees, which is where it gets its name. And so I'm going to teach you B to the epsilon trees as an example of a broad optimized data structure and that'll help us understand why they have this neat

Starting point is 00:07:03 performance profile, and then we'll they have this neat performance profile. And then we'll look at what that performance profile, the challenges and opportunities it creates. But before we do that, let me just give you a quick, simple computational model that we're going to use. So this is called the Disk Access Machine Model. machine model. And it's a very simple model of a computer for analyzing algorithms and data structures that are I.O. bound rather than CPU bound. And so it's a really, really simple model and it's surprisingly effective given how simple it is. So in this model we have a computer with a RAM of size M and the units here are kind of vague. Typically, M is going to be M things, so M key value pairs. And we have a disk, and in order to operate on any piece of data, all the data starts on disk, you have to shuffle it into RAM,

Starting point is 00:07:58 and you can do that in blocks of size B. Of course, if the RAM is full, then you either have to just kill a clean page or write back a dirty page to disk. And since we're not worried about computation, we're only concerned about the IO efficiency of our algorithms, the performance metric in this model is just how many block transfers do you have to do to get your job done. So we don't care about if you've got some complicated data structure in this thing, it doesn't matter. One IO.

Starting point is 00:08:28 Just load it in memory, do whatever you need to do to figure out what is going on inside there. And so these are going to be the parameters that appear in our performance evaluations, B and M, and the size of the data set, M. Now I know the model has deficiencies.

Starting point is 00:08:47 Like, for example, it assumes that every I-O has the same cost. Of course, on a disk, two I-Os that are adjacent are typically cheaper than two that are very far apart. But we'll just ignore that stuff for now and charge ahead with this model. OK, let me quickly review a B-tree in this model. So a B-tree is a search tree in which each node is sized

Starting point is 00:09:16 so that it fits in one block of storage. And so that means that a node holds B thingies, or a block holds B thingies. That means we can have B pointers to our children and the B pivot keys so that we know where to go when we're descending in the tree searching for an element. And since the fan out of the tree is roughly B, that means the height of the tree is going to be log base B of N, where N is the number of elements in the tree. And so that means that a lookup is basically what we have to do, we have to just walk a

Starting point is 00:09:53 root to leaf path, and that's going to be log base B of N IOs. And an insert is the same, because the way you do an insert in a B tree is you just search for the leaf, and then you insert it there. So there's one extra iota to write it back, but that's not really a big deal. It's just a log-based B event. So in a B-tree, it's interesting to point out that the cost of lookups and inserts

Starting point is 00:10:18 are basically the same. Now let me show you the B to the epsilon tree. This is a right optimized data structure. It's a variation on a B tree. So again, we're going to have nodes of size B, but we're going to use that space very differently. We're only going to have a fan out. Here I'm doing epsilon equals a half. We're only going to have a fan out, here I'm doing epsilon equals a half,

Starting point is 00:10:45 we're only going to have a fan out of square root of b. So the tree is going to be a little bit taller as a result. But what that means is that we only need to use square root of b of the space to store our pivot keys and pointers to the child nodes. And square root of b space is basically compared to b, it's nothing. Like if b were a million, square root of b is like 1,000. So we've got 999,000 space left over. What do we do with it?

Starting point is 00:11:15 It's going to be buffer space. And keep in mind, this buffer, a common confusion we see is people think this is like an in-memory cache. When this node is in memory, the buffer is in memory. But if this node gets evicted to disk, the buffer goes with it. It's part of the node. And since the fanout is now square to B, then a tree with N items in it will have a height of log base square to B of N.

Starting point is 00:11:45 But if you remember your high school logarithms rules, that's basically two log base B of n. So in terms of asymptotics, that two is just going to disappear into the constants of the big O notation. So the height is still log base B of n asymptotically. So let me tell you how we do an insertion into the B to the epsilon tree. Whenever you insert a new item,

Starting point is 00:12:10 you just put it in the buffer at the root. Insert another one, put it in the buffer at the root. Put them in the buffer at the root. And just keep doing that until the buffer fills up. So obviously that's pretty fast. You don't have to do this root-to-leaf walk. And when the buffer fills up, then you just flush those elements down

Starting point is 00:12:32 to the buffers of the children. And if this buffer is now full, then we might have to do another flush. There could be some cascading flushes going on here. But that's the insertion algorithm in a B to the epsilon tree. When we flush nodes from here to here, we always flush that, not nodes, elements. When we flush elements from here to here, we always flush them along the path that you would follow if you were searching for that element.

Starting point is 00:13:00 And that means later on, when we're searching for this element, as we do our root-to-leaf walk as part of the query, we will pass this node, and we can look inside its buffer and go, aha, there it is. So searches are still going to work correctly. Let's analyze the cost of doing an insert into this data structure. Well, each item might get written to disk once per level of the data structure. And we know there's log base B of n levels to the data structure. But whenever we do a flush, we have to do square root of B IOs to load in all these

Starting point is 00:13:41 children and flush everything down and write them back. So it's order square root of b. But in the process, we get to move b minus square root of b things down. So there's this huge batching factor. In fact, on average, for each I.O. that we perform, we move b over square root of b items

Starting point is 00:14:04 down one level. And so as a result, the IO cost per insert is the number of times each item gets written to disk divided by the batching factor. So every time we write something to disk, we actually write square root of B things to disk

Starting point is 00:14:20 but they're down. And so now take a moment and stop and think about that compared to the B-tree. The B-tree insertion cost was this log base B event. Here we now have an insertion cost of log base B divided by square root of B. If B is like a thousand, this is 30 times faster. If B is 100,000, this is 300 times faster. And when you actually go and implement this data structure and benchmark it, these numbers bear out. You really will find that you're one, two, sometimes three orders of magnitude faster

Starting point is 00:14:58 than a B-tree. Lookups are essentially the same. You just have to do a root to leaf path, you just have to check the buffers along the way. And range queries, I want to point out, range queries when you say I've got key A and key B, tell me everything in the database that's between them. And essentially that, you just have to do a query for the first item and then you can read the items off of disk, essentially at disk bandwidth.

Starting point is 00:15:24 So those are the three takeaways from this data structure. item and then you can read the items off of disk essentially at disk bandwidth. So those are the three takeaways from this data structure. The inserts are orders of magnitude faster than in a B-tree. Queries are, point queries are the same speed as in a B-tree and range queries run at disk bandwidth. So these are the performance strengths of a write optimized data structure like the B to the epsilon tree. But this strength actually does create something that we call the search insert asymmetry. And the problem here is inserts got way faster, but point queries didn't get any faster. And so now, if you have a workload where you're doing, for each insert,

Starting point is 00:16:09 you proceed that insert with a query, then you're going to be bottlenecked on the query. So, like, take this example workload. You know, in a B-tree, if you wanted to add $10 to my bank account balance, which I hope you do, then you would have to query for the old balance, then you'd have to compute the new balance, and then you'd have to insert the new balance. Now, in a B to the epsilon tree, this is lightning fast.

Starting point is 00:16:31 This is computation, but this is no faster than it was in the B tree. And so even though this data structure supposedly offers inserts orders of magnitude faster, maybe, maybe you'll get a 2x speedup, and probably not even that. So this is the search insert asymmetry and it's something that you have to keep in mind when you're designing your applications or other systems on top of a right optimized data structure. And there's a really cool trick

Starting point is 00:16:59 that data structures can provide like a right optimized data structure like the bdepsilon tree can provide to applications to help them overcome the search-insert asymmetry. And this trick is what we call an upsert, which is an unfortunate name collision with this upsert concept from the SQL database world. It's something slightly different.

Starting point is 00:17:17 An upsert is essentially a message that you insert into the tree to encode the change that you want to make. So, for example, if I want to add $10 to my bank account balance and my bank account balance is $5 stored down in here in some leaf, I don't want to do a query. Queries are slow. So what I do is I just put a message into the tree saying, add $10 to Rob's bank account balance. And this can be done as fast as an insert. So I can do tens to hundreds of thousands of these per second.

Starting point is 00:17:48 And this message is like a delta now. It's not an actual value. It's a delta on an old value. It'll eventually get flushed down to this leaf. When it reaches the leaf and meets up with its original message, they'll get merged together, and the new value will be written out to disk. And in the meantime,

Starting point is 00:18:04 when someone queries for my bank account balance, what they do is they do a root-to-leaf path walk, find the old value, and then they walk back up applying these deltas to the result before they return the actual result. And so even though it may take a while for this delta to get merged in with the old value, from a semantic point of view, the database or the key value store is updated instantaneously. So upserts are a really

Starting point is 00:18:31 powerful technique for overcoming the search-insert asymmetry in write-optimized data structures, and they work particularly well in BDF's entries. Okay. That's more or less the background I wanted to go over on right-optimized data

Starting point is 00:18:48 structures. We're actually going to talk about file systems in a minute. But just to wrap it up, point queries are as fast as in a B-tree. Inserts, upserts, and deletes. Deletes are just implemented like an upsert, just delete the message. They're very fast, able to do tens to hundreds of thousands per second.

Starting point is 00:19:04 And range queries, the important thing here is they can have a disk bandwidth. And so what that means is that when we design systems on top of data structures, we want to as much as possible avoid point queries and map as many application or higher level system level operations onto inserts, deletes, upserts, and range queries. And so that's what the design schema of BetterFS tries to accomplish. So let me tell you how BetterFS works.

Starting point is 00:19:37 So what it does is it has to map file system operations onto key value operations by storing file and directory data in a right optimized data structure we use the BDFs on tree. In fact we have two indexes one of them is the metadata index and the metadata index essentially stores well your file metadata and it maps full paths within that file system to information that you might call like struct stack so you know the ownership of the file the last time it was modified the size of the file permissions on the file stuff like that and then the data index is where we store file

Starting point is 00:20:16 contents and so the data index maps keys which are paths, and then the block number within the file to the actual 4K block of data. The reason we have these two indices is because the metadata index, both of them are sorted in such a way that when you do a recursive directory scan or a directory listing, it translates into a range query in the index. If you're doing something like a find, which doesn't look at the content of files, that will be a range query in the metadata index.

Starting point is 00:20:50 If you're doing a grep, that'll be a range query in the data index. And so the implications of this design, since we don't use inodes, we actually index everything by full path, is that we always maintain things in directory traversal order so we have really fast directory scans. And also in the data index, the file data is laid out in a logically sequential order as well, so we should have good sequential read performance.

Starting point is 00:21:17 So now that I've shown you the schema for betterFS, it's relatively straightforward to just figure out how each high-level file system operation gets mapped down to a key value store operation. So file reads are just range queries, writes are inserts, or actually we use upserts. If an application does a blind write of like four bytes to a block, then rather than reading the old block, we just use an upsert and just stick the new value for those four bytes into the tree. Metadata updates are upserts, readers are range queries, directory operations are inserts and deletes, and then the two problematic operations are unlink and rename. Since betterfs is 0.1, indexes everything by full path and has an entry in the key value store for each block

Starting point is 00:22:07 of the file. When you delete a file, we have to issue deletes for each block of that file. And you're going to see this is very painful in a minute. And also, if you rename a directory, the way betterfs 0.1 implements that is internally it just copies a directory to the way betterfs 0.1 implements that is internally it just copies a directory to the new name recursively. So these are going to be very slow, but we're going to fix them. Because of this mapping, we're going to have really fast metadata updates, we'll have really

Starting point is 00:22:37 fast directory traversals, and we're going to have a really big problem here that we'll fix later. All right, so that's it. That's better FS 0.1 in a nutshell. Let me stop and do a performance sanity check to see if we're actually getting anything out of this

Starting point is 00:22:54 construction. So, going to fix later means... In this talk. You have no idea how to do it until you get new graduate students oh no no no no okay so so so I mean this this is this is this is our fast 2015 paper yes and we're gonna we fix it in our fast 2016 paper and there's a lot of moving parts going on here. So I decided rather than try to just present the current version,

Starting point is 00:23:31 I want to build it piece by piece so we can focus on one part of it at a time. Yeah? You have a lot of upsurge staged through the tree. Does that imply there's a bunch of pending computation has to get done when you walk through and reverse them all? Yes. So the question is, if you have a lot of upserts staged in the tree, might you just be delaying computation and then suddenly it comes back to bite you in the end, right?

Starting point is 00:23:56 So there's some heuristics in the BD Epson tree where if it sees that you are applying the same upsert message over and over again, then even if it's not really worth the I.O. to flush that message down, it'll go ahead and flush that message down. So it sort of bounds the amount of computation you waste on reapplying upcerts in the meantime before it reaches the leaf. I think since then we've come up with some better heuristics, but we haven't implemented them yet.

Starting point is 00:24:25 But that's a really good question. Okay, so let's look. So you know one of the things we were hoping to get was really good random write performance. So this is a benchmark where we do a thousand four byte writes that random offsets in a file, in a 1 gigabyte file. We're comparing betterFS 0.1 to butterFS, exe4, xfs, and zfs. The y-axis is the time required to complete the benchmark. Things are already looking good here, where this is the time where betterFS required to do it, and that's the time that everyone else required to do it. But notice this is actually a log scale.

Starting point is 00:25:07 So we actually completed the benchmark. BetterFS completes the benchmark in about 0.17 seconds, whereas all the other file systems took over 10 seconds to do this random write benchmark. And so the takeaway here is that BetterFS is successfully exploiting the insert performance of the BDF slotting tree to improve random write performance. Here's another benchmark for creating millions of files in the file system as rapidly as

Starting point is 00:25:37 possible. So this is an abalance directory tree with a fan out of 128. And what we've plotted here is the cumulative throughput in terms of files created over time. So this is like the throughput per second after we've created the millionth file. And we're comparing with the same file systems. And so betterFS is up here, where the throughput is so higher is better but again note this is a log scale which means that it's about 10x faster than the next closest file system and about 100 next faster than other than the other file systems and some of the file systems

Starting point is 00:26:16 didn't even complete the benchmark so again we're able to exploit BD Epsilon Tree insertion performance to improve these operations. Sequential IO, one of the advantages of the write optimization structure like BD Epsilon Tree is that you should be able to do range queries at disk bandwidth. And so sequential file reads in our schema translate to range queries in the tree. So we would hope that we're getting pretty close

Starting point is 00:26:43 to disk bandwidth. And in fact, we are. Maybe we'd like to engineer this to get a little bit closer. But about 80 megabytes per second versus 110, there's not a fundamental algorithmic problem here. It's just some engineering work. So we're getting pretty close to disk bandwidth. Now, the problem with writes is we

Starting point is 00:27:00 should also be able to get sequential writes at disk bandwidth, and we don't. But this is mainly a problem that the BD epsilon tree that we use in BetterFS logs everything before it puts it in the tree. So everything gets written twice. And so right off the bat, we're at half disk bandwidth. And so this also is not quite a fair comparison because we're doing full data journaling here, and all these file systems are doing metadata-only journaling.. Later in this talk I'll tell you how we fixed this and we got sequential writes at disk bandwidth with full data journaling. Here's the recursive directory traversal performance benchmark. You might kind of be surprised

Starting point is 00:27:41 how much performance there is on the table for recursive directory scans. This is doing a find and this is doing a grep in Linux source code. And this is time so lower is better. Here's better fs versus everyone else, better fs versus everyone else. We're somewhere in the ballpark of three to eight times faster

Starting point is 00:28:00 than the other file systems at this benchmark. And so this is showing that by having a schema that translates recursive directory traversals into range queries, we can get a lot of performance. OK, now for the bad news. This is the time to delete a file. And the x-axis is the size of the file.

Starting point is 00:28:25 And the y-axis is time. And this line is basically order in, scaling on file deletion. So this is very embarrassing. I didn't even bother to compare with other file systems because there'd just be a flat line along the bottom. We'll fix it. Don't worry. We'll fix it.

Starting point is 00:28:44 Hey, at least we'll put this light it. Well, it motivates the talk. The second part of the talk is fixing this as part of the second part of the talk. And rename, okay, here's the other embarrassing slide. Renaming the Linux source tree in EXT4, that's just, you know, make a new pointer at the new location, delete the old pointer pointer so it's nearly instantaneous it took us over 20 seconds in better fs 0.1 because it's a copy we just copied it from the old name to the new name so orders of magnitude because it's an order in operation okay so where are we at with betterFS 0.1? Sequential reads, good enough.

Starting point is 00:29:29 Maybe we'd like a little bit better, but it's not really a fundamental problem. Random writes, really great. Recursive scans, really great. I didn't really focus any benchmarks on metadata updates, but they're also quite great. But then we've got sequential writes are about a factor of four slow, three or four. Renames and deletes are embarrassingly slow. And what I'm going to do in the second part of this talk is go through these issues, and we're just going to fix them.

Starting point is 00:30:01 And I hope that at the end you'll be convinced these issues had nothing to do with write optimization. They just had to do with the schema that we used to encode the file system in a key value store. And so we're just going to change the schema, and we're going to tweak the data structure a little bit here and there, but I don't think any of the tweaks are anything that are really specific to write optimization. So let's go fix these problems. Let's go through them one at a time.

Starting point is 00:30:21 All right. Accelerating rename without slowing down directory traversals. Let me explain, let me just review quickly and graphically why we use the full path schema. So here is a typical directory hierarchy slash home, doc, LaTeX, whatever. I guess this is in academics because it's got tech files. And in a full path schema, this gets laid out in a DFS order on disk. And so what that means is that if you do a recursive grep,

Starting point is 00:30:57 then as grep is traversing this tree, the disk is going to be doing sequential reads off of disk. And so that's why betterFS's recursive directory traversal performance on things like grep is so good. But the problem with a full path indexing scheme is that when you go to rename a directory, you have to really move it on disk to its new location in order to maintain the locality that you used during the grep. So even though logically it's a very simple change, just delete this arrow, add this arrow,

Starting point is 00:31:33 what you have to do on disk is sit there and copy the keys from one location to the other. And it takes a while. And I think we're actually going to have to stop and wait for it. Okay. all right. So that's why renaming the Linux source took like 20 seconds. And so sort of a cartoon version of what's going on here is that if you have a file system with lots of indirection,

Starting point is 00:31:59 like inodes with pointers, then your scan cost can be high because stuff can get scattered on disk and doing a recursive directory traversal may require chasing pointers, but your rename cost can be high because stuff can get scattered on disk and doing a recursive directory traversal may require chasing pointers, but your rename cost is really low. On the other hand, if you have a file system that maintains locality, your scan cost is low but your rename cost can be very high. And so betterFS 0.2 uses what we call a zone-based schema which tries to find a happy medium where we have low rename costs and

Starting point is 00:32:27 enough locality to get good scans. Let me tell you how zones work. In a zone, we basically take this directory structure and we break it up into contiguous regions of the tree, which we call zones. And then within each zone, we actually lay out these files and directories in the zone using a full path schema. And between zones, we use indirection like an inode. And so what that means is that if you're doing a recursive directory scan,

Starting point is 00:32:59 then as long as the scan is staying within the zone, then that's going to be sequential disk I the zone, then that's going to be sequential disk I.O. and that's going to be very, very fast. And the only time it will have to do a disk seek or issue a secondary I.O. is when it crosses zones. So if we have large zones, then those crossings will be rare and we're going to get most of our performance on our directory scans that we got from the full path schema. So we want large zones.

Starting point is 00:33:27 Now let's look at the cost of doing a rename. There's two cases. If we move a file like 1.mp4 that is the only object in this zone or is the top object in this zone, then there's a pointer to it. So we can move it by just updating some pointers. So that's gonna be a very cheap operation. The other case is when we want to move something

Starting point is 00:33:53 like this directory here, that is not the top element of its own. And so in that case, we are gonna have to copy that data over to its new home in order to maintain the locality that we use so that we can do um you know full path indexing within a zone so there is going to be a cost there and so what this means though is that if we bound the size of zones then we never have to copy more than one zone's worth of data as part of a rename so we can upper bound the size of zones, then we never have to copy more than one zone's worth of data as part

Starting point is 00:34:25 of a rename. So we can upper bound the worst case cost of a rename by keeping zones small. So for sequential scans, we want big zones. For renames, we want small zones. So when we do the Goldilocks solution, well, we get those zones to be just right, and we keep them in the right size range by doing splitting and merging of zones just like if you know splits and merges in B trees, it's a very similar algorithm so when a zone gets too big, split it

Starting point is 00:34:53 when a zone gets too small, merge it with one of its neighboring zones yeah how do you define what is a big zone what is a small zone, how many files are there so, great question. The question was, how big should a zone be? And so we just determined that experimentally.

Starting point is 00:35:12 That is your question, right? Or? OK. So here is a benchmark where we show the recursive directory scan performance as a function of zone size. And as you can see, as zones get bigger, the time to do the scan goes down, which is what we predicted. Here is a benchmark showing the rename cost as a function of zone size.

Starting point is 00:35:35 But there's two cases. If you're renaming an object which is the root of its zone, then the cost is flat. That doesn't depend on zone size, because that's just pointer updates. If you're renaming an object that is not the cost of its own, then you have to do a copy, and that grows. It's a log scale here,

Starting point is 00:35:50 so this is basically linear in the size of the zone. And by choosing a max zone size, we say we're not going to copy more than that much data. And so what we get is a rename curve that ends up looking like that. And in fact, that's what you'll see in the next slide. that much data. And so what we get is a rename curve that ends up looking like that. And in fact, that's what you'll see in the next slide. And so we chose 512 kilobyte zones,

Starting point is 00:36:10 which gives us most of the scan performance that we wanted and gives us a reasonable upper bound on the rename cost. And so here is an end of the day benchmark for renames with the zone based BetterFS. And BetterFS is this, here's the time to do renames of a single file of different sizes. And you know some file systems have a flat curve. BetterFS and ButterFS both have a bump right here. I don't know why ButterFS does.

Starting point is 00:36:42 We do because until you hit the zone limit it starts to get a little bit expensive to do the copying but then once you hit the zone limit it becomes cheap and here is the time to rename the linux source so better if 0.1 is way off the graph it took 21 seconds but better if 0.2 is now basically right in band with everyone else. In fact, it's the second fastest except for XFS. Yeah? What would we be saying to say is?

Starting point is 00:37:12 We just did it experimentally. We found a good trade-off. So that is a parameter you might have to tune for different media. OK. Yeah. All right, so now our rename performance is comparable to other file systems.

Starting point is 00:37:28 So if you tune that parameter, how does that affect the portability of the file system? Well, of the data. Oh. So it's an MKFS time parameter. So you can tune it to the media on which you're building the file system. If you copy files,

Starting point is 00:37:53 zones are an internal thing. They're completely hidden to the semantics of the file system. So if you just do a CP of some files, they don't make a difference. The only time I can see it coming up is if you DD'd a hard drive onto an SSD.

Starting point is 00:38:10 And then you might have a wrong code size. Yeah? In your example, do you have all the file metadata designs for a single directory that resides in one zone? I think the answer is yes. The answer to the question was, does all the file metadata for a single directory reside in one zone?

Starting point is 00:38:37 Yes. So this sort of assumes a relatively reasonable distribution of files and directories. If I had reasonable distribution of files in your entries. If I had one set of files that was huge, and the one file that was lots of data, and I had a whole separate set of files

Starting point is 00:38:55 that said there was lots of little files, you might not want to see that. Actually, it does. And you've hit on a so I will. I will. You've discovered a subtlety here that I wanted to gloss over. But it's a cool subtlety. It's like, oh, I get to brag about something, not like, oh, crap, I have to tell them.

Starting point is 00:39:16 So the question is, does this system depend on the distribution of file sizes or say, like, not having a directory with lots of little files in it, I think is kind of your question. Well, if a directory has lots of little files in it, then that directory gets counted as a big directory and it's going to be made into the root of its own zone. And so any rename of that directory will only be a pointer update. So, and then all the individual files within it, if they're all small, if I rename a single file, if it's a small file, I can copy it. It's cheap to copy. The actual rules for what... Essentially the rule is that if you're the root of your own

Starting point is 00:39:54 zone, you can be as big as you want because moving you just is a pointer update independent of your size. If you're not the root of your own zone, then you are constrained by the zone size. A small file, initially it's going to be just in the zone of its size. If you're not the root of your own zone, then you are constrained by the zone size. And so like a small file, as initially it's gonna be just in the zone of its directory and as it grows, when it hits a certain size, it'll be moved into its own zone, for example. And then after that, any reading of that file will be cheap. So really, really cool question.

Starting point is 00:40:19 Thanks. Okay. Someone else? I don't know how I'm doing on time. 15 minutes? That's worst case. All right. So we'll bang this out.

Starting point is 00:40:37 Okay. So how do we solve the sequential write performance problem without giving up on full data journaling. Let's review the problem. The problem is there's this tree and there's this log. Everything gets written twice. Whenever you insert a big value into the tree, it gets written in the log. Then when you insert a small value, it gets written in the log. That's not such a big deal. Then later on, these values get put into the tree and the tree gets, the nodes of the tree get made durable by running them back to disk.

Starting point is 00:41:08 And so now these items got written twice each. Bam, we're at half this bandwidth. Whoops. Because everything's just getting written twice. So what we use is a system that we call late binding journal. And the idea behind a late binding journal is, we're going to write these big values to disk as part of the tree anyway, why don't we just put a pointer to that data in the log

Starting point is 00:41:31 rather than writing it a second time in the log. But we still want to maintain the semantics and the temporal ordering of what happened. So when the user inserts key 1 with a big value v1, we say that insert happened now but we don't include the value so this is just a small message for small values we're going to handle them like normal and I'll explain why in a minute so they just continue to go in the log and then time passes and now this in-memory log

Starting point is 00:42:00 buffer we want to make this durable so we want to write that out the disk but we can't make it durable on its own because it's referencing some value in the tree and so what we first do is we write out this tree node containing this big value and then we put an entry in the log saying this message its value is stored here on disk and once we've written this node out, then we can make the log durable. Now the log is consistent on disk. This does make it a little bit more complicated to do recovery. I'm not going to go into that. It's still entirely possible. It's just a too fast recovery. Now you might ask, oh, and of course we handle the small values like normal. So you might ask, if this is so great, why do I only do it for big values,

Starting point is 00:42:46 why not small values? And there's sort of two reasons going on. When we use a late binding journaling technique, making a value durable requires writing out a whole node. And those are big, like four megabytes. So if a node only contains a single thing that we want to make durable, like a four kilobyte write, that's incredibly wasteful. It's just, it's cheaper to make that four kilobyte write durable by just appending it to the log. But for a big value, if you're writing out 4 megabytes of data and it's sitting in a 4 megabyte node, then it's very efficient to just write the node out to make that data durable.

Starting point is 00:43:17 Secondly, in the BD epsilon tree with caching, it turns out that, well, backing up, let's talk about small writes first. Small random writes in a B to the epsilon tree slowly percolate down the tree and they may end up getting written to disk several times anyway as they make their way down the tree towards the leaves. So what's one extra write to the log? It's not really going to harm performance. If you're already getting written to disk four times, why not make it five? But for big values, like what occur

Starting point is 00:43:48 during a sequential write, they actually are going to end up opening up in-memory cached root to leaf path down that tree. And what's going to happen is the first few writes might be a little bit slow, but the second writes, all those later writes as a part of that sequential write, are going to go straight to a leaf, and they're going to get rid of to disk one time in the tree.

Starting point is 00:44:07 And so then writing them a second time in the log is a big overhead, it doubles the cost. So it's worth doing it for big writes but not for small ones. And I think this is a really cool point about write optimization and the interaction of write optimized data structures with things like logs. Okay, so let's check that we actually accomplish our goal.

Starting point is 00:44:26 Here is BetterFS 0.2's sequential write performance, and it's in the high 90s. We've even tuned it where we can get it over 100 if we want, compared to BetterFS 0.1, which is a little under 30. So we have achieved our goal

Starting point is 00:44:42 without having to give up full data journaling semantics. The last problem we faced was range cast delete. And the problem here is that we just had to issue a whole bunch of delete commands for each block of the file. The solution is quite simple. We just added a command to the BD epsilon tree to insert a single message which says, delete all this stuff in this range. Simple message.

Starting point is 00:45:12 And so that means that in order to delete a file, you just drop that message in the tree and you're done. And then that message offline can go through and garbage collect all the nodes of the subtree that got deleted. But in fact, there are some opportunities to optimize stuff. When this delete message is sitting here, based on these pivot keys, we can actually tell this whole subtree is now garbage

Starting point is 00:45:35 and we can garbage collect it without even having to read it in and look at it. And so we can, these messages improve both latency and the throughput of deletes in betterfs so now here's a quick benchmark of doing deletions in betterfs 0.2 and 0.1 so this is the size of the file being deleted this is the time betterfs 0.1 starts here and goes into space betterest 0.2 has a flat deletion curve, and it's actually about 30% faster than EXT4. OK, so we fixed the problems that we had.

Starting point is 00:46:15 Did we break anything that we had before? Let's look at random writes. BetterFest 0.1, this is throughput in terms of random write throughput. BetterFest 0.1, this is throughput in terms of random write throughput. BetterFS 0.1 was already orders of magnitude faster than a file system like EXT4. BetterFS 0.2, through tuning and other fixes that we made in the code, is even faster. Here's the file creation benchmark. BetterFS 0.1, again, was orders of magnitude faster than EXT4 in terms of creating files.

Starting point is 00:46:48 BetterFS 0.2 is even faster although it has this interesting artifact here which actually what happened there, maybe you guys can figure it out, as we make more files there's initially one zone in the file system and it gets bigger and bigger and bigger and then it comes to a certain point when every new file that we create causes a new zone to get split off. And so it goes through this phase where every new file for a while is causing zone splits, splits,

Starting point is 00:47:17 splits, splits, splits, splits. And so performance drops. But then after a while, everything's split off nicely, and the file system continues to chug along at a good rate. Given time, I think I'll skip the real application performance. You can see it's just

Starting point is 00:47:36 we actually show that there's many real applications like Git and rsync and mailder server where you get real benefit from the performance that we have in this file system. Coming back to our original goal, we now have a file system that's good at sequential I.O. It beats update and place file systems on random writes by orders of magnitude.

Starting point is 00:48:06 It has renames and deletes that are just as fast as a conventional file system. It's got recursive scans that are somewhere between 3 and 10 times faster than other file systems. And it's got metadata updates, orders of magnitude faster than other file system. And so I hope now I hope you're convinced that with these new data structures that came online in the 90s and the 2000s we can now

Starting point is 00:48:37 build file systems that overcome design trade-offs that we have been stuck with for decades in the file system community. And by using these new data structures, we can build file systems that offer across-the-board, top-of-the-line performance. So when someone says they're building

Starting point is 00:48:57 a write-optimized file system, this doesn't mean it's like a niche file system just for small random writes or something like that. It's a file system that does well. It can be, something like that. It's a file system that does well, it can be done right, it's a file system that does well at all these basic operations. And these new data structures create both a need and an opportunity. There's an opportunity here to revisit lots of these file system design decisions that we've been living with for decades. But there's also a need because, you know, the right optimized data structure's performance profile is so radically different from things like B-trees that we've been used to working with that

Starting point is 00:49:36 if you often just take a right optimized data structure and drop it in to replace the B-tree, you will not get the performance you might hope for. You need to redesign the rest of the system around the performance of these new data structures. So mainly it's an opportunity, but it's also a need. And BetterFS is an open-source project. You can get the code at betterfs.org. Thank you. Thank you.

Starting point is 00:50:38 Yeah. Thank you. So, okay, so your question is, how do we handle concurrency in, like, directory scans? And POSIX is either wonderful or terrible, depending on your point of view on this. So POSIX basically says that when someone does a re-dir, the operating system can do whatever the hell it wants in terms of consistency with other applications. So stuff can show up that's been deleted, stuff that's been added since you started your re-dir cannot show up. POSIX has basically no consistency guarantees at all. The way, you know, I didn't get into it, but this B to the epsilon tree implementation that we use actually supports transactions.

Starting point is 00:51:19 And so Reader, you know, every system call that is essentially implemented inside of a single transaction by itself. That means that when you do a read dir, you are going to get a consistent view of the part of the directory that you read. For big directories, I think you actually have to end up calling read dir multiple times, and so someone else could sneak in with a different transaction and insert something and you might miss it. But that's what POSIX allows us to do.

Starting point is 00:51:45 Otherwise what we would have to do is we could, we could in a single transaction pull out the entire directory listing and store it for you and hold onto it or something like that. But like, you know, we efficiently support directories with millions or billions of files in them, so we're not going to materialize that whole directory in RAM for you. Have you considered some of the other things that POSIX and various unique systems allow, such as handling those hard links and maintaining the file handle open after the file is deleted, and things like NFS exports where you need a persistently stable, small, unique identifier for a file,

Starting point is 00:52:29 and some of those other things that a regular user application doesn't need to deal with, but an actual production policy frequently does. Yep. Okay, so you're asking about hard links, keeping a file around if it's open, and NFS. Yeah, NFS file handles, which are traditionally on Unix, are based on ICO numbers. Yes. So hard links with the zone-based file system, whenever you make a hard link to a file, we just put it in a zone by itself.

Starting point is 00:53:08 Keeping files around as long as they're open, we implemented, that was years ago, I forgot how. NFS IDs, I think we actually give every file a bogus inode number, but I don't think we've ever actually tried NFS exporting a better FS file system. But we just throw it in, you can just have a new counter and you just put the counter in there and make it the INODE number for the file. You need to do lookup ID. Pardon?

Starting point is 00:53:41 Lookup ID is going to be expensive in your system because the B tree is not ordered to support that. Right. So we could add another index, I guess. One of the beauties of having these B epsilon trees is indexes are cheap. Fine. You want to make that lookup cheap? Okay.

Starting point is 00:53:57 I'll add another index. Like I think we have at this point, dev bootstrap basically works on better FS. Although I'm not sure we actually booted it, but we this point dev bootstrap basically works on BetterFS. Although I'm not sure we actually booted it, but we did get dev bootstrap to actually install a file system for Debian on a BetterFS system. Yeah, that would require at least a handful of hardware.

Starting point is 00:54:15 Exactly. That was actually a blocker. Someone up here had their hand up. Sure. Do those transactions get you snapshots and do you have any sense of checks on them? The transactions we've... So the BFs on tree we're using also has snapshots,

Starting point is 00:54:35 which are implemented in a distinct way from the transactions. The transactions are done using VCC, and I don't think it's the right way to do snapshots, but it does have snapshots. We don't have snapshots the right way to do snapshots but it does have snapshots. We don't have snapshots exported to the file system layer nor do we have

Starting point is 00:54:48 transactions exported to the file system yet. It's something that would be nice to do. Check summing? Check summing, there is check summing on the nodes

Starting point is 00:54:57 of the BDF sign tree but it's not in any way, like, that's just all internal to the BDF sign tree code. Yeah? Do transactions lock the entire tree, or it seems like you could do something cool with having them chase each other down the tree? So, okay, let me tell you about the BDF Sanfrico.

Starting point is 00:55:20 The BDF Sanfrico was actually developed by a company called Tokutech, which was founded by several of the co-authors on this paper. And it was a commercial product. It's now owned by Percona. And it's, I mean, I'm blown away with the quality of engineering. It's extremely highly concurrent. So transactions do not lock the entire tree at all. So they proceed in parallel quite well. So the one thing I noticed was not on your chart of comparison was random read performance.

Starting point is 00:55:55 That's because everyone's screwed on random reads. Random reads are hopeless because unless you somehow replicate the data 500 times in different orders, and now you just kind of hope that my next sequence of random requests just happen to be in order in one of your replicas, once you commit to the data on disk in some order, I can always come up with a sequence of read operations that are going to force you to do seeks on a rotating disk. So everyone is hopeless for random reads.

Starting point is 00:56:29 And so we didn't even bother. Flash. Flash. So Flash actually, I don't want to go over time here, but Flash, so all these benchmarks were on a rotating disk. We have done a few benchmarks on SSD. And you might think, well,

Starting point is 00:56:48 since the strength of this data structure is good random write performance and Flash is good at random writes, why do we need this data structure? And it turns out you still get something pretty cool. So here's random write performance of betterFS 0.2 versus E64 on an SSD. It's not the 100 times faster that we were before, but it's still something like six or seven times faster. But yeah, we use more complicated data structures. Maybe

Starting point is 00:57:12 there's more computation. It's a little bit of a bottleneck. So our sequential write performance on Flash is not quite up to snuff. But I think this is an engineering problem. This is because of better data structures. Yes. So... So mostly, in general, we need to read and modify 4 bytes and then write 4K again. Yes. Okay, so your question is about the random write benchmark I showed was 4 bytes. How about 4K? 4K, I think the absolute numbers go up just because the bandwidth.

Starting point is 00:58:03 Now you're accomplishing 4 kilobytes of write per IO, but essentially the gap is the same. But the, well, it's not quite the same. It closes a little bit because in EXT4, when you do a 4-byte write to a block, it has to read in the block. And we don't because we just issue an upsert saying, change this 4 bytes of this block to this new value.

Starting point is 00:58:25 So we get a big advantage there. We didn't know if this would actually be a very useful thing to do because we didn't know if any applications ever did random four-byte writes to uncached blocks. But there was another paper at FAST the same year that we published our first BetterFS paper where they modified the Linux page cache to actually support sub-block writes.

Starting point is 00:58:44 And they got quite a substantial performance improvement on some applications. So there are applications that do this weird thing where they just blindly write some bytes into the middle of a block of a file. All right. Time's up. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list

Starting point is 00:59:04 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #48: Optimizing Every Operation in a Writeoptimized File System

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.