Storage Developer Conference - #48: Optimizing Every Operation in a Writeoptimized File System
Episode Date: June 22, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to STC Podcast Episode 48.
Today we hear from Rob Johnson, Research Assistant Professor with Stony Brook University,
as he presents Optimizing Every Operation in a Write-Optimized File System
from the 2016 Storage Developer Conference.
My name is Rob Johnson.
I'm somewhere in there.
I'm from Stony Brook University.
And I have three goals with this talk.
One is, of course, I'm just excited to talk about BetterFS,
which is a high-performance file system
that we've been building.
But I also hope to talk a little bit
about write-optimized data structures,
which we use inside this file system,
and explain what they are,
what their performance opportunities are,
and some of the challenges.
And then the
third goal is to show so to use better FS as an example of how to design an
application or a system on top of right optimized data structures so that you
take advantage of their strengths and you avoid pitfalls in the design of
right optimized applications okay so file systems if you would try to build a
general purpose file system, there's
a lot of different performance demands that you've got to satisfy.
Some applications really want great, you know, sequential read and write performance, some
want really good random writes, some want file and directory operations like renames
and deletes to go really fast, some applications want to do recursive scans like a backup.
You've got metadata updates to files and directories. So there's a lot of different things that applications need. And building file systems that excel at all of these has been a long
challenge. So let's take EXT4 as an example. EXT4 is a classic design of what we call an update in place file system.
So it tries to lay out the content of files in a logical way with good locality and then
when you update those files it updates it in place so it maintains that locality. And
so as a result on a disk which has like about 125 megabytes per second of bandwidth,
when EXT4 is doing sequential reads,
it's able to get about 110 megabytes of bandwidth.
So it's basically getting almost all of the resource out of the disk.
But when you go to do random writes,
all those random writes require seeks,
and so the performance is terrible.
And in case you're wondering,
no, I did not leave the line off the graph.
You just can't see it.
And so when we look at these performance objectives,
we see that update-in-place file systems are good at sequential reads and writes and file and directory renames, or at least EXT4 is.
But things like random writes and metadata updates,
which translate often into random writes, are not so good.
And we'll also see that recursive scans are not as good as they could be, but for a different reason.
And the other design extreme are log-structured file systems, which are optimized for random writes
because all these random writes just are appends to a log.
And so log-structured file systems translate that random write workload into a sequential
write workload. And so they're really, really fast at random writes. But then the problem
is that when you go back to read that data later on, it's just in the log in whatever
order it happened to get written in. And so if you're trying to read your file sequentially,
you may have to be seeking all over the disk to get the blocks in whatever order and location
that happen to be stuffed into the log.
And so log-structured file systems are really just a different trade-off between these performance
objectives.
They can be quite bad at sequential reads, but they are really good at sequential and
random writes.
They can also be good at file and directory operations.
Just for the same reason that they might not be good at sequential reads, they can also be good at file and directory operations. Just for the same reason that they might not be good at sequential reads,
they can also be bad at recursive scans.
So there have been these trade-offs in file system design for many, many years.
One is the one I just described between sequential reads and random writes,
where update-in-place file systems are optimized for reads
and log-shrifted file systems are optimized
for the random writes.
But this is not the only one.
I'll tell you later on about a trade-off
between rename performance and directory scans,
which is exemplified by inode-based file systems
versus full path indexing file systems.
So these trade-offs have been very enduring
in file system design.
And there's probably more, but these are the two I'm going to talk about today.
And the main idea of this talk is that by using new data structures,
randomized data structures, we can build file systems
that completely overcome these long-standing trade-offs.
And so this is also the outline of the
talk. So I'm going to begin by describing the background on right optimized data
structures and then I'll talk about how betterFS is designed to exploit those
right optimized data structures and then we'll look at performance evaluation and dive a little deeper into some of these trade-offs.
Okay, so route-optimized data structures are a new class of data structures
that were discovered starting in the mid-90s with the LSM tree.
And the LSM tree is a data structure, you may have heard of it before,
it's in a lot of web-scale databases like HBase and Cassandra and LevelDB.
There's been a steady stream of other ones since then, BD Epsilon trees, cache oblivious
lookahead arrays, xdict, and so on.
All these data structures are key value stores, so they're more or less like a drop-in replacement
for a B-tree.
And the main feature they have is that they can do inserts or updates and deletes orders of magnitude faster than a B-tree.
But point queries are basically as fast as in a B-tree.
So they don't get any faster, but they don't get any slower asymptotically either.
And betterFS
uses B to the epsilon trees, which is
where it gets its name.
And so I'm going to
teach you B to the epsilon trees
as an example of a broad optimized data structure
and that'll help us understand why they have this
neat
performance profile, and then we'll they have this neat performance profile.
And then we'll look at what that performance profile, the challenges and opportunities it creates.
But before we do that, let me just give you a quick, simple computational model that we're going to use.
So this is called the Disk Access Machine Model. machine model. And it's a very simple model of a computer for analyzing algorithms and data structures that are I.O. bound rather than CPU bound. And so it's a really, really
simple model and it's surprisingly effective given how simple it is. So in this model we
have a computer with a RAM of size M and the units here are kind of vague. Typically, M is going to be M things, so M key value pairs.
And we have a disk, and in order to operate on any piece of data,
all the data starts on disk, you have to shuffle it into RAM,
and you can do that in blocks of size B.
Of course, if the RAM is full, then you either have to just kill a clean page
or write back a dirty page to disk.
And since we're not worried about computation, we're only concerned about the IO efficiency
of our algorithms, the performance metric in this model is just how many block transfers
do you have to do to get your job done.
So we don't care about if you've got some complicated data structure in this thing,
it doesn't matter. One IO.
Just load it in memory, do whatever you need to do
to figure out what is going on inside there.
And so
these are going to be the parameters
that appear in our performance evaluations,
B and M, and
the size of the data set, M.
Now I know the model has deficiencies.
Like, for example, it assumes that every I-O
has the same cost.
Of course, on a disk, two I-Os that are adjacent
are typically cheaper than two that are very far apart.
But we'll just ignore that stuff for now
and charge ahead with this model.
OK, let me quickly review a B-tree in this model.
So a B-tree is a search tree in which each node is sized
so that it fits in one block of storage.
And so that means that a node holds B thingies, or a block holds B thingies.
That means we can have B pointers to our children and the B pivot keys
so that we know where to go when we're descending in the tree searching for an element.
And since the fan out of the tree is roughly B,
that means the height of the tree is going to be log base B of N,
where N is the number of elements in the tree.
And so that means that a lookup is basically what we have to do, we have to just walk a
root to leaf path, and that's going to be log base B of N IOs.
And an insert is the same, because the way you do an insert in a B tree is you just search
for the leaf, and then you insert it there.
So there's one extra iota to write it back,
but that's not really a big deal.
It's just a log-based B event.
So in a B-tree, it's interesting to point out
that the cost of lookups and inserts
are basically the same.
Now let me show you the B to the epsilon tree.
This is a right optimized data structure.
It's a variation on a B tree.
So again, we're going to have nodes of size B,
but we're going to use that space very differently.
We're only going to have a fan out.
Here I'm doing epsilon equals a half. We're only going to have a fan out, here I'm doing epsilon equals a half,
we're only going to have a fan out of square root of b. So the tree is going to be a little
bit taller as a result. But what that means is that we only need to use square root of
b of the space to store our pivot keys and pointers to the child nodes. And square root
of b space is basically compared to b,
it's nothing.
Like if b were a million, square root of b is like 1,000.
So we've got 999,000 space left over.
What do we do with it?
It's going to be buffer space.
And keep in mind, this buffer, a common confusion
we see is people think this is like an in-memory cache.
When this node is in memory, the buffer is in memory.
But if this node gets evicted to disk, the buffer goes with it.
It's part of the node.
And since the fanout is now square to B,
then a tree with N items in it will have a height of log base square to B of N.
But if you remember your high school logarithms rules,
that's basically two log base B of n.
So in terms of asymptotics,
that two is just going to disappear into the constants of the big O notation.
So the height is still log base B of n asymptotically.
So let me tell you how we do an insertion
into the B to the epsilon tree.
Whenever you insert a new item,
you just put it in the buffer at the root.
Insert another one, put it in the buffer at the root.
Put them in the buffer at the root.
And just keep doing that until the buffer fills up.
So obviously that's pretty fast.
You don't have to do this root-to-leaf walk.
And when the buffer fills up,
then you just flush those elements down
to the buffers of the children.
And if this buffer is now full,
then we might have to do another flush.
There could be some cascading flushes going on here.
But that's the insertion algorithm
in a B to the epsilon tree.
When we flush nodes from here to here, we always flush that, not nodes, elements.
When we flush elements from here to here, we always flush them along the path that you would follow if you were searching for that element.
And that means later on, when we're searching for this element, as we do our root-to-leaf walk as part of the query,
we will pass this node, and we can look inside its buffer and go,
aha, there it is.
So searches are still going to work correctly.
Let's analyze the cost of doing an insert into this data structure.
Well, each item might get written to disk once per level of the data structure.
And we know there's log base B of n levels to the data structure.
But whenever we do a flush, we have to do square root of B IOs to load in all these
children and flush everything down and write them back.
So it's order square root of b.
But in the process,
we get to move b minus square root of b things down.
So there's this huge batching factor.
In fact, on average,
for each I.O. that we perform,
we move b over square root of b items
down one level.
And so as a result,
the IO cost per insert
is the number of times
each item gets written to disk
divided by the batching factor.
So every time we write something to disk,
we actually write square root of B things to disk
but they're down.
And so now take a moment and stop and think about that compared to the
B-tree. The B-tree insertion cost was this log base B event. Here we now have an insertion
cost of log base B divided by square root of B. If B is like a thousand, this is 30
times faster. If B is 100,000, this is 300 times faster.
And when you actually go and implement this data structure and benchmark it, these numbers
bear out.
You really will find that you're one, two, sometimes three orders of magnitude faster
than a B-tree.
Lookups are essentially the same.
You just have to do a root to leaf path, you just have
to check the buffers along the way.
And range queries, I want to point out, range queries when you say I've got key A and key
B, tell me everything in the database that's between them.
And essentially that, you just have to do a query for the first item and then you can
read the items off of disk, essentially at disk bandwidth.
So those are the three takeaways from this data structure. item and then you can read the items off of disk essentially at disk bandwidth. So
those are the three takeaways from this data structure. The inserts are orders of
magnitude faster than in a B-tree. Queries are, point queries are the same
speed as in a B-tree and range queries run at disk bandwidth. So these are the
performance strengths of a write optimized data structure like the B to the epsilon tree.
But this strength actually does create something that we call the search insert asymmetry.
And the problem here is inserts got way faster, but point queries didn't get any faster. And so now, if you have a workload
where you're doing, for each insert,
you proceed that insert with a query,
then you're going to be bottlenecked on the query.
So, like, take this example workload.
You know, in a B-tree, if you wanted to add $10
to my bank account balance, which I hope you do,
then you would have to query for the old balance, then you'd
have to compute the new balance, and then you'd have to insert the new balance.
Now, in a B to the epsilon tree, this is lightning fast.
This is computation, but this is no faster than it was in the B tree.
And so even though this data structure supposedly offers inserts orders of magnitude faster,
maybe, maybe you'll get a 2x speedup, and probably not even that.
So this is the search insert asymmetry
and it's something that you have to keep in mind
when you're designing your applications
or other systems on top of a right optimized data structure.
And there's a really cool trick
that data structures can provide
like a right optimized data structure
like the bdepsilon tree
can provide to applications to help them overcome the search-insert asymmetry.
And this trick is what we call an upsert,
which is an unfortunate name collision with this upsert concept
from the SQL database world.
It's something slightly different.
An upsert is essentially a message that you insert into the tree
to encode the change that you want to make.
So, for example, if I want to add $10 to my bank account balance and my bank account balance is $5 stored down in here in some leaf, I don't want to do a query.
Queries are slow.
So what I do is I just put a message into the tree saying, add $10 to Rob's bank account
balance.
And this can be done as fast as an insert.
So I can do tens to hundreds of thousands of these per second.
And this message is like a delta now.
It's not an actual value.
It's a delta on an old value.
It'll eventually get flushed down to this leaf.
When it reaches the leaf and meets up with its original message,
they'll get merged together,
and the new value will be written out to disk.
And in the meantime,
when someone queries for my bank account balance,
what they do is they do a root-to-leaf path walk, find the old value,
and then they walk back up applying these deltas to the result
before they return the actual result.
And so even though it may take a while for this delta to get merged in with the old value,
from a semantic point of view, the database or the key value
store is updated instantaneously.
So upserts are a really
powerful technique for overcoming
the search-insert asymmetry
in write-optimized data structures,
and they work particularly well in
BDF's entries.
Okay.
That's more or less the
background I wanted to go over on right-optimized data
structures. We're actually going to talk about file systems in a minute.
But just to wrap it up,
point queries are as fast as
in a B-tree. Inserts,
upserts, and deletes. Deletes are just
implemented like an upsert, just delete the message.
They're very fast, able to
do tens to hundreds of thousands per second.
And range queries,
the important thing here is they can have a disk bandwidth.
And so what that means is that when we design systems on top of data structures, we want
to as much as possible avoid point queries and map as many application or higher level
system level operations onto inserts, deletes, upserts, and range queries.
And so that's what the design schema of BetterFS
tries to accomplish.
So let me tell you how BetterFS works.
So what it does is it has to map file system operations
onto key value operations
by storing file and directory
data in a right optimized data structure we use the BDFs on tree. In fact we have
two indexes one of them is the metadata index and the metadata index essentially
stores well your file metadata and it maps full paths within that file system to information that you might call like
struct stack so you know the ownership of the file the last time it was modified the size of
the file permissions on the file stuff like that and then the data index is where we store file
contents and so the data index maps keys which are paths, and then the block number within the file to the actual
4K block of data.
The reason we have these two indices is because the metadata index, both of them are sorted
in such a way that when you do a recursive directory scan or a directory listing, it
translates into a range query in the index.
If you're doing something like a find, which
doesn't look at the content of files, that will be a range
query in the metadata index.
If you're doing a grep, that'll be a range query in
the data index.
And so the implications of this design, since we don't
use inodes, we actually index everything by full path, is
that we always maintain things in directory traversal order so we have really
fast directory scans.
And also in the data index, the file data is laid out in a logically sequential order
as well, so we should have good sequential read performance.
So now that I've shown you the schema for betterFS, it's relatively straightforward
to just figure out how each high-level file system operation gets mapped down to a key value store operation.
So file reads are just range queries, writes are inserts, or actually we use upserts.
If an application does a blind write of like four bytes to a block, then rather than reading
the old block, we just use an upsert and just stick the new value for those four bytes into
the tree.
Metadata updates are upserts, readers are range queries, directory operations are inserts and deletes, and then the two problematic operations are unlink and rename. Since betterfs is 0.1,
indexes everything by full path and has an entry in the key value store for each block
of the file.
When you delete a file, we have to issue deletes for each block of that file.
And you're going to see this is very painful in a minute.
And also, if you rename a directory, the way betterfs 0.1 implements that is internally
it just copies a directory to the way betterfs 0.1 implements that is internally it just copies a directory
to the new name recursively.
So these are going to be very slow, but we're going to fix them.
Because of this mapping, we're going to have really fast metadata updates, we'll have really
fast directory traversals, and we're going to have a really big problem here that we'll
fix later.
All right, so that's it.
That's better FS 0.1 in a nutshell.
Let me stop and do
a performance
sanity check to see if we're actually getting
anything out of this
construction.
So, going to fix later means...
In this talk.
You have no idea
how to do it until you get new graduate students oh no no no no okay
so so so I mean this this is this is this is our fast 2015 paper yes and we're gonna we fix it in
our fast 2016 paper and there's a lot of moving parts going on here.
So I decided rather than try to just present the current version,
I want to build it piece by piece so we can focus on one part of it at a time.
Yeah?
You have a lot of upsurge staged through the tree.
Does that imply there's a bunch of pending computation has to get done when you walk through and reverse them all?
Yes.
So the question is, if you have a lot of upserts
staged in the tree, might you just be delaying computation
and then suddenly it comes back to bite you in the end, right?
So there's some heuristics in the BD Epson tree
where if it sees that you are applying the same upsert message
over and over again, then even if it's not really worth the I.O. to flush that message down,
it'll go ahead and flush that message down.
So it sort of bounds the amount of computation you waste on reapplying upcerts
in the meantime before it reaches the leaf.
I think since then we've come up with some better heuristics,
but we haven't implemented them yet.
But that's a really good question. Okay, so
let's look. So you know one of the things we were hoping to get was really good random
write performance. So this is a benchmark where we do a thousand four byte writes that random offsets in a file, in a 1 gigabyte file.
We're comparing betterFS 0.1 to butterFS, exe4, xfs, and zfs.
The y-axis is the time required to complete the benchmark.
Things are already looking good here, where this is the time where betterFS required to
do it, and that's the time that everyone else required to do it.
But notice this is actually a log scale.
So we actually completed the benchmark.
BetterFS completes the benchmark in about 0.17 seconds,
whereas all the other file systems took over 10 seconds
to do this random write benchmark.
And so the takeaway here is that BetterFS
is successfully exploiting the insert performance of the BDF slotting
tree to improve random write performance.
Here's another benchmark for creating millions of files in the file system as rapidly as
possible.
So this is an abalance directory tree with a fan out of 128. And what we've plotted here is the cumulative throughput in terms of files created over
time.
So this is like the throughput per second after we've created the millionth file.
And we're comparing with the same file systems.
And so betterFS is up here, where the throughput is so higher is better but again note
this is a log scale which means that it's about 10x faster than the next closest file system
and about 100 next faster than other than the other file systems and some of the file systems
didn't even complete the benchmark so again we're able to exploit BD Epsilon Tree insertion performance to improve these operations.
Sequential IO, one of the advantages
of the write optimization structure like BD Epsilon Tree
is that you should be able to do range queries
at disk bandwidth.
And so sequential file reads in our schema
translate to range queries in the tree.
So we would hope that we're getting pretty close
to disk bandwidth.
And in fact, we are.
Maybe we'd like to engineer this to get a little bit closer.
But about 80 megabytes per second versus 110,
there's not a fundamental algorithmic problem here.
It's just some engineering work.
So we're getting pretty close to disk bandwidth.
Now, the problem with writes is we
should also be able to get sequential writes at disk
bandwidth, and we don't.
But this is mainly a problem that the BD epsilon tree that we use in BetterFS logs everything before it puts it in the tree.
So everything gets written twice.
And so right off the bat, we're at half disk bandwidth.
And so this also is not quite a fair comparison because we're doing full data journaling here,
and all these file systems are doing metadata-only journaling.. Later in this talk I'll tell you how we fixed this and we got sequential writes at disk bandwidth with full data journaling.
Here's the recursive directory traversal performance benchmark. You might kind of be surprised
how much performance there is on the table for recursive directory scans. This is doing a find
and this is doing a grep in Linux source
code. And this is time
so lower is better. Here's
better fs
versus everyone else, better fs versus everyone else.
We're somewhere in the
ballpark of three to eight times faster
than the other file systems at this benchmark.
And so this is
showing that by having a schema that
translates recursive directory traversals into range queries,
we can get a lot of performance.
OK, now for the bad news.
This is the time to delete a file.
And the x-axis is the size of the file.
And the y-axis is time.
And this line is basically order in, scaling on file deletion.
So this is very embarrassing.
I didn't even bother to compare with other file systems
because there'd just be a flat line along the bottom.
We'll fix it.
Don't worry.
We'll fix it.
Hey, at least we'll put this light it. Well, it motivates the talk.
The second part of the talk is fixing this as part of the second part of the talk.
And rename, okay, here's the other embarrassing slide.
Renaming the Linux source tree in EXT4, that's just, you know, make a new pointer at the
new location, delete the old pointer pointer so it's nearly instantaneous it took us over 20 seconds in better fs 0.1 because it's a
copy we just copied it from the old name to the new name so orders of magnitude because it's an
order in operation okay so where are we at with betterFS 0.1?
Sequential reads, good enough.
Maybe we'd like a little bit better, but it's not really a fundamental problem.
Random writes, really great.
Recursive scans, really great.
I didn't really focus any benchmarks on metadata updates, but they're also quite great.
But then we've got sequential writes are about a factor of four slow, three or four.
Renames and deletes are embarrassingly slow.
And what I'm going to do in the second part of this talk is go through these issues,
and we're just going to fix them.
And I hope that at the end you'll be convinced these issues had nothing to do with write optimization.
They just had to do with the schema that we used to encode the file system in a key value store.
And so we're just going to change the schema,
and we're going to tweak the data structure a little bit here and there,
but I don't think any of the tweaks
are anything that are really specific to write optimization.
So let's go fix these problems.
Let's go through them one at a time.
All right.
Accelerating rename without slowing down directory traversals.
Let me explain, let me just review quickly and graphically why we use the full path schema.
So here is a typical directory hierarchy slash home, doc, LaTeX, whatever.
I guess this is in academics because it's got tech files.
And in a full path schema, this gets laid out
in a DFS order on disk.
And so what that means is that if you do a recursive grep,
then as grep is traversing this tree,
the disk is going to be doing sequential reads off of disk.
And so that's why betterFS's recursive directory traversal performance on things like grep
is so good.
But the problem with a full path indexing scheme is that when you go to rename a directory,
you have to really move it on disk to its new location in order to maintain the locality that you used during the grep.
So even though logically it's a very simple change,
just delete this arrow, add this arrow,
what you have to do on disk is sit there and copy the keys
from one location to the other.
And it takes a while.
And I think we're actually going to have to stop and wait for it.
Okay. all right.
So that's why renaming the Linux source took like 20 seconds.
And so sort of a cartoon version of what's going on here
is that if you have a file system with lots of indirection,
like inodes with pointers,
then your scan cost can be high
because stuff can get scattered on disk and doing a recursive directory traversal may require chasing pointers, but your rename cost can be high because stuff can get scattered on disk and
doing a recursive directory traversal may require chasing pointers, but your rename
cost is really low.
On the other hand, if you have a file system that maintains locality, your scan cost is
low but your rename cost can be very high.
And so betterFS 0.2 uses what we call a zone-based schema which tries to find a happy medium where we have low rename costs and
enough locality to get good scans. Let me tell you how zones work. In a zone, we basically
take this directory structure and we break it up into contiguous regions of the tree,
which we call zones. And then within each zone, we actually
lay out these files and directories in the zone
using a full path schema.
And between zones, we use indirection like an inode.
And so what that means is that if you're
doing a recursive directory scan,
then as long as the scan is staying within the zone,
then that's going to be sequential disk I the zone, then that's going to be sequential
disk I.O. and that's going to be very, very fast.
And the only time it will have to do a disk seek or issue a secondary I.O. is when it
crosses zones.
So if we have large zones, then those crossings will be rare and we're going to get most of
our performance on our directory scans that we got from the full path schema.
So we want large zones.
Now let's look at the cost of doing a rename.
There's two cases.
If we move a file like 1.mp4 that is the only object in
this zone or is the top object in this zone, then there's a
pointer to it.
So we can move it by just updating some pointers.
So that's gonna be a very cheap operation.
The other case is when we want to move something
like this directory here,
that is not the top element of its own.
And so in that case,
we are gonna have to copy that data over to its new home
in order to maintain the locality
that we use so that we can do um you know full path indexing within a zone so there is going to
be a cost there and so what this means though is that if we bound the size of zones then we never
have to copy more than one zone's worth of data as part of a rename so we can upper bound the size of zones, then we never have to copy more than one zone's worth of data as part
of a rename. So we can upper bound the worst case cost of a rename by keeping zones small.
So for sequential scans, we want big zones. For renames, we want small zones.
So when we do the Goldilocks solution, well, we get those zones to be just right,
and we keep them in the right size range by doing
splitting and merging of zones
just like if you know splits and merges
in B trees, it's a very similar algorithm
so when a zone gets too big, split it
when a zone gets too small, merge it with one of its neighboring zones
yeah
how do you define what is a big zone
what is a small zone, how many files
are there
so, great question.
The question was, how big should a zone be?
And so we just determined that experimentally.
That is your question, right?
Or?
OK.
So here is a benchmark where we show the recursive directory
scan performance as a function of zone size.
And as you can see, as zones get bigger, the time to do the scan goes down, which is what
we predicted.
Here is a benchmark showing the rename cost as a function of zone size.
But there's two cases.
If you're renaming an object which is the root of its zone, then the cost is flat.
That doesn't depend on zone size, because that's just pointer updates.
If you're renaming an object
that is not the cost of its own,
then you have to do a copy,
and that grows.
It's a log scale here,
so this is basically linear in the size of the zone.
And by choosing a max zone size,
we say we're not going to copy more than that much data.
And so what we get is a rename curve
that ends up looking like that.
And in fact, that's what you'll see in the next slide. that much data. And so what we get is a rename curve that ends up looking like that.
And in fact, that's what you'll see in the next slide.
And so we chose 512 kilobyte zones,
which gives us most of the scan performance that we wanted
and gives us a reasonable upper bound on the rename cost.
And so here is an end of the day benchmark
for renames with the zone based BetterFS.
And BetterFS is this, here's the time to do renames of a single file of different sizes.
And you know some file systems have a flat curve.
BetterFS and ButterFS both have a bump right here.
I don't know why ButterFS does.
We do because until you hit the zone limit it starts to get a
little bit expensive to do the copying but then once you hit the zone limit it becomes cheap
and here is the time to rename the linux source so better if 0.1 is
way off the graph it took 21 seconds but better if 0.2 is now basically
right in band with everyone else.
In fact, it's the second fastest except for XFS.
Yeah?
What would we be saying to say is?
We just did it experimentally.
We found a good trade-off.
So that is a parameter you might have
to tune for different media.
OK.
Yeah.
All right, so now our rename performance
is comparable to other file systems.
So if you tune that parameter,
how does that affect the portability of the file system?
Well, of the data.
Oh.
So it's an MKFS time parameter.
So you can tune it to the media
on which you're building the file system.
If you copy files,
zones are an internal thing.
They're completely hidden
to the semantics of the file system.
So if you just do a CP of some files,
they don't make a difference.
The only time I can see it coming up
is if you
DD'd a hard drive onto an SSD.
And then you might have a wrong code size.
Yeah?
In your example, do you have all the file metadata
designs for a single directory that resides in one zone?
I think the answer is yes.
The answer to the question was,
does all the file metadata for a single directory
reside in one zone?
Yes.
So this sort of assumes
a relatively reasonable distribution
of files and directories. If I had reasonable distribution of files
in your entries.
If I had one set of files that was huge,
and the one file that was lots of data,
and I had a whole separate set of files
that said there was lots of little files,
you might not want to see that.
Actually, it does.
And you've hit on a so I will.
I will.
You've discovered a subtlety here that I wanted to gloss over.
But it's a cool subtlety.
It's like, oh, I get to brag about something, not like, oh, crap, I have to tell them.
So the question is, does this system depend on the distribution of file sizes or say,
like, not having a directory with lots of little files in it, I think is kind of your question. Well, if a directory has lots of little files in
it, then that directory gets counted as a big directory and it's going to be made into
the root of its own zone. And so any rename of that directory will only be a pointer update.
So, and then all the individual files within it, if they're all small, if I rename a single file,
if it's a small file, I can copy it.
It's cheap to copy.
The actual rules for what... Essentially the rule is that if you're the root of your own
zone, you can be as big as you want because moving you just is a pointer update independent
of your size.
If you're not the root of your own zone, then you are constrained by the zone size. A small file, initially it's going to be just in the zone of its size. If you're not the root of your own zone, then you are constrained by the zone size. And so like a small file,
as initially it's gonna be just in the zone of its directory
and as it grows, when it hits a certain size,
it'll be moved into its own zone, for example.
And then after that, any reading of that file will be cheap.
So really, really cool question.
Thanks.
Okay.
Someone else?
I don't know how I'm doing on time.
15 minutes?
That's worst case.
All right.
So we'll bang this out.
Okay.
So how do we solve the sequential write performance problem
without giving up on full data journaling.
Let's review the problem. The problem is there's this tree and there's this log. Everything
gets written twice. Whenever you insert a big value into the tree, it gets written in
the log. Then when you insert a small value, it gets written in the log. That's not such
a big deal. Then later on, these values get put into the tree and the tree gets, the nodes of the tree get made durable
by running them back to disk.
And so now these items got written twice each.
Bam, we're at half this bandwidth.
Whoops.
Because everything's just getting written twice.
So what we use is a system that we call late binding journal.
And the idea behind a late binding journal is, we're going to write these big values to disk
as part of the tree anyway,
why don't we just put a pointer to that data in the log
rather than writing it a second time in the log.
But we still want to maintain the semantics
and the temporal ordering of what happened.
So when the user inserts key 1 with a big value v1,
we say that insert happened now but we don't
include the value so this is just a small message for small values we're
going to handle them like normal and I'll explain why in a minute so they
just continue to go in the log and then time passes and now this in-memory log
buffer we want to make this durable so we want to write that out the disk but
we can't make it durable on its own because it's referencing some value in the tree and so what we first do
is we write out this tree node containing this big value and then we put an entry in the log saying
this message its value is stored here on disk and once we've written this node out, then we can make the log durable. Now the
log is consistent on disk. This does make it a little bit more complicated to do recovery.
I'm not going to go into that. It's still entirely possible. It's just a too fast recovery.
Now you might ask, oh, and of course we handle the small values like normal. So you might
ask, if this is so great, why do I only do it for big values,
why not small values? And there's sort of two reasons going on. When we use a late binding
journaling technique, making a value durable requires writing out a whole node. And those
are big, like four megabytes. So if a node only contains a single thing that we want to make
durable, like a four kilobyte write, that's incredibly wasteful. It's just, it's cheaper
to make that four kilobyte write durable by just appending it to the log.
But for a big value, if you're writing out 4 megabytes of data and it's sitting in a
4 megabyte node, then it's very efficient to just write the node out to make that data
durable.
Secondly, in the BD epsilon tree with caching, it turns out that, well, backing up, let's talk about small writes first.
Small random writes in a B to the epsilon tree slowly percolate down the tree and they
may end up getting written to disk several times anyway as they make their way down the
tree towards the leaves.
So what's one extra write to the log?
It's not really going to harm performance.
If you're already getting written to disk four times, why not make it five?
But for big values, like what occur
during a sequential write, they actually
are going to end up opening up in-memory cached root to leaf
path down that tree.
And what's going to happen is the first few writes
might be a little bit slow, but the second writes,
all those later writes as a part of that sequential write,
are going to go straight to a leaf,
and they're going to get rid of to disk one time in the tree.
And so then writing them a second time in the log
is a big overhead, it doubles the cost.
So it's worth doing it for big writes
but not for small ones.
And I think this is a really cool point
about write optimization and the interaction
of write optimized data structures with things like logs.
Okay, so let's check that we actually accomplish our goal.
Here is BetterFS 0.2's
sequential write performance,
and it's in the high 90s.
We've even tuned it
where we can get it over 100 if we want,
compared to BetterFS 0.1,
which is a little under 30.
So we have achieved our goal
without having to give up
full data journaling semantics.
The last problem we faced was range cast delete.
And the problem here is that we just had to issue a whole bunch of delete commands for each block of the file.
The solution is quite simple.
We just added a command to the BD epsilon tree to insert a single message which says,
delete all this stuff in this range.
Simple message.
And so that means that in order to delete a file,
you just drop that message in the tree and you're done.
And then that message offline can go through and garbage
collect all the nodes of the subtree that got deleted.
But in fact, there are some opportunities to optimize stuff.
When this delete message is sitting here,
based on these pivot keys,
we can actually tell this whole subtree is now garbage
and we can garbage collect it
without even having to read it in and look at it.
And so we can, these messages improve both latency
and the throughput of deletes in betterfs so now here's a quick benchmark of
doing deletions in betterfs 0.2 and 0.1 so this is the size of the file being deleted this is the time
betterfs 0.1 starts here and goes into space betterest 0.2 has a flat deletion curve,
and it's actually about 30% faster than EXT4.
OK, so we fixed the problems that we had.
Did we break anything that we had before?
Let's look at random writes.
BetterFest 0.1, this is throughput in terms
of random write throughput. BetterFest 0.1, this is throughput in terms of random write throughput.
BetterFS 0.1 was already orders of magnitude faster than a file system like EXT4.
BetterFS 0.2, through tuning and other fixes that we made in the code, is even faster.
Here's the file creation benchmark.
BetterFS 0.1, again, was orders of magnitude faster than EXT4 in terms of creating files.
BetterFS 0.2 is even faster although it has this interesting artifact here which actually
what happened there, maybe you guys can figure it out, as we make more files there's initially
one zone in the file system and it gets bigger and bigger and bigger and then it comes to
a certain point
when every new file that we create
causes a new zone to get split off.
And so it goes through this phase
where every new file for a while is causing zone splits, splits,
splits, splits, splits, splits.
And so performance drops.
But then after a while, everything's split off nicely,
and the file system continues to chug along at a good rate.
Given time, I think
I'll skip the real
application performance. You can see
it's just
we actually show that there's
many real applications like Git and
rsync and
mailder server
where you get real
benefit from the performance that we have in this file system.
Coming back to our original goal, we now have a file system that's good at sequential I.O.
It beats update and place file systems on random writes by orders of magnitude.
It has renames and deletes that are just as fast as a conventional file system.
It's got recursive scans that are somewhere between 3 and 10 times faster than other file systems.
And it's got metadata updates, orders of magnitude faster than other file system. And so I hope now I hope you're convinced
that
with these new data structures
that came online
in the 90s and the 2000s
we can now
build file systems that overcome
design trade-offs
that we have been stuck with for decades
in the file system community.
And by using these new data structures,
we can build file systems that offer
across-the-board, top-of-the-line performance.
So when someone says they're building
a write-optimized file system,
this doesn't mean it's like a niche file system
just for small random writes or something like that.
It's a file system that does well. It can be, something like that. It's a file system that does well, it can be done right, it's a file system that does well at all these basic operations.
And these new data structures create both a need and an opportunity. There's an opportunity here
to revisit lots of these file system design decisions that we've been living with for decades.
But there's also a need because, you know, the right optimized data structure's performance
profile is so radically different from things like B-trees that we've been used to working with that
if you often just take a right optimized data structure and drop it in to replace the B-tree,
you will not get the performance you might hope for. You need to redesign the rest of the system
around the performance of these new data structures.
So mainly it's an opportunity, but it's also a need.
And BetterFS is an open-source project.
You can get the code at betterfs.org.
Thank you.
Thank you.
Yeah. Thank you. So, okay, so your question is, how do we handle concurrency in, like, directory scans?
And POSIX is either wonderful or terrible, depending on your point of view on this. So POSIX basically says that when someone does a re-dir, the operating system can do whatever the hell it wants
in terms of consistency with other applications.
So stuff can show up that's been deleted, stuff that's been added since you started your re-dir cannot show up.
POSIX has basically no consistency guarantees at all.
The way, you know, I didn't get into it,
but this B to the epsilon tree implementation that we use
actually supports transactions.
And so Reader, you know, every system call
that is essentially implemented inside of a single transaction by itself.
That means that when you do a read dir, you are going to get a consistent view of the
part of the directory that you read.
For big directories, I think you actually have to end up calling read dir multiple times,
and so someone else could sneak in with a different transaction and insert something
and you might miss it.
But that's what POSIX allows us to do.
Otherwise what we would have to do is we could,
we could in a single transaction pull out the entire directory listing
and store it for you and hold onto it or something like that.
But like, you know, we efficiently support directories
with millions or billions of files in them,
so we're not going to materialize that whole directory in RAM for you. Have you considered some of the other things that POSIX and various unique systems
allow, such as handling those hard links and maintaining the file handle open after the
file is deleted, and things like NFS exports where you need a persistently stable, small, unique identifier for a file,
and some of those other things that a regular user application doesn't need to deal with,
but an actual production policy frequently does.
Yep.
Okay, so you're asking about hard links, keeping a file around if it's open, and NFS.
Yeah, NFS file handles, which are traditionally on Unix, are based on ICO numbers.
Yes.
So hard links with the zone-based file system, whenever you make a hard link to a file, we
just put it in a zone by itself.
Keeping files around as long as they're open, we implemented, that was years ago, I forgot
how.
NFS IDs, I think we actually give every file a bogus inode number, but I don't think we've ever actually tried NFS
exporting a better FS file system.
But we just throw it in, you can just have a new counter and you just put the counter
in there and make it the INODE number for the file.
You need to do lookup ID.
Pardon?
Lookup ID is going to be expensive in your system because the B tree is not ordered to support that.
Right.
So we could add another index, I guess.
One of the beauties of having these B epsilon trees
is indexes are cheap.
Fine.
You want to make that lookup cheap?
Okay.
I'll add another index.
Like I think we have at this point,
dev bootstrap basically works on better FS. Although I'm not sure we actually booted it, but we this point dev bootstrap basically works on BetterFS.
Although I'm not sure we actually booted it,
but we did get dev bootstrap
to actually install a file system
for Debian on a BetterFS system.
Yeah, that would require at least a handful of hardware.
Exactly.
That was actually a blocker.
Someone up here had their hand up.
Sure.
Do those transactions get you snapshots
and do you have any sense of checks on them?
The transactions we've...
So the BFs on tree we're using also has snapshots,
which are implemented in a distinct way from the transactions.
The transactions are done using VCC,
and I don't think it's the right way to do snapshots,
but it does have snapshots.
We don't have snapshots the right way to do snapshots but it does have snapshots. We don't have
snapshots exported
to the file system layer
nor do we have
transactions exported
to the file system yet.
It's something
that would be nice to do.
Check summing?
Check summing,
there is check summing
on the nodes
of the BDF sign tree
but it's not
in any way,
like,
that's just all internal
to the BDF sign tree code.
Yeah? Do transactions lock the entire tree, or it seems like you could do something cool with having them chase each other down the tree?
So, okay, let me tell you about the BDF Sanfrico.
The BDF Sanfrico was actually developed by a company called Tokutech, which was founded by several of the co-authors on this paper.
And it was a commercial product.
It's now owned by Percona.
And it's, I mean, I'm blown away with the quality of engineering.
It's extremely highly concurrent.
So transactions do not lock the entire tree at all.
So they proceed in parallel quite well.
So the one thing I noticed was not on your chart of comparison was random read performance.
That's because everyone's screwed on random reads. Random reads are hopeless
because unless you somehow replicate the data 500 times in different orders,
and now you just kind of hope that my next sequence of random requests
just happen to be in order in one of your replicas,
once you commit to the data on disk in some order,
I can always come up with a sequence of read operations
that are going to force you to do seeks on a rotating disk.
So everyone is hopeless for random reads.
And so we didn't even bother.
Flash.
Flash.
So Flash actually, I don't want to go over time here,
but Flash, so all these benchmarks
were on a rotating disk.
We have done a few benchmarks on SSD.
And you might think, well,
since the strength of this data structure is good random write performance
and Flash is good at random writes,
why do we need this data structure?
And it turns out you still get something pretty cool.
So here's random write performance
of betterFS 0.2 versus E64 on an SSD.
It's not the 100 times faster that we were before, but it's still something
like six or seven times faster. But yeah, we use more complicated data structures. Maybe
there's more computation. It's a little bit of a bottleneck. So our sequential write performance
on Flash is not quite up to snuff. But I think this is an engineering problem. This is because of better data structures.
Yes.
So... So mostly, in general, we need to read and modify 4 bytes and then write 4K again.
Yes.
Okay, so your question is about the random write benchmark I showed was 4 bytes.
How about 4K?
4K, I think the absolute numbers go up just because the bandwidth.
Now you're accomplishing 4 kilobytes of write per IO,
but essentially the gap is the same.
But the, well, it's not quite the same.
It closes a little bit because in EXT4,
when you do a 4-byte write to a block,
it has to read in the block.
And we don't because we just issue an upsert saying,
change this 4 bytes of this block to this new value.
So we get a big advantage there.
We didn't know if this would actually be a very useful thing to do
because we didn't know if any applications ever did
random four-byte writes to uncached blocks.
But there was another paper at FAST the same year
that we published our first BetterFS paper
where they modified the Linux page cache
to actually support sub-block writes.
And they got quite a substantial performance improvement on some applications.
So there are applications that do this weird thing
where they just blindly write some bytes into the middle of a block of a file.
All right.
Time's up.
Thanks for listening.
If you have questions about the material presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further with your peers in the developer community.
For additional information about the Storage Developer Conference, visit storagedeveloper.org.