Storage Developer Conference - #7: FS Design Around SMR: Seagate’s Journey and Reference System with EXT4

Episode Date: May 16, 2016

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 7. Today we hear from Adrian Palmer, Drive Development Engineer, with Seagate Technologies as he presents File System Design Around SMR from the 2015 Stories Developer Conference.
Starting point is 00:00:47 My name is Adrian Palmer. I'm from Seagate and I'm going to be sharing with you our experiences in working with SMR. Not so much designing it because that's what we do at Seagate, but also allowing software to work with it. This is a SNEA global tutorial, so other than this time and one other slide, I won't be talking specifically about Seagate, but more our effort with our open source project with the ext4 called SMR FFS. So this will apply to the standard that we work with SMR, as opposed directly to having it designed to our drives.
Starting point is 00:01:29 So it would be an effort that we give to the industry to enable this technology to be used. This is our obligatory legal notice because it is a tutorial and it will be used everywhere else. You can read that at your leisure later but for our SMR friendly file system because SMR is such a fundamental drive technology is being embraced by everybody because this is a major change it is a game-changing change it is a disruptive change the fundamental assumptions of file system management are being changed. The notions of random writes now have to be addressed in the same way that we would address it as to tape. We are leading the way in providing the standards-compl.O. stack that resides below the file system and a reference file system so that we can enable the adoption and use of these drives for you to use in your businesses.
Starting point is 00:02:38 Using these new commands, we make and maintain the file system for performance operation this allows us to issue rights that are in the right order and allows us to do the io that is to the drive perspective beneficial and compliance we're sharing lessons learned because this is a reference design. As I'm the lead maintainer for this EXT4 system, I'm not going to say it's perfect, but I am saying I'm going to share my experiences with you. I want your feedback, and I want you to know that this is the way that we suggest that the file systems be modified such that SMR can be adopted. It doesn't have to be exactly the way we do it, but these are the concepts that we came up with to present to you. There are several assumptions that I'm going to make during this presentation.
Starting point is 00:03:41 The first of which is that you're familiar with file systems in general. I expect some of you in here are file system developers, which is a great place to be. And others are going to be interested. I'm also going to assume that you're going to have a strong familiarity with the SMR standards, the ZAC and the ZBC standards, which were presented in the last couple of presentations.
Starting point is 00:04:10 Our reference code is on GitHub at this address here. And also, if you're not familiar with either ZAC or ZBD or Z or the SMR, there's some tutorials that I've referenced here, as well as the official sites for the standards. So the first thing that we're going to start off with, just a basic talk about what is a file system. A file system is an essential piece of software in the system and some would even argue the most important system. The software that if unlike any other piece of software, other software you may lose a little bit of data. However, if a file system crashes you may lose all your data. This is so this is a fundamental
Starting point is 00:05:04 piece of software for many operating systems. It organizes and stores the structured data on the unstructured media. Right now, you just have a bathtub full of blocks. But what we're doing with these new commands is we're organizing them a little bit so that there's a little bit more structure to the media. That way we can use that to allow our data to be structured in such a way that we can both write it efficiently and read it efficiently. We take all of this space and manage it and allocate it as we need. A file system stores the metadata and the data.
Starting point is 00:05:51 This differs from a database. I mean conceptually a lot of people would guess that a file system and a database have the similar functions. However, a database in and of itself does not store data. For example, I can store every metric about me, but I cannot store me in that database. However, a file system, while storing every piece of metric about a file, also stores the file in it. And it provides maintenance and usability functions in order to consume and use that data inside the files. We have some basic file system requirements that we need to be aware of if we're going to
Starting point is 00:06:40 start writing for SMR. First of all, the super block. It's the mount point. It's a point where the mount function looks at the disk and says, this is what type of file system this is. This is how I should handle it. Unfortunately, it has to be in a known location on disk because the mount system cannot take the time to look through the entire disk or the entire partition to search for what it's supposed to be doing.
Starting point is 00:07:06 So that's one problem we have to deal with. On the complete other end of the spectrum, things like the journal, which will allow for a data recovery or allow for different scenarios like disasters, power failures, etc. can easily be written as a circular log. And in some cases, that's almost expected because the data in the journal is temporary. As soon as it gets written on the main disk, it can be discarded from the journal. And such that it can write over its tail. Depending on different file system requirements, the other types of information inside the file system can be random or sequential. File records, some file systems block them in one certain location as a table at the front of the disk,
Starting point is 00:08:02 and other file systems write them individual blocks all over the disk in various places. Block maps, how do we tell where those files have their data? How do we update those? Is it going to be a set of number of blocks that we have to set aside for the data no matter how big the file is, or is it going to be a dynamic size? How do we structure that? Same thing for our indexes and queries. Like, for instance, in POSIX, we're very familiar with the file and folder system, which is an index, which we need to be able to maintain and read at the same time. And, of course, the data.
Starting point is 00:08:55 If your data is set in a constant size and you know exactly what you're going to be writing, then you can easily have it all in one place, all allocated at the same time. Or if it's going to be a mix of data, like in most systems, most user systems especially, we don't know how long it is, we don't know if something is being modified in the middle. We need to be able to account for all of that. For SMR drives, in the previous presentations we've seen, introduces concept of zone. This is different than the sector. The sector is usually either 512 bytes or 4K, depending on which type of drive you have. And that is the atomic unit of rewrite access for the drive,
Starting point is 00:09:39 and that's what we refer to a lot of times as the block. Each sector is independently addressed in the LBA space and we can do standard things with it like read and write. Other than that we don't really care as long as we can do read and write. Zones on the other hand allow us to have an atomic performance rewrite unit in there such that we can issue the commands of reset right pointer and reset zones. Typically, this is 256 megabytes in size. And it is not directly accessed. Because it is made up of LBAs,
Starting point is 00:10:19 we actually address it by the first LBA or the first sector inside the zone. So, for example, sector 0 would be the beginning of zone 1, and sector 40,000-something-another would be the beginning of zone one, the second zone. So each one, for the purpose of the commands, is addressed by the first LBA in the zone. And as such, because it's a management unit, each zone has its own state. We look at the right pointer. We look at the condition of the zone, whether it's full or empty or open or closed. We look at the size of the zone, which is 256
Starting point is 00:11:12 megabytes, and we look at the type of zone, whether it's a right pointer zone, which is sequential right preferred or sequential right required, or if it's a conditional zone. For comparison to other disks, this graphic shows that the write profile of SMR is actually very similar to some other media, and so that the ZAC and ZBC standards can be applied across a whole range of devices. In conventional drives with CMR, you have random access. You can write the data anywhere you would like on the disk and rewrite it. For tape, flash erase blocks, or SMR zones, it's all sequential write for optimal performance. The data has to be written at the beginning, and it progresses through the zone, or the erase block,
Starting point is 00:12:12 or the tape from the beginning to the end in a very optimal way. And so this standard applies to not just SMR disks, but also many other types of disks, such that the whole stack can eventually be changed and still work with legacy disks. So what we see is we see not just a second path down from the I-O stack into the kernel, but we see a path that can supersede and augment the traditional IO stack. And so as we've taken steps to modify the stack, we have shown in our test that it can work and it does work with conventional drives also. It's just some of the data that's not on conventional drives, the zone information,
Starting point is 00:13:06 for example, may need to be synthesized depending on the conventional drive. The 256 meg zone with an SMR could very well be applied to sections of the conventional drive because those could be presented as conventional zones without a right pointer. Usually sectors have other information like sector number, identification, whatever. Right. Where is that in your chart there? I mean, is it the same throughout all of them? The sectors are each of those blocks right there. Each of the broken apart sections would be the zones. So each of those, like for instance, the first one
Starting point is 00:13:56 would be LBA 0, the second one would be LBA 1, etc. So yes, the sectors are numbered. The sectors are synonymous with the LBA space. It's just those LBAs are grouped within a zone. Is there any specific reason why in your SMR that you have the unwritten and the sequential in the order that you have them in? I have the sequentials at the beginning of the disk to show that they have to be at the beginning of the zone. The unwritten section has to be
Starting point is 00:14:32 at the end because that's after the right pointer. The right pointer starts at the beginning and marches through the zone. Is the order not necessarily the number? Correct. Yes, the zones don't have to be a specific size. It's just typically 256 megs.
Starting point is 00:14:49 If we were to apply this to something like Flash, then because the race blocks are a lot smaller, the zone size would be a lot smaller. Question on the 256 meg. That probably doesn't really line up with an end of track condition, right? So zone 2 follows zone 1 on the very next sector or at the end of track? That is a good observation, especially when you get from outer disk to inner disk. That will not be true.
Starting point is 00:15:19 And so it's the – and this goes into what the manufacturer chooses to do more so than what the file system chooses to do. But it's the manufacturer's choice whether they want to start at the next sector or break that down to the next track. But since the zones are meant to manage and break up the SMR, one of the things that happens with the SMR physically is that there's a gap track inside the disk that allow it to reset that particular zone without affecting the other zones. So that does start on the track boundary.
Starting point is 00:16:00 So one of the things I was mentioned in one of the previous presentations is that the idea of LBA space to physical space isn't exactly what we expect anymore. Because previous generations of file systems long ago actually lined up to the cylinders, heads, and sectors and lined up to the LBAs. But now we don't do that anymore. We're lining up to zones, which is one of the things that we'll need to do as I talk about what's on this slide, actually, the partition. Since the right pointers are in the zones, and it starts at the beginning, we have to do a couple of things. One, we have to align our partitions to the beginning of a zone, and we have to make our allocation units inside the file system,
Starting point is 00:16:59 whether it be groups or allocation units or whatever they're called in various file systems, they have to be the same size as the zone. I'm not saying that is an absolute truth, but it's a good idea. For example, in EXT4, with group size of 128 megs, if I were to put two of those in one zone, one of them would start in the middle. And if I were to write to the second group, then all of a sudden I would move the right corner past the first group and I would be unable to write to the first group. So if I keep the zones, the group size with the zone size, I'm able to allow a forward
Starting point is 00:17:39 write throughout the allocation unit of a file system that matches the management unit of the disk. This slide is just simply showing that that's what we want to do. We want to remap the block device as a block device. We're not doing a whole lot different other than managing it. So, the zone is like a sector and data? Not quite. The sector is still 512 bytes or 4K. It's a very small amount of data. The zone is 256 megs. Sending 256 megs to the drive at one time will take a very long time
Starting point is 00:18:29 to write. So we don't want to break that up and have the requirement of writing a whole zone at one time. We still have the ALBA space, we still have the sectors, but in the order that we write them, we want to march forward no the in Drive managed there are ways to do that. However, in Host Aware, Host Managed, especially in Host Managed, if an order arrives at the Drive interface, an I.O. arrives out of order at the Drive interface, it's returned as an error. It is not corrected or changed. No, because if, like, for instance, if I wanted to write, if I said LBA3, then LBA2 is a simple example.
Starting point is 00:19:31 If 3 is sent to the disk first and is written, I cannot go behind it and rewrite 2. So they have to arrive in order, which means that the stack, in the Linux world where I've been working we need to look at the I.O. schedule and we need to look at the scuzzy layer and we need to look at the SD driver in order to ensure that the in-order writes that were sent from the file system maintain their order inside the stack and And that's not always guaranteed, which is why we need to have some reordering algorithms inside, especially the SCSI layer, in order so that they're sent down to the drive in order.
Starting point is 00:20:20 Sir, these one-runner people are deep in the zone. You can't just post those and you guys will go scavenge the rest of the zone and do a re-modify write and lay down the whole zone again? That is certainly an option, but however, that is a rather expensive option, which is why in the file system that we want to change to be natively performant. We don't want to allow that. We don't want to create that situation. You have the first line there.
Starting point is 00:20:57 Two different block sizes. I'm assuming that this complicates the SMR a little bit, okay, for two different block sizes. From our perspective, no, because they're still LBAs. It doesn't complicate it any more than it complicates the LBA space. All right. I'm surprised that this is a relatively new technology that's being applied. I'm surprised that you're looking at 512-byte blocks because the industry is moving towards 4K. Am I correct?
Starting point is 00:21:34 It is moving, but extremely slowly. There are still people out there who insist on 512 blocks. Maybe this will accelerate that. It very well might. But then again, they may not want SMR either for a while. Well, they want capacity. Yes, but capacity is a selling point. If you want capacity, it's going to be SMR,
Starting point is 00:22:00 which is more than likely going to be 4K. You had a question. There was an article a few years back from Cigette itself regarding KB descriptor, which provided some idea about how SMRs could be used. So essentially, that idea needed that whenever I write a sector, you need to know which subsequent sector was overwritten
Starting point is 00:22:28 because of that. Is that provided by the hybrid? That was more of a conceptual level article? Yes, that still needs to happen, but it's abstracted in such a way that we can actually use the LBA space to order that. For instance, you cannot write anything before the right use the LBA space to order that. For instance, you cannot write anything before the right pointer in LBA space from the beginning of the zone. For instance, for the first zone,
Starting point is 00:22:53 if I want to go from LBA 0 to LBA 100, then that means in order to write something between those LBAs, I have to rewrite the zone. But LBA 101 is at the right pointer and can be written to. But we cannot know that LBA 0, vertically below LBA 0 is LBA 10, and that is getting overwritten also. I can write to LBA 20, skipping that. We don't know exactly where physically they line up on this. We just know that in the LBA space that is true. Because in all likelihood, LBA 1 is right next to LBA 0
Starting point is 00:23:38 instead of on top because it doesn't span a whole track. But we don't know exactly how long those tracks would be because it varies between the outer diameter and the inner diameter. There is a good reason for that. One is track interference. Because if you're always resetting back to the same point, eventually you're going to corrupt the track right next to it, and it's already very small as it is. But in a decade of use, they get smaller and smaller per decade.
Starting point is 00:24:25 The second reason for that is on some zone types that may come in the future, that may be a physical impossibility. Like for example, if this is applied to flash and flash groups, flash physically cannot rewind to half the zone. It's all or nothing. So we wanted to make sure that it was consistently applied inside the standard. Any other questions?
Starting point is 00:25:04 Starting to solve the problem of the file system, I have three steps that I've broken this into. The first one is I simply want to separate the problem space. In this case, I'm dedicating a zone to each problem subspace. The user data needs to have its own solution to be copy-on-write. The file records also need to be copy-on-write. The indexes need to be copy-on-write. The superblock can't be copy-on-write, but I need an algorithm to trick the drive into thinking it is. The trees for lookups need to be copy-on-write. needs to be copy and write, and the allocation to containers themselves need to be copy and write if I change them. So instead
Starting point is 00:25:54 of having a solution that I can apply to all of these, especially in a time that I have to work on the AXD4, I'm separating the problems of space. The first one to look at is the GPT and the superblocks. If we're looking at the first partition on a disk and we're starting at zone 1, zone 0, LBA 0 is taken up by the GPT up to LBA 33. The superblock is written right after that. The file system must know where the superblock is at mount time, so it has to be a known location. It's updated infrequently and more or less at the dismount of the drive. The superblock in various file systems is written in different places, but in EXT4 it's written in the first and every 3, 5, and 7 multiples of it.
Starting point is 00:26:58 And also the GPT is written to the last zone. The copy and write update needs to happen at the right pointer because that is a requirement of the zone commands. The scheme that I came up with is that the super block is written at the first zone. But also I have other metadata that I need to put in there. In this case it will be the the trees for the group pointers that I need to put in there too. So I will write the superblock at its known location such that the map can read it. And then I'll write a section of the trees and then I'll write the superblock right after that as at the right pointer so that the mount time
Starting point is 00:27:50 I can look at the known location for where the superblock is I can look at the right pointer back up one LBA and read the current superblock so that way I have I have a check built in for the superblock and the updated version. So what I've just done with that is I've given a known location. I've given an algorithm for it to be copy and write. I've not violated any IOs, but I still have the most current information that I can get when I mount the file system. The wipe algorithm, whenever that gets full, is I'm going to have to have a copy in the last zone that does the same thing,
Starting point is 00:28:35 but I will take this time, I will wipe with the reset wipe pointer the first zone and immediately write the GPT and the superblock back there so that at that point it is consistent and it is the primary location for finding that data when the drive is mounted and the file system is mounted and it matches the information at the last zone on the last part of the disk. There's a fixed size per zone, which implies a fixed number of sectors. The standard does not require that any of the zones on a drive be the same, but the
Starting point is 00:29:31 manufacturers are making the zones on a drive to be the same. So at this point, all the zones on known drives are 256 megabytes. Can that be recast? Not in the field. That is a hardware format on the media. Do other file systems, real braces like XFS? Is that in their development pipeline? XFS has looked at this. XFS is a bit more friendly, and XFS maintainers are pushing for a more of a drive-managed solution,
Starting point is 00:30:17 more so than a change in XFS. XFS is relatively mature. It is actually slated for its last round of updates within the next few years. Are these the same zones? Sorry? Are these the same zones that have been used in the past for data rate? Is it different zones?
Starting point is 00:30:39 Yes. These are zones specific to the ZAC-CBC standard. So you have two zones in a drive today. You have the zone that gives you a standard data rate per zone, and you have the shingle zone, too. Is that correct? As far as SMR is concerned, we only have one type of zone, which is the shingle zone. The ZAC- CBC does not put any specifications on the transfer rate. As far as the drive goes, we can't guarantee that either, especially in relation to SMR.
Starting point is 00:31:20 In the past, all drives operated with this technology called zone recording, in which each zone had a specific data rate. Do you know what I mean by that? I think so. able to increase the capacity per disk okay by allocating specific regions for bits to be stored and i'm assuming that that's still the case today with with every drive it is so you do get different uh different bandwidth depending on the zone you access. I'm sorry? It is the same, it doesn't change from PMR. It's the same zone?
Starting point is 00:32:11 It's different throughput depending on the zone you're accessing. Okay. Depending on where the zone is related to ODI. So you don't have two zones. You don't have two types of zones. You have one single zone. So it depends on the model of the drive. It's post-aware. You have a choice of having only right-foot zones. All right, I understand. Thank you.
Starting point is 00:32:36 Yeah, the different transfer rates depend on the diameter. You're going to get a lot higher transfer rate at the OD as opposed to the ID because of the difference in the rotation velocity. Any other questions? Journals. Journals are updated very frequently. The solution I came up with is to have a circular buffer made up of several different zones, depending on how big you want the zone.
Starting point is 00:33:09 And each time you cross a zone boundary, first of all, it has to be a new right because of the standards, but second of all, it does an immediate reset right pointer on the zone that it's going into, such that it can right to that zone in a sequential fashion. That also gives us the benefit of a checkpoint and it gives us the ability for the journal to overwrite its tail. We do have a trade in this though. It does require a lot of storage space. It requires the size of your journal plus one zone at the least in order to have the efficient memory usage and non-volatility that comes with the idea of a journal. There's various things that can be done in this journal. One of the ideas that we've had once upon a time is for EXT4 is to take the metadata and put it into the journal such that it lives there and it never has to live in another place on disk,
Starting point is 00:34:10 which will essentially turn the journal into a log-structured file system that works on the zones. Group descriptors in EXT4, they change infrequently, but they can also be modified to add a little bit more information about the zones that they're representing. Since they're supposed to be synonymous with the zones, I can put inside the group descriptor, I can put in the condition. I can say if we ever get to the point where we're using open, close, and finish, I can put those in there. I can say whether it's full, I can say whether it's empty, I can say whether it's a conventional zone or a sequential right preferred zone or a sequential right required zone. All that information can go into group descriptors. Also, the group descriptor is responsible for saying what blocks are free inside
Starting point is 00:35:12 its area of management. That works really well with the right pointer because everything before the right pointer is not free, and everything after the write pointer is free, which gives us the ability for a basis for a copy on write file system. Right now in ext4, these are laid out as an index tables, an index array on the disk at a certain location spanning a certain number of blocks. This will need to be reworked to change that into a B-tree,
Starting point is 00:35:51 which will give us a dynamic allocation of our group descriptors so that we can be able to store them in a copy-and-write manner inside the file system. The B-tree itself will also need to be stored on disk. And this is the piece of metadata that I was talking about earlier in putting with and in between the super block updates. File records, they change frequently. C time, M time, A time, M size, just to name a few,
Starting point is 00:36:23 as well as various file system attributes change every time you touch a file. These are currently on disk as a table, but now because of copy and write requirements, we need to put in a B plus tree in order to allow them to exist anywhere on the disk, which will also have the capability in EXC4 of allowing us to have dynamic inodes rather than having them at creation time. Because we don't want these written immediately after every change, we want to keep them in memory for a little while, let them age, until a, in the concept of ZFS, a transaction happens that allows them to be flush to disk.
Starting point is 00:37:10 Whether it's the journal or whether it's directly to disk, but we gather them and write the new blocks at the right pointer and update the B-tree and write the B-tree. And if we need to, write the other metadata. And then extents is handled in a very similar way. A file is ideally written as a single chunk, which is great. It's one I.O., but if we need to change something in the middle, it becomes fragmented. So we need multiple pointers about it so that we need to break it up into an extent tree,
Starting point is 00:37:52 which will need to be another tree, which is handled in a very similar way. Is there a limit to the size of the file? It would be the current restrictions that are in the file system in our case, which it's not. In some file systems it's 4 gigs, and EXT4 I believe can go up to 16 petabytes, but
Starting point is 00:38:17 I'm not sure about that. 16 terabytes, some number, but... So, in fact, the block allocation that you mentioned, the D4 and D3, the decision for supporting delayed allocation, all you need to do is just get the blocks
Starting point is 00:38:37 at the right point in the sum zone and get on with the allocation as well. I don't really understand the need for D3. The delayed allocation is for data, it's not for the metadata. For instance, in this particular case, if I want to update an inode, because that is laid out as a linear index on the disk, a linear array, I'm expected to be able to write in place that particular iNode. But I cannot do that, so I will need to pull that out, update the location that's pointed to in this tree, and put it tracking the right pointer for anything other than writing.
Starting point is 00:39:32 However, as the file system moves along and its other files change, the right pointer is going to move. So for any particular file, I will need to know where its inode is, which may or may not be at the right pointer at any given time. That's another problem. I was talking about log allocation being data on metadata. The only place where you can write your new logs is the right pointer. For an OK team, that's the only place where you can write your new blocks is the right pointer. For all of these things, that's the only place where you can do it.
Starting point is 00:40:10 So, where do you use the B-tree? I really don't see... The B-tree is for the metadata. But in this case, I think what you're asking is for the extents. Whatever block, I mean, an extents is just a sequence of blocks. And it has to start at the right pointer. Right. So, tracking just all the zones by pointers should be enough for implementing your block allocator. For writing new data, yes.
Starting point is 00:40:36 However, the structure has to be modified to track the history and where your new version is. So that's another problem. Right. That's the way that it's structured. However, ext4 does something that does not align with that. If you change a block in the middle of a file, what we've actually found is it deallocates that one block and then immediately reallocates the same block, which is not at the right pointer.
Starting point is 00:41:14 Okay. Yeah, we need to move fast. All right. I just got the 10-minute warning, so let's move a little faster. Problem number two. As we can no longer write in place, we now have to do cleanup as a separate step, which means that after I've solved the copy and write problem for each of those subspace problems,
Starting point is 00:41:43 the data, the superblock the group descriptors the trees the extents etc now I need to go through and be able to clean them up garbage collection in the in the GC in the journal holes will be created which means that it needs to be compacted and old entries that are still valid need to be moved to the front before the tail overwrites those data. They form non-sequentially, but the oldest type has to be handled first and moved either to disk or to the front of the journal. We need some pointers in the journal so we can do this, which I'll show you a picture of how I've done the journal in a little while. And we should have a trigger for how this works, which could be either time-based or it could be
Starting point is 00:42:34 event-based. One of the things that we're looking at in the ext4 is the idea of the kernel warning that says when the inodes are out of sync with the block allocations. In the zones, we have the same type of thing. As zones get used and modified, we're going to leave holes behind. So we need to trigger those in the same way. We need to remove the holes, we need to compact them into a scratch pad, we need to have trees that say not just what zones my iNote has data in, but what iNotes have data in my zone. So it's a reverse lookup tree to be able to look to do things like compaction and defragmentation. Is there a minimum and a maximum for the scratch pad as far as percentage of the area of the disc?
Starting point is 00:43:34 We have not actually determined that yet. We do know it needs to be at least one because the data that's moving into it will be less than the zone size or up to the zone size, at which point we'll free the previous zone. So a scratch pad can be rotating. But to determine the optimal number of zones, whether it's one or whether it's a multiple of that, we have not actually determined that, because we have not got to the point where we can actually load test the file
Starting point is 00:44:04 system at this point. And that decision is made at the host level? It will be made at the host level, yes, but I'm not sure if it will be at run time or compile time because it may be tuned per individual disk or it may be a static number that we feel is good for all disks. I apologize for asking this question, but there seems to be a lot of yet to be defined parts of the specification, yet we know that drives are available today that are in full shingle right. Is that correct? The specification is defined. It just doesn't define every number
Starting point is 00:44:48 that we would like it to define, such as the size of the zones. That's left up to the manufacturer. For garbage collection defragmentation, I'm looking at doing those as a user utility first to make it easier and determine the correctness of the algorithms before I move this into the kernel in order for file system, in real time to allow the file system to continue to run for long periods of time. Problem number three, advanced features. Indexes and queries. Depending on the use of the file system, there may be a lot of indexes for searching, whether it's in a POSIX path or whether it's just the name of a file or a certain attribute on the file. Each of those indexes have to be maintained, and each of them have to be written to disk.
Starting point is 00:46:02 Hashtables, non-static link arrays are not compatible with SMR unless they're small enough that you can rewrite them to the disk every time you update them, which is the reason that I'm using trees in a lot of places. Queries. Same thing. Indexes allow us to move to the location that we need to really quickly, but the drive does need to be cleaned to be efficient at those reads too.
Starting point is 00:46:39 Because as we get more and more fragmented, then we can have pieces all over the disk and increase the seek times. Extended attributes need to be thought about where they do want to be stored. Are they going to be stored as a tree or a list from the file record or the inode or are they going to be put in the inode or what is the design question that we going to be put in the I node, or what is the design question that we need to think about for the extended attributes. Snapshots, very nice feature that requires trees, which as a tree can be stored on the
Starting point is 00:47:20 disk, but it needs to be maintained in such a way that it's copy on write. These advanced features are not features that I'm looking to put in the EXC to begin with, but it is part of the design choices that we need to make when we're designing file systems around SMR. JBOD, RAID, checksums and parity. I can do RAID 0 and RAID 1 with SMR without much of a problem. At least conceptually. Parity, on the other hand, is an open question for me. When we have stripes we can we can
Starting point is 00:48:08 easily write a few LBAs to one drive and a few LBAs to another drive advancing the right pointer in each one of those keeping them in sync. However parity requires one of several solutions. You could either use it in a journal and keep it in memory. You can write partial parity, or you can close off the entire stripe and write the parity. But any of those are going to be an open question because of the way SMR has forward rights. It acts and has the same right profile as tape.
Starting point is 00:48:56 How many of us have ever thought about putting tape in a rate array? This is one of the questions that we have to solve. This is a little bit beyond the file system, but it's still something that we need to take into our consideration about how it would behave. Thank you. No, this is not done in the drive itself. This is done at the house. Correct. Correct. organizes the symbols for code words, the students, and sends it to different drives.
Starting point is 00:49:46 Correct. Why do you worry about which one is which? Inside the file system itself, you don't, because it's presented as one volume. But however, I present it here in advanced features for completeness, for what to think about when designing a file system, which may or may not include the RAID layers for SMR. Like for instance in EXT4 it does include stripe
Starting point is 00:50:16 information however it is not actively used because that's actually handled at a lower level. And then once that's aggregated, it is sent to the individual disk. That is all the slides I had. I hope that we've had a good discussion here. We've learned a lot. We've brought up a lot of questions about what we need to do in order to use SMR in our systems in the future.
Starting point is 00:50:53 Feel free to ask any questions, but I will take these last two minutes to go through some of the things about EXT4 that I've done, if there's no questions. In the idea of separating the problem space, what I've done is I've set up a meta group such that I will have a different zone that's responsible for each of the subproblems. For example, in zone 0, I'll be doing GPT and superblock updates, but in zone 32, I'll be doing group descriptor and bitmap and inode B-tree updates. The last zone of the groups, I'm always dedicating toward inodes, inodes for an 8 terabyte disk account for about 260 gigs. It's a lot. So that proportion is about right for ext4 and the amount of metadata it needs.
Starting point is 00:52:00 So zone 0 will be used for metadata, zone 31 will be used for the inodes, and zones 1 through 30 in each metagroup will be used for data. That way I'll just march through the disk, allow some locality, as well as allowing each individual piece of metadata to have its own solution inside the file system. This is the plan that I've put together for changing ext4. First of all, with command line arguments. Second of all, adding some handling. Three, adding some changes within the IOS stack at the SCSI and SD layers. Four, adding some ioctl and some sys file system handlers. And in version five, starting to do the major enhancements including the copy on the right and the garbage
Starting point is 00:52:52 collection. Stage six, I'm looking at all the user land utilities that need to be updated in order to talk to this. Seven, we're looking at some multi-disk enhancements including RAID, JVOD, and LVM. And once we work out all the kinks and get all the corner cases out, then we can work towards host-managed compliance. The data path for SMR versus EXT4 is going to be broken up into the idea of trees rather than the idea of a linear array that is written to the disk. In particular, the black line up there from the superblock to the group descriptors means
Starting point is 00:53:32 that the group descriptors are written on disk immediately after the superblock and last a certain number of blocks that are specified in the superblock. And any time I want to change those, they have to be written exactly in place. Whereas in the new path, I'm breaking that out, putting that into a tree so whenever there's an update, it can be written to another location on disk. The journal circular buffer, as it just marches through overriding and resetting each zone as it goes. Because of this it cannot be a regular file anymore in EXT4 but we need to give it a special type. And then I have some information about how to find the zones in JBOD, RAID 0, RAID 1, 5, and going forward. And I just got the message in an amount of time, so if there's any other questions, I will certainly answer them.
Starting point is 00:54:33 But we probably have to do it offline. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.