Storage Developer Conference - #33: Modern Erasure Codes for Distributed Storage Systems

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 33. Today we hear from Srinivasan Narayanamurthyam, member technical staff with NetApp, as he presents Modern Erasure Codes for Distributed Storage Systems from

Starting point is 00:00:46 the 2016 Storage Developer Conference. Good afternoon, everybody. So my name is Srini. It's a long name, though, but you can call me Srini from NetApp. So I'm going to talk about Modern Erasure Codes. So why is it modern? So why should I even talk about modern erasure codes? So what's wrong with the old codes? So the problem is we've been using codes like RAID or the underlying codes inside RAID like Read Solomon

Starting point is 00:01:11 which was designed like in 1960s and that's like a good 55 year old code and we're still using these Read Solomon codes day in and day out. So the idea is to look at things ground up from the thinking first principles and we wanted to see what sorts of codes could be useful not just for disks, what are the codes that we can use for flash, what are the codes that we can use for distributed storage systems and a lot more. So everything around is changing. I mean we're talking about the data deluge, we have huge capacities, densities of disks increasing, disk transfer rates are increasing. And essentially the problem with RAID and old techniques is that the time that's required to reconstruct a failed disk is going to be a lot more for increasing capacity.

Starting point is 00:02:09 And if you're going to talk about increased time for recovery, it's going to be more exposure for failure. And there's a time where you can have more failures. So that's the problem with existing reconstruction techniques. And we also know that the storage technologies are changing. We're talking about scale-out, distributed systems, cloud, converged, hyper-converged, and whatnot. Media is changing, and we're talking about scale out distributed systems cloud converged hyperconverged and whatnot media is changing and we're also looking for features like geo distribution security use of commodity hardware and things like that so how can all of these be served with old techniques like Reed Solomon okay so though we we try to construct

Starting point is 00:02:41 use the Solomon Reed Solomon and construct different forms of erasure codes, they're still not up to mark. And we definitely can't use RAID for all of these various requirements. So we do have certain newer dimensions of erasure codes. Like, essentially, if you're talking about an erasure code, we're talking about a failure. And because of this failure, we are going to be prepared by adding more parities. That way, we would have added a little bit of storage overhead. I mean, we're giving up a little bit on your storage efficiency.

Starting point is 00:03:12 Thereby, we gain reliability. But that's not the only trade-off that we're talking about anymore. Now, when we talk about distributed systems, we have network coming in. So what happens to the network bandwidth utilization? What happens to the number of nodes that I'll have to reach out to if I have to reconstruct? So I don't want to reach out to every little disk out there to reconstruct like one byte of data. So that's something that we can reconsider. So the optimality dimensions are really changing.

Starting point is 00:03:38 So we're not talking about just the disk world where we have like 60 disks packed in an array, have RAID on top of it and you kind of have a RAID group and say one disk failed, I'm going to read five other disks and reconstruct that one disk. That's not the case anymore. We've got the entire scheme of things changing altogether. So the talk I've organized it this way. I'm going to give you a little background, could be an error code, Erasure Code 101. It's more like what is an n comma k code. We talk about 6 comma 4, 9 comma 6, what's an n comma k code. We talk about 6 comma 4, 9 comma 6, what's an n comma k code and then get into the modern codes because those are like the

Starting point is 00:04:09 basics before I even get into some modern codes like locally repairable codes, regenerating codes and things like that. So essentially locally repairable codes means they put codes on top of codes. Okay so you have a re-solvement code, just one code and then you have another code on top, another code on top. So how do they even have this hierarchy of codes solve problems so that's the locally repairable code the other one is regenerating codes which essentially came from network codes basically it's from the networking community and regenerating codes are actually codes that are designed for storage systems okay with the newer dimensions in play

Starting point is 00:04:42 so that's those are the modern codes that I'm going to talk about. There was a little bit of deviation, there was some fountain codes and other stuff. I'm going to touch upon it but that's not important for my presentation because I'm more concentrated towards the distributed storage system side of things and I'm not talking about LDPC codes and other codes that are underneath FTL and things because that's not for distributed systems as of yet. Okay so and this is more like 101, this is like what's the current state and the next one is I'm trying to analyze all of these codes and see what's relevant where. Okay so like if I have a flash array, which of these codes will be

Starting point is 00:05:16 useful? If I have a geo distributed storage system, which will be useful? So it's a little bit of technical analysis of whatever codes we learn and then talk about some literature and who are the players in this area. So as I said, I'm going to start with the background. So essentially we started with classical codes, which includes Hamming code, Reed-Solomon codes, all those that we read in books, which was essentially 1950s and 60s. And after that, this was about 2000. Like 40 years, nothing nothing happened and then in 2000 fountain codes came and then this locally repairable codes is somewhere in 2011 2010 2011 that's like 50 year gap and suddenly from here to here there are like hundreds of papers on

Starting point is 00:05:59 locally repairable codes regenerating codes there are products are products on LRC, Microsoft uses LRC, Amazon uses LRC, everybody, Facebook uses LRC, everybody uses LRC these days and there are these regenerating codes where people are trying to venture into as well. Okay, so what's the difference here? So typically classical codes were actually talking about storage overhead versus reliability as the trade-off. So you give up on storage overhead, you gain reliability. Okay, that's the essential trade-off. Fountain codes brought in this concept of increased performance. So you give up on storage overhead, you add a little bit more parity, I'm going to give you a lot more performance. Okay, performance was another trade-off. And then locally repairable codes brought in

Starting point is 00:06:40 this other angle of repair degree. Like places like SMR. So I don't want to go touch on every little disk and every little block if I'm going to reconstruct just one byte of data. So repair degree where the degree means the number of nodes I reach out to for reconstructing a block of data. So the number of nodes is the repair degree. So that's another dimension. And the last one, obviously the most talked about is the repair bandwidth. So if I have to reconstruct 1 MB of data, how much MB of data am I going to flood on the network? Is it going to be 6 MB? Is it going to be 1 GB? I don't know. Okay so that's repair bandwidth and this is the most talked about code, the regenerating codes because it's very relevant to

Starting point is 00:07:19 distributed systems over the network, over WAN and things like that. Okay, so the timeline, so as I said the classical code started in 50s and 60s, RAID, the last RAID code that was constructed RAID 6 in Linux was in 2002, RAID DP was in 2004 and that was probably the end of it and then in 2015 there was this little work I think from UCLA on repairing Reed-Solomon, how to improve the repair performance of Reed-Solomon. So basically the problem with Reed-Solomon is this, actually I'll go to the next slide and explain that problem. So after this classical codes came the fountain codes. So 1998-2002 is when we had all these codes. So there was this Tornado code, the person called Luby and he came up with

Starting point is 00:08:09 this idea of fountain code which was essentially a rateless code. So when you say n, k we have rate there. Rate essentially means I'm talking about a 6,4 code where 6 is the total number of disks and 4 is going to be the my data disk and two parity disks. The rate is nothing but the amount of storage overhead you have okay so but these fountain codes are rateless codes basically so you don't have a rate you keep so they came in from networks where you have a fountain of code symbols that's coming in you you have a bucket you capture the amount of data that you want to reconstruct whatever you need and you leave it. The fountain basically just throws away your code symbols,

Starting point is 00:08:52 that's the name fountain code. You don't have a rate at all, you don't have a fixed dimension to the code. So that was the fountain code and implementation of it. So basically the ones on the top is theoretical stuff and the ones at the bottom are more systems and practical codes. So from the fountain codes theory they built Raptor codes, which essentially is rapid tornado code and now I think it's owned by Qualcomm. And these are a little relevant to our case because mostly it's it people tried experimenting it underneath FTL for for flash ECCs and and not more. So then came this interesting idea of locally repairable codes. So basically there were code so people try to experiment on what happens if I put codes on

Starting point is 00:09:41 top of codes and how does how does it even work? So in 2011, there was this pioneered work from Microsoft Research. So what they did was they established a common theory based on these two codes, pyramid and hierarchical codes that they constructed. And they said, what if I have a local group and a global group? And these local groups can actually reconstruct within that group I have a parity inside and I can reconstruct there and if I have more number of failures I'll probably go to the global group and actually reach out for more parities and read so that was the the idea and this was just the

Starting point is 00:10:16 theory and they went ahead and implemented in in Azure and they claim that they the code essentially saved about millions of dollars for them okay so that was so I'm just calling it as MLRC which stands for Microsoft's LRC Microsoft's locally repairable code and Facebook did the same idea on Facebook's photo storage they published a paper called Zorring Elephants XOR BAS and in VLDB 2013 that was also a local code. so I mean now people are trying to do regenerating and they're trying to combine regenerating codes with these locally repairable codes so so as I said the regenerating codes came from the idea of network codes so in 2010

Starting point is 00:11:00 a person from UTA Texas at Austin he came up with this idea of what if I draw some ideas from network codes and put it in storage. I'll explain the actual concepts later. But the idea is to draw some ideas from networks and they put it in storage and it actually helped saving the amount of bandwidth you utilize for repairing failed nodes. So that was in 2010 and people slowly started implementing these codes these days and another work was to try and combine locally repairable codes with regenerating codes. So how do I combine this idea with that idea and that was local regeneration. Yeah so and up to 2015 there was a fast paper, 2016 there was a fast paper, it was all on

Starting point is 00:11:47 regenerating codes. Yeah, so just a bit of 101 kind of slides, you can doze off if you want after the lunch. So classical n, k codes, so I'm taking an example of 6, 4. So we have a file or an object, you chunk it into k blocks and then you encode it, you have some parity blocks and that's essentially n, n encoded blocks and k out of it is the actual data and n minus k is your parity blocks. And when you have a failure, so essentially when you say you have two parity blocks it means that you can repair two failures. So if you have two failures you end up reading

Starting point is 00:12:27 the rest of them and then construct the actual data. So you have the four surviving blocks you decode it, that's essentially decoding and in order to repair what you typically do is now that you got the data back you again re-encode it, find those missing blocks and then reinsert it. Essentially the problem is repairs are expensive if you think of this in a distributed system. If this is going to be happening inside a storage array, that's fine. Now if you're talking about reading, so if I have to reconstruct these two blocks, I'm essentially going to read 1, 2, 3, 4 blocks into one machine and actually reconstruct the whole thing and distribute all of these all over network.

Starting point is 00:13:10 And you can now think each of these blocks is like a gigabyte or something. Now you're talking about 2 gigabyte failure and you're having 4 gigabyte over the network and you're computing on 4 gigabytes to compute another 4 gigabytes and then you're transferring 2 more gigabytes into the actual and that's huge. You don't want to do that on the network. For just a 2 node failure. Essentially it might not even be a 2 gigabyte failure, it could be just 1 MB inside the entire gigabyte file. So yeah, the point is in distributed systems such repair is not going to help. And this essentially is the basics of Reed-Solomon codes. So n, k basically is a code where these two are going to be two parities, out of which one parity is probably an XOR.

Starting point is 00:13:57 So you probably do A plus B plus C plus D, or A XOR B XOR C XOR D is this first code. And the next one is probably a linear equation from these. It could be 2A plus 3B plus 4C plus 5D and that could be that. Okay and that's also requiring some compute because you'll have to solve the linear equation, it's not mere XORs. So the problem is repair is not efficient. So then came these modern codes, essentially LRCs and regenerating codes, essentially LRCs and regenerating codes. So LRCs, the idea was, as I said, they went with this hierarchical and pyramid codes, they came up with this concept of what if I have two data

Starting point is 00:14:35 blocks and then I construct a parity and then I have few more parities at this level, I construct parities out of these parities and parities on top of it. Okay, what's the advantage of this? If I have one failure, I don't have to go read all of these blocks to construct that one failed block. All that I'll have to do is go read this parity and this and I'll get this one. Okay, so if it was read Solomon, if it was a plain one level of read Solomon where you had a, this is what, 8, so 9 comma 8 code. Imagine one extra parity if it is one extra parity and that one failed I

Starting point is 00:15:09 probably will be reading from 8 blocks and construct that one node right. So these 8 blocks instead of that they're gonna have just 2 blocks being read to reconstruct that one. Now if I have more number of failures so if I have two failures now this is not helping because this essentially is a 3,2 code and with the 3,2 code I can only reconstruct one failure. I can't reconstruct two failures. So when I have two failures, I go one level up and use that parity and then get it. Now if I have three failures, then you probably have to go one more level up to actually get that. But the essential idea is just by organizing these parities in an interesting way,

Starting point is 00:15:52 you can solve the same problem of three failures, but a lot more efficiently because you're not going to read all the data. You're not putting all of them in the network. And an interesting point to make is people at Microsoft and Facebook, they did not just do this idea of hierarchy or pyramid. So hierarchy is just going bottom up, constructing parity is bottom up, and pyramid is just constructing parity, stop down, doesn't really matter. But the interesting point here is this construction could be reorganized a little differently and they came up with a little bit more efficient constructions with that. So one of it is MLRC which was by Microsoft Azure team. So what they did was they said these blue lines are the linear

Starting point is 00:16:37 equations. So you have data, data, data, data, data, data and this is parity constructed out of these three blocks. This is parity constructed out of these three blocks and this is parity constructed out of these three blocks this is parity constructed out of these three blocks and this is parity constructed all of these and this is a parity constructed all of these okay so now if you have a failure it's one failure and you use these three to reconstruct that one failure essentially this could be an XOR okay it's just one parity here and it's this is another parity it could be an XOR this could be an XOR. Okay, it's just one parity here and this is another parity, it could be an XOR. This could be an XOR, this could be an XOR, this could be an XOR too, which means it's a lot of performance. Okay, you're not talking

Starting point is 00:17:12 about solving linear equations and taking time on the compute and I mean there's no latency there. So that's one node failure. Now you have two node failures. You have one linear equation here, another one here, maybe two XORs here and you're going to solve that. Okay, it's just two failures. You have one linear equation here, another one here, maybe two XORs here. And you're going to solve that. It's just two failures. Now if you have four failures, worst case. So if you have two failures, one here and one here, it's again simple. It's the same logic because there is a local code and there are global codes.

Starting point is 00:17:37 You're essentially using only two local codes. Now if you have four failures, that could also be solved. Same idea, you can have four. I mean, the worst case is what Sreed Solomon is offering. Worst case is you have four parities, you can solve four failures. And the four parities are only distributed in a different way. That's all. That's the only change here. So then Facebook came up with this idea where Facebook ran some analytics on their data center

Starting point is 00:18:04 and figured out that about 98% of failures are the single node failures okay 98% of it is just single node failure only 2% of it is more than more than one node failure now why do I optimize so much for a case which is only 2% of my probability okay like what Jeff has been talking about in the morning so you have so many parities, you can solve so many of them. But the problem is that's not going to happen every day. That's probably going to happen like once in five years, worst case. So why do I have to have so much of update complexity, so much of storage overhead and so many of these things for that once in a blue

Starting point is 00:18:39 moon failure, right? So what they did was, I'm going to optimize the code for one node failure I'm just going to optimize it the heck out of it okay so what they did was I have a six comma five code okay there's a six comma five code it's a long code unlike this way you have a three comma code three comma two code you have a six comma five the length of the code is quite long for a code and then you have another six comma five and then you have enough number of parities, one could be XOR and a few could be linear equations and you have a global parity on top of global parities. That's like 1, 2, 3, 4, 5, 6, 7, it's like 7 failure cases could be

Starting point is 00:19:17 handled here. But just that they've been organizing it a little bit differently. So so essentially you could think of this as a 14 comma 10 code if you forget all of these in a in a regular Reed Solomon way but you could also be thinking about because it's reorganized so much if I have one node failure I can reconstruct here, another node failure here, two node failures independently still local groups and if you have global failures you can take the global part. So essentially your N, K logic of erasure codes, now they've changed it to

Starting point is 00:19:53 some K, L, R where K is the number of data blocks, L is the local parities and R is the global parities. So six data blocks, 2 local parities and 2 global parities. So the equations are changing a little bit and the other problem that comes with these LRC codes is this. There's this concept called as MDS which is maximum distance separable code. Like if you have a 6, 4 code, I have 2 nodes worth of parities being stored which means I should essentially, I have 2 nodes worth of parities being stored. Which means I should essentially map it to 2 nodes worth of failure. So the amount of storage overhead I incur should be the same amount of failure that I could repair. Okay, that's called as MDS. Now that's not fair here because

Starting point is 00:20:37 if I have any node failure, if I have one node failure here, this is MDS. If I have this one and this one, this is MDS and this is MDS because that's Reed Solomon. If I have two node failure here, that's not MDS because I'm actually using one from here, I'm using one from here which essentially is not MDS to this, right. So this is not a straightforward MDS code. So people were stuck to this idea for the past 45 years, people were stuck to this idea that code should be MDS which means I've got 4 MB worth of parities, I should be able to repair 4 MB worth of failure. Code should be XOR like RAID 5 was always stuck to this XOR based code. You have one parity, I'll always use XOR because that's my best performance that I could get.

Starting point is 00:21:21 Then people actually got out of it and said okay RAID 6, if I have one failure I'll use the XOR but if I have the second failure I have linear equation to solve. Okay so people want XOR, people want MDS but people are okay to relax now. So they are okay to non-MDS code, they're okay to non-XOR codes, it's all fine because there is some other way of optimization, there are some other parameter they're trying to optimize on. Okay so these are locally repairable codes and these are not trying to optimize only on storage overhead versus reliability as trade-offs. They're also talking about repair degree which is a number of nodes you reach out to for a repair, maximum recoverability for as much as failure

Starting point is 00:21:58 you want, locality or minimum distance which is you optimize on that one node failure case which is the minimum distance there okay so the trade-offs are a little different it's not just storage overhead anymore now comes regenerating codes as i said this was inspired by network codes from the concept of information flow graphs and minimum cut bound so what they said was if you have a source and you have a data collector, this is a network coding diagram and I essentially don't want you to understand the whole thing, this is just to give you a picture of it and I'm going to map it to a storage terminology which will be easy for us to understand. So you have a source and you have some symbols

Starting point is 00:22:39 generated from it and these symbols are transferred somewhere and you have a loss like this node or rather this node failed. somewhere and you have a loss like this node or rather this node failed. Now if you have to reconstruct this data back you read out some of them and then reconstruct it for the data collector to read which means this data essentially gets transferred, translated into a different form of data that you gather from the remaining surviving nodes and you give it to the data collector. But the idea here is that's essentially what your regular erasure code also does. If one node fails, you read from the rest of them and you actually give it to a guy so that he can reconstruct it.

Starting point is 00:23:17 But the difference here is they have this thing called as a minimum cut bound where I'm not going to read from everywhere, I'm going to define a boundary saying that I only need one node to be read to actually get you the data that this one is lost but in a translated form. Okay, I don't want to have it in the same form. So x4 is what they lost, I don't want to reconstruct x4 exactly in the way it was lost by having to read all of these. I'm going to transfer translate this XOR into X5 which I can gather it from just X1 by reading only one one symbol instead of reading all three symbols and then get the data back. So the idea is to not read all the data but to read only as much data that's required for that reconstruction. Okay, so I don't know if it's visible, is it? Okay, so if you have a file the same

Starting point is 00:24:11 way that we have we try to understand Reed-Solomon, if you have a file and I have four blocks A, B, C, D, you essentially have the data blocks A, B and C, D stored in two nodes and then you have some parities here, this is XOR A plus B plus C plus D and then you have a linear equation and other three linear equations. You have data blocks, parity blocks. Now imagine you lost one node here, this node is lost. Now essentially in the Reed-Solomon way what we will be doing is to read all the surviving ones, you reconstruct the data back into the form that it was lost.

Starting point is 00:24:45 The lost data is a plus 2b plus 3c plus d and you will try to reconstruct that particular equation in the same way. That's not what these guys said. I'm going to repair it to something called as a functional repair form. A functional repair will let me reconstruct this data back but I'm not going to repair it to the exact form that I lost it. I'm going to reconstruct it to a totally different form but I know that those equations will let me calculate this one back. So you don't read 1, 2, 3, 4, 5, 6 equations or 6 data blocks now, you only read 1, 2, 3 out of it but you actually use these nodes

Starting point is 00:25:25 to compute something inside. So you have A and B, you compute me A plus 2B and give me a P1. So instead of transferring 2 MB worth of data, 2 blocks worth of data, I'm going to only transfer 1 block worth of data from this node, 1 block, 1 block, it's just going to be 3 blocks worth of data. So comparing against Reed-Solomon, it would be 1, two, three, four, five, six blocks worth of data, now it's going to be only three blocks worth of data, thereby 50% reduction on your network bandwidth and then reconstructed to a form completely different from this. So this reads as a plus 2b plus

Starting point is 00:25:59 something, this reads as 5a plus 7b plus something, I don't care about it. It's in a different parity form, but I know that these equations will let me calculate my lost data. That's called as functional repair. The earlier one where I tried to reconstruct it back to the original form is called as exact repair. But my problem is this. Our erasure codes that we use in storage systems are supposed to be what's called a systematic. So systematic means your data blocks will have to remain as data blocks. They should not be in some encoded form. Because you store A and you wanted to store A in your disk, not some A plus B in my disk. Because if I have to read back, I just have to read A

Starting point is 00:26:38 instead of recomputing what that A plus B comes out to. So because of my read throughput, I need systematic blocks which means my blocks will have to be stored in its original form, not in some encoded form. So in this functional repair problem helps if I have to calculate parities, but it doesn't help if I have to calculate data blocks. So imagine this one is lost and because of all these gimmicks that we just added, you have some parity calculation and you transformed transformed A and B into some 4A plus B and 6A plus C or whatever. That's not

Starting point is 00:27:13 going to help because my read throughput is not going to be the same. Okay, your latency is going to be bad. So the idea is you wanted to have a systematic regenerating code where these are exact exact repair and these are functional repairs. You got it? Are you with me? So I mean like so you wanted a systematic exact repair and a functional parity repair code and that's essentially what people are building now. Okay, so the advantage is your data blocks still remain in your original form, which means if I have a read coming, I just have to go read the data. I don't have to recompute with all the parities that's there.

Starting point is 00:27:55 But when it comes to parity blocks, because of this, the idea of reducing the network bandwidth, I'll have it in some other form. So that's just a gist of regenerating codes. Let me go one more step, double-click and see what are the types of regenerating codes, how they differ a little bit or not. So I'm going to take one code for an example, something called as repair by transfer codes where I was also part of the team which constructed this.

Starting point is 00:28:31 So we came up with this idea called a pentagon code. It's a simple pentagon. And the idea is like this. You have an edge. And you represent an edge with a data block symbol. I've named it as 1. And the edge is incident on two vertices. And these vertices, I'll have a copy of 1, I'll have a copy of 1. That's all. Same way you can see if I have 6 here, there's a 6 here, there's a 6 here. That's all.

Starting point is 00:28:57 So I'm able to actually fill this in with 9 blocks, with 2 copies of it. So I have 9 blocks, so 2 into 9, 18 blocks and I have 2 spaces left which I'll fill with parity. Yeah, are you with me? Yeah, cool. So this is a pentagon construction. Now if I have 1 node loss, any 1 node loss, the idea is so simple that I just have to copy data from the edges that are incident. So if I lost this one, I probably have 1, 5, 7, and 9. 1, 7, 9, and 5. Just copy those blocks from there. So single node failure is super.

Starting point is 00:29:35 It's just copy. But my problem is storage efficiency. It can't go any less than 2x. It's a replica. It's a real replica. It's not just an erasure code there. So a Reed-Solomon can go up to 1.5 X 1.4 X 1.6 X but this can't should be some 2.0 K X okay so this is 2 already and I have a parity there right so this is called as a repair by transfer you repair a node by just transferring data repair by

Starting point is 00:30:00 transfer code okay and this is this falls under the category called MBR, Minimum Bandwidth Regenerating Code, which I'll talk about in a while. So this doesn't stop with pentagon, I can have a heptagon, I can have an octagon, anything, it's just a framework. And in a pentagon, now the concept is, this is a regenerating code,

Starting point is 00:30:20 minimum bandwidth, it's regenerating code. What if I add locality to this? That is that LRC idea here. I can have this not just as a Reed-Solomon parity, but as a local parity. A local parity for this heptagon. And I can have a global parity for these heptagons. Okay, it's a local regenerating code. You have a local code in each of these octagon or heptagon or whatever, and you have a global code like how you have a LRC code. Okay, so that was a local code in each of these octagon or heptagon or whatever and you have a global code like how you have a LRC code. Okay so that was a local regenerative code. So now that kind of comes to a conclusion of what is this MBR

Starting point is 00:30:55 code, how do we even understand this. So we're always talking about storage overhead here which is 2.kx and what is called as repair bandwidth now because here your repair bandwidth for a repair here essentially is 4 into the amount of data that's like 4 times the symbol. So if I have 4 blocks worth of data being lost I'm going to copy 4 blocks worth of data which means my repair bandwidth is also 4 blocks. So your MBR codes the minimum bound or basically is the amount of data you lost. You can't get any lower than that.

Starting point is 00:31:28 So you're talking about Reed-Solomon where for one node worth of data, you're talking about reading four nodes worth of data, which will be essentially 16 blocks. Now you're talking about only reading four blocks, which is the minimum bound. So you can optimize on the bandwidth you you you you put on the network for a repair so you can have a repair bandwidth versus storage MBR stands here because you have the most the minimum amount of bandwidth you can read from okay because it's just a copy you don't

Starting point is 00:31:58 have anything extra so that's here but the problem is your storage overhead is a lot because your storage overhead is a lot. Because your storage overhead is 2 point some x, which is not even close to Reed-Solomon. So all MBR codes are somewhere here. And there's a form of code called MSR code, which is a minimum storage regenerating code, where your storage overhead is exactly the same as Reed-Solomon. Which means you're not giving up on anything from what Reed-Solomon offers you. So RAID 6, you have Reed-Solomon. Take out the Reed Solomon, I put MSR there, no different, okay? It's going to give you the same amount of storage efficiency, but it's going to give you this bandwidth efficiency that you've got here, which Reed

Starting point is 00:32:36 Solomon doesn't give you, okay? So that's the beauty of MSR. So a lot of people have flocked here in developing MSR codes or constructing MSR codes. Not much in MBR codes, though it's easy to construct. MSR codes are like the talk of the town these days. Yeah, and there was some theoretical work which said that, why don't we have some codes here and we can construct, we can have like a knob which says storage efficiency or bandwidth. Okay, I'll have a code here.

Starting point is 00:33:03 So theoretically they proved that they can't have codes here. And it was in 2015 that people proved that you can't have codes other than this area or in this area and not elsewhere. Yeah, so the trade-off again to repeat is storage overhead versus repair bandwidth. Right, so we've got a bunch of codes. Now let's try to analyze these what are the trade-offs what are we giving up on what's even meaningful for let's say a hyper convert system to use what which code makes sense for let's say a flash to use okay something like that so I've split it according

Starting point is 00:33:39 to the types of code that I spoke about essentially essentially the classical codes, the fountain codes, the locally repairable codes, and the regenerating codes. So when people are talking about MDS codes, we were always talking about storage overhead versus reliability. And so was the case with all your replication and parity codes. Replication is your Hadoop-like replication, parity is your regular Reed-Solomon and RAID and everything. Yeah, Reed-Solomon and RAID.

Starting point is 00:34:03 So we were always talking about storage overhead and reliability. And when we came to fountain code, as I said, people were worried about computational complexity, right? You wanted high performance codes. We're talking about rateless codes. When you say rateless, they don't care about the storage overhead, okay? So we were talking about rate and probability of correction. Higher the rate, higher probability of correction. Okay, so that was a different trade-off altogether. So when we came to LRCs, we were talking about, in general, we were talking about storage overhead and repair degree, which is the

Starting point is 00:34:33 FANON, the number of nodes you reach out to to read, but several codes, specific types of codes, actually optimize on slightly different parameters. Like they said, locality versus minimum distance in our Facebook's LRC right yeah in regenerating as I said storage overhead versus repair bandwidth and local regenerating you have the overall reconstruction cost right the number of nodes you reach out to and the amount of bandwidth you utilize in it that's probably the Holy Grail right yeah yeah so that's the idea so if you talk about reliability analysis, like we're used to all these empty TDLs and numbers that we can compare with, right? So this is the locally repairable codes

Starting point is 00:35:13 and this is regenerating. Let me take one step at a time. So in locally repairable codes, I'm comparing against a three replica system. We're talking distributed systems like Hadoop, Ceph, Cassandra, which uses three replicas. So that's like the standard. Now we're trying to compare that with the good old classical codes if I were to put a Reed-Solomon code instead of three replica system in Hadoop which we actually did and we saw that yeah this is storage overhead so three replica system your storage overhead 2x extra like you have one copy of data you you have 2 extra copies of data. And this is your repair traffic. Essentially, your repair traffic is almost nothing because it's just a copy.

Starting point is 00:35:53 You're just going to copy. You're not going to have like Reed Solomon where you have a lot of data to copy. If I have 1 block of data being lost, I'm going to copy some 1 point how much ever okay that's so it's like it depends on 14 come out and I have like one block loss I'm going to read the 10 nodes worth of data right that's a lot which comes somewhere here so essentially what we're trying to do is to get this one closer like you wanted to have minimum storage overhead and you also wanted to reduce the repair traffic somewhere closed down and obviously you wanted to increase the MTDL because the MTDL is a factor which actually saves your life of your entire system is going to go for a toss if you keep reading so much of data like in Reed Solomon right in Reed Solomon

Starting point is 00:36:35 though you're going to have a nice storage overhead you have 1.2x storage overhead that's sweet but the problem is I'm going to every disk so many times that's gonna fail so fast right you wanted to increase the MTD deal as well that's what your repair degree and repair bandwidth offers because when you say repair degree I'm actually going to reach out to those node not so many times like in Reed Solomon I'm gonna reach out to it only for like once instead of four times okay so that way it increases the empty TDR so your LRC is drastically improves your empty TDR so this was

Starting point is 00:37:09 Facebook's this was Microsoft and the the intention was to get these closer and these guys actually did right so coming to regenerating codes it's almost the same it's the same storage overhead and your code length here, so the idea is to get them closer and increase this. So, triple replica, pentagon and heptagon are the MBR codes I spoke about. The heptagon local is the local regenerating code which actually lets you increase this, right. It increases because I'm actually reaching out to that node much lesser times than I would have to with other codes. So this boost actually resulted because of my LRC properties. The repair degree is low. So because of the heptagon local, because it's local, this one is improved so much.

Starting point is 00:38:00 In general your regular regenerating code will offer you to get this one closer. If it's an MBR code you know your bandwidth will come a lot faster. If it's an MSR code your storage overhead will get closer a lot faster. But the idea is you got what I'm trying to say. If I have to compare it with RAID plus mirror, your MT-TDL still goes for a toss. Which means that, so my intention to put up this slide with RAID is that these regenerating codes are not just applicable for distributed systems like Hadoop and Ceph and stuff, it is also applicable to distribute your regular storage arrays because your MT-TDL is still

Starting point is 00:38:45 a lot better than your regular RAID but you're talking about this coming closer, you're not talking about 2x and all because your MSR code will actually get your storage efficiency low but you're reading less number of blocks every time because it's a regenerating code, you're actually reading only little number of blocks and that way your life actually improves. But I must accept that it's not so, it's not as easy as I'm just talking about. So the idea is Reed-Solomon codes have been there for like 45 years, it's so optimized your regular CPUs can get work done. But your MSR codes or regionally codes are not computationally effective, which means that we are operating at a higher Gallo field your field size is a lot more than what your regular CPUs can

Starting point is 00:39:28 offer so in terms of performance it's not so good I must accept okay but but it's there it's almost there the theory is ready people know that you'll have to optimize on the field size people know that they'll have to get the rates down the the high rate code should come to low rate and things like that so the people are working on it but the idea is regenerating codes is kind of trying to replace re-Solomon codes altogether. All right so I'm not going to go through the whole thing, let me take this one specific row and you will get an idea of what I'm trying to do here. So take a system, a general purpose storage array, a geo-distributed system or whatever and you take the properties of

Starting point is 00:40:04 the system, what's my most important property what's the least important property and what do the codes offer what what am i what are the requirements from a code okay so if I take a geo-distributed system my most important case is it's geo-distributed any repair over WAN is going to be so expensive I can't say my data center went down I'm going to copy all my data over WAN right that makes no sense at all. So repair over WAN is damn expensive. That's the most important property. And the least important property, storage overhead across DR sites. Actually, what I'm doing is replication.

Starting point is 00:40:35 I have two copies of data. Anything better than that is good for me. But still, that's the least important property I would be worried about. And what are the requirements from a code? I want local repairability. If I can repair within my data center, that's beautiful. Now I don't have a solution, I do read Solomon, or maybe I'm just replicating. But now I have an option of LRC. So if I want local repairability, I have LRC, I don't have to go to read Solomon or replication. So I can use a local repairable code. The least important property again is storage overhead. I don't care about it because the worst case is going to be my re-tolerance or replication. So anything better than that is good and that's my expectation of a code and if these are the

Starting point is 00:41:15 properties it's just local repair and storage overhead, sweet it's just LRC there. Okay so my goal is that. So you take a system and see what's the most important property, what's the least important property and what's the expectation from a code and then fit it right in. Earlier we didn't have this choice because we don't have these many trade-offs to work with. We had only storage overhead and reliability, right. You give up on one of these but now we have so many of these and naturally you can choose one of them. So who works on all of these areas?

Starting point is 00:41:48 That way you know whom to reach out to if you have to. So again the same logic, anything above the line is theoretical work, you might not be really interested. Anything below the line is systems. A lot of folks work on the theory. Here's a guy who actually came up with this idea of regenerating codes, who picked up, who got inspired from network codes and brought it to storage, Alex Dimakis. And he works closely with Facebook to get this Facebook photos and Facebook's data center being optimized with all these beautiful erasure codes. And he's an expert in regenerating

Starting point is 00:42:18 codes. And Berkeley, they were the ones who actually came up with, they together came up with this idea of regenerating codes. So all of, most of these, so most of these are trying to optimize on the repair traffic, the amount of data you need to read for a repair. And it was all, I mean I'm not talking about 20 year old work, these are all like the past three years, 13, 14, 15. And he's the guy whom I work with in Bangalore from IAC. So he was the one who proved that you can't have codes anywhere other than the MSR and the MBR point. You can't have codes anywhere else. You either get an MSR code or you get an MBR code. And the MBR code that we constructed, the double replication code, we called it double replication because you have two pentagons or two heptagons. You have two replicas already for Hadoop, so we

Starting point is 00:43:07 essentially use this double replica code for Hadoop. So we know Hadoop uses a minimum of three replicas, you can go all the way to seven replicas or whatever based on your data center. So in Hadoop you wanted for ease of reconstruction and for ease of running an MR job on top. You wanted data in plain, right? You wanted a systematic code. You want two plain copies of data, not just for availability for reconstruction. It's also for parallelism on your MR job. If you have an MR job and you want two copies, you need real two copies,

Starting point is 00:43:37 not some parity encoded form, right? So you need two copies, and we used Pentagon there. You have two real copies there. And you have a parity to actually back your failure scenario. So it's better than three replicas for failure, but it's not as good for parallelism. We figured that not many MR jobs are actually run for three copies. It's fine.

Starting point is 00:43:56 So we used that double replica on HDFS. We tried to implement it. And we got some beautiful results. We published it on Hot Storage. Yeah, so that's the work that we do with him. And he's a guy who works, I think he's still with Microsoft. He was at Bay Area, so I don't know if he's still around with Microsoft. But yeah, so he was the one who constructed the pyramid and locality code. He's a pioneer in LRCs. Those are guys who work on regenerating. He works on

Starting point is 00:44:23 regenerating. He works on LRC. She works on network coding, but she also claims she works on storage. Folks at NTU Singapore, they also work on something called a self-repairing codes, which is like self-healing codes within a disk. Not much of it is in systems, a lot of it is in theory. In systems, James Plank, I'm sure many of you know, he's the one who developed this thing called J-erasure, a library for erasure codes. He also tried to optimize the SIMD instructions underneath for erasure codes.

Starting point is 00:44:59 So he got a library called GFComplete in 2013 where he tried to increase the field size so that all those beautiful codes can be implemented. So he tries to work a lot closer to Intel to get these erasure codes to practice. So he works with a lot of OpenStack projects, EMC, and a lot of companies. So another guy from Chinese University of Hong Kong, he's also a pioneer in getting these theoretical stuff into practice. And stair codes is one, very similar to what we heard in the morning.

Starting point is 00:45:33 He actually, the stair codes is something where he had row codes and column codes. You have a row parity and a column parity. And you have some parities out of row and column parities and you use these stair codes essentially for trying to repair a disk level failure and a sector level failure which is one code. Isn't that interesting? So you have sector failures normally being taken care by the disk and you have disk failure normally being taken care by a RAID. Now you don't talk about two different systems working separately in silo. You're talking about one code which can actually reconstruct one node level failure and two sector level failures. And one of these sector level failures could be a three sector failure

Starting point is 00:46:13 together. Right? So Staircodes, that was in 2014, he tried to bring this row column parities together for slightly different purpose. And a bunch of startups and companies who work on this most of them on Reed Solomon and some of them on fountain codes and all and that's an interesting so Cleversafe I'm sure some of you would have heard so Cleversafe is a company who does information by dispersal essentially dispersal is your erasure codes and when you do not put a systematic code inside it, if it's non-systematic, it's going to be in some other form which means you get security there, right? So clever safe,

Starting point is 00:46:55 so it's secure storage where you use information by dispersal, erasure coded to a non-systematic erasure code where your original data is not in its original form but is in its original form, but is in a different form, thereby getting theoretical security because you don't really use an encryption algorithm there. You just use an erasure code algorithm,

Starting point is 00:47:16 but it's a non-systematic code, so it's not in its original form. And you've dispersed it. So you lose one disk out of 100 disks. You're not essentially going to gather any information out of it. That was clever. Some of them work on it. Related areas, I told you about the sector disk level failure. People are working on it, trying to get that improved.

Starting point is 00:47:36 This is interesting. PMDS stands for partial MDS codes. So people are willing to accept partial MDS codes. They're not strict on that MDS requirement. SD was the previous work to STAIR. It's called a sector disk failure codes. And it came out of the same Chinese University of Hong Kong. And yeah, for different media, Flash is the most spoken about in this space as well.

Starting point is 00:47:59 So how to improve LDPC codes. There was a proposal in FAST 2014, I believe, called WOM codes, multi-write codes, NVM. So all of these are being looked at, try to optimize these codes for Flash. Not essentially take Reed Solomon and put it inside. But try to get all these LRCs and turbo codes and fountain codes and regenerating codes, how to optimize them all into the media that we're talking about. Security, I told you about dispersal, yeah for cloud and yeah there are codes called as transformational codes. When data

Starting point is 00:48:31 moves, I mean you talk about a data lifecycle management where your data moves from hot tier to a cold tier, okay. Now when the data moves from hot tier to cold tier, you have a lot of other properties to worry about. You don't really care about storage efficiency, I mean you care a lot about storage efficiency in the cold tier, you don't care about storage efficiency in the hot tier, but you care a lot about the performance. So assuming I have three copies of data in my HDFS cluster for MR, as the data gets cold I'll have to somehow take these three copies and put it in the Reed-Solomon and keep it in 1.2x or

Starting point is 00:49:03 whatever. I need some level of reliability there but i can't afford to keep it at 3x right but the problem is if i reduce from let's say a 1.8x to 1.2x these days what what we'll have to do is take that 1.8x code decode all the data petabytes of data imagine and then re-encode them all into that 1.2x code and then feed it in. So that's not beautiful, right? You have to kind of decode all of it and encode all of it. Imagine the amount of time and data and effort that's required. So these codes are looking at how do I just delete some parities and get 1.2x?

Starting point is 00:49:43 That way, improve on storage efficiency, get a little reliability. So how do I work on all these parameters without much of a decode, encode kind of a phase? That's pretty much what I had. I'm going to take questions. No questions? Yes. So you mentioned the Calva field is different. For Reed-Solomon, Yes. The symbol, the alphabet. So your SSE, it was supporting 2 power 8 these days. AVX2 supports 2 power 8 too. They're extending it to 2 power 16. But what we are talking about is 2 power 64. Yeah, but there are codes that are constructed within 2 power 16

Starting point is 00:50:28 but doesn't give you the same level of freedom that a proper regenerating code will give. Okay, so currently it's 2 power 16. We're trying to operate I mean trying to get all these codes to 2 power 16 too. Questions? Yeah? For. Are they looking at was everything you talked about GF2?

Starting point is 00:50:54 Were they looking at higher order? Yeah, the XOR codes are GF2. And people are okay to not work with XOR codes. So these codes that we're talking about, regenerating codes, they're operating at 2 power 64. Yeah. And we are okay to 2 power 16 because the AVX2 or AVX instructions that Intel provides underneath your regular Haswell processors give you AVX2 instruction.

Starting point is 00:51:15 So if you can get it to 2 power 16, that's good enough. And I mean, RAID 6 is not XOR, so we don't really care. So nobody is stuck with XOR anymore. Other questions? So LRC is basically a... You can put Reed-Solomon together. Yeah, it's just Reed-Solomon, which means existing libraries, existing systems just work. It just has to be reorganized in a different way.

Starting point is 00:51:45 Yeah. So it's a hierarchical. So, essentially every write when you actually need to update a parry. Yeah. Your write updates are... Yeah, but it's no different from a regular read Solomon. If you were to arrange all these parities in a row form, it's the same. Right? Yeah, so I get that. Yeah. So, people are fine with the penalty, people are fine even with going for a non-MDS code. LRC is a non-MDS code. That was a deal breaker at that time. But after getting all these benefits when Azure claimed that we've saved millions of dollars just because of shifting the code from Reed SolomonSolomon to an LRC. Now people are willing to believe it.

Starting point is 00:52:32 More questions? Yeah. You talked a lot about different codes and the X and Y dimension of bandwidth versus efficiency of storage. What about latency? of bandwidth versus efficiency of storage. What about latency? So the new generation of, you know, the Volta memory are gonna be faster, and it can look on as possibly,

Starting point is 00:52:54 and so it's somewhere in between, and you would want error correction codes that can be closer, leading with those kinds of latency. So are there such codes? So these codes have not been looked at anything underneath your FTL. So, underneath FTL, we have LDPC codes and whatnot, right? So, these codes are not looked at from that dimension yet. But, I mean, these are not any worse than RS when it comes to latency from a media perspective.

Starting point is 00:53:27 Yeah, we're done. And we need to stop. Sure. Thanks a lot. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list

Starting point is 00:53:41 by sending an email to developers-subscribe at sneha.org. Here you can ask questions and discuss this topic further with your peers in the developer community. For additional information about the Storage Developer Conference, visit storagedeveloper.org.

Your Ad Here

Storage Developer Conference - #33: Modern Erasure Codes for Distributed Storage Systems

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.