Storage Developer Conference - #33: Modern Erasure Codes for Distributed Storage Systems
Episode Date: February 21, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 33.
Today we hear from Srinivasan Narayanamurthyam,
member technical staff with NetApp,
as he presents Modern Erasure Codes for Distributed Storage Systems from
the 2016 Storage Developer Conference.
Good afternoon, everybody.
So my name is Srini.
It's a long name, though, but you can call me Srini from NetApp.
So I'm going to talk about Modern Erasure Codes.
So why is it modern?
So why should I even talk about modern erasure codes?
So what's wrong with the old codes? So the problem is we've been using codes like RAID or the underlying codes inside RAID like Read Solomon
which was designed like in 1960s and that's like a good 55 year old code
and we're still using these Read Solomon codes day in and day out.
So the idea is to look at things ground up from the thinking first principles
and we wanted to see what sorts of codes could be useful not just for disks, what are the
codes that we can use for flash, what are the codes that we can use for distributed
storage systems and a lot more. So everything around is changing. I mean we're talking about
the data deluge, we have huge capacities, densities of disks increasing, disk transfer rates are increasing.
And essentially the problem with RAID and old techniques is that the time that's required to reconstruct a failed disk is going to be a lot more for increasing capacity.
And if you're going to talk about increased time for recovery,
it's going to be more exposure for failure.
And there's a time where you can have more failures.
So that's the problem with existing reconstruction techniques.
And we also know that the storage technologies are changing.
We're talking about scale-out, distributed systems, cloud,
converged, hyper-converged, and whatnot. Media is changing, and we're talking about scale out distributed systems cloud converged hyperconverged and whatnot media is changing and we're also looking for features like geo distribution security use of commodity hardware and things like that so how can all of
these be served with old techniques like Reed Solomon okay so though we we try to construct
use the Solomon Reed Solomon and construct different forms of erasure codes,
they're still not up to mark.
And we definitely can't use RAID for all of these various requirements.
So we do have certain newer dimensions of erasure codes.
Like, essentially, if you're talking about an erasure code,
we're talking about a failure.
And because of this failure, we are going to be prepared by adding more parities. That way, we would have added a little bit of storage overhead.
I mean, we're giving up a little bit on your storage efficiency.
Thereby, we gain reliability.
But that's not the only trade-off that we're talking about anymore.
Now, when we talk about distributed systems, we have network coming in.
So what happens to the network bandwidth utilization?
What happens to the number of nodes that I'll have to reach out to if I have to reconstruct?
So I don't want to reach out to every little disk out there to reconstruct like one byte of data.
So that's something that we can reconsider.
So the optimality dimensions are really changing.
So we're not talking about just the disk world where we have like 60 disks packed in an array,
have RAID on top of it and you kind
of have a RAID group and say one disk failed, I'm going to read five other disks and reconstruct
that one disk. That's not the case anymore. We've got the entire scheme of things changing
altogether. So the talk I've organized it this way. I'm going to give you a little background,
could be an error code, Erasure Code 101. It's more like what is an n comma k code. We talk about
6 comma 4, 9 comma 6, what's an n comma k code. We talk about 6 comma 4,
9 comma 6, what's an n comma k code and then get into the modern codes because those are like the
basics before I even get into some modern codes like locally repairable codes, regenerating codes
and things like that. So essentially locally repairable codes means they put codes on top of
codes. Okay so you have a re-solvement code, just one code and then you have another code on top,
another code on top. So how do they even have this hierarchy of
codes solve problems so that's the locally repairable code the other one is
regenerating codes which essentially came from network codes basically it's
from the networking community and regenerating codes are actually codes
that are designed for storage systems okay with the newer dimensions in play
so that's those are the modern codes that I'm going to talk about. There was a little bit of deviation, there was
some fountain codes and other stuff. I'm going to touch upon it but that's not
important for my presentation because I'm more concentrated towards the
distributed storage system side of things and I'm not talking about LDPC
codes and other codes that are underneath FTL and things because
that's not for distributed systems as of yet. Okay so and this is more like 101, this is like what's the current state and the
next one is I'm trying to analyze all of these codes and see what's relevant
where. Okay so like if I have a flash array, which of these codes will be
useful? If I have a geo distributed storage system, which will be useful? So
it's a little bit of technical analysis of whatever codes we learn and then talk about some literature and who are the players in this area.
So as I said, I'm going to start with the background. So essentially we started with classical codes,
which includes Hamming code, Reed-Solomon codes, all those that we read in books,
which was essentially 1950s and 60s. And after that, this was about 2000.
Like 40 years, nothing nothing happened and then in 2000
fountain codes came and then this locally repairable codes is somewhere in 2011 2010 2011
that's like 50 year gap and suddenly from here to here there are like hundreds of papers on
locally repairable codes regenerating codes there are products are products on LRC, Microsoft uses LRC,
Amazon uses LRC, everybody, Facebook uses LRC, everybody uses LRC these days and there are these
regenerating codes where people are trying to venture into as well. Okay, so what's the
difference here? So typically classical codes were actually talking about storage overhead
versus reliability as the trade-off. So you give up on storage overhead, you gain reliability. Okay, that's the essential trade-off. Fountain codes brought in this
concept of increased performance. So you give up on storage overhead, you add a
little bit more parity, I'm going to give you a lot more performance. Okay,
performance was another trade-off. And then locally repairable codes brought in
this other angle of repair degree. Like places like SMR. So I don't want to go touch on
every little disk and every little block if I'm going to reconstruct just one byte of data. So
repair degree where the degree means the number of nodes I reach out to for reconstructing a block
of data. So the number of nodes is the repair degree. So that's another dimension. And the last
one, obviously the most talked about is the repair bandwidth. So if I have to reconstruct 1 MB of data, how much MB of
data am I going to flood on the network? Is it going to be 6 MB? Is it going to be
1 GB? I don't know. Okay so that's repair bandwidth and this is the most
talked about code, the regenerating codes because it's very relevant to
distributed systems over the network, over WAN and things like that. Okay, so the timeline, so as I
said the classical code started in 50s and 60s, RAID, the last RAID code that
was constructed RAID 6 in Linux was in 2002, RAID DP was in 2004 and that was
probably the end of it and then in 2015 there was this little work I think from
UCLA on repairing Reed-Solomon, how to improve
the repair performance of Reed-Solomon. So basically the problem with Reed-Solomon is this,
actually I'll go to the next slide and explain that problem.
So after this classical codes came the fountain codes. So 1998-2002 is when we had all these codes. So there was this Tornado code, the person called Luby and he came up with
this idea of fountain code which was essentially a rateless code. So when you
say n, k we have rate there. Rate essentially means I'm talking about a 6,4
code where 6 is the total number of disks and 4 is going to be the my data
disk and two parity disks. The rate is nothing but the amount of storage overhead you have okay so but these fountain
codes are rateless codes basically so you don't have a rate you keep so they came in from
networks where you have a fountain of code symbols that's coming in you you have a bucket you
capture the amount of data that you want to reconstruct
whatever you need and you leave it. The fountain basically just throws away your code symbols,
that's the name fountain code. You don't have a rate at all, you don't have a fixed dimension to
the code. So that was the fountain code and implementation of it. So basically the ones on
the top is theoretical stuff and the ones at the bottom are more systems and practical
codes. So from the fountain codes theory they built Raptor codes, which
essentially is rapid tornado code and now I think it's owned by Qualcomm.
And these are a little relevant to our case because mostly it's it people tried experimenting it underneath
FTL for for flash ECCs and and not more. So then came this interesting idea of locally repairable
codes. So basically there were code so people try to experiment on what happens if I put codes on
top of codes and how does how does it even work? So in 2011, there was this pioneered work from Microsoft Research.
So what they did was they established a common theory based on these two codes,
pyramid and hierarchical codes that they constructed.
And they said, what if I have a local group and a global group?
And these local groups can actually reconstruct within
that group I have a parity inside and I can reconstruct there and if I have more
number of failures I'll probably go to the global group and actually reach out
for more parities and read so that was the the idea and this was just the
theory and they went ahead and implemented in in Azure and they claim
that they the code essentially saved about millions of dollars for them okay
so that was so I'm just calling it as MLRC which stands for Microsoft's LRC
Microsoft's locally repairable code and Facebook did the same idea on Facebook's
photo storage they published a paper called Zorring Elephants XOR BAS and in
VLDB 2013 that was also a local code. so I mean now people are trying to do regenerating
and they're trying to combine regenerating codes with these locally repairable codes so
so as I said the regenerating codes came from the idea of network codes so in 2010
a person from UTA Texas at Austin he came up with this idea of what if I draw some ideas from network codes and put it in storage.
I'll explain the actual concepts later.
But the idea is to draw some ideas from networks and they put it in storage and it actually helped saving the amount of bandwidth you utilize for repairing failed nodes.
So that was in 2010 and people slowly started
implementing these codes these days and another work was to try and combine
locally repairable codes with regenerating codes. So how do I combine
this idea with that idea and that was local regeneration.
Yeah so and up to 2015 there was a fast paper, 2016 there was a fast paper, it was all on
regenerating codes.
Yeah, so just a bit of 101 kind of slides, you can doze off if you want after the lunch.
So classical n, k codes, so I'm taking an example of 6, 4.
So we have a file or an
object, you chunk it into k blocks and then you encode it, you have some parity blocks and that's
essentially n, n encoded blocks and k out of it is the actual data and n minus k is your parity
blocks. And when you have a failure, so essentially when you say you have two parity blocks it means
that you can repair two failures. So if you have two failures you end up reading
the rest of them and then construct the actual data. So you have the four
surviving blocks you decode it, that's essentially decoding and in order to
repair what you typically do is now that you got the data back you again
re-encode it, find those missing blocks and then reinsert it.
Essentially the problem is repairs are expensive if you think of this in a distributed system.
If this is going to be happening inside a storage array, that's fine.
Now if you're talking about reading, so if I have to reconstruct these two blocks,
I'm essentially going to read 1, 2, 3, 4 blocks into one machine and actually reconstruct the whole thing and distribute all of these all over network.
And you can now think each of these blocks is like a gigabyte or something.
Now you're talking about 2 gigabyte failure and you're having 4 gigabyte over the network and you're computing on 4 gigabytes to compute another 4 gigabytes and then you're transferring 2
more gigabytes into the actual and that's huge. You don't want to do that on the network.
For just a 2 node failure. Essentially it might not even be a 2 gigabyte failure, it
could be just 1 MB inside the entire gigabyte file. So yeah, the point is in distributed
systems such repair is not going to help. And this essentially is the basics of Reed-Solomon codes.
So n, k basically is a code where these two are going to be two parities,
out of which one parity is probably an XOR.
So you probably do A plus B plus C plus D, or A XOR B XOR C XOR D is this first code.
And the next one is probably a linear equation from these. It could be 2A plus 3B plus 4C plus 5D and that
could be that. Okay and that's also requiring some compute because
you'll have to solve the linear equation, it's not mere XORs. So the
problem is repair is not efficient. So then came these modern codes, essentially
LRCs and regenerating codes, essentially LRCs and
regenerating codes. So LRCs, the idea was, as I said, they went with this hierarchical
and pyramid codes, they came up with this concept of what if I have two data
blocks and then I construct a parity and then I have few more parities at this
level, I construct parities out of these parities and parities on top of it.
Okay, what's the advantage of this? If I have one failure, I don't have to go read all of
these blocks to construct that one failed block. All that I'll have to do is
go read this parity and this and I'll get this one. Okay, so if it was read
Solomon, if it was a plain one level of read Solomon where you had a, this is
what, 8, so 9 comma 8 code. Imagine
one extra parity if it is one extra parity and that one failed I
probably will be reading from 8 blocks and construct that one node
right. So these 8 blocks instead of that they're gonna have just 2 blocks
being read to reconstruct that one. Now if I have more number of failures so if I have
two failures now this is not helping because this essentially is a 3,2
code and with the 3,2 code I can only reconstruct one failure. I can't reconstruct two failures. So
when I have two failures, I go one level up and use that parity and then get it. Now if I have
three failures, then you probably have to go one more level up to actually get that.
But the essential idea is just by organizing these parities in an interesting way,
you can solve the same problem of three failures,
but a lot more efficiently because you're not going to read all the data.
You're not putting all of them in the network.
And an interesting point to make is people at Microsoft and Facebook, they did not just do this idea of hierarchy or pyramid.
So hierarchy is just going bottom up, constructing parity is bottom up, and pyramid is just constructing parity, stop down, doesn't really matter.
But the interesting point here is this construction could be reorganized a little differently and
they came up with a little bit more efficient constructions with that. So one of it is MLRC
which was by Microsoft Azure team. So what they did was they said these blue lines are the linear
equations. So you have data, data, data, data, data, data and this is parity constructed out of these
three blocks. This is parity constructed out of these three blocks and this is parity constructed out of these three blocks this is parity constructed out of these three blocks and this is parity
constructed all of these and this is a parity constructed all of these okay so
now if you have a failure it's one failure and you use these three to
reconstruct that one failure essentially this could be an XOR okay it's just one
parity here and it's this is another parity it could be an XOR this could be an XOR. Okay, it's just one parity here and this is another parity,
it could be an XOR. This could be an XOR, this could be an XOR, this could be an
XOR too, which means it's a lot of performance. Okay, you're not talking
about solving linear equations and taking time on the compute and I mean
there's no latency there. So that's one node failure. Now you have two
node failures. You have one linear equation here, another one here, maybe two
XORs here and you're going to solve that. Okay, it's just two failures. You have one linear equation here, another one here, maybe two XORs here.
And you're going to solve that. It's just two failures.
Now if you have four failures, worst case.
So if you have two failures, one here and one here, it's again simple.
It's the same logic because there is a local code and there are global codes.
You're essentially using only two local codes.
Now if you have four failures, that could also be solved.
Same idea, you can have four.
I mean, the worst case is what Sreed Solomon is offering.
Worst case is you have four parities, you can solve four failures.
And the four parities are only distributed in a different way.
That's all. That's the only change here.
So then Facebook came up with this idea where Facebook ran some analytics on their data center
and figured out that about 98% of failures are the single node failures
okay 98% of it is just single node failure only 2% of it is more than more
than one node failure now why do I optimize so much for a case which is
only 2% of my probability okay like what Jeff has been talking about in the
morning so you have so many parities,
you can solve so many of them. But the problem is that's not going to happen every day. That's
probably going to happen like once in five years, worst case. So why do I have to have so much of
update complexity, so much of storage overhead and so many of these things for that once in a blue
moon failure, right? So what they did was, I'm going to optimize the code for one node failure
I'm just going to optimize it the heck out of it okay so what they did was I
have a six comma five code okay there's a six comma five code it's a long code
unlike this way you have a three comma code three comma two code you have a six
comma five the length of the code is quite long for a code and then you have
another six comma five and then you have enough number of parities, one could be XOR and a few could be
linear equations and you have a global parity on top of global parities.
That's like 1, 2, 3, 4, 5, 6, 7, it's like 7 failure cases could be
handled here. But just that they've been organizing it a little bit
differently. So so essentially you
could think of this as a 14 comma 10 code if you forget all of these in a
in a regular Reed Solomon way but you could also be thinking about because it's
reorganized so much if I have one node failure I can reconstruct
here, another node failure here, two node failures independently still local
groups and if you have global failures you can take the global part. So
essentially your N, K logic of erasure codes, now they've changed it to
some K, L, R where K is the number of data blocks, L is the local
parities and R is the global parities. So six data blocks, 2 local parities and 2 global parities. So
the equations are changing a little bit and the other problem that comes with
these LRC codes is this. There's this concept called as MDS which is maximum
distance separable code. Like if you have a 6, 4 code, I have 2 nodes worth of
parities being stored which means I should essentially, I have 2 nodes worth of parities being stored. Which means I should essentially map
it to 2 nodes worth of failure. So the amount of storage overhead I incur should be the same amount
of failure that I could repair. Okay, that's called as MDS. Now that's not fair here because
if I have any node failure, if I have one node failure here, this is MDS. If I have this one and this one, this is
MDS and this is MDS because that's Reed Solomon. If I have two node failure here,
that's not MDS because I'm actually using one from here, I'm using one from
here which essentially is not MDS to this, right. So this is not a
straightforward MDS code. So people were stuck to this idea for the past 45 years,
people were stuck to this idea that code should be MDS which means I've got 4 MB worth of parities, I should be able to repair
4 MB worth of failure. Code should be XOR like RAID 5 was always stuck to this XOR based code.
You have one parity, I'll always use XOR because that's my best performance that I could get.
Then people actually got out of it and said okay RAID 6, if I have one failure I'll use the XOR but if I have the
second failure I have linear equation to solve. Okay so people want XOR, people
want MDS but people are okay to relax now. So they are okay to non-MDS code,
they're okay to non-XOR codes, it's all fine because there is some other way of
optimization, there are some other parameter they're trying to optimize on.
Okay so these are locally repairable codes and these are not trying to optimize only on storage overhead versus reliability as
trade-offs. They're also talking about repair degree which is a number of nodes
you reach out to for a repair, maximum recoverability for as much as failure
you want, locality or minimum distance which is you optimize on that one node
failure case which is the minimum distance there okay so the trade-offs are a little different it's not just storage overhead anymore
now comes regenerating codes as i said this was inspired by network codes from the concept of
information flow graphs and minimum cut bound so what they said was if you have a source and you have a data
collector, this is a network coding diagram and I essentially
don't want you to understand the whole thing, this is just to give you a picture
of it and I'm going to map it to a storage terminology which will be easy
for us to understand. So you have a source and you have some symbols
generated from it and these symbols are transferred somewhere and you have a
loss like this node or rather this node failed. somewhere and you have a loss like this
node or rather this node failed. Now if you have to reconstruct this data back
you read out some of them and then reconstruct it for the data collector to
read which means this data essentially gets transferred, translated
into a different form of data that you gather from the remaining surviving nodes and you give it to the data collector.
But the idea here is that's essentially what your regular erasure code also does.
If one node fails, you read from the rest of them and you actually give it to a guy so that he can reconstruct it.
But the difference here is they have this thing called as a minimum cut bound where I'm not going to read from everywhere, I'm going to define a boundary
saying that I only need one node to be read to actually get you the data that this one is lost
but in a translated form. Okay, I don't want to have it in the same form. So x4 is what they lost,
I don't want to reconstruct x4 exactly in the way it was lost by having to read all of these. I'm going to
transfer translate this XOR into X5 which I can gather it from just X1 by
reading only one one symbol instead of reading all three symbols and then get
the data back. So the idea is to not read all the data but to read only as
much data that's required for that reconstruction. Okay, so I don't know if it's visible, is it? Okay, so if you have a file the same
way that we have we try to understand Reed-Solomon, if you have a file and I
have four blocks A, B, C, D, you essentially have the data blocks A, B and C, D
stored in two nodes and then you have some parities here, this is XOR A plus B
plus C plus D and then you have a linear equation and other three linear equations. You
have data blocks, parity blocks. Now imagine you lost one node here,
this node is lost. Now essentially in the Reed-Solomon way what we will be
doing is to read all the surviving ones, you reconstruct the data back into the
form that it was lost.
The lost data is a plus 2b plus 3c plus d and you will try to reconstruct that particular
equation in the same way. That's not what these guys said. I'm going to repair it to
something called as a functional repair form. A functional repair will let me reconstruct this
data back but I'm not going to repair it to the exact form
that I lost it. I'm going to reconstruct it to a totally different form but I
know that those equations will let me calculate this one back. So you
don't read 1, 2, 3, 4, 5, 6 equations or 6 data blocks now, you only read 1, 2, 3
out of it but you actually use these nodes
to compute something inside. So you have A and B, you compute me A plus 2B and
give me a P1. So instead of transferring 2 MB worth of data,
2 blocks worth of data, I'm going to only transfer 1 block worth of data from
this node, 1 block, 1 block, it's just going to be 3 blocks worth of data. So
comparing against Reed-Solomon, it would be 1, two, three, four, five, six blocks worth of
data, now it's going to be only three blocks worth of data, thereby 50%
reduction on your network bandwidth and then reconstructed
to a form completely different from this. So this reads as a plus 2b plus
something, this reads as 5a plus 7b plus something, I don't care about it.
It's in a different parity form, but I know that
these equations will let me calculate my lost data. That's called as functional repair. The
earlier one where I tried to reconstruct it back to the original form is called as exact repair.
But my problem is this. Our erasure codes that we use in storage systems are supposed to be what's
called a systematic. So systematic means your data blocks will have to remain as data blocks. They should not be in some encoded form.
Because you store A and you wanted to store A in your disk, not
some A plus B in my disk. Because if I have to read back, I just have to read A
instead of recomputing what that A plus B comes out to. So because of my
read throughput, I need
systematic blocks which means my blocks will have to be stored in its original
form, not in some encoded form. So in this functional repair problem helps if I
have to calculate parities, but it doesn't help if I have to calculate data
blocks. So imagine this one is lost and because of all these gimmicks that we
just added, you have some parity calculation and you transformed
transformed A and B into some 4A plus B and 6A plus C or whatever. That's not
going to help because my read throughput is not going to be the same. Okay, your
latency is going to be bad. So the idea is you wanted to have a systematic regenerating code where these are exact
exact repair and these are functional repairs. You got it? Are you with me? So I mean
like so you wanted a systematic exact repair and a functional
parity repair code and that's essentially what people are building now.
Okay, so the advantage is your data blocks still remain in your original form,
which means if I have a read coming, I just have to go read the data.
I don't have to recompute with all the parities that's there.
But when it comes to parity blocks, because of this,
the idea of reducing the network bandwidth,
I'll have it in some other form.
So that's just a
gist of regenerating codes. Let me go one more step, double-click and see what
are the types of regenerating codes, how they differ a little bit or not. So
I'm going to take one code for an example, something called as repair by
transfer codes where I was also part of the team which constructed this.
So we came up with this idea called a pentagon code.
It's a simple pentagon.
And the idea is like this.
You have an edge.
And you represent an edge with a data block symbol.
I've named it as 1.
And the edge is incident on two vertices. And these vertices, I'll have a copy of 1, I'll have a copy of 1. That's all.
Same way you can see if I have 6 here, there's a 6 here, there's a 6 here. That's all.
So I'm able to actually fill this in with 9 blocks, with 2 copies of it. So I have 9 blocks, so 2 into 9, 18 blocks
and I have 2 spaces left which I'll fill with parity. Yeah, are you with me? Yeah, cool.
So this is a pentagon construction. Now if I have 1 node loss, any 1 node loss, the idea
is so simple that I just have to copy data from the edges that are incident.
So if I lost this one, I probably have 1, 5, 7, and 9.
1, 7, 9, and 5.
Just copy those blocks from there.
So single node failure is super.
It's just copy.
But my problem is storage efficiency.
It can't go any less than 2x.
It's a replica.
It's a real replica. It's not just an erasure code there.
So a Reed-Solomon can go up to 1.5 X 1.4 X 1.6 X but this can't should be some 2.0
K X okay so this is 2 already and I have a parity there right so this is called
as a repair by transfer you repair a node by just transferring data repair by
transfer code okay and this is this falls under the category called MBR, Minimum
Bandwidth Regenerating Code, which I'll talk about in a while.
So
this doesn't stop with pentagon,
I can have a heptagon, I can have an octagon,
anything, it's just a framework.
And in a pentagon, now the
concept is, this is a regenerating code,
minimum bandwidth, it's regenerating code.
What if I add locality to this?
That is that LRC
idea here. I can have this not just as a Reed-Solomon parity, but as a local parity.
A local parity for this heptagon. And I can have a global parity for these heptagons.
Okay, it's a local regenerating code. You have a local code in each of these octagon or heptagon
or whatever, and you have a global code like how you have a LRC code. Okay, so that was a local code in each of these octagon or heptagon or whatever and you have a global code like how you have a LRC code. Okay so that was a local
regenerative code. So now that kind of comes to a conclusion of what is this MBR
code, how do we even understand this. So we're always talking about storage
overhead here which is 2.kx and what is called as repair bandwidth now
because here your
repair bandwidth for a repair here essentially is 4 into the amount of data
that's like 4 times the symbol. So if I have 4 blocks worth of data being lost
I'm going to copy 4 blocks worth of data which means my repair bandwidth is also
4 blocks. So your MBR codes the minimum bound or basically is the amount of data you lost.
You can't get any lower than that.
So you're talking about Reed-Solomon where for one node worth of data,
you're talking about reading four nodes worth of data,
which will be essentially 16 blocks.
Now you're talking about only reading four blocks, which is the minimum bound.
So you can optimize on the bandwidth you
you you you put on the network for a repair so you can have a repair
bandwidth versus storage MBR stands here because you have the most the minimum
amount of bandwidth you can read from okay because it's just a copy you don't
have anything extra so that's here but the problem is your storage overhead is
a lot because your storage overhead is a lot. Because your storage overhead is 2 point some x, which is not even close to Reed-Solomon. So all MBR codes are somewhere
here. And there's a form of code called MSR code, which is a minimum storage regenerating
code, where your storage overhead is exactly the same as Reed-Solomon. Which means you're
not giving up on anything from what Reed-Solomon offers you. So RAID 6, you have Reed-Solomon.
Take out the Reed Solomon,
I put MSR there, no different, okay? It's going to give you the same amount of storage efficiency,
but it's going to give you this bandwidth efficiency that you've got here, which Reed
Solomon doesn't give you, okay? So that's the beauty of MSR. So a lot of people have flocked
here in developing MSR codes or constructing MSR codes.
Not much in MBR codes, though it's easy to construct.
MSR codes are like the talk of the town these days.
Yeah, and there was some theoretical work which said that,
why don't we have some codes here and we can construct,
we can have like a knob which says storage efficiency or bandwidth.
Okay, I'll have a code here.
So theoretically they proved that they can't have codes here.
And it was in 2015 that people proved that you can't have codes
other than this area or in this area and not elsewhere.
Yeah, so the trade-off again to repeat is storage overhead versus repair bandwidth.
Right, so we've got a bunch of codes.
Now let's try to analyze these what are the trade-offs what are
we giving up on what's even meaningful for let's say a hyper convert system to use what which code
makes sense for let's say a flash to use okay something like that so I've split it according
to the types of code that I spoke about essentially essentially the classical codes, the fountain codes, the
locally repairable codes, and the regenerating codes.
So when people are talking about MDS codes, we were always talking about storage overhead
versus reliability.
And so was the case with all your replication and parity codes.
Replication is your Hadoop-like replication, parity is your regular Reed-Solomon and RAID
and everything.
Yeah, Reed-Solomon and RAID.
So we were always talking about storage overhead and reliability. And when we came to fountain
code, as I said, people were worried about computational complexity, right? You
wanted high performance codes. We're talking about rateless codes. When you
say rateless, they don't care about the storage overhead, okay? So we were talking
about rate and probability of correction. Higher the rate, higher probability of
correction. Okay, so that was a different trade-off
altogether. So when we came to LRCs, we were talking about, in
general, we were talking about storage overhead and repair degree, which is the
FANON, the number of nodes you reach out to to read, but several codes,
specific types of codes, actually optimize on slightly different
parameters. Like they said, locality versus minimum distance in our Facebook's LRC right yeah in regenerating as I said
storage overhead versus repair bandwidth and local regenerating you have the
overall reconstruction cost right the number of nodes you reach out to and the
amount of bandwidth you utilize in it that's probably the Holy Grail right
yeah yeah so that's the idea so if you talk about reliability analysis, like we're used to all these
empty TDLs and numbers that we can compare with, right? So this is the locally repairable codes
and this is regenerating. Let me take one step at a time. So in locally repairable codes, I'm
comparing against a three replica system. We're talking distributed systems like Hadoop, Ceph,
Cassandra, which uses three replicas. So that's like the standard. Now we're trying to compare that with the good old
classical codes if I were to put a Reed-Solomon code instead of three
replica system in Hadoop which we actually did and we saw that yeah this
is storage overhead so three replica system your storage overhead 2x extra
like you have one copy of data you you have 2 extra copies of data.
And this is your repair traffic. Essentially, your repair traffic is almost nothing because it's just a copy.
You're just going to copy. You're not going to have like Reed Solomon where you have a lot of data to copy.
If I have 1 block of data being lost, I'm going to copy some 1 point how much ever okay that's so it's like it depends on 14 come out and I have like one block loss I'm going to read the 10 nodes
worth of data right that's a lot which comes somewhere here so essentially what
we're trying to do is to get this one closer like you wanted to have minimum
storage overhead and you also wanted to reduce the repair traffic somewhere
closed down and obviously you wanted to increase the MTDL because the MTDL is a factor
which actually saves your life of your entire system is going to go for a toss
if you keep reading so much of data like in Reed Solomon right in Reed Solomon
though you're going to have a nice storage overhead you have 1.2x storage
overhead that's sweet but the problem is I'm going to every disk so many times that's gonna fail so fast right you
wanted to increase the MTD deal as well that's what your repair degree and
repair bandwidth offers because when you say repair degree I'm actually going to
reach out to those node not so many times like in Reed Solomon I'm gonna
reach out to it only for like once instead of four times okay so that way
it increases the
empty TDR so your LRC is drastically improves your empty TDR so this was
Facebook's this was Microsoft and the the intention was to get these closer
and these guys actually did right so coming to regenerating codes it's almost
the same it's the same storage overhead and your code length here, so the idea is to get them closer and increase this. So, triple replica, pentagon
and heptagon are the MBR codes I spoke about. The heptagon local is the
local regenerating code which actually lets you increase this, right. It
increases because I'm actually reaching out to that node much lesser times than I would have to with other codes. So this boost actually
resulted because of my LRC properties. The repair degree is low. So
because of the heptagon local, because it's local, this one is improved so much.
In general your regular regenerating code will offer you to get this
one closer. If it's an MBR code you know your bandwidth will come a lot faster. If it's
an MSR code your storage overhead will get closer a lot faster. But the idea is you got
what I'm trying to say. If I have to compare it with RAID plus mirror, your MT-TDL still goes for a toss.
Which means that, so my intention to put up this slide with RAID is that these
regenerating codes are not just applicable for distributed systems like
Hadoop and Ceph and stuff, it is also applicable to distribute your regular
storage arrays because your MT-TDL is still
a lot better than your regular RAID but you're talking about this coming closer, you're not
talking about 2x and all because your MSR code will actually get your storage efficiency
low but you're reading less number of blocks every time because it's a regenerating code,
you're actually reading only little number of blocks and that way your life actually
improves. But I must accept that it's not so, it's not as
easy as I'm just talking about. So the idea is Reed-Solomon codes have been there for like 45
years, it's so optimized your regular CPUs can get work done. But your MSR codes or regionally codes
are not computationally effective, which means that we are operating at a higher Gallo field your field size is a lot more than what your regular CPUs can
offer so in terms of performance it's not so good I must accept okay but but
it's there it's almost there the theory is ready people know that you'll have to
optimize on the field size people know that they'll have to get the rates down
the the high rate code should come to low rate and things like that so the
people are working on it but the idea is regenerating codes is kind of trying to replace
re-Solomon codes altogether. All right so I'm not going to go through the whole thing, let me take
this one specific row and you will get an idea of what I'm trying to do here. So take a system,
a general purpose storage array, a geo-distributed system or whatever and you take the properties of
the system, what's my most important property what's
the least important property and what do the codes offer what what am i what are
the requirements from a code okay so if I take a geo-distributed system my most
important case is it's geo-distributed any repair over WAN is going to be so
expensive I can't say my data center went down I'm going to copy all my data
over WAN right that makes no sense at all.
So repair over WAN is damn expensive. That's the most important property. And the least
important property, storage overhead across DR sites. Actually, what I'm doing is replication.
I have two copies of data. Anything better than that is good for me. But still, that's
the least important property I would be worried about. And what are the requirements from
a code? I want local repairability. If I can repair within my data center, that's beautiful. Now I don't have
a solution, I do read Solomon, or maybe I'm just replicating. But now I have an option
of LRC. So if I want local repairability, I have LRC, I don't have to go to read Solomon
or replication. So I can use a local repairable code. The least important property again is storage overhead. I don't care about it because the worst
case is going to be my re-tolerance or replication. So anything better than
that is good and that's my expectation of a code and if these are the
properties it's just local repair and storage overhead, sweet it's just LRC
there. Okay so my goal is that. So you take a system and see what's the most
important property, what's the least important property and what's the expectation from
a code and then fit it right in. Earlier we didn't have this choice because we
don't have these many trade-offs to work with. We had only storage overhead and
reliability, right. You give up on one of these but now we have so many of these
and naturally you can choose one of them.
So who works on all of these areas?
That way you know whom to reach out to if you have to.
So again the same logic, anything above the line is theoretical work, you might not be
really interested.
Anything below the line is systems.
A lot of folks work on the theory.
Here's a guy who actually came up with this idea of regenerating codes, who picked up, who got inspired from network codes and brought it to storage, Alex Dimakis. And
he works closely with Facebook to get this Facebook photos and Facebook's data center
being optimized with all these beautiful erasure codes. And he's an expert in regenerating
codes. And Berkeley, they were the ones who actually came up with, they together came up with this idea of
regenerating codes. So all of, most of these, so most of these are trying to optimize on the
repair traffic, the amount of data you need to read for a repair. And it was all, I mean I'm
not talking about 20 year old work, these are all like the past three years, 13, 14, 15. And he's the guy whom I work with in Bangalore from IAC. So he was the one who
proved that you can't have codes anywhere other than the MSR and the MBR point. You can't have
codes anywhere else. You either get an MSR code or you get an MBR code. And the MBR code that we
constructed, the double replication code, we called it double replication because you have two pentagons
or two heptagons. You have two replicas already for Hadoop, so we
essentially use this double replica code for Hadoop. So we know Hadoop uses a
minimum of three replicas, you can go all the way to seven replicas or whatever
based on your data center. So in Hadoop you wanted for ease of
reconstruction and for ease of running an MR job on top. You wanted data in plain, right?
You wanted a systematic code.
You want two plain copies of data, not just for availability for reconstruction.
It's also for parallelism on your MR job.
If you have an MR job and you want two copies, you need real two copies,
not some parity encoded form, right?
So you need two copies, and we used Pentagon there.
You have two real copies there.
And you have a parity to actually back your failure scenario.
So it's better than three replicas for failure,
but it's not as good for parallelism.
We figured that not many MR jobs are actually run for three copies.
It's fine.
So we used that double replica on HDFS.
We tried to implement it.
And we got some beautiful results.
We published it on Hot Storage.
Yeah, so that's the work that we do with him.
And he's a guy who works, I think he's still with Microsoft. He was at Bay Area, so I don't know if
he's still around with Microsoft. But yeah, so he was the one who constructed the pyramid and
locality code. He's a pioneer in LRCs. Those are guys who work on regenerating. He works on
regenerating. He works on LRC. She works on network coding, but she also claims she works on storage.
Folks at NTU Singapore, they also work on something called a self-repairing codes, which
is like self-healing codes within a disk. Not much of it is in systems, a lot of it
is in theory. In systems,
James Plank, I'm sure many of you know,
he's the one who developed this thing called J-erasure,
a library for erasure codes.
He also tried to optimize the SIMD instructions underneath for erasure codes.
So he got a library called GFComplete in 2013
where he tried to increase the field size
so that all those beautiful codes can be implemented.
So he tries to work a lot closer to Intel to get these erasure codes to practice.
So he works with a lot of OpenStack projects, EMC, and a lot of companies.
So another guy from Chinese University of Hong Kong,
he's also a pioneer in getting these theoretical stuff into practice.
And stair codes is one, very similar to what we heard in the morning.
He actually, the stair codes is something where he had row codes and column codes.
You have a row parity and a column parity. And you have some parities out of row and column parities and you use these stair codes essentially for trying to repair a disk
level failure and a sector level failure which is one code. Isn't that interesting?
So you have sector failures normally being taken care by the disk and you
have disk failure normally being taken care by a RAID. Now you don't talk about
two different systems working separately in silo.
You're talking about one code which can actually reconstruct one node level failure and two
sector level failures. And one of these sector level failures could be a three sector failure
together. Right? So Staircodes, that was in 2014, he tried to bring this row column parities
together for slightly different purpose. And a bunch of startups and companies who work
on this most of them on Reed Solomon and some of them on fountain codes and all
and that's an interesting so Cleversafe I'm sure some of you would have heard so
Cleversafe is a company who does information by dispersal essentially
dispersal is your erasure codes and when
you do not put a systematic code inside it, if it's non-systematic, it's going to
be in some other form which means you get security there, right? So clever safe,
so it's secure storage where you use information by dispersal, erasure
coded to a non-systematic erasure code where your original data is not in its
original form but is in its original form,
but is in a different form, thereby
getting theoretical security
because you don't really
use an encryption algorithm there.
You just use an erasure code algorithm,
but it's a non-systematic code, so it's
not in its original form. And you've dispersed it.
So you lose one disk out of 100 disks.
You're not essentially going to gather any
information out of it.
That was clever. Some of them work on it.
Related areas, I told you about the sector disk level failure.
People are working on it, trying to get that improved.
This is interesting. PMDS stands for partial MDS codes.
So people are willing to accept partial MDS codes.
They're not strict on that MDS requirement.
SD was the previous work to STAIR.
It's called a sector disk failure codes.
And it came out of the same Chinese University of Hong Kong.
And yeah, for different media,
Flash is the most spoken about in this space as well.
So how to improve LDPC codes.
There was a proposal in FAST 2014, I believe,
called WOM codes, multi-write codes,
NVM. So all of these are being looked at, try to optimize these codes for Flash. Not
essentially take Reed Solomon and put it inside. But try to get all these LRCs and turbo codes
and fountain codes and regenerating codes, how to optimize them all into the media that
we're talking about. Security, I told you about dispersal, yeah for
cloud and yeah there are codes called as transformational codes. When data
moves, I mean you talk about a data lifecycle management where your data
moves from hot tier to a cold tier, okay. Now when the data moves from hot tier to
cold tier, you have a lot of other properties to worry about. You don't
really care about storage efficiency, I mean you care a lot about storage
efficiency in the cold tier, you don't care about storage efficiency in the hot tier,
but you care a lot about the performance. So assuming I have three copies of data
in my HDFS cluster for MR, as the data gets cold I'll have to somehow take
these three copies and put it in the Reed-Solomon and keep it in 1.2x or
whatever. I need some level of reliability there but i can't afford to keep it at 3x right but the problem is
if i reduce from let's say a 1.8x to 1.2x these days what what we'll have to do is take that 1.8x
code decode all the data petabytes of data imagine and then re-encode them all into that 1.2x code
and then feed it in.
So that's not beautiful, right?
You have to kind of decode all of it and encode all of it.
Imagine the amount of time and data and effort that's required.
So these codes are looking at how do I just delete some parities and get 1.2x?
That way, improve on storage efficiency, get a little reliability.
So how do I work on all these parameters without much of a decode, encode kind of a phase?
That's pretty much what I had.
I'm going to take questions.
No questions?
Yes. So you mentioned the Calva field is different.
For Reed-Solomon, Yes. The symbol, the alphabet. So your SSE, it was supporting 2 power 8 these days.
AVX2 supports 2 power 8 too. They're extending it to 2 power 16. But what we are talking about is 2 power 64. Yeah, but there are codes that are constructed within 2 power 16
but doesn't give you the same level of freedom that a proper regenerating code
will give. Okay, so currently it's 2 power 16. We're trying to operate
I mean trying to get all these codes to 2 power 16 too.
Questions?
Yeah?
For.
Are they looking at
was everything you talked about GF2?
Were they looking at higher order?
Yeah, the XOR codes are GF2.
And people are okay to not work with XOR codes.
So these codes that we're talking about, regenerating codes,
they're operating at 2 power 64.
Yeah. And we are okay to 2 power 16
because the AVX2 or AVX instructions that Intel provides underneath
your regular Haswell processors give you AVX2 instruction.
So if you can get it to 2 power 16, that's good enough.
And I mean, RAID 6 is not XOR, so we don't really care.
So nobody is stuck with XOR anymore.
Other questions?
So LRC is basically a... You can put Reed-Solomon together.
Yeah, it's just Reed-Solomon,
which means existing libraries, existing systems just work.
It just has to be reorganized in a different way.
Yeah. So it's a hierarchical.
So, essentially every write when you actually need to update a parry.
Yeah.
Your write updates are...
Yeah, but it's no different from a regular read Solomon.
If you were to arrange all these parities in a row form, it's the same.
Right?
Yeah, so I get that. Yeah. So, people are fine with the penalty, people are fine even with going for a non-MDS code. LRC is a non-MDS code. That was a deal breaker at that time. But after getting all these benefits when Azure claimed that we've saved millions of dollars just because of shifting the code from Reed SolomonSolomon to an LRC. Now people are willing to believe it.
More questions?
Yeah.
You talked a lot about different codes and the X and Y dimension of bandwidth versus efficiency of storage.
What about latency? of bandwidth versus efficiency of storage.
What about latency?
So the new generation of,
you know, the Volta memory are gonna be faster,
and it can look on as possibly,
and so it's somewhere in between,
and you would want error correction codes
that can be closer,
leading with those kinds of latency.
So are there such codes? So these codes have not been looked at anything underneath your FTL.
So, underneath FTL, we have LDPC codes and whatnot, right?
So, these codes are not looked at from that dimension yet.
But, I mean, these are not any worse than RS when it comes to latency from a media perspective.
Yeah, we're done.
And we need to
stop. Sure.
Thanks a lot.
Thanks for listening.
If you have questions about the
material presented in this podcast, be sure
and join our developers mailing list
by sending an email to
developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the developer community.
For additional information about the Storage Developer Conference,
visit storagedeveloper.org.