Storage Developer Conference - #173: Facts, Figures and Insights from 250,000 Hard Drives
Episode Date: August 2, 2022...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to SDC Podcast, episode number 173.
Hi, my name is Andrew Klein from Backblaze, and I'm here to talk to you about hard drives.
Specifically, facts, figures, and insights from over 250,000 hard drives that we have here at Backblaze.
Lots of interesting things.
Why don't we get started?
So, as I said, we're going to talk to you about hard drives and lots of different things that we have about hard drives.
Specifically, we'll start out with where these hard drives come from, right? Where they live,
where they work, and so on and so forth. Talk about how large they are, the environment,
and everything like that. Then we'll dig into drive failure because, hey, that's what it's
all about here today, right? How drives fail, why drives fail, and all of that. Then we're going to
talk to you about specifics. Once you understand how we think about drive failure,
then we're going to talk about how, for example, when you turn them off and on the systems, does it affect drives?
Does it affect drives over time, temperature and things of that nature?
And finally, we're going to finish up with a presentation with a discussion on predicting drive failure.
Can you actually do it? So let's get started. So we keep those drives, those 250,000 drives
currently in four different data centers around the world, three in the US, one in the Netherlands.
Today, we have about 178,000 active drives, but over the lifespan, it's about 260,000, as you can see there,
right? And that's what we're actually talking about today when we talk about drives,
that 260,000 drive set. So we'll look at those all over the place.
Now, the drives actually are stored in storage servers, typically 60 to a server, right? Not all cases, but 60 to a
server is pretty typical. And we actually take 20 of those servers and we put them together into
something called a vault. Now, a vault is just 20 servers physically connected together. Within
there, drive number one in each one of those storage servers is, for example, is called a tome.
And drive number two is also a tome.
And they're all individual.
They're linked together.
So when you send some files in, they're stored in a tome over those 20 different drives.
And we use some Reed-Solomon encoding algorithms to do that and then retrieve them back as well.
So that gives us the redundancy that we need if a drive goes down and all of those kinds of things, right? But the important part about that is that this is the
way the entire farm works. So all of the drives that are set up this way, they all get the same
kind of load, okay? So that's the best part about all of that. So when we look at the data,
they're all treated about the same,
whether they're a four terabyte drive or a 16 terabyte drive. Now we collect data from these
drives. So we've been doing that since 2013. We collect using the SmartMonTools application,
right? And we store the data and I'll show you the format in a little bit. You can actually
download the data. When we produce our quarterly and annual reports on hard drive stats, we use the same data.
So at any time, you can go in and download it.
There's the URL for you to do that and test me out.
The data that we collect looks like this.
There's a row of data for every drive for every day.
So any given day today, there's 178,000 rows, right?
And so on.
And then tomorrow there'll be another 178,000 and so on.
So it's a fairly large amount of data set.
The interesting thing is, is that we collect not only the fact
that the serial number model and so on,
we also collect all of the smart stats from that.
And many of you are familiar with that, but that's information that the drive actually produces,
right? And tells us about the health of the drive, as you will. Now there's 255 pairs. We actually
have places for all of them, but different drives report different stats and so on and so forth.
And you'll see that as we go along. The interesting thing about this is we have something we call a drive day. And you'll see
that in our formula when we talk a little bit later. A drive day is basically the data that
we collect for one drive for one day. So for that drive, if we collect seven days worth of data,
that is seven drive days. And there's a reason we do that.
We'll get to that in just a second.
Now, what's a failure?
Failure is two easy ones, right?
Turn on the drive and it doesn't spin up.
That's pretty easy or make some really ugly noise or whatever the case may be.
It also won't stay synced in that array that we set up, in that tome.
It's got 19 friends and it's looking for its 20th one and it can't find it because it won't come up for some reason.
Electronically, it's spinning and everything like that, but it won't talk to anybody.
We also do it based on the statistics that we track, so the smart stats that we track.
Now, that means we have two different types of failures.
We have a reactive failure, right? One that, hey, something just broke and we need to get it out of there.
And then a proactive failure. Right. The proactive one is where we use smart stats and file system checks and so on to tell us what potentially here's a problem.
So there's a quick example there on the right-hand side
that you can see.
And you notice in the notes,
it starts giving us various different data points
to take a look at.
And it gives us a recommendation.
Hey, consider replacing this drive
because in this case,
the high offline uncorrectable smart stat
is particularly high.
We get all of these things,
but to this date,
we still, because the workload isn't that high, we actually review each one of those and have a
human being in there. So the automation produces all of this, and then we have a human check it.
And it also helps us validate how good, if you will, these algorithms are in predicting failure as we go along.
In either case, the drives are taken out and they're quarantined.
We will actually check drives to make sure that they are operationally bad.
So it wasn't just a spurious thing that happened.
But eventually those drives are wiped and cleaned and sent away.
Here's the formula that we use to compute drive failure.
And we've been using this since the beginning. And it's all based on a very simple thing, which is
you take a cohort, a model, for example, I would like all of model ABC123 over a given period of
time, in this case, Q2 2021. We gather up the number of drive days, how many drives are operating
over those periods and how many days they were working. And then the number of drive days, how many drives are operating over those periods and how
many days they were working. And then the number of drive failures. And we put it in that formula
down there in step three, and we get an answer. In this case, it's 1.52%. Now that's an annualized
failure rate. And that's important. It's not an annual failure rate. It's annualized because
we're only talking about period, which here is Q2,
right? So it's a quarter, but we've annualized the number so we can compare that number to any
other number. And we can do it over any period of time. We could do it over a month. We could
do it over five years. It doesn't matter. The reason we do this, the reason we do it with this
formula is because our environment is very dynamic, right? There's drives coming in
all of the time. So for example, model ABC123 might have started the quarter with a thousand
drives. The end of a month, we put in another thousand drives. At the end of the second month,
we put in another thousand drives. So we had 3,000 drives by the time we got to the end.
We wouldn't want to use that number to compute the failure rate because those drives were only 3,000 of them.
We're only there for a month.
Drive days accounts for all of that.
It also accounts for taking drives out and doing migrations and so on.
So drive failure.
So we look at drive failure and all kinds of different attributes
that we get from those smart stats.
So, for example, one of the
smart stats we'll talk to is about power cycling. So how many times does the drive get turned off
and then back on again? Okay. Now, the reason this is cool, and I always thought it was pretty cool,
was how many of you have one of those relatives at home that likes to turn the system off at the end of
the day, for example, and then turn it back on the next day. And then every time they walk away,
they turn the computer off and then they turn it back on. You got this feeling in your gut that
that's probably not good. Okay. But we don't know. No one's ever really proven anything one way or
the other. And we have lots of drives and they get turned off and
they get turned on from time to be that time of being. And we wanted to dig into that, see if
there was any relationship between doing that action of turning things off and on versus leaving
them operational. Okay. And we just, you just can't get there. We can't get there because quite
frankly, we don't fail enough drives. We don't
turn them off enough. All right. You can see some little differences here. If we over a given year,
we might take a good drive and turn it off and on three times. And we might take a failed,
a drive that's failed, and it's only been turned off and on four times. There's just no way you
can say, oh my goodness, that's bad.
Particularly, and we did it across the entire lifetime for all of those drives that we've ever had, right?
And we still couldn't get big enough numbers.
We just don't turn off the drives much.
The other way to look at it is time, right? So is there any correlation to turning them off and on?
And does that build up over time? And do you get a higher failure rate over time because you keep
turning them off and on? So in year one, you turn it off 10 times. In year two, you turn it off 10
times. By the time you get to the end, maybe it's 50 times, right? Is that something, right? And the answer is we don't know, you know? You can take a look
at this and you can see the plot against the time versus power cycles. And maybe in year three,
there's a lot less power cycles, but those are plotting failure at that point, right?
So what's going on, right? We look at the line that we drove through there, the progression. We calculate the R2 value. It's just not enough data there.
I'm not willing to take all the drives in my data center and start turning them off and on every day
just to see if you are, let me know what you find out.
Now, this is a cool one, okay?
This is related to the bathtub curve, right?
Failure over time.
The thought is that just like any other, a lot of other industrial products, that there's infant mortality, if you will, that the dries fail early, fairly high.
And then they kind of come down the curve and they settle in, right?
They settle into a nice low failure rate.
And then as they go along, maybe in year three or so, they start to rise back up and you get the other side of the bathtub curve.
And when we looked at this back in 2016, 2017, that was basically what we had, a bathtub curve that didn't die.
We only went to four years then and we kind of projected out.
And we figured that the failure rate would continue to
go up. And it does, you can see that. But what's interesting here is the front end of the curve,
the left side, if you will, of that curve, that year one isn't very high. You got a lot of water
falling out of that bathtub on that side over there, right? And I think that has a lot to do
with the way that drives are being tested now, you know, over the last several years in particular, last two or three years in particular.
I took a tour, for example, of the Seagate factory, the prototyping factory in Longmont, Colorado, and they put the drives through a pretty rigorous frontend process. They actually put them in, they monitor every drive they make, they put it in a
little oven, they kind of run it through some systems and things of that nature, just to make
sure that they get as many of those early failures out of the system as possible, and also to make
sure that the quality of the drives is good. And I think that process, those types of processes by
the manufacturers have pushed down that early part of the curve,
that infant mortality, if you will, so that we don't have as many early failures. It doesn't
mean, by the way, that you won't buy a drive and won't be DOA, but it probably means it's going to
happen a little bit less than it has. So the failure, the bathtub curve starting to look a
little lopsided, leaking a little water, but it still kind of settles down in the middle there.
You get a nice period in that second year and third year in particular where it's nice and low, and then it starts to skip back up.
We replace our dry someplace between four and five years typically in dealing with this kind of thing.
There's also the whole fact that you want to do migrations and those types of things, right?
So you want to go from a two terabyte to an eight terabyte or four terabyte to a 16 in order to get more density.
Now, the last bit, the last little one here is all about temperature, right? Temperature's been talked about since way back in 2007 when Google
started that conversation. They said, you know, you can turn up the air conditioning a little bit.
You don't have to have a meat locker in your data center. You can run your drives a little warmer.
And we did some work on that a few years ago, and we found the same kind of thing. And so we just
wanted to check back and see how that's holding up, right?
So here we are,
we have all of the operational drives on one side, right?
And the average temperature of those drives
is about 29.1 degrees Celsius, 83 degrees or so, right?
And you can see the mean and the mode there
and all of that for the operational drives.
Now, if temperature really affected drives, failures, which really
correlated to failure, we would probably see a slightly different graph than we do on a second
one. Now, you can see there's not as many data points, so you get a little bit of sloppiness
there, but you still have something that looks like a bell curve at the end of the day. And the average temperature
is only a couple of degrees warmer on those failed drives. It's probably not heat, okay,
that's causing those things. It's probably a byproduct of that. We actually went back a couple
of days to validate this and the failed drives seem to run a little bit warmer as they get closer
to the end. Kind of interesting.
But that's probably a byproduct of whatever is causing them to fail.
It's not the heat that's causing them to fail.
The only place that you can start to see a little bit of difference is on the right-hand side there when you start to get to like 40 degrees Celsius or so, 38 degrees Celsius, where you see a little bump that you
don't see on the good side. So maybe that's too high. Now the specs on drives are typically 60
or 70 degrees Celsius that you can run them up to. So we're not even close to that. But the idea
that you can still run a data center in our case at an average temperature at about 85, 86 degrees, the drives, that is, inside. And remember, these are being monitored inside, and this isn't
the external thing. When you walk down a cold aisle, it's still cold in a data center, believe me.
But we're running a fairly high temperature and no real difference, if you will, in failure rates.
Also decided to plot them this way, temperature versus time in years to see if, once again, maybe that repeated temperature of being at 85 or 86 degrees or something like that Fahrenheit would be a problem over time.
And you just don't really see that in the thing.
You see a little bit there in year four
where some of the dries start to get a little warm, right?
But interestingly enough,
the two charts really look about the same.
It's just that the fail dries, there aren't as many,
so there's a whole lot of holes in it.
But if you give a good gander at it, right, they look very similar. In other words, you really can't say there's any
correlation between temperature and time and years. It doesn't seem to, you can run a drive at
30, at 28 degrees, 29 degrees Celsius, and it doesn't seem to be much of a problem over time, and it doesn't seem to create more failures.
Now, this one kind of did this one, and I was a little surprised by the results.
Because what it's basically telling you is that as the drives got bigger, they got hotter.
Now, that's a really interesting thing. You could say, well, look, there's more
parts in there. There's more disks or platters in there. There's more actuators, all of that,
right? Remember, all of these drives are in the same environment. We're not twisting anything
here. This is what's going on inside the data center for all of the different tribes.
The 6s and the 10s there look a little low.
Those are one model.
Each one of those is one model.
They're both Seagates, and they run a little cooler than everything else.
And so maybe that's why that's a little low.
The 8s are kind of an interesting story.
Those were some of the early helium drives,
and so we were kind of curious as to whether or not that had an effect on it.
Everything from the 12s, 14s, 16s, and so on are all helium inside there.
There's helium.
And so one of the things that I didn't know the answer to,
and it would be great to find out at
some point is when you're measuring that temperature inside does it matter does it make a difference
whether there's helium or air inside there okay because helium is obviously significantly lighter
that's the reason they're using it it creates less friction less electricity used all of those
kinds of things but they seem to be running a little warmer. So it'd be interesting to understand that dynamic from a
gas point of view. I'm not a chemist that way. So it'd be nice to understand that,
see if that explains why these drives seem to be getting warmer as they get bigger. Now, predicting drive failure, right? We do this. We're doing that when we
take those drive stats and that little thing, that little sign comes up and says, hey,
consider replacing this drive. That's predicting drive failure. And how do we do that? We monitor
these particular statistics, right? We've been doing this for a while.
This is actually the slide that I gave back in 2017 on exactly this topic, right? And we took the five different stats that we were tracking and we said, hey, how well do they correlate to failure?
And then what's the in the case of what's the false positive rate?
Because if operational drives are showing a high number, we don't want to get fooled into yanking out a drive, right?
So let's compare 2017 to 2021 and see how consistent we are.
Smart 5, reallocated sector count.
Yeah, pretty consistent.
A little more, some more false positives there, but still pretty consistent.
Smart 187, there's something going on there.
Something's a little odd there because the false positive, the operational failure rate in 2021 is just really high.
So let's take a look at that in just a second.
188, command timeout, that's hanging in there. We're not detecting it as often, but
a zero false positive rate is pretty cool. So if you see that number show up there, and a number
being greater than zero in this case, so we suddenly get command timeout start happening,
it's probably a pretty good indicator that failure is coming on that.
197, those numbers are really weird now, right? The failed rate is correct, but the operational rate, the false positive rate is just insane.
And I'll explain that in a minute why it's happening.
And then 198 looks pretty good, okay?
It looks fairly consistent.
We can still kind of use that.
So 187, we talked about that, right?
What happened?
Well, it was 05 and then it went up.
The false positive rate went up to 23%.
Well, the first thing is to know about that particular status.
It's only in Seagate 4 terabyte drives.
And we still have some of those, all right, that we have in the system, although we've been taking them out over time.
The number, though, doesn't decline. So if it goes from a zero to a one, okay, and then it happens
again. So it is suddenly another unreported, uncorrectable error, it has to be remapped.
And then it becomes two, it doesn't go backwards when the remapping occurs.
All right. So it's never going to go down. So when we, as soon as it has a one, okay,
it's just going to have a two and then it's going to have a three. And again, these drives are
getting old, six, about six years old in some cases. So having these happen, right, even just having one happen is probably a reasonable
expectation that the system takes care of it, maps around it, everybody's happy, but that number
keeps going up. So it's a good indicator that there's a problem, okay, but it's a bad indicator
because even really good drives, maybe with one reallocated sector that we had to work around, so to speak, is there.
It's a problem.
So 197 on the other case, it's kind of a weird one.
Most larger drives report it. But there's a value set that comes in
197 is the same values that are in SMART1. And that doesn't make any sense when you look at the
definitions of those two different things, okay? And the other thing is, is current pending sector
count just would have a, you'd look at the SMARTART 197 raw value and you'd get a number that
would be 1.9 million. And that just doesn't make any sense because if you had that many
pending sectors, the drive would just be a pile of garbage at that point. So we don't know why,
but some drives seem to be misreporting that particular stat.
And what ends up happening is we have to take out both 187 and 197, okay, in trying to do our
calculations and figure it out. So it kind of leaves us with three. We're looking at some of
the others, for example, 196 looks pretty good.
And there's a couple of others that we also look at.
But it's getting to be a little bit harder these days to do those kinds of things.
So to kind of wrap up, we do some predicting, right? And way back in 2016, the folks at Morella Botzatu, thank you.
That's a hard one. Her friends at IBM did a nice paper on predicting disk replacement towards reliable data centers. They used our data to do that. And using what I would call fairly traditional statistical
multi-regression analysis, they went through and they tried to see
if you could use smart stats to predict drive failure.
And they took two particular models, and they actually did it.
They did it with a 4-terabyte drive and a 2-terabyte drive,
4-terabyte Seagate and a two terabyte HGST.
They used our data for, I think it was three years,
they collected the data and then they built,
and then they use that to train their model.
And then they went forward and then they said,
they try to predict and the predictions were pretty darn accurate.
In some cases they were doing, you know, at 30 days,
they are 85% accurate. And by
the time you get within a couple of days, they were up in the 95 to 97% accuracy rate, which is
pretty darn good. And they were doing all of that. The trouble, of course, is you need three years
worth of data. And as many of you know, manufacturers seem to make models for two or three
years, and then they change to another model and then another model.
And the question of transference is really good.
Can you really transfer that, what you learned in drive 1, 2, 3, to model 4, 5, 6?
And the answer is nobody knows at this point, right?
People have tried, because that's what artificial intelligence gains a whole lot of things out and trying to understand something and then do it.
Now, there's been other papers in between, but the most recent one I've seen is the interpretable predictive maintenance for hard drives.
And what they did is take the data and they applied a technique called optimized decision trees. They did it for one particular model of ours. And they were able to
also get product, you know, the ability to predict drive failure in the 97, 98, 99 type of fail
with a few days notice, which is all you really need, right? If I have some time, I can, I can
take care of that drive. In other words, if I can get that drive out before it fails, I can do all kinds of very other interesting things.
I could put it into read-only mode so it don't end up with data.
I could even clone the data off on the fly.
I can do lots of different things to save time and not degrade my system in any way, shape, or form.
They use that technique on that particular set.
Again, optimized decision trees, which just basically uses what we all know as decision trees, but basically does it all at once.
And it takes the learnings that it does and it applies it again and applies it again and applies it again.
I'm going to have one of the authors of that, Daisy Zhao, on a webcast in October with me.
And we're going to go through this model because it sounds really interesting to see.
Because the great part about it was is they didn't need that many observations.
They had about, what they said, about 50,000 observations.
So about 50,000 drive days, if you will, with that particular model.
So if I have 20,000 drives,
I'm doing pretty good, right? I can take a look at a fairly small subset of data,
set of data, and maybe with even as little as six months worth of information, I can begin
making predictions and getting that accuracy. And that's where this gets to be exciting.
So that's something I'm looking forward to
and seeing if we can learn a whole lot more about that.
So what do we talk about?
We obviously talked about where our drives live and work, right?
And they're all in our data center.
They're all treated the same.
They all get air conditioned, nice stable places.
There's no dust bunnies running around.
There's no little kids throwing them across the room and things you know, things of that nature. We talked about drive
failure, right? And how we compute it and why we use things called drive days and annualized
failure rate. We looked at a couple of different ways that drive failure can work out, right?
Look at power cycling. I'm sorry, can't get any conclusions there. Looked over time, said, yeah, yeah, the bathtub curve still exists, but it's leaking a little.
And then we looked at temperature and we said, well, as much as we'd like to think the temperature affects it,
normal operating systems of a drive just don't, you know, turn the air conditioning down a little and you'll be fine.
And then finally, we talked about predicting drive failure, how we do it.
And then some of the ways that we've seen over the years, other people try to do that.
So that's the presentation.
I look forward to any questions that may come down the road.
I'm looking forward to hearing from you.
And I wish to thank the folks at SDC at the Software Developer Conference for
inviting me and have a great day. Thanks for listening. If you have questions about the
material presented in this podcast, be sure and join our developers mailing list by sending an
email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the storage developer community.
For additional information about the Storage Developer Conference,
visit www.storagedeveloper.org.