Storage Developer Conference - #44: What Can One Billion Hours of Spinning Hard Drives Tell Us?
Episode Date: May 9, 2017...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC
Podcast. Every week, the SDC Podcast presents important technical topics to the developer
community. Each episode is hand-selected by the SNEA Technical Council from the presentations
at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast.
You are listening to SDC Podcast Episode 44.
Today we hear from Gleb Budman, CEO of BlackBase,
as he presents What Can One Billion Hours of Spinning Hard Drives Tell Us?
from the 2016 storage developer conference.
My name is Gleb Budman. I'm the co-founder and CEO of Backblaze. And today we're going to talk
about some of the things we've learned from running hard drives for about a billion hours.
So we're going to start off talking about the environment that the hard drives are in so that
it gives some context. It's not a research project. It's actually operational.
We're going to talk about what do we consider a dead or sick drive
and how do we make that determination.
Then we're going to talk about some of the findings.
So how long do they last?
What do we see about enterprise versus consumer drives,
effects of power cycling and temperature on some of the drives,
and then hopefully we'll have some time for questions.
So just the context so sometimes people go okay you know how was this research uh you know analysis you know lab set up and it's not a research lab um this is an
operational environment so we are simply collecting data from running the drives in our environment
that means both good and bad things on one hand it's a real environment so you know it might be more similar to what some of
you are running on the other hand it's not we don't have the flexibility of saying well this
is the perfect test that we would love to set up let's do that it's the more all of a function of
did was this something that we needed for the business. So the environment that they ran in for the most part over the last decade is a backup
style environment.
So we provide this $5 a month unlimited backup service.
That's what the drives have been doing for most of their life.
If you think of a backup environment, sometimes people think, oh, that means that you put all the data in them,
they just filled up and then they sat there completely idle for all the rest of the time, almost never used.
It's not actually the environment that we have because the drivers are constantly getting written to, read from, checked, deleted.
So they are actually operational all the time and then more more recently we have this b2 service
which is a cloud storage offering and again that is a real-time service so these drives are 24 7
being operated on in production so that's that's a little bit of the context. And then let's talk about what the drives are actually living in.
So in a plywood box.
So obviously all the drives are not in the plywood box anymore,
but this is where they started.
The very first version was a plywood server to store the drives in.
The whole idea originally was just, we needed as inexpensive as possible
a way to connect drives to the internet.
So this was a prototype plywood server.
It actually was deployed in a data center,
actually ran the, you know,
then we did a bunch of testing and analysis
on power and cooling and the like,
different ways of connecting drives to motherboards, built the chassis
and made what we call a storage pot.
And so a storage pot is just an inexpensive box to store hard drives and connect them
to the internet.
Some of the things that happen because it's an operational environment is that the drives
are purchased over long periods of time, right?
So some of the drives were purchased 10 years ago.
Some of the drives were purchased more recently.
Sometimes we dealt with things like the Thailand drive crisis, which many of you are probably
familiar with.
When Taiwan flooded, all of a sudden it was very difficult to get drives. And so one of the ways that we bought drives is we bought
consumer USB external hard drives, cracked them out of
their shells, and put the drives in the servers.
So a little unorthodox, but we bought thousands of drives
that way during a time when drives were unavailable.
So the drives in the environment are a whole mix.
They started off with being one terabytes. We now have up to eight terabytes. It's a combination of
the two. But the production is now pretty standardized for these storage pod deployments.
And the model that they're mostly running in is what we call storage pod version 6.0.
It's a 60 drive chassis with 60 drives on the front, compute on the back,
and this is the base unit that the drives are in. And then, so the one level up from that is,
and early on we used RAID inside of the environment. We still have RAID in some of the boxes
that are kind of the legacy of Circle. So when I talk a little bit about how we treat drives
and decide whether they are dead or not,
one of the things I talk about is RAID.
These are actually RAIN and RAID.
The next level up and what we've switched to.
What thing now?
Like, is it the same as the SAS or SATA?
They're SATA drives.
So they're all consumer SATA drives.
And so the next level up right now is actually They're SATA drives. So they're all consumer SATA drives.
And so the next level up right now is actually what we call Vault.
So Vault is 20 of these units, these storage pods, each put into a different rack.
And when you write one file, we chop it into 20 pieces.
It gets erasure coded across all 20, one shard per drive, per pod, per rack.
And then you can reassemble any of them from any 17 of those pieces.
So that's kind of the base unit today and what most of the drives are running in.
And then the next level up from that is you assemble these vaults into clusters, and then the clusters get assembled into the overall back-placed file system.
And so today we start about 250 petabytes of data for customers.
We add about 10 petabytes every month.
And all of that is spinning drives.
So we get asked periodically, what about Flash and SSD?
We love SSD. SSD is awesome. I hope to switch it to some point.
It's completely unaffordable for the kind of use cases that we're looking at today.
So this is all spinning drives and mostly, almost entirely, consumer-grade spinning drives.
So that's the environment. Let's talk about how do we actually consider
a SICK drive. So there are three things. The first is that the drive doesn't spin up. It
doesn't connect to the OS. Okay, so obviously if you can't turn the drive on, the drive
is dead. That's kind of binary, fairly straightforward. The second one is it won't stay synced in
an array. Now I say RAID here,
as I said in some of the pods it's actually RAID the way we kind of think of RAID. In most of them
now it's this erasure coded algorithm across 20 different storage pods, but conceptually
the same thing which is the drive won't stay in that array, it's unusable in this kind of
environment, we consider that a sick drive.
Both of those are fairly binary in nature. The third one is using smart sets. And so we use
smart sets to give us a sense of, do we think the drive is going bad? It's still a little bit in
that gray area of, this is not a clearly dead drive, but it might not be quite a good drive either.
So let's talk more about the smart stat side of it.
So there are over 50 different smart stats you could use.
These are the five that we actually look at to get a sense of whether the drives are going
bad. And these are the ones that seem to be correlated to failure
versus being correlated sometimes to things that look like failure.
The most common thing where you can look at smart stats
that look like they're correlated to failure but aren't
is age or time running.
So there are some smart stats you look at and go, wow, you know, this smart
set was high when the drive died. And most of the drives that are dying, the smart set is higher on.
Well, if it measures hours of time that the drive has been in production, obviously it's going to
be correlated to the older the drive is, the more likelihood it was to die.
So there are some that are misleading in that way.
So you have to actually correlate out age.
But these are five that we've found that actually do correlate with failure specifically.
Three of them you'll notice are reported only by Seagate. So one of the things is that some of the smart stats are reported by all drive manufacturers and models.
Some of them are only reported by one vendor or another.
It would be great, obviously, if all drive vendors reported all of them, and if all of them standardized.
That would be nice.
It's not the world we live in.
So we use the data that we can pull from them.
So the detection of spy rates.
So of the failed drives that we see, about 22%
have one of those five having a reading above zero.
So 24% have two of those smart sets that have a reading above zero. So 24% have two of those smart sets
that have a reading above zero.
So if you add all of the five, about 80%
of the drives that failed had a reading above zero
on one of those five smart sets.
So fairly high correlation.
Very, very few of the operational drives
actually have a reading on those smart stats.
So it really is a significant indicator.
It's not the kind of thing where the drive doesn't spin up
is 50% of the drives.
The drive doesn't sink in the rate as 40% and 10%
through smart stats.
It's really a large percentage come from the smart set data. And so overall, only 4% of the operational drives
show anything at all on any of those five smart sets
versus almost 80% of the failed drives.
So this is another way to look at this data, which is that each one
of the indicators has shown up in about 40% of the drives for failed drives,
mostly sub 1% for the operational. So you know the smart sets really do give a good guidance to the fact that drives are going to fail.
This is a correlation chart between the different smartsats.
So one of the questions is, okay, so there are 50-ish smartsats.
We've chosen five.
But if you could just tell based on one of these five that a drive is going to die,
why even bother tracking the other four?
And so what you'll see here is that, you know, obviously smart five correlates with smart five.
It's the same one.
But if a drive sees smart five, the chances that it sees any of the other smart errors is actually very small on a correlation basis.
So just because a drive is exhibiting one type of error does not mean it's exhibiting
all of the types of errors.
It's actually a very slight correlation.
So it leans towards showing that a drive may die for different reasons.
It's not just that as soon as it has some issue with it, all things end up breaking.
The only one that does have a fairly high correlation
is 197 and 198 have a fairly high correlation.
We still use both for two reasons.
One is the correlation is not perfect.
It's not one to one.
And the other is one of those is only reported by Seagate,
so for the other drives, we'll use the other one.
So other smart sets.
So of the 50-ish smart sets, a lot of them don't correlate to anything related to failure
or don't seem to.
Some of them have issues around things like,
so the drive manufacturers don't publish and say,
here's exactly the stat, you should be looking at it inside of the smart stat to tell you whether
the drive is going to fail. So a lot of this is anecdotal. And so some of them, if you're looking
at the failures versus the data, they're completely random and mismatched, and there's no putting it
together. We do look at some, so high fly rates, you know, you look at it and you go, wow,
you know, 47% of the drives that failed had high fly rates. That might be a strong indicator
for being a leading indicator of drive failure. But 16% of operational drives had it so you know if you're pulling a drive out because
it has a high high fly rights you might be pulling a perfectly valid perfectly good drive out so
it's not you know it's not a clear indicator one of the things with high fly rates that we've seen is that whereas on the five smart
sets that we use, they should all be zero.
If they have a value of one, that is an indicator that something is not going great.
On high fly rates, it's very possible that you'll see it go, you know, it should be zero, but it'll say one.
And then for a month, nothing will happen.
Then I'll have two.
And for a month, nothing will happen.
Then I'll have three.
And that drive might be fine for a long, long time.
So this one is more of a trending.
So we'll see some drives with 113 high fly rates. If they have accumulated
113 high fly rates over four years, that drive is probably still fine. If they've accumulated
113 high fly rates in one day, that's probably an indicator things are going off the rails and
it's going to go bad. So whereas with the five that we use consistently, while it's not a perfect science, it's pretty
black and white, with some of them it's more of a directional guidance. Spin
retry count. So spin retry count is, you know, you power on the drive and it tries to
spin up and it doesn't, can't spin up so it tries again and spins up and talks about how many times it took to do that. Of the
failed drives it's a small percentage but of the operational drives it's a
very very small percentage so relatively speaking this should be a good one to
give us guidance that the drives might be getting ready to fail.
There are a couple issues for us with this.
Primarily the fact that we don't power cycle drives
that often, so the data here is somewhat limited.
And the experience for us to say,
hey, this drive might be something that we should pull,
isn't as valuable, because it's not the kind of thing
where we reboot the entire 250 petabytes every day
and can look at this number every single day and go,
oh, these 20 drives are good.
And these 20 drives have high spin retry counts,
and we should pull them.
So it is both one that we have somewhat less data on
and also one that is somewhat operationally more difficult
to kind of make decisions based off of.
Okay, so failure rates of drives.
So, you know, obviously one of the big questions
that people ask is, how long is my drive going to last?
And, you know, on the internet,
you'll see lots of pundits saying,
oh my, you know, my drive died,
so, you know, I'll never buy this drive vendor again.
It's like, well, a survey sample of one.
So right now, we have about 70,000 drives in operation.
We have had about 5,000 drives die over the history of time.
So operational hours is about a billion
of drive hours in production.
So about 7% of the drives have failed,
which as it is, I think,
is a very low number of drives that have failed.
But that's over time,
and that takes into account various things
like the number of drives going up and everything else.
The way we think about it is an annualized failure rate.
So for one year that a drive is running, what's the failure rate?
And it's sub-4%, which to me is astounding.
You know, you're talking about these are consumer drives under constant load 24-7.
They are storing, most of our drives at this point are 4-terabyte drives,
so they're storing 4 terabytes of data, constantly getting written, read, deleted,
and spinning at 7200 RPM with a hair between the head and the drive,
and yet 96% of them at the end of the year are
still going to be fine.
To me, we don't make drives.
I'm actually incredibly impressed by the drive manufacturers for pulling that off.
So this is a couple years old at this point, but drive failure rates over time.
So 4% is an annualized failure rate, so on average 4%.
But that doesn't stay consistent throughout the life of a drive.
It changes from the beginning to the middle to the end.
So on the left is a theoretical curve.
It's called the bathtub curve.
And the idea is that at the very beginning you buy a drive and it might die
for a variety of reasons
manufacturing defects
You know various things that that aren't quite in alignment etc that caused the failure rate
Early on in the drives life to be higher
Then it gets into this middle state where if it survived through
that early part, it doesn't have any of these kind of manufacturing defects. It's just going
to have kind of a standard level of failure. And then toward the end of its life, it starts
wearing out. The parts start getting old, they start wearing out, and so the failure rate starts to climb.
In the middle, throughout that entire time frame, there are random errors that are constantly happening.
And so you put all that together and you get this bathtub curve.
So that's the theoretical.
This is the real data that we've seen in our environment and it actually does match the bathtub curve though the back end of the bathtub curve is a little steeper than the front end. Now this
one is a like I said it's a few years old we are we are actually working on
a reanalyzing this curve. We do a every quarter we publish statistics on drive
failure rates on our blog.
In the Q4 analysis, we're planning on having an updated version of that
with all the new history over the last few years.
So if you're interested in seeing that, you know, subscribe to the blog
and we'll send that to you when it's when it's ready.
But it's you know, it does follow this kind of general bathtub distribution.
So one of the things that we hear a lot is you can't use consumer drives
in an enterprise environment or you can't use consumer drives
in a data center environment.
So in our case, we needed to use consumer drives at the very beginning for the simple
reason that providing a $5 a month unlimited service or half a penny a gig storage service,
it's very hard to do with the prices of enterprise drives.
So if we could make consumer drives work, we would obviously prefer to do that.
This was an analysis we did at one point.
To be realistic, we have very few enterprise drives in our environment.
So compared to the consumer drives where we feel like we actually have good statistics
on their failure rates, we just don't have that many enterprise drives in the environment.
But we had some, so we were able to do some amount of analysis on it.
So the key takeaway for us was that at least in our environment, the enterprise drives and the
consumer drives performed about the same. The enterprise drives were actually slightly worse,
but pretty close to the same.
I think one of the key things for us was that the question is, when are you going to swap the drives anyway?
And does the, you know, if you could get better than 4%, does it matter?
And, you know, if you could get down from 4% to 2% or 1%, are are you gonna make a different decision and if the drives cost twice as much you the ROI isn't going to be there between
a four percent and a three percent so it's a good question there was a mix I
remember exactly what the whole mix was. I mean, I think that they tended to be the higher RPM ones
because some of them were for different use cases,
but I don't remember exactly.
So, you know, I think the key thing for us, though,
was that if the consumer drives are failing at 20%, then even if you don't have a straight ROI for the benefit of getting down from 20% to 4%, you may do it for other reasons.
Operationally, replacing 20% of the drives every year that fail can be taxing on the organization.
But down at 4%, you're going to swap out the drives anyway.
I mean, right now, we've migrated off of all of the 1 terabytes,
and we've just finished migrating off of all the 2 terabytes.
We still have tons of them that are perfectly fine in production,
but they just don't make sense anymore from a density perspective in the data center.
Yeah?
Do you guys consider upgrading firmware at any point? And are your pods like the same vendor or are you guys using different vendors in the
same pod?
Is it capacity based or is it...
The same vendor you mean the same drive vendor?
Right.
So typically the pods start out being the same vendor in a pod, but they don't necessarily stay that way.
So as drives get swapped and replaced, they will get swapped and replaced with drives from other manufacturers, potentially.
In practice, a large percentage of them are fairly homogenous.
So we'll have 50 pods of Seagate drives,
50 pods of Western Digital drives, et cetera.
But as drives get pulled, we don't
force keeping homogeneity.
And the firmware question was related to,
because if you have a firmware bug in a large population
and you're trying
to update, the consumer level drives do not have some of the capabilities that the enterprise
level have while trying to do that update as the IOS being serviced.
So that's why I wanted to know, back to your new idea.
So, I remember what we do with firmware.
I'd have to double-check on it.
Shoot me an email, I'll double-check on it and get back to you on it.
One of the things that we do with drives is after,
if we pull drives out that are not completely bricked,
we have a device that then runs through and checks the drive out that are not completely bricked.
We have a device that then runs through and checks the drive to see, could it be reused?
Is it still valued?
And in almost all cases, the drives actually get a, you know, no, you can't use this drive anymore kind of thing.
Power cycling.
So we don't, like I said, we don't power cycle drives a lot,
but one of the questions that we get asked is, you know,
should I leave my drive running all the time,
or should I turn it off when I'm not using it?
This tends to be more of a consumer-level question, obviously,
than a data center question.
So SMART 12 gives us the number of power cycle counts.
And statistically speaking, the failed drives
do have more power cycles. However, 27 versus 10, you know, it's not super clear. It is more. The 27 somewhat
correlates to the drives also being in service longer because that's why they've been power
cycled more. We don't intentionally powercycle the drives. It's just a function of if the pod needs
to be power-cycled for some reason,
then we will power-cycle it.
And as such, the drives will go that way.
So this is one where, again, the data
is a little shy on whether power-cycling actually has much of an effect or not.
So this is one where I put this out there.
There's a little tiny bit of data on it.
Take it for what you will.
This one we have a little bit more data on.
So one of the questions is, if I keep the drives really hot, do they die or do they perform better because the lubricants move more smoothly through them?
Should I keep them cool or is that going to seize them up?
So this is a chart of the operational temperatures.
This chart has nothing to do with failures. This is just the drives
operating in the environment telling us not ambient temperature, but the drives themselves
internally, what is the temperature as they're operating. So most of the drives are somewhere
from here to here. So some drives get quite hot, some drives stay quite cold, but for the most part,
they're kind of in this, you know, middle range. And, you know, this represents about 70,000 drives.
So, and there are various reasons for why the drives are at different temperatures. Obviously,
they're inside of a data center, the data center is kept at, you know, is kept, you know, cool from
the cool aisle, etc. But drives that are, you know, toward the top
of a rack are more likely going to be ambient temperature is going to be warmer there. Drives
that are in the middle of a pod are more likely to have the ambient temperature be a little warmer,
etc. So there are some ambient temperature reasons for
the drives to be warmer or colder. In addition to that, there are other reasons for drives to be
warmer or colder. Some drives just run warmer. So you take two drives, you plug them in, you do
nothing with them, you put them in the exact same environment, and some manufacturers and some drive models run warmer than others.
Load is also a factor.
So if the drives are in their initial load phase, where
we're putting all of the data on them as fast as possible,
they tend to run warmer than when
they're a little more static.
But this is the overall flow.
So failed drives sources operation.
So I thought that this was an entertaining chart.
And my first response to it when I saw it
was to call up our chief cloud officer and say,
should I buy you a bunch of space heaters?
So, you know, clearly we have far fewer drives that have died,
that are hotter, which is an interesting takeaway.
You know, we're in a COLA facility, so unfortunately we can't ask them, hey turn off the AC, I would love to do that. But we do have some
control over the ambient temperature. Every one of our back way storage pods
has fans in it. When we started the company we we had six fans. And there were three in the front, three in the middle.
And at some point, one of the things that we saw was that we had a row of, a set of fans that all died in fairly short order
because they had a manufacturing issue with the lubricant.
And so we found that a whole bunch of the storage pods were down to one fan.
And when we looked at the drive temperatures and drive failure rates at that point,
this was a number of years ago,
the temperatures hadn't gone way up,
and they weren't outside of any of the drive specs.
And the drive failure rates didn't seem to be different.
And so we said, why are we spending extra money on buying fans and
extra power on powering fans? Certainly, maybe
we don't go down to one, but going down to three seems
sufficient. So all of the storage pods that we've been deploying
now for a number of years
have had three fans, not six. But this would say that we should stop putting fans into the pods at
all, and maybe we should put little heaters into the fans, into the storage pods, right?
So it's an interesting analysis. There are some things, obviously, to be careful of.
This is correlation, not causation.
So the question is are there other things that would correlate failure with cooler temperatures
than the temperature itself? And so there are other analyses that we want to do around,
for example, one of the things I mentioned was on the in the middle of a pod might get warmer than drives on the
edges is it possible the drives in the middle of a pod have are you know better
insulated in from vibration or other things and then drives on the edges and
so the two would be correlated to each other as opposed to causal. Right, so there are things that might be causing this.
Yeah? that's where that temperature is. Let's see. It's slightly different that since you have that curve where
70% of most of the dried in this picture,
most of the dried are on the left side, which
calculated the fact that most of the dried
weren't on that side.
So that's the relation with temperature.
More of that type, more of this so it's
so this is a
percentage of
failure versus operational
so this is
so it could be, it could have been that even though
the chart of operational
temperatures was over here, it could have
been 100% like this
and very low that way, right?
because it's not number of failures,
it's percentage of failures.
And this is a chart based on the last 90 days of temperature
for that drive, for all the drives.
So one of the things that there was an initial analysis that
was done that looked even a little wackier than this one, but it was based
on the temperature of the drive on the last day that it reported. And we looked
at that and said that might not be a good way to look at it because who knows
what happened the day the drive actually died. You know, like it might have caught
fire and then died, right? And so it's like, yay, hot temperature is good, or something. I mean, so this is looking back out 90 days back.
One of the slices on this data is by manufacturer.
So one of the things to note is, you know,
things vary, you know, by manufacturer, right?
So it's not 100% the same curve for all drive makes models,
but this still is consistent.
So warmer temperature is definitely correlated with lower percentage of failures.
So again, before you deploy space heaters in your environment or feel
free to do that, let me know how it goes. I'd love to know also, we're going to need
to run various other theories on what could possibly be causal other than temperature.
But short of finding some of those things to be true,
I actually think that we probably will start looking at,
at the very least, tearing fans out of pods
and seeing if that improves reliability over time.
The metrics, again, on all of this,
I mean, even with most of the drives being in this middle realm and most of the drives having
higher failures in this middle realm than they do up here, the overall failure rate
is still sub 4% per year.
So the drives are still quite reliable at any temperature.
I mean, obviously, if you cook them to 200 degrees Celsius, they're probably going to
be less reliable.
But on the whole, they're quite reliable dries.
So the last thing on here was just to correlate temperature
versus the five smart stats that we use currently for checking failure.
So we've said that there are different reasons for dry failures.
The five smart sets that we see
don't, for the most part, don't correlate much with each other so the failure of
any given one of those things isn't necessarily an indicator that
everything is going wrong inside of it. Temperature seems to be the same, which
is high or low temperatures have little correlation with any of the other five smart stats.
So if low temperatures are causing failures,
they don't seem to be causing failures in the ways
that these smart stats are accounting for it.
And again, all of the temperature
is based on the internal drive temperature.
This is not ambient environmental temperature. This is what the drive tells us its temperature
is at. So that was a little bit of an overview of the environment the drives are living in
and how they're getting operated. What we think of as both obviously a formal black
and white drive dying, as well as the more gray areas around the
smart stats and then the variety of findings around reliability over time and some of the
slices on that data.
There is, if you go to backwise.com slash hard dash drive, we publish lots of analysis
on the drives.
We also publish all of these smart stat data for
all of our drives publicly. So if there's any analysis that you're like, ooh I wish
they'd run you know XYZ analysis on the data, certainly you can email me and go
hey can you guys run this and you know and if it's something that we go yeah
yeah actually that would be a good one to run, we'll try to run it.
And if we do, we'll publish a blog post with the data.
If you want, you can run all of your own analyses on all of the data.
So it is all downloadable.
It is a fairly significant amount of data.
I think there are 6 billion data points in there.
So you're probably not going to load it for the most part in Excel.
You know, maybe some subset or slice of it, but the data is available there.
And also you can sign up there for getting updates on these various different things.
That's it.
Thank you.
I'll try to answer questions.
What do you do for the interconnection between the vault?
Like when you have the vault of a drive you do for the interconnection between the vault?
Like when you have the vault of drive, what's the interconnection between each pod?
Ethanite.
Ethanite.
Yeah, so 10 gig.
So it used to be one gig.
So the question was the interconnection between all the pods on the vault.
So the pods used to be, when they were arrayed, they used to be one gigabit connections because
internally all the communications were internal and external to the pod, all that would happen
is we would send up to one gigabit of traffic to that pod and if that one gigabit was full,
we'd send it to a different pod.
In the case of the vaults, because they have to rebuild the drives throughout the different pods in the vault,
we upgraded all the machines to 10 gig between them.
But it's just infinite.
This is a very interesting thing about the pod.
You guys have made your own storage pod.
Can you talk more about that?
Yeah, you can actually.
So we've actually open sourced the design.
So if you Google Backblaze Storage Pod, you know, version 6.0,
we've open sourced every one of the designs, 1, 2, 3, 4, 5, and 6.
It gives the specs for all the components inside of it.
It also gives the build books.
We actually just with version 6.0, we took the build book and gave it to a different contract manufacturer and said,
don't talk to us, read this build book and make sure you can build the pod to production based on the build book
so that we can do a full round trip to make sure the
build book actually works.
Wow.
So the external internet connection, is that a
file interface or a block interface?
ISAAC KOHANEKIS- Is it a file or block
interface between them?
So it's a file interface between them because we chop files into basically small versions
of the files.
So on every pod, the drives themselves do run a file system.
So they each run ext4 on the drive itself. The only thing that that file system does
is read a file, write a file.
All of these intelligences above that layer.
But the drives do have file systems on them.
Did you?
Yeah.
So I wonder about range.
So I mean, drives in the middle, right?
So they have average middle temperature, but we don't see the external way to like,
as the lowest and the highest, because it's always in cycles of heat and air cooling, right? Mm-hmm. So there are somewhere in the inside,
they generally more heat, right?
But their lowest temperature, of course, is higher, right?
So the range is shorter.
So maybe it's the range of highest and lowest
temperature through normal cycling.
So if I understand your question,
the chart shows the average temperature for the drive, but...
Not even average, but immediate center of effort, right?
Because we just pick up some average temperature at some moment, right?
But not cold and wet temperature.
So you don't know.
Maybe sometimes it goes really cool, but it warms it warm up after that BC, general warm, right?
So it goes down cool.
So it's maybe like a bigger range.
So you know, because your iron metal is always
extending the shrink because of temperature.
So what happens?
Yeah, so on the temperature side,
you're absolutely right. We don't have
we are only capturing
the data from these
smart sets, from the drives
on some basis. We're not doing it
constantly every millisecond.
And so certainly
it is an average
in terms of, it was
based off of 90 days' worth of data.
So it's a sample on day one, a sample on day two, a sample on day three, a sample on day four.
It's certainly possible, to your point, that during that day, the drive heated up, cooled down, heated up, cooled down, et cetera, right?
So it's certainly possible. It would be one of the theories we would probably need to cross-check against
before deciding that this was definitely causal and not correlated.
There are some reasons to believe that there's not that much variation throughout a given day.
Things around ambient temperature, like the location of the drive and the air flow through it, etc.
doesn't change throughout the day.
The drive was put into this location in this rack at this point.
It's pretty consistent and the load that the drives are under tends to stay fairly consistent. Now it varies from location
to location. So when we deploy a bunch of new pods, those pods are empty. They have no data
on them whatsoever. We aim all of the data to fill those up. So those 20 new pods are certainly
under load that's higher than the load of
the rest of the farm. But once they're full, their load is kind of consistent from day
to day to day. And so I would say that it's most likely they're not wildly varying in
temperatures during the day, but it's certainly a sampling of data points of the drives.
All of the drives are kept in the manufacturer's range.
So we have had a couple drives report things like,
I'm 250 degrees Celsius or I'm zero.
It's like, I'm pretty sure no one is shoving ice cubes around you right now, and
I'm pretty sure that the drive hasn't melted. So some of them, when they fail, the temperature
sensors give bad data, but on the whole, across 70,000 drives, they're all within the range.
It's more of a question of where within the manufacturer's acceptable range do they seem
to perform better or worse?
Any other questions?
Yeah.
The temperature is fascinating.
I was wondering if you have to know across different manufacturers if they're actually measuring the same thing,
where you're seeing the metric when we're recording temperature,
or if there's consistency between the different manufacturers?
So I'm pretty sure that we're getting the same temperature
because if you put the two drives next to each other.
Now, like I said, the drives, when you power two drives up
from different models or manufacturers,
the drive temperatures they'll report is somewhat different.
Now, there are two possibilities for that.
One is the drive is hotter.
One is the temperature sensor is set up differently, right?
So I don't know the answer between those two.
We do know that certain drives seem to run hotter.
It's only by a couple degrees.
So it's not the kind of thing where that was obviously
from 17 degrees Celsius to 46 is a humongous span
of temperature.
The difference is between what we think of as a drive that
runs hot and a drive that runs cool is a few degrees.
But they are somewhat different.
Uh-huh?
Do you see any surprises
with other components?
For example, if one pod is running at 95 degrees,
suddenly power supply is dropping off.
Some other hardware component?
Any other correlations?
JOHN MCCUTCHAN- It's a really interesting question.
I don't think we've done that correlation.
One of the nice things about hard drives is they have a bunch of sensors. They
spit out data through their interface, and we track all that inside of a large system.
Things that are pod-related and with different components, we don't necessarily have all that
kind of data as directly. So it's an interesting question, but I don't think I could say with enough data.
The other thing about it is from just a volume of data perspective,
we have 70,000 hard drives, but we only have 1,500 power supplies.
And so whereas like with the temperatures, when you get out to either extreme of the temperatures, like there was one there was one thing that I was looking at on here.
This one. So if you look over here, it starts looking like that's interesting.
So heat is really good. But then when you get up around 42 or 43 degrees heat starts getting bad
the problem is you know the number of drives in that 43 temperature range if you look at the one
above here you know you're down way down here in terms of the number of drives in the environment
i mean you're talking about you know a hundred drives running at that temperature. So yeah, that's not data.
So same with the number of power supply components.
the hardware is suddenly
A whole bank.
Yeah.
Exactly.
Yeah.
A whole bank.
Any other questions I can answer?
Yeah.
Where are you going next as far as what other metrics
or data would you like to see your hear people capture and what other information do you want to get?
You know, so one of the things we do periodically look at is we kind of, of all these different
other smart sets, we try and recheck whether any of them seem like they're correlated to
failure.
Because when we only had a thousand000 drives and, let's say,
20 failed drives, some of the other smart sets
looked like they were random correlations to failure.
With 70,000 drives and 5,000 drive failures,
we can smooth out some of the randomness
and figure out whether they may be actually correlated or not.
So periodically, what I want to do is go back to some of those. There are also
new smart stats periodically that, so for example, helium drives have a helium
sensor. So they check whether the amount of helium in the drive, you know, is
leaking out. And so we now track that in the smart set.
And if you download the smart set data from our website,
you'll see at some point that was a new metric.
Not a lot of data there yet, because you
had to buy the drive, have it in production,
and have enough of them fail to be
able to do any backwards calculation.
But at some point, some of these new smart stats will be interested in looking at those failures.
What we're also trying to do is look at what other things can we look at unrelated to smart
stats for failure. So by getting rid of RAID and writing all of our own underlying erasure
coding algorithms from scratch, we now have direct access to the drives
and are able to see things like this drive was
able to write this file, this drive was not
able to write this file, this drive
seems to be waiting five seconds or eight seconds or one second
when we ask it to do something.
And so we're able to capture a lot more data than just
what the drive tells us now.
And so one of the things we'll be looking at
is, based on our own interactions what the drive tells us now and so one of the things we'll be looking at is based on our own interactions with the drive do we see
other correlations to it and then you know ten years from now I'm hoping that
all this will be completely irrelevant and useless because we will be on SSDs
but ten years ago we looked at it and said, huh.
In 2007, we were like, we're probably two years away
from SSDs making sense in this environment.
So we'll see.
So what kind of moved behind the gate-to-gate
relationship?
Was it by assuming some clustering,
sub-organizing of scheming, something? I'm sorry. I couldn't hear you.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah.
Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. something else, just build clusters of things, hold my features in the sink, and what should clusters stop?
So what are the kinds of correlations and other kinds
of math behind it?
So we have two people in the company
who deal with re-analyzing the data in various ways.
I don't know all the different types of methods
that they've tried. I know
periodically they have looked at things around, did we have, for example, so we had,
there were a set of Seagate three terabyte drives that had high failure rates. And at some point,
we actually needed to yank out drives that weren't dead because the rate of failure on those was so high it was making it operational and challenging.
So how do you calculate all this stuff taking into account that we have these set of drives that are not failed but look like they're going to fail?
And so there were various compensations for those kinds of drives
and those behaviors.
But I, you know,
I don't know all of the math
behind everything
that they've done inside of it.
Any other last questions?
I think we have time
for one more.
All right.
Thank you guys so much.
Appreciate it.
Enjoy the rest of the course.
Thanks for listening.
If you have questions about the material presented in this podcast, be sure and join All right. Thank you guys so much. Appreciate it. Enjoy the rest of the course.