Storage Developer Conference - #44: What Can One Billion Hours of Spinning Hard Drives Tell Us?

Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcast. You are listening to SDC Podcast Episode 44. Today we hear from Gleb Budman, CEO of BlackBase, as he presents What Can One Billion Hours of Spinning Hard Drives Tell Us? from the 2016 storage developer conference.

Starting point is 00:00:46 My name is Gleb Budman. I'm the co-founder and CEO of Backblaze. And today we're going to talk about some of the things we've learned from running hard drives for about a billion hours. So we're going to start off talking about the environment that the hard drives are in so that it gives some context. It's not a research project. It's actually operational. We're going to talk about what do we consider a dead or sick drive and how do we make that determination. Then we're going to talk about some of the findings. So how long do they last?

Starting point is 00:01:15 What do we see about enterprise versus consumer drives, effects of power cycling and temperature on some of the drives, and then hopefully we'll have some time for questions. So just the context so sometimes people go okay you know how was this research uh you know analysis you know lab set up and it's not a research lab um this is an operational environment so we are simply collecting data from running the drives in our environment that means both good and bad things on one hand it's a real environment so you know it might be more similar to what some of you are running on the other hand it's not we don't have the flexibility of saying well this is the perfect test that we would love to set up let's do that it's the more all of a function of

Starting point is 00:02:01 did was this something that we needed for the business. So the environment that they ran in for the most part over the last decade is a backup style environment. So we provide this $5 a month unlimited backup service. That's what the drives have been doing for most of their life. If you think of a backup environment, sometimes people think, oh, that means that you put all the data in them, they just filled up and then they sat there completely idle for all the rest of the time, almost never used. It's not actually the environment that we have because the drivers are constantly getting written to, read from, checked, deleted. So they are actually operational all the time and then more more recently we have this b2 service

Starting point is 00:02:48 which is a cloud storage offering and again that is a real-time service so these drives are 24 7 being operated on in production so that's that's a little bit of the context. And then let's talk about what the drives are actually living in. So in a plywood box. So obviously all the drives are not in the plywood box anymore, but this is where they started. The very first version was a plywood server to store the drives in. The whole idea originally was just, we needed as inexpensive as possible a way to connect drives to the internet.

Starting point is 00:03:28 So this was a prototype plywood server. It actually was deployed in a data center, actually ran the, you know, then we did a bunch of testing and analysis on power and cooling and the like, different ways of connecting drives to motherboards, built the chassis and made what we call a storage pot. And so a storage pot is just an inexpensive box to store hard drives and connect them

Starting point is 00:03:54 to the internet. Some of the things that happen because it's an operational environment is that the drives are purchased over long periods of time, right? So some of the drives were purchased 10 years ago. Some of the drives were purchased more recently. Sometimes we dealt with things like the Thailand drive crisis, which many of you are probably familiar with. When Taiwan flooded, all of a sudden it was very difficult to get drives. And so one of the ways that we bought drives is we bought

Starting point is 00:04:26 consumer USB external hard drives, cracked them out of their shells, and put the drives in the servers. So a little unorthodox, but we bought thousands of drives that way during a time when drives were unavailable. So the drives in the environment are a whole mix. They started off with being one terabytes. We now have up to eight terabytes. It's a combination of the two. But the production is now pretty standardized for these storage pod deployments. And the model that they're mostly running in is what we call storage pod version 6.0.

Starting point is 00:05:01 It's a 60 drive chassis with 60 drives on the front, compute on the back, and this is the base unit that the drives are in. And then, so the one level up from that is, and early on we used RAID inside of the environment. We still have RAID in some of the boxes that are kind of the legacy of Circle. So when I talk a little bit about how we treat drives and decide whether they are dead or not, one of the things I talk about is RAID. These are actually RAIN and RAID. The next level up and what we've switched to.

Starting point is 00:05:36 What thing now? Like, is it the same as the SAS or SATA? They're SATA drives. So they're all consumer SATA drives. And so the next level up right now is actually They're SATA drives. So they're all consumer SATA drives. And so the next level up right now is actually what we call Vault. So Vault is 20 of these units, these storage pods, each put into a different rack. And when you write one file, we chop it into 20 pieces.

Starting point is 00:06:00 It gets erasure coded across all 20, one shard per drive, per pod, per rack. And then you can reassemble any of them from any 17 of those pieces. So that's kind of the base unit today and what most of the drives are running in. And then the next level up from that is you assemble these vaults into clusters, and then the clusters get assembled into the overall back-placed file system. And so today we start about 250 petabytes of data for customers. We add about 10 petabytes every month. And all of that is spinning drives. So we get asked periodically, what about Flash and SSD?

Starting point is 00:06:45 We love SSD. SSD is awesome. I hope to switch it to some point. It's completely unaffordable for the kind of use cases that we're looking at today. So this is all spinning drives and mostly, almost entirely, consumer-grade spinning drives. So that's the environment. Let's talk about how do we actually consider a SICK drive. So there are three things. The first is that the drive doesn't spin up. It doesn't connect to the OS. Okay, so obviously if you can't turn the drive on, the drive is dead. That's kind of binary, fairly straightforward. The second one is it won't stay synced in an array. Now I say RAID here,

Starting point is 00:07:27 as I said in some of the pods it's actually RAID the way we kind of think of RAID. In most of them now it's this erasure coded algorithm across 20 different storage pods, but conceptually the same thing which is the drive won't stay in that array, it's unusable in this kind of environment, we consider that a sick drive. Both of those are fairly binary in nature. The third one is using smart sets. And so we use smart sets to give us a sense of, do we think the drive is going bad? It's still a little bit in that gray area of, this is not a clearly dead drive, but it might not be quite a good drive either. So let's talk more about the smart stat side of it.

Starting point is 00:08:11 So there are over 50 different smart stats you could use. These are the five that we actually look at to get a sense of whether the drives are going bad. And these are the ones that seem to be correlated to failure versus being correlated sometimes to things that look like failure. The most common thing where you can look at smart stats that look like they're correlated to failure but aren't is age or time running. So there are some smart stats you look at and go, wow, you know, this smart

Starting point is 00:08:46 set was high when the drive died. And most of the drives that are dying, the smart set is higher on. Well, if it measures hours of time that the drive has been in production, obviously it's going to be correlated to the older the drive is, the more likelihood it was to die. So there are some that are misleading in that way. So you have to actually correlate out age. But these are five that we've found that actually do correlate with failure specifically. Three of them you'll notice are reported only by Seagate. So one of the things is that some of the smart stats are reported by all drive manufacturers and models. Some of them are only reported by one vendor or another.

Starting point is 00:09:34 It would be great, obviously, if all drive vendors reported all of them, and if all of them standardized. That would be nice. It's not the world we live in. So we use the data that we can pull from them. So the detection of spy rates. So of the failed drives that we see, about 22% have one of those five having a reading above zero. So 24% have two of those smart sets that have a reading above zero. So 24% have two of those smart sets

Starting point is 00:10:08 that have a reading above zero. So if you add all of the five, about 80% of the drives that failed had a reading above zero on one of those five smart sets. So fairly high correlation. Very, very few of the operational drives actually have a reading on those smart stats. So it really is a significant indicator.

Starting point is 00:10:30 It's not the kind of thing where the drive doesn't spin up is 50% of the drives. The drive doesn't sink in the rate as 40% and 10% through smart stats. It's really a large percentage come from the smart set data. And so overall, only 4% of the operational drives show anything at all on any of those five smart sets versus almost 80% of the failed drives. So this is another way to look at this data, which is that each one

Starting point is 00:11:09 of the indicators has shown up in about 40% of the drives for failed drives, mostly sub 1% for the operational. So you know the smart sets really do give a good guidance to the fact that drives are going to fail. This is a correlation chart between the different smartsats. So one of the questions is, okay, so there are 50-ish smartsats. We've chosen five. But if you could just tell based on one of these five that a drive is going to die, why even bother tracking the other four? And so what you'll see here is that, you know, obviously smart five correlates with smart five.

Starting point is 00:11:53 It's the same one. But if a drive sees smart five, the chances that it sees any of the other smart errors is actually very small on a correlation basis. So just because a drive is exhibiting one type of error does not mean it's exhibiting all of the types of errors. It's actually a very slight correlation. So it leans towards showing that a drive may die for different reasons. It's not just that as soon as it has some issue with it, all things end up breaking. The only one that does have a fairly high correlation

Starting point is 00:12:29 is 197 and 198 have a fairly high correlation. We still use both for two reasons. One is the correlation is not perfect. It's not one to one. And the other is one of those is only reported by Seagate, so for the other drives, we'll use the other one. So other smart sets. So of the 50-ish smart sets, a lot of them don't correlate to anything related to failure

Starting point is 00:12:57 or don't seem to. Some of them have issues around things like, so the drive manufacturers don't publish and say, here's exactly the stat, you should be looking at it inside of the smart stat to tell you whether the drive is going to fail. So a lot of this is anecdotal. And so some of them, if you're looking at the failures versus the data, they're completely random and mismatched, and there's no putting it together. We do look at some, so high fly rates, you know, you look at it and you go, wow, you know, 47% of the drives that failed had high fly rates. That might be a strong indicator

Starting point is 00:13:35 for being a leading indicator of drive failure. But 16% of operational drives had it so you know if you're pulling a drive out because it has a high high fly rights you might be pulling a perfectly valid perfectly good drive out so it's not you know it's not a clear indicator one of the things with high fly rates that we've seen is that whereas on the five smart sets that we use, they should all be zero. If they have a value of one, that is an indicator that something is not going great. On high fly rates, it's very possible that you'll see it go, you know, it should be zero, but it'll say one. And then for a month, nothing will happen. Then I'll have two.

Starting point is 00:14:32 And for a month, nothing will happen. Then I'll have three. And that drive might be fine for a long, long time. So this one is more of a trending. So we'll see some drives with 113 high fly rates. If they have accumulated 113 high fly rates over four years, that drive is probably still fine. If they've accumulated 113 high fly rates in one day, that's probably an indicator things are going off the rails and it's going to go bad. So whereas with the five that we use consistently, while it's not a perfect science, it's pretty

Starting point is 00:15:09 black and white, with some of them it's more of a directional guidance. Spin retry count. So spin retry count is, you know, you power on the drive and it tries to spin up and it doesn't, can't spin up so it tries again and spins up and talks about how many times it took to do that. Of the failed drives it's a small percentage but of the operational drives it's a very very small percentage so relatively speaking this should be a good one to give us guidance that the drives might be getting ready to fail. There are a couple issues for us with this. Primarily the fact that we don't power cycle drives

Starting point is 00:15:54 that often, so the data here is somewhat limited. And the experience for us to say, hey, this drive might be something that we should pull, isn't as valuable, because it's not the kind of thing where we reboot the entire 250 petabytes every day and can look at this number every single day and go, oh, these 20 drives are good. And these 20 drives have high spin retry counts,

Starting point is 00:16:19 and we should pull them. So it is both one that we have somewhat less data on and also one that is somewhat operationally more difficult to kind of make decisions based off of. Okay, so failure rates of drives. So, you know, obviously one of the big questions that people ask is, how long is my drive going to last? And, you know, on the internet,

Starting point is 00:16:41 you'll see lots of pundits saying, oh my, you know, my drive died, so, you know, I'll never buy this drive vendor again. It's like, well, a survey sample of one. So right now, we have about 70,000 drives in operation. We have had about 5,000 drives die over the history of time. So operational hours is about a billion of drive hours in production.

Starting point is 00:17:10 So about 7% of the drives have failed, which as it is, I think, is a very low number of drives that have failed. But that's over time, and that takes into account various things like the number of drives going up and everything else. The way we think about it is an annualized failure rate. So for one year that a drive is running, what's the failure rate?

Starting point is 00:17:30 And it's sub-4%, which to me is astounding. You know, you're talking about these are consumer drives under constant load 24-7. They are storing, most of our drives at this point are 4-terabyte drives, so they're storing 4 terabytes of data, constantly getting written, read, deleted, and spinning at 7200 RPM with a hair between the head and the drive, and yet 96% of them at the end of the year are still going to be fine. To me, we don't make drives.

Starting point is 00:18:13 I'm actually incredibly impressed by the drive manufacturers for pulling that off. So this is a couple years old at this point, but drive failure rates over time. So 4% is an annualized failure rate, so on average 4%. But that doesn't stay consistent throughout the life of a drive. It changes from the beginning to the middle to the end. So on the left is a theoretical curve. It's called the bathtub curve. And the idea is that at the very beginning you buy a drive and it might die

Starting point is 00:18:48 for a variety of reasons manufacturing defects You know various things that that aren't quite in alignment etc that caused the failure rate Early on in the drives life to be higher Then it gets into this middle state where if it survived through that early part, it doesn't have any of these kind of manufacturing defects. It's just going to have kind of a standard level of failure. And then toward the end of its life, it starts wearing out. The parts start getting old, they start wearing out, and so the failure rate starts to climb.

Starting point is 00:19:30 In the middle, throughout that entire time frame, there are random errors that are constantly happening. And so you put all that together and you get this bathtub curve. So that's the theoretical. This is the real data that we've seen in our environment and it actually does match the bathtub curve though the back end of the bathtub curve is a little steeper than the front end. Now this one is a like I said it's a few years old we are we are actually working on a reanalyzing this curve. We do a every quarter we publish statistics on drive failure rates on our blog. In the Q4 analysis, we're planning on having an updated version of that

Starting point is 00:20:15 with all the new history over the last few years. So if you're interested in seeing that, you know, subscribe to the blog and we'll send that to you when it's when it's ready. But it's you know, it does follow this kind of general bathtub distribution. So one of the things that we hear a lot is you can't use consumer drives in an enterprise environment or you can't use consumer drives in a data center environment. So in our case, we needed to use consumer drives at the very beginning for the simple

Starting point is 00:20:50 reason that providing a $5 a month unlimited service or half a penny a gig storage service, it's very hard to do with the prices of enterprise drives. So if we could make consumer drives work, we would obviously prefer to do that. This was an analysis we did at one point. To be realistic, we have very few enterprise drives in our environment. So compared to the consumer drives where we feel like we actually have good statistics on their failure rates, we just don't have that many enterprise drives in the environment. But we had some, so we were able to do some amount of analysis on it.

Starting point is 00:21:29 So the key takeaway for us was that at least in our environment, the enterprise drives and the consumer drives performed about the same. The enterprise drives were actually slightly worse, but pretty close to the same. I think one of the key things for us was that the question is, when are you going to swap the drives anyway? And does the, you know, if you could get better than 4%, does it matter? And, you know, if you could get down from 4% to 2% or 1%, are are you gonna make a different decision and if the drives cost twice as much you the ROI isn't going to be there between a four percent and a three percent so it's a good question there was a mix I remember exactly what the whole mix was. I mean, I think that they tended to be the higher RPM ones

Starting point is 00:22:29 because some of them were for different use cases, but I don't remember exactly. So, you know, I think the key thing for us, though, was that if the consumer drives are failing at 20%, then even if you don't have a straight ROI for the benefit of getting down from 20% to 4%, you may do it for other reasons. Operationally, replacing 20% of the drives every year that fail can be taxing on the organization. But down at 4%, you're going to swap out the drives anyway. I mean, right now, we've migrated off of all of the 1 terabytes, and we've just finished migrating off of all the 2 terabytes.

Starting point is 00:23:15 We still have tons of them that are perfectly fine in production, but they just don't make sense anymore from a density perspective in the data center. Yeah? Do you guys consider upgrading firmware at any point? And are your pods like the same vendor or are you guys using different vendors in the same pod? Is it capacity based or is it... The same vendor you mean the same drive vendor? Right.

Starting point is 00:23:39 So typically the pods start out being the same vendor in a pod, but they don't necessarily stay that way. So as drives get swapped and replaced, they will get swapped and replaced with drives from other manufacturers, potentially. In practice, a large percentage of them are fairly homogenous. So we'll have 50 pods of Seagate drives, 50 pods of Western Digital drives, et cetera. But as drives get pulled, we don't force keeping homogeneity. And the firmware question was related to,

Starting point is 00:24:20 because if you have a firmware bug in a large population and you're trying to update, the consumer level drives do not have some of the capabilities that the enterprise level have while trying to do that update as the IOS being serviced. So that's why I wanted to know, back to your new idea. So, I remember what we do with firmware. I'd have to double-check on it. Shoot me an email, I'll double-check on it and get back to you on it.

Starting point is 00:24:53 One of the things that we do with drives is after, if we pull drives out that are not completely bricked, we have a device that then runs through and checks the drive out that are not completely bricked. We have a device that then runs through and checks the drive to see, could it be reused? Is it still valued? And in almost all cases, the drives actually get a, you know, no, you can't use this drive anymore kind of thing. Power cycling. So we don't, like I said, we don't power cycle drives a lot,

Starting point is 00:25:30 but one of the questions that we get asked is, you know, should I leave my drive running all the time, or should I turn it off when I'm not using it? This tends to be more of a consumer-level question, obviously, than a data center question. So SMART 12 gives us the number of power cycle counts. And statistically speaking, the failed drives do have more power cycles. However, 27 versus 10, you know, it's not super clear. It is more. The 27 somewhat

Starting point is 00:26:16 correlates to the drives also being in service longer because that's why they've been power cycled more. We don't intentionally powercycle the drives. It's just a function of if the pod needs to be power-cycled for some reason, then we will power-cycle it. And as such, the drives will go that way. So this is one where, again, the data is a little shy on whether power-cycling actually has much of an effect or not. So this is one where I put this out there.

Starting point is 00:26:52 There's a little tiny bit of data on it. Take it for what you will. This one we have a little bit more data on. So one of the questions is, if I keep the drives really hot, do they die or do they perform better because the lubricants move more smoothly through them? Should I keep them cool or is that going to seize them up? So this is a chart of the operational temperatures. This chart has nothing to do with failures. This is just the drives operating in the environment telling us not ambient temperature, but the drives themselves

Starting point is 00:27:32 internally, what is the temperature as they're operating. So most of the drives are somewhere from here to here. So some drives get quite hot, some drives stay quite cold, but for the most part, they're kind of in this, you know, middle range. And, you know, this represents about 70,000 drives. So, and there are various reasons for why the drives are at different temperatures. Obviously, they're inside of a data center, the data center is kept at, you know, is kept, you know, cool from the cool aisle, etc. But drives that are, you know, toward the top of a rack are more likely going to be ambient temperature is going to be warmer there. Drives that are in the middle of a pod are more likely to have the ambient temperature be a little warmer,

Starting point is 00:28:21 etc. So there are some ambient temperature reasons for the drives to be warmer or colder. In addition to that, there are other reasons for drives to be warmer or colder. Some drives just run warmer. So you take two drives, you plug them in, you do nothing with them, you put them in the exact same environment, and some manufacturers and some drive models run warmer than others. Load is also a factor. So if the drives are in their initial load phase, where we're putting all of the data on them as fast as possible, they tend to run warmer than when

Starting point is 00:28:58 they're a little more static. But this is the overall flow. So failed drives sources operation. So I thought that this was an entertaining chart. And my first response to it when I saw it was to call up our chief cloud officer and say, should I buy you a bunch of space heaters? So, you know, clearly we have far fewer drives that have died,

Starting point is 00:29:37 that are hotter, which is an interesting takeaway. You know, we're in a COLA facility, so unfortunately we can't ask them, hey turn off the AC, I would love to do that. But we do have some control over the ambient temperature. Every one of our back way storage pods has fans in it. When we started the company we we had six fans. And there were three in the front, three in the middle. And at some point, one of the things that we saw was that we had a row of, a set of fans that all died in fairly short order because they had a manufacturing issue with the lubricant. And so we found that a whole bunch of the storage pods were down to one fan. And when we looked at the drive temperatures and drive failure rates at that point,

Starting point is 00:30:35 this was a number of years ago, the temperatures hadn't gone way up, and they weren't outside of any of the drive specs. And the drive failure rates didn't seem to be different. And so we said, why are we spending extra money on buying fans and extra power on powering fans? Certainly, maybe we don't go down to one, but going down to three seems sufficient. So all of the storage pods that we've been deploying

Starting point is 00:31:04 now for a number of years have had three fans, not six. But this would say that we should stop putting fans into the pods at all, and maybe we should put little heaters into the fans, into the storage pods, right? So it's an interesting analysis. There are some things, obviously, to be careful of. This is correlation, not causation. So the question is are there other things that would correlate failure with cooler temperatures than the temperature itself? And so there are other analyses that we want to do around, for example, one of the things I mentioned was on the in the middle of a pod might get warmer than drives on the

Starting point is 00:31:49 edges is it possible the drives in the middle of a pod have are you know better insulated in from vibration or other things and then drives on the edges and so the two would be correlated to each other as opposed to causal. Right, so there are things that might be causing this. Yeah? that's where that temperature is. Let's see. It's slightly different that since you have that curve where 70% of most of the dried in this picture, most of the dried are on the left side, which calculated the fact that most of the dried weren't on that side.

Starting point is 00:32:37 So that's the relation with temperature. More of that type, more of this so it's so this is a percentage of failure versus operational so this is so it could be, it could have been that even though the chart of operational

Starting point is 00:32:57 temperatures was over here, it could have been 100% like this and very low that way, right? because it's not number of failures, it's percentage of failures. And this is a chart based on the last 90 days of temperature for that drive, for all the drives. So one of the things that there was an initial analysis that

Starting point is 00:33:20 was done that looked even a little wackier than this one, but it was based on the temperature of the drive on the last day that it reported. And we looked at that and said that might not be a good way to look at it because who knows what happened the day the drive actually died. You know, like it might have caught fire and then died, right? And so it's like, yay, hot temperature is good, or something. I mean, so this is looking back out 90 days back. One of the slices on this data is by manufacturer. So one of the things to note is, you know, things vary, you know, by manufacturer, right?

Starting point is 00:34:03 So it's not 100% the same curve for all drive makes models, but this still is consistent. So warmer temperature is definitely correlated with lower percentage of failures. So again, before you deploy space heaters in your environment or feel free to do that, let me know how it goes. I'd love to know also, we're going to need to run various other theories on what could possibly be causal other than temperature. But short of finding some of those things to be true, I actually think that we probably will start looking at,

Starting point is 00:34:50 at the very least, tearing fans out of pods and seeing if that improves reliability over time. The metrics, again, on all of this, I mean, even with most of the drives being in this middle realm and most of the drives having higher failures in this middle realm than they do up here, the overall failure rate is still sub 4% per year. So the drives are still quite reliable at any temperature. I mean, obviously, if you cook them to 200 degrees Celsius, they're probably going to

Starting point is 00:35:21 be less reliable. But on the whole, they're quite reliable dries. So the last thing on here was just to correlate temperature versus the five smart stats that we use currently for checking failure. So we've said that there are different reasons for dry failures. The five smart sets that we see don't, for the most part, don't correlate much with each other so the failure of any given one of those things isn't necessarily an indicator that

Starting point is 00:35:56 everything is going wrong inside of it. Temperature seems to be the same, which is high or low temperatures have little correlation with any of the other five smart stats. So if low temperatures are causing failures, they don't seem to be causing failures in the ways that these smart stats are accounting for it. And again, all of the temperature is based on the internal drive temperature. This is not ambient environmental temperature. This is what the drive tells us its temperature

Starting point is 00:36:29 is at. So that was a little bit of an overview of the environment the drives are living in and how they're getting operated. What we think of as both obviously a formal black and white drive dying, as well as the more gray areas around the smart stats and then the variety of findings around reliability over time and some of the slices on that data. There is, if you go to backwise.com slash hard dash drive, we publish lots of analysis on the drives. We also publish all of these smart stat data for

Starting point is 00:37:06 all of our drives publicly. So if there's any analysis that you're like, ooh I wish they'd run you know XYZ analysis on the data, certainly you can email me and go hey can you guys run this and you know and if it's something that we go yeah yeah actually that would be a good one to run, we'll try to run it. And if we do, we'll publish a blog post with the data. If you want, you can run all of your own analyses on all of the data. So it is all downloadable. It is a fairly significant amount of data.

Starting point is 00:37:36 I think there are 6 billion data points in there. So you're probably not going to load it for the most part in Excel. You know, maybe some subset or slice of it, but the data is available there. And also you can sign up there for getting updates on these various different things. That's it. Thank you. I'll try to answer questions. What do you do for the interconnection between the vault?

Starting point is 00:38:02 Like when you have the vault of a drive you do for the interconnection between the vault? Like when you have the vault of drive, what's the interconnection between each pod? Ethanite. Ethanite. Yeah, so 10 gig. So it used to be one gig. So the question was the interconnection between all the pods on the vault. So the pods used to be, when they were arrayed, they used to be one gigabit connections because

Starting point is 00:38:27 internally all the communications were internal and external to the pod, all that would happen is we would send up to one gigabit of traffic to that pod and if that one gigabit was full, we'd send it to a different pod. In the case of the vaults, because they have to rebuild the drives throughout the different pods in the vault, we upgraded all the machines to 10 gig between them. But it's just infinite. This is a very interesting thing about the pod. You guys have made your own storage pod.

Starting point is 00:39:03 Can you talk more about that? Yeah, you can actually. So we've actually open sourced the design. So if you Google Backblaze Storage Pod, you know, version 6.0, we've open sourced every one of the designs, 1, 2, 3, 4, 5, and 6. It gives the specs for all the components inside of it. It also gives the build books. We actually just with version 6.0, we took the build book and gave it to a different contract manufacturer and said,

Starting point is 00:39:37 don't talk to us, read this build book and make sure you can build the pod to production based on the build book so that we can do a full round trip to make sure the build book actually works. Wow. So the external internet connection, is that a file interface or a block interface? ISAAC KOHANEKIS- Is it a file or block interface between them?

Starting point is 00:39:58 So it's a file interface between them because we chop files into basically small versions of the files. So on every pod, the drives themselves do run a file system. So they each run ext4 on the drive itself. The only thing that that file system does is read a file, write a file. All of these intelligences above that layer. But the drives do have file systems on them. Did you?

Starting point is 00:40:39 Yeah. So I wonder about range. So I mean, drives in the middle, right? So they have average middle temperature, but we don't see the external way to like, as the lowest and the highest, because it's always in cycles of heat and air cooling, right? Mm-hmm. So there are somewhere in the inside, they generally more heat, right? But their lowest temperature, of course, is higher, right? So the range is shorter.

Starting point is 00:41:13 So maybe it's the range of highest and lowest temperature through normal cycling. So if I understand your question, the chart shows the average temperature for the drive, but... Not even average, but immediate center of effort, right? Because we just pick up some average temperature at some moment, right? But not cold and wet temperature. So you don't know.

Starting point is 00:41:41 Maybe sometimes it goes really cool, but it warms it warm up after that BC, general warm, right? So it goes down cool. So it's maybe like a bigger range. So you know, because your iron metal is always extending the shrink because of temperature. So what happens? Yeah, so on the temperature side, you're absolutely right. We don't have

Starting point is 00:42:07 we are only capturing the data from these smart sets, from the drives on some basis. We're not doing it constantly every millisecond. And so certainly it is an average in terms of, it was

Starting point is 00:42:23 based off of 90 days' worth of data. So it's a sample on day one, a sample on day two, a sample on day three, a sample on day four. It's certainly possible, to your point, that during that day, the drive heated up, cooled down, heated up, cooled down, et cetera, right? So it's certainly possible. It would be one of the theories we would probably need to cross-check against before deciding that this was definitely causal and not correlated. There are some reasons to believe that there's not that much variation throughout a given day. Things around ambient temperature, like the location of the drive and the air flow through it, etc. doesn't change throughout the day.

Starting point is 00:43:10 The drive was put into this location in this rack at this point. It's pretty consistent and the load that the drives are under tends to stay fairly consistent. Now it varies from location to location. So when we deploy a bunch of new pods, those pods are empty. They have no data on them whatsoever. We aim all of the data to fill those up. So those 20 new pods are certainly under load that's higher than the load of the rest of the farm. But once they're full, their load is kind of consistent from day to day to day. And so I would say that it's most likely they're not wildly varying in temperatures during the day, but it's certainly a sampling of data points of the drives.

Starting point is 00:44:08 All of the drives are kept in the manufacturer's range. So we have had a couple drives report things like, I'm 250 degrees Celsius or I'm zero. It's like, I'm pretty sure no one is shoving ice cubes around you right now, and I'm pretty sure that the drive hasn't melted. So some of them, when they fail, the temperature sensors give bad data, but on the whole, across 70,000 drives, they're all within the range. It's more of a question of where within the manufacturer's acceptable range do they seem to perform better or worse?

Starting point is 00:44:47 Any other questions? Yeah. The temperature is fascinating. I was wondering if you have to know across different manufacturers if they're actually measuring the same thing, where you're seeing the metric when we're recording temperature, or if there's consistency between the different manufacturers? So I'm pretty sure that we're getting the same temperature because if you put the two drives next to each other.

Starting point is 00:45:13 Now, like I said, the drives, when you power two drives up from different models or manufacturers, the drive temperatures they'll report is somewhat different. Now, there are two possibilities for that. One is the drive is hotter. One is the temperature sensor is set up differently, right? So I don't know the answer between those two. We do know that certain drives seem to run hotter.

Starting point is 00:45:41 It's only by a couple degrees. So it's not the kind of thing where that was obviously from 17 degrees Celsius to 46 is a humongous span of temperature. The difference is between what we think of as a drive that runs hot and a drive that runs cool is a few degrees. But they are somewhat different. Uh-huh?

Starting point is 00:46:04 Do you see any surprises with other components? For example, if one pod is running at 95 degrees, suddenly power supply is dropping off. Some other hardware component? Any other correlations? JOHN MCCUTCHAN- It's a really interesting question. I don't think we've done that correlation.

Starting point is 00:46:21 One of the nice things about hard drives is they have a bunch of sensors. They spit out data through their interface, and we track all that inside of a large system. Things that are pod-related and with different components, we don't necessarily have all that kind of data as directly. So it's an interesting question, but I don't think I could say with enough data. The other thing about it is from just a volume of data perspective, we have 70,000 hard drives, but we only have 1,500 power supplies. And so whereas like with the temperatures, when you get out to either extreme of the temperatures, like there was one there was one thing that I was looking at on here. This one. So if you look over here, it starts looking like that's interesting.

Starting point is 00:47:18 So heat is really good. But then when you get up around 42 or 43 degrees heat starts getting bad the problem is you know the number of drives in that 43 temperature range if you look at the one above here you know you're down way down here in terms of the number of drives in the environment i mean you're talking about you know a hundred drives running at that temperature. So yeah, that's not data. So same with the number of power supply components. the hardware is suddenly A whole bank. Yeah.

Starting point is 00:47:55 Exactly. Yeah. A whole bank. Any other questions I can answer? Yeah. Where are you going next as far as what other metrics or data would you like to see your hear people capture and what other information do you want to get? You know, so one of the things we do periodically look at is we kind of, of all these different

Starting point is 00:48:15 other smart sets, we try and recheck whether any of them seem like they're correlated to failure. Because when we only had a thousand000 drives and, let's say, 20 failed drives, some of the other smart sets looked like they were random correlations to failure. With 70,000 drives and 5,000 drive failures, we can smooth out some of the randomness and figure out whether they may be actually correlated or not.

Starting point is 00:48:44 So periodically, what I want to do is go back to some of those. There are also new smart stats periodically that, so for example, helium drives have a helium sensor. So they check whether the amount of helium in the drive, you know, is leaking out. And so we now track that in the smart set. And if you download the smart set data from our website, you'll see at some point that was a new metric. Not a lot of data there yet, because you had to buy the drive, have it in production,

Starting point is 00:49:20 and have enough of them fail to be able to do any backwards calculation. But at some point, some of these new smart stats will be interested in looking at those failures. What we're also trying to do is look at what other things can we look at unrelated to smart stats for failure. So by getting rid of RAID and writing all of our own underlying erasure coding algorithms from scratch, we now have direct access to the drives and are able to see things like this drive was able to write this file, this drive was not

Starting point is 00:49:50 able to write this file, this drive seems to be waiting five seconds or eight seconds or one second when we ask it to do something. And so we're able to capture a lot more data than just what the drive tells us now. And so one of the things we'll be looking at is, based on our own interactions what the drive tells us now and so one of the things we'll be looking at is based on our own interactions with the drive do we see other correlations to it and then you know ten years from now I'm hoping that

Starting point is 00:50:15 all this will be completely irrelevant and useless because we will be on SSDs but ten years ago we looked at it and said, huh. In 2007, we were like, we're probably two years away from SSDs making sense in this environment. So we'll see. So what kind of moved behind the gate-to-gate relationship? Was it by assuming some clustering,

Starting point is 00:50:42 sub-organizing of scheming, something? I'm sorry. I couldn't hear you. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.

Starting point is 00:51:04 Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.

Starting point is 00:51:04 Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. something else, just build clusters of things, hold my features in the sink, and what should clusters stop? So what are the kinds of correlations and other kinds of math behind it? So we have two people in the company who deal with re-analyzing the data in various ways. I don't know all the different types of methods that they've tried. I know

Starting point is 00:51:25 periodically they have looked at things around, did we have, for example, so we had, there were a set of Seagate three terabyte drives that had high failure rates. And at some point, we actually needed to yank out drives that weren't dead because the rate of failure on those was so high it was making it operational and challenging. So how do you calculate all this stuff taking into account that we have these set of drives that are not failed but look like they're going to fail? And so there were various compensations for those kinds of drives and those behaviors. But I, you know, I don't know all of the math

Starting point is 00:52:10 behind everything that they've done inside of it. Any other last questions? I think we have time for one more. All right. Thank you guys so much. Appreciate it.

Starting point is 00:52:21 Enjoy the rest of the course. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join All right. Thank you guys so much. Appreciate it. Enjoy the rest of the course.

Your Ad Here

Storage Developer Conference - #44: What Can One Billion Hours of Spinning Hard Drives Tell Us?

...

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.