Podcast Archive - StorageReview.com - Podcast #127 – Backblaze Drive Report
Episode Date: May 24, 2024Brian welcomes Andy Klein to the Podcast this week. Andy is the Principle Storage… The post Podcast #127 – Backblaze Drive Report appeared first on StorageReview.com. ...
Transcript
Discussion (0)
So after potentially a minor hiccup to get started, let's start over. Welcome to the
podcast everybody and this is what happens when you try to do it live from a lab
after six o'clock eastern standard time. Anyway, I've got Andy Klein from Backblaze with me today.
He's the cloud storage storyteller over at Backblaze and we're going to get a handle on that and how good his stories are here in a second.
But Andy is one of the guys behind the quarterly storage report that Backblaze puts together.
I know home labbers love it to see the data from all those glorious hard drives and a
couple SSDs that they have in the ecosystem there.
But in terms of the data that
gets put out there into the industry about annual failure rates, Backblaze is one of the best
in being open and communicative about that. Now, they don't always use necessarily data center
product, and we'll talk about all that, I'm sure. But Andy, that's a pretty big run up and
some technical difficulties, But thanks for joining
us today. Well, thank you for having me. I appreciate being here. Looking forward to it.
Well, yeah, you say that now, but wait till our Discord gets through with you. So we've got,
we've invited a bunch of people from our Discord to join in live on the, on YouTube and ask their
comments. And I've promised that if one of them makes me proud that I will send them a box of goodies,
which could include stuff on this table.
We don't know yet.
But the odds of them impressing me are kind of low,
but we'll see.
I'm going to keep an open mind.
So your study, your Q1 study,
just came out in the last couple of days.
You are indexing like 300,000 hard drives at this point.
Give a high level of what that looks like
and why that's important to Backblaze.
Well, it's important to us for two things.
One, I think it's part of our DNA to be transparent
about what we can and what we do.
And when we started DriveStats back in 2015 or so, that was kind of the reason.
We had done some other things before that, talked about our storage servers, our storage pods,
and how we built them and so on and so forth and open sourced the design of all of that.
We talked about our experiences doing something called dry farming, which was
what happened when they had the floods in Thailand and it shut down production of about half the
world's hard drives. We've talked about going through the possible, we almost got acquired
back in 2011. We walked through that with folks. So Dry Stats was initially just another way of communicating
about the things we do internally and saying, hey, this is how it works in here.
So that's always the number one objective. The other one is it's information that's not out there in much.
And, you know, it helps.
Hopefully it helps people gain insights into us, gain insights into the drives that they're using, you know, and give in terms of organizations that provide failure rate information, there's none that I'm aware of that's bigger than you or even close to you guys.
Is anyone else being this open with this kind of data other than what they report back to WDC, Gate, Toshiba, etc.? You know, we see from time to time something surfaces like we did part of this last one we talked a little bit about the
storage of you folks and they do data some analysis on the drives that they collect and
had gotten back and failed and so it's always good to see that kind of information make it out there
but you know the other side of why we do this too is from an operational point of view, it's great information
internally to understand.
We really work hard to understand our environment inside because we have to optimize the way
our environment works.
I don't want to go into products and all of that kind of stuff, but we're on the lower end of the price spectrum, if you will, for what we charge for our services.
And so there's only so many ways you can do that.
And one of them is to be incredibly efficient about what you do.
And so that's what we try to do from an operational point of view.
And this data is part of all of that.
Well, talk a little bit about that operational efficiency.
I know you've retired some drives and we'll talk about that a little bit when they get to too small of a sample size, it looks like.
But you still have a ton of, it looks like 16,000 or so, four terabyte drives and another, I don't know, 20 or a little, yeah, about 20,000, eight terabyte drives. How long can a small hard drive like that
be efficient in your system when you know that we're sitting out there with 24s and now with
Hammer, you know, 30 plus out there, how do those little ones stay relevant to you or why do they
stay relevant to your operation? Well, they stay relevant because one, they're, they're continue
to function. Okay. And so, so I don't know.
So being alive is a good, being alive is a good thing, but they continue to function
at a rate of failure.
That's I'll just say below a threshold.
And I don't want to put a number out there because every drive has a different one and
the circumstances are different and what chassis they're in and so on and so forth. A lot of the four terabyte drives are in the 45 drive chassis that we had way back when.
And so now that we've gone past that and we're using 60 drives in a chassis, the amount of
storage density you can get by upgrading to 16s or 20s or whatever is enormous.
But you also need that price point.
You also need those drives to be at the right price point in their curve.
They all come down a little bit.
They start out high.
And over time, you start out something at three, four, five cents a terabyte.
Sure.
Gigabyte.
And gigabyte.
And then they come down and you're paying a penny and a half, right?
And so someplace, you don't want to be at the top of the curve.
You don't want to wait for the bottom.
So when you start to do all of that math,
having those drives function and fail at relatively low rates
and then picking the time to move them is the way we look at it.
So if you can get a drive to last in six or seven years based on the life cycle of drives today.
Now, if the manufacturers started doubling those rates, and so two years from now we have 50 terabyte drives, the math all changes.
But that hasn't been the case it took it took the uh the industry a
long time to recover uh from the whole you know thailand drive crisis mess um and uh and really
get back to innovation there was a whole bunch of consolidation that happened and you know so i
think finally they're starting to get their feet underneath them but they're also also running into
just physical limitations of what you can cram onto a spinning platter. Well, yeah, for sure. We did a podcast with Seagate not long
ago about hammer technology. So if you guys are curious in terms of what's coming to enable
higher density hard drives, that's a good one to check out. And for anyone that's playing along
with this or even checking this pod out later, we've got linked in the description, the Q124 failure report that I'm referencing
we're talking about this evening or afternoon.
So you can check that out
and see what we're talking about in terms of drive coverage.
So I asked you about the little ones,
but you've got a bunch of 22s in there now from WD,
and it looks like you guys sort of skated over the 18 and 20s.
So is that all part of that mathematics that you were talking about,
or your cost per terabyte?
Yeah, we got, and we continue to get, to be honest,
really good prices on 16 terabytes,
and the cost numbers kind of broke really well for the 22s.
It doesn't mean that we won't have 18s or 20s in.
It's just that right now, the math doesn't work.
Right.
And we saw the same thing like when they went from four to six, and there was fives in there for about 15 minutes.
And so we got some fivesives and we put them in. Um, and then what ends up happening is, is if you don't do it in bulk, you have a problem
replacing them and so on and so forth.
So kind of lessons learned a little bit.
Well, a lot of those fives, I don't know what era of five you're talking about, but I know
a lot of the fives, those guys tried to sneak out there as SMR in branded products and others.
And I use sneak sort of to make it sound
more nefarious than it probably was but i mean that's got to matter to you guys when you're
profiling these drives if you end up with something that you know we'll talk about procurement in a
little bit that maybe you don't expect or delivers a performance profile that's that's unexpected
yeah that's we run we we test them all first First of all, we don't use SMR drives,
plain and simple. It doesn't work for the architecture and you pay a pretty good penalty
to have to delete and rewrite. And that's the nature of the business here. You know,
we don't discourage people from trying to delete files.
You can put a file out and delete it the next day and you pay for one day, right?
So if that's going to be your model and that's the way you're going to work,
then you can't use a technology underneath which penalizes you for that. So we don't use SMRs.
That said, any drive that comes in the door, we have a testing protocol that we go through.
And we run it in, we have what's called a mini vault, which is normally a vault is 20
servers hooked together. And a mini vault is six. And all we're really doing there is putting the
drives through their paces and seeing how well they scale up, what happens when we fill them up, how do they, how's their performance, all of that.
And a little bit about, you're only going to do this for like three months or so, so you're not going to get a lot of performance data.
But sometimes you can see a drive that's, for whatever reason, just doesn't fit in our environment. And every once in a while we catch one like that
where we put it in and I don't know what it is.
I don't know if the stars in the line are right,
but the thing starts failing almost as soon as we put it in our systems.
And so now you know, skip that one, don't use that model.
But we do test them before we put them in.
We had some Tshiba 20s in
for example and there's even a couple still left just hanging around um in there um and uh and they
were they were pretty good uh but they um they weren't i'll just say they weren't worth the bump
so to speak off the 16. um so um you know so there's, there's a bunch of really smart guys in, in,
in doing all of that kind of stuff. And I just get to talk to them about it and sound like I
know what I'm talking about. So about this, we've got a question that's come in about
rolling up the data. So right now you show the data on the actual model, the SKU for lack of a
better term of the drive itself so that you know we're buying this
part from wd 22 terabytes we can track that do you ever consider rolling together families or
do you see affinity there if i look at a uh you like the new whatever it is 22 or 24 terabyte
platform like the exos x24 there when they that family, they'll do new versions of higher capacity drives
in that same family.
So you might have, you know, eight or 10 drives
and SED, non, whatever.
So maybe you've got 18 or 24 variations of that family.
Why not aggregate the family
and then maybe even aggregate the brands
as part of that communication?
Well, we do aggregate brands.
It's fun to do that.
The reality is that every drive model seems to have its own personality.
It just does.
And sometimes you can get a family, to your point, a grouping that kind of looks the same.
We saw that way back when we had the HGST four terabyte drives.
It didn't matter what model we bought or whatever.
They all just worked.
Okay.
But the other thing that changes is, um, model numbers and families.
You they're not consistent between the manufacturers on how they get advanced.
Sure.
Seagate, um, has, you know, for, for a given exos model drive type of drive,
they'll have, you know, five or six different models.
They'll have a, Oh, Oh one Oh two.
Oh, and then 1G and 2G
and so on and so forth.
And those are, they don't behave the same.
They don't in our environment, you know,
and knowing what they changed is like secret sauce for them.
So, you know, so sometimes,
and sometimes they even change things within a model,
like they'll update or they'll rev the firmware
and things of that nature.
Whereas some of the other manufacturers never change a model number, but there's different
iterations inside there.
So even within a given model number, just managing that and calling that a group is
okay.
But as soon as you step outside of it, the ability to
transfer what you learned from the first to the second and the second to the third and
so on, not that easy.
I saw that a couple of years ago.
I was talking with some guys from MIT who had taken all of that data and done a bunch
of analysis, and their hope was to
be able to do exactly that with some AI-like software to do transference from one model
of what they learned from one model to another and so on.
Their answer was, they just couldn't get it.
They couldn't get comfortable with how that worked. And I've seen that in some of the other papers that have
been written using the data as well, where you can do really well predicting a given model,
but trying to make that work someplace else, not so much.
Well, I mean, that'll be an absolute nightmare too with Flash as you try to think about rolling
those up. And we'll talk about that a little bit.
You guys don't have much Flash conversation on this report,
but we've seen from every vendor,
they've got drives that they sell to hyperscalers,
which is different than the drive to enterprise.
And it might just be a little tweak in the product numbering
that's talking about a different firmware
or NAND configuration
or provisioning or whatnot. So in that case, it gets even more fragmented. So you really focus
on a single SKU, which helps you guys stay targeted and focused on the individual drives.
So that makes sense. So with all of this data that you put out there, you must have a favorite.
Our guy Orion's wanting to know what is your preferred drive in manufacturer if you're having to buy for your environment today?
I'm going to say something. I generally buy the least expensive one okay and and the reason is is because i'm a backup guy and so i i'm never
one drive deep in anything right sure and that's the right answer um if you're buying something
that's your one drive deep in your you're rolling the dice then nobody's 100 no we give that same
advice every time you know some home labor says i need a drive for a system or four drives for a QNAP personology or whatever.
And it's like, at that volume, you could have four great drives where you get a billion hours on, or they could all be dead in the afternoon.
I mean, you just don't know, right?
You just don't know right you just don't know quantities yeah the only thing i can tell you is is over the
last few years it looks like the uh doa rate you know arrival rate is uh is uh has gone down for
drives for at least for us yeah um and uh you know so uh that seems to be fun we do we plot it out
we've done this a couple of different times we looked at the life expectancy with the bathtub curve when drives fail and they used to fail in that wonderful u pattern right
where early failure settling late failure um and the front end of that curve has come down a bunch
um it it's not quite towards the middle but um the front end of the early failures or infant mortality, if you will, that has come down for all of the manufacturers.
Every once in a while is a model that seems to not do that,
but in aggregate across all of them.
Yeah.
Is the wall of failure doing anything in terms of lowering or coming in or out then also?
So the rates seem to, even the rates have gotten better now,
you know, over the years.
But the, and the length of time it seems to get to
that when the curve bends back up,
seems to have pushed out a little bit.
But I will also say there's some,
there's some of us in that data.
Because when we first started out, we bought, everything we bought was consumer drives because we could go to Costco if we had to. And we did.
Wait, wait. I want to think about that shopping trip. I could see you and the guys rolling
through Costco in your flip-fl and your your jam shorts or whatever
on a Saturday afternoon with a shopping cart yep you're just filling it up with whatever loose part
drives they had and maybe a little bourbon from the back a little bourbon in the back well the
we normally didn't even back then normally didn't do it although that I mentioned dry farming
much earlier when all of the
normal channels dried up you couldn't buy a drive through normal channels and
was this was back in 2011 late 2011 early 2012 right that was the only way
you could get drives and we would say forming do you mean shucking like buying
externals and tossing buying externals and shocking on the okay and and so we would do exactly that in our flip-flops on a Saturday afternoon,
go out and hit all the Costcos and Best Buys and whatever,
and collect them up and bring them back and then turn them over to our guys,
and they would put them together, and we'd have a storage unit a few weeks later.
You must have a sick amount of Best Buy rewards points to cash in for beats headphones or whatever it is you need right i i don't think
they gave them to us at that point they actually we actually got a couple of our guys got banned
from like shopping at costco and stuff it was all these weird little things that happened because
you know we were taking out of a consumer channel at that point um i might also have a best buy ban i can't shop there online anymore so uh they they got
salty with us for doing something not too dissimilar from what you were up to at one point
in time hey but look i mean you got to do what you got to do and and if you're and really if in
any of these scenarios if they're directing the product to their big customers, even back then, which would have still been the likes of Facebook and early days of AWS or whomever else in the cloud that might be dominating with their checkbook, did and and but what um the the back to the
original point though was we over time we transitioned from consumer drives to enterprise
drives um and so consumer drives interesting you know one two three year warranty kind of depends
on when you got them uh enterprise drives typically five-year warranties and and all the warranty is is just a projection
by the manufacturer based on failure it has nothing to do with anything else right and so
but it is when you buy an enterprise drive you expect it the last five years versus something
that you expect the last two years now that's not always the case I have some great eight terabyte consumer drives that are failing a whole lot less than the eight terabyte enterprise drives.
But let's just say that's the case.
And so that you're back.
Yeah, your your your RMA returns really should be pretty creative.
I'd like to hear about you've got to have some good stories about about some early days of trying to get service on these things or replacements. Yeah. So obviously once we,
at that point when we were cracking them out of the case, the warranty was void, right?
You lose that on the shucking. Yeah. Yeah. But yeah, I'll use the word shucking.
But for the most part, it know, it's fairly, it was fairly straightforward. And we would, it was more appropriate
for us to RMA things if they broke early. Because again,
operationally efficient, I got somebody there, and they can
work and they can, you know, put stuff in a box and send it back.
The thing we learned very early on is, is you don't get a new
one back. Right? You know, so you just get something that
they rebuilt or refurbished or whatever the case may be.
You still get the remaining warranty life though, right? So
if you had two and a half used itself two and a half left, is
that how that works?
You do, or you're supposed to. The, the reality is, is you don't know what condition that drive is in.
You know, sometimes you'll get them and you'll still have the hours on them.
You know, something you'll have, you know, hey, look, it has 12,000 hours on it.
Oh, my God.
Hey, look, a classic car has 12,000 miles on it.
It's just, you know, it's burning.
You should be thankful.
You should be paying them for that.
I should be paying them.
The bits are well worn.
Yeah.
They know what to do.
But, you know, so we had, but we've gotten past that.
Now when we get returns like that, sometimes it's credits.
Sometimes it just depends on the manufacturer.
And we get returns and they end up going into what I'll call a replacement pool.
You know, so they're not used new for new things.
They just go in and they're substituting in for something else that's failed.
All right. Well, that's that makes sense.
I still want to man, I'd like to go on one of those shopping trips with you guys next time you hit Costco.
I want to see what that looks like. I know I brought up Flash a little bit. We've got a question from Blaze who wants to dive into that a little bit
more. I think you're using Flash, probably some little M.2s for boot. What else, if anything,
are you guys using Flash for in your ecosystem? And if so, or depending on the answer there,
why do we not see that show up in this report?
Yeah, that's a good question.
So we obviously use, you know, SSDs for boot drives.
And we've talked about that.
Boot drives are a little more than just booting the system
for their case.
They record daily data and all kinds of things like that.
So they're actually functioning and reasonable drives.
And they come in different form factors.
Some are M.2 and some are just the standard ones, the standard two and a half inch for
us.
So it doesn't really matter.
A lot of them are just the cheapest thing we could get because quite frankly, it's not
doing something that's going to kill us.
If it fails, we just, you know, we'll swap it out and keep moving.
So to pay a lot for something like that doesn't make much sense.
Where we also use them in several other different types of servers.
So as you can imagine, we have data servers, which this is where all of this data comes from, right?
And the smart stats and all of that kind of good stuff. And then we have a whole group of other servers which help, you know,
API servers, restore servers, and so on and so forth.
And various different ones of those use SSDs or Flash inside of those.
The reason you don't see the data for those is because we don't have them
instrumented for that to report it out.
The amount of data that they spit out and how often we monitor those things,
we're, of course, monitoring and we have everything instrumented and everything like that.
But to actually store it, even though we're a data storage company, it's just too much. We could snap
it once a day or something like that. The other thing is the operational considerations.
Those things are working. And so that whole little process of just writing something to
a quote unquote disk and storing it off and everything like that is is um and putting it somewhere um uh it's it gets in the
way um uh i wouldn't i i've asked our guys to do it uh and and they keep telling me they'll get to
it no because they're at costco all day how are they supposed to get anything done that's right
they're buying rotisserie chickens bourbonbon, and hard drives. There you go.
So we don't have them instrumented for that purpose, for the drive stats, if you will, for doing that.
We do have them instrumented for monitoring, which is a daily activity, not a recording of monthly or annual data or anything like that.
I will tell you.
I was going to say, do you have any anecdotal experience
then with the failures on those drives?
Yeah, yeah.
So interestingly enough,
we burned them out before they fail.
You killed the endurance?
Yeah, yeah.
That's what catches us first
because a lot of the things I just spoke about
are write something,
use it, delete it, write something, use it, delete it, in just a continuous thing. So if you think
of like a restore server, where we're going to build something to hand back to somebody, right?
That data is on there for as long as it needs to only as long as it needs to be.
You put everything there, you put it wherever you
package it up you deliver it and you get rid of it um because the the the source is still on the data
files um you know if you're responding to api calls information comes in you record it blah blah blah
you log everything and then you flush it back out and you go on to the next one. And so you're doing how many how many reads and writes a day, you know, to that
thing. So that's what usually catches them first is they just
get to the end of their endurance life cycles. The the
we we're not replacing them at a at a high rate because of
anything. We're replacing it because, like I said,
they get to the end of their life.
Well, with Intel cutting off Optane,
those little M.2s that are still running around,
they're real low capacity.
I don't know if that would work for you,
but the write endurance on those is pretty wild.
And to your budgetary constraints,
they're relatively inexpensive.
Have you played with Optane at all?
That, I must admit, I don't know. I don't know the answer to that. I haven't heard of it,
but it doesn't mean that there's not some guys in the data center or whatever.
It would be a travesty if you decided now that you like it because they're gone.
They're gone.
Yeah, you could buy up the inventory, I suppose. But that would be that would be said.
So in your architecture, then, do you not have any all flash nodes or everything for data is hard drive based?
Everything for data currently is hard drives.
The the we do use flash for the what we call shard stashing, which is the transition between
receiving the data and getting it to disk.
Okay. And, and so, but that's, and again, that the data comes in, it gets written to flash,
and then it local, and then it gets pushed down to disk. But that means we can respond back to
the end user quicker that we have the data ready to go if it's need be.
But that's the only place.
And like I said, it's transitory for the data.
So we have looked at it.
We've talked about it, about building it.
The truth of the matter is, is for most of the applications that we serve data for, that's not the bottleneck.
You know, I mean, if you really want your data, you know, quickly spread around the world, your gaming company, whatever the case may be, you use different services, CDN services, whatever the case may be. We can be the origin store. We're fine with that. That's a good model, but
doesn't mean we won't go there. But once again, we want to be operationally efficient. We want
to be able to, you know, charge a fair price for the product and move forward. So right now,
the most efficient way to do that is just using hard drives.
So further on that, we've got another question
about something that we spend a lot of time on,
maybe you not so much with a lot of the new form factors,
especially in enterprise flash,
the EDSFF form factors, be it E1S, E1L, E3,
or even the standard U.2 that we're also accustomed to,
the capacities in Flash have
gone through the roof. And I know with that, you'll gasp at the pricing. But with 61.44
terabytes from Solidigm, for instance, shipping now, which we use a ton, actually the server,
but the one you guys can probably hear over my shoulder is cranking through something right now full of those drives but the density
is just absolutely uh enthralling and but maybe you guys don't have a density challenge is that
part of the balance with the your investment in flash well it's um density is um density is
something right now we can manage with just hard drives and improving, like going from 4s to 16s, for example.
Oh, yeah.
You know, so, but the, you can see it coming from the point of view of not getting 16, but to your point, you know, substantially more in the same space.
In 2 1⁄2 inch versus 3 1⁄2 inch or whatever the form factor is you're going to use
you can cram it uh into space there's a couple of other little things you got to think about those
cooling and electricity um and uh you know and so uh and data centers are you know being pointed
out more and more uh for using a whole lot of both um you know, being dirty and being electrical users.
And so you've got to balance that out, too.
I mean, we have not only a, you know, a responsibility to hold people's data and all of that.
We have a responsibility from a sustainability point of view and from, you know, and so on.
So, you know, could we build one and could we use one? And could we
fill a whole data center full of one and, and get, you know, quite frankly, 90%, 95, 99% of our users
no better experience because they're never going to use the capability and use X three, four,
five X, the amount of electricity. Yeah, I suspect we could.
Is that the right thing to do?
You would need another tier at that point.
You'd have to have customers asking for a faster S3
or something, and maybe even then more,
maybe it's more of a fabric challenge
than a pure storage IO challenge.
But yeah, I mean, the density on Flash
that was pretty wild, right?
I mean, you got it,
even though it's expensive, even though it may not solve a lot of problems that you guys have
currently 61 terabytes in a drive is it's just crazy to think about.
Yep. Yeah, it and you know, we certainly look at it, We dream about it. We do some math about it.
And, you know, but right now, you know, for our company and what we do, there's plenty of opportunity to do what we do.
We like to keep things simple, single tier of storage, you know, starting to get into multi-tiers is then all of a sudden you start introducing all of the things that make dealing
with the other the bigger guys complex okay um hey you need this management software to sit on
and decide how often your software your data is utilized and so which tier you should put it on
and that's only going to cost you a million a year um because we can save you two um you know oh by the way you you
really you meant we kind of made a mistake and we put this over in this tier and it was supposed to
be in that tier so you're going to owe us another hundred thousand dollars for moving in between the
different and it gets to be a pain it just really does now i i'm not saying that's not a necessity
in certain many companies but there are many many many more out there that don't need that kind of complexity.
And that's what we deliver.
And we're kind of good at it.
I'm not saying we won't ever go down that way.
And we won't ever invent something like that because it's really, really awesome.
But our DNA is keeping it simple.
Well, I mean, when the prices come down on these hammer drives you must
be chomping a little bit at the bit at the idea of dropping 30 straight into your your infrastructure
with probably no changes as long as they behave as expected that remains to be seen anytime you
have a new tech even though they've been working on it for two decades or whatever, a very long time. Nothing's proven
until it is, right? Well, we saw the same thing, by the way, with Helium. I mean, they worked on
Helium for 20, 25 years before they finally got it. Same thing with Hammer and Mammar as well.
We work with the different vendors. I mean, we have a, we have a good working relationship with all of the different manufacturers
with Seagate, you know,
we've talked to them about bringing the hammer technology on board.
We actually had it scheduled a couple of times and then they pulled it back,
you know, because they're trying to get it right too,
for all of those reasons. And, you know, maybe, maybe they didn't want us writing about it I don't know but
probably I don't know should I give my experimental tech to some dude that
writes a blog every four or every three months about it that's right maybe not
maybe not you know and no and and so so but you know we know that technology is
coming we're certainly looking forward to it.
You know, it'll be interesting to see the real pricing points.
You know, what's the street price really going to be?
What's the manufacturer's price going to be underneath that?
What can we get it for?
So I don't know if we'll be, you know, we'll have it on day one.
We may buy, we may decide to invest in it on day one one just so we can play with it and see what it's
like and beat it up so as the price comes down we know what we're getting into right that's a you
know we we did that with helium we went out and we bought a full vault full of helium drives when
they were pretty pretty pricey at the time but we we we got we got our we got our feet wet, so to speak.
What's the what's the quantity for a vault? I'm used to a case
that has 20 I have, I assume and I don't want to assume too much
the vault is larger than a box of 20.
Yeah, it's it's 1200 drives.
slightly. And so yeah, it and that's a logical collection.
There are there are 20 servers each with 60 drives right now.
And then we stripe what's called a tome.
A tome is a drive in each one of those 20 servers.
So there's 60 tomes in a vault.
You can do the math, right?
And that's how we store data,
striping it across those 20 systems.
Well, so this is interesting
because this has come up a couple times in the questions.
And by the way, we're getting some awesome questions
from YouTube, so you keep pumping those over.
Jordan's the question bulldog who's working through them
and sharing them with us over here.
But when you take a drive like that,
there's maybe a legacy concern, and you tell me if it's still legit or not, about people being
worried, enterprises, organizations, yourself maybe, being worried about a bunch of drives,
1,200, all from the same production batch. Would you take those and put them into your vault or would you divvy them
up a little bit to try to avoid that sort of cohort risk how do you view that is that still
a relevant topic in today's technology it it was i i'll say it was before now what's happening is is
we actually um we don't build our own storage servers anymore.
We did way back when, and then we contracted out for it.
Now we buy Supermicro, for example.
And we contract in the drives as well.
And so we'll get a mix of a couple of different vendors inside that particular vault, all of those different systems that come in.
And one, and two, I think, like I said, I think over the last several years, the infant failure rate and the general failure rate has gone out.
And again, the quality of using enterprise class drives versus consumer drives has probably had something to do with that.
So less concerned about that.
The other one, of course, is that we monitor this stuff. We pay attention to it on a daily basis. We're looking at the failure rates of things.
The chances of having a mass failure across all of the drives at the same time are probably infinitesimal. We also do erasure coding across and we factor all of that into the durability of the data that's out there. So for example, on 16 terabyte drives, there are 15 data drives and
five parity drives. And so we can lose five drives and still recover anything out of that particular
tone. And ideally you're set up
i presume so that if you get close to that limit that maybe you can you know offline that node and
gently you know re-silver the data or whatever you've got to do right they have a whole protocol
they go through um you know so you fail a drive okay that's good you fail a second drive what are
we now what are we going to do well let's attempt to for example clone the the two failed drives now um and because cloning is a lot quicker than rebuilding
you know and so on oh we got a third one let's drop the thing into read-only mode so now all
it's doing is servicing it's not writing and so on and so forth um and you know if it ever got down
to four or five for example we'd be working working with companies like Drive Savers and all of those to get data off the drives.
And you start to lock things down and try to back up things as quickly as possible.
So we've never gotten down to five failed drives in a tone.
It's just, you know, statistically, it's a small number.
It's the infamous 11 nines number, right?
And the 12th nine is a guy, an intern perpetually on staff at Costco ready to buy drives.
Ready to buy drives.
If necessary.
So let's not leave this topic because I want to make this really clear for the younger audience or any of these SIs that are putting systems together. In your opinion then, not that you're issuing advice here, but in your opinion, if someone's putting together a 12 drive standard 2U chassis,
super micro, whatever, going to install TrueNAS or Windows or whatever, and the box of drives
shows up from Newegg or CDW and they're all from the same lot, your concern is where from zero to 100 on that?
It's low.
It's on the lower end of things.
I just think it is.
Again, you'd obviously want to have some level of erasure coating in there, you know, or raid, whatever you want to call it.
Right.
And make sure that's working.
And then, you know, if it's truly precious, have it someplace else.
Right.
You know, it's, but I do think it's on the low end of things. I'm not going to, you know, okay, so is it possible that that crate of drives that fell off the back of the truck?
And, you know, yeah, okay.
But chances are they're not going to work real well when you put them in.
Well, yeah, I mean, the odds of my golden doodle murdering you are very, very low.
Maybe not zero, but they're very, very low.
Especially if you have a... Try taking away a stew toy.
Yeah, well, even then she'd probably just roll over.
She's not, doesn't have that bone in her body.
So one of the things that we're experimenting a lot with right now is alternative cooling.
And that means for us, we've got a little liquid loop
that's probably out of frame that we're working on.
We just retrofitted a PowerEdge with cold plates.
We're talking to, that's a cool IT solution.
We're talking to ChilDyn.
They've got a negative pressure rig.
We're looking at Submer and all these other alternate ways, Isotope, Zudacor.
I mean, there's probably a dozen incredible companies out there looking at alternative cooling.
And then we're also talking to guys like Eaton about getting an in-rack cooling setup.
We've got some portable chillers.
We've got all sorts of stuff going on here as we wrestle with thermals.
What does it look like in your data
centers? Are you guys all air? Are you able to get a little warmer with hard drives? Do you have
batteries in the room that make you fight the cooling battle a little bit more? What does that
look like for Backblaze? So there's two things. One, you can run data center temperatures a little
warmer than you might think. And we monitor that, but you can easily run them up to about 80 degrees.
And then the second thing is
we actually have a whole data center that doesn't use AC.
It's a facility out in
Stockton. It's run by a company called Nautilus.
Wasn't that a collab with Microsoft or something
a while ago?
Didn't they start that on the water there?
It is, it's on the water, it's on the naval base there.
And so we were the first tenant in,
and we have a full up one of our data centers there.
The only air conditioner in the whole thing is this one little swamp cooler over the central corridor of stairs because that was required by California for ADA.
Of course.
But everything else is what happens is the water comes down the river.
They have a system at the bottom which takes the river water out.
They have some magic that occurs.
They do transfer cooling
to cool the water that runs through our systems.
And then we have mats on the back
that have fans in them
and they blow the air through
or they actually suck the air through.
Because what they're doing
is they're just keeping
an ambient data center temperature.
And, you know, so you run that at,
let's pick a number of 78 degrees.
You say, okay, that's what, that's going to be our mark.
And then they make the adjustments as to, you know,
how much water comes in, how quickly things gets cooled
or not cooled as the case may be.
And, you know, no AC units, no cooling stacks.
And we've got a missed opportunity.
The Ohio River here in Cincinnati is like two and a half miles that way.
I could bring a garden hose up here and try to rig that up.
I'm going to assign that to the intern tomorrow.
There you go.
You got an intern.
Yeah, yeah.
Send them on down there.
See if they can get yourself two and a half miles.
Connect a 50-foot hose and a pump all the way out of the river back up here.
So do you guys colo everything or do you own any of your own data center footprint?
No, we colo everything.
And, you know, at some point in our life, I'm sure we'll get to that point where we start building our own of them.
But right now there's ample opportunity to get into other folks to do it.
Obviously, everything's in a cage and a locked environment.
Like Nautilus, it's a suite.
Over in the Amsterdam's a suite and places like that. The Nautilus, though.
There's got to be a guy with a trident standing out front running security.
There is not, unfortunately.
This is missed opportunity.
This is what happens when you give nerds marketing budget.
They don't know what to do with it.
That was the first thing I looked for when I showed up there to take a tour.
I don't know.
I don't know how you missed that.
It's obvious.
Yeah, carry on.
No, so the facility is great.
They've been great to work with, by the way, because, you know, we're all learning a little bit as we go through this.
You know, what the kind of capabilities are and everything like that.
Because we instrument everything internally, we can tell, you know, if something's getting a little warmer than it should be and making the appropriate adjustments. You know, one of the things that you find out
is the different chassis that you have
circulate air differently, you know, in their environment.
The other thing you find out is location,
and not only up and down, but near a wall, near an aisle.
You know, whether or not what other servers
are in that rack with it.
I mean, there were all kinds of things that, you know, you get not what other servers are in that rack with it. I mean, all kinds of things that, you know,
you get to understand about your environment
when you start to change the parameters like that.
And, you know, the estimate right now is it's about,
it saves about 30% on electricity,
which is a pretty substantial amount.
And then again, no chemicals, no Freon,
nothing like that running through the place. So pretty cool. No, that's a good reminder. I remember when they launched that,
gosh, it's probably been four or five years at this point, maybe even longer, the Neptune play.
So I'm going to have to dig back in there and check that out again. You gave some guys on YouTube probably a good amount of fear
over their own labs
and preparing for failure
when you're talking about
the benefits of erasure coding
or multiple drive failure zones
or however you're set up.
Have you guys ever lost a whole vault?
Is there, and maybe you don't want to talk about it, but there started to get to be some conversation about the crisis level and what
that means in your world. So, no, we've never lost a whole vault. We've lost a server.
Sure. You know, where a server will come down, but then that's only one of 20, and that's part of the algorithm, right?
And that happens all of the time.
You've got to replace a drive or a memory board or whatever the case may be.
You have to take a server down.
You have to bring it back up.
The system, the software is built to understand that it was doing something, and now that system needs to catch up a little bit,
you know, in order to have the right data on it and all of that, and all the software is smart
enough to do that. So, so as we're thinking, well, two is, as we're thinking about resiliency,
I'm sure, you know, Seagate's core vault product, they've built a 102 drive chassis that really acts.
It does its own internal erasure coating, more or less,
and it can recover from failed platters and that sort of thing
and just reduce the capacity of that drive.
Have you looked into any more advanced telemetry like that?
Maybe it's not something they would
expose to you without buying that product but it seems to me if someone's going to fail a platter
but not fail a drive and soldier on you guys would be a good candidate for something like that
yeah yeah so so i there's some some things i can't quite say but yes we've talked to different
manufacturers and tried different things from them uh as of, you know, just working with them. And, you know, and then
the real issue always comes down to, okay, so what would you charge for this?
Wait, you got to pay to run their drive at a lower capacity instead of RMAing it? It seems like, again, they should be paying you. This is easy.
It doesn't quite work. I'm showing up for your next negotiation.
Clearly something's going wrong here. Well, I'll put you with
our new guy who runs ops, Chris. I'll bring the
Trident. Don't worry.
You really run into um you know again
you want to use solid proven technologies that's what we're up to you know up to here um there's
no reason to go reinvent the wheel so to speak you just need to do it well because that's where
again the price points are for delivering really good solid technology to people at a price point that's affordable.
You know, and so but we do like I said, we do get exposed to all of these different things and we get to see them.
We tried when we were looking for different servers to move from our red storage pods, which we built for the longest time.
We tried a couple of different models and one of them was i believe um at least a one of the type
of you talked about 100 terabyte 100 drive system from seagate with a couple of spares built on the
back because that's the other fun little thing you can do right if something starts to go wrong you
start to rebuild on the fly so you're never out um you know so you're never really out of sequence
and you have a couple of spares and all of that um and we tried that in in our environment it was it was an
interesting one uh wasn't it like i said and at the time it wasn't price appropriate for us
but we'll probably come back to it again because we're always looking for ways of doing things a
little more efficiently getting um you know so we can manage the cost now. Well speaking of efficiency versus cost, we've got a guy, Ty, and wants to know a little
bit more about your network fabric.
I'm going to guess you're pretty standard 1 gig management, 10 gig fabric, but you tell
me, do you benefit from faster fabrics than 10 gig?
Yeah, yeah, they run, it kind of depends on where you're going.
If you're going between data centers
and network internet connects and all that,
you can run 100 gig connections.
Sure.
You know, multiple ones actually.
In your 20 gig or in your 20 system clusters,
what does that look like?
That's a minimum 10.
They may be up to 25 these days.
I thought they put hundreds in there and then they split them because they have different functions.
Some of them are management functions and so on.
But they're minimally 10.
I think the older drives, the older systems,
the red storage pods might be still
10.
Because they
just don't have the
capacity to do anything
more than that, to be honest,
because of the hard
work that's inside of them.
Some of them are from
2013, 2012, kind of of them are from 2013 2012.
kind of timeframe. So still eons ago, still cranking them. Oh, yeah. Yeah. So a lot of what
you guys do is store things for comparatively very long time. So you know, for that for people that
might not know you real well, you've got Veeam
practice, for instance, NAS backup, PC backup, Mac backup, a lot of backup, a lot of hold my data
and hope I don't need it. I'm sure there's plenty more, but that's got to be a big bulk of it.
For data like that, where I might send it to you on a scheduled basis or update it or refresh it
with you maybe every so often, but it doesn't ever change much.
Do you worry about data sitting around?
Do you worry about things like bit rot
or are you guys always balancing, shifting,
you know, redoing the erasure coding
and moving things around?
How does that work in terms of just
a warehousing of data standpoint?
It's a good question. We do worry about it, obviously. One of the things we do is we regularly run what we call shard integrity checks.
A shard is just a chunk of data sitting on one of those drives, and it's been summed up. And we go
and run that again and see if we got it. And it's the
same. And if it's not, then we just go out and say, hey, the rest of you 19 guys, we're kind
of missing something. Let's rebuild this. And that gets rebuilt. And so, and that's the primary
mechanism that we use. The, you know, that we use to keep track of things.
The other thing we're finding now, for example, we're migrating data from four terabytes to 16s, for example.
And as we go through that process, we actually use our normal read and writes.
We don't do anything fancy.
We're actually grabbing a file.
So if, in fact,
when we go to get that file, there was something flipped, a bit flipped or whatever, we don't,
we only need, again, you know, 17, 16, whatever number of pieces to reconstruct it. And we get
it back. And then when we write it, it's pristine again. So we constantly are going through and taking a look at, you know, exactly that particular
topic. Because, you know, you can rely so much on erasure coding and all of that, but you don't
want to be caught short because you didn't bother to pay attention to something. Well, and maybe
that'll be the answer to my follow-up there is if it's a four terabyte drive in systems that are as
large as yours, why
not just yank it and chuck it, put a new one in and let the system rebuild itself? Is it
really worth being gentle with that data when you're retiring a drive?
So the only thing with four terabyte drives, it's not as bad, but the rebuild time on a
drive in a system that's working it can be dazed okay and
the larger the drive the longer the rebuild time and so you don't want to
have to reduce the performance of your vault for example just so you can get
your rebuilds done so you don't want to create more of a problem in you need the
other thing that's important in our system is that all of those drives in that tome have to be the same size.
Or at least the same logical size.
So I can't put a 16 and suddenly start using the data above 4, everybody else is 4.
So it becomes an exercise more in migration than just substitution.
So let's talk about that for a minute, too.
And I know we're close to our time, but this one's an important one, I think.
In this sort of era we're in of digital consumption, for lack of a better word, we're all creating data.
Everyone's seeing the charts of billion things, phones, GPS, AI, whatever. We're creating a lot of data, but you're still
in a position where you're turning over components, server systems, switches, all this stuff,
you know, at a time-to-time basis. The industry, I think, could do better in terms of
taking working gear and instead of throwing it into a shredder, doing something else
with it so it has another life. Maybe it's less a problem for you guys because you get so much
out of these individual devices. But how do you guys think about your responsibility
from just an ecological perspective or some sort of woo-woo green perspective about what we do with all of this
hardware at the end, and can any of it be repurposed and see another life in some other way?
It's a really good question. So you'll enjoy this just as a little bit of a story.
When we first started building our own servers and all of that, painting them red and all of that,
we finally, those got outdated and we had to move up.
And we actually sent out a note to all of our community
saying, hey, we're going to give these away.
All you have to do is sign up and come pick them up.
And it came down to our data center in Sacramento.
We had a line of people.
We had people driving from Canada with trucks and stuff. It was, trucks and stuff. We had another one where we had a contest. First of all, that's absolutely amazing
because that one right there to me and to the guys, a lot of guys listening here and that'll
check this out later, giving back to the community of people that are out there trying to solve
problems. Absolutely fantastic. Love that. All right. So the other one. Yeah. So that's part of the exercise that
we went through. Now we're kind of a little too big to do that. We do try to get as much mileage
as we can out of the equipment, like we explained. We also do recycling at the end when things finally get to that stage. There's also
services now, oh goodness, I can't remember the name of it, that will take that older hardware
and then they just kind of reassemble it and then they put it in a data center themselves and sell
time off of it. So it's not the fastest stuff in the world or whatever the case may be,
but you're just looking to solve a problem, and it's affordable.
And that's a really nice business that people can work with.
The only issue with them, of course, is there's a lot more junk, if you will,
than they could ever deploy again right um you
know so so the best answer for our perspective is is to get as much life out of the things you have
and then when you do have to recycle try to do that as responsibly as possible
and and that's where we are right now we're still using storage servers like i said that we built back in
2013 2012 uh version 3 servers are still in there um you know it's using drives that we bought back
in 2013. hey look um don't don't sell yourself short about being too big i'm sure you guys are
big with your 300 000 drives and all but if it comes up again we'd be
happy to help you make sure that we find homes for that i'll even station jordan with his beard
and a trident outside of each data center to to uh decommission and make sure an orderly line is
formed and we'll even have them bring you barbecue and maple syrup for the canadians to uh to make
that happen uh man this is, this is an amazing conversation.
I'm really glad we got this in.
What are you looking forward to next quarter?
So what's, and maybe you don't even chunk it up like this,
but I know you're watching those AFRs slowly,
really slowly to drip down,
but maybe not even,
maybe I'll give you the rest of the year.
What are you excited about for the rest of 24
in terms of how your business can be impacted?
Well, I really liked the fact that we're starting
to up the density of things, you know,
and I do see the 22s coming in.
I hopefully will see something a little larger
than that coming down the pipe as well.
I'm kind of really looking forward to that um I want to see how the life expectancy of some of the 8s
and 12s do because they're the ones that are going to have to be you know around for a little while
longer for all of the reasons we just talked about um doing that that and I i don't it you know it's funny as people ask that question and i say
well i don't see anything i when the data when i get the data okay then it does what it does
it does what it does i i i can't make it up we publish the data i can't make up stuff um you
know you can just you can disagree with my interpretation, and people have,
but, and that's okay. That's what this is all about, but that's kind of the overall trends I
kind of see as we go forward. I do see that the number drifting down a little bit on average,
and I am really looking forward to seeing how the 22s and larger size drives do over time, because that's where the future is.
That's the next five years right there.
Well, yeah.
And I think when you look at, even if we just look at back to the consumer pricing, which is the easiest thing to visualize,
we're seeing a ton of deals or discounts or good prices per terabyte in that 16 to 20 band as you say as that
new tech comes out the 22s 24s starts to put a little more pressure on the the mid section there
and those tend to be uh be pretty good values so uh look i know you've hung with us for a long time
we've gone over a little bit but the one last question that the the guys are dying to ask is
uh you've got a bookshelf behind you.
Do you have something good on the bookshelf,
a favorite book or a recommendation?
Oh, boy.
A lot of them are, you know, John Grisham,
Mike Connell, Michael Connelly type of stuff.
Perfect. How do you beat that?
I have all of those collected up.
I just shelved my copy of the Stephen Hawking one, the one he
wrote initially, because I had read it again just a couple of months ago, which I really enjoy
reading, because every time I read it, I understand it a little bit more.
See, I feel like that with watching war games. I watch it over and over and over again,
and each time I learn a little something new.
It is fun to go back and watch some of those.
I did that with my boys where we watched war games and I think it was short circuit and a couple of other goofy ones like that.
And they were just so much fun.
Spent a whole lot of time explaining old technology, unfortunately.
Yes, yes, absolutely.
Yeah, I recommend sticking with Fast and the Furious.
It's much easier to get along with for automotive enthusiasts and children alike.
Similar.
All you have to do is just sit there and watch with your mouth going.
Did they just do that?
They took a theater to space, okay?
What do you want that's a that's
groundbreaking cinema right there yep well this has been great uh really appreciate the time
for uh anyone that missed it we'll have uh the notes put together on this we'll have the audio
track out so you can check that out we'll we'll scrub this up and put it out as well uh but until
then andy thank you so much for participating.
Love it. And can't wait to do this again with you in a couple quarters and see what you learn.
Yeah, appreciate the opportunity. And thank you very much.