The Changelog: Software Development, Open Source - Hard drive reliability at scale (Interview)
Episode Date: April 26, 2023This week Adam talks with Andy Klein from Backblaze about hard drive reliability at scale....
Transcript
Discussion (0)
What up, nerds? This is Adam. And this week on the show, I went solo and I'm talking to
Andy Klein, the principal cloud storyteller over at Backblaze. I'm a big fan of Backblaze
because, well, they have an awesome backup service, but also because they have these
quarterly and annually delivered drive stats that I've been reading religiously for many years.
As you may know, I'm also a home labber.
So that means I run ZFS 15 plus storage drives and I just love that kind of stuff.
So getting in on the show and digging into hard drives and drive stats and all these different details, and particularly how Backblaze's service operates,
from the storage pod to their methodology of how they swap out hard drives
to how they buy hard drives.
Their story is fascinating to me.
If I wasn't podcasting, I'd probably be racking and stacking storage servers
and swapping out hard drives and running analytics and all this fun stuff.
Hey, I'll stop dreaming. I'll keep podcasting. But this show was fun. I hope you enjoy it. storage servers and swapping out hard drives and running analytics and all this fun stuff hey i'll
stop dreaming i'll keep podcasting but this show was fun i hope you enjoy it and i want to give a
massive thank you to our friends and our partners at fastly and fly because this podcast got you
fast because fastly they're fast globally they're our cd and a choice and we love them check them
out at fastly.com and our friends over at fly, well, they can help you like they help us put our application and our database globally close to our users.
And we love that with no ops.
So check them out at Fly.io. what's going on friends before the show kicks off i'm here with jonathan norris co-founder
and cto of dev cycle a feature management platform that's designed to help you build
maintainable code at scale so jonathan one cool thing about your platform is your pricing.
It's usage-based. Tell me why that's important.
One of our core principles is like, we believe that everyone on your development team
should have access to your feature flagging platform,
whether it's your developers or your product managers or designers and basically the whole team, right?
And so that's one of the core reasons why we've gone with usage-based pricing, not seed-based pricing. So we align sort of our pricing to the
cost to serve your traffic, not sort of how many team members and seats you might need to fill
and to bill on that. So that's a core differentiator for us. And it really allows us to
get ingrained into developers' workflow. If every developer has access to their feature management platform,
then it just makes the seamless integrations
into the rest of your dev tools
make a lot more sense
because there's no gates.
There's no these three people over here
who only have access
to your feature flagging platform
for budgeting reasons
because it's seed-based.
You don't have to play games like that.
We want every member of the team
to have access to the platform
and really make it a core part of their development workflow. want every member of the team to have access to the platform and really
make it a core part of their development workflow. Okay, so it's great to have usage-based pricing
for accessibility. But what about those folks who are saying we have lots of traffic? So that's just
as scary as paying for seats. Yeah, it really depends on the use case, right? And so there's a
bunch of customers who may just use you on their server side. And because our SDK is doing all the hard work there, the costs to us are pretty minimal.
And so the cost to our customers is pretty minimal.
But generally for those larger customers, they also want a support contract and uptime
guarantees and all that type of stuff.
And so, yeah, you can easily get into the 100k plus range for larger deals.
But because of some of our architectural differences with how we designed
our client side SDKs using edge APIs and a lot of those things compared to some of the larger
companies in the space, we're probably able to undercut the competition by about 50% on the
client side usage. So even if you're a company that, and like, for example, like I, we have
experience working with very large media organizations and large mobile apps like Grubhub
and things like that with Tapletics.
So we know how to do scale well.
But yeah, for those large client side use cases, we can still undercut by about 50% the bill that people are getting from the large competitors.
Okay, so everyone on your team can play a role.
That's awesome.
They have a forever free tier that gets you started.
So try that out at devcycle.com slash changelog.
Again, devcycle.com slash changelog. Again, devcycle.com slash changelog. so i'm here with andy klein from backblaze andy i've been following your work and your posts over
the years the backblaze drive stats blog posts have been crucial for me because i'm a home labber
as well as a podcaster and a developer and all these things.
So I pay attention to which drives I should buy. And in many ways, I may not buy the drives that
you're suggesting, but it's an indicator of which brands fail more often. But I've been following
your work for a while. And the pre-call you mentioned your title, at least currently,
is Principal Cloud Storyteller. But what do you do at Backblaze? Give me a background. Well, I started out as the first marketing hire a long time ago, 11 years or so ago.
And it's kind of changed over the years as we've added people and everything. And these days,
I spend most of my time worrying about drive stats, the drive stats themselves,
the data that we've collected now for 10 years. So we have 10 years worth of drive data that we
look at.
And I spent a lot of time analyzing that and taking a looking at it. And then also spending
some time with infrastructure, things like how does our network work or how do our systems work
or how do our storage pods work? So a lot of writing these days, a lot of analysis of the
data that we've collected over the years. So that's what I do. I think Storyteller might be
fitting then because that's kind of what you do. If you write a lot and you dig into the data,
the analysis, I mean, that's the fun part. That's why I've been following your work. And it's
kind of uncommon for us to have a, in quotes, marketer on this show. You're more technical
than you are marketing, but you are in the marketing realm, the storytelling realm of
Backblaze. Yeah. I mean, a million years ago, I was an engineer. I mean, I wrote code for a
living. And then I got into IT and the IT side of the world, got my MBA degree because I thought
that would be useful, and then crossed over to the dark side. But I've always appreciated the
technical side of things, and that if you're a developer, you know what it is, right? You got
to dig in. You got to dig in,
you got to find out what's going on. You just don't take something at face value and go,
oh, that was great. Move, let's go. And so that's been, I think what drives me is that curiosity,
that analytical point of view. So it's helped me a lot, especially doing what I'm doing now.
This recent blog post, I feel like it's almost preparatory for this podcast because you just
wrote a post called 10 Stories from 10 Years of DriveStats Data.
And this kind of does let us drive a bit, but there's a lot of insights in there.
What's some of your favorite insights from this post?
What were you most surprised by, I suppose?
I think the thing I'm most surprised with is that we're still doing it.
You know, it's great to collect the data.
It's great to tell stories from things.
But after 10 years of it, it's amazing that people find this interesting, you know, after 10 years.
So that's the coolest part of it all.
And we'll keep doing it for as long as people find it interesting.
I think that's the big deal about it.
But there wasn't anything, any particular insight that just drove me, that made me say,
oh, man, I didn't realize that.
It's the whole data set together.
And every time I dig into it, I find something that's kind of interesting and new.
You know, we're getting ready to do the new drive stats posts for Q1.
And so I'm spending a lot of time right now going through that data.
And you suddenly see something you hadn't seen before.
Or what I really like is others start to ask questions about it.
People start asking questions, saying, hey, what about this?
Or I did this, what do you think?
And so we're taking a particular article that was written a few weeks ago on the average
life of a hard drive. And we're just applying that, what they did to our data and seeing if
we come up with the same answer, how good is that answer and so on. So there's always a fun
insight or two. And I kind of learned something every time I go through this. So the 10 years,
I could have probably put another 10
or 20 or 30 or 40 on there, but I think after about 10, they get boring.
For sure. 10 insights in 10 years does make sense. It makes a good title for a blog post.
That's sticky enough. I guess, help me understand, since you say you're surprised by the 10 years of
this data collection, how does Backblaze use it internally to make it worth it from a business endeavor?
Then obviously it has some stickiness to attract folks to the Backblaze brand and what you all do.
I may not use your services, but I may learn from your storytelling.
You're in the trenches with all these different hard drives over the years.
How does this data get used internally? How does it encompass for you?
So that's a really good question. I mean, almost from the beginning, we were tracking the smart
stats. And there were a handful of them, I think five or six that we really looked at. And we were
doing that since whatever, 2008, 2009, when we first started the company. We weren't really
saving the data. We were just looking at it and saying, whatever, 2008, 2009, when we first started the company. We weren't really saving the data.
We were just looking at it and saying, okay, are there things interesting here and moving forward?
And that helped, okay?
That helped.
The methodology we looked for, we worked with was, you know, if something throws an error,
like an F-sick or an ATA error or some other monitoring system throws an error,
then you can use the smart stats that you're looking at to decide if this
really is a problem or not. ATA errors are a great example. They can be a disk problem, but they can
also be a backplane problem or a SATA card problem or any number of different other things that could
be part of the real issue. So if it identifies something, okay, great, let's take a minute.
Let's see what it's saying about that drive. Let's take a minute. Let's see what it's
saying about that drive. Let's take a look at the smart stats and see if there's any real data there
that helps back this up. Are there media errors? Are we getting command timeouts and so on?
And so that's the way we've used it over the years. And when we started saving it,
what we could do with that was get patterns on a macro level. So not just on a single drive,
but on a model level. And so you start looking at things at a model level and you go, hmm,
that particular model of drive doesn't seem to be doing well for us. And then it allowed us to begin
to understand the concept of testing. So we didn't have to wait until drives started failing. We could
start bringing in a small subset of them, run for a period of time observe their behavior in our environment and
then if that passed then we would buy more of them for example and if it didn't pass then we would
remove those or as the case may be and move on to a different model but we always wanted to have a
wide berth a wide number of different models in a given size and so on.
Because if you get too narrow, you get too dependent on a given model.
And if you suddenly have a problem with that model, you're stuck.
So the data that we collect helps us make decisions along those lines.
And now what people are doing, we've talked to companies that are doing it, they're starting to use that data in more of a machine learning or AI, if you want to go that
far type of way to analyze it and predict failure moving forward. And so, and I've seen some studies
and we've even talked about that in a blog post or two about the AI, the use of AI or machine learning. That's the more proper one here.
It's really not AI.
And you see how you can make predictions on things like,
hey, based on the stats,
the smart stats stacked up this way,
this drive will die,
has a 95% chance of dying in the next 30 days.
That's a really good piece of data to have in your hand
because then you can prepare for it. You can clone that drive, get it running, you know, get a new drive, put the
new drive back in, remove the one that's going to fail, and you're done. You don't have issues
with durability. And I'll get to that in a second, okay? But, you know, that kind of capability is
really kind of cool. It also does the other way, where you can say,
a drive with this kind of characteristics has a 70% chance of lasting the next two years.
Okay, cool.
That means that from a planning point of view, that model,
I now understand its failure profile, and I can move forward.
As I buy things and consider replacements and do
migrations and, you know, move from two to four to eight to 12 drives and so on, terabyte drives
and so on. I mentioned durability earlier. Durability is that notion of, you know, is my
data still there? Did you lose it, right? And all of the factors that go into durability, you know, that people write down
how many nines there are, right? Durability, you know, but the things that you want to have are
important is to have everything in your system spinning all of the time. Well, that's not a
reality. So anytime something stops spinning, okay, becomes non-functional, you have a potential
decrease in your durability.
So what you want to do is get that data, that drive back up to speed and running as quickly as possible. So if I take out a drive and I have to rebuild it, so I take out a drive that's failed
and I put in a new drive and it has to rebuild in the array it's in effectively, that might take
days or even weeks, all right? But if I can clone that drive and get it back in and get back to service and say, let's
say 24 hours, I just saved myself all of that time.
Yeah.
And that impact on durability.
So that data, okay, that we've been collecting all of this time gives us that ability to
see those types of patterns, understand how our data center is behaving,
understand how particular models are behaving,
and make good decisions from a business point of view about what to buy, maybe not what not to buy, and so on.
Yeah.
It's a very complex business to manage this, I'm sure.
Can you tell me more about the file system
or stuff at the storage layer that you're doing because you
mentioned cloning i'm wondering like if you clone rather than replace and re-silver which is a term
that zfs uses i'm not sure if it's a term that crosses the chasm to other file systems or
storage uh things like seph or others but you know to clone a drive does that mean that array
you know gets removed from you know activity it's, of course, but you clone it so that there's no new data
or data written so that that clone is true. It's parity plus data
on there and a bunch of other stuff. Can you speak to the technical bits of the storage
layer, the file system, etc.? Yeah, so we didn't build our
own file system. I don't remember right off the top of my head which one we
actually used, but it's a fairly standard one.
What we did do is we built our own
Reed-Solomon encoding algorithms
to basically do the array.
And we can do it in 17.3, 16.4,
whatever the model is of data to parity.
And it depends on the drive size.
So when you take a drive out that's failed,
if you have to replace it,
that thing has to talk to the other drives
in what we call a tome.
A tome is 20 drives that basically create
that 16.4 or 17.3 setup.
And that drive has to talk to all the rest of them
to get its bits back, so to be rebuilt.
And that process takes a little
bit of time. That's what takes the days or weeks, right? If I clone it, if I remove that drive,
all right, the system keeps functioning, okay? That's part of the parity mechanism, right? So
no worries there. And then when I put the clone back in, the clone goes, wait a minute, I'm not
quite up to speed. Okay, the drive does. It says, but I got a whole bunch of stuff. So let me see what I got.
And that's part of our software that says, let me know where I am.
Okay.
Oh, I have all of these files.
I have all of these pieces.
It does a bunch of things called shard integrity checks and so on and so forth to make sure it's all in one piece.
And then it says, okay, I still need everything from yesterday at 3.52 p.m., you know, blah. And then it starts to collect all
of the pieces from its neighbors and rebuild those pieces it needs on its system. In the meantime,
the system's still functioning. Okay, people are adding new data or reading data from it,
and they're picking it up from the other 19 in this case. And that one drive kind of stays in
what we call read- only mode until it's
rebuilt and then once it's rebuilt it's part of the system so you cut down that process of replacing
that one drive from like i said weeks perhaps into a day or two right and the software that
you mentioned that does the smart reading etc to kind of give you that predictive analysis of
this drive may fail within the next 90 days,
which gives you that indicator to say, okay, let me pull that drive, do the clone versus just simply replace it
and let it re-silver or whatever terminology that you all use to sort of recreate that disk
from all the other drives in its tome or its array.
You wrote that software. Is that available as open source or is that behind the scenes proprietary? Right now it's ours. If I were to say very inelegant, and I'm sure these developers
are going to hear this and go, my guys are going to come yell at me, but it hasn't been open to
source. And a lot of that has to do with, well, a lot of it just has to do, like I said, with the
fact that the edges aren't very
clean. So it just kind of works in our system and goes from there. What it does today is it's more
of a confirmation using the smart stats system. So in other words, it's looking for, I mentioned
earlier, ATA errors and so on as the first identifier. And once it does that, then the
smart stats get applied to see if it's truly a failure or if it's some other problem that we need to go investigate.
Just to clarify, too, for the listeners, if you're catching up to this, self-monitoring analysis and reporting technology, that is what smart is when Andy refers to smart.
It's a feature in the drive, but it's software that lives, I suppose,
on the box itself, right?
So it's between the operating system and the hard drive having the smart capability.
No, this smart system is built into each drive.
Right, okay.
And so it gets, what happens is you run a, we run a program called Smart CTL that interrogates
that data and it's just captured into each drive.
Some manufacturers also keep another layer of data that they also have.
So the drives are kind of self-monitoring themselves and reporting data, and then we can ask it, hey, please give us that data.
And that's what we do once a day.
We say, actually, we run the
smart checks on a routine basis. It's usually multiple times a day, but once a day we record
the data. That's what makes up our drive stats. And so it's each drive just holding it and saying,
this is my current status right now of smart X and smart Y. Some of the values update on a regular basis like um hours there's a power on
hours so i assume once an hour that thing gets updated this temperature which i think is probably
something that continues is continually updated i don't know if it's once a minute once every
five minutes or whatever but you can take the it has a temperature of the drive so there are a lot
of other things in there besides you know how good is, how much trouble I had writing or reading from a particular sector.
Sectors, as the case may be.
Command timeout is a really good one because that indicates that the thing is really busy trying to figure out what it was supposed to be doing.
And it's starting to stack up commands. And then there are some others
that are interesting indicators like high fly rights, which is how far the head flies from the
disk. And that number is, the tolerance on that is so thin these days. I mean, when you're talking,
I mean, nine platters in a drive, that head is really, really close. And so if it comes up even
just a little bit, it's getting in everybody's way so
that's another thing that gets monitored and so on so i was looking at a drive while you were talking
through that bit there i ran i have a an 18 terabyte one of many in this array and i was
looking and so that you'd be happy to know that my command timeout is zero i don't know what a
bad number would be other than zero so like at what point does the command timeout is zero. I don't know what a bad number would be other than zero. So at what point does
the command timeout of that or of a disk get into the bad zone? It's a good question. It does vary
and it usually comes, that particular one happens to come with usually some other error. Okay. One
of the things we found when we analyzed smart stats individually is we couldn't find any single smart stat which was indicative
by itself of failure. Okay. Until it got to be really, really weird. Like I went to go,
I'm finding bad sectors. And so having a few bad sectors on a drive is just a thing that happens
over time and they get remapped and everybody's happy. But having, you know, Allison is a lot, but maybe that's not
a lot on an 18 terabyte drive because the sector size is the same basically. And so, but it is a
lot on a 500 meg drive or 500 gig drive, you know, so there's somewhat relative kind of things, but
no individual one generally is indicative of failure. It usually is a combination
of it. And then some drives just fail. They don't give you any indication at all. And then they just
go, I'm done. I'm out of here. And we've seen that. And roughly 25% of the drives we see,
at least the last time I looked at all of this, just failed with no indication in smart stats at
all. They just rolled over and died. And there doesn't seem to be anything in relation to a
brand or a model or a size or how long they were in. It just seems to be they get there. Now,
inevitably, what happened is before they failed, something, something went wrong, okay? And maybe the
smart stats got recorded, but we don't record them every minute because it was just getting away. So
maybe we missed it, okay? So I'm open to that as a possible explanation. But most of them,
you do get a combination of the five or six different smart sets that we really pay attention to, a combination of those, you'll get those showing up in about 75% of the time.
And like I said, there are some, you know, command timeouts is a good one where,
hey, I'm having a little trouble.
I'm having a little trouble.
Okay, I caught up and it goes back down to zero.
Okay.
And then there are others like bad sector counts.
They just continue to go up because, go up because they don't get better.
They only get worse.
Once they get mapped out, they're gone.
And you have to understand that about each of the stats as to whether or not it's a static.
It's an always up number or it can get better.
Things like high fly rights, we see that happen.
Almost every drive has that at some point or another. But what happens is, if you see, the favorite way to look at this is, you look at
it over a year, and there's 52 of them. 52 is a high number, but if it's once a week, if they all
happened in an hour, I have a problem. And so there's a lot of that going on with the smart stats as well.
What causes a high fly rate?
Is that like movement in the data center, in the physical hardware movement that makes it?
Or is it?
Could be.
It could be.
It could just be the tolerances are so fine now that the actuator moving through there and you get a
little vibration to your point, or maybe there's just something mechanical in that actuator where
it occasionally has a little bit of wobble in it for no particular reason. But it usually has to do
with some movement. It's never a good idea to have a spinning object,
you know, 7,200 or 15,000
or whatever RPMs
and a little thing stuffed in there,
you know, less than a human hair
and start jiggling it around.
So, you know.
Yeah, for sure.
Bad things happen.
Bad things happen.
Let's talk about the way you collect
and store this smart data.
Let me take a crack at this.
This may not be it at all.
If I were engineering the system, I would consider maybe a time series database, collect that data over time, graph it, et cetera.
How do you all do that kind of mapping or that kind of data collection and storing?
Yeah, so once a day, we record the data for posterity's sake.
Like I mentioned earlier, we do record it.
Like take a whole snapshot
of what gets spit out from SmartCTL.
You just grab all that?
We grab a particular,
I think they go,
I think they go with,
we call pod,
we call them pods, okay?
The original storage pod,
45 or 60 drives, right?
And we go pod by pod. That's the
way we think about it. So we go to that pod and we run smart CTL on the drives in there. We pull
out the data, we store that data, and then we add a few things to it. We keep the date, we keep a
serial number, a model number, and the size of the drive, and some other basic information that we
record from the device itself.
So we know what storage pod it's in, we know which location it's in, and so on and so forth.
At that point, we have the data, okay?
And then we store it into, I'll say, a database for a backup thing.
It actually stores locally and then gets transferred up overnight.
That's part of the boot drives get to do that fun stuff for us. Okay. And
so we take, we snapshot all of that data, we store it locally, and then it gets brought up overnight.
Then there's a processing system which goes through and determines the notion of failure.
Okay. So if a drive reported something to us, it didn't fail yet. All right. The next day we go in
and we go back to that same pod, for example, and we notice that one of the drives is missing.
We look for 60.
We only got 59.
What happened?
So that gets reported, and then that becomes part of what the software on our side processes over the next few days.
Tell me about that missing drive.
What happened to it? All right. And at that point, we interact with some of our other systems or maintenance and our inventory systems to see what actions might have been taken on that drive. We also have some smarts built into the software itself to help identify those things. And if all of those things make sense, then we go, it failed or it didn't. You know, it didn't because it was a temp drive that was in for a few days,
and then it got taken out and replaced by the clone.
So it didn't really fail.
It just didn't get a chance to finish, right?
So we shouldn't fail it, right?
Or we migrated that system.
We took all of the drives out, okay?
And we went looking for them, and they weren't there.
But we don't want to fail 60 drives. And so that's not what happened.
So the system figures all of that kind of stuff out.
It looks, like I said, it can connect up to the inventory of maintenance systems to help validate that because we have to track all of those things for obvious reasons by serial number.
So it's fairly complex software that goes in and does that analysis. And it takes a few, sometimes a few days for it to
kind of figure out whether a missing drive is really a failed drive or whether a missing drive
is a good drive and just got removed for a valid reason. And then once that happens, then we record
it. Once a quarter, I go in and I pull the data out and look at it. And I'm looking at it for
the most recent quarter. I actually go back in and validate all of the failures as well by hand
against the maintenance records in particular, just because we want that data to be as accurate
as possible. And then we confirm all of that. And almost always we get a really solid confirmation.
If we find anything funny, we keep looking.
And that's the data we publish, and that's the data we base the reports on. so in the sponsor minisode here in the breaks i'm with Tom Hu, dev advocate at Sentry on the CodeCov team.
So Tom, tell me about Sentry's acquisition of CodeCov.
And in particular,
how is this improving the Sentry platform?
When I think about the acquisition,
when I think about how does Sentry use CodeCov
or conversely, how does CodeCov use Sentry?
Like I think of CodeCov
and I think of the time of deploy.
When you're a software developer,
you have your lifecycle,
you write your code, you test your code, you deploy, and then your code goes into production, and then you sort of fix bugs. And I sort of think of that split in time as like when
you actually do a deploy. Now, where CodeCup is really useful is before deploy time. It's when
you are developing your code. It's when you're saying, hey, like, I want to make sure this is
going to work. I want to make sure that I have as few bugs as possible. I want to make sure that I've thought of all the errors and all the edge
cases and whatnot. And Sentry is the flip side of that. It says, hey, what happens when you hit
production, right? When you have a bug and you need to understand what's happening in that bug,
you need to understand the context around it. You need to understand where it's happening,
what the stack trace looks like, what other local variables, you know, exist at that time so that
you can debug that. And hopefully you don't see that error case again. When I think of like, oh, what can Sentry
do with CodeCover? What can CodeCover do with Sentry? It's sort of taking that entire spectrum
of the developer lifecycle of, hey, what can we do to make sure that you ship the least buggy code
that you can? And when you do come to a bug that is unexpected, you can fix it as quickly as possible, right?
Because, you know, as developers,
we want to write good code.
We want to make sure that people can use
the code that we've written.
We want to make sure that they're happy with the product,
they're happy with the software,
and it works the way that we expect it to.
If we can build a product, you know,
the Century plus CodeCup thing,
to make sure that you are de-risking your code changes
and de-risking your software
than we've hopefully done to the developer community as service.
So Tom, you say bring your tests and you'll handle the rest.
Break it down for me.
How does a team get started with CodeCov?
You know, what you bring to the table is like testing and you bring your coverage reports.
And what CodeCov does is we say, hey, give us your coverage reports, give us access to your code base so that we can,
you know, overlay code coverage on top of it and give us access to your CICD. And so with those
things, what we do and what CodeCov is really powerful at is that it's not just, hey, like,
this is your code coverage number. It's, hey, here's a code coverage number, and your viewer also knows,
and other parts of your organization know as well.
So it's not just you dealing with code coverage
and saying, I don't really know what to do with this.
Because we take your code coverage, we analyze it,
and we throw it back to you into your developer workflow.
And by developer workflow, I mean your pull request,
your merge request.
And we give it to you as a comment so that you can see,
oh, great, this was my code coverage change. But not only do you see this sort of information,
but your reviewer also sees it and they can tell, oh, great, you've tested your code or you haven't
tested your code. And we also give you a status check, which says, hey, like you've met whatever
your team's decision on what your code coverage should be, or you haven't met that goal, whatever
it happens to be. And so CodeCov is particularly powerful in making sure that code coverage is not just a thing that you're doing on your own
island as a developer, but that your entire team can get involved with and can make decisions.
Very cool. Thank you, Tom. So hey, listeners, head to Sentry and check them out. Sentry.io
and use our code CHANGELOG. So the cool thing is, is our listeners,
you get the team plan for free for three months.
Not one month, not two months, three months.
Yes, the team plan for free for three months.
Use the code changelog.
Again, sentry.io.
That's S-E-N-T-R-Y.io.
And use the code changelog.
Also check out our friends over at CodeCove.
That's CodeCove.io.
Like code coverage, but just shortened to CodeCove.
CodeCove.io.
Enjoy. In terms of your data centers, do you have many of them throughout the world?
I assume the answer is yeah.
Yeah, we have five right now.
Five, four in the U.S. and one in Amsterdam.
Okay.
And they all run the same software and the process is the same. And the automation all occurs in the U.S. and one in Amsterdam. Okay. And they all run the same software, and the process is the same.
And the automation all occurs in the front end.
That's all fun and stuff like that.
The validation, if you will, is me.
And a little bit of that comes from me putting my name on this thing,
so I want to make sure the data's right.
So I don't want to automate myself yet.
Not yet.
We'll have Andy Clown Klein EI at some point.
Yeah, that's exactly.
Well, you know, I'm not quite ready to turn drive stats over to chat GPT yet.
So, and I think, I don't know how long I can continue.
I mean, we're up to almost a quarter of a million drives right now.
Luckily, we get, you know.
In service currently?
In service now, yeah.
That's a lot of drives.
And, uh, and so in any given quarter we got, you know, in the last quarter we had 900 and something
drives that failed. That sounds like a lot, except we have 250,000. So no, but it is getting to be,
it is an intensive kind of work, a bit of work for me to do the validation but i i do think it's worth it and
yes we are looking at systems which will help improve that bit of validation as well but like
i said this just comes from historically from eight years of me putting my name on this of
wanting to make sure that the stuff that we publish is as good as it can be doing some quick
math here it sounds like maybe like 99.8% of your drives remain in service,
like 0.2% is what fail in a quarter, roughly. It could be. That's a fair number. We actually do
an interesting calculation because that basic assumption there assumes that all of those drives
have been in service for the
length of time, the same length of time.
And that's not the case, of course, right?
And so we actually count what we call drive days.
And so that's just a day.
A drive is in service for one day.
That's one drive day.
So if you have a drive model ABC and there are 10 drives and those 10
drives have been in service for 10 days, okay, that's 100 drive days for that model. I think,
you know, and it's the most simplest way to do it. And so we count that. And then we count failures,
of course, for that model or all of the drives or whatever the case may be. Model is the most
logical way to do all of this stuff.
Any other combination and you're,
I'll be honest, you're cheating a little.
You know, we do it for our,
we do it quarterly for all of our drives
and then we do it quarterly for a lifetime
for all of our drives, you know, each quarter.
But we also publish them by model number.
And the model number is the more appropriate
way to look at it.
Yeah.
Yeah, and not just the macro number. The macro number we is the more appropriate way to look at it. Yeah. And not
just the macro number. The macro number we're going to come up with, for example, might be like
1.4%, 1.5%. And that's a solid number. Okay. And it's a good number, but it's for all of the
different models and they vary and their failure rates over a given period of time. So drive days
is the way we do it. When we first started down this road back in
2013, we spent some time with the folks at UC Santa Cruz who are really smart about things
like this. And they gave us a very simple formula, which was based on drive days to do that
computation of the failure rates. And then we explain it. Almost every quarter, we have to
explain it because most people do exactly what you did. How many drives you got? How many? And you do the division. And it's the most logical thing to do, but it doesn't account for the fact that, like I said, at any given moment we have new drives coming in, we're taking out old drives and so on. So it changes. All of that changes. And the drive's days does. Do you do much preparation for a drive to go into service?
Like, do you do a burn-in test?
Do you worry about bad sectors before you put them in?
Or you just roll the dice because you've got so much history that you kind of have confidence?
Like, how do you operate when you add new drives to the system?
That's a really good question.
When we first started out, we would put all of the drives in into a storage pod, okay? And we'd burn it in, so to speak. We'd run it for a week or so.
We still do that to a degree, but that burn-in period's a whole lot less. But when we replace
a single drive, we don't burn it in, if you will. They put it in and it obviously has to run through a series of F6 and so on in order
to even, you know, did it come up?
What does it look like?
What does the smart stats look like?
And if it passes all of those basic things, then it's in service.
I think one of the things that's really helped us with that over the years, I've been in,
my goodness, it's probably four or five years now. I was at the Seagate facility in Longmont, Colorado,
where they do their prototyping for all of the drive builds and so on and so forth.
And one of the things that they do, and they do it at all of their factories at some point,
is once the drives come off the line, so to speak,
is they actually put them in a testing box.
And they run that test, some tests on it
for a few hours, days, whatever their period of time is.
And you can see that when you get a, quote,
brand new drive, it has power on hours,
16, 20, 25, whatever.
And so it's not zero.
So they did some banging on it
to make sure you don't get a DOA drive.
And so I think that has helped. And I'm
relatively sure all of the manufacturers do something like that, although Seagate's the
only one I've actually ever seen. Yeah. Well, that's my drive of choice. I use Seagate's.
I was on Ironwolf for a bit there, then Ironwolf Pro in some cases. I think mainly for the warranty
that you get with the Pro label you get a lower warranty
which is nice not necessarily a better drive but definitely a better warranty and then my newest
drive i've gotten from them was the i think it's called the exos i'm not sure do you know how to
pronounce that by any chance uh that's as good a chance as any i'll go with that one i don't know
exos is exos there you go we'll call EXOS then. I think that probably sounds better.
I think that's the ones we actually use as well.
Yeah.
Yeah.
So it's interesting.
We trade off, and we have an opportunity to do something which I'll say Joe Consumer doesn't have.
We can trade off failure rates for dollars, right? So, and I'm not going to pick on any driver manufacturer,
but if a drive, particular drive has a failure rate that sets a 2% and a different drive
has a failure rate of 1%, all right?
We can say, we look at the cost and we can say,
well, wait, the one with 2% cost us $100 less.
And the lifetime cost of that and replacing these drives
over a five or seven year period or whatever it is, we're going to save a half a million dollars if we just buy those.
Yeah.
So we can do that.
Okay.
And people at home with one drive don't really have that.
Maybe that's not the decision they want to make.
And that's why we always tell them, hey, you know, there's this company that backs up things.
Maybe.
But anyway.
That's right.
Backblaze.
Yeah.
So that's cool, though, that you get to do that kind of trade-off.
As you said, you know, dollars per failure, things like that.
I think that's really interesting.
Do you have some program or formula that you use behind the scenes that lets you make those choices?
And then, too, I guess when you're buying the drives, can you use your data as leverage? Well, hey, you know, HGST, you know, based on our stats from the last 10 years, your drives fail more often. So we want to pay less for your drives because we'll have to buy more of them sooner.
We're happy to keep buying them.
However, they're going to fail more often and more frequently based on our data.
Does that give you leverage?
So I'm not the buyer, but I do know that the data gets brought up from time to time
in conversations with the various different companies. Now, inevitably, they have a different
number. All right. They always do. They publish it on their data sheets. And every drive I've
ever seen has either a 1% annual failure rate or a 0.9% failure rate. So that's the range.
It's like 0.9 to 1.
And so that's what they believe is their number.
And they do that through calculations of mean time between failures and so on and so forth
of all of the components.
And so that's the number they believe.
Okay.
Now, you know, whether or not we influence that, we say, well, look, we'll go buy these
and we'll do this trade-off.
You never know what numbers you're going to get from a manufacturer at a given time.
The other thing that we do is I don't need the latest and greatest drives in my data center because why would I overpay for them?
So we're going to use drives that are somewhat down the price curve and have
a proven capability in the market already. And so you're better off negotiating from a point of view
of where you are in that price curve than your drives fail more or your drives fail less kind
of thing. One. And two, model by model is so much different. You may get one model of a 16 terabyte
drive that, you know, let's just say Seagate makes and its failure rate is 0.5%. That's,
it's great. 0.5, you know, half a percent. And then you may get another 16 terabyte
drive from Seagate and it fails at 2%. Okay. So, you know, what do you do, right? You just negotiate, you
know, based on where they are in the curve. That's the best thing to do. If you're going to buy,
you know, 22 terabyte drives right now, you're paying a pretty good premium for it. So I don't
want to buy 22 terabyte drives right now. I'll wait until the price comes down, you know, and
then we can buy 22s or we can buy
24s or whatever the case may be. And we'll know a lot more about the history. And, you know,
so we're buying into, we're buying into the pricing curve as it drops.
We talk a bit about your storage pods themselves. I know that there's some history there from
Protocase, which I've read up on the history because I'm a 45 drives fan being the brand 45
drives. I kind of know some of the storage pod history where you all had a prototype and a desire for a design.
You went to Protocase and collaboratively you came up with what was StoragePod 1.0.
I think you even open sourced the design and stuff like that.
So people can go and buy it, which was really drove a lot of the early demand for the 45 drives spinoff from protocase to become a
legitimate brand and then there were needs of folks who wanted high density storage but not
storage pod backblaze version because you had different circumstances and different choices
you were making because you had different business measures you were basing choices off like you said
you don't want the latest greatest drive you want something that actually proved itself in the market
you know you had different demand curves you were operating on, so you're not
the same as everyone else. Long story short, give me the story of the storage
pod. Help me understand the design of it, so to speak.
15 drives, 30 drives, 45, 60. I know that there are 15
per row. I don't know what you call them, but give me the layout of the
storage pod. Tell me more about it. Sure. So the 15, just to start with, is actually the size of the first array we used. And
we used RAID 6 when we first started. And so we did it in a, I think it was a 13-2 arrangement.
And so 45 just came from, you know, three rows effectively. Now we actually just mechanically, we didn't lay them
out like an array in each row. We actually moved them around and that had to do with the fact that
the back planes that we use were five drives each. And so you didn't want to overload a given back
plane with all of the commands going on. So you, you wanted to move it around. It was just a whole
lot more efficient that way. It also had
to do with the fact that if you lost a backplane, okay, you would lose five drives and suddenly that
array, you couldn't get any data out of it. So that was a way to improve durability.
But we started out building those and you're exactly right. We had a design. We had it. We sketched it out in our head. We actually we built it out of wood.
OK, and in some place, I don't draw a blog post somewhere.
There's a picture of a wooden storage pot with the slats and everything.
And and we built it out of wood and we said, hey, we don't know how to bend metal.
We don't know how to do anything. But what we understood was that the design would work because before we built it out of wood and we said, hey, we don't know how to bend metal. We don't know how to do anything. But what we understood was that the design would work.
Because before we built it out of wood, we actually plugged together a bunch of five-unit
drobo-like systems and did all of this analysis and said, this will work.
And if we do it right, we can get the cost down.
Because if we were going to use, for example, even at that time, S3 as our backend, instead of doing
it ourselves, we couldn't offer our product at the price point we wanted to. We would actually
have to 10X it. So rather than getting unlimited storage for your PC for five bucks a month at the
time, you were going to have to pay a lot more. So we decided to build our own, right? And design
our own. And then we went to the folks at Protocase and I don't know how we found them, to be honest with
you, but they helped build that and build all, and they were, they're really good at it. You know,
they really understand how to bend metal and they can produce all of the designs and they,
and that's exactly what we did. And then we turned around and said, okay, well, this is great. And we
like it. Let's open source it. Let's tell the world about this. And that's what we did way back in 2009 now.
And then we changed the design over the years and added some things. But to your point, at some
point, the folks at Protocase said, well, this is really cool. Maybe we should be making these for
other folks because we had made the decision that we wanted to be a software company and not a hardware
company. And people were asking us to make storage pods for them. And we went, well, there's like
nine of us who work here. I don't think we really, we don't have a lot of time.
That's not our business model.
And so let's, no, we're not going to make it. Now, the truth is we actually did make a few of them
because somebody was going to give us a whole bunch of money for them
who shall remain nameless.
And so we took the money and made a couple of storage pods,
but that wasn't going to be our business.
And Protocase stepped forward and said,
well, I think this is a really cool idea.
Maybe we should start doing this.
And that's where they kind of started.
And then they could customize them.
We needed them for a very specific purpose.
We used commodity parts in them.
When we published it, you could build your own.
You can go down and buy your own Intel cards
and your own Supermicro motherboards.
And the only thing you had to do that was funny
was the power supply you had to do that was funny was the power supply
cable had to be made because it went to two different power supplies and came into the
motherboard. But other than that, you know, everything else was basically do it yourself.
Even the backplanes at the time you could buy. So it was really, really cool that they could do
that. And a lot of folks actually, once we published it, actually started building their
own storage pods, which is even cooler, right?
But the 45 drive guys took it and they said, you know, if we could let people customize
this, or maybe we'll produce some different versions of it.
Let's make a really fast version.
Yay.
You know, and they could upgrade it.
And that's where they started to differentiate themselves.
Then they went to direct connect drives instead of using backplanes.
And I don't
know exactly when they made that decision, but that's kind of where we parted with them because
they wanted to go down to direct connect drive in place, which was great. And I think to this day,
that's the technology that they use. And we stayed with backplanes. And so we eventually went and
used the other manufacturers. These days, to be quite honest with you, we actually buy storage pods, if you will, from Supermicro.
And they are Supermicro servers.
They are not ours.
They're not even painted red.
And, you know, and we just buy them off the rack, so to speak, you know, because they give you the ability to pick a model and put a few options on it.
And we say, give me 500 of those.
And they ship them.
And we're happy as clams with those.
And we don't have to worry about storage pods and updating the design or anything like that.
And the 45 drive guys, they're doing great.
They're really, I like them because they're the true customization place.
You can go over there and say, hey, I want one of these that kind of looks like this and paint it blue.
And oh, by the way, I like that piece of software.
So let's put that on there, put our clone on it, blah, blah, blah, blah, blah.
And you get what you want and then they support it, which is even better.
So, so cool.
I think it's interesting because I have a AV-15 is what they call it.
That's the model number for their Stornator, 10 feet to the left of me over there,
with 15 drives in it.
And so mine is an AV-15.
That's what the 15 is.
It's a 15-drive array.
It's based on this original storage pod
that you all started out with.
I think that's just so cool how, you know,
I never knew you.
I didn't know the full Backblaze story.
I had come to learn of 45 drives.
I was trying to build a high density storage array for myself for
our own backups and a bunch of other fun things and just a home lab scenario.
And it's just so cool to have a piece of hardware over there
that's based on early ideas of you all.
And now you've actually abandoned that idea because you're like, you know what? We want even
more commodity. We had a great design, sure, but we actually just want to get it directly from Supermicro
and just literally take it from there and put it into our racks.
Now, can we get into your data center a bit?
Because I got to imagine these things get pretty heavy to like lift.
I read the blog post that you had out there, which was like a kind of a behind the scenes
of your US East data center.
And I actually just noticed this.
I'm glad you mentioned the change of heart when it comes to your storage pod that you no longer use a custom version for yourselves, that you just buy it directly from Supermicro.
So it's still a 4U server, which is a great size for that. And you have them stacked 12 high in a cabinet, and you leave 4U at the top for a 1U core server and an IPMI switch interface.
Can you talk to me about that design, the entire stack?
How much work did it go into designing that 12 high cabinet, for example? Well, the first things you
have to start thinking about, obviously, are how much space it is. But the next thing you have to
think about is electricity and how much electricity you can get to a rack. Because let's face it,
you're spinning that many drives, it takes up a little bit of juice. And so some of the earlier
places we were in from a data center point of view they said okay so here's a rack and here
you get you know here's 25 amps have a good time and oh by the way you can only use 80 percent of
that and so you suddenly go i can only stack storage pods so high especially as the drives
got bigger and started soaking up more and more electricity and so now you go well i can put four
terabyte drives here but i can't put anything with eight because, okay. So, but that's changed over time as people actually realized,
one, that these things use electricity. So you go into a facility like that and you say, okay,
so do we have enough? How much electricity we got? Okay. You haven't, we have plenty. Great.
For the drives today, the drives tomorrow, and so on. And then it becomes a floor layout issue.
How do you optimize the space?
How much air cooling do you get?
Because these things will definitely produce a little bit of heat.
So you could put all of the racks really, really close, okay, if you wanted to.
But then you're not getting the proper circulation.
It's really difficult to do maintenance and all of that.
And there are a lot of really smart people out there who kind of specialize in that.
Once you decide on where you're going to put them, then it's not only your storage pods,
but all of the ancillary equipment, the other servers that go in.
So, for example, restore servers or API servers.
So now that we do the S3 integration,
the B2 cloud storage piece,
we had our own APIs.
Now we also support the S3 APIs.
Well, they don't work the same.
So when you make an S3 call,
it actually has to kind of be
turned into our call on the back end.
And we had to have a group of servers to do that.
And so we have those kinds of servers. And then you have just, you know, utility boxes and monitoring systems and so on
and so forth that all have to be built into that. So we may have an entire rack of nothing but
support servers. You know, we have the architecture as such that there's a, you know, you have to
have, you have to know where all of the data is. And so we have servers in there, that's their job.
They know where the data is, which storage pod it is, and so on and so forth.
So you go and say, hey, I would like to have a file, or, you know, and you ask that device, you know, assuming you've been authenticated, blah, blah, blah, blah, blah, right?
And it says, okay, you'll find it over here.
And here you go.
Have a good time.
And the same thing when you want to write something. Okay. The way we write things is pretty straightforward, which is we literally connect you
to a tome, actually to a particular pod and a tome. Once you say, hey, I have some data and I
want to write it. And you say, great, here's where you're going to put it. And you're right there.
And then we have to record all of that, of course, and keep track of it for future reference so you
can get your data back. So that
whole process of laying things out, like I said, the biggest one starts with what's your footprint
and then how many racks can you get in there, but how much electricity can you support,
how much cooling is there, and so on. And then, of course, you just have to deal with the fact
that these things are big. So going up is really, really cool because we can get it.
Okay.
The only issue ever became one of does the lift guy go high enough?
Good old Luigi there.
Go high enough so that we can get them out, so we can get them back down.
What do we have to do?
If I have to bring a ladder with me every time to service a storage pod, maybe that slows me down a little bit.
If you can lift it.
They are heavy, but you can get on the lift well i mean even my 15 drive array if i have it fully stacked
to put it back in the rack or to pull it out with and it's got rack rails i mean it's heavy i didn't
weigh it but i mean it it's effort it's not like a little thing. It's,
it's thick and it's just 15 drives. Now, if you get 60 in there.
Yep. And they come bigger. They, you can get them as large. I think I've seen 104 now
in there, but, um, you know, in a, in there. And so with 60, yes. Okay.
You don't want to drop it either. Right. I mean, that would be the worst thing ever.
No, you don't want to drop it. When, when I mean, that would be the worst thing ever. No, you don't want to drop it. When we first started the company and myself and Yev,
who's one of the other guys in marketing, a bit of a character, him and I used to go to
local trade shows and stuff and we'd bring a storage pod with us, but we only brought it
with five drives in it because quite frankly, we had to lug it like through streets and upstairs and all kinds of things like that.
So, yes, they do get quite heavy.
And that's why we have the rack in place.
And no, we don't let people cart them around and all of that.
But, yeah, we do want to optimize the space.
But we do need to get in them from time to time to replace a drive.
So you don't want them to be right at the top of the rack.
So you put in some of the other equipment,
which doesn't require as much hands-up maintenance up there.
So a 52U server rack, you're stacking them 12 high.
They weigh roughly 150 in pounds each, 68 kilograms, roughly. That's just assuming that.
And then to lift that, I think you, in the
details here in your blog post is Guido. Guido, yeah. Guido's mentioned, and I think that's like
a server lift. It's like a thing. Like, how do you, how'd that come about? He's a, so that was,
that started with the very first ones at 45. We went, you know, our first rack that we built,
okay, was like a half height rack,
and it was only, it only went up four, okay, that was our first setup, and as soon as it went higher
than four, people went, this is really heavy, we need to find this out, so you can get a server lift,
and that's what we did, we actually had to raise money way back when to buy a server lift because they're not cheap.
And that was Guido, who was named after the server lift, the lift in Cars, by the way,
the movie Cars.
And then later on, we added Luigi.
I know all of the data centers have their own.
I don't think the rest of them have funny names for them, although they'll have to ask,
I guess.
Yeah, I was thinking that was like the name of that one. Was it Luigi was the character that sold the tires and Guido was
his sidekick. Is that correct? I think so. It's been a few years since I watched the movie.
I like that though. That does make sense. Yeah. Okay. So, yeah, I'm looking here quickly. Guido was the kind of blue lift-looking thing,
and I believe Luigi was the Ferrari lover.
There we go.
Italian.
Yeah, so that was my buddy Sean,
who ran our data centers for a number of years
before moving over to another job within Backplace.
But he was the one who named those things.
So he has a bit of a sense of humor. What's up?
This episode is brought to you by Postman.
Our friends at Postman help more than 25 million developers to build, test, debug, document, monitor, and publish their APIs.
And I'm here with Arno LeRae, API handyman at Postman.
So Arno, Postman has this feature called API governance, and it's supposed to help teams
unify their API design roles, and it gets built into their tools to provide linting
and feedback about API design and adopted best practices.
But I want to hear from you.
What exactly is API governance and why is it important for organizations and for teams?
I think it's a little bit different from what people are used to
because for most people, API governance is a kind of API police.
But I really see it otherwise.
API governance is about helping people create the right APIs in the right way.
In order, not just for the beauty of creating right APIs, beautiful APIs,
but in order to have them do that quickly, efficiently, without even thinking about it,
and ultimately help their organization achieve what they want to achieve.
But how does that manifest?
How does that actually play out in organizations?
The first facet of API governance will be having people look at your APIs and ensure
they are sharing the same look and feel as all of our APIs in the organization.
Because if all of your APIs look the same, once you have learned to use one, you move
to the next one.
And so you can use it very quickly because you know every pattern of action and behavior. But people always focus
too much on that. And they forget that API governance is not only about designing things
the right way, but also helping people do that better and also ensuring that you are creating the right API. So you can go beyond that very dumb API design review
and help people learn things by explaining,
you know, you should avoid using that design pattern
because it will have bad consequences on the consumer
or implementation or performance or whatsoever.
And also, by the way, why are you creating this API?
What is it supposed to do? And then by the way, why are you creating this API? What is it supposed to do?
And then through the conversation, help people realize that maybe they're not having the right perspective creating their API.
They're just exposing complexity in our workings instead of providing a valuable service that will help people.
And so I've been doing API design reviews for quite a long time and slowly but
surely people shift their mind from oh i don't like api governance because they're here to tell
me how to do things to hey actually i've learned things and i'd like to work with you but now i
realize that i'm designing better apis and i'm able to do that alone. So I need less help, less support for you. So yeah, it's really
about having that progression from people seeing governance as I have to do things that way to I
know how to do things the correct way. But before all that, I need to really take care about what
API I'm creating, what is its added value, how it helps people.
Very cool.
Thank you, Arno.
Okay, the next step is to check out
Postman's API governance feature for yourself.
Create better quality APIs
and foster collaboration
between development teams and API teams.
Head to postman.com slash changelowpod.
Sign up and start using Postman for free today.
Again, postman.com slash changelowpod.
So we kind of touched a little bit on this to some degree, but tell me, here's two questions
I want to ask you that I want to get to at least.
I want to know how you all buy hard drives, and then I want your advice for how consumers should buy hard drives.
So we touched on it a little bit, but give me a deeper detail.
Like how do you choose which manufacturers?
It seems based on your data, you have four that you choose from.
I believe HGST, Seagate was one we talked about already.
Western Digital, of course, is always in that mix.
And then I think sometimes Micron's in there.
It depends.
Those are the SSD stats.
Toshibas.
Toshibas are the fourth.
Okay, so you primarily map around four different manufacturers.
How do you, like, do you buy them in batches?
Do you have a relationship with the manufacturer?
Do you have to go to distributors?
How does it work for you all to buy?
Like, how much of a lift is it for you to buy drives?
And when you do buy them, I'm assuming it's once a quarter
because you're tracking failures at once a quarter.
How often are you buying new
and how many do you buy when you do buy them?
So it's actually a very variable process.
And HGST, just to kind of fill in the gap there, HGST as a company got bought by
Western Digital. It gets split up by Western Digital and I think Toshiba years ago. And so
we have HGST drives, but they're historical in nature. And so now we deal with WD, Western
Digital, to get what effectively are HGST drives. But the process is you maintain relationships with
either the manufacturer directly or the manufacturer will send you to a distributor.
You almost never buy directly. We don't buy directly from the manufacturer. You always buy
through a distributor. We always buy through one. Now, maybe Google or somebody like that goes and
can change that, okay? But companies of our size, we've always bought through a distributor.
It's just the way it works.
That's where the contract is with and so on and so forth.
We don't buy drives.
Well, originally, we used to buy drives as we could afford them.
But those days are over.
And now we buy them based on, first thing you want to do is,
is what are your needs, your storage needs out over, let's say the next year and a half,
two years. And how often, how much do you think you're going to need? How much growth in storage?
And then you start dividing by where you are in that curve. Remember, we talked about that earlier.
So if I am trying to buy something, I want to buy something in the middle to the bottom
end of the curve, but sometimes you can't get quantity down there through a distributor. So
you have to, it goes back and forth. We also, we say, okay, well, let's decide that we're going to
get eight terabyte drives and we're going to buy, we want to buy 5,000, eight terabyte drives.
And then we'll go out to the manufacturers
or the distributors in this case,
but the manufacturers and say,
hey, we're looking for some of these.
We're looking for 5,000 of these 8-terabyte drives.
What do you got?
And they'll come back with,
well, I don't have that model.
I have this model.
It's an older model or it's a newer model.
And I can sell you not 5,000,
I can sell you 7,000 at this price.
And so you get all of these things that come back and you negotiate back and forth until you finally
get to somebody or someone that you can buy from. And then you place the order and the order becomes
one of, you know, so how often do you do it? We like to buy them so we have a nice cushion. But if you buy so many at a
given price, and six months later, they're down 20%, that's extra money you just had basically
sitting on your data center floor. So you want to be efficient in how you buy them,
but you always want to have a buffer. And a good example was
the supply chain problems that happened over the pandemic, right? And we had that buffer.
The first thing we did is as it started to look like things were starting to get tight is we
placed a bunch of orders for a bunch of equipment, not just drives, but all of the support equipment
and everything like that. But we had a buffer in place.
And as prices went up, because they did,
we were unaffected by that, or minimally affected by it.
So it really is just a matter of what's available.
We know what we need.
We ask the manufacturers, hey, this is what we need,
and this is the timing we need it in.
They come back with bids, basically, this is what we need and this is the timing we need it in. They come back with bids basically and say, we can deliver this year, this many at this price at this time. And
that's also important. So, you know, just-in-time manufacturing, you know, or just-in-time
warehousing, whatever you want to call it, is part of that math that goes together, you know.
And sometimes manufacturers don't deliver. It happens while the
distributor doesn't deliver. It says, hey, I was going to get you, you know, 3,000 drives by Friday.
I can't make it. I don't have them. Okay. And at that point, you know, that's why you have a buffer.
Okay. And then you have to make a decision. Well, okay. When can you have them? I can have them
in 30 days. Okay. That'll work. I can't have them for six months.
Then you better find a substitute.
And you want to maintain good relationships, of course,
with all of these folks.
And I think we do have good relationships with all of them.
You know, the Seagate one has been a solid one over the years.
You know, the Western Digital one has gotten a lot better,
you know, over the last three or four years with them.
And Toshiba has always been a pretty good one for us.
You know, we meet with them on a regular basis so they understand our needs and can help us decide what to do next because they're part of that.
You know, they may have something coming down the road that we're not aware of.
Okay.
And they go, you know, hey, we're going to, we have an overproduction of, you know of 12 terabyte drives out of this factory in here.
I'll tell you what we can do.
Those kinds of things come up from time to time.
For sure.
How do you determine, it may not be an exact answer, but how do you determine 8 terabyte, 12 terabyte, 16 terabyte?
Is it simply based on cost at the end of the day or is it based upon capability?
How do you really map around the different spectrums is it just simply what's available at the cheapest or
that curve is it always about that cost curve we that's where you want to start with but it's not
only about that okay so do you limit it within that range though so like anything above that
curve it's like that's out of the question unless there's a reason for it. We bought some new drives way back. I remember the time we did it. We bought some, I think it
was 12 terabyte HGSTs or something at the time. And they were almost 2X what we would normally
have paid for that drive. So we do it from time to time if it matters from a timing point of view or something like that. We also do it from an experiment point of view.
Sell me 1,200 drives.
That's a vault.
And we'll pay a little bit extra for it to see how they really work.
Are these going to meet our needs, for example?
You also do it a little bit for goodwill.
There's some of that still out in the world,
you know, and do that. And then the other side of that, the flip side of that is somebody may
come back and say, hey, I have a bunch of eights. We're at the bottom of the curve. You know,
basically here, they're almost free. And you buy them and you use them for substitutes or something
like that. Or you may be using them for testing purposes.
Or, you know, we have a mini setup for a vault that we use for testing purposes and testing software,
and sometimes they go in there.
You know, so there's all of these different things
that play into that decision.
The logical thing to say is,
well, always give me the biggest drive
because that's the best, most efficient use of space.
And that's important.
But all of the other things start to factor in like,
well, that 16 terabyte uses four times the electricity
of that four terabyte.
Wow, how much is that going to cost us?
Or it produces so much more heat.
Or when we use it, it's slower
because it's such a bigger drive.
It's not as fast.
It doesn't give us the data quick enough.
And I'm using that as an example.
Right.
Even though it's a 7200 RPM drive,
it's still slower on the data out standpoint.
The IOPS is slower.
The IOPS is lower.
So you trade off those kinds of things as well.
The other one, which most people don't recognize,
is when you get into this whole game of rebuilding a drive,
I can rebuild a four-terabyte drive in a couple of days.
Way faster.
What does it take me to rebuild a full 16-terabyte drive?
Weeks.
So does that change your durability calculations?
What do you have to do in order to make sure
that you're staying at the
level you want to stay at for that? Well, something you just said there made me think about saturation.
So while you may use a 16 terabyte drive, is there a capacity limit? Do you go beyond a 50%
threshold? For example, if you have an array of 16 terabyte drives in your Tome, I assume a Tome
is one single pod or is a Tome? It's actually spread across 20 pods. It's one drive and 20
different storage pods. Yeah. Okay. So given that, do you fill those disks fully up before you move
along? Do you operate at full capacity? It's a good question. We do run them at above 80%,
okay? And the reason has to do with the fact that there's a notion of filling them up and then, but
so the way our system works is you're given a URL to the data, okay? To your location, to your
particular tome, if you will, particular drive. So we fill those drives up to about 80%.
And then at that point, there are no new people coming in.
Okay.
What happens then is, is that existing customers, they say, I have some more data for you.
I have some more data for you.
And they continue to fill that up until we get to, I think it's like 90, 95% or something
like that.
At that point, then we hand them off and we say, go someplace else.
Here's a new place to write your data.
So we have this process where we get to
where we slow it down, slow it down,
stop writing new people,
let the existing people write in to there
to fill it back up.
Then we also have a whole mechanism in place
for data recovery, space recovery.
Because we don't charge for extra
for any of that kind of stuff because we use PMR drives or CMR drives. That's just a normal process.
Deletion and reuse of space is an easy thing. It's not like an SMR drive, which is expensive
to do that. And so we delete data and recover the space and reuse it. So maybe we get to 95%, but then, you know, people delete files and we come back down and then so, you know, you can add some more and so on and so forth.
So, you know, so that seems to be about the right range for us.
But they are definitely chock full of data.
Yeah.
So the point that you're making there is I may have, and the reason why I asked you that to get clarity was because I may have a,
you know,
in my example,
an 18 terabyte drive in an array,
but that entire,
you know,
array or that entire V dev is not full of data.
Like every 18 terabyte drive is not chock full.
Cause that's not the way I'm using it.
Like backblaze is way different.
You're trying to be,
you know,
inexpensive stores.
That's reliable and,
you know, easy to access, trying to be, you know, inexpensive storage that's reliable and, you know,
easy to access, fast to access, et cetera, fast to backup to. Then you also have your B2 service and a bunch of other reasons of how you use your hardware. But, you know, my use case is different.
So now dovetailing into the way you buy drives, which is very unique and very different. You know,
I don't have a, I guess I'm at the whim of the retailer.
So I would typically buy from B&H, maybe Amazon, maybe Newegg, you know, maybe CDW, for example.
These are places I might go to buy consumer level hard drives.
And I'm buying six, eight, maybe 10 if I'm really feeling lucky, you know, maybe if I'm, if I'm buying for
the full platter for the full, for the full range of 15, maybe I'm buying 15 plus a couple for
parity to have replacement discs, you know, but even then that's like super expensive for someone
like me, not someone like you, because you buy, you know, 5,008 terabyte drives at a time,
massive check, right? Yep. Me way different different. Or the people like me, way different.
So let's dovetail into how can you leverage what you know about the way you buy drives
to give some guidance to consumers out there that are home labbers,
that are building out four drive arrays, six drive arrays, eight drive arrays,
12 drive arrays, whatever it might be.
Give me some examples of what you know from what you know with all these drive stats,
these 10 years of drive stats to how you buy drives. What are some of your recommendations for consumers or home labors
buying hard drives? So that's a really good question. And it does come up, and you're
absolutely right. Somebody with a handful of drives or a small number of drives has to think
differently. And I think one of the reasons why the data, what we do, has been
popular, if you will, for the last number of years is because there's such a dearth of information
out there. Other than that, you go to the manufacturer and you could take every data
sheet produced by every single manufacturer and just change the name and they look identical and they almost have the same numbers on them.
And so they're of very little use from that perspective.
But there are some things you can do as a consumer.
One is you can, manufacturers try to match the use of the drive
to the mechanics inside of it a little bit
and the firmware that's inside of it and so on.
And so you might look at that.
So if you're doing video recording, you know, you're just recording your video systems or
something like that, that's a different use case than you might be using it where you're
storing media on it, you know, and you want to access your movies and you created a Plex
server or whatever the case may be versus, you know, Joe person who's looking for an
external drive because they have a
computer and they want to put some data on an external unit.
So I think what we give people from our perspective is at least data to help make a decision.
Now, where else do you get it from?
There's various sites that do produce it.
There's folks like yourself who work in a home lab
thing and say, hey, I had success with this. And I think you need all of that information in order
to make a solid decision. I don't think it's a monetary one, although I completely understand
why people make a monetary decision. You know, gee, I can buy that for $179, but that one costs me $259 and they're the same size.
And I don't really have $179, much less $259, so I guess I'm going to buy that one.
So I understand those decisions and you cross your fingers from that perspective.
The other little thing, it's just the wild card in all of this, you never know when you're going
to get a really good drive model or another, a really bad drive model.
And you could buy a drive
and it's the, let's just say,
DX000 model, right?
And you bought it and it's been great.
It's been running for years.
And your friend says,
what do you use there?
And I said, I'm using the DX000.
And he goes, great.
And he goes to the store
and he can't get that,
but he can get a DX001 and it's pretty close, right? And he goes, great. And he goes to the store and he can't get that, but he can get a
DX001 and it pretty close. Right. And it fails three days out of the box. Okay. So, you know,
so, so you have to be somewhat precise, but you also have to get, you also see the personalities
of the different manufacturers. Okay. And I'll go back to Seagate. Seagate makes good solid drives
that are a little less expensive. Okay. And so that maybe do they fail more often? Maybe. Okay.
But there are certainly some good models out there and it doesn't necessarily correlate to price,
by the way. We've seen that and it doesn't correlate to enterprise features. It seems to
just be, they made a really good model.
The other thing I would do is if you're buying a drive is I would buy a drive that's been in production a bit, maybe a year, maybe six months at least, and then look and see what people are saying on websites, various consumer sites.
Don't go to the pay-for-play review sites because, you know, you just buy your way up the list.
But, hey, I'm thinking of using this particular model.
And then pay attention to the model they're using.
And then when you go to buy it, make sure you get the same one because, again, they don't have to be the same.
Use our data wherever it's helpful to help maybe guide you a little bit towards where you might find the right ones.
Maybe ones to stay away you a little bit towards where you might find the right ones. Maybe it wants to stay away from a little bit, but at the end of the day,
that's just one piece of the information that you're going to have to dig up.
And there just isn't a testing agency for drives out there.
We get people begging us for that.
We get people literally saying.
Spin off a division or something like that.
That's right, right?
Wouldn't it be fun?
Yeah, I mean, realistically, I mean, you've done a lot of the hard work in quantifying the value of the data.
And you've been consistent with the ability to, one, capture it and then report on it at a quarterly and yearly basis, which I just commend you on.
Thank you for that because that's been like,
and you give it out for free.
You don't say, hey, for Backblaze customers,
you can see the data.
It's free for everybody to see.
And I think you even have like downloads of the raw data,
if I recall correctly.
Like I've, I didn't know what to do with it,
but I'm like, great, it's there.
You know, that if I wanted to dig into it further, then I could.
But yeah, there should be some sort of drive testing, but what
a hard thing to do. I mean, especially as you probably know, models change so quickly and the
numbers don't seem to, the model numbers don't seem to be like there's any, some sort of rhyme
or reason to them. They just seem to be like, okay, we're done with building that one and now
we're going here. And it's also based on geography. It might be made in Taiwan. It might be made in
Vietnam. It might be made somewhere else. It might be made in Vietnam.
It might be made somewhere else.
And these things also play a role into it.
It could have been something geographically in that area.
There could have been a storm.
There could have been an earthquake or a hurricane or something catastrophic or who knows what.
There's things that happen in these manufacturing plants
when they make these drives to get consistency.
I've even heard to buy not in the same batch.
So don't buy more than X drives from,
you know, let's say B&H,
you know, buy two from B&H, two from CDW.
Obviously buy the same model if you can
to try and, you know, keep the model number parity.
But, you know, I've heard like all these different,
you know, essentially old wives tales
on how to buy hard drives as a consumer.
And it really, it seems to be cargo culted
or learn from somebody else
or just fear, essentially.
This is why I do it because it's a fear.
And the way I've kind of done it
is based on the capacity first.
So I think, how big do I need so
I begin with my capacity because I'm different and I want to get to price curve eventually but my
my deal is how much do I want to have how much how many drives can actually handle you know and
then at that level you know what's my parity level can I afford to have a couple extra so if those
two fail in that parity let's say a RAID Z2 given a ZFS file system array as extra so if those two fail in that parity let's say a raid z2 given a zfs
file system array as an example if those two drives fail can i replace them do i have two
more drives to replace them if two did fail i hadn't considered your cloning idea which i think
is super smart i'm gonna have to consider that i might just do some hard drive failure tests just
to see how that could work that seems so smart to clone versus re silver although i don't know how
that would work with zfs if that's a thing or not but capacity is where i begin then it's like okay
for price did i get that and then the final thing i do once i actually get the drives i hadn't
considered running the smart test right away to consider how many uh power on hours it had because
i didn't consider they were doing testing there but well, hey, if Seagate is doing a burn-in of sorts
on my drives or some sort of test beforehand, let me know. Like, hey, I would buy a model that has
burn-in testing beforehand. Save me the week if I'm going to burn in an 18 terabyte drive. So
when I bought this new array recently, the burn-in test lasted seven full days. I used a,
I don't know if you use this software or not,
it's called BadBlocks,
but you can run a series of tests.
It writes three different test patterns
and then a final one, which is the zeros across it.
But for each write, there's a read comparison.
So it's a write across the whole disk in one pattern,
then a read, another write, then a read,
another write, then a read,
and then finally a zero pass write and then a read, another write, then a read, and then finally a
zero pass write, and then a re-comparison to confirm that the drive is actually clean.
But for an 18 terabyte drive, six drives took an entire week, you know, and that's just a
tremendous amount of time for somebody who's like, you know, I just want to get on to building my
thing. Come on now. But that's the way I look at it look at it's like that's how i've learned to buy
is like what capacity i want to have and then can i afford it just the drives alone and then can i
afford the extras if i need parity and replacement for that parity of course you want parity and then
finally doing a final burn-in before i actually put the drives in the service which i feel like
is a little overkill to some degree but like like, you know what? The worst thing to build or the worst thing to do
is to build this full array.
I'm not a business.
I have limited time.
And then I got to deal with failures a week or so later.
Now that burn-in test may not predict
a week long later failure,
but it might mitigate it because like,
well, if drive four of six
did not pass the sectors test in bad blocks, well then, well, let's four of six did not pass the sector's test in bad blocks,
well then, well, let's send that one back for an RMA or just a simple straight up return kind of thing.
And you know, before you even build the array, you've got a problem child essentially.
And the other thing is, is running that kind of software, if there is a media error, which happens, it just does.
And yet having that drive rebuild around it.
And so you don't even know it other than it might tell you that. But if you put your system in play
before you do that and it finds it, same thing can happen. But now your system runs a little slower
for a period of time until it figures out how to map around that.
For sure. The only other thing I want to talk to you about is i think it's a newer player to your
drive stats which is ssds now i think you only use them as boot drives not as like storage drives
in your b2 service or your at large data services and i think the reason why you make these choices
you're very pragmatic with the hardware you buy. Only buy the things you need and you keep your expenses or your cost of goods sold low
because you want to have the lowest cost storage services out there, whether it's B2 or whatnot.
That's how I understand of Backblaze's business model and thought process when you all spend
money.
So with SSDs, obviously you're replacing older hard drives that may be the boot drive, which
as you know, the boot drive is the drive that's running the operating system itself
on the thing. Now I've got to imagine this 52U array that you
have or this 52U rack you have, you've only got one
server in there, but you've got, what was it, eight?
Eight storage pods, and then you've got what was it uh eight eight storage pods or you know and then
you got one actual service so is all that hooked back to that server and then uh tell me about the
ssds yeah so actually well just to kind of set the thing a storage pod is actually more than just
storage it's actually its own storage unit it's a server uh so there is a cpu in there there's all
of the other components
and everything like that. So it's not like a JBOD kind of thing, which each server is its
own server unit. It's got its own intelligence, its own processor, its own 10G network connections
and whatever else, right? And so each one has its own boot drive as well. So that's where we get
them all from.
The boot drive for us does more than just boot the system.
Like I mentioned earlier, it stores some of the smart stats on it for a period of time.
It actually stores access logs and a few other things that we keep track of on a regular basis.
Because one, there's a compliance aspect to doing it. And then two, there's just a maintenance aspect
and a debugging aspect
when something goes a little wonky
and you want to be able to look through various logs.
So we keep those for a period of time
and they get stored on that SSD as well
or the boot drive as well.
The SSDs, to be honest, we started using those
because the price point came down to the point
where we wanted to pay for it.
Yeah.
Performance probably made sense, too.
And then price made sense.
Yeah.
And we've tried different ones over the course of it and the data.
We've talked about building a storage pod out of SSDs, okay?
And, in fact, some of our competitors are even talking about and doing some of those things.
The cost point just doesn't make sense yet.
And the reality is the speed, which is what most people would think they would be getting,
is it's not the drive where any slowness happens.
It's not even, quite frankly, in our systems.
I mean, we're dropping 100 gigabyte NIC cards in these things, right?
I mean, or 25.
And a lot of it is, it just takes a while to get the data from you, even just next door.
Forget about getting it, you know.
And so the SSDs are a pretty cool idea.
And I guarantee you when the price point gets to the right spot, we'll do it.
Backing up somebody's data, whether it takes 10 milliseconds or whether it takes 12 milliseconds,
is not an interesting thing.
And you shouldn't have to pay a premium to get that last little tiny bit.
And that's to your point.
That's where we live.
We want to provide a solid, good, well-performing service at an economical price. That's what we do. SSDs don't fit into that as data servers at this point. They're still too expensive. And the use cases could be interesting. The read-write, the number of writes and stuff could be interesting. Do they wear out under that environment? People have been using them in what we call caching servers, you know, in order to do things. And the reads and
writes on those are enormous. And so you could literally burn through those servers and those
SSDs in six months. And so is that economical? Did you really gain anything from a cost perspective? No, you didn't. Versus the analysis for all of that is still is
ongoing from our perspective but i can see a day when we're there i can see a day when you know
we're using on something besides hard drives to store customer data but we will do it in a way
that's economical and practical yeah you said that sometimes you swap out 8TB drives. I got to imagine the largest SSDs out there tend to be 4 to 8TB.
But if you compare the cost to an 8TB hard drive, it's probably double.
Especially the 8TB SSD is probably at least maybe four times the cost of an 8-terabyte hard drive. So, I mean, yeah, I'm not going to,
when I buy Backblaze for backup services
or even B2 services, for example,
which is like a similar S3 replacement
and you even support the API,
as you mentioned before,
I'm not looking for the caching and the speed necessarily.
I mean, I want the speed to be there,
but it's not like, well, I will pay double for backblaze because you give me SSD backups. Like it's just not
something I'm trying to buy as a consumer from my perspective. And that totally makes sense,
you know, for your business model. And that makes a lot of sense. That's why I want to talk to you
about all these details because the way you buy hard drives and the way you manage these systems
is so unique in comparison and i i
mean we've just never examined the behind the scenes of you know a data center or a storage
data center like you all operate you know like what does it take to build that i know we barely
scratched the surface here i've got probably 30 000 other questions that might go deeper and
technical on like different softwares and stuff like that. So we'll leave it there for the most part.
But it has been fun talking to you.
Is there anything else that I haven't asked you that you wanted to cover
in your 10-year stat history?
Anything whatsoever that's just on your mind that we can leave with
before we tell off the show?
Well, I will say the next DriveStats report is coming up.
That's always fun.
I think it's due out May 4th.
Okay.
May the 4th, yes.
That's Star Wars Day.
It's my anniversary, too.
There you go.
That's even better.
Last year, we wrote one up all about on May the 4th,
and we did it all of its Star Wars themes and stuff like that.
But I've dialed that back this year.
So maybe one or two Star Wars references and that'll be it.
But, uh, congratulations on the anniversary though.
Thank you.
But yeah, so, uh, yeah, so that's coming.
I encourage folks who do read those things, um, and to, to, if they have questions or
comments, feel free, we'll, we'll answer them.
Uh, we try to do the best we can.
We try to be open with, uh, how we do things and why we do the things we do. And so I always look forward to that. And ask the hard ones. We'll give you the best answer we can with it. We are these days a public company, so I don't know how many things we can disclose at certain points, but we'll do the best we can in getting you the information you're
asking for. Yeah, I always appreciate digging into this. I don't always dig as deep as I
would love to because of time or just, you know, just too much data, so to speak,
because it is kind of overwhelming. And the way you have to look at even like your drive failures
by manufacturer, for example, is like, well, that number may be higher for Seagate, but you also
have a lot more Seagate drives in services. A lot of corollaries you have to look at. You can't just
say, okay, let me go to Backblaze's data and say, okay, these are the drives I'm going to buy. Well,
it might be an indicator to manufacturer, maybe not model or size particularly, but it might mean
like, okay, you seem to favor Seagate. You mentioned that your relationship was pretty good there.
I like Seagate. I've had great, was pretty good there. I like Seagate.
I've had great.
I almost switched recently when I was buying my newest array.
And I was thinking about building.
I was like, I'm going to go to Western Digital.
I almost did that, but I'm like, well, I've got these drives in service for this many years.
Knock on wood.
With zero failures, right?
With zero failures.
When you say that, things happen.
So I'm sorry to say that but i've been having seagate drives in service for i mean as long as i've been running data stores
which has been a long time probably eight plus years maybe 10 years or more longer than that
you know 13 years so over time like i've always only ever used Seagate drives. I don't know why I chose Seagate.
Cool name.
I liked Iron Wolf.
Cool brand name.
All that good stuff.
They got some good stuff there.
But the things I read about them was pretty strong.
The warranty was good.
And I've had good services with Western Digital as well in terms of warranty.
I've had some drives from them fail and have to be RMA'd.
And the RMA process is pretty simple.
That's what you want to buy for.
You want to buy a brand that's reliable. You want to buy for parity. You want to buy for
replacements of that parity and to be able to swap it out easily. And then also, do you have
a worthwhile warranty and can you RMA pretty easily? RMA is just simply sending the drive
back that's got a failure, you know, and they will replace that drive with the drive that you
got or something that is equivalent.
There's always circumstances that make that different.
But I've only had good responses from Seagate
as well as Western Digital.
So those are the brands I stick with.
But that could be a wives' tale, right, Andy?
That's Adam's wives' tale of how I buy hard drives.
It's okay.
People have to be comfortable with whatever decision they make.
But the most important thing, and you built it into your system, right, is to what? Have a backup.
And I don't care what kind of backup it is. You don't have to use our service.
It isn't a backup. It's just parody. But yeah, definitely have a backup.
You know, and because if you lose it, it's gone. I mean, so, you know, have a backup. And again,
we've said this before. I don't care if you use us.
I don't care if you use anybody.
I don't care how you do it.
Just get your stuff backed up so that if something happens, you don't lose the things that are precious to you.
It's as simple as that.
And again, I don't care who you do it with or how you do it.
Just get it done.
Very cool.
Well, Andy, thank you so much for taking an hour or so of your time to geek out with me on hard drives.
Not always the, you know, I'm curious how many people come to this episode, honestly, and be excited about this topic.
It's not the most funnest topic unless you're a hard drive nerd like you and I might be.
I'd rather enjoy it.
I think this kind of stuff is pretty fun.
I'm kind of curious what audience we bring to this one because this is a unique topic for us to have on the changelog.
Yeah, I appreciate the opportunity, and I hope some folks listen.
It's always fun to have folks, you know, listen to what you say
and then make comments on it and, you know, and all of that.
There are some places where geeks hang out, you know,
and hard drive geeks in particular hang out,
so maybe we'll get a whole bunch of them together and listen to it.
But just the education of what goes on.
I mean, you understand the complexity of a hard drive and what's going on inside there, right?
And I understand that to some degree as well.
And it's miraculous that that thing works.
It does what it does, and it does it at the price points that they do it at.
So we just need to have that appreciation for that technology you know for as long as it's around for sure i
agree i mean we definitely under appreciate and take for granted the mechanics of a hard drive
as simple as it might be like wow it i mean like on my macbook pro i don't care because i'm using
an ssd it's actually probably nvme ss SSD or just straight up NVMe, not NVMe SSD,
but it's in the M2 format or whatever it might be.
You know, at that point, I'm not caring.
But in other cases, yes, I'm, you know, the hard drive.
I mean, that's what the cloud is built upon.
Your cloud is built upon spinning rusty hard drives that eventually fail.
That's not always the coolest topic, but it is crucial.
It's like almost
mainframe level crucial, right? Like we don't think about mainframes too often. We had an episode
about that, but how often you talk about hard drives and the simple thing that they are, but
also the very complex thing they are. And like you had said, the miraculousness of the fact that it
actually works. But yeah, thanks so much, Andy. It's been awesome talking to you. Appreciate you.
Thank you, Adam. It was great.
Okay, so our friends over at Backblaze have basically made it a science to buy and swap out hard drives to operate at scale.
A quarter of a million hard drives in service with, I think, what back in the napkin math was 0.1-ish percent failure per quarter.
Not bad.
Not bad for the manufacturers and not bad for the service to keep the uptime and keep it affordable.
Once again, a big thanks to Andy for sharing his time with us today. And guess what?
There is a bonus for our Plus Plus subscribers.
So if you're not a Plus plus subscriber, it's easy.
changelog.com slash plus plus 10 bucks a month, a hundred bucks a year, no ads directly support us
get closer to the metal, get bonus content. And of course a shout out on changelog news
on Mondays. You can't beat that 10 bucks10 a month, $100 a year to support us. Wow, that's easy,
and we appreciate it. Speaking of appreciation, a big thank you to our friends and our partners
at Fastly, Fly, and TypeSense. We love them. You should check them out. Fastly.com, Fly.io and Typesense.org They support us, you should support them.
And of course, a big thank you to our Beats Master
in residence, Breakmaster Cylinder.
Those beats, bangin', bangin', bangin'.
Love them.
And of course, last but not least,
thank you to you.
Thank you for listening to the show.
Thank you for tuning in all the way to the
very end thank you for coming back every single week if you haven't heard there is a new spin
on the monday news show it is now combined with changelog weekly so on mondays it's now called
changelog news officially and changelog news is a podcast and a newsletter in the same vein.
When Jared ships that show, he ships a podcast and a newsletter, and you can subscribe to both.
Check it out at changelog.com slash news.
If you're subscribed to this show already, to get the audio podcast of that show, do nothing.
Just keep subscribing to this show.
You get it anyways.
But that's it. The show's done. Thank you again. Just keep subscribing to the show. You get it anyways. But that's it.
The show's done. Thank you again. We will see you on Monday.