The Changelog: Software Development, Open Source - Hard drive reliability at scale (Interview)

Starting point is 00:00:00 What up, nerds? This is Adam. And this week on the show, I went solo and I'm talking to Andy Klein, the principal cloud storyteller over at Backblaze. I'm a big fan of Backblaze because, well, they have an awesome backup service, but also because they have these quarterly and annually delivered drive stats that I've been reading religiously for many years. As you may know, I'm also a home labber. So that means I run ZFS 15 plus storage drives and I just love that kind of stuff. So getting in on the show and digging into hard drives and drive stats and all these different details, and particularly how Backblaze's service operates, from the storage pod to their methodology of how they swap out hard drives

Starting point is 00:00:52 to how they buy hard drives. Their story is fascinating to me. If I wasn't podcasting, I'd probably be racking and stacking storage servers and swapping out hard drives and running analytics and all this fun stuff. Hey, I'll stop dreaming. I'll keep podcasting. But this show was fun. I hope you enjoy it. storage servers and swapping out hard drives and running analytics and all this fun stuff hey i'll stop dreaming i'll keep podcasting but this show was fun i hope you enjoy it and i want to give a massive thank you to our friends and our partners at fastly and fly because this podcast got you fast because fastly they're fast globally they're our cd and a choice and we love them check them

Starting point is 00:01:22 out at fastly.com and our friends over at fly, well, they can help you like they help us put our application and our database globally close to our users. And we love that with no ops. So check them out at Fly.io. what's going on friends before the show kicks off i'm here with jonathan norris co-founder and cto of dev cycle a feature management platform that's designed to help you build maintainable code at scale so jonathan one cool thing about your platform is your pricing. It's usage-based. Tell me why that's important. One of our core principles is like, we believe that everyone on your development team should have access to your feature flagging platform,

Starting point is 00:02:17 whether it's your developers or your product managers or designers and basically the whole team, right? And so that's one of the core reasons why we've gone with usage-based pricing, not seed-based pricing. So we align sort of our pricing to the cost to serve your traffic, not sort of how many team members and seats you might need to fill and to bill on that. So that's a core differentiator for us. And it really allows us to get ingrained into developers' workflow. If every developer has access to their feature management platform, then it just makes the seamless integrations into the rest of your dev tools make a lot more sense

Starting point is 00:02:51 because there's no gates. There's no these three people over here who only have access to your feature flagging platform for budgeting reasons because it's seed-based. You don't have to play games like that. We want every member of the team

Starting point is 00:03:02 to have access to the platform and really make it a core part of their development workflow. want every member of the team to have access to the platform and really make it a core part of their development workflow. Okay, so it's great to have usage-based pricing for accessibility. But what about those folks who are saying we have lots of traffic? So that's just as scary as paying for seats. Yeah, it really depends on the use case, right? And so there's a bunch of customers who may just use you on their server side. And because our SDK is doing all the hard work there, the costs to us are pretty minimal. And so the cost to our customers is pretty minimal. But generally for those larger customers, they also want a support contract and uptime

Starting point is 00:03:34 guarantees and all that type of stuff. And so, yeah, you can easily get into the 100k plus range for larger deals. But because of some of our architectural differences with how we designed our client side SDKs using edge APIs and a lot of those things compared to some of the larger companies in the space, we're probably able to undercut the competition by about 50% on the client side usage. So even if you're a company that, and like, for example, like I, we have experience working with very large media organizations and large mobile apps like Grubhub and things like that with Tapletics.

Starting point is 00:04:06 So we know how to do scale well. But yeah, for those large client side use cases, we can still undercut by about 50% the bill that people are getting from the large competitors. Okay, so everyone on your team can play a role. That's awesome. They have a forever free tier that gets you started. So try that out at devcycle.com slash changelog. Again, devcycle.com slash changelog. Again, devcycle.com slash changelog. so i'm here with andy klein from backblaze andy i've been following your work and your posts over the years the backblaze drive stats blog posts have been crucial for me because i'm a home labber

Starting point is 00:05:01 as well as a podcaster and a developer and all these things. So I pay attention to which drives I should buy. And in many ways, I may not buy the drives that you're suggesting, but it's an indicator of which brands fail more often. But I've been following your work for a while. And the pre-call you mentioned your title, at least currently, is Principal Cloud Storyteller. But what do you do at Backblaze? Give me a background. Well, I started out as the first marketing hire a long time ago, 11 years or so ago. And it's kind of changed over the years as we've added people and everything. And these days, I spend most of my time worrying about drive stats, the drive stats themselves, the data that we've collected now for 10 years. So we have 10 years worth of drive data that we

Starting point is 00:05:44 look at. And I spent a lot of time analyzing that and taking a looking at it. And then also spending some time with infrastructure, things like how does our network work or how do our systems work or how do our storage pods work? So a lot of writing these days, a lot of analysis of the data that we've collected over the years. So that's what I do. I think Storyteller might be fitting then because that's kind of what you do. If you write a lot and you dig into the data, the analysis, I mean, that's the fun part. That's why I've been following your work. And it's kind of uncommon for us to have a, in quotes, marketer on this show. You're more technical

Starting point is 00:06:16 than you are marketing, but you are in the marketing realm, the storytelling realm of Backblaze. Yeah. I mean, a million years ago, I was an engineer. I mean, I wrote code for a living. And then I got into IT and the IT side of the world, got my MBA degree because I thought that would be useful, and then crossed over to the dark side. But I've always appreciated the technical side of things, and that if you're a developer, you know what it is, right? You got to dig in. You got to dig in, you got to find out what's going on. You just don't take something at face value and go, oh, that was great. Move, let's go. And so that's been, I think what drives me is that curiosity,

Starting point is 00:06:56 that analytical point of view. So it's helped me a lot, especially doing what I'm doing now. This recent blog post, I feel like it's almost preparatory for this podcast because you just wrote a post called 10 Stories from 10 Years of DriveStats Data. And this kind of does let us drive a bit, but there's a lot of insights in there. What's some of your favorite insights from this post? What were you most surprised by, I suppose? I think the thing I'm most surprised with is that we're still doing it. You know, it's great to collect the data.

Starting point is 00:07:28 It's great to tell stories from things. But after 10 years of it, it's amazing that people find this interesting, you know, after 10 years. So that's the coolest part of it all. And we'll keep doing it for as long as people find it interesting. I think that's the big deal about it. But there wasn't anything, any particular insight that just drove me, that made me say, oh, man, I didn't realize that. It's the whole data set together.

Starting point is 00:07:56 And every time I dig into it, I find something that's kind of interesting and new. You know, we're getting ready to do the new drive stats posts for Q1. And so I'm spending a lot of time right now going through that data. And you suddenly see something you hadn't seen before. Or what I really like is others start to ask questions about it. People start asking questions, saying, hey, what about this? Or I did this, what do you think? And so we're taking a particular article that was written a few weeks ago on the average

Starting point is 00:08:27 life of a hard drive. And we're just applying that, what they did to our data and seeing if we come up with the same answer, how good is that answer and so on. So there's always a fun insight or two. And I kind of learned something every time I go through this. So the 10 years, I could have probably put another 10 or 20 or 30 or 40 on there, but I think after about 10, they get boring. For sure. 10 insights in 10 years does make sense. It makes a good title for a blog post. That's sticky enough. I guess, help me understand, since you say you're surprised by the 10 years of this data collection, how does Backblaze use it internally to make it worth it from a business endeavor?

Starting point is 00:09:10 Then obviously it has some stickiness to attract folks to the Backblaze brand and what you all do. I may not use your services, but I may learn from your storytelling. You're in the trenches with all these different hard drives over the years. How does this data get used internally? How does it encompass for you? So that's a really good question. I mean, almost from the beginning, we were tracking the smart stats. And there were a handful of them, I think five or six that we really looked at. And we were doing that since whatever, 2008, 2009, when we first started the company. We weren't really saving the data. We were just looking at it and saying, whatever, 2008, 2009, when we first started the company. We weren't really saving the data.

Starting point is 00:09:45 We were just looking at it and saying, okay, are there things interesting here and moving forward? And that helped, okay? That helped. The methodology we looked for, we worked with was, you know, if something throws an error, like an F-sick or an ATA error or some other monitoring system throws an error, then you can use the smart stats that you're looking at to decide if this really is a problem or not. ATA errors are a great example. They can be a disk problem, but they can also be a backplane problem or a SATA card problem or any number of different other things that could

Starting point is 00:10:18 be part of the real issue. So if it identifies something, okay, great, let's take a minute. Let's see what it's saying about that drive. Let's take a minute. Let's see what it's saying about that drive. Let's take a look at the smart stats and see if there's any real data there that helps back this up. Are there media errors? Are we getting command timeouts and so on? And so that's the way we've used it over the years. And when we started saving it, what we could do with that was get patterns on a macro level. So not just on a single drive, but on a model level. And so you start looking at things at a model level and you go, hmm, that particular model of drive doesn't seem to be doing well for us. And then it allowed us to begin

Starting point is 00:10:57 to understand the concept of testing. So we didn't have to wait until drives started failing. We could start bringing in a small subset of them, run for a period of time observe their behavior in our environment and then if that passed then we would buy more of them for example and if it didn't pass then we would remove those or as the case may be and move on to a different model but we always wanted to have a wide berth a wide number of different models in a given size and so on. Because if you get too narrow, you get too dependent on a given model. And if you suddenly have a problem with that model, you're stuck. So the data that we collect helps us make decisions along those lines.

Starting point is 00:11:41 And now what people are doing, we've talked to companies that are doing it, they're starting to use that data in more of a machine learning or AI, if you want to go that far type of way to analyze it and predict failure moving forward. And so, and I've seen some studies and we've even talked about that in a blog post or two about the AI, the use of AI or machine learning. That's the more proper one here. It's really not AI. And you see how you can make predictions on things like, hey, based on the stats, the smart stats stacked up this way, this drive will die,

Starting point is 00:12:18 has a 95% chance of dying in the next 30 days. That's a really good piece of data to have in your hand because then you can prepare for it. You can clone that drive, get it running, you know, get a new drive, put the new drive back in, remove the one that's going to fail, and you're done. You don't have issues with durability. And I'll get to that in a second, okay? But, you know, that kind of capability is really kind of cool. It also does the other way, where you can say, a drive with this kind of characteristics has a 70% chance of lasting the next two years. Okay, cool.

Starting point is 00:12:53 That means that from a planning point of view, that model, I now understand its failure profile, and I can move forward. As I buy things and consider replacements and do migrations and, you know, move from two to four to eight to 12 drives and so on, terabyte drives and so on. I mentioned durability earlier. Durability is that notion of, you know, is my data still there? Did you lose it, right? And all of the factors that go into durability, you know, that people write down how many nines there are, right? Durability, you know, but the things that you want to have are important is to have everything in your system spinning all of the time. Well, that's not a

Starting point is 00:13:36 reality. So anytime something stops spinning, okay, becomes non-functional, you have a potential decrease in your durability. So what you want to do is get that data, that drive back up to speed and running as quickly as possible. So if I take out a drive and I have to rebuild it, so I take out a drive that's failed and I put in a new drive and it has to rebuild in the array it's in effectively, that might take days or even weeks, all right? But if I can clone that drive and get it back in and get back to service and say, let's say 24 hours, I just saved myself all of that time. Yeah. And that impact on durability.

Starting point is 00:14:15 So that data, okay, that we've been collecting all of this time gives us that ability to see those types of patterns, understand how our data center is behaving, understand how particular models are behaving, and make good decisions from a business point of view about what to buy, maybe not what not to buy, and so on. Yeah. It's a very complex business to manage this, I'm sure. Can you tell me more about the file system or stuff at the storage layer that you're doing because you

Starting point is 00:14:45 mentioned cloning i'm wondering like if you clone rather than replace and re-silver which is a term that zfs uses i'm not sure if it's a term that crosses the chasm to other file systems or storage uh things like seph or others but you know to clone a drive does that mean that array you know gets removed from you know activity it's, of course, but you clone it so that there's no new data or data written so that that clone is true. It's parity plus data on there and a bunch of other stuff. Can you speak to the technical bits of the storage layer, the file system, etc.? Yeah, so we didn't build our own file system. I don't remember right off the top of my head which one we

Starting point is 00:15:23 actually used, but it's a fairly standard one. What we did do is we built our own Reed-Solomon encoding algorithms to basically do the array. And we can do it in 17.3, 16.4, whatever the model is of data to parity. And it depends on the drive size. So when you take a drive out that's failed,

Starting point is 00:15:45 if you have to replace it, that thing has to talk to the other drives in what we call a tome. A tome is 20 drives that basically create that 16.4 or 17.3 setup. And that drive has to talk to all the rest of them to get its bits back, so to be rebuilt. And that process takes a little

Starting point is 00:16:05 bit of time. That's what takes the days or weeks, right? If I clone it, if I remove that drive, all right, the system keeps functioning, okay? That's part of the parity mechanism, right? So no worries there. And then when I put the clone back in, the clone goes, wait a minute, I'm not quite up to speed. Okay, the drive does. It says, but I got a whole bunch of stuff. So let me see what I got. And that's part of our software that says, let me know where I am. Okay. Oh, I have all of these files. I have all of these pieces.

Starting point is 00:16:33 It does a bunch of things called shard integrity checks and so on and so forth to make sure it's all in one piece. And then it says, okay, I still need everything from yesterday at 3.52 p.m., you know, blah. And then it starts to collect all of the pieces from its neighbors and rebuild those pieces it needs on its system. In the meantime, the system's still functioning. Okay, people are adding new data or reading data from it, and they're picking it up from the other 19 in this case. And that one drive kind of stays in what we call read- only mode until it's rebuilt and then once it's rebuilt it's part of the system so you cut down that process of replacing that one drive from like i said weeks perhaps into a day or two right and the software that

Starting point is 00:17:18 you mentioned that does the smart reading etc to kind of give you that predictive analysis of this drive may fail within the next 90 days, which gives you that indicator to say, okay, let me pull that drive, do the clone versus just simply replace it and let it re-silver or whatever terminology that you all use to sort of recreate that disk from all the other drives in its tome or its array. You wrote that software. Is that available as open source or is that behind the scenes proprietary? Right now it's ours. If I were to say very inelegant, and I'm sure these developers are going to hear this and go, my guys are going to come yell at me, but it hasn't been open to source. And a lot of that has to do with, well, a lot of it just has to do, like I said, with the

Starting point is 00:18:03 fact that the edges aren't very clean. So it just kind of works in our system and goes from there. What it does today is it's more of a confirmation using the smart stats system. So in other words, it's looking for, I mentioned earlier, ATA errors and so on as the first identifier. And once it does that, then the smart stats get applied to see if it's truly a failure or if it's some other problem that we need to go investigate. Just to clarify, too, for the listeners, if you're catching up to this, self-monitoring analysis and reporting technology, that is what smart is when Andy refers to smart. It's a feature in the drive, but it's software that lives, I suppose, on the box itself, right?

Starting point is 00:18:48 So it's between the operating system and the hard drive having the smart capability. No, this smart system is built into each drive. Right, okay. And so it gets, what happens is you run a, we run a program called Smart CTL that interrogates that data and it's just captured into each drive. Some manufacturers also keep another layer of data that they also have. So the drives are kind of self-monitoring themselves and reporting data, and then we can ask it, hey, please give us that data. And that's what we do once a day.

Starting point is 00:19:24 We say, actually, we run the smart checks on a routine basis. It's usually multiple times a day, but once a day we record the data. That's what makes up our drive stats. And so it's each drive just holding it and saying, this is my current status right now of smart X and smart Y. Some of the values update on a regular basis like um hours there's a power on hours so i assume once an hour that thing gets updated this temperature which i think is probably something that continues is continually updated i don't know if it's once a minute once every five minutes or whatever but you can take the it has a temperature of the drive so there are a lot of other things in there besides you know how good is, how much trouble I had writing or reading from a particular sector.

Starting point is 00:20:12 Sectors, as the case may be. Command timeout is a really good one because that indicates that the thing is really busy trying to figure out what it was supposed to be doing. And it's starting to stack up commands. And then there are some others that are interesting indicators like high fly rights, which is how far the head flies from the disk. And that number is, the tolerance on that is so thin these days. I mean, when you're talking, I mean, nine platters in a drive, that head is really, really close. And so if it comes up even just a little bit, it's getting in everybody's way so that's another thing that gets monitored and so on so i was looking at a drive while you were talking

Starting point is 00:20:50 through that bit there i ran i have a an 18 terabyte one of many in this array and i was looking and so that you'd be happy to know that my command timeout is zero i don't know what a bad number would be other than zero so like at what point does the command timeout is zero. I don't know what a bad number would be other than zero. So at what point does the command timeout of that or of a disk get into the bad zone? It's a good question. It does vary and it usually comes, that particular one happens to come with usually some other error. Okay. One of the things we found when we analyzed smart stats individually is we couldn't find any single smart stat which was indicative by itself of failure. Okay. Until it got to be really, really weird. Like I went to go, I'm finding bad sectors. And so having a few bad sectors on a drive is just a thing that happens

Starting point is 00:21:39 over time and they get remapped and everybody's happy. But having, you know, Allison is a lot, but maybe that's not a lot on an 18 terabyte drive because the sector size is the same basically. And so, but it is a lot on a 500 meg drive or 500 gig drive, you know, so there's somewhat relative kind of things, but no individual one generally is indicative of failure. It usually is a combination of it. And then some drives just fail. They don't give you any indication at all. And then they just go, I'm done. I'm out of here. And we've seen that. And roughly 25% of the drives we see, at least the last time I looked at all of this, just failed with no indication in smart stats at all. They just rolled over and died. And there doesn't seem to be anything in relation to a

Starting point is 00:22:32 brand or a model or a size or how long they were in. It just seems to be they get there. Now, inevitably, what happened is before they failed, something, something went wrong, okay? And maybe the smart stats got recorded, but we don't record them every minute because it was just getting away. So maybe we missed it, okay? So I'm open to that as a possible explanation. But most of them, you do get a combination of the five or six different smart sets that we really pay attention to, a combination of those, you'll get those showing up in about 75% of the time. And like I said, there are some, you know, command timeouts is a good one where, hey, I'm having a little trouble. I'm having a little trouble.

Starting point is 00:23:16 Okay, I caught up and it goes back down to zero. Okay. And then there are others like bad sector counts. They just continue to go up because, go up because they don't get better. They only get worse. Once they get mapped out, they're gone. And you have to understand that about each of the stats as to whether or not it's a static. It's an always up number or it can get better.

Starting point is 00:23:41 Things like high fly rights, we see that happen. Almost every drive has that at some point or another. But what happens is, if you see, the favorite way to look at this is, you look at it over a year, and there's 52 of them. 52 is a high number, but if it's once a week, if they all happened in an hour, I have a problem. And so there's a lot of that going on with the smart stats as well. What causes a high fly rate? Is that like movement in the data center, in the physical hardware movement that makes it? Or is it? Could be.

Starting point is 00:24:18 It could be. It could just be the tolerances are so fine now that the actuator moving through there and you get a little vibration to your point, or maybe there's just something mechanical in that actuator where it occasionally has a little bit of wobble in it for no particular reason. But it usually has to do with some movement. It's never a good idea to have a spinning object, you know, 7,200 or 15,000 or whatever RPMs and a little thing stuffed in there,

Starting point is 00:24:50 you know, less than a human hair and start jiggling it around. So, you know. Yeah, for sure. Bad things happen. Bad things happen. Let's talk about the way you collect and store this smart data.

Starting point is 00:25:05 Let me take a crack at this. This may not be it at all. If I were engineering the system, I would consider maybe a time series database, collect that data over time, graph it, et cetera. How do you all do that kind of mapping or that kind of data collection and storing? Yeah, so once a day, we record the data for posterity's sake. Like I mentioned earlier, we do record it. Like take a whole snapshot of what gets spit out from SmartCTL.

Starting point is 00:25:31 You just grab all that? We grab a particular, I think they go, I think they go with, we call pod, we call them pods, okay? The original storage pod, 45 or 60 drives, right?

Starting point is 00:25:43 And we go pod by pod. That's the way we think about it. So we go to that pod and we run smart CTL on the drives in there. We pull out the data, we store that data, and then we add a few things to it. We keep the date, we keep a serial number, a model number, and the size of the drive, and some other basic information that we record from the device itself. So we know what storage pod it's in, we know which location it's in, and so on and so forth. At that point, we have the data, okay? And then we store it into, I'll say, a database for a backup thing.

Starting point is 00:26:18 It actually stores locally and then gets transferred up overnight. That's part of the boot drives get to do that fun stuff for us. Okay. And so we take, we snapshot all of that data, we store it locally, and then it gets brought up overnight. Then there's a processing system which goes through and determines the notion of failure. Okay. So if a drive reported something to us, it didn't fail yet. All right. The next day we go in and we go back to that same pod, for example, and we notice that one of the drives is missing. We look for 60. We only got 59.

Starting point is 00:26:49 What happened? So that gets reported, and then that becomes part of what the software on our side processes over the next few days. Tell me about that missing drive. What happened to it? All right. And at that point, we interact with some of our other systems or maintenance and our inventory systems to see what actions might have been taken on that drive. We also have some smarts built into the software itself to help identify those things. And if all of those things make sense, then we go, it failed or it didn't. You know, it didn't because it was a temp drive that was in for a few days, and then it got taken out and replaced by the clone. So it didn't really fail. It just didn't get a chance to finish, right? So we shouldn't fail it, right?

Starting point is 00:27:34 Or we migrated that system. We took all of the drives out, okay? And we went looking for them, and they weren't there. But we don't want to fail 60 drives. And so that's not what happened. So the system figures all of that kind of stuff out. It looks, like I said, it can connect up to the inventory of maintenance systems to help validate that because we have to track all of those things for obvious reasons by serial number. So it's fairly complex software that goes in and does that analysis. And it takes a few, sometimes a few days for it to kind of figure out whether a missing drive is really a failed drive or whether a missing drive

Starting point is 00:28:12 is a good drive and just got removed for a valid reason. And then once that happens, then we record it. Once a quarter, I go in and I pull the data out and look at it. And I'm looking at it for the most recent quarter. I actually go back in and validate all of the failures as well by hand against the maintenance records in particular, just because we want that data to be as accurate as possible. And then we confirm all of that. And almost always we get a really solid confirmation. If we find anything funny, we keep looking. And that's the data we publish, and that's the data we base the reports on. so in the sponsor minisode here in the breaks i'm with Tom Hu, dev advocate at Sentry on the CodeCov team. So Tom, tell me about Sentry's acquisition of CodeCov.

Starting point is 00:29:28 And in particular, how is this improving the Sentry platform? When I think about the acquisition, when I think about how does Sentry use CodeCov or conversely, how does CodeCov use Sentry? Like I think of CodeCov and I think of the time of deploy. When you're a software developer,

Starting point is 00:29:44 you have your lifecycle, you write your code, you test your code, you deploy, and then your code goes into production, and then you sort of fix bugs. And I sort of think of that split in time as like when you actually do a deploy. Now, where CodeCup is really useful is before deploy time. It's when you are developing your code. It's when you're saying, hey, like, I want to make sure this is going to work. I want to make sure that I have as few bugs as possible. I want to make sure that I've thought of all the errors and all the edge cases and whatnot. And Sentry is the flip side of that. It says, hey, what happens when you hit production, right? When you have a bug and you need to understand what's happening in that bug, you need to understand the context around it. You need to understand where it's happening,

Starting point is 00:30:19 what the stack trace looks like, what other local variables, you know, exist at that time so that you can debug that. And hopefully you don't see that error case again. When I think of like, oh, what can Sentry do with CodeCover? What can CodeCover do with Sentry? It's sort of taking that entire spectrum of the developer lifecycle of, hey, what can we do to make sure that you ship the least buggy code that you can? And when you do come to a bug that is unexpected, you can fix it as quickly as possible, right? Because, you know, as developers, we want to write good code. We want to make sure that people can use

Starting point is 00:30:52 the code that we've written. We want to make sure that they're happy with the product, they're happy with the software, and it works the way that we expect it to. If we can build a product, you know, the Century plus CodeCup thing, to make sure that you are de-risking your code changes and de-risking your software

Starting point is 00:31:07 than we've hopefully done to the developer community as service. So Tom, you say bring your tests and you'll handle the rest. Break it down for me. How does a team get started with CodeCov? You know, what you bring to the table is like testing and you bring your coverage reports. And what CodeCov does is we say, hey, give us your coverage reports, give us access to your code base so that we can, you know, overlay code coverage on top of it and give us access to your CICD. And so with those things, what we do and what CodeCov is really powerful at is that it's not just, hey, like,

Starting point is 00:31:40 this is your code coverage number. It's, hey, here's a code coverage number, and your viewer also knows, and other parts of your organization know as well. So it's not just you dealing with code coverage and saying, I don't really know what to do with this. Because we take your code coverage, we analyze it, and we throw it back to you into your developer workflow. And by developer workflow, I mean your pull request, your merge request.

Starting point is 00:32:01 And we give it to you as a comment so that you can see, oh, great, this was my code coverage change. But not only do you see this sort of information, but your reviewer also sees it and they can tell, oh, great, you've tested your code or you haven't tested your code. And we also give you a status check, which says, hey, like you've met whatever your team's decision on what your code coverage should be, or you haven't met that goal, whatever it happens to be. And so CodeCov is particularly powerful in making sure that code coverage is not just a thing that you're doing on your own island as a developer, but that your entire team can get involved with and can make decisions. Very cool. Thank you, Tom. So hey, listeners, head to Sentry and check them out. Sentry.io

Starting point is 00:32:39 and use our code CHANGELOG. So the cool thing is, is our listeners, you get the team plan for free for three months. Not one month, not two months, three months. Yes, the team plan for free for three months. Use the code changelog. Again, sentry.io. That's S-E-N-T-R-Y.io. And use the code changelog.

Starting point is 00:33:04 Also check out our friends over at CodeCove. That's CodeCove.io. Like code coverage, but just shortened to CodeCove. CodeCove.io. Enjoy. In terms of your data centers, do you have many of them throughout the world? I assume the answer is yeah. Yeah, we have five right now. Five, four in the U.S. and one in Amsterdam.

Starting point is 00:33:42 Okay. And they all run the same software and the process is the same. And the automation all occurs in the U.S. and one in Amsterdam. Okay. And they all run the same software, and the process is the same. And the automation all occurs in the front end. That's all fun and stuff like that. The validation, if you will, is me. And a little bit of that comes from me putting my name on this thing, so I want to make sure the data's right. So I don't want to automate myself yet.

Starting point is 00:34:02 Not yet. We'll have Andy Clown Klein EI at some point. Yeah, that's exactly. Well, you know, I'm not quite ready to turn drive stats over to chat GPT yet. So, and I think, I don't know how long I can continue. I mean, we're up to almost a quarter of a million drives right now. Luckily, we get, you know. In service currently?

Starting point is 00:34:23 In service now, yeah. That's a lot of drives. And, uh, and so in any given quarter we got, you know, in the last quarter we had 900 and something drives that failed. That sounds like a lot, except we have 250,000. So no, but it is getting to be, it is an intensive kind of work, a bit of work for me to do the validation but i i do think it's worth it and yes we are looking at systems which will help improve that bit of validation as well but like i said this just comes from historically from eight years of me putting my name on this of wanting to make sure that the stuff that we publish is as good as it can be doing some quick

Starting point is 00:35:01 math here it sounds like maybe like 99.8% of your drives remain in service, like 0.2% is what fail in a quarter, roughly. It could be. That's a fair number. We actually do an interesting calculation because that basic assumption there assumes that all of those drives have been in service for the length of time, the same length of time. And that's not the case, of course, right? And so we actually count what we call drive days. And so that's just a day.

Starting point is 00:35:37 A drive is in service for one day. That's one drive day. So if you have a drive model ABC and there are 10 drives and those 10 drives have been in service for 10 days, okay, that's 100 drive days for that model. I think, you know, and it's the most simplest way to do it. And so we count that. And then we count failures, of course, for that model or all of the drives or whatever the case may be. Model is the most logical way to do all of this stuff. Any other combination and you're,

Starting point is 00:36:07 I'll be honest, you're cheating a little. You know, we do it for our, we do it quarterly for all of our drives and then we do it quarterly for a lifetime for all of our drives, you know, each quarter. But we also publish them by model number. And the model number is the more appropriate way to look at it.

Starting point is 00:36:24 Yeah. Yeah, and not just the macro number. The macro number we is the more appropriate way to look at it. Yeah. And not just the macro number. The macro number we're going to come up with, for example, might be like 1.4%, 1.5%. And that's a solid number. Okay. And it's a good number, but it's for all of the different models and they vary and their failure rates over a given period of time. So drive days is the way we do it. When we first started down this road back in 2013, we spent some time with the folks at UC Santa Cruz who are really smart about things like this. And they gave us a very simple formula, which was based on drive days to do that

Starting point is 00:36:57 computation of the failure rates. And then we explain it. Almost every quarter, we have to explain it because most people do exactly what you did. How many drives you got? How many? And you do the division. And it's the most logical thing to do, but it doesn't account for the fact that, like I said, at any given moment we have new drives coming in, we're taking out old drives and so on. So it changes. All of that changes. And the drive's days does. Do you do much preparation for a drive to go into service? Like, do you do a burn-in test? Do you worry about bad sectors before you put them in? Or you just roll the dice because you've got so much history that you kind of have confidence? Like, how do you operate when you add new drives to the system? That's a really good question. When we first started out, we would put all of the drives in into a storage pod, okay? And we'd burn it in, so to speak. We'd run it for a week or so.

Starting point is 00:37:52 We still do that to a degree, but that burn-in period's a whole lot less. But when we replace a single drive, we don't burn it in, if you will. They put it in and it obviously has to run through a series of F6 and so on in order to even, you know, did it come up? What does it look like? What does the smart stats look like? And if it passes all of those basic things, then it's in service. I think one of the things that's really helped us with that over the years, I've been in, my goodness, it's probably four or five years now. I was at the Seagate facility in Longmont, Colorado,

Starting point is 00:38:27 where they do their prototyping for all of the drive builds and so on and so forth. And one of the things that they do, and they do it at all of their factories at some point, is once the drives come off the line, so to speak, is they actually put them in a testing box. And they run that test, some tests on it for a few hours, days, whatever their period of time is. And you can see that when you get a, quote, brand new drive, it has power on hours,

Starting point is 00:38:55 16, 20, 25, whatever. And so it's not zero. So they did some banging on it to make sure you don't get a DOA drive. And so I think that has helped. And I'm relatively sure all of the manufacturers do something like that, although Seagate's the only one I've actually ever seen. Yeah. Well, that's my drive of choice. I use Seagate's. I was on Ironwolf for a bit there, then Ironwolf Pro in some cases. I think mainly for the warranty

Starting point is 00:39:23 that you get with the Pro label you get a lower warranty which is nice not necessarily a better drive but definitely a better warranty and then my newest drive i've gotten from them was the i think it's called the exos i'm not sure do you know how to pronounce that by any chance uh that's as good a chance as any i'll go with that one i don't know exos is exos there you go we'll call EXOS then. I think that probably sounds better. I think that's the ones we actually use as well. Yeah. Yeah.

Starting point is 00:39:50 So it's interesting. We trade off, and we have an opportunity to do something which I'll say Joe Consumer doesn't have. We can trade off failure rates for dollars, right? So, and I'm not going to pick on any driver manufacturer, but if a drive, particular drive has a failure rate that sets a 2% and a different drive has a failure rate of 1%, all right? We can say, we look at the cost and we can say, well, wait, the one with 2% cost us $100 less. And the lifetime cost of that and replacing these drives

Starting point is 00:40:23 over a five or seven year period or whatever it is, we're going to save a half a million dollars if we just buy those. Yeah. So we can do that. Okay. And people at home with one drive don't really have that. Maybe that's not the decision they want to make. And that's why we always tell them, hey, you know, there's this company that backs up things. Maybe.

Starting point is 00:40:42 But anyway. That's right. Backblaze. Yeah. So that's cool, though, that you get to do that kind of trade-off. As you said, you know, dollars per failure, things like that. I think that's really interesting. Do you have some program or formula that you use behind the scenes that lets you make those choices?

Starting point is 00:41:10 And then, too, I guess when you're buying the drives, can you use your data as leverage? Well, hey, you know, HGST, you know, based on our stats from the last 10 years, your drives fail more often. So we want to pay less for your drives because we'll have to buy more of them sooner. We're happy to keep buying them. However, they're going to fail more often and more frequently based on our data. Does that give you leverage? So I'm not the buyer, but I do know that the data gets brought up from time to time in conversations with the various different companies. Now, inevitably, they have a different number. All right. They always do. They publish it on their data sheets. And every drive I've ever seen has either a 1% annual failure rate or a 0.9% failure rate. So that's the range.

Starting point is 00:41:46 It's like 0.9 to 1. And so that's what they believe is their number. And they do that through calculations of mean time between failures and so on and so forth of all of the components. And so that's the number they believe. Okay. Now, you know, whether or not we influence that, we say, well, look, we'll go buy these and we'll do this trade-off.

Starting point is 00:42:06 You never know what numbers you're going to get from a manufacturer at a given time. The other thing that we do is I don't need the latest and greatest drives in my data center because why would I overpay for them? So we're going to use drives that are somewhat down the price curve and have a proven capability in the market already. And so you're better off negotiating from a point of view of where you are in that price curve than your drives fail more or your drives fail less kind of thing. One. And two, model by model is so much different. You may get one model of a 16 terabyte drive that, you know, let's just say Seagate makes and its failure rate is 0.5%. That's, it's great. 0.5, you know, half a percent. And then you may get another 16 terabyte

Starting point is 00:42:58 drive from Seagate and it fails at 2%. Okay. So, you know, what do you do, right? You just negotiate, you know, based on where they are in the curve. That's the best thing to do. If you're going to buy, you know, 22 terabyte drives right now, you're paying a pretty good premium for it. So I don't want to buy 22 terabyte drives right now. I'll wait until the price comes down, you know, and then we can buy 22s or we can buy 24s or whatever the case may be. And we'll know a lot more about the history. And, you know, so we're buying into, we're buying into the pricing curve as it drops. We talk a bit about your storage pods themselves. I know that there's some history there from

Starting point is 00:43:38 Protocase, which I've read up on the history because I'm a 45 drives fan being the brand 45 drives. I kind of know some of the storage pod history where you all had a prototype and a desire for a design. You went to Protocase and collaboratively you came up with what was StoragePod 1.0. I think you even open sourced the design and stuff like that. So people can go and buy it, which was really drove a lot of the early demand for the 45 drives spinoff from protocase to become a legitimate brand and then there were needs of folks who wanted high density storage but not storage pod backblaze version because you had different circumstances and different choices you were making because you had different business measures you were basing choices off like you said

Starting point is 00:44:19 you don't want the latest greatest drive you want something that actually proved itself in the market you know you had different demand curves you were operating on, so you're not the same as everyone else. Long story short, give me the story of the storage pod. Help me understand the design of it, so to speak. 15 drives, 30 drives, 45, 60. I know that there are 15 per row. I don't know what you call them, but give me the layout of the storage pod. Tell me more about it. Sure. So the 15, just to start with, is actually the size of the first array we used. And we used RAID 6 when we first started. And so we did it in a, I think it was a 13-2 arrangement.

Starting point is 00:44:58 And so 45 just came from, you know, three rows effectively. Now we actually just mechanically, we didn't lay them out like an array in each row. We actually moved them around and that had to do with the fact that the back planes that we use were five drives each. And so you didn't want to overload a given back plane with all of the commands going on. So you, you wanted to move it around. It was just a whole lot more efficient that way. It also had to do with the fact that if you lost a backplane, okay, you would lose five drives and suddenly that array, you couldn't get any data out of it. So that was a way to improve durability. But we started out building those and you're exactly right. We had a design. We had it. We sketched it out in our head. We actually we built it out of wood.

Starting point is 00:45:47 OK, and in some place, I don't draw a blog post somewhere. There's a picture of a wooden storage pot with the slats and everything. And and we built it out of wood and we said, hey, we don't know how to bend metal. We don't know how to do anything. But what we understood was that the design would work because before we built it out of wood and we said, hey, we don't know how to bend metal. We don't know how to do anything. But what we understood was that the design would work. Because before we built it out of wood, we actually plugged together a bunch of five-unit drobo-like systems and did all of this analysis and said, this will work. And if we do it right, we can get the cost down. Because if we were going to use, for example, even at that time, S3 as our backend, instead of doing

Starting point is 00:46:25 it ourselves, we couldn't offer our product at the price point we wanted to. We would actually have to 10X it. So rather than getting unlimited storage for your PC for five bucks a month at the time, you were going to have to pay a lot more. So we decided to build our own, right? And design our own. And then we went to the folks at Protocase and I don't know how we found them, to be honest with you, but they helped build that and build all, and they were, they're really good at it. You know, they really understand how to bend metal and they can produce all of the designs and they, and that's exactly what we did. And then we turned around and said, okay, well, this is great. And we like it. Let's open source it. Let's tell the world about this. And that's what we did way back in 2009 now.

Starting point is 00:47:10 And then we changed the design over the years and added some things. But to your point, at some point, the folks at Protocase said, well, this is really cool. Maybe we should be making these for other folks because we had made the decision that we wanted to be a software company and not a hardware company. And people were asking us to make storage pods for them. And we went, well, there's like nine of us who work here. I don't think we really, we don't have a lot of time. That's not our business model. And so let's, no, we're not going to make it. Now, the truth is we actually did make a few of them because somebody was going to give us a whole bunch of money for them

Starting point is 00:47:50 who shall remain nameless. And so we took the money and made a couple of storage pods, but that wasn't going to be our business. And Protocase stepped forward and said, well, I think this is a really cool idea. Maybe we should start doing this. And that's where they kind of started. And then they could customize them.

Starting point is 00:48:08 We needed them for a very specific purpose. We used commodity parts in them. When we published it, you could build your own. You can go down and buy your own Intel cards and your own Supermicro motherboards. And the only thing you had to do that was funny was the power supply you had to do that was funny was the power supply cable had to be made because it went to two different power supplies and came into the

Starting point is 00:48:30 motherboard. But other than that, you know, everything else was basically do it yourself. Even the backplanes at the time you could buy. So it was really, really cool that they could do that. And a lot of folks actually, once we published it, actually started building their own storage pods, which is even cooler, right? But the 45 drive guys took it and they said, you know, if we could let people customize this, or maybe we'll produce some different versions of it. Let's make a really fast version. Yay.

Starting point is 00:48:56 You know, and they could upgrade it. And that's where they started to differentiate themselves. Then they went to direct connect drives instead of using backplanes. And I don't know exactly when they made that decision, but that's kind of where we parted with them because they wanted to go down to direct connect drive in place, which was great. And I think to this day, that's the technology that they use. And we stayed with backplanes. And so we eventually went and used the other manufacturers. These days, to be quite honest with you, we actually buy storage pods, if you will, from Supermicro.

Starting point is 00:49:29 And they are Supermicro servers. They are not ours. They're not even painted red. And, you know, and we just buy them off the rack, so to speak, you know, because they give you the ability to pick a model and put a few options on it. And we say, give me 500 of those. And they ship them. And we're happy as clams with those. And we don't have to worry about storage pods and updating the design or anything like that.

Starting point is 00:49:56 And the 45 drive guys, they're doing great. They're really, I like them because they're the true customization place. You can go over there and say, hey, I want one of these that kind of looks like this and paint it blue. And oh, by the way, I like that piece of software. So let's put that on there, put our clone on it, blah, blah, blah, blah, blah. And you get what you want and then they support it, which is even better. So, so cool. I think it's interesting because I have a AV-15 is what they call it.

Starting point is 00:50:22 That's the model number for their Stornator, 10 feet to the left of me over there, with 15 drives in it. And so mine is an AV-15. That's what the 15 is. It's a 15-drive array. It's based on this original storage pod that you all started out with. I think that's just so cool how, you know,

Starting point is 00:50:37 I never knew you. I didn't know the full Backblaze story. I had come to learn of 45 drives. I was trying to build a high density storage array for myself for our own backups and a bunch of other fun things and just a home lab scenario. And it's just so cool to have a piece of hardware over there that's based on early ideas of you all. And now you've actually abandoned that idea because you're like, you know what? We want even

Starting point is 00:51:03 more commodity. We had a great design, sure, but we actually just want to get it directly from Supermicro and just literally take it from there and put it into our racks. Now, can we get into your data center a bit? Because I got to imagine these things get pretty heavy to like lift. I read the blog post that you had out there, which was like a kind of a behind the scenes of your US East data center. And I actually just noticed this. I'm glad you mentioned the change of heart when it comes to your storage pod that you no longer use a custom version for yourselves, that you just buy it directly from Supermicro.

Starting point is 00:51:39 So it's still a 4U server, which is a great size for that. And you have them stacked 12 high in a cabinet, and you leave 4U at the top for a 1U core server and an IPMI switch interface. Can you talk to me about that design, the entire stack? How much work did it go into designing that 12 high cabinet, for example? Well, the first things you have to start thinking about, obviously, are how much space it is. But the next thing you have to think about is electricity and how much electricity you can get to a rack. Because let's face it, you're spinning that many drives, it takes up a little bit of juice. And so some of the earlier places we were in from a data center point of view they said okay so here's a rack and here you get you know here's 25 amps have a good time and oh by the way you can only use 80 percent of

Starting point is 00:52:31 that and so you suddenly go i can only stack storage pods so high especially as the drives got bigger and started soaking up more and more electricity and so now you go well i can put four terabyte drives here but i can't put anything with eight because, okay. So, but that's changed over time as people actually realized, one, that these things use electricity. So you go into a facility like that and you say, okay, so do we have enough? How much electricity we got? Okay. You haven't, we have plenty. Great. For the drives today, the drives tomorrow, and so on. And then it becomes a floor layout issue. How do you optimize the space? How much air cooling do you get?

Starting point is 00:53:10 Because these things will definitely produce a little bit of heat. So you could put all of the racks really, really close, okay, if you wanted to. But then you're not getting the proper circulation. It's really difficult to do maintenance and all of that. And there are a lot of really smart people out there who kind of specialize in that. Once you decide on where you're going to put them, then it's not only your storage pods, but all of the ancillary equipment, the other servers that go in. So, for example, restore servers or API servers.

Starting point is 00:53:41 So now that we do the S3 integration, the B2 cloud storage piece, we had our own APIs. Now we also support the S3 APIs. Well, they don't work the same. So when you make an S3 call, it actually has to kind of be turned into our call on the back end.

Starting point is 00:54:01 And we had to have a group of servers to do that. And so we have those kinds of servers. And then you have just, you know, utility boxes and monitoring systems and so on and so forth that all have to be built into that. So we may have an entire rack of nothing but support servers. You know, we have the architecture as such that there's a, you know, you have to have, you have to know where all of the data is. And so we have servers in there, that's their job. They know where the data is, which storage pod it is, and so on and so forth. So you go and say, hey, I would like to have a file, or, you know, and you ask that device, you know, assuming you've been authenticated, blah, blah, blah, blah, blah, right? And it says, okay, you'll find it over here.

Starting point is 00:54:40 And here you go. Have a good time. And the same thing when you want to write something. Okay. The way we write things is pretty straightforward, which is we literally connect you to a tome, actually to a particular pod and a tome. Once you say, hey, I have some data and I want to write it. And you say, great, here's where you're going to put it. And you're right there. And then we have to record all of that, of course, and keep track of it for future reference so you can get your data back. So that whole process of laying things out, like I said, the biggest one starts with what's your footprint

Starting point is 00:55:12 and then how many racks can you get in there, but how much electricity can you support, how much cooling is there, and so on. And then, of course, you just have to deal with the fact that these things are big. So going up is really, really cool because we can get it. Okay. The only issue ever became one of does the lift guy go high enough? Good old Luigi there. Go high enough so that we can get them out, so we can get them back down. What do we have to do?

Starting point is 00:55:38 If I have to bring a ladder with me every time to service a storage pod, maybe that slows me down a little bit. If you can lift it. They are heavy, but you can get on the lift well i mean even my 15 drive array if i have it fully stacked to put it back in the rack or to pull it out with and it's got rack rails i mean it's heavy i didn't weigh it but i mean it it's effort it's not like a little thing. It's, it's thick and it's just 15 drives. Now, if you get 60 in there. Yep. And they come bigger. They, you can get them as large. I think I've seen 104 now in there, but, um, you know, in a, in there. And so with 60, yes. Okay.

Starting point is 00:56:20 You don't want to drop it either. Right. I mean, that would be the worst thing ever. No, you don't want to drop it. When, when I mean, that would be the worst thing ever. No, you don't want to drop it. When we first started the company and myself and Yev, who's one of the other guys in marketing, a bit of a character, him and I used to go to local trade shows and stuff and we'd bring a storage pod with us, but we only brought it with five drives in it because quite frankly, we had to lug it like through streets and upstairs and all kinds of things like that. So, yes, they do get quite heavy. And that's why we have the rack in place. And no, we don't let people cart them around and all of that.

Starting point is 00:56:55 But, yeah, we do want to optimize the space. But we do need to get in them from time to time to replace a drive. So you don't want them to be right at the top of the rack. So you put in some of the other equipment, which doesn't require as much hands-up maintenance up there. So a 52U server rack, you're stacking them 12 high. They weigh roughly 150 in pounds each, 68 kilograms, roughly. That's just assuming that. And then to lift that, I think you, in the

Starting point is 00:57:27 details here in your blog post is Guido. Guido, yeah. Guido's mentioned, and I think that's like a server lift. It's like a thing. Like, how do you, how'd that come about? He's a, so that was, that started with the very first ones at 45. We went, you know, our first rack that we built, okay, was like a half height rack, and it was only, it only went up four, okay, that was our first setup, and as soon as it went higher than four, people went, this is really heavy, we need to find this out, so you can get a server lift, and that's what we did, we actually had to raise money way back when to buy a server lift because they're not cheap. And that was Guido, who was named after the server lift, the lift in Cars, by the way,

Starting point is 00:58:13 the movie Cars. And then later on, we added Luigi. I know all of the data centers have their own. I don't think the rest of them have funny names for them, although they'll have to ask, I guess. Yeah, I was thinking that was like the name of that one. Was it Luigi was the character that sold the tires and Guido was his sidekick. Is that correct? I think so. It's been a few years since I watched the movie. I like that though. That does make sense. Yeah. Okay. So, yeah, I'm looking here quickly. Guido was the kind of blue lift-looking thing,

Starting point is 00:58:48 and I believe Luigi was the Ferrari lover. There we go. Italian. Yeah, so that was my buddy Sean, who ran our data centers for a number of years before moving over to another job within Backplace. But he was the one who named those things. So he has a bit of a sense of humor. What's up?

Starting point is 00:59:35 This episode is brought to you by Postman. Our friends at Postman help more than 25 million developers to build, test, debug, document, monitor, and publish their APIs. And I'm here with Arno LeRae, API handyman at Postman. So Arno, Postman has this feature called API governance, and it's supposed to help teams unify their API design roles, and it gets built into their tools to provide linting and feedback about API design and adopted best practices. But I want to hear from you. What exactly is API governance and why is it important for organizations and for teams?

Starting point is 01:00:12 I think it's a little bit different from what people are used to because for most people, API governance is a kind of API police. But I really see it otherwise. API governance is about helping people create the right APIs in the right way. In order, not just for the beauty of creating right APIs, beautiful APIs, but in order to have them do that quickly, efficiently, without even thinking about it, and ultimately help their organization achieve what they want to achieve. But how does that manifest?

Starting point is 01:00:44 How does that actually play out in organizations? The first facet of API governance will be having people look at your APIs and ensure they are sharing the same look and feel as all of our APIs in the organization. Because if all of your APIs look the same, once you have learned to use one, you move to the next one. And so you can use it very quickly because you know every pattern of action and behavior. But people always focus too much on that. And they forget that API governance is not only about designing things the right way, but also helping people do that better and also ensuring that you are creating the right API. So you can go beyond that very dumb API design review

Starting point is 01:01:28 and help people learn things by explaining, you know, you should avoid using that design pattern because it will have bad consequences on the consumer or implementation or performance or whatsoever. And also, by the way, why are you creating this API? What is it supposed to do? And then by the way, why are you creating this API? What is it supposed to do? And then through the conversation, help people realize that maybe they're not having the right perspective creating their API. They're just exposing complexity in our workings instead of providing a valuable service that will help people.

Starting point is 01:02:00 And so I've been doing API design reviews for quite a long time and slowly but surely people shift their mind from oh i don't like api governance because they're here to tell me how to do things to hey actually i've learned things and i'd like to work with you but now i realize that i'm designing better apis and i'm able to do that alone. So I need less help, less support for you. So yeah, it's really about having that progression from people seeing governance as I have to do things that way to I know how to do things the correct way. But before all that, I need to really take care about what API I'm creating, what is its added value, how it helps people. Very cool.

Starting point is 01:02:46 Thank you, Arno. Okay, the next step is to check out Postman's API governance feature for yourself. Create better quality APIs and foster collaboration between development teams and API teams. Head to postman.com slash changelowpod. Sign up and start using Postman for free today.

Starting point is 01:03:01 Again, postman.com slash changelowpod. So we kind of touched a little bit on this to some degree, but tell me, here's two questions I want to ask you that I want to get to at least. I want to know how you all buy hard drives, and then I want your advice for how consumers should buy hard drives. So we touched on it a little bit, but give me a deeper detail. Like how do you choose which manufacturers? It seems based on your data, you have four that you choose from. I believe HGST, Seagate was one we talked about already.

Starting point is 01:03:45 Western Digital, of course, is always in that mix. And then I think sometimes Micron's in there. It depends. Those are the SSD stats. Toshibas. Toshibas are the fourth. Okay, so you primarily map around four different manufacturers. How do you, like, do you buy them in batches?

Starting point is 01:04:02 Do you have a relationship with the manufacturer? Do you have to go to distributors? How does it work for you all to buy? Like, how much of a lift is it for you to buy drives? And when you do buy them, I'm assuming it's once a quarter because you're tracking failures at once a quarter. How often are you buying new and how many do you buy when you do buy them?

Starting point is 01:04:18 So it's actually a very variable process. And HGST, just to kind of fill in the gap there, HGST as a company got bought by Western Digital. It gets split up by Western Digital and I think Toshiba years ago. And so we have HGST drives, but they're historical in nature. And so now we deal with WD, Western Digital, to get what effectively are HGST drives. But the process is you maintain relationships with either the manufacturer directly or the manufacturer will send you to a distributor. You almost never buy directly. We don't buy directly from the manufacturer. You always buy through a distributor. We always buy through one. Now, maybe Google or somebody like that goes and

Starting point is 01:05:02 can change that, okay? But companies of our size, we've always bought through a distributor. It's just the way it works. That's where the contract is with and so on and so forth. We don't buy drives. Well, originally, we used to buy drives as we could afford them. But those days are over. And now we buy them based on, first thing you want to do is, is what are your needs, your storage needs out over, let's say the next year and a half,

Starting point is 01:05:30 two years. And how often, how much do you think you're going to need? How much growth in storage? And then you start dividing by where you are in that curve. Remember, we talked about that earlier. So if I am trying to buy something, I want to buy something in the middle to the bottom end of the curve, but sometimes you can't get quantity down there through a distributor. So you have to, it goes back and forth. We also, we say, okay, well, let's decide that we're going to get eight terabyte drives and we're going to buy, we want to buy 5,000, eight terabyte drives. And then we'll go out to the manufacturers or the distributors in this case,

Starting point is 01:06:07 but the manufacturers and say, hey, we're looking for some of these. We're looking for 5,000 of these 8-terabyte drives. What do you got? And they'll come back with, well, I don't have that model. I have this model. It's an older model or it's a newer model.

Starting point is 01:06:22 And I can sell you not 5,000, I can sell you 7,000 at this price. And so you get all of these things that come back and you negotiate back and forth until you finally get to somebody or someone that you can buy from. And then you place the order and the order becomes one of, you know, so how often do you do it? We like to buy them so we have a nice cushion. But if you buy so many at a given price, and six months later, they're down 20%, that's extra money you just had basically sitting on your data center floor. So you want to be efficient in how you buy them, but you always want to have a buffer. And a good example was

Starting point is 01:07:06 the supply chain problems that happened over the pandemic, right? And we had that buffer. The first thing we did is as it started to look like things were starting to get tight is we placed a bunch of orders for a bunch of equipment, not just drives, but all of the support equipment and everything like that. But we had a buffer in place. And as prices went up, because they did, we were unaffected by that, or minimally affected by it. So it really is just a matter of what's available. We know what we need.

Starting point is 01:07:39 We ask the manufacturers, hey, this is what we need, and this is the timing we need it in. They come back with bids, basically, this is what we need and this is the timing we need it in. They come back with bids basically and say, we can deliver this year, this many at this price at this time. And that's also important. So, you know, just-in-time manufacturing, you know, or just-in-time warehousing, whatever you want to call it, is part of that math that goes together, you know. And sometimes manufacturers don't deliver. It happens while the distributor doesn't deliver. It says, hey, I was going to get you, you know, 3,000 drives by Friday. I can't make it. I don't have them. Okay. And at that point, you know, that's why you have a buffer.

Starting point is 01:08:18 Okay. And then you have to make a decision. Well, okay. When can you have them? I can have them in 30 days. Okay. That'll work. I can't have them for six months. Then you better find a substitute. And you want to maintain good relationships, of course, with all of these folks. And I think we do have good relationships with all of them. You know, the Seagate one has been a solid one over the years. You know, the Western Digital one has gotten a lot better,

Starting point is 01:08:41 you know, over the last three or four years with them. And Toshiba has always been a pretty good one for us. You know, we meet with them on a regular basis so they understand our needs and can help us decide what to do next because they're part of that. You know, they may have something coming down the road that we're not aware of. Okay. And they go, you know, hey, we're going to, we have an overproduction of, you know of 12 terabyte drives out of this factory in here. I'll tell you what we can do. Those kinds of things come up from time to time.

Starting point is 01:09:11 For sure. How do you determine, it may not be an exact answer, but how do you determine 8 terabyte, 12 terabyte, 16 terabyte? Is it simply based on cost at the end of the day or is it based upon capability? How do you really map around the different spectrums is it just simply what's available at the cheapest or that curve is it always about that cost curve we that's where you want to start with but it's not only about that okay so do you limit it within that range though so like anything above that curve it's like that's out of the question unless there's a reason for it. We bought some new drives way back. I remember the time we did it. We bought some, I think it was 12 terabyte HGSTs or something at the time. And they were almost 2X what we would normally

Starting point is 01:09:56 have paid for that drive. So we do it from time to time if it matters from a timing point of view or something like that. We also do it from an experiment point of view. Sell me 1,200 drives. That's a vault. And we'll pay a little bit extra for it to see how they really work. Are these going to meet our needs, for example? You also do it a little bit for goodwill. There's some of that still out in the world, you know, and do that. And then the other side of that, the flip side of that is somebody may

Starting point is 01:10:30 come back and say, hey, I have a bunch of eights. We're at the bottom of the curve. You know, basically here, they're almost free. And you buy them and you use them for substitutes or something like that. Or you may be using them for testing purposes. Or, you know, we have a mini setup for a vault that we use for testing purposes and testing software, and sometimes they go in there. You know, so there's all of these different things that play into that decision. The logical thing to say is,

Starting point is 01:11:00 well, always give me the biggest drive because that's the best, most efficient use of space. And that's important. But all of the other things start to factor in like, well, that 16 terabyte uses four times the electricity of that four terabyte. Wow, how much is that going to cost us? Or it produces so much more heat.

Starting point is 01:11:20 Or when we use it, it's slower because it's such a bigger drive. It's not as fast. It doesn't give us the data quick enough. And I'm using that as an example. Right. Even though it's a 7200 RPM drive, it's still slower on the data out standpoint.

Starting point is 01:11:35 The IOPS is slower. The IOPS is lower. So you trade off those kinds of things as well. The other one, which most people don't recognize, is when you get into this whole game of rebuilding a drive, I can rebuild a four-terabyte drive in a couple of days. Way faster. What does it take me to rebuild a full 16-terabyte drive?

Starting point is 01:11:57 Weeks. So does that change your durability calculations? What do you have to do in order to make sure that you're staying at the level you want to stay at for that? Well, something you just said there made me think about saturation. So while you may use a 16 terabyte drive, is there a capacity limit? Do you go beyond a 50% threshold? For example, if you have an array of 16 terabyte drives in your Tome, I assume a Tome is one single pod or is a Tome? It's actually spread across 20 pods. It's one drive and 20

Starting point is 01:12:31 different storage pods. Yeah. Okay. So given that, do you fill those disks fully up before you move along? Do you operate at full capacity? It's a good question. We do run them at above 80%, okay? And the reason has to do with the fact that there's a notion of filling them up and then, but so the way our system works is you're given a URL to the data, okay? To your location, to your particular tome, if you will, particular drive. So we fill those drives up to about 80%. And then at that point, there are no new people coming in. Okay. What happens then is, is that existing customers, they say, I have some more data for you.

Starting point is 01:13:15 I have some more data for you. And they continue to fill that up until we get to, I think it's like 90, 95% or something like that. At that point, then we hand them off and we say, go someplace else. Here's a new place to write your data. So we have this process where we get to where we slow it down, slow it down, stop writing new people,

Starting point is 01:13:34 let the existing people write in to there to fill it back up. Then we also have a whole mechanism in place for data recovery, space recovery. Because we don't charge for extra for any of that kind of stuff because we use PMR drives or CMR drives. That's just a normal process. Deletion and reuse of space is an easy thing. It's not like an SMR drive, which is expensive to do that. And so we delete data and recover the space and reuse it. So maybe we get to 95%, but then, you know, people delete files and we come back down and then so, you know, you can add some more and so on and so forth.

Starting point is 01:14:12 So, you know, so that seems to be about the right range for us. But they are definitely chock full of data. Yeah. So the point that you're making there is I may have, and the reason why I asked you that to get clarity was because I may have a, you know, in my example, an 18 terabyte drive in an array, but that entire,

Starting point is 01:14:30 you know, array or that entire V dev is not full of data. Like every 18 terabyte drive is not chock full. Cause that's not the way I'm using it. Like backblaze is way different. You're trying to be, you know, inexpensive stores.

Starting point is 01:14:43 That's reliable and, you know, easy to access, trying to be, you know, inexpensive storage that's reliable and, you know, easy to access, fast to access, et cetera, fast to backup to. Then you also have your B2 service and a bunch of other reasons of how you use your hardware. But, you know, my use case is different. So now dovetailing into the way you buy drives, which is very unique and very different. You know, I don't have a, I guess I'm at the whim of the retailer. So I would typically buy from B&H, maybe Amazon, maybe Newegg, you know, maybe CDW, for example. These are places I might go to buy consumer level hard drives. And I'm buying six, eight, maybe 10 if I'm really feeling lucky, you know, maybe if I'm, if I'm buying for

Starting point is 01:15:25 the full platter for the full, for the full range of 15, maybe I'm buying 15 plus a couple for parity to have replacement discs, you know, but even then that's like super expensive for someone like me, not someone like you, because you buy, you know, 5,008 terabyte drives at a time, massive check, right? Yep. Me way different different. Or the people like me, way different. So let's dovetail into how can you leverage what you know about the way you buy drives to give some guidance to consumers out there that are home labbers, that are building out four drive arrays, six drive arrays, eight drive arrays, 12 drive arrays, whatever it might be.

Starting point is 01:15:59 Give me some examples of what you know from what you know with all these drive stats, these 10 years of drive stats to how you buy drives. What are some of your recommendations for consumers or home labors buying hard drives? So that's a really good question. And it does come up, and you're absolutely right. Somebody with a handful of drives or a small number of drives has to think differently. And I think one of the reasons why the data, what we do, has been popular, if you will, for the last number of years is because there's such a dearth of information out there. Other than that, you go to the manufacturer and you could take every data sheet produced by every single manufacturer and just change the name and they look identical and they almost have the same numbers on them.

Starting point is 01:16:47 And so they're of very little use from that perspective. But there are some things you can do as a consumer. One is you can, manufacturers try to match the use of the drive to the mechanics inside of it a little bit and the firmware that's inside of it and so on. And so you might look at that. So if you're doing video recording, you know, you're just recording your video systems or something like that, that's a different use case than you might be using it where you're

Starting point is 01:17:14 storing media on it, you know, and you want to access your movies and you created a Plex server or whatever the case may be versus, you know, Joe person who's looking for an external drive because they have a computer and they want to put some data on an external unit. So I think what we give people from our perspective is at least data to help make a decision. Now, where else do you get it from? There's various sites that do produce it. There's folks like yourself who work in a home lab

Starting point is 01:17:46 thing and say, hey, I had success with this. And I think you need all of that information in order to make a solid decision. I don't think it's a monetary one, although I completely understand why people make a monetary decision. You know, gee, I can buy that for $179, but that one costs me $259 and they're the same size. And I don't really have $179, much less $259, so I guess I'm going to buy that one. So I understand those decisions and you cross your fingers from that perspective. The other little thing, it's just the wild card in all of this, you never know when you're going to get a really good drive model or another, a really bad drive model. And you could buy a drive

Starting point is 01:18:29 and it's the, let's just say, DX000 model, right? And you bought it and it's been great. It's been running for years. And your friend says, what do you use there? And I said, I'm using the DX000. And he goes, great.

Starting point is 01:18:42 And he goes to the store and he can't get that, but he can get a DX001 and it's pretty close, right? And he goes, great. And he goes to the store and he can't get that, but he can get a DX001 and it pretty close. Right. And it fails three days out of the box. Okay. So, you know, so, so you have to be somewhat precise, but you also have to get, you also see the personalities of the different manufacturers. Okay. And I'll go back to Seagate. Seagate makes good solid drives that are a little less expensive. Okay. And so that maybe do they fail more often? Maybe. Okay. But there are certainly some good models out there and it doesn't necessarily correlate to price,

Starting point is 01:19:18 by the way. We've seen that and it doesn't correlate to enterprise features. It seems to just be, they made a really good model. The other thing I would do is if you're buying a drive is I would buy a drive that's been in production a bit, maybe a year, maybe six months at least, and then look and see what people are saying on websites, various consumer sites. Don't go to the pay-for-play review sites because, you know, you just buy your way up the list. But, hey, I'm thinking of using this particular model. And then pay attention to the model they're using. And then when you go to buy it, make sure you get the same one because, again, they don't have to be the same. Use our data wherever it's helpful to help maybe guide you a little bit towards where you might find the right ones.

Starting point is 01:20:05 Maybe ones to stay away you a little bit towards where you might find the right ones. Maybe it wants to stay away from a little bit, but at the end of the day, that's just one piece of the information that you're going to have to dig up. And there just isn't a testing agency for drives out there. We get people begging us for that. We get people literally saying. Spin off a division or something like that. That's right, right? Wouldn't it be fun?

Starting point is 01:20:26 Yeah, I mean, realistically, I mean, you've done a lot of the hard work in quantifying the value of the data. And you've been consistent with the ability to, one, capture it and then report on it at a quarterly and yearly basis, which I just commend you on. Thank you for that because that's been like, and you give it out for free. You don't say, hey, for Backblaze customers, you can see the data. It's free for everybody to see. And I think you even have like downloads of the raw data,

Starting point is 01:20:55 if I recall correctly. Like I've, I didn't know what to do with it, but I'm like, great, it's there. You know, that if I wanted to dig into it further, then I could. But yeah, there should be some sort of drive testing, but what a hard thing to do. I mean, especially as you probably know, models change so quickly and the numbers don't seem to, the model numbers don't seem to be like there's any, some sort of rhyme or reason to them. They just seem to be like, okay, we're done with building that one and now

Starting point is 01:21:20 we're going here. And it's also based on geography. It might be made in Taiwan. It might be made in Vietnam. It might be made somewhere else. It might be made in Vietnam. It might be made somewhere else. And these things also play a role into it. It could have been something geographically in that area. There could have been a storm. There could have been an earthquake or a hurricane or something catastrophic or who knows what. There's things that happen in these manufacturing plants

Starting point is 01:21:42 when they make these drives to get consistency. I've even heard to buy not in the same batch. So don't buy more than X drives from, you know, let's say B&H, you know, buy two from B&H, two from CDW. Obviously buy the same model if you can to try and, you know, keep the model number parity. But, you know, I've heard like all these different,

Starting point is 01:22:02 you know, essentially old wives tales on how to buy hard drives as a consumer. And it really, it seems to be cargo culted or learn from somebody else or just fear, essentially. This is why I do it because it's a fear. And the way I've kind of done it is based on the capacity first.

Starting point is 01:22:24 So I think, how big do I need so I begin with my capacity because I'm different and I want to get to price curve eventually but my my deal is how much do I want to have how much how many drives can actually handle you know and then at that level you know what's my parity level can I afford to have a couple extra so if those two fail in that parity let's say a RAID Z2 given a ZFS file system array as extra so if those two fail in that parity let's say a raid z2 given a zfs file system array as an example if those two drives fail can i replace them do i have two more drives to replace them if two did fail i hadn't considered your cloning idea which i think is super smart i'm gonna have to consider that i might just do some hard drive failure tests just

Starting point is 01:22:59 to see how that could work that seems so smart to clone versus re silver although i don't know how that would work with zfs if that's a thing or not but capacity is where i begin then it's like okay for price did i get that and then the final thing i do once i actually get the drives i hadn't considered running the smart test right away to consider how many uh power on hours it had because i didn't consider they were doing testing there but well, hey, if Seagate is doing a burn-in of sorts on my drives or some sort of test beforehand, let me know. Like, hey, I would buy a model that has burn-in testing beforehand. Save me the week if I'm going to burn in an 18 terabyte drive. So when I bought this new array recently, the burn-in test lasted seven full days. I used a,

Starting point is 01:23:43 I don't know if you use this software or not, it's called BadBlocks, but you can run a series of tests. It writes three different test patterns and then a final one, which is the zeros across it. But for each write, there's a read comparison. So it's a write across the whole disk in one pattern, then a read, another write, then a read,

Starting point is 01:24:02 another write, then a read, and then finally a zero pass write and then a read, another write, then a read, and then finally a zero pass write, and then a re-comparison to confirm that the drive is actually clean. But for an 18 terabyte drive, six drives took an entire week, you know, and that's just a tremendous amount of time for somebody who's like, you know, I just want to get on to building my thing. Come on now. But that's the way I look at it look at it's like that's how i've learned to buy is like what capacity i want to have and then can i afford it just the drives alone and then can i afford the extras if i need parity and replacement for that parity of course you want parity and then

Starting point is 01:24:36 finally doing a final burn-in before i actually put the drives in the service which i feel like is a little overkill to some degree but like like, you know what? The worst thing to build or the worst thing to do is to build this full array. I'm not a business. I have limited time. And then I got to deal with failures a week or so later. Now that burn-in test may not predict a week long later failure,

Starting point is 01:24:58 but it might mitigate it because like, well, if drive four of six did not pass the sectors test in bad blocks, well then, well, let's four of six did not pass the sector's test in bad blocks, well then, well, let's send that one back for an RMA or just a simple straight up return kind of thing. And you know, before you even build the array, you've got a problem child essentially. And the other thing is, is running that kind of software, if there is a media error, which happens, it just does. And yet having that drive rebuild around it. And so you don't even know it other than it might tell you that. But if you put your system in play

Starting point is 01:25:31 before you do that and it finds it, same thing can happen. But now your system runs a little slower for a period of time until it figures out how to map around that. For sure. The only other thing I want to talk to you about is i think it's a newer player to your drive stats which is ssds now i think you only use them as boot drives not as like storage drives in your b2 service or your at large data services and i think the reason why you make these choices you're very pragmatic with the hardware you buy. Only buy the things you need and you keep your expenses or your cost of goods sold low because you want to have the lowest cost storage services out there, whether it's B2 or whatnot. That's how I understand of Backblaze's business model and thought process when you all spend

Starting point is 01:26:22 money. So with SSDs, obviously you're replacing older hard drives that may be the boot drive, which as you know, the boot drive is the drive that's running the operating system itself on the thing. Now I've got to imagine this 52U array that you have or this 52U rack you have, you've only got one server in there, but you've got, what was it, eight? Eight storage pods, and then you've got what was it uh eight eight storage pods or you know and then you got one actual service so is all that hooked back to that server and then uh tell me about the

Starting point is 01:26:51 ssds yeah so actually well just to kind of set the thing a storage pod is actually more than just storage it's actually its own storage unit it's a server uh so there is a cpu in there there's all of the other components and everything like that. So it's not like a JBOD kind of thing, which each server is its own server unit. It's got its own intelligence, its own processor, its own 10G network connections and whatever else, right? And so each one has its own boot drive as well. So that's where we get them all from. The boot drive for us does more than just boot the system.

Starting point is 01:27:29 Like I mentioned earlier, it stores some of the smart stats on it for a period of time. It actually stores access logs and a few other things that we keep track of on a regular basis. Because one, there's a compliance aspect to doing it. And then two, there's just a maintenance aspect and a debugging aspect when something goes a little wonky and you want to be able to look through various logs. So we keep those for a period of time and they get stored on that SSD as well

Starting point is 01:27:57 or the boot drive as well. The SSDs, to be honest, we started using those because the price point came down to the point where we wanted to pay for it. Yeah. Performance probably made sense, too. And then price made sense. Yeah.

Starting point is 01:28:10 And we've tried different ones over the course of it and the data. We've talked about building a storage pod out of SSDs, okay? And, in fact, some of our competitors are even talking about and doing some of those things. The cost point just doesn't make sense yet. And the reality is the speed, which is what most people would think they would be getting, is it's not the drive where any slowness happens. It's not even, quite frankly, in our systems. I mean, we're dropping 100 gigabyte NIC cards in these things, right?

Starting point is 01:28:47 I mean, or 25. And a lot of it is, it just takes a while to get the data from you, even just next door. Forget about getting it, you know. And so the SSDs are a pretty cool idea. And I guarantee you when the price point gets to the right spot, we'll do it. Backing up somebody's data, whether it takes 10 milliseconds or whether it takes 12 milliseconds, is not an interesting thing. And you shouldn't have to pay a premium to get that last little tiny bit.

Starting point is 01:29:20 And that's to your point. That's where we live. We want to provide a solid, good, well-performing service at an economical price. That's what we do. SSDs don't fit into that as data servers at this point. They're still too expensive. And the use cases could be interesting. The read-write, the number of writes and stuff could be interesting. Do they wear out under that environment? People have been using them in what we call caching servers, you know, in order to do things. And the reads and writes on those are enormous. And so you could literally burn through those servers and those SSDs in six months. And so is that economical? Did you really gain anything from a cost perspective? No, you didn't. Versus the analysis for all of that is still is ongoing from our perspective but i can see a day when we're there i can see a day when you know we're using on something besides hard drives to store customer data but we will do it in a way that's economical and practical yeah you said that sometimes you swap out 8TB drives. I got to imagine the largest SSDs out there tend to be 4 to 8TB.

Starting point is 01:30:50 But if you compare the cost to an 8TB hard drive, it's probably double. Especially the 8TB SSD is probably at least maybe four times the cost of an 8-terabyte hard drive. So, I mean, yeah, I'm not going to, when I buy Backblaze for backup services or even B2 services, for example, which is like a similar S3 replacement and you even support the API, as you mentioned before, I'm not looking for the caching and the speed necessarily.

Starting point is 01:31:21 I mean, I want the speed to be there, but it's not like, well, I will pay double for backblaze because you give me SSD backups. Like it's just not something I'm trying to buy as a consumer from my perspective. And that totally makes sense, you know, for your business model. And that makes a lot of sense. That's why I want to talk to you about all these details because the way you buy hard drives and the way you manage these systems is so unique in comparison and i i mean we've just never examined the behind the scenes of you know a data center or a storage data center like you all operate you know like what does it take to build that i know we barely

Starting point is 01:31:55 scratched the surface here i've got probably 30 000 other questions that might go deeper and technical on like different softwares and stuff like that. So we'll leave it there for the most part. But it has been fun talking to you. Is there anything else that I haven't asked you that you wanted to cover in your 10-year stat history? Anything whatsoever that's just on your mind that we can leave with before we tell off the show? Well, I will say the next DriveStats report is coming up.

Starting point is 01:32:22 That's always fun. I think it's due out May 4th. Okay. May the 4th, yes. That's Star Wars Day. It's my anniversary, too. There you go. That's even better.

Starting point is 01:32:32 Last year, we wrote one up all about on May the 4th, and we did it all of its Star Wars themes and stuff like that. But I've dialed that back this year. So maybe one or two Star Wars references and that'll be it. But, uh, congratulations on the anniversary though. Thank you. But yeah, so, uh, yeah, so that's coming. I encourage folks who do read those things, um, and to, to, if they have questions or

Starting point is 01:32:56 comments, feel free, we'll, we'll answer them. Uh, we try to do the best we can. We try to be open with, uh, how we do things and why we do the things we do. And so I always look forward to that. And ask the hard ones. We'll give you the best answer we can with it. We are these days a public company, so I don't know how many things we can disclose at certain points, but we'll do the best we can in getting you the information you're asking for. Yeah, I always appreciate digging into this. I don't always dig as deep as I would love to because of time or just, you know, just too much data, so to speak, because it is kind of overwhelming. And the way you have to look at even like your drive failures by manufacturer, for example, is like, well, that number may be higher for Seagate, but you also have a lot more Seagate drives in services. A lot of corollaries you have to look at. You can't just

Starting point is 01:33:49 say, okay, let me go to Backblaze's data and say, okay, these are the drives I'm going to buy. Well, it might be an indicator to manufacturer, maybe not model or size particularly, but it might mean like, okay, you seem to favor Seagate. You mentioned that your relationship was pretty good there. I like Seagate. I've had great, was pretty good there. I like Seagate. I've had great. I almost switched recently when I was buying my newest array. And I was thinking about building. I was like, I'm going to go to Western Digital.

Starting point is 01:34:12 I almost did that, but I'm like, well, I've got these drives in service for this many years. Knock on wood. With zero failures, right? With zero failures. When you say that, things happen. So I'm sorry to say that but i've been having seagate drives in service for i mean as long as i've been running data stores which has been a long time probably eight plus years maybe 10 years or more longer than that you know 13 years so over time like i've always only ever used Seagate drives. I don't know why I chose Seagate.

Starting point is 01:34:45 Cool name. I liked Iron Wolf. Cool brand name. All that good stuff. They got some good stuff there. But the things I read about them was pretty strong. The warranty was good. And I've had good services with Western Digital as well in terms of warranty.

Starting point is 01:34:58 I've had some drives from them fail and have to be RMA'd. And the RMA process is pretty simple. That's what you want to buy for. You want to buy a brand that's reliable. You want to buy for parity. You want to buy for replacements of that parity and to be able to swap it out easily. And then also, do you have a worthwhile warranty and can you RMA pretty easily? RMA is just simply sending the drive back that's got a failure, you know, and they will replace that drive with the drive that you got or something that is equivalent.

Starting point is 01:35:25 There's always circumstances that make that different. But I've only had good responses from Seagate as well as Western Digital. So those are the brands I stick with. But that could be a wives' tale, right, Andy? That's Adam's wives' tale of how I buy hard drives. It's okay. People have to be comfortable with whatever decision they make.

Starting point is 01:35:44 But the most important thing, and you built it into your system, right, is to what? Have a backup. And I don't care what kind of backup it is. You don't have to use our service. It isn't a backup. It's just parody. But yeah, definitely have a backup. You know, and because if you lose it, it's gone. I mean, so, you know, have a backup. And again, we've said this before. I don't care if you use us. I don't care if you use anybody. I don't care how you do it. Just get your stuff backed up so that if something happens, you don't lose the things that are precious to you.

Starting point is 01:36:14 It's as simple as that. And again, I don't care who you do it with or how you do it. Just get it done. Very cool. Well, Andy, thank you so much for taking an hour or so of your time to geek out with me on hard drives. Not always the, you know, I'm curious how many people come to this episode, honestly, and be excited about this topic. It's not the most funnest topic unless you're a hard drive nerd like you and I might be. I'd rather enjoy it.

Starting point is 01:36:38 I think this kind of stuff is pretty fun. I'm kind of curious what audience we bring to this one because this is a unique topic for us to have on the changelog. Yeah, I appreciate the opportunity, and I hope some folks listen. It's always fun to have folks, you know, listen to what you say and then make comments on it and, you know, and all of that. There are some places where geeks hang out, you know, and hard drive geeks in particular hang out, so maybe we'll get a whole bunch of them together and listen to it.

Starting point is 01:37:07 But just the education of what goes on. I mean, you understand the complexity of a hard drive and what's going on inside there, right? And I understand that to some degree as well. And it's miraculous that that thing works. It does what it does, and it does it at the price points that they do it at. So we just need to have that appreciation for that technology you know for as long as it's around for sure i agree i mean we definitely under appreciate and take for granted the mechanics of a hard drive as simple as it might be like wow it i mean like on my macbook pro i don't care because i'm using

Starting point is 01:37:41 an ssd it's actually probably nvme ss SSD or just straight up NVMe, not NVMe SSD, but it's in the M2 format or whatever it might be. You know, at that point, I'm not caring. But in other cases, yes, I'm, you know, the hard drive. I mean, that's what the cloud is built upon. Your cloud is built upon spinning rusty hard drives that eventually fail. That's not always the coolest topic, but it is crucial. It's like almost

Starting point is 01:38:05 mainframe level crucial, right? Like we don't think about mainframes too often. We had an episode about that, but how often you talk about hard drives and the simple thing that they are, but also the very complex thing they are. And like you had said, the miraculousness of the fact that it actually works. But yeah, thanks so much, Andy. It's been awesome talking to you. Appreciate you. Thank you, Adam. It was great. Okay, so our friends over at Backblaze have basically made it a science to buy and swap out hard drives to operate at scale. A quarter of a million hard drives in service with, I think, what back in the napkin math was 0.1-ish percent failure per quarter. Not bad.

Starting point is 01:38:53 Not bad for the manufacturers and not bad for the service to keep the uptime and keep it affordable. Once again, a big thanks to Andy for sharing his time with us today. And guess what? There is a bonus for our Plus Plus subscribers. So if you're not a Plus plus subscriber, it's easy. changelog.com slash plus plus 10 bucks a month, a hundred bucks a year, no ads directly support us get closer to the metal, get bonus content. And of course a shout out on changelog news on Mondays. You can't beat that 10 bucks10 a month, $100 a year to support us. Wow, that's easy, and we appreciate it. Speaking of appreciation, a big thank you to our friends and our partners

Starting point is 01:39:34 at Fastly, Fly, and TypeSense. We love them. You should check them out. Fastly.com, Fly.io and Typesense.org They support us, you should support them. And of course, a big thank you to our Beats Master in residence, Breakmaster Cylinder. Those beats, bangin', bangin', bangin'. Love them. And of course, last but not least, thank you to you. Thank you for listening to the show.

Starting point is 01:40:03 Thank you for tuning in all the way to the very end thank you for coming back every single week if you haven't heard there is a new spin on the monday news show it is now combined with changelog weekly so on mondays it's now called changelog news officially and changelog news is a podcast and a newsletter in the same vein. When Jared ships that show, he ships a podcast and a newsletter, and you can subscribe to both. Check it out at changelog.com slash news. If you're subscribed to this show already, to get the audio podcast of that show, do nothing. Just keep subscribing to this show.

Starting point is 01:40:43 You get it anyways. But that's it. The show's done. Thank you again. Just keep subscribing to the show. You get it anyways. But that's it. The show's done. Thank you again. We will see you on Monday.

The Changelog: Software Development, Open Source - Hard drive reliability at scale (Interview)

This week Adam talks with Andy Klein from Backblaze about hard drive reliability at scale....

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.