Storage Developer Conference - #187: More Than Just a Bucket of Bits: Cloud Object Storage turns Sweet Sixteen

Episode Date: April 11, 2023

...

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage Developer Conference. The link to the slides is available in the show notes at snea.org slash podcasts. You are listening to STC Podcast, episode number 187. Good morning. Thank you for getting up early, making it to this first session, Storage Developer Conference. I'm Pat Patterson. I'm the Chief Technical Evangelist at Backblaze. I've been at Backblaze since the beginning of this year. In the past, I've worked at Sun Microsystems, Salesforce, a startup called StreamSets,
Starting point is 00:01:10 and Citrix. So I've worked with developer communities for about 17 years now, but I'm still very much a developer myself. So what I'm going to be talking about this morning is doing a little bit of a review of cloud object storage, which kind of had this anniversary this year, this turning 16 years since Amazon introduced S3, and then explaining how our cloud object storage platform actually works at Backblaze. So this is not a Backblaze pitch. This is how the layers in our cloud object storage platform.
Starting point is 00:01:55 So who's heard of Backblaze? Anyone? Oh, yeah. Who's heard of DriveStats? No one. Yes? No? So I'll mention DriveStats? No one. Yes? No? So I'll mention DriveStats later.
Starting point is 00:02:08 DriveStats, we collect statistics on the, let me think now, 200,000 drives that we spin and release them every quarter. But I'll talk about that more later. Anyway, so March 2006, and Amazon, up till then, known primarily as an online retailer, bookstore, launched Amazon Web Services. And the very first service was S3. And that kind of talks to the foundational nature of cloud object storage. And it's easy to forget just how different this was at the time for developers. Now, this is from that press release.
Starting point is 00:03:03 And developers reading this noticed new words. So there's no mention of files. We're writing, reading, and deleting objects. No mention of file names. Objects are stored and retrieved via a developer-assigned key. And there was no mention of APIs in the way that developers had understood them until then. REST and SOAP interfaces designed to work with any internet development toolkit. So this was very, very different from working with a POSIX file system. And if we look at that, this is very typical code in C from the time, and it still is if you're working with a local storage.
Starting point is 00:04:02 So you might make a directory with particular permissions. You might create a file, which gives you a handle or a file descriptor, and then write some data and close that file. Now, the problem with this, from an internet scale perspective, is that the operating system has to maintain state. That file descriptor points to the location of the file, that the operating system has to maintain state. That file descriptor points to the location of the file,
Starting point is 00:04:29 so it's path, it's inode, and also the location in the file that the client is reading or writing. Now, this is okay for an operating system on a single machine, but clearly cannot scale to the internet. So S3 came along, and this is probably familiar to everybody in the room. So you've got this HTTP interface, and you said, okay, I want a bucket. I'll do a put to slash my bucket, and include some XML payload saying, I want that in a particular region.
Starting point is 00:05:15 And you would put an object in that bucket by saying, okay, put slash my bucket, my key, and you would just enclose the body to upload. Now, the only state here is in that list of buckets and the objects in storage themselves. Clients can write and read, and all S3 has to do is go and write data and return data, and doesn't have to maintain that pointer into a particular object. Now, there's some clear semantics here in the way
Starting point is 00:06:00 that cloud object storage works. So, objects are immutable. You can't append to an existing object. You can't seek within it and overwrite some data. Objects can be versioned. So, you can replace an object with a new version of itself. And you can upload, you can work with very, very large files. So you can upload a single part typically.
Starting point is 00:06:31 This is true of S3 and many other cloud object storage platforms. S3 became the de facto API. So you can upload a single part up to about five gigabytes, and above that, you have to use a multi-part strategy, and that's really driven by HTTP. You know, it's hard to keep an HTTP connection going indefinitely to upload 10 terabytes of data. And there's other features like you can set an object lock to prevent deletion for a given time period. And now this is quite interesting. You can set a lock on a file for so many days
Starting point is 00:07:19 and it cannot be deleted. So you can't delete it through the API, you can't ask the cloud provider to delete it. The platform literally makes it impossible. And this is a feature that was introduced over the past few years to guard against ransomware. So ransomware that would attack your applications, maybe get access to your credentials for accessing that cloud storage, and go start deleting things. But this makes it absolutely impossible.
Starting point is 00:07:48 In fact, certainly on Backblaze B2, the only way, if you really want to delete that file, you have to close your account. There is just no way. There's no path through the API or through support to delete a file that's been locked. And another thing you'll see, these objects that we manipulate in cloud storage are accessible via HTTP.
Starting point is 00:08:17 So developers started uploading images and JavaScript and other web resources and serving them to browsers. And when you do that, nowadays, you need to control which browsers enforce origin resource sharing rules. So typically, if you serve up a web page with a script, it can only access resources from the origin it came from unless you set up this cross-origin resource sharing. So you're basically saying it's okay to access this object from web pages in this domain. So how does S3 work? Well, Amazon certainly isn't telling.
Starting point is 00:09:13 They kind of give hints and release research papers. But Backblaze isn't Amazon. We have a history of being very open with what we do and how we do things. And this session is actually based on completely publicly released information. So you could have put all this together if you've gone and done some research over the past few weeks. So we built a storage cloud.
Starting point is 00:09:42 Backblaze was founded in 2007. And it was originally, the original product was computer backup. So the initial thinking was, we'll build a backup client that customers just download, start running, and we'll store the data on S3. So S3. So S3 had been around for a year by then. And the target price for Backblaze backup was five bucks a month for unlimited data. And it became clear almost instantly
Starting point is 00:10:20 that this was not possible on S3. That it was not possible to provide that service because a petabyte cost $2.8 million in 2009. And what we actually ended up doing was figuring out, okay, what's the lowest cost we can put storage online? And, you know, the raw drives cost $81,000, and we got it to $117,000. And this is how.
Starting point is 00:10:56 So step one, we put drives online as cheaply as we could. So we designed these custom chassis to hold 60 drives. So back in the day, 64 terabyte drives might have been, what's that, 240 terabytes in a pod, something like that. So we didn't build the motherboards. We bought in the motherboards, power supplies,
Starting point is 00:11:31 had the cases made, and basically assembled these ourselves as a storage pod. And we released the specification. So in a pod, there's a custom web application running on Apache Tomcat. There's Debian Linux with EXT4. And yeah, nowadays with 60, 16 terabyte drives, you can get close to a petabyte of storage. Am I doing my math correctly? Petabyte,
Starting point is 00:12:09 yeah. It's early in the morning. I've only had one cup of tea. Into one of these boxes. And the Tomcat's running HTTPS. The pods are accessible from the internet, and there's nothing fancy there. The idea here was to build to a particular price target for robustness and reliability. There's no iSCSI, NFS, SQL, Fibre Channel. It's basic technology. And you can build your own. We released the design, the bill of materials, and people have actually done this. So, you know, you can go out, buy. This is the cost nowadays. I looked it up.
Starting point is 00:13:00 60 Seagate 16-terabyte drives would cost you just over $17,000. The case and the other bits, about three and a half. So for $21,000, give or take, you can put together a pod, and your cost per gigabyte is just over two cents. So this is like our base price before we pay anybody any wages or pay for the electricity and so on. Now, right now, we have about two exabytes of storage online. And like I mentioned earlier, 200,000 disks. And every day, for every disk,
Starting point is 00:13:47 we collect the drive serial number, the model number, the date, obviously, whether it failed that day, and all of the smart attributes. So every day, we collect 200,000 rows, if you like, of information. And then every quarter we release that publicly. And Andy Klein, our god of drive stats,
Starting point is 00:14:20 he does a webinar and we release the raw data and we also talk about our interpretation of that data. So which drive models we've introduced this quarter, how they're shaping up for liability, the annual failure rates for the different drives we're spinning, any lessons we can learn. So in the last DriveStats, which was in July and he was talking about how so we're all familiar with the bathtub the bathtub reliability curve that devices tend to fail more often when they're brand new and then the failure rate flattens out and they have like a working life and towards the end of life, the failure rate rises. And what we've noticed
Starting point is 00:15:05 from the statistics we've gathered is that that curve is now very asymmetrical. So we're just not seeing as many drives fail in the early part of their lifetime. The curve starts far lower down and proceeds and obviously rises towards the end of their working life. So we share this information with the community. And yeah, just go to backraise.com slash drivestats. All the data is there. I'm actually, it's there as zipped CSV files. So one CSV file per day, zipped up like with a zip file per quarter.
Starting point is 00:15:46 I'm working on a project to actually give you direct access to that data in a B2 bucket in a form that is queryable. So I've already written one blog post about this. If you search for Backblaze Parquet, P-A-R-Q-U-E-T, you'll see that. And my goal is to make that data available so you can literally run SQL queries against, let's see, 2014. We started nearly eight years of data for a whole range of drives. It's very cool.
Starting point is 00:16:26 Very cool for storage geeks, actually. If I talk about this in any other audience, people are just like. Right, so we've got our pod. What's the next layer? Well, 20 pods is a vault. So vaults are software. We just rack up these pods, plug them into the network,
Starting point is 00:16:47 and some software creates the concept of a vault. And a vault is the way that we store data with redundancy. So a vault is 20 pods, sorry, yeah, 20 pods. And then we've got this concept of a tome, which is 20 drives where each is in the same position in each storage pod. So drive one in each of those 20 forms tome one. And what we do when you upload data is split the data into 20 shards and then use a number of those shards for redundancy, essentially parity. So we use something called Reed-Solomon erasure coding that I'll talk about in a moment. And what we found was we used to use 17 data shards and three parity, but drives are getting bigger.
Starting point is 00:17:49 And that means that rebuilding the data on a failed drive now takes longer. It can take several days, over a week, to rebuild a 16 terabyte drive from those parity bits. So therefore, to keep the durability, you have to have more parity shards. So now with 16-terabyte or greater drives, we use 16 data and 4 parity.
Starting point is 00:18:18 So we write those shards across a tome, and we designed this for 11 nines of data durability. So with 960 terabytes per pod, 16 data pods per volt, okay, so the math works out at 4 parity, 16 data, so that's about 15 petabytes in a volt. Now this is that Reed-Solomon encoding, and there's actually a video where Brian Beach, one of our architects, explains this way better than I can, but it's based on matrix algebra. And what you have is you have this encoding matrix,
Starting point is 00:19:03 and you'll notice that the part of the matrix that, so you've got, we've just done it with four by four here. So four data shards and two parity shards. So what you have is the identity matrix here, and then you have these special encoding rows. And the way the math works out is that when you multiply these two matrices, you end up with the data and then two parity rows. And as Brian explains here way better than I can,
Starting point is 00:19:39 if you lose any one or two of the rows, of the data rows, you can reconstruct them by basically multiplying what you've got left by the inverse of this matrix, and you get back this one. It's really, really elegant matrix algebra. So we've got our vault full of pods and tomes. Now we build them into a cluster. This is the one part of this that you don't see a lot of in our blog posts. But clusters, again, are software. We're just bringing these things together into the concept of a cluster. So a cluster has around 100 volts. It's not a fixed number. So you could have 2,000 pods, 120,000 hard drives.
Starting point is 00:20:36 And this is what makes it scalable. So you could start your cluster off with, say, 50 volts and then build it out. A cluster is what we think of as an instance of Backblaze. When you create a bucket, if you go into the Backblaze interface, you'll notice that bucket has an API endpoint. Part of that endpoint identifies that cluster. So we might have one or two clusters in a physical data center, and there are cluster-wide services.
Starting point is 00:21:14 So again, Tomcat for that API, Cassandra for metadata. So it would be great if we could store files and they just had an internal opaque identifier that said, okay, they're on this pod and this location, whatever, but developers wanna have keys and file names and human readable stuff, so we actually use Cassandra to make an index of those file names and human-readable stuff. So we actually used Cassandra to make an index of those file names
Starting point is 00:21:48 against our internal location of the objects. And, oh, I was supposed to, okay, I didn't say HashiCorp Vault. I said off-the-shelf standard software for secrets. Because our CISO said, don't tell people what we use. Okay, so about those APIs, how do developers get to their files? Now, we have, as every other storage vendor has,
Starting point is 00:22:20 an S3-compatible API. Now, this implements 38 of the 97 operations. And really, it's, I'd say 80-20 rule, but it's like 140 rule. These are the 40 that 95% of applications use for creating buckets, deleting buckets, creating objects, setting cause rules, all of that stuff, we only don't implement the really AWS-specific stuff. But it's great. This works with off-the-shelf AWS CLI.
Starting point is 00:23:01 All of the S3 SDKs, so pull down the Python Boto3 SDK, and most off-the-shelf client apps. So you use the application key ID, and you set the endpoint URL. And this is in common with pretty much every S3-compatible service. You have this endpoint URL you have to use to tell the software, hey, this is not the default Amazon. And one of the qualities of this API is that, again, it's stateless. There's no concept of a session that the client has with S3. The client must sign every API call. So if you look at what's happening on the wire when you're using S3, then you'll see that there's an authorization header, and it's a SHA-256 HMAC. So it basically signs the headers and the body of the request with a timestamp so it can't be replied, so that when S3 or S3-compatible
Starting point is 00:24:12 service receives this, they can just verify this against the application key and see that it's from a known client and it hasn't been tampered with in transit. So this HMAC gives you message integrity and message authentication. Now, when Backlays B2 launched back in 2016, it only had its B2 native API. And the S3 compatible API was introduced a few years later. And we made different choices here.
Starting point is 00:24:53 We elected to have a simpler RESTful API. And this API actually does have state. So your client exchanges credentials for an authorization token and an API URL. So you have to include that in subsequent calls to the API. So a different design choice back in 2016, when it wasn't clear that S3 was the de facto standard and you kind of had to do that to play. Back then, different storage platforms had their different APIs.
Starting point is 00:25:30 And another really interesting design difference is if you think back to when S3 started, you could do everything against a global domain. So I think it's s3.amazonaws.com. So if all of your buckets are behind that one domain, that implies the existence of a really big load balancer to service those requests coming in from anywhere and going out to buckets in a number of data centers. And we didn't want to build or buy, it would be more accurate, a massive load balancer in 2016.
Starting point is 00:26:18 So the way the native API works is a little bit different. The client, when it wants to upload data, requests an upload URL. So you say, hey, I want an upload URL. And that gives you the address of one of those individual pods. There is no load balancer. You go to a service, say, hey, I want to load some data, and it says, oh, I think I'll give you this pod. And you can start posting data directly to that red box on a rack somewhere. And so every pod is accessible from the Internet, and that holds today. So we still support the B2 native API.
Starting point is 00:27:06 And that actually gives you some better, lower latency, I think, and some different scalability characteristics from S3. And it's interesting to look at S3 and how they went from a global domain to region domains. And then they encouraged people to put the bucket as the first part of the domain name. So they kind of backed away from the big honking load balancer and realized, the more we can get the client to tell us in that domain about
Starting point is 00:27:46 which bucket they're working with, the easier it is for us. And of course, we implemented S3 compatible and had to do a similar thing. But by the time we did that, we had figured out how to do the load balancing in software and didn't have to buy a big expensive box. So we have our clusters. They're sitting in data centers. And then the next layer of abstraction is the region. So this is the only abstraction layer that's actually exposed to customers. So when you create your Backblaze account, you choose whether to have US West or EU Central.
Starting point is 00:28:37 And that's the only part of it that you're exposed to. So a cluster, there are typically several clusters in a region. And today, right now, in the US, we have four clusters across Sacramento, Phoenix, and now Stockton, just up the road. And we have one in Europe in Amsterdam. And the Stockton one is really interesting. That's this one here. So you can go to the blog post. We only posted this last week. We only released this information last week. So we're actually in a data center in Stockton on the San Joaquin River called Nautilus. And the data center actually floats on the river. So you can see those piles there. It goes up and down on those piles. And what you can't see, so what you're seeing is basically the halls with the server. And then
Starting point is 00:29:42 underneath here is the cooling system. Now, what happens is that it draws in river water, and river water doesn't flow around the data center. There's a sealed water cooling system that goes around the data center, but there's a heat exchanger. So this data center is cooled by the river water. The river water flows through, spends about 15 seconds in the system, and is discharged four degrees Fahrenheit warmer than the ambient temperature of the river in a location where we consulted with the wildlife agencies to minimize any effect.
Starting point is 00:30:27 So very, very cool, the Nautilus data center. We're quite proud of that one. And brand new, so we're literally moving data into it as I speak. And then we've just started talking about our first cluster in the US East region. So we'll have three regions. And this is an actual photo of the data center with empty racks ready to get those storage pods. And by the way, you know, talking about those pods, when we built them, that was the only way to get that drive density into a rack, like those custom pods.
Starting point is 00:31:19 We don't actually have them custom made anymore because the market caught up. So you can buy a chassis from Supermicro with the box and the motherboard and the SATA connectors for the drives and the power supplies and populate it yourself with drives. And they have bigger economies of scale than we do. So we're thrilled now that we don't have to custom manufacture those pods and we can actually buy them off the shelf.
Starting point is 00:31:53 So unfortunately, as beautiful as those red chassis look, this won't have a sea of red, it'll just be gray, which is sad. Okay, so then once you've got your regions, customers can replicate data between regions. So they might be doing this for compliance, continuity. There might be regulations that say you must have your data in two geographically separate locations.
Starting point is 00:32:24 You might be saying, hey, we want a copy of this so it's closer to particular customers. Or you might be even just replicating locally to say, we want a test version of this data. So we want to set up replication from our production buckets so we always have a live distinct set for software testing and staging. and the Backblaze paid for my gas to get here. So I'll put up a little slide saying, you know, you can get started. We give you 10 gigabytes to play around with
Starting point is 00:33:12 and you don't have to provide a credit card. But yeah, with that, I'm happy to answer any questions to the best of my ability. And here I'll come clean. As the technical evangelist, I understand how this works, and I've spoken to the engineers. I didn't build this, but I do help developers actually use it and use our APIs and take advantage of it.
Starting point is 00:33:45 Yeah, go ahead. Talk a little bit about the hardware test and CPU. Yeah, yeah. So, yeah, yeah. So, it's, and I was on a call with the engineers on Friday actually asking them some of these questions. So it's an Intel CPU.
Starting point is 00:34:08 CPU doesn't have to be that beefy. The only thing the CPU really does of any consequence is that Reed-Solomon encoding. It's really IO is the limiting factor. So they have, trying to remember, this is all on the website. So we've been through like those six generations of pods, been through various ways of actually connecting those drives to the motherboard. And I think what we do now, I can't remember.
Starting point is 00:34:50 We've been through iterations of SATA connections on the motherboard and basically fanning out through daughter cards. And I think the daughter cards are older, and we went to the connections on the motherboard. But it's, I mean, literally, if you Google Backblaze pod blog, you will get to the descriptions of the different iterations I believe I believe we don't do compression but we do we do that erasure coding, and we also do a level of deduplication. So we'll hash at a certain level and have pointers rather than storing duplicates.
Starting point is 00:35:40 Yeah. And you can grab a card. I'm happy to answer any questions. Yeah. And you can grab a card. I'm happy to answer any questions. Yeah. Yeah, so that's a really good question. We use SSDs for boot drives. We don't use them for the data drives, although we are always continually evaluating,
Starting point is 00:36:19 re-evaluating that decision. And I think right now they just don't have the, you know, the data per dollar that we would be looking for. And we've been dealing with these hard drives for, what are we now, 2007, 15 years. So we know their characteristics really, really well. And so we actually have now a DriveStats report for SSDs with the SSDs that we do run. And we publish that data quarterly now as well. So at the point, believe me, at the point where it makes sense for us, if that tipping point comes,
Starting point is 00:37:07 we'll start using SSDs for data. Yeah? Well, customers don't get a choice. This is a service. So we rack up the drives, and you create your account and start uploading data. So it works just the same as S3. So, I mean, theoretically, a customer could take the pod blueprint and build themselves a pod. Well, they wouldn't even be a customer. Somebody out there could take the pod blueprint and build themselves a pod with 60 SSDs and it could work great. I suspect it would
Starting point is 00:37:52 be quite expensive to put a petabyte of SSD in a box. Yeah. Yeah, absolutely. Can I? No, there we go. Helps if I turn the clicker back on. Which one? This one? Oh, yeah, yeah, yeah. And this is, that comparison is old.
Starting point is 00:38:24 We were talking about, like, doing a new one, but this is from September 2009. So this would be very different now. But this was the point at which we decided, oh, hey, we can't put this on. Well, this is the point at which we wrote the blog post explaining why we built our own storage cloud and we didn't just use an existing one. But yeah, these numbers are very, very different today. And in fact, some vendors don't even exist anymore, which is sad. Yeah? Oh, right, right.
Starting point is 00:39:14 Yeah, the time period for this? I don't remember. It might be like a three-year life of the hardware kind of thing. That's a really good question. And it would be in that blog post. So again, if you Google Backblaze Petabyte S3, something like that, you'll find that blog post, or take my card, and I can email it to you or give me your card. But yeah, I would guess this is like the three-year service life of the hardware.
Starting point is 00:39:54 Yeah, the cloud works for a lot of people, but you get to a scale where you have to do things yourself for it to be economic. Sorry, yeah. Yeah. yourself for it to be economic. Sorry, yeah. Yeah, good question. I would have to go and look at the drive stats to see. I know we've got a lot of 16s and that's our standard build. We may have some 20s that we're putting in there to evaluate. I'm just curious about how quickly things are adopted.
Starting point is 00:40:34 If your website is a website, you've got to wonder, okay, how far is that going to go? Right, right. Yeah, it's a good question. And to be honest, if you went and listened to Andy Klein's webinar from July, he'll probably explicitly say, you know, whether 16s are the biggest we're using or we've moved, you know, we started using 20s. I could go and ferret around and get that information, but the next speaker will be up. No, and that's,
Starting point is 00:41:18 so that's another really, really good question. Yeah. So we have two basic products. The computer backup that we introduced in 2007, still there. And we have the B2 cloud object storage. Now, storage is what we do. That's all we do. And we partner with other independent cloud vendors to provide a solution. So for CDN in particular, CloudFlare, Fastly, and Bunny.net.
Starting point is 00:41:54 And with those partners, part of that partnership is that we give you free egress of your data from Backblaze to that partner. So in many cases, we have dedicated links inside the data center. So that, because we would love to not charge egress fees, you know, per gigabyte. You know, you get so many, I think you get the gigabyte per month free,
Starting point is 00:42:24 and then you start paying a penny per gigabyte per month. Don't quote me on that. But we would love to not have to charge that, but as soon as that data flows out over the public Internet, like those providers charge us, and, you know, we have to cover that. Now, with our partners, so CDN, CloudFlare, Fastly, Bunny.net, we have compute partners in Vulture and Equinix Metal. Rising Cloud is another one, a serverless platform. So in many cases, we do have that direct interconnect.
Starting point is 00:43:10 And so data flows across Ethernet cables within a data center and doesn't cost us anything to serve it up. I will give you a card, or you give me your card, because we're getting into solution engineer territory. And I know a lot of what we offer, but I don't know the details on what it costs and who pays what. But I know we certainly do. So our biggest, if you look at the user agent logs for uploads, Veeam is the biggest by far. So in terms of use cases, archive and backup, media and entertainment.
Starting point is 00:44:10 So we have a lot of media companies storing video files. So 4K video is a terabyte per hour. And I haven't talked about cost, but we are a fifth of the price of S3. So if you've got terabytes of video that you just need to archive, we do very well there. And what we call the origin store use case, which is companies where they want to upload. So we've got a really interesting customer called NodeCraft that does online game servers.
Starting point is 00:44:51 So you can be playing Minecraft and decide, oh, I'm done playing Minecraft on my rented server. I want to play Terraria. And so they grab your state of your game, your world, and put it into B2 and grab your Terraria and serve it up. So in that use case, we're actually a part of their application rather than just, oh, I put in the key and data goes there. They're actually
Starting point is 00:45:20 writing code against our APIs. And I'm also doing some work with my colleague in evangelism around the data lake use case of storing structured data. So CSV files, and I mentioned Parquet, and using tools to run queries against those directly in object storage. All right, well, any more questions? I could literally talk about this for the rest of the day, but they do have another speaker.
Starting point is 00:45:52 So thank you very much for your attention and have a great rest of the conference. Thank you. Thanks for listening. If you have questions about the material presented in this podcast, be sure and join our developers mailing list by sending an email to developers-subscribe at sneha.org.
Starting point is 00:46:14 Here you can ask questions and discuss this topic further with your peers in the storage developer community. For additional information about the storage developer conference, visit

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.