Storage Developer Conference - #187: More Than Just a Bucket of Bits: Cloud Object Storage turns Sweet Sixteen
Episode Date: April 11, 2023...
Transcript
Discussion (0)
Hello, everybody. Mark Carlson here, SNEA Technical Council Co-Chair. Welcome to the
SDC Podcast. Every week, the SDC Podcast presents important technical topics to the storage
developer community. Each episode is hand-selected by the SNEA Technical Council from the presentations at our annual Storage
Developer Conference. The link to the slides is available in the show notes at snea.org
slash podcasts. You are listening to STC Podcast, episode number 187.
Good morning. Thank you for getting up early, making it to this first session, Storage Developer Conference.
I'm Pat Patterson. I'm the Chief Technical Evangelist at Backblaze.
I've been at Backblaze since the beginning of this year. In the past, I've worked at Sun Microsystems, Salesforce, a startup called StreamSets,
and Citrix. So I've worked with developer communities for about 17 years now, but I'm
still very much a developer myself. So what I'm going to be talking about this morning
is doing a little bit of a review of cloud object storage,
which kind of had this anniversary this year,
this turning 16 years since Amazon introduced S3,
and then explaining how our cloud object storage platform actually works at Backblaze.
So this is not a Backblaze pitch.
This is how the layers in our cloud object storage platform.
So who's heard of Backblaze?
Anyone?
Oh, yeah.
Who's heard of DriveStats?
No one. Yes? No? So I'll mention DriveStats? No one.
Yes?
No?
So I'll mention DriveStats later.
DriveStats, we collect statistics on the, let me think now, 200,000 drives that we spin and release them every quarter.
But I'll talk about that more later. Anyway, so March 2006, and Amazon, up till then,
known primarily as an online retailer, bookstore,
launched Amazon Web Services.
And the very first service was S3.
And that kind of talks to the foundational nature of cloud object storage.
And it's easy to forget just how different this was at the time for developers.
Now, this is from that press release.
And developers reading this noticed new words. So there's no mention of files.
We're writing, reading, and deleting objects. No mention of file names.
Objects are stored and retrieved via a developer-assigned key.
And there was no mention of APIs in the way that developers had understood them until then.
REST and SOAP interfaces designed to work with any internet development toolkit.
So this was very, very different from working with a POSIX file system.
And if we look at that, this is very typical code in C from the time,
and it still is if you're working with a local storage.
So you might make a directory with particular permissions.
You might create a file, which gives you a handle or a file descriptor,
and then write some data and close that file.
Now, the problem with this,
from an internet scale perspective,
is that the operating system has to maintain state.
That file descriptor points to the location of the file, that the operating system has to maintain state.
That file descriptor points to the location of the file,
so it's path, it's inode,
and also the location in the file
that the client is reading or writing.
Now, this is okay for an operating system
on a single machine, but clearly cannot scale to the internet. So S3 came along, and this is probably familiar to everybody in the room.
So you've got this HTTP interface, and you said, okay, I want a bucket. I'll do a put to slash my bucket, and
include some XML payload saying, I want
that in a particular region.
And you would put an object in that bucket by
saying, okay, put slash my bucket, my key, and
you would just enclose the body to upload.
Now, the only state here is in that list of buckets
and the objects in storage themselves.
Clients can write and read,
and all S3 has to do is go and write data and return data, and doesn't have to
maintain that pointer into a particular object. Now, there's some clear semantics here in the way
that cloud object storage works. So, objects are immutable.
You can't append to an existing object.
You can't seek within it and overwrite some data.
Objects can be versioned.
So, you can replace an object with a new version of itself.
And you can upload,
you can work with very, very large files.
So you can upload a single part typically.
This is true of S3 and many other cloud object storage platforms.
S3 became the de facto API.
So you can upload a single part up to about five gigabytes, and above that, you have to use a
multi-part strategy, and that's really driven by HTTP. You know, it's hard to keep an HTTP
connection going indefinitely to upload 10 terabytes of data. And there's other features like you can set an object lock
to prevent deletion for a given time period.
And now this is quite interesting.
You can set a lock on a file for so many days
and it cannot be deleted.
So you can't delete it through the API,
you can't ask the cloud provider to delete it.
The platform literally makes it impossible. And this is a feature that was introduced over the
past few years to guard against ransomware. So ransomware that would attack your applications,
maybe get access to your credentials for accessing that cloud storage,
and go start deleting things.
But this makes it absolutely impossible.
In fact, certainly on Backblaze B2,
the only way, if you really want to delete that file,
you have to close your account.
There is just no way.
There's no path through the API or through support
to delete a file that's been locked.
And another thing you'll see, these objects that we manipulate in cloud storage are accessible
via HTTP.
So developers started uploading images and JavaScript and other web resources and serving them to browsers.
And when you do that, nowadays, you need to control which browsers enforce origin resource sharing rules. So typically, if you serve up a web page with a script,
it can only access resources from the origin it came from
unless you set up this cross-origin resource sharing.
So you're basically saying it's okay to access this object
from web pages in this domain.
So how does S3 work?
Well, Amazon certainly isn't telling.
They kind of give hints and release research papers.
But Backblaze isn't Amazon.
We have a history of being very open with what we do and how we do things.
And this session is actually based on completely publicly released information.
So you could have put all this together
if you've gone and done some research
over the past few weeks.
So we built a storage cloud.
Backblaze was founded in 2007.
And it was originally, the original product was computer backup.
So the initial thinking was, we'll build a backup client that customers just download, start running,
and we'll store the data on S3.
So S3.
So S3 had been around for a year by then.
And the target price for Backblaze backup was five bucks a month for unlimited data.
And it became clear almost instantly
that this was not possible on S3.
That it was not possible to provide that service
because a petabyte cost $2.8 million in 2009.
And what we actually ended up doing was figuring out,
okay, what's the lowest cost we can put storage online?
And, you know, the raw drives cost $81,000,
and we got it to $117,000.
And this is how.
So step one, we put drives online as cheaply as we could. So we designed these custom chassis
to hold 60 drives.
So back in the day,
64 terabyte drives might have been,
what's that, 240 terabytes in a pod,
something like that.
So we didn't build the motherboards.
We bought in the motherboards, power supplies,
had the cases made,
and basically assembled these ourselves as a storage pod.
And we released the specification.
So in a pod, there's a custom web application
running on Apache Tomcat.
There's Debian Linux with EXT4.
And yeah, nowadays with 60, 16 terabyte drives,
you can get close to a petabyte of storage. Am I doing my math correctly? Petabyte,
yeah. It's early in the morning. I've only had one cup of tea. Into one of these boxes.
And the Tomcat's running HTTPS. The pods are accessible from the internet, and there's nothing fancy there.
The idea here was to build to a particular price target for robustness and reliability.
There's no iSCSI, NFS, SQL, Fibre Channel. It's basic technology. And you can build your own.
We released the design, the bill of materials, and people have actually done this.
So, you know, you can go out, buy.
This is the cost nowadays.
I looked it up.
60 Seagate 16-terabyte drives would cost you just over $17,000.
The case and the other bits, about three and a half.
So for $21,000, give or take, you can put together a pod,
and your cost per gigabyte is just over two cents.
So this is like our base price before we pay anybody any wages
or pay for the electricity and so on.
Now, right now, we have about two exabytes of storage online.
And like I mentioned earlier, 200,000 disks. And every day, for every disk,
we collect the drive serial number,
the model number, the date, obviously,
whether it failed that day,
and all of the smart attributes.
So every day, we collect 200,000 rows,
if you like, of information.
And then every quarter we release that publicly.
And Andy Klein, our god of drive stats,
he does a webinar and we release the raw data
and we also talk about our interpretation of that data.
So which drive models we've introduced this quarter, how they're shaping up for liability,
the annual failure rates for the different drives we're spinning, any lessons we can learn.
So in the last DriveStats, which was in July and he was talking about how so we're all familiar
with the bathtub the bathtub reliability curve that devices tend to fail more
often when they're brand new and then the failure rate flattens out and they
have like a working life and towards the end of life, the failure rate rises. And what we've noticed
from the statistics we've gathered is that that curve is now very asymmetrical. So we're just not
seeing as many drives fail in the early part of their lifetime. The curve starts far lower down
and proceeds and obviously rises towards the end of their working life.
So we share this information with the community.
And yeah, just go to backraise.com slash drivestats.
All the data is there.
I'm actually, it's there as zipped CSV files.
So one CSV file per day, zipped up like with a zip file per quarter.
I'm working on a project to actually
give you direct access to that data
in a B2 bucket in a form that is queryable.
So I've already written one blog post about this.
If you search for Backblaze Parquet, P-A-R-Q-U-E-T, you'll see
that. And my goal is to make that data available so you can literally run SQL queries against,
let's see, 2014. We started nearly eight years of data for a whole range of drives.
It's very cool.
Very cool for storage geeks, actually.
If I talk about this in any other audience,
people are just like.
Right, so we've got our pod.
What's the next layer?
Well, 20 pods is a vault.
So vaults are software.
We just rack up these pods, plug them into the network,
and some software creates the concept of a vault.
And a vault is the way that we store data with redundancy.
So a vault is 20 pods, sorry, yeah, 20 pods. And then we've got this concept of a tome,
which is 20 drives where each is in the same position in each storage pod. So drive one
in each of those 20 forms tome one. And what we do when you upload data is split the data into 20 shards and then use a number of those shards for redundancy, essentially parity.
So we use something called Reed-Solomon erasure coding that I'll talk about in a moment.
And what we found was we used to use 17 data shards and three parity,
but drives are getting bigger.
And that means that rebuilding the data
on a failed drive now takes longer.
It can take several days, over a week,
to rebuild a 16 terabyte drive from those parity bits.
So therefore, to keep the durability,
you have to have more parity shards.
So now with 16-terabyte or greater drives,
we use 16 data and 4 parity.
So we write those shards across a tome,
and we designed this for 11 nines of data durability. So with 960 terabytes
per pod, 16 data pods per volt, okay, so the math works out at 4 parity, 16 data, so that's about
15 petabytes in a volt. Now this is that Reed-Solomon encoding,
and there's actually a video where Brian Beach,
one of our architects, explains this way better than I can,
but it's based on matrix algebra.
And what you have is you have this encoding matrix,
and you'll notice that the part of the matrix that,
so you've got, we've just done it with four by four here.
So four data shards and two parity shards.
So what you have is the identity matrix here,
and then you have these special encoding rows.
And the way the math works out is that when you multiply these two matrices,
you end up with the data and then two parity rows.
And as Brian explains here way better than I can,
if you lose any one or two of the rows, of the data rows,
you can reconstruct them by basically multiplying what you've got left by the inverse of this matrix, and you get back this one. It's really, really elegant matrix algebra.
So we've got our vault full of pods and tomes. Now we build them into a cluster.
This is the one part of this that you don't see a lot of in our blog posts. But clusters, again,
are software. We're just bringing these things together into the concept of a cluster.
So a cluster has around 100 volts.
It's not a fixed number.
So you could have 2,000 pods, 120,000 hard drives.
And this is what makes it scalable.
So you could start your cluster off with, say, 50 volts and then build it out.
A cluster is what we think of as an instance of Backblaze.
When you create a bucket, if you go into the Backblaze interface, you'll notice that bucket has an API endpoint.
Part of that endpoint identifies that cluster.
So we might have one or two clusters
in a physical data center,
and there are cluster-wide services.
So again, Tomcat for that API,
Cassandra for metadata.
So it would be great if we could store files
and they just had an internal opaque identifier
that said, okay, they're on this pod
and this location, whatever,
but developers wanna have keys and file names
and human readable stuff, so we actually use Cassandra to make an index of those file names and human-readable stuff. So we actually used Cassandra to make an index of those file names
against our internal location of the objects.
And, oh, I was supposed to, okay, I didn't say HashiCorp Vault.
I said off-the-shelf standard software for secrets.
Because our CISO said,
don't tell people what we use.
Okay, so about those APIs,
how do developers get to their files?
Now, we have, as every other storage vendor has,
an S3-compatible API. Now, this implements 38 of the 97 operations.
And really, it's, I'd say 80-20 rule,
but it's like 140 rule.
These are the 40 that 95% of applications use
for creating buckets, deleting buckets,
creating objects, setting cause rules, all of that stuff, we only don't implement the really AWS-specific stuff.
But it's great.
This works with off-the-shelf AWS CLI.
All of the S3 SDKs, so pull down the Python Boto3 SDK, and most off-the-shelf client apps.
So you use the application key ID, and you set the endpoint URL. And this is in common with
pretty much every S3-compatible service. You have this endpoint URL you have to use
to tell the software, hey, this is not the default Amazon. And one of the
qualities of this API is that, again, it's stateless. There's no concept of a session
that the client has with S3. The client must sign every API call. So if you look at what's happening on
the wire when you're using S3, then you'll see that there's an authorization header,
and it's a SHA-256 HMAC. So it basically signs the headers and the body of the request with a timestamp so it can't be replied, so that when S3 or S3-compatible
service receives this, they can just verify this against the application key and see that
it's from a known client and it hasn't been tampered with in transit.
So this HMAC gives you message integrity
and message authentication.
Now, when Backlays B2 launched back in 2016,
it only had its B2 native API.
And the S3 compatible API was introduced a few years later.
And we made different choices here.
We elected to have a simpler RESTful API.
And this API actually does have state.
So your client exchanges credentials
for an authorization token and an API URL. So you
have to include that in subsequent calls to the API. So a different design choice back in 2016,
when it wasn't clear that S3 was the de facto standard and you kind of had to do that to play.
Back then, different storage platforms
had their different APIs.
And another really interesting design difference is
if you think back to when S3 started,
you could do everything against a global domain.
So I think it's s3.amazonaws.com. So if all of your
buckets are behind that one domain, that implies the existence of a really big load balancer to
service those requests coming in from anywhere and going out to buckets in a number of data centers.
And we didn't want to build or buy, it would be more accurate,
a massive load balancer in 2016.
So the way the native API works is a little bit different.
The client, when it wants to upload data, requests
an upload URL. So you say, hey, I want an upload URL. And that gives you the address of one of
those individual pods. There is no load balancer. You go to a service, say, hey, I want to load some data,
and it says, oh, I think I'll give you this pod.
And you can start posting data directly to that red box on a rack somewhere.
And so every pod is accessible from the Internet, and that holds today.
So we still support the B2 native API.
And that actually gives you some better,
lower latency, I think, and some different
scalability characteristics from S3.
And it's interesting to look at S3
and how they went from a global domain to region domains.
And then they encouraged people to put the bucket as the first part of the domain name.
So they kind of backed away from the big honking load balancer and realized,
the more we can get the client to tell us in that domain about
which bucket they're working with, the easier it is for us. And of course, we implemented S3
compatible and had to do a similar thing. But by the time we did that, we had figured out how to
do the load balancing in software and didn't have to buy a big expensive box.
So we have our clusters.
They're sitting in data centers.
And then the next layer of abstraction is the region.
So this is the only abstraction layer that's actually exposed to customers.
So when you create your Backblaze account, you choose whether to have US West or EU Central.
And that's the only part of it that you're exposed to. So a cluster, there are typically several clusters in a region.
And today, right now, in the US, we have four clusters across Sacramento, Phoenix, and now
Stockton, just up the road. And we have one in Europe in Amsterdam. And the Stockton one is really
interesting. That's this one here. So you can go to the blog post. We only posted this last week.
We only released this information last week. So we're actually in a data center in Stockton
on the San Joaquin River called Nautilus. And the data center actually floats
on the river. So you can see those piles there. It goes up and down on those piles.
And what you can't see, so what you're seeing is basically the halls with the server. And then
underneath here is the cooling system.
Now, what happens is that it draws in river water,
and river water doesn't flow around the data center. There's a sealed water cooling system that goes around the data center,
but there's a heat exchanger.
So this data center is cooled by the river water.
The river water flows through, spends about 15 seconds in the system,
and is discharged four degrees Fahrenheit warmer than the ambient temperature of the river
in a location where we consulted with the wildlife agencies to minimize any effect.
So very, very cool, the Nautilus data center.
We're quite proud of that one.
And brand new, so we're literally moving data into it as I speak.
And then we've just started talking about our first cluster in the US East region.
So we'll have three regions.
And this is an actual photo of the data center with empty racks ready to get those storage pods. And by the way, you know, talking about those pods,
when we built them, that was the only way to get that drive density into a rack,
like those custom pods.
We don't actually have them custom made anymore because the market caught up.
So you can buy a chassis from Supermicro with the box and the motherboard
and the SATA connectors for the drives and the power supplies
and populate it yourself with drives.
And they have bigger economies of scale than we do.
So we're thrilled now that we don't have to
custom manufacture those pods
and we can actually buy them off the shelf.
So unfortunately, as beautiful as those red chassis look,
this won't have a sea of red, it'll just be gray,
which is sad.
Okay, so then once you've got your regions,
customers can replicate data between regions.
So they might be doing this for compliance, continuity.
There might be regulations that say you must have your data
in two geographically separate locations.
You might be saying, hey, we want a copy of this
so it's closer to particular customers.
Or you might be even just replicating locally to say,
we want a test version of this data.
So we want to set up replication from our production buckets so we always have a live distinct set for software testing and staging. and the Backblaze paid for my gas to get here.
So I'll put up a little slide saying,
you know, you can get started.
We give you 10 gigabytes to play around with
and you don't have to provide a credit card.
But yeah, with that, I'm happy to answer
any questions to the best of my ability.
And here I'll come clean.
As the technical evangelist, I understand how this works,
and I've spoken to the engineers.
I didn't build this, but I do help developers actually use it
and use our APIs and take advantage of it.
Yeah, go ahead.
Talk a little bit about the hardware test
and CPU.
Yeah, yeah.
So, yeah, yeah.
So, it's, and I was on a call with the engineers
on Friday actually asking them some of these questions.
So it's an Intel CPU.
CPU doesn't have to be that beefy.
The only thing the CPU really does of any consequence
is that Reed-Solomon encoding.
It's really IO is the limiting factor.
So they have, trying to remember, this is all on the website.
So we've been through like those six generations of pods,
been through various ways of actually connecting those drives to the motherboard.
And I think what we do now, I can't remember.
We've been through iterations of SATA connections on the motherboard and basically fanning out through daughter cards.
And I think the daughter cards are older,
and we went to the connections on the motherboard.
But it's, I mean, literally, if you Google Backblaze pod blog,
you will get to the descriptions of the different iterations
I believe I believe we don't do compression but we do we do that erasure coding, and we also do a level of deduplication.
So we'll hash at a certain level
and have pointers rather than storing duplicates.
Yeah.
And you can grab a card.
I'm happy to answer any questions. Yeah. And you can grab a card. I'm happy to answer any questions.
Yeah.
Yeah, so that's a really good question.
We use SSDs for boot drives.
We don't use them for the data drives,
although we are always continually evaluating,
re-evaluating that decision.
And I think right now they just don't have the, you know,
the data per dollar that we would be looking for.
And we've been dealing with these hard drives for, what are we now, 2007, 15 years.
So we know their characteristics really, really well. And so we actually have now a DriveStats report for SSDs with the SSDs that we do run.
And we publish that data quarterly now as well.
So at the point, believe me, at the point where it makes sense for us,
if that tipping point comes,
we'll start using SSDs for data.
Yeah?
Well, customers don't get a choice.
This is a service.
So we rack up the drives, and you create your account and start uploading data.
So it works just the same as S3. So, I mean, theoretically, a customer could take the pod blueprint and build themselves a pod.
Well, they wouldn't even be a customer.
Somebody out there could take the pod blueprint and build themselves a pod with 60 SSDs and it could work great. I suspect it would
be quite expensive to put a petabyte of SSD in a box. Yeah. Yeah, absolutely.
Can I?
No, there we go.
Helps if I turn the clicker back on.
Which one?
This one?
Oh, yeah, yeah, yeah.
And this is, that comparison is old.
We were talking about, like, doing a new one, but this is from September 2009.
So this would be very different now.
But this was the point at which we decided, oh, hey, we can't put this on.
Well, this is the point at which we wrote the blog post explaining why we built our own storage cloud
and we didn't just use an existing one.
But yeah, these numbers are very, very different today.
And in fact, some vendors don't even exist anymore, which is sad.
Yeah? Oh, right, right.
Yeah, the time period for this?
I don't remember.
It might be like a three-year life of the hardware kind of thing.
That's a really good question. And it would be in that blog post.
So again, if you Google Backblaze Petabyte S3, something like that,
you'll find that blog post, or take my card, and I can email it to you
or give me your card.
But yeah, I would guess this is like the three-year service life of the hardware.
Yeah, the cloud works for a lot of people,
but you get to a scale where you have to do things yourself for it to be economic.
Sorry, yeah.
Yeah. yourself for it to be economic. Sorry, yeah.
Yeah, good question. I would have to go and look at the drive stats to see.
I know we've got a lot of 16s and that's our standard build.
We may have some 20s that we're putting in there to evaluate.
I'm just curious about how quickly things are adopted.
If your website is a website, you've got to wonder, okay, how far is that going to go?
Right, right.
Yeah, it's a good question. And to be honest, if you went and listened to Andy Klein's webinar from July,
he'll probably explicitly say, you know, whether 16s are the biggest we're using
or we've moved, you know, we started using 20s.
I could go and ferret around and get that information,
but the next speaker will be up.
No, and that's,
so that's another really, really good question.
Yeah.
So we have two basic products.
The computer backup that we introduced in 2007, still there.
And we have the B2 cloud object storage.
Now, storage is what we do.
That's all we do. And we partner with other independent cloud vendors to provide a solution.
So for CDN in particular, CloudFlare, Fastly, and Bunny.net.
And with those partners, part of that partnership is that we give you
free egress of your data from Backblaze to that partner.
So in many cases, we have dedicated links
inside the data center.
So that, because we would love to not charge
egress fees, you know, per gigabyte.
You know, you get so many,
I think you get the gigabyte per month free,
and then you start paying a penny per gigabyte per month.
Don't quote me on that.
But we would love to not have to charge that,
but as soon as that data flows out over the public Internet,
like those providers charge us, and, you know, we have to cover that. Now, with our partners, so CDN, CloudFlare, Fastly, Bunny.net,
we have compute partners in Vulture and Equinix Metal.
Rising Cloud is another one, a serverless platform.
So in many cases, we do have that direct interconnect.
And so data flows across Ethernet cables within a data center and doesn't cost us anything to serve it up. I will give you a card, or you give me your card,
because we're getting into solution engineer territory.
And I know a lot of what we offer, but I don't know the details on what it costs
and who pays what.
But I know we certainly do.
So our biggest, if you look at the user agent logs
for uploads, Veeam is the biggest by far.
So in terms of use cases, archive and backup, media and entertainment.
So we have a lot of media companies storing video files.
So 4K video is a terabyte per hour.
And I haven't talked about cost, but we are a fifth of the price of S3. So if you've got terabytes of video that you just need to archive,
we do very well there.
And what we call the origin store use case,
which is companies where they want to upload.
So we've got a really interesting customer called NodeCraft
that does online game servers.
So you can be playing Minecraft and decide,
oh, I'm done playing Minecraft on my rented server.
I want to play Terraria.
And so they grab your state of your game, your world,
and put it into B2
and grab your Terraria and serve it up. So in that use case, we're actually
a part of their application rather than just, oh,
I put in the key and data goes there. They're actually
writing code against our APIs. And I'm also doing some work with my colleague in evangelism
around the data lake use case of storing structured data.
So CSV files, and I mentioned Parquet,
and using tools to run queries against those
directly in object storage.
All right, well, any more questions?
I could literally talk about this for the rest of the day,
but they do have another speaker.
So thank you very much for your attention
and have a great rest of the conference.
Thank you.
Thanks for listening.
If you have questions about the material
presented in this podcast,
be sure and join our developers mailing list
by sending an email to developers-subscribe at sneha.org.
Here you can ask questions and discuss this topic further
with your peers in the storage developer community.
For additional information about the storage developer conference, visit