Screaming in the Cloud - Engineering Around Extreme S3 Scale with R. Tyler Croy
Episode Date: January 13, 2026R. Tyler Croy, a principal engineer at Scribd, joins Corey Quinn to explain what happens when simple tasks cost $100,000. Checking if files are damaged? $100K. Using newer S3 tools? Way too e...xpensive. Normal solutions don't work anymore. Tyler shares how with this much data, you can't just throw money at the problem, but rather you have to engineer your way out.About R. Tyler: R. Tyler Croy leads infrastructure architecture at Scribd and has been an open source developer for over 14 years. His work spans the FreeBSD, Python, Ruby, Puppet, Jenkins, and Delta Lake communities. Under his leadership, Scribd’s Infrastructure Engineering team built Delta Lake for Rust to support a wide variety of high performance data processing systems. That experience led to Tyler developing the next big iteration of storage architecture to power large-scale fulltext compute challenges facing the organization.Show Highlights:01:48 Scribd's 18-Year History04:00 One Document Becomes Billions of Files05:47 When Normal Physics Stop Working08:02 Why S3 Metadata Costs Too Much10:50 How AI Made Old Documents Valuable13:30 From 100 Billion to 100 Million Objects15:05 The Curse of Retail Pricing 19:17 How Data Scientists Create Growth21:18 De-Normalizing Data Problems25:29 Evolving Old Systems27:45 Billions Added Since Summer29:29 Underused S3 Features31:48 Where to Find TylerLinks: Scribd: https://tech.scribd.comMastodon: https://hacky.town/@rtylerGitHub: https://github.com/rtylerSponsored by: duckbillhq.com
Transcript
Discussion (0)
And so even the simple, like S3 will gladly compute checks sums for you, but to do so,
it's a batch operation.
And a batch operation costs $1 per million objects.
If you have a hundred billion objects, all of a sudden you're faced with, if I need
checksums for all of this data, I've got to go drop 100K just to compute checksums because
a billion is just so astronomically large that just the simple act of getting checksums for
the data you already have in S3 becomes a very serious pricing discussion.
if you aren't ready to drop that kind of coin.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
I am joined today by an early Duck Bill Group customer,
a recent speaker at the inaugural, San Francisco Finops,
I suppose we'll call it.
Our Tyler Croy is an infrastructure architect over at Scribd.
Tyler, how are you?
How are you?
I'm doing all right.
This episode is sponsored in part by my day job, Duck Bill.
Do you have a horrifying AWS bill?
That can mean a lot of things.
Predicting what it's going to be.
Determining what it should be.
Negotiating your next long-term contract with AWS.
Or just figuring out why it increasingly resembles a phone number,
but nobody seems to quite know why that is.
To learn more, visit duckbillh.com.
Remember, you can't duck the duck bill bill,
which my CEO reliably informs me is absolutely not our slogan.
For a while over at Scrib, six years in change.
The company boldly answering the question,
what if S3 had a good user interface?
Yeah, I mean, I've joined the most recent incantation of script.
Script has been a lot of things.
We've been around for 18 years.
I think Scrib has existed longer than most other companies.
So, I mean, we've got the lack of vowel in our name.
We're from the flicker, the scripts, the word before the dot LYs.
Well, as we look at them, vowels, vowels are expensive.
Why buy them if you don't have to?
So your talk was fascinating because it, of course, focused heavily on economics,
and it also focused on S3, the reason that many of us don't sleep anymore.
And that's been, it was an interesting story, just at the level of scale that you're talking about,
Things that people don't consider to be expensive, got expensive.
Specifically, request charges when you're doing things in buckets that, for those who are unfamiliar,
effectively have a crap ton of uploaded text documents at head of out scale.
I think crapton is the metric that storage lens shows you at our scale.
So for the uninitiated script has user documented or user uploaded content, documents typically presentations through our slideshare product.
going back 18 years.
And so every day, thousands and thousands of new documents,
legal documents, study guides, et cetera, get uploaded.
And those have been quietly accumulating
in our S3 storage layer for a long time
without anybody really paying attention to it.
And so a year or so ago, I started to really look at,
like, where's a big debt I can make in Clust Explorer?
Like if I'm gonna take on something big,
what's the biggest thing?
And I saw a storage cost.
Cloud Economics,
instead of pulling an AWS billing org,
going alphabetically, if you start with a big numbers first, that tends to have impact.
For years, I was asked about people's random Alexa for business spend.
It is $3 a month.
What are you doing?
Yeah, I mean, most companies, I think, have like EC2, S3, Aurora.
Like, those are the big things.
But once I started to look into our actual S3 spend, I knew we had a lot of content.
Like, we talk about the hundreds of millions of documents that have been uploaded over the years.
But when I actually looked into what was stored in S3,
we're talking hundreds of billions of objects
because every single document that you upload,
we have like format conversion,
we have accessibility changes that get made.
And so every single document became this diaspora
of related objects in S3.
And suddenly like batch operations,
intelligent tiering,
anything that has a per object charge associated with it
becomes wildly expensive in a way that requires
is you to like step back and think about how how should we be doing this? How should we be storing
this data? Because that shotgun of objects into S3 only works for the first billion. And then after that,
you might have to think what you're doing. The ergonomics of the request charges are very different
too. I think philosophically, we tend to see, you know on some level, oh, if I stuff an exabyte of data
into S3, that's going to be expensive. But when we start, I think it's hard for humans to wrap their
head around the idea of hundreds of billions of objects, just because it's the difference between
a million and a billion is about a billion. If you pass a point of scale, you do an S3LS to see
what objects you have there and it'll complete right around the time the earth crashes into the
sun. It's a, it's just not something that makes sense. But on the other side of it, I overoptimized
for a lot of this stuff because I think at the Duck Bill group now, our total S3 bill is something like
$110 a month right now. And we can do basically any.
anything we want to S3. It doesn't materially move the needle on our business because we are not
script. The S3, like, you can abuse S3 for terabytes, petabytes even, really. Like, you can put
so much into S3. It's so incredibly cheap. It's so incredibly reliable. And then there's this
like something happens. And I don't know when it happened at Scrib because I wasn't paying attention.
I'm only looking back in history. Something happened where we went from like the first billion to the
next billion to the next billion. And once you're in the tens or hundreds of billions of objects,
like it's like quantum physics. Like all of a sudden, all of the physics that you've learned
no longer applies, you're in a completely different ballgame and you've got to figure out how does
this world work? Because the world I thought I had doesn't exist anymore. Yes. Oh, very much so.
It's you also have the problem where, especially we're talking about all things billing.
It's a lot of hurry up and wait. Okay, we're going to make some transitions. We're going to try
something here and see how it goes. And then you have to, in some cases, wait for objects to
age out into the next tier or there's a bunch of request charges that suddenly mean for this
month your S3 bill is just in the stratosphere and you get the angry client screaming phone call.
Like, what have you done? Yes, there is an amortization story here. Give it time. Don't move it back.
Yeah. And I'm actively, as we record this, I am waiting for a first 30 day on some reclassing
to occur for intelligent tiering.
And I can't wait until next month
because I'm hoping for a big draw.
Yeah.
This stuff has become magic,
but you have to speak the right incantations around it.
And past a certain point of scale,
a lot of things,
just in the way that AWS talks about this,
are no longer makes sense.
I asked you about using S3 metadata or S3 tables
for a lot of this stuff.
And your response was the polite business person
equivalent of a kid, that's adorable.
You have any idea what that would cost
just on the sheer number of objects
because that's not
usually the first dimension
we can to think about historically.
Now, metadata and tables are changing that
and vector buckets and directory buckets
and Lord knows what else.
But that changes the way that we think
about a service that honestly is older
than some people listening to this show.
I mean, yeah, a lot of these things really
break down in ways that are challenging.
The metadata and invidavit
inventory and other like operations or things that have happened in S3 over the last few years that are really interesting.
They're really, really great when you're sub billion, sub billion objects.
But like S3 metadata, the questions I don't have or I want to ask of these buckets are not worth the amount of money it would take to ingest into S3 metadata and then to continue storing because they're just so astronomically huge.
I was looking at, I was looking at a problem with some of these older objects.
sometime in 2024, you probably talked about this.
Every upload to S3 got a check-sum automatically.
You don't have to do anything.
Before that, you may or may not have a check-sum.
If you're using in a proper, you know, AWS SDK, you did.
But if you weren't, who knows?
And at some point...
Who would use anything the latest correct SDK?
Why would you interact with S3 except through the official SDK?
But when you go back to like Scrib's S3
bucket came, was created the year S3 was created and announced. And so when we're going back
that far in time, we have billions of objects that don't have check sums. And so even the simple,
like S3 will gladly compute checks sums for you. But to do so, it's a batch operation. And a batch
operation costs $1 per million. If you have a hundred billion objects, all of a sudden,
you're faced with, if I need checks sums for all of this data, I've got to go drop 100K just
to compute check sums because a billion is just so astronomically large that just the simple
act of getting checksums for the data you already have in S3 becomes a very serious pricing
discussion if you aren't ready to drop that kind of coin.
I have to think just, again, you have lived in this space.
I only dabble from time to time.
But my default approach when I start thinking about this sort of problem is the idea of lazy
check-summing, lazy conversions.
But when I was at Expensify back in 2012, something that we learned was that the typical
life cycle of a receipt was it would be written once and it would be read either one time
or never, except for the end of the long tail where suddenly things get read a lot years later
during an audit of some sort or when there's a question of mouth the essence. So you could never
get rid of the data. You had to have it with the expectation it would never get read. I have to imagine
the majority of stuff that gets uploaded in the script. In many cases, it's there, but it's not accessed.
I would say that's a pretty good assumption.
The interesting thing about user-uploaded content and user-uploaded documents in particular
is the long tail is years and years long.
A study guide that was created for Catcher in the Rye in 2010 is still probably just as useful in 2025 as it was in 2010
because the Catcher in the Rye is a classic and people still get no access.
We'll post a link to it on Reddit one year when the entire guy started in that.
Yeah, it's impossible to predict what's going to hit.
It's impossible to predict, but one of the things that's been really interesting about Scrib's
particular flavor of content is in the last couple of years, large language models have
become really useful for what Scrib does. I won't speak to the utility of large language
models in other domains, but the utility in what Scrib does in particular has made old
documents suddenly much more useful, much more interesting, much more relevant to use.
users today than they have ever been before.
Because before this, you didn't have to sort of rely on a Reddit post or something to
like reinvigorate a document, right?
But now that if we look at all of this long, almost 20 year history of Scrib, if you look
at that like a knowledge base, then all of a sudden we're looking at a very broad horizontal
access pattern that we might want to be doing for data science use cases or large language
model based applications, that again flips the access patterns that you might
have in a traditional user-generated content site on his head and makes the storage discussion
so much more challenging, but like in a fun way.
One of the more horrifying parts of your talk was when you mentioned that you did a lot of digging
into various file formats. You were talking about, even looking at ISOs at one point.
I'm like, oh, hey, someone knows what Joliet standard is in this day and age. Imagine that.
But you picked Parquet and then started using S3 byte offsets on reeds to be able to just grab the end of
of an object and then figure out where exactly you'd go and grab things from exploded document
views. It was a very in-depth approach. It sounds like not to be the mind, S3 tables, rest of
metadata from first principles, because those things didn't exist back then. And now that they do,
they're not even close to being economically viable. Yeah, I think if they were economically
viable, I mean, S3 tables would be really interesting for this use case. The really novel thing
about Apache Parquet files is a lot of what we are doing at Scrib with Apache Parquet is not new
territory. It's not necessarily novel. It's how the quote-unquote lakehouse architectures of, you know,
delta tables and iceberg tables and things like that are doing really, really fast queries and
things like that on top of S3 object stores or, you know, Azure blah, blah, blah, blah. So like the
infrastructure for picking needles out of these parquet file haystacks already exist. One of the work that I've been doing
is reusing some of the same principles,
but bringing it to a wildly different domain
of this sort of very, very large content library that script has,
and using that as a way to reduce object size.
The whole thing that I was trying to get across
at the FinOps Meetup that you all invited me down for
was the problem of X is really expensive at 100 billion objects.
My solution has been, okay, not like to go negotiate with the team
or try to find a way to make that cheaper,
but try to get the object count actually lower.
Because if you bring it from $100 billion to $100 million,
then we're in a ballpark to where you can take advantage of intelligent tearing,
batch operations become much more easy to do.
All sorts of things become simpler if you can reduce that object count.
And when I was looking at other things like ISOs,
I mean, the classics are a classic for a reason.
They never die.
When I was looking at that, like zip, tar, etc.,
I wasn't able to find a way to get random fight selections within S3 objects.
to work nearly as effectively as I can with Parquet.
As with Parquet, if I know what file I'm looking for,
I can get it extremely quickly from within,
let's say, a 100 megabyte file,
I can go grab you 25 kilobytes with the same level
of, I would say, performance as most other S3 object accesses work.
Because S3 supported this range request for a long time.
And one of the bits of trivia
that I was very pleased to discover, which really, really made
this work well is you can do negative range reads on S3. So you can look at the tail of a file.
You can look at the middle of the file. You can grab any part of an object that exists in S3
if you just know where it in the file that it exists.
Which is magic. And there's a...
It is magic.
The downside, of course, is you have to know it's there. You have to have an in-depth
understanding of your workload, which you folks do. This is also, I think, the curse of retail
price. It's no secret at this point that at scale, nobody's been.
in retail. But and like, well, of course we're going to work and negotiate with you on this
in return for long-term commitments. We'll wind up giving you various degrees of discounts. But when you're
sitting there just doing the napkin math to figure out, okay, if I have a hundred billion objects
and what are you going to charge me? Okay, never mind. We're going to move on to the next part
of the conversation because it doesn't occur to you to go up there. It's like, so I need at minimum
a 98% discount on this particular dimension, even if that's a tax.
It sounds ludicrous. Like there's no way you would even be able to say that with a straight face. Like I'm not going to go into a car dealership and ask for a car for 20 bucks because it's just wasting everyone's time. Same principle applies here. They have priced themselves out of some very interesting conversations. The same principle applies. I think the problem domain that we are faced with on top of S3. I think S3, as I think you've claimed a number of times, is the eighth wonder of the world. It is a
fantastic piece of infrastructure. Building on top of it enables so many different use cases,
but when you've got a large enough scale, you've got really interesting problems. And being in
engineering, this is certainly a bias, right? Like, I don't want to look away from those problems.
Getting things cheaper is sometimes easier to do just with paper, you know, just signing a contract.
In other times, stepping back far enough to look at what we're trying to accomplish and coming
up with an interesting technology solution is also a perfectly reasonable way to solve the problem.
And I think the way that I really am trying to approach what we're doing with S3 at Scrib is it's
not just about getting the bill lower.
Like nobody is going to give me the time and the money to make the bill lower.
But if I can give us new capabilities by expanding what we can do with this 100 billion
objects within the organization, that is a capability change that you get from a technology-based
solution as opposed to a policy or, you know, a contract-based solution. Both are equally valid, right?
But I'm much better at one than the other. That may be a different story for you, but I'm better
with the, let's build some, write some code that's going to solve some big problems. And hopefully
that'll make the chart go down in a way that makes time in finance happy. This episode is sponsored
by my own company, Duck Bill. Having trouble with your AWS bill, perhaps it's time to
negotiate a contract with them.
Maybe you're just wondering how to predict what's going on in the wide world of
AWS.
Well, that's where Duck Bill comes in to help.
Remember, you can't duck the Duck Bill bill, which I am reliably informed by my business
partner is absolutely not our motto.
To learn more, visit DuckbillHQ.com.
It feels like half the time you look deep at the bill, like every different usage item,
there's a reason for it.
there's ways to optimize and market it.
But it's small to mid-sized scale.
It feels like it's just a tax on not knowing those intricacies.
It's also, frankly, why the bigger bills get less interesting
because you can have a weird misconfiguration that's, you know,
a significant portion of an $80,000 monthly bill.
But by the time you're like, you know, we spent $100 million bucks a year,
no one's going to spend $3 million of that on NAC gateway data processing
because someone's going to ask where the hell the money's going long before it gets to that point.
So it starts to normalize.
You see the usual suspects and stuff.
services. S3, of course, being one of the top three every time. But in your case, it's,
it's not just about the fact that it's S3. It's what is the usage type? What is the dimension
breakdown? What is the ratio of requests to bite stored? That's where it starts to get really
interesting. And there's still no really good way of saying, oh, 99% of it is this one S3 bucket,
because you have to go diving even to get that out. It's not really you have to go diving to get the
specifics on where data is being stored, especially as it starts to get more and more costly.
But the use cases that I see more and more, and this is sort of because of the time that we're in
right now, is if you give a data scientist a bucket, if you give a data scientist a table or
an engineer at table, they're going to start to put data in it. And it starts to explode over
time to where we start to have data sizes that get large enough to where you're like, okay,
should this be an S3? We need it to be online. It should it be an Aurora? Should it be an elastic
the cache, like there's all of these very interesting data scale problems that are starting to
creep up because data has become so much more intrinsic in, you know, the product value or what
we can do that's really interesting.
And everybody wants the data all the time in every surface possible for as little latency
as possible.
And all of normalizing for everything else, S3 is so like incredibly fast.
It is incredibly fast.
It is incredibly cheap.
You just have to know to store data in it to take advantage of those.
two properties. And that's that's sort of the the thrust of the work that I've been doing over the
last year is if you know how to wield S3, it is probably the most powerful tool in the toolbox,
but you have to know how to wield it. Yeah. It's magic. It is infinite storage. Definition
it's. It can fast and you can fill it. I know that because I've done some
super some loops. Yeah. That's why my test environment is other people's product counts.
it also changes the nature of behaviors.
If I were to go into a data center environment and say, great, I need to just store another 10 petabytes over in that rack next week.
The response would be, right.
Is there someone smarter I can speak with?
Do you have a goldfish, perhaps?
Whereas work with S3 is just effectively one gigabyte at a time.
And there is no forcing function where, well, now we need to get a whole bunch more power, cooling, hardware, etc.
you can just keep incrementally adding and growing forever.
There is no more bound to speak of,
at least not anything you or I are ever going to encounter
because the money is the bound there.
And there's no forcing function that makes us clean up after ourselves.
It is an unbounded growth problem.
It is an unbounded growth problem.
I think there's an industry change that has happened that has influenced this.
I was having a chat with my friend Denny from Databricks about this.
When I first came up in the industry, how you store data,
whether it was online or offline,
was in a relational data store of some form
or the data warehouse,
and the goal of all of these foreign key constraints
and the relations between them
was to only store any one piece of data once.
In the last 10 years,
we've said, to hell with that, denormalize the data.
It's faster to just denormalize it
and to create a copy plus one column of this table
rather than to try to manage
all of these relationships between data.
And so we have excessive denormalization
of data happening across the data layer for good reasons, in theory, but for good reasons.
But what that means is this unbounded growth has happened because we have infinitely cheap storage,
you know, right?
And then we also have this push of denormalizing data, which leads to crazy data growth as time goes
on because most new data sets are not net new original data sets.
They are that old table or that old data set plus some new properties that I've added is now,
an entirely new, you know, prefix, an entirely new bucket of stuff, right?
And so rather than trying to find a way to get the least amount of storage used,
we said to hell with it, S3 is cheap enough, just copy the data a bunch.
And that works great until it stops working.
At the billions is where it stops working.
And also, do you realize, okay, you had a data scientist copy away five petabytes to do a quick
experiment for two weeks.
She left the company in 2012.
and oops, a doozy, we probably should have cleaned that up before now.
Yeah, that's where intelligent tiering could really help.
Intelligent tiering and object access logs have been used quite aggressively by myself
and some other folks that I work with to identify exactly those data sets that were orphaned
that were suspiciously huge.
Object access logs are great.
Cloudfield data events are terrible.
I've done the math on this.
It's something like 25 times more expensive than the S3 request to write the cloud
Data event recording that request.
So professionally speaking, don't do that.
The actual are a lot more reasonable.
It reminds us the old good old days of server web logs from Apache and whatnot.
I set up webelizer.
I know exactly where things are going.
Exactly.
And my log analytics suite, that was a very convoluted one-line ock script.
When you have underpowered systems, what else are you going to do?
It's not hitting myself or making it fall over or tail dash F and try and strain the tea leaves with your teeth.
That's how I've been doing it, actually.
There are worse choices you could make.
Unfortunately, I really wish you could give a version of your talk at Reinvent, but it doesn't involve AI.
So there's no way in the world that it would ever fit on the schedule.
Someone just did the analytics, something like 400 and change of the roughly 500 talks in the catalog so far are about, I mentioned.
at least once in the title or description.
Yeah, I mean, that's how you get on stage.
That's how you get some funding.
I mean, you've got to have some hashtag AI.
I think the thing about the AI use cases,
there's a lot of really interesting things that people are doing,
that's, you know, quote unquote AI on AWS.
Some of those are with AWS's AI tools.
A lot of them are kind of conventional,
sagemaker, conventional vector stores,
eventually, like the tools of three or four years ago, the stuff that's coming out now or that
that has been announced in the last year or so, I think we're going to see a couple more years
before that's really used in anger for production products and things like that.
This is the problem, too, is there you're building things out.
Like, it's easy as hell for me to go to you and point at some of these new features.
Like, well, why didn't you idiots just build this thing, build scribed on top of this?
Because this may shock you, script is older than three months.
Who knew?
Yeah, I mean, the thing that's fascinating about script in that regard is there's scripts that evolve, right?
Like we've had very different business models depending on which, you know, era of script that you look at.
And the design constraints that influence systems change over time.
And so I think it's a great thing for most engineers to work for a company that has had to, like, join a company that doesn't have greenfield problems.
join a company that's been around for five, 10 years because there's really interesting engineering
challenges when you look at a system that was built for an error that no longer exists and have to
figure out how do I convert this? How do I make this work in where we are today? Because you've got
every problem has constraints, but the constraints of somebody's solution yesterday, bringing that
to today, is a very interesting mental challenge because you can't throw it out and rebuild it. You've
got to find a way to evolve the system as opposed to burning it to the ground.
I make this observation time to time in meetings that it doesn't take a particularly
right solutions engineer to be able to fill an empty whiteboard with an architecture that largely
solves any given problem you want to throw at them. The thing is is that we are not starting
from square one with a greenfield approach for almost anything these days. It's great. We already
have an existing business and existing systems that are doing this and no, we don't get to just
turn it all off for 18 months while we decide to start
start over from scratch and then do a slow leisurely migration.
There has to be a path forward here.
Yes, there has to be a path forward.
And in script's case, what makes it interesting is we still get uploads.
Like, we're getting uploads.
Thousands and thousands of uploads happen every day.
And so any storage solution that we come up with on top of S3 has to slip in or follow
behind something that's getting thousands and thousands of uploads every single day
and all of the objects that that created.
I was looking at a waypoint.
do as a free database.
Thank you.
Thank you for the engagement, I guess.
It's really helpful.
Anything of the database when you hold it wrong?
Please don't use script as a database.
I mean, you personally, you have information.
We can ask you about it and get answers.
You're just a slow database.
Done.
I am a very slow database.
Lossi as well.
I was looking at some assets that I had put together for a discussion with AWS at the
beginning of the summer. And on the whiteboard, I had put one number, you know, this many billion
objects. When I went and I looked at that in the last two weeks, there are another couple
billion objects in that bucket compared to what I had put on the whiteboard. And so when you're
looking at a system that is in display, what exists? Additional use of the service and the site,
not the classic architectural pattern called Lambda invokes itself? It was not Lambda
invokes itself, though I am a fan of that one.
I don't put your logs into the same bucket you're recording the object access in.
That's a, yeah.
I don't think I've done that one.
I have done the recursive language.
Well, aren't we sense?
But like the challenge of architecting or designing something that has to handle massive scale,
but also work through 18 years of massive scale, it's such a fascinating problem.
Like, there is no bigger problem at Scrib that has me this excited than figuring out how
we take 18 years of the largest document library that exists, as far as I'm aware,
and figuring out how to make that useful, you know, high performance, easily accessible,
and give new capabilities.
Like, S3 is due, S3, the service team is doing this now as well.
Like, they're trying to find new ways to get new capabilities out of S3.
Well, they'll never break the old capabilities, except turning up.
Well, never breaks.
None of us want, like soap endpoints for their API.
It's like, you're one of them.
worse. And I recently discovered the seating as a bucket. But you should do that. Sounds like I'm
kidding. I love that. I love that soap and points are still supported until October of 2025.
It makes me laugh so much that I saw that. Oh, next month. I missed that. Okay.
No, no, no. They said those deprecation warnings. I discovered so recently for S3 and I discovered
that it was going away recently, both within about two days of each other.
Not that I was planning on using so.
There's a lot of really cool tools in the S3 toolbox that are underutilized.
I mean, Object Lambda I've chatted with you about before.
I think Object Lambda is super cool, but I don't think it's seeing a lot of attention.
I think S3 Select was a good idea that maybe didn't materialize in any particular way.
Over the years, the S3 service team has been doing interesting things.
And I think only now in the last two years have they found their footing with the metadata table
as different types of buckets and then vector, vector buckets,
and found a way to move up from object storage
in a way that is really fascinating to see
how the industry around it is going to change.
Because as you've pointed out,
like S3 is backwards compatible with just the classic S3 API,
but S3 means a lot of different things now.
It's like SageMaker.
Like, what do you mean by S3?
Like, what part of S3 are you talking about?
In my case, I'm just talking about objects.
I'm not talking about this other magic stuff.
Yeah, don't worry.
Those will continue to work.
They kind of have to, but it's weird to almost wonder if you go on a cruise to the next five years and come back,
how little or how much of the then current stuff would you recognize?
They keep dripping out feature enhancements, but they add up to something meaningful.
They do.
They do.
There's a very clear data strategy from the S3 service team that's working in concert with some of the other parts like the bedrock team.
and the Sagemaker teams to where it's to me, I mean, I'm all about data.
I love data.
I work on data products and data tools.
Like, this is my jam.
But there's never been a more exciting time, in my opinion, to be building on top of
S3 as a platform because the platform itself, rock solid, super fast, super cheap, and
getting more capabilities every reinvent, which is super cool.
Yeah.
It's really neat to see.
I want to thank you for taking the time to speak to me about.
all of this. If people want to learn more, and I'm strongly suggest they do, where's the best
place them to find you this? That is a great question. You can find my shit posts on Macedon.
So I'm just, hacky dot... My server got eaten when I had a 10,000. I don't know the energy
to do it again. Yeah, I noticed that Quinny Pig was on Macedon for a minute and then I didn't
know what happened. Yeah, I basically didn't.
I missed the deprecation warning and suddenly, nope, no backups, you're done.
Oh, no.
It is definitely the do-it-yourself social network.
Yeah, which is great for all to talk to a very specific subset archetype of a person and absolutely no one else.
Fair.
Absolutely fair.
I'd say the ScribTech blog, tech.com is a good place to find some stuff that we periodically share.
Macedon is probably the easiest place to find me or GitHub.
I'm just R. Tyler on GitHub.
Like for the last 20 years that GitHub's been around,
GitHub's been my primary social network.
So that's where people can find me.
Awesome.
Mine historically has been,
my primary social network is basically notepad and text files
and terrible data security.
But that's beside the point.
Thank you so much for taking the time to speak with me.
I appreciate it.
Thanks, Corey.
Our Tyler Croy, Infrastructure Architect at Scrib,
I'm cloud economist Cory Quinn,
and this is screaming in the cloud.
If you've enjoyed this podcast,
please leave a five-stop
review on your podcast platform of choice.
Whereas if you hated this podcast, please leave a five-star review on your podcast platform
of choice, along with an angry, insulting comment that completely transposes a few numbers,
and you'll have no idea what the hell it's going to cost to retrieve that from Glacier Deep Archive.
