Screaming in the Cloud - Engineering Around Extreme S3 Scale with R. Tyler Croy

Episode Date: January 13, 2026

R. Tyler Croy, a principal engineer at Scribd, joins Corey Quinn to explain what happens when simple tasks cost $100,000. Checking if files are damaged? $100K. Using newer S3 tools? Way too e...xpensive. Normal solutions don't work anymore. Tyler shares how with this much data, you can't just throw money at the problem, but rather you have to engineer your way out.About R. Tyler: R. Tyler Croy leads infrastructure architecture at Scribd and has been an open source developer for over 14 years. His work spans the FreeBSD, Python, Ruby, Puppet, Jenkins, and Delta Lake communities. Under his leadership, Scribd’s Infrastructure Engineering team built Delta Lake for Rust to support a wide variety of high performance data processing systems. That experience led to Tyler developing the next big iteration of storage architecture to power large-scale fulltext compute challenges facing the organization.Show Highlights:01:48 Scribd's 18-Year History04:00 One Document Becomes Billions of Files05:47 When Normal Physics Stop Working08:02 Why S3 Metadata Costs Too Much10:50 How AI Made Old Documents Valuable13:30 From 100 Billion to 100 Million Objects15:05 The Curse of Retail Pricing 19:17 How Data Scientists Create Growth21:18 De-Normalizing Data Problems25:29 Evolving Old Systems27:45 Billions Added Since Summer29:29 Underused S3 Features31:48 Where to Find TylerLinks: Scribd: https://tech.scribd.comMastodon:  https://hacky.town/@rtylerGitHub: https://github.com/rtylerSponsored by: duckbillhq.com

Transcript
Discussion (0)
Starting point is 00:00:00 And so even the simple, like S3 will gladly compute checks sums for you, but to do so, it's a batch operation. And a batch operation costs $1 per million objects. If you have a hundred billion objects, all of a sudden you're faced with, if I need checksums for all of this data, I've got to go drop 100K just to compute checksums because a billion is just so astronomically large that just the simple act of getting checksums for the data you already have in S3 becomes a very serious pricing discussion. if you aren't ready to drop that kind of coin.
Starting point is 00:00:36 Welcome to Screaming in the Cloud. I'm Corey Quinn. I am joined today by an early Duck Bill Group customer, a recent speaker at the inaugural, San Francisco Finops, I suppose we'll call it. Our Tyler Croy is an infrastructure architect over at Scribd. Tyler, how are you? How are you?
Starting point is 00:00:58 I'm doing all right. This episode is sponsored in part by my day job, Duck Bill. Do you have a horrifying AWS bill? That can mean a lot of things. Predicting what it's going to be. Determining what it should be. Negotiating your next long-term contract with AWS. Or just figuring out why it increasingly resembles a phone number,
Starting point is 00:01:21 but nobody seems to quite know why that is. To learn more, visit duckbillh.com. Remember, you can't duck the duck bill bill, which my CEO reliably informs me is absolutely not our slogan. For a while over at Scrib, six years in change. The company boldly answering the question, what if S3 had a good user interface? Yeah, I mean, I've joined the most recent incantation of script.
Starting point is 00:01:51 Script has been a lot of things. We've been around for 18 years. I think Scrib has existed longer than most other companies. So, I mean, we've got the lack of vowel in our name. We're from the flicker, the scripts, the word before the dot LYs. Well, as we look at them, vowels, vowels are expensive. Why buy them if you don't have to? So your talk was fascinating because it, of course, focused heavily on economics,
Starting point is 00:02:16 and it also focused on S3, the reason that many of us don't sleep anymore. And that's been, it was an interesting story, just at the level of scale that you're talking about, Things that people don't consider to be expensive, got expensive. Specifically, request charges when you're doing things in buckets that, for those who are unfamiliar, effectively have a crap ton of uploaded text documents at head of out scale. I think crapton is the metric that storage lens shows you at our scale. So for the uninitiated script has user documented or user uploaded content, documents typically presentations through our slideshare product. going back 18 years.
Starting point is 00:02:59 And so every day, thousands and thousands of new documents, legal documents, study guides, et cetera, get uploaded. And those have been quietly accumulating in our S3 storage layer for a long time without anybody really paying attention to it. And so a year or so ago, I started to really look at, like, where's a big debt I can make in Clust Explorer? Like if I'm gonna take on something big,
Starting point is 00:03:21 what's the biggest thing? And I saw a storage cost. Cloud Economics, instead of pulling an AWS billing org, going alphabetically, if you start with a big numbers first, that tends to have impact. For years, I was asked about people's random Alexa for business spend. It is $3 a month. What are you doing?
Starting point is 00:03:42 Yeah, I mean, most companies, I think, have like EC2, S3, Aurora. Like, those are the big things. But once I started to look into our actual S3 spend, I knew we had a lot of content. Like, we talk about the hundreds of millions of documents that have been uploaded over the years. But when I actually looked into what was stored in S3, we're talking hundreds of billions of objects because every single document that you upload, we have like format conversion,
Starting point is 00:04:09 we have accessibility changes that get made. And so every single document became this diaspora of related objects in S3. And suddenly like batch operations, intelligent tiering, anything that has a per object charge associated with it becomes wildly expensive in a way that requires is you to like step back and think about how how should we be doing this? How should we be storing
Starting point is 00:04:35 this data? Because that shotgun of objects into S3 only works for the first billion. And then after that, you might have to think what you're doing. The ergonomics of the request charges are very different too. I think philosophically, we tend to see, you know on some level, oh, if I stuff an exabyte of data into S3, that's going to be expensive. But when we start, I think it's hard for humans to wrap their head around the idea of hundreds of billions of objects, just because it's the difference between a million and a billion is about a billion. If you pass a point of scale, you do an S3LS to see what objects you have there and it'll complete right around the time the earth crashes into the sun. It's a, it's just not something that makes sense. But on the other side of it, I overoptimized
Starting point is 00:05:18 for a lot of this stuff because I think at the Duck Bill group now, our total S3 bill is something like $110 a month right now. And we can do basically any. anything we want to S3. It doesn't materially move the needle on our business because we are not script. The S3, like, you can abuse S3 for terabytes, petabytes even, really. Like, you can put so much into S3. It's so incredibly cheap. It's so incredibly reliable. And then there's this like something happens. And I don't know when it happened at Scrib because I wasn't paying attention. I'm only looking back in history. Something happened where we went from like the first billion to the next billion to the next billion. And once you're in the tens or hundreds of billions of objects,
Starting point is 00:05:59 like it's like quantum physics. Like all of a sudden, all of the physics that you've learned no longer applies, you're in a completely different ballgame and you've got to figure out how does this world work? Because the world I thought I had doesn't exist anymore. Yes. Oh, very much so. It's you also have the problem where, especially we're talking about all things billing. It's a lot of hurry up and wait. Okay, we're going to make some transitions. We're going to try something here and see how it goes. And then you have to, in some cases, wait for objects to age out into the next tier or there's a bunch of request charges that suddenly mean for this month your S3 bill is just in the stratosphere and you get the angry client screaming phone call.
Starting point is 00:06:38 Like, what have you done? Yes, there is an amortization story here. Give it time. Don't move it back. Yeah. And I'm actively, as we record this, I am waiting for a first 30 day on some reclassing to occur for intelligent tiering. And I can't wait until next month because I'm hoping for a big draw. Yeah. This stuff has become magic, but you have to speak the right incantations around it.
Starting point is 00:07:05 And past a certain point of scale, a lot of things, just in the way that AWS talks about this, are no longer makes sense. I asked you about using S3 metadata or S3 tables for a lot of this stuff. And your response was the polite business person equivalent of a kid, that's adorable.
Starting point is 00:07:23 You have any idea what that would cost just on the sheer number of objects because that's not usually the first dimension we can to think about historically. Now, metadata and tables are changing that and vector buckets and directory buckets and Lord knows what else.
Starting point is 00:07:38 But that changes the way that we think about a service that honestly is older than some people listening to this show. I mean, yeah, a lot of these things really break down in ways that are challenging. The metadata and invidavit inventory and other like operations or things that have happened in S3 over the last few years that are really interesting. They're really, really great when you're sub billion, sub billion objects.
Starting point is 00:08:00 But like S3 metadata, the questions I don't have or I want to ask of these buckets are not worth the amount of money it would take to ingest into S3 metadata and then to continue storing because they're just so astronomically huge. I was looking at, I was looking at a problem with some of these older objects. sometime in 2024, you probably talked about this. Every upload to S3 got a check-sum automatically. You don't have to do anything. Before that, you may or may not have a check-sum. If you're using in a proper, you know, AWS SDK, you did. But if you weren't, who knows?
Starting point is 00:08:36 And at some point... Who would use anything the latest correct SDK? Why would you interact with S3 except through the official SDK? But when you go back to like Scrib's S3 bucket came, was created the year S3 was created and announced. And so when we're going back that far in time, we have billions of objects that don't have check sums. And so even the simple, like S3 will gladly compute checks sums for you. But to do so, it's a batch operation. And a batch operation costs $1 per million. If you have a hundred billion objects, all of a sudden,
Starting point is 00:09:14 you're faced with, if I need checks sums for all of this data, I've got to go drop 100K just to compute check sums because a billion is just so astronomically large that just the simple act of getting checksums for the data you already have in S3 becomes a very serious pricing discussion if you aren't ready to drop that kind of coin. I have to think just, again, you have lived in this space. I only dabble from time to time. But my default approach when I start thinking about this sort of problem is the idea of lazy check-summing, lazy conversions.
Starting point is 00:09:46 But when I was at Expensify back in 2012, something that we learned was that the typical life cycle of a receipt was it would be written once and it would be read either one time or never, except for the end of the long tail where suddenly things get read a lot years later during an audit of some sort or when there's a question of mouth the essence. So you could never get rid of the data. You had to have it with the expectation it would never get read. I have to imagine the majority of stuff that gets uploaded in the script. In many cases, it's there, but it's not accessed. I would say that's a pretty good assumption. The interesting thing about user-uploaded content and user-uploaded documents in particular
Starting point is 00:10:26 is the long tail is years and years long. A study guide that was created for Catcher in the Rye in 2010 is still probably just as useful in 2025 as it was in 2010 because the Catcher in the Rye is a classic and people still get no access. We'll post a link to it on Reddit one year when the entire guy started in that. Yeah, it's impossible to predict what's going to hit. It's impossible to predict, but one of the things that's been really interesting about Scrib's particular flavor of content is in the last couple of years, large language models have become really useful for what Scrib does. I won't speak to the utility of large language
Starting point is 00:11:06 models in other domains, but the utility in what Scrib does in particular has made old documents suddenly much more useful, much more interesting, much more relevant to use. users today than they have ever been before. Because before this, you didn't have to sort of rely on a Reddit post or something to like reinvigorate a document, right? But now that if we look at all of this long, almost 20 year history of Scrib, if you look at that like a knowledge base, then all of a sudden we're looking at a very broad horizontal access pattern that we might want to be doing for data science use cases or large language
Starting point is 00:11:43 model based applications, that again flips the access patterns that you might have in a traditional user-generated content site on his head and makes the storage discussion so much more challenging, but like in a fun way. One of the more horrifying parts of your talk was when you mentioned that you did a lot of digging into various file formats. You were talking about, even looking at ISOs at one point. I'm like, oh, hey, someone knows what Joliet standard is in this day and age. Imagine that. But you picked Parquet and then started using S3 byte offsets on reeds to be able to just grab the end of of an object and then figure out where exactly you'd go and grab things from exploded document
Starting point is 00:12:22 views. It was a very in-depth approach. It sounds like not to be the mind, S3 tables, rest of metadata from first principles, because those things didn't exist back then. And now that they do, they're not even close to being economically viable. Yeah, I think if they were economically viable, I mean, S3 tables would be really interesting for this use case. The really novel thing about Apache Parquet files is a lot of what we are doing at Scrib with Apache Parquet is not new territory. It's not necessarily novel. It's how the quote-unquote lakehouse architectures of, you know, delta tables and iceberg tables and things like that are doing really, really fast queries and things like that on top of S3 object stores or, you know, Azure blah, blah, blah, blah. So like the
Starting point is 00:13:04 infrastructure for picking needles out of these parquet file haystacks already exist. One of the work that I've been doing is reusing some of the same principles, but bringing it to a wildly different domain of this sort of very, very large content library that script has, and using that as a way to reduce object size. The whole thing that I was trying to get across at the FinOps Meetup that you all invited me down for was the problem of X is really expensive at 100 billion objects.
Starting point is 00:13:36 My solution has been, okay, not like to go negotiate with the team or try to find a way to make that cheaper, but try to get the object count actually lower. Because if you bring it from $100 billion to $100 million, then we're in a ballpark to where you can take advantage of intelligent tearing, batch operations become much more easy to do. All sorts of things become simpler if you can reduce that object count. And when I was looking at other things like ISOs,
Starting point is 00:13:59 I mean, the classics are a classic for a reason. They never die. When I was looking at that, like zip, tar, etc., I wasn't able to find a way to get random fight selections within S3 objects. to work nearly as effectively as I can with Parquet. As with Parquet, if I know what file I'm looking for, I can get it extremely quickly from within, let's say, a 100 megabyte file,
Starting point is 00:14:24 I can go grab you 25 kilobytes with the same level of, I would say, performance as most other S3 object accesses work. Because S3 supported this range request for a long time. And one of the bits of trivia that I was very pleased to discover, which really, really made this work well is you can do negative range reads on S3. So you can look at the tail of a file. You can look at the middle of the file. You can grab any part of an object that exists in S3 if you just know where it in the file that it exists.
Starting point is 00:14:55 Which is magic. And there's a... It is magic. The downside, of course, is you have to know it's there. You have to have an in-depth understanding of your workload, which you folks do. This is also, I think, the curse of retail price. It's no secret at this point that at scale, nobody's been. in retail. But and like, well, of course we're going to work and negotiate with you on this in return for long-term commitments. We'll wind up giving you various degrees of discounts. But when you're sitting there just doing the napkin math to figure out, okay, if I have a hundred billion objects
Starting point is 00:15:26 and what are you going to charge me? Okay, never mind. We're going to move on to the next part of the conversation because it doesn't occur to you to go up there. It's like, so I need at minimum a 98% discount on this particular dimension, even if that's a tax. It sounds ludicrous. Like there's no way you would even be able to say that with a straight face. Like I'm not going to go into a car dealership and ask for a car for 20 bucks because it's just wasting everyone's time. Same principle applies here. They have priced themselves out of some very interesting conversations. The same principle applies. I think the problem domain that we are faced with on top of S3. I think S3, as I think you've claimed a number of times, is the eighth wonder of the world. It is a fantastic piece of infrastructure. Building on top of it enables so many different use cases, but when you've got a large enough scale, you've got really interesting problems. And being in engineering, this is certainly a bias, right? Like, I don't want to look away from those problems. Getting things cheaper is sometimes easier to do just with paper, you know, just signing a contract.
Starting point is 00:16:34 In other times, stepping back far enough to look at what we're trying to accomplish and coming up with an interesting technology solution is also a perfectly reasonable way to solve the problem. And I think the way that I really am trying to approach what we're doing with S3 at Scrib is it's not just about getting the bill lower. Like nobody is going to give me the time and the money to make the bill lower. But if I can give us new capabilities by expanding what we can do with this 100 billion objects within the organization, that is a capability change that you get from a technology-based solution as opposed to a policy or, you know, a contract-based solution. Both are equally valid, right?
Starting point is 00:17:15 But I'm much better at one than the other. That may be a different story for you, but I'm better with the, let's build some, write some code that's going to solve some big problems. And hopefully that'll make the chart go down in a way that makes time in finance happy. This episode is sponsored by my own company, Duck Bill. Having trouble with your AWS bill, perhaps it's time to negotiate a contract with them. Maybe you're just wondering how to predict what's going on in the wide world of AWS. Well, that's where Duck Bill comes in to help.
Starting point is 00:17:49 Remember, you can't duck the Duck Bill bill, which I am reliably informed by my business partner is absolutely not our motto. To learn more, visit DuckbillHQ.com. It feels like half the time you look deep at the bill, like every different usage item, there's a reason for it. there's ways to optimize and market it. But it's small to mid-sized scale. It feels like it's just a tax on not knowing those intricacies.
Starting point is 00:18:14 It's also, frankly, why the bigger bills get less interesting because you can have a weird misconfiguration that's, you know, a significant portion of an $80,000 monthly bill. But by the time you're like, you know, we spent $100 million bucks a year, no one's going to spend $3 million of that on NAC gateway data processing because someone's going to ask where the hell the money's going long before it gets to that point. So it starts to normalize. You see the usual suspects and stuff.
Starting point is 00:18:37 services. S3, of course, being one of the top three every time. But in your case, it's, it's not just about the fact that it's S3. It's what is the usage type? What is the dimension breakdown? What is the ratio of requests to bite stored? That's where it starts to get really interesting. And there's still no really good way of saying, oh, 99% of it is this one S3 bucket, because you have to go diving even to get that out. It's not really you have to go diving to get the specifics on where data is being stored, especially as it starts to get more and more costly. But the use cases that I see more and more, and this is sort of because of the time that we're in right now, is if you give a data scientist a bucket, if you give a data scientist a table or
Starting point is 00:19:22 an engineer at table, they're going to start to put data in it. And it starts to explode over time to where we start to have data sizes that get large enough to where you're like, okay, should this be an S3? We need it to be online. It should it be an Aurora? Should it be an elastic the cache, like there's all of these very interesting data scale problems that are starting to creep up because data has become so much more intrinsic in, you know, the product value or what we can do that's really interesting. And everybody wants the data all the time in every surface possible for as little latency as possible.
Starting point is 00:19:53 And all of normalizing for everything else, S3 is so like incredibly fast. It is incredibly fast. It is incredibly cheap. You just have to know to store data in it to take advantage of those. two properties. And that's that's sort of the the thrust of the work that I've been doing over the last year is if you know how to wield S3, it is probably the most powerful tool in the toolbox, but you have to know how to wield it. Yeah. It's magic. It is infinite storage. Definition it's. It can fast and you can fill it. I know that because I've done some
Starting point is 00:20:27 super some loops. Yeah. That's why my test environment is other people's product counts. it also changes the nature of behaviors. If I were to go into a data center environment and say, great, I need to just store another 10 petabytes over in that rack next week. The response would be, right. Is there someone smarter I can speak with? Do you have a goldfish, perhaps? Whereas work with S3 is just effectively one gigabyte at a time. And there is no forcing function where, well, now we need to get a whole bunch more power, cooling, hardware, etc.
Starting point is 00:21:01 you can just keep incrementally adding and growing forever. There is no more bound to speak of, at least not anything you or I are ever going to encounter because the money is the bound there. And there's no forcing function that makes us clean up after ourselves. It is an unbounded growth problem. It is an unbounded growth problem. I think there's an industry change that has happened that has influenced this.
Starting point is 00:21:22 I was having a chat with my friend Denny from Databricks about this. When I first came up in the industry, how you store data, whether it was online or offline, was in a relational data store of some form or the data warehouse, and the goal of all of these foreign key constraints and the relations between them was to only store any one piece of data once.
Starting point is 00:21:43 In the last 10 years, we've said, to hell with that, denormalize the data. It's faster to just denormalize it and to create a copy plus one column of this table rather than to try to manage all of these relationships between data. And so we have excessive denormalization of data happening across the data layer for good reasons, in theory, but for good reasons.
Starting point is 00:22:05 But what that means is this unbounded growth has happened because we have infinitely cheap storage, you know, right? And then we also have this push of denormalizing data, which leads to crazy data growth as time goes on because most new data sets are not net new original data sets. They are that old table or that old data set plus some new properties that I've added is now, an entirely new, you know, prefix, an entirely new bucket of stuff, right? And so rather than trying to find a way to get the least amount of storage used, we said to hell with it, S3 is cheap enough, just copy the data a bunch.
Starting point is 00:22:43 And that works great until it stops working. At the billions is where it stops working. And also, do you realize, okay, you had a data scientist copy away five petabytes to do a quick experiment for two weeks. She left the company in 2012. and oops, a doozy, we probably should have cleaned that up before now. Yeah, that's where intelligent tiering could really help. Intelligent tiering and object access logs have been used quite aggressively by myself
Starting point is 00:23:09 and some other folks that I work with to identify exactly those data sets that were orphaned that were suspiciously huge. Object access logs are great. Cloudfield data events are terrible. I've done the math on this. It's something like 25 times more expensive than the S3 request to write the cloud Data event recording that request. So professionally speaking, don't do that.
Starting point is 00:23:34 The actual are a lot more reasonable. It reminds us the old good old days of server web logs from Apache and whatnot. I set up webelizer. I know exactly where things are going. Exactly. And my log analytics suite, that was a very convoluted one-line ock script. When you have underpowered systems, what else are you going to do? It's not hitting myself or making it fall over or tail dash F and try and strain the tea leaves with your teeth.
Starting point is 00:24:03 That's how I've been doing it, actually. There are worse choices you could make. Unfortunately, I really wish you could give a version of your talk at Reinvent, but it doesn't involve AI. So there's no way in the world that it would ever fit on the schedule. Someone just did the analytics, something like 400 and change of the roughly 500 talks in the catalog so far are about, I mentioned. at least once in the title or description. Yeah, I mean, that's how you get on stage. That's how you get some funding.
Starting point is 00:24:33 I mean, you've got to have some hashtag AI. I think the thing about the AI use cases, there's a lot of really interesting things that people are doing, that's, you know, quote unquote AI on AWS. Some of those are with AWS's AI tools. A lot of them are kind of conventional, sagemaker, conventional vector stores, eventually, like the tools of three or four years ago, the stuff that's coming out now or that
Starting point is 00:25:01 that has been announced in the last year or so, I think we're going to see a couple more years before that's really used in anger for production products and things like that. This is the problem, too, is there you're building things out. Like, it's easy as hell for me to go to you and point at some of these new features. Like, well, why didn't you idiots just build this thing, build scribed on top of this? Because this may shock you, script is older than three months. Who knew? Yeah, I mean, the thing that's fascinating about script in that regard is there's scripts that evolve, right?
Starting point is 00:25:32 Like we've had very different business models depending on which, you know, era of script that you look at. And the design constraints that influence systems change over time. And so I think it's a great thing for most engineers to work for a company that has had to, like, join a company that doesn't have greenfield problems. join a company that's been around for five, 10 years because there's really interesting engineering challenges when you look at a system that was built for an error that no longer exists and have to figure out how do I convert this? How do I make this work in where we are today? Because you've got every problem has constraints, but the constraints of somebody's solution yesterday, bringing that to today, is a very interesting mental challenge because you can't throw it out and rebuild it. You've
Starting point is 00:26:22 got to find a way to evolve the system as opposed to burning it to the ground. I make this observation time to time in meetings that it doesn't take a particularly right solutions engineer to be able to fill an empty whiteboard with an architecture that largely solves any given problem you want to throw at them. The thing is is that we are not starting from square one with a greenfield approach for almost anything these days. It's great. We already have an existing business and existing systems that are doing this and no, we don't get to just turn it all off for 18 months while we decide to start start over from scratch and then do a slow leisurely migration.
Starting point is 00:26:54 There has to be a path forward here. Yes, there has to be a path forward. And in script's case, what makes it interesting is we still get uploads. Like, we're getting uploads. Thousands and thousands of uploads happen every day. And so any storage solution that we come up with on top of S3 has to slip in or follow behind something that's getting thousands and thousands of uploads every single day and all of the objects that that created.
Starting point is 00:27:20 I was looking at a waypoint. do as a free database. Thank you. Thank you for the engagement, I guess. It's really helpful. Anything of the database when you hold it wrong? Please don't use script as a database. I mean, you personally, you have information.
Starting point is 00:27:36 We can ask you about it and get answers. You're just a slow database. Done. I am a very slow database. Lossi as well. I was looking at some assets that I had put together for a discussion with AWS at the beginning of the summer. And on the whiteboard, I had put one number, you know, this many billion objects. When I went and I looked at that in the last two weeks, there are another couple
Starting point is 00:28:01 billion objects in that bucket compared to what I had put on the whiteboard. And so when you're looking at a system that is in display, what exists? Additional use of the service and the site, not the classic architectural pattern called Lambda invokes itself? It was not Lambda invokes itself, though I am a fan of that one. I don't put your logs into the same bucket you're recording the object access in. That's a, yeah. I don't think I've done that one. I have done the recursive language.
Starting point is 00:28:29 Well, aren't we sense? But like the challenge of architecting or designing something that has to handle massive scale, but also work through 18 years of massive scale, it's such a fascinating problem. Like, there is no bigger problem at Scrib that has me this excited than figuring out how we take 18 years of the largest document library that exists, as far as I'm aware, and figuring out how to make that useful, you know, high performance, easily accessible, and give new capabilities. Like, S3 is due, S3, the service team is doing this now as well.
Starting point is 00:29:07 Like, they're trying to find new ways to get new capabilities out of S3. Well, they'll never break the old capabilities, except turning up. Well, never breaks. None of us want, like soap endpoints for their API. It's like, you're one of them. worse. And I recently discovered the seating as a bucket. But you should do that. Sounds like I'm kidding. I love that. I love that soap and points are still supported until October of 2025. It makes me laugh so much that I saw that. Oh, next month. I missed that. Okay.
Starting point is 00:29:40 No, no, no. They said those deprecation warnings. I discovered so recently for S3 and I discovered that it was going away recently, both within about two days of each other. Not that I was planning on using so. There's a lot of really cool tools in the S3 toolbox that are underutilized. I mean, Object Lambda I've chatted with you about before. I think Object Lambda is super cool, but I don't think it's seeing a lot of attention. I think S3 Select was a good idea that maybe didn't materialize in any particular way. Over the years, the S3 service team has been doing interesting things.
Starting point is 00:30:15 And I think only now in the last two years have they found their footing with the metadata table as different types of buckets and then vector, vector buckets, and found a way to move up from object storage in a way that is really fascinating to see how the industry around it is going to change. Because as you've pointed out, like S3 is backwards compatible with just the classic S3 API, but S3 means a lot of different things now.
Starting point is 00:30:38 It's like SageMaker. Like, what do you mean by S3? Like, what part of S3 are you talking about? In my case, I'm just talking about objects. I'm not talking about this other magic stuff. Yeah, don't worry. Those will continue to work. They kind of have to, but it's weird to almost wonder if you go on a cruise to the next five years and come back,
Starting point is 00:30:56 how little or how much of the then current stuff would you recognize? They keep dripping out feature enhancements, but they add up to something meaningful. They do. They do. There's a very clear data strategy from the S3 service team that's working in concert with some of the other parts like the bedrock team. and the Sagemaker teams to where it's to me, I mean, I'm all about data. I love data. I work on data products and data tools.
Starting point is 00:31:26 Like, this is my jam. But there's never been a more exciting time, in my opinion, to be building on top of S3 as a platform because the platform itself, rock solid, super fast, super cheap, and getting more capabilities every reinvent, which is super cool. Yeah. It's really neat to see. I want to thank you for taking the time to speak to me about. all of this. If people want to learn more, and I'm strongly suggest they do, where's the best
Starting point is 00:31:51 place them to find you this? That is a great question. You can find my shit posts on Macedon. So I'm just, hacky dot... My server got eaten when I had a 10,000. I don't know the energy to do it again. Yeah, I noticed that Quinny Pig was on Macedon for a minute and then I didn't know what happened. Yeah, I basically didn't. I missed the deprecation warning and suddenly, nope, no backups, you're done. Oh, no. It is definitely the do-it-yourself social network. Yeah, which is great for all to talk to a very specific subset archetype of a person and absolutely no one else.
Starting point is 00:32:35 Fair. Absolutely fair. I'd say the ScribTech blog, tech.com is a good place to find some stuff that we periodically share. Macedon is probably the easiest place to find me or GitHub. I'm just R. Tyler on GitHub. Like for the last 20 years that GitHub's been around, GitHub's been my primary social network. So that's where people can find me.
Starting point is 00:32:54 Awesome. Mine historically has been, my primary social network is basically notepad and text files and terrible data security. But that's beside the point. Thank you so much for taking the time to speak with me. I appreciate it. Thanks, Corey.
Starting point is 00:33:07 Our Tyler Croy, Infrastructure Architect at Scrib, I'm cloud economist Cory Quinn, and this is screaming in the cloud. If you've enjoyed this podcast, please leave a five-stop review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that completely transposes a few numbers,
Starting point is 00:33:29 and you'll have no idea what the hell it's going to cost to retrieve that from Glacier Deep Archive.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.