Software Huddle - Rewriting in Rust + Being a Learning Machine with AJ Stuyvenberg

Episode Date: May 6, 2025

Today's guest is AJ Stuyvenberg, a Staff Engineer at Datadog working on their Serverless observability project. He had a great article recently about how they rewrote their AWS Lambda extension in Rus...t. It's a really interesting look at a big, hard project, from thinking about when it's a good idea to do a rewrite to talking about their focus on performance and reliability above all else and what he thinks about the Rust ecosystem. Beyond that, AJ is just a learning machine, so I got his thoughts on all kinds of software development topics, from underrated AWS services and our favorite databases to the AWS Free Tier and the annoyances of a new AWS account. Finally, AJ dishes out some career advice for curious, ambitious developers.

Transcript
Discussion (0)
Starting point is 00:00:00 One of the challenges we had with Go is that it's really not meant to be like an ultra lightweight runtime. It's a very full featured runtime. You have a garbage collector, you have Go routines, you have all these different language features, which are very, very nice to work with, but they can get heavy. Had you written any rust before going on this project? No, not a line. And I think that's a big part of why I was so resistant. Not a lot. Languages in general, I think once you really learn and know one, it's easy, I think in my opinion, it's been easy enough to kind of pick up a second and a third with time. What did you think of the Rust ecosystem? Go has an army of very talented developers
Starting point is 00:00:36 led by Rob Pike at Google. All that said, Rust is pretty mature. There's a lot of support and you are gonna probably spend more time reading library code than you were expecting. I don't think that's a bad thing. You pick up a lot of support and you are going to probably spend more time reading library code than you were expecting. I don't think that's a bad thing. You pick up a lot of the patterns and a lot of the the idiomatic strategies that you're going to want to use anyway. So but it is different. I think it is harder than than go. You're like a big fan of AppRunner, right? You're like one of the two people that loves AppRunner. I can't recommend that anyone use AppRunner anymore.
Starting point is 00:01:03 I can't recommend that anyone use AppRunner anymore. One of my biggest complaints about AWS is it can be really hard to know which services are like soft deprecated. What's up, everybody? This is Alex, and we have a great show for you today because AJ Steivenberg is on the show. AJ is one of my favorite people. We work together at serverless. Now he's doing awesome things at Datadog. He just wrote the sweet blog post on rewriting their Lambda extension from Go to Rust and some of the reasons around that. Very like measured approach to doing a rewrite, which is a risky endeavor, learning Rust and doing Rust.
Starting point is 00:01:34 So like lots of cool things there. He also was like really good, I think just like general AWS serverless ecosystem thoughts and so we go back and forth on some of those things and always learn a lot from AJ and enjoy chatting with him. So, you know, if you have any questions, comments, people you want to be on the show, feel free to reach out to me or to Sean. And with that, let's get to the show. AJ, welcome to the show.
Starting point is 00:01:58 Thanks, Alex. It's great to be here. Yeah, man. I'm excited to have you because you're one of my favorite people. You know, we worked together at Serverless and had a great time there. And it's just been super fun to watch what you're doing since then. You're now staff engineer at DataDog, AWS Serverless hero, re-invent speaker,
Starting point is 00:02:14 all kinds of expert in all these different areas. I guess that's the high level stuff, but for people that don't know about you, maybe you wanna introduce yourself. Ah, thanks. That's actually, that's too kind. But yeah, I'm a staff engineer for the Serverless group here at DataDog. know about you, maybe you want to introduce yourself. I'm a staff engineer for the serverless group here at Datadog. For the listeners before this, Alex and I did work together at Serverless Inc.
Starting point is 00:02:35 He's a pretty good boss, so if you get the opportunity, you should try and work for him. I've been in the AWS and serverless space for a number of years now. to function in 2016 or 2017. over into the managed data store services, things like DynamoDB, Alex, quite familiar with, of course, and SNS, SQS, kind of the other ancillary cloud products too, like Google Cloud Run, Azure App Services, that kind of thing. So our group kind of encompasses that. And it's a lot of fun. I think it's a really cool space to be in. It's been a really fun ride and we've had a lot of hits.
Starting point is 00:03:21 So I'm excited to talk about it. Yeah, yeah, I know. It's been really fun to watch your journey at DataDog because I'm just so jealous of all the stuff you're learning because you're always just sharing all this interesting stuff. And part of it is you're just a big time learner, always wanting to dig into stuff and share it. But I think it's such a perfect fit for you because you get this amazing scope and scale at DataDog and all the low level stuff that's's going on there and you're just gonna see so much data and it's like I'm just really jealous. I think it's cool what you're
Starting point is 00:03:50 doing. So I do want to talk about your journey at some point but I want to start with like why we brought you on on the first place we've been sort of talking about this for a while and you just wrote this blog post about like rewriting Data Dog's Lambda extension. Rewrote it in Rust because rewrites are always an awesome idea, and Rust especially is a great idea. I guess maybe give me, yeah, tell me about that.
Starting point is 00:04:12 I guess I want to say first, just some background on why this is so hard and scary, not just the rewrite, but the extension itself, because Lambda is this dynamic compute environment where it's just like spinning up on demand. And when it does spin up, usually it wants something super quickly a response.
Starting point is 00:04:30 So you need to be able to spin up super quickly because you're like sort of wrapping this, the, you know, the customer's code. You can never fail. You can never fail no matter what happens. But you also have like these extremely variable workloads because there's like a million different languages and everyone can bring their own language.
Starting point is 00:04:46 You know, the execution time can be a few milliseconds. It can be 15 minutes. That's like super varied. Just like all this hard stuff that's going on that just like scares the crap out of me if I was doing any of this stuff. But anyway, with that sort of background, I guess tell me about this rewrite.
Starting point is 00:05:01 Tell me about the problem and like in what's going on there, and we'll dig in. Yeah, absolutely. So I think you hit on a lot of the interesting notes and to kind of intro the fact is lambda on the surface level seems very simple, and it is. But to make something simple, you have to solve a lot of hard problems
Starting point is 00:05:17 and then kind of give people pretty firm boundaries. And a lot of the things I think people complain about with lambda are the result of hard-won distributed systems lessons. So I'm talking about things like hard-capping the duration that a Lambda function can run. It's like no more than 15 minutes. And then when you put a limit on that, that process is killed at that time. Whatever you set, maybe it's five seconds or something. Capping the amount of memory and with it in Lambda, the amount of vCPU cores you get.
Starting point is 00:06:05 that a single processor application can consume. In a typical load balanced workload, in a typical big server kind of infrastructure, which you have is a load balancer and then a number of worker cores or servers behind that load balancer. And that load balancer is kind of sitting in front and receiving all the requests, all the traffic. And it will not send a request to a server if that server is not responding to health checks is a millisecond that the customer feels, frequent. to look at a very specific rewrite in Rust. Initially, the Lambda extension was built on Go. It was a fork of our main Datadog agent.
Starting point is 00:07:55 I think Go is generally very well suited to Lambda. I would suggest if you're looking to write a Lambda extension, explore Go and Rust. is that it's really not meant to be like an ultra lightweight runtime. It's a very full featured runtime. You have a garbage collector. You have massive go routines and I shouldn't say massive. You have go routines. You have all these different language features which are very, very nice to work with. But they can get heavy.
Starting point is 00:08:16 And especially when the code base gets very large, you end up with just like it's much more difficult to take a really large thing and pair it, pair it all the way down to the core essentials versus start from the beginning and go up. At some point in the last year, we were exploring all the different paths we could to make the Go agent work in Lambda and we weren't able to hit our performance goals. We were ripping out anything within a knit call and go that was blocking
Starting point is 00:09:04 at Rust. And the big, I think, big benefit there is of course, like, you get memory safety too. So it's like two things, right? Get out of the hot, get out of the hot path as fast as you can and get some memory safety. Yeah. Yeah. And you're talking about like, you know, all these being on the hot path, it's something your customer feels. And I'm guessing that people are like less tolerant of like third party introduced slowness to then it's like, Hey, if I wrote this crappy code and it slows down our thing, it's like, well, you know, I got, I have other priorities. But if it's like, if I'm pulling in this thing and it's being so it's like, hey, if I wrote this crappy code your product and then it starts getting slow. Nobody wants to have that.
Starting point is 00:09:50 You mentioned that the original one was, there's the Datadog agent that runs on a server in the background collecting metrics and forwarding them along. Basically the Lambda extension was a forked version of that? stripping things that we didn't need out as best we could. We had explored using Go plugins, which are a feature of the language which allows you to ship separate binaries and then load it. It's similar to like dynamic library load, like deal ID. The downside there is that one of the goals was to get the size of the compiled binary down as low as we could. That impacts your cold start time. One of the goals was to get the size of the compiled binary down as low as we could. That impacts your cold start time.
Starting point is 00:10:50 When you do that with Go, it doesn't know which features of the standard library are going to be included. It doesn't know which features of the standard library are going to be included on any of the plugins you load. As a result, every single plugin includes a full copy of the standard library. That was where we abandoned that project and started looking at a rewrite. And that's something that the main agent in a server the whole function crashes right there. And that's something that the main agent in a server
Starting point is 00:11:30 doesn't contend with, because it's either a totally separate pod, and then it's using HTTP to network and receive telemetry data from other machines or other pods. And inside of Lambda, that's not the case. necessary for us to have like a crash proof system or as much as possible. Gotcha. Yep. And what was sort of like the timeline on this road? Like when you release the initial extension, was it right away like, you know, this is good enough and this can work, but we realized there's some cold start issues and things like that. And we spent a lot of time improving that in Go as best you could.
Starting point is 00:11:59 And then at some point it you were just like, hey, we're not going to hit what we want. Or like, what did that sort of look like? Well, that's a great question. I think with the benefit of hindsight, maybe this story is told a little differently. But I think it's important to call out that when extensions were launched in 2020, cold starts over the board were still very bad.
Starting point is 00:12:14 Like across the board, cold starts were a problem everywhere. And lambdas continue to invest engineering time in improving that, including things like Java Snap Start or the container caching and loading we've talked about in the past. And all these important developments the best engineering time in improving that, doing everything we could to get that cold start down. But a certain element of gravity exists when you have all these thousands of lines of code and trying to remove them while still being completely compatible is a very difficult challenge. And that's where we landed on this. Well, we have the API boundaries that we need for this. And actually, I think if we really talk about ideal state from the ground up,
Starting point is 00:13:05 we don't really need a garbage collector as a Lambda extension. And a big part of that is because you have one function execution process per extension. It's one-to-one in the sandbox, which means I don't have to balance fairness across a bunch of different clients in the way that the main Go agent does. So when we looked at that carefully, it was like, well, we're gonna have to rewrite it all anyway, basically. Right, to like, to make the optimizations necessary to have the cold starts we need, we would basically rewrite the whole thing. And then we're like, well, we don't need
Starting point is 00:13:33 the garbage collector. So why don't we just manage our own memory? And Rust became a very apparent tool. Yep, interesting. Okay, so you actually thought, hey, potentially we could rewrite this in Go and make it significantly faster, but probably not to the rest levels that you got down to, but faster than like
Starting point is 00:13:49 sort of the original Go agent that you had. Yeah, exactly. And we had gone a really long way down that path. I think the binary we were producing was like 10s or 20s of megabytes smaller than the agent at the time produced. We had stripped out a ton of things which were kind of like irrelevant for within Lambda. And at the same time, it was like very difficult without kind of rewriting the core of the aggregation and the client fairness and the balancing and all these app like different components. And then we were like, well, we're
Starting point is 00:14:18 gonna have to rewrite this anyway. And if we don't need a garbage collector, and we would rather have crash safety, there's better tools for that. And it's been a great experience. I think it took us, we started the process in March 2024. We were live for beta users by, oh man, it was like late summer, late August 2024. We were live for beta users and then November at reInvent we went GA. So it was actually pretty quick. Yeah. And like, tell me about that rollout, especially like with the beta users. How did you find these?
Starting point is 00:14:48 Were these people that were having issues or maybe like clients that you're in touch with a lot? And how do they even roll it out? Do they pick like some low value functions where they can at least start testing it? Or like, what did that rollout look like? Yeah, that really varies by the customer. So we did a number of things. We identified, first off, we have teams inside of Datadog that rely on Lambda heavily for different features and different capabilities. Yeah, that really varies by the customer. and traces working. had a number of customers that had voiced their dissatisfaction with the initial cold start time that we had. So we were paying them and going directly and saying, hey, would you want to try this new thing out? Would you want to try this next generation extension out?
Starting point is 00:15:53 Yeah, of course, a lot of people were just willing to try it in staging. We asked them to, obviously, try this in a very low risk environment, in a safe way, and then you can rule it out. When it came time to release it into GA, and in fact, actually, when we released it into beta, the way we did this was kind of clever, I think. The bottle cap process, the Rust process, boots first, and it reads the configuration file in the environment, and it decides if anything there is unsupported at that time.
Starting point is 00:16:23 So we had a bare workload of, I think it was, metrics logs and traces for some runtimes, but not all, and it decides if anything there is unsupported at that time. So we had a bare workload of, I think it was metrics logs and traces for some run times, but not all. And we were like, well, just to be safe, we're going to just boot into the main Go agent. And that failover, we can fail over. And then we were able to boot the main Go agent and they were two different processes deployed in the same Lambda extension. And the Go agent was dormant until the Rust agent booted it.
Starting point is 00:17:00 And we had that for months. Sorry. Initially, you had to also opt in. So it would read the environment variable and then it also would make sure that you would opt into this next generation beta. And then eventually we switched it and then you have to opt out. And then the next step is going to be, of course, we're going to remove the opt out and then people are going to have to migrate to the next gen fully. Yep, gotcha. And okay, so you mentioned like it's only at first, it was only compatible with a few runtimes. I guess like what did what did that look like? Do you have to write sort of custom instrumentation for every runtime or like how does that even work with all these different runtimes and especially bring your own runtime? Like what does that look like
Starting point is 00:17:37 to actually write that instrumentation? That has been the joy of my career these last four years here at Datadog. That's very specific to that language. about the encoding supported and the encryption and things like that. popular runtimes. What about like, talk me through, I don't wanna say like the political, but like the organizational aspects of it where. Oh, it's political. I think that's fair. Well, yeah. I mean, like, I don't, yeah.
Starting point is 00:19:12 But like, I know like in the post, you mentioned how like, hey, I think you're a pretty wise engineer and you're like, hey, I'm kind of resistant to rewrites, but at some point you became convinced that like, hey, this is actually a good idea to do this in Rust. Who, did you have to get on board? Like what kind of process was that like, hey, this is actually a good idea to do this in Rust. Who, did you have to get on board? Like what kind of process was that like to convince,
Starting point is 00:19:29 I don't know how many effective people that like, hey, this actually is a good idea, make that case to them. Did you have a lot of control and autonomy in that on the serverless team or did you have to convince a lot of outside teams or what did that look like? Yeah, that's a, I mean, it is political. That's a great question. This goes back so that the namesake I mean it is political.
Starting point is 00:20:04 mouse traps. And it's all about building the organizational buy in to pursue large ambitious projects with like a high chance of failure. And one of the things they talked about was making sure that you compare the scale of the thing you have to the thing you need to build. And the point that that Mark brought up was ostensibly a bottling factory that manufactures and bottles like beer at the scale of millions and millions of bottles a day does the exact same thing as a bottle capper that you can just you know, take a bottle put a cap on and pinch and crimp Crimp the bottle they have the same
Starting point is 00:20:32 Functionality, but they have vastly different scales and that's kind of if you have a new problem You're looking to pursue a rewrite That's one of the things to consider and I think that was that was a big Story that we were telling. We have this fork of the go agent. We've effectively already created a second binary within it. The idea that we have one shared code base is already sort of a myth. It's not able to serve our purposes for these various reasons that our customers
Starting point is 00:21:25 Obviously the company wanted us to explore every path and make sure that we had crossed our T's and dotted our I's. I think we had done that, and at that point, once we had a really good internal document laying out the vision for BottleCap, everyone was really encouraging about it. Once they saw our perspective of, look, our scale is very different, and the actual deliverable features are very different. So we should just have a different thing for this use case. And that, I think, was able to sell the vision. Yeah, interesting. How long did you talk about having an interesting demo that sort of proves that point? How long did you have to spend on just that of making something workable enough to prove that out without building the know, building the entire thing.
Starting point is 00:22:05 How far down the road did you have to go there? Yeah, a demo is worth a thousand words. If anyone is looking for a career cheat code, I think the two things I offer is always go a level deeper than your peers, always building to like learn the behind-the-scenes reasons of why something works the way it works. And the second is have a good demo, have a good, you know, do the work to build a proof of concept along with your document. why something works the way it works. that blows people away when you're like, I think they were talking about like a P99 query time, 100 little 10 millisecond improvements and get that P99 down, you have to fundamentally rethink and design for that use case. And that was key to the fundamental design of the project was every single PR that we merged, we benchmarked every single PR even to this day, before we do every
Starting point is 00:23:36 single release, we look exactly at that not only the cold start time, but the runtime duration, like do we add overhead? Do we add more overhead when we're shipping data? Do we add, you know, additional network bytes? We had we're testing different compression algorithms as well to ship time duration, like do we add overhead? for them. And it's just a huge part of that work is the perspective of performance first at every step of the way. And that's how you get these, you know, you have to hold performance really close to the heart. And that's a feature. It's not, you know, not everyone needs it. But when you do need it, the only way to get it is to take a microscope at every step of the way. Yep. Yep. That was like one of my favorite parts of that. I guess like, what were you using to even measure all those different things to make sure,
Starting point is 00:24:25 hey, on every pull request, we don't do, what does that look like for someone that has not done that deep of performance work before? Well, we use Datadoc. Yeah, there we go, there's a pitch. This episode's sponsored by Datadoc. No, I'm just kidding, I'm just kidding. No.
Starting point is 00:24:40 No, no, no, it's true, it's true. Datadoc has this whole suite of observability and performance tools. And one thing that we used heavily was the native profiler. And if you haven't used a profiler like a CPU or memory profiler, you should absolutely try it. Pick one. I don't care which one. But I love hearing stories about people who have been experimenting with profilers and found bugs in their programs.
Starting point is 00:25:02 And that's a big challenge in Lambda. We have profilers that work in every major language and runtime for Lambda. and found bugs in their programs. So what I did was I rebuilt Lambda on top of EC2. And I used the Lambda runtime interface emulator, and then I wrote a custom layer for Lambda's telemetry API. And then I created these hell tests where this function would allocate massive, massive strings and then send them to the extension or create many, many, many spans or huge amounts of metric context or metrics and push them through the pipe.
Starting point is 00:25:50 So what I did was I created these hell programs, these hell Lambda functions that would push the extension to the limit. And then every time we were testing a major change, Wait, you were using what? I ran it outside of Firecrackers because I needed access to these system calls. And from there, I created all of these test cases that I could run the program through before we shipped big changes. And one simple example was when we started the project, native threads or hardware threads. Tokio is the popular asynchronous runtime.
Starting point is 00:26:50 It gives you features similar to what you would get in Node.js where you can kind of async and await. Most of the libraries you use in Rust are going to be TokIO compatible. And when we added that, we wanted to be very sure it was a net win. So we were profiling it on these various CPU sizes, various memory sizes, and making sure that it wasn't causing us to block down. And only when we were able to prove that with a profiler and show that the runtime wasn't spending unnecessary time moving tasks around threads or that sort of thing, then we were comfortable releasing it.
Starting point is 00:27:23 When you say release it, like bring Tokio in, is that what you're saying? Correct. Gotcha, so you wrote it originally without Tokio, sort of native stuff, built this performance baseline, and then you're like, hey, bring this in, see if it makes any changes, it does not, we're fine here. Yeah, and that's still kind of a fluid thing
Starting point is 00:27:39 because the efficacy of Tokio is going to kind of vary, depending on the number of vio is going to vary depending on the number of vCPUs you provision your Lambda function. So in some cases it's better probably to have hardware threads and then just have it dispatch a different thread for everything when you have extra CPU overhead. But especially if you're running in those 128 megabyte Lambda functions, those really tiny ones with an eighth of a vCPU per second or whatever, then Tokyo becomes really helpful because it's able to kind of swap stuff out when needed. Yeah, yeah, interesting. Had you written any rust before going on this project? No, not a line. And I think that's a big part of why I was so resistant.
Starting point is 00:28:13 Not a line. That's amazing. Yeah. Yeah. Yeah, it... Languages in general, I think once you really learn and know one, it's easy, I think in my opinion, it's been easy enough to kind of pick up a second and a third with time. I think this has been a really fun time because LLMs became very popular. So like we had access to like Copilot and ChatGPT and that became super useful because I had a Go program that worked kind of and I was able to like copy and paste 1,000 lines of Go for a question of, an LLM is like a really great way to get a quick answer for a question of like, is this correct? Rust also has an incredibly good compiler. So in your editor, it will kind of just beat you, beat you over the head until the program is correct.
Starting point is 00:29:14 And I think that was also really useful because that combined with an LLM, you can generally, even without a lot of experience, produce technically correct software using that. Yep. Yep. Interesting. Okay. You mentioned that like picking up, like once you know a language, you can pick up other ones. And I, I think of you as like a Rubyist first, right? Is that, is that right? And then like when you joined a serverless, like a lot of JavaScript type stuff, had you done before going to Datadog, had you done any like, you know, more compiled, like very strong, strongly typed languages? Um, I've written a little bit of C++ previously and I have kind of recurring nightmares from it. Okay. Um, but I think my, yeah, my, my first kind of large experience with, with like distributed
Starting point is 00:29:58 systems was with serverless is, um, like backend, uh, Lambda functions written in Go. And that was kind of my, my first, um, I would even call that systems programming and I would like backend, Lambda functions written in Go. I would even call that systems programming. You just use some of the same systems programming techniques and approaches. We're not building a database here, but we do want safety. We do want to make sure we're using the correct locking mechanisms and we're not spending too much time waiting for a mutex, for example. So those principles still apply.
Starting point is 00:30:27 But yeah, I did, I spent a lot of time reading books and trying out programs and both Go and Rust have really good benchmarking, like micro benchmarking tools. So if you have like 12 lines of code and you want to say like, is this better than that, it'll give you this whole spectrum of tests. I mean, like Go Bench, I think is you this whole spectrum of tests. the the the when you have and not giving up too much latency because that's every every line of code is like giving up a little bit. And you need to be very very careful about adding new software on top of existing software without. Potentially blowing your your performance as always so it's it's makes it like it's really important to create those like micro benchmarks on HPR on each you know on each, you know, Is it the same for the lifetime of a string? from React and Redux, and we have these kind of
Starting point is 00:32:05 more functional processes that go into our thinking nowadays. And I think a lot of those are just exposed for you bare with Rust. It's like, oh, you want this string to be usable over here? Well, now you have to copy it. What did you think of the Rust ecosystem? Are there big gaps there? Does it feel pretty full featured and fleshed out? How did that feel? That has been tough, if I'm totally honest. Go has an army of very talented developers led by Rob Pike at Google.
Starting point is 00:32:35 And every tiny corner of- Is he still working on Go at Google? I can't remember. I'm not actually sure. Okay. But the language is mature enough. Yeah, for sure, yeah. And it's clear. There's an army of people working on it. I'm not actually sure. spend $300 million a year on this language until it's perfect. And I found Go, any single,
Starting point is 00:33:10 we wanted to use Unix domain sockets, Go has them perfectly implemented. And the interface works and it works with every abstraction. and you got to kind of time together, and then you might be on this older version. So it is more difficult. That's for sure. There were many times I wish that I had like the ecosystem of Go. But I think all that said, Rust is pretty mature. There's a lot of support and you are going to probably spend more time reading library code than you were expecting. I don't think that's a bad thing. You pick up a lot of the patterns and a lot of the the idiomatic strategies that you're gonna want to use anyway so but it is different I think it is harder than
Starting point is 00:33:48 then go yeah interesting is rust like mature as a runtime like are they still like changing stuff version to version or is it pretty stable and mature that way it's extremely stable well they have like a go one to two type thing or like where yeah where they out there they could and I think that would be definitely a battle. I don't expect it to change that much. I think due to the more limited nature. Again, the runtime doesn't handle that much for you. Not really in the grand scheme of things.
Starting point is 00:34:16 You have to do your own memory management. So there's less surface area for them to break. But of course, that can always change. Yep. surface area for them to break. Benchmark. AWS is either a secrets manager or KMS or what have you. To do that, you typically would use the AWS SDK. We imported it and we found immediately it was our largest crate, or the largest dependence we had by far, was the SDK. It impacted something like 10 or 20 milliseconds
Starting point is 00:35:45 Now making these API calls is pretty straightforward in a sense. They're old, like, SOAP APIs, so they have passed the arguments you want in the header, and you then sign those headers with this AWS SIGV4 or SIGV4A, which is the newer version. And then you can make the request. I did have some issues with it here and there. I think I had not initially added support for Java's Snapstart because the credentials can expire, because they are stored in the snapshot. But for the most part we've been pretty happy with that decision. And the biggest reason is we needed two API calls and we didn't need the entire SDK.
Starting point is 00:36:25 And so is that just like a function of, yeah, you have different requirements than the AWS SDK team has where they need to support everything and because of that they have to make some different choices and like a little bit of latency and size is acceptable for them. Whereas like, you know, you don't need that much and it's easier to do it yourself? It's been painful and they're just not as, they're simply not as performant as hand created bespoke SDKs from every language. But, but Amazon, I think really wanted a way to, to have, you know, an automated release. They could kind of write a bunch of metadata and then push this release out for
Starting point is 00:37:17 everybody. And, um, it would be the same for every runtime. And that was the priority. So for us, it's just, we have different priorities. And I think- Interesting. Go ahead. I was gonna say, I think you'll find a lot of agreement with different people who talk about, why do we pull in so many dependencies,
Starting point is 00:37:34 just take the parts you need. I think that's kind of a growing chorus in the software world these days, which is stop just grabbing random node libraries or Rust crates or Go packages, like just take the little bit you need and go from there. And I think that's kind of a part of the ethos we took for
Starting point is 00:37:49 was like, we really just need this slice and we don't really need the 10 or 20 millisecond hit, especially because not all users use those API key storage mechanisms, right? Like some people, whatever, uploaded it with an environment variable, which are encrypted at rest or chose other options. And as a result, they didn't need this.
Starting point is 00:38:07 And we didn't want them to kind of pay the overhead as well. Yeah, interesting. Were LLMs useful at generating that particular code? Do you remember? I think that is yes, because kind of like a Rust compiler, when you're dealing with signing a request with AWS SIG4 and kind of like a Rust compiler, when you're dealing with signing a request with AWS SIG 4 and kind of making the request,
Starting point is 00:38:35 it only works if everything's perfect. So you know you don't remember that being that helpful for that portion. But I do, like I said, I think I remember being, LLMs were most helpful when we're parsing large config files that we needed to write custom deserializers for this Rust library called seerday, which stands for serializer deserializer. And those get a little gritty. Well, hairy. Yeah, and there's also a lot of examples of it.
Starting point is 00:39:02 So LLMs became like a really natural fit. Yep, yep. I had a similar, like the reason I asked about the LLM is I had something similar recently where I'm receiving emails on Amazon SES and you can choose to have those encrypted when they come in and put them to S3. And if you do that, they encrypt it with... They encrypt the client side with the S3 encryption client is what it's called. And then you decrypt it with that, but they only have the S3 encryption client for like six languages and JavaScript is not one of them. And it's like, I didn't want to have a,
Starting point is 00:39:33 like one other, you know, function in Go or something like that that's using that. So I just like went to the Go library. I found out which sort of encryption algorithm they're using. I went to the Go library and I said, make this, but JavaScript, no, I said, make this but JavaScript. LLM just made the whole thing and you know,
Starting point is 00:39:46 you test it and it works and yeah, good to go. So it's like. See, and that's like a great example of going a level deeper on a problem, right? Cause like a number of people would either just give up there or they would maybe, whatever they would like add a sub process or like call out to, you know, a C program
Starting point is 00:40:00 or like a binary and make it do the work. And instead you're like, well, I'm just going to read the code and then I'm going to figure out how to generate it myself. And yeah, of course, like the LLM helped you, C program or like a binary and make it do the work. are at their best when you give them a very specific direction, you kind of know the shape of the answer you want. But I do find it very useful and we did use it heavily throughout this process. So that was a lot of fun. It's something that's like, yeah, you could do it a lot faster and you can see if it works. projects like BottleCap where you have this purpose-built thing for the specific niche, are going to just grow.
Starting point is 00:41:04 That's the expectation, is that this thing should be as minable as it can be across any dimension we care about. It needs to be kind of bespoke for that. That's kind of tough because SaaS businesses were created around this guise of, well, you'll write the software once and you sell it over and over again with zero additional marginal cost of goods sold. I think that is going to start to change a little bit. and you sell it over and over again considered this and we ruled it out early on. But now with this new performance improvement,
Starting point is 00:41:45 we signed up and it's been great. Yep, that's awesome. Tell me more about the bespoke software stuff. Do you think we'll see that in, this is like a developer tool, which is like one sort of thing where we're sort of used to using LLMs and, or we know our specific requirements quite well and evaluate those things. I guess like, do you see a lot of bespoke software happening in B2B SaaS or
Starting point is 00:42:13 prosumer or even consumer type stuff? Like do you think software is going to be like pretty bespoke across the board for like all types of consumers going forward? Yeah, I do think so. I mean, obviously, the beating heart of systems, the backend distributed systems are going to be limited in what they can necessarily adapt to or clone. Although I do think the cost of running software is approaching a commodity price. These managed services are getting close to the limit of what you can really do in a general sense. But I do think that as far as like people using your libraries or interacting with your APIs, there's going to be this expectation of it being pretty highly customized. Now, whether that happens on your end on the API level or via, you know, like a model context protocol kind of interacting where an agent on their end takes your API, consumes it and modifies it in the way the customer wants.
Starting point is 00:43:05 context protocol interacting where an agent on their end takes your API, consumes it, and modifies it in the way the customer wants. I'm not sure. I think it's going to be a little bit of both. But I do think the age of, oh, you want to interact with Stripe and here's their crazy, well, Stripe is a great API. like Zora. people are going to want more of it. Yep. Yeah. I know the interesting thing is like, you know, I think we can create software a lot faster. It's hard for me to imagine like a world with a hundred times or a thousand times as much software as we have now. But I think that is like coming. I just can't really picture what that's what that's going to look like. But yeah, I think it's going
Starting point is 00:44:02 to be yeah, I think it's going to be pretty wild. I do think it'll be the dimension of personal software is one aspect of that, where you're going to be able to customize apps on your phone or on your computers a little bit better. The journey, if you've watched or you've experimented with any of the home brew labs or home hosted solutions has come a really long way. The different toolkits for starting up your own web server or running a server on a bare metal box in your basement have gotten way better.
Starting point is 00:44:31 I'm thinking about things like tail scale, which have just been incredible, where if you wanted to create your own VPN previously, you have to deal with open VPN and all these very challenging tools to set up and run. And now it's just two off clicks, one on your phone and one on your home server, and then you can connect securely anywhere in the world. challenging tools to set up and run. maybe isn't the right way I think about it. I think about it more of like the,
Starting point is 00:45:04 if you compare binaries of software between each other, there's gonna be, you know, 80% variance versus previously it was like configurable and it was like a very rigid. Yeah, yeah, interesting. I guess, so on that same point of AI usage, like how are you using AI day to day? I think there's like a broad spectrum of like, you know, copilot, tab complete,
Starting point is 00:45:24 there's cursor, more agent mode, there's like a broad spectrum of like, you know, copilot, tap complete, there's cursor, more agent mode, there's cloud code. I guess like where, what are you sort of using day to day? I'm trying it all. I'm just trying to stay super curious. Again, like my big ethos is like, kind of try everything and go a level deeper on everything. So my main workflow is still, I can't get out of Vim.
Starting point is 00:45:43 I'm still on Neo Vim. I've been a Vim user for a couple of decades and I just did nothing, everything feels slow. Once you know Vim motions, everything feels slow. So I use VS Code for a little bit. I do have Cursor, I have used Cursor. There's a version of a Cursor-like experience called Avante for Neo Vim that I'm really pleased with.
Starting point is 00:46:01 I'm having a lot of fun. I also like Claude Coden a lot. That's been a very cool tool. It allows me to chat back and forth with it. that I'm really pleased with. that drives the number of mem copies to minimum. Like whatever it is, however many times you have to copy the string, make it as little as possible. And introduce lifetimes and if you need to, static lifetimes or even like a memory arena to hold some data.
Starting point is 00:46:35 And like we haven't had to do that yet, but obviously if you throw that at an LLM, it totally crashes and burns right now. Yeah, oh yeah, yeah, for sure. Like that sort of hard thing. That's I'll be curious like when that becomes truly doable. But like right now I do, I like a full, a few like full stack apps I'm helping with.
Starting point is 00:46:55 And that it's just like, so like you have these patterns. So it's like, Hey, go make this new data access pattern. Now write the route for it. Now write like the front end service to consume it, now write the display logic, and it can just do that so easily once you have a few patterns in your application. It's like, yeah, it's not super hard code,
Starting point is 00:47:15 but it does save you a lot of time and doesn't drain you from the monotony of that sort of stuff, which is fun. I do think it's been an absolute boon for application, like web application development, and I think a big part of that which is fine. And as a result, it's sort of regressing to the mean one shot it. Now at the same time, I really, just yesterday, I fixed a bug where I used the LLM to parse a YAML file. And it inadvertently broke an environment variable.
Starting point is 00:48:15 And I didn't know it. And it was my own gap. I didn't have a test for it. But the LLM was like, oh, no, this is it. This is all you need. And I tried it, and it worked. So I was like, OK, we'll do this. And then I get a bug report. And sure enough, it turns out, like, oh no, this is it, this is all you need. And I tried it and it worked. So I was like, okay, we'll do this. And then I get a bug report.
Starting point is 00:48:26 And yeah, sure enough, it turns out like, oh, this no longer processes the environment variable version of this correctly and it needs to. And it's just, yeah. So. It is hard because like you start to trust it. You're like, man, it nailed that last thing. This seems pretty similar.
Starting point is 00:48:38 Like in that same wheelhouse, like I'm sure it'll do it. And it's like, you scan it over, but you're not looking quite as closely as you can. So it's like, you gotta figure out, you not looking quite as closely as you can so it's like you got to figure out you know test coverage slash eyeballing slash manual clicking around like what what you sort of need for it and then based on your level of seriousness for like how important it is if there is an error and things like that it's gonna vary so at the risk of sounding too old I I think it's a boon for early career software developers
Starting point is 00:49:05 because you basically get a mid-senior engineer with unlimited patience that you can just ask questions to. And I think for me, I was always getting the feedback that I was too needy early on in my career and asking for help too much and not spending enough time trying to run things down myself. And I think the LM is like such a great tool for that. But you do still get I mean, I'm sure you've experienced this. And I think everybody has where it gets caught in a loop where it can't the initial approach it chose wasn't correct, for whatever reason.
Starting point is 00:49:36 And maybe it was my fault, I didn't give it enough context. But then it runs down an unsustainable path where it's like, deleting code or you know, leaving methods around and that sort of thing. And I think that we're just not there yet but it is a really exciting time and I'm enjoying it and I you know I don't really think it's going to break about like mass unemployment I think it's just going to create kind of more demand for software. Yep yep so you're not in the AI 2027 camp? I mean I guess I don't have enough information to make that decision if we're being totally pragmatic. I think it's obviously a possibility especially if it gets to the point where it can do all versions of independent
Starting point is 00:50:08 thought independently, then of course, yeah, that's going to change my opinion on it. All of the companies that are working on this technology that are out there are expanding, they're taking what they have, like the models they have, and they're deploying them in different modalities and different integrations and that sort of thing. They're not saying, okay, we have like 20 years of advancements ready for you next week. Right. And that's sort of where I'm starting to indicate that we may be, um, maybe unable to solve some of these really core problems, but that doesn't mean the tools are useless. It's super useful. Yeah, it's super. Yeah, I completely agree. Okay. I want to switch and do some, some AWS
Starting point is 00:50:42 info takes, cause I think you, or AWS slash Infra takes a lot of AWS stuff, but I feel like you always have some good stuff on that. So first of all, I have a few that I've heard from you before, but I wanna just hear you defend publicly. Number one, you're like a big fan of AppRunner, right? You're like one of the two people that loves AppRunner. I guess, are you still, are you, you and Jeremy Daly
Starting point is 00:51:02 and like Chris Munns, those are like the three that I think of for it. So like, I guess, are you still there? And if so, sell me on it. I can't recommend that anyone use AppRunner anymore because I think just as evidenced by the entire lack of changes in the last five years, I don't think they're working on it. I don't have any special information to indicate that, but that's just my gut feeling on it. So I don't want anyone to wake up tomorrow and they're like, oh. I don't have any special information to indicate that, but that's just my gut feeling on it. So I don't want anyone to wake up tomorrow
Starting point is 00:51:28 and they're like, oh, it's gonna go into life. Like you have to move everything again. What did you love about it? Oh man, I think Amazon was able to somehow solve a two piece of team problem for the first time in the history of the company. And they were like, we're gonna solve this problem vertically end to end.
Starting point is 00:51:43 And it was a really cool experience. If you hadn't used app runner, it was it gave you this really managed experience where you could connect it straight to GitHub, it would build a container out of your application, and then it would run it as a like a managed container service. So more similar to Fargate, where you have kind of a long running container, but it could scale up for you, it could scale down for you, it could like turn it off. It never really scaled to zero super well, but it had you know, scale up for you, it could scale down for you,
Starting point is 00:52:25 we want this experience, but for Java 8 and we needed to be compatible with these SOC 2 or bank regulations or whatever and all these different hard problems to solve where they weren't quite sure the addressable market was there. and kind of ignore some of those really hard problems. The product was good, it was fun to use. They had intractable problems that they had to solve. For example, if your application failed a health check, it would roll back to the previous version, but it wouldn't roll back your CloudFormation deploy. So it wasn't like the full feature you get from Lambda, with some indication on cloud formation and it just didn't do that for AppRunner. And there's just like a number of cases of that where things kind of languished
Starting point is 00:53:26 and they weren't able to solve it, unfortunately. Yeah. I think sort of on this note, one of my biggest complaints about AWS is it can be really hard to know which services are like soft deprecated, you know, or like set up for deprecation. And it's like, of course they can't come out and say,
Starting point is 00:53:45 we're deprec, but it's like, man, like, you're trying to make decisions on what things to invest in. And it's hard to know what they're investing in, you know? And it's nice in some sense that they're starting to deprecate some stuff and be a little more clear on that. But even like, even like, yeah, you have to like read the, read the tea leaves and try and figure out, oh man, they haven't done any updates
Starting point is 00:54:05 for app runner in a long time. There might not be a team there. Well, not even that like the app runner got has gotten so bad that the creator of Java who was a distinguished engineer at Amazon, flamed the app runner team in a GitHub issue and said, you're on Java, like, I think it was 17 or something like, why are you so so far behind? Why is this project so far behind?
Starting point is 00:54:24 And this is the guy that wrote Java. So, you know, then I think that's something, like, why are you so far behind? Why is this project so far behind? But yeah, I agree, you do have to kind of read the tea leaves. And like everything in life, if you go with the flow, it's a little easier for you. So you're going to find more success on EKS or ECS or Lambda than you are on like an app runner. So I think, you know, write a container, use a container based Lambda function when you're outscaling that, move it to Fargate and then be happy.
Starting point is 00:54:54 That was my next question. Same thing with container images over zip files. You've been team container for a while now. And I've been right. Tell me, yeah, yeah. Tell me why. Tell me why I shouldn't use a zip. Yeah, I mean, it's just clear that container and I've been right. Yeah, I mean, it's just clear that the open container
Starting point is 00:55:11 initiative and OCI standard has become the standard for packaging applications. It's true that containers are objectively worse packaging mechanisms than zip files. The idea of a zip, and it's a beautiful idea and I think it's pure in an academic sense. So if you want me to defend the lambda zip base function, the reason is you zip up exactly only the bits you need, only the tiny components you need, the dependencies you need, and then lambda provides the rest. And as a result, they're able to cache those base images really, really well. So you get faster cold charts for the bits there than you maybe would have provided and so on and so forth. The problem is people get it wrong all the time. to cache those base images really, really well,
Starting point is 00:56:05 like a copy of, I don't know, like Elf in here for some reason. So that's one aspect of it. I think containers, while people still are confused by them and make elementary mistakes with them, they've kind of won the packaging war in the cloud ecosystem. And as a result, it's easier to just kind of go with the flow there. Lambda also put in a ton of work to make them really fast. I did a blog post and a video about it
Starting point is 00:56:33 when I benchmarked it, but they do something really cool. They, every time you deploy a container-based Lambda function, Lambda will go to elastic container registry. It'll pull the container, and then it will create a hash of each 512 kilobyte chunk in your container image, and then it will look and compare and see if it's already seen those chunks, and if it has it just drops them. And then it creates a manifest file
Starting point is 00:56:56 for your Lambda function, and then it creates keys based on each of those chunks that are specific to your function, and then it creates kind of a main key that encrypts all of the keys for all the chunks, and it stores that. Then when you have a cold start, when it has to bootstrap a new Lambda sandbox, it goes out and it uses this concept of content addressable keys where it says,
Starting point is 00:57:14 okay, find me all of these chunks based on this hash. And the only way that you can decrypt all of those chunks is with that primary key, that main key, that encrypted all of the other keys that are used for each of those 512 kilobyte chunks. So that means that if you and I have the same chunk of a container, we can just reuse it and we can share it. And that means that those are often for very rudimentary files, those are probably already cached out there somewhere. So you're able to just have really, really fast cold starts, even though they support
Starting point is 00:57:44 up to 10 gigabyte images, which compared to a 250 megabyte Lambda function is a nice win. Yeah, yeah. Yeah, definitely check out Avery's write up and video on that and like the original stuff by Mark. But like that whole thing, like Mark's stuff on it, the way you explained it, I was just,
Starting point is 00:58:02 that was like one of those things where it's like, man, the way you explained it, Yeah, and now you can just use a container-based Lambda function. It's as fast or faster as the zip function, and it's easier to develop. You get more consistent, reproducible builds. You don't have to worry about these crazy bash scripts that have to run on your MacBook and cross-compile or any of that. All of it just works right now. And I think that's the benefit. Yep, yep. Okay, next question. This is on AWS. What is a more annoying problem? That there's not a way to sort of like enforce the free tier or where it's just like, hey, I don't want to spend any money on this.
Starting point is 00:58:51 So I want this to be like a free account, you know? Cause you sometimes see those people that like, they're a student, they get like a $3,000 bill. That one, the free tier problem. Or what I call like the new production account problem, where when you create a new account, you have to request like Lambda limit increases because you get like 10 concurrent functions or like to get out of SES sandbox mode is like a total pain in the butt or SMS sandbox mode and like all
Starting point is 00:59:15 these like permission things that you have to go through just to like get through basic stuff which which of those is more annoying? I wish both. Can I press the button for both? I think that's what I'm talking about. The free tier billing thing is such a miss on their end, but it's just not meant for students. Every time someone says, there are ways to avoid it, but it's just not designed.
Starting point is 01:00:05 that problem does like the reason they do that is to prevent the second problem or the reason they do the second problem is to prevent the first problem, which is, you know, if they gave everybody uncapped accounts all the time, the abuse would be rampant. So that's why they lock it down. I do think if I were to start a consultancy consultancy around this, I would just like I know the magic words to talk to SES and get them to lift the limit and get you out of sandbox mode. You know, I know the same for Lambda and all those kind of places you're going to run into it. That is tough. I wish that when you're using an Amazon organizational unit, an OU, when you create a new account
Starting point is 01:00:33 within that organization, they just give you limits from like a predefined set that are allowed for your org. I think that would be really, really nice. So, yeah. Or if you could like pre-verify an account in some way and just like I don't have a check like Validate your bank account or something like that something that would be hard for you know, like a You know true it a bad actor that's just gonna come in and like steal compute and leave The truth is is that like even for the people the students that are getting these bills are people like you and I that want
Starting point is 01:01:00 To spend say a hundred bucks a month on Amazon like it's not designed for that either, right? It's the actual value you get out of Amazon and AWS is when you're on these that want to spend say 100 bucks a month on Amazon, it's not designed for that either. The actual value you get out of Amazon and AWS is when you're on these commit spend accounts where you're going to commit to spend millions of dollars a month. And from there you can kind of pick whatever computer you need and those types of things. can do. And that's, you know, it has to work at that scale. And I think as a result, because that's where the revenue is, that's sort of where it's optimized for, it's unfortunate. There's people in the community doing great work
Starting point is 01:01:29 to try and help everyone avoid those sharp edges, I think, you know, account you in that group. But it is tough. And I don't know, I don't know how you solve that problem, because the scale at which Amazon can, can give you compute, and thus bill you is so fast that it's very, very difficult to say, okay, now you have this limit. I would say there's an educational account available for students where it just, it's very limited to what you can do, but you don't even type a credit card in, you know, it's like, it's just, you know, that's interesting.
Starting point is 01:01:53 I didn't know about that. That's cool. Yeah, but it is very, very limited. So it's like a, it's a learning path system. This, uh, the second issue is like my hobby horse right now, because I was like helping set up a new account recently and like, it took me multiple emails just to get out of SES sandbox mode. And I'm like, no, I'm just sending transactional emails. Yeah, not marketing emails. It's like, this is exactly in the way.
Starting point is 01:02:13 Yeah, it's like, come on, you're great. So, okay, next one, AWS network costs. I think you've been kind of, you are one of the biggest opponents or one of the people that I think of when it says these are total bunk. I think they actually have some validity to them. I guess like tell me, tell me what you think on AWS network costs.
Starting point is 01:02:32 And I'm mostly talking inter-AZ costs, also egress costs. You know, they charge you a cent, a couple cents to get out. Like yeah. Yeah. Per gigabyte. It's not cheap. I forget all the numbers and it will change so I don't wanna misquote it, but it's crazy how fast that becomes
Starting point is 01:02:52 your highest bill line item. You can spend millions of dollars a month on AWS and then all of a sudden, EC2 networking is your number one bill. And I think that there are some valid points. I mean, it's a limited resource and it's sort of a public good. So you have to defend against the tragedy of the comments is your number one bill. their whole premises, we're going to give you a binary you were able to realize something like a 10 or 20x cost improvement on Kafka running Kafka backed by S3.
Starting point is 01:03:47 So that tells me that networking costs can't be so much of a hit because you're able to do that and save so much money and it's built into the price of S3, which is already a commodity product. So it's very clear to me that you and I are paying much more bandwidth rate than the S3 team pays. So now we have all these cases where it's like, they can sell their company for $220 million because they found this kind of cheat
Starting point is 01:04:11 in the bandwidth billing. That's truly and really at its core what it is. And good for them. I'm very, both those guys, Ryan and Richie, I know they've been on your podcast. I really appreciate talking with them over the years. I think they're really sharp people. Yeah, they're sharp.
Starting point is 01:04:24 They're smart guys. Yeah. But yeah, but it is it's it's it's exploiting the billing model. And they should, you know, it's a good thing. And I want more people to do that. And I think more data systems are going to do that kind of thing. But the My question on that, and I'm like, I think that's the strongest point against my argument is like, what's going on with s3 there. And I don't know if that's like a billing mistake that they are locked into now. Like they sort of mispriced that in some And I don't know if that's like a billing mistake that they are locked into now. Like they sort of mispriced that in some way
Starting point is 01:04:47 and didn't account for that very well. Or if they are just, or I think there's sort of three possibilities. It's a billing mistake. The second one is like a actually inter-AZ networking does not cost that much or is not that scarce and like they're overcharging for it. And the third one is like,
Starting point is 01:05:03 hey, there's some efficiencies that the S3 team is able to do, or some amount of predictability or something there, as compared to just the fact that if I just wanna come in and throw a bunch of traffic at it that they can't account for and things like that, they need to have a different rate for that sort of thing. So, yeah. Yeah, so it's a combination of two and three.
Starting point is 01:05:25 I don't really think it's one. And I think the two is a high margin for them. And when they already, especially within an availability zone, they've already talked about, they've given talks, AWS has given talks like Colm McCarthy and another, I forget the person's name. There was this great talk at Riemann about the Holocore fiber they were using between AZs in US
Starting point is 01:05:43 East 1, so they were able to actually get like double the range that the buildings could be away from each other. As a result, they able to just buy more real estate and have more capacity in US East one. And that tells me that they're able to once they have the trench dug for the fiber running huge, huge backhaul cables, like massive, massive links that are doing, you know, hundreds and hundreds of gigabytes per second across many, huge backhaul cables, massive, massive links that are doing hundreds and hundreds of gigabits per second across many, many redundant links are very cheap for them to add. I don't think that bandwidth is too saturated,
Starting point is 01:06:20 but because those links, they operate on peak bandwidth, peak is in or shouldn't say that they have to account for the peak of transmission between two AZs to physical buildings, it becomes a very big opportunity to optimize when you can like do things off peak. So I'm thinking like elastic block storage, EBS snapshot replication is pretty cheap. It's actually I think I think it can be cheaper than s3. And I think my guess is they're just replicating it at night, where there's not very much load. And as a result, they get like a lower quoted price internally or whatever. So I think there's like opportunity for those types of things.
Starting point is 01:06:50 But I do, you know, I don't think those links are super taxed. I'm sure they are at certain times, and the need, the idea is we don't want people to abuse that, I get that. But for things like Kafka, where like best practice is to use multi-AZ, it kind of sucks,
Starting point is 01:07:04 because you're just gonna pay this huge tax. Yeah. Yeah. Yeah. I think that and oh I would say the last one I went out is like other clouds don't bill for that. So it's really hard for me to make the point. Well don't but okay so Google does right. Google does. Microsoft says they do but they don't which is weird I think. Yep and Oracle doesn't at all. Oracle's trying to catch up in some sense, so it's like. Yeah, of course. And then Cloudflare has this whole game,
Starting point is 01:07:34 they're like, we don't, but then once you get big enough, they do is what it sounds like, which is like, they get a little kind of pious sounding on that sort of thing, and I don't know what the story is. They also just have a very different traffic profile than AWS does, I think, given all the ingress stuff. It's just like a different, I think they, it reminds me of the strategy credit
Starting point is 01:08:01 that Ben Thompson in Stratecory talks about, where sometimes just because of who your customer is or the shape of your product, you can sort of claim some things that are actually much cheaper for you than for everyone else. You can sort of position it as this value or we're doing it out of good. Yeah, arbitrage that. Yeah, we're doing it out of the goodness of our heart or because we really believe in this, but actually it's just a lot cheaper for you than for someone else.
Starting point is 01:08:26 So it's like a talking point or an argument you can use as a sword against them. And yeah, so I feel like I really want to get to the ground of like what's going on with Cloudflare there because they talk one game and then it sounds like the reality is like a little bit different on some of that stuff. Yeah, I just think that like their business is structured in such a way where they can, like
Starting point is 01:08:46 you said, they can kind of use that as a marketing kind of tool against others. I think a similar one would be, they advertise that they only charge you for CPU time. So if you're waiting the results of something on the network, you don't get charged for that in workers. I think that's great. But also for even the largest scale, Lambda deployments, Lambda itself is typically not the number one bill item for a serverless application. It's S3 or Dynamo or API Gateway or CloudWatch, SQS or whatever. SQS, yeah, exactly. Those bills add up. Typically, I think Lambda is number three. So sure, that's true. And I think it's an opportunity for Cloudflare to kind of hit AWS with it every chance they get.
Starting point is 01:09:26 But if you actually look at like what that would cost you in the end of your day, maybe it's not as much as you'd think. Yeah, yeah. Yeah, so I still want to get to the bottom of this. I feel like no one has a great handle on the networking stuff, but the things that I would say are in my favor is Google does actually charge for it.
Starting point is 01:09:44 Microsoft says they do. Cloudflare is kind of cagey ony on it and then also I can't remember if you were there weird I was talking to someone that reinvent that works on EBS and he was saying they spend a bunch of time trying to optimize our limits. a lot of different things like that. And like it is a factor for them. Someone on CloudWatch was saying like, hey, we get billed for network costs and we look at it and think about it. And like, again, like you're saying, it's, they're preventing a tragedy to the commons where if you don't think about that,
Starting point is 01:10:14 then you're gonna use this inter-AZ traffic just willy nilly without like thinking about it. So at least just putting a cost on it makes you, makes you think about it and at least consider that trade off. Yeah, I think the issue I take with it is I think it's just, it's at least consider that trade off. Yeah, I think the issue I take with it is I think it started that way. We just need to make sure people don't abuse it and it sort of has ended now.
Starting point is 01:10:35 That's true, it is like a profit center now that they can't get away. That's a good point. So I think if you start to see an actual slashing of prices and competition on crossed clouds in a huge way, I think you'll see that cost start to come down. And if you see them cutting that cost, then it's kind of a strong indicator that it was a big part of their margin. Yep. Yep. Interesting.
Starting point is 01:10:56 Okay. While we're on this point, CloudFlare, are you using CloudFlare? I've played with it. I've really been trying to use their container service and they haven't added me to the beta and they announced it like a year ago. So I'm a little annoyed there. Come on Boris, Michael, get somebody get him in there. Yeah, yeah, I'm trying. I'm hoping I want to try it out. I haven't really used workers
Starting point is 01:11:14 because I don't read a lot of JavaScript anymore. So just like kind of adapting to that hasn't been my primary focus. And then yeah, so it's been it hasn't been a huge focus on mine, but I do I admire their work. I know and respect them, a lot of friends over there. I think they're a really great group and I think they're kind of going the right direction. Yep, yep. Yeah, the big thing for me is I'd say like,
Starting point is 01:11:35 I do like a lot of the people there, very sharp. The concepts are weird, right? Like you really have to change, like learning, like I'm just like different than containers in certain ways or servers in different ways, but it's also very similar in a lot of ways, where I feel like workers and durable objects, they're just way different than some
Starting point is 01:11:53 of the other concepts. You really have to go all in and understand those sorts of things in a different way. And then still, the surrounding ecosystem around, permissions logging, infrastructure as code, is just not as robust as the other stuff. So that's like what told me back on Cloudflare. But I do like the people there and what they're doing.
Starting point is 01:12:12 So yeah. Yeah, I do think they need a better infrastructure as code story. I think that's definitely a big gap. I know SST has been pushing that direction and I think if that really... Yeah, they've got some good stuff for it. Yeah. Yeah. If that cracks it, I think it's kind's going to be a big growth opportunity for them.
Starting point is 01:12:26 Yeah, for sure. Okay. Last one on the infra area, the database ecosystem. What are you seeing out there? What do you like, excited about? What do you think about it? Man, I dabble in all... This is my favorite part of programming right now is database programming and systems programming.
Starting point is 01:12:40 So I've been following Sam Lambert. Of course, talking about planet scale metal has been really exciting. I totally agree with him getting into the free tier. I don't really like, I think free tiers are just like a massive cost sink and they don't really convert super well from my experience, mostly at serverless. I think it's just been like, you need
Starting point is 01:13:00 an incredible margins at scale to be able to subsidize a free tier. And I think that if I'm building a SaaS today, like if I'm going to quit and do my own thing, I would do a two week trial. And that's it. It's like if you like it, you pay for it. If you don't, you don't get serious or not. Yeah. Yeah. And it's not like it's more about who I want to partner with with a customer and not so much like do I want to talk about my user base numbers. I think the growth of users and free tiers became like a talking point for raising money. And that's why you see it's like the dog wagging the tail. That
Starting point is 01:13:27 was a metric they could put on the screen and said, look at all these users that we're getting. And but none of them would ever pay for the product. So I think if you want to build something sustainable, and I know Sam talked about that on your podcast, I think that's like a much better route to do. So yeah, I think Aurora D SQL is a very interesting piece of technology. Not everybody needs multi region. And I think that's where Sam would to do. really cool tool. So if you have that need, it's awesome. If you don't, yeah, there's like some rough edges. But if you really just want a relational data store in Lambda, Aurora is pretty awesome. Like you should check out Dsql. And then I follow kind of all of the key value store stuff too. So like we, you know, internally we use FoundationDB quite
Starting point is 01:14:17 heavily. That's kind of very popular. Well, it was nascent and now becoming a popular key value store. And of course like Dynamo, Cassandra and so on. Yeah. Yeah. Yeah. It's interesting on the free tier stuff. It's interesting because you have RDS and Aurora, kind of the elephants in the room. You have planet scale metal, which doesn't have a free tier.
Starting point is 01:14:40 Then there's like so many competing for like the rest. And like Super Base is like a pretty big one in that category. And then you have so many other ones. And man, that's just like a tough area to be competing in. I like a lot of those folks and companies and things like that. But that's a tough area to be competing in. Yeah, it'll be interesting. It really is.
Starting point is 01:15:01 And I think databases in general are like as much of the apprehension I have around rewriting software, I have around like picking a database provider that's like a third party. Yeah, it'll be interesting. per customer, like a multi-tenant app So I love reading the papers. I love seeing the benchmarks and I love of course the drama So I do like following it now. Yeah. Yeah. Yeah, and it's interesting seeing like all the OLTP on Object storage stuff that's happening which like I never thought would be a thing and it's the same reason warp stream did it. So I mean the answer is the same like similar to what we did in bottle cap right? We're like not everybody needs read after write consistency or even like you know serialized snapshot consistency Maybe they just need you know eventual consistency and that eventuality is like on terms of seconds And all of a sudden it becomes a very interesting and compelling use case where your costs
Starting point is 01:16:36 Become maybe from like your number one cloud cost to like a rounding error on your bill And that's you know everything's fungible man Yep, yep, it's true. So yeah, like, I'll be curious because like, that's a hard technology problem and a lot of people working on it. I'm sure like some cool stuff is going to shake out of that. I'll be, you know, it's going to take a little while on that. But yeah, some, some cool stuff there. Closing out here, you know, I want to just talk about like maybe career advice could just because I think, again, I think your path is so interesting. You are a Rubyist.
Starting point is 01:17:06 You had done some deep stuff because you'd done all this elastic search, sort of deep dive and gave a talk at reInvent, came to serverless, did a bunch of JavaScript and some Go, and now doing the deeper, more system stuff. I guess, what career advice would you, I guess you already said the two pieces of career, at least like- Go a level deeper, right? Yeah, go a little deeper. No, I guess you already said the two pieces of career, at least like go a level deeper, right?
Starting point is 01:17:26 Yeah, go a little deeper. No, I can expand on that. If I'm going to give career advice, especially like to myself in the past, like the first is like calm down, it's going to be okay. And that's something I even tell myself now is like a new dad, calm down, it's going to be okay. And also, you know, yeah, everything, your JavaScript to Python, But that aside, your Java, your Ruby, your Go, your Rust,
Starting point is 01:18:05 your JavaScript to Python, they're all making abstractions on top of the same concrete understanding of the actual CPU and architecture. So if you learn a little bit about how that works, it can actually take you a really long way across all languages and kind of all dimensions. So yeah, go a level deeper, learn. And that's true no matter what your niche is. Learn how HTTP works. It's a text protocol. You should at some point in your career write an actual HTTP request in a text editor and send it and try it. And realize that there's them a curl command where I'm like, Oh, look, I just have this text file and I can tell curl to
Starting point is 01:18:47 send it and it's valid because it has headers and then it has a line break and a carriage return, you know, there's two of them, right. And then there's like the body and now it HDBV2, of course, is a binary and framed protocol. So there's like a little difference, but you build on that. Similar to networking, there's been a discussion for sellers being blocked in Spain because they're blocking La Liga, the Spanish Football League is blocking IP addresses at a wholesale level that are associated with, I guess, illegal reproductions of football games. And because they're doing this at the IP level and not at the domain level using the server name indicator, which it passes through the feed, through the TCP connection when you upgrade to TLS, it blocks anybody that shares an IP. And it's the same reason you were talking about with SES. Like, why is it hard to get out of the
Starting point is 01:19:34 email jail? Well, it's very, very difficult to have a good reputation IP address that sends email. So Amazon protects that very closely. And what I'm saying as far as career advice is all of these things relate to the same principles because IP addressing and IP networks make the backbone of what we're all doing. So if you understand a little bit about it, it can actually take you a really long way. So learn how HTTP works, learn how TCP works,
Starting point is 01:19:56 learn UDP, like try it out, write your own servers and requests and get down in the nitty gritty. And I think it's just like a huge, huge step forward in your career that you can have compared to peers. Yep, yep. And one thing you said earlier about like don't be afraid to dig into the library too I think is so useful. I know like when I was learning to code honestly like what I did is I would answer questions on Stack Overflow in like the Django
Starting point is 01:20:17 topic of like things I hadn't done but like someone to ask a question and I would go like read the Django docs, read the source code, try and figure out how it worked and like try and explain it to people. And that just like, you know, being able to, that just helped me so much with like reading, reading code and figuring out what's happening. And yeah, going a little deeper, I think is like super helpful.
Starting point is 01:20:34 So, I don't know. Yeah, if I could sum it up, I think that's it. And like, you know, it's a brave world out there, brave new world, don't be scared of the LLMs, like use them, they're very helpful. And you'll, for every second you're scared that it's gonna employ you, you'll find a second where you're like, oh, this is totally falling apart to the LLMs, like use them. Yeah. Yeah, exactly. AJ, always great to talk to you. Thanks for coming on.
Starting point is 01:21:04 If people want to find you, where, yeah, what's the easiest place to find you? I guess for now, you can still find me on Twitter, at ASTUIVE. You can send me an email. I'm just aj at data.githq.com. You can find me on LinkedIn. Just search for AJ.
Starting point is 01:21:18 I real name post everywhere, including on places like Reddit. I think it just kind of builds your own integrity, or sorry, it builds your reputation in a positive way. So just search for my name. You'll find me everywhere I'm active. Yeah, for sure. We'll link in to ShownHits as well.
Starting point is 01:21:31 But yeah, thanks for coming on, AJ. Always great to talk to you. Absolutely. Thanks so much, Alex. Have a good one.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.