Software Huddle - Rewriting in Rust + Being a Learning Machine with AJ Stuyvenberg
Episode Date: May 6, 2025Today's guest is AJ Stuyvenberg, a Staff Engineer at Datadog working on their Serverless observability project. He had a great article recently about how they rewrote their AWS Lambda extension in Rus...t. It's a really interesting look at a big, hard project, from thinking about when it's a good idea to do a rewrite to talking about their focus on performance and reliability above all else and what he thinks about the Rust ecosystem. Beyond that, AJ is just a learning machine, so I got his thoughts on all kinds of software development topics, from underrated AWS services and our favorite databases to the AWS Free Tier and the annoyances of a new AWS account. Finally, AJ dishes out some career advice for curious, ambitious developers.
Transcript
Discussion (0)
One of the challenges we had with Go is that it's really not meant to be like an ultra lightweight runtime.
It's a very full featured runtime. You have a garbage collector, you have Go routines, you have all these different language features, which are very, very nice to work with, but they can get heavy.
Had you written any rust before going on this project?
No, not a line. And I think that's a big part of why I was so resistant. Not a lot. Languages in general, I think once you really learn and know one, it's easy, I think in my opinion,
it's been easy enough to kind of pick up a second
and a third with time.
What did you think of the Rust ecosystem?
Go has an army of very talented developers
led by Rob Pike at Google.
All that said, Rust is pretty mature.
There's a lot of support and you are gonna probably
spend more time reading library code than you were expecting. I don't think that's a bad thing. You pick up a lot of support and you are going to probably spend more time reading library code than you were expecting.
I don't think that's a bad thing. You pick up a lot of the patterns and a lot of the the idiomatic strategies that you're going to want to use anyway.
So but it is different. I think it is harder than than go.
You're like a big fan of AppRunner, right? You're like one of the two people that loves AppRunner.
I can't recommend that anyone use AppRunner anymore.
I can't recommend that anyone use AppRunner anymore. One of my biggest complaints about AWS is it can be really hard to know which services are like soft deprecated.
What's up, everybody? This is Alex, and we have a great show for you today because AJ Steivenberg is on the show.
AJ is one of my favorite people. We work together at serverless. Now he's doing awesome things at Datadog.
He just wrote the sweet blog post on rewriting their Lambda extension from Go to Rust
and some of the reasons around that.
Very like measured approach to doing a rewrite,
which is a risky endeavor,
learning Rust and doing Rust.
So like lots of cool things there.
He also was like really good,
I think just like general AWS serverless ecosystem thoughts
and so we go back and forth on some of those things
and always learn a lot from AJ and enjoy chatting with him.
So, you know, if you have any questions, comments, people you want to be on the show, feel free to reach out to me or to Sean.
And with that, let's get to the show.
AJ, welcome to the show.
Thanks, Alex. It's great to be here.
Yeah, man. I'm excited to have you because you're one of my favorite people.
You know, we worked together at Serverless
and had a great time there.
And it's just been super fun
to watch what you're doing since then.
You're now staff engineer at DataDog,
AWS Serverless hero, re-invent speaker,
all kinds of expert in all these different areas.
I guess that's the high level stuff,
but for people that don't know about you,
maybe you wanna introduce yourself.
Ah, thanks.
That's actually, that's too kind. But yeah, I'm a staff engineer for the Serverless group here at DataDog. know about you, maybe you want to introduce yourself.
I'm a staff engineer for the serverless group here at Datadog.
For the listeners before this, Alex and I did work together at Serverless Inc.
He's a pretty good boss, so if you get the opportunity, you should try and work for him.
I've been in the AWS and serverless space for a number of years now. to function in 2016 or 2017.
over into the managed data store services, things like DynamoDB, Alex, quite familiar with, of course, and SNS, SQS, kind of the other ancillary cloud
products too, like Google Cloud Run, Azure App Services, that kind of thing.
So our group kind of encompasses that.
And it's a lot of fun.
I think it's a really cool space to be in.
It's been a really fun ride and we've had a lot of hits.
So I'm excited to talk about it.
Yeah, yeah, I know.
It's been really fun to watch your journey at DataDog because I'm just so
jealous of all the stuff you're learning because you're always just sharing all this interesting
stuff. And part of it is you're just a big time learner, always wanting to dig into stuff and
share it. But I think it's such a perfect fit for you because you get this amazing scope and scale
at DataDog and all the low level stuff that's's going on there and you're just gonna see so much
data and it's like I'm just really jealous. I think it's cool what you're
doing. So I do want to talk about your journey at some point but I want to
start with like why we brought you on on the first place we've been sort of
talking about this for a while and you just wrote this blog post about like
rewriting Data Dog's Lambda extension. Rewrote it in Rust because rewrites are
always an awesome idea,
and Rust especially is a great idea.
I guess maybe give me, yeah,
tell me about that.
I guess I want to say first,
just some background on why this is so hard and scary,
not just the rewrite,
but the extension itself,
because Lambda is this dynamic compute environment
where it's just like spinning up on demand.
And when it does spin up,
usually it wants something super quickly a response.
So you need to be able to spin up super quickly
because you're like sort of wrapping this,
the, you know, the customer's code.
You can never fail.
You can never fail no matter what happens.
But you also have like these extremely variable workloads
because there's like a million different languages
and everyone can bring their own language.
You know, the execution time can be a few milliseconds.
It can be 15 minutes.
That's like super varied.
Just like all this hard stuff that's going on
that just like scares the crap out of me
if I was doing any of this stuff.
But anyway, with that sort of background,
I guess tell me about this rewrite.
Tell me about the problem
and like in what's going on there, and we'll dig in.
Yeah, absolutely.
So I think you hit on a lot of the interesting notes
and to kind of intro the fact is lambda
on the surface level seems very simple, and it is.
But to make something simple,
you have to solve a lot of hard problems
and then kind of give people pretty firm boundaries.
And a lot of the things I think people complain about
with lambda are the result of
hard-won distributed systems lessons.
So I'm talking about things like hard-capping the duration that a Lambda function can run.
It's like no more than 15 minutes. And then when you put a limit on that, that process is killed at that time.
Whatever you set, maybe it's five seconds or something.
Capping the amount of memory and with it in Lambda, the amount of vCPU cores you get.
that a single processor application can consume. In a typical load balanced workload,
in a typical big server kind of infrastructure,
which you have is a load balancer and then a number of worker cores or servers behind that load balancer.
And that load balancer is kind of sitting in front and receiving all the requests, all the traffic.
And it will not send a request to a server if that server is not responding to health checks
is a millisecond that the customer feels, frequent.
to look at a very specific rewrite in Rust. Initially, the Lambda extension was built on Go.
It was a fork of our main Datadog agent.
I think Go is generally very well suited to Lambda.
I would suggest if you're looking to write a Lambda extension, explore Go and Rust. is that it's really not meant to be like an ultra lightweight runtime. It's a very full featured runtime.
You have a garbage collector.
You have massive go routines and I shouldn't say massive.
You have go routines.
You have all these different language features
which are very, very nice to work with.
But they can get heavy.
And especially when the code base gets very large,
you end up with just like it's much more difficult to take
a really large thing and pair it, pair it all the way down
to the core essentials
versus start from the beginning and go up.
At some point in the last year, we were exploring all the different paths we could to make the Go agent work in Lambda
and we weren't able to hit our performance goals.
We were ripping out anything within a knit call and go that was blocking
at Rust. And the big, I think, big benefit there is of course, like, you get memory safety too. So it's like two things, right? Get out of the hot, get out of the hot path as fast
as you can and get some memory safety.
Yeah. Yeah. And you're talking about like, you know, all these being on the hot path,
it's something your customer feels. And I'm guessing that people are like less tolerant
of like third party introduced slowness to then it's like, Hey, if I wrote this crappy
code and it slows down our thing, it's like, well, you know, I got, I have other priorities.
But if it's like, if I'm pulling in this thing and it's being so it's like, hey, if I wrote this crappy code
your product and then it starts getting slow. Nobody wants to have that.
You mentioned that the original one was,
there's the Datadog agent that runs on a server in the background collecting metrics and forwarding them along.
Basically the Lambda extension was a forked version of that?
stripping things that we didn't need out as best we could. We had explored using Go plugins, which are a feature of the language which allows you to ship separate binaries and then load it.
It's similar to like dynamic library load, like deal ID.
The downside there is that one of the goals was to get the size of the compiled binary down as low as we could.
That impacts your cold start time. One of the goals was to get the size of the compiled binary down as low as we could.
That impacts your cold start time.
When you do that with Go, it doesn't know which features of the standard library are going to be included.
It doesn't know which features of the standard library are going to be included on any of the plugins you load.
As a result, every single plugin includes a full copy
of the standard library.
That was where we abandoned that project
and started looking at a rewrite.
And that's something that the main agent in a server the whole function crashes right there.
And that's something that the main agent in a server
doesn't contend with, because it's either a totally separate pod,
and then it's using HTTP to network and receive telemetry data from other machines or other pods.
And inside of Lambda, that's not the case. necessary for us to have like a crash proof system or as much as possible. Gotcha. Yep.
And what was sort of like the timeline on this road?
Like when you release the initial extension, was it right away like, you know, this is
good enough and this can work, but we realized there's some cold start issues and things
like that.
And we spent a lot of time improving that in Go as best you could.
And then at some point it you were just like, hey, we're not going to hit what we want.
Or like, what did that sort of look like?
Well, that's a great question.
I think with the benefit of hindsight,
maybe this story is told a little differently.
But I think it's important to call out
that when extensions were launched in 2020,
cold starts over the board were still very bad.
Like across the board, cold starts were a problem everywhere.
And lambdas continue to invest engineering time
in improving that, including things like Java Snap Start
or the container caching and loading we've talked about in the past. And all these important developments the best engineering time in improving that,
doing everything we could to get that cold start down. But a certain element of gravity exists when you have all these thousands of lines of code and trying to remove them
while still being completely compatible is a very difficult challenge.
And that's where we landed on this. Well, we have the API boundaries that we need for this.
And actually, I think if we really talk about ideal state from the ground up,
we don't really need a garbage collector as a Lambda extension.
And a big part of that is because you have one function execution process per extension.
It's one-to-one in the sandbox, which means I don't have to balance fairness
across a bunch of different clients in the way that the main Go agent does. So when we looked at that carefully, it was like, well, we're gonna have to rewrite it all anyway, basically.
Right, to like, to make the optimizations necessary
to have the cold starts we need,
we would basically rewrite the whole thing.
And then we're like, well, we don't need
the garbage collector.
So why don't we just manage our own memory?
And Rust became a very apparent tool.
Yep, interesting.
Okay, so you actually thought,
hey, potentially we could rewrite this in Go
and make it significantly
faster, but probably not to the rest levels that you got down to, but faster than like
sort of the original Go agent that you had.
Yeah, exactly.
And we had gone a really long way down that path.
I think the binary we were producing was like 10s or 20s of megabytes smaller than the
agent at the time produced.
We had stripped out a ton of things which were kind of like irrelevant for within Lambda. And at the same time, it was like
very difficult without kind of rewriting the core of the aggregation and the client fairness and
the balancing and all these app like different components. And then we were like, well, we're
gonna have to rewrite this anyway. And if we don't need a garbage collector, and we would rather have
crash safety, there's better tools for that. And it's been a great experience.
I think it took us, we started the process in March 2024.
We were live for beta users by, oh man, it was like late summer, late August 2024.
We were live for beta users and then November at reInvent we went GA.
So it was actually pretty quick.
Yeah.
And like, tell me about that rollout, especially like with the beta users. How did you find these?
Were these people that were having issues or maybe like clients that you're in touch with a lot?
And how do they even roll it out? Do they pick like some low value functions where they can at least
start testing it? Or like, what did that rollout look like?
Yeah, that really varies by the customer. So we did a number of things. We identified, first off, we have teams
inside of Datadog that rely on Lambda heavily for different features and different capabilities. Yeah, that really varies by the customer.
and traces working. had a number of customers that had voiced their dissatisfaction with the initial cold start time that we had. So we were paying them and going directly and saying, hey, would
you want to try this new thing out? Would you want to try this next generation extension
out?
Yeah, of course, a lot of people were just willing to try it in staging. We asked them
to, obviously, try this in a very low risk environment, in a safe way, and then you can rule it out.
When it came time to release it into GA,
and in fact, actually, when we released it into beta,
the way we did this was kind of clever, I think.
The bottle cap process, the Rust process, boots first,
and it reads the configuration file in the environment,
and it decides if anything there is unsupported at that time.
So we had a bare workload of, I think it was, metrics logs and traces for some runtimes, but not all, and it decides if anything there is unsupported at that time.
So we had a bare workload of, I think it was metrics logs
and traces for some run times, but not all.
And we were like, well, just to be safe, we're going to just boot into the main Go agent.
And that failover, we can fail over.
And then we were able to boot the main Go agent
and they were two different processes deployed in the same Lambda extension.
And the Go agent was dormant until the Rust agent booted it.
And we had that for months. Sorry. Initially, you had to also opt in.
So it would read the environment variable and then it also would make sure that you would opt into this next generation beta.
And then eventually we switched it and then you have to opt out.
And then the next step is going to be, of course, we're going to remove the opt out and then people are going to have to migrate to the next gen fully.
Yep, gotcha. And okay, so you mentioned like it's only at first, it was only compatible
with a few runtimes. I guess like what did what did that look like? Do you have to write
sort of custom instrumentation for every runtime or like how does that even work with all these
different runtimes and especially bring your own runtime? Like what does that look like
to actually write that instrumentation? That has been the joy of my career these last four
years here at Datadog.
That's very specific to that language. about the encoding supported and the encryption and things like that.
popular runtimes. What about like, talk me through, I don't wanna say like the political, but like the organizational aspects of it where.
Oh, it's political.
I think that's fair.
Well, yeah.
I mean, like, I don't, yeah.
But like, I know like in the post,
you mentioned how like, hey,
I think you're a pretty wise engineer and you're like,
hey, I'm kind of resistant to rewrites,
but at some point you became convinced that like,
hey, this is actually a good idea to do this in Rust.
Who, did you have to get on board? Like what kind of process was that like, hey, this is actually a good idea to do this in Rust. Who, did you have to get on board?
Like what kind of process was that like to convince,
I don't know how many effective people that like,
hey, this actually is a good idea, make that case to them.
Did you have a lot of control and autonomy in that
on the serverless team or did you have to convince
a lot of outside teams or what did that look like?
Yeah, that's a, I mean, it is political.
That's a great question.
This goes back so that the namesake I mean it is political.
mouse traps. And it's all about building the organizational buy in to pursue large ambitious projects with like a high chance
of failure. And one of the things they talked about was
making sure that you compare the scale of the thing you have to
the thing you need to build. And the point that that Mark brought
up was ostensibly a bottling factory that manufactures and
bottles like beer at the scale of millions and millions of
bottles a day does the exact same thing as a bottle capper that you can just you know, take a bottle put a cap on and pinch and crimp
Crimp the bottle they have the same
Functionality, but they have vastly different scales and that's kind of if you have a new problem
You're looking to pursue a rewrite
That's one of the things to consider and I think that was that was a big
Story that we were telling.
We have this fork of the go agent.
We've effectively already created a second binary within it.
The idea that we have one shared code base is already sort of a myth.
It's not able to serve our purposes for these various reasons that our customers
Obviously the company wanted us to explore every path and make sure that we had crossed our T's and dotted our I's.
I think we had done that, and at that point,
once we had a really good internal document laying out the vision for BottleCap,
everyone was really encouraging about it. Once they saw our perspective of, look, our scale is very different, and the actual deliverable features are very different.
So we should just have a different thing for this use case. And that, I think, was able to sell the
vision. Yeah, interesting. How long did you talk about having an interesting demo that sort of
proves that point? How long did you have to spend on just that of making something workable enough
to prove that out without building the know, building the entire thing.
How far down the road did you have to go there?
Yeah, a demo is worth a thousand words.
If anyone is looking for a career cheat code, I think the two things I offer is always go a level deeper than your peers,
always building to like learn the behind-the-scenes reasons of why something works the way it works.
And the second is have a good demo, have a good, you know, do the work to build a proof of concept along with your document. why something works the way it works.
that blows people away when you're like, I think they were talking about like a P99 query time,
100 little 10 millisecond improvements and get that P99 down, you have to fundamentally rethink and design for that use case. And that was key to the fundamental design of the project was every
single PR that we merged, we benchmarked every single PR even to this day, before we do every
single release, we look exactly at that not only the cold start time, but the runtime duration,
like do we add overhead? Do we add more overhead when we're shipping data? Do we add, you know,
additional network bytes? We had we're testing different compression algorithms as well to ship time duration, like do we add overhead?
for them. And it's just a huge part of that work is the perspective of performance first at every step of the way. And that's how you get these,
you know, you have to hold performance really close to the heart. And that's a feature. It's
not, you know, not everyone needs it. But when you do need it, the only way to get it is to take a
microscope at every step of the way. Yep. Yep. That was like one of my favorite parts of that.
I guess like, what were you using to even measure all those different things to make sure,
hey, on every pull request, we don't do,
what does that look like for someone
that has not done that deep of performance work before?
Well, we use Datadoc.
Yeah, there we go, there's a pitch.
This episode's sponsored by Datadoc.
No, I'm just kidding, I'm just kidding.
No.
No, no, no, it's true, it's true.
Datadoc has this whole suite of observability
and performance tools.
And one thing that we used heavily was the native profiler.
And if you haven't used a profiler like a CPU or memory profiler,
you should absolutely try it. Pick one. I don't care which one.
But I love hearing stories about people who have been experimenting with profilers
and found bugs in their programs.
And that's a big challenge in Lambda.
We have profilers that work in every major language and runtime for Lambda. and found bugs in their programs.
So what I did was I rebuilt Lambda on top of EC2. And I used the Lambda runtime interface emulator,
and then I wrote a custom layer for Lambda's telemetry API.
And then I created these hell tests where this function would allocate massive, massive strings
and then send them to the extension or create many, many, many spans
or huge amounts of metric context
or metrics and push them through the pipe.
So what I did was I created these hell programs,
these hell Lambda functions that would push the extension to the limit.
And then every time we were testing a major change, Wait, you were using what?
I ran it outside of Firecrackers because I needed access to these system calls.
And from there, I created all of these test cases
that I could run the program through before we shipped big changes.
And one simple example was when we started the project, native threads or hardware threads.
Tokio is the popular asynchronous runtime.
It gives you features similar to what you would get in Node.js
where you can kind of async and await.
Most of the libraries you use in Rust are going to be TokIO compatible. And when we added that, we wanted to be very sure it was a net win.
So we were profiling it on these various CPU sizes, various memory sizes, and making sure
that it wasn't causing us to block down.
And only when we were able to prove that with a profiler and show that the runtime wasn't
spending unnecessary time moving tasks around threads or that sort of thing, then we were
comfortable releasing it.
When you say release it, like bring Tokio in,
is that what you're saying?
Correct.
Gotcha, so you wrote it originally without Tokio,
sort of native stuff, built this performance baseline,
and then you're like, hey, bring this in,
see if it makes any changes, it does not, we're fine here.
Yeah, and that's still kind of a fluid thing
because the efficacy of Tokio is going to kind of vary,
depending on the number of vio is going to vary
depending on the number of vCPUs you provision your Lambda function.
So in some cases it's better probably to have hardware threads and then just have it dispatch a different thread for everything
when you have extra CPU overhead.
But especially if you're running in those 128 megabyte Lambda functions, those really tiny ones with an eighth of a vCPU per second or whatever, then Tokyo becomes really helpful because it's able to kind of swap stuff out when needed.
Yeah, yeah, interesting. Had you written any rust before going on this project?
No, not a line. And I think that's a big part of why I was so resistant.
Not a line. That's amazing. Yeah.
Yeah.
Yeah, it...
Languages in general, I think once you really learn and know one, it's easy, I think in my opinion, it's been easy enough to kind of pick up a second and a third with time.
I think this has been a really fun time because LLMs became very popular.
So like we had access to like Copilot and ChatGPT and that became super useful because I had a Go program that worked kind of and I was able to like copy and paste 1,000 lines of Go
for a question of, an LLM is like a really great way to get a quick answer for a question of like, is this correct?
Rust also has an incredibly good compiler. So in your editor, it will kind of just beat you, beat you over the head until the program is correct.
And I think that was also really useful because that combined with an LLM, you can generally, even without a lot of experience, produce technically correct software using that. Yep. Yep. Interesting. Okay. You mentioned that like picking up, like once you know a language,
you can pick up other ones. And I, I think of you as like a Rubyist first, right? Is that,
is that right? And then like when you joined a serverless, like a lot of JavaScript type stuff,
had you done before going to Datadog, had you done any like, you know, more compiled, like very
strong, strongly typed languages?
Um, I've written a little bit of C++ previously and I have kind of recurring nightmares from it.
Okay.
Um, but I think my, yeah, my, my first kind of large experience with, with like distributed
systems was with serverless is, um, like backend, uh, Lambda functions written in Go.
And that was kind of my, my first, um, I would even call that systems programming and I would like backend, Lambda functions written in Go.
I would even call that systems programming.
You just use some of the same systems programming techniques and approaches.
We're not building a database here, but we do want safety.
We do want to make sure we're using the correct locking mechanisms and we're not spending too much time waiting for a mutex,
for example.
So those principles still apply.
But yeah, I did, I spent a lot of time reading books and trying out programs and both Go
and Rust have really good benchmarking, like micro benchmarking tools.
So if you have like 12 lines of code and you want to say like, is this better than that,
it'll give you this whole spectrum of tests.
I mean, like Go Bench, I think is you this whole spectrum of tests.
the the the when you have and not giving up too much latency because that's every every line of code is like giving up a little bit. And you need to be very very careful about adding new software on top of existing software without.
Potentially blowing your your performance as always so it's it's makes it like it's really important to create those like micro benchmarks on HPR on each you know on each, you know,
Is it the same for the lifetime of a string? from React and Redux, and we have these kind of
more functional processes that go into our thinking nowadays.
And I think a lot of those are just exposed for you bare with Rust.
It's like, oh, you want this string to be usable over here?
Well, now you have to copy it. What did you think of the Rust ecosystem? Are there big gaps there? Does it feel pretty full featured and fleshed out?
How did that feel?
That has been tough, if I'm totally honest.
Go has an army of very talented developers
led by Rob Pike at Google.
And every tiny corner of-
Is he still working on Go at Google?
I can't remember.
I'm not actually sure.
Okay.
But the language is mature enough.
Yeah, for sure, yeah. And it's clear. There's an army of people working on it. I'm not actually sure.
spend $300 million a year on this language until it's perfect. And I found Go, any single,
we wanted to use Unix domain sockets,
Go has them perfectly implemented.
And the interface works and it works with every abstraction. and you got to kind of time together, and then you might be on this older version. So it is more difficult.
That's for sure. There were many times I wish that I had like the ecosystem of Go.
But I think all that said, Rust is pretty mature.
There's a lot of support and you are going to probably spend more time reading library code than you were expecting.
I don't think that's a bad thing. You pick up a lot of the patterns and a lot of the the idiomatic strategies that you're
gonna want to use anyway so but it is different I think it is harder than
then go yeah interesting is rust like mature as a runtime like are they still
like changing stuff version to version or is it pretty stable and mature that
way it's extremely stable well they have like a go one to two type thing or like
where yeah where they out there they could and I think that would be definitely a battle.
I don't expect it to change that much.
I think due to the more limited nature.
Again, the runtime doesn't handle that much for you.
Not really in the grand scheme of things.
You have to do your own memory management.
So there's less surface area for them to break.
But of course, that can always change. Yep. surface area for them to break.
Benchmark. AWS is either a secrets manager or KMS or what have you.
To do that, you typically would use the AWS SDK.
We imported it and we found immediately it was our largest crate,
or the largest dependence we had by far, was the SDK.
It impacted something like 10 or 20 milliseconds
Now making these API calls is pretty straightforward in a sense.
They're old, like, SOAP APIs, so they have passed the arguments you want in the header,
and you then sign those headers with this AWS SIGV4 or SIGV4A, which is the newer version.
And then you can make the request. I did have some issues with it here and there.
I think I had not initially added support for Java's
Snapstart because the credentials can expire,
because they are stored in the snapshot.
But for the most part we've been pretty happy with that decision. And the biggest reason is we needed two API calls and we didn't need the entire SDK.
And so is that just like a function of, yeah, you have different requirements than the AWS SDK team has
where they need to support everything and because of that they have to make some different choices and like a
little bit of latency and size is acceptable for them. Whereas like, you know, you don't need that much and it's
easier to do it yourself?
It's been painful and they're just not as,
they're simply not as performant as hand created bespoke SDKs from every language. But, but Amazon, I think really wanted a way to, to have,
you know, an automated release.
They could kind of write a bunch of metadata and then push this release out for
everybody. And, um, it would be the same for every runtime.
And that was the priority. So for us, it's just, we have different priorities.
And I think-
Interesting.
Go ahead.
I was gonna say, I think you'll find a lot of agreement
with different people who talk about,
why do we pull in so many dependencies,
just take the parts you need.
I think that's kind of a growing chorus
in the software world these days,
which is stop just grabbing random node libraries
or Rust crates or Go packages,
like just take the little bit you need
and go from there.
And I think that's kind of a part of the ethos we took for
was like, we really just need this slice
and we don't really need the 10 or 20 millisecond hit,
especially because not all users use those API key storage
mechanisms, right?
Like some people, whatever, uploaded it
with an environment variable, which are encrypted at rest
or chose other options.
And as a result, they didn't need this.
And we didn't want them to kind of pay the overhead as well.
Yeah, interesting.
Were LLMs useful at generating that particular code?
Do you remember?
I think that is yes, because kind of like a Rust compiler,
when you're dealing with signing a request with AWS SIG4 and kind of like a Rust compiler,
when you're dealing with signing a request with AWS SIG 4
and kind of making the request,
it only works if everything's perfect.
So you know you don't remember that being that helpful for that portion. But I do, like I said, I think I remember being,
LLMs were most helpful when we're parsing large config files
that we needed to write custom deserializers for this Rust
library called seerday, which stands for serializer deserializer.
And those get a little gritty.
Well, hairy.
Yeah, and there's also a lot of examples of it.
So LLMs became like a really natural fit.
Yep, yep. I had a similar, like the reason I asked about the LLM is I had something similar recently
where I'm receiving emails on Amazon SES and you can choose to have those encrypted when they
come in and put them to S3. And if you do that, they encrypt it with... They encrypt the client
side with the S3 encryption client is what it's called. And then you decrypt it with that,
but they only have the S3 encryption client
for like six languages and JavaScript is not one of them.
And it's like, I didn't want to have a,
like one other, you know, function in Go
or something like that that's using that.
So I just like went to the Go library.
I found out which sort of encryption algorithm
they're using.
I went to the Go library and I said, make this,
but JavaScript, no, I said, make this but JavaScript.
LLM just made the whole thing and you know,
you test it and it works and yeah, good to go.
So it's like.
See, and that's like a great example
of going a level deeper on a problem, right?
Cause like a number of people would either
just give up there or they would maybe,
whatever they would like add a sub process
or like call out to, you know, a C program
or like a binary and make it do the work.
And instead you're like, well,
I'm just going to read the code and then I'm going to figure out how to generate it myself. And yeah, of course, like the LLM helped you, C program or like a binary and make it do the work.
are at their best when you give them a very specific direction, you kind of know the shape of the answer you want.
But I do find it very useful and we did use it heavily throughout this process.
So that was a lot of fun.
It's something that's like, yeah, you could do it a lot faster and you can see if it works.
projects like BottleCap where you have this purpose-built thing for the specific niche, are going to just grow.
That's the expectation, is that this thing should be as minable as it can be across any
dimension we care about.
It needs to be kind of bespoke for that.
That's kind of tough because SaaS businesses were created around this guise of, well, you'll
write the software once and you sell it over and over again with zero additional marginal
cost of goods sold.
I think that is going to start to change a little bit. and you sell it over and over again
considered this and we ruled it out early on. But now with this new performance improvement,
we signed up and it's been great.
Yep, that's awesome.
Tell me more about the bespoke software stuff.
Do you think we'll see that in,
this is like a developer tool,
which is like one sort of thing where we're sort of used to
using LLMs and, or we know our specific requirements quite well and evaluate those
things. I guess like, do you see a lot of bespoke software happening in B2B SaaS or
prosumer or even consumer type stuff? Like do you think software is going to be like
pretty bespoke across the board for like all types of consumers going forward?
Yeah, I do think so. I mean, obviously, the beating heart of systems,
the backend distributed systems are going to be limited in what they can necessarily adapt to or clone.
Although I do think the cost of running software is approaching a commodity price.
These managed services are getting close to the limit of what you can really do in a general sense.
But I do think that as far as like people using your libraries or interacting with your APIs, there's going to be this expectation of it being pretty highly customized.
Now, whether that happens on your end on the API level or via, you know, like a model context protocol kind of interacting where an agent on their end takes your API, consumes it and modifies it in the way the customer wants.
context protocol interacting where an agent on their end takes your API, consumes it, and modifies it in the way the customer wants.
I'm not sure. I think it's going to be a little bit of both.
But I do think the age of, oh, you want to interact with Stripe and here's their crazy,
well, Stripe is a great API. like Zora.
people are going to want more of it. Yep. Yeah. I know the interesting thing is like, you know, I think we can create software
a lot faster. It's hard for me to imagine like a world with a hundred times or a thousand
times as much software as we have now. But I think that is like coming. I just can't
really picture what that's what that's going to look like. But yeah, I think it's going
to be yeah, I think it's going to be pretty wild.
I do think it'll be the dimension of personal software
is one aspect of that, where you're going to be able to customize apps on your phone or on your
computers a little bit better. The journey, if you've watched or you've
experimented with any of the home brew labs or home hosted solutions has come a really long way.
The different toolkits for starting up your own web server
or running a server on a bare metal box in your basement
have gotten way better.
I'm thinking about things like tail scale,
which have just been incredible,
where if you wanted to create your own VPN previously,
you have to deal with open VPN
and all these very challenging tools to set up and run.
And now it's just two off clicks,
one on your phone and one on your home server, and then you can connect securely anywhere in the world. challenging tools to set up and run.
maybe isn't the right way I think about it. I think about it more of like the,
if you compare binaries of software between each other,
there's gonna be, you know, 80% variance versus previously
it was like configurable and it was like a very rigid.
Yeah, yeah, interesting.
I guess, so on that same point of AI usage,
like how are you using AI day to day?
I think there's like a broad spectrum of like,
you know, copilot, tab complete,
there's cursor, more agent mode, there's like a broad spectrum of like, you know, copilot, tap complete, there's cursor, more agent mode,
there's cloud code.
I guess like where, what are you sort of using day to day?
I'm trying it all.
I'm just trying to stay super curious.
Again, like my big ethos is like, kind of try everything
and go a level deeper on everything.
So my main workflow is still, I can't get out of Vim.
I'm still on Neo Vim.
I've been a Vim user for a couple of decades
and I just did nothing, everything feels slow.
Once you know Vim motions, everything feels slow.
So I use VS Code for a little bit.
I do have Cursor, I have used Cursor.
There's a version of a Cursor-like experience
called Avante for Neo Vim that I'm really pleased with.
I'm having a lot of fun.
I also like Claude Coden a lot.
That's been a very cool tool. It allows me to chat back and forth with it. that I'm really pleased with.
that drives the number of mem copies to minimum. Like whatever it is, however many times
you have to copy the string, make it as little as possible.
And introduce lifetimes and if you need to,
static lifetimes or even like a memory arena
to hold some data.
And like we haven't had to do that yet,
but obviously if you throw that at an LLM,
it totally crashes and burns right now.
Yeah, oh yeah, yeah, for sure.
Like that sort of hard thing.
That's I'll be curious like when that becomes truly doable.
But like right now I do, I like a full,
a few like full stack apps I'm helping with.
And that it's just like, so like you have these patterns.
So it's like, Hey, go make this new data access pattern.
Now write the route for it.
Now write like the front end service to consume it,
now write the display logic,
and it can just do that so easily
once you have a few patterns in your application.
It's like, yeah, it's not super hard code,
but it does save you a lot of time
and doesn't drain you from the monotony
of that sort of stuff, which is fun.
I do think it's been an absolute boon for application, like web application development, and I think a big part of that which is fine. And as a result, it's sort of regressing to the mean
one shot it.
Now at the same time, I really, just yesterday,
I fixed a bug where I used the LLM to parse a YAML file.
And it inadvertently broke an environment variable.
And I didn't know it.
And it was my own gap.
I didn't have a test for it.
But the LLM was like, oh, no, this is it.
This is all you need.
And I tried it, and it worked.
So I was like, OK, we'll do this.
And then I get a bug report. And sure enough, it turns out, like, oh no, this is it, this is all you need. And I tried it and it worked. So I was like, okay, we'll do this. And then I get a bug report.
And yeah, sure enough, it turns out like,
oh, this no longer processes the environment variable
version of this correctly and it needs to.
And it's just, yeah.
So.
It is hard because like you start to trust it.
You're like, man, it nailed that last thing.
This seems pretty similar.
Like in that same wheelhouse, like I'm sure it'll do it.
And it's like, you scan it over,
but you're not looking quite as closely as you can.
So it's like, you gotta figure out, you not looking quite as closely as you can so it's like you got to figure out you know test coverage
slash eyeballing slash manual clicking around like what what you sort of need
for it and then based on your level of seriousness for like how important it is
if there is an error and things like that it's gonna vary so at the risk of
sounding too old I I think it's a boon for early career software developers
because you basically get a mid-senior engineer
with unlimited patience that you can just ask questions to.
And I think for me, I was always getting the feedback
that I was too needy early on in my career
and asking for help too much and not spending enough time
trying to run things down myself. And I think the LM is like such a great tool for that. But you do still get
I mean, I'm sure you've experienced this. And I think everybody has where it gets caught
in a loop where it can't the initial approach it chose wasn't correct, for whatever reason.
And maybe it was my fault, I didn't give it enough context. But then it runs down an unsustainable
path where it's like, deleting code or you know, leaving methods around and that sort
of thing. And I think that we're just not there yet but it is a really exciting time and
I'm enjoying it and I you know I don't really think it's going to break about like mass
unemployment I think it's just going to create kind of more demand for software.
Yep yep so you're not in the AI 2027 camp?
I mean I guess I don't have enough information to make that decision if we're being totally
pragmatic. I think it's obviously a possibility especially if it gets to the point where it can do all versions of independent
thought independently, then of course, yeah, that's going to change my opinion on it.
All of the companies that are working on this technology that are out there are expanding,
they're taking what they have, like the models they have, and they're deploying them in different
modalities and different integrations and that sort of thing. They're not saying, okay, we have like 20 years of advancements
ready for you next week. Right. And that's sort of where I'm starting to indicate that
we may be, um, maybe unable to solve some of these really core problems, but that doesn't
mean the tools are useless. It's super useful.
Yeah, it's super. Yeah, I completely agree. Okay. I want to switch and do some, some AWS
info takes, cause I think you, or AWS slash Infra takes
a lot of AWS stuff, but I feel like you always have
some good stuff on that.
So first of all, I have a few that I've heard from you
before, but I wanna just hear you defend publicly.
Number one, you're like a big fan of AppRunner, right?
You're like one of the two people that loves AppRunner.
I guess, are you still, are you, you and Jeremy Daly
and like Chris Munns, those are like the three
that I think of for it. So like, I guess, are you still there?
And if so, sell me on it.
I can't recommend that anyone use AppRunner anymore because I think just as evidenced
by the entire lack of changes in the last five years, I don't think they're working
on it. I don't have any special information to indicate that, but that's just my gut feeling
on it. So I don't want anyone to wake up tomorrow and they're like, oh. I don't have any special information to indicate that, but that's just my gut feeling on it.
So I don't want anyone to wake up tomorrow
and they're like, oh, it's gonna go into life.
Like you have to move everything again.
What did you love about it?
Oh man, I think Amazon was able to somehow solve
a two piece of team problem for the first time
in the history of the company.
And they were like, we're gonna solve this problem
vertically end to end.
And it was a really cool experience. If you hadn't used app runner, it was it gave
you this really managed experience where you could
connect it straight to GitHub, it would build a container out of
your application, and then it would run it as a like a
managed container service. So more similar to Fargate, where
you have kind of a long running container, but it could scale up
for you, it could scale down for you, it could like turn it off.
It never really scaled to zero super well, but it had you know, scale up for you, it could scale down for you,
we want this experience, but for Java 8 and we needed to be compatible with these SOC 2
or bank regulations or whatever and all these different
hard problems to solve where they weren't quite sure
the addressable market was there. and kind of ignore some of those really hard problems.
The product was good, it was fun to use. They had intractable problems that they had to solve.
For example, if your application failed a health check,
it would roll back to the previous version, but it wouldn't roll back your CloudFormation deploy.
So it wasn't like the full feature you get from Lambda, with some indication on cloud formation and it just didn't do that for AppRunner. And there's just like a number of cases of that where things kind of languished
and they weren't able to solve it, unfortunately.
Yeah.
I think sort of on this note,
one of my biggest complaints about AWS
is it can be really hard to know
which services are like soft deprecated, you know,
or like set up for deprecation.
And it's like, of course they can't come out and say,
we're deprec, but it's like, man, like,
you're trying to make decisions on what things to invest in.
And it's hard to know what they're investing in, you know?
And it's nice in some sense that they're starting
to deprecate some stuff and be a little more clear on that.
But even like, even like, yeah, you have to like read the,
read the tea leaves and try and figure out, oh man,
they haven't done any updates
for app runner in a long time.
There might not be a team there.
Well, not even that like the app runner got has gotten so bad
that the creator of Java who was a distinguished engineer at
Amazon, flamed the app runner team in a GitHub issue and said,
you're on Java, like, I think it was 17 or something like,
why are you so so far behind?
Why is this project so far behind?
And this is the guy that wrote Java. So, you know, then I think that's something, like, why are you so far behind?
Why is this project so far behind?
But yeah, I agree, you do have to kind of read the tea leaves.
And like everything in life, if you go with the flow, it's a little easier for you.
So you're going to find more success on EKS or ECS or Lambda than you are on like an app runner. So I think, you know, write a container,
use a container based Lambda function
when you're outscaling that, move it to Fargate
and then be happy.
That was my next question.
Same thing with container images over zip files.
You've been team container for a while now.
And I've been right.
Tell me, yeah, yeah.
Tell me why.
Tell me why I shouldn't use a zip. Yeah, I mean, it's just clear that container and I've been right.
Yeah, I mean, it's just clear that the open container
initiative and OCI standard has become the standard
for packaging applications.
It's true that containers are objectively worse packaging mechanisms than zip files. The idea of a zip, and it's a beautiful idea and I think it's pure in an academic sense.
So if you want me to defend the lambda zip base function, the reason is you zip up exactly
only the bits you need, only the tiny components you need, the dependencies you need, and then
lambda provides the rest.
And as a result, they're able to cache those base images really, really well.
So you get faster cold charts for the bits there than you maybe would have provided and so on and so forth. The problem is people get it wrong all the time. to cache those base images really, really well,
like a copy of, I don't know, like Elf in here for some reason.
So that's one aspect of it.
I think containers, while people still are confused by them
and make elementary mistakes with them, they've kind of won the packaging war in the cloud ecosystem.
And as a result,
it's easier to just kind of go with the flow there.
Lambda also put in a ton of work to make them really fast.
I did a blog post and a video about it
when I benchmarked it,
but they do something really cool.
They, every time you deploy a container-based Lambda
function, Lambda will go to elastic container registry.
It'll pull the container,
and then it will create a hash of
each 512 kilobyte chunk in your container image, and then it will look and compare and see if it's
already seen those chunks, and if it has it just drops them. And then it creates a manifest file
for your Lambda function, and then it creates keys based on each of those chunks that are
specific to your function, and then it creates kind of a main key that encrypts all of the keys
for all the chunks,
and it stores that.
Then when you have a cold start,
when it has to bootstrap a new Lambda sandbox,
it goes out and it uses this concept
of content addressable keys where it says,
okay, find me all of these chunks based on this hash.
And the only way that you can decrypt all of those chunks
is with that primary key, that main key,
that encrypted all of the other keys
that are used for each of those 512 kilobyte chunks. So that means that if you and I have the same
chunk of a container, we can just reuse it and we can share it. And that means that those
are often for very rudimentary files, those are probably already cached out there somewhere.
So you're able to just have really, really fast cold starts, even though they support
up to 10 gigabyte images,
which compared to a 250 megabyte Lambda function
is a nice win.
Yeah, yeah.
Yeah, definitely check out Avery's write up
and video on that and like the original stuff by Mark.
But like that whole thing, like Mark's stuff on it,
the way you explained it, I was just,
that was like one of those things where it's like, man, the way you explained it,
Yeah, and now you can just use a container-based Lambda function. It's as fast or faster as the zip function, and it's easier to develop.
You get more consistent, reproducible builds. You don't have to worry about these crazy bash scripts
that have to run on your MacBook and cross-compile or any of that. All of it just works right now.
And I think that's the benefit.
Yep, yep. Okay, next question. This is on AWS. What is a more annoying problem?
That there's not a way to sort of like enforce the free tier or where it's just like,
hey, I don't want to spend any money on this.
So I want this to be like a free account, you know?
Cause you sometimes see those people that like,
they're a student, they get like a $3,000 bill.
That one, the free tier problem.
Or what I call like the new production account problem,
where when you create a new account,
you have to request like Lambda limit increases because you get like 10 concurrent functions or like to get
out of SES sandbox mode is like a total pain in the butt or SMS sandbox mode and like all
these like permission things that you have to go through just to like get through basic
stuff which which of those is more annoying?
I wish both.
Can I press the button for both? I think that's what I'm talking about.
The free tier billing thing is such a miss on their end,
but it's just not meant for students.
Every time someone says,
there are ways to avoid it, but it's just not designed.
that problem does like the reason they do that is to prevent the second problem or the reason they do the second problem is to prevent the first problem, which is, you know, if
they gave everybody uncapped accounts all the time, the abuse would be rampant. So that's
why they lock it down. I do think if I were to start a consultancy consultancy around
this, I would just like I know the magic words to talk to SES and get them to lift the limit
and get you out of sandbox mode. You know, I know the same for Lambda and all those kind of places you're going to run into
it.
That is tough.
I wish that when you're using an Amazon organizational unit, an OU, when you create a new account
within that organization, they just give you limits from like a predefined set that are
allowed for your org.
I think that would be really, really nice.
So, yeah.
Or if you could like pre-verify an account in some way and just like I don't have a check like
Validate your bank account or something like that something that would be hard for you know, like a
You know true it a bad actor that's just gonna come in and like steal compute and leave
The truth is is that like even for the people the students that are getting these bills are people like you and I that want
To spend say a hundred bucks a month on Amazon like it's not designed for that either, right?
It's the actual value you get out of Amazon and AWS is when you're on these that want to spend say 100 bucks a month on Amazon,
it's not designed for that either.
The actual value you get out of Amazon and AWS is when you're on these commit spend accounts
where you're going to commit to spend millions of dollars a month.
And from there you can kind of pick whatever computer you need and those types of things. can do. And that's, you know, it has to work at that scale. And I think as a result, because that's where the revenue is,
that's sort of where it's optimized for, it's
unfortunate. There's people in the community doing great work
to try and help everyone avoid those sharp edges, I think, you
know, account you in that group. But it is tough. And I don't
know, I don't know how you solve that problem, because the scale
at which Amazon can, can give you compute, and thus bill you is
so fast that it's very, very difficult to say, okay, now you
have this limit. I would say there's an educational account available for students where it just,
it's very limited to what you can do, but you don't even type a credit card in,
you know, it's like, it's just, you know, that's interesting.
I didn't know about that. That's cool. Yeah, but it is very, very limited.
So it's like a, it's a learning path system. This, uh,
the second issue is like my hobby horse right now,
because I was like helping set up a new account recently and like,
it took me multiple emails just to get out of SES sandbox mode.
And I'm like, no, I'm just sending transactional emails.
Yeah, not marketing emails.
It's like, this is exactly in the way.
Yeah, it's like, come on, you're great.
So, okay, next one, AWS network costs.
I think you've been kind of,
you are one of the biggest opponents
or one of the people that I think of
when it says these are total bunk.
I think they actually have some validity to them.
I guess like tell me, tell me what you think on AWS network costs.
And I'm mostly talking inter-AZ costs, also egress costs.
You know, they charge you a cent, a couple cents to get out.
Like yeah.
Yeah.
Per gigabyte.
It's not cheap.
I forget all the numbers and it will change so I don't wanna misquote it,
but it's crazy how fast that becomes
your highest bill line item.
You can spend millions of dollars a month on AWS
and then all of a sudden, EC2 networking
is your number one bill.
And I think that there are some valid points.
I mean, it's a limited resource and it's sort of a public good. So you have to defend against the tragedy of the comments is your number one bill.
their whole premises, we're going to give you a binary you were able to realize something like a 10 or 20x cost improvement on Kafka running
Kafka backed by S3.
So that tells me that networking costs can't be so much of a hit because you're able to
do that and save so much money and it's built into the price of S3, which is already a commodity
product.
So it's very clear to me that you and I are paying much more bandwidth rate than the S3
team pays.
So now we have all these cases where it's like,
they can sell their company for $220 million
because they found this kind of cheat
in the bandwidth billing.
That's truly and really at its core what it is.
And good for them.
I'm very, both those guys, Ryan and Richie,
I know they've been on your podcast.
I really appreciate talking with them over the years.
I think they're really sharp people.
Yeah, they're sharp.
They're smart guys. Yeah.
But yeah, but it is it's it's it's exploiting the billing model. And they should, you know,
it's a good thing. And I want more people to do that. And I think more data systems are
going to do that kind of thing. But the
My question on that, and I'm like, I think that's the strongest point against my argument is like,
what's going on with s3 there. And I don't know if that's like a billing mistake that they
are locked into now. Like they sort of mispriced that in some And I don't know if that's like a billing mistake that they are locked into now.
Like they sort of mispriced that in some way
and didn't account for that very well.
Or if they are just,
or I think there's sort of three possibilities.
It's a billing mistake.
The second one is like a actually inter-AZ networking
does not cost that much or is not that scarce
and like they're overcharging for it.
And the third one is like,
hey, there's some efficiencies that the S3 team is able to do,
or some amount of predictability or something there,
as compared to just the fact that if I just wanna come in
and throw a bunch of traffic at it
that they can't account for and things like that,
they need to have a different rate for that sort of thing.
So, yeah.
Yeah, so it's a combination of two and three.
I don't really think it's one.
And I think the two is a high margin for them.
And when they already, especially within an availability
zone, they've already talked about,
they've given talks, AWS has given talks
like Colm McCarthy and another, I forget the person's name.
There was this great talk at Riemann
about the Holocore fiber they were using between AZs in US
East 1, so they were able to actually get like double the range that the buildings
could be away from each other. As a result, they able to just buy more real estate and
have more capacity in US East one. And that tells me that they're able to once they have
the trench dug for the fiber running huge, huge backhaul cables, like massive, massive
links that are doing, you know, hundreds and hundreds of gigabytes per second across many, huge backhaul cables, massive, massive links
that are doing hundreds and hundreds of gigabits per second
across many, many redundant links are very cheap for them to add.
I don't think that bandwidth is too saturated,
but because those links, they operate on peak bandwidth, peak is in or shouldn't say that they have to account for the peak of transmission between two AZs to physical buildings, it becomes a very big opportunity to optimize when you
can like do things off peak. So I'm thinking like elastic block storage, EBS snapshot replication
is pretty cheap. It's actually I think I think it can be cheaper than s3. And I think my
guess is they're just replicating it at night, where there's not very much load. And as a
result, they get like a lower quoted price
internally or whatever.
So I think there's like opportunity
for those types of things.
But I do, you know,
I don't think those links are super taxed.
I'm sure they are at certain times,
and the need, the idea is we don't want people
to abuse that, I get that.
But for things like Kafka,
where like best practice is to use multi-AZ,
it kind of sucks,
because you're just gonna pay
this huge tax. Yeah. Yeah. Yeah. I think that and oh I would say the last one I went out is like
other clouds don't bill for that. So it's really hard for me to make the point. Well don't but okay
so Google does right. Google does. Microsoft says they do but they don't which is weird I think.
Yep and Oracle doesn't at all.
Oracle's trying to catch up in some sense, so it's like.
Yeah, of course.
And then Cloudflare has this whole game,
they're like, we don't, but then once you get big enough,
they do is what it sounds like, which is like,
they get a little kind of pious sounding
on that sort of thing, and I don't know what the story is.
They also just have a very different traffic profile
than AWS does, I think, given all the ingress stuff.
It's just like a different, I think they,
it reminds me of the strategy credit
that Ben Thompson in Stratecory talks about,
where sometimes just
because of who your customer is or the shape of your product, you can sort of claim some
things that are actually much cheaper for you than for everyone else.
You can sort of position it as this value or we're doing it out of good.
Yeah, arbitrage that.
Yeah, we're doing it out of the goodness of our heart or because we really believe in
this, but actually it's just a lot cheaper for you than for someone else.
So it's like a talking point or an argument
you can use as a sword against them.
And yeah, so I feel like I really want to get to the ground
of like what's going on with Cloudflare there
because they talk one game and then it sounds like
the reality is like a little bit different
on some of that stuff.
Yeah, I just think that like their business is structured in such a way where they can, like
you said, they can kind of use that as a marketing kind of tool against others. I think a similar
one would be, they advertise that they only charge you for CPU time. So if you're waiting
the results of something on the network, you don't get charged for that in workers. I think
that's great. But also for even the largest scale,
Lambda deployments, Lambda itself is typically not the number one bill item for a serverless application.
It's S3 or Dynamo or API Gateway or CloudWatch, SQS or whatever.
SQS, yeah, exactly. Those bills add up.
Typically, I think Lambda is number three. So sure, that's true. And I think it's an opportunity for Cloudflare to kind of hit AWS with it every chance they get.
But if you actually look at like what that would cost you
in the end of your day,
maybe it's not as much as you'd think.
Yeah, yeah.
Yeah, so I still want to get to the bottom of this.
I feel like no one has a great handle on the networking stuff,
but the things that I would say are in my favor is
Google does actually charge for it.
Microsoft says they do. Cloudflare is kind of cagey ony on it and then also I can't remember if you were there weird I was talking to someone that reinvent that works on EBS and he was saying they spend a bunch of time trying to optimize our limits.
a lot of different things like that. And like it is a factor for them.
Someone on CloudWatch was saying like,
hey, we get billed for network costs
and we look at it and think about it.
And like, again, like you're saying,
it's, they're preventing a tragedy to the commons
where if you don't think about that,
then you're gonna use this inter-AZ traffic
just willy nilly without like thinking about it.
So at least just putting a cost on it makes you,
makes you think about it
and at least consider that trade off.
Yeah, I think the issue I take with it is I think it's just, it's at least consider that trade off.
Yeah, I think the issue I take with it is I think it started that way.
We just need to make sure people don't abuse it and it sort of has ended now.
That's true, it is like a profit center now that they can't get away.
That's a good point. So I think if you start to see an actual slashing of prices and competition on crossed clouds
in a huge way, I think you'll see that cost start to come down.
And if you see them cutting that cost, then it's kind of a strong indicator that it was
a big part of their margin.
Yep.
Yep.
Interesting.
Okay.
While we're on this point, CloudFlare, are you using CloudFlare?
I've played with it.
I've really been trying to use their container service and they haven't added me to the beta
and they announced it like a year ago.
So I'm a little annoyed there.
Come on Boris, Michael, get somebody get him in there.
Yeah, yeah, I'm trying. I'm hoping I want to try it out. I haven't really used workers
because I don't read a lot of JavaScript anymore. So just like kind of adapting to that hasn't
been my primary focus. And then yeah, so it's been it hasn't been a huge focus on mine,
but I do I admire their work.
I know and respect them, a lot of friends over there.
I think they're a really great group
and I think they're kind of going the right direction.
Yep, yep.
Yeah, the big thing for me is I'd say like,
I do like a lot of the people there, very sharp.
The concepts are weird, right?
Like you really have to change,
like learning, like I'm just like different
than containers in certain ways
or servers in different ways, but it's also
very similar in a lot of ways, where I feel like workers
and durable objects, they're just way different than some
of the other concepts.
You really have to go all in and understand those sorts
of things in a different way.
And then still, the surrounding ecosystem around,
permissions logging, infrastructure as code,
is just not as robust as the other stuff.
So that's like what told me back on Cloudflare.
But I do like the people there and what they're doing.
So yeah.
Yeah, I do think they need a better infrastructure as code story.
I think that's definitely a big gap.
I know SST has been pushing that direction and I think if that really...
Yeah, they've got some good stuff for it.
Yeah.
Yeah.
If that cracks it, I think it's kind's going to be a big growth opportunity for them.
Yeah, for sure.
Okay.
Last one on the infra area, the database ecosystem.
What are you seeing out there?
What do you like, excited about?
What do you think about it?
Man, I dabble in all...
This is my favorite part of programming right now is database programming and systems programming.
So I've been following Sam Lambert.
Of course, talking about planet scale metal has been really exciting.
I totally agree with him getting into the free tier.
I don't really like, I think free tiers are just
like a massive cost sink and they don't really
convert super well from my experience,
mostly at serverless.
I think it's just been like, you need
an incredible margins at scale to be
able to subsidize a free tier.
And I think that if I'm building a SaaS today, like if I'm going to quit and do my own thing,
I would do a two week trial. And that's it. It's like if you like it, you pay for it.
If you don't, you don't get serious or not. Yeah. Yeah. And it's not like it's
more about who I want to partner with with a customer and not so much like do I want to talk
about my user base numbers. I think the growth of users and free tiers became like a talking
point for raising money. And that's why you see it's like the dog wagging the tail. That
was a metric they could put on the screen and said, look at all these users that we're
getting. And but none of them would ever pay for the product. So I think if you want to
build something sustainable, and I know Sam talked about that on your podcast, I think
that's like a much better route to do. So yeah, I think Aurora D SQL is a very interesting
piece of technology. Not everybody needs multi region. And I think that's where Sam would to do.
really cool tool. So if you have that need, it's awesome. If you don't, yeah, there's like some rough edges. But if you really just want a relational data store in Lambda, Aurora
is pretty awesome. Like you should check out Dsql. And then I follow kind of all of the
key value store stuff too. So like we, you know, internally we use FoundationDB quite
heavily. That's kind of very popular. Well, it was nascent and now becoming a popular
key value store. And of course like Dynamo, Cassandra and so on.
Yeah.
Yeah.
Yeah.
It's interesting on the free tier stuff.
It's interesting because you have RDS and Aurora, kind of the elephants in the room.
You have planet scale metal, which doesn't have a free tier.
Then there's like so many competing for like the rest.
And like Super Base is like a pretty big one in that category.
And then you have so many other ones.
And man, that's just like a tough area to be competing in.
I like a lot of those folks and companies and things like that.
But that's a tough area to be competing in.
Yeah, it'll be interesting.
It really is.
And I think databases in general are like as much of the apprehension I have around rewriting software, I have around like picking a database provider that's like a third party. Yeah, it'll be interesting.
per customer, like a multi-tenant app So I love reading the papers. I love seeing the benchmarks and I love of course the drama So I do like following it now. Yeah. Yeah. Yeah, and it's interesting seeing like all the
OLTP on
Object storage stuff that's happening which like I never thought would be a thing and it's the same reason warp stream
did it.
So I mean the answer is the same like similar to what we did in bottle cap right? We're like not everybody needs read after write consistency or even like you know serialized snapshot consistency
Maybe they just need you know eventual consistency and that eventuality is like on terms of seconds
And all of a sudden it becomes a very interesting and compelling use case where your costs
Become maybe from like your number one cloud cost to like a rounding error on your bill
And that's you know everything's fungible man
Yep, yep, it's true. So yeah, like, I'll be curious because like, that's a hard technology
problem and a lot of people working on it. I'm sure like some cool stuff is going to
shake out of that. I'll be, you know, it's going to take a little while on that. But
yeah, some, some cool stuff there. Closing out here, you know, I want to just talk about
like maybe career advice could just because I think, again, I think your path is so interesting.
You are a Rubyist.
You had done some deep stuff
because you'd done all this elastic search,
sort of deep dive and gave a talk at reInvent,
came to serverless, did a bunch of JavaScript and some Go,
and now doing the deeper, more system stuff.
I guess, what career advice would you,
I guess you already said the two pieces of career,
at least like- Go a level deeper, right? Yeah, go a little deeper. No, I guess you already said the two pieces of career, at least like go a level deeper, right?
Yeah, go a little deeper.
No, I can expand on that.
If I'm going to give career advice, especially like to myself in the past, like the first
is like calm down, it's going to be okay.
And that's something I even tell myself now is like a new dad, calm down, it's going to
be okay.
And also, you know, yeah, everything,
your JavaScript to Python, But that aside, your Java, your Ruby, your Go, your Rust,
your JavaScript to Python, they're all making abstractions on top of the same concrete understanding of the actual CPU and architecture.
So if you learn a little bit about how that works,
it can actually take you a really long way across all languages and kind of all dimensions.
So yeah, go a level deeper, learn. And that's true no matter what your niche is.
Learn how HTTP works.
It's a text protocol.
You should at some point in your career write an actual HTTP request in a text editor and send it and try it.
And realize that there's them a curl command where I'm like, Oh, look, I just have this text file and I can tell curl to
send it and it's valid because it has headers and then it has a line break and a carriage return,
you know, there's two of them, right. And then there's like the body and now it HDBV2, of course,
is a binary and framed protocol. So there's like a little difference, but you build on that.
Similar to networking, there's been a discussion for sellers being blocked in Spain because they're blocking La Liga, the Spanish Football League is blocking IP addresses at a wholesale
level that are associated with, I guess, illegal reproductions of football games.
And because they're doing this at the IP level and not at the domain level using the server
name indicator, which it passes through the feed,
through the TCP connection when you upgrade to TLS, it blocks anybody that shares an IP. And it's the same reason you were talking about with SES. Like, why is it hard to get out of the
email jail? Well, it's very, very difficult to have a good reputation IP address that sends email.
So Amazon protects that very closely. And what I'm saying as far as career advice
is all of these things relate to the same principles
because IP addressing and IP networks
make the backbone of what we're all doing.
So if you understand a little bit about it,
it can actually take you a really long way.
So learn how HTTP works, learn how TCP works,
learn UDP, like try it out,
write your own servers and requests
and get down in the nitty gritty.
And I think it's just like a huge, huge step forward
in your career that you can have compared to peers.
Yep, yep. And one thing you said earlier about like don't be
afraid to dig into the library too I think is so useful. I know like when I was learning to code
honestly like what I did is I would answer questions on Stack Overflow in like the Django
topic of like things I hadn't done but like someone to ask a question and I would go like
read the Django docs, read the source code, try and figure out how it worked and like try and
explain it to people.
And that just like, you know, being able to,
that just helped me so much with like reading,
reading code and figuring out what's happening.
And yeah, going a little deeper,
I think is like super helpful.
So, I don't know.
Yeah, if I could sum it up, I think that's it.
And like, you know, it's a brave world out there,
brave new world, don't be scared of the LLMs,
like use them, they're very helpful.
And you'll, for every second you're scared that it's gonna employ you, you'll find a second where you're like, oh, this is totally falling apart to the LLMs, like use them.
Yeah. Yeah, exactly. AJ, always great to talk to you.
Thanks for coming on.
If people want to find you,
where, yeah, what's the easiest place to find you?
I guess for now, you can still find me on Twitter,
at ASTUIVE.
You can send me an email.
I'm just aj at data.githq.com.
You can find me on LinkedIn.
Just search for AJ.
I real name post everywhere,
including on places like Reddit.
I think it just kind of builds your own integrity,
or sorry, it builds your reputation in a positive way.
So just search for my name.
You'll find me everywhere I'm active.
Yeah, for sure.
We'll link in to ShownHits as well.
But yeah, thanks for coming on, AJ.
Always great to talk to you.
Absolutely. Thanks so much, Alex.
Have a good one.