The Changelog: Software Development, Open Source - Kaizen! Let it crash (Friends)
Episode Date: January 17, 2026Gerhard is back for Kaizen 22! We're diving deep into those pesky out-of-memory errors, analyzing our new Pipedream instance status checker, and trying to figure out why someone in Asia downloads a si...ngle episode so much.
Transcript
Discussion (0)
Welcome to ChangeLog and Friends, a weekly talk show about how good systems become bad systems.
Thanks as always to our partners at Fly to IO, the platform for devs who just want to ship, build fast, run any code fearlessly at Fly to IO.
Okay, let's Kaysen.
Well, friends, I don't know about you, but something bothers me about getting abactions.
I love the fact that it's there.
I love the fact that it's so ubiquitous.
I love the fact that agents that do my coding for me believe that.
My CI CD workflow begins with drafting Tommel files for GitHub Actions.
That's great.
It's all great.
Until, yes, until your builds start moving like molasses.
Get Up Actions is slow.
It's just the way it is.
That's how it works.
I'm sorry.
But I'm not sorry because our friends at Namespace, they fix that.
Yes, we use Namespace.
So to do all of our builds so much faster.
NameSpace is like GitHub actions, but faster, I mean, like, way faster.
It cashes everything smartly.
It cashes your dependencies, your Docker layers, your build artifacts, so your CI can run super fast.
You get shorter feedback loops, happy developers because we love our time, and you get fewer.
I'll be back after this coffee and my build finishes.
So that's not cool.
The best part is it's drop in.
It works right alongside your existing GitHub actions with almost.
zero config. It's a one line change. So you can speed up your builds, you can delight your team,
and you can finally stop pretending that build time is focus time. It's not. Learn more. Go to
namespace.com.S.O. That's namespace.org. Just like it sounds like it said. Go there, check them out.
We use them. We love them. And you should too. Namspace.s.o.
How else would you learn? Let it crash. Exactly. The best things happen with
things fail. Seriously. If it's in a controlled way, right? I think that's like something which
isn't said. It's implied. It has to be a controlled failure where you have the boundary and things
will not blow up. I mean, they'll blow up, but like, you know, like the fireworks sort of blowing up
where it's a controlled explosion. Yeah. Right. Tiny little crashes to learn from. Welcome
everyone to Kisen 22 with the incomparable Gerhard Lazu.
He's here to let us know how he lets it crash.
It's like that song, let it snow, let it snow, let it snow.
Only you know how to replace.
Hey, Gerhard, how are you?
Hey, Jared.
I'm good.
Thank you.
Thank you.
Had a great holiday.
It was a great couple of weeks where I've managed to finally disconnect.
It's been, I know, like 20 years since I had two weeks completely off.
Even my holidays are only a week.
So this was very different, very enjoyable, and I feel so refreshed.
So I'm firing on all of the dentist.
You unplugged and now you're plugged back in pretty much.
Plug it in.
I stopped it and I started it.
And it's like brand new.
It's like Glade, man.
I'm like Glade over here, man.
Plug it in, plug it in, you know what I'm saying?
Smell the scent, the fresh New Year's scent called 2026.
Some people are going to say this is going to be the big.
best year ever. I've heard it said. What do you think? They keep saying that. I'm excited about them.
They said that about 2020. We have to admit it was off to a killer start. I mean, it was really going
well. Right. Pott intended killer start. It was COVID. Pun intended. Killer start. That was
2020. 2020 was the year of COVID and everyone's like, oh, this is going to be like the best year ever. And
that we had three years of misery.
So I think,
I think,
I just want like an easygoing year.
You know what I mean?
Last year,
2025, 1st of January,
we were building shelves.
We were like redoing like studies and whatnot.
And the whole year was full on.
Like it was like,
it was nonstop.
Every week there was something significant happening.
And this year would like just like to,
for it to be a bit more chill,
maybe a bit more meaningful.
So that's one way we're thinking.
But how about you, Adam?
How are your holidays?
My holidays were filled with barbecue and good times.
Wow.
Even in winter.
So barbecue never stops.
It does no seasons.
Never stops in Texas.
Actually, just to just to shower you all with a few of my picks from my most recent
barbecue adventures.
If you're in Zulip, go to the general channel.
Look for barbecue with three bangs after it because why do one bang when you can do three?
Bang, bang.
Some recent ribs.
My gosh.
My ribs method is on point.
My spatch cock chicken method is on point.
No one is disappointed at my barbecue joint.
Very nice.
Look at it.
Add some meat on this slide.
That's what happened in real time.
Wow.
Real time meat added.
This is like,
yeah.
This is intense.
And that's again,
just to be clear,
it's Adam's barbecue.
Okay.
So like no joking aside.
We're talking about barbecue.
I mean,
I think we have to leave it there.
Let's move on.
I think we have to leave it there.
I didn't show a burger, but I do make a mean burger too.
Thank you, Gerhard for assuming that is something I do rock really good.
My smash burgers are on point.
Very nice. Very nice.
I'm looking forward to that.
One day.
My favorite Christmas tree.
This is what it looked like.
What is that?
And for those that are listening, it's a networking cabinet.
There's lots of blue lights.
flashing. This is happening in the loft. You have many terabits of network throughput. There's some
switches. There's unified. There's micro-tick. This is maybe five years in the works. And every Christmas,
I take time to improve it little by little. So this year I went really crazy. I read it like the
whole thing. I did, I read it like the whole, for example, DHCP network, VLAN. Oh, man, it's beautiful.
Your V-Lens are beautiful.
They are.
I want to be a guest on your network, man.
I'm going to get blot from everything, okay?
Yeah.
Well, well, well, there's like a big story happening in the background.
And it is, it is going to be, I think this is amazing.
This is, this will be the best network that I have run, like in my life.
But the blue and the darkness and it's like, that was like one more Christmas tree in our house and this was it.
where I would just go and tinker for a few hours
in between Christmas dinner and all the Christmas festivities.
So it was nice just to spend a bit of time tinkering with hardware.
And I'm sure that many of you listening,
when it comes Christmas time, when things start quieting down,
you get like the little projects that you didn't have time for throughout the year
and then you, you know, have some fun.
So I'm wondering, did any of you, did anything fun this Christmas,
But nerdy fun, that's what I mean by that.
Nerdy fun.
Well, I got upset with something.
And so I decided to just let it roll.
You know, I'm trying to say.
I got upset with the amount of RAM usage on my machine.
And while I like the application, I was like, you know what?
I'm just kind of tired of having 4 gig.
I think it was, you know, it was like 1.2 gigs of RAM being used by clean my Mac.
fancy little utility application helps you tune and pay attention and stuff like that.
And I decided to remake it.
That was it.
So I remade it.
It's called MacTuner.
I know there used to be a MacTuner.com,
which was, I think, a Mac Magazine, I believe.
But MacTuner fit.
I might change it.
Who knows?
But for now it's called Mac Tuner.
It does all the things, all the things.
Analyze, clean up, uninstall.
And not just that fake uninstall.
the real one where you get the dirty dirties out.
You know what I'm saying?
The dirties, all the dirties are out.
Okay.
My mind is still on the dirty burger that you mentioned earlier.
Yeah.
I mean, that's about as nerds that can get.
I mean, I made a little utilities for me for now.
Soon to be open source, though.
Soon to be.
It will be soon.
Yeah.
I mean, why not, right?
Share it with the world.
Well, I didn't create a Mac tuner, but I found one.
I also was thinking, clean my Mac, you know, like, how long am I going to run this thing?
And the answer is, as long as I ran it, because I'm done now.
I found a tool called Mull, M-O-L-E, which is a command line macOS cleaner that does like everything.
So you may have some competition here at it.
Maybe you can come out and like throw some lows down.
Like here's why I'm better than mole.
It's got 2E.
It's all command line based.
It does cleaning, optimizing, uninstalling, daisy disk, you know, Explorer.
Oh gosh.
All from, yeah.
I'm feeling it.
I'm feeling intimidated over here.
He's starting to sweat.
He's starting to sweat.
I think he just changed his mind about open sourcing it.
Here's your,
here's your domain name idea, Adam.
Better Than Mull.com.
You know,
better than grip.
That's good.
I could do that.
So I've been using that.
I'm very excited because,
uh,
who doesn't want to just have all the things right there in their command line?
And yeah.
I didn't spend any tokens on it.
Adam's got some tokens involved,
but his also works the exact way he wants it to.
Yeah, yeah, absolutely.
Mine leverages some recast stuff as well.
It's kind of cool.
Sweet. Open source that sucker.
One day.
Which day is that?
Not today.
Definitely not right now.
But it's going to be one day.
One day.
There's a bigger launch awaiting, as I'll say.
There's a bigger launch awaiting till I'm going to open source some things.
Been using app cleaner for many, many years.
There's no TUI.
There's no CLI.
is just like a regular app.
It's a really old one.
It's like drag and drop onto it, right?
Pretty much, yeah.
And you also have a list of applications.
But it's so old, it is difficult to find it these days.
And it has an update in a very long time.
So I will check Moll out.
Mole's really cool.
Rue install Moll and you're done.
So you can check it out right here while we're talking.
And I liked App Zapper.
And I think App Zapper doesn't exist anymore.
But the cool thing about that was that it would literally make the zap sound as it.
Yeah.
You drop your app on and it zapped it.
And I just like that sound.
That's the only feature that your application needs to have.
If it zaps.
Mold is not zaps.
So there you have.
Make it zap.
It's our tagline.
I actually make it zap.
Make it zap.
There you go.
I think that's a very good debate actually.
I know,
and everything.
What about you?
Cool.
Besides your Christmas tree,
did you?
I will come back to that.
I will come back to the Christmas tree.
This guy's got stories,
man.
Oh,
oh, yes.
It's like,
I have to,
I have to tease them and be very disciplined
because there's too much stuff.
So I have to be very careful
because it will be an hour
and I will not shut up
talking about this, this thing.
I mean, it's just like, anyway.
So we will come back to that, I promise.
Okay.
Last time, when you finished Kaizen 21,
this was one of the last thoughts
that we shared,
which is what's next.
So bam, remember bam, that happened live.
OOM crashes, out-of-memory crashes,
and a bunch of other things.
The good news is that only one thing happened.
O.M. Crashes.
I don't know one thing to talk about.
But this rabbit hole is really, really deep.
Okay.
All right.
Take us down the rabbit hole.
The O-O-O-M.
Out of Memory.
Who remembers this book?
Erlang in Anger.
Erlaine in Anger.
Stuff Goes Bad by Fred Hebert.
Ferd.ca.
Now, I remember learn you some Erlang for great good,
but I do not remember.
this one in particular.
So I'm not sure why the other one
in my radar because he wrote both of them at the seams.
But when did this one come out?
Wow. So this one, if I look, I just
switched to the browser in 2016,
2017 while he was still at Heroku.
Remember Heroku? Those were the days.
So about 10 years ago.
And Fred, I mean, he's just like,
if you don't know his blog, I mean, it's just amazing.
I'll just click it very quickly.
just to have a look.
Oh, I think it's one of the best blogs out there.
There's so much goodness here, so much.
But one of my favorites is queues and queuing and how cues don't protect from overload.
So queues don't fix overload.
And this is so relevant to today's conversation as well.
But there's a lot of stuff in the Erlang ecosystem.
And there's many, many things that Ferd wrote over the years.
that are so relevant to today.
So if I click on download PDF, right, by the way, this is like a, it's amazing.
This book is open source.
You can download it, open source freely available, creative comments license.
And I'm going to make this a little bit bigger so we can see what's happening.
And if I search for Let It Crash, it's page number one.
It's in the introduction.
Page one.
Page one.
And this idea of Let It Crash really comes from the,
Erlang ecosystem.
It's very well-renowned there because of how the Erlang VM works and how all the processes
and the supervision trees.
It was built this way.
And we know a thing or two about Erland, Jared, right?
Because the application, Elixir, the Phoenix framework, runs on the same principle.
I know a thing and you know too.
So that's how we get to a thing or two.
And Adam, I'm sure he knows the big one.
But we don't know whether he's going to share it.
The point is, the point is, when you think about let it crash, Jared,
Yes.
In your, like from your development experience with Erlang, with Elixir, Phoenix, is there any
situation, any moment where you could experience it and you realized, huh, that's nice?
When I let it crash.
When you let it crash.
Well, it's nice that the beam seems to handle a lot of the problems with letting it crash.
You know, it just goes again or there's a supervision tree and things watching each other.
Yeah.
And I don't have to think about it very much.
I can't think of like an instance in development
where I was like, this is really useful,
but I'm sure you could come up with one.
Yeah.
So you know when you write code,
we tend to write code very defensively.
Typically try a catch.
So you feel like you need to account
for every single scenario.
And the let it crash philosophy
is about not preventing failure,
learning from it.
What that means is you need to have a context
where it's safe for things to crash
and the overall system will still remain
stable. So how can you build a resilient system? Really this is about resiliency, where the core of the
system will remain running and the system as a whole will remain running even though parts of it
may experience failures. But those failures will not bring everything down. And that's really important.
So fewer try, catch blocks, don't code defensively, let it crash and separate the code that solves the
problem from the code that fixes the failures. And the more you can lean into the framework or the
VM or whatever you have, the system to deal with failures, the better off you are to focus on
the things that are unique to your application. Yeah. And Erlang is well renowned for that.
Kind of the opposite philosophy that Go took as I write some Go code and I write some elixir code
where with Go it's like handle every error condition right after you potentially raise one and make sure
There's no error.
And if you're not dealing with it, then you're not writing robust software.
And the other philosophy is let it crash and deal with it elsewhere.
I think they're both legitimate, depending on what you're building.
Agreed.
Well, in our case, we had a lot of crashes to deal with.
Yeah, we're telling me, gosh, Erling style.
So what we are going to have a look at is all the times that the pipe dream has been crashing
since our last Kaizen.
So since Kaizen 2021, which is October 17th,
we had a lot of crashes.
And there's a certain property about the system
and this is Varnish specifically
that made these crashes pretty okay.
And the property which I'm referring to
is when you start the Varnish D, the demon,
Varnish itself runs as a thread
and you have many, many threads that do different things.
So when we had these out-of-memory crashes,
All that happened, the thread was killed, which means that the system as a whole didn't crash.
The VM didn't, the firecracker, VM didn't crash.
The application needed to restart.
It was just a thread that was using too much memory and it restarted within seconds.
As in maybe two seconds and everything was back to normal.
Obviously, the cache was cold, but it was good.
I mean, the VM, and that's why the memory looked a bit interesting in that it doesn't release all the memory.
the VM doesn't restart.
There's not many hangs.
It restarts and it crashes really, really quickly.
So that's a nice property.
Well, that confuses me.
So how does Fly know about it then if it's just happening inside of Varnish?
So it's looking at the process.
It has looking at the process ID.
Which process uses the most memory?
And it's the same process that's asking for more memory.
So it just basically sell, it will just send a signal to that process and kill that process.
But that is just a thread that maps to a thread.
So Varnish itself didn't crash.
is just a thread that maps to a process ID that crashed
and then it was restarted by the varnish demon.
Okay, so where is fly involved in that?
Because fly is aware because I see all these fly notices
and I get the fly emails.
Right.
So fly is aware that there is a process on the machine
that is using too much memory
and more memories being requested.
And then it looks like, okay, which process do I kill?
And in this case, a process with the most memory will get shot
and we'll get killed.
So Fly as a platform can actually reach it and kill that process without killing the machine,
rebooting the VM or the Firecracker or whatever.
So the Fly platform, it integrates with that functionality, which is a kernel,
it's a Linux functionality.
That's why like an out-of-memory crash would happen even like if you have a single machine,
you have too much memory, you don't have any swap.
How do you basically give more memory when there's no memory left?
and when the system is becoming unstable.
So then you get like just a single process
which gets killed.
In Fly's case, they surface that.
They surface the fact that there was like an out-of-memory crash.
There was an out-of-memory vent.
And they send you an email when that happens.
It doesn't mean that the machine had to restart.
It doesn't mean that he stopped serving traffic.
It just means there was like something that just had to go away
because it was using too much memory.
And I say too much memory.
Obviously, it's a bit more complicated than that
because something was asking for memory.
The kernel didn't have any more memory to allocate,
so it just had to look at what needs to be killed
so that I can allocate more memory
because something is using too much memory.
And it just so happens,
it would be this process and this thread.
So how many crashes do you think
that the Pied Dream had since Kaysen 2020, 2021,
since October?
So we're talking about three months,
maybe a bit more than that.
So Gerhard has presented us,
a multiple choice quiz. A is 20, B is 40, C is 80, D is 160. Now I know that I personally receive an
email every time this happens. And so I have a little bit of a feeler into this. I delete them,
so I can't go do a quick search. Adam, do you get emails when these fly things crash?
I don't. Okay, good for you. Not to mine. I know. It's enough I do. They're in a box that
doesn't get looked at. You've been saving on some email bandwidth. But they do know, because when we send
the email, so let's go back to this one.
if I click on this one,
let's take this one,
and you can see everyone that gets an emails,
I'm just going to make this a little bit bigger.
So you can see services, Jared,
the Adam and Gerhard.
I do get it.
So there must be a filter.
He just doesn't look at it.
Superhuman saving me.
Nice.
That's okay.
So what do we think?
Good thing other people are looking at it.
It's not an Adam problem.
That's the thing.
So that's a good thing.
He's doing the right thing.
He's just saving his inbox for more important messages.
They ran out of L.
I'm on that to the side.
So I feel like 160 is too many.
I don't think I've gotten 160 emails since October on this particular thread.
20s feels like not enough.
I've certainly gotten more than 20 emails.
So I'm between 40 and 80.
And I'm going to think that, gosh, that's a tough one.
I'm going to go with 40.
Adam, what do you think?
I'd go at 40 as well.
Oh, I got it.
Yes.
43, exactly.
The price is right.
That price is right.
All right.
Cool.
Yeah.
43 crashes from October to December through the end of the year.
Yeah.
And obviously there were like periods when we had quite a few.
So if we were to think about what could be happening in Varnish that it's running out of memory and crashing.
So this is us trying to think about.
about the sort of traffic that we serve,
trying to think about
everything, I mean, now
we see every single request that hits
change log, the CDN
as well, and it's a lot of requests.
Yeah. So there's something
in the system, there was something in the system
that was using
way too much memory.
And as a result,
the process or the thread in this case
was crashing.
I mean, I could guess it, but I might even
have some insight. So
Should I just say it or do you want to add on the guess?
I mean, my guess based on also I saw some emails flying through,
but already I would have suspected that we just have too many large files.
These 60 to 80 to 100 megabyte MP3 files loaded into memory, you know,
flying every which direction.
And you just can't load up that much memory without some sort of fancy freeing mechanism.
And it's just trying to hold all these MP3s in RAM, I think.
and I just can't do it.
So that's my guess.
Yeah.
That was a good guess.
And I think the next question is going to be to the audience.
Because we know too much.
How are they going to answer?
It's not real time.
Well, just think about it.
Like, we will give some time for people to think.
Okay, we'll be like a delay here.
So if they have, what's it called?
The feature where you pause, silence, you skip silences on,
they're not going to have any time to think about this.
Right.
Okay.
So quickly, turn that feature off.
Give yourself some time to think.
Go ahead.
Or pause.
we can also say pause. Now it's a good time to pause. And then what could be the problem? So you're
right. All those large files, we had all the MP3 files, many, many MP3 files. They're large,
all trying to be cached in memory. And that was a problem. So what is many? Well, we have thousands
at this point of MP3 files across all the podcasts, like since the beginning of time. Large. Large means anywhere
from 30 to 40 megabytes to 100 plus megabytes.
So that's, I mean, just think if you had to load the 1,000 files
that take 100 megabytes,
that's a lot of memory that you need to have available.
And the problem is that once you store these large files,
as we discovered, you get memory fragmentation.
In that, imagine that you have all the memory available,
you keep storing all these files,
and at some point there's no more memory left.
So what do you do?
Well, you need to see what can evict from memory so that you can store the new file.
So imagine that you evict a few of those objects, but maybe they aren't big enough,
and you haven't evicted them fast enough.
So then you have this big file that can't fit anywhere because the sizes, like the holes that you have in memory,
aren't big enough for this file to fit.
And there's no defragmentation or nothing like that that runs in the background,
which means that even though technically you kind of would have space in the memory
for the specific files you may not.
And then it's, it can't be stored in memory.
Now, the thing in Varnish is actually called,
I kid you not, N underscore LRU underscore Nuked.
So I think the connection to the nuke and to the book
and to let it crash is right there.
So LRU Nuked basically, it's like a forced eviction.
So it's an event where an object,
has to be evicted from the cache
just to make room for a new one
because the storage is full.
So you can see how many times this has happened.
And that's like an important metric
that if we look at, we can see
we had too many of these events, right?
Like many objects were being nuked from memory
to make room for new objects,
but sometimes they wouldn't fit.
So how badly did it nuke?
Because we can measure this,
we can look at this.
And this is what it looks like
from a memory perspective.
So you can see that
instance was running about maybe four gigs of memory. And then we had a massive spike within
minutes, like one or two minutes to 16 gigabytes. So that's a lot of data that had to be fit in
memory. And you can already see where this is going. Scrapers and bots and LLMs. And then we have
so many things happening. And then you can see the memory, it went up. The thread was killed.
The child was killed. Like the varnish once the memory came down again. And then it went up again.
So the graph that we see here, we can see the first spike, just like maybe a minute apart.
The second spike, another crash, it took a little while for it to restore.
We're talking maybe 10 seconds.
And then we stabilized around 10 gigabytes.
From a CPU perspective, we got like 100% CPU utilization when this happens.
Like everything is full on, everything.
Like the instance is really struggling to allocate and deallocate and free up memory.
and more importantly, we have a lot of traffic flowing through.
So how much?
2.29 gigabytes,
or gigabytes,
specifically, 2.29 gigabytes per second.
Per second, exactly.
And these happen so quickly,
have like a huge rush of traffic coming in,
and then nothing.
Well, friends, I'm here again with a good friend of mine,
Kyle Goldbreth,
co-founder and CEO of depot.dev.
Slow builds suck.
Depot knows it.
Kyle, tell me, how do you go about making builds faster?
What's the secret?
When it comes to optimizing build times,
to drive build times to zero,
you really have to take a step back
and think about the core components
that make up a build.
You have your CPUs, you have your networks,
you have your disks,
all of that comes into play
when you're talking about reducing build time.
And so some of the things that we do with Depot,
We're always running on the latest generation for ARMCpues and AMD CPUs from Amazon.
Those in general are anywhere between 30 and 40% faster than GitHub's own hosted runners.
And then we do a lot of cache tricks, both for way back in the early days,
when we first started depot, we focused on container image builds.
But now we're doing the same types of cash tricks inside of GitHub Actions,
where we essentially multiplex uploads and downloads of GitHub Actions cache
inside of our runners so that we're going directly to blob storage with as high of throughput as
humanly possible. We do other things inside of a GitHub Actions Runner, like we cordon off portions of
memory to act as disk so that any kind of integration tests that you're doing inside of CI
that's doing a lot of operations to disk, think like you're testing database migrations in CI.
By using RAM disks instead inside of the runner, it's not going to a physical drive,
it's going to memory. And that's orders of magnitude faster. The other part of build performance,
is the stuff that's not the tech side of it.
It's the observability side of it.
You can't actually make a build faster if you don't know where it should be faster.
And we look for patterns and commonalities across customers,
and that's what drives our product roadmap.
This is the next thing we'll start optimizing for.
Okay.
So when you build with Depot, you're getting this.
You're getting the essential goodness of relentless pursuit of very, very fast builds near zero speed builds.
And that's cool. Kyle and his team are relentless on this pursuit.
You should use them. Depot.dev.
Free to start. Check it out.
One liner change in your get-up actions, depot.dev.
So why is more traffic coming into the instance than going out?
So this is the traffic that the instance is receiving.
So we're receiving 2.29 gigabits, which we're only sending 145 megabits.
It's, now it's a good time to pause and think about why this is happening.
Yeah, don't skip silence.
So when we say the instance, we mean the Varnish instance.
The Varnish instance, yeah.
Which sits between our end user, whatever that is, or users and our application.
Yeah.
Well, actually, and our Cloudflare, not our application.
All our backends.
And we have a couple of backends.
Yes.
But in the case of MP3 files, it's our Cloudflare R2.
Origin.
That's correct.
Varnish is receiving a bunch of data and sending back significant.
in order of magnitude less data.
And what's it receiving?
I don't know, man.
I mean, my guess would be like we're uploading MP3s.
Now, that's going to go straight through the app to R2.
Just a DDoS?
I mean, what is it?
I don't know.
Yeah.
So it is a DDoS, but it's specifically downloading MP3 files
or starting to download MP3 files, but never finishing.
Hanging.
Right?
So you get like all these requests for MP3 files for large files.
Varnish is going.
and fetching them as quickly as it can.
So pulling all this data in so it has in memory,
but the client is never around long enough.
Yeah.
Exactly.
So they basically abort.
But Varnish is still pulling it in all the data.
Now, there is a property.
It's called Bresp.
DoStream True.
So what this does, a very weird thing,
it tells Varnish not to buffer the entire backend response
if the client is slow, right?
So I'm not going to fetch the...
entire MP3 file if you only want the first, I know, minute or two or a range or something like that.
Now, this is on by default.
So by default, that's how Varnish behaves.
So we wouldn't need to enable this.
But if the object is uncashable, it cannot be stored in cache.
You see where I'm going with this?
Memory, you don't can't store it in memory.
So you keep pulling these files over and over again and maybe even just fragments of them.
So even though the client never receives them, you may be pulling hundreds of files and the client just goes away.
So you're not pulling the entire file, but you're still pulling enough and not able to fit it anywhere and just becomes a mess.
This reminds me of the 90s when you used to go jean shopping, right?
And you'd go into, do tell us, which I would never, you know, never shopped at.
But let's just imagine I did, right?
You go in there and be like, I like all these jeans, get them all.
I'm trying them all on.
And I just bounce.
Yeah, the person goes to collect them all.
They come back and you're not there.
Here's the dress room full of jeans and Adam's gone.
By the, by the see you.
That's really sounds like you're speaking from experience.
Was this like a, it was just a prank?
I just made it up just now, you know, I'm just creative like that, you know, on the fly.
Creativity.
Wow.
That's a good one.
That's a good one.
On the fly, yes.
So.
It is on the fly.
There it is.
On the fly.
It is.
On the fly.com.
Boom.
Well, what could we do then? What's going on here?
Exactly. So this was one of the things which I had to deep dive and understand what on earth is going on.
Like, where do we store? Like what's happening? So there's a lot, lot more that went into this pull request. It's pull request 44. I'm calling you the elephant in the room. I'm going to switch to the browser just to have a look at that. So the title of the pull request is storing MP3 files in the first.
file cache. But that's like the tip, right? Like the most obvious thing is, well, you either have
lots and lots of memory to give varnish, which honestly would be impractical in the sense that
would be way too expensive to store all these files in memory. The next best thing is to have
something like a file cache. And by the way, we're talking about open source varnish. That's really
important. Like anyone can use this, anyone can configure this. You can configure a file cache,
which will basically preallocate a file on disk. And then,
that's where these large files will be stored.
Pull request 44, the one that we're looking at,
it's in the Pipley repository.
That's what this adds.
But there's significantly more stuff.
And if I'm going to, let me go,
there's quite a few files.
I highlighted the few, so I'm going to look at this one.
So it's not just that.
You also need to tune, for example, thread pools.
You need to tune the minimum, the maximum.
You need to tune the workspace backend,
like how many memory structures get allocated.
you need to configure the new limit
and there's a couple more things that we had to go through
just to make things stable.
Now, I just going to very quickly mention these things.
You can go and have a look at pull requests
to see what else went into it.
So this was the one file.
The other one was the regions.
That's another thing.
Not all regions would suffer from this.
So you don't want to allocate too much memory
or too much CPU to regions
where maybe they don't get a lot of traffic.
And you would think that this thing is easy.
But oh man, I have a surprise for you.
You can't mix and match sizes easily in fly.
So you can't say create like application groups
and this group will be like the small group
and that group will be the big group
and this is just one application.
It's not straightforward.
So you have to, again, this is how I solved it.
Maybe someone listening to this will tell me,
Hey, Gerhardt, you're wrong.
I would love to know that, seriously.
So the way I solved it is we deploy in all the regions, right,
because you specify the size once.
So you say, my starting size is the large instance type.
It has a certain number of cores, certain number of memory.
And by the way, the disc is the same in all of them
because that's like another problem.
So we will sidebar that or put a pin in that.
So when it comes to the initial deployment,
you deploy the one size across all the application instances,
and then you go and need to check to see which instances should be scaled down,
so that you have the capacity,
but the regions that don't need the capacity, you can just bring them down.
And you do like a rolling deploy in that you replace one for one.
You have plenty of capacity to handle the traffic while instances are being rolled,
all that good stuff.
But we have hot regions and we have cold regions.
And there's quite a few things here.
Again, if someone knows how to do this better, I would love to hear about that.
And we have the Tomol, we have the primary region.
There's a couple of things here.
We'll come back to services and HTTP services.
That's a fun one.
We leave that for a little bit later.
Flyjust.
We can see how we do the Fly CTL deploy.
We disable HA because we want only one instance per region.
We have 15 regions in total.
We specify the CPUs, the memory, all the good stuff.
including environment variables.
Oh, that's another thing.
We need to adjust the varnish size based on the memory that the instance has, right?
We need to say like, hey, varnish, you get 70%.
And that's the other thing that this does.
Same thing for the file size.
You can't take up the entire disk.
We tell you, based on the disk that we provision,
how much space you should use from the disk that gets created.
There's a scaling there.
So that's another good one.
I'm going through PULB because there's anything else.
Oh, man, this was, this was a pain.
So recreating, like writing tests for this.
Everything is tested in the sense that which requests would go or which,
basically which files would get cached in the file store and which files would be
cashed in the memory store.
So how do you write the tests?
Some varnish logging is included.
You have to have anchors.
There's quite a few things.
So that's assets backend.vtc.
And part of this, it was a huge refactoring.
So if you look at the lines of code, I wouldn't say it's that.
many, 1,500 were added and 1,470 were deleted. So not much changed. I mean, the net is
30 new lines were added. But there was like a huge massive refactoring part of this. So there's
again, this was I think two, three days of like figuring it out, trying things, refactoring
things. And if you think that an LLM can help you, well, you try this.
And it takes longer to go through those iterations than if you know what you're looking for.
It tends to be easier.
Anyway, it's very dense, very specific, very difficult to make sure that it's doing the right thing.
But it's all there.
We have the mock backends.
We're reusing things.
We split the VCLs.
By the way, you finish like the splits.
It's easier to reuse them.
So there's quite a few things there.
Now, this is Kaizen.
So we are wondering what improved.
After all this work, right, we rolled it out what improved.
And to answer this question, we need to figure out which region is the busiest one.
So out of all the regions that we serve, we have 15 in total, which ones get the most traffic is those hot regions?
We're looking at the fly, the Grafana dashboard for our fly application, the instance of
the pipe dream, the current one. And we can see that SJC, San Jose, California is a red, nice big red
circle, which means it has the most traffic. And also NRT, which is Tokyo.
Hmm. Apparently. We're big in Japan. Yeah. And Europe, there's quite a fuse. If I'm going to
pull this down a little bit. Let's see. No, I wanted to go here. What about that new continent? Are we big
there? The new continent? Australia. Oh, there's a new one. There's a new new new one.
Well, what's it called?
Which is a new one?
I don't know.
There's a headline I heard.
I thought y'all would get the joke.
No.
Over the holiday, there was speculation.
There was a new continent being announced.
Narnia?
Maybe.
Could have been Narnia.
No, no, no.
So the closet.
Right now, if even this loss, even this list is basically, if you think about it,
it kind of makes sense, right?
It's U.S. East, U.S. West, Europe.
But we have quite a few instances in Europe.
We have four.
It's more geographically spread in Europe.
And we have Asia.
So these are like the big ones.
Australia, Africa and South America, they're not as busy.
They are like these busy regions.
Cool.
So which instance would you like us to have a look at?
So I have a queue right here.
SJC, baby.
Let's go.
Let's go, baby.
All right, let's see that.
So I'm running FlyCTL SSH console.
I'm using two flags dash s, which is a short one for dash.
select, it will prompt me which instance I want to select.
And then I have dash C, capital C.
It's different than lowercasey.
They do different things.
I give it the command to run.
And it's Varnish stat dash 1, which will give me all the statistics from Varnish at a point in time.
So since this instance was running, I will select SJC.
There you go.
And it will give me all the data, which is like all the counters that Varnish is incrementing,
keeping track of different things, of the origins,
back hands, the memory pool, the disc pool, the lock counters.
There's so much stuff.
I'm really, really impressed how many things Varnish has.
So this is what we're going to do.
We, because AI, right, we're going to copy all of this,
and we're going to ask AI what it thinks of this.
Okay?
It's just too much data here.
So let's be serious about it.
So question to you, which is your favorite AI, Jared?
which ones do you use?
Oh, I don't like any of them.
I would probably start with Claude and then I would go to GROC and then I would go to chat GPT.
Third.
Okay.
So Claude, which one, which version?
Opus, man.
Give us the opus.
Opus.
Okay.
So we're looking at abacus.
com.
A.I is something I've been using for a long, long time.
It allows you, I'm only paying $10 per month for it, not sponsored, you know, not
affiliate in any way.
It's just something that I've picked for myself.
I can basically pick any model and I can just just run this.
So I have something prepared.
So I'm going to drop this.
It's all the data.
And we're going to read through something that I prepared ahead of time.
You pre-prompted this.
I pre-prompted this, exactly.
Engineering this prompt for weeks.
Exactly.
Not really, but.
Oh, that's a long prompt.
So we're going to read it.
And in the meantime, Adam will think about his favorite LLM to try.
and I have mine. So we'll try three LLMs to see what they say.
So I need to read the prompt now while everybody thinks,
no, we should be using whatever LLM you should be using.
You are a Varnish 7 expert.
You need to prepare four distinct responses and be explicit about the person that you're addressing.
One, a seasoned sysadmin that has been living and breathing infrastructure for the last 20 years.
Be precise, think deeply an approach to set up from a hardware perspective.
2. An elixir application developer that embraces Erlanks, let it crash concept,
you need to give it straight, give it fast, and keep it relevant to their application.
It's the app and the nightly backends.
Assets and fees are important but less relevant, Cloudflare 2.
3. The business person that is selling this thing, they care about costs, efficiency and simplicity.
Keep it high level and relevant for someone that doesn't care about the tech, but cares about the outcomes.
And four, the audience of a podcast where this is being discussed.
Make it general, relatable, and fun.
Make analogies, keep it light and engaging.
I have fun too many types.
We don't want to make it too fun.
That's a lot of fun.
So, yeah, there's one too many funds.
That's right.
Wow.
Now that you understand your audience,
please analyze the following Varnestadt output for the SJC.
Look, I already knew that you would pick.
How do you don't know?
I go for the big one.
I have no idea.
Focus on things that work well, things that could be improved and anything else that you find interesting.
And by the way, ignore the synthetic request.
It will keep mentioning these.
Like, I get so fed up with this.
We have health checks that run every five seconds.
So they are normal.
So,
I'm going to copy this.
I'm going to run this.
And I'm also going to open a new window for Adam.
So which LLM should we pick Adam?
Which is your favorite?
You mean model?
Model, yeah.
Which model?
We just used it.
But I'd probably back up to like Codex.
Codex.
Which is like GPD5.
Latest.
GPT51.
51.
52.
There you go.
So GPD codex.
My favorite one is Gemini.
So I'm going to drop it.
Now let's see how do they compare.
Gemini, you're in a different tab now.
So Abacus can't do Gemini?
It might, but I have like my own pro account.
So that's something else.
Like I use quite a Vio.
I use nanobanana.
quite a few things. Transcripts, it's all like part of the package. Sure. So it can, but that's
what I prefer. Cool. So Claude Opus 4.5. For the season, for the season system. This is you.
This is me. This is me exactly. Thank you for not to say. For like who's who.
Following. So what's working well. Rock solid stability. So by the way, the instance has been running
for 5.4 days. We had like all these improvements shipped and we are able to observe how are busy
instance works and that's what this is basically.
That was a,
the window moved. Cool.
So after four,
5.4 days, zero child panics crashes.
Zero threat failures. This is important.
It means no threads died. No threats had to be
restarted. Everything is healthy on this instance.
It didn't crash. So this instance didn't crash.
Zero lock contention across all subsystems.
Your CPU cash lines are happy.
Excellent.
Hit ratio, 93%.
We like that.
We really like that.
We have backend connection pooling with a 2-1 reuse ratio,
and memory pressure is minimal.
132 LRUs in the last five days, LRE nukes.
So very few objects had to be removed from memory.
Threadpool property, 300 threads, zero queuing, zero drops.
That's perfect.
Areas to investigate.
Disc storage allocator failures.
we have discs he fails.
We are hitting storage fragmentation.
The disc is 97% full.
We have 48 gigabytes used.
That's how many MP3 files are stored.
By the way, how many MP3 files total do you think we have?
Size or files count?
Size.
Size.
Well, if we had a thousand episodes at 100 megs each,
which neither of those things are true,
that'd be 100 gigs, right?
So 100's too big, but a thousand's too small.
I'm going to say 80 gigs.
Adam, do you know, guess?
That math checks out.
I wasn't say like a terabyte, but that's probably raw wave files versus not.
All the files that we store in R2, and this includes all the assets, but we know that the MP3 files are the biggest.
It's close to 250 gigabytes.
We may have some duplicates.
I don't know.
I haven't checked.
But that's how much files we have in R2.
Yeah, well, we also have plus plus for the last couple of years,
which means every episode has two files, not just one.
So that makes sense.
So we should go higher.
Now, we use this in every single region.
So maybe we want to reduce number of regions.
But I think...
You know the third category called Super Hot.
Super Hot, yes.
Which is like SJC and Tokyo, right?
That's possible. Yeah, there's four, which we know they're really, really hot. Yeah. Yeah. But honestly, this is happening across multiple regions. And it is. We'll get to some interesting things. So, okay, synthetic responses, Grace hits, all good. For the Elixir developer, and I think this is you, Jared. Do you want to read it out? Oh, well, the TLDR is varnished to do its job. Your app back end is well protected. You want to read the whole thing? If you want, I mean, how it's shielded. It's 95% shielded. No, fit.
failure, zero back end failures.
That's because of, you know, my code doesn't really let it crash very often.
Exactly.
Your code is, yeah, it crashes internally, not externally.
That's right.
My thing is doing its thing.
Mm-hmm.
It is generating some uncasurable responses, but, you know, we do have somes that we just don't want to be cached.
Ooh, one fetch failure, negligible.
Yeah, I agree.
You don't worry about that.
And in the end, it says, whoever wrote this is really good at what they do.
I agree.
And she's proud of themselves.
and congratulations on such a great hire.
Yeah, I agree.
I agree.
I think the hire needs a promotion and a bonus, I think.
There you go.
All right, for the business person,
the caching layer is performing excellently.
Adam, do you recognize yourself,
or shall I continue with this?
You can read it.
93% of requests never touch your servers,
massive cost savings on compute.
Do you know how many requests per second
the application is serving?
Like maximum, by the way.
What's the maximum RPS for this amazing
Elixir Phoenix application for the homepage?
Probably a lot.
Gosh.
Thousands?
Tens of thousands?
Maximum.
Okay. Jared?
100,000?
The database connection is involved.
Concurrently?
Concurrently, yes.
I don't know.
I'd say a lot, not very many.
To our homepage?
I'd be like 12.
12 requests a second.
Yeah.
17.
17.
I'm right in there, maybe.
Someone knows are code.
So 17 requests per second.
So if all these requests were hitting the application, we need so much compute to serve
that.
You know, so much caching.
Obviously, we've removed all the caching.
Now we're joking about this because we purposefully removed all the caching from the
application.
Right.
I remember that a couple of years back because we said this has no place in the application.
The application gets restarted.
Yeah.
We need to store this somewhere.
we need a cluster.
It was just really messy to handle it at that layer,
which is why we introduced this.
Five plus days running without any issues.
By the way, this is like the last deploy.
So maybe by the next Kheisen,
if we do no more deployes,
we'll be able to see how it handles.
Zero failures on the infrastructure side.
And three terabytes of data served to users.
Three terabytes.
So in five days,
this one instance served three terabytes.
without your application server is breaking a sweat.
Storage is getting full, so we need basically more storage.
For the podcast audience.
Oh, yeah, that's be fun.
Imagine a really good receptionist at a busy office.
This VARG server is like having someone at the front desk who remembers everything.
Out of 100 people who walk in asking questions, 93 of them get their answers immediately from the receptionist without ever bothering the
experts in the back office.
What's cool?
It's been running for over five days straight
without a coffee break
or a single mistake.
That sounds cruel to me,
but let's go with it.
It served three terabytes of data
that's like streaming about
a thousand HD movies.
This one instance
streamed a thousand HD movies
in five days.
And the experts
had only had to answer
seven percent of the questions.
The one quirk.
The filing cabinet is getting full.
It's like,
when your receptionist's desk drawers are stuffed
and they occasionally have to throw away
old notes to make room for new ones.
Not a crisis,
just time to get a bigger cabinet.
Okay,
I think their last of fun stats of 300 workers.
I think that's too detailed.
That's good fun there.
Good job, Belmont.
Do we care about GPT or Gemini?
We can only use one.
We can only pick one.
Gemini's getting some good hotness.
Let's check Gemini.
We'll see how it adds up.
Oh, it's still thinking.
Let's see.
I think it's finished.
Maybe that's, let me just close that.
Did it finish?
I think it did.
All right.
So let's go up.
Slow thinking.
I did like the thinking.
Show thinking.
Show thinking.
Show thinking.
Slow thinking.
I thought I said slow thinking.
I was like, oh,
on speed up, Chevin.
Show thinking.
There's a squad lot there.
Anyway, we're not going to look into that.
So the instance has been up for 5.3 days.
The MGT uptime.
I like it.
telling me which of those, that long list of counters is important.
From a system perspective, the threading model is perfectly dialed in 300 threads across
two pools with a zero thread limited and zero thread queue length.
The kernel is in context switching unnecessarily and we aren't starting for execution units.
However, we have a critical IO and storage allocation bottleneck in the SMF.
The file, I'm not sure what Steve Doar means.
That's a weird one.
I'm not sure Gemini knows what Steve Norman is either.
The problem is SMF, disc, C, fail is at 19,000 plus against only requests at a 73% failure rate for disc allocations.
The G-space, in the G-space we have only 145 megabytes while we have 48 megabytes, 48 gigabytes used.
You're effectively out of contiguous blocks or hitting the ceiling of the pre-allocated file.
That sounds right.
Varnish is forced to fall back to.
to SMA RAM or simply not cache large objects.
While SMA has 2.1 gigabytes free, the disk is thrashing.
The hardware, if this is spinning glass, your I-O-8 is likely spinning.
It's not.
It's NVME.
So, yeah, there's quite a few things there.
Interesting.
It's low relative suggesting the lock contention on the shared memory log is negligible.
Nice.
You have a low lock contention.
The bottleneck is strictly at a storage layer.
We just need basically a bigger disk.
The Elixir developer, Let It Crash Proponent.
I like it.
Your app backend is rock solid.
Well done, Jared.
Varnish hasn't seen a single backend fail or an unhealthy event in five days.
The let it crash philosophy is working.
If the app is crashing, it's recovering so fast, Varnish doesn't even notice.
I don't think it is.
I think the application is good.
The good news is we're seeing an effective 75% hit rate.
Now, this is different, right?
We had 93% there.
So I would like to understand the discrepancy.
That's another reason why you should use more than one, by the way.
And then obviously double check the data because sometimes they give you things which aren't true.
The crash concern, the disk cache is crashing.
Again, this is not for the application developer.
I think this is for the SIS admin.
In me, I think, the straight talk.
Your main synth is high.
So we have health checks.
There's a delta responses.
So we have a lot of synthetic requests.
Again, sorry, synthetic responses.
is again like a varnish thing.
The business person,
efficiency or currently serving
of our traffic from varnish,
I think I know what's happened.
I don't think it's taking into account
the synthetic requests.
Those should be removed
from the total number of requests.
They think Claude has the right number.
I think so, yeah.
Yeah, I think so.
This means for customers,
we have cost efficiency, that's good,
the risk, there's the bottom line.
I think this was the fun.
But I think this is a library.
I think we can stop it here.
The library analogy versus the secretary analogy.
I think that was a better one.
I got a barista one.
I thought it was like a very good one.
Oh yeah.
For queuing or for what?
For queuing, yeah.
Yeah.
Like the barista analogy, I thought it was very good.
Yeah.
This is using books and whatnot.
The library hasn't burned now.
That's a good thing.
That is fun.
So I think Jamie and I's getting a bit funnier.
The nightly feeds and the app are still humming along.
Nice.
So that's what we have.
And, and that was only half the problem.
Well, friends, this episode is brought to you by Squarespace,
the all-in-one platform for building your online presence.
Whether that's a portfolio, a consulting business,
or finally shipping that side project landing page,
you just have been meaning to do, but never get to.
Here's the thing.
You mass-produced code on the daily.
You deploy new services, new infrastructure, new hardware,
You're versioning your APIs.
You're simvering all over the place.
But when someone asks you about your own personal website, it's like, ah, I'm still working on it.
Does that sound familiar?
Squarespace exists so you don't have to treat your personal site like a weekend project that never ships.
Pick a template and drag and drop your way to something that actually looks good and move on with your life.
No wrestling with CSS.
No, I'll just build my own static site generator again.
It's just done.
If you do consulting or freelance work on the side, Squarespace,
handles the whole entire workflow.
Showcase your services,
let clients book time directly in your calendar,
send professional invoices, and get paid online.
It's the boring infrastructure
that you don't want to build for yourself.
And for those of you out there who are doing courses
or gated content or educational stuff,
tutorials, workshops, that intro to whatever series
you keep talking about,
you can set up a membership area with a paywall
and start earning recurring revenue.
Set your price, gate the content,
and you're done.
And they've also added,
Blueprint AI. This generates a custom site based on your industry, your goals, your style preferences.
It's not going to replace your design skills by any means, but it'll get you about 80% of the way
there in about five minutes. Here's the call to action. This is what I want you to do. Go to
Squarespace.com slash changelog for a free trial. And when you're ready to launch,
use our offer code, change log, and save 10% off your first purchase of a website or a domain.
again, squarespace.com
slash change log.
That was only half the problem.
So we were like at the midpoint.
I was feeling good.
I feel like we had it all fixed.
What else is the problem?
Oh, wow.
This is like when all the fun begins.
So you remember this, Jared?
Yes.
MP3 requests intermittently hang in Newark, New Jersey.
This was our good friend John Spurlock, who's been on the show before and is a podcast nerd.
In fact, he runs op3.3.dev and other podcasts.
nerdery things. And so he really knows this stuff. And so when he reports issues,
you know, I don't say, did you try rebooting? I take it seriously. So I shared it with you.
And he actually did some additional digging for us. Go ahead. So in terms of you tested this.
And I think you had issues as well. So we've confirmed this for sure. I did. Like certain times,
certain files. Actually, it would be all requests as
certain times. I assume that that was that particular pop, as we could call them, or pipe
in the pipe dream was hanging. And then it would go away. And he actually had the same problem.
He had a Friday night deploy of friends. And he was trying to listen to it on Friday. Couldn't
get to it. By Saturday morning, he can get to it. So it's intermittent hanging, very difficult
to diagnose, very difficult, I assume, to debug. And then it just comes back to normal. I thought
maybe the out of memory thing, like it's just in some sort of fugue state until it reboots
and then it works again. But you go ahead. That's what I thought. That's why like a deep dive on
this. This was November, end of November, beginning of November. So November, I was just trying to
figure out what on earth was going on. Just like, you know, like from the sides. I didn't have too
much time. But if you look at this response, there's quite a few things there. This is like
my initial one, like an investigation, trying to understand what's happening, giving a couple of
debug headers, like a couple of extra headers that the request can be made off.
Sorry, can be can be run with.
So we just get a bit more details.
Forcing regions as well.
So there's quite a few things there.
I was checking into that.
This is Don McKinnon.
He also had issues today.
So he pasted some results.
So thank you.
Thank you, Don, for adding this.
This was helpful.
So,
I'm still scrolling, I'm still scrolling, there you go.
It's super helpful. I have confirmed that the requests have been hanging.
You are getting the hangs this afternoon as well.
This was only three weeks ago.
So this has been going on for a while.
I dug deeper and I found the problem.
The problem was that in the fly config,
we had the concurrency set to connections, not requests.
So it's possible to configure an application.
Again, you're configuring the fly proxy that sits in front of the application to limit how much traffic hits your application.
So requests, how many requests per second should the fly proxy forward to your application before it stops like that because you want to, you don't want to get overloaded.
So before it like starts throttling, it starts slowing clients down.
and then when you start,
that's the when you start seeing fly edge errors.
Connections, you would use for something that has long running connections,
like a database, for example.
In our case, it's not a database, right?
It's an HTTP application.
So requests would have been the right concurrency.
I have no idea why I picked connections.
It was the wrong one.
But the effect was, as you can see here,
we had 2,700 long running connections
on that edge, so on that region.
So in this case, it was, I think, Orange One, I think EWR.
Right.
So EWR was getting, had like all these connections opened.
Clients weren't closing the connection.
The proxy was full.
No more connections could be forwarded to the application.
Long-running connections.
There are usually clients which are not doing the right thing.
Right.
You shouldn't have that many long-running connections.
So the problem was a misconfiguration on our side,
which meant that connections like slow connections,
long-running connections,
were basically blocking other connections from coming through.
So that was a problem there.
And I thought that was it, but there was more.
So this was the last comment last week.
We now have a check that runs every hour.
And what was interesting,
and I'll talk about the check as well.
We had response bodies timing out in two regions.
So 13 regions were fine.
But even after this configuration, there were two regions.
I had an EWR where when we were using HTTP 2,
and for some reason this is important,
when we're using HTTP 2 and the proxy,
the fly proxy would see this.
It would not forward the connection correctly.
As in it would start, it would like serve the response.
like we could see the headers coming back from our instances,
what we wouldn't get is the body.
So the body would always be like zero bytes served.
And we could see this happening.
We could see the connections that, by the way, they were opened.
They shouldn't have been opened because the application changed.
So these connections should have been dropped.
There was something not quite right.
My suspicion is with a fly proxy layer.
Because when we were forcing HTTP 1, everything was working fine.
And by the way, the fly proxy, when it talks to our varnish instance,
it's using HTTP 1 and you can see that in the headers.
So the proxy to the varnish was fine,
but the client to the proxy was not fine.
And HTTP 2.0 is a very complex protocol.
There's so many things which just don't work the way people would expect.
So anyway, the issue fix itself.
That's the important thing.
So opening this.
Not super satisfying.
Yeah, that was very nice to see.
And there was something,
Myaelrus.
How would you read this?
Maya Illeros.
Illeris.
There you go.
So someone on the fly community forum that was very helpful,
they noticed that we had a misconfiguration in our fly tunnel.
And we were using services as well as HTTP service.
And this is bad.
By the way, this is very, very bad.
So everything was happy.
Like we could push this config.
You know, the applications were running.
everything was fine.
But because we had these two things together,
it was apparently creating some issues.
And all we did, we were explicitly setting the idle timeout.
And the idle timeout, that's the one where if after 60 seconds,
the connection isn't doing anything,
it will be forcefully terminated by the proxy.
So that part was important.
So anyway, we made the change, we pushed the change.
But even before we pushed the change,
the proxy started behaving.
And now there's pull request, 49, has like,
We right-sized it.
We made a few changes.
I captured like all the details, the configuration, the commands.
It's all there if you want to read it.
But most importantly, now we have a check that runs against all regions every hour on the hour.
CICD.
It's using Hurl.
And what I'm thinking is, shall we try running that locally to see how it behaves?
Because that's how he started it.
Like I was starting to do it locally.
So on the left-hand side, I'm back in the terminal.
the left-hand side, I am monitoring my internet connection. Remember that Christmas tree? This is related
that Christmas tree. So I'm at the top of the Christmas tree. I'm at the gateway, the core router. It's a
micro-tech CCR 2004. Pretty good. 10 gigabits per second maximum. Now, my internet connection isn't 10
gigabytes but it's 2.5, which is plenty for this test. So we, every second is showing me how many
packets and how many bits we're receiving and transmitting. Okay. And again, we are recording.
Everything's happening live. So you can see jumping right as Riverside. We're pushing more data to
Riverside. Cool. So I'm going to run now. Just check. And just check by default. It's one of the
commands, the just command that we have in the Pipely repository.
And check, all it does, it runs Hurl with a couple of flags.
It downloads an MP3 file.
It downloads feeds.
It basically connects to all the different backends,
and it sees how quickly it can get data back.
We're transferring about those quick, those eight seconds.
I'm going to run it again.
As I run this, pay attention to the left-hand side.
it will go to
120 megabits per second
so that's that MP3 file being downloaded
so every single time this runs
a full MP3 file gets downloaded
alongside a few other things
okay I can open the reports
we're not going to look into that because
we're going to run something more interesting now
we do check all and what check all does
it runs the same command against all the regions
I'm at 2.3 gigabits per second
we're downloading all the files
we can see the response is coming back.
EWR just sped by.
I had sped by.
So all the different endpoints are returning.
Now I'm based in London.
Obviously the further way you are.
So for example, this was South America.
Those LAX.
So a couple of instances are slower to respond.
And all this happens via headers.
So you can, when you connect to fly,
you can tell it, hey, I want to connect to a specific region.
And then that's what routes the request to that, to that region.
That's cool.
And again, it's all captured in that pull request and you can see what it looks like.
The Czech call one, Johannesburg, that's usually slow.
And the slowest one is Tokyo for me.
Sydney as well can be slow.
So we still haven't received the responses from there.
We should get that shortly.
You can see I'm pulling now 50 megabits, 20 megabits.
It's just slowing down.
And it's just the connections between now and there.
The last one with there goes, Tokyo.
in 60 seconds, I pulled about 2 gigs roughly.
It's a lot of data that gets pulled down.
The feeds between, between all of that.
And anyone can run it.
I would recommend you not to run this
because we have to pay for this bandwidth.
But our CI runs it just to make sure that everything works.
And if we look at every hour,
I think I'm going to tune this down.
You can see there were no more connections hanging.
So we go to the bottom of that as well.
If it ever comes back, because it went away on its own.
If it comes back on its own, we're going to about it.
Exactly.
Now we have a system that is able to inform us when there's a problem.
So let's go to three.
We're on page number three.
This one, for example, took more than five minutes.
Right.
So sometimes when the connectivity is a bit slow, some regions can be slow.
That's when you get these time out.
So this is capped at five minutes.
The last one that failed was a while ago.
So you can see we're January 5th.
There we go.
There's one that failed.
January 4th.
Check all instances.
So let's see run and we'll see exactly which region failed.
Execution, NRT, that's Tokyo.
And as you can see, we have 100 seconds, right?
So if after 100 seconds, it doesn't download, it just times out.
And we were pulling data, but it didn't finish downloading the entire MP3.
And we're downloading 100 and something megabytes.
Very cool.
So, I mean, not cool that it didn't finish, but cool that was a while ago.
And we can actually test this.
Now, do we need to be doing such a large file?
Is that part of the test?
Or could we test a smaller file and still get the same results?
We could.
Yes.
This was a file that was reported.
So we need to find an MP3 file.
Absolutely.
I think we can also reduce the frequency.
We don't have to run it every hour.
This was always in preparation for this conversation.
What about episode 456?
Of course that one.
That's coming up.
That's the deepest rapid hole.
So I'm leaving that.
I'm leaving that for last.
That's coming at them.
One thing I suggested though in our
the Nerzulu,
but I don't think this is,
I didn't check to see if this is even a thing,
but to validate,
you know,
if the fly CLI could validate the Tomo file for you.
Because you could have been,
you could have checked the Toma file for syntax errors
or just.
Yeah.
Dues and don't essentially.
and it didn't.
It does have a validation subcommand.
Syntactically is correct.
The config is valid.
I mean, it was applied.
But because it combines two things, it shouldn't.
So at least I would expect a warning.
Like, hey, you're using both HTTP services.
Yeah, validate syntax and validate, you know,
expected, you know, true tommel file config.
You know, don't combine or conflate two values or overwrite one or, you know,
just that kind of thing.
I would defensively do something like that in a CLI to protect my user from a poor config.
They could have just not been holding it wrong for so long.
Yep, I agree.
So it's the impact of that configuration indeed.
Yeah.
So this is something we can see again the same logs.
We can see this one here we go like to 50 megabytes per second.
That's 500 megabits.
When we have these peaks, when we see this in the fly config,
we can see this when usually when the benchmarks run or like when the checks run because they put
significant pressure on the instances and we can see them we can pick them up straight away
so that's that's what this is all right so remember this guy this guy was saying March 29 so
it's almost two years ago when this guy was saying we will run into all sorts of issues
that we end up sinking all kinds of time into so
this guy had a good hunch. This is Jared.
March 29th.
And we just went through a couple of examples of issues that we had to deal with part of this.
But because of this, we understand the traffic and we understand how the application behaves
and the backends behave at a very deep level.
So you're right, Jared.
We did sunk all sorts of how many lines.
Let's see how many lines do we have now.
So how many lines?
20 lines.
590 lines we have in total Varnish config.
It's more than 20 lines.
By the way, we have like the roadmap to 2.0.
This is 1.0 that we tagged and shipped.
It's solved like a lot of issues.
But that was the easy stuff.
Okay.
So for everyone that stuck with us, something really good is coming up.
And Adam was already mentioning it, episode 4, 5, 6.
There's something special about episode 4.
456.
So what is special about it?
What stands out to you, Jared?
Oh, it's just getting rocked with downloads.
So episode 4456, oh off, it's complicated, was down.
By the way, this was recorded in 2021.
It was published, again, August 2021.
For some reason, it's been downloaded a lot in recent months.
It has over one million downloads.
This is the most popular episode.
on the change log ever.
The most downloaded episode.
It's crazy.
It's crazy.
So.
Oh, so you guys looked into this.
We did.
Yes.
We dug into this.
Okay.
I didn't know.
You guys were doing this.
So we just had a quick look to understand what is happening here.
So we have Huntingcomb open up.
Remember every single request, which comes through the pipe dream, through Pipely, every single request.
We sent to Honeycomb.
We're able to look at it.
This is the last 60 days.
And I have filtering done in such a way so that I'm only looking at this one file.
how many times has this file been downloaded in the last two months?
And you can see the peaks, right?
You can see, and by the way, this is gigabytes.
So, and this is, the period is four hours.
So we are peaking at about a hundred.
Actually, this peak was here.
We had 200, almost 300, 300, 300, 400,
anyway, close to 400 gigabytes in a four hour period.
That's just too much.
All right.
think so. Like I know this, I know this is a great episode, great conversation. I remember that conversation.
It was good. Like, who is downloading this file 400, I know, times or actually more than 400 times,
every four hours consistently for months on end? And super fan. So we can see like a different
regions. Now, this is spread across the entire world. It's not just one region. This is really,
really big. I think if there was a DDoS attack, I think this would cost this one. And like in the last
six months, in sorry, in the last two months, 60 days, we served 30 terabytes in San Jose,
California alone. In Tokyo, we served five, 15 terabytes. This is a big number. And if you look in
this column, the distinct IPs, the client APs, we had over 10,000.
IPs downending this file. So this is not one or two IPs. This is thousands and thousands of IPs,
which keep downloading this file over and over and over again. So I don't know how we would
block 10,000 IPs. Right. That would be, that will be, the VCL would be crazy. Well, that episode was
starring Aaron Perretti, who is a very talented person. And he is the co-founder of Indyweb camp and a big
fan of the Indyweb as well as Oath, obviously.
So my hunch is Aaron's very interested in being the most downloaded episode ever.
And he controls a fleet of machines from all around the world.
And he points them wherever he wishes.
And he thinks, you know what I'm going to do?
I'm going to get at the number one spot on these guys' download charts.
And so I'm thinking Aaron Perrecki is, you know, the man with a mascot.
We pull a mask off.
And it's him this whole time.
What do you think, Gerhard?
I think that we need to speak.
I see, I don't want to say the specific language.
I think we need to go to Asia.
I think we need to visit a couple of cities in Asia.
Okay.
Find the IPs, which are responsible for this,
because this is a crazy amount of traffic.
Asia, it just so happens if we look at,
so Asia is basically the continent,
which where we are getting the most downloads from
because of this one episode.
And this is actually traffic being served.
This is not like head requests or get requests.
these are bytes being sent to thousands and thousands of machines in Asia every single hour.
So whoever is doing this, please stop.
Please.
It's a cycle.
So we need to like knock on doors.
We need to go over there and knock on some doors and say, excuse me, is this IP address at this home?
And then they might say yes and say, would you please stop?
And what's going on over here?
What can they possibly benefit from this?
Like, what could they be getting?
Maybe, maybe we're the speed test.
Someone is using us to speed test their connection.
Who knows?
Yeah, maybe.
That's the only thing I can imagine.
But that's a lot of IP addresses.
It is.
And it's across multiple regions.
Which?
Multiple data centers, yes.
So multiple regions, fly regions are serving these IPs, yes.
They're all coming from Asia, by the way.
Again, I don't want to mention any names because there's, there's no, there's no one.
There's no bad guys here, right?
We just want to assume that someone left the oven on.
It's like the blinker on when you're driving.
I was like, hey, you're not turning.
It's time to turn that blinker off.
So the way I can see us mitigating this,
and this is a hard problem because of the number right of IPs,
which are hitting is we can basically start blocking entire net blocks,
entire network blocks.
Unfortunately, some genuine listeners might be caught.
in this and basically a change log will not be available or at least the mp3s will not be available
to a portion of users the other one is obviously we can and we should right this is like the next
problem we should enable some throttling because there's more stuff happening here so we don't have any
sort sort of throttling we assume fairness we're assuming goodwill we're assuming decency and we're
not seeing that here that's the internet so to be honest like whoever is doing this and it's not lLM
I had to look, we have, we have that problem as well, but in this case, it's not LLMs.
This is something completely different.
So my hope is by someone that listens to this episode, maybe we put this in the intro,
whoever's downloading episode 4, 5, 6, please stop, because we'll need to take the next step.
I don't know it's a bit of a cat and a mouse game, but that's what we'll need to happen.
Because we need to pay for this bandwidth.
This is only varnish, right?
This is only the cash layer this is happening.
This is only cash layer, yes.
Yeah.
And so what mechanisms are in varnished to do thralling or rate limiting or just anything like that whatsoever?
There's v mods, which are basically modules that varnish loads that just give it extra functionality.
One such V mode, and I've looked at this, it is free and open source, is the VMod throttle.
Now that means that we need to start keeping track of IPs and it will use a bit more memory.
That's okay, we have more memory.
And then we can need to start basically applying limits to how many download specific IPs.
can do.
And we can limit it to MP3 files only.
So if we have a bot or if we have, for example, like, I don't know, an RSS aggregator
or something like that, we can, we, we were okay serving those requests.
Because again, that's what Varnish is meant to do.
The problem here is that we're serving a lot of bytes for MP3 is the same MP3 that
cannot be real traffic.
Yeah, I mean, even in this case, you can like tie it potentially just to this MP3,
like you just said, which is not an all MP3.
scenario. Like if you request this MP3 with this kind of like request signature of X per whatever,
I mean, I didn't examine the actual signature of the request, but that's how I'd probably
investigate it is begin to isolate. Does that require us to write a lot of defensive code against
that kind of scenario? I don't think so. It's just some configuration. We just need to add more
configuration. And back to Jared's point, we're just chasing now new problems that we didn't even
think we would have.
But we have what looks to me like an actor that's not very, I want to say this in a nice way,
an unfriendly actor that is not very happy.
And they are very angrily downloading our MP3 over and over again, thousands of times
across thousands of IPs.
And this is not cool because ultimately we end up paying for this bandwidth.
That is not helping anyone.
but that's one. It's not the only one. So we have one more. So you can see here, for example,
this is the last seven days. We have seven terabytes that were transferred in the last seven days.
Seven terabytes? Maybe that's more than that. It needs to be more. And it actually,
geocode does not exist. Okay. I was expecting to see more than that. Anyway,
Asia is the one that we can see like that pattern. But we also have like in Europe.
sometimes we have these spikes and it's like this spike which I wanted to focus on.
We know that someone in Frankfurt that connects to Frankfurt downloaded the static
favicon 170,000 times in the span of I know like an hour or two.
So they downloaded this like two, three hours.
So it gets, you know, requests like this that are putting stress on these instances.
I mean, what potentially, that was like a pass request as well.
Yeah.
It went through the cash, which means that,
they must have had like a cookie set up or something like that that basically was preventing the
cash from working in this case, which again, that's how it's supposed to work. So anyway, I think,
I think that was unfortunately not the best thing that we could have ended on, but it's a thing
and it's something food for thought, like more work to be done. There's many things that we didn't get
to talk to. We didn't have time for. For example, we didn't talk about the nightly. By the way,
Nightly now is being served by the pipe dream as well.
And the reason why we had to do this,
because that sometimes will get scraped,
would get hit really heavily.
It's a very small app, it's EngineX,
but if I open it,
so let's just click on that one,
and that's pull request 46.
Before it was basically topping up at 141 request per second.
Now it's 1,300, so it's almost like a 10x,
in order of magnitude faster.
The latency went way, way down.
So, and it's just
the only thing we had to do is basically put varnish in front of it.
Nice. Well, that's nice.
Yeah, that's one more thing there.
And you can go and have a look how it works.
It would be like a benchmark here, a small benchmark here.
That's it.
We have, we have last one for the road.
Before we do that, anything else we want to talk about before I
share one last thought.
I suppose what do?
You know, if we know these downloads are happening, we're here on the podcast, just
politely asking to stop, we just let it keep happening.
Well, we could set up some sort of throttling.
I think it'll be the easiest thing.
Now, which will impact everyone.
I don't want to start blocking, again, IP ranges, net blocks, because we don't know
who's going to be called there.
They may change to other IP blocks.
So that's entirely possible.
We don't know how this will work.
We can't block any.
entire country, an entire continent, especially if it's a big one. I don't, I don't think that's
reasonable. So really throttling is, I think, the fairest thing. And then we can throttle MP3
specifically. Because we do have, for example, I see them, like, for example, we have an
a Python client and a Go client that every week they come and they download all our MP3s.
I don't know why they do that, but every seven days, they basically request every single MP3 that we
have. So they're like scraping the website and then pulling everything down.
I don't know why.
Yeah.
Again, the closer, like the more I was looking at,
and again, because I was working so deep in this,
I started noticing like these behaviors
that you would normally not see.
So it's one of the advantages, I suppose,
to being to working so close.
Yeah.
With the traffic, with all the requests,
and having this level of understanding and visibility
into every single request.
So it really helps, down to the IP level.
Something like that, though,
like the Go client and the Python client,
Where would you, would that be a honeycomb thing?
Where would that be?
Yeah.
It's honeycomb, yeah.
You can filter by user agent, for example.
And you can see that like there will be on, for example, say, no, I don't want to show any IPs or anything like that.
So that's why I'm looking to screen share that.
Yeah.
But once we start digging into that, you can say group by client, agent.
And you can say filter by MP3s.
So like URL contains MP3.
and that will be able to group
and you can say
oh and by the way only show me
where there's more than
for example 100 downloads
and then you'll start seeing like the outliers
which are the clients that are downloading
certain MP3s or
MP3s in general excessively
now that can be spoofed
that's the other thing like we have
for example
the request agent like the user agent
it's an empty string
that also happens right
because you don't have to set
send the header if you don't want to.
Yeah, you can also send whatever you want to.
So that can be spoofed.
Yeah.
Yeah, it's like whenever you build systems like this,
and then even when you observe them,
I guess you don't expect,
that's what I originally thought,
but you kind of hope that clients,
aka people, behave.
You know, they're going to use the system
for the system's purpose,
not to once a week download and scrape the entire thing.
And I mean, in that case,
somebody could have their own web archiver and they could have altruistic reasons for it.
I think that's kind of silly.
But, you know, once per week, download the entire contents and somebody's disk seems like,
I want your thing.
And I want to keep getting your thing.
And if it ever changes, I want to make sure I have that snapshot.
I don't understand it.
It doesn't make any sense.
Like, what would make anybody do that?
What is the purpose and motivation to keep doing that to even commit the compute or the script?
are the time to do that.
Like, what are they getting from it?
I don't know.
We need to go over there and knock on some doors, man.
I'm going to ask them.
Yeah.
Why are you doing this?
Every door.
It's hard for you.
In Asia.
Do you listen to the change log?
Yes, I do.
How many times?
Yeah.
Tell us about four, five, six.
You know what four, five, six means, don't you?
Yeah.
It's just, see, this is how, so this is, I think, a really delicate and a really important.
point to discuss because this is how good systems become bad systems.
It's true. Yeah. You have to treat everybody bad. Exactly. Like we don't want to be doing this,
but we are forced to do something against something which is good. So it's not benefiting anyone
and we have to step in and do something about it. Now we have to do it. It's been like I was expecting
this to stop, but it's still even to this day. We made varnish. I mean now that it's stable,
it's able to serve more traffic.
It's able to, like, we just had like the biggest spike
because now the system is more stable.
But it means that bad actors, again,
I shouldn't be using that.
Unhappy people.
Unhappy clients.
Use it.
Unhappy clients.
Yeah.
The only person you can offend is the one who's doing this
and I'm fine with it.
Yeah.
They need to knock it off.
Here's a cut.
This might be a cudgel.
But if we're trying to solve the problem of
they're taking our bandwidth,
for something that's no longer relevant or interesting and has been out there for years.
What if we could just toggle certain episodes?
And this might be a cat and mouse game as well.
But at a certain point, it's like, well, just give them the R2 URL and not the CDN URL.
And just let Cloudflare deal with it.
You know, like just let them download it directly from Cloudflare.
And we're just out of the equation then.
We don't care about the stats.
We don't care about anything.
We're just like, you know, we serve this file plenty of times through our CDN.
Now we're going to just let R2 serve it.
What do you think about that idea?
I can see for this specific episode being a very simple fix, right?
Because we can just serve basically a location header and we just do redirect and that's it.
We're done with it.
So it'd be like another synthetic response.
The question is if they're actually malicious, then they switch to a new episode and start doing that one.
Exactly.
Exactly.
And we have other clients, which are, for example,
we've seen that pass, right? They're basically busting the cash and purposefully going to R2 directly
and just varnish is like a, almost like acts like a proxy in this case. Right. So we have that as well.
We have every now and then we have like this, this random client that comes and downloads all the
episodes and that's not the problem. So I think that some sort of a throttle would make sense,
which would keep the system fair to everybody. But the throttle will need to be high enough so that it
doesn't impact anyone else.
Now, if, for example, our requests or like, if our audience grows or we become more popular
and we get more requests, obviously would need to be aware of where the limit is and start
increasing the limits, right, once we are throttling too much, maybe.
That seems to be more like long term and it seems a more, I know, like a well-engineered
approach in a way.
But certainly the simplest thing would be just like, take this one URL.
I mean, that could be done in minutes, roll it out.
And then we would stop this abuse for this specific MP3.
That would be the easiest thing, for sure.
So yeah, I can see how pragmatic that approach is.
And I like the pragmatism.
Well, it's at least worth checking to see if, you know, the mouse is still alive over there.
Right.
You know?
Yep.
And if they are, well, then we'll know that this is a cat and mouse game.
But if it's just like somebody left.
the blinker on, we're just going to turn their blinker off for them.
Yeah.
And see if it just, the problem goes away.
And if it changes to a new MB3, then yeah, we need more generic solutions.
Yeah.
We may not need that at all.
I do have to say that the internet these days is very different from the internet even like a year
ago.
With the rise of LLMs and AIs, I'm starting to see patterns in our traffic, which are unlike
any other time.
We have these very big spikes when a lot of data has been requesting.
in very short periods of time from, I mean, the user agents, they don't make much sense.
I mean, I know they're spoofed.
There are many IPs which are being used.
So it's almost like there's like a, I know, some system which wants a lot of our content
is seeing silly things because some requests, they just don't make sense.
Like, for example, what benefit does static fav icon have?
Like, what's up with that?
That just makes no sense.
It's a small file.
Maybe it's a heartbeat.
Or a version of a heartbeat?
Maybe.
But this is the first time I've seen this specific file being downloaded this many times.
I haven't seen this before.
Which makes me think is it's a trend that we'll start seeing more and more requests that don't make sense.
And then you start having to set up like some form of protection for all sorts of clients that are just doing the wrong thing.
You need like a defensive layer by the fault.
Exactly.
Exactly.
Yeah.
And something that would be fair to regular clients.
like for example when I want to do like a benchmark I mean sure it's me and I wouldn't want other people to do that because I'm testing the system making sure like the real world the production system everywhere in the world is working correctly and I'm aware of what that means and how it costs and by the way my IPs are removed from all like the the stats because otherwise you see like those massive benchmarks so we account for that but we can't account can't account for all these like weird clients
it's a challenge.
I think it's a good one,
but it just sets us up to,
you know,
when you become older,
it feels like this is more like,
like an adult problem.
Right?
So we like got the thing barely working.
We got it out there.
We know we made it stable,
reliable,
all that.
Now we're hitting almost like,
feels like a new layer of problems.
And then this to me is like a hint
as to like the next phase.
Oh,
to be a kid again.
Yeah.
Well,
right.
One positive thing I think is the
robustness of our observability.
Like being able to have this visibility is great.
Because otherwise we're like, you know, wow.
Pat ourselves on the back, Aaron Perakia.
Let's get you back on the pod because, man, you are big all over the world.
That's amazing.
They can't even downloads, you know?
So what's your one last thing for the road ahead?
I agree.
What's my one last thing?
Yeah.
So keep it short.
We'll keep it fun.
I mentioned about the Christmas tree.
I mentioned about the various things.
which I had going over the holidays.
So make it work club.
That's the place.
You're there.
Both of you are there.
So you can join whenever you want.
Next Thursday.
Yeah, next Thursday.
I'm going to talk about a hundred gigabit wan.
The hundred gigabit wan.
So why would I need such a thing?
Smoking.
It's smoking, for sure.
So I thought the CCR 2004 has like four CPC.
use has like multiple 10 gigabit sFP plus ports it even has 24 SFP 20 sorry SFP 28 ports but it doesn't
have a switch chip and people that know about a little bit about hardware you want to switch
chip to the hardware offloading L3 and even L4 so after I bought a CCR 2004 it was like almost
like a Christmas present I thought surely this will be enough for the rest of my life
and no, I had to get the flagship.
So I'll be talking about that, the land, the setup, quite a few things coming up.
And it just goes to show how much I enjoy the hardware side of things as well, the networking
side of things.
Like I shaved two milliseconds of my wand.
It's amazing.
Like little things like that, you know, like, it was already good.
It was like sub five milliseconds.
But I wanted like sub three milliseconds.
seconds. It is now 2.4 milliseconds. So the next, and what it means, like, why would I do this? So first of all, I'm all about improving. And every winter, I improve the network. In this specific instance, I wanted the pages just to be snappier, like things to load a lot quicker to handle a bit more traffic, but also to not have an impact. Like I was running that benchmark, like 2.5. Look, I'm going to do another one. Right now. Let's see. I have a speed test. I have speed test right here.
test, London, let's go for this one.
So we're recording, we're streaming, right?
And I'm just pulling 2.5, 2.6 gigabits down.
And there's no interruption on my network.
Right. So it's just my bread and butter.
You know, that's how I work.
And by the way, if you see any buffering or any slowing down, let me know.
I see Adam a bit more pixelated.
Maybe you can see you pixelated too. I don't know.
But yeah, I just pulled six gigabytes, three down.
on three up.
And it's just
what I do every day.
I work with this stuff and yeah,
enjoy it.
And by the way,
this is the slower gateway router.
So I'm getting the proper one set up.
And I'll talk about that.
And there's so many things there.
Like VLanning is quite a thing.
I have a new IPV4 block, by the way.
So some would say that I'm preparing for hosting something.
And maybe I am.
I don't know.
We'll see how that works.
But I just realized that my home connection, obviously I couldn't serve all the MP3s that were being downloaded.
Like that would really cripple my connection if that was happening.
But I'm at 2.5 gigabits.
The next one will be 5 gigabits.
And the hardware can do it.
And the 5 gigabits, I mean, that's like a decent server.
Sure.
And if you can do 5 gigabits all day every day.
Sorry.
Yeah, gigabits per second.
That's pretty decent.
So I'm just waiting for more internet.
I'm going to say you're going to have 100 gigabit wan, but you're not going to,
you're not going to have a connection for it, right?
Right.
So very few places in the world have that.
So if I was in Switzerland, I would get 25 gigabytes.
Now, would you move?
Would you move for this?
Of course.
The only reason to be.
Yeah, it's the 25 gigabit connection.
But I know that 100 gig is coming.
So we'll see.
Are they either ship it by the time I move or I come, I move and then they ship it.
So it's one or the other.
Okay.
The important thing is I have the router to handle that.
You'll be ready.
You will be ready.
Exactly.
So I'm a prepper.
I'm prepping for that.
Prepping for a good internet.
And this is just like a...
And interestingly, five years ago when I got like the previous router, I did the same
thing.
It's like a forum post.
They'd like a follow up.
So I just did a follow up recently.
at this milestone.
So I've been at this for some number of years.
And I like optimizing my network
and making sure that it's in tip, top condition.
Relentless.
I love it.
So relentless.
Good stuff.
Care hard.
Well, that's a happy note to end on, right?
That's a happy note to end on.
Observability in 100 gigabit?
No.
So you go.
All right.
Well, the good news for Kaysen is we have a lot to work on.
Always.
Always.
That's what it seems.
Yeah.
We know how to pick them, don't we?
Oh, my gosh.
The rabbit hole goes deep and we keep going in.
Kaisen.
My friends.
Kaisen.
Kaisen.
All right, Kisen 22 is in the bag.
Join the discussion in our community Zulip had to change.
Log.com slash community to sign up for $0.
And of course, check out all of Gerhard's passions at make it work.com.
Thanks again to our partners at fly.io and to our beatfreak in residence,
breakmaster cylinder.
Next week on the pod,
News on Monday, Damien Tanner from Layercode on Wednesday,
and Techno Tim catches Adam up with the state of home lab tech on Friday.
Have a great weekend.
Recommend us to a friend if you like the show, and let's talk again real soon.
