The Changelog: Software Development, Open Source - Kaizen! Let it crash (Friends)

Starting point is 00:00:15 Welcome to ChangeLog and Friends, a weekly talk show about how good systems become bad systems. Thanks as always to our partners at Fly to IO, the platform for devs who just want to ship, build fast, run any code fearlessly at Fly to IO. Okay, let's Kaysen. Well, friends, I don't know about you, but something bothers me about getting abactions. I love the fact that it's there. I love the fact that it's so ubiquitous. I love the fact that agents that do my coding for me believe that. My CI CD workflow begins with drafting Tommel files for GitHub Actions.

Starting point is 00:00:58 That's great. It's all great. Until, yes, until your builds start moving like molasses. Get Up Actions is slow. It's just the way it is. That's how it works. I'm sorry. But I'm not sorry because our friends at Namespace, they fix that.

Starting point is 00:01:14 Yes, we use Namespace. So to do all of our builds so much faster. NameSpace is like GitHub actions, but faster, I mean, like, way faster. It cashes everything smartly. It cashes your dependencies, your Docker layers, your build artifacts, so your CI can run super fast. You get shorter feedback loops, happy developers because we love our time, and you get fewer. I'll be back after this coffee and my build finishes. So that's not cool.

Starting point is 00:01:43 The best part is it's drop in. It works right alongside your existing GitHub actions with almost. zero config. It's a one line change. So you can speed up your builds, you can delight your team, and you can finally stop pretending that build time is focus time. It's not. Learn more. Go to namespace.com.S.O. That's namespace.org. Just like it sounds like it said. Go there, check them out. We use them. We love them. And you should too. Namspace.s.o. How else would you learn? Let it crash. Exactly. The best things happen with things fail. Seriously. If it's in a controlled way, right? I think that's like something which

Starting point is 00:02:31 isn't said. It's implied. It has to be a controlled failure where you have the boundary and things will not blow up. I mean, they'll blow up, but like, you know, like the fireworks sort of blowing up where it's a controlled explosion. Yeah. Right. Tiny little crashes to learn from. Welcome everyone to Kisen 22 with the incomparable Gerhard Lazu. He's here to let us know how he lets it crash. It's like that song, let it snow, let it snow, let it snow. Only you know how to replace. Hey, Gerhard, how are you?

Starting point is 00:03:07 Hey, Jared. I'm good. Thank you. Thank you. Had a great holiday. It was a great couple of weeks where I've managed to finally disconnect. It's been, I know, like 20 years since I had two weeks completely off. Even my holidays are only a week.

Starting point is 00:03:22 So this was very different, very enjoyable, and I feel so refreshed. So I'm firing on all of the dentist. You unplugged and now you're plugged back in pretty much. Plug it in. I stopped it and I started it. And it's like brand new. It's like Glade, man. I'm like Glade over here, man.

Starting point is 00:03:41 Plug it in, plug it in, you know what I'm saying? Smell the scent, the fresh New Year's scent called 2026. Some people are going to say this is going to be the big. best year ever. I've heard it said. What do you think? They keep saying that. I'm excited about them. They said that about 2020. We have to admit it was off to a killer start. I mean, it was really going well. Right. Pott intended killer start. It was COVID. Pun intended. Killer start. That was 2020. 2020 was the year of COVID and everyone's like, oh, this is going to be like the best year ever. And that we had three years of misery.

Starting point is 00:04:18 So I think, I think, I just want like an easygoing year. You know what I mean? Last year, 2025, 1st of January, we were building shelves. We were like redoing like studies and whatnot.

Starting point is 00:04:31 And the whole year was full on. Like it was like, it was nonstop. Every week there was something significant happening. And this year would like just like to, for it to be a bit more chill, maybe a bit more meaningful. So that's one way we're thinking.

Starting point is 00:04:46 But how about you, Adam? How are your holidays? My holidays were filled with barbecue and good times. Wow. Even in winter. So barbecue never stops. It does no seasons. Never stops in Texas.

Starting point is 00:04:58 Actually, just to just to shower you all with a few of my picks from my most recent barbecue adventures. If you're in Zulip, go to the general channel. Look for barbecue with three bangs after it because why do one bang when you can do three? Bang, bang. Some recent ribs. My gosh. My ribs method is on point.

Starting point is 00:05:18 My spatch cock chicken method is on point. No one is disappointed at my barbecue joint. Very nice. Look at it. Add some meat on this slide. That's what happened in real time. Wow. Real time meat added.

Starting point is 00:05:34 This is like, yeah. This is intense. And that's again, just to be clear, it's Adam's barbecue. Okay. So like no joking aside.

Starting point is 00:05:42 We're talking about barbecue. I mean, I think we have to leave it there. Let's move on. I think we have to leave it there. I didn't show a burger, but I do make a mean burger too. Thank you, Gerhard for assuming that is something I do rock really good. My smash burgers are on point.

Starting point is 00:05:59 Very nice. Very nice. I'm looking forward to that. One day. My favorite Christmas tree. This is what it looked like. What is that? And for those that are listening, it's a networking cabinet. There's lots of blue lights.

Starting point is 00:06:14 flashing. This is happening in the loft. You have many terabits of network throughput. There's some switches. There's unified. There's micro-tick. This is maybe five years in the works. And every Christmas, I take time to improve it little by little. So this year I went really crazy. I read it like the whole thing. I did, I read it like the whole, for example, DHCP network, VLAN. Oh, man, it's beautiful. Your V-Lens are beautiful. They are. I want to be a guest on your network, man. I'm going to get blot from everything, okay?

Starting point is 00:06:50 Yeah. Well, well, well, there's like a big story happening in the background. And it is, it is going to be, I think this is amazing. This is, this will be the best network that I have run, like in my life. But the blue and the darkness and it's like, that was like one more Christmas tree in our house and this was it. where I would just go and tinker for a few hours in between Christmas dinner and all the Christmas festivities. So it was nice just to spend a bit of time tinkering with hardware.

Starting point is 00:07:22 And I'm sure that many of you listening, when it comes Christmas time, when things start quieting down, you get like the little projects that you didn't have time for throughout the year and then you, you know, have some fun. So I'm wondering, did any of you, did anything fun this Christmas, But nerdy fun, that's what I mean by that. Nerdy fun. Well, I got upset with something.

Starting point is 00:07:47 And so I decided to just let it roll. You know, I'm trying to say. I got upset with the amount of RAM usage on my machine. And while I like the application, I was like, you know what? I'm just kind of tired of having 4 gig. I think it was, you know, it was like 1.2 gigs of RAM being used by clean my Mac. fancy little utility application helps you tune and pay attention and stuff like that. And I decided to remake it.

Starting point is 00:08:16 That was it. So I remade it. It's called MacTuner. I know there used to be a MacTuner.com, which was, I think, a Mac Magazine, I believe. But MacTuner fit. I might change it. Who knows?

Starting point is 00:08:27 But for now it's called Mac Tuner. It does all the things, all the things. Analyze, clean up, uninstall. And not just that fake uninstall. the real one where you get the dirty dirties out. You know what I'm saying? The dirties, all the dirties are out. Okay.

Starting point is 00:08:41 My mind is still on the dirty burger that you mentioned earlier. Yeah. I mean, that's about as nerds that can get. I mean, I made a little utilities for me for now. Soon to be open source, though. Soon to be. It will be soon. Yeah.

Starting point is 00:08:55 I mean, why not, right? Share it with the world. Well, I didn't create a Mac tuner, but I found one. I also was thinking, clean my Mac, you know, like, how long am I going to run this thing? And the answer is, as long as I ran it, because I'm done now. I found a tool called Mull, M-O-L-E, which is a command line macOS cleaner that does like everything. So you may have some competition here at it. Maybe you can come out and like throw some lows down.

Starting point is 00:09:23 Like here's why I'm better than mole. It's got 2E. It's all command line based. It does cleaning, optimizing, uninstalling, daisy disk, you know, Explorer. Oh gosh. All from, yeah. I'm feeling it. I'm feeling intimidated over here.

Starting point is 00:09:39 He's starting to sweat. He's starting to sweat. I think he just changed his mind about open sourcing it. Here's your, here's your domain name idea, Adam. Better Than Mull.com. You know, better than grip.

Starting point is 00:09:50 That's good. I could do that. So I've been using that. I'm very excited because, uh, who doesn't want to just have all the things right there in their command line? And yeah. I didn't spend any tokens on it.

Starting point is 00:10:00 Adam's got some tokens involved, but his also works the exact way he wants it to. Yeah, yeah, absolutely. Mine leverages some recast stuff as well. It's kind of cool. Sweet. Open source that sucker. One day. Which day is that?

Starting point is 00:10:14 Not today. Definitely not right now. But it's going to be one day. One day. There's a bigger launch awaiting, as I'll say. There's a bigger launch awaiting till I'm going to open source some things. Been using app cleaner for many, many years. There's no TUI.

Starting point is 00:10:33 There's no CLI. is just like a regular app. It's a really old one. It's like drag and drop onto it, right? Pretty much, yeah. And you also have a list of applications. But it's so old, it is difficult to find it these days. And it has an update in a very long time.

Starting point is 00:10:45 So I will check Moll out. Mole's really cool. Rue install Moll and you're done. So you can check it out right here while we're talking. And I liked App Zapper. And I think App Zapper doesn't exist anymore. But the cool thing about that was that it would literally make the zap sound as it. Yeah.

Starting point is 00:11:01 You drop your app on and it zapped it. And I just like that sound. That's the only feature that your application needs to have. If it zaps. Mold is not zaps. So there you have. Make it zap. It's our tagline.

Starting point is 00:11:12 I actually make it zap. Make it zap. There you go. I think that's a very good debate actually. I know, and everything. What about you? Cool.

Starting point is 00:11:19 Besides your Christmas tree, did you? I will come back to that. I will come back to the Christmas tree. This guy's got stories, man. Oh, oh, yes.

Starting point is 00:11:26 It's like, I have to, I have to tease them and be very disciplined because there's too much stuff. So I have to be very careful because it will be an hour and I will not shut up talking about this, this thing.

Starting point is 00:11:38 I mean, it's just like, anyway. So we will come back to that, I promise. Okay. Last time, when you finished Kaizen 21, this was one of the last thoughts that we shared, which is what's next. So bam, remember bam, that happened live.

Starting point is 00:12:00 OOM crashes, out-of-memory crashes, and a bunch of other things. The good news is that only one thing happened. O.M. Crashes. I don't know one thing to talk about. But this rabbit hole is really, really deep. Okay. All right.

Starting point is 00:12:14 Take us down the rabbit hole. The O-O-O-M. Out of Memory. Who remembers this book? Erlang in Anger. Erlaine in Anger. Stuff Goes Bad by Fred Hebert. Ferd.ca.

Starting point is 00:12:28 Now, I remember learn you some Erlang for great good, but I do not remember. this one in particular. So I'm not sure why the other one in my radar because he wrote both of them at the seams. But when did this one come out? Wow. So this one, if I look, I just switched to the browser in 2016,

Starting point is 00:12:48 2017 while he was still at Heroku. Remember Heroku? Those were the days. So about 10 years ago. And Fred, I mean, he's just like, if you don't know his blog, I mean, it's just amazing. I'll just click it very quickly. just to have a look. Oh, I think it's one of the best blogs out there.

Starting point is 00:13:07 There's so much goodness here, so much. But one of my favorites is queues and queuing and how cues don't protect from overload. So queues don't fix overload. And this is so relevant to today's conversation as well. But there's a lot of stuff in the Erlang ecosystem. And there's many, many things that Ferd wrote over the years. that are so relevant to today. So if I click on download PDF, right, by the way, this is like a, it's amazing.

Starting point is 00:13:40 This book is open source. You can download it, open source freely available, creative comments license. And I'm going to make this a little bit bigger so we can see what's happening. And if I search for Let It Crash, it's page number one. It's in the introduction. Page one. Page one. And this idea of Let It Crash really comes from the,

Starting point is 00:14:02 Erlang ecosystem. It's very well-renowned there because of how the Erlang VM works and how all the processes and the supervision trees. It was built this way. And we know a thing or two about Erland, Jared, right? Because the application, Elixir, the Phoenix framework, runs on the same principle. I know a thing and you know too. So that's how we get to a thing or two.

Starting point is 00:14:25 And Adam, I'm sure he knows the big one. But we don't know whether he's going to share it. The point is, the point is, when you think about let it crash, Jared, Yes. In your, like from your development experience with Erlang, with Elixir, Phoenix, is there any situation, any moment where you could experience it and you realized, huh, that's nice? When I let it crash. When you let it crash.

Starting point is 00:14:48 Well, it's nice that the beam seems to handle a lot of the problems with letting it crash. You know, it just goes again or there's a supervision tree and things watching each other. Yeah. And I don't have to think about it very much. I can't think of like an instance in development where I was like, this is really useful, but I'm sure you could come up with one. Yeah.

Starting point is 00:15:06 So you know when you write code, we tend to write code very defensively. Typically try a catch. So you feel like you need to account for every single scenario. And the let it crash philosophy is about not preventing failure, learning from it.

Starting point is 00:15:22 What that means is you need to have a context where it's safe for things to crash and the overall system will still remain stable. So how can you build a resilient system? Really this is about resiliency, where the core of the system will remain running and the system as a whole will remain running even though parts of it may experience failures. But those failures will not bring everything down. And that's really important. So fewer try, catch blocks, don't code defensively, let it crash and separate the code that solves the problem from the code that fixes the failures. And the more you can lean into the framework or the

Starting point is 00:16:03 VM or whatever you have, the system to deal with failures, the better off you are to focus on the things that are unique to your application. Yeah. And Erlang is well renowned for that. Kind of the opposite philosophy that Go took as I write some Go code and I write some elixir code where with Go it's like handle every error condition right after you potentially raise one and make sure There's no error. And if you're not dealing with it, then you're not writing robust software. And the other philosophy is let it crash and deal with it elsewhere. I think they're both legitimate, depending on what you're building.

Starting point is 00:16:40 Agreed. Well, in our case, we had a lot of crashes to deal with. Yeah, we're telling me, gosh, Erling style. So what we are going to have a look at is all the times that the pipe dream has been crashing since our last Kaizen. So since Kaizen 2021, which is October 17th, we had a lot of crashes. And there's a certain property about the system

Starting point is 00:17:07 and this is Varnish specifically that made these crashes pretty okay. And the property which I'm referring to is when you start the Varnish D, the demon, Varnish itself runs as a thread and you have many, many threads that do different things. So when we had these out-of-memory crashes, All that happened, the thread was killed, which means that the system as a whole didn't crash.

Starting point is 00:17:33 The VM didn't, the firecracker, VM didn't crash. The application needed to restart. It was just a thread that was using too much memory and it restarted within seconds. As in maybe two seconds and everything was back to normal. Obviously, the cache was cold, but it was good. I mean, the VM, and that's why the memory looked a bit interesting in that it doesn't release all the memory. the VM doesn't restart. There's not many hangs.

Starting point is 00:17:58 It restarts and it crashes really, really quickly. So that's a nice property. Well, that confuses me. So how does Fly know about it then if it's just happening inside of Varnish? So it's looking at the process. It has looking at the process ID. Which process uses the most memory? And it's the same process that's asking for more memory.

Starting point is 00:18:13 So it just basically sell, it will just send a signal to that process and kill that process. But that is just a thread that maps to a thread. So Varnish itself didn't crash. is just a thread that maps to a process ID that crashed and then it was restarted by the varnish demon. Okay, so where is fly involved in that? Because fly is aware because I see all these fly notices and I get the fly emails.

Starting point is 00:18:35 Right. So fly is aware that there is a process on the machine that is using too much memory and more memories being requested. And then it looks like, okay, which process do I kill? And in this case, a process with the most memory will get shot and we'll get killed. So Fly as a platform can actually reach it and kill that process without killing the machine,

Starting point is 00:18:58 rebooting the VM or the Firecracker or whatever. So the Fly platform, it integrates with that functionality, which is a kernel, it's a Linux functionality. That's why like an out-of-memory crash would happen even like if you have a single machine, you have too much memory, you don't have any swap. How do you basically give more memory when there's no memory left? and when the system is becoming unstable. So then you get like just a single process

Starting point is 00:19:24 which gets killed. In Fly's case, they surface that. They surface the fact that there was like an out-of-memory crash. There was an out-of-memory vent. And they send you an email when that happens. It doesn't mean that the machine had to restart. It doesn't mean that he stopped serving traffic. It just means there was like something that just had to go away

Starting point is 00:19:42 because it was using too much memory. And I say too much memory. Obviously, it's a bit more complicated than that because something was asking for memory. The kernel didn't have any more memory to allocate, so it just had to look at what needs to be killed so that I can allocate more memory because something is using too much memory.

Starting point is 00:19:59 And it just so happens, it would be this process and this thread. So how many crashes do you think that the Pied Dream had since Kaysen 2020, 2021, since October? So we're talking about three months, maybe a bit more than that. So Gerhard has presented us,

Starting point is 00:20:17 a multiple choice quiz. A is 20, B is 40, C is 80, D is 160. Now I know that I personally receive an email every time this happens. And so I have a little bit of a feeler into this. I delete them, so I can't go do a quick search. Adam, do you get emails when these fly things crash? I don't. Okay, good for you. Not to mine. I know. It's enough I do. They're in a box that doesn't get looked at. You've been saving on some email bandwidth. But they do know, because when we send the email, so let's go back to this one. if I click on this one, let's take this one,

Starting point is 00:20:51 and you can see everyone that gets an emails, I'm just going to make this a little bit bigger. So you can see services, Jared, the Adam and Gerhard. I do get it. So there must be a filter. He just doesn't look at it. Superhuman saving me.

Starting point is 00:21:03 Nice. That's okay. So what do we think? Good thing other people are looking at it. It's not an Adam problem. That's the thing. So that's a good thing. He's doing the right thing.

Starting point is 00:21:14 He's just saving his inbox for more important messages. They ran out of L. I'm on that to the side. So I feel like 160 is too many. I don't think I've gotten 160 emails since October on this particular thread. 20s feels like not enough. I've certainly gotten more than 20 emails. So I'm between 40 and 80.

Starting point is 00:21:36 And I'm going to think that, gosh, that's a tough one. I'm going to go with 40. Adam, what do you think? I'd go at 40 as well. Oh, I got it. Yes. 43, exactly. The price is right.

Starting point is 00:21:48 That price is right. All right. Cool. Yeah. 43 crashes from October to December through the end of the year. Yeah. And obviously there were like periods when we had quite a few. So if we were to think about what could be happening in Varnish that it's running out of memory and crashing.

Starting point is 00:22:14 So this is us trying to think about. about the sort of traffic that we serve, trying to think about everything, I mean, now we see every single request that hits change log, the CDN as well, and it's a lot of requests. Yeah. So there's something

Starting point is 00:22:31 in the system, there was something in the system that was using way too much memory. And as a result, the process or the thread in this case was crashing. I mean, I could guess it, but I might even have some insight. So

Starting point is 00:22:47 Should I just say it or do you want to add on the guess? I mean, my guess based on also I saw some emails flying through, but already I would have suspected that we just have too many large files. These 60 to 80 to 100 megabyte MP3 files loaded into memory, you know, flying every which direction. And you just can't load up that much memory without some sort of fancy freeing mechanism. And it's just trying to hold all these MP3s in RAM, I think. and I just can't do it.

Starting point is 00:23:18 So that's my guess. Yeah. That was a good guess. And I think the next question is going to be to the audience. Because we know too much. How are they going to answer? It's not real time. Well, just think about it.

Starting point is 00:23:30 Like, we will give some time for people to think. Okay, we'll be like a delay here. So if they have, what's it called? The feature where you pause, silence, you skip silences on, they're not going to have any time to think about this. Right. Okay. So quickly, turn that feature off.

Starting point is 00:23:44 Give yourself some time to think. Go ahead. Or pause. we can also say pause. Now it's a good time to pause. And then what could be the problem? So you're right. All those large files, we had all the MP3 files, many, many MP3 files. They're large, all trying to be cached in memory. And that was a problem. So what is many? Well, we have thousands at this point of MP3 files across all the podcasts, like since the beginning of time. Large. Large means anywhere from 30 to 40 megabytes to 100 plus megabytes.

Starting point is 00:24:22 So that's, I mean, just think if you had to load the 1,000 files that take 100 megabytes, that's a lot of memory that you need to have available. And the problem is that once you store these large files, as we discovered, you get memory fragmentation. In that, imagine that you have all the memory available, you keep storing all these files, and at some point there's no more memory left.

Starting point is 00:24:47 So what do you do? Well, you need to see what can evict from memory so that you can store the new file. So imagine that you evict a few of those objects, but maybe they aren't big enough, and you haven't evicted them fast enough. So then you have this big file that can't fit anywhere because the sizes, like the holes that you have in memory, aren't big enough for this file to fit. And there's no defragmentation or nothing like that that runs in the background, which means that even though technically you kind of would have space in the memory

Starting point is 00:25:15 for the specific files you may not. And then it's, it can't be stored in memory. Now, the thing in Varnish is actually called, I kid you not, N underscore LRU underscore Nuked. So I think the connection to the nuke and to the book and to let it crash is right there. So LRU Nuked basically, it's like a forced eviction. So it's an event where an object,

Starting point is 00:25:45 has to be evicted from the cache just to make room for a new one because the storage is full. So you can see how many times this has happened. And that's like an important metric that if we look at, we can see we had too many of these events, right? Like many objects were being nuked from memory

Starting point is 00:26:00 to make room for new objects, but sometimes they wouldn't fit. So how badly did it nuke? Because we can measure this, we can look at this. And this is what it looks like from a memory perspective. So you can see that

Starting point is 00:26:15 instance was running about maybe four gigs of memory. And then we had a massive spike within minutes, like one or two minutes to 16 gigabytes. So that's a lot of data that had to be fit in memory. And you can already see where this is going. Scrapers and bots and LLMs. And then we have so many things happening. And then you can see the memory, it went up. The thread was killed. The child was killed. Like the varnish once the memory came down again. And then it went up again. So the graph that we see here, we can see the first spike, just like maybe a minute apart. The second spike, another crash, it took a little while for it to restore. We're talking maybe 10 seconds.

Starting point is 00:26:57 And then we stabilized around 10 gigabytes. From a CPU perspective, we got like 100% CPU utilization when this happens. Like everything is full on, everything. Like the instance is really struggling to allocate and deallocate and free up memory. and more importantly, we have a lot of traffic flowing through. So how much? 2.29 gigabytes, or gigabytes,

Starting point is 00:27:23 specifically, 2.29 gigabytes per second. Per second, exactly. And these happen so quickly, have like a huge rush of traffic coming in, and then nothing. Well, friends, I'm here again with a good friend of mine, Kyle Goldbreth, co-founder and CEO of depot.dev.

Starting point is 00:27:47 Slow builds suck. Depot knows it. Kyle, tell me, how do you go about making builds faster? What's the secret? When it comes to optimizing build times, to drive build times to zero, you really have to take a step back and think about the core components

Starting point is 00:28:01 that make up a build. You have your CPUs, you have your networks, you have your disks, all of that comes into play when you're talking about reducing build time. And so some of the things that we do with Depot, We're always running on the latest generation for ARMCpues and AMD CPUs from Amazon. Those in general are anywhere between 30 and 40% faster than GitHub's own hosted runners.

Starting point is 00:28:24 And then we do a lot of cache tricks, both for way back in the early days, when we first started depot, we focused on container image builds. But now we're doing the same types of cash tricks inside of GitHub Actions, where we essentially multiplex uploads and downloads of GitHub Actions cache inside of our runners so that we're going directly to blob storage with as high of throughput as humanly possible. We do other things inside of a GitHub Actions Runner, like we cordon off portions of memory to act as disk so that any kind of integration tests that you're doing inside of CI that's doing a lot of operations to disk, think like you're testing database migrations in CI.

Starting point is 00:29:01 By using RAM disks instead inside of the runner, it's not going to a physical drive, it's going to memory. And that's orders of magnitude faster. The other part of build performance, is the stuff that's not the tech side of it. It's the observability side of it. You can't actually make a build faster if you don't know where it should be faster. And we look for patterns and commonalities across customers, and that's what drives our product roadmap. This is the next thing we'll start optimizing for.

Starting point is 00:29:30 Okay. So when you build with Depot, you're getting this. You're getting the essential goodness of relentless pursuit of very, very fast builds near zero speed builds. And that's cool. Kyle and his team are relentless on this pursuit. You should use them. Depot.dev. Free to start. Check it out. One liner change in your get-up actions, depot.dev. So why is more traffic coming into the instance than going out?

Starting point is 00:30:03 So this is the traffic that the instance is receiving. So we're receiving 2.29 gigabits, which we're only sending 145 megabits. It's, now it's a good time to pause and think about why this is happening. Yeah, don't skip silence. So when we say the instance, we mean the Varnish instance. The Varnish instance, yeah. Which sits between our end user, whatever that is, or users and our application. Yeah.

Starting point is 00:30:34 Well, actually, and our Cloudflare, not our application. All our backends. And we have a couple of backends. Yes. But in the case of MP3 files, it's our Cloudflare R2. Origin. That's correct. Varnish is receiving a bunch of data and sending back significant.

Starting point is 00:30:47 in order of magnitude less data. And what's it receiving? I don't know, man. I mean, my guess would be like we're uploading MP3s. Now, that's going to go straight through the app to R2. Just a DDoS? I mean, what is it? I don't know.

Starting point is 00:31:04 Yeah. So it is a DDoS, but it's specifically downloading MP3 files or starting to download MP3 files, but never finishing. Hanging. Right? So you get like all these requests for MP3 files for large files. Varnish is going. and fetching them as quickly as it can.

Starting point is 00:31:20 So pulling all this data in so it has in memory, but the client is never around long enough. Yeah. Exactly. So they basically abort. But Varnish is still pulling it in all the data. Now, there is a property. It's called Bresp.

Starting point is 00:31:35 DoStream True. So what this does, a very weird thing, it tells Varnish not to buffer the entire backend response if the client is slow, right? So I'm not going to fetch the... entire MP3 file if you only want the first, I know, minute or two or a range or something like that. Now, this is on by default. So by default, that's how Varnish behaves.

Starting point is 00:31:59 So we wouldn't need to enable this. But if the object is uncashable, it cannot be stored in cache. You see where I'm going with this? Memory, you don't can't store it in memory. So you keep pulling these files over and over again and maybe even just fragments of them. So even though the client never receives them, you may be pulling hundreds of files and the client just goes away. So you're not pulling the entire file, but you're still pulling enough and not able to fit it anywhere and just becomes a mess. This reminds me of the 90s when you used to go jean shopping, right?

Starting point is 00:32:34 And you'd go into, do tell us, which I would never, you know, never shopped at. But let's just imagine I did, right? You go in there and be like, I like all these jeans, get them all. I'm trying them all on. And I just bounce. Yeah, the person goes to collect them all. They come back and you're not there. Here's the dress room full of jeans and Adam's gone.

Starting point is 00:32:51 By the, by the see you. That's really sounds like you're speaking from experience. Was this like a, it was just a prank? I just made it up just now, you know, I'm just creative like that, you know, on the fly. Creativity. Wow. That's a good one. That's a good one.

Starting point is 00:33:03 On the fly, yes. So. It is on the fly. There it is. On the fly. It is. On the fly.com. Boom.

Starting point is 00:33:09 Well, what could we do then? What's going on here? Exactly. So this was one of the things which I had to deep dive and understand what on earth is going on. Like, where do we store? Like what's happening? So there's a lot, lot more that went into this pull request. It's pull request 44. I'm calling you the elephant in the room. I'm going to switch to the browser just to have a look at that. So the title of the pull request is storing MP3 files in the first. file cache. But that's like the tip, right? Like the most obvious thing is, well, you either have lots and lots of memory to give varnish, which honestly would be impractical in the sense that would be way too expensive to store all these files in memory. The next best thing is to have something like a file cache. And by the way, we're talking about open source varnish. That's really important. Like anyone can use this, anyone can configure this. You can configure a file cache,

Starting point is 00:34:07 which will basically preallocate a file on disk. And then, that's where these large files will be stored. Pull request 44, the one that we're looking at, it's in the Pipley repository. That's what this adds. But there's significantly more stuff. And if I'm going to, let me go, there's quite a few files.

Starting point is 00:34:24 I highlighted the few, so I'm going to look at this one. So it's not just that. You also need to tune, for example, thread pools. You need to tune the minimum, the maximum. You need to tune the workspace backend, like how many memory structures get allocated. you need to configure the new limit and there's a couple more things that we had to go through

Starting point is 00:34:44 just to make things stable. Now, I just going to very quickly mention these things. You can go and have a look at pull requests to see what else went into it. So this was the one file. The other one was the regions. That's another thing. Not all regions would suffer from this.

Starting point is 00:35:03 So you don't want to allocate too much memory or too much CPU to regions where maybe they don't get a lot of traffic. And you would think that this thing is easy. But oh man, I have a surprise for you. You can't mix and match sizes easily in fly. So you can't say create like application groups and this group will be like the small group

Starting point is 00:35:25 and that group will be the big group and this is just one application. It's not straightforward. So you have to, again, this is how I solved it. Maybe someone listening to this will tell me, Hey, Gerhardt, you're wrong. I would love to know that, seriously. So the way I solved it is we deploy in all the regions, right,

Starting point is 00:35:43 because you specify the size once. So you say, my starting size is the large instance type. It has a certain number of cores, certain number of memory. And by the way, the disc is the same in all of them because that's like another problem. So we will sidebar that or put a pin in that. So when it comes to the initial deployment, you deploy the one size across all the application instances,

Starting point is 00:36:10 and then you go and need to check to see which instances should be scaled down, so that you have the capacity, but the regions that don't need the capacity, you can just bring them down. And you do like a rolling deploy in that you replace one for one. You have plenty of capacity to handle the traffic while instances are being rolled, all that good stuff. But we have hot regions and we have cold regions. And there's quite a few things here.

Starting point is 00:36:38 Again, if someone knows how to do this better, I would love to hear about that. And we have the Tomol, we have the primary region. There's a couple of things here. We'll come back to services and HTTP services. That's a fun one. We leave that for a little bit later. Flyjust. We can see how we do the Fly CTL deploy.

Starting point is 00:36:56 We disable HA because we want only one instance per region. We have 15 regions in total. We specify the CPUs, the memory, all the good stuff. including environment variables. Oh, that's another thing. We need to adjust the varnish size based on the memory that the instance has, right? We need to say like, hey, varnish, you get 70%. And that's the other thing that this does.

Starting point is 00:37:17 Same thing for the file size. You can't take up the entire disk. We tell you, based on the disk that we provision, how much space you should use from the disk that gets created. There's a scaling there. So that's another good one. I'm going through PULB because there's anything else. Oh, man, this was, this was a pain.

Starting point is 00:37:34 So recreating, like writing tests for this. Everything is tested in the sense that which requests would go or which, basically which files would get cached in the file store and which files would be cashed in the memory store. So how do you write the tests? Some varnish logging is included. You have to have anchors. There's quite a few things.

Starting point is 00:37:56 So that's assets backend.vtc. And part of this, it was a huge refactoring. So if you look at the lines of code, I wouldn't say it's that. many, 1,500 were added and 1,470 were deleted. So not much changed. I mean, the net is 30 new lines were added. But there was like a huge massive refactoring part of this. So there's again, this was I think two, three days of like figuring it out, trying things, refactoring things. And if you think that an LLM can help you, well, you try this. And it takes longer to go through those iterations than if you know what you're looking for.

Starting point is 00:38:40 It tends to be easier. Anyway, it's very dense, very specific, very difficult to make sure that it's doing the right thing. But it's all there. We have the mock backends. We're reusing things. We split the VCLs. By the way, you finish like the splits. It's easier to reuse them.

Starting point is 00:38:58 So there's quite a few things there. Now, this is Kaizen. So we are wondering what improved. After all this work, right, we rolled it out what improved. And to answer this question, we need to figure out which region is the busiest one. So out of all the regions that we serve, we have 15 in total, which ones get the most traffic is those hot regions? We're looking at the fly, the Grafana dashboard for our fly application, the instance of the pipe dream, the current one. And we can see that SJC, San Jose, California is a red, nice big red

Starting point is 00:39:41 circle, which means it has the most traffic. And also NRT, which is Tokyo. Hmm. Apparently. We're big in Japan. Yeah. And Europe, there's quite a fuse. If I'm going to pull this down a little bit. Let's see. No, I wanted to go here. What about that new continent? Are we big there? The new continent? Australia. Oh, there's a new one. There's a new new new one. Well, what's it called? Which is a new one? I don't know. There's a headline I heard.

Starting point is 00:40:05 I thought y'all would get the joke. No. Over the holiday, there was speculation. There was a new continent being announced. Narnia? Maybe. Could have been Narnia. No, no, no.

Starting point is 00:40:15 So the closet. Right now, if even this loss, even this list is basically, if you think about it, it kind of makes sense, right? It's U.S. East, U.S. West, Europe. But we have quite a few instances in Europe. We have four. It's more geographically spread in Europe. And we have Asia.

Starting point is 00:40:31 So these are like the big ones. Australia, Africa and South America, they're not as busy. They are like these busy regions. Cool. So which instance would you like us to have a look at? So I have a queue right here. SJC, baby. Let's go.

Starting point is 00:40:50 Let's go, baby. All right, let's see that. So I'm running FlyCTL SSH console. I'm using two flags dash s, which is a short one for dash. select, it will prompt me which instance I want to select. And then I have dash C, capital C. It's different than lowercasey. They do different things.

Starting point is 00:41:10 I give it the command to run. And it's Varnish stat dash 1, which will give me all the statistics from Varnish at a point in time. So since this instance was running, I will select SJC. There you go. And it will give me all the data, which is like all the counters that Varnish is incrementing, keeping track of different things, of the origins, back hands, the memory pool, the disc pool, the lock counters. There's so much stuff.

Starting point is 00:41:40 I'm really, really impressed how many things Varnish has. So this is what we're going to do. We, because AI, right, we're going to copy all of this, and we're going to ask AI what it thinks of this. Okay? It's just too much data here. So let's be serious about it. So question to you, which is your favorite AI, Jared?

Starting point is 00:42:00 which ones do you use? Oh, I don't like any of them. I would probably start with Claude and then I would go to GROC and then I would go to chat GPT. Third. Okay. So Claude, which one, which version? Opus, man. Give us the opus.

Starting point is 00:42:17 Opus. Okay. So we're looking at abacus. com. A.I is something I've been using for a long, long time. It allows you, I'm only paying $10 per month for it, not sponsored, you know, not affiliate in any way. It's just something that I've picked for myself.

Starting point is 00:42:30 I can basically pick any model and I can just just run this. So I have something prepared. So I'm going to drop this. It's all the data. And we're going to read through something that I prepared ahead of time. You pre-prompted this. I pre-prompted this, exactly. Engineering this prompt for weeks.

Starting point is 00:42:51 Exactly. Not really, but. Oh, that's a long prompt. So we're going to read it. And in the meantime, Adam will think about his favorite LLM to try. and I have mine. So we'll try three LLMs to see what they say. So I need to read the prompt now while everybody thinks, no, we should be using whatever LLM you should be using.

Starting point is 00:43:10 You are a Varnish 7 expert. You need to prepare four distinct responses and be explicit about the person that you're addressing. One, a seasoned sysadmin that has been living and breathing infrastructure for the last 20 years. Be precise, think deeply an approach to set up from a hardware perspective. 2. An elixir application developer that embraces Erlanks, let it crash concept, you need to give it straight, give it fast, and keep it relevant to their application. It's the app and the nightly backends. Assets and fees are important but less relevant, Cloudflare 2.

Starting point is 00:43:45 3. The business person that is selling this thing, they care about costs, efficiency and simplicity. Keep it high level and relevant for someone that doesn't care about the tech, but cares about the outcomes. And four, the audience of a podcast where this is being discussed. Make it general, relatable, and fun. Make analogies, keep it light and engaging. I have fun too many types. We don't want to make it too fun. That's a lot of fun.

Starting point is 00:44:13 So, yeah, there's one too many funds. That's right. Wow. Now that you understand your audience, please analyze the following Varnestadt output for the SJC. Look, I already knew that you would pick. How do you don't know? I go for the big one.

Starting point is 00:44:26 I have no idea. Focus on things that work well, things that could be improved and anything else that you find interesting. And by the way, ignore the synthetic request. It will keep mentioning these. Like, I get so fed up with this. We have health checks that run every five seconds. So they are normal. So,

Starting point is 00:44:42 I'm going to copy this. I'm going to run this. And I'm also going to open a new window for Adam. So which LLM should we pick Adam? Which is your favorite? You mean model? Model, yeah. Which model?

Starting point is 00:44:57 We just used it. But I'd probably back up to like Codex. Codex. Which is like GPD5. Latest. GPT51. 51. 52.

Starting point is 00:45:07 There you go. So GPD codex. My favorite one is Gemini. So I'm going to drop it. Now let's see how do they compare. Gemini, you're in a different tab now. So Abacus can't do Gemini? It might, but I have like my own pro account.

Starting point is 00:45:21 So that's something else. Like I use quite a Vio. I use nanobanana. quite a few things. Transcripts, it's all like part of the package. Sure. So it can, but that's what I prefer. Cool. So Claude Opus 4.5. For the season, for the season system. This is you. This is me. This is me exactly. Thank you for not to say. For like who's who. Following. So what's working well. Rock solid stability. So by the way, the instance has been running for 5.4 days. We had like all these improvements shipped and we are able to observe how are busy

Starting point is 00:45:56 instance works and that's what this is basically. That was a, the window moved. Cool. So after four, 5.4 days, zero child panics crashes. Zero threat failures. This is important. It means no threads died. No threats had to be restarted. Everything is healthy on this instance.

Starting point is 00:46:16 It didn't crash. So this instance didn't crash. Zero lock contention across all subsystems. Your CPU cash lines are happy. Excellent. Hit ratio, 93%. We like that. We really like that. We have backend connection pooling with a 2-1 reuse ratio,

Starting point is 00:46:36 and memory pressure is minimal. 132 LRUs in the last five days, LRE nukes. So very few objects had to be removed from memory. Threadpool property, 300 threads, zero queuing, zero drops. That's perfect. Areas to investigate. Disc storage allocator failures. we have discs he fails.

Starting point is 00:46:56 We are hitting storage fragmentation. The disc is 97% full. We have 48 gigabytes used. That's how many MP3 files are stored. By the way, how many MP3 files total do you think we have? Size or files count? Size. Size.

Starting point is 00:47:12 Well, if we had a thousand episodes at 100 megs each, which neither of those things are true, that'd be 100 gigs, right? So 100's too big, but a thousand's too small. I'm going to say 80 gigs. Adam, do you know, guess? That math checks out. I wasn't say like a terabyte, but that's probably raw wave files versus not.

Starting point is 00:47:40 All the files that we store in R2, and this includes all the assets, but we know that the MP3 files are the biggest. It's close to 250 gigabytes. We may have some duplicates. I don't know. I haven't checked. But that's how much files we have in R2. Yeah, well, we also have plus plus for the last couple of years, which means every episode has two files, not just one.

Starting point is 00:48:03 So that makes sense. So we should go higher. Now, we use this in every single region. So maybe we want to reduce number of regions. But I think... You know the third category called Super Hot. Super Hot, yes. Which is like SJC and Tokyo, right?

Starting point is 00:48:20 That's possible. Yeah, there's four, which we know they're really, really hot. Yeah. Yeah. But honestly, this is happening across multiple regions. And it is. We'll get to some interesting things. So, okay, synthetic responses, Grace hits, all good. For the Elixir developer, and I think this is you, Jared. Do you want to read it out? Oh, well, the TLDR is varnished to do its job. Your app back end is well protected. You want to read the whole thing? If you want, I mean, how it's shielded. It's 95% shielded. No, fit. failure, zero back end failures. That's because of, you know, my code doesn't really let it crash very often. Exactly. Your code is, yeah, it crashes internally, not externally. That's right. My thing is doing its thing. Mm-hmm.

Starting point is 00:49:02 It is generating some uncasurable responses, but, you know, we do have somes that we just don't want to be cached. Ooh, one fetch failure, negligible. Yeah, I agree. You don't worry about that. And in the end, it says, whoever wrote this is really good at what they do. I agree. And she's proud of themselves. and congratulations on such a great hire.

Starting point is 00:49:23 Yeah, I agree. I agree. I think the hire needs a promotion and a bonus, I think. There you go. All right, for the business person, the caching layer is performing excellently. Adam, do you recognize yourself, or shall I continue with this?

Starting point is 00:49:39 You can read it. 93% of requests never touch your servers, massive cost savings on compute. Do you know how many requests per second the application is serving? Like maximum, by the way. What's the maximum RPS for this amazing Elixir Phoenix application for the homepage?

Starting point is 00:49:59 Probably a lot. Gosh. Thousands? Tens of thousands? Maximum. Okay. Jared? 100,000? The database connection is involved.

Starting point is 00:50:09 Concurrently? Concurrently, yes. I don't know. I'd say a lot, not very many. To our homepage? I'd be like 12. 12 requests a second. Yeah.

Starting point is 00:50:18 17. 17. I'm right in there, maybe. Someone knows are code. So 17 requests per second. So if all these requests were hitting the application, we need so much compute to serve that. You know, so much caching.

Starting point is 00:50:33 Obviously, we've removed all the caching. Now we're joking about this because we purposefully removed all the caching from the application. Right. I remember that a couple of years back because we said this has no place in the application. The application gets restarted. Yeah. We need to store this somewhere.

Starting point is 00:50:48 we need a cluster. It was just really messy to handle it at that layer, which is why we introduced this. Five plus days running without any issues. By the way, this is like the last deploy. So maybe by the next Kheisen, if we do no more deployes, we'll be able to see how it handles.

Starting point is 00:51:07 Zero failures on the infrastructure side. And three terabytes of data served to users. Three terabytes. So in five days, this one instance served three terabytes. without your application server is breaking a sweat. Storage is getting full, so we need basically more storage. For the podcast audience.

Starting point is 00:51:27 Oh, yeah, that's be fun. Imagine a really good receptionist at a busy office. This VARG server is like having someone at the front desk who remembers everything. Out of 100 people who walk in asking questions, 93 of them get their answers immediately from the receptionist without ever bothering the experts in the back office. What's cool? It's been running for over five days straight without a coffee break

Starting point is 00:51:54 or a single mistake. That sounds cruel to me, but let's go with it. It served three terabytes of data that's like streaming about a thousand HD movies. This one instance streamed a thousand HD movies

Starting point is 00:52:08 in five days. And the experts had only had to answer seven percent of the questions. The one quirk. The filing cabinet is getting full. It's like, when your receptionist's desk drawers are stuffed

Starting point is 00:52:20 and they occasionally have to throw away old notes to make room for new ones. Not a crisis, just time to get a bigger cabinet. Okay, I think their last of fun stats of 300 workers. I think that's too detailed. That's good fun there.

Starting point is 00:52:34 Good job, Belmont. Do we care about GPT or Gemini? We can only use one. We can only pick one. Gemini's getting some good hotness. Let's check Gemini. We'll see how it adds up. Oh, it's still thinking.

Starting point is 00:52:47 Let's see. I think it's finished. Maybe that's, let me just close that. Did it finish? I think it did. All right. So let's go up. Slow thinking.

Starting point is 00:52:57 I did like the thinking. Show thinking. Show thinking. Show thinking. Slow thinking. I thought I said slow thinking. I was like, oh, on speed up, Chevin.

Starting point is 00:53:07 Show thinking. There's a squad lot there. Anyway, we're not going to look into that. So the instance has been up for 5.3 days. The MGT uptime. I like it. telling me which of those, that long list of counters is important. From a system perspective, the threading model is perfectly dialed in 300 threads across

Starting point is 00:53:26 two pools with a zero thread limited and zero thread queue length. The kernel is in context switching unnecessarily and we aren't starting for execution units. However, we have a critical IO and storage allocation bottleneck in the SMF. The file, I'm not sure what Steve Doar means. That's a weird one. I'm not sure Gemini knows what Steve Norman is either. The problem is SMF, disc, C, fail is at 19,000 plus against only requests at a 73% failure rate for disc allocations. The G-space, in the G-space we have only 145 megabytes while we have 48 megabytes, 48 gigabytes used.

Starting point is 00:54:08 You're effectively out of contiguous blocks or hitting the ceiling of the pre-allocated file. That sounds right. Varnish is forced to fall back to. to SMA RAM or simply not cache large objects. While SMA has 2.1 gigabytes free, the disk is thrashing. The hardware, if this is spinning glass, your I-O-8 is likely spinning. It's not. It's NVME.

Starting point is 00:54:29 So, yeah, there's quite a few things there. Interesting. It's low relative suggesting the lock contention on the shared memory log is negligible. Nice. You have a low lock contention. The bottleneck is strictly at a storage layer. We just need basically a bigger disk. The Elixir developer, Let It Crash Proponent.

Starting point is 00:54:49 I like it. Your app backend is rock solid. Well done, Jared. Varnish hasn't seen a single backend fail or an unhealthy event in five days. The let it crash philosophy is working. If the app is crashing, it's recovering so fast, Varnish doesn't even notice. I don't think it is. I think the application is good.

Starting point is 00:55:06 The good news is we're seeing an effective 75% hit rate. Now, this is different, right? We had 93% there. So I would like to understand the discrepancy. That's another reason why you should use more than one, by the way. And then obviously double check the data because sometimes they give you things which aren't true. The crash concern, the disk cache is crashing. Again, this is not for the application developer.

Starting point is 00:55:30 I think this is for the SIS admin. In me, I think, the straight talk. Your main synth is high. So we have health checks. There's a delta responses. So we have a lot of synthetic requests. Again, sorry, synthetic responses. is again like a varnish thing.

Starting point is 00:55:48 The business person, efficiency or currently serving of our traffic from varnish, I think I know what's happened. I don't think it's taking into account the synthetic requests. Those should be removed from the total number of requests.

Starting point is 00:56:01 They think Claude has the right number. I think so, yeah. Yeah, I think so. This means for customers, we have cost efficiency, that's good, the risk, there's the bottom line. I think this was the fun. But I think this is a library.

Starting point is 00:56:15 I think we can stop it here. The library analogy versus the secretary analogy. I think that was a better one. I got a barista one. I thought it was like a very good one. Oh yeah. For queuing or for what? For queuing, yeah.

Starting point is 00:56:28 Yeah. Like the barista analogy, I thought it was very good. Yeah. This is using books and whatnot. The library hasn't burned now. That's a good thing. That is fun. So I think Jamie and I's getting a bit funnier.

Starting point is 00:56:40 The nightly feeds and the app are still humming along. Nice. So that's what we have. And, and that was only half the problem. Well, friends, this episode is brought to you by Squarespace, the all-in-one platform for building your online presence. Whether that's a portfolio, a consulting business, or finally shipping that side project landing page,

Starting point is 00:57:05 you just have been meaning to do, but never get to. Here's the thing. You mass-produced code on the daily. You deploy new services, new infrastructure, new hardware, You're versioning your APIs. You're simvering all over the place. But when someone asks you about your own personal website, it's like, ah, I'm still working on it. Does that sound familiar?

Starting point is 00:57:24 Squarespace exists so you don't have to treat your personal site like a weekend project that never ships. Pick a template and drag and drop your way to something that actually looks good and move on with your life. No wrestling with CSS. No, I'll just build my own static site generator again. It's just done. If you do consulting or freelance work on the side, Squarespace, handles the whole entire workflow. Showcase your services,

Starting point is 00:57:47 let clients book time directly in your calendar, send professional invoices, and get paid online. It's the boring infrastructure that you don't want to build for yourself. And for those of you out there who are doing courses or gated content or educational stuff, tutorials, workshops, that intro to whatever series you keep talking about,

Starting point is 00:58:04 you can set up a membership area with a paywall and start earning recurring revenue. Set your price, gate the content, and you're done. And they've also added, Blueprint AI. This generates a custom site based on your industry, your goals, your style preferences. It's not going to replace your design skills by any means, but it'll get you about 80% of the way there in about five minutes. Here's the call to action. This is what I want you to do. Go to

Starting point is 00:58:30 Squarespace.com slash changelog for a free trial. And when you're ready to launch, use our offer code, change log, and save 10% off your first purchase of a website or a domain. again, squarespace.com slash change log. That was only half the problem. So we were like at the midpoint. I was feeling good. I feel like we had it all fixed.

Starting point is 00:58:53 What else is the problem? Oh, wow. This is like when all the fun begins. So you remember this, Jared? Yes. MP3 requests intermittently hang in Newark, New Jersey. This was our good friend John Spurlock, who's been on the show before and is a podcast nerd. In fact, he runs op3.3.dev and other podcasts.

Starting point is 00:59:14 nerdery things. And so he really knows this stuff. And so when he reports issues, you know, I don't say, did you try rebooting? I take it seriously. So I shared it with you. And he actually did some additional digging for us. Go ahead. So in terms of you tested this. And I think you had issues as well. So we've confirmed this for sure. I did. Like certain times, certain files. Actually, it would be all requests as certain times. I assume that that was that particular pop, as we could call them, or pipe in the pipe dream was hanging. And then it would go away. And he actually had the same problem. He had a Friday night deploy of friends. And he was trying to listen to it on Friday. Couldn't

Starting point is 01:00:00 get to it. By Saturday morning, he can get to it. So it's intermittent hanging, very difficult to diagnose, very difficult, I assume, to debug. And then it just comes back to normal. I thought maybe the out of memory thing, like it's just in some sort of fugue state until it reboots and then it works again. But you go ahead. That's what I thought. That's why like a deep dive on this. This was November, end of November, beginning of November. So November, I was just trying to figure out what on earth was going on. Just like, you know, like from the sides. I didn't have too much time. But if you look at this response, there's quite a few things there. This is like my initial one, like an investigation, trying to understand what's happening, giving a couple of

Starting point is 01:00:40 debug headers, like a couple of extra headers that the request can be made off. Sorry, can be can be run with. So we just get a bit more details. Forcing regions as well. So there's quite a few things there. I was checking into that. This is Don McKinnon. He also had issues today.

Starting point is 01:00:59 So he pasted some results. So thank you. Thank you, Don, for adding this. This was helpful. So, I'm still scrolling, I'm still scrolling, there you go. It's super helpful. I have confirmed that the requests have been hanging. You are getting the hangs this afternoon as well.

Starting point is 01:01:21 This was only three weeks ago. So this has been going on for a while. I dug deeper and I found the problem. The problem was that in the fly config, we had the concurrency set to connections, not requests. So it's possible to configure an application. Again, you're configuring the fly proxy that sits in front of the application to limit how much traffic hits your application. So requests, how many requests per second should the fly proxy forward to your application before it stops like that because you want to, you don't want to get overloaded.

Starting point is 01:02:01 So before it like starts throttling, it starts slowing clients down. and then when you start, that's the when you start seeing fly edge errors. Connections, you would use for something that has long running connections, like a database, for example. In our case, it's not a database, right? It's an HTTP application. So requests would have been the right concurrency.

Starting point is 01:02:23 I have no idea why I picked connections. It was the wrong one. But the effect was, as you can see here, we had 2,700 long running connections on that edge, so on that region. So in this case, it was, I think, Orange One, I think EWR. Right. So EWR was getting, had like all these connections opened.

Starting point is 01:02:50 Clients weren't closing the connection. The proxy was full. No more connections could be forwarded to the application. Long-running connections. There are usually clients which are not doing the right thing. Right. You shouldn't have that many long-running connections. So the problem was a misconfiguration on our side,

Starting point is 01:03:08 which meant that connections like slow connections, long-running connections, were basically blocking other connections from coming through. So that was a problem there. And I thought that was it, but there was more. So this was the last comment last week. We now have a check that runs every hour. And what was interesting,

Starting point is 01:03:33 and I'll talk about the check as well. We had response bodies timing out in two regions. So 13 regions were fine. But even after this configuration, there were two regions. I had an EWR where when we were using HTTP 2, and for some reason this is important, when we're using HTTP 2 and the proxy, the fly proxy would see this.

Starting point is 01:03:56 It would not forward the connection correctly. As in it would start, it would like serve the response. like we could see the headers coming back from our instances, what we wouldn't get is the body. So the body would always be like zero bytes served. And we could see this happening. We could see the connections that, by the way, they were opened. They shouldn't have been opened because the application changed.

Starting point is 01:04:19 So these connections should have been dropped. There was something not quite right. My suspicion is with a fly proxy layer. Because when we were forcing HTTP 1, everything was working fine. And by the way, the fly proxy, when it talks to our varnish instance, it's using HTTP 1 and you can see that in the headers. So the proxy to the varnish was fine, but the client to the proxy was not fine.

Starting point is 01:04:41 And HTTP 2.0 is a very complex protocol. There's so many things which just don't work the way people would expect. So anyway, the issue fix itself. That's the important thing. So opening this. Not super satisfying. Yeah, that was very nice to see. And there was something,

Starting point is 01:05:01 Myaelrus. How would you read this? Maya Illeros. Illeris. There you go. So someone on the fly community forum that was very helpful, they noticed that we had a misconfiguration in our fly tunnel. And we were using services as well as HTTP service.

Starting point is 01:05:23 And this is bad. By the way, this is very, very bad. So everything was happy. Like we could push this config. You know, the applications were running. everything was fine. But because we had these two things together, it was apparently creating some issues.

Starting point is 01:05:37 And all we did, we were explicitly setting the idle timeout. And the idle timeout, that's the one where if after 60 seconds, the connection isn't doing anything, it will be forcefully terminated by the proxy. So that part was important. So anyway, we made the change, we pushed the change. But even before we pushed the change, the proxy started behaving.

Starting point is 01:05:58 And now there's pull request, 49, has like, We right-sized it. We made a few changes. I captured like all the details, the configuration, the commands. It's all there if you want to read it. But most importantly, now we have a check that runs against all regions every hour on the hour. CICD. It's using Hurl.

Starting point is 01:06:19 And what I'm thinking is, shall we try running that locally to see how it behaves? Because that's how he started it. Like I was starting to do it locally. So on the left-hand side, I'm back in the terminal. the left-hand side, I am monitoring my internet connection. Remember that Christmas tree? This is related that Christmas tree. So I'm at the top of the Christmas tree. I'm at the gateway, the core router. It's a micro-tech CCR 2004. Pretty good. 10 gigabits per second maximum. Now, my internet connection isn't 10 gigabytes but it's 2.5, which is plenty for this test. So we, every second is showing me how many

Starting point is 01:06:59 packets and how many bits we're receiving and transmitting. Okay. And again, we are recording. Everything's happening live. So you can see jumping right as Riverside. We're pushing more data to Riverside. Cool. So I'm going to run now. Just check. And just check by default. It's one of the commands, the just command that we have in the Pipely repository. And check, all it does, it runs Hurl with a couple of flags. It downloads an MP3 file. It downloads feeds. It basically connects to all the different backends,

Starting point is 01:07:37 and it sees how quickly it can get data back. We're transferring about those quick, those eight seconds. I'm going to run it again. As I run this, pay attention to the left-hand side. it will go to 120 megabits per second so that's that MP3 file being downloaded so every single time this runs

Starting point is 01:07:56 a full MP3 file gets downloaded alongside a few other things okay I can open the reports we're not going to look into that because we're going to run something more interesting now we do check all and what check all does it runs the same command against all the regions I'm at 2.3 gigabits per second

Starting point is 01:08:13 we're downloading all the files we can see the response is coming back. EWR just sped by. I had sped by. So all the different endpoints are returning. Now I'm based in London. Obviously the further way you are. So for example, this was South America.

Starting point is 01:08:28 Those LAX. So a couple of instances are slower to respond. And all this happens via headers. So you can, when you connect to fly, you can tell it, hey, I want to connect to a specific region. And then that's what routes the request to that, to that region. That's cool. And again, it's all captured in that pull request and you can see what it looks like.

Starting point is 01:08:49 The Czech call one, Johannesburg, that's usually slow. And the slowest one is Tokyo for me. Sydney as well can be slow. So we still haven't received the responses from there. We should get that shortly. You can see I'm pulling now 50 megabits, 20 megabits. It's just slowing down. And it's just the connections between now and there.

Starting point is 01:09:08 The last one with there goes, Tokyo. in 60 seconds, I pulled about 2 gigs roughly. It's a lot of data that gets pulled down. The feeds between, between all of that. And anyone can run it. I would recommend you not to run this because we have to pay for this bandwidth. But our CI runs it just to make sure that everything works.

Starting point is 01:09:31 And if we look at every hour, I think I'm going to tune this down. You can see there were no more connections hanging. So we go to the bottom of that as well. If it ever comes back, because it went away on its own. If it comes back on its own, we're going to about it. Exactly. Now we have a system that is able to inform us when there's a problem.

Starting point is 01:09:49 So let's go to three. We're on page number three. This one, for example, took more than five minutes. Right. So sometimes when the connectivity is a bit slow, some regions can be slow. That's when you get these time out. So this is capped at five minutes. The last one that failed was a while ago.

Starting point is 01:10:04 So you can see we're January 5th. There we go. There's one that failed. January 4th. Check all instances. So let's see run and we'll see exactly which region failed. Execution, NRT, that's Tokyo. And as you can see, we have 100 seconds, right?

Starting point is 01:10:21 So if after 100 seconds, it doesn't download, it just times out. And we were pulling data, but it didn't finish downloading the entire MP3. And we're downloading 100 and something megabytes. Very cool. So, I mean, not cool that it didn't finish, but cool that was a while ago. And we can actually test this. Now, do we need to be doing such a large file? Is that part of the test?

Starting point is 01:10:49 Or could we test a smaller file and still get the same results? We could. Yes. This was a file that was reported. So we need to find an MP3 file. Absolutely. I think we can also reduce the frequency. We don't have to run it every hour.

Starting point is 01:11:03 This was always in preparation for this conversation. What about episode 456? Of course that one. That's coming up. That's the deepest rapid hole. So I'm leaving that. I'm leaving that for last. That's coming at them.

Starting point is 01:11:18 One thing I suggested though in our the Nerzulu, but I don't think this is, I didn't check to see if this is even a thing, but to validate, you know, if the fly CLI could validate the Tomo file for you. Because you could have been,

Starting point is 01:11:31 you could have checked the Toma file for syntax errors or just. Yeah. Dues and don't essentially. and it didn't. It does have a validation subcommand. Syntactically is correct. The config is valid.

Starting point is 01:11:45 I mean, it was applied. But because it combines two things, it shouldn't. So at least I would expect a warning. Like, hey, you're using both HTTP services. Yeah, validate syntax and validate, you know, expected, you know, true tommel file config. You know, don't combine or conflate two values or overwrite one or, you know, just that kind of thing.

Starting point is 01:12:07 I would defensively do something like that in a CLI to protect my user from a poor config. They could have just not been holding it wrong for so long. Yep, I agree. So it's the impact of that configuration indeed. Yeah. So this is something we can see again the same logs. We can see this one here we go like to 50 megabytes per second. That's 500 megabits.

Starting point is 01:12:31 When we have these peaks, when we see this in the fly config, we can see this when usually when the benchmarks run or like when the checks run because they put significant pressure on the instances and we can see them we can pick them up straight away so that's that's what this is all right so remember this guy this guy was saying March 29 so it's almost two years ago when this guy was saying we will run into all sorts of issues that we end up sinking all kinds of time into so this guy had a good hunch. This is Jared. March 29th.

Starting point is 01:13:11 And we just went through a couple of examples of issues that we had to deal with part of this. But because of this, we understand the traffic and we understand how the application behaves and the backends behave at a very deep level. So you're right, Jared. We did sunk all sorts of how many lines. Let's see how many lines do we have now. So how many lines? 20 lines.

Starting point is 01:13:39 590 lines we have in total Varnish config. It's more than 20 lines. By the way, we have like the roadmap to 2.0. This is 1.0 that we tagged and shipped. It's solved like a lot of issues. But that was the easy stuff. Okay. So for everyone that stuck with us, something really good is coming up.

Starting point is 01:14:00 And Adam was already mentioning it, episode 4, 5, 6. There's something special about episode 4. 456. So what is special about it? What stands out to you, Jared? Oh, it's just getting rocked with downloads. So episode 4456, oh off, it's complicated, was down. By the way, this was recorded in 2021.

Starting point is 01:14:21 It was published, again, August 2021. For some reason, it's been downloaded a lot in recent months. It has over one million downloads. This is the most popular episode. on the change log ever. The most downloaded episode. It's crazy. It's crazy.

Starting point is 01:14:41 So. Oh, so you guys looked into this. We did. Yes. We dug into this. Okay. I didn't know. You guys were doing this.

Starting point is 01:14:47 So we just had a quick look to understand what is happening here. So we have Huntingcomb open up. Remember every single request, which comes through the pipe dream, through Pipely, every single request. We sent to Honeycomb. We're able to look at it. This is the last 60 days. And I have filtering done in such a way so that I'm only looking at this one file. how many times has this file been downloaded in the last two months?

Starting point is 01:15:11 And you can see the peaks, right? You can see, and by the way, this is gigabytes. So, and this is, the period is four hours. So we are peaking at about a hundred. Actually, this peak was here. We had 200, almost 300, 300, 300, 400, anyway, close to 400 gigabytes in a four hour period. That's just too much.

Starting point is 01:15:35 All right. think so. Like I know this, I know this is a great episode, great conversation. I remember that conversation. It was good. Like, who is downloading this file 400, I know, times or actually more than 400 times, every four hours consistently for months on end? And super fan. So we can see like a different regions. Now, this is spread across the entire world. It's not just one region. This is really, really big. I think if there was a DDoS attack, I think this would cost this one. And like in the last six months, in sorry, in the last two months, 60 days, we served 30 terabytes in San Jose, California alone. In Tokyo, we served five, 15 terabytes. This is a big number. And if you look in

Starting point is 01:16:28 this column, the distinct IPs, the client APs, we had over 10,000. IPs downending this file. So this is not one or two IPs. This is thousands and thousands of IPs, which keep downloading this file over and over and over again. So I don't know how we would block 10,000 IPs. Right. That would be, that will be, the VCL would be crazy. Well, that episode was starring Aaron Perretti, who is a very talented person. And he is the co-founder of Indyweb camp and a big fan of the Indyweb as well as Oath, obviously. So my hunch is Aaron's very interested in being the most downloaded episode ever. And he controls a fleet of machines from all around the world.

Starting point is 01:17:14 And he points them wherever he wishes. And he thinks, you know what I'm going to do? I'm going to get at the number one spot on these guys' download charts. And so I'm thinking Aaron Perrecki is, you know, the man with a mascot. We pull a mask off. And it's him this whole time. What do you think, Gerhard? I think that we need to speak.

Starting point is 01:17:32 I see, I don't want to say the specific language. I think we need to go to Asia. I think we need to visit a couple of cities in Asia. Okay. Find the IPs, which are responsible for this, because this is a crazy amount of traffic. Asia, it just so happens if we look at, so Asia is basically the continent,

Starting point is 01:17:52 which where we are getting the most downloads from because of this one episode. And this is actually traffic being served. This is not like head requests or get requests. these are bytes being sent to thousands and thousands of machines in Asia every single hour. So whoever is doing this, please stop. Please. It's a cycle.

Starting point is 01:18:13 So we need to like knock on doors. We need to go over there and knock on some doors and say, excuse me, is this IP address at this home? And then they might say yes and say, would you please stop? And what's going on over here? What can they possibly benefit from this? Like, what could they be getting? Maybe, maybe we're the speed test. Someone is using us to speed test their connection.

Starting point is 01:18:37 Who knows? Yeah, maybe. That's the only thing I can imagine. But that's a lot of IP addresses. It is. And it's across multiple regions. Which? Multiple data centers, yes.

Starting point is 01:18:47 So multiple regions, fly regions are serving these IPs, yes. They're all coming from Asia, by the way. Again, I don't want to mention any names because there's, there's no, there's no one. There's no bad guys here, right? We just want to assume that someone left the oven on. It's like the blinker on when you're driving. I was like, hey, you're not turning. It's time to turn that blinker off.

Starting point is 01:19:13 So the way I can see us mitigating this, and this is a hard problem because of the number right of IPs, which are hitting is we can basically start blocking entire net blocks, entire network blocks. Unfortunately, some genuine listeners might be caught. in this and basically a change log will not be available or at least the mp3s will not be available to a portion of users the other one is obviously we can and we should right this is like the next problem we should enable some throttling because there's more stuff happening here so we don't have any

Starting point is 01:19:46 sort sort of throttling we assume fairness we're assuming goodwill we're assuming decency and we're not seeing that here that's the internet so to be honest like whoever is doing this and it's not lLM I had to look, we have, we have that problem as well, but in this case, it's not LLMs. This is something completely different. So my hope is by someone that listens to this episode, maybe we put this in the intro, whoever's downloading episode 4, 5, 6, please stop, because we'll need to take the next step. I don't know it's a bit of a cat and a mouse game, but that's what we'll need to happen. Because we need to pay for this bandwidth.

Starting point is 01:20:21 This is only varnish, right? This is only the cash layer this is happening. This is only cash layer, yes. Yeah. And so what mechanisms are in varnished to do thralling or rate limiting or just anything like that whatsoever? There's v mods, which are basically modules that varnish loads that just give it extra functionality. One such V mode, and I've looked at this, it is free and open source, is the VMod throttle. Now that means that we need to start keeping track of IPs and it will use a bit more memory.

Starting point is 01:20:50 That's okay, we have more memory. And then we can need to start basically applying limits to how many download specific IPs. can do. And we can limit it to MP3 files only. So if we have a bot or if we have, for example, like, I don't know, an RSS aggregator or something like that, we can, we, we were okay serving those requests. Because again, that's what Varnish is meant to do. The problem here is that we're serving a lot of bytes for MP3 is the same MP3 that

Starting point is 01:21:18 cannot be real traffic. Yeah, I mean, even in this case, you can like tie it potentially just to this MP3, like you just said, which is not an all MP3. scenario. Like if you request this MP3 with this kind of like request signature of X per whatever, I mean, I didn't examine the actual signature of the request, but that's how I'd probably investigate it is begin to isolate. Does that require us to write a lot of defensive code against that kind of scenario? I don't think so. It's just some configuration. We just need to add more configuration. And back to Jared's point, we're just chasing now new problems that we didn't even

Starting point is 01:21:54 think we would have. But we have what looks to me like an actor that's not very, I want to say this in a nice way, an unfriendly actor that is not very happy. And they are very angrily downloading our MP3 over and over again, thousands of times across thousands of IPs. And this is not cool because ultimately we end up paying for this bandwidth. That is not helping anyone. but that's one. It's not the only one. So we have one more. So you can see here, for example,

Starting point is 01:22:29 this is the last seven days. We have seven terabytes that were transferred in the last seven days. Seven terabytes? Maybe that's more than that. It needs to be more. And it actually, geocode does not exist. Okay. I was expecting to see more than that. Anyway, Asia is the one that we can see like that pattern. But we also have like in Europe. sometimes we have these spikes and it's like this spike which I wanted to focus on. We know that someone in Frankfurt that connects to Frankfurt downloaded the static favicon 170,000 times in the span of I know like an hour or two. So they downloaded this like two, three hours.

Starting point is 01:23:11 So it gets, you know, requests like this that are putting stress on these instances. I mean, what potentially, that was like a pass request as well. Yeah. It went through the cash, which means that, they must have had like a cookie set up or something like that that basically was preventing the cash from working in this case, which again, that's how it's supposed to work. So anyway, I think, I think that was unfortunately not the best thing that we could have ended on, but it's a thing and it's something food for thought, like more work to be done. There's many things that we didn't get

Starting point is 01:23:46 to talk to. We didn't have time for. For example, we didn't talk about the nightly. By the way, Nightly now is being served by the pipe dream as well. And the reason why we had to do this, because that sometimes will get scraped, would get hit really heavily. It's a very small app, it's EngineX, but if I open it, so let's just click on that one,

Starting point is 01:24:07 and that's pull request 46. Before it was basically topping up at 141 request per second. Now it's 1,300, so it's almost like a 10x, in order of magnitude faster. The latency went way, way down. So, and it's just the only thing we had to do is basically put varnish in front of it. Nice. Well, that's nice.

Starting point is 01:24:30 Yeah, that's one more thing there. And you can go and have a look how it works. It would be like a benchmark here, a small benchmark here. That's it. We have, we have last one for the road. Before we do that, anything else we want to talk about before I share one last thought. I suppose what do?

Starting point is 01:24:49 You know, if we know these downloads are happening, we're here on the podcast, just politely asking to stop, we just let it keep happening. Well, we could set up some sort of throttling. I think it'll be the easiest thing. Now, which will impact everyone. I don't want to start blocking, again, IP ranges, net blocks, because we don't know who's going to be called there. They may change to other IP blocks.

Starting point is 01:25:12 So that's entirely possible. We don't know how this will work. We can't block any. entire country, an entire continent, especially if it's a big one. I don't, I don't think that's reasonable. So really throttling is, I think, the fairest thing. And then we can throttle MP3 specifically. Because we do have, for example, I see them, like, for example, we have an a Python client and a Go client that every week they come and they download all our MP3s. I don't know why they do that, but every seven days, they basically request every single MP3 that we

Starting point is 01:25:43 have. So they're like scraping the website and then pulling everything down. I don't know why. Yeah. Again, the closer, like the more I was looking at, and again, because I was working so deep in this, I started noticing like these behaviors that you would normally not see. So it's one of the advantages, I suppose,

Starting point is 01:26:02 to being to working so close. Yeah. With the traffic, with all the requests, and having this level of understanding and visibility into every single request. So it really helps, down to the IP level. Something like that, though, like the Go client and the Python client,

Starting point is 01:26:16 Where would you, would that be a honeycomb thing? Where would that be? Yeah. It's honeycomb, yeah. You can filter by user agent, for example. And you can see that like there will be on, for example, say, no, I don't want to show any IPs or anything like that. So that's why I'm looking to screen share that. Yeah.

Starting point is 01:26:33 But once we start digging into that, you can say group by client, agent. And you can say filter by MP3s. So like URL contains MP3. and that will be able to group and you can say oh and by the way only show me where there's more than for example 100 downloads

Starting point is 01:26:54 and then you'll start seeing like the outliers which are the clients that are downloading certain MP3s or MP3s in general excessively now that can be spoofed that's the other thing like we have for example the request agent like the user agent

Starting point is 01:27:10 it's an empty string that also happens right because you don't have to set send the header if you don't want to. Yeah, you can also send whatever you want to. So that can be spoofed. Yeah. Yeah, it's like whenever you build systems like this,

Starting point is 01:27:24 and then even when you observe them, I guess you don't expect, that's what I originally thought, but you kind of hope that clients, aka people, behave. You know, they're going to use the system for the system's purpose, not to once a week download and scrape the entire thing.

Starting point is 01:27:42 And I mean, in that case, somebody could have their own web archiver and they could have altruistic reasons for it. I think that's kind of silly. But, you know, once per week, download the entire contents and somebody's disk seems like, I want your thing. And I want to keep getting your thing. And if it ever changes, I want to make sure I have that snapshot. I don't understand it.

Starting point is 01:28:05 It doesn't make any sense. Like, what would make anybody do that? What is the purpose and motivation to keep doing that to even commit the compute or the script? are the time to do that. Like, what are they getting from it? I don't know. We need to go over there and knock on some doors, man. I'm going to ask them.

Starting point is 01:28:23 Yeah. Why are you doing this? Every door. It's hard for you. In Asia. Do you listen to the change log? Yes, I do. How many times?

Starting point is 01:28:33 Yeah. Tell us about four, five, six. You know what four, five, six means, don't you? Yeah. It's just, see, this is how, so this is, I think, a really delicate and a really important. point to discuss because this is how good systems become bad systems. It's true. Yeah. You have to treat everybody bad. Exactly. Like we don't want to be doing this, but we are forced to do something against something which is good. So it's not benefiting anyone

Starting point is 01:29:02 and we have to step in and do something about it. Now we have to do it. It's been like I was expecting this to stop, but it's still even to this day. We made varnish. I mean now that it's stable, it's able to serve more traffic. It's able to, like, we just had like the biggest spike because now the system is more stable. But it means that bad actors, again, I shouldn't be using that. Unhappy people.

Starting point is 01:29:26 Unhappy clients. Use it. Unhappy clients. Yeah. The only person you can offend is the one who's doing this and I'm fine with it. Yeah. They need to knock it off.

Starting point is 01:29:35 Here's a cut. This might be a cudgel. But if we're trying to solve the problem of they're taking our bandwidth, for something that's no longer relevant or interesting and has been out there for years. What if we could just toggle certain episodes? And this might be a cat and mouse game as well. But at a certain point, it's like, well, just give them the R2 URL and not the CDN URL.

Starting point is 01:30:00 And just let Cloudflare deal with it. You know, like just let them download it directly from Cloudflare. And we're just out of the equation then. We don't care about the stats. We don't care about anything. We're just like, you know, we serve this file plenty of times through our CDN. Now we're going to just let R2 serve it. What do you think about that idea?

Starting point is 01:30:19 I can see for this specific episode being a very simple fix, right? Because we can just serve basically a location header and we just do redirect and that's it. We're done with it. So it'd be like another synthetic response. The question is if they're actually malicious, then they switch to a new episode and start doing that one. Exactly. Exactly. And we have other clients, which are, for example,

Starting point is 01:30:39 we've seen that pass, right? They're basically busting the cash and purposefully going to R2 directly and just varnish is like a, almost like acts like a proxy in this case. Right. So we have that as well. We have every now and then we have like this, this random client that comes and downloads all the episodes and that's not the problem. So I think that some sort of a throttle would make sense, which would keep the system fair to everybody. But the throttle will need to be high enough so that it doesn't impact anyone else. Now, if, for example, our requests or like, if our audience grows or we become more popular and we get more requests, obviously would need to be aware of where the limit is and start

Starting point is 01:31:21 increasing the limits, right, once we are throttling too much, maybe. That seems to be more like long term and it seems a more, I know, like a well-engineered approach in a way. But certainly the simplest thing would be just like, take this one URL. I mean, that could be done in minutes, roll it out. And then we would stop this abuse for this specific MP3. That would be the easiest thing, for sure. So yeah, I can see how pragmatic that approach is.

Starting point is 01:31:50 And I like the pragmatism. Well, it's at least worth checking to see if, you know, the mouse is still alive over there. Right. You know? Yep. And if they are, well, then we'll know that this is a cat and mouse game. But if it's just like somebody left. the blinker on, we're just going to turn their blinker off for them.

Starting point is 01:32:09 Yeah. And see if it just, the problem goes away. And if it changes to a new MB3, then yeah, we need more generic solutions. Yeah. We may not need that at all. I do have to say that the internet these days is very different from the internet even like a year ago. With the rise of LLMs and AIs, I'm starting to see patterns in our traffic, which are unlike

Starting point is 01:32:31 any other time. We have these very big spikes when a lot of data has been requesting. in very short periods of time from, I mean, the user agents, they don't make much sense. I mean, I know they're spoofed. There are many IPs which are being used. So it's almost like there's like a, I know, some system which wants a lot of our content is seeing silly things because some requests, they just don't make sense. Like, for example, what benefit does static fav icon have?

Starting point is 01:33:01 Like, what's up with that? That just makes no sense. It's a small file. Maybe it's a heartbeat. Or a version of a heartbeat? Maybe. But this is the first time I've seen this specific file being downloaded this many times. I haven't seen this before.

Starting point is 01:33:14 Which makes me think is it's a trend that we'll start seeing more and more requests that don't make sense. And then you start having to set up like some form of protection for all sorts of clients that are just doing the wrong thing. You need like a defensive layer by the fault. Exactly. Exactly. Yeah. And something that would be fair to regular clients. like for example when I want to do like a benchmark I mean sure it's me and I wouldn't want other people to do that because I'm testing the system making sure like the real world the production system everywhere in the world is working correctly and I'm aware of what that means and how it costs and by the way my IPs are removed from all like the the stats because otherwise you see like those massive benchmarks so we account for that but we can't account can't account for all these like weird clients

Starting point is 01:34:04 it's a challenge. I think it's a good one, but it just sets us up to, you know, when you become older, it feels like this is more like, like an adult problem. Right?

Starting point is 01:34:14 So we like got the thing barely working. We got it out there. We know we made it stable, reliable, all that. Now we're hitting almost like, feels like a new layer of problems. And then this to me is like a hint

Starting point is 01:34:26 as to like the next phase. Oh, to be a kid again. Yeah. Well, right. One positive thing I think is the robustness of our observability.

Starting point is 01:34:36 Like being able to have this visibility is great. Because otherwise we're like, you know, wow. Pat ourselves on the back, Aaron Perakia. Let's get you back on the pod because, man, you are big all over the world. That's amazing. They can't even downloads, you know? So what's your one last thing for the road ahead? I agree.

Starting point is 01:34:56 What's my one last thing? Yeah. So keep it short. We'll keep it fun. I mentioned about the Christmas tree. I mentioned about the various things. which I had going over the holidays. So make it work club.

Starting point is 01:35:10 That's the place. You're there. Both of you are there. So you can join whenever you want. Next Thursday. Yeah, next Thursday. I'm going to talk about a hundred gigabit wan. The hundred gigabit wan.

Starting point is 01:35:27 So why would I need such a thing? Smoking. It's smoking, for sure. So I thought the CCR 2004 has like four CPC. use has like multiple 10 gigabit sFP plus ports it even has 24 SFP 20 sorry SFP 28 ports but it doesn't have a switch chip and people that know about a little bit about hardware you want to switch chip to the hardware offloading L3 and even L4 so after I bought a CCR 2004 it was like almost like a Christmas present I thought surely this will be enough for the rest of my life

Starting point is 01:36:04 and no, I had to get the flagship. So I'll be talking about that, the land, the setup, quite a few things coming up. And it just goes to show how much I enjoy the hardware side of things as well, the networking side of things. Like I shaved two milliseconds of my wand. It's amazing. Like little things like that, you know, like, it was already good. It was like sub five milliseconds.

Starting point is 01:36:33 But I wanted like sub three milliseconds. seconds. It is now 2.4 milliseconds. So the next, and what it means, like, why would I do this? So first of all, I'm all about improving. And every winter, I improve the network. In this specific instance, I wanted the pages just to be snappier, like things to load a lot quicker to handle a bit more traffic, but also to not have an impact. Like I was running that benchmark, like 2.5. Look, I'm going to do another one. Right now. Let's see. I have a speed test. I have speed test right here. test, London, let's go for this one. So we're recording, we're streaming, right? And I'm just pulling 2.5, 2.6 gigabits down. And there's no interruption on my network. Right. So it's just my bread and butter. You know, that's how I work.

Starting point is 01:37:20 And by the way, if you see any buffering or any slowing down, let me know. I see Adam a bit more pixelated. Maybe you can see you pixelated too. I don't know. But yeah, I just pulled six gigabytes, three down. on three up. And it's just what I do every day. I work with this stuff and yeah,

Starting point is 01:37:41 enjoy it. And by the way, this is the slower gateway router. So I'm getting the proper one set up. And I'll talk about that. And there's so many things there. Like VLanning is quite a thing. I have a new IPV4 block, by the way.

Starting point is 01:37:57 So some would say that I'm preparing for hosting something. And maybe I am. I don't know. We'll see how that works. But I just realized that my home connection, obviously I couldn't serve all the MP3s that were being downloaded. Like that would really cripple my connection if that was happening. But I'm at 2.5 gigabits. The next one will be 5 gigabits.

Starting point is 01:38:20 And the hardware can do it. And the 5 gigabits, I mean, that's like a decent server. Sure. And if you can do 5 gigabits all day every day. Sorry. Yeah, gigabits per second. That's pretty decent. So I'm just waiting for more internet.

Starting point is 01:38:37 I'm going to say you're going to have 100 gigabit wan, but you're not going to, you're not going to have a connection for it, right? Right. So very few places in the world have that. So if I was in Switzerland, I would get 25 gigabytes. Now, would you move? Would you move for this? Of course.

Starting point is 01:38:52 The only reason to be. Yeah, it's the 25 gigabit connection. But I know that 100 gig is coming. So we'll see. Are they either ship it by the time I move or I come, I move and then they ship it. So it's one or the other. Okay. The important thing is I have the router to handle that.

Starting point is 01:39:14 You'll be ready. You will be ready. Exactly. So I'm a prepper. I'm prepping for that. Prepping for a good internet. And this is just like a... And interestingly, five years ago when I got like the previous router, I did the same

Starting point is 01:39:27 thing. It's like a forum post. They'd like a follow up. So I just did a follow up recently. at this milestone. So I've been at this for some number of years. And I like optimizing my network and making sure that it's in tip, top condition.

Starting point is 01:39:39 Relentless. I love it. So relentless. Good stuff. Care hard. Well, that's a happy note to end on, right? That's a happy note to end on. Observability in 100 gigabit?

Starting point is 01:39:50 No. So you go. All right. Well, the good news for Kaysen is we have a lot to work on. Always. Always. That's what it seems. Yeah.

Starting point is 01:40:00 We know how to pick them, don't we? Oh, my gosh. The rabbit hole goes deep and we keep going in. Kaisen. My friends. Kaisen. Kaisen. All right, Kisen 22 is in the bag.

Starting point is 01:40:15 Join the discussion in our community Zulip had to change. Log.com slash community to sign up for $0. And of course, check out all of Gerhard's passions at make it work.com. Thanks again to our partners at fly.io and to our beatfreak in residence, breakmaster cylinder. Next week on the pod, News on Monday, Damien Tanner from Layercode on Wednesday, and Techno Tim catches Adam up with the state of home lab tech on Friday.

Starting point is 01:40:40 Have a great weekend. Recommend us to a friend if you like the show, and let's talk again real soon.

The Changelog: Software Development, Open Source - Kaizen! Let it crash (Friends)

Gerhard is back for Kaizen 22! We're diving deep into those pesky out-of-memory errors, analyzing our new Pipedream instance status checker, and trying to figure out why someone in Asia downloads a si...ngle episode so much.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.