The Changelog: Software Development, Open Source - Kaizen! S3 R2 B2 D2 (Friends)

Episode Date: August 11, 2023

Gerhard joins us for the 11th Kaizen and this one might contain the most improvements ever. We're on Fly Apps V2, we've moved from S3 to R2 & we have a status page now, just to name a few....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Changelogin Friends, a weekly talk show about R2-D2. Thank you to our partners for helping us bring you awesome pods each and every week. Fassy.com, Fly.io, and Typesense.org. Okay, let's talk. All right, we are here to Kaizen once again for the 11th time. Second time on ChangeLog Log and Friends, and we're joined by our old friend Gerhard Lazu. Welcome back, Gerhard.
Starting point is 00:00:50 Hi. It's so good to be back. It feels like I'm back home. You're home here with us. We have the comfy couch over there for you that you can just sit back and relax. You've got the mic boom arm, which you can bring the mic really close yep got comfortable headphones and of course your favorite drink yeah feels great for you too i hope dear listener if not pause it go and do your thing and then resume that's right so kaizen always
Starting point is 00:01:18 be improving that's it is that what we're doing? Are we improving? Are we just making progress? My perspective, we are improving. Okay. What are we improving? Let's talk about that. What and how? Okay. Yes, we're trying to for sure.
Starting point is 00:01:36 That is the aim is to continuously improve. And in order to do that, I guess you change things, right? You're like, well, we've been doing X. Let's try Y. And this is our new two month cadence. It's been roughly two months since we last recorded. So we're good there. It's the summer months, which for me, at least a little bit is more time to work on things because less news, less events, less things going on, more vacations, which does slow us down. I think we've all taken a little bit of time. But Kaizen 11, if you look at the discussion we have,
Starting point is 00:02:11 which is discussion 469 on our changelog.com repository on GitHub, we'll link it up, of course. But Gerhard does a great job of outlining each Kaizen with its own discussion this one's got a bunch of stuff in a gear heart this is maybe like the best Kaizen ever I think so I already do like so many things changed like I couldn't believe it because when you work on it and it's like a week in week out you maybe add one thing or maybe even half a thing there were weeks when nothing happened yeah but then two months are up and you look at it and you have like seven things and some of them are really big.
Starting point is 00:02:48 And you think, whoa, that's a lot of things that changed. Did they improve? Back to Adam's question. Well, let's figure it out. I mean, I obviously have my perspective and I can share the things which I think improved. But for the listeners, what improved? And for you, Jared, what improved? And Adam as well. When it comes to the app, when it comes for the listeners, what improved? And for you, Jared, what improved on Atom as well?
Starting point is 00:03:06 When it comes to the app, when it comes to the service, did you notice anything improving? I would say the biggest improvement, oh, I had to think about it. Front-facing features, maybe not so much. I mean, this has all kind of been infrastructure back end. Of course, I'm always tweaking the admin and improving it for us. The biggest thing for me has been the change of how we deliver all transactional email through the application, which did introduce a very difficult to debug bug, which I haven't actually quite figured out yet. I just worked around, which we can talk about.
Starting point is 00:03:51 But we're using Obon for all email delivery, which includes all of our newsletter delivery. So literally for a while, the first step was just like, okay, when we send out changelog news, we need to send that out with persistent background queue we can't just ephemerally do that because it's just a lot of emails and you don't want to have it something die midway through the process and like half of our readers don't get their newsletters like we need to make sure that's robust and so i put that through obon we also don't want it to be sending
Starting point is 00:04:20 duplicates which actually is kind of the bug that it's doing it anyways but his face y'all you missed his face in the video so i saw his face it was just hilarious i had to laugh sorry what did my face look like defeat was it utter defeat what was my face it was just like it was disgust and defeat and humor at the same time well let's look at it this way we're sending emails twice as hard twice as better than none. We laugh so that we don't cry, Adam. We laugh so we don't cry. The weird thing about this is that it's not like, hey, everybody gets two emails. Like that would kind of make more sense.
Starting point is 00:04:57 It's just you, right? It's not just me. I wish it was just me. It's a handful of people that get like 30, 40 of the exact same email. It's like, it's recueing them. I can't figure it out, but I just reduce it down to a single worker because I had five workers going and somehow it was just like recueing. I'd love to figure that out.
Starting point is 00:05:16 But right now we're back to, okay. But yeah, it was very embarrassing. It's like certain, a handful. And one time it was our guest. Trying to think who the guest was. He was very gracious. Solomon? I'm like, it wasn't Solomon, no.
Starting point is 00:05:28 But maybe it was, and he didn't tell me about it. It's just like, you get 35 of this thank you email versus one. But everybody else just gets one. Very strange. You like your lot. You don't have enough change log in your inbox. Or DDoS in your inbox. So that was a big change.
Starting point is 00:05:44 Also a bit of a headache, but it did prompt me to finally do what I had said I was going to do last year, which is we signed up for Oban Web, which is supporting the Oban Pro cause. And we don't have that quite in place because I want more visibility into our background jobs, basically.
Starting point is 00:05:59 The way I'm getting visibility right now is I proxy the Postgres server so I can access it. And I literally am looking at the table of Obon jobs and doing things. Like a boss, straight in product. Of course. As you do when you're the boss.
Starting point is 00:06:18 Real developers developing production. Let's just be honest about it. You got to do what you got to do, man. I got to see what's going on here. You got to improve on that. That's where the action's at. You know, I would rather have the nice web UI, but right now all I have is a UI into the database table. Now, because you mentioned that, that's the one thing which I didn't get to. It's on my list. So close. What else is coming up? It will not take long, but I had, I had a good reason for it.
Starting point is 00:06:43 What's that? I did all the other eight things. Oh, I see. I guess that's good enough. You accomplished all these other things instead. So the weird thing about that, I guess the reason why I had to pull you in on it, because otherwise, so Obon Web is just distributed as an Elixir package, right?
Starting point is 00:07:02 And the thing is, is because it's a, effectively an open core kind of a thing where he has Obon, which is a package which is open source and free and all that, and then he has a subscription. It's distributed via its own hex repository that he hosts and has some sort of credentials, which is fine locally, but then you have to somehow get that
Starting point is 00:07:24 so that your CI, when it goes to install all the things to deploy, it can actually authenticate to his hex server. And that's why I was like, eh, Gerhard should probably do this because I'm not sure exactly how much Go code has to be written in order to get that done. And so that's why I pass it off to you. But yeah, I would love to have Oban Web for the next Kaizen
Starting point is 00:07:43 because it'll help me figure out exactly what's going on with these duplicate emails. That being said, aside from that particular bug, it is really nice to have the persistence and the ostensible opportunity just to have a single send versus if we have 15 nodes running the app,
Starting point is 00:08:00 who knows who's going to grab that and just send it, right? So that's a big one for me. And the replay ability of the emails to resend in case it bounces and stuff, we didn't have that before. So that's cool. Right. So by the way, I hope this isn't a surprise.
Starting point is 00:08:19 I would hate for it to be, but let's see what happens. We are running two instances of the change log, which means that even though you scale down the worker to one there will be two workers running right that I do know okay but it fixed it anyways so don't care at this point it's working amazing maybe listener if you've gotten more than one email from us
Starting point is 00:08:38 maybe 15 maybe 73 emails do let us know we want to know these things oh gosh that's a lot of emails don't want to yes these things. Oh, gosh. That's a lot of emails. We don't want to. Yes. So we're finally distributed then? We have two versions of the app. So we are finally telling the truth
Starting point is 00:08:52 in terms of we put our app in our database, close to our users? So it's all in Ashburn in Virginia. Okay. So it's all in USC, so in that data center. Now we have the option of obviously spreading them
Starting point is 00:09:07 across multiple locations and we should do that. That's like the next step. So from one, you go to two. That is like a nice improvement. And then if that works and we're happy with that,
Starting point is 00:09:16 we can go to more and that's the plan. But before we can do that, we should also obviously replicate the database. So we should have multiple followers, like one leader, multiple followers, and then obviously all the local apps sorry all the apps
Starting point is 00:09:30 which are not running in virginia and in ashburn they should connect their local postgres follower so they're connecting closer to themselves exactly versus back to the one in virginia exactly and then we can even like remove shielding, for example, from Fastly. But that's a change which I didn't want to do before we had like multiple locations. So right now we're like a single region, but more than one, which is already an improvement. Right. Can we do a quick TLDR, TLDL of why it's finally happening? It's caching, basically, right?
Starting point is 00:10:03 That's the reason why we haven't been able to replicate the application was because of caching issues? Yes, but no. Okay. Yes and no. Okay. So not a TLDL then, give the longer, short, long version of that. Why?
Starting point is 00:10:19 So the OBAN workers were an important part of that, right? So knowing that we're not basically missing important operations was essential. So because the job's not going to the database, when, let's say, the app stops halfway to doing something, that's okay. The other instance can pick up the job and resume it and send the email. In this case, 38 times. I mean, there's something there. Well, they're just overachievers. I always liked my code to go above and beyond. So that was one. The other one, the caching side of things, I think it's okay if multiple backends
Starting point is 00:10:53 have like different caches. I think that's okay. Obviously we'll need to look into that as well. And this is like, back to you, Jared, where are we with that? Because I still don't know what is the plan. After all, we've been back and forth for a year now, two years. We're getting closer, but we're not there yet. So one thing that we should point out with regards to this, I think the Fly machine upgrade to Apps v2 is pushing us into this new world. We did not, at this point in our lives,
Starting point is 00:11:31 choose to go into this new world. And maybe, Gerard, you knew the date better than I did, but we knew this migration was coming, where Fly was saying you have to upgrade to Apps v2. I just didn't realize it was going to happen when it did. And so I wasn't ready. The code wasn't ready. And the migration went through just
Starting point is 00:11:52 fine. You did some work there. You can talk about the details of that work. We're now on Apps v2. And that allows us to run these multiple nodes at the same time. It requires it, really. Doesn't it require it? So, I mean, you can run just a single one. Okay. So a couple of things. First of all, this migration was like a progressive rollout.
Starting point is 00:12:10 So certain apps of ours, we receive notification like, hey, these apps will be migrated to from V1 to V2 within like the next week. And then our app, the changelog app, it required a blue-green deployment strategy. Blue-green was not implemented for apps v2 until maybe a month ago, maybe two months ago. Something actually was like a month ago, because two months ago, they didn't have this option. So a month ago, this was enabled. Shortly after, I think a week after, we received this notification, hey, this app will be migrated to v2 but the problem for us is that our deployment configuration i wasn't sure whether it will work with v2 because what they say is hey if you have a chance try and save the fly config after the upgrade because some things may have
Starting point is 00:12:57 changed in the configuration so in our case we didn't have to do that like everything continued working which was a nice surprise. But when you go to V2 and when you go to machines or apps V2, their strong recommendation is to run more than one. And the reason for that is the app will not be automatically placed on another host if a host was to have a physical failure. It doesn't happen that often, so on and so forth. But actually, the bigger the provider is, the more often it happens. So in our case, I think we would have been fine, but I wanted to make sure that we're
Starting point is 00:13:31 running on two hosts just to prevent the app going down and then me having to basically jump into action and fix it. So that was, I think pull request 4.7.5 and you can check it out on github so i was ensuring that the app deploys they work on fly io machines i did like a few small changes a few small improvements the seat uh the fly ctl the cli was was updated a couple of things like that and then everything worked as it should have the warning which you get in the dashboard in the fly dashboard if you're on a single instance they say we strongly recommend that you run more than one. And they explain in their documentation why.
Starting point is 00:14:09 So it's basically a strong recommendation, and we did it. I see. Okay, so we did it. Yep. But I will now confess that we were not ready to do it. I didn't know we were doing it yet. And I did not fix the caching problem. Right.
Starting point is 00:14:23 Which we did experience. So a few people mentioned, hey, this GoTime episode appears in my podcast app, And I did not fix the caching problem, which we did experience. So a few people mentioned, hey, this GoTime episode appears in my podcast app, and then it disappears, and then it appears again. And I call that flapping. I'm not sure exactly what it is. But basically, depending on which version of the app you're hitting, the cache may or may not be up to date. And so the reason for this is because the way the code works is, it's after you publish or edit or whatever, we go and clear the cache may or may not be up to date. And so the reason for this is because the way the code works is
Starting point is 00:14:45 it's after you publish or edit or whatever, we go and clear the cache, which is just right there in memory in the application server. And we do not clear the cache across all of our application servers because we aren't good enough to write that code yet. I do have what I think is the best case fix for this, which I learned from Lars Fickman, but I'm not going to exactly use exactly what he built.
Starting point is 00:15:09 I think we just should use the Phoenix PubSub implementation. But in the meantime, I was like, well, this isn't cool. So I'm just going to reduce our cache times. These are response caches. So you hit changelog.com slash podcast slash feed. We deliver you an XML file of multiple megabytes. We cache that right there inside the application because it's not going to change.
Starting point is 00:15:34 And we'll cache it for infinity until we have an update. Well, I just changed that to two minutes. And I was like, well, we'll just cache it for two minutes, and every two minutes we'll go ahead and just regenerate and we'll watch Honeycomb and see if that puts ridiculous amounts of strain and slows down our response times and etc etc etc this is behind fastly by the way it's just that fastly has a lot of points of presence and every single one of them is going to ask for that file and so there's still a lot of requests that was just kind of good enough it's working
Starting point is 00:16:03 right things are fast enough they're good enough dogggone it. People like us. So that's an old Stuart Smalley line for those who missed it. Because I'm good enough. I'm smart enough. And doggone it, people like me. But that's not really a fix. It's just a workaround. It stopped the flapping because basically if you're out of date, you're only going to be out of date for 120 seconds and then you're going to get the new file. And so that's what I'm doing right now. I'm just clearing the caches every two minutes and so every app server is going to be eventually consistent every two minutes.
Starting point is 00:16:34 I'd much rather have the solution that actually makes sense, which is clustered app servers that are pub-subbing an opportunity to clear their caches and we can go back to infinity because there really is no reason to regenerate that until there's an update. But for now, we're just doing every two minutes. That was my quick fix,
Starting point is 00:16:49 and I was hoping that it wouldn't provide or require too much extra processing on our app servers. From what I can tell in Honeycomb, and I'm no expert, it seems like everything's okay. Obviously not operating at full potential at the moment. Yeah. So this is a very interesting thing that you mentioned because one of the improvements that we did,
Starting point is 00:17:08 we set up both SLOs that we were allowed to set up in Honeycomb. We can come back to why there's only two. So the first SLO is we want to make sure that 95% of the time podcast feeds should be served within one second. So in the last 30 days, that is our SLO. The second one is 98% of all responses should be either 200 or 300. So these are the two SLOs. Now, what we can do now is dig into the first SLO, which is 95% of the time the podcast feeds should
Starting point is 00:17:41 be served within one second and see how that changed since you've made the caching change. What I've seen is no change when I was looking at this. Oh, good. So I'm going to share my screen now. Make sure that... That I audibly describe what we're looking at for our listener. Exactly, and that I'm not missing anything important.
Starting point is 00:18:01 So I will try to do this as good as I can. We're looking at Gerhard's screen. Yes. Thank you. This is a honeycomb browser of podcast feed response latency. Go ahead. That's it. So 95. So all configured right now, the budget burned down. We are at negative 5.68%. Is that good or bad? Well, we've burned our budget, which means that more than 95, so you can see 94.72. So we're failing. 94.72% of the time,
Starting point is 00:18:32 we are serving our feeds in less than one second. Oh, I can fix that. We just changed the budget so that we pass it. Right, exactly. So that would be exactly. So let's just agree on a new budget. That's it. Yeah, exactly.
Starting point is 00:18:44 Yeah, but this is just like supposed to give us like an idea of how well we are serving our podcast, sorry, our feeds. Now these feeds, as you know, they're across all the podcasts. And all clients all around the world, right? Exactly. Okay, so now I'm sharing all of it,
Starting point is 00:18:59 like the entire Brave browser. So I can basically open multiple tabs. So I was looking at the first one 95 of the time so regardless whether cached or uncached i have like some saved queries here so now we can see what is the latency of cached versus uncached feeds these are the last 28 days we're gonna hits and misses exactly when we hit them it's more or less the elapsed time so i can just drill down in these and i can see last 20 days they are served roughly within 0.55 seconds so between half a second 500 milliseconds 550 milliseconds yeah we'll take it so let's go to the last 60 days and see if that changed that
Starting point is 00:19:37 shouldn't have changed by the way we have a few big spikes and by the way this is fastly so fastly serving this we can see some spikes all the way to seven seconds. But overall, we're serving within half a second. No big change. So let's flip this and let's say, show me all the misses, because this means it goes to the app, right? 2.5 seconds. That's it.
Starting point is 00:20:00 So what we're seeing here, we can see how many of these requests went through. So we have about 2,000 in a four hour period, two and a half thousand. it so what we're seeing here we can see how many of these requests right where it went through so we have about 2 000 in a four hour period two and a half thousand and we can see that the latency wise we are at 2.3 seconds 2.8 seconds it varies but over the last 60 days we have obviously like an increase here up to four seconds 4.6 seconds July 9th. But otherwise, it hasn't changed. Right. So my change was no big deal.
Starting point is 00:20:27 A nothing burger. Which is nice. Yeah. So here's the other thing. Let's have a look at the URL. I think this is like an interesting one. So let's group them by URL because what will be interesting is to see
Starting point is 00:20:38 which feeds take the longest, right? Like the P95. And you can see all the misses. So this is the podcast feed. The P95 is 2.9 seconds, Practical AI 1.5 seconds. But the one which has the highest latency is the master feed. And I don't think that's surprising. Not at all. If you actually just go download the master feed, I just did the other day, it's about 11 megabytes i mean it's a gigantic file yep and we're recalculating the contents of that once every two minutes every other time it's just sending the file but even just sending that file from i guess it's from fly to fastly that's just going to take some time and then sending it from fastly obviously to
Starting point is 00:21:21 wherever it goes well that's up to fastly. But the only way we can make that faster, there's two ways I can think of. One is you cache it forever, so you just get rid of that calculation time, which happens once every two minutes. Then all you have is send time. We're already doing gzip and whatever else you can possibly do in terms of just HTTP stuff.
Starting point is 00:21:41 The only way you can make it faster is take stuff out of it, I think. Yeah, limiting it. And we used to do that. We used to limit it, because I think we have over 1,100 episodes in there, and there's everything, pretty much. Not the transcripts, thankfully.
Starting point is 00:21:55 That would really balloon it up. But chapters, etc., etc. There's lots in there. Show notes, links, descriptions, all the stuff for every episode we've ever shipped. The only way to make that smaller is you just limit it to N episodes or N is some sort of number like 500 or 100.
Starting point is 00:22:13 We used to do that. I would happily continue doing that if the podcast indexes would just keep our history, but they won't, they'll purge it. And then you'll go to our master feed and you'll see 100 episodes. And you'll be like, cool, they have 100 episodes. No, we've put out 1,100 plus episodes.
Starting point is 00:22:28 We want people to know that. We want people to be able to listen to them. We used to have complaints, hey, why won't you put the full feed in there? There is a feature called paginated feeds. It's a non-standard RSS thing that we used to do. And we paginated that and it was a much smaller thing. And then it was great, except Apple Podcast didn't support it. Spotify didn't support it.
Starting point is 00:22:52 Blah, blah, blah. It's that old story. So I was like, screw it. I'm just going to put everything back in there and it's just going to be expensive and slow. And that's what it is. What do you guys think about that? Like, is that a good trade-off? Just leave it because it's 11 megabyte file.
Starting point is 00:23:04 I don't know. What do you guys think about that? Is that a good trade-off? Just leave it because it's an 11 megabyte file. I don't know. What do you do? Well, I think that serving the full file is important for the service to behave the same. So I wouldn't change that. If you change the file, it will appear differently in these players. So I don't think we should change that. If they all supported pagination, I would happily paginate it
Starting point is 00:23:24 every 100 episodes or so. And we have a bunch of smaller files that are all faster responses. And like, I would love to do that, just like you do with your blog. Yeah. But you know, you can't make these big players do the cool stuff.
Starting point is 00:23:38 They never do the cool stuff. They always do what they want to do, so. But I think that's okay. So if we look at how long it takes to serve this master feed from fastly from the cdn directly versus our app so when it's a cache hit when it's served directly from fastly we are seeing a p95 of 2.7 seconds okay when our app serves it directly it's 9.9 seconds so it's roughly three times slower I don't think that's so bad. Our app is three times slower than Fastly. I think that's okay.
Starting point is 00:24:11 I also think with this kind of content, it's okay. If this was our homepage, this would not be okay. If we had humans consuming this, it would not be okay. But podcast apps, crawlers, like, oh, sorry, you had to wait three seconds to get our feed like who cares you're a crawler you wait around until there's the next one yeah so the fact that we have slow clients and clients that aren't people they're actually just more machines consuming it i have less of a problem with that being just not super fast if this was our home page like it'll be all hands on deck until you get it fixed. There's no way I'd make people wait around for this kind of stuff. What I'm wondering if we can improve this is do we care about the cash hits or do we care about
Starting point is 00:24:53 the cash misses? Because based on the one that we care about, we can maybe see if there are some optimizations we can do about cash hits. I'm not sure what we can improve because it's not us, it's basically Fastly. I'm not either. But cash misses, and I think this is something really interesting. If I dig into Honeycomb, into the cash misses, you will notice something interesting. You'll notice that there was quite some variability. So it would take anywhere from four seconds to 15 seconds, right?
Starting point is 00:25:20 Like we see the squiggly lines. But then from August the 4 4th we're seeing five seconds eight seconds six seconds so it's less spiky and it seems to keep within 10 seconds there seems to be an improvement so i'm wondering what changed there did something change on august the 4th i have a hunch i'm going to see if it's that. I'm not sure. August 4th, run. I'm looking at the commits on August 4th. Run Dagger Engine as a fly machine.
Starting point is 00:25:50 That was your commit on August 4th. Yeah. And that was the only one in terms of things that might've went live. Well, I think August 2nd, that's when I commented, this was merged last week. We upgraded Postgres SQL.
Starting point is 00:26:03 Upgrade Postgres on august 2nd yep went from postgres 14 to postgres 15 it's possible because the app is hitting the database on not on every miss but every miss once every two minutes and that's going to be really slow right and so that could slow down other requests. So if Postgres is getting our data back out faster somehow because of some sort of optimization that they did, which I could certainly see, then that might be what explains that because there's no application changes from us.
Starting point is 00:26:37 So again, like all these other pull requests, I mean, it went from AWS S3 to Cloudflare. Again, I don't think that's related. No. And that happened on... It was just over the weekend. Three days ago, August 7th.
Starting point is 00:26:50 August 5th, actually. So that happened after our improvement. But that wouldn't have anything to do with the feeds. That's just the, you know,
Starting point is 00:26:56 that's the MP3s themselves, but not the feed files. So, well, cool. So Kaizen, right? So we upgraded Postgres and we got a little bit faster in our cache miss responses on our feed endpoints. Well, let's just the master fee that's improved i mean
Starting point is 00:27:29 that's the one that there's like some improvement there but then like in this view it's difficult to see the p95 um miss miss go time feed i mean maybe we can just zoom in that i mean do we want to continue doing this or shall we switch topics i don don't know. Adam, how bored are you at this point? One more layer. One more layer. Okay. Oh, he's never bored. He's always going to go one more. All right. Peel the onion. So this one seems to haven't changed. Like a go time feed hasn't changed, right? If anything, it looks like slightly worse here, but again, like we would need to continue digging to see like, Hey, what client, where, which data center it's coming from, things like that. Yeah. So maybe it's location specific, right? It's a client, for example, from Asia, which is going to Ashburn.
Starting point is 00:28:09 Obviously that's going to be slow, or we have a few clients from, you know, a certain geographical region. Getting routed to the wrong place. Yeah. Which would add to this latency, exactly, routing differently or whatever. So, but the master feed improving, which is by the way, the one which takes the longest, I think that's a good one. And the improvement was around eight seconds, I think that's a good one.
Starting point is 00:28:27 And the improvement was around eight seconds, roughly, plus or minus? Yeah, roughly eight seconds. Yeah, I mean, in terms of percentages, it's more than 2x, 3x, almost 3x faster. Yeah. So that's a huge one. And so to kind of zoom back out a little bit, Jared, you're saying the perfect world would be a PubSub multi-geo application to know if the update should happen and do it indefinitely. Right. Or infinitely, I think you said, versus this temporary, you weren't ready necessarily to, you know, version from V1 to V2 apps. And then you made it update every two minutes instead of infinitely, because that would obviously have the cache issues we have with clients.
Starting point is 00:29:08 Yeah, exactly. There's no reason not to cache forever because the file doesn't change until we trigger a change by updating something. Right. When it updates, we update it and then it caches again forever. Exactly. And right now when we update it, we clear the cache, but it only clears it on the app server that's running the instance of the admin that you hit. The other app servers that aren't running that request
Starting point is 00:29:32 don't know that there's a new thing. Well, with Phoenix PubSub and clustering, you can just PubSub it. You don't even have to use Postgres as a backend, which is what Lars Fickman's solution does. And you can just say, hey, everybody clear your caches. And they'll just clear it that one time.
Starting point is 00:29:49 And then we never have to compute it until we're actually publishing or editing something. And so that's darn near as good as a pre-compute. Because I know there's a lot of people out there thinking, why don't you just pre-compute these things? This is why static site generators exist, et cetera. Because that is just a static XML file effectively until we change something.
Starting point is 00:30:07 That's a different infrastructure. That's a different architecture that we don't currently have. And so it's kind of like easy button versus hard button. I've definitely considered it, but if we can just cache forever and have all of our nodes just know when to clear their caches, then everything just works hunky-dory. For now, we'll just go ahead and take the performance hit,
Starting point is 00:30:26 recalculate every two minutes. That seems to be not the worst trade-off in the world, looking at these stats. But that would be a way of improving these times. But now that you mention that, when we used to cache, when we used to have a single app instance, I didn't see much better times. The feed was being served in more or less
Starting point is 00:30:42 like the same time, right? So if I look at like this is this is go time feed when we had misses so let's go with 60 days in a 60 day window right it was just under two seconds and it hasn't changed much even when it was cached that's weird so what i'm wondering is again going back to the like generating these files could we upload them to our cdn and our cdn by the way we have two right of course now there's another thing to talk about back to the generating these files, could we upload them to our CDN? And our CDN, by the way, we have two. Of course.
Starting point is 00:31:09 Now there's another thing to talk about. We have Cloudflare R2. So could we upload the file to R2 and serve it directly from there? Yes. That was one of my other architecture options, is doing that. You have a lot of the same problems in terms of updates and blowing things away and all that.
Starting point is 00:31:27 It's definitely a route that we could go. We have a dynamic web server that is pretty fast and is already working. And so just running the code at the first request, to me, makes a lot of sense, but we can certainly at the time of update or publish or whatever it is, go ahead and run all the feeds, pre-compute them, and upload them somewhere. I'm thinking OBAN. We want more OBAN, right? We could. And it doesn't matter which instance
Starting point is 00:31:56 picks up the job and which instance uploads the feed, ultimately one of them will upload the feed on our CDN and that will be that. So one thing we don't want to change is the URLs to our feeds. And our URLs to our feeds currently go to the application.
Starting point is 00:32:13 And so the application would have to be wise to say, does this file exist already on R2? And if it does, serve it from there. If not, serve it myself. No, because we, remember, we have Fastly in front. So we can add some ruling Fastly to say, hey, if this is a feed request, forward it to R2, don't forward it to the app.
Starting point is 00:32:33 What if R2 doesn't have the file for some weird reason? Well, it will, right? It will have like an old one. That's what you think. What if it doesn't? Well, you've already uploaded the file once. You're just updating. Well, we have to blow it away.
Starting point is 00:32:46 Maybe it's just a race condition at that point. Well, why can't you just re-upload it? You're doing a put. Well, that's true. Is that atomic? I assume, I guess that's the point, atomic. It should be. I mean, the file's already there.
Starting point is 00:32:55 It's just basically updating an existing object because it's an object store. Right. You're saying, hey, there's like this new thing. Just take this new thing and then the new thing will be served. Okay. Yeah, we could definitely try that route
Starting point is 00:33:06 and then we just turn off caching at the... Well, our application server would never... Never even see those requests. So we would lose some telemetry that way because we are watching those requests from the crawlers because some crawlers will actually report subscriber counts. I see.
Starting point is 00:33:22 And so our application's logging subscribers, which is a number that we like to see, from those feed crawler requests. crawlers will actually report subscriber counts. I see. And so our application's logging subscribers, which is a number that we like to see, from those feed crawler requests. So we lose that visibility. Maybe we can get it at the Fastly layer somehow. Fastly logs everything to S3. We're putting more and more stuff in the Fastly at this point as well.
Starting point is 00:33:37 So I just, I'm tentative. I like to have everything in my code base if possible. But the folks at Cloudflare right now are really upset by this conversation. Well, we're using Cloudflare behind, so that's there. I know, but we're not using their stats. We're not using the more entrenched we are to Fastly's
Starting point is 00:33:54 way of things. They're like, no, that's the dark side. Right. And the Fastly folks are probably thinking, you're using Cloudflare? No, that's the dark side. Yeah, well, we're using both. So we have two CDNs. But we're using them differently though, aren't we? Aren't side. Yeah, well, we're using both. So we have two CDNs. But we're using them differently though, aren't we? Aren't we using them differently? Like we're using R2 simply as object storage, not CDN necessarily. That's right. We replaced AWS
Starting point is 00:34:13 with Cloudflare, not Fastly with Cloudflare. So far, who knows where we go from here. But let's talk about this migration because this was a big chunk of work that we accomplished. Yeah. Well, the first thing which I want to mention is that you've made this list, which really, really helped me. It was like a great one. I wasn't expecting it to be this good. No, I was. I'm joking. I'm just pulling your leg. No, no, no. I was. I was. This is better than I normally do. I was like, you know what? Let's open a pull request. Let's do this the right way. Yeah. I was surprised just by how accurate this list was. Wow, Jared knows a lot of things, like how this fits together.
Starting point is 00:34:48 I'm impressed. So I genuinely appreciated you creating this. By the way, this is pull request 468 if you want to go and check it out. I mean, you created like the perfect, like, hey, this is what I'm thinking. Like, what am I missing? And actually you didn't miss a lot of things.
Starting point is 00:35:03 Good. So we went from S3 to R2, where as you know, we're using AWS S3 to store all our static assets, all the MP3s, all the files, all like all the JavaScript and the CSS and all those things and SVGs. And we migrated, might I say with no downtime,
Starting point is 00:35:23 like zero downtime. Zero downtime. Yeah, on a weekend, as you do, right no downtime, like zero downtime. Zero downtime. Yeah, on a weekend, as you do, right? I was like sipping a coffee. Okay, so what should I do this weekend? How about migrating hundreds of gigs from S3 to R2? And it was a breeze. It was a real breeze.
Starting point is 00:35:39 That's awesome. And your list played a big part in that, Jared. So thank you for that. That's good stuff. Wow. Let's put a little clap in there. Thanks, guys. Appreciate that. Applause. I'm looking at number
Starting point is 00:35:52 six. Make sure we can upload new files. And you didn't do that yourself, but you can check it off because I just uploaded ChangeLog News yesterday. Everything worked swimmingly. We published a new MP3 file without any issues whatsoever. Amazing. Where should we start start should we start with the y on this one i mean i think well the y is easy right maybe adam can queue up with a y here
Starting point is 00:36:11 well i just pay attention to how much money we spend that's right every dollar comes out of our bottom line pretty much right and i was like why is this doubling you know every so months and then it was just like it had been very very small for so long like sub 10 bucks for a very long time in the last year so it's gotten to be like 20 30 40 50 and then recently it was over 100 yeah i think about six months ago a few kaizens ago and i'm like why and we couldn't really explain exactly what but then we explained some of it but then it only went down a little bit then it went back down to like 120 bucks but that's a lot of money to spend on object storage right i mean it's just it's more than we want and when you can get free
Starting point is 00:36:52 egress well you take free egress yeah so one theory you mentioned i think we actually got to 150 at one point uh maybe the last time i recorded Kaizen, which really was like, okay, let's make some moves here. Because if it goes to 300, if it doubles from 150 to 300, that's an issue. So I knew that it would be a bigger lift to migrate our entire application, which is the bulk of it, because of all of our MP3s.
Starting point is 00:37:18 Which Fastly, of course, is serving them, but we are putting this as the origin for Fastly, and so it's requesting them from S3 for us. And that was the major cost. It was like outbound traffic, major cost on S3. And so we knew with R2 we'd have zero on that. This took two months-ish from then, like we actually landed this,
Starting point is 00:37:35 it was almost two months from us realizing that we should do this. However, changelog.social, which is a Mastodon app server, was also on S3. And I immediately switched that one over to R2 just to try out R2. And it was super fast and easy to do that. And I think we went from 150 down to 120. It started to drop precipitously after that. And I think it's because of the way the Fediverse works. When we upload an image to Mastodon, as I do with my
Starting point is 00:38:06 stupid memes and stuff, right? And we put it out on our Mastodon server. Well, that image goes directly from S3. Oh, I put Fastly in front of it too, I thought. I might have. But somehow that image is getting propagated around because all the Mastodon instances that have people that follow us have to download that image for them to be able to see it. And so this architecture of the Fediverse, where it's all these federated servers, they're all having to download all those assets. And so I think maybe that was a big contributor to that cost,
Starting point is 00:38:38 was just change.social. And once I switched that, it started to come down. And now it's going to go to pretty much zero because of this change. Yeah, it just should be a few dollars. And I think we have a few things to clean. So I was looking, I was basically enabled storage lens, which is an option in S3, and you can dig down. So I'm just going to, again, sharing my screen.
Starting point is 00:38:58 I'm going to click around for a few things. I'm going to come back to 469. Obviously, you won't have access to this, but if you're using AWS S3, you can enable storage lens and have a play around with it so what i want to see is here extended storage lens okay and now it loads up and we can see where the cost goes so we can see the total storage we can see the object count we can see the growth and how things are changing and you know how many more things we're adding this was in the last like day to day all requests like month to month so you can see we have like a one percent change in total storage month to month so we're like approaching the one terabyte mark
Starting point is 00:39:35 not there yet but getting there quickly and if we see like which are the buckets that contribute and i have to remember where was it oh there you there you go. So you can see changelog assets, which are the static ones, they contribute 22%. Changelog uploads Jared, they contribute 21%. This is the storage costs. And changelog com backups, which is mostly nightly, they contribute again, 20%. So they're like roughly evenly spread. So I'm wondering, is anything here that we can clean up? Anything here that we don't need? Well, we can get rid of changelog uploads, Jared, because that was my dev environment. Basically, I would mirror production with the assets. Right.
Starting point is 00:40:14 So that I had the most current assets, because I like to do that when I'm developing, have it look real. And so I just had this AWS S3 sync command that would just sync from slash assets to mine, which is why they're roughly the same amount
Starting point is 00:40:26 of gigabytes. Probably haven't run it in a while. I see. And so that's all moved over to R2. So that whole bucket could just get blown away.
Starting point is 00:40:32 Okay. Should we do that now, live? Yeah, let's do it. What's going to happen? Right? What's the worst thing that can happen?
Starting point is 00:40:38 Let's do it live! Like some sort of ta-da sound? Right, boom, everything explodes. So I think we won't be able to do that. We'll need to delete
Starting point is 00:40:44 the individual objects, by the way. Ah, you can't just delete So I think we won't be able to do that. We'll need to delete the individual objects, by the way. Ah, you can't just delete a bucket? What's wrong with these people? It's too dangerous. I remember this again, this not being possible. So let's, again, let me search for Jarrett. I found the bucket. So we select it, let's say delete.
Starting point is 00:41:00 And to confirm, buckets must be empty before they can be deleted. You know what? R2 has the same exact thing because I created a test bucket. I tried to move our logs over there as well. That failed, maybe we can talk about that. But I couldn't delete it without emptying the bucket first. And I'll say this, R2 does not have the ergonomic tooling that's built up around S3.
Starting point is 00:41:21 And so in order to delete all the objects inside the R2 bucket, we're talking about you're writing JavaScript, basically. There's the GUI apps, the tools, all that stuff isn't there. And it's API compatible with S3, but not really. It kind of goes back to our conversation, Adam, with Craig Kirsteins about Postgres compatible
Starting point is 00:41:43 isn't actually Postgres compatible. Cloudflare's S3 compatible API is not 100% compatible. It's like mostly, but enough that certain tools that should just work don't. So like Transmit, for instance, which is a great FTP. It started off as an FTP client,
Starting point is 00:42:00 has S3 support. I think I complained about this last time we were on the show, so I'll make it short, but it doesn't support R2 because of like streaming uploads or some sort of aspect of S3's API that R2 doesn't have yet. So anyways, I haven't deleted a bucket from R2 because you have to actually click like highlight all and then delete and it paginates and they're like, okay. And there's like thousands of files. How do you delete them from S3? Just open up an app and select all and hit delete or what?
Starting point is 00:42:31 Well, I think I would try and use the AWS CLI for this. You would? Yeah, that's how I'd approach it. And I think just like that, I would like maybe script it, like list and delete things as a one-off. Now I would try Transmit to see if that works. I mean, we're talking S3 now, right? I just open it up in Transmit and hit delete.
Starting point is 00:42:51 I'm doing the same thing now, see if I can delete it from Transmit. Oh, it's going to be gone already. I already did it. That's why nice GUI apps are just for the win, you know? I just open it up in Transmit, select all, hopefully I did the right bucket, that was pretty fast. All you had was just the uploads folder in change on uploads Jared, right? There was a static folder, but it's already gone because I just deleted it.
Starting point is 00:43:12 Nice. Better look at it quick because it's going away. That's why it'd be great to have a transmit for R2. So somebody out there should build a little Mac GUI for R2. You can call it D2. I believe somebody said they wanted to call it D2. Is that in Slack or Twitter? That was on Twitter.
Starting point is 00:43:30 Jordy Mon Companies, who's a listener and one of the hosts of Software Engineering Daily. We know Jordy. He's the one who said, call it D2. I was like, that's a good idea. That would be a good one. Is it Jordy? Yes.
Starting point is 00:43:42 In my brain, I've had it mapped to Jory. It could be Jordy. It's a j name that you know whenever someone's potentially around the world j's are pronounced differently but he's from the uk i don't know i'm gonna go with jordy okay yeah call it rd2 you know write it in towery we'll cover it here in changelog news of course but yeah r2 is just too new to have all the great tools i mean s3 yeah right just has everything it's been around for a while for sure so what i wouldn't delete i wouldn't delete the change log assets on s3 i mean we can consider that our backup if something backup yeah was to go catastrophically wrong with r2 again i don't expect it to happen but you know
Starting point is 00:44:21 better be safe than sorry i mean we can keep those 100 and whatever 200 or however many gigs we have in s3 for this we won't be doing any operations against them so it shouldn't cost us much other than storage space and continue using r2 maybe even set up a sync between r2 and s3 so that we have a backup to the backup or like a backup to our new CDN in a way. So that would be good. But yeah, I think that's a good idea. So we are on R2. We did it. And it was a breeze. Why not consider B2? Black Backblaze versus S3. So I know, I listened to the episode, by the way, great episode, loved it. Backblaze episode?
Starting point is 00:45:04 The Backblaze episode. And I'm using them and have been using them for many, many years. When I've set up my Kubernetes backup strategy, by the way, I have a Kubernetes cluster in production. That's a thing. And all my workloads now running on Kubernetes. We can talk about it later.
Starting point is 00:45:17 In your home lab? No, no, no, in production. Okay, for Dagger or? Well, what that means, I mean, I've been hosting a bunch of websites for decades. Oh, that's right. Right. So it's like mostly WordPress websites,
Starting point is 00:45:28 some static websites. We're talking 20 websites. I won't be giving any names. Again, they're like longtime customers. BBC, NYtimes.com. That's right. That's it. That's it.
Starting point is 00:45:39 BBC, all of them. All of them. Yeah, exactly. So I've set up this production cluster. I mean, this was the second one. I set it up in June and I've been hosting these workloads. I was using a lot of DigitalOcean droplets.
Starting point is 00:45:52 I had about 10. So all of these I consolidated in two bare metal servers and they're running Talos and it's all production. So obviously production needs backups and it needs restores. So what I did when I was migrating between Kubernetes clusters, these workloads, the backups were going to B2.
Starting point is 00:46:09 And B2 was okay, but slow, like sometimes unexpectedly slow. I had the same feedback from Transistor FM. I had them on ShipIt and they were saying some operations on B2, sometimes they're slow. So they can take minutes instead of seconds. And that was my experience as well. Restoring things from B2 was incredibly slow. So it took me 30 minutes to restore, I don't know, like 10 gigs roughly. And that's not normal. So what I did, I said, okay, I have to try R2. I tried R2, same restore, three minutes. So there's a 10X difference between B2 and R2 in my experience. Again, it's limited to me.
Starting point is 00:46:54 So that's why for big restores, I'm restoring for R2. But of course, I'm using both B2 and R2 because I have two backup mechanisms in place. Of course. The reason why I suggest or even ask B2 versus S3 is if it's only for backup, B2s based on their pricing page is 0.005 cents per gigabyte per month.
Starting point is 00:47:17 And S3, if this is accurate, is 0.026, which is five times the cost per gig. Yeah. So if it's just backup, we can deal with slowness, right? I mean, if it's a restore, we can deal with slowness. We can just buffer that into our mental space and then, you know, keep five times our dollars. So here's a question related to Kaizen infrastructure and what not. If we were to say, okay,
Starting point is 00:47:43 what we want is a backup service that takes our two things and puts them to B2 once a day or once a week, or even just a mirror, just constantly mirroring. Where would we put that? Where would it run? How would it work? We could write an OBAN worker for it. True.
Starting point is 00:48:00 What I would do is solve it as a CI-CD job, yeah. So it would be a custom robotic arm inside of our dagger. That's it. And it would pick it up and it would move it and drop it. It would be GitHub Actions, you know, and it would just run there. We could have an O-band worker as well. I mean, whatever we're more comfortable with.
Starting point is 00:48:19 Yeah. But then should our app know about this? Maybe it should. I mean, it has access, right, to all the credentials. It's easy in terms of secrets and stuff. It's already there. I mean, we obviously have to add the B2 stuff. I think the question is, do you want to do it or should I?
Starting point is 00:48:34 That's what it comes down to. This is a good question. I'd rather you do it. There you go. So that settles it. I don't want to do it. Do you like how I had to act like I thought about it? I made it dramatic.
Starting point is 00:48:51 Yeah, you did. You really acted that out good, Jared. I did. What's interesting to think about, just kind of almost separating this conversation a bit, is you mentioned Dagger and you mentioned GitHub Actions. And I'm just curious if Dagger's a potential acquisition target for GitHub. Because if you are complementary and you're improving,
Starting point is 00:49:13 and every time we have a problem like this, your solution is a background job built into CI using code, Go code, whatever code you want, because that's what the move from Q to everything else went to for Dagger. And you're so complimentary in terms of Dagger to GitHub Actions. You're not cannibalizing, you're only complimenting.
Starting point is 00:49:36 I can definitely see it. So a year from now, GearHead will work from GitHub. Anything is possible. So that's a good idea there, for now on that subject again i didn't want to talk too much about dagger in this kaizen but i'll just take a few minutes so i noticed that we had again this is fly apps v2 migrations related where we run uh we used to run a docker instance in Fly, and that's where Dagger would run.
Starting point is 00:50:07 We'd have all the caching, everything, so our jobs would be fast. Our CI jobs would be fast. And part of that migration, the networking stopped working. So I was thinking, okay, well, we have all this resiliency in all these layers, but we don't have resiliency in our CI. So if our primary setup stops working on Fly, in this case, then nothing works. So I thought, well, why don't you use the free GitHub runners?
Starting point is 00:50:30 That's exactly what we did. So now if you look in our CI, and there is a screenshot in one of these pull requests, let me try and find it. It's called make our ship it YAML GitHub workflow resilient 476. so the tldr looks like this when dagger on fly stops working there's a fallback job where we go on the free github runners it takes longer it takes almost three times as long all the way up to maybe 10 minutes but if the primary one fails we fall back to github we are running on Kubernetes, Dagger on Kubernetes.
Starting point is 00:51:06 So we have three runtimes now, Fly, GitHub, and Kubernetes. And the common factor is Dagger. It made it really simple to have this sort of resiliency because at its core, it's the same thing. We just vary the runtime. But we didn't have to do much. You can go and check our Shippit YAML workflow,
Starting point is 00:51:28 GitHub workflow, to see how that's wired up. Again, it's still running. It's still kicked off by GitHub Actions. But then the bulk of our pipeline runs in one of these places. Where's the Kubernetes stuff? That's the production Kubernetes
Starting point is 00:51:42 which I told you about. Oh, it's at your house. Well, no, it's not. I have an experimental Kubernetes cluster in my house. This is a real production one, right? Running like in a real data center, not my house. Hey man. Yeah.
Starting point is 00:51:58 It's not like you never run any of our production stuff from your house. Exactly, I did. And it worked really well, I have to say, for a while. And then obviously we improved on that. It was a stopgap solution. Hey, you know, we've had the work from home movement, you know, everybody's taking their work from home
Starting point is 00:52:15 and it's like, well, why not bring your work to your house? You know, take your CI home with you. Exactly, I take the CI. Okay, so this is a production Kubernetes thing of yours that this is running on. This is like, that's a third fallback in case, or? That one's slightly special in the sense that that one doesn't deploy yet.
Starting point is 00:52:34 So it runs and it builds, but it doesn't deploy. So there is this limitation in GitHub actions. And again, if someone from GitHub is listening to this, I would really, really appreciate if some thought was given to this i would really really appreciate if some thought was given to this so when you select runs on when you say github runs on all the labels have to match so what that means is that you can't have a fallback you can't say runs on this or that or that or that you can't define like a nice fallback so So then what you do, you have to like basically say this job needs the other job. And if that job is just a mess. So if for example, Kubernetes was not available,
Starting point is 00:53:12 how do we specify a fallback? And I say not available, it can't pick up a job. So it won't fail. It's just not available. So a job will basically wait to be picked up for a certain amount of time. And then it will time out most likely again i haven't tested this fully but that mechanism the runs on mechanism is pretty inflexible in github actions now in the case of fly and docker that's like fairly straightforward it basically starts on github and then eventually it hands over to fly because we start another engine anyways i mean you can go and check the workflow i don't want to go too much into the details but that's like a simpler proposition when you have a third one which may or may not be there it's a bit more complicated gotcha so right now i'm just
Starting point is 00:53:55 like running it as an experiment to see how it behaves to see you know if it is reliable long term and if it is then maybe you know make a decision in a month's time or two months time but for now it's fly with the guitar fallback cool resiliency for the win always have two yes and now we have three just in case well i didn't even consider that we would keep s3 for a backup or consider b2 as a lower cost backup because i thought, well, we'll just, you know, cut our ties, keep our dollars and move to R2 and that's done. But that does make sense because what if R2 poops the bed? You know, we're going to have some issues. We got all of our, almost a terabyte of assets we've been collecting over these years, our JavaScript, our feeds, whatever we're going to put there ever. If we have no business continuity, which is a phrase I learned 20 years ago now, business continuity, right?
Starting point is 00:54:49 That's key in backups, right? You can't just put the backup over there. You've got to get it back to keep doing business. So that does make sense. I didn't consider that and I'm glad you did. And the cost will go down, right? Because again, we are using R2, which is free for egress.
Starting point is 00:55:03 S3 isn't, so we're not pulling anything from s3 i mean if anything we can move the bits over to b2 so that the storage costs will be lower but again there'll be one of operations and by the way when you're right actually the operations you pay for them but anyways the point is it will be well it will cost us something from s3 to migrate off s3 but it's like a one-off cost. We've already done that though, haven't we? So when we just move from R2 to B2. Oh, that's right. Actually, that's a good point. Yes, exactly. We migrate from R2 to B2. That's correct. So maybe delete S3 after we migrate to B2. That's there. Cool. Well, you can delete my bucket now because all the files are gone. So go ahead and get that done at your leisure. It doesn't have to be. Okay, so let's see me refresh.
Starting point is 00:55:46 Yeah, that's right. So let's get this thing done. Yep, confirmed. It's gone. Bucket's there. Files are gone. Got an emotional attachment to this bucket though. I've been using this for a long time.
Starting point is 00:55:56 You have another one in R2, by the way. That's not the same. Which is assets dev that you can use. That's right. But that's shared across multiple people. So it's not as personal. Like this was my bucket, man. This was my bucket. I see. We can create one for you. It's okay. But that's shared across multiple people. So it's not as personal. Like this was my bucket, man. This was my bucket.
Starting point is 00:56:06 I see. We can create one for you. It's okay. We can create one for you. It's free. I appreciate you consoling me as you delete my bucket. Change log uploads, Jarrett.
Starting point is 00:56:20 No fat fingering, delete bucket. Boom. Spell that. Boom. It's gone forever. All right. Cool. What about the backups?
Starting point is 00:56:27 What about the nightly backups? Is that something that we can clear? Because by the way, there's a lot of backups going all back to 2000 and something, I even forget what it was. Those are assets backups or database? I think it's a database. No, this is another one.
Starting point is 00:56:42 This is a small one. This is like from our pre-fly migration we can come back to this a bit later because it has just like a few and this costs us nothing backups nightly they start in 2015 oh it's nightly this is a backup of changelog nightly yep we don't need that we don't need that no man i don't I don't think so. But this is still happening. Yeah, it is. Because my code works. Right. I wrote this years ago.
Starting point is 00:57:11 2015. I forgot about this. We've been backing up changelog nightly every night. Might be the first, some of the first code you wrote for this company, Jared. Might have been. The last backup is 76 megabytes. Do we want to delete the old ones like what's the plan here yeah man there's no reason to have them because each one has the entire contents of
Starting point is 00:57:31 the other the previous ones oh i see it's not like uh differentials or anything it's like the entire folder structure of change.nightly which is all static files right every night we add two static files and send an email and then we back it up. And so that's just been happening for years. Wow. And so I forgot about it. So yeah, this can like, okay,
Starting point is 00:57:49 so we'll fix this. Just keep the most recent one. Just keep one. In fact, tonight we'll have a new one so you can delete them all. We'll create a new one tonight. Cool. What do we do about tomorrow though?
Starting point is 00:57:59 Well, we could make it run less often. I think that would, I think it run like weekly. No, no, hang on. I think that's fine. I think that's fine.
Starting point is 00:58:05 I think that's fine. What we can do is like set some sort of an expiration or like auto purging on the objects. Oh yeah, let's do that. Okay, that's the best idea. Good, okay, cool. So we'll fix that as well. Great.
Starting point is 00:58:16 Keep the last 10. Great. I can't believe the nightly folder structure of just HTML files is 76 megabytes of HTML. It seems like a lot. Well, maybe worth something. It's a tar, so it means there is no archiving, no compression of any kind.
Starting point is 00:58:31 I'm wondering if we can make it smaller. Where does Nightly run, by the way? It's a production Kubernetes cluster in my closet. In your closet, right? It's on an old digital ocean droplet. I'd actually like to get rid of that, but you know. Don't say where it is because I'm sure it's like so unpatched, I think.
Starting point is 00:58:51 It's like a honeypot to this point. Like there's so many, the exploits have exploits to this point. Yeah, thankfully it's just straight up static files served by I think Nginx, but... No SSH. Oh yeah, no SSH. No SSH, good. Don no ftp nothing no can't connect to it right completely firewalled it's actually air gapped yeah i don't even know how it does what it does because you have to walk over to the stop drive in
Starting point is 00:59:19 for every request we have somebody go plug it in. Yeah. Every night we plug it in, it runs, and then we unplug it. Should we put it on the fly? What do you think? We certainly could. Honestly, ChangeLog Nightly is like it's an entire subject. The quality has been degrading lately because of the rise of malware authors just attacking GitHub constantly.
Starting point is 00:59:42 And so there's a lot of malware stuff. I'm like, the only change that I've made to Change.NET in the last couple of years is just fighting off malware. We just don't want malware repos showing up. And they're constantly, as cat and mouse has been. I think we just shut it down. It still provides a little bit of value for about 4,000 people. Yeah, it really does.
Starting point is 01:00:00 And me, I still read it. I still find cool stuff in there. It's just harder. You have to scan through some of the crappy stuff. There's just some crappier repos in there just because GitHub's so big now. It's become a little bit rigid because it's like an old Ruby code base
Starting point is 01:00:14 that sometimes I got gem file problems on my local machine. It's just like you can't run it locally. I only can run it from that DigitalOcean server. So I go in there and vib and edit stuff. So you don't want me to see it. That's what you're saying. No, I don't want you to look at it.
Starting point is 01:00:28 Careheart is not allowed anywhere near this thing. You just flip over. This is legacy code. This is legacy code. I've thought about rewriting it in Elixir and just like bringing it in and having like a monorepo deal. And then we would have our,
Starting point is 01:00:43 like, and then I'm like, why would I put any time into this? There's so many things I can work on. I see. So Nightly is just kind of out there. We could definitely put it on fly. I think that would definitely help our security story. But it might be tough
Starting point is 01:00:55 because it's like Ruby 2, it's like old gems, stuff like that. If there's a container for it, it doesn't really matter. It really doesn't. That's what I'm telling you. It's Ruby 2, it's old gems. There's no container, man. it doesn't really matter. It really doesn't. That's what I'm telling you. It's Ruby 2. It's old gems.
Starting point is 01:01:07 There's no container, man. This is like pre-Docker. No, no. I mean, there is a Ruby 2 container. Oh, yes. I'm saying there's no Docker file for Change.Nightly is my point. We don't need a Docker file. If there is a Docker container that we can start off, that's okay.
Starting point is 01:01:22 We can keep it exactly as it was. So I'm looking now at the Ruby official on Docker Hub image. Ruby 2.3.3, patch 2.22. 2.3, yeah, there you go. Six years ago, it exists. We can pull it. We can base it off on this. I learned this with, kind of learned this with Chet GPT recently with running, I didn't want to set up a dev environment.
Starting point is 01:01:44 I was just actually just for fun, trying to run Jekyll without having to actually install Jekyll or anything. Because Jekyll's notoriously just kind of hard to maintain because it's Ruby and gem files and all the reasons. And so I'm like, I want to just run the entire thing in a doc container, but still hit it from a typical web browser like I would to develop a blog. And so my Jekyll blog lives in I think a Ruby 2.7. I don't even remember what exactly. But it was something that was safe for ARM
Starting point is 01:02:10 because I'm on an M1 Mac and all that good stuff. And it was like a special Docker file there that I could just run and build off of. So similar to what you're saying here, you just kind of go back in time to a Docker file that was out there for Ruby 2.3.3 and call it a day. That's 2.22. We can totally do this and it will
Starting point is 01:02:28 challenge accepted. Show me nightly. Show me yours. That would save us $22 a month Gerhard I think. Something like that. That's how much we spend on this nightly server for DO.
Starting point is 01:02:44 It's about $ bucks a month. And that's literally the only thing on there. Yeah, you have hundreds of gigabytes of backups. Hundreds of gigabytes of backups. Really redundant. But we'll fix that. Since we're mentioning Change.Log Nightly though and the spaminess of it,
Starting point is 01:02:58 I do want to highlight a spam situation in the most recent one. But I think it's actually a student and the person's handle on github is rsriram9843 okay he has or he has they have i'm not sure their gender desktop tutorial project three project one project four develop. So check those out. They seem to be pretty popular because they're in the latest nightly's top new repositories. There you go. You don't think
Starting point is 01:03:32 it's spam or you do think it's spam? Well, I mean, it looks like a normal person. Maybe they did that. I don't know. It could be a... I don't know. It seemed like a normal person would qualify as spam. That it doesn't belong there? Yeah, like it's a bot or it's malware.
Starting point is 01:03:48 They very well might be a bot. I mean, in that case, if it is, don't go there. I've just identified a bot to not check out. Here's how far I've gotten, but I haven't pulled the trigger yet on trying to actually have a malware slash spam detection system for Nightly that's actually good is i take a list of a bunch of good repos here's what we have owner which is like github handle repo which is the name of the repo right and then like the description that's what we have and i took like 20 good ones like these are legit but they're diverse you. Because you can put emoji in there, some people write in different languages, et cetera.
Starting point is 01:04:26 And I pass it off to ChatGPT. And I say, here's an example of 20 good projects on GitHub. And then I pass it some bad ones. And then I say, is this one good or bad? And it's about 60% accurate.
Starting point is 01:04:43 Really? It's slightly better than a coin toss. And I thought, well, that's not good enough because I can't, I mean, this is all automated. I'm not going to act on 60% confidence, you know, or 60% accuracy. I can't just be like, nah, not good. I think you'd have to fine tune.
Starting point is 01:04:58 It gets above my pay grade of being like, okay, let's take Lama and fine tune it. I would love for somebody who's interested in such things to try it. For now, I'm doing a bunch of fuzzy matching on just common things that spammers do in their names. There's duplication, there's these words, there's leak code, and I inevitably use cat and mouse.
Starting point is 01:05:18 But I would love, I think you have to almost go to a GPT to actually have a decent system. That's as far as I got. And I thought, well, not only is this not accurate enough with my current implementation, I'm on an old, rigid Ruby 2 code base that I can't really, what am I going to do, pull in the OpenAI gem? I'm never going to be able to get modern tooling into this system
Starting point is 01:05:39 until Gerhard saves us with a Docker file or whatever he's going to do. A dagger pipeline, but yes, close enough. Yeah, sorry, wrong company. I'll daggerize it. That's what's going to happen. We need to daggerize this sucker. That will be Kaizen, slightly better.
Starting point is 01:05:56 That'll be the next one, cool. So the last thing which I want to mention before we start thinking about wrapping up and thinking about the next Kaizen is mention that now we have status.changelog.com. Oh, yeah. Yeah, that's another thing that happened. So when we are down, hopefully never.
Starting point is 01:06:13 We've got 100% uptime on changelog.com. Now the checks, they don't run every 30 seconds. We are still on the free tier. This is better stack. And I think the checks are like every three minutes. So there's downtime, which is less than three minutes minutes it won't even be picked up by this tool by this system however if there is an incident we will be communicating it via status change.com so if changelog was to be down again not going to happen on my watch but you know it has happened like many
Starting point is 01:06:42 years ago and it wasn't us it was fastly remember that episode i forget which one it was yes um but bbc was down too so again after i say this boom everything crashes and burns no not gonna happen i'm not gonna even tempt it uh but yeah so that's i think the one thing which i wanted to mention we have a status page very cool and for those of us on my side of the pond you go to status.changelog.com if you're in the uk you go to status yes status that's it both will get you there just depends on how you like to say it s-t-a-t-u-s like potus we can agree on that like potus you'll say you got the u.s in there i like i appreciate it so what are we thinking for next Kaizen? What would we like to see? Oh my goodness.
Starting point is 01:07:29 I would like to see ChangeLog Nightly upgraded in the ways that we just discussed off of DigitalOcean specifically. I would like to see... Clustering working? Clustering. I think we need to get clustering working so we can use Phoenix PubSub. I think we have to do Elixir releases to do that.
Starting point is 01:07:47 I was reading about it a little bit. That's there. So that's when I stopped and was like, releases. This is outside of my wheelhouse. So I looked into that, by the way, but then I decided to leave them out of scope for the migration that was like, I think, for the previous Kaizen. But there is like some code in our pipelines to do that.
Starting point is 01:08:06 Okay. I would like to see Oban web installed so we can have that observability. Top of my list. That one should be easy enough. Adam was mentioning middleware IO, trying it out maybe. What's middleware IO?
Starting point is 01:08:19 Did I mention that? I did. Oh yeah. AI powered cloud observability platform. Oh, shiny. That's a cloud observability platform. Oh, shiny. That's a nice headline. I do like that. It gets me in there because it's AI-powered.
Starting point is 01:08:31 Right. That's how you raise money today is your AI-powered stuff. And it's also cloud observability. It's also a platform. It has all the buzzwords. Was this generated by any chance it's a real website? Yes. They reached out.
Starting point is 01:08:45 You know, I asked you if you saw it. So usually we get lots of inbound requests from people. Some are legitimate. Some are whatever. But my smell test is, Gerhard, did you hear this? Would you try this out? Would you want to try it out? And I don't think I've spoken to them yet, but we do have something in the works works to get connected so i will escalate that up my list to make sure i do so and then i think you
Starting point is 01:09:11 said you wanted to play with it right so we can probably get an account to see if you like it and go from there kind of thing cool one was trying the wild card yeah there you go what about something that gives us more than two slos i? I mean, that's which we didn't talk about. I mean. Yeah, we didn't talk about that, but come on, Honeycomb. What's the deal with that? Two? I know.
Starting point is 01:09:33 I will tell you, here's what's happening. While we were talking on this podcast, I was emailing Christine Yen because she's going to come on a future episode of Founders Talk. And I like her. I like the whole team there. And I think they do amazing work. And obviously reference and leverage uh honeycomb as critical infrastructure like we i don't think we could do what we do with the quite the way we do it like we just the listeners didn't get to see you share your screen but jared and i did they'll hear what you said about what was on your screen and they'll follow along hopefully but we were like knee deep into layers
Starting point is 01:10:05 and layers of observability data that's inside a honeycomb and we don't have to like program or what do you call the instrument right these things to do it just captures it and we just ask the questions obviously i think it has a length of time of logs that can go through right like there's uh six weeks or eight weeks or a couple months. I'm not sure what the... Yeah, it's two months, 60 days. Two months, 60 days. Traces and everything.
Starting point is 01:10:30 Enough for us. Maybe we can get more. I don't know. We're hitting the limit, by the way. We have 100 million events per month and we're exhausting that because we're sending all the traces. Yes, we're getting emails about it.
Starting point is 01:10:42 They keep telling us like, hey, you've gone over X again this month. Right. Threatening. And by the way, we are paying for it. We are paying for it. Yeah, we are paying for it because we haven't made this connection yet.
Starting point is 01:10:53 So my hope is, and Christine may be listening to this right now because I sent in the email, hey, Christine, literally we're talking about Hunnicombe as I type this because we're on the podcast. We're talking about you right now.
Starting point is 01:11:02 And it goes out this Friday and here's an echo because I'm now talking to her and everybody else in this very moment here and just suggesting like, hey, we're big fans of Honeycomb. We want to partner with them.
Starting point is 01:11:12 We want to find ways to speak more about them, but more importantly, improve. Like two SLOs on the free plan may, I'm curious, why is that limit there? It's the pro plan. It's the paid plan.
Starting point is 01:11:22 It's the pro plan. The free one, you don't get any. Gosh, even there you go. So there you go. So if you're paying for the pro plan, you should get more than two SLOs. And if you don't, why? What's the cost to enable an SLO? Well, here's a quick question before we go. There are also now triggers. And I was in there poking around and I see the SLOs and I see the triggers and triggers seem to be based on similar things that SLOs are based on. It's like, if this happens, trigger. Do these
Starting point is 01:11:50 work together? Are they separate features, Gerhard? Do you understand triggers better than I do inside of Honeycomb? So triggers is almost like alarms, right? So it's like an alert. Right. But isn't SLO also like an alarm? Like, hey, you haven't reached your objective. Kind of, but it gives you like the perspective of like the last 30 days, right? So when you click on one. Does it email you? Yes, I do get emails and you can.
Starting point is 01:12:12 This one says triggered right there. It says it's been triggered. I mean, this basically gives you almost like a graph and you can do like comparisons, like to start understanding like, when does this SLO fail? And by the way, some of these things aren't that helpful. And again, that's like to Adam's point,
Starting point is 01:12:26 there's like more to discuss about this. But what's important, we have a budget and it tracks the budget and we see whereabouts we are. A trigger will not have that. A trigger will say, hey, this thing just happened. So an SLO, I think it goes further.
Starting point is 01:12:40 You have obviously an SLI and it keeps track of that. And then you receive emails when you're just about like 24 hours from exhausting your budget. And that makes it really helpful. Right. Okay. Fair enough. They're deeper. There's more things to track. Seems a bit redundant to me, but I can see how you might just have some one-off triggers that don't need to be like full on SLOs. I wonder if we could use those to get around some of our two SLO maximum, maybe.
Starting point is 01:13:10 So it's almost like when something is slow, but again, can it take into account, maybe it can, maybe we just need to write a query that takes into account. But then apart from the dashboard view and the comparison view, there must be something else about SLOs as well. I mean, why not just call them the same thing if it's just that? Because I think SLO is like buzzword compatible at this point. Like it sounds like a thing that you could charge money for. I see. Queer run every 15 minutes.
Starting point is 01:13:32 So maybe. Anyways, let's look into triggers a little bit. But yeah, we definitely want to get some more SLOs. Yeah, more SLOs. And we spell it M-O-A-R. Because Gerhard says, look, you should have two of everything. Except for wives and SLOs. You should have more.
Starting point is 01:13:47 Less than two wives. Less than two wives. Definitely. And more than two. Yes. SLOs. Absolutely. Absolutely.
Starting point is 01:13:55 Two of everything else. Right. Right. So, hi, Christine, if you're listening. Can we talk? Stoked. Love, Honeycomb. More SLOs. More SLOs.
Starting point is 01:14:06 More SLOs, please. Yeah. This has been a fun Kaizen, though. I mean, I think, you know, let me, I've been quiet quite a bit during this show because y'all do the work and I just get the pundit as necessary. It's great to see all this work done. I mean, it's great to see us now improving, yes, but I think paying attention to how we spend money with S3 and making changes and leveraging other players in the space. Mad respect
Starting point is 01:14:32 for Cloudflare. We love to find ways to work with them in any way, shape, or form. And the same with BetterStack. I think the status page is something we haven't really looked further into with working with them. But part of this journey with Kaizen is improving, but also finding the right tools out that we like that we can trust in terms of you know who's behind the business and the way they treat the community and
Starting point is 01:14:54 the way they frame and build their products finding those folks out there that we can work with ourselves and leverage but then also promote to our listener base and saying hey these are things that we're using in these ways and all of our code is open source on GitHub, which you can see these integrations. I think it's beautiful, right? Like to have an open source code base and like to integrate with Dagger since, you know,
Starting point is 01:15:15 0.1 or whatever the release was initially when you first got us on there. And then having that conversation with Solomon on the change log and kind of going into all that. And like all this stuff is out there in the open. And we just invite everybody listening to the show to just follow along as you'd like to, to see where we go and then how it works when we put it into place.
Starting point is 01:15:33 So that's kind of fun. I like doing that with you all. It's a lot of fun. Yeah, same here. I mean, this really is unique. I mean, to be able to discuss so openly and to share the code, like not just like, we're not just like talking about ideas or like what we did this is like a summary and hey by the way there's a change look there is a github repo you can go and check all these things out and there's something that you
Starting point is 01:15:53 like use it try it out and let us know how it works for you so yes we're doing it for us of course but also a lot of effort goes in to share this. So it's easy to understand. It's easy to try using it and try it out and see if it works for you. And we're open about the things that didn't work out because a bunch of things didn't. Right, precisely. To close the loop on the invitation,
Starting point is 01:16:18 I would say if you made it this far and you haven't gone here to this particular webpage yet and joined the community, you should do so now because we are just as open and welcoming in Slack in person as we can be. Go to changelog.com slash community. Free to join. Love to talk to you in there. Lots of people in Slack.
Starting point is 01:16:38 It's not overly busy, but it's definitely active. And there's a place for you there. So if you don't have a home or a place to put your hat or hang your coat or your scarf or whatever you might be wearing or take your shoes off and hang out for a bit, that is a place for you to join. You're invited and everyone's welcome
Starting point is 01:16:53 no matter where you're at on your journey. So hope to see you there. What else is left? What can we look forward to? One last thing. If you join the Dev Channel in Slack, please don't archive it. What the heck?
Starting point is 01:17:07 I just noticed that. Obed Frimpong joined and he archived. And then Maros Kuchera joined and archived. So that just messes up with our clients. So please don't do that. Don't archive channels. I don't know why people can do that. I mean, maybe there's some fix that we should do.
Starting point is 01:17:23 Yeah, maybe. You'd think that would be a setting, like, no. That's the limit of our invitation, okay? We are very open and very inviting until you archive our channels. And then we don't want it to happen. So don't do that. That's like coming into our house and being like,
Starting point is 01:17:37 oh, I threw away your kitchen table. I hope you didn't need that. Yeah. I got rid of that. Neighbor needed a table. Yeah. Play nice. Be nice. That's right. That's right. Otherwise never needed a table. Yeah. Play nice. Be nice.
Starting point is 01:17:45 Play nice. That's right. That's right. Otherwise, welcome. Otherwise, welcome. All right. Kaizen. Looking for us with the next one.
Starting point is 01:17:53 Kaizen. Kaizen. Always. This changelogging friends features a changelog plus plus bonus. That's what the crowd wants. Gerhard's boss, Solomon Hikes from Dagger was on the changelog a few weeks back and we just had to get Gerhard's review of that episode. I didn't listen to it.
Starting point is 01:18:15 What? Come on. I just wanted to see the reaction. How rude. Actually, he listened to it twice and he has opinions, of course. If you aren't on the plus plus bandwagon yet, now's a great time to sign up. Directly support our work, ditch the ads from all of our pods, and get in on fun bonuses like the one that Plus Plus subscribers are about to hear.
Starting point is 01:18:37 Check it out at changelog.com slash plus plus. Thanks again to our partners, Fasty.com, Fly.io, and Typesense.org. And to our beat-freaking residents, the mysterious Breakmaster Cylinder. Next week on The Changelog, news on Monday, Debian's 30th birthday party on Wednesday, and Justin Searles right here on Changelog and Friends on Friday.
Starting point is 01:19:02 Have a great weekend, and we'll talk to you again real soon.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.