The Changelog: Software Development, Open Source - Kaizen! S3 R2 B2 D2 (Friends)
Episode Date: August 11, 2023Gerhard joins us for the 11th Kaizen and this one might contain the most improvements ever. We're on Fly Apps V2, we've moved from S3 to R2 & we have a status page now, just to name a few....
Transcript
Discussion (0)
Welcome to Changelogin Friends, a weekly talk show about R2-D2.
Thank you to our partners for helping us bring you awesome pods each and every week.
Fassy.com, Fly.io, and Typesense.org.
Okay, let's talk.
All right, we are here to Kaizen once again for the 11th time.
Second time on ChangeLog Log and Friends, and we're
joined by our old friend Gerhard Lazu.
Welcome back, Gerhard.
Hi.
It's so good to be back.
It feels like I'm back home.
You're home here with us.
We have the comfy couch over there for you that you can just sit back and relax.
You've got the mic boom arm, which you can bring the mic really close yep got
comfortable headphones and of course your favorite drink yeah feels great for you too i hope dear
listener if not pause it go and do your thing and then resume that's right so kaizen always
be improving that's it is that what we're doing? Are we improving? Are we just making progress?
My perspective, we are improving.
Okay.
What are we improving?
Let's talk about that.
What and how?
Okay.
Yes, we're trying to for sure.
That is the aim is to continuously improve.
And in order to do that, I guess you change things, right?
You're like, well, we've been doing X.
Let's try Y.
And this is our new two month cadence. It's been roughly two months since we last recorded. So we're good
there. It's the summer months, which for me, at least a little bit is more time to work on things
because less news, less events, less things going on, more vacations, which does slow us down. I think we've all
taken a little bit of time. But Kaizen 11, if you look at the discussion we have,
which is discussion 469 on our changelog.com repository on GitHub, we'll link it up,
of course. But Gerhard does a great job of outlining each Kaizen with its own discussion this one's got a bunch of stuff in a gear heart this is maybe like
the best Kaizen ever I think so I already do like so many things changed
like I couldn't believe it because when you work on it and it's like a week in
week out you maybe add one thing or maybe even half a thing there were weeks
when nothing happened yeah but then two months are up and you look at it
and you have like seven things
and some of them are really big.
And you think, whoa, that's a lot of things that changed.
Did they improve?
Back to Adam's question.
Well, let's figure it out.
I mean, I obviously have my perspective
and I can share the things which I think improved.
But for the listeners, what improved?
And for you, Jared, what improved? And Adam as well. When it comes to the app, when it comes for the listeners, what improved? And for you, Jared, what improved on Atom as well?
When it comes to the app, when it comes to the service, did you notice anything improving?
I would say the biggest improvement, oh, I had to think about it.
Front-facing features, maybe not so much. I mean, this has all kind of been infrastructure back end.
Of course, I'm always tweaking the admin and improving it for us.
The biggest thing for me has been the change of how we deliver all transactional email
through the application, which did introduce a very difficult to debug bug,
which I haven't actually quite figured out yet.
I just worked around, which we can talk about.
But we're using Obon for all email delivery,
which includes all of our newsletter delivery.
So literally for a while, the first step was just like,
okay, when we send out changelog news,
we need to send that out with persistent background queue we can't
just ephemerally do that because it's just a lot of emails and you don't want to have it something
die midway through the process and like half of our readers don't get their newsletters like we
need to make sure that's robust and so i put that through obon we also don't want it to be sending
duplicates which actually is kind of the bug that it's doing it anyways but
his face y'all you missed his face in the video so i saw his face it was just hilarious i had to
laugh sorry what did my face look like defeat was it utter defeat what was my face it was just like
it was disgust and defeat and humor at the same time well let's look at it this way we're sending
emails twice as hard twice as better than none.
We laugh so that we don't cry, Adam. We laugh so we don't cry.
The weird thing about this is that it's not like, hey, everybody gets two emails.
Like that would kind of make more sense.
It's just you, right?
It's not just me.
I wish it was just me.
It's a handful of people that get like 30, 40 of the exact same email.
It's like, it's recueing them.
I can't figure it out, but I just reduce it down to a single worker because I had five
workers going and somehow it was just like recueing.
I'd love to figure that out.
But right now we're back to, okay.
But yeah, it was very embarrassing.
It's like certain, a handful.
And one time it was our guest.
Trying to think who the guest was.
He was very gracious.
Solomon?
I'm like, it wasn't Solomon, no.
But maybe it was, and he didn't tell me about it.
It's just like, you get 35 of this thank you email versus one.
But everybody else just gets one.
Very strange.
You like your lot.
You don't have enough change log in your inbox.
Or DDoS in your inbox.
So that was a big change.
Also a bit of a headache,
but it did prompt me to finally do
what I had said I was going to do last year,
which is we signed up for Oban Web,
which is supporting the Oban Pro cause.
And we don't have that quite in place
because I want more visibility
into our background jobs, basically.
The way I'm getting visibility right now
is I proxy the Postgres server
so I can access it.
And I literally am looking at the table of Obon jobs
and doing things.
Like a boss, straight in product.
Of course.
As you do when you're the boss.
Real developers developing production.
Let's just be honest about it.
You got to do what you got to do, man.
I got to see what's going on here.
You got to improve on that. That's where the action's at. You know,
I would rather have the nice web UI, but right now all I have is a UI into the database table.
Now, because you mentioned that, that's the one thing which I didn't get to. It's on my list.
So close. What else is coming up? It will not take long, but I had, I had a good reason for it.
What's that? I did all the other eight things.
Oh, I see.
I guess that's good enough.
You accomplished all these other things instead.
So the weird thing about that,
I guess the reason why I had to pull you in on it,
because otherwise, so Obon Web
is just distributed as an Elixir package, right?
And the thing is, is because it's a,
effectively an open core kind of a thing where he has Obon,
which is a package which is open source and free and all that,
and then he has a subscription.
It's distributed via its own hex repository
that he hosts and has some sort of credentials,
which is fine locally,
but then you have to somehow get that
so that your CI,
when it goes to install all the things to deploy,
it can actually authenticate to his hex server.
And that's why I was like, eh, Gerhard should probably do this
because I'm not sure exactly how much Go code has to be written
in order to get that done.
And so that's why I pass it off to you.
But yeah, I would love to have Oban Web for the next Kaizen
because it'll help me figure out exactly what's going
on with these duplicate emails. That being
said, aside from that particular bug,
it is really nice to have the
persistence and the
ostensible opportunity just to have a single
send versus if we
have 15 nodes running the app,
who knows who's going to grab that and
just send it, right? So
that's a big one for me.
And the replay ability of the emails to resend in case it bounces and stuff,
we didn't have that before.
So that's cool.
Right.
So by the way, I hope this isn't a surprise.
I would hate for it to be, but let's see what happens.
We are running two instances of the change log,
which means that even though you scale down the worker to one there will be two workers running right that I do
know okay but it fixed it anyways
so
don't care at this point it's working
amazing maybe listener if you've
gotten more than one email from us
maybe 15 maybe 73 emails
do let us know we want to know these
things oh gosh that's a lot of emails
don't want to yes these things. Oh, gosh. That's a lot of emails. We don't want to.
Yes.
So we're finally distributed then?
We have two versions of the app.
So we are finally telling the truth
in terms of we put our app in our database,
close to our users?
So it's all in Ashburn in Virginia.
Okay.
So it's all in USC,
so in that data center.
Now we have the option
of obviously spreading them
across multiple locations
and we should do that.
That's like the next step.
So from one,
you go to two.
That is like a nice improvement.
And then if that works
and we're happy with that,
we can go to more
and that's the plan.
But before we can do that,
we should also obviously
replicate the database.
So we should have
multiple followers,
like one leader, multiple followers, and then obviously all the local apps sorry all the apps
which are not running in virginia and in ashburn they should connect their local postgres follower
so they're connecting closer to themselves exactly versus back to the one in virginia exactly and
then we can even like remove shielding, for example, from Fastly.
But that's a change which I didn't want to do before we had like multiple locations.
So right now we're like a single region, but more than one, which is already an improvement.
Right.
Can we do a quick TLDR, TLDL of why it's finally happening?
It's caching, basically, right?
That's the reason why we haven't been able to replicate the application
was because of caching issues?
Yes, but no.
Okay.
Yes and no.
Okay.
So not a TLDL then, give the longer, short, long version of that.
Why?
So the OBAN workers were an important part of that, right?
So knowing that we're not basically missing important operations was essential.
So because the job's not going to the database,
when, let's say, the app stops halfway to doing something, that's okay.
The other instance can pick up the job and resume it and send the email.
In this case, 38 times. I mean, there's something there.
Well, they're just overachievers. I always liked my code to go above and beyond.
So that was one. The other one, the caching side of things, I think it's okay if multiple backends
have like different caches. I think that's okay. Obviously we'll need to look into that as well.
And this is like, back to you, Jared, where are we with that? Because I still don't know what is the plan.
After all, we've been back and forth for a year now, two years.
We're getting closer, but we're not there yet.
So one thing that we should point out with regards to this,
I think the Fly machine upgrade to Apps v2
is pushing us into this new world.
We did not, at this point in our lives,
choose to go into this new world.
And maybe, Gerard, you knew the date better than I did,
but we knew this migration was coming,
where Fly was saying you have to upgrade to Apps v2.
I just didn't realize it was going to happen when it
did. And so I wasn't ready.
The code wasn't ready.
And the migration went through just
fine. You did some work there. You can talk about the
details of that work. We're now on Apps v2.
And that allows us to run these
multiple nodes at the same time.
It requires it, really. Doesn't it require it?
So, I mean,
you can run just a single one.
Okay. So a couple of things. First of all, this migration was like a progressive rollout.
So certain apps of ours, we receive notification like, hey, these apps will be migrated to from
V1 to V2 within like the next week. And then our app, the changelog app, it required a blue-green deployment strategy.
Blue-green was not implemented for apps v2 until maybe a month ago, maybe two months ago.
Something actually was like a month ago, because two months ago, they didn't have this option.
So a month ago, this was enabled.
Shortly after, I think a week after, we received this notification, hey, this app will be migrated to v2 but the problem for us is that
our deployment configuration i wasn't sure whether it will work with v2 because what they say is hey
if you have a chance try and save the fly config after the upgrade because some things may have
changed in the configuration so in our case we didn't have to do that like everything continued
working which was a nice surprise.
But when you go to V2 and when you go to machines or apps V2,
their strong recommendation is to run more than one.
And the reason for that is the app will not be automatically placed on another host if a host was to have a physical failure.
It doesn't happen that often, so on and so forth.
But actually, the bigger the provider is, the more often it happens.
So in our case, I think we would have been fine, but I wanted to make sure that we're
running on two hosts just to prevent the app going down and then me having to basically
jump into action and fix it.
So that was, I think pull request 4.7.5 and you can check it out on github so i was ensuring that the
app deploys they work on fly io machines i did like a few small changes a few small improvements
the seat uh the fly ctl the cli was was updated a couple of things like that and then everything
worked as it should have the warning which you get in the dashboard in the fly dashboard if you're on
a single instance they say we strongly recommend that you run more than one.
And they explain in their documentation why.
So it's basically a strong recommendation, and we did it.
I see.
Okay, so we did it.
Yep.
But I will now confess that we were not ready to do it.
I didn't know we were doing it yet.
And I did not fix the caching problem.
Right.
Which we did experience.
So a few people mentioned, hey, this GoTime episode appears in my podcast app, And I did not fix the caching problem, which we did experience.
So a few people mentioned, hey, this GoTime episode appears in my podcast app,
and then it disappears, and then it appears again.
And I call that flapping.
I'm not sure exactly what it is. But basically, depending on which version of the app you're hitting,
the cache may or may not be up to date.
And so the reason for this is because the way the code works is, it's after you publish or edit or whatever, we go and clear the cache may or may not be up to date. And so the reason for this is because the way the code works is
it's after you publish or edit or whatever,
we go and clear the cache,
which is just right there in memory in the application server.
And we do not clear the cache across all of our application servers
because we aren't good enough to write that code yet.
I do have what I think is the best case fix for this,
which I learned from Lars Fickman,
but I'm not going to exactly use exactly what he built.
I think we just should use the Phoenix PubSub implementation.
But in the meantime, I was like, well, this isn't cool.
So I'm just going to reduce our cache times.
These are response caches.
So you hit changelog.com slash podcast slash feed.
We deliver you an XML file of multiple megabytes.
We cache that right there inside the application
because it's not going to change.
And we'll cache it for infinity until we have an update.
Well, I just changed that to two minutes.
And I was like, well, we'll just cache it for two minutes,
and every two minutes we'll go ahead and just regenerate
and we'll watch Honeycomb and see if that puts ridiculous amounts of strain and slows down our
response times and etc etc etc this is behind fastly by the way it's just that fastly has a
lot of points of presence and every single one of them is going to ask for that file
and so there's still a lot of requests that was just kind of good enough it's working
right things are fast enough they're good enough dogggone it. People like us. So that's an old Stuart Smalley line for those
who missed it. Because I'm good enough. I'm smart enough. And doggone it, people like me.
But that's not really a fix. It's just a workaround. It stopped the flapping because
basically if you're out of date, you're only going to be out of date for 120 seconds and then you're going to get the new file.
And so that's what I'm doing right now.
I'm just clearing the caches every two minutes
and so every app server is going to be eventually consistent
every two minutes.
I'd much rather have the solution that actually makes sense,
which is clustered app servers that are pub-subbing
an opportunity to clear their caches
and we can go back to infinity
because there really is no reason to regenerate that
until there's an update.
But for now, we're just doing every two minutes.
That was my quick fix,
and I was hoping that it wouldn't provide
or require too much extra processing on our app servers.
From what I can tell in Honeycomb,
and I'm no expert, it seems like everything's okay.
Obviously not operating at full potential at the moment.
Yeah.
So this is a very interesting thing that you mentioned
because one of the improvements that we did,
we set up both SLOs that we were allowed to set up in Honeycomb.
We can come back to why there's only two.
So the first SLO is we want to make sure that 95% of the time
podcast feeds should be served within one second.
So in the last 30 days, that is our SLO.
The second one
is 98% of all responses should be either 200 or 300. So these are the two SLOs. Now,
what we can do now is dig into the first SLO, which is 95% of the time the podcast feeds should
be served within one second and see how that changed since you've made the caching change.
What I've seen is no change when I was looking at this.
Oh, good.
So I'm going to share my screen now.
Make sure that...
That I audibly describe what we're looking at
for our listener.
Exactly, and that I'm not missing anything important.
So I will try to do this as good as I can.
We're looking at Gerhard's
screen. Yes. Thank you. This is a honeycomb browser of podcast feed response latency. Go ahead.
That's it. So 95. So all configured right now, the budget burned down. We are at negative 5.68%.
Is that good or bad? Well, we've burned our budget, which means that more than 95,
so you can see 94.72.
So we're failing.
94.72% of the time,
we are serving our feeds in less than one second.
Oh, I can fix that.
We just changed the budget so that we pass it.
Right, exactly.
So that would be exactly.
So let's just agree on a new budget.
That's it.
Yeah, exactly.
Yeah, but this is just like supposed to give us
like an idea of how well we are serving our podcast,
sorry, our feeds.
Now these feeds, as you know,
they're across all the podcasts.
And all clients all around the world, right?
Exactly.
Okay, so now I'm sharing all of it,
like the entire Brave browser.
So I can basically open multiple tabs.
So I was looking at the first one 95 of the
time so regardless whether cached or uncached i have like some saved queries here so now we can
see what is the latency of cached versus uncached feeds these are the last 28 days we're gonna hits
and misses exactly when we hit them it's more or less the elapsed time so i can just drill down in these and i can see last 20 days
they are served roughly within 0.55 seconds so between half a second 500 milliseconds 550
milliseconds yeah we'll take it so let's go to the last 60 days and see if that changed that
shouldn't have changed by the way we have a few big spikes and by the way this is fastly so fastly
serving this we can see some spikes all the way to seven seconds.
But overall, we're serving within half a second.
No big change.
So let's flip this and let's say, show me all the misses,
because this means it goes to the app, right?
2.5 seconds.
That's it.
So what we're seeing here,
we can see how many of these requests went through.
So we have about 2,000 in a four hour period, two and a half thousand. it so what we're seeing here we can see how many of these requests right where it went through so
we have about 2 000 in a four hour period two and a half thousand and we can see that the latency
wise we are at 2.3 seconds 2.8 seconds it varies but over the last 60 days we have obviously like
an increase here up to four seconds 4.6 seconds July 9th. But otherwise, it hasn't changed.
Right.
So my change was no big deal.
A nothing burger.
Which is nice.
Yeah.
So here's the other thing.
Let's have a look at the URL.
I think this is like an interesting one.
So let's group them by URL
because what will be interesting is to see
which feeds take the longest, right?
Like the P95.
And you can see all the misses.
So this is the podcast feed.
The P95 is 2.9 seconds, Practical AI 1.5 seconds. But the one which has the highest latency is the master feed. And I don't think that's surprising. Not at all. If you actually just go download the
master feed, I just did the other day, it's about 11 megabytes i mean it's a gigantic file yep and we're recalculating the contents of that once every two minutes
every other time it's just sending the file but even just sending that file from i guess it's
from fly to fastly that's just going to take some time and then sending it from fastly obviously to
wherever it goes well that's up to fastly. But the only way we can make that faster,
there's two ways I can think of.
One is you cache it forever,
so you just get rid of that calculation time,
which happens once every two minutes.
Then all you have is send time.
We're already doing gzip and whatever else you can possibly do
in terms of just HTTP stuff.
The only way you can make it faster
is take stuff out of it, I think.
Yeah, limiting it.
And we used to do that. We used to
limit it, because I think we have over
1,100 episodes in there, and there's
everything, pretty much.
Not the transcripts, thankfully.
That would really balloon it up.
But chapters, etc., etc.
There's lots in there. Show notes,
links, descriptions,
all the stuff for every episode we've ever shipped.
The only way to make that smaller
is you just limit it to N episodes
or N is some sort of number like 500 or 100.
We used to do that.
I would happily continue doing that
if the podcast indexes would just keep our history,
but they won't, they'll purge it.
And then you'll go to our master feed
and you'll see 100 episodes.
And you'll be like, cool, they have 100 episodes.
No, we've put out 1,100 plus episodes.
We want people to know that.
We want people to be able to listen to them.
We used to have complaints, hey, why won't you put the full feed in there?
There is a feature called paginated feeds.
It's a non-standard RSS thing that we used to do.
And we paginated that and it was a much smaller thing.
And then it was great, except Apple Podcast didn't support it.
Spotify didn't support it.
Blah, blah, blah.
It's that old story.
So I was like, screw it.
I'm just going to put everything back in there and it's just going to be expensive and slow.
And that's what it is.
What do you guys think about that?
Like, is that a good trade-off?
Just leave it because it's 11 megabyte file.
I don't know. What do you guys think about that? Is that a good trade-off? Just leave it because it's an 11 megabyte file. I don't know.
What do you do?
Well, I think that serving the full file is important
for the service to behave the same.
So I wouldn't change that.
If you change the file, it will appear differently in these players.
So I don't think we should change that.
If they all supported pagination, I would happily paginate it
every 100 episodes or so.
And we have a bunch of smaller files
that are all faster responses.
And like, I would love to do that,
just like you do with your blog.
Yeah.
But you know, you can't make these big players
do the cool stuff.
They never do the cool stuff.
They always do what they want to do, so.
But I think that's okay.
So if we look at how long it takes to serve this master feed
from fastly from the cdn directly versus our app so when it's a cache hit when it's served directly
from fastly we are seeing a p95 of 2.7 seconds okay when our app serves it directly it's 9.9
seconds so it's roughly three times slower I don't think that's so bad.
Our app is three times slower than Fastly. I think that's okay.
I also think with this kind of content, it's okay. If this was our homepage, this would not be okay.
If we had humans consuming this, it would not be okay. But podcast apps, crawlers, like,
oh, sorry, you had to wait three seconds to get our feed like
who cares you're a crawler you wait around until there's the next one yeah so the fact that we have
slow clients and clients that aren't people they're actually just more machines consuming it
i have less of a problem with that being just not super fast if this was our home page like
it'll be all hands on deck until you get it fixed. There's no way I'd make people wait around for this kind of stuff.
What I'm wondering if we can improve this is do we care about the cash hits or do we care about
the cash misses? Because based on the one that we care about, we can maybe see if there are some
optimizations we can do about cash hits. I'm not sure what we can improve because it's not us,
it's basically Fastly. I'm not either.
But cash misses, and I think this is something really interesting.
If I dig into Honeycomb, into the cash misses,
you will notice something interesting.
You'll notice that there was quite some variability.
So it would take anywhere from four seconds to 15 seconds, right?
Like we see the squiggly lines.
But then from August the 4 4th we're seeing five seconds
eight seconds six seconds so it's less spiky and it seems to keep within 10 seconds there seems to
be an improvement so i'm wondering what changed there did something change on august the 4th
i have a hunch i'm going to see if it's that. I'm not sure.
August 4th, run.
I'm looking at the commits on August 4th.
Run Dagger Engine as a fly machine.
That was your commit on August 4th.
Yeah.
And that was the only one
in terms of things that might've went live.
Well, I think August 2nd,
that's when I commented,
this was merged last week.
We upgraded Postgres SQL.
Upgrade Postgres on august 2nd yep
went from postgres 14 to postgres 15 it's possible because the app is hitting the database
on not on every miss but every miss once every two minutes and that's going to be
really slow right and so that could slow down other requests. So if Postgres is getting our data back out faster somehow
because of some sort of optimization that they did,
which I could certainly see,
then that might be what explains that
because there's no application changes from us.
So again, like all these other pull requests,
I mean, it went from AWS S3 to Cloudflare.
Again, I don't think that's related.
No.
And that happened on...
It was just over the weekend.
Three days ago,
August 7th.
August 5th, actually.
So that happened
after our improvement.
But that wouldn't have
anything to do
with the feeds.
That's just the,
you know,
that's the MP3s themselves,
but not the feed files.
So,
well, cool.
So Kaizen, right?
So we upgraded Postgres
and we got a little bit faster
in our cache miss responses on our feed endpoints. Well, let's just the master fee that's improved i mean
that's the one that there's like some improvement there but then like in this view it's difficult to
see the p95 um miss miss go time feed i mean maybe we can just zoom in that i mean do we want to
continue doing this or shall we switch topics i don don't know. Adam, how bored are you at this point? One more layer. One more layer. Okay. Oh, he's never bored. He's
always going to go one more. All right. Peel the onion. So this one seems to haven't changed.
Like a go time feed hasn't changed, right? If anything, it looks like slightly worse here,
but again, like we would need to continue digging to see like, Hey, what client,
where, which data center it's coming from, things like that. Yeah. So maybe it's location specific, right?
It's a client, for example, from Asia, which is going to Ashburn.
Obviously that's going to be slow, or we have a few clients from, you know, a certain geographical
region.
Getting routed to the wrong place.
Yeah.
Which would add to this latency, exactly, routing differently or whatever.
So, but the master feed improving, which is by the way, the one which takes the longest,
I think that's a good one.
And the improvement was around eight seconds, I think that's a good one.
And the improvement was around eight seconds, roughly, plus or minus?
Yeah, roughly eight seconds.
Yeah, I mean, in terms of percentages, it's more than 2x, 3x, almost 3x faster.
Yeah.
So that's a huge one.
And so to kind of zoom back out a little bit, Jared, you're saying the perfect world would be a PubSub multi-geo application to know if the update should happen and do it indefinitely.
Right. Or infinitely, I think you said, versus this temporary, you weren't ready necessarily to, you know, version from V1 to V2 apps.
And then you made it update every two minutes instead of infinitely, because that would obviously have the cache issues we have with clients.
Yeah, exactly.
There's no reason not to cache forever because the file doesn't change until we trigger a change by updating something.
Right.
When it updates, we update it and then it caches again forever.
Exactly.
And right now when we update it, we clear the cache, but it only clears it on the app server
that's running the instance of the admin that you hit.
The other app servers that aren't running that request
don't know that there's a new thing.
Well, with Phoenix PubSub and clustering,
you can just PubSub it.
You don't even have to use Postgres as a backend,
which is what Lars Fickman's solution does.
And you can just say,
hey, everybody clear your caches.
And they'll just clear it that one time.
And then we never have to compute it
until we're actually publishing or editing something.
And so that's darn near as good as a pre-compute.
Because I know there's a lot of people out there thinking,
why don't you just pre-compute these things?
This is why static site generators exist, et cetera.
Because that is just a static XML file effectively
until we change something.
That's a different infrastructure.
That's a different architecture that we don't currently have.
And so it's kind of like easy button versus hard button.
I've definitely considered it,
but if we can just cache forever
and have all of our nodes just know when to clear their caches,
then everything just works hunky-dory.
For now, we'll just go ahead and take the performance hit,
recalculate every two minutes.
That seems to be not the worst trade-off in the world,
looking at these stats.
But that would be a way of improving these times.
But now that you mention that, when we used to cache,
when we used to have a single app instance,
I didn't see much better times.
The feed was being served in more or less
like the same time, right?
So if I look at like this is
this is go time feed when we had misses so let's go with 60 days in a 60 day window right it was
just under two seconds and it hasn't changed much even when it was cached that's weird so what i'm
wondering is again going back to the like generating these files could we upload them to
our cdn and our cdn by the way we have two right of course now there's another thing to talk about back to the generating these files, could we upload them to our CDN?
And our CDN, by the way, we have two.
Of course.
Now there's another thing to talk about.
We have Cloudflare R2.
So could we upload the file to R2 and serve it directly from there?
Yes.
That was one of my other architecture options,
is doing that.
You have a lot of the same problems
in terms of updates and blowing things away and all that.
It's definitely a route that we could go.
We have a dynamic web server that is pretty fast and is already working.
And so just running the code at the first request, to me, makes a lot of sense, but we can certainly at the time of update or publish or whatever it is,
go ahead and run all the feeds,
pre-compute them, and upload
them somewhere. I'm thinking OBAN.
We want more OBAN, right? We could.
And it doesn't matter which instance
picks up the job and which
instance uploads the feed, ultimately
one of them will upload the feed
on our CDN and
that will be that.
So one thing we don't want to change
is the URLs to our feeds.
And our URLs to our feeds currently go to the application.
And so the application would have to be wise to say,
does this file exist already on R2?
And if it does, serve it from there.
If not, serve it myself.
No, because we, remember, we have Fastly in front.
So we can add some ruling Fastly to say,
hey, if this is a feed request, forward it to R2,
don't forward it to the app.
What if R2 doesn't have the file for some weird reason?
Well, it will, right?
It will have like an old one.
That's what you think.
What if it doesn't?
Well, you've already uploaded the file once.
You're just updating.
Well, we have to blow it away.
Maybe it's just a race condition at that point.
Well, why can't you just re-upload it?
You're doing a put.
Well, that's true.
Is that atomic?
I assume, I guess that's the point, atomic.
It should be.
I mean, the file's already there.
It's just basically updating an existing object
because it's an object store.
Right.
You're saying, hey, there's like this new thing.
Just take this new thing
and then the new thing will be served.
Okay.
Yeah, we could definitely try that route
and then we just turn off caching at the...
Well, our application server would never...
Never even see those requests.
So we would lose some telemetry that way
because we are watching those requests
from the crawlers
because some crawlers will actually report subscriber counts.
I see.
And so our application's logging subscribers,
which is a number that we like to see, from those feed crawler requests. crawlers will actually report subscriber counts. I see. And so our application's logging subscribers,
which is a number that we like to see,
from those feed crawler requests.
So we lose that visibility.
Maybe we can get it at the Fastly layer somehow.
Fastly logs everything to S3.
We're putting more and more stuff in the Fastly at this point as well.
So I just, I'm tentative.
I like to have everything in my code base if possible.
But the folks at Cloudflare right now
are really upset by this conversation.
Well, we're using Cloudflare behind, so
that's there. I know, but we're not
using their stats. We're not using
the more entrenched we are to Fastly's
way of things. They're like, no, that's
the dark side. Right. And the Fastly
folks are probably thinking, you're using
Cloudflare? No, that's the dark side.
Yeah, well, we're using both.
So we have two CDNs. But we're using them differently though, aren't we? Aren't side. Yeah, well, we're using both. So we have two CDNs.
But we're using them differently though, aren't we? Aren't we using them differently? Like we're
using R2 simply as object storage, not CDN necessarily. That's right. We replaced AWS
with Cloudflare, not Fastly with Cloudflare. So far, who knows where we go from here. But let's
talk about this migration because this was a big chunk of work that we accomplished. Yeah. Well,
the first thing which I want to mention is that you've made this list,
which really, really helped me. It was like a great one. I wasn't expecting it to be this good.
No, I was. I'm joking. I'm just pulling your leg. No, no, no. I was. I was.
This is better than I normally do. I was like, you know what? Let's open a pull request. Let's do this the right way. Yeah. I was surprised just by how accurate this list was.
Wow, Jared knows a lot of things,
like how this fits together.
I'm impressed.
So I genuinely appreciated you creating this.
By the way, this is pull request 468
if you want to go and check it out.
I mean, you created like the perfect,
like, hey, this is what I'm thinking.
Like, what am I missing?
And actually you didn't miss a lot of things.
Good.
So we went from S3 to R2,
where as you know,
we're using AWS S3 to store all our static assets,
all the MP3s, all the files,
all like all the JavaScript and the CSS
and all those things and SVGs.
And we migrated, might I say with no downtime,
like zero downtime.
Zero downtime. Yeah, on a weekend, as you do, right no downtime, like zero downtime. Zero downtime.
Yeah, on a weekend, as you do, right?
I was like sipping a coffee.
Okay, so what should I do this weekend?
How about migrating hundreds of gigs from S3 to R2?
And it was a breeze.
It was a real breeze.
That's awesome.
And your list played a big part in that, Jared.
So thank you for that.
That's good stuff.
Wow.
Let's put a little clap in there.
Thanks, guys. Appreciate that.
Applause. I'm looking at number
six. Make sure we can upload new files. And you
didn't do that yourself, but you can check it off because
I just uploaded ChangeLog News yesterday.
Everything worked swimmingly. We published a new
MP3 file without any
issues whatsoever.
Amazing. Where should we start start should we start with the y
on this one i mean i think well the y is easy right maybe adam can queue up with a y here
well i just pay attention to how much money we spend
that's right every dollar comes out of our bottom line pretty much right and i was like
why is this doubling you know every so months and then it was just like
it had been very very small for so long like sub 10 bucks for a very long time in the last year
so it's gotten to be like 20 30 40 50 and then recently it was over 100 yeah i think about six
months ago a few kaizens ago and i'm like why and we couldn't really explain exactly what but then
we explained some of it but then it only went down a little bit then it went back down to like 120 bucks but that's a lot of money
to spend on object storage right i mean it's just it's more than we want and when you can get free
egress well you take free egress yeah so one theory you mentioned i think we actually got to
150 at one point uh maybe the last time i recorded Kaizen, which really was like, okay, let's make
some moves here.
Because if it goes to 300,
if it doubles from 150 to 300, that's an issue.
So I knew that it would be a bigger lift to migrate
our entire application, which is
the bulk of it, because of all of our MP3s.
Which Fastly, of course, is serving them,
but we are putting this as the origin
for Fastly, and so it's requesting them from
S3 for us. And that was the major cost.
It was like outbound traffic, major cost on S3.
And so we knew with R2 we'd have zero on that.
This took two months-ish from then,
like we actually landed this,
it was almost two months from us realizing
that we should do this.
However, changelog.social,
which is a Mastodon app server,
was also on S3. And I immediately switched that one
over to R2 just to try out R2. And it was super fast and easy to do that. And I think we went from
150 down to 120. It started to drop precipitously after that. And I think it's because of the way
the Fediverse works. When we upload an image to Mastodon, as I do with my
stupid memes and stuff, right? And we put it out on our Mastodon server. Well, that image goes
directly from S3. Oh, I put Fastly in front of it too, I thought. I might have. But somehow that
image is getting propagated around because all the Mastodon instances that have people that follow
us have to download that image for them to be able to see it.
And so this architecture of the Fediverse,
where it's all these federated servers,
they're all having to download all those assets.
And so I think maybe that was a big contributor to that cost,
was just change.social.
And once I switched that, it started to come down.
And now it's going to go to pretty much zero because of this change.
Yeah, it just should be a few dollars.
And I think we have a few things to clean.
So I was looking, I was basically enabled storage lens,
which is an option in S3, and you can dig down.
So I'm just going to, again, sharing my screen.
I'm going to click around for a few things.
I'm going to come back to 469.
Obviously, you won't have access to this,
but if you're using AWS S3, you can enable storage lens and have a play around with it so what i want to
see is here extended storage lens okay and now it loads up and we can see where the cost goes
so we can see the total storage we can see the object count we can see the growth and how things
are changing and you know how many more things we're adding this was in the last like day to day all requests like month to month so you can see we have like a one
percent change in total storage month to month so we're like approaching the one terabyte mark
not there yet but getting there quickly and if we see like which are the buckets that contribute
and i have to remember where was it oh there you there you go. So you can see changelog assets, which are the static ones, they contribute 22%. Changelog uploads Jared, they contribute 21%.
This is the storage costs. And changelog com backups, which is mostly nightly,
they contribute again, 20%. So they're like roughly evenly spread. So I'm wondering,
is anything here that we can clean up? Anything here that we don't need? Well, we can get rid of changelog uploads, Jared,
because that was my dev environment.
Basically, I would mirror production with the assets.
Right.
So that I had the most current assets,
because I like to do that when I'm developing,
have it look real.
And so I just had this AWS S3 sync command
that would just sync from slash assets to mine,
which is why
they're roughly
the same amount
of gigabytes.
Probably haven't run
it in a while.
I see.
And so that's all
moved over to R2.
So that whole bucket
could just get blown away.
Okay.
Should we do that now,
live?
Yeah, let's do it.
What's going to happen?
Right?
What's the worst thing
that can happen?
Let's do it live!
Like some sort of
ta-da sound?
Right, boom,
everything explodes.
So I think we won't
be able to do that.
We'll need to delete
the individual objects, by the way. Ah, you can't just delete So I think we won't be able to do that. We'll need to delete the individual
objects, by the way. Ah, you can't just
delete a bucket? What's wrong with these people?
It's too dangerous. I remember this again,
this not being
possible. So let's, again, let me search for
Jarrett. I found the bucket.
So we select it, let's say delete.
And to confirm, buckets must be
empty before they can be deleted.
You know what?
R2 has the same exact thing because I created a test bucket.
I tried to move our logs over there as well.
That failed, maybe we can talk about that.
But I couldn't delete it without emptying the bucket first.
And I'll say this, R2 does not have the ergonomic tooling that's built up around S3.
And so in order to delete all the objects inside the R2 bucket,
we're talking about you're writing JavaScript, basically.
There's the GUI apps, the tools,
all that stuff isn't there.
And it's API compatible with S3,
but not really.
It kind of goes back to our conversation, Adam,
with Craig Kirsteins about Postgres compatible
isn't actually Postgres compatible.
Cloudflare's S3 compatible API
is not 100% compatible.
It's like mostly, but enough
that certain tools that should just work
don't. So like Transmit,
for instance, which is a great
FTP. It started off as an FTP client,
has S3 support.
I think I complained about this last time
we were on the show, so I'll make
it short, but it doesn't support R2 because of like streaming uploads or some sort of aspect
of S3's API that R2 doesn't have yet. So anyways, I haven't deleted a bucket from R2 because you
have to actually click like highlight all and then delete and it paginates and they're like,
okay. And there's like thousands of files. How do you delete them from S3?
Just open up an app and select all and hit delete or what?
Well, I think I would try and use the AWS CLI for this.
You would?
Yeah, that's how I'd approach it.
And I think just like that, I would like maybe script it,
like list and delete things as a one-off.
Now I would try Transmit to see if that works.
I mean, we're talking S3 now, right?
I just open it up in Transmit and hit delete.
I'm doing the same thing now, see if I can delete it from Transmit.
Oh, it's going to be gone already. I already did it.
That's why nice GUI apps
are just for the win, you know?
I just open it up in Transmit, select all,
hopefully I did the right bucket, that was pretty fast.
All you had was just the uploads folder in change on uploads Jared, right?
There was a static folder, but it's already gone because I just deleted it.
Nice.
Better look at it quick because it's going away.
That's why it'd be great to have a transmit for R2.
So somebody out there should build a little Mac GUI for R2.
You can call it D2.
I believe somebody said they wanted to call it D2.
Is that in Slack or Twitter?
That was on Twitter.
Jordy Mon Companies, who's a listener
and one of the hosts of Software Engineering Daily.
We know Jordy.
He's the one who said, call it D2.
I was like, that's a good idea.
That would be a good one.
Is it Jordy?
Yes.
In my brain, I've had it mapped to Jory.
It could be Jordy. It's a j name that you know whenever someone's potentially around the world j's are pronounced differently but
he's from the uk i don't know i'm gonna go with jordy okay yeah call it rd2 you know write it in
towery we'll cover it here in changelog news of course but yeah r2 is just too new to have all
the great tools i mean s3
yeah right just has everything it's been around for a while for sure so what i wouldn't delete
i wouldn't delete the change log assets on s3 i mean we can consider that our backup if something
backup yeah was to go catastrophically wrong with r2 again i don't expect it to happen but you know
better be safe than sorry i mean we can keep those 100 and
whatever 200 or however many gigs we have in s3 for this we won't be doing any operations against
them so it shouldn't cost us much other than storage space and continue using r2 maybe even
set up a sync between r2 and s3 so that we have a backup to the backup or like a backup to our new CDN in a way. So that would be good. But yeah,
I think that's a good idea. So we are on R2. We did it. And it was a breeze.
Why not consider B2? Black Backblaze versus S3.
So I know, I listened to the episode, by the way, great episode, loved it.
Backblaze episode?
The Backblaze episode.
And I'm using them
and have been using them for many, many years.
When I've set up my Kubernetes backup strategy,
by the way, I have a Kubernetes cluster in production.
That's a thing.
And all my workloads now running on Kubernetes.
We can talk about it later.
In your home lab?
No, no, no, in production.
Okay, for Dagger or?
Well, what that means, I mean,
I've been hosting a bunch of websites for decades.
Oh, that's right.
Right.
So it's like mostly WordPress websites,
some static websites.
We're talking 20 websites.
I won't be giving any names.
Again, they're like longtime customers.
BBC, NYtimes.com.
That's right.
That's it.
That's it.
BBC, all of them.
All of them.
Yeah, exactly.
So I've set up this production cluster.
I mean, this was the second one.
I set it up in June
and I've been hosting these workloads.
I was using a lot of DigitalOcean droplets.
I had about 10.
So all of these I consolidated
in two bare metal servers
and they're running Talos
and it's all production.
So obviously production needs backups
and it needs restores.
So what I did when I was migrating between Kubernetes clusters, these workloads, the backups were going to B2.
And B2 was okay, but slow, like sometimes unexpectedly slow.
I had the same feedback from Transistor FM.
I had them on ShipIt and they were saying some operations on B2, sometimes they're slow. So they can take minutes instead of seconds. And that was my experience as
well. Restoring things from B2 was incredibly slow. So it took me 30 minutes to restore,
I don't know, like 10 gigs roughly. And that's not normal. So what I did, I said, okay, I have to try R2. I tried R2, same restore, three minutes.
So there's a 10X difference between B2 and R2
in my experience.
Again, it's limited to me.
So that's why for big restores, I'm restoring for R2.
But of course, I'm using both B2 and R2
because I have two backup mechanisms in place.
Of course.
The reason why I suggest or even ask B2 versus S3
is if it's only for backup,
B2s based on their pricing page is 0.005 cents
per gigabyte per month.
And S3, if this is accurate, is 0.026,
which is five times the cost per gig.
Yeah. So if it's just backup, we can deal with slowness, right?
I mean, if it's a restore, we can deal with slowness.
We can just buffer that into our mental space and then, you know,
keep five times our dollars.
So here's a question related to Kaizen infrastructure and what not.
If we were to say, okay,
what we want is a backup service that takes our two things and puts them to B2
once a day or once a week, or even just a mirror,
just constantly mirroring.
Where would we put that?
Where would it run?
How would it work?
We could write an OBAN worker for it.
True.
What I would do is solve it as a CI-CD job, yeah.
So it would be a custom robotic arm inside of our dagger.
That's it.
And it would pick it up and it would move it and drop it.
It would be GitHub Actions, you know,
and it would just run there.
We could have an O-band worker as well.
I mean, whatever we're more comfortable with.
Yeah.
But then should our app know about this?
Maybe it should.
I mean, it has access, right, to all the credentials.
It's easy in terms of secrets and stuff.
It's already there.
I mean, we obviously have to add the B2 stuff.
I think the question is, do you want to do it or should I?
That's what it comes down to.
This is a good question.
I'd rather you do it.
There you go.
So that settles it.
I don't want to do it.
Do you like how I had to act like I thought about it?
I made it dramatic.
Yeah, you did.
You really acted that out good, Jared.
I did.
What's interesting to think about,
just kind of almost separating this conversation a bit,
is you mentioned Dagger and you mentioned GitHub Actions.
And I'm just curious if Dagger's a potential acquisition target for GitHub.
Because if you are complementary and you're improving,
and every time we have a problem like this,
your solution is a background job built into CI using code,
Go code, whatever code you want,
because that's what the move from Q
to everything else went to for Dagger.
And you're so complimentary in terms of Dagger
to GitHub Actions.
You're not cannibalizing, you're only complimenting.
I can definitely see it.
So a year from now, GearHead will work from GitHub.
Anything is possible.
So that's a good idea there, for now on that subject again i didn't want
to talk too much about dagger in this kaizen but i'll just take a few minutes so i noticed that
we had again this is fly apps v2 migrations related where we run uh we used to run a docker
instance in Fly,
and that's where Dagger would run.
We'd have all the caching, everything,
so our jobs would be fast.
Our CI jobs would be fast.
And part of that migration, the networking stopped working.
So I was thinking, okay, well, we have all this resiliency
in all these layers, but we don't have resiliency in our CI.
So if our primary setup stops working on Fly, in this case, then nothing works.
So I thought, well, why don't you use the free GitHub runners?
That's exactly what we did.
So now if you look in our CI,
and there is a screenshot in one of these pull requests,
let me try and find it.
It's called make our ship it YAML GitHub workflow resilient 476. so the tldr looks like this when dagger on
fly stops working there's a fallback job where we go on the free github runners it takes longer
it takes almost three times as long all the way up to maybe 10 minutes but if the primary one fails
we fall back to github we are running on Kubernetes, Dagger on Kubernetes.
So we have three runtimes now, Fly, GitHub, and Kubernetes.
And the common factor is Dagger.
It made it really simple to have this sort of resiliency
because at its core, it's the same thing.
We just vary the runtime.
But we didn't have to do much.
You can go and check
our Shippit YAML workflow,
GitHub workflow,
to see how that's wired up.
Again, it's still running.
It's still kicked off by GitHub Actions.
But then the bulk of our pipeline
runs in one of these places.
Where's the Kubernetes stuff?
That's the production Kubernetes
which I told you about.
Oh, it's at your house.
Well, no, it's not.
I have an experimental Kubernetes cluster in my house.
This is a real production one, right?
Running like in a real data center, not my house.
Hey man.
Yeah.
It's not like you never run
any of our production stuff from your house.
Exactly, I did.
And it worked really well, I have to say, for a while.
And then obviously we improved on that.
It was a stopgap solution.
Hey, you know, we've had the work from home movement,
you know, everybody's taking their work from home
and it's like, well, why not bring your work to your house?
You know, take your CI home with you.
Exactly, I take the CI.
Okay, so this is a production Kubernetes thing of yours
that this is running on.
This is like, that's a third fallback in case, or?
That one's slightly special in the sense
that that one doesn't deploy yet.
So it runs and it builds, but it doesn't deploy.
So there is this limitation in GitHub actions.
And again, if someone from GitHub is listening to this,
I would really, really appreciate
if some thought was given to this i would really really appreciate if some
thought was given to this so when you select runs on when you say github runs on all the labels have
to match so what that means is that you can't have a fallback you can't say runs on this or that
or that or that you can't define like a nice fallback so So then what you do, you have to like basically say this job needs the other job. And if that job is just a mess. So if for example, Kubernetes was not available,
how do we specify a fallback? And I say not available, it can't pick up a job. So it won't
fail. It's just not available. So a job will basically wait to be picked up for a certain
amount of time. And then it will time out most likely again i haven't
tested this fully but that mechanism the runs on mechanism is pretty inflexible in github actions
now in the case of fly and docker that's like fairly straightforward it basically starts on
github and then eventually it hands over to fly because we start another engine anyways i mean
you can go and check the workflow i don't want to go too much into the details but that's like a simpler proposition when you have a
third one which may or may not be there it's a bit more complicated gotcha so right now i'm just
like running it as an experiment to see how it behaves to see you know if it is reliable long
term and if it is then maybe you know make a decision in a month's time or two months time but for now it's fly with the guitar fallback cool resiliency for the win always have two
yes and now we have three just in case well i didn't even consider that we would keep s3
for a backup or consider b2 as a lower cost backup because i thought, well, we'll just, you know, cut our ties,
keep our dollars and move to R2 and that's done. But that does make sense because what if R2 poops the bed? You know, we're going to have some issues. We got all of our, almost a terabyte of
assets we've been collecting over these years, our JavaScript, our feeds, whatever we're going
to put there ever. If we have no business continuity, which is a phrase I learned 20 years ago now,
business continuity, right?
That's key in backups, right?
You can't just put the backup over there.
You've got to get it back to keep doing business.
So that does make sense.
I didn't consider that and I'm glad you did.
And the cost will go down, right?
Because again, we are using R2,
which is free for egress.
S3 isn't, so we're not pulling anything
from s3 i mean if anything we can move the bits over to b2 so that the storage costs will be lower
but again there'll be one of operations and by the way when you're right actually the operations
you pay for them but anyways the point is it will be well it will cost us something from s3 to
migrate off s3 but it's like a one-off cost. We've already done that though,
haven't we? So when we just move from R2 to B2. Oh, that's right. Actually, that's a good point.
Yes, exactly. We migrate from R2 to B2. That's correct. So maybe delete S3 after we migrate to B2. That's there. Cool. Well, you can delete my bucket now because all the files are gone. So go
ahead and get that done at your leisure. It doesn't have to be. Okay, so let's see me refresh.
Yeah, that's right.
So let's get this thing done.
Yep, confirmed.
It's gone.
Bucket's there.
Files are gone.
Got an emotional attachment to this bucket though.
I've been using this for a long time.
You have another one in R2, by the way.
That's not the same.
Which is assets dev that you can use.
That's right.
But that's shared across multiple people.
So it's not as personal.
Like this was my bucket, man. This was my bucket. I see. We can create one for you. It's okay. But that's shared across multiple people. So it's not as personal. Like this was my bucket, man.
This was my bucket.
I see.
We can create one for you.
It's okay.
We can create one for you.
It's free.
I appreciate you consoling me
as you delete my bucket.
Change log uploads, Jarrett.
No fat fingering, delete bucket.
Boom.
Spell that.
Boom.
It's gone forever.
All right.
Cool.
What about the backups?
What about the nightly backups?
Is that something that we can clear?
Because by the way, there's a lot of backups
going all back to 2000 and something,
I even forget what it was.
Those are assets backups or database?
I think it's a database.
No, this is another one.
This is a small one.
This is like from our pre-fly migration we can
come back to this a bit later because it has just like a few and this costs us nothing backups
nightly they start in 2015 oh it's nightly this is a backup of changelog nightly yep we don't need
that we don't need that no man i don't I don't think so. But this is still happening.
Yeah, it is. Because
my code works. Right.
I wrote this years ago.
2015. I forgot about this.
We've been backing up changelog nightly
every night. Might be the first,
some of the first code you wrote for this company, Jared.
Might have been.
The last backup is 76 megabytes.
Do we want to delete the old ones like what's
the plan here yeah man there's no reason to have them because each one has the entire contents of
the other the previous ones oh i see it's not like uh differentials or anything it's like the entire
folder structure of change.nightly which is all static files right every night we add two static
files and send an email and then we back it up. And so that's just been happening for years.
Wow.
And so I forgot about it.
So yeah,
this can like,
okay,
so we'll fix this.
Just keep the most recent one.
Just keep one.
In fact,
tonight we'll have a new one so you can delete them all.
We'll create a new one tonight.
Cool.
What do we do about tomorrow though?
Well,
we could make it run less often.
I think that would,
I think it run like weekly.
No,
no,
hang on.
I think that's fine. I think that's fine.
I think that's fine.
What we can do is like set some sort of an expiration
or like auto purging on the objects.
Oh yeah, let's do that.
Okay, that's the best idea.
Good, okay, cool.
So we'll fix that as well.
Great.
Keep the last 10.
Great.
I can't believe the nightly folder structure
of just HTML files is 76 megabytes of HTML.
It seems like a lot.
Well, maybe worth something.
It's a tar, so it means there is no archiving,
no compression of any kind.
I'm wondering if we can make it smaller.
Where does Nightly run, by the way?
It's a production Kubernetes cluster in my closet.
In your closet, right?
It's on an old digital ocean droplet.
I'd actually like to get rid of that, but you know.
Don't say where it is
because I'm sure it's like so unpatched, I think.
It's like a honeypot to this point.
Like there's so many,
the exploits have exploits to this point.
Yeah, thankfully it's just straight up static files
served by I think Nginx, but...
No SSH.
Oh yeah, no SSH. No SSH, good. Don no ftp nothing no can't connect to it right completely firewalled it's actually air gapped yeah i don't
even know how it does what it does because you have to walk over to the stop drive in
for every request we have somebody go plug it in. Yeah. Every night we plug it in, it runs, and then we unplug it.
Should we put it on the fly?
What do you think?
We certainly could.
Honestly, ChangeLog Nightly is like it's an entire subject.
The quality has been degrading lately
because of the rise of malware authors
just attacking GitHub constantly.
And so there's a lot of malware stuff.
I'm like, the only change that I've made to Change.NET
in the last couple of years is just fighting off malware.
We just don't want malware repos showing up.
And they're constantly, as cat and mouse has been.
I think we just shut it down.
It still provides a little bit of value for about 4,000 people.
Yeah, it really does.
And me, I still read it.
I still find cool stuff in there.
It's just harder.
You have to scan through some of the crappy stuff.
There's just some crappier repos in there
just because GitHub's so big now.
It's become a little bit rigid
because it's like an old Ruby code base
that sometimes I got gem file problems
on my local machine.
It's just like you can't run it locally.
I only can run it from that DigitalOcean server.
So I go in there and vib and edit stuff.
So you don't want me to see it.
That's what you're saying.
No, I don't want you to look at it.
Careheart is not allowed anywhere near this thing.
You just flip over.
This is legacy code.
This is legacy code.
I've thought about rewriting it in Elixir
and just like bringing it in
and having like a monorepo deal.
And then we would have our,
like, and then I'm like,
why would I put any time into this?
There's so many things I can work on.
I see.
So Nightly is just kind of out there.
We could definitely put it on fly.
I think that would definitely help our security story.
But it might be tough
because it's like Ruby 2,
it's like old gems, stuff like that.
If there's a container for it,
it doesn't really matter.
It really doesn't.
That's what I'm telling you.
It's Ruby 2, it's old gems. There's no container, man. it doesn't really matter. It really doesn't. That's what I'm telling you. It's Ruby 2.
It's old gems.
There's no container, man.
This is like pre-Docker.
No, no.
I mean, there is a Ruby 2 container.
Oh, yes.
I'm saying there's no Docker file for Change.Nightly is my point.
We don't need a Docker file.
If there is a Docker container that we can start off, that's okay.
We can keep it exactly as it was. So I'm looking now at the Ruby official on Docker Hub image.
Ruby 2.3.3, patch 2.22.
2.3, yeah, there you go.
Six years ago, it exists.
We can pull it.
We can base it off on this.
I learned this with, kind of learned this with Chet GPT recently
with running, I didn't want to set up a dev environment.
I was just actually just for fun, trying to run Jekyll without having to actually install
Jekyll or anything. Because Jekyll's notoriously just kind of hard to maintain because it's Ruby
and gem files and all the reasons. And so I'm like, I want to just run the entire
thing in a doc container, but still hit it from a typical web browser
like I would to develop a blog. And so my Jekyll blog lives in
I think a Ruby 2.7.
I don't even remember what exactly.
But it was something that was safe for ARM
because I'm on an M1 Mac and all that good stuff.
And it was like a special Docker file there
that I could just run and build off of.
So similar to what you're saying here,
you just kind of go back in time to a Docker file
that was out there for Ruby 2.3.3 and call it a day.
That's 2.22.
We can totally do this and it will
challenge accepted.
Show me nightly.
Show me yours.
That would save us
$22 a month
Gerhard I think. Something like that.
That's how much we spend on this
nightly server for DO.
It's about $ bucks a month.
And that's literally the only thing on there.
Yeah, you have hundreds of gigabytes of backups.
Hundreds of gigabytes of backups.
Really redundant.
But we'll fix that.
Since we're mentioning Change.Log Nightly though
and the spaminess of it,
I do want to highlight a spam situation
in the most recent one.
But I think it's actually a student and the person's handle
on github is rsriram9843 okay he has or he has they have i'm not sure their gender desktop
tutorial project three project one project four develop. So check those out. They seem to be pretty popular because they're
in the latest nightly's
top new repositories.
There you go. You don't think
it's spam or you do think it's spam?
Well, I mean, it looks like a normal
person. Maybe they did that. I don't know.
It could be a... I don't know.
It seemed like a normal person
would qualify as spam.
That it doesn't belong there?
Yeah, like it's a bot or it's malware.
They very well might be a bot.
I mean, in that case, if it is, don't go there.
I've just identified a bot to not check out.
Here's how far I've gotten, but I haven't pulled the trigger yet on trying to actually have a malware slash spam detection system for Nightly that's actually good is i take a list
of a bunch of good repos here's what we have owner which is like github handle repo which is the name
of the repo right and then like the description that's what we have and i took like 20 good ones
like these are legit but they're diverse you. Because you can put emoji in there, some people write in different languages,
et cetera.
And I pass it off to ChatGPT.
And I say,
here's an example
of 20 good projects on GitHub.
And then I pass it some bad ones.
And then I say,
is this one good or bad?
And it's about 60% accurate.
Really?
It's slightly better than a coin toss.
And I thought, well, that's not good enough
because I can't, I mean, this is all automated.
I'm not going to act on 60% confidence,
you know, or 60% accuracy.
I can't just be like, nah, not good.
I think you'd have to fine tune.
It gets above my pay grade of being like,
okay, let's take Lama and fine tune it.
I would love for somebody who's interested
in such things to try it.
For now, I'm doing a bunch of fuzzy matching
on just common things that spammers do in their names.
There's duplication, there's these words,
there's leak code, and I inevitably use cat and mouse.
But I would love, I think you have to almost go to a GPT
to actually have a decent system.
That's as far as I got.
And I thought, well, not only is this not accurate enough
with my current implementation,
I'm on an old, rigid Ruby 2 code base that I can't really,
what am I going to do, pull in the OpenAI gem?
I'm never going to be able to get modern tooling into this system
until Gerhard saves us with a Docker file
or whatever he's going to do.
A dagger pipeline, but yes, close enough.
Yeah, sorry, wrong company.
I'll daggerize it.
That's what's going to happen.
We need to daggerize this sucker.
That will be Kaizen, slightly better.
That'll be the next one, cool.
So the last thing which I want to mention
before we start thinking about wrapping up
and thinking about the next Kaizen
is mention that now we have status.changelog.com.
Oh, yeah.
Yeah, that's another thing that happened.
So when we are down, hopefully never.
We've got 100% uptime on changelog.com.
Now the checks, they don't run every 30 seconds.
We are still on the free tier.
This is better stack.
And I think the checks are like every three minutes.
So there's downtime, which is less than three minutes minutes it won't even be picked up by this tool by this system
however if there is an incident we will be communicating it via status change.com so if
changelog was to be down again not going to happen on my watch but you know it has happened like many
years ago and it wasn't us it was fastly remember that episode i forget which one it was yes um but bbc was down too so again after i say
this boom everything crashes and burns no not gonna happen i'm not gonna even tempt it uh but
yeah so that's i think the one thing which i wanted to mention we have a status page very cool
and for those of us on my side of the pond you go to status.changelog.com if you're in
the uk you go to status yes status that's it both will get you there just depends on how you like to
say it s-t-a-t-u-s like potus we can agree on that like potus you'll say you got the u.s in there i
like i appreciate it so what are we thinking for next Kaizen? What would we like to see?
Oh my goodness.
I would like to see ChangeLog Nightly upgraded in the ways that we just discussed
off of DigitalOcean specifically.
I would like to see...
Clustering working?
Clustering.
I think we need to get clustering working
so we can use Phoenix PubSub.
I think we have to do Elixir releases to do that.
I was reading about it a little bit.
That's there.
So that's when I stopped and was like, releases.
This is outside of my wheelhouse.
So I looked into that, by the way,
but then I decided to leave them out of scope for the migration
that was like, I think, for the previous Kaizen.
But there is like some code in our pipelines to do that.
Okay.
I would like to see Oban web installed
so we can have that observability.
Top of my list.
That one should be easy enough.
Adam was mentioning middleware IO,
trying it out maybe.
What's middleware IO?
Did I mention that?
I did.
Oh yeah.
AI powered cloud observability platform.
Oh, shiny. That's a cloud observability platform. Oh, shiny.
That's a nice headline.
I do like that.
It gets me in there because it's AI-powered.
Right.
That's how you raise money today is your AI-powered stuff.
And it's also cloud observability.
It's also a platform.
It has all the buzzwords.
Was this generated by any chance it's a real website?
Yes.
They reached out.
You know, I asked you if you saw it.
So usually we get lots of inbound requests from people.
Some are legitimate.
Some are whatever.
But my smell test is, Gerhard, did you hear this?
Would you try this out?
Would you want to try it out?
And I don't think I've spoken to them yet, but we do have something in the works works to get connected so i will escalate that up my list to make sure i do so and then i think you
said you wanted to play with it right so we can probably get an account to see if you like it
and go from there kind of thing cool one was trying the wild card yeah there you go what
about something that gives us more than two slos i? I mean, that's which we didn't talk about.
I mean.
Yeah, we didn't talk about that, but come on, Honeycomb.
What's the deal with that?
Two?
I know.
I will tell you, here's what's happening.
While we were talking on this podcast, I was emailing Christine Yen because she's going to come on a future episode of Founders Talk.
And I like her.
I like the whole team there.
And I think they do amazing work. And obviously reference and leverage uh honeycomb as critical infrastructure
like we i don't think we could do what we do with the quite the way we do it like we just
the listeners didn't get to see you share your screen but jared and i did they'll hear what you
said about what was on your screen and they'll follow along hopefully but we were like knee deep into layers
and layers of observability data that's inside a honeycomb and we don't have to like program or
what do you call the instrument right these things to do it just captures it and we just
ask the questions obviously i think it has a length of time of logs that can go through right
like there's uh six weeks or eight weeks or a couple months.
I'm not sure what the...
Yeah, it's two months, 60 days.
Two months, 60 days.
Traces and everything.
Enough for us.
Maybe we can get more.
I don't know.
We're hitting the limit, by the way.
We have 100 million events per month
and we're exhausting that
because we're sending all the traces.
Yes, we're getting emails about it.
They keep telling us like,
hey, you've gone over X again this month.
Right.
Threatening.
And by the way, we are paying for it.
We are paying for it.
Yeah, we are paying for it
because we haven't made this connection yet.
So my hope is,
and Christine may be listening to this right now
because I sent in the email,
hey, Christine,
literally we're talking about Hunnicombe
as I type this
because we're on the podcast.
We're talking about you right now.
And it goes out this Friday
and here's an echo
because I'm now talking to her
and everybody else
in this very moment here
and just suggesting like,
hey, we're big fans of Honeycomb.
We want to partner with them.
We want to find ways
to speak more about them,
but more importantly, improve.
Like two SLOs on the free plan may,
I'm curious,
why is that limit there?
It's the pro plan.
It's the paid plan.
It's the pro plan.
The free one,
you don't get any.
Gosh, even there you go. So there you go. So if you're paying for the pro plan,
you should get more than two SLOs. And if you don't, why? What's the cost to enable an SLO?
Well, here's a quick question before we go. There are also now triggers. And I was in there poking
around and I see the SLOs and I see the triggers and triggers seem to be
based on similar things that SLOs are based on. It's like, if this happens, trigger. Do these
work together? Are they separate features, Gerhard? Do you understand triggers better than I do
inside of Honeycomb? So triggers is almost like alarms, right? So it's like an alert.
Right. But isn't SLO also like an alarm? Like, hey, you haven't reached your objective.
Kind of, but it gives you like the perspective
of like the last 30 days, right?
So when you click on one.
Does it email you?
Yes, I do get emails and you can.
This one says triggered right there.
It says it's been triggered.
I mean, this basically gives you almost like a graph
and you can do like comparisons,
like to start understanding like,
when does this SLO fail?
And by the way, some of these things aren't that helpful.
And again, that's like to Adam's point,
there's like more to discuss about this.
But what's important,
we have a budget and it tracks the budget
and we see whereabouts we are.
A trigger will not have that.
A trigger will say,
hey, this thing just happened.
So an SLO, I think it goes further.
You have obviously an SLI
and it keeps track of that.
And then you receive emails
when you're just about like 24 hours from exhausting your budget. And that makes it really helpful.
Right. Okay. Fair enough. They're deeper. There's more things to track.
Seems a bit redundant to me, but I can see how you might just have some one-off triggers that
don't need to be like full on SLOs. I wonder if we could use those to get around some of our two
SLO maximum, maybe.
So it's almost like when something is slow, but again, can it take into account,
maybe it can, maybe we just need to write a query that takes into account. But then apart from the dashboard view and the comparison view,
there must be something else about SLOs as well.
I mean, why not just call them the same thing if it's just that?
Because I think SLO is like buzzword compatible at this point.
Like it sounds like a thing that you could charge money for.
I see.
Queer run every 15 minutes.
So maybe.
Anyways, let's look into triggers a little bit.
But yeah, we definitely want to get some more SLOs.
Yeah, more SLOs.
And we spell it M-O-A-R.
Because Gerhard says, look, you should have two of everything.
Except for wives and SLOs.
You should have more.
Less than two wives.
Less than two wives.
Definitely.
And more than two.
Yes.
SLOs.
Absolutely.
Absolutely.
Two of everything else.
Right.
Right.
So, hi, Christine, if you're listening.
Can we talk?
Stoked.
Love, Honeycomb.
More SLOs. More SLOs.
More SLOs, please.
Yeah.
This has been a fun Kaizen, though.
I mean, I think, you know, let me, I've been quiet quite a bit during this show because y'all do the work and I just get the pundit as necessary.
It's great to see all this work done. I mean, it's great to see us now improving, yes, but I think paying attention to how we spend money
with S3 and making changes
and leveraging other
players in the space. Mad respect
for Cloudflare. We love to find
ways to work with them in any way,
shape, or form. And the same with
BetterStack. I think the status page is something we
haven't really looked further into
with working with them. But part of
this journey with Kaizen is improving, but also finding the right tools out that we like that we
can trust in terms of you know who's behind the business and the way they treat the community and
the way they frame and build their products finding those folks out there that we can work
with ourselves and leverage but then also promote to our listener base and saying hey these are
things that we're using in these ways
and all of our code is open source on GitHub,
which you can see these integrations.
I think it's beautiful, right?
Like to have an open source code base
and like to integrate with Dagger since, you know,
0.1 or whatever the release was initially
when you first got us on there.
And then having that conversation with Solomon on the change log
and kind of going into all that.
And like all this stuff is out there in the open.
And we just invite everybody listening to the show
to just follow along as you'd like to,
to see where we go and then how it works when we put it into place.
So that's kind of fun.
I like doing that with you all.
It's a lot of fun.
Yeah, same here.
I mean, this really is unique.
I mean, to be able to discuss so openly and to share the code,
like not just like, we're not just like talking about ideas or like what we did this is like a summary and hey by the way there's a change look
there is a github repo you can go and check all these things out and there's something that you
like use it try it out and let us know how it works for you so yes we're doing it for us of
course but also a lot of effort goes in to share this. So it's easy to understand.
It's easy to try using it and try it out
and see if it works for you.
And we're open about the things that didn't work out
because a bunch of things didn't.
Right, precisely.
To close the loop on the invitation,
I would say if you made it this far
and you haven't gone here to this particular webpage yet
and joined the community, you should do so now
because we are just as open and welcoming in Slack in person as we can be.
Go to changelog.com slash community.
Free to join.
Love to talk to you in there.
Lots of people in Slack.
It's not overly busy, but it's definitely active.
And there's a place for you there.
So if you don't have a home or a place to put your hat
or hang your coat or your scarf
or whatever you might be wearing
or take your shoes off and hang out for a bit,
that is a place for you to join.
You're invited and everyone's welcome
no matter where you're at on your journey.
So hope to see you there.
What else is left?
What can we look forward to?
One last thing.
If you join the Dev Channel in Slack,
please don't archive it.
What the heck?
I just noticed that.
Obed Frimpong joined and he archived.
And then Maros Kuchera joined and archived.
So that just messes up with our clients.
So please don't do that.
Don't archive channels.
I don't know why people can do that.
I mean, maybe there's some fix that we should do.
Yeah, maybe.
You'd think that would be a setting, like, no.
That's the limit of our invitation, okay?
We are very open and very inviting
until you archive our channels.
And then we don't want it to happen.
So don't do that.
That's like coming into our house and being like,
oh, I threw away your kitchen table.
I hope you didn't need that.
Yeah.
I got rid of that.
Neighbor needed a table.
Yeah.
Play nice.
Be nice. That's right. That's right. Otherwise never needed a table. Yeah. Play nice. Be nice.
Play nice.
That's right.
That's right.
Otherwise, welcome.
Otherwise, welcome.
All right.
Kaizen.
Looking for us with the next one.
Kaizen.
Kaizen.
Always.
This changelogging friends features a changelog plus plus bonus.
That's what the crowd wants.
Gerhard's boss, Solomon Hikes from Dagger was on the changelog a few weeks back and
we just had to get Gerhard's review of that episode.
I didn't listen to it.
What?
Come on.
I just wanted to see the reaction.
How rude.
Actually, he listened to it twice and he has opinions, of course.
If you aren't on the plus plus bandwagon yet, now's a great time to sign up.
Directly support our work, ditch the ads from all of our pods,
and get in on fun bonuses like the one that Plus Plus subscribers are about to hear.
Check it out at changelog.com slash plus plus.
Thanks again to our partners, Fasty.com, Fly.io, and Typesense.org.
And to our beat-freaking residents,
the mysterious Breakmaster Cylinder.
Next week on The Changelog,
news on Monday,
Debian's 30th birthday party on Wednesday,
and Justin Searles right here on Changelog and Friends on Friday.
Have a great weekend,
and we'll talk to you again real soon.