The Changelog: Software Development, Open Source - Kaizen! The day half the internet went down (Interview)
Episode Date: August 5, 2021This week we're sharing a special episode of our new podcast called Ship It. This episode is our Kaizen-style episode where we point our lens inward to Changelog.com to see what we should improve next.... The plan is do this episode style every 10 episodes. Gerhard, Adam, and Jerod talk about the things that we want to improve in our setup over the next few months. We talk about how the June Fastly outage affected changelog.com, how we responded that day, and what we could do better. We discuss multi-cloud, multi-CDN, and the next sensible and obvious improvements for our app.
Transcript
Discussion (0)
This week on The Change Law, we're sharing a special episode of our new podcast called Ship It.
Subscribe at changelog.com slash ship it.
This is our Kaizen style episode where we point our lens inward to changelog.com to see what we should improve next.
The plan is to do this episode style every 10 episodes.
Ship It launched back in May and now has 13 episodes in the feed to enjoy.
You'll hear stories from Tom Wilkie on Grafana's Big Ten idea,
charity majors on Honeycomb's secret to high-performing teams,
Dave Farley on the foundations of continuous delivery,
and coming soon you'll hear from Uma and Karthik from Chaos Native
on resiliency being born from chaos,
Justin Searles on software testing and automation,
Gerhard is bringing it with this show.
It is awesome.
You should subscribe.
Check it out at changelog.com slash ship it
and anywhere you listen to podcasts.
Big thanks to our partners,
Linode, Fastly, and LaunchDarkly.
We love Linode.
They keep it fast and simple.
Get $100 in credit at linode.com slash changelog.
Our bandwidth is provided by Fastly.
Learn more at fastly.com
and get your feature flags,
Powered by LaunchDarkly.
Get a demo at launchdarkly.com and get your feature flags, Power BI Launch Darkly. Get a demo at launchdarkly.com.
This episode is brought to you by Influx Data, the makers of InfluxDB,
a time-series platform for building and operating time-series applications.
In this segment, Marian Bija from NodeSource shares how InfluxDB plays a critical role in delivering the core value of the APM tool they have called NSolid.
It's built specifically for Node.js apps to collect data from the application and stack in real time.
At NodeSource, we want to lean into a time series database and InfluxDB quickly rose to the top of the unique value propositions of EnSolid is real-time data.
And there is a lot of APM tools out there, but there is a variance in terms of how available the data is.
It's not really real-time.
There is actually a staging period to the data.
And InfluxDB is magical and allows us to deliver on our unique value proposition of real-time data with EnSolid. To get started, head to influxdata.com slash changelog.
Again, that's influxdata.com slash changelog.
We are going to shift in 3, 2, 1. so i really wanted to talk to you about this topic of kaizen so kaizen for those that's the
first time they hear this is the concept of the art of self-improving
specifically and that is really powerful because it's the best way that you have to improve yourself
and to always think about how can i do this better it all starts with how can i do this better
so with that in mind what i wanted us to do every 10th episode was to reflect on what can we improve
for the application, for our setup,
but also the show.
Because isn't that the best way of improving?
I think it is.
Kaizen, I love it.
Always be improving, ABI.
ABI, yeah.
Always be something.
ABS, always be something, you know.
I'm pretty sure that means something else for others, ABS.
But yes, always be something.
Automatic system.
That's what it refers to for me.
The reason why I care so much about this is that having been part of Pivotal,
it's a company which isn't anymore.
It was acquired by VMware a couple of last year or two years ago, whatever,
is that one of the core principles was to always be improving.
Be kind was there as well.
But always be improving was something that was embodied in the retrospectives that we used to
have every single week at the end of the week. And this was good because what worked well? What
didn't work so well? Anything that people want to discuss? And that made sure that everybody was in
sync with the problems, but also the wins. I think that's important.
So having done it for five, six, seven years, it's so deeply ingrained in me. I cannot not do it.
It's part of me. And I do it continuously. And I think the infrastructure setup that we've been
rolling for some number of years has been an embodiment of that. Every year it has been improving. It was rooted in this principle.
Now, one thing that we did in the past differently
is that we improved,
or at least we share those improvements once per year.
It was a yearly thing.
And one of the ideas for the show was to do it more often,
to improve more often.
So we can improve and take smaller steps,
but also figure things out a lot, lot quicker.
What works, what doesn't work rather than once a year.
It works with about two, every two and a half, every two-ish months, essentially, we get a response, a blip, a feedback loop.
Whereas before it was like once and then more recently twice in the year.
If it's every 10 shows Kaizen, every 10 shows, then we get around, you know, four or five-ish per year
if you're shooting for 50 shows a year.
So I think in May, beginning of May,
end of April, beginning of May,
we switched on the 2021 setup
and we had a show, we had the intro,
we did a couple of things.
Episodes, do you still remember
which episode that was from Changelog, Adam?
4.3.
No, but I have the internet
and I will look it up.
So give me a moment while I look it up so give me a moment
while I look it up
that is a good one
so that was
meant to be part
of the ship it
but then some timelines
got moved around
and then that
went on changelog
and then the ship it
we did the intro to the show
so that's
that's how that happened
that was an interesting
maneuver
last minute maneuver
from us too
which I'm not sure
really matters to the
listeners but I think
it was kind of
we had a plan
and then at the last minute,
we changed the first 10 yards of running down the field, so to speak.
That was episode 441 on the ChangeLogs feed.
So changelog.com slash 441 will get you there.
Inside 2021's infrastructure for changelog.com,
which is like a throwback to the year prior,
inside 2020's infrastructure for changelog.com.
So we've been doing that every year now for the past couple of years.
I think that change made a lot of sense.
And that change just led to a couple of other things.
And now we're finally in the point to talk about the next improvement.
So you don't have to wait another year.
Not only that, we're doing things slightly differently.
We're going to share the
things that we're thinking about improving, maybe why we're thinking about improving them,
so that maybe you have better ideas. Maybe you know about things that we don't,
that you would like us to try out, maybe part of the same thing. So Fastly, I would like to
mention that because Fastly, our partner, amazing CDN, had an outage a couple of weeks back.
Unexpected, of course.
Right after you said 100% uptime.
Exactly.
It was like a week after, wasn't it?
That show shifted the very next week, Fastly outage.
It was a global outage too.
It was global.
Half the internet broke.
It was the biggest Fastly outage that I am aware of.
So what that made me realize is that Fastly is great when it works.
And when it doesn't, it doesn't affect just us.
It affects everybody.
Everybody.
BBC was down.
That's a big one.
BBC being down.
Emojis were down.
On the whole internet.
That was unexpected.
Wait, wait, wait.
Tell me more.
How were emojis down for the whole internet?
Does it make sense?
Well, apparently the assets that were served by AWS had something to do with it.
I don't know exactly which capacity, but AWS was serving certain emoji assets.
And Fastly was part of that.
And emojis stopped working for Slack.
So I think in the Slack setup somewhere,
I mean, everybody uses Slack, right?
To communicate these days because everybody's at home these days
or most of us are at home these days.
So you couldn't use emojis in Slack anymore.
They stopped working. That you couldn't use emojis in slack anymore they stopped working
that makes more sense than emojis just stopped working globally across the entire you know
world of devices but yeah inside sensational it has its news it has to be sensational
well most importantly we were down so most importantly to us so bbc being down tragic
terrible for lots of people but for us specifically
we were down
and that was the worst part about it wasn't it
for us yes
and for all the listeners
right
and interestingly
during this time
our origin
the back end to Fastly
was up
it didn't have an issue
so this month
I got the report
we were down for 21 minutes
because of that
so 99.96% uptime.
So you had a cutover though.
You turned off Fastly basically, right?
Yes.
Jumped in, switched Fastly,
basically every rooted traffic.
So DNS update and changelog.com
would start resolving directly to the Linode host,
to the Linode load balancer node balancer
and uh fastly was basically taken out of the picture but because of how dns is cached it
took a couple of more minutes to propagate but uh that was it and cdn as well rerouted it it was a
i was basically chilling it was like a day off it was a great one i was like in the garden it was
sunny it was perfect yeah chilling there's nothing you do right yeah as you do exactly i let the phone start like going off
like crazy yeah um so now that was really like what i got like sms messages because we have like
multiple systems right when something is down like you really want to know about it so i got like
texts i got ping them alerts i got oh what i didn't get is telegram notifications because guess who
else was down grafana cloud no you didn't let me guess i was gonna guess it i thought you were
saying you had the day off because of all the down all the downtime nothing to do yes grafana
sorry adam what were you saying i was saying i thought you said you were taking the day off
because you had nothing to do because the internet was down essentially
that's what I thought you were saying
I was just chilling
it was like a gorgeous day sunny
it was like a day off I was like sunbathing
I will go into more details
with that
let me say two things
first of all thanks for springing into action
and bringing us back up
21 minutes nothing nothing wrong with
that compared to the bbc those suckers they were down for much longer but the bummer side so let
me tell you the bummer side which i haven't told you this before but what you did is you cut fastly
out and you put leno directly in right and so all of our traffic was served from leno during that
time well it just so happened to be timed directly when we shipped our episode of the changelog with Ryan Dahl.
And because we do all of our analytics through our Fastly logs and we served all of that traffic directly from Linode, we have no idea how popular that episode is.
In fact, it looks like it's not in our admin. It looks like it's not a very good episode of the changelog, but I'm quite sure it's
pretty popular. So I was, I was bummed. I was like, Oh no, we missed out on the stats for the
show, which is one of our bigger shows of the year. But I'd rather have that happen and let
people listen to it than have it be down and you know, nobody gets to listen to it. So that was a
bummer, but pick your poison, I guess, or better of two evils yeah i i remember that actually i remember that
because i remember looking at the stats and the stats were like down yeah and i was thinking i
want to talk to jared about this so if there's one lesson to learn from this we need to double up
so everything that we do we need to do two off that thing. Like monitoring, we have two
monitoring systems. And then because sometimes Grafana Cloud has an issue. And then we want to
still know, and this is when I say Grafana Cloud, I mean the black box, all the exporters. And there
was like a recent one as well, when they push updates, sometimes things are offline for a few
minutes. And it makes you think that a website is offline, but it's not. Or when it is offline,
you don't get anything.
So we use Pingdom as a backup.
And that helps.
So stats, I think it's great to have stats from Fastly,
but I don't think we can rely
only on those stats.
I think we need more.
Well, it's one of those
ROI kind of conversations.
And I think this is a good conversation
for ShipIt.
Like what's worth doing?
And the fact is that
in our five years of being on Fastly, this is the first conversation for ship it like what's worth doing and the fact is is that in our five years of being on fastly this is the first incident they've had and if it didn't happen
to be right when we released an a popular episode of the changelog like it was just like a saturday
and we missed some downloads i wouldn't care all that much you know and at the end of the day i
know that show's popular so i still don't it's not really changing my life. I just know it was popular because people reacted that way
versus like looking at the download stats. So the question becomes like, what does it take
to get that redundancy, right? What does, what does that redundancy cost and what does it gain?
Yeah. And in the case of stats, I'm not sure if, you know, what side of the teeter-totter we
actually end up on because the way it works now is Fastly streams the logs of all of the requests
to the MP3 files over to S3. And then we take those logs, which are formatted in a specific way,
parse them, and then bring them locally into our database. And it's reproducible in that way off of S3.
So we can just suck down the same logs from S3 whenever we want, reparse them, read down,
you know, recalculate.
But what would it take to get Linodes doing the same thing or changing the way we do our
stats so that we're either redundant or do it differently?
I don't know the answer to that off the top of my head.
In the case of something like Grafana, though, I would put that back on them.
Like we shouldn't have two Grafanas.
Like Grafana, I think this is probably the case for multi-cloud, right?
Wouldn't it make sense then to be, let's say, on GCP, Azure, or essentially multi-cloud?
And maybe that's an issue with cloud at large.
Maybe it's like, well, the cloud has to be multi-cloud so that if part of their cloud goes down, then there's still some sort of like redundancy in them.
I would rather them do that kind of stuff than us have to, you know, have essentially two Grafanas or Linode and Fastly and like deal with that.
Maybe that's just more and maybe that's the unique scenario where it's like we do have to deal with that M plus whatever. But I would say on a service level, push that onto the service to be smarter about the way they roll out their own cloud and their potential downtime, what that means for internet at large.
Now, obviously, as you would expect, I think about this differently.
Please tell us.
Please tell us. Please tell. The way I think about this is that we are in a unique
position to try out all these providers. So we have the know-how and really our integrations
are fairly simple. So I know that it wouldn't take that much more to integrate Cloudflare.
So how about you use Cloudflare and Fastly,
the two biggest CDM providers, at the same time? What if, for example, we decouple assets from
local storage, we store them in an S3 object store, we, for a database, we use maybe CockroachDB,
a hosted one, and then the database is global. And then we are running changelog, one instance on Linode,
one instance on Render,
one instance on Fly.
And then we use different types of services,
not just Kubernetes.
We try a platform because we try it out.
And at the same time, we are fully redundant.
Now, the pipeline that orchestrates all of that
will be interesting,
but this is not something that's going to happen
even like in a year. It's like slowly, gradually. It this is not something that's going to happen even like in a year.
It's like slowly, gradually.
It's maybe a direction that we choose to go towards.
And maybe we realize, you know what?
Actually, in practice, Cloudflare and Fastly,
it's just too complicated.
Because only once you start implementing,
you realize just how difficult it is.
Yeah, that's the cost that Joe was talking about.
How much does the redundancy cost and how
much does it gain you? So from a CDM perspective, you just basically have multiple DNS entries,
right? You point both Fastly and, what do you call it, Cloudflare to the same origin or origins
in this case. Let's just start with the one origin. The configuration is maybe slightly
different, but we don't have too many rules in Fastly. How do they map to Cloudflare?
I don't know.
But again, there's not that much stuff.
I think the biggest problem is around stats, right?
We keep hitting that.
Yes.
And I looked at Cloudflare, it's probably two years ago now, with regards to serving
our MP3s.
And where I ran up into problems was their visibility into the logs and to getting that information out was
paled in the comparison to what Fastly provides. And so we, we would lose a lot of fidelity
in those logs, like with regard to IP addresses, Fastly will actually resolve, you know, with their
own max mine database or whatever their GOIP database is that will give you the state and the country
of the request stuff that we wouldn't have to do it. And Cloudflare, at least at the time,
now this is a couple of years ago, just didn't provide any sort of that visibility. And so it
was like, I would lose a lot of what I have in my stats using Cloudflare. And if I was going to
provide, if I was going to go multi CDN, which is kind of like multi cloud, I would going to go multi-CDN, which is kind of like multi-cloud, I would have to go lowest common denominator with my analytics
in order to do that.
And so it really didn't seem worth it at the time.
But maybe it's different now.
Yeah, if they've improved their logs,
then it's back on the table, let's say.
So that's maybe a long-term direction.
What's some stuff that is more immediate
that you have on the hit list,
things that we should be doing with the platform?
Yeah.
I think multi-CDM makes sense to me.
Because just for those reasons.
If you've got one that goes down,
then you've got another resolver.
But once in five years, how often is it vastly down?
Okay, I'm thinking about this
from the perspective of the experience
and sharing these things like right a few years back we were missing this but we don't know what
they have or don't have this year or maybe what are missing maybe they don't even know what we
would like for them to have and listeners to the show of the show they can think you know what
this show is really interesting because they
are using multi-cloud and these are all the struggles that they have. So maybe we can learn
from them and not do some of these mistakes ourselves. So in a way, we're just producing
good content that is very relevant to us. So we say, you know what? We are informed and we have
made an informed decision to not use Cloudflare because of these reasons, which may or may not
apply to you, by the way. It's like there's a brand new hammer you know and we grab hold of it and everyone gathers around we put our
hand out and we we we strike it right on our thumb and then everybody knows that hammer really hurts
when you strike on your thumb i'm glad those guys did it i've've learned something. Instead, yeah. And I don't have to do that myself. I think that's a very interesting perspective,
but I don't see it that way.
Okay.
It's an amazing analogy,
but I'm not sure that applies here.
But yeah, it's great fun.
That's for sure.
Okay, good.
This episode is brought to you by our friends at LaunchDarkly, Thank you. If a feature isn't ready to be released to users, wrapping code with feature flags gives you the safety to test new features and infrastructure in your production environments without impacting the wrong end users.
When you're ready to release more widely, update the flag status and the changes are made instantaneously by the real-time streaming architecture.
Eliminate risk, deliver value, get started for free today at LaunchDarkly.com.
Again, LaunchDarkly.com. Again, launchdarkly.com. So you're asking, Jared, what is next on our hill?
One of the things I learned from the Fastly incident is that we don't have anything to manage incidents. When something is down, how do we let users know
what is going on? How do we learn from it in a way that we can capture and then share
amongst ourselves and also others? A document is great. Slack, just to write some messages is great,
but it feels very ad hoc. So one of the things that I would really, really like is a way to manage
these types of incidents.
And guess what?
There's another incident
that we have right now.
Right now?
Right now, right now.
Like the website's down right now?
No.
The incident,
this is a small incident.
No, the website is 100% up.
100% of time.
Thank you.
Yeah.
So fastly, it's your responsibility to keep it up, right?
That's what it boils down to.
It's someone else's problem.
It's Fastly's problem.
That's right.
Pass the buck.
Right.
So right now, one of the DNS simple tokens that we used to renew certificates has been deleted.
So it's either Adam or Jared, because I haven't.
Wasn't me.
Anyways, the point is, I'm not pointing any fingers.
I don't touch DNS.
So in the account of the DNS...
It's looking like maybe it was me, but I didn't touch anything.
So I don't know what's going on.
It could be worse than we said.
It could be a bit flimsy.
So we had two DNS tokens.
One was for the old setup and one was for the new setup. The one for the old setup,
I have deleted because we just didn't need it. And then we had three DNS tokens left. One of them
disappeared, is no longer there. And that was the one that was used by CertManager to renew
certificates. So certificates are now failing to renew. We passed the 30-day threshold
and we have, I think, another 25 days to renew the certificate. But because the token is not there,
the certificate will never be renewed. And then eventually the certificate will stop being valid.
This is the same one that we use in Fastly. So a lot of stuff is going to break for many people now i found out about this by just
looking at through canines is what is what is happening with the different jobs there's jobs
which are failing that are meant to renew things it's it's not the best setup so what i've done
the first thing which i've done i've set up an alert in grafana cloud when the dns is uh expires
in less than i think think, two weeks,
or actually three weeks, whatever,
some number of seconds, because that's how they count them,
I get an alert.
So it should automatically renew within 30 days.
If within 25 days it hasn't been renewed, I get an alert.
So I have 25 days to fix it, roughly.
So what I would like to do is, first of all,
capture this problem in a way that we can refer back to it,
and also fix it in a way that we also can refer back to it,
like how did we fix it?
What went into it? What was added?
So that this doesn't happen again.
And adding that alert was one of the actions that they took even before we created an incident.
So that's one of the top things on my list.
How does that sound to you both?
Was it called an access token?
So on June 19th, they have an activity log.
This is actually kind of important for,
I think this is super important for services
that have multiple people doing things
that are important that could break things, essentially.
Have an activity log of things that happen,
deletions, additions, and D&Simple does have that,
except for to have more than 30 days of activity,
you have to upgrade to a pro plan that costs 300 bucks a year.
It's kind of pricey.
So we don't know what happened.
Well,
we do for the past 30 days.
And so on June 19th,
cause I'm the only user,
it says Adam deleted it.
So I guess I did.
It was not me.
No,
no,
that was actually me,
but that was so the token, which I deleted was one for the, no. It was not me. No, no. That was actually me. But that was...
So the token which I deleted was one for the old infrastructure.
There were two tokens.
I see.
Okay.
So this happened, you know, a long...
Do you know when, roughly?
June 19th sounds about right.
Can you assume at least?
June 19th sounds right.
But a single token was deleted and we had two.
Yeah.
Okay.
So it shows a single token being deleted June 1919, at an abnormal time for me to do any
deletions.
I think Jared as well.
That was me.
If this is central time zone, because that's where I'm at and I'm in the site, it's 7 in
the morning.
You know, 7.16 in the morning.
I'm definitely not deleting things at that time besides Zs in my brain.
So I don't get up that early.
That's all we know.
Maybe you accidentally deleted two.
It was a two-for-one deal that morning.
It doesn't show on the activity log, though, so that's the good thing.
Right.
I would maybe push back on DN simple support, and they can dig into it
and, one, get a true get blame on this,
and then, two, see if it was maybe just an error on the platform side
yeah I don't think I've done anything with tokens
aside from maybe one of our
github access tokens
was expiring or they made a
new one and I think I rotated one
token but nothing to do with DNS
not in the last
month or
six months it'd be cool if like certain
things like this require consensus.
You can delete it if Jared also
deletes it. Oh, it's like the nuclear codes. You gotta
have two hands on the button.
You'd have to do it at the same time, so
you could do it async by saying, okay,
Gerhard at his 7 in the morning time
frame, because he's in London,
deleted it. You get an email, Jared,
saying, Gerhard deleted this. Do you want to
also, you know, have consensus on this deletion and you have to go and also delete it too where it's like
two people coming together to agree on the deletion of an access token or it's awfully
draconian for a dns access token myself that's why i think the nuclear codes make sense you know
like you're about to send a nuclear bomb. You've got to have consent. Yeah.
I think an access log is good enough.
It would help in the DNSimple log to see which token has been deleted,
like the name of the token.
It doesn't say that.
It's not very thorough.
It just says access token delete.
That would have helped.
That's the event name.
And so some of the items in DNS have text associated with them,
but this does not. It doesn't showcase the token or the event name. And so some of the items in DNS have text associated with them. This does not.
It doesn't showcase the token or the first six or anything like that.
It's just simply the event name.
In this case, everything else is pretty thorough.
Well, I think we're rat hole in this particular incident.
But the bigger picture thing, in addition to this,
we've got to figure out what happened here and fix it it is how do we handle incidents in a better way so i think this is a place where
i would love to have listeners let us know how you handle incidents what are some good options
i know gerhard you've been looking at a few platforms and solutions surely there's open
source things there's lots of ways
that you can go about this you could use existing tools i mean you set up kind of a a notice for
this particular thing but that's not what you're talking about like how do we track and manage
incidents in like a historical communicable way exactly i don't know we don't know the best way
of doing this or a good way so what's a good way for listeners that they have a great incident solution or maybe they have one that they use at work, but they hate it. Like avoid this one. Is it Slack? Is it email? Is it tweets? What's the best way for listeners to feedback comments on the episode page, perhaps on the website. Yeah, that is an excellent point. Yeah. So
however you want to communicate via Slack or, you know, even via like Twitter, we are everywhere
these days, everywhere that works and still available. Everywhere where you can get an emoji
rendered, we're there. Exactly. The idea being that, I mean, there are a couple of things here.
For example, one thing which this reminded me is that we do not declare,
and this is like a bit chicken and egg situation
where we should absolutely manage the tokens
on the Unisimple side with something like,
for example, Kubernetes, why not?
Which continuously declares those.
Now, obviously, you still need the token
that creates tokens.
But if you have that,
we should have the token that it needs to create.
Now, I think that's a bit interesting because then what do you do from the perspective of security?
It can't give itself access to everything and then delete all the DNS records.
I mean, that's not good.
So some thought needs to go there.
But the idea being is that even with Fastly, for example, when we integrate, we still have manual things, manual integrations. We don't
declare the configuration. That's something which I would like us to do more of. And maybe also have
some checks that would... I mean, if you don't have DNS or something isn't right, like in this case,
you don't have access to DNS, that's a problem. And you would like to know about it as soon as
possible. So the token being deleted
on the 19th and the failure only happening like two weeks later almost end of june that's not
great because it removes the moment that you've done something which maybe maybe maybe it was me
maybe i have deleted by mistake the wrong token but i remember there were two who knows maybe i've
seen two tokens there was just one it's. And then when that happened, it makes sense, right? That two weeks later,
this thing starts failing. But because it took so long for this failure to start happening,
it was really difficult to reconcile the two and to link the two together.
Yeah. So where do those checks live in the system? Where would they live? I mean,
not in Grafana, I wouldn't think. I don't know.
I think it depends.
So in Kubernetes, right,
like you declare the state of your system,
not just the state of your system,
but the state of the systems
that the system integrates with.
So you can have like providers.
I know that Crossplane has that concept of providers.
It integrates with AWS, GCP.
I don't think that it has a DNS simple provider,
but we should have something that periodically makes sure that everything is the way it should
be. And Kubernetes has like those reconciling loops. It's central to how it operates.
So to me, that sounds like a good place. Monitoring, from a monitoring perspective,
you can check things, that things are the way you expect them to be. But that is more like
when there's a problem,
you need to like to work backwards from that. Where is the problem? Well, if you try to
continuously create things and if it doesn't exist, it will be recreated. If it exists,
there's nothing to do. So that's more proactive. So I quite like that model.
What does instant management give a team though? Because I think this came about whenever you said,
well, hey, FASI was down we
didn't expect it to be down a majority if not all the responsibility tends to fall on your shoulders
for resuming uptime which is incident management right like a disruption in a service that requires
an emergency response that's you're there you're our our first and only responder I suppose Jared
and I can step in in most cases, but you
really hold the majority of the knowledge. Does incident management give you the ability to sort
of share that load with other people that may not have to know everything you do, but can step in?
What does incident management, I guess, break down to be? Is it simply monitoring and awareness?
Is it action taking? Is there multiple facets
of incident management? So it has a couple of elements, but the element that I'm thinking about
based on your initial question was having the concept of a runbook. So I know I have a problem.
Great. I'm going to communicate my problem. So what do I do? And you codify those steps
in something which is called a runbook. So for example, if Jared had to roll the DNS, what would he do?
How do he approach it?
It didn't have to be me.
But the problem is, as you very well spotted,
is that I am the one that has the most context in this area,
and it would take Jared longer to do the same steps.
Make files, plural, we have how-tos.
So how to rotate credentials
or how to rotate credential. And it's a step-by-step process, like seven steps or four steps, however
many it's now, how to basically rotate a specific credential. So we need something similar to that,
but codified in a way that first of all, there's an incident. These people need to know about it,
maybe including our listeners like hey
we are down we know we're down we're working on it we'll be back shortly and then one of us
whoever is around because maybe one of us is on holiday and if i am on holiday well what do you
do what are the steps that you follow to restore things and as automated as things are there are
still elements i mean ro right? Not everything is automated
because it's not worth automating everything or it's impossible. So what are the steps that
Jared or even you can follow to restore things or anyone for that matter that has access to things,
anyone trusted? Yeah. And if it's that simple, then maybe we can automate that. Some things
aren't worth automating because if you run it once every five years,
well, why automate it?
The ROI just doesn't make sense.
It seems like it's pretty complex
to define for a small team.
Maybe easier for larger teams,
but more challenging for smaller teams.
But I know that there are
incident management platforms out there.
Can we name names?
I have two.
Name names. So one of them is FireHydrant.
The other one is Incident.io. I looked at both. And I know that FireHydrant for a fact has the concept of runbooks. So we could codify these steps in the runbook. I don't know about Incident.io,
but if they don't have one, or if they don't have this feature, I think they should,
because it makes a lot of sense. If we had this feature, we wouldn't need to basically find a way to do this or work around the system. The system exists and facilitates these types of
approaches, which makes sense across the industry, not just for us. So even though we're a small team,
we still need to communicate these sorts of things somehow
and in a way that makes sense.
And if we use a tool...
What's an example of a runbook then?
Let's say for our case, Fastly, the Fastly outage,
which is a once in five...
They're not going to do that in the next five years.
I'm knocking on wood over here.
Remember my certainty?
It would be smarter than...
100% uptime?
Next week, FastASI goes down.
Exactly. Don't jinx it.
Well, you know, given their
responsibility and size, they're probably
going to be less likely to do that again
anytime soon, is kind of what I mean by that.
So, but even that,
would you codify in a run book
a FASI homage? I think I would.
Now you might, because you
have this hindsight, you know, of recent events, but, you know, prior to this, you probably wouldn't.
So what's a more common runbook for a team like us?
I think I would codify the incidents that happen.
So, for example, if we had an incident management platform, when the fastly incident happened, I would have used whatever the platform
or whatever this tool offered me
to manage that incident.
And then as an outcome of managing the incident,
we would have had this runbook.
So I wouldn't preemptively add this.
I see.
So it's retrospective.
An incident happens, it doesn't happen again.
Well, it may.
Gotcha.
Yeah, this is what I've done to fix it, right?
And anyone can follow those steps.
And maybe if something, for example,
happens a couple of times,
then we create a runbook.
But at least Jared can see,
oh, this happened like six months ago.
This is what Gerhard did.
Maybe I should do the same.
I don't know.
Like, for example, in the case of this DNS token,
what are the steps which I'm going to take to fix it?
So capturing those steps somewhere in a simple form, right? Like literally, as I do it, I do this and I do that. And that is
stored somewhere and can be retrieved at a later date. I guess then the question is, when the
incident happens again, how does somebody know where to go look for these runbooks? I suppose
if you're using one of these services, it gets pretty easy because like, hey, go to the service,
right? And there's a runbooks dashboard
for so to speak.
I think it's just specific
to the service, but yeah.
And you go there,
you're like, oh man,
there's never been a runbook
for this.
I'm screwed.
Call Gearheart
or call so-and-so, you know?
Yeah, I suppose.
But I think
if you operate a platform
long enough
or a system long enough,
you see many, many things.
And then you try to,
I mean, it just progresses to the point that,
let's imagine that we did have multi-cloud.
Let's imagine that, I know, Linode was completely down
and the app was running elsewhere.
We wouldn't be down.
And okay, we would restore, we'd be like in a degraded state,
but things would still be working.
If we had multi-CDN, Fastly's down, well, Cloudflare's up.
It rarely happens that both are down at the same time. So then it's degraded, Fastly's down. Well, Cloudflare's up. It rarely happens that both are down the same
time. So then it's degraded, but it still works. So it's not completely down. In this case, for
example, we didn't have this, but right now, if the backend goes away, if everything disappears,
we can recreate everything within half an hour. Now, how would you do that? It's simple for me,
but if I had to do it maybe once and codify it,
which is actually what I have in mind for the 2022 setup,
I will approach it as if we've lost 2021
and I have to recreate it.
So what are the steps that I'll perform to recreate it?
And I'll go through them, I'll capture them.
Because 2021 is kind of a standard
and you're codifying the current golden standard.
The steps that that would take, yes, to set up a new one.
Yeah, exactly.
To get to zero where you're at right now.
This is ground zero.
And 2021, when I set up, was fairly easy to stand up
because I changed these things inside the setup
so that, for example, right now, the first step, which it does,
it downloads from backup everything it doesn't have.
So if you're standing this up on a fresh setup, it obviously has no assets, no database. So the
first thing which it does, it will pull down the backup, like from the backup, it will pull
everything down. And that's how we test our backups. Which is smart because the point of a
backup is restoration, not storage. Exactly. So we test it at least once a year now. You know,
what's important, I think, to mention here is that this may not be what every team
should do.
Like in many cases, this is exploration on our part.
This is not so much what every team should do in terms of redundancy.
We're doing it in pursuit of one, knowledge, and two, content to share.
So we may go forge new ground on the listener's behalf.
And hey, that's why you listen to the show.
And if you're not subscribed, you should subscribe.
But this, we're doing this not so much because one,
our service is so important that it must be up at all times.
It's because the pursuit of uptime is kind of fun
and we're doing as content and knowledge.
So that's, I think, kind of cool.
Not so much that everyone should eke out
every ounce of possible runtime.
It's just, in some cases, it's probably not wise because you have product to focus on or other things.
Maybe you have a dedicated team of SREs. And in that case, that's their sole job is literally uptime.
And that totally makes sense. But for us, we're a small team. And so maybe our seemingly, you know, unwavering focus on uptime is not because we're so important,
but because it's fun for content and knowledge to share.
And it makes us think about things in a different way.
So if you try something out, why are you trying something out?
Well, we have a certain problem to address and it may be a fun one, but we will learn.
So it's this curiosity, this built-in curiosity.
How does incident IO work?
How does fire hydrant work?
What is different?
What about render?
What about fly?
They look all cool.
Let's try it out.
What would it mean to run changelog on these different platforms?
Some are hard, some are that simple.
And sometimes you may even be surprised, say, you know what?
I would not have guessed this platform is so much better. So why are we complicating things using this
other thing? But you don't know until you try it. And you can't be trying these things all the time.
So you need those innovators that are out there. And if, for example, we have something stable that
we depend on, something that serves us well, we can try new things out in a way that doesn't disrupt us completely. And I think we have a very
good setup to do those things. This reminds me of Sesame Street. Either of you watch Sesame Street?
Not that I remember. Of course, everybody knows Sesame Street. But my son is a year and a half
old, so he watches Sesame Street. But something that Haley Steinfeld sings on the show is,
I wonder what if, let's try, right?
And that's kind of what we're doing here.
It's like, I wonder how this would work out if we did this.
What if we did that?
Let's try.
I think that's how all great ideas start.
The majority may fail.
The majority of the ideas may fail.
But how are you going to find the truly remarkable
ideas that work well in practice? Because on paper, everything is amazing. Everything is new.
Everything is shiny. How well does it work in practice? And that's where we come in, right?
Because if it works for a simple app that we have, which serves a lot of traffic,
it will most probably work for you too. Because I think the majority of our listeners,
I don't think they are the Googles or the Amazons. Maybe you work for those companies, but let's be honest,
it's everybody part of that company that contributes to some massive systems that very few have.
It's all about gleaning really. Like we're doing some of this stuff and the entire solution or the
way we do it may not be pertinent to the listener in every single case but it's about gleaning what makes sense for your case the classic it depends comes into play like
it this this makes sense to do in some cases does it work for me it depends maybe maybe not What's up, shippers?
This episode is brought to you by Century.
You already know working in code means happy customers,
and that's exactly why teams choose Century.
From error tracking to performance monitoring,
Century helps teams see what actually matters,
resolve problems quicker, and learn continuously about their applications from the front end to the back end.
Over a million developers and 70,000 organizations already ship better software faster with Sentry.
That includes us.
And guess what?
You can too.
Ship it listeners new to Sentry get the team plan for free for three months.
Use the code SHIPIT when you sign up.
Head to Sentry.io and use the
code ship it so i would like us to talk about the specifics, three areas of improvement for the changelog.com
setup, not for the whole year 2022, but just like over the next couple of months.
Top of my list is the incident management. So I have some sort of incident management,
but that seems like a on the side sort of thing. And we've already discussed that at some length. The next thing
is I would like to integrate Fastly logging. This is the origin, the backend logging with
Grafana Cloud. The reason why I think we need to have that is to understand how our origin,
in this case, Linode, LKE, where changelog.com runs, how does the origin behave from a Fastly perspective,
from a CDM perspective? Because that's something that we have no visibility in. So what I mean by
that is like when a request hits Fastly and that request has to be proxied to a node balancer
running in Linode, and that has to be proxied to Ingress NGINX running in Kubernetes, and that has to be proxy to Ingress NGINX running in Kubernetes. And that has to be proxy to eventually our instance of changelog.
How does that work?
How does that interaction work?
How many requests do we get?
How many fail?
When are they slow?
Stuff like that.
So have some SLOs, uptime as well, but also how many requests fail and what is the 99th
percentile for every single request.
That's what I would like to have.
How hard is that to set up?
Not too hard.
The only problematic area is that Fastly doesn't support
sending logs directly to Grafana Cloud.
So I looked into this a couple of months ago,
and the problem is around authenticating the HTTPS origin
where the logs will be sent, right?
Because it needs to push logs, HTTP requests.
So how do we verify that we own the HTTPS origin,
which is Grafana Cloud?
Well, we don't.
So we don't want to DDoS any random HTTPS endpoint
because that's what we would do if we were to set this up.
So we need to set up,
and again, this is like in the support ticket with Fastly,
what they recommend is you need to set up a proxy.
So imagine you have NGINX,
it receives those requests,
which are the Fastly logs,
it'll be HTTPS,
and then it proxies them to Grafana Cloud.
So that would work.
Where would we put our proxy?
Well, we would use the Ingress Engine X on Kubernetes,
the one that serves all the traffic,
all the changelog traffic.
Well, couldn't we DDoS ourself then?
We could.
If Fastly sends a large amount of logs, yes, we could.
Now, would we set another?
It's not a DDoS if it's ourself.
It's just a regular DOS.
It's not going to be distributed.
It's just us. Yeah, it's just, well, it will come from all Fastly endpoints, Iself. It's just a regular DOS. It's not going to be distributed. It's just us.
Yeah, it's just, well, it will come from all Fastly endpoints, I imagine.
That's true. It could come from a lot of Fastly points of presence. Yeah.
We could run it elsewhere, I suppose, but I like things being self-contained. I like things being
declared in a single place, right? So to me, it makes more sense to use the same setup. I mean,
it is in a way a Fastly limitation, right? And actually specifically Fastly in Grafana Cloud, that lack of integration that we have to work around.
But speaking of that, I know that Honeycomb supports Fastly logging directly. And one of
the examples that Honeycomb has is the RubyGems org traffic, which is also proxied by Fastly. So in there, like try Honeycomb
out, you can play with
the dataset, which is the
RobyGems.org traffic.
So I know that that integration works out of the box.
And that's why maybe that would be
an easier place to start.
Just a place to start, yeah.
But then we're using Grafana Cloud for everything else.
So that's an interesting
moment. Like do we start moving
stuff across to Honeycomb or do we have things
in two systems right
that's like a like a
little break in the dam you know like a
little bit of water just starts to pour out and it's
not a big deal right now on Grafana Cloud
right yeah well they got
just a little thing over here
Honeycomb yeah turns out
pretty nice over there.
It starts to crack a little bit and more water starts to and all of a sudden just bursts and
Grafana loses a customer.
That stuff happens.
We could also parallelize
this and we could
simultaneously try to get
Fastly and Grafana sitting
in a tree.
K-I-S-S-I-N-G.
They're integrations. That would be great, right? know, sit in a tree. K-I-S-S-I-N-G. But, you know, they're integrations.
Because that would be great, right?
Yeah, that would be great.
That is actually a request from us.
And that would probably be in the benefit of both,
I think both Fastly and Grafana,
that would be in both entities to their benefit.
So maybe that's already in the works.
Who knows?
I would guess that it is.
Well, I would like to know,
because then we could be not doing a bunch of work. We could just procrastinate until it's there.
Right. Yeah.
It's stuff like this, right? Let's put an email filler
out. We got some people we can talk to
to know for sure.
And then if it is
in the works and it's maybe on the back burner,
we can put some
fire under the burner because we
need it to. Well, then we've hit
another interesting in that
i really want to try honeycomb out i've signed up and i want to start sending some events their way
and just start using honeycomb to see what insights we can derive from things that we do
one of the things that i really want to track with honeycomb and this is like i wasn't expecting to
discuss this but it seems to be related so why not is? I want to visualize how long it takes us from Git push
to deploy, because there are many things that happen in that pipeline. And from the past episodes,
this is really important. This is something that teams are either happy or unhappy about.
The quicker you can see your code out in production, the happier you will be. Does it work? Well, you want to get it out there quickly.
Right now, it can take anywhere between 10 and 17, 18 minutes, even 20, because it depends
on so many parts, like CircleCI, sometimes the jobs are queued.
The backups that run, well, sometimes they can run 10 seconds more.
The caches that we hit in certain parts, like images being pulled, whatever, they can run 10 seconds more the caches that we hit in certain parts like
images being pulled whatever they can be slower or they can be cold and they have to be warmed up
so we don't really know first of all i mean in my head i know what they are all the steps but you
and jared don't know what does the git push have to go through before it goes out into prod and
what are all the things that may go wrong and then which is the area or
which is like the step which takes the longest amount of time and also is like most variable
because that's how we focus on reducing this time to prod and honeycomb i mean they're they're
championing this left right and center i mean charity majors i don't know which episode but
she will be on the show very, very soon.
15 minutes or bust. That's what it means. Like you're either in production, your code is either
in production 15 minutes or you're bust. There was an unpopular opinion shared on GoTime.
I can't remember who shared it, but he said, if it's longer than 10 minutes, you're bust.
There you go. So that 15 minutes is going to be moving, I think.
It will be moving. As the industry pushes forward, it's going to keep going lower and lower, right?
Exactly. Well, what is it that does every
I suppose, every
Git push, which is from local
to presumably GitHub
in our case, could be another
code host. Is there a way to
scrutinize like, oh, this is just
this is just views and CSS
changing to like
make that deployment faster.
If it's not involving images or a bunch of other stuff,
why does a deployment of, let's just say it's a typo change on an HTML
and a dark style to the page for some reason, whatever.
If it's just simply CSS or an EX file change in our case,
could that be faster?
Is there a way to have a smarter pipeline?
These are literally just an HTML and CSS update.
Of course, you're going to want to minimize or minify that CSS
that SAS produces in our case, etc., etc.,
but 15 minutes is way long for something like that.
You're right.
So the steps that we go through, they're always the same. We could make the pipeline smarter in that, for example, if the
code doesn't change, you don't need to run the tests. The tests themselves, they don't take long
to run, but to run the tests, you need to get the dependencies. And we don't distinguish like if the
CSS changed, you know what? You don't need to get dependencies. So we don't distinguish between the
type of push that it was because then you start putting smarts. I mean, you don't need to get dependencies. So we don't distinguish between the type of push that it was
because then you start putting smarts.
I mean, you have to declare that somehow.
You have to define that logic somewhere.
And then maybe that logic becomes, first of all,
difficult to declare, brittle to change.
What happens if you add another path?
What happens if, for example, I don't know,
you've changed a Node.js dependency,
which right now we use,
and then we remove Node.js,
and then we compile assets differently.
And then, by the way, now you need to watch that
because the paths, I mean, the CSS you just generated
actually depends on some Elixir dependencies.
I don't know.
I think ESBuild, we were looking at that or thinking.
You effectively introduce a big cache invalid validation problem yes that's what you do
yeah cache and validation is one of the hard things in computer science so it's slow but it's
simple it's like we just rebuild it every time it's like why does react re-render the entire
re-render the entire dom every time well it doesn't anymore because that was too slow so it's like
does all this diffing and stuff but there's like millions and millions of dollars in engineering spent
and figuring out how react is going to smartly re-render the dom right it's the same thing
it's like there's so many little what-ifs once you start only doing and this is why gatsby spent years on their feature which is what partial builds
because building on gatsby site which is a static site generator right building a 10 000 page static
site with gatsby was slow just i'm just made up the word 10 000 but you know 100 000 whatever the
number is was slow and so it's like well couldn't we just only build the parts that changed, right? Like what Adam just said.
It's like, yeah, we could.
But then they go and spent two years building that feature
and VC money and everything else to get that done.
So it's like a fractal of complexity.
Yeah.
I'm saying there's small things you can do.
You can get like the 80% thing and it works mostly
and doesn't get you all, it doesn't squeeze out every performance, but it's a big, so there's probably some low hanging fruit you can do. You can get like the 80% thing and it works mostly and doesn't get you all,
it doesn't squeeze out every performance,
but it's a big,
so there's probably some low hanging fruit we could do,
but it's surprisingly complicated to do that kind of stuff.
And the first step really is trying to understand
these 15 minutes.
First of all, how much they vary,
because as I said, sometimes they can take 20 minutes.
Why does it vary by that much?
Like maybe, for example, it's test jobs being queued up in CircleCI. A lot of the time that
happens and they are queued up for maybe five minutes. So maybe that is the biggest portion
of those 20 minutes or 15 minutes, and that's what we should optimize first. Yeah, that's why I said
there's probably some low hanging fruit. We can probably do a little bit of recon and knock that down quite a bit. And that's exactly why I'm thinking,
like, use Honeycomb, just like to try and visualize those steps, what they are, how they work,
stuff like that. Exactly. Good idea. Second thing is, and I think this can either be a managed
PostgreSQL database, so that either CockroachDB or anyone that manages
PostgreSQL, like one of our partners, one of our sponsors, I would like us to offload that problem
and we just get the metrics out of it to understand how well it behaves, what can we
optimize, stuff like that in our queries. But otherwise, I don't think we should continue hosting PostgreSQL.
I mean, we have a single instance. I mean, it's simple, really, really simple. It backups. I mean,
it's no different than SQLite, for example, the way we use it right now, but it works. We didn't
have any problems since we switched from a clustered PostgreSQL to single node PostgreSQL,
not one. We used to have countless problems before when we had a cluster.
So it's hard, is what I'm saying.
What we have now works,
but what if we remove the problem altogether?
I remember slacking,
how can our Postgres be out of memory?
It's like, well, wasn't it with the backup?
The backup got something happened with the backup.
Or the wall file.
The wall file.
The replication got stuck and it was like broken.
It just wouldn't resume.
And the disk would fill up.
Crazy, crazy, crazy.
And that's the reason you would want to use a Minj
because they handle
a lot of that stuff.
Exactly.
And if it can be distributed,
then that means that we can run
multiple instances of our app.
Was it not for the next point,
which is an S3 object store
for all the media assets
instead of local disk.
Right now,
when we restore from backups,
that's actually what takes the most time because we have like 90 gigs at this point. So restoring that will take some number of minutes. And I think moving to an S3 one and a managed PostgreSQL,
which we don't have, we can have multiple instances of changelog. We can run them in
multi-cloud. I mean, it opens up so much possibility if we did that
that would be like putting all of our assets in s3 it'd be like welcome to the 2000s guys
it would be right that's exactly right yeah you're you've now left the 90s maybe i should
explain why we're using local storage some of it's actually just technical debt this was a
decision i made when building the platform back in 2015 around how we handle uploads, not image uploads, but MP3 uploads, which is one of the major things that we upload and process.
And these MP3s are anywhere also want to do post-processing, like post-upload processing on the MP3s
because we go about rewriting ID3 tags and doing fancy stuff
based on the information in the CMS, not a pre-upload thing.
So it's nice for putting out a lot of podcasts
because if Gerhard names the episode and then uploads the file to the episode,
the MP3 itself is encoded with the episode's information without having to duplicate yourself.
So because of that reason, and because I was new to Elixir, and I didn't really know exactly the
best way to do it in the cloud, I just said, let's keep it simple. We're just going to upload the files to the local
disk. We had a big VPS
with a big disk on it and
don't complicate things.
And so that's what we did.
And knowing full well, even back then
I had done client work where I would put
their assets on S3. It's just because
this MP3 thing and the ID3s
we run FFmpeg against
it and how do you do that in the cloud, etc. So that was the ID3s, we run FFmpeg against it. And like, how do you do that
in the cloud, et cetera. So that was the initial decision-making and we've been
kind of bumping up against that ever since. Now the technical debt part is that our image upload
image, our assets uploader library in Elixir that I use is pretty much unmaintained at this point it's a library called arc and in
fact the last release was cut version 0.11 in october of 2018 so it hasn't changed and it's
a bit long in the in the tooth is that a saying long in the tooth i think it is and um i know
it's warts pretty well i've used very successfully, so it serves us very well.
But there's technical debt there.
And so as part of this, well, let's put our assets on S3 thing,
I'm like, let's replace ARC when we're doing this
because I don't want to retrofit ARC.
It does support S3 uploads,
but the way it goes about shelling out for the post-processing stuff,
it's kind of wonky.
I don't totally trust it and so i would want to replace it as part of this move and i haven't found that replacement or do i
write one etc so it's kind of like that where it's just slightly a bigger job than you know
reconfiguring arc just to push to s3 and doing one upload and being done with it. But it's definitely time.
It's past time.
So I'm with you.
I think we do it.
Yeah, I think that makes a lot of sense.
And this just basically highlights the importance
of discussing these improvements constantly.
So stuff that keeps coming up, not once,
but like two years in a row,
it's the stuff that really needs to change, right?
Unless you do this
constantly, you don't realize exactly what the top item is, because some things just change,
right? It stops being important. But the persistent items are the ones that I think
will improve your quality of software, your quality of system, service, whatever you have
running. And it's important to keep coming back to these things. Is this still important? It is.
Okay, so let's do it. But you know what? Let's just wait another cycle. And then eventually
you just have to do it. So I think this is one of those cases and we have time to think about this
and what else will it unlock? If we do this, then we can do that. And is it worth it? Maybe it is.
And I think in this case, this S3 and the database, which is not managed, have the potential
of unlocking so many things for us.
Simplifying everything.
Well, the app becomes effectively stateless, right?
It does. How amazing is that?
And then you're basically in the cloud world
where you can just do whatever you want.
That's exactly it.
Life is good.
That's exactly it.
And then face all new problems that you didn't know existed.
True.
Does this Arc thing,
does it also impact the chaptering stuff
we've talked about in the past year?
Wasn't that also part of it there is an angle into that so for the listener that the
chaptering so at the mp3 spec actually it's the id3 version 2 spec which is part of the way mp3s
work it's all about the headers supports chaptering id3 v1 not. ID3V1 is very simple. It's like a fixed frame kind of a thing.
And ID3V2 is complicated more so, but has a lot more features. One of which is chaptering,
which chapters are totally cool. You know, ship it's roughly three segments. Well, we could like
throw a chapter in into the MP3 for each segment. And if you want to skip to segment three real
fast, you could, we would love to build that into our platform. Cause then we could also represent
those chapters on the webpage, right? So you can have like
timestamps and click around lots of cool stuff. Unfortunately, there's not an ID three V two
elixir library. And the way that we do our ID three tags right now by way of arc is with FFM
peg. So we shell out to FFM peg and we tell FFFmpeg what to do to the mp3 file and it does all the magic, the ID3 magic, and then we take it from there.
So the idea was, well, if we could not depend on FFmpeg, first of all, that simplifies our deploys because we don't have a dependency that's like a Linux binary.
Oh, it's a small thing.
But we'd be able to also do chaptering.
So we get some features as well as simplify the setup.
And that is only partially to do with Arc.
Really, that has to do with the lack of that ID3v2 library in Elixir.
Like that functionality does not exist in native Elixir.
And so I could, if it did, I could plug that into Arc's pipeline and get that done currently.
Now, if FFmpeg supported the feature, we wouldn't need it anyway. and get that done currently now if FFmpeg supported the feature we wouldn't need it anyway we would just do it
in the FFmpeg but it does
not it doesn't seem like it's something that they're interested
in because the mp3 chaptering is not
like a new whiz bang feature like
it's been around for a decade maybe more
so the fact that it doesn't exist in FFmpeg
which is if you've ever seen is like one of
the most featureful
tools in the world I mean FFmpeg, which is, if you've ever seen, is like one of the most featureful tools in the world.
I mean, FFmpeg is an amazing piece
of software that does so many
things, but doesn't support
MP3 chaptering. So that's kind of a slightly
related but different initiative that I've also
never executed on.
I just wondered if we had to bite the arc tail
off or whatever that might seem like
to also get a win,
you know, along with that. And the win we've wanted for years essentially was being able to bake in
some sort of chaptering maker into the CMS backend
so that we can display this on pages, you said,
or in clients that support it because that's a big win for listeners.
And for obvious reasons that Jared just mentioned,
that's why we haven't done it.
It's not because we don't want to.
It's because we haven't technically been able to.
So if this made us bite that off,
then it could provide some team motivation.
Like we get this feature too,
and we get this stateless capability for application.
It just provides so much action.
Yeah, and one way I thought that we could tackle that,
which doesn't work with our current setup,
is we could, I mean, we render the MP3s
or we mix down the MP3s locally on our machines.
Then we upload them to the site, right?
We could pre-process the chapters locally.
We could add the chapters locally to the MP3,
then upload that file.
And if we could just write something that reads ID3V2,
it doesn't have to write it.
We could pull that out of the MP3
and display it on the website.
And that would be like a pretty good compromise.
However, when we do upload the file,
when you pass it to FFmpeg and tell it to do the title
and the authors and the dates and all that,
well, it just completely blows away your local ID3 tags.
So it overwrites it.
As I was listening to you talking about this,
one of the things that reminded me of is the segments in YouTube video files,
which sometimes I really like because I can skip to specific topics really easily. So rather than
having fixed like beginning, middle and end, you can have topic by topic and you can skip to the specific parts.
I would love to see that in changelog audio files.
That's the feature right there.
Like you use it however you want to use it.
So like the obvious way is like,
well, there's three segments, I'll put three chapters in.
But if you were in charge of doing your own episode details
and you could put the chapters in the way you'd want to,
yeah, you could make it really nice just like that.
And for clients that support it it it is a spectacular feature now a lot of the popular podcast apps
don't care like spotify is not going to use it apple podcast historically has not used it so
like they're they basically don't exist but the indie devs tend to put those kind of features in
like the overcasts the castros the i'm not sure pocket cast is into anymore but like those
people who really care about the user experience of their podcast clients
they support chaptering and for the ones that do it's a really nice feature yeah i love that
the other thing that i would really like is when i write blog posts i could just drag and drop files
as i do in github and just get them automatically uploaded to S3.
Because right now I have to manually upload them.
You and me both.
And then reference them.
It's so clunky.
I would love that feature.
You're exposing our ad hoc-ness.
Come on now.
We literally open up Transmit
or whatever you use to manage S3 buckets,
and we drag and drop them,
and then we copy URL.
But first you have to make it readable by the world. Don't that part and then put the link into your blog post that's that's
you can go away and figure that on the bucket so that all files are readable really we do have that
we do have that on the button i didn't know about that but it still sucks it does suck but one thing
which i do for these for these episodes for the It ones, I take a screenshot, by the way, took very good screenshots of all three of us.
And I put them in the show notes.
I saw that.
You're the first one to do that.
So, again, you're pushing the envelope of Changelog Podcasts and probably pushing us towards features that I would normally just completely put off over and over again.
See what happens when people come together and talk about what could improve?
Well said. So what I propose now is that we go and improve those things. see what happens when people come together and talk about what could improve well said so what
i propose now is that we go and improve those things and come back in 10 episodes how does
that sound sounds good kaizen kaizen that's it for this episode of ship it thank you for tuning in
we have a bunch of podcasts for developers
at Changelog that you should check out. Subscribe to the master feed at changelog.com forward slash
master to get everything we ship. I want to personally invite you to join your fellow
changeloggers at changelog.com forward slash community. It's free to join and stay. Leaving,
on the other hand, will cost you some happiness credits.
Come hang with us in Slack.
There are no imposters.
Everyone is welcome.
Huge thanks again to our partners, Fastly, LaunchDarkly, and Minote.
Also, thanks to Breakmaster Cylinder for making all our awesome beats.
That's it for this week.
See you next week. Game on