The Changelog: Software Development, Open Source - Kaizen! The day half the internet went down (Interview)

Episode Date: August 5, 2021

This week we're sharing a special episode of our new podcast called Ship It. This episode is our Kaizen-style episode where we point our lens inward to Changelog.com to see what we should improve next.... The plan is do this episode style every 10 episodes. Gerhard, Adam, and Jerod talk about the things that we want to improve in our setup over the next few months. We talk about how the June Fastly outage affected changelog.com, how we responded that day, and what we could do better. We discuss multi-cloud, multi-CDN, and the next sensible and obvious improvements for our app.

Transcript
Discussion (0)
Starting point is 00:00:00 This week on The Change Law, we're sharing a special episode of our new podcast called Ship It. Subscribe at changelog.com slash ship it. This is our Kaizen style episode where we point our lens inward to changelog.com to see what we should improve next. The plan is to do this episode style every 10 episodes. Ship It launched back in May and now has 13 episodes in the feed to enjoy. You'll hear stories from Tom Wilkie on Grafana's Big Ten idea, charity majors on Honeycomb's secret to high-performing teams, Dave Farley on the foundations of continuous delivery,
Starting point is 00:00:31 and coming soon you'll hear from Uma and Karthik from Chaos Native on resiliency being born from chaos, Justin Searles on software testing and automation, Gerhard is bringing it with this show. It is awesome. You should subscribe. Check it out at changelog.com slash ship it and anywhere you listen to podcasts.
Starting point is 00:00:48 Big thanks to our partners, Linode, Fastly, and LaunchDarkly. We love Linode. They keep it fast and simple. Get $100 in credit at linode.com slash changelog. Our bandwidth is provided by Fastly. Learn more at fastly.com and get your feature flags,
Starting point is 00:01:03 Powered by LaunchDarkly. Get a demo at launchdarkly.com and get your feature flags, Power BI Launch Darkly. Get a demo at launchdarkly.com. This episode is brought to you by Influx Data, the makers of InfluxDB, a time-series platform for building and operating time-series applications. In this segment, Marian Bija from NodeSource shares how InfluxDB plays a critical role in delivering the core value of the APM tool they have called NSolid. It's built specifically for Node.js apps to collect data from the application and stack in real time. At NodeSource, we want to lean into a time series database and InfluxDB quickly rose to the top of the unique value propositions of EnSolid is real-time data. And there is a lot of APM tools out there, but there is a variance in terms of how available the data is.
Starting point is 00:01:53 It's not really real-time. There is actually a staging period to the data. And InfluxDB is magical and allows us to deliver on our unique value proposition of real-time data with EnSolid. To get started, head to influxdata.com slash changelog. Again, that's influxdata.com slash changelog. We are going to shift in 3, 2, 1. so i really wanted to talk to you about this topic of kaizen so kaizen for those that's the first time they hear this is the concept of the art of self-improving specifically and that is really powerful because it's the best way that you have to improve yourself and to always think about how can i do this better it all starts with how can i do this better
Starting point is 00:02:58 so with that in mind what i wanted us to do every 10th episode was to reflect on what can we improve for the application, for our setup, but also the show. Because isn't that the best way of improving? I think it is. Kaizen, I love it. Always be improving, ABI. ABI, yeah.
Starting point is 00:03:19 Always be something. ABS, always be something, you know. I'm pretty sure that means something else for others, ABS. But yes, always be something. Automatic system. That's what it refers to for me. The reason why I care so much about this is that having been part of Pivotal, it's a company which isn't anymore.
Starting point is 00:03:36 It was acquired by VMware a couple of last year or two years ago, whatever, is that one of the core principles was to always be improving. Be kind was there as well. But always be improving was something that was embodied in the retrospectives that we used to have every single week at the end of the week. And this was good because what worked well? What didn't work so well? Anything that people want to discuss? And that made sure that everybody was in sync with the problems, but also the wins. I think that's important. So having done it for five, six, seven years, it's so deeply ingrained in me. I cannot not do it.
Starting point is 00:04:12 It's part of me. And I do it continuously. And I think the infrastructure setup that we've been rolling for some number of years has been an embodiment of that. Every year it has been improving. It was rooted in this principle. Now, one thing that we did in the past differently is that we improved, or at least we share those improvements once per year. It was a yearly thing. And one of the ideas for the show was to do it more often, to improve more often.
Starting point is 00:04:40 So we can improve and take smaller steps, but also figure things out a lot, lot quicker. What works, what doesn't work rather than once a year. It works with about two, every two and a half, every two-ish months, essentially, we get a response, a blip, a feedback loop. Whereas before it was like once and then more recently twice in the year. If it's every 10 shows Kaizen, every 10 shows, then we get around, you know, four or five-ish per year if you're shooting for 50 shows a year. So I think in May, beginning of May,
Starting point is 00:05:10 end of April, beginning of May, we switched on the 2021 setup and we had a show, we had the intro, we did a couple of things. Episodes, do you still remember which episode that was from Changelog, Adam? 4.3. No, but I have the internet
Starting point is 00:05:22 and I will look it up. So give me a moment while I look it up so give me a moment while I look it up that is a good one so that was meant to be part of the ship it but then some timelines
Starting point is 00:05:30 got moved around and then that went on changelog and then the ship it we did the intro to the show so that's that's how that happened that was an interesting
Starting point is 00:05:36 maneuver last minute maneuver from us too which I'm not sure really matters to the listeners but I think it was kind of we had a plan
Starting point is 00:05:44 and then at the last minute, we changed the first 10 yards of running down the field, so to speak. That was episode 441 on the ChangeLogs feed. So changelog.com slash 441 will get you there. Inside 2021's infrastructure for changelog.com, which is like a throwback to the year prior, inside 2020's infrastructure for changelog.com. So we've been doing that every year now for the past couple of years.
Starting point is 00:06:10 I think that change made a lot of sense. And that change just led to a couple of other things. And now we're finally in the point to talk about the next improvement. So you don't have to wait another year. Not only that, we're doing things slightly differently. We're going to share the things that we're thinking about improving, maybe why we're thinking about improving them, so that maybe you have better ideas. Maybe you know about things that we don't,
Starting point is 00:06:33 that you would like us to try out, maybe part of the same thing. So Fastly, I would like to mention that because Fastly, our partner, amazing CDN, had an outage a couple of weeks back. Unexpected, of course. Right after you said 100% uptime. Exactly. It was like a week after, wasn't it? That show shifted the very next week, Fastly outage. It was a global outage too.
Starting point is 00:07:01 It was global. Half the internet broke. It was the biggest Fastly outage that I am aware of. So what that made me realize is that Fastly is great when it works. And when it doesn't, it doesn't affect just us. It affects everybody. Everybody. BBC was down.
Starting point is 00:07:19 That's a big one. BBC being down. Emojis were down. On the whole internet. That was unexpected. Wait, wait, wait. Tell me more. How were emojis down for the whole internet?
Starting point is 00:07:30 Does it make sense? Well, apparently the assets that were served by AWS had something to do with it. I don't know exactly which capacity, but AWS was serving certain emoji assets. And Fastly was part of that. And emojis stopped working for Slack. So I think in the Slack setup somewhere, I mean, everybody uses Slack, right? To communicate these days because everybody's at home these days
Starting point is 00:07:59 or most of us are at home these days. So you couldn't use emojis in Slack anymore. They stopped working. That you couldn't use emojis in slack anymore they stopped working that makes more sense than emojis just stopped working globally across the entire you know world of devices but yeah inside sensational it has its news it has to be sensational well most importantly we were down so most importantly to us so bbc being down tragic terrible for lots of people but for us specifically we were down
Starting point is 00:08:27 and that was the worst part about it wasn't it for us yes and for all the listeners right and interestingly during this time our origin the back end to Fastly
Starting point is 00:08:37 was up it didn't have an issue so this month I got the report we were down for 21 minutes because of that so 99.96% uptime. So you had a cutover though.
Starting point is 00:08:50 You turned off Fastly basically, right? Yes. Jumped in, switched Fastly, basically every rooted traffic. So DNS update and changelog.com would start resolving directly to the Linode host, to the Linode load balancer node balancer and uh fastly was basically taken out of the picture but because of how dns is cached it
Starting point is 00:09:12 took a couple of more minutes to propagate but uh that was it and cdn as well rerouted it it was a i was basically chilling it was like a day off it was a great one i was like in the garden it was sunny it was perfect yeah chilling there's nothing you do right yeah as you do exactly i let the phone start like going off like crazy yeah um so now that was really like what i got like sms messages because we have like multiple systems right when something is down like you really want to know about it so i got like texts i got ping them alerts i got oh what i didn't get is telegram notifications because guess who else was down grafana cloud no you didn't let me guess i was gonna guess it i thought you were saying you had the day off because of all the down all the downtime nothing to do yes grafana
Starting point is 00:09:58 sorry adam what were you saying i was saying i thought you said you were taking the day off because you had nothing to do because the internet was down essentially that's what I thought you were saying I was just chilling it was like a gorgeous day sunny it was like a day off I was like sunbathing I will go into more details with that
Starting point is 00:10:16 let me say two things first of all thanks for springing into action and bringing us back up 21 minutes nothing nothing wrong with that compared to the bbc those suckers they were down for much longer but the bummer side so let me tell you the bummer side which i haven't told you this before but what you did is you cut fastly out and you put leno directly in right and so all of our traffic was served from leno during that time well it just so happened to be timed directly when we shipped our episode of the changelog with Ryan Dahl.
Starting point is 00:10:52 And because we do all of our analytics through our Fastly logs and we served all of that traffic directly from Linode, we have no idea how popular that episode is. In fact, it looks like it's not in our admin. It looks like it's not a very good episode of the changelog, but I'm quite sure it's pretty popular. So I was, I was bummed. I was like, Oh no, we missed out on the stats for the show, which is one of our bigger shows of the year. But I'd rather have that happen and let people listen to it than have it be down and you know, nobody gets to listen to it. So that was a bummer, but pick your poison, I guess, or better of two evils yeah i i remember that actually i remember that because i remember looking at the stats and the stats were like down yeah and i was thinking i want to talk to jared about this so if there's one lesson to learn from this we need to double up
Starting point is 00:11:40 so everything that we do we need to do two off that thing. Like monitoring, we have two monitoring systems. And then because sometimes Grafana Cloud has an issue. And then we want to still know, and this is when I say Grafana Cloud, I mean the black box, all the exporters. And there was like a recent one as well, when they push updates, sometimes things are offline for a few minutes. And it makes you think that a website is offline, but it's not. Or when it is offline, you don't get anything. So we use Pingdom as a backup. And that helps.
Starting point is 00:12:11 So stats, I think it's great to have stats from Fastly, but I don't think we can rely only on those stats. I think we need more. Well, it's one of those ROI kind of conversations. And I think this is a good conversation for ShipIt.
Starting point is 00:12:21 Like what's worth doing? And the fact is that in our five years of being on Fastly, this is the first conversation for ship it like what's worth doing and the fact is is that in our five years of being on fastly this is the first incident they've had and if it didn't happen to be right when we released an a popular episode of the changelog like it was just like a saturday and we missed some downloads i wouldn't care all that much you know and at the end of the day i know that show's popular so i still don't it's not really changing my life. I just know it was popular because people reacted that way versus like looking at the download stats. So the question becomes like, what does it take to get that redundancy, right? What does, what does that redundancy cost and what does it gain?
Starting point is 00:13:01 Yeah. And in the case of stats, I'm not sure if, you know, what side of the teeter-totter we actually end up on because the way it works now is Fastly streams the logs of all of the requests to the MP3 files over to S3. And then we take those logs, which are formatted in a specific way, parse them, and then bring them locally into our database. And it's reproducible in that way off of S3. So we can just suck down the same logs from S3 whenever we want, reparse them, read down, you know, recalculate. But what would it take to get Linodes doing the same thing or changing the way we do our stats so that we're either redundant or do it differently?
Starting point is 00:13:40 I don't know the answer to that off the top of my head. In the case of something like Grafana, though, I would put that back on them. Like we shouldn't have two Grafanas. Like Grafana, I think this is probably the case for multi-cloud, right? Wouldn't it make sense then to be, let's say, on GCP, Azure, or essentially multi-cloud? And maybe that's an issue with cloud at large. Maybe it's like, well, the cloud has to be multi-cloud so that if part of their cloud goes down, then there's still some sort of like redundancy in them. I would rather them do that kind of stuff than us have to, you know, have essentially two Grafanas or Linode and Fastly and like deal with that.
Starting point is 00:14:18 Maybe that's just more and maybe that's the unique scenario where it's like we do have to deal with that M plus whatever. But I would say on a service level, push that onto the service to be smarter about the way they roll out their own cloud and their potential downtime, what that means for internet at large. Now, obviously, as you would expect, I think about this differently. Please tell us. Please tell us. Please tell. The way I think about this is that we are in a unique position to try out all these providers. So we have the know-how and really our integrations are fairly simple. So I know that it wouldn't take that much more to integrate Cloudflare. So how about you use Cloudflare and Fastly, the two biggest CDM providers, at the same time? What if, for example, we decouple assets from
Starting point is 00:15:12 local storage, we store them in an S3 object store, we, for a database, we use maybe CockroachDB, a hosted one, and then the database is global. And then we are running changelog, one instance on Linode, one instance on Render, one instance on Fly. And then we use different types of services, not just Kubernetes. We try a platform because we try it out. And at the same time, we are fully redundant.
Starting point is 00:15:38 Now, the pipeline that orchestrates all of that will be interesting, but this is not something that's going to happen even like in a year. It's like slowly, gradually. It this is not something that's going to happen even like in a year. It's like slowly, gradually. It's maybe a direction that we choose to go towards. And maybe we realize, you know what? Actually, in practice, Cloudflare and Fastly,
Starting point is 00:15:54 it's just too complicated. Because only once you start implementing, you realize just how difficult it is. Yeah, that's the cost that Joe was talking about. How much does the redundancy cost and how much does it gain you? So from a CDM perspective, you just basically have multiple DNS entries, right? You point both Fastly and, what do you call it, Cloudflare to the same origin or origins in this case. Let's just start with the one origin. The configuration is maybe slightly
Starting point is 00:16:21 different, but we don't have too many rules in Fastly. How do they map to Cloudflare? I don't know. But again, there's not that much stuff. I think the biggest problem is around stats, right? We keep hitting that. Yes. And I looked at Cloudflare, it's probably two years ago now, with regards to serving our MP3s.
Starting point is 00:16:38 And where I ran up into problems was their visibility into the logs and to getting that information out was paled in the comparison to what Fastly provides. And so we, we would lose a lot of fidelity in those logs, like with regard to IP addresses, Fastly will actually resolve, you know, with their own max mine database or whatever their GOIP database is that will give you the state and the country of the request stuff that we wouldn't have to do it. And Cloudflare, at least at the time, now this is a couple of years ago, just didn't provide any sort of that visibility. And so it was like, I would lose a lot of what I have in my stats using Cloudflare. And if I was going to provide, if I was going to go multi CDN, which is kind of like multi cloud, I would going to go multi-CDN, which is kind of like multi-cloud, I would have to go lowest common denominator with my analytics
Starting point is 00:17:28 in order to do that. And so it really didn't seem worth it at the time. But maybe it's different now. Yeah, if they've improved their logs, then it's back on the table, let's say. So that's maybe a long-term direction. What's some stuff that is more immediate that you have on the hit list,
Starting point is 00:17:47 things that we should be doing with the platform? Yeah. I think multi-CDM makes sense to me. Because just for those reasons. If you've got one that goes down, then you've got another resolver. But once in five years, how often is it vastly down? Okay, I'm thinking about this
Starting point is 00:18:03 from the perspective of the experience and sharing these things like right a few years back we were missing this but we don't know what they have or don't have this year or maybe what are missing maybe they don't even know what we would like for them to have and listeners to the show of the show they can think you know what this show is really interesting because they are using multi-cloud and these are all the struggles that they have. So maybe we can learn from them and not do some of these mistakes ourselves. So in a way, we're just producing good content that is very relevant to us. So we say, you know what? We are informed and we have
Starting point is 00:18:38 made an informed decision to not use Cloudflare because of these reasons, which may or may not apply to you, by the way. It's like there's a brand new hammer you know and we grab hold of it and everyone gathers around we put our hand out and we we we strike it right on our thumb and then everybody knows that hammer really hurts when you strike on your thumb i'm glad those guys did it i've've learned something. Instead, yeah. And I don't have to do that myself. I think that's a very interesting perspective, but I don't see it that way. Okay. It's an amazing analogy, but I'm not sure that applies here.
Starting point is 00:19:13 But yeah, it's great fun. That's for sure. Okay, good. This episode is brought to you by our friends at LaunchDarkly, Thank you. If a feature isn't ready to be released to users, wrapping code with feature flags gives you the safety to test new features and infrastructure in your production environments without impacting the wrong end users. When you're ready to release more widely, update the flag status and the changes are made instantaneously by the real-time streaming architecture. Eliminate risk, deliver value, get started for free today at LaunchDarkly.com. Again, LaunchDarkly.com. Again, launchdarkly.com. So you're asking, Jared, what is next on our hill? One of the things I learned from the Fastly incident is that we don't have anything to manage incidents. When something is down, how do we let users know
Starting point is 00:20:27 what is going on? How do we learn from it in a way that we can capture and then share amongst ourselves and also others? A document is great. Slack, just to write some messages is great, but it feels very ad hoc. So one of the things that I would really, really like is a way to manage these types of incidents. And guess what? There's another incident that we have right now. Right now?
Starting point is 00:20:51 Right now, right now. Like the website's down right now? No. The incident, this is a small incident. No, the website is 100% up. 100% of time. Thank you.
Starting point is 00:21:04 Yeah. So fastly, it's your responsibility to keep it up, right? That's what it boils down to. It's someone else's problem. It's Fastly's problem. That's right. Pass the buck. Right.
Starting point is 00:21:14 So right now, one of the DNS simple tokens that we used to renew certificates has been deleted. So it's either Adam or Jared, because I haven't. Wasn't me. Anyways, the point is, I'm not pointing any fingers. I don't touch DNS. So in the account of the DNS... It's looking like maybe it was me, but I didn't touch anything. So I don't know what's going on.
Starting point is 00:21:37 It could be worse than we said. It could be a bit flimsy. So we had two DNS tokens. One was for the old setup and one was for the new setup. The one for the old setup, I have deleted because we just didn't need it. And then we had three DNS tokens left. One of them disappeared, is no longer there. And that was the one that was used by CertManager to renew certificates. So certificates are now failing to renew. We passed the 30-day threshold and we have, I think, another 25 days to renew the certificate. But because the token is not there,
Starting point is 00:22:12 the certificate will never be renewed. And then eventually the certificate will stop being valid. This is the same one that we use in Fastly. So a lot of stuff is going to break for many people now i found out about this by just looking at through canines is what is what is happening with the different jobs there's jobs which are failing that are meant to renew things it's it's not the best setup so what i've done the first thing which i've done i've set up an alert in grafana cloud when the dns is uh expires in less than i think think, two weeks, or actually three weeks, whatever, some number of seconds, because that's how they count them,
Starting point is 00:22:50 I get an alert. So it should automatically renew within 30 days. If within 25 days it hasn't been renewed, I get an alert. So I have 25 days to fix it, roughly. So what I would like to do is, first of all, capture this problem in a way that we can refer back to it, and also fix it in a way that we also can refer back to it, like how did we fix it?
Starting point is 00:23:12 What went into it? What was added? So that this doesn't happen again. And adding that alert was one of the actions that they took even before we created an incident. So that's one of the top things on my list. How does that sound to you both? Was it called an access token? So on June 19th, they have an activity log. This is actually kind of important for,
Starting point is 00:23:30 I think this is super important for services that have multiple people doing things that are important that could break things, essentially. Have an activity log of things that happen, deletions, additions, and D&Simple does have that, except for to have more than 30 days of activity, you have to upgrade to a pro plan that costs 300 bucks a year. It's kind of pricey.
Starting point is 00:23:51 So we don't know what happened. Well, we do for the past 30 days. And so on June 19th, cause I'm the only user, it says Adam deleted it. So I guess I did. It was not me.
Starting point is 00:24:03 No, no, that was actually me, but that was so the token, which I deleted was one for the, no. It was not me. No, no. That was actually me. But that was... So the token which I deleted was one for the old infrastructure. There were two tokens. I see. Okay.
Starting point is 00:24:10 So this happened, you know, a long... Do you know when, roughly? June 19th sounds about right. Can you assume at least? June 19th sounds right. But a single token was deleted and we had two. Yeah. Okay.
Starting point is 00:24:22 So it shows a single token being deleted June 1919, at an abnormal time for me to do any deletions. I think Jared as well. That was me. If this is central time zone, because that's where I'm at and I'm in the site, it's 7 in the morning. You know, 7.16 in the morning. I'm definitely not deleting things at that time besides Zs in my brain.
Starting point is 00:24:43 So I don't get up that early. That's all we know. Maybe you accidentally deleted two. It was a two-for-one deal that morning. It doesn't show on the activity log, though, so that's the good thing. Right. I would maybe push back on DN simple support, and they can dig into it and, one, get a true get blame on this,
Starting point is 00:25:02 and then, two, see if it was maybe just an error on the platform side yeah I don't think I've done anything with tokens aside from maybe one of our github access tokens was expiring or they made a new one and I think I rotated one token but nothing to do with DNS not in the last
Starting point is 00:25:20 month or six months it'd be cool if like certain things like this require consensus. You can delete it if Jared also deletes it. Oh, it's like the nuclear codes. You gotta have two hands on the button. You'd have to do it at the same time, so you could do it async by saying, okay,
Starting point is 00:25:36 Gerhard at his 7 in the morning time frame, because he's in London, deleted it. You get an email, Jared, saying, Gerhard deleted this. Do you want to also, you know, have consensus on this deletion and you have to go and also delete it too where it's like two people coming together to agree on the deletion of an access token or it's awfully draconian for a dns access token myself that's why i think the nuclear codes make sense you know like you're about to send a nuclear bomb. You've got to have consent. Yeah.
Starting point is 00:26:05 I think an access log is good enough. It would help in the DNSimple log to see which token has been deleted, like the name of the token. It doesn't say that. It's not very thorough. It just says access token delete. That would have helped. That's the event name.
Starting point is 00:26:21 And so some of the items in DNS have text associated with them, but this does not. It doesn't showcase the token or the event name. And so some of the items in DNS have text associated with them. This does not. It doesn't showcase the token or the first six or anything like that. It's just simply the event name. In this case, everything else is pretty thorough. Well, I think we're rat hole in this particular incident. But the bigger picture thing, in addition to this, we've got to figure out what happened here and fix it it is how do we handle incidents in a better way so i think this is a place where
Starting point is 00:26:51 i would love to have listeners let us know how you handle incidents what are some good options i know gerhard you've been looking at a few platforms and solutions surely there's open source things there's lots of ways that you can go about this you could use existing tools i mean you set up kind of a a notice for this particular thing but that's not what you're talking about like how do we track and manage incidents in like a historical communicable way exactly i don't know we don't know the best way of doing this or a good way so what's a good way for listeners that they have a great incident solution or maybe they have one that they use at work, but they hate it. Like avoid this one. Is it Slack? Is it email? Is it tweets? What's the best way for listeners to feedback comments on the episode page, perhaps on the website. Yeah, that is an excellent point. Yeah. So however you want to communicate via Slack or, you know, even via like Twitter, we are everywhere
Starting point is 00:27:51 these days, everywhere that works and still available. Everywhere where you can get an emoji rendered, we're there. Exactly. The idea being that, I mean, there are a couple of things here. For example, one thing which this reminded me is that we do not declare, and this is like a bit chicken and egg situation where we should absolutely manage the tokens on the Unisimple side with something like, for example, Kubernetes, why not? Which continuously declares those.
Starting point is 00:28:19 Now, obviously, you still need the token that creates tokens. But if you have that, we should have the token that it needs to create. Now, I think that's a bit interesting because then what do you do from the perspective of security? It can't give itself access to everything and then delete all the DNS records. I mean, that's not good. So some thought needs to go there.
Starting point is 00:28:45 But the idea being is that even with Fastly, for example, when we integrate, we still have manual things, manual integrations. We don't declare the configuration. That's something which I would like us to do more of. And maybe also have some checks that would... I mean, if you don't have DNS or something isn't right, like in this case, you don't have access to DNS, that's a problem. And you would like to know about it as soon as possible. So the token being deleted on the 19th and the failure only happening like two weeks later almost end of june that's not great because it removes the moment that you've done something which maybe maybe maybe it was me maybe i have deleted by mistake the wrong token but i remember there were two who knows maybe i've
Starting point is 00:29:23 seen two tokens there was just one it's. And then when that happened, it makes sense, right? That two weeks later, this thing starts failing. But because it took so long for this failure to start happening, it was really difficult to reconcile the two and to link the two together. Yeah. So where do those checks live in the system? Where would they live? I mean, not in Grafana, I wouldn't think. I don't know. I think it depends. So in Kubernetes, right, like you declare the state of your system,
Starting point is 00:29:51 not just the state of your system, but the state of the systems that the system integrates with. So you can have like providers. I know that Crossplane has that concept of providers. It integrates with AWS, GCP. I don't think that it has a DNS simple provider, but we should have something that periodically makes sure that everything is the way it should
Starting point is 00:30:10 be. And Kubernetes has like those reconciling loops. It's central to how it operates. So to me, that sounds like a good place. Monitoring, from a monitoring perspective, you can check things, that things are the way you expect them to be. But that is more like when there's a problem, you need to like to work backwards from that. Where is the problem? Well, if you try to continuously create things and if it doesn't exist, it will be recreated. If it exists, there's nothing to do. So that's more proactive. So I quite like that model. What does instant management give a team though? Because I think this came about whenever you said,
Starting point is 00:30:43 well, hey, FASI was down we didn't expect it to be down a majority if not all the responsibility tends to fall on your shoulders for resuming uptime which is incident management right like a disruption in a service that requires an emergency response that's you're there you're our our first and only responder I suppose Jared and I can step in in most cases, but you really hold the majority of the knowledge. Does incident management give you the ability to sort of share that load with other people that may not have to know everything you do, but can step in? What does incident management, I guess, break down to be? Is it simply monitoring and awareness?
Starting point is 00:31:22 Is it action taking? Is there multiple facets of incident management? So it has a couple of elements, but the element that I'm thinking about based on your initial question was having the concept of a runbook. So I know I have a problem. Great. I'm going to communicate my problem. So what do I do? And you codify those steps in something which is called a runbook. So for example, if Jared had to roll the DNS, what would he do? How do he approach it? It didn't have to be me. But the problem is, as you very well spotted,
Starting point is 00:31:56 is that I am the one that has the most context in this area, and it would take Jared longer to do the same steps. Make files, plural, we have how-tos. So how to rotate credentials or how to rotate credential. And it's a step-by-step process, like seven steps or four steps, however many it's now, how to basically rotate a specific credential. So we need something similar to that, but codified in a way that first of all, there's an incident. These people need to know about it, maybe including our listeners like hey
Starting point is 00:32:26 we are down we know we're down we're working on it we'll be back shortly and then one of us whoever is around because maybe one of us is on holiday and if i am on holiday well what do you do what are the steps that you follow to restore things and as automated as things are there are still elements i mean ro right? Not everything is automated because it's not worth automating everything or it's impossible. So what are the steps that Jared or even you can follow to restore things or anyone for that matter that has access to things, anyone trusted? Yeah. And if it's that simple, then maybe we can automate that. Some things aren't worth automating because if you run it once every five years,
Starting point is 00:33:05 well, why automate it? The ROI just doesn't make sense. It seems like it's pretty complex to define for a small team. Maybe easier for larger teams, but more challenging for smaller teams. But I know that there are incident management platforms out there.
Starting point is 00:33:22 Can we name names? I have two. Name names. So one of them is FireHydrant. The other one is Incident.io. I looked at both. And I know that FireHydrant for a fact has the concept of runbooks. So we could codify these steps in the runbook. I don't know about Incident.io, but if they don't have one, or if they don't have this feature, I think they should, because it makes a lot of sense. If we had this feature, we wouldn't need to basically find a way to do this or work around the system. The system exists and facilitates these types of approaches, which makes sense across the industry, not just for us. So even though we're a small team, we still need to communicate these sorts of things somehow
Starting point is 00:34:06 and in a way that makes sense. And if we use a tool... What's an example of a runbook then? Let's say for our case, Fastly, the Fastly outage, which is a once in five... They're not going to do that in the next five years. I'm knocking on wood over here. Remember my certainty?
Starting point is 00:34:21 It would be smarter than... 100% uptime? Next week, FastASI goes down. Exactly. Don't jinx it. Well, you know, given their responsibility and size, they're probably going to be less likely to do that again anytime soon, is kind of what I mean by that.
Starting point is 00:34:36 So, but even that, would you codify in a run book a FASI homage? I think I would. Now you might, because you have this hindsight, you know, of recent events, but, you know, prior to this, you probably wouldn't. So what's a more common runbook for a team like us? I think I would codify the incidents that happen. So, for example, if we had an incident management platform, when the fastly incident happened, I would have used whatever the platform
Starting point is 00:35:05 or whatever this tool offered me to manage that incident. And then as an outcome of managing the incident, we would have had this runbook. So I wouldn't preemptively add this. I see. So it's retrospective. An incident happens, it doesn't happen again.
Starting point is 00:35:20 Well, it may. Gotcha. Yeah, this is what I've done to fix it, right? And anyone can follow those steps. And maybe if something, for example, happens a couple of times, then we create a runbook. But at least Jared can see,
Starting point is 00:35:32 oh, this happened like six months ago. This is what Gerhard did. Maybe I should do the same. I don't know. Like, for example, in the case of this DNS token, what are the steps which I'm going to take to fix it? So capturing those steps somewhere in a simple form, right? Like literally, as I do it, I do this and I do that. And that is stored somewhere and can be retrieved at a later date. I guess then the question is, when the
Starting point is 00:35:56 incident happens again, how does somebody know where to go look for these runbooks? I suppose if you're using one of these services, it gets pretty easy because like, hey, go to the service, right? And there's a runbooks dashboard for so to speak. I think it's just specific to the service, but yeah. And you go there, you're like, oh man,
Starting point is 00:36:10 there's never been a runbook for this. I'm screwed. Call Gearheart or call so-and-so, you know? Yeah, I suppose. But I think if you operate a platform
Starting point is 00:36:19 long enough or a system long enough, you see many, many things. And then you try to, I mean, it just progresses to the point that, let's imagine that we did have multi-cloud. Let's imagine that, I know, Linode was completely down and the app was running elsewhere.
Starting point is 00:36:34 We wouldn't be down. And okay, we would restore, we'd be like in a degraded state, but things would still be working. If we had multi-CDN, Fastly's down, well, Cloudflare's up. It rarely happens that both are down at the same time. So then it's degraded, Fastly's down. Well, Cloudflare's up. It rarely happens that both are down the same time. So then it's degraded, but it still works. So it's not completely down. In this case, for example, we didn't have this, but right now, if the backend goes away, if everything disappears, we can recreate everything within half an hour. Now, how would you do that? It's simple for me,
Starting point is 00:37:03 but if I had to do it maybe once and codify it, which is actually what I have in mind for the 2022 setup, I will approach it as if we've lost 2021 and I have to recreate it. So what are the steps that I'll perform to recreate it? And I'll go through them, I'll capture them. Because 2021 is kind of a standard and you're codifying the current golden standard.
Starting point is 00:37:24 The steps that that would take, yes, to set up a new one. Yeah, exactly. To get to zero where you're at right now. This is ground zero. And 2021, when I set up, was fairly easy to stand up because I changed these things inside the setup so that, for example, right now, the first step, which it does, it downloads from backup everything it doesn't have.
Starting point is 00:37:45 So if you're standing this up on a fresh setup, it obviously has no assets, no database. So the first thing which it does, it will pull down the backup, like from the backup, it will pull everything down. And that's how we test our backups. Which is smart because the point of a backup is restoration, not storage. Exactly. So we test it at least once a year now. You know, what's important, I think, to mention here is that this may not be what every team should do. Like in many cases, this is exploration on our part. This is not so much what every team should do in terms of redundancy.
Starting point is 00:38:17 We're doing it in pursuit of one, knowledge, and two, content to share. So we may go forge new ground on the listener's behalf. And hey, that's why you listen to the show. And if you're not subscribed, you should subscribe. But this, we're doing this not so much because one, our service is so important that it must be up at all times. It's because the pursuit of uptime is kind of fun and we're doing as content and knowledge.
Starting point is 00:38:40 So that's, I think, kind of cool. Not so much that everyone should eke out every ounce of possible runtime. It's just, in some cases, it's probably not wise because you have product to focus on or other things. Maybe you have a dedicated team of SREs. And in that case, that's their sole job is literally uptime. And that totally makes sense. But for us, we're a small team. And so maybe our seemingly, you know, unwavering focus on uptime is not because we're so important, but because it's fun for content and knowledge to share. And it makes us think about things in a different way.
Starting point is 00:39:13 So if you try something out, why are you trying something out? Well, we have a certain problem to address and it may be a fun one, but we will learn. So it's this curiosity, this built-in curiosity. How does incident IO work? How does fire hydrant work? What is different? What about render? What about fly?
Starting point is 00:39:33 They look all cool. Let's try it out. What would it mean to run changelog on these different platforms? Some are hard, some are that simple. And sometimes you may even be surprised, say, you know what? I would not have guessed this platform is so much better. So why are we complicating things using this other thing? But you don't know until you try it. And you can't be trying these things all the time. So you need those innovators that are out there. And if, for example, we have something stable that
Starting point is 00:40:02 we depend on, something that serves us well, we can try new things out in a way that doesn't disrupt us completely. And I think we have a very good setup to do those things. This reminds me of Sesame Street. Either of you watch Sesame Street? Not that I remember. Of course, everybody knows Sesame Street. But my son is a year and a half old, so he watches Sesame Street. But something that Haley Steinfeld sings on the show is, I wonder what if, let's try, right? And that's kind of what we're doing here. It's like, I wonder how this would work out if we did this. What if we did that?
Starting point is 00:40:35 Let's try. I think that's how all great ideas start. The majority may fail. The majority of the ideas may fail. But how are you going to find the truly remarkable ideas that work well in practice? Because on paper, everything is amazing. Everything is new. Everything is shiny. How well does it work in practice? And that's where we come in, right? Because if it works for a simple app that we have, which serves a lot of traffic,
Starting point is 00:40:59 it will most probably work for you too. Because I think the majority of our listeners, I don't think they are the Googles or the Amazons. Maybe you work for those companies, but let's be honest, it's everybody part of that company that contributes to some massive systems that very few have. It's all about gleaning really. Like we're doing some of this stuff and the entire solution or the way we do it may not be pertinent to the listener in every single case but it's about gleaning what makes sense for your case the classic it depends comes into play like it this this makes sense to do in some cases does it work for me it depends maybe maybe not What's up, shippers? This episode is brought to you by Century. You already know working in code means happy customers,
Starting point is 00:41:57 and that's exactly why teams choose Century. From error tracking to performance monitoring, Century helps teams see what actually matters, resolve problems quicker, and learn continuously about their applications from the front end to the back end. Over a million developers and 70,000 organizations already ship better software faster with Sentry. That includes us. And guess what? You can too.
Starting point is 00:42:18 Ship it listeners new to Sentry get the team plan for free for three months. Use the code SHIPIT when you sign up. Head to Sentry.io and use the code ship it so i would like us to talk about the specifics, three areas of improvement for the changelog.com setup, not for the whole year 2022, but just like over the next couple of months. Top of my list is the incident management. So I have some sort of incident management, but that seems like a on the side sort of thing. And we've already discussed that at some length. The next thing is I would like to integrate Fastly logging. This is the origin, the backend logging with
Starting point is 00:43:12 Grafana Cloud. The reason why I think we need to have that is to understand how our origin, in this case, Linode, LKE, where changelog.com runs, how does the origin behave from a Fastly perspective, from a CDM perspective? Because that's something that we have no visibility in. So what I mean by that is like when a request hits Fastly and that request has to be proxied to a node balancer running in Linode, and that has to be proxied to Ingress NGINX running in Kubernetes, and that has to be proxy to Ingress NGINX running in Kubernetes. And that has to be proxy to eventually our instance of changelog. How does that work? How does that interaction work? How many requests do we get?
Starting point is 00:43:52 How many fail? When are they slow? Stuff like that. So have some SLOs, uptime as well, but also how many requests fail and what is the 99th percentile for every single request. That's what I would like to have. How hard is that to set up? Not too hard.
Starting point is 00:44:09 The only problematic area is that Fastly doesn't support sending logs directly to Grafana Cloud. So I looked into this a couple of months ago, and the problem is around authenticating the HTTPS origin where the logs will be sent, right? Because it needs to push logs, HTTP requests. So how do we verify that we own the HTTPS origin, which is Grafana Cloud?
Starting point is 00:44:35 Well, we don't. So we don't want to DDoS any random HTTPS endpoint because that's what we would do if we were to set this up. So we need to set up, and again, this is like in the support ticket with Fastly, what they recommend is you need to set up a proxy. So imagine you have NGINX, it receives those requests,
Starting point is 00:44:56 which are the Fastly logs, it'll be HTTPS, and then it proxies them to Grafana Cloud. So that would work. Where would we put our proxy? Well, we would use the Ingress Engine X on Kubernetes, the one that serves all the traffic, all the changelog traffic.
Starting point is 00:45:11 Well, couldn't we DDoS ourself then? We could. If Fastly sends a large amount of logs, yes, we could. Now, would we set another? It's not a DDoS if it's ourself. It's just a regular DOS. It's not going to be distributed. It's just us. Yeah, it's just, well, it will come from all Fastly endpoints, Iself. It's just a regular DOS. It's not going to be distributed. It's just us.
Starting point is 00:45:25 Yeah, it's just, well, it will come from all Fastly endpoints, I imagine. That's true. It could come from a lot of Fastly points of presence. Yeah. We could run it elsewhere, I suppose, but I like things being self-contained. I like things being declared in a single place, right? So to me, it makes more sense to use the same setup. I mean, it is in a way a Fastly limitation, right? And actually specifically Fastly in Grafana Cloud, that lack of integration that we have to work around. But speaking of that, I know that Honeycomb supports Fastly logging directly. And one of the examples that Honeycomb has is the RubyGems org traffic, which is also proxied by Fastly. So in there, like try Honeycomb out, you can play with
Starting point is 00:46:07 the dataset, which is the RobyGems.org traffic. So I know that that integration works out of the box. And that's why maybe that would be an easier place to start. Just a place to start, yeah. But then we're using Grafana Cloud for everything else. So that's an interesting
Starting point is 00:46:23 moment. Like do we start moving stuff across to Honeycomb or do we have things in two systems right that's like a like a little break in the dam you know like a little bit of water just starts to pour out and it's not a big deal right now on Grafana Cloud right yeah well they got
Starting point is 00:46:39 just a little thing over here Honeycomb yeah turns out pretty nice over there. It starts to crack a little bit and more water starts to and all of a sudden just bursts and Grafana loses a customer. That stuff happens. We could also parallelize this and we could
Starting point is 00:46:57 simultaneously try to get Fastly and Grafana sitting in a tree. K-I-S-S-I-N-G. They're integrations. That would be great, right? know, sit in a tree. K-I-S-S-I-N-G. But, you know, they're integrations. Because that would be great, right? Yeah, that would be great. That is actually a request from us.
Starting point is 00:47:10 And that would probably be in the benefit of both, I think both Fastly and Grafana, that would be in both entities to their benefit. So maybe that's already in the works. Who knows? I would guess that it is. Well, I would like to know, because then we could be not doing a bunch of work. We could just procrastinate until it's there.
Starting point is 00:47:26 Right. Yeah. It's stuff like this, right? Let's put an email filler out. We got some people we can talk to to know for sure. And then if it is in the works and it's maybe on the back burner, we can put some fire under the burner because we
Starting point is 00:47:42 need it to. Well, then we've hit another interesting in that i really want to try honeycomb out i've signed up and i want to start sending some events their way and just start using honeycomb to see what insights we can derive from things that we do one of the things that i really want to track with honeycomb and this is like i wasn't expecting to discuss this but it seems to be related so why not is? I want to visualize how long it takes us from Git push to deploy, because there are many things that happen in that pipeline. And from the past episodes, this is really important. This is something that teams are either happy or unhappy about.
Starting point is 00:48:19 The quicker you can see your code out in production, the happier you will be. Does it work? Well, you want to get it out there quickly. Right now, it can take anywhere between 10 and 17, 18 minutes, even 20, because it depends on so many parts, like CircleCI, sometimes the jobs are queued. The backups that run, well, sometimes they can run 10 seconds more. The caches that we hit in certain parts, like images being pulled, whatever, they can run 10 seconds more the caches that we hit in certain parts like images being pulled whatever they can be slower or they can be cold and they have to be warmed up so we don't really know first of all i mean in my head i know what they are all the steps but you and jared don't know what does the git push have to go through before it goes out into prod and
Starting point is 00:49:01 what are all the things that may go wrong and then which is the area or which is like the step which takes the longest amount of time and also is like most variable because that's how we focus on reducing this time to prod and honeycomb i mean they're they're championing this left right and center i mean charity majors i don't know which episode but she will be on the show very, very soon. 15 minutes or bust. That's what it means. Like you're either in production, your code is either in production 15 minutes or you're bust. There was an unpopular opinion shared on GoTime. I can't remember who shared it, but he said, if it's longer than 10 minutes, you're bust.
Starting point is 00:49:37 There you go. So that 15 minutes is going to be moving, I think. It will be moving. As the industry pushes forward, it's going to keep going lower and lower, right? Exactly. Well, what is it that does every I suppose, every Git push, which is from local to presumably GitHub in our case, could be another code host. Is there a way to
Starting point is 00:49:58 scrutinize like, oh, this is just this is just views and CSS changing to like make that deployment faster. If it's not involving images or a bunch of other stuff, why does a deployment of, let's just say it's a typo change on an HTML and a dark style to the page for some reason, whatever. If it's just simply CSS or an EX file change in our case,
Starting point is 00:50:24 could that be faster? Is there a way to have a smarter pipeline? These are literally just an HTML and CSS update. Of course, you're going to want to minimize or minify that CSS that SAS produces in our case, etc., etc., but 15 minutes is way long for something like that. You're right. So the steps that we go through, they're always the same. We could make the pipeline smarter in that, for example, if the
Starting point is 00:50:50 code doesn't change, you don't need to run the tests. The tests themselves, they don't take long to run, but to run the tests, you need to get the dependencies. And we don't distinguish like if the CSS changed, you know what? You don't need to get dependencies. So we don't distinguish between the type of push that it was because then you start putting smarts. I mean, you don't need to get dependencies. So we don't distinguish between the type of push that it was because then you start putting smarts. I mean, you have to declare that somehow. You have to define that logic somewhere. And then maybe that logic becomes, first of all,
Starting point is 00:51:15 difficult to declare, brittle to change. What happens if you add another path? What happens if, for example, I don't know, you've changed a Node.js dependency, which right now we use, and then we remove Node.js, and then we compile assets differently. And then, by the way, now you need to watch that
Starting point is 00:51:34 because the paths, I mean, the CSS you just generated actually depends on some Elixir dependencies. I don't know. I think ESBuild, we were looking at that or thinking. You effectively introduce a big cache invalid validation problem yes that's what you do yeah cache and validation is one of the hard things in computer science so it's slow but it's simple it's like we just rebuild it every time it's like why does react re-render the entire re-render the entire dom every time well it doesn't anymore because that was too slow so it's like
Starting point is 00:52:03 does all this diffing and stuff but there's like millions and millions of dollars in engineering spent and figuring out how react is going to smartly re-render the dom right it's the same thing it's like there's so many little what-ifs once you start only doing and this is why gatsby spent years on their feature which is what partial builds because building on gatsby site which is a static site generator right building a 10 000 page static site with gatsby was slow just i'm just made up the word 10 000 but you know 100 000 whatever the number is was slow and so it's like well couldn't we just only build the parts that changed, right? Like what Adam just said. It's like, yeah, we could. But then they go and spent two years building that feature
Starting point is 00:52:50 and VC money and everything else to get that done. So it's like a fractal of complexity. Yeah. I'm saying there's small things you can do. You can get like the 80% thing and it works mostly and doesn't get you all, it doesn't squeeze out every performance, but it's a big, so there's probably some low hanging fruit you can do. You can get like the 80% thing and it works mostly and doesn't get you all, it doesn't squeeze out every performance, but it's a big,
Starting point is 00:53:08 so there's probably some low hanging fruit we could do, but it's surprisingly complicated to do that kind of stuff. And the first step really is trying to understand these 15 minutes. First of all, how much they vary, because as I said, sometimes they can take 20 minutes. Why does it vary by that much? Like maybe, for example, it's test jobs being queued up in CircleCI. A lot of the time that
Starting point is 00:53:32 happens and they are queued up for maybe five minutes. So maybe that is the biggest portion of those 20 minutes or 15 minutes, and that's what we should optimize first. Yeah, that's why I said there's probably some low hanging fruit. We can probably do a little bit of recon and knock that down quite a bit. And that's exactly why I'm thinking, like, use Honeycomb, just like to try and visualize those steps, what they are, how they work, stuff like that. Exactly. Good idea. Second thing is, and I think this can either be a managed PostgreSQL database, so that either CockroachDB or anyone that manages PostgreSQL, like one of our partners, one of our sponsors, I would like us to offload that problem and we just get the metrics out of it to understand how well it behaves, what can we
Starting point is 00:54:19 optimize, stuff like that in our queries. But otherwise, I don't think we should continue hosting PostgreSQL. I mean, we have a single instance. I mean, it's simple, really, really simple. It backups. I mean, it's no different than SQLite, for example, the way we use it right now, but it works. We didn't have any problems since we switched from a clustered PostgreSQL to single node PostgreSQL, not one. We used to have countless problems before when we had a cluster. So it's hard, is what I'm saying. What we have now works, but what if we remove the problem altogether?
Starting point is 00:54:50 I remember slacking, how can our Postgres be out of memory? It's like, well, wasn't it with the backup? The backup got something happened with the backup. Or the wall file. The wall file. The replication got stuck and it was like broken. It just wouldn't resume.
Starting point is 00:55:04 And the disk would fill up. Crazy, crazy, crazy. And that's the reason you would want to use a Minj because they handle a lot of that stuff. Exactly. And if it can be distributed, then that means that we can run
Starting point is 00:55:15 multiple instances of our app. Was it not for the next point, which is an S3 object store for all the media assets instead of local disk. Right now, when we restore from backups, that's actually what takes the most time because we have like 90 gigs at this point. So restoring that will take some number of minutes. And I think moving to an S3 one and a managed PostgreSQL,
Starting point is 00:55:37 which we don't have, we can have multiple instances of changelog. We can run them in multi-cloud. I mean, it opens up so much possibility if we did that that would be like putting all of our assets in s3 it'd be like welcome to the 2000s guys it would be right that's exactly right yeah you're you've now left the 90s maybe i should explain why we're using local storage some of it's actually just technical debt this was a decision i made when building the platform back in 2015 around how we handle uploads, not image uploads, but MP3 uploads, which is one of the major things that we upload and process. And these MP3s are anywhere also want to do post-processing, like post-upload processing on the MP3s because we go about rewriting ID3 tags and doing fancy stuff
Starting point is 00:56:31 based on the information in the CMS, not a pre-upload thing. So it's nice for putting out a lot of podcasts because if Gerhard names the episode and then uploads the file to the episode, the MP3 itself is encoded with the episode's information without having to duplicate yourself. So because of that reason, and because I was new to Elixir, and I didn't really know exactly the best way to do it in the cloud, I just said, let's keep it simple. We're just going to upload the files to the local disk. We had a big VPS with a big disk on it and
Starting point is 00:57:09 don't complicate things. And so that's what we did. And knowing full well, even back then I had done client work where I would put their assets on S3. It's just because this MP3 thing and the ID3s we run FFmpeg against it and how do you do that in the cloud, etc. So that was the ID3s, we run FFmpeg against it. And like, how do you do that
Starting point is 00:57:25 in the cloud, et cetera. So that was the initial decision-making and we've been kind of bumping up against that ever since. Now the technical debt part is that our image upload image, our assets uploader library in Elixir that I use is pretty much unmaintained at this point it's a library called arc and in fact the last release was cut version 0.11 in october of 2018 so it hasn't changed and it's a bit long in the in the tooth is that a saying long in the tooth i think it is and um i know it's warts pretty well i've used very successfully, so it serves us very well. But there's technical debt there. And so as part of this, well, let's put our assets on S3 thing,
Starting point is 00:58:12 I'm like, let's replace ARC when we're doing this because I don't want to retrofit ARC. It does support S3 uploads, but the way it goes about shelling out for the post-processing stuff, it's kind of wonky. I don't totally trust it and so i would want to replace it as part of this move and i haven't found that replacement or do i write one etc so it's kind of like that where it's just slightly a bigger job than you know reconfiguring arc just to push to s3 and doing one upload and being done with it. But it's definitely time.
Starting point is 00:58:46 It's past time. So I'm with you. I think we do it. Yeah, I think that makes a lot of sense. And this just basically highlights the importance of discussing these improvements constantly. So stuff that keeps coming up, not once, but like two years in a row,
Starting point is 00:59:01 it's the stuff that really needs to change, right? Unless you do this constantly, you don't realize exactly what the top item is, because some things just change, right? It stops being important. But the persistent items are the ones that I think will improve your quality of software, your quality of system, service, whatever you have running. And it's important to keep coming back to these things. Is this still important? It is. Okay, so let's do it. But you know what? Let's just wait another cycle. And then eventually you just have to do it. So I think this is one of those cases and we have time to think about this
Starting point is 00:59:33 and what else will it unlock? If we do this, then we can do that. And is it worth it? Maybe it is. And I think in this case, this S3 and the database, which is not managed, have the potential of unlocking so many things for us. Simplifying everything. Well, the app becomes effectively stateless, right? It does. How amazing is that? And then you're basically in the cloud world where you can just do whatever you want.
Starting point is 00:59:54 That's exactly it. Life is good. That's exactly it. And then face all new problems that you didn't know existed. True. Does this Arc thing, does it also impact the chaptering stuff we've talked about in the past year?
Starting point is 01:00:04 Wasn't that also part of it there is an angle into that so for the listener that the chaptering so at the mp3 spec actually it's the id3 version 2 spec which is part of the way mp3s work it's all about the headers supports chaptering id3 v1 not. ID3V1 is very simple. It's like a fixed frame kind of a thing. And ID3V2 is complicated more so, but has a lot more features. One of which is chaptering, which chapters are totally cool. You know, ship it's roughly three segments. Well, we could like throw a chapter in into the MP3 for each segment. And if you want to skip to segment three real fast, you could, we would love to build that into our platform. Cause then we could also represent those chapters on the webpage, right? So you can have like
Starting point is 01:00:47 timestamps and click around lots of cool stuff. Unfortunately, there's not an ID three V two elixir library. And the way that we do our ID three tags right now by way of arc is with FFM peg. So we shell out to FFM peg and we tell FFFmpeg what to do to the mp3 file and it does all the magic, the ID3 magic, and then we take it from there. So the idea was, well, if we could not depend on FFmpeg, first of all, that simplifies our deploys because we don't have a dependency that's like a Linux binary. Oh, it's a small thing. But we'd be able to also do chaptering. So we get some features as well as simplify the setup. And that is only partially to do with Arc.
Starting point is 01:01:30 Really, that has to do with the lack of that ID3v2 library in Elixir. Like that functionality does not exist in native Elixir. And so I could, if it did, I could plug that into Arc's pipeline and get that done currently. Now, if FFmpeg supported the feature, we wouldn't need it anyway. and get that done currently now if FFmpeg supported the feature we wouldn't need it anyway we would just do it in the FFmpeg but it does not it doesn't seem like it's something that they're interested in because the mp3 chaptering is not like a new whiz bang feature like
Starting point is 01:01:56 it's been around for a decade maybe more so the fact that it doesn't exist in FFmpeg which is if you've ever seen is like one of the most featureful tools in the world I mean FFmpeg, which is, if you've ever seen, is like one of the most featureful tools in the world. I mean, FFmpeg is an amazing piece of software that does so many things, but doesn't support
Starting point is 01:02:11 MP3 chaptering. So that's kind of a slightly related but different initiative that I've also never executed on. I just wondered if we had to bite the arc tail off or whatever that might seem like to also get a win, you know, along with that. And the win we've wanted for years essentially was being able to bake in some sort of chaptering maker into the CMS backend
Starting point is 01:02:31 so that we can display this on pages, you said, or in clients that support it because that's a big win for listeners. And for obvious reasons that Jared just mentioned, that's why we haven't done it. It's not because we don't want to. It's because we haven't technically been able to. So if this made us bite that off, then it could provide some team motivation.
Starting point is 01:02:48 Like we get this feature too, and we get this stateless capability for application. It just provides so much action. Yeah, and one way I thought that we could tackle that, which doesn't work with our current setup, is we could, I mean, we render the MP3s or we mix down the MP3s locally on our machines. Then we upload them to the site, right?
Starting point is 01:03:07 We could pre-process the chapters locally. We could add the chapters locally to the MP3, then upload that file. And if we could just write something that reads ID3V2, it doesn't have to write it. We could pull that out of the MP3 and display it on the website. And that would be like a pretty good compromise.
Starting point is 01:03:29 However, when we do upload the file, when you pass it to FFmpeg and tell it to do the title and the authors and the dates and all that, well, it just completely blows away your local ID3 tags. So it overwrites it. As I was listening to you talking about this, one of the things that reminded me of is the segments in YouTube video files, which sometimes I really like because I can skip to specific topics really easily. So rather than
Starting point is 01:03:59 having fixed like beginning, middle and end, you can have topic by topic and you can skip to the specific parts. I would love to see that in changelog audio files. That's the feature right there. Like you use it however you want to use it. So like the obvious way is like, well, there's three segments, I'll put three chapters in. But if you were in charge of doing your own episode details and you could put the chapters in the way you'd want to,
Starting point is 01:04:21 yeah, you could make it really nice just like that. And for clients that support it it it is a spectacular feature now a lot of the popular podcast apps don't care like spotify is not going to use it apple podcast historically has not used it so like they're they basically don't exist but the indie devs tend to put those kind of features in like the overcasts the castros the i'm not sure pocket cast is into anymore but like those people who really care about the user experience of their podcast clients they support chaptering and for the ones that do it's a really nice feature yeah i love that the other thing that i would really like is when i write blog posts i could just drag and drop files
Starting point is 01:05:02 as i do in github and just get them automatically uploaded to S3. Because right now I have to manually upload them. You and me both. And then reference them. It's so clunky. I would love that feature. You're exposing our ad hoc-ness. Come on now.
Starting point is 01:05:16 We literally open up Transmit or whatever you use to manage S3 buckets, and we drag and drop them, and then we copy URL. But first you have to make it readable by the world. Don't that part and then put the link into your blog post that's that's you can go away and figure that on the bucket so that all files are readable really we do have that we do have that on the button i didn't know about that but it still sucks it does suck but one thing which i do for these for these episodes for the It ones, I take a screenshot, by the way, took very good screenshots of all three of us.
Starting point is 01:05:48 And I put them in the show notes. I saw that. You're the first one to do that. So, again, you're pushing the envelope of Changelog Podcasts and probably pushing us towards features that I would normally just completely put off over and over again. See what happens when people come together and talk about what could improve? Well said. So what I propose now is that we go and improve those things. see what happens when people come together and talk about what could improve well said so what i propose now is that we go and improve those things and come back in 10 episodes how does that sound sounds good kaizen kaizen that's it for this episode of ship it thank you for tuning in
Starting point is 01:06:22 we have a bunch of podcasts for developers at Changelog that you should check out. Subscribe to the master feed at changelog.com forward slash master to get everything we ship. I want to personally invite you to join your fellow changeloggers at changelog.com forward slash community. It's free to join and stay. Leaving, on the other hand, will cost you some happiness credits. Come hang with us in Slack. There are no imposters. Everyone is welcome.
Starting point is 01:06:52 Huge thanks again to our partners, Fastly, LaunchDarkly, and Minote. Also, thanks to Breakmaster Cylinder for making all our awesome beats. That's it for this week. See you next week. Game on

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.