The Changelog: Software Development, Open Source - Inside 2021's infrastructure for Changelog.com (Interview)

Starting point is 00:00:00 This week on the ChangeLog, it's that time again. We're talking about the latest infrastructure updates we've made for 2021. We're joined by Gerhard LeZou, our resident SRE here at ChangeLog, talking about all the improvements we made to 10x our speed and be 100% available. Also, we announced our newest podcast we're launching, hosted by Gerhard. So, stick around the last half of the show for more details and how to subscribe. Of course, huge thanks to our partners Fastly, Linode, and LaunchDarkly.

Starting point is 00:00:30 We love Linode. They keep it fast and simple. Check them out at linode.com slash changelog. Our bandwidth is provided by Fastly. Learn more at fastly.com and get your feature flags powered by LaunchDarkly. Check them out at launchdarkly.com and get your feature flags powered by LaunchDarkly. Check them out at launchdarkly.com. This episode is brought to you by Linode.

Starting point is 00:00:54 Gone are the days when Amazon Web Services was the only cloud provider in town. Linode stands tall to offer cloud computing developers trust, easily deploy cloud compute, storage, and networking in seconds with a full-featured API, CLI, and cloud manager with a user-friendly interface. Whether you're working on a personal project or managing your enterprise's infrastructure, Linode has the pricing, scale, and support you need

Starting point is 00:01:18 to launch and scale in the cloud. Get started with $100 in free credit at linode.com slash changelog. credit at leno.com slash changelog. Again, leno.com slash changelog. We're back with Gerhard Lazu, our resident SRE. What's up, Gerhard? It's all good. It's actually 10 times better. Our website is, I hope so.

Starting point is 00:01:55 That's the title of the show. It's 10 times better. I like 10 times anything. Are you a 10X SRE or what's going on here? That's exactly what it is. It's a 10X. That was the theme for this setup. It has to be 10 times something. It doesn't matter what that 10 times is. It's 10 times something, like an order of magnitude better. And it is. Guess what?

Starting point is 00:02:15 It is. Nice. So it couldn't have been 10 times slower to deploy or 10 times longer response times. None of that. It had to be 10 times better. Well, for those who haven't listened to the annual ChangeLog infrastructure episode, welcome, you are here. This hasn't been a whole year. It's been a half a year, so it's now, I guess, semi-annual.

Starting point is 00:02:38 But we worked faster this time around, didn't we, Gerhard? We did, because we had the basics covered really well and the base was so good that iterating was super simple yeah and what we iterated on was basically what mattered the most up time and response latency we had a couple of tricks up our sleeve I think it was combined I had one you had one we put them together and uh yeah we did it faster we did better this year not much has changed actually so i think that's like almost like the what everybody wants introduce a little change not much change but make it so much better which we did fine tuning

Starting point is 00:03:18 there's details in the fine tune that make things faster and that's where you gotta that's where you gotta optimize for yeah i think it takes a while to learn your system i suppose to learn all the components like properly learn them and then when you're comfortable with all the components figure out which is the smallest change that you can make for the biggest improvement and that's what we did yeah shall we spoil it i mean if someone just wants to listen to five minutes we can spoil it and they can no let's tease it let's tease it. Let's hold it back. Let's tease it, all right. Hold it back.

Starting point is 00:03:46 We'll tease it. Stick around, listener. Yeah. Let's start with this. Not much changing this time around. A lot changed last time around. So our 2020 episode, which came out last October, was a big change.

Starting point is 00:03:59 A lot going on. And some of the reaction to that episode was, and we're on Kubernetes now. And it's like, hey, guys, you run a three-tier website, right? You have a database and an application server and Nginx or whatever. Kubernetes is way overkill. So let's start there. Gerhard, what do you think about that? Do you agree with that? Not really. And this is like, that's a really controversial part. I assume you're going to say that because you agree with that not really and this is like that's a really controversial you're gonna say that because you're the one that set it up so right so i think that's

Starting point is 00:04:29 a very simplistic view because you're right when you boil it down that's exactly what you have right it's just a phoenix app it's a web app and your database you have a proxy maybe and that's about it right that's what you have but have. But it's almost like the iceberg, right? It's like the thing that you see at the top and there's everything else behind or below the sea level or the sea line. So what else do we have below? Well, you have certificates, you have load balancers, you have DNS, you have code updates, you have tests, you have CI, CD, you have dependencies, you have dependencies of dependencies, and the list goes on and on and on. And things are changing all the time. So given you have so many things, how do you manage that? And usually what happens,

Starting point is 00:05:21 you don't. You just go with the flow, right? Let's say you don't care about your CDN integration. Just tick a box and assume everything just works. And most of the time it does, but when it breaks, do you even know that it broke? What about the monitoring? How do you manage the monitoring? And again, it just goes from there because you're running a production system, a production system that is serving a lot of traffic which changelog.com does and even though it's a simple app i think it's almost like a

Starting point is 00:05:50 it's simple because we made deliberate choices it could be a microservices architecture we didn't choose that but it could be the fact that we don't have that it doesn't mean that we don't have all these things around it could you have one thing that manages all those things? Could you have control plane is the term that many use today. But that's what we kind of have. We have a control plane that manages all the things. And I say all the things, all the things that we could convert. There's always more work that we could do.

Starting point is 00:06:23 And I think that's where the next improvements are coming from for us. We have a very solid base and improving is really simple now. And everything is like in a single place. So you have this single thing, which you can hold in your head. Everything is automated.

Starting point is 00:06:40 Everything recovers. And again, I don't want to spoil it too much, but migrating from the 2021 setup to the 22 setup, in terms of time wise, we could perform a live migration in 27 minutes, from nothing to everything. How cool is that? Did you already know all the Kubernetes stuff? Like, so when people think about setting up a kubernetes cluster they talk about the complexities of the api perhaps or the tooling or the ecosystem i always think back to cncs that's not a roadmap what is that it's like a trail yeah the landscape and there's just like

Starting point is 00:07:18 all of these words that i don't know any of them and each one each one of those is like a complex piece of software right and i get overwhelmed you, you got this rolled out. I'm just curious, was there a Kubernetes learning curve for you or had you already done that previously? And so when you started helping us, you already understood what you were doing. Because I think a lot of the cost for people, they're like, well, is this worth doing for me or not? It's like, well, do I have to learn all the Kubernetes things or do I have somebody who knows that I'm already? So I'm just curious where you're coming from. So I had some knowledge,

Starting point is 00:07:50 but it was mostly basic. But the thing to understand is that I have been doing infrastructure for, I don't want to say decades because that's like bragging, but let's put it a really long time. So we were joking about webmasters. I used to be

Starting point is 00:08:05 one uh cgi bins oh yes baby those were like the good old times i remember cgi bins i wouldn't describe them as a good old time but uh well they were better than i was perspectives like pin glasses and all that you know your glasses and you remember the past much better than it actually was right there's an element of that so i've been doing this for a really long time and i can appreciate the cycles that we went through and we had many many cycles and i've learned to learn on the job and if you optimize for that there's nothing new that is too daunting or too i mean mean, it's exciting, you'll make mistakes. But after you've been over, I don't know, six, seven cycles, they come and go. Remember Ruby on Rails? Oh man, those were the good old days. Phoenix, I think, captures some of that. The point being that even though

Starting point is 00:09:00 I didn't know, I kind of know how to navigate that landscape and you're right if your baseline is like zero and you have little experience it is daunting and you would want a curated experience but if you have seen these new technologies emerge and you know kind of where you are in the in the cycle like are you going on the uptrend are you up are you it's like whereabouts are you in that um the law of innovation of diffusions. The law of diffusion of innovations. Sounds better. What is it?

Starting point is 00:09:30 Law of fusion innovation. That's it. What's that? So early adopters, early, it's basically any new thing. Whenever you're introducing it, you have to focus on the first 2.5%, the early adopters oh this is like the curve of people who are going to adopt that starts with like the enthusiast and it goes to the exactly early majority the spread of a new idea exactly and kubernetes right now i would say it's the late majority it's not laggards you can still not do kubernetes but i think it's the late majority now

Starting point is 00:10:01 so we waited for it long enough before we went into kubernetes i would say we were towards the end of the early majority that adopted it that's that's what i think so a lot of the components were fairly mature and while mistakes could be made it was more difficult and our hosting provider right you know because that's how it all started let's get some vPSs, remember those days? And then VMs and then cloud instances. So they offer a managed Kubernetes service. And that was the thing which we were waiting for

Starting point is 00:10:33 so that we wouldn't need to worry about the control plane, about, you know, etcd and certificates and the integration with the IaaS. So all that stuff was abstracted away from us. Once we had that, we had the building blocks. And we had to identify a couple of things, but they were fairly well-defined. CertManager, external DNS, Ingress Engine X. That was pretty much it.

Starting point is 00:10:56 And these were like fairly standard components that have been improved over the course of a year, two years. So we were just like after 1.0, I think cert manager was the only one which wasn't 1.0, but then later on it was. So the components were fairly mature. There were so many blog posts and use cases and mistakes that have already been made before us.

Starting point is 00:11:18 And what we wanted to do was fairly standard. So there's nothing crazy. Documentation was written. We weren't those early adopters or like we were like towards the late early adopters and we were not the innovators definitely not so a lot of the stuff made sense and it was easy now having said that we still there wasn't any pain right yeah we still hit a couple of interesting things shall we go into that what do you think some interesting things that we've hit? Okay. So some interesting things that we've hit were around the PostgreSQL operators.

Starting point is 00:11:51 We chose PostgreSQL Crunchy first, and it was fairly hard to work with it because of how complicated it is. It's doing so many things, has so many features and the replication bit us right so we had a replicated poster sequel and we had downtime because it was replicated you wouldn't expect that to happen because it wasn't replicated right because it was replicated we had downtime because it was replicated but it stopped replicating exactly it stopped replicating okay so it wasn't which one was it no no no no so so hang on so we had the replication in place right replication stopped working and it took down our primary system it filled up the writer headlock filled up the disk

Starting point is 00:12:37 right it went down the secondary was way way behind so it couldn't be promoted to primary. And we had downtime. Right. And we had data loss. And we had data loss. Yeah, we did. Oh, yes. That's way worse than downtime, in my opinion. We had a backup from, I think, six hours ago, was it?

Starting point is 00:12:57 It was like six hours. Nine hours ago. It was like a bunch of hours, and we've lost some data, yes. Thankfully, it wasn't a ton of data, but it was definitely data loss. Because we had backups. That's the lucky part because we had backups yeah we had good backups but yeah six hours back so i think there was like thankfully was there any podcast episodes that were published during that time i don't think there was an episode that would have been a bigger

Starting point is 00:13:17 problem but there was news items and comments and a few things where i had edited a thing and i had to go back and edit it again thankfully we caught it fast enough that i remember it and we're a small team so we remembered our data loss we're like i know what i did yesterday or the last six hours so we fixed it up but in a larger team that would have been catastrophic yep yeah that was not cool that was really not cool and you go through the documentation right and it's not like do this or do that you don't have a simple a list of simple steps to follow and then you're scrambling it's not like do this or do that you don't have a simple list of simple steps to follow and then you're scrambling it's like i just i just need to get

Starting point is 00:13:49 this thing back up right that's all we cared about and what was the simple what would be the simplest thing so i think two hours later we had this like no we just have to restore from backup because resizing the disk was difficult it was just it was just a mess it was just a mess and i think this goes to show that it has not matured that much i mean it's getting there but it hasn't matured that much and if you need that type of redundancy from postgresql then well you either have some dba chops especially when it comes to postgresql and know what you have to do or you're just paying for that which Which I think for us, if it really, really mattered, we would have just paid for that, for the problem to have been taken care of. But the interesting thing is, I always thought that maybe PostgreSQL, maybe Crunchy was too

Starting point is 00:14:38 complicated. And then we tried the other operator, the Zalando one, and the same thing happened, right? So it wasn't an operator thing. And here's the thing. We still don't fully understand where the latency is in the Kubernetes networking stack, but we know that there is some latency and we have some very high spikes. So think that's an operation

Starting point is 00:15:01 that should take maybe up to 100 milliseconds will take five seconds and then if you have plenty of those things and like in a certain series of events things will just get out of sync and they will not be able to continue replicating correctly and when that happens the system will not be able to recover it was a surprise to me and i remember looking this for a really long time and not like thinking could it be linode's private networking and it wasn't that wasn't the problem even though it indicated there's like some network latency so we went down to single communities node everything was running on one node and we still had the same latency problems so there is something and there

Starting point is 00:15:40 wasn't was there wasn't cpu bound it was like high network throughput so we weren't like hitting any sort of limit other than network latency so how many metrics would we need to enable in the different layers of the stack and how well would we need to know that stack to debug this issue right and i think that's where a lot of people that hit issues with kubernetes that's where they're coming from. You wouldn't expect these. These are normal problems. These are just almost like specific to the stack that we are running, which in this case is Kubernetes.

Starting point is 00:16:13 So you kind of need to be an expert to kind of know how to look at this. But I do hope that some technologies, I think they've been around for a while, but again, it goes back to how do you pick and choose your components? So what I'm wondering is, would Linkerd have helped with this? I think they've been around for a while, but again, it goes back to how do you pick and choose your components? So what I'm wondering is, would Linkerd have helped with this? Could Linkerd show us the latency between the different services? What is Linkerd and how would it do that? So it basically intercepts all the traffic between...

Starting point is 00:16:43 So imagine Ingress Engine X when it talks to the app. Linkerd would place itself between Ingress Engine x and in this case the app so we'd see all the latency between the two components same way it would intercept all the traffic between the app and the database postgresql the service the postgresql service so to show us when there's any sort of like weird latency between the two services now we could enable all the metrics for post or sequel but then you need to find the dashboards they need to understand those dashboards if you have grafana or something else then you're literally becoming a dba right that's the hard part though is you talked about crunchy what was the other one

Starting point is 00:17:25 you talked about we moved to? And then... Zalando. What's it called? Z, Zalando, PostgreSQL. So you got those two and then you consider would Linkerd have helped us?

Starting point is 00:17:35 But that shows to me, at least from someone from this perspective, which is not a Kubernetes operator, I'm not an SRE, is that you have to have some sort of understanding of the different tooling available in theRE, is that you have to have some sort of understanding of the different tooling available in the ecosystem, which means you got to pay attention.

Starting point is 00:17:48 Yes. Right. Very closely. And even not to just know which tools are available to manage Postgres like we need to and replicate and whatnot, but also a high degree of understanding of those tooling and how they'll actually help you. And so I think that could be is be, it's just a very daunting, high-touch world that Kubernetes presents.

Starting point is 00:18:10 It may be the future. And I'm not sure in terms of the law of diffusion and innovation where we're at, it's early majority, late majority, in terms of adoption of Kubernetes at large, but it seems like it's still iterating and still getting better because we thought it was Linode's networking. It wasn't.

Starting point is 00:18:28 And you suggest different tooling, but that to me says you've got to have your ear close to the ground of Kubernetes and all its intricacies to really deal with this kind of problem or problems like it. We're dealing with it in Postgres. I'm sure there's other databases that are going to have issues, but it's similar. It's the same kind of issue or problems like it. We're dealing with it in Postgres. I'm sure there's other databases that are going to have issues. But it's similar. It's the same kind of issue where it's a latency of some sort that spikes and causes everything to slow down

Starting point is 00:18:50 and then haywire. So they do say, and let me be specific, Kelsey Hightower has been saying this for a long, long time. Don't run your data services on Kubernetes because things get complicated. And I think this is a first-hand experience of what he was referring to. Things may seem okay for a long, long time, but then things start getting problematic. You have the combination of tooling that maybe wasn't meant

Starting point is 00:19:17 to run in these types of environments. And how do you basically evolve it so that it embraces this distributed everything can go and come within milliseconds as containers do so i'm wondering if something like cockroach db which is meant to be run as a distributed postgreSQL replacement would have helped i don't know would we have benefited from a managed postgreSQL instance maybe so maybe we should have listened to that advice and not run postgreSQL in kubernetes but all these things first of all they made us just understand the stack a little bit better and say us mostly me and it made me realize that simple is best so for the 2021 setup we're running just a very simple stateful set single postgreSQL instance that can restore from backup in less than one minute so

Starting point is 00:20:17 let's say that you lose everything right if you back up frequently which we do every hour by the way and I have to change that setting. I've set it to be three hours, but I need to change it to one hour. It's super simple. And then the database will back itself up every hour. We can lose an hour worth of data. We can back it up every 30 minutes, but it's very simple. And then you have backups, you can self-expire them. By the way, we back up to S3. And we back up the entire media as well. And these backups, the reason why they were important

Starting point is 00:20:51 is because when we did the 2021 setup, all I had to do, I had to let the system restore from backup to pull all our media, which is 85 gigabytes right now all the files all the mp3s all that stuff so to download that from s3 is fairly fast especially for mp3s they download like a few gigs per second but it's uh gigabits not gigabytes by the way you have 85 gigabytes that's an important distinction but it's when all those small like all the avatars all those small files when you have to download they take slightly longer because there's some so many of them but um we can restore everything from scratch so like let's say we delete

Starting point is 00:21:34 everything within 27 minutes because of all those small files everything's restored the database super fast the media files the whole lot and because it's so simple do you need to have a distributed system you can use these local ssds that's another problem which we had disks not detaching no it's not rebooting we had like another downtime because of that and i know that you know all these issues have been fixed i mean we were early adopters in the case of linux kubernetes engine right it shipped in November 2019. We started using it, right? It was a beta.

Starting point is 00:22:09 And just when it went live, I think in May, we were already starting to switch some production workloads across. And then by, was it August or September? I can't remember. Everything was across. Something like that. So did we need a multi-node Kubernetes cluster? The answer is no.

Starting point is 00:22:24 What we needed was proper CDN integration. And that's where the speed comes from. So by properly integrating with the CDN, in this case, Fastly, the website is actually 15 times faster. The latency. Did you say 15 times? One five, yeah.

Starting point is 00:22:41 15 times. One five. Actually, let's do this. By the way, we are integrating with Grafana Cloud. So we ship all the logs, all the metrics to Grafana Cloud and we have synthetic monitoring set up there. And we have probes running all around the world. By the way, not all probes are reliable, but we have plenty to show us what's happening. And we're monitoring our babies now. We are, yes. The feeds and we have alerts and all and reports there's like so many things we

Starting point is 00:23:05 have set up so thank you grafana cloud that's a really cool thing behind the scenes jared called our feeds our baby yes he did a little joke there but yes we're monitoring our babies which is our podcast feed yes and if a baby is crying guess who gets the telegram message this is like grafana cloud integration i do right so when our that's the way it should be yeah exactly right that's how you stand by your infrastructure

Starting point is 00:23:29 if you're willing to be woken up at night and guess what we're caching it and so cache doesn't go down anymore yeah all of Fastly

Starting point is 00:23:36 would have to be down before changelog would be down so you have proper integration which we didn't have before we did some caching but not as much as we do now anyways before we enabled caching the change.com website the average latency

Starting point is 00:23:51 so we have san francisco dallas new york london frankfurt bangalore sydney and tokyo these are all our probes so the average latency across all probes was 880 milliseconds. That's kind of embarrassing. Before. Yep. Yeah. Now it's 66 milliseconds.

Starting point is 00:24:15 So how much is that? 880 by 66, 13.3 times. Not quite 15, but not 10 either. It's more than 10. We can round to 15 and guess what the uptime is 100 exactly 100 it's 100 that's exactly right we want all the nines This episode is brought to you by CloudZero. They help teams monitor, control, and predict their cloud spend. And I talked with Ben Johnson, co-founder and CTO at Obsidian Security.

Starting point is 00:25:06 They get tremendous value from using Cloud Zero. Ben shared with me the challenges they face driving innovation and customer value while also trying to control and understand their Amazon Web Services spend. We want our engineers to move fast, to innovate, and to really focus on driving customer value.

Starting point is 00:25:27 Yet at the same time, reality is we have to pay for cloud compute and storage. And the challenge around AWS is often that you have multiple accounts, you have lots of different services, you have some people who only have access to development environments, not necessarily production. A lot of these different challenges across services, across accounts that make it hard to understand the positive or negative impact to the costs that the new feature, the scale, maybe the change in architecture are having. And so giving our team more insight into the ramifications, again, positive or negative of their changes in order to maybe we need to really move fast. Let's have less worry about cost right now. Or maybe now we're in a more stable place. Let's drive down the cost so we can give those cost savings onto our customers or improve our own margin. So a product like CloudZero can really help your team

Starting point is 00:26:19 get a handle on costs, get alerted to those spikes, feel good when you actually see the costs drop and do all that without a whole lot of investment of your own time. a handle on costs, get alerted to those spikes, feel good when you actually see the costs drop, and do all that without a whole lot of investment of your own time. All right. If your organization shares similar struggles as BAN and Obsidian Security, check out CloudZero today. Learn more and get a demo at cloudzero.com slash changelog. Again, cloudzero.com slash changelog. so this speaks to really geographic relocation of our assets right i mean we had all of our images and mp3s and css and javascript assets served via cdn all the way back to when we set the system up that's right but we didn't serve the entire website via that cdn that's right and so even

Starting point is 00:27:15 though phoenix is really fast even though we're set up good we had we even have in memory caching in places where it makes sense like the feeds who wants to recalculate the change logs feed of 400 and some odd things every time it gets requested like we cache that in the app in addition to that we now have it behind the cdn and just the fact that that used to be served from like new york east even if it was really fast to answer in bangalore in tokyo is never going to be under well it's going to be under, well, it's going to be an average of 880 milliseconds around the world, right? Yep.

Starting point is 00:27:48 There's not much we could do about that while our responses were coming from a centralized, you know, single pop, as they call it, point of presence, which is the way it was. So now every request goes through Fastly, and we should have done that a long time ago. We should have. I'll take full responsibility on that one because I kind of slept on it for years.

Starting point is 00:28:14 I think you resisted it, actually. Didn't you resist it for a little bit? You were like, no, let's not do that. Yeah, I think it was. I'm trying to call you out on anything. I'm just trying to be like, what was the circumstances for saying no, really? I think it's because I didn't read the docs well enough. I didn't realize how easy it is to just bypass that

Starting point is 00:28:29 if you have a cookie set. So I thought, well, we have signed in users, signed out users. I guess I always had done it that way. I just served the dynamic parts from the application or behind NGINX, and I served the static parts from a CDN, and that was just what I was used to.

Starting point is 00:28:46 That's what we did. I thought it would be hard to switch because I didn't realize that there's just a setting where it's like, pass through fastly if you're signed in. Probably a minuscule percentage of our traffic is

Starting point is 00:29:01 signed in users. Maybe lucky 3%, maybe 1% of requests is signed in users maybe that's right lucky three percent maybe one percent of requests are signed in people so a little bit of ignorance a little bit of just like old school this is how i do it and then because we didn't have worldwide monitoring we had single point monitoring it always seemed pretty fast you know we always got good scores is it good for you it's good yeah exactly is it good for us is it good for you? It's good for me. Yeah, exactly. Is it good for us?

Starting point is 00:29:27 Is it good for people in the States? Once we set up the Grafana with the around the world monitoring, then you start to realize, holy cow, this is not fast for everybody, you know? Yeah. So I think it was less, just less important because I didn't realize how bad it was out there.

Starting point is 00:29:41 Well, that's interesting too when you talk about observability. What's it, you don't know what you don't know until you know or something like that. Basically, you know, observability provides a lot of data to understand some of the problems because either you don't have time or you not necessarily

Starting point is 00:29:56 don't care, but you don't care because you can't care. You don't have the data to really understand the full rounded picture of the problem or the concern. And that's what's interesting is that once you start to monitor something, you really start to understand the real problems. And that's why I think, you know, there's a lot of pluses to, you know, it doesn't require Kubernetes to use Grafana, right? We don't need Kubernetes to use Grafana, but the full rounded picture of what cloud native asks of teams or prescribes or subscribes

Starting point is 00:30:28 is this picture of Kubernetes simplified, in quotes, simplified plane that everyone understands. You can go from our organization to a whole different team that they're using Kubernetes. It's roughly the same API and all the same concerns. You've got an understanding from team to team if you're someone who moves around or someone who SREs for many people, or it's just a standardized way of doing things. I'm curious though, about the average, because you said 880 was the average. So share the highest, because that says average. What was the highest? So this is the average latency right and you have all the

Starting point is 00:31:06 different points can you see that yes okay cool so this is all probes we'll pull a screenshot into the show notes for sure but so let's look for example dallas right which is closest to where adam is so in dallas what we're seeing is the average latency is 42.20 milliseconds okay that's pretty good it's a pretty good latency. You can see that you have a couple of high ones. So the max goes to about 200 milliseconds. This is now, not before. This is last seven days.

Starting point is 00:31:34 Looking across the last seven days. If your maximum response time is 200 milliseconds, then you're sitting pretty. 200 milliseconds, exactly. And that's where the average, and this is Dallas. So let's take, I don dallas so let's take i know let's take london for example for me so london is 87 milliseconds and the maximum is 400 milliseconds now what we need to understand is that some of this is also related to probes

Starting point is 00:31:58 so do you see the uptime says it's 99.98 well what that actually means is that some probes, some Grafana probes are either overloaded because they take more than five seconds, which is exactly what happened here. It takes more than five seconds. And that's a timeout. If a response takes more than five seconds to come back, it's considered an error. It may have taken longer, but it's considered, nope, it didn't respond quickly enough. But maybe the probe was being overloaded. I know that when we were looking at Bangalore, I think that was the one. This is Bangalore. See, for example, these errors here.

Starting point is 00:32:33 This was the 4th of May. The error rate was very high. But all it meant is that the probe may have been overloaded. Not necessarily the website, because I'm pretty sure fastly was rock solid around this period i mean you just have to think how many pops they have how many points of presence so once you get in the fastly cache any endpoint should be able to serve it so we have a shield in new york and then every other point of presence basically distributes from there it reads it from that cache and it replicates across the whole world and we have a micro cache so we we cache every response for 60 seconds and then if there's any cache misses it will continue serving stale content while asynchronously going back to the origin and

Starting point is 00:33:18 requesting an update so you should always serve cached content unless obviously the the point was like down or reloaded or something like that, which very rarely happens. And then we reroute traffic. So typically when there are issues, it's the high latency. It's most likely the probe. Let's see. Can I have one, for example, can I see one probe here that was not very healthy? Look, for example, this one, this was Tokyo.

Starting point is 00:33:43 Do you see how the latency went slightly high so tokyo was having not a great day the tokyo probe same thing here in bangalore the bangalore probe was all the way up to five seconds so some requests were timing out but which probe out of here looks most loaded let me just open this like in a slightly bigger view it's frankfurt look at frankfurt how many spikes it has do you see these spikes it goes all the way to three seconds four seconds now in the big scheme of things this is no big deal right you think ah this is okay but the probe i think is overloaded what does that mean to be overloaded like the grafana probe that's it's got a lot of logs it's doing for

Starting point is 00:34:25 not just us but others similar to the way a noisy neighbor is on a vps exactly right or whatever route this is taking the route is overloaded the networking right we don't know what route it takes so however this probe runs we can see now we never had this and this is this is a really fascinating thing who knows what problems we had in the past in the 2021 setup. But because we never had this level of visibility, we didn't know. We didn't know what we didn't know. So now we know that, for example, users in Frankfurt, maybe there's an interconnect that is slow.

Starting point is 00:35:00 Maybe it's not just that probe. But still, we are able to serve within seconds most requests so we monitor the nginx logs and we can see the response times we can see the traffic served this is by the way after the cdn cache so we still need to get the logs out of the cdn to be able to visualize the same thing that's something which i wasn't able to set up just yet but it's on the list and we can see that the 99th percentile the average 99th percentile is 707 milliseconds so we are under one second this is nginx to the app but the time interval is 10 minutes so if we go to let's say five minutes it minutes, it's a lot. One minute, we had like, look at that. Whoa, what happened here?

Starting point is 00:35:47 So when the time interval is one minute, the 99th percentile response time was one minute. The 95th percentile was 300 milliseconds, and the 99th percentile was one minute. So what the hell happened here? I don't have the answer, but I would love to find out. Well, now you know there's a problem though. There's a thing, right? Because before you didn't know there was a problem. And if we're dealing with replication of databases

Starting point is 00:36:17 and this was sort of like attached to that, like as you begin to... Here's the thing. All this runs on a single massive host we have 32 cpus amd epic 64 gigs of ram or 128 gigs of ram ssds super fast it's a single host so how can the 99th percentile between ingress nginx running on that host and the app which is running on the same host, be this high? Bitcoin miner. Bitcoin miner.

Starting point is 00:36:49 It's not, but sure. I assure you it's not. I'm glad you shared the specs of that server, too, because that does put it into context of... This should never happen. ...its capability, and that this shouldn't happen. It shouldn't happen. What do you surmise? What's your gut?

Starting point is 00:37:06 Something in QProxy. Something in QProxy. I mean, that's the only thing. It's not the database. Yeah. It's not the app. It's something between all those components that make up Kubernetes.

Starting point is 00:37:21 We have Calico for the CNI. Maybe it's that. Maybe it's the overlay network but this is where that like almost like you want more observability it's almost like you know you have a problem before you didn't have you were like so ignorant you didn't even have a problem and if you look at external monitoring everything looks good everything is fine from a cdm perspective things are okay and that is the experience that we want to give our users the website is always available it's super fast regardless where we are in the world and these is the experience that we want to give our users the website is always available

Starting point is 00:37:45 it's super fast regardless where we are in the world and these are the things that we are now becoming aware of so the question is do we invest in this or maybe do we do something else and when i say something else do we continue down kubernetes or do we take i don't know a platform as a service our problem has always been bandwidth right because we need a lot of bandwidth like think hundreds of terabytes of bandwidth it's not like in the detective shows where they're they say zoom and enhance you know that's what you're doing to us here we zoom in at a certain point you zoom and enhance and just it can't enhance any further and you're like you're you're staring at a blob and you're like i don't know what that is yeah that's kind of where we're at so you need like another level you need another zoom or another enhance in order to dive down and the smaller

Starting point is 00:38:34 these problems are the more use time you spend right figuring out how to get that zoom done and probably the lower your you know your roi so to speak or the long diminishing returns hits you and you're sinking massive amounts of resources into solving this tiny little problem that may or may not be worth it i mean ignorance i guess was bliss except for our user for our users it wasn't bliss like we didn't we thought it was fast everywhere and now we know that it's it wasn't it's better and yet we still have this little thing that's like what what's going on there yeah and it does happen fairly frequently by the way so there's something there would tracing help i don't know like look we look at the last six hours we have a spike here that was 7 p.m and they're not periodic

Starting point is 00:39:19 like they happen like uh 4 p.m could it be the database backups i mean they do run every three hours you have four and you have seven so maybe go to the like last 12 hours. But then you have like all these smaller spikes. This is 1 p.m. So not really. All right. You had like these spikes. And again, most of the stuff, like if you look at the traffic that we serve, it's nothing. The server is like not even like 1% loaded cp is not an issue network is not an issue nothing is an issue all the components are healthy very little memory use so it's not a problem so this is a good thing i think it it refines your understanding i think it makes you think about your setup in ways that you haven't thought before so you really do feel

Starting point is 00:40:06 like the master of your domain and most things are easy to set up i think it's just like knowing which things to set up and what i'm hoping that we'll do with this and we ship it is it will share some of those stories we'll share the things that worked out and things that didn't work out so that others would have to do this. Wait, wait, wait, wait. What's this ship it you just said? What's this thing? What's this ship it? What are you talking about?

Starting point is 00:40:34 So I'm thinking about like, it's like it's been five years in the making. Okay, every year we have been improving our infrastructure, our setup. We've been shipping it, sharing it with you all. So how about we do this more all so how about we do this more often how about we do this every week how about we do some interviews and some sharing of how to ship stuff and what else is other than shipping because getting it out in production

Starting point is 00:40:58 that's like such a small part of the story i wouldn't say it's like the tip of the iceberg it could be but there's so much underneath it's all the other things that you need to care about so it's a new show that we would like to start and this is the first episode it's the first episode that you show i'm excited i'm excited about this show i think this is so awesome i mean i think that we've been asked you know why do we do this do this? Why do we even care about Kubernetes ourselves? Like to use it considering our three-tree application and not really needing, so to speak, that.

Starting point is 00:41:31 I think because we care. Because we're explorers. Because this is fun to dig into this kind of stuff. And as you mentioned, Garrett, will Kubernetes be the solution for us forever? Maybe. Is it great? Sure, in many ways, but it's got a lot of downfalls as well.

Starting point is 00:41:45 Will a pass make more sense? You know, will many ways, but it's got a lot of downfalls as well. Will a PAS make more sense? Will a render, a flyer, something like that, or whatever Linode has in the future, or DigitalOcean, will that make sense? Maybe. I don't know. For our application, you mentioned we need a high bandwidth. I think that's part of the journey. And doing this show, sharing our story, like we had the last couple of years consistently, naturally evolved into the need to want to share more and not just our story, which is going to be one part of it, but other stories, other teams stories and how they ship things. Like, wouldn't it be cool to learn how Kubernetes ships Kubernetes? Oh, yes.

Starting point is 00:42:18 Or how different platforms ship their different platforms. They use their platform to ship their platform or do they do something different you know are they dog fooding are they champagning whatever you call it and that's gonna be the fun journey you know and i think that's what uh is really fun about this is do more not just less i think that what that's the one thing that we've learned there's like so much to this there's so many good conversations that can be had there's so many problems that others are sharing like i was researching about network latency in kubernetes and i came across blog posts we were saying like how kubernetes made my latency 10 times worse i was thinking that's my problem but it wasn't it was just a clickbait i clicked on it like oh damn it

Starting point is 00:43:02 just wanted me to click so i wouldn't want that for others right i would genuinely want to dig into this with different people that have had similar problems or that have maybe tooling that can help with this problem to help us understand what the problem is to help others understand and maybe come up with solution which is which works for more than just us. So there's, again, a way to curate these problems, a way to understand them and to see what makes sense. Because Grafana Cloud may or it does make sense for us, but maybe it doesn't for others.

Starting point is 00:43:36 So what else is out there? We don't know. And it's not a fixed thing. It's changing all the time. Like every KubeCon, there's new tools, there's new approaches, there's just new people, right? New efforts going on. So what are they? It is a full-time job just keeping up with all the things. And it happens to be fun. Thank you. to deploy code at any time, even if a feature isn't ready to be released to users. Wrapping code with feature flags gives you the safety to test new features and infrastructure in your production environments without impacting the wrong end users.

Starting point is 00:44:33 When you're ready to release more widely, update the flag status, and the changes are made instantaneously by the real-time streaming architecture. Eliminate risk, deliver value, get started for free today at LaunchDarkly.com. Again, LaunchDarkly.com. Again, LaunchDarkly.com. So if you're listening to this in the ChangeLog podcast and you're interested in our new show, Ship It, you can go right now to changelog.com

Starting point is 00:45:06 slash ship it, subscribe there. If you happen to be subscribed to our master feed, which is your one-stop shop for all ChangeLog podcasts, you're already going to get it. We're going to ship it right into your feed. But if you're interested in coming along this journey with Gerhard and with us and with our setup and with other people's setups and see where this thing goes, definitely subscribe to ShipIt. Now, if you're listening to this on the ShipIt feed, hey, congratulations, you're already here. Welcome. But I'm excited too. This should be a lot of fun. And I think I will learn a lot by listening and maybe even participating a little bit. I think that that makes so much sense, because there's so many good ideas out there. There's so many good ideas that are good ideas for a while,

Starting point is 00:45:46 and then they're terrible ideas, but that's okay. Because ultimately, what do you care about? How does this help you? Does it make sense? And what else is out there? It's almost like the novelty factor, that in itself is good enough to subscribe and to just like what's around the corner.

Starting point is 00:46:04 Like one thing which i would love to find out i mean i'm putting this out there in the universe is that one of the guests on ship it is none other than elon musk does he ship kubernetes to mars i would want to know that wait wait wait what are you saying now? Why not? How does he ship those rockets? That's like proper engineering, right? We're just like playing here. So this is an episode request.

Starting point is 00:46:33 This is not a promise. This is a request. No, no, no. Okay, good. Because I about got very excited. I was like, really? Gerhard is dreaming and we are liking it. Six years from now it will happen, I'm sure. Now in six years, that's how long this thing took, from an idea it makes sense he just did snl he should

Starting point is 00:46:49 he should do ship it yeah we're the next natural step from there i think so and maybe we can help him curate the tech that will get shipped why not i say we it's like the royal we the shipping group right so he doesn't ship the version that has all this downtime right because i don't think that will be good for the mission i think we're just looking at the downtime that we had before we had a lot of downtime and now it's like all green 19 days all green since we did this switch the new setup we didn't have any downtime. 100%. That's awesome. I say, okay, it's a little window, but it should never go down unless we mess something in the

Starting point is 00:47:29 CDN config. That's possible. Because at one point I said, there goes them nines. Oh, yes. Because the last time we talked, we talked about the nines and how much they cost and how much each nine costs and the effort, not just the cost, but the effort required to get to those nines.

Starting point is 00:47:49 And that's kind of part of it, too, because we're going on this journey thinking this is improving. And sometimes improving isn't just simply infrastructure and speed. Sometimes it's knowledge. Sometimes it's understanding. And maybe the current version you've improved, but you've really just improved your understanding of the system and what's required, and the system you currently got might not fit the bill for what you really need, which means something else, or you're iterating towards that learning, and that's the interesting part. Very well put.

Starting point is 00:48:14 Gerhard, do you expect a community, or do you desire a community around this show? Do you think there'll be people involved, helping guide direction, ask for certain topics, certain interviews? What's your thoughts on like who this is for and how involved they're going to be? I think you can approach it from multiple angles.

Starting point is 00:48:38 I think a community would be nice, but a community, I think it just needs to make sense for the community rather than for us or for me. So if the community would find that useful, sure thing. But I think it's more around, I mean, the CNCF. I'm just thinking, I just recently came back from, I say came back, it was right here in front of the computer. Virtual. The virtual KubeCon, CloudNativeCon 2021. We have a good interview, possibly one more or two.

Starting point is 00:49:08 Anyways, that's a fantastic community. There are so many things happening there. So what I see a Shippit community, a community is hard work. And I think a community, if it serves itself and if it's like self-sustaining, maybe. But I think if anything,

Starting point is 00:49:23 it's sharing interesting topics. It's solving specific problems that others would find helpful and interesting. And it's more like spreading ideas and approaches and perspectives that make sense to some. That's what I'm hoping to get out of this. Obviously, learn, right? Learn new things and share those learnings i think those episodes i think they will be very time specific it's almost like there will be a journey and in that journey that episode makes sense and they build one on top of the other and eventually have like a nice journey that's i mean we used to do it like every

Starting point is 00:50:01 six months every 12 months something like that so I would like to do that a lot more often. So like smaller steps, gain a lot more perspectives and share it a lot more often rather than once every six months or once every year. That's what I'm hoping. But what do you think? I mean, I could imagine a world where there's a group of enthusiast shippers. Maybe the act of running things in production is technology specific so that you might have like the Kubernetes community and the Ansible community or whatever.

Starting point is 00:50:39 But I think like people are interested in these things, whether they're SREs or they're DevOps or they're sysadmins, like I used to be back in the day, I can imagine people rallying around and hanging out together and talking about these topics, similar to how JavaScript folks hang out and talk about JavaScript in the JS party community of our Slack. So that shows very community-oriented.

Starting point is 00:51:04 We want the community to actually like come up with ideas and like challenge us and request the guests and so that's like a community oriented show i was just curious your angle on that for this particular podcast i think that makes a lot of sense like all those things make a lot of sense to have engagement from the listeners right that's the way i would i would phrase that again it's more about exploring and sharing and that's what i'm really passionate about and finding ways to improve changelog in a way that is open source and others can benefit because that's one thing that we have always done, shared our approach publicly. Like if you look at the commit messages, there's so much insight in them.

Starting point is 00:51:48 And I find that very interesting because... Yeah, you write books in there. Yeah, I did. I did actually. I think we could publish a book. We could probably pull a book out of here. There's a lot of text in there. ASCII art and all those things, links.

Starting point is 00:52:00 There's a lot of stuff there. Yeah, check it out. Emoji. Emojis, ohmojis are the best. They convey so much emotion. In regards to community, we can say that we have a dev channel in our community Slack.

Starting point is 00:52:15 And if I'm keen off of what Jared's saying, it's like, where can people hang out at? So we already know that changelog.com slash community is there. It's free to join. It's open. We already have a dev channel. But maybe, are you saying maybe a ship it channel makes more sense where we have similar to GS party? We've got a GS party channel and people hang out there and chat during live shows. And maybe this show isn't live, but we can start to have, hey, I like this show.

Starting point is 00:52:38 I want to invite this person. I want to suggest that person. Well, where do people go and congregate? Where can that happen? And I think we've already paid for the price of admission, which is free, and the infra's there thanks to free Slack and community and all that good stuff. It's done. So a matter of moving some of that conversation from dev to ship it

Starting point is 00:53:00 or just promoting dev to what could be ship it. Either way, in terms of the logistics of that getting done sounds good to me but i think we should definitely have a ship a channel where folks can hang out and talk and yeah you know throw ideas out there and have a place to to discuss the show and things around the show doesn't have to be about the show, but I think that would be rad. Do we have comments enabled on episodes? Yeah, we do. Okay.

Starting point is 00:53:30 So that's one for now. If you listen to a recent backstage for now, we thought about turning them off. You can go listen to that conversation. And, uh, we actually agreed on turning them off and then I just didn't do it. Okay.

Starting point is 00:53:44 So yeah, we might leave them on forever because of laziness or maybe it'll disappear but i don't know you go listen to backstage episode was at 16 all the emotions are on comments but for now they're there and i don't know if i just leave them on because people do seem to like them you You know, I've, since then, this is a micro version of that conversation, I've seen more adoption of our comments and especially that recent blog post you got there, Jared. I mean, like, if it weren't for that, you wouldn't have people talking to you.

Starting point is 00:54:13 Yeah, I wonder if that episode spurred on more comments. They're like, wait a second, these guys have a comment section? I didn't know that until they posted a show about it. And then even since, I've looked at our design of it, and I think that, you know, for a signed-out user, it could be, we could do better design to make a better effort to encourage discussion. Oh yeah, like actually an emoji picker. There's definitely some things we could do.

Starting point is 00:54:33 Reactions. There's all sorts of stuff we could do. Just guys to higher value content, really. Higher value comments. But that recent post you did, you might as well timestamp it. That got a lot of comments itself. The backstage episode we're talking about is episode 16. Accurately titled, Let Us Know in the Comments. So yes, let us know in the comments.

Starting point is 00:54:55 So yes, there are comments on each episode. So it's a great place to have conversation. Especially, I like the permanence of those in terms of it's attached to the episode. So if you have follow-up links or questions regarding the content, it's a great place for that. Whereas, of course, there's conversation that's going to happen on Twitter and on Reddit and on Hacker News and on LinkedIn. Do people have conversations on LinkedIn? I don't know about that.

Starting point is 00:55:20 They do. And elsewhere. And in our Slack. But there's some value to the comments on site. It's worth it, in my opinion. But if you're listening to this and you're thinking, well, one, they've answered my questions around community. Because clearly we just in time produced the future of things.

Starting point is 00:55:38 So we just determined that we're going to have a community. And it'll potentially be the Ship It channel in Slack. But if you have a request for an episode, there's an easy way to do that, changelog.com slash request. It's there for every show we have, the changelog, Founders Talk, Ship It, all the shows essentially. So if you have a request for a guest or an idea,

Starting point is 00:55:57 that's the best way to share it with us. If you want to join the community, it's there, changelog.com slash community. No debate about that. And if you care about shipping it, then you should ship it with us. Also, if you care about all the other things that happen before shipping it and after shipping it.

Starting point is 00:56:18 And while you're shipping it? And while you're shipping it. Oh, yes. It's just, yeah, there's like, it's almost like that's like a point in time but there's so many things happening before and after and it's like it's not like a single event right you find yourself shipping it and you would like to think that every time is the same but that's what we aim for it's like it's ideal but it's not right sometimes you ship it and you

Starting point is 00:56:42 take production down and go oh crap crap, what did I do? Well, there's a great lesson to learn there. So I think it's those things which are really interesting, right? How do you build systems where shipping is so easy and straightforward, they don't even think about it. I think we were rather fortunate that that was the case for us. Just get push and everything will take care of itself or emerge if there's a pr well you heard it here first gerhard our resident sre for hire has been promoted to podcast host coming at you weekly changelog.com slash ship it and uh i'm excited gerhard i i mean i've been a big fan of what you've been doing with us for so long i'm glad to get to a weekly cadence where it makes a more rounded sense to talk about what we're doing,

Starting point is 00:57:29 what others are doing, and all that fun stuff. But hey, listeners, you know what to do, changelog.com slash ship it. All right, that's it for this episode of The Change Log. Thank you for tuning in. We have a bunch of podcasts for you at changelog.com. You should check out. Subscribe to the master feed. Get them all at changelog.com slash master. Get everything we ship in a single feed.

Starting point is 00:57:53 And I want to personally invite you to join the community at changelog.com slash community. It's free to join. Come hang with us in Slack. There are no imposters and everyone is welcome. Huge thanks again to our partners,

Starting point is 00:58:04 Linode, Fastly, and LaunchDarkly. Also, thanks to Breakmaster Cylinder for making all of our awesome beats. That's it for this week. We'll see you next week. Thank you. Bye.

Your Ad Here

The Changelog: Software Development, Open Source - Inside 2021's infrastructure for Changelog.com (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.