The Changelog: Software Development, Open Source - Inside 2020's infrastructure for Changelog.com (Interview)

Starting point is 00:00:00 Kubernetes is everywhere. You can't avoid it. There's a lot of documentation, examples, guides, but we go beyond that, right? We show you how to run a web application in production with Kubernetes, which apparently everybody's doing these days or trying to figure out, and there's like so many opinions. And so how do you actually do it? Well, we'll show you how. So changelog.com itself runs on Linode Kubernetes engine. It's proof that it's easy, straightforward, show you how. So changelog.com itself runs on Linode Kubernetes engine. It's proof that it's easy, straightforward, and it works. And we have all the commits to back this up. We have all the code to back this up.

Starting point is 00:00:33 You can see what choices we've made. And I really love what we have built. And I really love that we can keep it real. We can still deliver business value, right? No one stopped anybody from shipping. And it's just a bunch of us. It doesn't take a teams of 10, 20, 30 people to do this. Bandwidth for changelog is provided by Fastly. Learn more at fastly.com. Our feature flags are powered by LaunchDarkly. Check them out at launchdarkly.com. And we're hosted on Leno

Starting point is 00:01:03 cloud servers. Get $100 in hosting credit at Linode.com. What up, friends? You might not be aware, but we've been partnering with Linode since 2016. That's a long time ago. Way back when we first launched our open source platform that you now see at ChangeLog.com, Linode was there to help us and we are so grateful. Fast forward several years now and Linode is still in our corner behind the scenes helping us to ensure we're running on the very best cloud infrastructure out there. We trust Linode. They keep it fast and they keep it simple.

Starting point is 00:01:41 Get $100 in free credit at Linode.com slash changelog. Again, $100 in free credit at Linode.com slash changelog. What's up? Welcome back, everyone. This is the Change Local Podcast featuring the hackers, the leaders, and the innovators in the software world. I'm Alex Dekowiak, Editor-in-Chief here at ChangeLog. On today's show, we're talking with Gerhard LeZou, our resident SRE, ops, and infrastructure expert here at ChangeLog about the evolution of our infrastructure. What's new in 2020? What are we planning to do in 2021? And what are we using today? The most notable change? Well, we're now running on Linode Kubernetes Engine, LKE. We even test the resilience of this new infrastructure by purposely taking the site down live on the show. But that's near the end, so don't miss it.

Starting point is 00:02:37 And for those longtime listeners out there, you may have noticed a change at the top of the show. And I want to welcome LaunchDarkly as our newest partner here at ChangeLog. They'll be powering our feature flags. Check want to welcome LaunchDarkly as our newest partner here at Changelog. They'll be powering our feature flags. Check them out at LaunchDarkly.com. All right, let's do the show. So longtime listeners

Starting point is 00:02:56 of the Changelog all know Gerhard Lezu. Recent listeners, maybe not so much. If you've been listening back to last December, you've heard Gerhard's voice before as he went to KubeCon

Starting point is 00:03:08 and had some awesome interviews late last year. If you've been around for a while, you've heard him on our 2018 infrastructure, our 2019 infrastructure, and today on our 2020-21 infrastructure. 2020 never happened. If anybody asks me,

Starting point is 00:03:23 it happens. For those who haven't heard about you, Gerhard, from our perspective, maybe we consider him our SRE for hire, our remote infra guy that we call when we need help. And he's been helping us for many years. We appreciate you for that. For the brand new listeners, Gerhard, what's your background? Where are you coming from? So the one thing that I really, really enjoy is infrastructure. Even more so breaking it, understanding its limits, and then putting it back together. It's just this need to understand how something works at a very deep level. And then taking all the building blocks and putting them together much better than they were before.

Starting point is 00:04:05 That's what we've been doing with changelog.com infrastructure for many, many years. Half the stuff you don't even know, right? That's been going on. That's right. It was all for the best, trust me. We took many systems apart, we put them together, we tried different components over the years. And I feel that right now we are in a very good place. I mean, as challenging as 2020 was, we managed to complete our migration to Linode Kubernetes Engine.

Starting point is 00:04:35 For the listeners from previous years, we have been running on Linode for many years now. They have an amazing infrastructure and amazing service, and we have a great relationship with them. And they somehow managed to keep things simple even with all this complexity so over the years we had different setups but right now we settled on the node kubernetes engine it's simple it's performant it allowed us to do many things very quickly and more importantly it sets us up for a great future yeah so to go back just a little ways back in 2018 we were running ansible scripts and concourse ci you can go back and listen to that episode we've done one of these per year for the last three

Starting point is 00:05:19 years this is our third annual infrastructure episode 2019 we replaced that stuff with Docker Swarm and a few other goodies that I can't recall off the top of my head, but Gerhard knows inside and out. And these infrastructure setups all come with an accompanying blog post, open source code, like how we did the decision making. So that is also going to be a companion to this episode is Gerhard's annual blog post. For 2020, we wanted to move from Docker Swarm into Kubernetes, which was really the goal and what we've accomplished here in October.

Starting point is 00:05:55 We accomplished it before October, but here we are in October talking about it. So tell us about where we were last year and the things lacking from that setup, things that we wanted, and how this transition is accomplishing some of those goals. I think that's a really good place to start. Because last year, as exciting as it was to roll out that infrastructure for 2019, we were using Docker Swarm. And the big difference was that we didn't have to install Docker. We didn't have to do any of that management because it came with the operating system we're using core os at the time and core os out of the box it just had docker so we didn't have to install it so there were fewer things for our scripts right our ansible to do and we could switch to something like terraform and we could worry about

Starting point is 00:06:48 basically managing not just the vm but also integrating with a load balancer node balancer in linode speak and it was a much simpler configuration but it still meant that we had a single VM. And some might frown upon that, like why a single VM? But looking at our availability for the entire year, it wasn't that bad. And any problems that we had were fixed relatively quickly, except one. We may go into that later. But for the entire year, ourtime was um just under four hours sorry we had downtime less than four hours that was pretty good for a single vm so just goes to show

Starting point is 00:07:32 that some simple things can work and you can push them really far and i know that jared is a big fan of simple things because you know they're easy to understand when something goes wrong it's easy to fix it and we were really i know that adam was very excited about us going to kubernetes we wanted to do that for a while but the time wasn't right and it wasn't right because linode didn't have a simple one click kubernetes story you had to do a bunch of things you could do it if you really wanted to but it wasn't easy and then in 2019 at the end november the magic happened they Linode Kubernetes Engine entered beta. I was at KubeCon.

Starting point is 00:08:07 I met with Hilary Wilmoth and Mike Catrani from Linode. We gained access to Linode Kubernetes Engine. It was in beta. And with one command later, we had a three-node Kubernetes cluster. And that was really simple. That was like the experience that we wanted and were waiting for.

Starting point is 00:08:28 And once we had that, things kind of flowed from there. It was really simple to add all these other components. Now, compared to what we had before, we had to worry about, I suppose, the migration from CoreOS to Flatcar because CoreOS became end-of-life, right? With the acquisition of

Starting point is 00:08:45 core os by red hat so we had to do that migration and we were approaching we knew that the end of life would come so rather than doing that and continuing with a single vm docker swarm complications we went to something simpler which was kubernetes because we had this one api and we could provision everything which meant less terraforming we didn't have to provision node balances we didn't have to create volumes and then like attach them to vms using terraform we didn't have to do any of that this kubernetes api would do all those things for us which meant that it was it was a much simpler system to work with. Now, when you say something simpler, probably alarms go off in people's heads because they think Kubernetes is simpler because Kubernetes has a reputation of being very complex, not simple.

Starting point is 00:09:38 Do you think that's not true? Are you talking about from a different perspective? It's simpler. I think there is complexity in everything. So even if you have like a single vm some things may be simpler but other things will be harder so the trade-offs which you're making about packages how to install them where to get them from volumes as i said formatting how to format them all those things you need to do. Load balances, configuring them, TLS certificates.

Starting point is 00:10:08 I mean, these things are still required. Now, you may be familiar with that approach and maybe that's why you think it's simpler. But if you use something like, for example, external DNS for automatic DNS management, which is a component that you just deploy to Kubernetes, you don't have to go and manage your DNS with Terraform or manually or Ansible or anything like that. And it's this combination of the different components which have matured over the years, which you run in Kubernetes, and then they in turn integrate with everything around. So for example, certificates, we used to pay for certificates before and we had to wire

Starting point is 00:10:46 that together and set it up in the load balance and set it up in our cdn and do all those things now with cert manager it's much much simpler we are getting it violets encrypt it's all integrated it all plays nice together so while what happens behind the scenes is still complex, these components that you can pick and choose, and with the maturity that comes over the years, it is simpler. It is a simpler setup. So CertManager is a Kubernetes component. What is it called? Is it a pod? Is it a cubelet? Okay. So I see where you're going with this. So let me call it additional components. It's an additional component

Starting point is 00:11:31 that you install in your Kubernetes cluster that gives you, it extends the Kubernetes API with extra knowledge. So your Kubernetes API, by default, you say, I want a deployment, right? Let's say, let's just go with the knowledge. So your Kubernetes API, by default, you say, I want a deployment. Let's say, let's just go with a deployment. But then how do you ask it for a certificate? So once you install CertManager,

Starting point is 00:11:54 it's a component that in a way teaches your Kubernetes API about certificates. So then you can say, hey, Kubernetes, give me a certificate. And CertManager, it has a bunch of components inside, but let's say it's like one thing. It knows how to make that happen. Gotcha.

Starting point is 00:12:11 So it's like the complexity is on the inside. All of the difficulties and the confusion and the technical intricacies are on the inside. And if you can get it set up and configured and make use of it, your life is simpler. Exactly. Once those components are hiding the complexity, which will be there no matter what you do, no matter what you use,

Starting point is 00:12:37 but they allow you to ask for things via the single API. And the thing which gets me really excited about Kubernetes is that everything gets standardized between the single API. And the thing which gets me really excited about Kubernetes is that everything gets standardized between a single API. And it goes to the point like you want a VM or you want a resource, you talk to the same API and all you have to do is install the right components

Starting point is 00:12:59 of the API or like those components know how to translate your request into an actual thing, whatever you may want. So load balancers, you no longer have to provision load balancers. Based on your provider, whoever you are getting your Kubernetes from, it knows how to translate that request into a load balancer. Certificate, same thing.

Starting point is 00:13:24 Why then do we have to wait for LKE to make it easier? And I guess it's sort of a loaded question to some degree because every cloud needs to have its own Kubernetes engine. They have their own fork of it or version of it that sort of runs natively in their environment. It's probably because it needs to plug in to certain places. But we had to wait for LKE to make it possible. Like you had mentioned, we can use Terraform beforehand and Ansible and sort of do it ourselves, but LKE made it, I suppose, easier on Linode. Why is that?

Starting point is 00:13:51 So first of all, the most important thing when you start off your Kubernetes journey in production, you want to manage Kubernetes. And what that means is that you want updates to be applied automatically. You want the control plane, which is the API component itself, you want that to be set up separately from everything else, and you just want to consume this API. Not only that, but you want your Kubernetes cluster to be

Starting point is 00:14:17 integrated with other things that that provider has. For example, node balances in the case of Linode. And while you can install all those components it's like cobbling it together so you want the vendor to give you an api that is already pre-configured with a bunch of things not only that but when there are updates you want your vendor to just take care of them you can specify when you're okay to get updates and you want to specify maybe which versions of kubernetes you want like do you want the latest one or do you want to be more conservative and stay behind but you don't want to worry about updating the infrastructure when it's like the core infrastructure so to speak so in our case we had to update core os

Starting point is 00:15:01 right that was like our responsibility to update the VM. But with LKE, we still have to do it in the sense that we have to run the command, but that's it. We run one command, it will do it for us. And I'm hoping that not too far away from today, not too far in the future, LKE will be able to update itself based on a schedule. I mean, that would be the dream, right?

Starting point is 00:15:23 So that the vendor will keep the API automatically updated, all the right versions for us already deployed, and then we only are responsible to become the components that we add on top and inside of this API. As I mentioned, cert manager, ingress, nginx, and a bunch of other things that we use. It's like for any application,

Starting point is 00:15:46 there are N concerns that must be taken care of. Like Gerhard said, these things have to happen. Your DNS has to happen. Your certificates have to happen. And every application has its own number N. Maybe ours is 100, it's a pretty simple application. Maybe somebody else's is 1,000 things or 1,200 things, whatever it is.

Starting point is 00:16:05 And the more of those you can take off your plate and onto your hosting provider's plate is just a win. It makes it more achievable for you to manage less and then to manage more. And if you were just building your own Kubernetes deployment on top of a VPS or on top of something that's not LKE or a Kubernetes engine, there's a whole bunch of things that you have to take care of now that you'd rather not because maybe

Starting point is 00:16:31 it's not your domain expertise. Maybe it's just a huge time sink. And the more they can do, probably better than you can do it, the better off you are. Makes sense. That's right. Another thing which, again, I'm making many assumptions here, but I'm going to mention this, is the whole declarative nature of kubernetes you tell it what you want to happen and it has this like way of describing things and it will just make it happen so i don't have to tell it how it

Starting point is 00:16:56 needs to get the certificate i just tell it these are the credentials you just make it happen and by the way when it expires i don't care i just want you to renew it because i never want an expired certificate right so right always get me keep my certificate up to date the same thing would be true for some of our services i think that's where we are going next in that we want this automatically updating thing like updating system so you want to automatically for example update postgresql well how can you do that if it's like not a managed service or if it's the component doesn't know how to update itself so that's like another way that linode for example could help with we

Starting point is 00:17:36 know like their managed database service in that if we can provision those via the kubernetes api which i'm really hoping we'll be able to then we can offload that responsibility to Linode again. And we always say, just give me a new one, give me a new one, give me the latest one, and do backups for me and do all those things. But we are describing more of what we want and doing less of how that thing happens because there's no value in our case to spend

Starting point is 00:18:07 like to basically reinvent how we do database backups right or monitoring right i mean i would have thought that by this point things will have standardized that's another thing that kubernetes is like a standardization of how to do monitoring how to do logging and to begin with i know there's an explosion of ways and there's so many ways you can achieve this but i'm hoping that over time things will become like the clear winner so to speak so for example we chose ingress nginx to do the tcp routing but there's so many other ways you can achieve that so how do you give all this choice and how do you give all these options to people but at the same time have like a set of

Starting point is 00:18:45 building blocks that just kind of make sense that's almost like the next frontier and i think i see providers that they offer more than just kubernetes that's like the entry point if you wish but you get like curated kubernetes experiences which know how to do all these things more and more centralized logging monitoring as i mentioned security built-in, policies, all those things. Yeah, the declarative aspect is huge for me because I like to just declare the way things should be, and I just don't care about the details anymore. I remember as a young man, I really cared about the details.

Starting point is 00:19:20 And I loved scripting. And I'm like, I'm going to write the script, A, and then B, and then C, and then maybe run D, not. And then like, I took joy. I still like to script things sometimes, but I really took joy in like the details, the imperative details, the programming of how to roll out a thing. I used to roll my own deploys with rsync and all that kind of stuff. But I just don't have time for that. I just want to say, hey, I want an SSL certificate on this domain, and I want it to always be fresh. And I just wanted to configure it,

Starting point is 00:19:50 and the details of how that happens are just not my concern. And it's really a shift. It feels good to just be able to declare. I mean, there's almost like a God complex. Like, I declare this is going to happen, and then it happens. It's like, oh, that feels pretty good, right? So I think that's definitely a holy grail and a shift from a time where everybody is writing code

Starting point is 00:20:12 to do their operations. And now we're writing YAML to do our operations. Whether you like YAML or not, it's a lot simpler than a Turing complete. Although, is YAML Turing complete? It might be. It's simpler than code, generally. Gerhard, you probably know. Is YAML Turing complete. Although it's YAML Turing complete, it might be. It's simpler than code generally. Gerhard, you probably know, is YAML Turing complete? I don't know. Honestly, I don't know. Mind blank because I'm already thinking about something else. So I don't want

Starting point is 00:20:34 to lose my idea. Go ahead. Yes. Move on. So not only that you declare how you want things to be, but if anything diverges from what you declared, it will automatically try to reconverge back on that point. And that's the really cool thing about VMs going away, right? You can lose a VM and it's okay because the system knows what you want. And if that's not true, it will try to reconcile in that state. So you no longer have to worry about VMs going away and your apps going down, right? Or your database going down or whatever. It will automatically spin up on one of the healthy VMs. Not to mention about resource, like finding where to put things, you don't have that problem anymore. And I remember many years back when Kelsey Hightower gave a demo, the

Starting point is 00:21:20 Tetris demo, right? I mean, that was it. That was like Kubernetes in one very simple picture. It will figure a bunch of things out that you thought were important, but aren't. And figuring out what your capacity is and where you need to put things, you need to go up or do you need to go down on scale, all those things can be taken care of. I think that's super powerful.

Starting point is 00:22:00 This episode of The Change Log is brought to you by Teamistry. Teamistry is a podcast that tells the stories of teams who work together in new and unexpected ways to achieve remarkable things. Each episode of Team History tells a story, and in each story, you'll find practical lessons for your team and your business. I got a sneak preview of season two, and I couldn't stop listening. I was once in the U.S. Army, and nothing gets me more excited than seeing teams achieve great things when they learn to work together. And that's exactly what this show delivers. This season, the show travels deep into the underwater caves of northern Thailand to discover how divers, medics, soldiers, and volunteers saved a group of trapped teenagers, explains how a world-renowned watch company pitted their two factories against each other in an attempt to become the best watchmaker in the world, and finds out how Iceland went

Starting point is 00:22:43 from having one of the highest COVID-19 death rates to a model example of how to deal with the virus. These are stories that entertain, and they're packed with business cases you can actually use. Season 2 of Team History is out right now. Search for Team History anywhere you listen to podcasts. Check the show notes for a link to subscribe, and many thanks to our friends at Teamistry for their support. So it's worth noting that we don't really need what we have, I suppose, around Kubernetes.

Starting point is 00:23:31 Like this is for fun to some degree. One, we love Linode, they're a great partner. Two, we love UGear and all the work we've done here. We don't really need this setup. It's about, one, it's about learning ourselves, but then also sharing that. So obviously, changel.com is open changelog.com is open source. All the code is open source. So if you're curious how this is implemented, you can look at our code base.

Starting point is 00:23:51 But beyond that, I think it's important to sort of remind our audience that we don't really need this. It's fun to have and actually a worthwhile investment for us because this does cost us money. GearHead does not work for free. And it's part of this desire to sort of like learn for ourselves and also to share it with everyone else. So that's fun. It's fun to do. There's something which I'd like to add here.

Starting point is 00:24:13 And I would like to answer the question of how does this help you, a ChangeLog listener? So Kubernetes is everywhere. You can't avoid it. There's a lot of documentation, examples, guides. But we go beyond that, right? We show you how to run a web application in production with Kubernetes, which apparently everybody's doing these days or trying to figure out and there's like so many opinions. And so how

Starting point is 00:24:38 do you actually do it? Well, we'll show you how. So changelog.com itself runs on Linode Kubernetes Engine. It's proof that it's easy, straightforward, and it works. And we have all the commits to back this up. We have all the code to back this up. You can see what choices we've made. And I really love what we have built. And I really love that we can keep it real. We can still deliver business value.

Starting point is 00:25:01 No one stopped anybody from shipping. And it's just a bunch of us it doesn't take a teams of 10 20 30 people to do this it takes a person an hour here an hour there when you add it all up maybe it's a few weeks in i know six months five months however long it was it doesn't take that long and we enjoy working with our partners. We enjoy working with Linode. And I would like to give a shout out to Andrew Zauber, the Linode engineer that has been with us through all this. And we have not only been improving Linode Kubernetes Engine,

Starting point is 00:25:36 but we also had some discussions about the improvements that would make sense. Maybe things that weren't as obvious until we started using it or like a bunch of people started using it and giving all this real world feedback so we want you to succeed with kubernetes like changelog wants you to be successful with kubernetes and not only that the entire ecosystem there's so much choice and we haven't made the best choice but we made the choice that makes sense for us given given our constraints. And it works.

Starting point is 00:26:06 We are transparent about it. We share everything. And yeah, it's all out there. Yeah. Let's talk some more about some of the choices that we made. Like Garrett said, these are choices that we made for our circumstance and our application. They're not necessarily the ones that you should make, but it's an example of a choice that you should make, but it's an example of a choice that you can make.

Starting point is 00:26:26 We can give our thoughts and opinions on whether or not it's working out, or was it a good choice, bad choice, why did we choose that? Part of that is what this show is for. But also continuing forward after this show, we'd love to have conversations with listeners and everybody about these things.

Starting point is 00:26:41 We mentioned a few components of the Kubernetes API that you put together. The cert manager, you mentioned Nginx ingress. There's also some DNS, external DNS is another DNS management. Is it the exact same thing as cert manager only it has a different function? It's an extension. It's ans simple extension so that we can provision wildcard tls certificates so we needed that to do the integration with our dns provider which is dns simple and yeah i mean those are like the four core components and i simply pick them based on maturity based on community based on how things are going and integrating them was fairly simple and straightforward.

Starting point is 00:27:25 And you can see how we've done them. So Ingress Engine X, super simple for TCP routing. And it automatically integrates with node balances. We know node balances, so that was great. I'm not going to go over all of them, but what I'll mention is, I'll mention about QPrometheus, which is the operator.

Starting point is 00:27:43 It's an operator that we use to set up Grafana and Prometheus, which is the operator. It's an operator that we use to set up Grafana and Prometheus for Changelog. If you go to grafana.changelog.com, that's basically where we host all the metrics for Kubernetes. What we don't have currently, but we would like to add, is integrating Prometheus with all the services that we use. So for example, for a database, we use the crunchy data PostgreSQL operator. So you would like to integrate QPrometheus with our PostgreSQL database. Same thing for Ingress and GeneX,

Starting point is 00:28:15 which we currently don't have. We're just looking at Kubernetes metrics and system metrics. But there's relatively simple and straightforward to add all those extra things. And I suppose that's what's coming next so we have better visibility into what happens inside of changelog.com and all the services that we depend on another aspect of the setup you have is Keel

Starting point is 00:28:37 which was news to me we also have K9S which is the coolest part of the setup from my perspective so we should talk about that. But as we get into Kiel, it might be useful, it's useful for me as well, even as someone who's a part of this party, to just understand what does a deployment look like? So from I push a commit to GitHub,

Starting point is 00:28:59 our master branch on GitHub, then to what happens yet, because we have a GitHub-based deploy, right? We're pushing and it deploys on our behalf. Can you walk through just like the, you know, this, then this, then this, the nuts and bolts that I don't want to have to care about, but when things break down, we have to care about. Okay, so let's just introduce keel very briefly. Sure. keel automates updates of helm deployments or daemon sets or stateful sets or deployments so when there's an update to an image or to something it will automatically update or it can update

Starting point is 00:29:34 based on certain rules whatever is running in kubernetes so in our case we use keel to trigger automatic updates for the app itself and there's a bit of controversy here in that GitOps is up and coming. And I don't want to go into that now, but that's like another approach. So one approach is to do GitOps and use Flux or Argo CD, or use something like Keel, which goes against some of the things that GitOps stands for. But I'm not going to go into that now. To your second question, how does everything work? In 2018, I made the decision to separate building and publishing and testing

Starting point is 00:30:15 from the actual deployment. So what you actually have is CIs that deploy code into production. And I think that is very dangerous and very wrong because your CI has the keys to your production environment. And I wouldn't do that. So our CI stops at publishing images to Docker Hub. And a push to GitHub triggers a build and circle CI,

Starting point is 00:30:42 which run tests, which compiles assets, and if everything is fine, pulls dependencies, and it builds a Docker image. And the last step is to publish the artifact, the Docker image, or the container image to Docker Hub. And that's it. That's where CI stops. Now, what we used to have before, we had this very simple loop that would continuously update the Docker service. Super simple. If there's a new one, it's like a bash while one kind of a thing. That's it. That's exactly it was like three lines of code. Super simple. Okay. So keel is a bit more complex than that. But the principle is very simple. Because why wouldn't you want to run the latest version of your app that passes all its tests has all the dependencies in production i mean why wouldn't you want that

Starting point is 00:31:32 i i can't like that's what we want right like you want your commits if everything is fine to go into production right that's what you want so like maybe the only time i think you wouldn't want that is like what if it mismatched your database schema or something and that was unable to resolve and then you like want to roll it back, but you wouldn't know that until you rolled it out. So of course you want that. Yes. Yeah. You can do like things if you have migrations, by the way, run every deploy runs migrations. So when, when the new app starts, we do blue green deploys, by the way, it's all handled very nicely by the deployment model in Kubernetesubernetes so we don't have to worry about any of that so when the new version comes up you're right you run like the migration and maybe something can go wrong so yes but if the

Starting point is 00:32:15 app fails to start you have readiness probes that will not put it in the load balancer and if it crashes well there you go it crashes what's a probe? Is it like a thing that says, hey, are you ready? Hey, are you ready? So there's a startup probe, there's a liveness probe, and there's a readiness probe in Kubernetes. There are like three types of probes. The readiness probe determines when the pod is ready. And ready means when is it ready to serve traffic in the case of a web app so you need to be listening to the tcp socket that you say you'll be listening and maybe you can do checks and we determine if like you get 200 back so is the htp response 200 and if it is the app is ready to be put in the load balancer so you you declare what ready looks like. Exactly. Gotcha. Exactly.

Starting point is 00:33:05 So the app may keep crashing and that's it's okay. The old app will not be taken down. And until the new app is ready, it runs all the migrations and everything is fine. It won't promote it. There is a risk of the new version

Starting point is 00:33:20 doing a migration that the old version can't work with. Right. I was just thinking about that and you will have taken production down yeah but in all the years that you've been working on changelog how many times did that happen i can't think of one exactly zero so in four years it never happened right like phoenix 2016 that's what i remember since we started developing in phoenix we've never had that situation happen to us.

Starting point is 00:33:46 No. Well, and our schema is pretty stable. I mean, it's rare at this point that we make massive changes to our schema. These things are pretty well thought out and in place and working. And usually it's additive. Every once in a while,

Starting point is 00:34:00 I'll decide I actually hate the name of something I named four years ago. And because I'm pedantic and a completionist, I can't merely rename it in the code. I also have to rename the database table because there's a mismatch and I can't have that. And that would be a major change. I'm going to rename a database table, but it's very rare. And so most of the changes then are additive. I'm going to add a key.

Starting point is 00:34:22 I'm going to add a column. We're going to add a new table because we're trying something new. These things rarely cause data problems and migration problems. Well, again, in four years, I don't remember one. Right, me neither. Yeah, we never had an issue with this. But what if, here's where we can spend all the money, right? As engineers, what if? I know, right? That's the danger.

Starting point is 00:34:43 And that's where you spend all your money right there on that what if anyways keep going exactly so keel does something very similar to what our while loop was doing but a bit more in that docker hub now sends a webhook to keel or listening like on a public ip there's a host and listens on these webhooks when there's been an update to the image that we publish and if there has key will trigger an update to our deployment it all happens seamlessly automatically the new version comes up and everybody's happy it also does periodic polls it's polling docker hub to see there's a new version And if there is, maybe the webhook failed to be sent or we missed it.

Starting point is 00:35:27 I haven't seen it happen. What I did see happen is Keel locking up. We just saw that before the show, by the way. But we're not running the latest version of Keel. So maybe that's something worth updating, I suppose. But other than that, whenever you do a commit, a few minutes later, you have it in production.

Starting point is 00:35:45 And it's been like that for years for us. So it works, and it's simple. Now, we can make it a lot more complex, and I would like us to look at GitOps sooner or later. Tell us what GitOps is, because you keep saying, I'm not going to talk about that, but I don't know what it is. Is this like you let your Git do your ops? What's GitOps? Okay, so GitOps is a way of implementing deployments so you have continuous deployments

Starting point is 00:36:11 you're continuously deploying code but it's a way of implementing continuous deployments in cloud native applications so if you're using kubernetes or cloud native or at least that's the tagline and what git does, it allows you to define everything about your application using Git, including which version you should be running in production. So if you were using GitOps with changelog, there would be a commit for every single deploy, which would need to be approved, merged somewhere,

Starting point is 00:36:45 so we would roll out the latest version. So you're basically versioning what runs in production. To some extent, we already are doing that because all our YAML that defines all the change log services isn't Git.

Starting point is 00:37:02 What we don't have, we don't apply those changes by some sort like an automated system it's either you or me that says apply but we have make targets which apply all those things and that's how we roll out changes but for the app which changes a lot more often we don't run commands we don't have a ci running commands every single time there's an update we don't do that the app we, every single time there's an update, we don't do that. The app, we have Kiel that automatically updates whatever's running in production. And why would the GitOps advocates say

Starting point is 00:37:30 that we're doing it wrong, quote-unquote? It's because they want that history. They want that to be like an atomic aspect of their application, deployments, to be like explicit, atomic, logged things. Is that why? Yes, that's one of the reasons. The other reason, the more important one,

Starting point is 00:37:47 is you always know what you're running in production. So if I asked you what version of the app we're running now in production, you say master. But master always changes. Sure. So imagine if you're deploying 100 hundred instances of your app just imagine about that for a second if you're deploying a hundred app instances by the time the 90th

Starting point is 00:38:14 instance gets spun up if it's looking at master it may pull a different version because master may have changed during the deployment and if you have many developers pushing lots of code and master always keeps changing, then you could have multiple versions of your application running and you wouldn't even know it. Gotcha. Not to mention that when something crashes and master has changed, again, the version that you thought you were running will change

Starting point is 00:38:41 because you pull the latest image. And there's like a bunch of things for example um in kubernetes you're advised to use versions the exact version of what you're deploying like your image well we're using latest and latest means whatever is latest and that changes so in that from that perspective we are breaking you know the breaking the fully declarative in a way because we can't recreate the same thing. Multiple runs of the same thing. Sorry, not declarative idempotent.

Starting point is 00:39:14 We don't have idempotency because multiple runs will end up with different states because latest is fluid. It can be anything. Gotcha. So does Flux then, or Argo CD, do they capture the version? Yes. Essentially for us, that way all the instances that roll out

Starting point is 00:39:32 or potentially roll out if we have more than one, like we might? Exactly. And every single change, yes, is versioned. Yes. And tracked separately. But then, like right now, all our code, including the infrastructure code, is a single repository with argocd or flux you need another repository that tracks what gets deployed

Starting point is 00:39:49 because if you think about it if a commit triggers another commit and the commit triggers another commit you have a continuous loop of commits triggering commits and it never ends right so you have an infinite commit loop by capturing what you've deployed you're bumping the version of the artifact that's getting deployed and you just end up with that. Just recursively does that. So we would need to have another repository which keeps track of what gets deployed.

Starting point is 00:40:14 And from our perspective, we wanted to keep changelog self-contained and that if you pull down one repository, you have everything there. Yeah, we used to have two repos. We had an infrastructure repo. We had the source code repo. And we were happy to get rid of the other one and have just one place where everything lives so simple yeah but maybe we can somehow i don't know configure the ci to ignore certain commits so it won't build if certain paths change i mean that i know it's possible in some cis and then we can also maybe do argo or like flux whatever we choose to

Starting point is 00:40:47 maybe not deploy every commits maybe be a bit more selective i don't know i don't know but maybe we can exclude like basically we can break this infinite loop yeah so does a a new version get spun up every time there's a new push to master so if i'm working on something and jerry's working on something and we just yes happen to push in similar time frames, his push triggers a new version in this GitOps world. His push initiates a new version, mine does too. And obviously latest in would be the latest version and eventually to get to my version if I'm after Jared, for example.

Starting point is 00:41:22 So if Jared commits, then I commit. And there's two new versions that are going to roll out, but mine being the latter one will be the latest version that's rolled out. So it'll eventually just get to it. So there's a time frame potentially even in there, right? Because you have to sort of initiate or stand up two different versions, roll that one out, and then roll the next one out. Is that roughly a scenario? Is that how

Starting point is 00:41:46 that works? That's how ours works. Right? Right. Ours works exactly the same way, but basically key will trigger multiple deploys. So every push to go through the pipeline takes a few minutes. So even if they enter the same time and we don't have parallelism, so we

Starting point is 00:42:02 do one build at a time, so you have what your jared's build in your example goes out gets deployed a few minutes later your build arrives and goes out again so you'll have two deploys within the span of a few minutes but right now we have a single app instance right so we don't have like multiple apps running in production and there's like reasons for it we don't need to get into now but we have only one app version one app instance running at any point if we had a hundred app instances which were running across let's say i don't know 10 hosts 10 vms for scalability reasons which again we don't have but what if to go back to what jared was saying then we may have problems yeah we're

Starting point is 00:42:44 solving our problems not everybody's but we're at least showing that it's possible i suppose and aware that we can not so much that we need to yeah which is important yeah i mean we chose keel because it's really simple i mean a lot of choices which we made is because it's simple and it suits us and i would argue that it would suit the majority unless you're like a really big team with like a really big kubernetes deployment and investment and all that then you may need to do things differently more certainly than not but if you're like a small team of let's say up to 10 people that have a bunch of apps this may work perfectly well for a long, long time. What's up, friends? Have you ever seen a problem and thought

Starting point is 00:43:34 to yourself, I bet I could do that better? Our friends at Equinix agree. Equinix is the world's digital infrastructure company, and they've been connecting and powering the digital world for over 20 years now. They just launched a new product called Equinix Metal. It's built from the ground up to empower developers with low latency, high performance infrastructure anywhere. We'd love for you to try it out and give them your feedback. Visit info.equinixmetal.com slash changelog to get $500 in free credit to play with, plus a rad t-shirt.

Starting point is 00:44:00 Again, info.equinixmetal.com slash changelog. Get $500 in free credit. Equinix Metal.com slash changelog. Get $500 in free credit. Equinix Metal. Build freely. Let's talk about availability because one of the reasons why you even build this kind of infrastructure is for resilience, for availability. And I suppose to test that, let's take the site down. I love that idea. I think that's the best idea we've had all evening.

Starting point is 00:44:50 I think so, too. Before we do it, what's going to happen? What should happen? Okay, so what should happen is we have a three-node Kubernetes cluster. The application and the database are running on one node. They're close together.

Starting point is 00:45:07 And when we take the VM down, another VM will just delete the node. Another VM should be created. And in the meantime, the website, the app, should migrate in the database from the VM that was deleted onto one of the other two VMs which are still running. And I say VMs because it's Kubernetes nodes, we have three. So we delete one. We expect Linode to recreate the VM, reprovision, notice that, hey, there should be

Starting point is 00:45:39 three, there's only one. And in the meantime, we expect the website to be recreated on one of the other two vms and the database and everything to be back together that's what we expect to happen okay what are the odds what are the odds where are you sitting like and how many nines are you thinking this is going to work so i don't expect this to take more than 10 minutes and last time when i tried this it was seven minutes how many nines less than 0.000 something right well you want to know not the nines of availability how many nines are on your confidence level oh i see how confident are you uh there's 99.9 yeah there's a lot of nines let me put it that way there's a lot of nights all right more than i have let's do this well if it doesn't work then we can fix it right right and we can just edit the show and act like it

Starting point is 00:46:32 worked exactly but no i think we should leave like the real thing right with like proof how long it took and all that so listeners listen closely there will be no edit stops here. There'll be no breaks. Okay. So the notes, here we go. Let's do this. Okay. B29D. Let's go to nodes. Nodes. And I'm going to delete it.

Starting point is 00:46:56 29B. There you go. Delete node. I can't, I can't delete the node from here. Okay. I can't delete the node from here. I need to go to the Linode console. Oh, cloud console. That's okay. Logistics. Yeah Okay, I can't delete the node from here. I need to go to the Linode console. Oh. Cloud console.

Starting point is 00:47:06 Logistics. Logistics, yeah, because I can drain it, I can do other things. I can't just delete a node, right? You're not supposed to do that. So you're trying to access it through K9s, K9s, K9s, I don't know how you pronounce it. K9s. Which is a really cool CLI, like a

Starting point is 00:47:22 awesome terminal app for accessing all the information about your Kubernetes clusters, but does not give the ability to delete things, apparently. Yeah. Not nodes. For safety. You can't delete nodes for safety reasons. We're going to delete the VM. Not power off, not reboot, delete.

Starting point is 00:47:39 You're going to delete the VM. See, I'm nervous. This is from now, the Linode Cloud admin. Are you sure you want to delete this? I am. This is permanent now, the Linode Cloud admin. Are you sure you want to delete this? I am. This is permanent. I want to take changelog.com down and see how long it takes before it comes back up.

Starting point is 00:47:52 I'm starting to stopwatch too. Here we go. This is proof of the pudding, right? If the pudding is good, this will work. All right, here we go. Ready, delete. Boom. The VM's gone.

Starting point is 00:48:02 Stopwatch started. We'll see. I expect to get an email from Pingdom here very shortly. It will take a minute. As well as a push notification to my watch. Walk us through why the 10 minutes. What's the window there? Why is it roughly 10 minutes? Why is it not more than 10 minutes? What's going to happen behind the scenes now?

Starting point is 00:48:20 So behind the scenes, the VM is going away, being deleted, being stopped, going away. The app will stop working. And it will take a while for the Kubernetes to figure out that the node is not healthy. So we can see that the node is still ready, according to canines, according to Kubernetes, but we know that we have deleted the VM. It will take a while for that to be deleted. And when it's's properly gone when it's no longer there like the physical vm has been powered down we expect kubernetes to try to re-spin or to recreate the app on another node that's healthy and ready and ready to go so it's still up right we said delete vm it's red. We can see it.

Starting point is 00:49:06 But has it actually been deleted? I just got a notification from Pingdom that we are down. There we go. So now Kubernetes confirmed that. We no longer have the node. So now what's going to happen if we look at the deployment, if you look at the app, there we go. It's down and it has not been created anywhere. So what's the reason?

Starting point is 00:49:24 It's persistent volume claim, a reference to persistent volume in the same namespace. I think that's okay. Minimum replicas unavailable. Where are the events? Let's see, let's go to event. Pulled, everything's fine, 95 seconds ago. Pulled all these things, still fine.

Starting point is 00:49:42 It's not a problem. Let's see see maybe at pulse oh there you go warning attached failed attach volume that's what we're waiting for so multi attach error for volume pvc so this volume is already attached to another vm the one that we deleted it so you gotta detach it exactly before you can reattach it to another VM. And that takes some time. That takes some time, exactly. I don't know. Well, last time when I tested this, it took seven minutes, end-to-end. In seven minutes, everything was back up. We're still down, by the way. So answer this, then. I go to changelog.com right

Starting point is 00:50:19 now, and I get a 503. Service is down, essentially. It's not available. But via ping in the terminal, I'm pinging changel.com. I'm still getting a ping. Is that the node balance? That is node balance, exactly. Not only that, but we're hitting NGINX. So we have NGINX deployed on every single VM, on every single Kubernetes node.

Starting point is 00:50:40 So we have three instances of NGINX ingress. So you can get to Node Balance, so you can get to Nginx, which runs in Kubernetes, but you can't get to the application. There's no application running, so it can't service these requests. So you're getting 503s. So this is a lot like chaos engineering here, only we are manually introducing, we are the chaos monkey, and we are monkeying with ourselves

Starting point is 00:51:06 while we record a podcast. So it's like a step beyond, like even more idiotic than chaos engineering is what we're doing right now. But so far we think it's working and so that makes us feel good. But it's just kind of hurry up and wait to see if this thing can get reattached

Starting point is 00:51:23 and go from there. Yeah. We're three minutes and 40 seconds in, according to my stopwatch. So Gerhard, we had a downtime. So the difference between this and our old setup is this is going to auto-heal as long as it works as advertised. Whereas last time, we had last year, we had a downtime which lasted multiple hours where it got into a state that it was never gonna it doesn't auto heal like i had to basically drive home and get a hold of you and figure it out you want to tell that story a little bit while i wait here so last year what happened the docker service basically under which we had

Starting point is 00:51:57 a single vm back in the day and we were running docker swarm on a single VM. And the Docker service was not configured to automatically start. I was expecting, to be honest, for the operating system to have this essential service by default started, but that was not the case. So we had to manually start the Docker service so that everything else would basically come back up. And that was the problem. Obviously, we fixed it since but the docker service the docker demon in that case was not running meaning that there was no changelog app no database none of those things right and that

Starting point is 00:52:37 docker service wasn't managed or supposed to be managed by core os but wasn't being or something like that it wasn't configured to automatically restart when the operating system restarts so what can we see here is now we can see that the volume failed to mount after it failed to attach it now it attached the volume so we can see that both the database and the application have the volume attached and what i expect to happen very soon the application to spin up we can see the containers creating both the application and the database and we can see they try to create it for three minutes and 40 seconds it tried to basically it was aware that's when it started being aware that hey i have to recreate this pod the app pod and the database pod so four minutes container creating let's see what's the state of

Starting point is 00:53:27 it if we describe it successfully signed the multi-attach was fixed so let's see if we look at the logs there's nothing there database backup still container creating we're in container creating this one's ready now i'm sorry no, it has just readiness probe. Let's see what state it's in. Still container creating. The database is already running. That's a good sign. There we go. We have PostgreSQL.

Starting point is 00:53:52 I'm a seconder and I'm following a leader. So this is the leader now. The leadership changed. We have, anyways, I'm not going to go into the details now. No, I like this. I feel like you could be a play-by-play commentator. This is like sports announcer. That's right.

Starting point is 00:54:04 This is like the radio. When you're trying to listen to the game, you have no idea what's going on, and he's telling you which direction they're running. You're over here telling us exactly what K9s is reporting back. That's right. You're on K9s radio. And Adam and I have the advantage of the visuals here.

Starting point is 00:54:18 The listeners are like, what is going on over there? Yeah, the website's still down. Here's what's going on, listeners. The website is still down, and Ger what's going on, listeners. The website is still down and Gerhard's trying to give us confidence here. Yeah. And it's coming back up. It's coming.

Starting point is 00:54:30 I have confidence in this. All I have to do is just basically, you know, let it play out. I know the right thing will happen. It will reconcile. Running. There you go. The app is up.

Starting point is 00:54:39 Yes, baby. The app is running. Five minutes later, according to this, how much do you have, Adam, on your stopwatch? Seven minutes. Are we back up? I don't see it up on my side. Pingdom hasn't told me yet. Let me see if I can refresh the page here.

Starting point is 00:54:55 It's booting. I'm still unavailable. Ping never failed, so that's good. Low bouncing happened. And we're back, baby. That's true. Official time is 7 minutes and 35 seconds, according to at least my refresh. There you go. Cool.

Starting point is 00:55:10 Very cool. Too overwhelmed or underwhelmed? And the dashboard that provided all this was K9s. It was K9s. It's a great... Yeah, so I mean, you can watch the play-by-play of failure, essentially. Is this official observability, or are there better tools, or is this just capable enough to be good for us for now?

Starting point is 00:55:27 Okay, so K9s, it's an anchors interface to Kubernetes, which means that you can do things really quickly, really efficiently by just using shortcuts. It runs in a terminal and you can do all sorts of amazing things with Kubernetes without having to type all those commands, without having to worry about shell auto-completion or whatever. And if I remember correctly, K9s actually won an award recently, a CNCF award.

Starting point is 00:55:55 I'm Googling this up as I speak. What happened with K9s? K9s award, K9s Kubernetes. I wish I remember this. It was 10 open source. No, that's from last year, 2019. There's something in 2020 that Canines was mentioned on Twitter. 2020 Canines grant.

Starting point is 00:56:19 No, that's something else. I know Jared logged it at the tail end of 2019. I logged it just recently again. And then again this year, whenever you had a chance to play with it. Yeah, I had to log it again because I started using it finally. And I was like, oh yeah, this thing is awesome.

Starting point is 00:56:32 And so I logged it again after Gerhard showed it to me. This is one I'm great to see pods being scheduled quite a bit. Uses a lot of CPU. There was something here that was mentioning the canines when something was recognized, like the developer, canines developer for something. I forget the exact detail. Anyways, we can look it up and we can link in the show notes.

Starting point is 00:56:55 But canines, it's a really easy way of just jumping around your Kubernetes cluster, having a play with different resources, tailing the logs. For example, I'm on the app container right now. If I press S, it asks me which container I want to open a shell in. So right now, I open the shell in production on the production app, like one command away. And it makes stuff like this really simple.

Starting point is 00:57:22 Which is excellent. What would you do beforehand? Just completion foo? What would you do beforehand? Just completion foo? What would you do? Well, I used to SSH to the server and then connect into the Docker container and then go from there. That's what I would do on the previous setup.

Starting point is 00:57:38 So the equivalent would be if I do Kubernetes exec it, which pod app, which container app. So this is what I need to run. And then what do I want Kubernetes exec it, which pod app, which container app. So this is what I need to run. And then what do I want to exec? Well, maybe it will, yeah, you must specify at least commands or bash. So this is what I would need to do.

Starting point is 00:57:54 Deprecated exec pod. So I need to do, there you go. I need to do dash dash. Pods not found. So what's the name of the pod? Let me go back out because the pod there you go it's this basically it has like an you gotta find the pod name yeah that's what i need to do to it's a very visual it's a very visual dashboard to kubernetes essentially oh no hang on i know

Starting point is 00:58:20 what the problem is the problem is that i need to namespace there you go it's stuff like this right they need to be the right namespace. Heck with that. Shout out to Ferdinand Galeana, aka D-Rail on GitHub, the author of Canines. Super cool tool. Thanks, Fernand.

Starting point is 00:58:35 We appreciate it. Hey, it looks like he offers corporate training for Go and Kubernetes. So there's your shout out, Fernand. Awesome, Gerhard. Well, the availability is back. Now, you mentioned last year we had almost four hours of downtime.

Starting point is 00:58:50 We just experienced seven minutes of downtime here. Which we should deduct from our actual downtime, too. But how are we doing so far this year? We're doing much better. Definitely. So last year, 2019, we had that pretty bad downtime due to the Docker service.

Starting point is 00:59:03 That was actually almost two hours we had of downtime due to the Docker service. That was actually almost two hours we had of downtime due to the misconfigured Docker service. So for the whole 2019, our availability SLI was 99.96. So three nines and a six, which means that we were down for 220 minutes for all of 2019. We had 50 micro downtimes, and that has to do with how the Docker does the promotion for different, like when it does like blue-green. So all in all, we had almost four hours of downtime. This year, 2020, with the LKE migration,

Starting point is 00:59:43 which we started way back when, I don't know, January, February was ongoing. April, we had a bunch of stuff already migrated. We had that use case with Rob Yogle, and we had already a parallel deployment on LKE. And then we completed everything, I think it was in July or August whenust when we were through so while this was happening including the migration okay it's two more months to go we were down for 68 minutes all of 2020 so that means we're just below four nines, 99.988, something like that. That's where we are. And half of the downtime was because of the migration.

Starting point is 01:00:33 There's an interesting story there. Right. It had to do with the slow DNS propagation when we switched over. We hit the let's encrypt search request limit because the dns wasn't fully propagated which means let's encrypt was throttling us and when the dns did propagate we couldn't get a certificate fast enough i'll have to email josh about that jared get that api limit do something for us yeah come on man let us hit that thing more it It was my mistake. It was a one-off. The reason why it was my mistake is because the TTL in DNS manager,

Starting point is 01:01:10 DNS manager, external DNS, the TTL in external DNS defaults to one hour. I have since changed to 60 seconds. So had it been 60 seconds, I could have flipped back within a few minutes. But because it was an hour, once it goes out, you have to realize that DNS resolvers will cache it for one hour. Sure.

Starting point is 01:01:31 So even if you change it, they will keep serving the wrong IP address. And that's why in some parts of the world, it was okay. And I knew how to basically clear the DNS. But for most people people it was down. Don't you think a 60 second DNS TTL is pretty low for a normal scenario? Aren't people going to be hitting their DNS root servers more often when coming to our website versus a higher number?

Starting point is 01:02:00 To me it makes more sense that you would crank it down when you're going to make migrations and then crank it back up when you're pretty stable but is that just a thing that doesn't really matter in the in practice it doesn't really matter all the beginning am i still sharing my screen i am sharing my screen yes you sure yeah so if you look at github.com they have 60 seconds github.com a single ip address it's global. They have 60 seconds. So all the big names, they have a really low DNS TTL. What about google.com? 60 seconds.

Starting point is 01:02:30 Single IP address. Give me one more. What about microsoft.com? Try microsoft. Microsoft.com. They have a high one. I wouldn't be comfortable with such a high TTL. Because if you need to change it, right?

Starting point is 01:02:42 When something goes wrong and if you need to change it, well, when something goes wrong, and if you need to change it, well, the TTL is already out there, right? Yeah, but couldn't you just say, like, on your way up to a deployment, like, we knew we were going to roll this thing out. Like, part of your step is, like, we're going to lower our TTLs all down to 60 seconds, and then we're going to do our stuff,

Starting point is 01:02:58 and when it's all done, we're going to raise them back up and let people hold on to that cache. I mean, caching is nice. It is, and that's why this is what we recommend that the ttl is but dns resolvers will implement theirs and isps will implement theirs so and not to mention that if you have a router like adam has a very smart router it has its own caching setting yeah so it's turtles all the way down. And while you recommend 60 seconds,

Starting point is 01:03:30 who knows what the different DNS resolvers will use. Some may use an hour, regardless what the upstream says. But what we care about is that at least the DNS that respect those settings will pick it up soon enough. In the big scheme of things, it's so small. I mean, the DNS requests are really a very small amount of traffic that goes through the internet these days sure and what i would ask is why don't we

Starting point is 01:03:53 have instantaneous updates and i think that's what some providers already have like cloudflare for example like the big ones how well a rat hole i was gonna ask how they get that done but we don't have to go there i don't know myself to be honest i mean it changes it's not something i keep up to date but i do know that instantaneous updates especially when something goes wrong they're so handy to have right what happens if the ip gets compromised or whatever so many things that can happen it's almost like um safety release mechanism if you wish you know maybe you won't need it but when you will it's good to have it and there's nothing worse than when you really need it and you're like well we gotta wait 60 minutes and that's exactly what happened last time so that's how you kill your

Starting point is 01:04:41 nines right there not to mention that you need to wait, right? So you need to wait for an hour before the new TTL will be picked up. Right. So yeah, you need to say, okay, I'm going to upgrade in one hour because that's how long it will wait for the TTL to expire before the new one will be picked up. And that's when your boss fires you. Good stuff. Well, with availability, there comes some acronyms, right? Oh, yeah. comes some acronyms

Starting point is 01:05:07 some new acronyms at least SLO, SLI did we have these before and I just didn't know about it or is this new to the world of there's always a new TLA three letter acronym there's always a new three letter acronym

Starting point is 01:05:23 I see I thought you meant like TLA, like the TLA plus. Okay. I was thinking about something completely different. I got you. So did we have this before? I don't think we formally had it. I know that when we talked about Jared, how much downtime would you be okay with?

Starting point is 01:05:41 Because the less downtime you're okay with the more complex your infrastructure becomes and he said a few minutes here a few minutes there it's okay right so in this case seven minutes it's okay it's not the end of the world we can be down for seven minutes it's not a problem so when you look across the whole year how much downtime are we okay to have and that will become in this case our availability slo so service level objective our service level objective is that we are available or we are unavailable for at most 50 minutes in any one year and that's what your four nines means the sli the service level indicator is where where are we in those 50 minutes and right now we are at 68 minutes this year plus an extra seven minutes

Starting point is 01:06:27 so that's like almost like your budget but that was rollout year so 2021 will be the test of course by 2021 we may be rolling out new stuff so let's talk a little bit about things that we didn't do or things that we might do and uh then we'll talk about how the listeners can get involved what haven't we done here? What haven't we done? So the first thing, as you know, I'm very passionate about, is about logs and metrics. I think we need a better understanding of how the system works. While we have grafana.changelog.com,

Starting point is 01:06:59 and we have all the metrics from the Kubernetes side, from the infrastructure side, we don't have metrics from PostgreSQL, for example. We don't have metrics from our application, Phoenix. It exposes a lot of metrics. So we don't know what's happening inside of our app. Why is it constantly using 8 gigabytes of memory? We don't know that.

Starting point is 01:07:22 It would be good to know that so that maybe we can bring it down. We can speed some things up. Another, we try to optimize, for example, queries, right? How fast pages load when they're not cached. Well, when a page loads, can we see a trace? That's where traces come in. Where's the most time spent in that request timeline? Is it the database? Is it the app itself? What's going on? The other thing that we would like to have is centralize all the logs. So I have Kubernetes and we can see logs for pods easily. K9s makes it super simple, but even if we didn't use K9s, kubectl, kubectl, they're there, but they will be gone once the pod goes. So can we aggregate all those logs before we're using paper trail?

Starting point is 01:08:04 But this is something that I wrote about last year. We'd like to try Loki out. We'd like to send all the logs to Loki. And when we send all the logs to Loki, and we already have, for example, maybe IPs for requests, browser agents, user agents, could you maybe have some dashboards? And I know we talked about this last year

Starting point is 01:08:25 to show you maybe where users are coming from, like the stats. Remember that the app currently does. Could we maybe use something else and use the logs as they are without having to process them ahead of time? Could we do that? So logging, metrics, integrating everything.

Starting point is 01:08:44 The one thing which I think it's everybody's dream is to have an automatically updating system. So could we roll updates to, for example, our container image for the app, the latest version of Erlang or the latest version of Elixir or the latest version of PostgreSQL automatically with no intervention?

Starting point is 01:09:03 I think we could. So what does the setup look like? What about automatic Kubernetes updates? We're not on the latest version, we want to upgrade, but someone has to do that upgrade. Could we automate that so we're always running the latest? And can we do it in such a way so that it causes minimal disruption to everything else? Because in a way you want like the most secure setup right the most efficient setup with no effort if possible it just happens right it's right dream yeah exactly i mean i would love to just again declare an elixir version a postgres version and not have to worry about how it goes from my current version to that version

Starting point is 01:09:44 i know especially with post, that can be very tricky with database backend migrations and stuff. The actual format of the data storage by Postgres can cause issues. But yeah, that would be super cool. And who knows, maybe we'll get there someday. Yeah, I mean, I think it's all within reach to be honest like all these systems once you declare them once they are self-updating self-healing it's almost like the next step we want like an automatically updating cluster if you wish and how will those updates

Starting point is 01:10:20 happen so for example you may be okay to update to the latest patch of postgres sql automatically but maybe not to the latest major maybe you'd want to control when majors get rolled out totally but for other things like for example erlang you may be okay because it's very backwards compatible and you always want to have the latest right and maybe you'd want to have canary updates, so like updates happen, but not in situ. So if something fails, you would like to know what the failure is so that you know how complicated this upgrade will be. It's almost like it will feel for you like,

Starting point is 01:10:58 hey, there's a problem with this. You may not even have thought about this. Not to mention that you can then start consuming beta versions, right? And say, hey, I can't upgrade to this because then you can start feeding back into different development cycles. Like this is completely not working for us or there's like this new feature which is amazing.

Starting point is 01:11:18 Can you ship it, please? Would benefit from it greatly. So you can start consuming the latest and greatest in an automated fashion with little effort. And then you can have automatic updates like your phone does. Isn't that great? When you don't have to update all the thousand apps which you have installed, they automatically update.

Starting point is 01:11:36 I think that's pretty great. Phones these days update themselves. Until they update on their own and they change their icon. And you're like, what happened? Right. Because that's the problem. This new icon is their own and they change their icon. And you're like, what happened? Right. Because that's the problem. This new icon is ugly. I like the old one.

Starting point is 01:11:50 Yeah. And then you can't go back. I know what you mean. There's no going back either. It's futile. Well, I'd say one of the coolest things about this process this year, I think, has been the contributions back to LKE and to various open source things surrounding Kubernetes and the work that you did there, Gerhard, you know, getting us to where we are right now. I think

Starting point is 01:12:10 it's really cool that we can use this as a necessary excuse to help other people and help open source software get pushed forward as well. So that excites me. I hope as we move forward, we continue to look for opportunities like that to work with open source projects and with providers to level everybody up as we use their stuff. I'm surprised it's taken this long to get Let's Encrypt in place. I mean, I know we had an SSL and it was set to expire, but it's been out for a while

Starting point is 01:12:41 and we finally have it in place in automation, which is awesome. So we did have Let's Encrypt before manually. Right. As we were doing the migration, I remember our certificate was closing expiration, so we didn't want to wait. So we did do it manually. But it was still a bit like, okay, I have to run this command, and I have to save the certificate somewhere like in the source of truth where we keep all our secrets then i have to put a certificate on cdn and i have to set it up in the load balance so they're like a couple of steps which we had to take yeah but now everything's automated for us because of how the node balances integrate with

Starting point is 01:13:21 the ingress nginx and how all the components know what needs to happen. And to your point, Jared, too, there's a lot of services out there now that we're in, now that we're actually in cloud native land, officially, since we're using Kubernetes, there's a lot of different services out there, both open source and, you know, opportunities to partner with people that are available to us

Starting point is 01:13:41 that are pretty cool. So there's probably some holes where we haven't chosen a good solution or we're not there yet so this is us saying hey reach out if you know we're you know capable using reach out and say hello pretty easy to get in touch editors at changelog.com say hi it's too easy i would be very excited to hear about what the cool kids are doing these days in the world of communities it's moving so fast so many things are changing it's impossible to keep up with it but this way you know just as we are helping the community in our own specific way we hope that others will maybe have a look at what we do and suggest better ways. The best compliment is someone telling you

Starting point is 01:14:26 how crap your setup is. That's right. Because then you can learn something new. So I'm looking forward to someone telling me how crap the setup is and how it can be made better. I'm really looking forward to that. And I suppose you could use grafana.changelog.com and do some hunting yourself if you wanted to. Also,

Starting point is 01:14:46 it's open source. You can look that way as well. So it's transparent. It's available to you to dig into just like we can. No different. Absolutely. Please do check out all of the source code is up on GitHub. All the links to all the things are in the show notes.

Starting point is 01:15:02 As we stated earlier, Gerhard will be publishing a detailed blog post covering many of the nitty gritty details that we glossed over here on this conversation. So look forward to that. We'll cross post it in the show notes as well. We don't know the exact timing on when, what goes out in which order, but it'll all work itself out.

Starting point is 01:15:21 We'll just declare it and the system will just work out that content as it needs to. Gerhard, anything we left on the table? Anything left unsaid before we call the show? No, we went into a lot of details, much more will be covered in the blog post. I really enjoy this. Every year, it's almost like I'm looking forward to this. The results of combining all these awesome building blocks. And in a way, that's what we do here.

Starting point is 01:15:45 We take the best or the simplest, in some cases, out of the open source world, and we combine them in a way that makes sense for us. And yeah, I'm really excited about where we will be this time next year. Really looking forward to that. And if you have questions or thoughts, you can share in the comments on this show, of course,

Starting point is 01:16:03 or you can join the community, which is easy. You can talk in real time and Slack with us and other things, but changelog.com slash community is where you can make that happen. It is a free community to join. It's the cost of your attention and your time. That's it. And everyone's welcome.

Starting point is 01:16:20 No matter where you're at in your journey, this is a place you can call your home and hang your hat and hang out and ask questions. Gerhard's there. I'm there. Jared's there. And you could be too. So hope to see you there.

Starting point is 01:16:33 Gerhard, thank you so much for all your hard work. It's been amazing work with you all these years. We appreciate every single year this leveling up. We love to see this happen, this progress happening and achieving our SLO of four nines next year. Four nines. Because five nines is expensive. That's a different subject but this has been awesome. Once we've achieved four nines, we will

Starting point is 01:16:54 move to five nines, right? It's always aspirational. Of course, what you got to do? It's the next layer. The reason why it's not three nines, because three nines would be easy. We already got that. We already have that. We can fail and get three nines would be easy. We already got that. Yeah we already have that so four nines. We can fail and get three nines.

Starting point is 01:17:07 We could or we can be completely awesome and get five nines. One day I'm sure we'll get there. I'm sure. One day. One can dream right?

Starting point is 01:17:16 One can dream. I would like to one last shout out to Andrew Zauber. It was great working with you on some of these things. Thank you very much

Starting point is 01:17:23 for your detailed explanation about the proxy protocol support. That was great. with you on some of these things. Thank you very much for your detailed explanation about the proxy protocol support. That was great. Thank you very much. And everybody else at Linode that is building a solid platform. Yes, yes. Definitely.

Starting point is 01:17:35 Thank you, Linode. A hundred percent. All right. That's the show. Thanks, guys. Yep. It's been awesome. Thanks, Gerhard.

Starting point is 01:17:41 This was good. Thank you. That's it for this episode of The Change Law. Thank you for tuning in. If you haven't heard yet, we have launched Change Law Plus Plus. It is our membership program that lets you get closer to the metal, remove the ads, make them disappear, as we say, and enjoy supporting us. It's the best way to directly support this show and our other podcasts here on changelog.com. And if you've never been to changelog.com, you should go there now.

Starting point is 01:18:08 Again, join changelog++ to directly support our work and make the ads disappear. Check it out at changelog.com slash plus plus. Of course, huge thanks to our partners who get it, Fastly, Linode, and Rollbar. Also, thanks to Breakmaster Cylinder for making all of our beats. And thank you to you for listening. We appreciate you. That's it for this week. We'll see you next week. real-time follow-up yaml is not turing complete according to this hacker news comment from january 2018 so there you go okay so about yaml i don't want to go into too much detail but basically um yaml is a big intention point it's a big big intention point right now and there's something

Starting point is 01:19:21 called skylark it's a programming language, if you wish, that you run inside YAML. It's really crazy. Wow. But it's something that sticks and there's like JSON on it and KSON on it and there's like so many ways of doing this and everybody has their own way.

Starting point is 01:19:36 But the one thing that really seems to be coming through and through is this Skylark templating language because you write code, you write functions inside of your YAML. They get interpreted and then YAML gets combined and you end up with a really nice final

Starting point is 01:19:53 big document. It's able to manipulate YAML by writing the functions inside of the YAML. It's really weird. But it's in a good way because AWS does the opposite opposite aws i kid you not they you write functions as yaml so you do dash increment dash one dash one and it will know to increment one plus one and that's just bash is crazy no this thing does something else you do

Starting point is 01:20:22 comments and it knows like you say, this is like a function call. And if you look at YTT, do get-ytt.io. Okay, ytt.io. Yeah, my internet, there's a problem with it. I mean, even get-ytt didn't. I'm on it. All your YAML shaping in one tool. Yeah.

Starting point is 01:20:40 Template and patches needed to easily make your configuration reusable and extensible works with your own third-party YAMl configuration so click on try and playground okay oh so this is it right here that's it and guess what we use ytt oh we do changelog uses ytt yeah to template all these things oh nice so many things that we use i don't even know yep love it

Your Ad Here

The Changelog: Software Development, Open Source - Inside 2020's infrastructure for Changelog.com (Interview)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.