The Changelog: Software Development, Open Source - Inside 2020's infrastructure for Changelog.com (Interview)
Episode Date: November 6, 2020We're talking with Gerhard Lazu, our resident SRE, ops, and infrastructure expert about the evolution of Changelog's infrastructure, what's new in 2020, and what we're planning for in 2021. The most n...otable change? We're now running on Linode Kubernetes Engine (LKE)! We even test the resilience of this new infrastructure by purposefully taking the site down. That's near the end, so don't miss it!
Transcript
Discussion (0)
Kubernetes is everywhere. You can't avoid it. There's a lot of documentation, examples, guides,
but we go beyond that, right? We show you how to run a web application in production with Kubernetes,
which apparently everybody's doing these days or trying to figure out, and there's like so many
opinions. And so how do you actually do it? Well, we'll show you how. So changelog.com itself
runs on Linode Kubernetes engine. It's proof that it's easy, straightforward, show you how. So changelog.com itself runs on Linode Kubernetes engine.
It's proof that it's easy, straightforward, and it works.
And we have all the commits to back this up.
We have all the code to back this up.
You can see what choices we've made.
And I really love what we have built.
And I really love that we can keep it real.
We can still deliver business value, right?
No one stopped anybody from shipping.
And it's just a bunch of us. It doesn't take a teams of 10, 20, 30 people to do this.
Bandwidth for changelog is provided by Fastly. Learn more at fastly.com. Our feature flags are
powered by LaunchDarkly. Check them out at launchdarkly.com. And we're hosted on Leno
cloud servers. Get $100 in hosting credit at Linode.com.
What up, friends?
You might not be aware, but we've been partnering with Linode since 2016.
That's a long time ago.
Way back when we first launched our open source platform that you now see at ChangeLog.com, Linode was there to help us and we are so grateful.
Fast forward several years now and Linode is still in our corner behind the scenes helping us to ensure we're running on the very best cloud infrastructure out there.
We trust Linode.
They keep it fast and they keep it simple.
Get $100 in free credit at Linode.com slash changelog. Again, $100 in free credit at Linode.com slash changelog.
What's up?
Welcome back, everyone.
This is the Change Local Podcast featuring the hackers, the leaders, and the innovators in the software world.
I'm Alex Dekowiak, Editor-in-Chief here at ChangeLog. On today's show, we're talking with Gerhard LeZou, our resident SRE, ops, and infrastructure expert here at ChangeLog about the evolution of our infrastructure.
What's new in 2020? What are we planning to do in 2021? And what are we using today? The most notable change? Well, we're now running on Linode Kubernetes Engine, LKE.
We even test the resilience of this new infrastructure by purposely taking the site down live on the show.
But that's near the end, so don't miss it.
And for those longtime listeners out there, you may have noticed a change at the top of the show.
And I want to welcome LaunchDarkly as our newest partner here at ChangeLog.
They'll be powering our feature flags. Check want to welcome LaunchDarkly as our newest partner here at Changelog. They'll be powering
our feature flags.
Check them out
at LaunchDarkly.com.
All right, let's do the show.
So longtime listeners
of the Changelog
all know Gerhard Lezu.
Recent listeners,
maybe not so much.
If you've been listening back
to last December,
you've heard Gerhard's voice before
as he went to KubeCon
and had some awesome interviews
late last year.
If you've been around for a while, you've heard him on our
2018 infrastructure, our 2019 infrastructure,
and today on our
2020-21
infrastructure.
2020 never happened. If anybody asks me,
it happens.
For those who haven't heard about you, Gerhard, from our perspective,
maybe we consider him our SRE for hire, our remote infra guy that we call when we need help.
And he's been helping us for many years. We appreciate you for that. For the brand new
listeners, Gerhard, what's your background? Where are you coming from? So the one thing that I really, really enjoy is infrastructure.
Even more so breaking it, understanding its limits, and then putting it back together.
It's just this need to understand how something works at a very deep level.
And then taking all the building blocks and putting them together much better than they were before.
That's what we've been doing with changelog.com infrastructure for many, many years.
Half the stuff you don't even know, right? That's been going on.
That's right.
It was all for the best, trust me. We took many systems apart, we put them together,
we tried different components over the years. And I feel that right now we are in a very good place.
I mean, as challenging as 2020 was,
we managed to complete our migration
to Linode Kubernetes Engine.
For the listeners from previous years,
we have been running on Linode for many years now.
They have an amazing infrastructure and amazing service,
and we have a great relationship with them.
And they somehow managed to keep things simple even with all this complexity so over the years we had different setups but right now we settled on the node kubernetes engine it's simple it's
performant it allowed us to do many things very quickly and more importantly it sets us up for a great future
yeah so to go back just a little ways back in 2018 we were running ansible scripts and concourse ci
you can go back and listen to that episode we've done one of these per year for the last three
years this is our third annual infrastructure episode 2019 we replaced that stuff with Docker Swarm
and a few other goodies that I can't recall off the top of my head,
but Gerhard knows inside and out.
And these infrastructure setups all come with an accompanying blog post,
open source code, like how we did the decision making.
So that is also going to be a companion to this episode
is Gerhard's annual blog post.
For 2020, we wanted to move from Docker Swarm into Kubernetes, which was really the goal and what we've accomplished here in October.
We accomplished it before October, but here we are in October talking about it.
So tell us about where we were last year and the things lacking from that setup, things that we wanted, and how this transition is accomplishing some of those goals.
I think that's a really good place to start.
Because last year, as exciting as it was to roll out that infrastructure for 2019, we were using Docker Swarm.
And the big difference was that we didn't have to install Docker.
We didn't have to do any of that management because it came with the operating system we're using
core os at the time and core os out of the box it just had docker so we didn't have to install it
so there were fewer things for our scripts right our ansible to do and we could switch to something like terraform and we could worry about
basically managing not just the vm but also integrating with a load balancer node balancer
in linode speak and it was a much simpler configuration but it still meant that we had a single VM.
And some might frown upon that, like why a single VM?
But looking at our availability for the entire year, it wasn't that bad.
And any problems that we had were fixed relatively quickly, except one.
We may go into that later.
But for the entire year, ourtime was um just under four hours sorry
we had downtime less than four hours that was pretty good for a single vm so just goes to show
that some simple things can work and you can push them really far and i know that jared is a big fan
of simple things because you know they're easy to understand when something goes wrong it's easy to
fix it and we were really i know that adam was very excited
about us going to kubernetes we wanted to do that for a while but the time wasn't right and it wasn't
right because linode didn't have a simple one click kubernetes story you had to do a bunch of
things you could do it if you really wanted to but it wasn't easy and then in 2019 at the end
november the magic happened they Linode Kubernetes Engine entered beta.
I was at KubeCon.
I met with Hilary Wilmoth and Mike Catrani from Linode.
We gained access to Linode Kubernetes Engine.
It was in beta.
And with one command later,
we had a three-node Kubernetes cluster.
And that was really simple.
That was like the experience that we wanted
and were waiting for.
And once we had that,
things kind of flowed from there.
It was really simple to add all these other components.
Now, compared to what we had before,
we had to worry about, I suppose,
the migration from CoreOS to Flatcar
because CoreOS became end-of-life, right?
With the acquisition of
core os by red hat so we had to do that migration and we were approaching we knew that the end of
life would come so rather than doing that and continuing with a single vm docker swarm
complications we went to something simpler which was kubernetes because we had this one api and we could provision everything which meant less terraforming we didn't have to provision
node balances we didn't have to create volumes and then like attach them to vms using terraform
we didn't have to do any of that this kubernetes api would do all those things for us which meant
that it was it was a much simpler system to work with.
Now, when you say something simpler, probably alarms go off in people's heads because they think Kubernetes is simpler because Kubernetes has a reputation of being very complex, not
simple.
Do you think that's not true?
Are you talking about from a different perspective?
It's simpler.
I think there is complexity in everything.
So even if you have
like a single vm some things may be simpler but other things will be harder so the trade-offs
which you're making about packages how to install them where to get them from volumes as i said
formatting how to format them all those things you need to do. Load balances, configuring them, TLS certificates.
I mean, these things are still required.
Now, you may be familiar with that approach and maybe that's why you think it's simpler.
But if you use something like, for example,
external DNS for automatic DNS management,
which is a component that you just deploy to Kubernetes,
you don't have to go and manage your DNS with Terraform or manually or Ansible or anything like that. And it's this combination of the different components which have matured over the years,
which you run in Kubernetes, and then they in turn integrate with everything around.
So for example, certificates, we used to pay for certificates before and we had to wire
that together and set it up in the load balance and set it up in our cdn and do all those things
now with cert manager it's much much simpler we are getting it violets encrypt it's all integrated
it all plays nice together so while what happens behind the scenes is still complex, these
components that you can pick and choose, and with the maturity that comes over the years,
it is simpler. It is a simpler setup. So CertManager is a Kubernetes component.
What is it called? Is it a pod? Is it a cubelet? Okay. So I see where you're going with this.
So let me call it additional components.
It's an additional component
that you install in your Kubernetes cluster
that gives you,
it extends the Kubernetes API with extra knowledge.
So your Kubernetes API, by default,
you say, I want a deployment, right? Let's say, let's just go with the knowledge. So your Kubernetes API, by default, you say, I want a deployment.
Let's say, let's just go with a deployment.
But then how do you ask it for a certificate?
So once you install CertManager,
it's a component that in a way
teaches your Kubernetes API about certificates.
So then you can say,
hey, Kubernetes, give me a certificate.
And CertManager, it has a bunch of components inside,
but let's say it's like one thing.
It knows how to make that happen.
Gotcha.
So it's like the complexity is on the inside.
All of the difficulties and the confusion
and the technical intricacies are on the inside.
And if you can get it set up and configured
and make use of it, your life is simpler.
Exactly. Once those components are hiding
the complexity, which will be there
no matter what you do, no matter what you use,
but they allow you to ask for things
via the single API.
And the thing which gets me really excited about Kubernetes is that everything gets standardized between the single API. And the thing which gets me really excited about Kubernetes
is that everything gets standardized between a single API.
And it goes to the point like you want a VM
or you want a resource,
you talk to the same API
and all you have to do is install the right components
of the API or like those components
know how to translate your request into an actual thing,
whatever you may want.
So load balancers, you no longer have to provision load balancers.
Based on your provider,
whoever you are getting your Kubernetes from,
it knows how to translate that request into a load balancer.
Certificate, same thing.
Why then do we have to wait for LKE to make it easier?
And I guess it's sort of a loaded question to some degree because every cloud needs to have its own Kubernetes engine. They have their own fork
of it or version of it that sort of runs natively in their environment. It's probably because
it needs to plug in to certain places. But we had to wait for LKE to make it possible.
Like you had mentioned, we can use Terraform beforehand and Ansible and sort of
do it ourselves,
but LKE made it, I suppose, easier on Linode.
Why is that?
So first of all, the most important thing
when you start off your Kubernetes journey in production,
you want to manage Kubernetes.
And what that means is that you want updates
to be applied automatically.
You want the control plane,
which is the API component itself, you want that to be set up separately from everything else,
and you just want to consume this API. Not only that, but you want your Kubernetes cluster to be
integrated with other things that that provider has. For example, node balances in the case of
Linode. And while you can install all
those components it's like cobbling it together so you want the vendor to give you an api that
is already pre-configured with a bunch of things not only that but when there are updates you want
your vendor to just take care of them you can specify when you're okay to get updates and you want to specify
maybe which versions of kubernetes you want like do you want the latest one or do you want to be
more conservative and stay behind but you don't want to worry about updating the infrastructure
when it's like the core infrastructure so to speak so in our case we had to update core os
right that was like our responsibility to update the VM.
But with LKE, we still have to do it in the sense that we have to run the command,
but that's it.
We run one command, it will do it for us.
And I'm hoping that not too far away from today,
not too far in the future,
LKE will be able to update itself based on a schedule.
I mean, that would be the dream, right?
So that the vendor will keep the API automatically updated,
all the right versions for us already deployed,
and then we only are responsible
to become the components that we add on top
and inside of this API.
As I mentioned, cert manager,
ingress, nginx, and a bunch of other things that we use.
It's like for any application,
there are N concerns that must be taken care of.
Like Gerhard said, these things have to happen.
Your DNS has to happen.
Your certificates have to happen.
And every application has its own number N.
Maybe ours is 100, it's a pretty simple application.
Maybe somebody else's is 1,000 things or 1,200 things,
whatever it is.
And the more of those you can take off your plate
and onto your hosting provider's plate is just a win.
It makes it more achievable for you to manage less
and then to manage more.
And if you were just building your own Kubernetes deployment
on top of a VPS or on top of something
that's not LKE or a Kubernetes engine, there's a whole
bunch of things that you have to take care of now that you'd rather not because maybe
it's not your domain expertise.
Maybe it's just a huge time sink.
And the more they can do, probably better than you can do it, the better off you are.
Makes sense.
That's right.
Another thing which, again, I'm making many assumptions here, but I'm going to mention
this, is the whole declarative nature of kubernetes you tell it what you want to happen and it has this
like way of describing things and it will just make it happen so i don't have to tell it how it
needs to get the certificate i just tell it these are the credentials you just make it happen and
by the way when it expires i don't care i just want
you to renew it because i never want an expired certificate right so right always get me keep my
certificate up to date the same thing would be true for some of our services i think that's where
we are going next in that we want this automatically updating thing like updating system so you want to
automatically for example update postgresql
well how can you do that if it's like not a managed service or if it's the component doesn't
know how to update itself so that's like another way that linode for example could help with we
know like their managed database service in that if we can provision those via the kubernetes api
which i'm really hoping we'll be able to then we can offload that responsibility to Linode again.
And we always say, just give me a new one,
give me a new one, give me the latest one,
and do backups for me and do all those things.
But we are describing more of what we want
and doing less of how that thing happens
because there's no value in our case to spend
like to basically reinvent how we do database backups right or monitoring right i mean i would
have thought that by this point things will have standardized that's another thing that kubernetes
is like a standardization of how to do monitoring how to do logging and to begin with i know there's
an explosion of ways and there's so many ways you can achieve this
but i'm hoping that over time things will become like the clear winner so to speak
so for example we chose ingress nginx to do the tcp routing but there's so many other ways you
can achieve that so how do you give all this choice and how do you give all these options
to people but at the same time have like a set of
building blocks that just kind of make sense that's almost like the next frontier and i think i see
providers that they offer more than just kubernetes that's like the entry point if you wish
but you get like curated kubernetes experiences which know how to do all these things more and
more centralized logging monitoring as i mentioned security built-in, policies, all those things.
Yeah, the declarative aspect is huge for me
because I like to just declare the way things should be,
and I just don't care about the details anymore.
I remember as a young man, I really cared about the details.
And I loved scripting.
And I'm like, I'm going to write the script,
A, and then B, and then C, and then maybe run D, not. And then like, I took joy. I still like to script things
sometimes, but I really took joy in like the details, the imperative details, the programming
of how to roll out a thing. I used to roll my own deploys with rsync and all that kind of stuff.
But I just don't have time for that. I just want to say, hey, I want an SSL certificate on this domain,
and I want it to always be fresh.
And I just wanted to configure it,
and the details of how that happens are just not my concern.
And it's really a shift.
It feels good to just be able to declare.
I mean, there's almost like a God complex.
Like, I declare this is going to happen, and then it happens.
It's like, oh, that feels pretty good, right?
So I think that's definitely a holy grail
and a shift from a time where everybody is writing code
to do their operations.
And now we're writing YAML to do our operations.
Whether you like YAML or not,
it's a lot simpler than a Turing complete.
Although, is YAML Turing complete?
It might be.
It's simpler than code, generally. Gerhard, you probably know. Is YAML Turing complete. Although it's YAML Turing complete, it might be. It's simpler than code generally. Gerhard, you probably know, is YAML Turing complete? I don't know. Honestly,
I don't know. Mind blank because I'm already thinking about something else. So I don't want
to lose my idea. Go ahead. Yes. Move on. So not only that you declare how you want things to be,
but if anything diverges from what you declared, it will automatically try to
reconverge back on that point. And that's the really cool thing about VMs going away, right?
You can lose a VM and it's okay because the system knows what you want. And if that's not true,
it will try to reconcile in that state. So you no longer have to worry about VMs going away and
your apps going down, right? Or your database going down or whatever. It will automatically spin up on one
of the healthy VMs. Not to mention about resource, like finding where to put things, you don't have
that problem anymore. And I remember many years back when Kelsey Hightower gave a demo, the
Tetris demo, right? I mean, that was it. That was like Kubernetes in one very simple picture.
It will figure a bunch of things out
that you thought were important, but aren't.
And figuring out what your capacity is
and where you need to put things,
you need to go up or do you need to go down on scale,
all those things can be taken care of.
I think that's super powerful.
This episode of The Change Log is brought to you by Teamistry. Teamistry is a podcast that tells the stories of teams who work together in new and unexpected ways to achieve remarkable things.
Each episode of Team History tells a story, and in each story, you'll find practical lessons for your team and your business.
I got a sneak preview of season two, and I couldn't stop listening.
I was once in the U.S. Army, and nothing gets me more excited than seeing teams achieve great things when they learn to work together.
And that's exactly what this show delivers.
This season, the show travels deep into the underwater caves of northern Thailand to discover how divers, medics, soldiers, and volunteers saved a group of trapped teenagers, explains
how a world-renowned watch company pitted their two factories against each other in
an attempt to become the best watchmaker in the world, and finds out how Iceland went
from having one of the highest COVID-19 death rates
to a model example of how to deal with the virus.
These are stories that entertain,
and they're packed with business cases you can actually use.
Season 2 of Team History is out right now.
Search for Team History anywhere you listen to podcasts.
Check the show notes for a link to subscribe,
and many thanks to our friends at Teamistry for their support. So it's worth noting that we don't really need what we have, I suppose, around Kubernetes.
Like this is for fun to some degree.
One, we love Linode, they're a great partner.
Two, we love UGear and all the work we've done here.
We don't really need this setup.
It's about, one, it's about learning ourselves, but then also sharing that.
So obviously, changel.com is open changelog.com is open source.
All the code is open source.
So if you're curious how this is implemented, you can look at our code base.
But beyond that, I think it's important to sort of remind our audience that we don't really need this.
It's fun to have and actually a worthwhile investment for us because this does cost us money.
GearHead does not work for free.
And it's part of this desire to sort of like learn for ourselves
and also to share it with everyone else.
So that's fun.
It's fun to do.
There's something which I'd like to add here.
And I would like to answer the question of how does this help you,
a ChangeLog listener?
So Kubernetes is everywhere.
You can't avoid it.
There's a lot of documentation, examples, guides.
But we go beyond that, right?
We show you how to run a web application in production with Kubernetes, which apparently
everybody's doing these days or trying to figure out and there's like so many opinions. And so how
do you actually do it? Well, we'll show you how. So changelog.com itself runs on Linode Kubernetes Engine.
It's proof that it's easy, straightforward, and it works.
And we have all the commits to back this up.
We have all the code to back this up.
You can see what choices we've made.
And I really love what we have built.
And I really love that we can keep it real.
We can still deliver business value.
No one stopped anybody from shipping.
And it's just a bunch of
us it doesn't take a teams of 10 20 30 people to do this it takes a person an hour here an hour
there when you add it all up maybe it's a few weeks in i know six months five months however
long it was it doesn't take that long and we enjoy working with our partners. We enjoy working with Linode.
And I would like to give a shout out to Andrew Zauber,
the Linode engineer that has been with us through all this.
And we have not only been improving Linode Kubernetes Engine,
but we also had some discussions about the improvements that would make sense.
Maybe things that weren't as obvious until we started using it
or like a bunch of people
started using it and giving all this real world feedback so we want you to succeed with kubernetes
like changelog wants you to be successful with kubernetes and not only that the entire ecosystem
there's so much choice and we haven't made the best choice but we made the choice that makes
sense for us given given our constraints.
And it works.
We are transparent about it.
We share everything.
And yeah, it's all out there.
Yeah.
Let's talk some more about some of the choices that we made.
Like Garrett said, these are choices that we made for our circumstance and our application.
They're not necessarily the ones that you should make, but it's an example of a choice
that you should make, but it's an example of a choice that you can make.
We can give our thoughts and opinions
on whether or not it's working out,
or was it a good choice, bad choice,
why did we choose that?
Part of that is what this show is for.
But also continuing forward after this show,
we'd love to have conversations with listeners
and everybody about these things.
We mentioned a few components of the Kubernetes API that you put
together. The cert manager, you mentioned Nginx ingress. There's also some DNS, external DNS
is another DNS management. Is it the exact same thing as cert manager only it has a different
function? It's an extension. It's ans simple extension so that we can provision wildcard
tls certificates so we needed that to do the integration with our dns provider which is dns
simple and yeah i mean those are like the four core components and i simply pick them based on
maturity based on community based on how things are going and integrating them was fairly simple
and straightforward.
And you can see how we've done them.
So Ingress Engine X, super simple for TCP routing.
And it automatically integrates with node balances.
We know node balances, so that was great.
I'm not going to go over all of them,
but what I'll mention is,
I'll mention about QPrometheus,
which is the operator.
It's an operator that we use
to set up Grafana and Prometheus, which is the operator. It's an operator that we use to set up Grafana and
Prometheus for Changelog. If you go to grafana.changelog.com, that's basically where we host
all the metrics for Kubernetes. What we don't have currently, but we would like to add,
is integrating Prometheus with all the services that we use. So for example, for a database,
we use the crunchy data PostgreSQL operator.
So you would like to integrate QPrometheus with our PostgreSQL database.
Same thing for Ingress and GeneX,
which we currently don't have.
We're just looking at Kubernetes metrics and system metrics.
But there's relatively simple and straightforward
to add all those extra things.
And I suppose that's what's coming next
so we have better visibility into what happens
inside of changelog.com and all the services that we depend on
another aspect of the setup you have is Keel
which was news to me
we also have K9S which is the coolest part of the setup
from my perspective so we should talk about that.
But as we get into Kiel, it might be useful,
it's useful for me as well,
even as someone who's a part of this party,
to just understand what does a deployment look like?
So from I push a commit to GitHub,
our master branch on GitHub,
then to what happens yet,
because we have a GitHub-based deploy, right?
We're pushing and it
deploys on our behalf. Can you walk through just like the, you know, this, then this, then this,
the nuts and bolts that I don't want to have to care about, but when things break down, we have to
care about. Okay, so let's just introduce keel very briefly. Sure. keel automates updates of helm deployments or daemon sets or stateful sets or deployments so
when there's an update to an image or to something it will automatically update or it can update
based on certain rules whatever is running in kubernetes so in our case we use keel to trigger
automatic updates for the app itself and there's a bit of controversy here in that GitOps is up
and coming. And I don't want to go into that now, but that's like another approach. So one approach
is to do GitOps and use Flux or Argo CD, or use something like Keel, which goes against some of
the things that GitOps stands for. But I'm not going to go into that now. To your second question,
how does everything work?
In 2018, I made the decision
to separate building and publishing and testing
from the actual deployment.
So what you actually have is CIs
that deploy code into production.
And I think that is very dangerous and very wrong
because your CI has the keys to your production environment.
And I wouldn't do that.
So our CI stops at publishing images to Docker Hub.
And a push to GitHub triggers a build and circle CI,
which run tests, which compiles assets, and if everything is fine,
pulls dependencies, and it builds a Docker image. And the last step is to publish the artifact,
the Docker image, or the container image to Docker Hub. And that's it. That's where CI stops.
Now, what we used to have before, we had this very simple loop that would continuously update
the Docker service. Super simple. If there's a new one, it's like a bash while one kind of a thing.
That's it. That's exactly it was like three lines of code. Super simple. Okay. So keel is a bit more
complex than that. But the principle is very simple. Because why wouldn't you want to run the latest version of your app
that passes all its tests has all the dependencies in production i mean why wouldn't you want that
i i can't like that's what we want right like you want your commits if everything is fine to go into
production right that's what you want so like maybe the only time i think you wouldn't want
that is like what if it mismatched your database schema or something and that was unable to resolve and then you like want to roll it back,
but you wouldn't know that until you rolled it out. So of course you want that.
Yes. Yeah. You can do like things if you have migrations, by the way, run every deploy runs
migrations. So when, when the new app starts, we do blue green deploys, by the way, it's all
handled very nicely by the deployment model in Kubernetesubernetes so we don't have to worry about any of that so when the new version comes
up you're right you run like the migration and maybe something can go wrong so yes but if the
app fails to start you have readiness probes that will not put it in the load balancer and if it
crashes well there you go it crashes what's a probe? Is it like a thing that says, hey, are you ready? Hey, are you ready?
So there's a startup probe, there's a liveness probe, and there's a readiness probe in Kubernetes.
There are like three types of probes. The readiness probe determines when the pod is ready.
And ready means when is it ready to serve traffic in the case of a web app so you need to be
listening to the tcp socket that you say you'll be listening and maybe you can do checks and we
determine if like you get 200 back so is the htp response 200 and if it is the app is ready
to be put in the load balancer so you you declare what ready looks like. Exactly. Gotcha. Exactly.
So the app may keep crashing
and that's it's okay.
The old app will not be taken down.
And until the new app is ready,
it runs all the migrations
and everything is fine.
It won't promote it.
There is a risk of the new version
doing a migration
that the old version can't work with.
Right.
I was just thinking about
that and you will have taken production down yeah but in all the years that you've been working on
changelog how many times did that happen i can't think of one exactly zero so in four years it
never happened right like phoenix 2016 that's what i remember since we started developing in phoenix
we've never had that situation happen to us.
No.
Well, and our schema is pretty stable.
I mean, it's rare at this point
that we make massive changes to our schema.
These things are pretty well thought out
and in place and working.
And usually it's additive.
Every once in a while,
I'll decide I actually hate the name of something
I named four years ago.
And because I'm pedantic and a completionist, I can't merely rename it in the code.
I also have to rename the database table because there's a mismatch and I can't have that.
And that would be a major change.
I'm going to rename a database table, but it's very rare.
And so most of the changes then are additive.
I'm going to add a key.
I'm going to add a column.
We're going to add a new table because we're trying something new. These things rarely cause
data problems and migration problems.
Well, again, in four years, I don't remember one.
Right, me neither.
Yeah, we never had an issue with this.
But what if, here's where we can spend all the money, right? As engineers, what if?
I know, right? That's the danger.
And that's where you spend all your money
right there on that what if anyways keep going exactly so keel does something very similar to
what our while loop was doing but a bit more in that docker hub now sends a webhook to keel
or listening like on a public ip there's a host and listens on these webhooks when there's been an update to the image that we
publish and if there has key will trigger an update to our deployment it all happens seamlessly
automatically the new version comes up and everybody's happy it also does periodic polls
it's polling docker hub to see there's a new version And if there is, maybe the webhook failed to be sent
or we missed it.
I haven't seen it happen.
What I did see happen is Keel locking up.
We just saw that before the show, by the way.
But we're not running the latest version of Keel.
So maybe that's something worth updating, I suppose.
But other than that,
whenever you do a commit,
a few minutes later, you have it in production.
And it's been like that for years for us.
So it works, and it's simple.
Now, we can make it a lot more complex,
and I would like us to look at GitOps sooner or later.
Tell us what GitOps is, because you keep saying,
I'm not going to talk about that, but I don't know what it is.
Is this like you let your Git do your ops? What's GitOps?
Okay, so GitOps is a way of implementing deployments so you have continuous deployments
you're continuously deploying code but it's a way of implementing continuous deployments
in cloud native applications so if you're using kubernetes or cloud native or at least that's
the tagline and what git does, it allows you to define
everything about your application using Git,
including which version you should be running in production.
So if you were using GitOps with changelog,
there would be a commit for every single deploy,
which would need to be approved, merged somewhere,
so we would roll out the latest version.
So you're basically versioning
what runs in production.
To some extent,
we already are doing that
because all our YAML
that defines all the change log services
isn't Git.
What we don't have,
we don't apply those changes by some sort like
an automated system it's either you or me that says apply but we have make targets which apply
all those things and that's how we roll out changes but for the app which changes a lot more
often we don't run commands we don't have a ci running commands every single time there's an
update we don't do that the app we, every single time there's an update, we don't do that. The app, we have Kiel that automatically updates
whatever's running in production.
And why would the GitOps advocates say
that we're doing it wrong, quote-unquote?
It's because they want that history.
They want that to be like an atomic aspect
of their application, deployments,
to be like explicit, atomic, logged things.
Is that why?
Yes, that's one of the reasons.
The other reason, the more important one,
is you always know what you're running in production.
So if I asked you what version of the app
we're running now in production,
you say master.
But master always changes.
Sure.
So imagine if you're deploying 100 hundred instances of your app just imagine
about that for a second if you're deploying a hundred app instances by the time the 90th
instance gets spun up if it's looking at master it may pull a different version because master
may have changed during the deployment and if you have many developers pushing lots of code
and master always keeps changing,
then you could have multiple versions of your application running
and you wouldn't even know it.
Gotcha.
Not to mention that when something crashes and master has changed,
again, the version that you thought you were running will change
because you pull the latest image.
And there's like a bunch of things for
example um in kubernetes you're advised to use versions the exact version of what you're deploying
like your image well we're using latest and latest means whatever is latest and that changes
so in that from that perspective we are breaking you know the breaking the fully declarative in a way
because we can't recreate the same thing.
Multiple runs of the same thing.
Sorry, not declarative idempotent.
We don't have idempotency because multiple runs will end up with different states
because latest is fluid.
It can be anything.
Gotcha.
So does Flux then, or
Argo CD, do they capture the
version? Yes. Essentially for us, that way
all the instances that roll out
or potentially roll out if we have more than
one, like we might? Exactly. And every single
change, yes, is versioned. Yes.
And tracked separately.
But then, like right now,
all our code, including the infrastructure
code, is a single repository
with argocd or flux you need another repository that tracks what gets deployed
because if you think about it if a commit triggers another commit and the commit triggers another
commit you have a continuous loop of commits triggering commits and it never ends right
so you have an infinite commit loop by capturing what you've deployed you're bumping the version
of the artifact that's getting
deployed and you just end up with that.
Just recursively does that.
So we would need to have another repository
which keeps track of what gets deployed.
And from our perspective, we wanted
to keep changelog self-contained and that
if you pull down one repository, you have everything
there. Yeah, we used to have two
repos. We had an infrastructure repo.
We had the source code repo. And we were happy to get rid of the other one and have just one place where everything lives so
simple yeah but maybe we can somehow i don't know configure the ci to ignore certain commits so it
won't build if certain paths change i mean that i know it's possible in some cis and then we can also maybe do argo or like flux whatever we choose to
maybe not deploy every commits maybe be a bit more selective i don't know i don't know but
maybe we can exclude like basically we can break this infinite loop yeah so does a a new version
get spun up every time there's a new push to master so if i'm working on something and jerry's
working on something and we just yes happen to push in similar time frames,
his push triggers a new version in this GitOps world.
His push initiates a new version, mine does too.
And obviously latest in would be the latest version
and eventually to get to my version if I'm after Jared, for example.
So if Jared commits, then I commit.
And there's two new versions that are going to roll out,
but mine being the latter one will be the latest version that's rolled out.
So it'll eventually just get to it.
So there's a time frame potentially even in there, right?
Because you have to sort of initiate or stand up two different versions,
roll that one out, and then roll the next one out.
Is that roughly a scenario? Is that how
that works? That's how ours works.
Right? Right. Ours works
exactly the same way, but basically
key will trigger multiple deploys. So
every push to go through the
pipeline takes a few minutes.
So even if they enter the same time
and we don't have parallelism, so we
do one build at a time,
so you have what your jared's
build in your example goes out gets deployed a few minutes later your build arrives and goes out
again so you'll have two deploys within the span of a few minutes but right now we have a single
app instance right so we don't have like multiple apps running in production and there's like
reasons for it we don't need to get into now but we have only one app version one app instance running at any point if we had a hundred app instances which were
running across let's say i don't know 10 hosts 10 vms for scalability reasons which again we don't
have but what if to go back to what jared was saying then we may have problems yeah we're
solving our problems not everybody's but we're
at least showing that it's possible i suppose and aware that we can not so much that we need to
yeah which is important yeah i mean we chose keel because it's really simple
i mean a lot of choices which we made is because it's simple and it suits us
and i would argue that it would suit the majority unless you're like a really big team
with like a really big kubernetes deployment and investment and all that then you may need to do
things differently more certainly than not but if you're like a small team of let's say up to 10
people that have a bunch of apps this may work perfectly well for a long, long time. What's up, friends? Have you ever seen a problem and thought
to yourself, I bet I could do that better? Our friends at Equinix agree. Equinix is the world's
digital infrastructure company, and they've been connecting and powering the digital world for over
20 years now. They just launched a new product called Equinix Metal.
It's built from the ground up to empower developers with low latency,
high performance infrastructure anywhere.
We'd love for you to try it out and give them your feedback.
Visit info.equinixmetal.com slash changelog to get $500 in free credit to play with,
plus a rad t-shirt.
Again, info.equinixmetal.com slash changelog.
Get $500 in free credit.
Equinix Metal.com slash changelog. Get $500 in free credit. Equinix Metal.
Build freely. Let's talk about availability because one of the reasons why you even build this kind
of infrastructure is for resilience, for availability.
And I suppose to test that, let's take the site down.
I love that idea.
I think that's the best idea we've had all evening.
I think so, too.
Before we do it, what's going to happen?
What should happen?
Okay, so what should happen is
we have a three-node Kubernetes cluster.
The application and the database
are running on one node.
They're close together.
And when we take the VM down,
another VM will just delete the node.
Another VM should be created.
And in the meantime, the website, the app,
should migrate in the database from the VM that was deleted
onto one of the other two VMs
which are still running. And I say VMs because it's Kubernetes nodes, we have three. So we delete
one. We expect Linode to recreate the VM, reprovision, notice that, hey, there should be
three, there's only one. And in the meantime, we expect the website to be recreated on one of the other two vms and
the database and everything to be back together that's what we expect to happen okay what are the
odds what are the odds where are you sitting like and how many nines are you thinking this is going
to work so i don't expect this to take more than 10 minutes and last time when i tried this it was seven minutes how
many nines less than 0.000 something right well you want to know not the nines of availability
how many nines are on your confidence level oh i see how confident are you uh there's 99.9 yeah
there's a lot of nines let me put it that way there's a lot of nights all right more than i have let's do this well
if it doesn't work then we can fix it right right and we can just edit the show and act like it
worked exactly but no i think we should leave like the real thing right with like proof how long it
took and all that so listeners listen closely there will be no edit stops here. There'll be no breaks. Okay. So the notes, here we go.
Let's do this.
Okay.
B29D.
Let's go to nodes.
Nodes.
And I'm going to delete it.
29B.
There you go.
Delete node.
I can't, I can't delete the node from here.
Okay.
I can't delete the node from here.
I need to go to the Linode console.
Oh, cloud console. That's okay. Logistics. Yeah Okay, I can't delete the node from here. I need to go to the Linode console. Oh. Cloud console.
Logistics.
Logistics, yeah, because I can drain it, I can do other things.
I can't just delete a node, right?
You're not supposed to do that. So you're trying to access it through
K9s, K9s, K9s,
I don't know how you pronounce it.
K9s. Which is a really cool
CLI, like a
awesome terminal app
for accessing all the information about your
Kubernetes clusters, but does not
give the ability to delete things, apparently.
Yeah. Not nodes.
For safety. You can't delete nodes for
safety reasons. We're going to delete the
VM. Not power off, not reboot, delete.
You're going to delete the VM.
See, I'm nervous. This is from
now, the Linode Cloud admin. Are you sure
you want to delete this? I am. This is permanent now, the Linode Cloud admin. Are you sure you want to delete this?
I am.
This is permanent.
I want to take changelog.com down
and see how long it takes before it comes back up.
I'm starting to stopwatch too.
Here we go.
This is proof of the pudding, right?
If the pudding is good, this will work.
All right, here we go.
Ready, delete.
Boom.
The VM's gone.
Stopwatch started.
We'll see.
I expect to get an email from Pingdom here very shortly.
It will take a minute.
As well as a push notification to my watch.
Walk us through why the 10 minutes. What's the window there?
Why is it roughly 10 minutes? Why is it not more than 10 minutes?
What's going to happen behind the scenes now?
So behind the scenes, the VM is going away, being deleted, being stopped,
going away. The app will stop working. And it will take a while for the Kubernetes to figure out that
the node is not healthy. So we can see that the node is still ready, according to canines,
according to Kubernetes, but we know that we have deleted the VM. It will take a while for
that to be deleted. And when it's's properly gone when it's no longer there
like the physical vm has been powered down we expect kubernetes to try to re-spin or to
recreate the app on another node that's healthy and ready and ready to go so it's still up right
we said delete vm it's red. We can see it.
But has it actually been deleted?
I just got a notification from Pingdom that we are down.
There we go.
So now Kubernetes confirmed that.
We no longer have the node.
So now what's going to happen if we look at the deployment,
if you look at the app, there we go. It's down and it has not been created anywhere.
So what's the reason?
It's persistent volume claim,
a reference to persistent volume in the same namespace.
I think that's okay.
Minimum replicas unavailable.
Where are the events?
Let's see, let's go to event.
Pulled, everything's fine, 95 seconds ago.
Pulled all these things, still fine.
It's not a problem.
Let's see see maybe at pulse
oh there you go warning attached failed attach volume that's what we're waiting for so multi
attach error for volume pvc so this volume is already attached to another vm the one that we
deleted it so you gotta detach it exactly before you can reattach it to another VM. And that takes some time. That takes some time,
exactly. I don't know. Well, last time when I tested this, it took seven
minutes, end-to-end. In seven minutes, everything was back up.
We're still down, by the way. So answer this, then. I go to changelog.com right
now, and I get a 503. Service is down, essentially.
It's not available. But via ping in the terminal, I'm pinging changel.com.
I'm still getting a ping.
Is that the node balance?
That is node balance, exactly.
Not only that, but we're hitting NGINX.
So we have NGINX deployed on every single VM,
on every single Kubernetes node.
So we have three instances of NGINX ingress.
So you can get to Node Balance, so you can get to Nginx,
which runs in Kubernetes, but you can't get to the application.
There's no application running, so it can't service these requests.
So you're getting 503s.
So this is a lot like chaos engineering here,
only we are manually introducing, we are the chaos monkey,
and we are monkeying with ourselves
while we record a podcast.
So it's like a step beyond,
like even more idiotic than chaos engineering
is what we're doing right now.
But so far we think it's working
and so that makes us feel good.
But it's just kind of hurry up and wait
to see if this thing can get reattached
and go from there.
Yeah.
We're three minutes and 40 seconds in, according to my stopwatch.
So Gerhard, we had a downtime.
So the difference between this and our old setup is this is going to auto-heal as long as it works as advertised.
Whereas last time, we had last year, we had a downtime which lasted multiple hours where it got into a state that it was never gonna it doesn't auto heal like i had
to basically drive home and get a hold of you and figure it out you want to tell that story a little
bit while i wait here so last year what happened the docker service basically under which we had
a single vm back in the day and we were running docker swarm on a single VM. And the Docker service was not configured to automatically start.
I was expecting, to be honest, for the operating system
to have this essential service by default started,
but that was not the case.
So we had to manually start the Docker service
so that everything else would basically come back up.
And that was the problem.
Obviously, we fixed it since but the docker service the docker demon in that case was not running meaning that there was no changelog app no database none of those things right and that
docker service wasn't managed or supposed to be managed by core os but wasn't being or something
like that it wasn't configured to automatically restart when the operating system restarts so what can we see here is now we can
see that the volume failed to mount after it failed to attach it now it attached the volume
so we can see that both the database and the application have the volume attached
and what i expect to happen very soon the application to spin up we can see the containers
creating both the application and the database and we can see they try to create it for three
minutes and 40 seconds it tried to basically it was aware that's when it started being aware that
hey i have to recreate this pod the app pod and the database pod so four minutes container creating let's see what's the state of
it if we describe it successfully signed the multi-attach was fixed so let's see if we look
at the logs there's nothing there database backup still container creating we're in container
creating this one's ready now i'm sorry no, it has just readiness probe.
Let's see what state it's in.
Still container creating.
The database is already running.
That's a good sign. There we go.
We have PostgreSQL.
I'm a seconder and I'm following a leader.
So this is the leader now.
The leadership changed.
We have, anyways, I'm not going to go into the details now.
No, I like this.
I feel like you could be a play-by-play commentator.
This is like sports announcer.
That's right.
This is like the radio.
When you're trying to listen to the game,
you have no idea what's going on,
and he's telling you which direction they're running.
You're over here telling us exactly what K9s is reporting back.
That's right.
You're on K9s radio.
And Adam and I have the advantage of the visuals here.
The listeners are like, what is going on over there?
Yeah, the website's still down.
Here's what's going on, listeners.
The website is still down, and Ger what's going on, listeners. The website is still down
and Gerhard's trying to give us confidence here.
Yeah.
And it's coming back up.
It's coming.
I have confidence in this.
All I have to do is just basically, you know,
let it play out.
I know the right thing will happen.
It will reconcile.
Running.
There you go.
The app is up.
Yes, baby.
The app is running.
Five minutes later, according to this, how much do you have,
Adam, on your stopwatch?
Seven minutes. Are we back up?
I don't see it up on my side.
Pingdom hasn't told me yet.
Let me see if I can refresh the page here.
It's booting.
I'm still unavailable.
Ping never failed, so that's good. Low bouncing happened.
And we're back, baby.
That's true. Official time is 7 minutes and 35 seconds,
according to at least my refresh.
There you go.
Cool.
Very cool.
Too overwhelmed or underwhelmed?
And the dashboard that provided all this was K9s.
It was K9s.
It's a great...
Yeah, so I mean, you can watch the play-by-play of failure, essentially.
Is this official observability, or are there better tools,
or is this just capable enough to be good for us for now?
Okay, so K9s, it's an anchors interface to Kubernetes,
which means that you can do things really quickly,
really efficiently by just using shortcuts.
It runs in a terminal and you can do all sorts of amazing things
with Kubernetes without having to type all those commands,
without having to worry about shell auto-completion or whatever.
And if I remember correctly,
K9s actually won an award recently, a CNCF award.
I'm Googling this up as I speak.
What happened with K9s?
K9s award, K9s Kubernetes.
I wish I remember this.
It was 10 open source.
No, that's from last year, 2019.
There's something in 2020 that Canines was mentioned on Twitter.
2020 Canines grant.
No, that's something else.
I know Jared logged it at the tail end of 2019.
I logged it just recently again.
And then again this year,
whenever you had a chance to play with it.
Yeah, I had to log it again
because I started using it finally.
And I was like, oh yeah, this thing is awesome.
And so I logged it again after Gerhard showed it to me.
This is one I'm great to see pods being scheduled quite a bit.
Uses a lot of CPU.
There was something here that was mentioning the canines
when something was recognized, like the developer,
canines developer for something.
I forget the exact detail.
Anyways, we can look it up and we can link in the show notes.
But canines, it's a really easy way of just jumping around
your Kubernetes cluster, having a play with different resources,
tailing the logs.
For example, I'm on the app container right now.
If I press S, it asks me which container I want to open a shell in.
So right now, I open the shell in production on the production app,
like one command away.
And it makes stuff like this really simple.
Which is excellent.
What would you do beforehand?
Just completion foo? What would you do beforehand? Just completion foo?
What would you do?
Well, I used to SSH to the server
and then connect into the Docker container
and then go from there.
That's what I would do on the previous setup.
So the equivalent would be
if I do Kubernetes exec it,
which pod app, which container app. So this is what I need to run. And then what do I want Kubernetes exec it, which pod app, which container app.
So this is what I need to run.
And then what do I want to exec?
Well, maybe it will, yeah,
you must specify at least commands or bash.
So this is what I would need to do.
Deprecated exec pod.
So I need to do, there you go.
I need to do dash dash.
Pods not found.
So what's the name of the pod?
Let me go back out because the pod there you go it's
this basically it has like an you gotta find the pod name yeah that's what i need to do to
it's a very visual it's a very visual dashboard to kubernetes essentially oh no hang on i know
what the problem is the problem is that i need to namespace there you go it's stuff like this
right they need to be the right namespace.
Heck with that.
Shout out to Ferdinand Galeana,
aka D-Rail on GitHub,
the author of Canines.
Super cool tool.
Thanks, Fernand.
We appreciate it.
Hey, it looks like he offers corporate training
for Go and Kubernetes.
So there's your shout out, Fernand.
Awesome, Gerhard.
Well, the availability is back.
Now, you mentioned last year
we had almost four hours of downtime.
We just experienced seven minutes
of downtime here.
Which we should deduct from our actual downtime, too.
But how are we doing so far this year?
We're doing much better.
Definitely. So last year, 2019,
we had that pretty bad downtime
due to the Docker service.
That was actually almost two hours we had of downtime due to the Docker service. That was actually almost two hours
we had of downtime due to the misconfigured Docker service. So for the whole 2019,
our availability SLI was 99.96. So three nines and a six, which means that we were down for 220 minutes for all of 2019.
We had 50 micro downtimes,
and that has to do with how the Docker does the promotion
for different, like when it does like blue-green.
So all in all, we had almost four hours of downtime.
This year, 2020, with the LKE migration,
which we started way back when, I don't know, January, February was
ongoing. April, we had a bunch of stuff already migrated. We had that use case with Rob Yogle,
and we had already a parallel deployment on LKE. And then we completed everything,
I think it was in July or August whenust when we were through so while this was
happening including the migration okay it's two more months to go we were down for 68 minutes
all of 2020 so that means we're just below four nines, 99.988, something like that.
That's where we are.
And half of the downtime was because of the migration.
There's an interesting story there.
Right.
It had to do with the slow DNS propagation when we switched over.
We hit the let's encrypt search request limit because the dns wasn't fully propagated which
means let's encrypt was throttling us and when the dns did propagate we couldn't get a certificate
fast enough i'll have to email josh about that jared get that api limit do something for us
yeah come on man let us hit that thing more it It was my mistake. It was a one-off. The reason why it was my mistake
is because the TTL in DNS manager,
DNS manager, external DNS,
the TTL in external DNS defaults to one hour.
I have since changed to 60 seconds.
So had it been 60 seconds,
I could have flipped back within a few minutes.
But because it was an hour, once it goes out, you have to realize that DNS resolvers will
cache it for one hour.
Sure.
So even if you change it, they will keep serving the wrong IP address.
And that's why in some parts of the world, it was okay.
And I knew how to basically clear the DNS.
But for most people people it was down.
Don't you think a 60 second DNS TTL is pretty low for a normal scenario?
Aren't people going to be hitting their DNS root servers
more often when coming to our website
versus a higher number?
To me it makes more sense that you would crank it down
when you're going to make migrations
and then crank it back up when you're pretty stable but is that just a thing that doesn't really
matter in the in practice it doesn't really matter all the beginning am i still sharing my
screen i am sharing my screen yes you sure yeah so if you look at github.com they have 60 seconds
github.com a single ip address it's global. They have 60 seconds. So all the big names, they have a really low DNS TTL.
What about google.com?
60 seconds.
Single IP address.
Give me one more.
What about microsoft.com?
Try microsoft.
Microsoft.com.
They have a high one.
I wouldn't be comfortable with such a high TTL.
Because if you need to change it, right?
When something goes wrong and if you need to change it, well, when something goes wrong, and if you need to change it,
well, the TTL is already out there, right?
Yeah, but couldn't you just say,
like, on your way up to a deployment,
like, we knew we were going to roll this thing out.
Like, part of your step is, like,
we're going to lower our TTLs all down to 60 seconds,
and then we're going to do our stuff,
and when it's all done, we're going to raise them back up
and let people hold on to that cache.
I mean, caching is nice.
It is, and that's why this is what
we recommend that the ttl is but dns resolvers will implement theirs and isps will implement theirs
so and not to mention that if you have a router like adam has a very smart router it has its own
caching setting yeah so it's turtles all the way down.
And while you recommend 60 seconds,
who knows what the different DNS resolvers will use.
Some may use an hour,
regardless what the upstream says.
But what we care about is that at least the DNS that respect those settings
will pick it up soon enough.
In the big scheme of things, it's so small.
I mean, the DNS requests are really a very small amount
of traffic that goes through the internet these days sure and what i would ask is why don't we
have instantaneous updates and i think that's what some providers already have like cloudflare for
example like the big ones how well a rat hole i was gonna ask how they get that done but we don't have to
go there i don't know myself to be honest i mean it changes it's not something i keep up to date
but i do know that instantaneous updates especially when something goes wrong they're so handy to have
right what happens if the ip gets compromised or whatever so many things that can happen it's almost like um
safety release mechanism if you wish you know maybe you won't need it but when you will it's
good to have it and there's nothing worse than when you really need it and you're like well we
gotta wait 60 minutes and that's exactly what happened last time so that's how you kill your
nines right there not to mention that you need to wait, right? So you need to wait for an hour before the new TTL will be picked up.
Right.
So yeah, you need to say, okay, I'm going to upgrade in one hour
because that's how long it will wait for the TTL to expire
before the new one will be picked up.
And that's when your boss fires you.
Good stuff.
Well, with availability, there comes some acronyms, right? Oh, yeah. comes some acronyms
some new acronyms at least
SLO, SLI
did we have these before
and I just didn't know about it
or is this new to the world of
there's always a new TLA
three letter acronym
there's always a new three letter acronym
I see
I thought you meant like TLA, like the TLA plus.
Okay.
I was thinking about something completely different.
I got you.
So did we have this before?
I don't think we formally had it.
I know that when we talked about Jared, how much downtime would you be okay with?
Because the less downtime you're okay with the more complex your infrastructure
becomes and he said a few minutes here a few minutes there it's okay right so in this case
seven minutes it's okay it's not the end of the world we can be down for seven minutes it's not
a problem so when you look across the whole year how much downtime are we okay to have
and that will become in this case our availability slo so service level objective
our service level objective is that we are available or we are unavailable for at most 50
minutes in any one year and that's what your four nines means the sli the service level indicator is
where where are we in those 50 minutes and right now we are at 68 minutes this year plus an extra seven minutes
so that's like almost like your budget but that was rollout year so 2021 will be the test
of course by 2021 we may be rolling out new stuff so let's talk a little bit about things that we
didn't do or things that we might do and uh then we'll talk about how the listeners can get involved
what haven't we done here? What haven't we done?
So the first thing, as you know, I'm very passionate about,
is about logs and metrics.
I think we need a better understanding of how the system works.
While we have grafana.changelog.com,
and we have all the metrics from the Kubernetes side,
from the infrastructure side,
we don't have metrics from PostgreSQL, for example.
We don't have metrics from our application, Phoenix.
It exposes a lot of metrics.
So we don't know what's happening inside of our app.
Why is it constantly using 8 gigabytes of memory?
We don't know that.
It would be good to know that so that maybe we can bring it down.
We can speed some things up. Another, we try to optimize, for example, queries, right? How fast
pages load when they're not cached. Well, when a page loads, can we see a trace? That's where
traces come in. Where's the most time spent in that request timeline? Is it the database? Is it
the app itself? What's going on? The other thing that we would like to have is
centralize all the logs. So I have Kubernetes and we can see logs for pods easily. K9s makes it
super simple, but even if we didn't use K9s, kubectl, kubectl, they're there, but they will
be gone once the pod goes. So can we aggregate all those logs before we're using paper trail?
But this is something that I wrote about last year.
We'd like to try Loki out.
We'd like to send all the logs to Loki.
And when we send all the logs to Loki,
and we already have, for example, maybe IPs for requests,
browser agents, user agents,
could you maybe have some dashboards?
And I know we talked about this last year
to show you maybe where users are coming from,
like the stats.
Remember that the app currently does.
Could we maybe use something else
and use the logs as they are
without having to process them ahead of time?
Could we do that?
So logging, metrics, integrating everything.
The one thing which I think it's everybody's dream
is to have an automatically updating system.
So could we roll updates to, for example,
our container image for the app,
the latest version of Erlang
or the latest version of Elixir
or the latest version of PostgreSQL automatically
with no intervention?
I think we could. So what does the setup look like?
What about automatic Kubernetes updates? We're not on the latest version, we want to upgrade,
but someone has to do that upgrade. Could we automate that so we're always running the latest?
And can we do it in such a way so that it causes minimal disruption to everything else? Because in
a way you want like
the most secure setup right the most efficient setup with no effort if possible it just happens
right it's right dream yeah exactly i mean i would love to just again declare an elixir version a
postgres version and not have to worry about how it goes from my current version to that version
i know especially with post, that can be very tricky
with database backend migrations and stuff.
The actual format of the data storage by Postgres can cause issues.
But yeah, that would be super cool.
And who knows, maybe we'll get there someday.
Yeah, I mean, I think it's all within reach to be honest like all these
systems once you declare them once they are self-updating self-healing it's almost like the
next step we want like an automatically updating cluster if you wish and how will those updates
happen so for example you may be okay to update to the latest patch of postgres
sql automatically but maybe not to the latest major maybe you'd want to control when majors
get rolled out totally but for other things like for example erlang you may be okay because it's
very backwards compatible and you always want to have the latest right and maybe you'd want to have canary updates,
so like updates happen, but not in situ.
So if something fails,
you would like to know what the failure is so that you know how complicated this upgrade will be.
It's almost like it will feel for you like,
hey, there's a problem with this.
You may not even have thought about this.
Not to mention that you can then start consuming beta versions, right?
And say, hey, I can't upgrade to this
because then you can start feeding back
into different development cycles.
Like this is completely not working for us
or there's like this new feature which is amazing.
Can you ship it, please?
Would benefit from it greatly.
So you can start consuming the latest and greatest
in an automated fashion with little effort.
And then you can have automatic updates like your phone does.
Isn't that great?
When you don't have to update all the thousand apps
which you have installed, they automatically update.
I think that's pretty great.
Phones these days update themselves.
Until they update on their own and they change their icon.
And you're like, what happened?
Right. Because that's the problem. This new icon is their own and they change their icon. And you're like, what happened? Right.
Because that's the problem.
This new icon is ugly.
I like the old one.
Yeah.
And then you can't go back.
I know what you mean.
There's no going back either.
It's futile.
Well, I'd say one of the coolest things about this process this year, I think, has been
the contributions back to LKE and to various open source things surrounding Kubernetes and the
work that you did there, Gerhard, you know, getting us to where we are right now. I think
it's really cool that we can use this as a necessary excuse to help other people and help
open source software get pushed forward as well. So that excites me. I hope as we move forward,
we continue to look for opportunities like that
to work with open source projects and with providers
to level everybody up as we use their stuff.
I'm surprised it's taken this long to get Let's Encrypt in place.
I mean, I know we had an SSL and it was set to expire,
but it's been out for a while
and we finally have it in place in automation, which is awesome.
So we did have Let's Encrypt before manually.
Right.
As we were doing the migration, I remember our certificate was closing expiration, so we didn't want to wait.
So we did do it manually.
But it was still a bit like, okay, I have to run this command, and I have to save the certificate somewhere like in the source of truth where we keep all our secrets then i have to put a certificate on cdn
and i have to set it up in the load balance so they're like a couple of steps which we had to
take yeah but now everything's automated for us because of how the node balances integrate with
the ingress nginx and how all the components know what needs to happen.
And to your point, Jared, too,
there's a lot of services out there now that we're in,
now that we're actually in cloud native land,
officially, since we're using Kubernetes,
there's a lot of different services out there,
both open source and, you know,
opportunities to partner with people that are available to us
that are pretty cool.
So there's probably some holes
where we haven't chosen a good solution or we're not there yet so this is us saying hey reach out if you know
we're you know capable using reach out and say hello pretty easy to get in touch editors at
changelog.com say hi it's too easy i would be very excited to hear about what the cool kids are doing these days in the
world of communities it's moving so fast so many things are changing it's impossible to keep up
with it but this way you know just as we are helping the community in our own specific way
we hope that others will maybe have a look at what we do and suggest better ways. The best compliment is someone telling you
how crap your setup is.
That's right.
Because then you can learn something new.
So I'm looking forward to someone telling me
how crap the setup is and how it can be made better.
I'm really looking forward to that.
And I suppose you could use grafana.changelog.com
and do some hunting yourself if you wanted to. Also,
it's open source. You can look that way
as well. So it's
transparent. It's available to you to
dig into just like we can. No different.
Absolutely. Please do
check out all of the source code is
up on GitHub. All the links to all the things
are in the show notes.
As we stated earlier, Gerhard will
be publishing a detailed blog post covering many of the nitty
gritty details that we glossed over here on this conversation.
So look forward to that.
We'll cross post it in the show notes as well.
We don't know the exact timing on when,
what goes out in which order,
but it'll all work itself out.
We'll just declare it and the system will just work out that content as it needs to.
Gerhard, anything we left on the table?
Anything left unsaid before we call the show?
No, we went into a lot of details, much more will be covered in the blog post.
I really enjoy this.
Every year, it's almost like I'm looking forward to this.
The results of combining all these awesome building blocks.
And in a way, that's what we do here.
We take the best or the simplest,
in some cases, out of the open source world,
and we combine them in a way that makes sense for us.
And yeah, I'm really excited about
where we will be this time next year.
Really looking forward to that.
And if you have questions or thoughts,
you can share in the comments on this show, of course,
or you can join the community, which is easy.
You can talk in real time and Slack
with us and other things, but
changelog.com slash community is where
you can make that happen. It is a free
community to join. It's the cost of your
attention and your time. That's it.
And everyone's welcome.
No matter where you're at in your journey,
this is a place you can call your home and
hang your hat and hang out and ask questions.
Gerhard's there.
I'm there.
Jared's there.
And you could be too.
So hope to see you there.
Gerhard, thank you so much for all your hard work.
It's been amazing work with you all these years.
We appreciate every single year this leveling up.
We love to see this happen, this progress happening and achieving our SLO of four nines next year.
Four nines. Because five nines
is expensive. That's a different subject
but this has been awesome.
Once we've achieved four nines, we will
move to five nines, right? It's always
aspirational. Of course, what you got to do?
It's the next layer.
The reason why it's not three nines, because three nines
would be easy. We already got that.
We already have that.
We can fail and get three nines would be easy. We already got that. Yeah we already have that so four nines.
We can fail and get three nines.
We could or we can be
completely awesome and
get five nines.
One day I'm sure we'll
get there.
I'm sure.
One day.
One can dream right?
One can dream.
I would like to one
last shout out to
Andrew Zauber.
It was great working
with you on some of
these things.
Thank you very much
for your detailed
explanation about the proxy protocol support. That was great. with you on some of these things. Thank you very much for your detailed explanation about the proxy protocol support.
That was great.
Thank you very much.
And everybody else at Linode
that is building a solid platform.
Yes, yes.
Definitely.
Thank you, Linode.
A hundred percent.
All right.
That's the show.
Thanks, guys.
Yep.
It's been awesome.
Thanks, Gerhard.
This was good.
Thank you.
That's it for this episode of The Change Law.
Thank you for tuning in.
If you haven't heard yet, we have launched Change Law Plus Plus.
It is our membership program that lets you get closer to the metal, remove the ads, make them disappear, as we say, and enjoy supporting us.
It's the best way to directly support this show and our other podcasts here on changelog.com.
And if you've never been to changelog.com, you should go there now.
Again, join changelog++ to directly support our work and make the ads disappear.
Check it out at changelog.com slash plus plus.
Of course, huge thanks to our partners who get it, Fastly, Linode, and Rollbar.
Also, thanks to Breakmaster Cylinder for making all of our beats.
And thank you to you for listening. We appreciate you. That's it for this week. We'll see you next
week. real-time follow-up yaml is not turing complete according to this hacker news comment from
january 2018 so there you go okay so about yaml i don't want to go into too much detail but basically
um yaml is a big intention point it's a big big intention point right now and there's something
called skylark it's a programming language, if you wish,
that you run inside YAML.
It's really crazy.
Wow.
But it's something that sticks
and there's like JSON on it and KSON on it
and there's like so many ways of doing this
and everybody has their own way.
But the one thing that really seems to be
coming through and through
is this Skylark templating language
because you write code, you write functions
inside of your YAML.
They get interpreted and then YAML gets
combined and you end up
with a really nice final
big document. It's able to
manipulate YAML by
writing the functions inside of the YAML.
It's really weird.
But it's in a good way
because AWS does the opposite opposite aws i kid you
not they you write functions as yaml so you do dash increment dash one dash one and it will know
to increment one plus one and that's just bash is crazy no this thing does something else you do
comments and it knows like you say, this is like a function call.
And if you look at YTT, do get-ytt.io.
Okay, ytt.io.
Yeah, my internet, there's a problem with it.
I mean, even get-ytt didn't.
I'm on it.
All your YAML shaping in one tool.
Yeah.
Template and patches needed to easily make your configuration
reusable and extensible works with your own third-party YAMl configuration so click on try and playground okay oh so this is it right
here that's it and guess what we use ytt oh we do changelog uses ytt yeah to template all these
things oh nice so many things that we use i don't even know yep love it