The Changelog: Software Development, Open Source - Inside 2021's infrastructure for Changelog.com (Interview)
Episode Date: May 21, 2021This week we're talking about the latest infrastructure updates we've made for 2021. We're joined by Gerhard Lazu, our resident SRE here at Changelog, talking about the improvements we've made to 10x ...our speed and be 100% available. We also mention the new podcast we've launched, hosted by Gerhard. Stick around the last half of the show for more details.
Transcript
Discussion (0)
This week on the ChangeLog, it's that time again.
We're talking about the latest infrastructure updates we've made for 2021.
We're joined by Gerhard LeZou, our resident SRE here at ChangeLog,
talking about all the improvements we made to 10x our speed and be 100% available.
Also, we announced our newest podcast we're launching, hosted by Gerhard.
So, stick around the last half of the show for more details and how to subscribe.
Of course, huge thanks to our partners Fastly,
Linode, and LaunchDarkly.
We love Linode.
They keep it fast and simple.
Check them out at linode.com slash changelog.
Our bandwidth is provided by Fastly.
Learn more at fastly.com
and get your feature flags powered by LaunchDarkly.
Check them out at launchdarkly.com and get your feature flags powered by LaunchDarkly. Check them out at launchdarkly.com.
This episode is brought to you by Linode.
Gone are the days when Amazon Web Services was the only cloud provider in town.
Linode stands tall to offer cloud computing developers trust,
easily deploy cloud compute, storage, and networking in seconds
with a full-featured API, CLI, and cloud manager
with a user-friendly interface.
Whether you're working on a personal project
or managing your enterprise's infrastructure,
Linode has the pricing, scale, and support you need
to launch and scale in the cloud.
Get started with $100 in free credit
at linode.com slash changelog. credit at leno.com slash changelog. Again,
leno.com slash changelog. We're back with Gerhard Lazu, our resident SRE.
What's up, Gerhard?
It's all good.
It's actually 10 times better.
Our website is, I hope so.
That's the title of the show.
It's 10 times better.
I like 10 times anything.
Are you a 10X SRE or what's going on here?
That's exactly what it is.
It's a 10X.
That was the theme for this setup. It has to be 10 times something. It doesn't matter what that
10 times is. It's 10 times something, like an order of magnitude better. And it is. Guess what?
It is. Nice. So it couldn't have been 10 times slower to deploy or 10 times longer response
times. None of that. It had to be 10 times better.
Well, for those who haven't listened
to the annual ChangeLog infrastructure episode,
welcome, you are here.
This hasn't been a whole year.
It's been a half a year,
so it's now, I guess, semi-annual.
But we worked faster this time around,
didn't we, Gerhard?
We did, because we had the basics covered really well and the base was
so good that iterating was super simple yeah and what we iterated on was basically what mattered
the most up time and response latency we had a couple of tricks up our sleeve I think it was
combined I had one you had one we put them together and uh yeah we did it faster we did better this year
not much has changed actually so i think that's like almost like the what everybody wants
introduce a little change not much change but make it so much better which we did fine tuning
there's details in the fine tune that make things faster and that's where you gotta
that's where you gotta optimize for yeah i think it takes a while to learn your system i suppose to learn all the components like properly learn them
and then when you're comfortable with all the components figure out which is the smallest
change that you can make for the biggest improvement and that's what we did yeah shall
we spoil it i mean if someone just wants to listen to five minutes we can spoil it and they can no
let's tease it let's tease it. Let's hold it back.
Let's tease it, all right.
Hold it back.
We'll tease it.
Stick around, listener.
Yeah.
Let's start with this.
Not much changing this time around.
A lot changed last time around.
So our 2020 episode, which came out last October,
was a big change.
A lot going on.
And some of the reaction to that episode was,
and we're on Kubernetes now.
And it's like, hey, guys, you run a three-tier website, right? You have a database and an
application server and Nginx or whatever. Kubernetes is way overkill. So let's start there.
Gerhard, what do you think about that? Do you agree with that?
Not really. And this is like, that's a really controversial part. I assume you're going to say that because you agree with that not really and this is like that's a really
controversial you're gonna say that because you're the one that set it up so right so i think that's
a very simplistic view because you're right when you boil it down that's exactly what you have
right it's just a phoenix app it's a web app and your database you have a proxy maybe and that's
about it right that's what you have but have. But it's almost like the iceberg,
right? It's like the thing that you see at the top and there's everything else behind or below
the sea level or the sea line. So what else do we have below? Well, you have certificates,
you have load balancers, you have DNS, you have code updates, you have tests, you have CI, CD, you have dependencies,
you have dependencies of dependencies, and the list goes on and on and on. And things are changing
all the time. So given you have so many things, how do you manage that? And usually what happens,
you don't. You just go with the flow, right? Let's say you don't care about your CDN integration.
Just tick a box and assume everything just works.
And most of the time it does, but when it breaks, do you even know that it broke?
What about the monitoring?
How do you manage the monitoring?
And again, it just goes from there because you're running a production system, a production
system that is serving a lot
of traffic which changelog.com does and even though it's a simple app i think it's almost like a
it's simple because we made deliberate choices it could be a microservices architecture we didn't
choose that but it could be the fact that we don't have that it doesn't mean that we don't have all
these things around it could you have one thing that manages all those things?
Could you have control plane is the term that many use today.
But that's what we kind of have.
We have a control plane that manages all the things.
And I say all the things, all the things that we could convert.
There's always more work that we could do.
And I think that's where the next improvements
are coming from for us.
We have a very solid base
and improving is really simple now.
And everything is like in a single place.
So you have this single thing,
which you can hold in your head.
Everything is automated.
Everything recovers.
And again, I don't want to spoil it too much,
but migrating from the 2021
setup to the 22 setup, in terms of time wise, we could perform a live migration in 27 minutes,
from nothing to everything. How cool is that? Did you already know all the Kubernetes stuff?
Like, so when people think about setting up a kubernetes cluster they talk about
the complexities of the api perhaps or the tooling or the ecosystem i always think back to cncs
that's not a roadmap what is that it's like a trail yeah the landscape and there's just like
all of these words that i don't know any of them and each one each one of those is like a complex
piece of software right and i get overwhelmed you, you got this rolled out. I'm just curious,
was there a Kubernetes learning curve for you or had you already done that previously? And so when
you started helping us, you already understood what you were doing. Because I think a lot of
the cost for people, they're like, well, is this worth doing for me or not? It's like, well, do I
have to learn all the Kubernetes things or do I have somebody who knows that I'm already?
So I'm just curious where you're coming from.
So I had some knowledge,
but it was mostly basic.
But the thing to understand
is that I have been doing infrastructure
for, I don't want to say decades
because that's like bragging,
but let's put it a really long time.
So we were joking about webmasters.
I used to be
one uh cgi bins oh yes baby those were like the good old times i remember cgi bins i wouldn't
describe them as a good old time but uh well they were better than i was perspectives like
pin glasses and all that you know your glasses and you remember the past much better than it
actually was right there's an element of that so i've been doing this for a really long time and i can appreciate the cycles that we
went through and we had many many cycles and i've learned to learn on the job and if you optimize
for that there's nothing new that is too daunting or too i mean mean, it's exciting, you'll make mistakes. But after you've been over,
I don't know, six, seven cycles, they come and go. Remember Ruby on Rails? Oh man, those were
the good old days. Phoenix, I think, captures some of that. The point being that even though
I didn't know, I kind of know how to navigate that landscape and you're right if your
baseline is like zero and you have little experience it is daunting and you would want
a curated experience but if you have seen these new technologies emerge and you know kind of where
you are in the in the cycle like are you going on the uptrend are you up are you it's like whereabouts
are you in that um the law of innovation of diffusions.
The law of diffusion of innovations.
Sounds better.
What is it?
Law of fusion innovation.
That's it.
What's that?
So early adopters, early, it's basically any new thing.
Whenever you're introducing it, you have to focus on the first 2.5%, the early adopters oh this is like the curve of people
who are going to adopt that starts with like the enthusiast and it goes to the exactly early
majority the spread of a new idea exactly and kubernetes right now i would say it's the late
majority it's not laggards you can still not do kubernetes but i think it's the late majority now
so we waited for it long enough before we
went into kubernetes i would say we were towards the end of the early majority that adopted it
that's that's what i think so a lot of the components were fairly mature and while mistakes
could be made it was more difficult and our hosting provider right you know because that's how it all
started let's get some vPSs, remember those days?
And then VMs and then cloud instances.
So they offer a managed Kubernetes service.
And that was the thing which we were waiting for
so that we wouldn't need to worry about the control plane,
about, you know, etcd and certificates and the integration with the IaaS.
So all that stuff was abstracted away from us.
Once we had that, we had the building blocks.
And we had to identify a couple of things,
but they were fairly well-defined.
CertManager, external DNS, Ingress Engine X.
That was pretty much it.
And these were like fairly standard components
that have been improved over the course of a year, two years.
So we were just like after 1.0,
I think cert manager was the only one which wasn't 1.0,
but then later on it was.
So the components were fairly mature.
There were so many blog posts and use cases
and mistakes that have already been made before us.
And what we wanted to do was fairly standard.
So there's nothing crazy.
Documentation was written.
We weren't those early adopters
or like we were like towards the late early adopters and we were not the innovators
definitely not so a lot of the stuff made sense and it was easy now having said that we still
there wasn't any pain right yeah we still hit a couple of interesting things shall we go into that
what do you think some interesting things that we've hit? Okay. So some interesting things that we've hit were around the PostgreSQL operators.
We chose PostgreSQL Crunchy first, and it was fairly hard to work with it because of how
complicated it is. It's doing so many things, has so many features and the replication bit us right so we
had a replicated poster sequel and we had downtime because it was replicated you wouldn't expect that
to happen because it wasn't replicated right because it was replicated we had downtime because
it was replicated but it stopped replicating exactly it stopped replicating okay so it wasn't
which one was it no no no
no so so hang on so we had the replication in place right replication stopped working
and it took down our primary system it filled up the writer headlock filled up the disk
right it went down the secondary was way way behind so it couldn't be promoted to primary. And we had downtime.
Right.
And we had data loss.
And we had data loss.
Yeah, we did.
Oh, yes.
That's way worse than downtime, in my opinion.
We had a backup from, I think, six hours ago, was it?
It was like six hours.
Nine hours ago.
It was like a bunch of hours, and we've lost some data, yes.
Thankfully, it wasn't a ton of data,
but it was definitely data loss.
Because we had backups. That's the lucky part because we had backups yeah we had good backups
but yeah six hours back so i think there was like thankfully was there any podcast episodes that
were published during that time i don't think there was an episode that would have been a bigger
problem but there was news items and comments and a few things where i had edited a thing and i had
to go back and edit it again
thankfully we caught it fast enough that i remember it and we're a small team so we remembered our
data loss we're like i know what i did yesterday or the last six hours so we fixed it up but in a
larger team that would have been catastrophic yep yeah that was not cool that was really not cool
and you go through the documentation right and it's not like do this or do that you don't have
a simple a list of simple steps to follow and then you're scrambling it's not like do this or do that you don't have a simple
list of simple steps to follow and then you're scrambling it's like i just i just need to get
this thing back up right that's all we cared about and what was the simple what would be the simplest
thing so i think two hours later we had this like no we just have to restore from backup
because resizing the disk was difficult it was just it was just a mess it was just a mess and i think this goes to show that it has not matured that much i mean it's getting there but it hasn't
matured that much and if you need that type of redundancy from postgresql then well you either
have some dba chops especially when it comes to postgresql and know what you have to do
or you're just paying for that which Which I think for us, if it
really, really mattered, we would have just paid for that, for the problem to have been taken care
of. But the interesting thing is, I always thought that maybe PostgreSQL, maybe Crunchy was too
complicated. And then we tried the other operator, the Zalando one, and the same thing happened, right?
So it wasn't an operator thing.
And here's the thing.
We still don't fully understand
where the latency is in the Kubernetes networking stack,
but we know that there is some latency
and we have some very high spikes.
So think that's an operation
that should take maybe up to 100 milliseconds
will take five seconds and
then if you have plenty of those things and like in a certain series of events things will just get
out of sync and they will not be able to continue replicating correctly and when that happens the
system will not be able to recover it was a surprise to me and i remember looking this for a
really long time and not like thinking could it be linode's private networking and it wasn't that wasn't the problem even though it
indicated there's like some network latency so we went down to single communities node everything
was running on one node and we still had the same latency problems so there is something and there
wasn't was there wasn't cpu bound it was like high network throughput
so we weren't like hitting any sort of limit other than network latency so how many metrics would we
need to enable in the different layers of the stack and how well would we need to know that
stack to debug this issue right and i think that's where a lot of people that hit issues with
kubernetes that's where they're coming from.
You wouldn't expect these.
These are normal problems.
These are just almost like specific to the stack that we are running, which in this case is Kubernetes.
So you kind of need to be an expert to kind of know how to look at this.
But I do hope that some technologies, I think they've been around for a while, but again,
it goes back to how do you pick and choose your components?
So what I'm wondering is, would Linkerd have helped with this? I think they've been around for a while, but again, it goes back to how do you pick and choose your components?
So what I'm wondering is, would Linkerd have helped with this?
Could Linkerd show us the latency between the different services?
What is Linkerd and how would it do that?
So it basically intercepts all the traffic between...
So imagine Ingress Engine X when it talks to the app.
Linkerd would place itself between Ingress Engine x and in this case the app so we'd see all the latency between the two components
same way it would intercept all the traffic between the app and the database postgresql
the service the postgresql service so to show us when there's any sort of like weird latency between the two services
now we could enable all the metrics for post or sequel but then you need to find the dashboards
they need to understand those dashboards if you have grafana or something else then you're
literally becoming a dba right that's the hard part though is you talked about crunchy
what was the other one
you talked about we moved to?
And then...
Zalando.
What's it called?
Z, Zalando, PostgreSQL.
So you got those two
and then you consider
would Linkerd have helped us?
But that shows to me,
at least from someone
from this perspective,
which is not a Kubernetes operator,
I'm not an SRE,
is that you have to have
some sort of understanding of the different tooling available in theRE, is that you have to have some sort of understanding
of the different tooling available in the ecosystem, which means you got to pay attention.
Yes.
Right.
Very closely.
And even not to just know which tools are available to manage Postgres like we need
to and replicate and whatnot, but also a high degree of understanding of those tooling and
how they'll actually help you.
And so I think that could be is be, it's just a very daunting,
high-touch world that Kubernetes presents.
It may be the future.
And I'm not sure in terms of the law of diffusion
and innovation where we're at,
it's early majority, late majority,
in terms of adoption of Kubernetes at large,
but it seems like it's still iterating
and still getting better
because we thought it was Linode's networking. It wasn't.
And you suggest different tooling, but that to me says you've got to
have your ear close to the ground of Kubernetes and all its
intricacies to really deal with this kind
of problem or problems like it. We're dealing with it in Postgres. I'm sure there's other databases
that are going to have issues, but it's similar. It's the same kind of issue or problems like it. We're dealing with it in Postgres. I'm sure there's other databases that are going to have issues.
But it's similar.
It's the same kind of issue where it's a latency of some sort
that spikes and causes everything to slow down
and then haywire.
So they do say, and let me be specific,
Kelsey Hightower has been saying this for a long, long time.
Don't run your data services on Kubernetes
because things get complicated.
And I think this is a first-hand
experience of what he was referring to. Things may seem okay for a long, long time, but then
things start getting problematic. You have the combination of tooling that maybe wasn't meant
to run in these types of environments. And how do you basically evolve it so that it embraces this distributed everything can go and
come within milliseconds as containers do so i'm wondering if something like cockroach db
which is meant to be run as a distributed postgreSQL replacement would have helped i don't
know would we have benefited from a managed postgreSQL instance
maybe so maybe we should have listened to that advice and not run postgreSQL in kubernetes
but all these things first of all they made us just understand the stack a little bit better
and say us mostly me and it made me realize that simple is best so for the 2021 setup we're running just a very simple
stateful set single postgreSQL instance that can restore from backup in less than one minute so
let's say that you lose everything right if you back up frequently which we do every hour by the
way and I have to change
that setting. I've set it to be three hours, but I need to change it to one hour. It's super simple.
And then the database will back itself up every hour. We can lose an hour worth of data.
We can back it up every 30 minutes, but it's very simple. And then you have backups,
you can self-expire them. By the way, we back up to S3.
And we back up the entire media as well.
And these backups, the reason why they were important
is because when we did the 2021 setup,
all I had to do, I had to let the system restore from backup
to pull all our media, which is 85 gigabytes right now all the files all the mp3s all that
stuff so to download that from s3 is fairly fast especially for mp3s they download like a few gigs
per second but it's uh gigabits not gigabytes by the way you have 85 gigabytes that's an important
distinction but it's when all those small like all the avatars
all those small files when you have to download they take slightly longer because there's some
so many of them but um we can restore everything from scratch so like let's say we delete
everything within 27 minutes because of all those small files everything's restored the database
super fast the media files the whole lot and because it's so simple do you need to
have a distributed system you can use these local ssds that's another problem which we had disks
not detaching no it's not rebooting we had like another downtime because of that and i know that
you know all these issues have been fixed i mean we were early adopters in the case of linux
kubernetes engine right it shipped in November 2019.
We started using it, right?
It was a beta.
And just when it went live, I think in May,
we were already starting to switch some production workloads across.
And then by, was it August or September?
I can't remember.
Everything was across.
Something like that.
So did we need a multi-node Kubernetes cluster?
The answer is no.
What we needed was proper CDN integration.
And that's where the speed comes from.
So by properly integrating with the CDN,
in this case, Fastly,
the website is actually 15 times faster.
The latency.
Did you say 15 times?
One five, yeah.
15 times.
One five.
Actually, let's do this.
By the way, we are integrating with Grafana Cloud.
So we ship all the logs, all the metrics to Grafana Cloud and we have synthetic monitoring
set up there. And we have probes running all around the world. By the way, not all probes
are reliable, but we have plenty to show us what's happening. And we're monitoring our babies now.
We are, yes. The feeds and we have alerts and all and reports there's like so many things we
have set up so thank you grafana cloud that's a really cool thing behind the scenes jared called
our feeds our baby yes he did a little joke there but yes we're monitoring our babies which is our
podcast feed yes and if a baby is crying guess who gets the telegram message this is like grafana
cloud integration i do right so when our that's the way it should be
yeah
exactly right
that's how you stand
by your infrastructure
if you're willing to be
woken up at night
and guess what
we're caching it
and so cache
doesn't go down anymore
yeah
all of Fastly
would have to be
down before
changelog would be down
so you have proper
integration
which we didn't have before
we did some caching
but not as much as we do now anyways before we enabled caching the change.com website the average latency
so we have san francisco dallas new york london frankfurt bangalore sydney and tokyo these are
all our probes so the average latency across all probes
was 880 milliseconds.
That's kind of embarrassing.
Before.
Yep.
Yeah.
Now it's 66 milliseconds.
So how much is that?
880 by 66, 13.3 times.
Not quite 15, but not 10 either.
It's more than 10.
We can round to 15 and guess
what the uptime is 100 exactly 100 it's 100 that's exactly right we want all the nines This episode is brought to you by CloudZero.
They help teams monitor, control, and predict their cloud spend.
And I talked with Ben Johnson, co-founder and CTO at Obsidian Security.
They get tremendous value from using Cloud Zero.
Ben shared with me the challenges they face
driving innovation and customer value
while also trying to control and understand
their Amazon Web Services spend.
We want our engineers to move fast,
to innovate,
and to really focus on driving customer value.
Yet at the same time, reality is we have to pay for cloud compute and storage. And the challenge around AWS is often that you have
multiple accounts, you have lots of different services, you have some people who only have
access to development environments, not necessarily production. A lot of these different challenges across services,
across accounts that make it hard to understand the positive or negative impact to the costs
that the new feature, the scale, maybe the change in architecture are having. And so giving our team
more insight into the ramifications, again, positive or negative of their changes in order to maybe we need to really move fast. Let's have less worry about cost right now. Or maybe now
we're in a more stable place. Let's drive down the cost so we can give those cost savings onto
our customers or improve our own margin. So a product like CloudZero can really help your team
get a handle on costs, get alerted to those spikes, feel good when you actually see the costs
drop and do all that without a whole lot of investment of your own time. a handle on costs, get alerted to those spikes, feel good when you actually see the costs drop,
and do all that without a whole lot of investment of your own time.
All right. If your organization shares similar struggles as BAN and Obsidian Security,
check out CloudZero today. Learn more and get a demo at cloudzero.com
slash changelog. Again, cloudzero.com slash changelog. so this speaks to really geographic relocation of our assets right i mean we had all of our
images and mp3s and css and javascript assets served via cdn all the way back to when we set the system up
that's right but we didn't serve the entire website via that cdn that's right and so even
though phoenix is really fast even though we're set up good we had we even have in memory caching
in places where it makes sense like the feeds who wants to recalculate the change logs
feed of 400 and some odd things every time it gets requested like we cache that in the app
in addition to that we now have it behind the cdn and just the fact that that used to be served from
like new york east even if it was really fast to answer in bangalore in tokyo is never going to be
under well it's going to be under,
well, it's going to be an average of 880 milliseconds around the world, right?
Yep.
There's not much we could do about that
while our responses were coming from a centralized,
you know, single pop, as they call it,
point of presence, which is the way it was.
So now every request goes through Fastly,
and we should have done that a long time ago.
We should have.
I'll take full responsibility on that one because I kind of slept on it for years.
I think you resisted it, actually.
Didn't you resist it for a little bit?
You were like, no, let's not do that.
Yeah, I think it was.
I'm trying to call you out on anything.
I'm just trying to be like, what was the circumstances for saying no, really?
I think it's because I didn't read the docs well enough.
I didn't realize how easy it is to just bypass that
if you have a cookie set.
So I thought, well, we have signed in users,
signed out users.
I guess I always had done it that way.
I just served the dynamic parts from the application
or behind NGINX,
and I served the static parts from a CDN,
and that was just what I was used to.
That's what we did.
I thought it would be hard to switch
because I didn't realize
that there's just a setting where it's like,
pass through fastly if you're
signed in.
Probably a minuscule
percentage of our traffic is
signed in users.
Maybe lucky 3%,
maybe 1% of requests is signed in users maybe that's right lucky three percent maybe one percent
of requests are signed in people so a little bit of ignorance a little bit of just like old school
this is how i do it and then because we didn't have worldwide monitoring we had single point
monitoring it always seemed pretty fast you know we always got good scores is it good for you it's
good yeah exactly is it good for us is it good for you? It's good for me. Yeah, exactly.
Is it good for us?
Is it good for people in the States?
Once we set up the Grafana with the around the world monitoring,
then you start to realize,
holy cow, this is not fast for everybody, you know?
Yeah.
So I think it was less,
just less important
because I didn't realize how bad it was out there.
Well, that's interesting too
when you talk about observability.
What's it, you don't know what you don't know
until you know or something like that. Basically, you know,
observability provides a lot of data
to understand some of the
problems because either
you don't have time or you not necessarily
don't care, but you don't care because you
can't care. You don't have the
data to really understand the full rounded
picture of the problem or
the concern. And that's what's
interesting is that once you start to monitor something, you really start to understand the
real problems. And that's why I think, you know, there's a lot of pluses to, you know, it doesn't
require Kubernetes to use Grafana, right? We don't need Kubernetes to use Grafana, but the full rounded picture of what cloud native asks of teams or prescribes or subscribes
is this picture of Kubernetes
simplified, in quotes, simplified plane that everyone
understands. You can go from our organization to a whole different team that they're using
Kubernetes. It's roughly the same API and all the same concerns. You've got
an understanding from team to team if you're someone who moves around or someone who SREs for many people, or it's just
a standardized way of doing things. I'm curious though, about the average, because you said 880
was the average. So share the highest, because that says average. What was the highest?
So this is the average latency right and you have all the
different points can you see that yes okay cool so this is all probes we'll pull a screenshot into
the show notes for sure but so let's look for example dallas right which is closest to where
adam is so in dallas what we're seeing is the average latency is 42.20 milliseconds okay that's
pretty good it's a pretty good latency.
You can see that you have a couple of high ones.
So the max goes to about 200 milliseconds.
This is now, not before.
This is last seven days.
Looking across the last seven days.
If your maximum response time is 200 milliseconds,
then you're sitting pretty.
200 milliseconds, exactly.
And that's where the average, and this is Dallas.
So let's take, I don dallas so let's take i
know let's take london for example for me so london is 87 milliseconds and the maximum is 400
milliseconds now what we need to understand is that some of this is also related to probes
so do you see the uptime says it's 99.98 well what that actually means is that some probes, some Grafana probes are either
overloaded because they take more than five seconds, which is exactly what happened here.
It takes more than five seconds. And that's a timeout. If a response takes more than five
seconds to come back, it's considered an error. It may have taken longer, but it's considered,
nope, it didn't respond quickly enough. But maybe the probe was being overloaded.
I know that when we were looking at Bangalore, I think that was the one.
This is Bangalore.
See, for example, these errors here.
This was the 4th of May.
The error rate was very high.
But all it meant is that the probe may have been overloaded.
Not necessarily the website, because I'm pretty sure fastly was rock solid around this period i mean you just have to think how many pops they have how many
points of presence so once you get in the fastly cache any endpoint should be able to serve it
so we have a shield in new york and then every other point of presence basically distributes
from there it reads it from that cache and it replicates across the whole world and we have a micro cache so we we cache every response for 60 seconds and then if there's any cache misses
it will continue serving stale content while asynchronously going back to the origin and
requesting an update so you should always serve cached content unless obviously the the point was like down or reloaded or something like that,
which very rarely happens.
And then we reroute traffic.
So typically when there are issues, it's the high latency.
It's most likely the probe.
Let's see.
Can I have one, for example, can I see one probe here that was not very healthy?
Look, for example, this one, this was Tokyo.
Do you see how the latency went
slightly high so tokyo was having not a great day the tokyo probe same thing here in bangalore
the bangalore probe was all the way up to five seconds so some requests were timing out but which
probe out of here looks most loaded let me just open this like in a slightly bigger view it's frankfurt
look at frankfurt how many spikes it has do you see these spikes it goes all the way to three
seconds four seconds now in the big scheme of things this is no big deal right you think ah
this is okay but the probe i think is overloaded what does that mean to be overloaded like the
grafana probe that's it's got a lot of logs it's doing for
not just us but others similar to the way a noisy neighbor is on a vps exactly right
or whatever route this is taking the route is overloaded the networking right we don't know
what route it takes so however this probe runs we can see now we never had this and this is this is
a really fascinating thing who knows what problems we had in the past in the 2021 setup.
But because we never had this level of visibility, we didn't know.
We didn't know what we didn't know.
So now we know that, for example, users in Frankfurt,
maybe there's an interconnect that is slow.
Maybe it's not just that probe.
But still, we are able to serve within seconds most requests
so we monitor the nginx logs and we can see the response times we can see the traffic served this
is by the way after the cdn cache so we still need to get the logs out of the cdn to be able
to visualize the same thing that's something which i wasn't able to set up just yet but it's on the list and we can see that the 99th percentile the average 99th percentile is 707
milliseconds so we are under one second this is nginx to the app but the time interval is 10
minutes so if we go to let's say five minutes it minutes, it's a lot. One minute, we had like, look at that.
Whoa, what happened here?
So when the time interval is one minute, the 99th percentile response time was one minute.
The 95th percentile was 300 milliseconds, and the 99th percentile was one minute.
So what the hell happened here?
I don't have the answer, but I would love to find out.
Well, now you know there's a problem though.
There's a thing, right?
Because before you didn't know there was a problem.
And if we're dealing with replication of databases
and this was sort of like attached to that,
like as you begin to...
Here's the thing.
All this runs on a single massive host we have 32 cpus amd epic
64 gigs of ram or 128 gigs of ram ssds super fast it's a single host so how can the 99th percentile
between ingress nginx running on that host and the app which is running on the same host, be this high?
Bitcoin miner.
Bitcoin miner.
It's not, but sure.
I assure you it's not.
I'm glad you shared the specs of that server, too,
because that does put it into context of... This should never happen.
...its capability, and that this shouldn't happen.
It shouldn't happen.
What do you surmise?
What's your gut?
Something in QProxy.
Something in QProxy.
I mean, that's the only thing.
It's not the database.
Yeah.
It's not the app.
It's something between all those components
that make up Kubernetes.
We have Calico for the CNI.
Maybe it's that.
Maybe it's the overlay network but this is
where that like almost like you want more observability it's almost like you know you
have a problem before you didn't have you were like so ignorant you didn't even have a problem
and if you look at external monitoring everything looks good everything is fine from a cdm perspective
things are okay and that is the experience that we want to give our users the website is always
available it's super fast regardless where we are in the world and these is the experience that we want to give our users the website is always available
it's super fast regardless where we are in the world and these are the things that we are now
becoming aware of so the question is do we invest in this or maybe do we do something else
and when i say something else do we continue down kubernetes or do we take i don't know a platform as a service our problem has
always been bandwidth right because we need a lot of bandwidth like think hundreds of terabytes of
bandwidth it's not like in the detective shows where they're they say zoom and enhance you know
that's what you're doing to us here we zoom in at a certain point you zoom and enhance and just
it can't enhance any further and you're like you're you're staring at a blob and you're like i don't know what that is yeah that's kind of where we're at so you need like
another level you need another zoom or another enhance in order to dive down and the smaller
these problems are the more use time you spend right figuring out how to get that zoom done
and probably the lower your you know your roi so to speak or the long diminishing
returns hits you and you're sinking massive amounts of resources into solving this tiny
little problem that may or may not be worth it i mean ignorance i guess was bliss except for our
user for our users it wasn't bliss like we didn't we thought it was fast everywhere and now we know
that it's it wasn't it's better and yet we still have this little thing that's like what what's going on there yeah and it does
happen fairly frequently by the way so there's something there would tracing help i don't know
like look we look at the last six hours we have a spike here that was 7 p.m and they're not periodic
like they happen like uh 4 p.m could it be the database backups i mean they do run every three
hours you have four and you have seven so maybe go to the like last 12 hours. But then you have like all these
smaller spikes. This is 1 p.m. So not really. All right. You had like these spikes. And again,
most of the stuff, like if you look at the traffic that we serve, it's nothing. The server is like
not even like 1% loaded cp is not an issue network
is not an issue nothing is an issue all the components are healthy very little memory use
so it's not a problem so this is a good thing i think it it refines your understanding
i think it makes you think about your setup in ways that you haven't thought before so you really do feel
like the master of your domain and most things are easy to set up i think it's just like knowing
which things to set up and what i'm hoping that we'll do with this and we ship it is it will share
some of those stories we'll share the things that worked out and things that didn't work out so that others would have to do this.
Wait, wait, wait, wait.
What's this ship it you just said?
What's this thing?
What's this ship it?
What are you talking about?
So I'm thinking about like,
it's like it's been five years in the making.
Okay, every year we have been improving
our infrastructure, our setup.
We've been shipping it, sharing it with you all.
So how about we do this more all so how about we do this
more often how about we do this every week how about we do some interviews and some sharing of
how to ship stuff and what else is other than shipping because getting it out in production
that's like such a small part of the story i wouldn't say it's like the tip of the iceberg
it could be but there's so
much underneath it's all the other things that you need to care about so it's a new show that
we would like to start and this is the first episode it's the first episode that you show
i'm excited i'm excited about this show i think this is so awesome i mean i think that
we've been asked you know why do we do this do this? Why do we even care about Kubernetes ourselves?
Like to use it considering our three-tree application
and not really needing, so to speak, that.
I think because we care.
Because we're explorers.
Because this is fun to dig into this kind of stuff.
And as you mentioned, Garrett,
will Kubernetes be the solution for us forever?
Maybe.
Is it great?
Sure, in many ways, but it's got a lot of downfalls as well.
Will a pass make more sense? You know, will many ways, but it's got a lot of downfalls as well.
Will a PAS make more sense? Will a render, a flyer, something like that, or whatever Linode has in the future, or DigitalOcean, will that make sense? Maybe. I don't know.
For our application, you mentioned we need a high bandwidth. I think that's part of the journey.
And doing this show, sharing our story, like we had the last couple of years consistently,
naturally evolved into the need to want to share more and not just our story, which is
going to be one part of it, but other stories, other teams stories and how they ship things.
Like, wouldn't it be cool to learn how Kubernetes ships Kubernetes?
Oh, yes.
Or how different platforms ship their different platforms.
They use their platform to ship their platform or do they do something different you know are they dog fooding are they champagning whatever
you call it and that's gonna be the fun journey you know and i think that's what uh is really fun
about this is do more not just less i think that what that's the one thing that we've learned
there's like so much to this there's so many good conversations that can be had there's so many problems that others are sharing
like i was researching about network latency in kubernetes and i came across blog posts
we were saying like how kubernetes made my latency 10 times worse i was thinking
that's my problem but it wasn't it was just a clickbait i clicked on it like oh damn it
just wanted me to click so i wouldn't want that
for others right i would genuinely want to dig into this with different people that have had
similar problems or that have maybe tooling that can help with this problem to help us understand
what the problem is to help others understand and maybe come up with solution which is
which works for more than just us.
So there's, again, a way to curate these problems,
a way to understand them and to see what makes sense.
Because Grafana Cloud may or it does make sense for us, but maybe it doesn't for others.
So what else is out there? We don't know.
And it's not a fixed thing. It's changing all the time.
Like every KubeCon, there's new tools, there's new approaches,
there's just new people, right? New efforts going on. So what are they? It is a full-time job just keeping up with
all the things. And it happens to be fun. Thank you. to deploy code at any time, even if a feature isn't ready to be released to users. Wrapping code with feature flags
gives you the safety to test new features
and infrastructure in your production environments
without impacting the wrong end users.
When you're ready to release more widely,
update the flag status,
and the changes are made instantaneously
by the real-time streaming architecture.
Eliminate risk, deliver value,
get started for free today at LaunchDarkly.com.
Again, LaunchDarkly.com. Again, LaunchDarkly.com.
So if you're listening to this in the ChangeLog podcast and you're interested in our new show, Ship It, you can go right now to changelog.com
slash ship it, subscribe there. If you happen to be subscribed to our master feed, which is your
one-stop shop for all ChangeLog podcasts, you're already going to get it. We're going to ship it
right into your feed. But if you're interested in coming along this journey with Gerhard and with us
and with our setup and with other people's setups and see where this thing goes, definitely subscribe to
ShipIt. Now, if you're listening to this on the ShipIt feed, hey, congratulations, you're already
here. Welcome. But I'm excited too. This should be a lot of fun. And I think I will learn a lot by
listening and maybe even participating a little bit. I think that that makes so much sense,
because there's so many good ideas out there. There's so many good ideas that are good ideas for a while,
and then they're terrible ideas, but that's okay.
Because ultimately, what do you care about?
How does this help you?
Does it make sense?
And what else is out there?
It's almost like the novelty factor,
that in itself is good enough to subscribe
and to just like what's around the corner.
Like one thing which i
would love to find out i mean i'm putting this out there in the universe is that one of the guests
on ship it is none other than elon musk does he ship kubernetes to mars i would want to know that
wait wait wait what are you saying now? Why not?
How does he ship those rockets?
That's like proper engineering, right?
We're just like playing here.
So this is an episode request.
This is not a promise.
This is a request. No, no, no.
Okay, good.
Because I about got very excited.
I was like, really?
Gerhard is dreaming and we are liking it.
Six years from now it will happen, I'm sure.
Now in six years, that's how long this thing took, from an idea it makes sense he just did snl he should
he should do ship it yeah we're the next natural step from there i think so and maybe we can help
him curate the tech that will get shipped why not i say we it's like the royal we the shipping group
right so he doesn't ship the version that has all this downtime
right because i don't think that will be good for the mission i think we're just looking at
the downtime that we had before we had a lot of downtime and now it's like all green 19 days
all green since we did this switch the new setup we didn't have any downtime. 100%. That's awesome. I say, okay, it's
a little window, but it should never
go down unless we mess something in the
CDN config. That's possible.
Because at one point I said, there goes them nines.
Oh, yes.
Because the last time we talked, we talked about the nines and how much they
cost and how much each
nine costs and the effort, not just the
cost, but the effort required to
get to those nines.
And that's kind of part of it, too, because we're going on this journey thinking this is improving.
And sometimes improving isn't just simply infrastructure and speed.
Sometimes it's knowledge.
Sometimes it's understanding.
And maybe the current version you've improved, but you've really just improved your understanding of the system and what's required, and the system you currently got might not fit the bill for what you really need, which means
something else, or you're
iterating towards that learning, and that's the interesting
part. Very well put.
Gerhard, do you expect a
community, or do you desire
a community around this show?
Do you think there'll be people
involved, helping guide
direction, ask for certain topics, certain interviews?
What's your thoughts on like who this is for and how involved they're going to be?
I think you can approach it from multiple angles.
I think a community would be nice, but a community, I think it just needs to make sense for the community rather than for us or for me.
So if the community would find that useful, sure thing.
But I think it's more around, I mean, the CNCF.
I'm just thinking, I just recently came back from, I say came back, it was right here in front of the computer.
Virtual.
The virtual KubeCon, CloudNativeCon 2021.
We have a good interview,
possibly one more or two.
Anyways, that's a fantastic community.
There are so many things happening there.
So what I see a Shippit community,
a community is hard work.
And I think a community,
if it serves itself
and if it's like self-sustaining, maybe.
But I think if anything,
it's sharing interesting topics.
It's solving specific problems that others would find helpful and interesting.
And it's more like spreading ideas and approaches and perspectives that make sense to some.
That's what I'm hoping to get out of this.
Obviously, learn, right?
Learn new things and share those learnings i think those episodes i think they will be very time specific it's almost
like there will be a journey and in that journey that episode makes sense and they build one on
top of the other and eventually have like a nice journey that's i mean we used to do it like every
six months every 12 months something like that so I would like to do that a lot more often.
So like smaller steps, gain a lot more perspectives and share it a lot more often rather than once every six months or once every year.
That's what I'm hoping.
But what do you think?
I mean, I could imagine a world where there's a group of enthusiast shippers.
Maybe the act of running things in production is technology specific
so that you might have like the Kubernetes community
and the Ansible community or whatever.
But I think like people are interested in these things,
whether they're SREs or they're DevOps or they're sysadmins,
like I used to be back in the day,
I can imagine people rallying around and hanging out together
and talking about these topics,
similar to how JavaScript folks hang out
and talk about JavaScript in the JS party community of our Slack.
So that shows very community-oriented.
We want the community to
actually like come up with ideas and like challenge us and request the guests and so that's like a
community oriented show i was just curious your angle on that for this particular podcast i think
that makes a lot of sense like all those things make a lot of sense to have engagement from the listeners right that's the way i would i would
phrase that again it's more about exploring and sharing and that's what i'm really passionate
about and finding ways to improve changelog in a way that is open source and others can benefit
because that's one thing that we have always done, shared our approach publicly. Like if you look at the commit messages,
there's so much insight in them.
And I find that very interesting because...
Yeah, you write books in there.
Yeah, I did.
I did actually.
I think we could publish a book.
We could probably pull a book out of here.
There's a lot of text in there.
ASCII art and all those things, links.
There's a lot of stuff there.
Yeah, check it out.
Emoji.
Emojis, ohmojis are the best.
They convey so much emotion.
In regards to community, we can say that we have
a dev channel
in our community Slack.
And if I'm keen off of what Jared's saying,
it's like, where can people hang out at?
So we already know that changelog.com
slash community is there. It's free to join.
It's open. We already have a dev channel.
But maybe, are you saying maybe a ship it channel makes more sense where we have similar to GS party?
We've got a GS party channel and people hang out there and chat during live shows.
And maybe this show isn't live, but we can start to have, hey, I like this show.
I want to invite this person.
I want to suggest that person.
Well, where do people go and congregate?
Where can that happen?
And I think we've already paid for the price of admission, which is free,
and the infra's there thanks to free Slack and community and all that good stuff.
It's done.
So a matter of moving some of that conversation from dev to ship it
or just promoting dev to what could be ship it.
Either way, in terms of the logistics of that
getting done sounds good to me but i think we should definitely have a ship a channel where
folks can hang out and talk and yeah you know throw ideas out there and have a place to to
discuss the show and things around the show doesn't have to be about the show, but I think that would be rad. Do we have comments enabled on episodes?
Yeah,
we do.
Okay.
So that's one for now.
If you listen to a recent backstage for now,
we thought about turning them off.
You can go listen to that conversation.
And,
uh,
we actually agreed on turning them off and then I just didn't do it.
Okay.
So yeah, we might leave them on forever
because of laziness or maybe it'll disappear but i don't know you go listen to backstage episode
was at 16 all the emotions are on comments but for now they're there and i don't know if i just
leave them on because people do seem to like them you You know, I've, since then, this is a micro version of that conversation,
I've seen more adoption of our comments
and especially that recent blog post
you got there, Jared. I mean, like,
if it weren't for that, you wouldn't have people talking to you.
Yeah, I wonder if that episode spurred on more comments.
They're like, wait a second, these guys have a comment section?
I didn't know that until they
posted a show about it. And then even since,
I've looked at our design of it, and I think that, you know,
for a signed-out user, it could be, we could do better design to make a better effort to encourage discussion.
Oh yeah, like actually an emoji picker.
There's definitely some things we could do.
Reactions.
There's all sorts of stuff we could do.
Just guys to higher value content, really.
Higher value comments.
But that recent post you did, you might as well timestamp it.
That got a lot of
comments itself. The backstage episode we're talking about is episode 16.
Accurately titled, Let Us Know in the Comments. So yes, let us know in the comments.
So yes, there are comments on each episode. So it's a great place to have
conversation. Especially, I like the permanence of those in terms of it's attached to the episode.
So if you have follow-up links or questions regarding the content,
it's a great place for that.
Whereas, of course, there's conversation that's going to happen on Twitter
and on Reddit and on Hacker News and on LinkedIn.
Do people have conversations on LinkedIn?
I don't know about that.
They do.
And elsewhere.
And in our Slack.
But there's some value to the comments on site.
It's worth it, in my opinion.
But if you're listening to this and you're thinking,
well, one, they've answered my questions around community.
Because clearly we just in time produced the future of things.
So we just determined that we're going to have a community.
And it'll potentially be the Ship It channel in Slack.
But if you have a request for an episode,
there's an easy way to do that, changelog.com slash request.
It's there for every show we have,
the changelog, Founders Talk, Ship It,
all the shows essentially.
So if you have a request for a guest or an idea,
that's the best way to share it with us.
If you want to join the community, it's there,
changelog.com slash community.
No debate about that.
And if you care about shipping it,
then you should ship it with us.
Also, if you care about all the other things
that happen before shipping it and after shipping it.
And while you're shipping it?
And while you're shipping it.
Oh, yes.
It's just, yeah, there's like,
it's almost like that's like a point in time
but there's so many things happening before and after and it's like it's not like a single event
right you find yourself shipping it and you would like to think that every time is the same
but that's what we aim for it's like it's ideal but it's not right sometimes you ship it and you
take production down and go oh crap crap, what did I do?
Well, there's a great lesson to learn there. So I think it's those things which are really
interesting, right? How do you build systems where shipping is so easy and straightforward,
they don't even think about it. I think we were rather fortunate that that was the case for us.
Just get push and everything will take care of itself or emerge if there's a pr well you heard
it here first gerhard our resident sre for hire has been promoted to podcast host coming at you
weekly changelog.com slash ship it and uh i'm excited gerhard i i mean i've been a big fan of
what you've been doing with us for so long i'm glad to get to a weekly cadence where it makes a more rounded sense to talk about what we're doing,
what others are doing, and all that fun stuff. But hey, listeners, you know what to do,
changelog.com slash ship it. All right, that's it for this episode of The Change Log. Thank you
for tuning in. We have a bunch of podcasts for you
at changelog.com.
You should check out.
Subscribe to the master feed.
Get them all at changelog.com slash master.
Get everything we ship in a single feed.
And I want to personally invite you
to join the community
at changelog.com slash community.
It's free to join.
Come hang with us in Slack.
There are no imposters
and everyone is welcome.
Huge thanks again to our partners,
Linode, Fastly, and LaunchDarkly. Also, thanks to Breakmaster Cylinder for making all of our
awesome beats. That's it for this week. We'll see you next week. Thank you. Bye.