PurePerformance - How to build distributed resilient systems with Adrian Hornsby
Episode Date: August 19, 2019Adrian Hornsby (@adhorn) has dedicated his last years helping enterprises around the world to build resilient systems. He wrote a great blog series titled “Patterns for Resilient Architectures” an...d has given numerous talks about this such as Resiliency and Availability Design Patterns for the Cloud at DevOne in Linz earlier this year.Listen in and learn more about why resiliency starts with humans, why we need to version everything we do, why default timeouts have to be flagged, how to deal with retries and backoffs and why every distributed architect has to start designing systems that provide different service levels depending on the overall system health state.Links:Adrian on Twitter: https://twitter.com/adhornMedium Blog Post: https://medium.com/@adhorn/patterns-for-resilient-architecture-part-1-d3b60cd8d2b6Adrian's DevOne talk: https://www.youtube.com/watch?v=mLg13UmEXlwDevOne Intro video: https://www.youtube.com/watch?v=MXXTyTc3SPU
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always, my co-host Andy Grabner.
How are you doing today, Andy?
Hey Brian, pretty good. Summer is finally here.
I don't actually know exactly when this is going to air, but at the time of the recording... Towards the end of the summer, yeah.
Yeah, exactly. The time of the recording... Toward the end of the summer, yeah. Yeah, exactly.
At the time of the recording is really good.
We are just hitting about 30 degrees Celsius,
which is almost getting to the edge of getting too hot,
but it's still good.
Yep, we're having some nice weather here in Denver as well.
And I just got a new fence put up,
so all the donations that our listeners give for this podcast
didn't help for
the fence at all because we get no money for it anyway um we have an exciting guest today uh you
sent me you know you sent me over um some of the presentations and documentation as i was reading
through it i was like wow this is great great stuff um as you and I were discussing before we started, to me, when I was reading through this, I'm like, this all makes perfect sense and seems completely obvious. But who the heck knew about this until someone ran into these problems and figured these things out? so important and it's probably not even thought of so much i guess what i'm trying to say is very obvious stuff but probably stuff that nobody even thinks about very much right so it's like
in plain sight you know exactly i mean i i wouldn't say it's i mean if it's too obvious
i would assume that our guests wouldn't have a real job to to spread the word but uh it's it's
really interesting i think the more and more people are moving into the cloud space,
the more and more it's important for people to understand. And now coming to the topic,
resiliency and availability design, I think was one of the, was actually the topic that Adrienne,
who is our guest today, spoke about at DevOne conference here in Linz. And I wouldn't give it
as much credit, obviously, if I would kind of rehearse what he did.
That's why I want to, first of all, say welcome to the show, Adrian.
Thanks so much for taking time.
Thank you so much for actually making the way.
You made the way to Linz in April and enlightened a lot of the developers, I think, in the audience.
And now I really want to, first of all, pass it over to you and let you introduce yourself, what you do.
And then I want to talk really more about the stuff
that you are advocating for in the world of cloud native.
Hi, Andrew. Hi, Brian.
First of all, thank you so much for inviting me.
It's such a pleasure to join you on the podcast.
And second, it was such a pleasure to actually visit Linz and especially the conference they won.
My feedback was that that conference was actually one of the best conferences attended that year and this year actually and it was I don't know the whole day was was just great from the introduction music to the talk the speakers the people there it
actually blew my mind so I think the whole the pleasure was mine more than
then then then yours but thanks thanks again for having me here and yes
resiliency is actually quite an interesting topic I wouldn't say no one
is thinking about it I think everyone is thinking about it the problem is uh is is more how to do it
and then you know and those kind of things andy before we dive in you mentioned the intro music
what was your intro music that was so memorable i think it was special so they actually had a team
um i think it was a combination of the dynatrace creative team and also a local
company here in Linz that created the video and the music, you know,
and actually we had a guy on stage also performing later on in the day.
So if you go on YouTube,
I believe they uploaded the intro videos recently to YouTube.
So if you go online and check out the def one linds 2019 intro music
i think you will hear it and especially if you consider the whole video being on i don't know
how big the stage was but it was immense and it was like it was it was blowing everybody's mind
i think that was really cool it was amazing really amazing amazing. I watched that video on YouTube in loop at least a few times.
So, Adrian, can you, before we jump into the topic,
a little bit about yourself, you know, your background
and also what you do right now, and then let's jump into the topic.
Cool.
Yeah, so I'm working for AWS,
so Amazon Web Services, as a technical evangelist.
Technical evangelist is kind of the voice of the customers back to the service team.
So we do a lot of conferences.
We talk to a lot of customers.
And actually, we travel a lot to speak at conferences and meet our customers.
And then we feed back a lot of this discussion back to the service team so that we can improve our services, right?
Actually, a lot of the services we do
and the roadmap of our services is based on that feedback.
Actually, about 90, 95% of roadmap is customer feedback.
So, I mean, it's a big, big role and I love it.
So, currently, my focus is on architecture
and how to build resilient architectures,
how to practice chaos engineering
or resiliency engineering, safety engineering,
and all these kinds of things
that are really exciting for me
because I've been doing that for about 11 years now
and I joined AWS three years ago.
So it's like cherry on top of the
cake for me I get to speak about the technology I love around the world to great customers and
learn a lot actually in the process as well. Hey so and obviously we all know AWS and I believe
you are obviously one company that people look up to when it comes to scale and building resiliency.
I don't remember.
I mean, there's not many times that I remember when Amazon or AWS itself had any problems.
So obviously you're doing a good job.
Now, if you talk with developers, architects around the world,
and also coming back to what Brian said in the beginning about it all seems so obvious,
but it seems we don't know enough about it.
What are the things that you tell every architect, every developer?
What are the top three things that you believe everyone that is moving into the space of figuring out what is the next architecture?
What are the things you tell them that they definitely have to take care of,
that they have to read up on, that they have to understand? Because otherwise, it doesn't make
sense to build a system that can potentially scale and is resilient against all sorts of things.
That's a very important question, indeed. I'll go back a little bit in the question
and talk about the resiliency at AWS.
I mean, we launched AWS about in 2006.
So that's almost 12 years of operational excellence.
And where we have learned a lot,
we always say internally
that there's no compression algorithm for experience.
So we've experienced a lot of outage and, you know, at scale,
it's impossible not to have failures.
I mean, the probability to have failures
increased dramatically with scale.
So, I mean, for us, that experience is something
that we can give back to our customers in terms of services that we built and how they are built and what's under the hood.
And, you know, I think a lot of the, let's say, the good resiliency we're experiencing on AWS is based on this experience, right? And it starts how we build our regions,
how we build the infrastructures,
how we build the services.
And one of the first construct
is actually what we call an AWS region, right?
And we have 21 regions globally with AWS.
And a region is based on several availability zones, which are physically separated.
And usually a region has three availability zones, right?
So the fact that we decouple or have redundancy in our availability zone actually give already
out of the box, very high resiliency.
You know, it's, it's mathematic, right? If you put systems in parallel,
the overall availability of that system kind of increases.
So that's actually built in kind of in our infrastructure
so that in case there is an issue happening in one AZ,
one availability zone, you know, the rest of the,
and maintain, let's say, an availability and
reliability for the service.
So, of course, for our users and for our customers, the first thing to understand is how we build
infrastructure because it really depends on the cloud providers.
I mean, cloud providers have different ways of building their regions, but on AWS, we use that region, every decision construct to build this, let's say, out of the box,
multi-AZ redundancy.
And all our services are based on that, right?
So actually that's the first thing to understand.
Right.
And I think in your, one of your posts you had put in, if you had a single instance running
at 99% uptime, right?
That downtime gets you to three days and 15 hours a year, I assume.
And two in parallel brings you to 15 or 52 minutes.
Yes.
But when you go to three in parallel, which is exactly what you're talking about.
Yeah.
Yeah, six nines.
Yeah, going to six nines is 31 seconds, which is, as you said, it's just, it's great.
And it's funny because we also do the three,
we always talk about three nodes too as well
with our systems.
And to me, it was just like,
oh, we just, that's just what we do and what we needed.
This kind of highlighted exactly why
and why it's so important.
So yeah.
And it's mathematic, right?
It's actually how we build electronic devices as well
and how nuclear power plants as well are built, right?
There's high redundancy in every component that is used
so that in case of failure, the other one can take over, right?
So that's the kind of the idea of also of bulk heading, right?
You create isolation so that you avoid kind of
having very big blast radius right if you have one instance and that one fail
your blast radius is hundred percent right so if you have three instance but
one fails or three AZ and one fails basically your blast radius has reduced
to 33 percent so it's already kind of kind of nice of course, on top of that, we do a lot of other
things, right? So our infrastructure is based using what we call sale-based architectures,
which is actually a construct that is taking the idea of multi-AZ to the next level, right?
So we don't want to spend too much time explaining this, but basically it's like really creating small individual cells
that basically limit the blast radius of failures.
And that's just for the infrastructure side.
Then you have on the software level,
you have to build redundancy as well and resiliency on the application.
You have on the operation and also on people.
You know, software people don't often think about it,
but in every team there's that very, very good software developer,
that magician that does everything.
And if you take him out of the equation,
basically you can have very problematic operations, right?
At least I always had that guy in our company that could do everything.
And the bus factor, I always say, is very high.
So you don't want that.
That's kind of like the whole point of building resiliency into your system
so that you don't have to rely or have that single point of failure in a single person.
I forget who your person's name was.
I had a Paul, but it really doesn't matter.
I also wanted to say, I think Adrian at DevOne,
didn't you also, I think in your presentation, or maybe it was at a
conversation we had later on, you actually talked about
testing the resiliency of the organization by actually taking people
out and seeing what is happening if critical people
or certain teams are simply
not available, taking away their phones, letting them not connect to the network and seeing
how does the organization behave as a whole to the fact that certain people are not there.
Yeah, exactly.
I mean, that's, that's, I think that's the bigger picture of resiliency engineering or kind of business resiliency is, you know,
identify key people in the organization, in your software team,
and just take them out of the equation.
So I often go to customers and help them, let's say,
implement what we call chaos engineering.
So the first thing I do is I don't test software.
I test people.
So basically I find those key people and I take their laptop
and tell them to go home and see how the uh basically the operation of their software people or the whole software
team is behaving right and very often it's chaos actually very often you'll see uh that people
don't are not able to recover from a loss of a single person in their team, which is kind of obvious that you want to prevent or work around that.
But I mean, many organizations can't handle that.
And that's just what it is.
And is your best practice, I know people don't like the word best practice,
but is it one of your guidances that you also kind of apply the fact of three here that you should that you should have let's
say at least three three systems or three people in this case that can that can do a particular
job or or does this work differently with people it's a good question. You could apply a three factor, but I think it would be maybe far-fetched.
I think it's just important to be able to share knowledge.
And I think that's also a very good reason why we want to version everything in our software.
You version the code, you want to version the infrastructure, the version, the documentation,
and especially, you know,
all the operations as well.
I think that's a big part of the problem
is when people do operation,
they don't leave trace of what they're doing.
So if you can put in place systems
that actually force people to open a ticket,
explain what they want to do,
and only after that ticket has been opened,
they are granted access to the operation.
So there's actually a living documentation of what has happened, right?
And everything is version control.
Everything is traceable.
You know, the logs, the actions, I mean, pretty much also the API, right?
That's why enabling CloudTrail, for example, is super important,
or AWS Config to verify configuration drifts
or things like this.
I think that's kind of the first thing
before having to employ three people.
Because if you do that, actually,
most of the team should be able to recover from that
or at least to follow up on that and kind of go
back up the stack of what's happened, right?
It's about the kind of documenting history, if you will.
It sounds like it's more than just documentation, right?
When we talk about documenting, you can consider releases as code and infrastructure as code as a way of documenting, you know,
not like writing paragraphs about what something is, but as you mentioned in your talks, you
want to do like, say your infrastructure as code or deployments as code.
So this way it is in a way documented, but more the processes so that if, if, you know,
Paul, I'm going to make a Beatles reference.
If Paul is dead, you know, you have it all there and someone just has to push that button to make a Beatles reference. If Paul is dead, you have it all there.
And someone just has to push that button to push that script out.
And it's more about maintaining that script then
as opposed to maintaining the person.
Yeah, exactly.
And you say it's about automations as well, right?
You don't want to have any humans involved in the deployment,
the maintenance, and the maintenance of your system, right?
You want humans but supported by tools, right? So human in the loop kind your system, right? You want humans, but supported by tools, right?
So human in the loop kind of things, right?
So that maybe you don't have 100% automation,
but you have humans in the loop.
So from looking back where we started the conversation,
I asked you, so what are the things you always make sure
you tell the people that you talk to I obviously
start with the infrastructure having high availability infrastructure I
really like the explanation with the availability sounds physically separated
and so on and so forth also the people aspect what else is there if we now come
back maybe to you know the application or service layer, what are the things that people need to understand?
What are the key requirements that you say you need to understand this concept
because otherwise I wouldn't even start coding?
Right. Well, I think there are a few things, right?
But I would say the three most important that I see cause a lot of outage
is timeouts, retries, and exception
handling, right? I think that the problem, and let's start with timeout. Timeout is kind of the
time you will wait for a request to succeed or fail, right? Kind of if your dependency doesn't
answer is how long are you willing to wait?
And the problem with timeouts is if you wait too long,
then your system is just hanging, right?
It's just like you're using resources for nothing.
So the second problem is when you build software,
you often use libraries to do, let's say,
an HTTP call to a dependency or using an SDK.
And the problem is those libraries
are often configured with timeouts
that are out of this world.
I mean, either it's infinite timeouts
or relying on system defaults or 30 minutes, 100 minutes or 100 seconds.
I mean, in very few, especially on the backend side, very few libraries have a kind of meaningful timeout, right? And a meaningful timeout in the world of distributed system
is roughly 10 seconds, right?
Already in 10 seconds,
if you do 10 million requests per second,
that's a lot of hanging for nothing, right?
Yeah.
So the problem is people don't verify those libraries.
They don't do code introspection to figure out
what actually are the default timeouts verify those libraries. They don't do code introspection to figure out what actually are
the default timeouts of those libraries.
And often they figure it out
during an outage and in production.
And that's kind of happened
very, very, very, very regularly.
I think most of the outage I've experienced
before joining AWS were timeout related.
You download the
library to connect to a database,
a GDBC driver for MySQL,
and that's like a
minus one default, which
relies on system default.
So basically, you
have no idea what it is. And then
you realize in production that those
defaults are just really not good for
your system.
So first, figuring out how to to manage your how to manage your dependencies and understanding the timeouts
and not using default values right so the most important is that you don't want to have a client
deep timeout that is five seconds and then the back-end timeout that is 30 minutes because then
you're going to use all the connection pools to your dependency and then the backend timeout that is 30 minutes because then you're going to
use all the connection pools to your dependency and then run out of connection pools.
I think that's an error message that a lot of people have seen in their software is we
run out of connection pools, right?
So I think aligning the timeouts makes sure that the timeout is passed from the client to the backend as an inherited timeout is also a good practice mode and should be done.
So is there an option?
Does it make sense or is there something that we can do to dynamically adjust timeouts?
Would this also make sense? Because obviously, if I download the library, the library creator would never know how his library is used, his or her library is used in the real world, depending on, I don't know, end-to-end transaction behavior? Or is this something that doesn't make sense and every architect should sit down and basically
say, well, this is what I believe should be the right timeout for this particular component
because we've done our capacity planning.
And so we know what level of load to expect.
That's a good question, actually.
I know two customers that are doing kind of dynamic timeouts, and sure you introspect all your dependencies
and then you are able to alert an operator
or a software developer.
If a timeout is not set before being deployed in production,
that's already a massive step forward.
It will prevent or I think start small, right?
I always say start small and then see if you get that issue.
But just by first defining your timeouts and not use default values,
that's already a massive step forward.
So I would say keep it simple.
Static definition is great.
Yeah, and that's one of the ones, ones this and i think what we're going to talk
about next which is the um retries and the retry jitter thing which i hope you get to but um those
are two of the ones that when i read them i was that's when i was like oh duh but again something
you don't think about like who thinks to look at the default timeout whereas we've seen we've seen
problem patterns for years now even with the connections you're talking about.
One of the ones we used to use in our demo environment was a default setup for Hibernate, which would cause all sorts of problems. And the idea is if you're going to use a framework or pull anything else in, you have to look at what it's doing and understand what it's doing before you go ahead and use it.
But most people just take, I'm just going to drop this in and go.
And one of the things I equate it to is like when you get a new phone,
you know, some people, when they get a new phone,
they just turn it on and use it and then wonder why everybody knows where
they are. And every time they post a picture, Oh, you were in, you know,
South of France. Oh, well, how did you know your geotags were turned on?
For me,
I go through every single setting in my phone and make sure it's what I want
to have turned on, you know, but that's differences in people. And I think developers have to really start looking into when I get a new framework,
look at all the settings. Andy, if we take a quick step back to when we had Stefano Dorian
with the machines optimizing all the settings, right? So there was a situation where like JVMs
have over 700 settings for configuration settings, right? And most people
don't know any of them, but they're doing cool things too, where they're using AI to help fine
tune those. And a lot of stuff that I think settings and default options are starting to
come back into the spotlight because most people ignore them. Yeah. I think the problem becomes
even more important as you build abstractions on the infrastructures,
on the software, because someone has made decisions, right? And you need to know those
decisions. Yeah. We actually talked about in that one of those others, like if you have a JVM
yourself and running it on your own host, you can control everything. If you're using Lambda,
I think we actually mentioned it in the podcast, you're relying on AWS to have the settings optimized
and there's only so much you can do
and you just have to trust in them.
And obviously you guys all do a great job,
but you have that, you're even further extracted.
So the ones you can control,
you really have to look at even more closely.
Yeah, it's a trade-off, right?
Between operational details and flexibility.
But I think at least the default values that are exposed by the frameworks
should be understood and should be verified at all costs.
So we want to talk about retries.
I think that was another really interesting...
That's the second one you mentioned, I believe, right?
The retries?
Yeah, retries.
So retry is quite simple i'll take the example you know if you have kids you've probably been
traveling with kids in the in the back of the car and they will say are we there yet are we there
yet are we there yet are we there yet right so that creates a lot of of kind of, let's say... Tension.
Stress, tension.
And that's the same with a network, right?
And when your client is experiencing or when your backend is experiencing issue
and cannot answer queries or timeouts,
then the client very often will retry, right?
That's kind of a classic case of resiliency, right?
You want to retry after an error
because there are errors in system,
like transient errors and things like this, right?
And so classic, you retry.
The problem is in the distributed system,
if you have many clients,
they all and all of those experience
kind of the same transient errors
and everyone retries at the same time, you
overload the network with retry packets.
The problem is very often those retries are every second or few times per second, and
there's no max.
So these are kind of two default settings that you see is the retries are usually in
infinite loop with no max retries.
And that's kind of created massive issues.
It's kind of what we call a retry storm, right?
So you kind of DDoS yourself, basically, by using retries.
So one simple solution is actually,
if a retry is failing,
is instead of retrying immediately,
is you wait a little bit before retrying again it's like you back off right so you
realize daddy is not really happy to answer every seconds that is not there
yet so you take another two seconds and then six maybe ten ten seconds and then
15 you know and the classic way to do that is exponential backoff.
Now, the problem with exponential backoff,
and this is what is outlined in one of the papers we wrote about retries,
is in distributed systems, even if you back off with the same algorithm,
so exponential backoff in that case, you still have retry clusters
because it's natural that all the system
will back off at the same time and then therefore retry at the same time. So you need to add
in your retry loop a jitter with the backoff algorithm. So you add a random jitter within
the backoff algorithm so that not everyone is retrying at the same time.
It can spread the retries across a larger time.
And actually, that's also one of the big reasons why distributed systems take a long time to
recover from failures is because well-implemented distributed systems are in retry operations
and just take their time so that you can kind of slowly recover
from an outage and not have massive retry storms every time you put back the
system up as well handy yeah what's up think back to think back to your old
load testing days what is that jitter versus no jitter piece remind you of
well it's the the random think times.
Is that what you're talking about?
Yeah.
So just for anybody listening or whatever,
one of the early mistakes that used to be made in load testing
is you would just have a static think time.
So when you start your test, you'd have all these same time test hits,
which is the same exact thing that you're describing,
where it's all the tests every five seconds or every two seconds, another people, another set of people would hit.
So you'd have this false, false spikes going on all the time.
And then this load increased, you'd get even worse and worse, which is exactly why we'd
add the randomness into it, which is if you think about the idea of testers working with developers and vice versa,
that's the kind of thing where it might come up.
But since the testing teams had that on their radar
for a long time, it'd be interesting to see
if there was a collaboration between testing and dev,
if that would have come up in some sort of settings
if they were looking at that.
Experience, it's good that it reminds you of things,
because i think
those are the most important right it's when you can put uh an experience on top of a feature is
is great and and adrian actually the so the timeouts the retries and the backoffs was one
of the things that that i i was also excited about when you showed it in Lint. And then I think I told you,
I kind of borrowed slash stole some of your slides.
I gave you credit for it in one of the presentations
I did a couple of weeks later
at a developer conference in Iasi, in Romania.
And it was extremely well-received as well.
You had a great animation where you actually showed
what happens if the timeouts are incorrectly
specified between backend and frontend and if the retries are not done well.
And I can just encourage everyone, we will put up the links to your blog posts, to your
papers where you actually talk about it and also to your slides and slide share.
So a lot of great material out there.
Great.
Good to hear.
Yeah. material out on that. Great. Good to hear. Yeah, and I think also
I didn't talk about it, but the second
really bad effect of retries is actually the logging.
That's usually the second thing that happens is because
everybody retries and the backend kind of retries
to connect to dependencies in loop,
they write logs and very often it will fill up the logs in the instance.
And then you don't have any more space left on the instance
and all of a sudden you take it out.
That has happened a lot, actually,
having systems that go in retry storms
and that are not killed because of the network but because
of the instances running out of space to write the logs due to those retries so something that
is important as well in your application to monitor actually the the disk space it sounds
it sounds stupid right but it's it's critical because if you can't write your log basically your application is taken out of a of a business right so you have to monitor that and change the log
the log level basically dynamically to adjust to this kind of stuff yeah but it's it's not stupid
at all we've actually been talking about this you know several times in our podcast that when we advocate for monitoring your system,
not only in production, but also in pre-prod, we often tell people, if you're running your tests
against your new builds, then look at things like how many log messages are written by the test and
how did it change from the last test, from the last build to this build? Because maybe somebody just accidentally
is logging 10% more and the 10% will kill you later on
or somebody was accidentally turning on a log level
or changing the log level.
So these are all things we can detect much earlier.
We should.
Before joining AWS, actually,
we ran into that issue at scale.
And we went around it that we actually stopped logging on file on the disk itself.
So we actually dynamically sent the logs to Elasticsearch through a stream like either Kafka or Firehose on AWS to be able to compensate for this kind of lack of log.
And I think it was very good, actually.
We got rid of Logstash,
which was kind of a big problem to run on our instances.
Logstash is kind of a software that takes log
and sends it to a central logging system.
Yeah.
So instead of doing the log,
we actually wrote directly,
send the JSON object directly
to Elasticsearch for analysis.
And that was brilliant, actually,
because we got rid of first of log stash.
So that's less software to install on our instances.
And we got rid of large disks
while having to transfer the logs.
Of course, it brings another set of complexities.
You have to manage Elasticsearch at scale,
but there's a lot of offering out there that can help.
So at least for us, it was very, very good to move to do that.
But it feels like it didn't really
address the problem of log floods, right?
Because if the system brought you,
if you were logging too much
and your current system couldn't handle it,
and you just replaced the backend
with a different system that can scale better,
but did you also address the situation
with just too much logs or too many log statements?
Yeah, of course, yeah.
I mean, that's definitely a big problem, right?
If you write too many logs,
and especially if you have the wrong log level in production.
But it's still, we got rid of writing logs on the file and having to transfer it after that.
So we were, it was a lot easier to bring system up and down and kind of break that kind of dependency to writing on disk.
Of course, I mean, you know,
if you solve a problem with a regex,
you have another problem, right?
So it's the same thing with logging, right?
If you don't write on the disk
and you write to Elasticsearch,
then you need to manage and scale Elasticsearch.
But we just found that a lot easier for us.
And of course, you can send it directly to S3 as well
if you want almost infinite scale.
But at Firehose as well.
So I think those are great solutions to do this.
Actually, I've pushed some code on GitHub,
on my GitHub, that demos exactly that,
how to do that with your application.
So it's pretty cool.
You know, I think that brings up an interesting point too that um you don't always have to start with a full solution right if you
go back to the default timeout versus a dynamic timeout or this idea of changing the way you're
treating your logs versus cleaning them up obviously you want to go clean them up in the end
but if you have an immediate problem that you just need to get some more buffer zone around, there are things to steps you can take.
And a lot of times I think organizations are really stressed between, you know, release dates and schedules versus trying to do these maintenance types of tasks.
So just kind of putting into people's minds that sometimes there are smaller steps you can take that'll get you to that final path
which might be easier to take to start with but it doesn't mean stop getting to your
optimal state but also just from a from an overwhelming point of view you don't have to
think oh my gosh we have to clean up all the logs it's going to be this huge project
we should just give up now right there's something you could do in the meantime and sometimes look at those right yeah it's a good point and especially i think you have to
think about the situation where you are experiencing an outage and then you're out there trying to
fix that outage right you want to have as simple things to do as possible you don't want complicated
things so yeah that's you know the idea of simple is beautiful, right?
And I think that actually pays off a lot
when you are trying to recover from outage.
So I think simple solutions work well.
It's just doing simple things is hard.
Yeah.
So the third thing, so you said timeouts, retries, and backoff,
which we kind of combined.
And then I believe you also said exception handling
was another thing you wanted to talk about yeah i mean in applications very often especially in
distributed systems where you have multiple dependencies and one of those dependency as
an exception is usually wrapped inside another exemption and very often it's never recursively taken out and analyzed. So basically you don't really have visibility or observability onto what's
really happened to your dependencies.
And so I think this is something that is very important,
handling exceptions and doing it well,
and not just a try except that doesn't do anything, right?
So just having a resilient system is great,
but it doesn't mean that you should have self-healing
without observability, right?
So it's good if you recover,
but it's always good to make sure that you understand
what was the root cause and you can actually go back into it
and not just recover without
any trace of what happened and i think that's also kind of a big issue right
yeah i mean it's obviously a topic that we have been talking about a lot obviously especially
with our history on where brian and i work at dynatrace because we've always been you know strong advocates for end-to-end tracing and really figuring out what the
root cause is now we just joined the the open the open telemetry group with also
with your colleagues and others right and we are really driving standards
forward so we can get all of this information, the observability data,
the telemetry, the traces from all sorts of systems that our systems are running on.
Because if I write software, then let's assume I run it on AWS, then I can only manage or
control what I am kind of writing and tracing in my stack.
But then maybe I call into a third-party service.
And if you guys then are also applying that standard
or implementing that standard,
that gives everyone out there more visibility end-to-end,
what's actually happening in their end-to-end distributed system,
because we all know that eventually we will touch,
our transactions touch more third-party systems
or services than our own code, right?
Because if I run on AWS, I probably use it in MOTB,
I use some messaging, I use all sorts of things,
and then it would be great, obviously,
to have more root cause information through traces,
through telemetry, and so on and so forth.
Yeah, exactly. Yeah, makes sense.
Yeah. Hey, actually, so it's interesting. When you said said in the beginning so timeouts retries and exceptions i when i heard
exceptions i almost thought you're going down that route of of error handling meaning what if i'm a
service and i'm calling another service yes i do my retries and my and all that stuff but how do i
react in case there's actually a problem?
I thought you were talking about
error handling because I know there's a lot
of talk in your presentation.
It's part of it, right?
It's part of that error handling.
I think it's one
of the biggest issues in
software is naming your
functions in your class and then how do you
handle errors, right? And especially handle errors in a way that you're not, that you're still
providing a service that is accessible. Maybe it's not accessible in its full glory with all
the features and capabilities, but at least not showing the user a blank page that says,
sorry, we're out of connections in our database.
Yeah, it's degradation, right?
So it's basically how do you use errors to degrade your service to be resilient?
And kind of, I mean, that's the whole purpose of resiliency
is how do you offer a service
while experiencing an issue, an outage, right?
And the service doesn't have to be 100% full-blown,
like the full service itself.
It can be a degraded experience, and you said it well.
You can move from a write-read experience to read-only.
That's especially true for if you think about Prime Video, right?
They are in the business of serving videos,
not writing to a master Video, right? They are in the business of serving videos, not writing to a master database, right?
So if you can show to your systems, to your application,
that actually a master database is down,
make it aware that, hey, it can still run.
Even if you're down, I'll just move into a read-only mode.
And then you handle that, and then you return that to the clients.
That can also degrade the experience or modify the UI so that the user hopefully doesn't notice much of what's happening.
Netflix does that as well very easily. You see their UI is actually made so that if a dependency on the,
let's say, a recommendation engine,
you have a lot of different recommendations on Netflix.
If one of those recommendations doesn't work,
they just remove that stack and kind of bump the rest up,
and you never see it.
And they use heavily the cash and things like you say
I think those are kind of the experiences that you can do and you can
get that through through the health checks right and you know I'm you
opening a second pound or a box here is like health checks how do you define
health checks between deep and the shallow health checks and those are also
complicated things that you need to research, you need to understand,
so that through a good health check, you can actually degrade the experience of the users for different things.
If you have dependencies to a mail server, it should react totally different than if you have a failure of a database, right?
So that's kind of the whole point of a health check system that kind of understand the situation of an outage, what's going on,
and hopefully degrade the experience so that users can still consume your service.
So I know we could probably go on and on,
because I'm just looking at your part two of your blog,
where you talk about avoiding cascading failures.
And then you actually had, in the wrapping up section,
you talk about, actually, continue reading,
because health checks are going to be the next thing
in the next part three of your blog series.
I have a couple of questions to this because
i think it's it's out there actually it is out there sorry yeah um what i what what uh
what would just just out of curiosity if i have a service a calling a service b
who should be who is responsible for for health check or for rechecking?
Is the caller responsible to deal with unhealthy systems? Or is the callee the one that says, well, hey, I reject you.
I mean, I assume there's different strategies.
And maybe you want to apply different strategies depending on your architecture. But is it typically I as the caller
am responsible for making sure
I'm not overloading an already unhealthy system?
Or is it more the other way around?
Even though I'm struggling,
I can still correctly tell my health state
and therefore I'm rejecting things.
What are the approaches here?
This is a very hard question, but a very good one.
From the backend side, so from the receiver side,
if you get into a war mode,
so if you get a lot of retry storms,
if you get in a state where you are being DDoSed,
it's your responsibility, basically protect your, your, the entrance
of the castle, right?
So you're going to do rate limiting, you're going to do load shedding, this kind of stuff,
right?
But of course, I love to see, to see that responsibility also from the client side,
because at the end of the day,
it's the customer experience that is very important.
So if you look from a customer or from a client's point of view,
is how can you give back some visibility about the system
to the client so that it can help preserve
the infrastructure or the architectures
and knowingly degrade the experience.
I think it's much nicer to do it from that way
than to do it aggressively from the back end side
and trying to prevent aggressively being destroyed by your own clients.
So I think that's a good way to do it.
Yeah. Well, it's also the the I think if you do it from if you can propagate the health state
from the back and all the way to the front and already kind of tell the the
front the most the much if it's the right word the most front-end client
already to start backing off that just saves a lot of pain throughout the whole architecture right i mean
exactly and especially like so just take the example of a database right if you have a master
and a read replica and your master database go down your back end goes into read-only mode
you have to propagate this back to the client so that actually it kind of moves the application into a mode where you don't
get the customer to try to write things. So for example, you wouldn't be able to update your
profile picture, or you wouldn't be able to write a status update if you would be a social app or
things like this, but you can still consume everything that is already being put into database. So these kind of things.
Or then you want to queue those requests.
Maybe some requests you're going to reject.
Some requests can be queued.
And that's the whole synchronous versus asynchronous.
That really depends on what you're trying to do.
If you have a bank transaction, you don't want to do that asynchronously you want to make sure it's synchronous and and that the system is handling it well right so
basically you're gonna go into a mode where your application won't let you do a money transfer if
you have a master database right going down but you can maybe read something yeah yeah so that's also I guess comes then for I think in the in your fourth part
of your blog you talk about caching and I would I assume when you talk about
read read only you could also say well I'm just reading from a cache a cached
version and yeah we take to a Netflix example i think maybe you told me that but maybe
they don't turn the recommendations off completely but they just take a cached version and that
cached version might be at least so quote unquote smart to show me recommendations for my geolocation
right like showing me in the in the instance content is europe European because I'm from Europe. And then Brian, he would see content from the US.
And yeah.
Yeah, exactly.
I mean, of course, handling cash is not easy at all.
It brings a lot of problems with cash eviction and things like this.
But you can at least use it to serve content that is maybe not that dynamic,
like the top movies of last month or something that
I don't know, there's tons of ways
to do that.
I think definitely cache plays a big
and important role in distributed systems
and should be used, but carefully as well, because it's like the RegEx as well.
It can create other sets of problems.
Very cool.
I know there will be a whole not a topic probably around how we can test for it.
Like chaos engineering is another big topic of yours but i i believe brian we
probably want to we want to invite adrian back for uh for a part two of the blog yeah i think so
i think we can fill up a whole another episode with you adrian if you'd be willing to spend more
time with us i'd love that yeah chaos engineering is definitely my focus currently. So traveling a lot and meeting customers
and helping them implement chaos engineering practices.
And that's super interesting as well.
Yeah, and it's something we haven't spent much time on either.
So I think it would make for a great topic.
So Andy, speaking of that though,
anything else or do you want to go ahead
and summon the Summarytor?
I think we can summon the Summarytor do you want to go ahead and summon the summerator i think we can summon the summerator all right go ahead so adrian uh thank you so much first of all for
again taking the time to kind of recap all the great content and stuff you've been working on
for quite a while obviously uh it was very happy that we met in lynx when you were speaking at
the phone i think what i took away from this session, there's always a lot of things.
I really like the, I don't know what the right term is, but the story of three.
So when you design a system for resiliency, always think about at least three nodes on the hardware side,
three services that can kind of, or also three, if it's possible, people that can kind of cover a
certain role.
Because I think what I, what we also learned today, it's about having resiliency goes through
the whole organization.
And that typically starts with people.
Absolutely.
Diane, what I also took away from, from this chat is aversion to everything.
Coming back to the initial discussion we had
about making things transparent
and actually allowing other people
to see what other people have been doing
so that they can actually jump in.
From an architecture perspective,
when I asked you about,
so what are the things
that everybody needs to understand?
I believe to sum up the things
that we talked about, like timeouts, retries,
exception handlers, errors, and health check, I think that the thing that I take away most is
we need to start designing software for different service levels depending on the health state of
the overall system. So everything we do, every component we design needs to be able to adjust to a different service level mode, depending on what the overall state of the system is.
And we need to, I believe, we need to try to get that database. So shifting left the health state so that every service involved
in the end-to-end transaction change
can immediately switch to different service level.
And I think that's going to be a big thing
that differentiates people and organizations and services
if they can dynamically adapt the service level
on every level of their service chain
to deal with resiliency
basically i think that's what that took away or at least several levels right maybe not all of them
yeah but at least few levels that make sense for the customer but hey thanks a lot for uh for
inviting me uh andrew and brian it's been such a pleasure actually to uh to have this conversation
i didn't see the time pass and it's brilliant it always, it's been such a pleasure actually to have this conversation.
I didn't see the time pass and it's brilliant.
It always means it's good.
Yeah, one last thing I just wanted to add briefly before we run is that, Andy, we've had a few shows on performance anti-patterns.
And if you recall, when we started looking at microservices,
most of the performance anti-patterns were basically
the same, just a microservice flavor of them.
What I really love about these concepts of the timeouts, the retries, and the exceptions,
although they apply to older monolithic things as well, they really become much stronger
anti-patterns.
I almost feel like we have some new anti-patterns that are becoming more common
in a cloud-native distributed kind of system,
which is, I don't want to say exciting,
but from a topical point of view
and cool things to look out for is exciting
because now there's these new things
where you can almost guarantee
nobody's looking at the timeout.
So when these situations occur,
you can be like, hey,
you think about your timeouts.
There's some of these new patterns coming up at the timeout. So when these situations occur, you can be like, hey, you think about your timeouts. A couple, you know,
there's some of these new patterns
coming up that we can take
as quick hits to check
to make sure people are aware of
and addressing properly.
So anyway, that's all I got.
The low-hanging fruits, right?
Exactly.
Yeah.
The old ones used to be
N plus one query, right?
Or N plus one service calls between microservices
or the other ones, which still exist, right?
They're obviously still there.
But even when we moved from monolith to microservices,
they were kind of the same ones coming over.
We didn't have so many of these distributed system ones.
So it's cool to see some new ones finally entering.
Thanks again, Adrian.
This is, considering this is thank you
in August
are you going to be
you have any
fall
any fall time
fall season
appearances you're going to be making
yeah I'm going to
do you mean holidays or
no well
you can tell us about your holidays
if you want
but I mean more like
any conference
oh conferences
yeah
no yeah actually i'm speaking
tomorrow in oslo about chaos engineering but uh in august it's my holiday time so when this
is hearing uh hearing i'm gonna being uh i'm gonna be on holiday oh perfect my family is coming over
to finland and i'm gonna see my niece and spend time with them because i'm living in finland
three thousand kilometers away.
So we only see a couple of times a year.
So it's going to be a nice to have them over.
Great.
Awesome.
Well, thank you very much.
If people want to check out what you're doing,
uh,
you have some blogs on medium,
which we'll put up.
Do you also put things up on Twitter that people can follow you or any other
social media?
Yeah.
It's adorn on Twitter on medium can follow you or any other social media? Yeah, it's Adorn on Twitter, on Medium,
and pretty much all over the internet.
I think my days of privacy on the internet are over.
Great, awesome.
Well, we look forward to having you back. If any of our listeners have any questions or anything
or any topics they want us to cover with anything, you can send us a tweet at pure underscore DT or an email at pureperformance at dianatrace.com.
Thanks, everyone, for listening.
You can follow me at Emperor Wilson, Grabner, Andy.
I got that right, Andy, right?
Correct?
All right.
It's been a while since we mentioned those guys.
So thanks, everyone, for listening.
Thank you, Adrian Adrian for being on
Andy always a pleasure
thank you so much