PurePerformance - How to build distributed resilient systems with Adrian Hornsby

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always, my co-host Andy Grabner. How are you doing today, Andy? Hey Brian, pretty good. Summer is finally here. I don't actually know exactly when this is going to air, but at the time of the recording... Towards the end of the summer, yeah. Yeah, exactly. The time of the recording... Toward the end of the summer, yeah. Yeah, exactly.

Starting point is 00:00:45 At the time of the recording is really good. We are just hitting about 30 degrees Celsius, which is almost getting to the edge of getting too hot, but it's still good. Yep, we're having some nice weather here in Denver as well. And I just got a new fence put up, so all the donations that our listeners give for this podcast didn't help for

Starting point is 00:01:06 the fence at all because we get no money for it anyway um we have an exciting guest today uh you sent me you know you sent me over um some of the presentations and documentation as i was reading through it i was like wow this is great great stuff um as you and I were discussing before we started, to me, when I was reading through this, I'm like, this all makes perfect sense and seems completely obvious. But who the heck knew about this until someone ran into these problems and figured these things out? so important and it's probably not even thought of so much i guess what i'm trying to say is very obvious stuff but probably stuff that nobody even thinks about very much right so it's like in plain sight you know exactly i mean i i wouldn't say it's i mean if it's too obvious i would assume that our guests wouldn't have a real job to to spread the word but uh it's it's really interesting i think the more and more people are moving into the cloud space, the more and more it's important for people to understand. And now coming to the topic, resiliency and availability design, I think was one of the, was actually the topic that Adrienne,

Starting point is 00:02:15 who is our guest today, spoke about at DevOne conference here in Linz. And I wouldn't give it as much credit, obviously, if I would kind of rehearse what he did. That's why I want to, first of all, say welcome to the show, Adrian. Thanks so much for taking time. Thank you so much for actually making the way. You made the way to Linz in April and enlightened a lot of the developers, I think, in the audience. And now I really want to, first of all, pass it over to you and let you introduce yourself, what you do. And then I want to talk really more about the stuff

Starting point is 00:02:52 that you are advocating for in the world of cloud native. Hi, Andrew. Hi, Brian. First of all, thank you so much for inviting me. It's such a pleasure to join you on the podcast. And second, it was such a pleasure to actually visit Linz and especially the conference they won. My feedback was that that conference was actually one of the best conferences attended that year and this year actually and it was I don't know the whole day was was just great from the introduction music to the talk the speakers the people there it actually blew my mind so I think the whole the pleasure was mine more than then then then yours but thanks thanks again for having me here and yes

Starting point is 00:03:40 resiliency is actually quite an interesting topic I wouldn't say no one is thinking about it I think everyone is thinking about it the problem is uh is is more how to do it and then you know and those kind of things andy before we dive in you mentioned the intro music what was your intro music that was so memorable i think it was special so they actually had a team um i think it was a combination of the dynatrace creative team and also a local company here in Linz that created the video and the music, you know, and actually we had a guy on stage also performing later on in the day. So if you go on YouTube,

Starting point is 00:04:17 I believe they uploaded the intro videos recently to YouTube. So if you go online and check out the def one linds 2019 intro music i think you will hear it and especially if you consider the whole video being on i don't know how big the stage was but it was immense and it was like it was it was blowing everybody's mind i think that was really cool it was amazing really amazing amazing. I watched that video on YouTube in loop at least a few times. So, Adrian, can you, before we jump into the topic, a little bit about yourself, you know, your background and also what you do right now, and then let's jump into the topic.

Starting point is 00:04:59 Cool. Yeah, so I'm working for AWS, so Amazon Web Services, as a technical evangelist. Technical evangelist is kind of the voice of the customers back to the service team. So we do a lot of conferences. We talk to a lot of customers. And actually, we travel a lot to speak at conferences and meet our customers. And then we feed back a lot of this discussion back to the service team so that we can improve our services, right?

Starting point is 00:05:25 Actually, a lot of the services we do and the roadmap of our services is based on that feedback. Actually, about 90, 95% of roadmap is customer feedback. So, I mean, it's a big, big role and I love it. So, currently, my focus is on architecture and how to build resilient architectures, how to practice chaos engineering or resiliency engineering, safety engineering,

Starting point is 00:05:53 and all these kinds of things that are really exciting for me because I've been doing that for about 11 years now and I joined AWS three years ago. So it's like cherry on top of the cake for me I get to speak about the technology I love around the world to great customers and learn a lot actually in the process as well. Hey so and obviously we all know AWS and I believe you are obviously one company that people look up to when it comes to scale and building resiliency.

Starting point is 00:06:25 I don't remember. I mean, there's not many times that I remember when Amazon or AWS itself had any problems. So obviously you're doing a good job. Now, if you talk with developers, architects around the world, and also coming back to what Brian said in the beginning about it all seems so obvious, but it seems we don't know enough about it. What are the things that you tell every architect, every developer? What are the top three things that you believe everyone that is moving into the space of figuring out what is the next architecture?

Starting point is 00:07:01 What are the things you tell them that they definitely have to take care of, that they have to read up on, that they have to understand? Because otherwise, it doesn't make sense to build a system that can potentially scale and is resilient against all sorts of things. That's a very important question, indeed. I'll go back a little bit in the question and talk about the resiliency at AWS. I mean, we launched AWS about in 2006. So that's almost 12 years of operational excellence. And where we have learned a lot,

Starting point is 00:07:38 we always say internally that there's no compression algorithm for experience. So we've experienced a lot of outage and, you know, at scale, it's impossible not to have failures. I mean, the probability to have failures increased dramatically with scale. So, I mean, for us, that experience is something that we can give back to our customers in terms of services that we built and how they are built and what's under the hood.

Starting point is 00:08:14 And, you know, I think a lot of the, let's say, the good resiliency we're experiencing on AWS is based on this experience, right? And it starts how we build our regions, how we build the infrastructures, how we build the services. And one of the first construct is actually what we call an AWS region, right? And we have 21 regions globally with AWS. And a region is based on several availability zones, which are physically separated. And usually a region has three availability zones, right?

Starting point is 00:08:51 So the fact that we decouple or have redundancy in our availability zone actually give already out of the box, very high resiliency. You know, it's, it's mathematic, right? If you put systems in parallel, the overall availability of that system kind of increases. So that's actually built in kind of in our infrastructure so that in case there is an issue happening in one AZ, one availability zone, you know, the rest of the, and maintain, let's say, an availability and

Starting point is 00:09:27 reliability for the service. So, of course, for our users and for our customers, the first thing to understand is how we build infrastructure because it really depends on the cloud providers. I mean, cloud providers have different ways of building their regions, but on AWS, we use that region, every decision construct to build this, let's say, out of the box, multi-AZ redundancy. And all our services are based on that, right? So actually that's the first thing to understand. Right.

Starting point is 00:09:58 And I think in your, one of your posts you had put in, if you had a single instance running at 99% uptime, right? That downtime gets you to three days and 15 hours a year, I assume. And two in parallel brings you to 15 or 52 minutes. Yes. But when you go to three in parallel, which is exactly what you're talking about. Yeah. Yeah, six nines.

Starting point is 00:10:21 Yeah, going to six nines is 31 seconds, which is, as you said, it's just, it's great. And it's funny because we also do the three, we always talk about three nodes too as well with our systems. And to me, it was just like, oh, we just, that's just what we do and what we needed. This kind of highlighted exactly why and why it's so important.

Starting point is 00:10:39 So yeah. And it's mathematic, right? It's actually how we build electronic devices as well and how nuclear power plants as well are built, right? There's high redundancy in every component that is used so that in case of failure, the other one can take over, right? So that's the kind of the idea of also of bulk heading, right? You create isolation so that you avoid kind of

Starting point is 00:11:06 having very big blast radius right if you have one instance and that one fail your blast radius is hundred percent right so if you have three instance but one fails or three AZ and one fails basically your blast radius has reduced to 33 percent so it's already kind of kind of nice of course, on top of that, we do a lot of other things, right? So our infrastructure is based using what we call sale-based architectures, which is actually a construct that is taking the idea of multi-AZ to the next level, right? So we don't want to spend too much time explaining this, but basically it's like really creating small individual cells that basically limit the blast radius of failures.

Starting point is 00:11:51 And that's just for the infrastructure side. Then you have on the software level, you have to build redundancy as well and resiliency on the application. You have on the operation and also on people. You know, software people don't often think about it, but in every team there's that very, very good software developer, that magician that does everything. And if you take him out of the equation,

Starting point is 00:12:16 basically you can have very problematic operations, right? At least I always had that guy in our company that could do everything. And the bus factor, I always say, is very high. So you don't want that. That's kind of like the whole point of building resiliency into your system so that you don't have to rely or have that single point of failure in a single person. I forget who your person's name was. I had a Paul, but it really doesn't matter.

Starting point is 00:12:44 I also wanted to say, I think Adrian at DevOne, didn't you also, I think in your presentation, or maybe it was at a conversation we had later on, you actually talked about testing the resiliency of the organization by actually taking people out and seeing what is happening if critical people or certain teams are simply not available, taking away their phones, letting them not connect to the network and seeing how does the organization behave as a whole to the fact that certain people are not there.

Starting point is 00:13:17 Yeah, exactly. I mean, that's, that's, I think that's the bigger picture of resiliency engineering or kind of business resiliency is, you know, identify key people in the organization, in your software team, and just take them out of the equation. So I often go to customers and help them, let's say, implement what we call chaos engineering. So the first thing I do is I don't test software. I test people.

Starting point is 00:13:41 So basically I find those key people and I take their laptop and tell them to go home and see how the uh basically the operation of their software people or the whole software team is behaving right and very often it's chaos actually very often you'll see uh that people don't are not able to recover from a loss of a single person in their team, which is kind of obvious that you want to prevent or work around that. But I mean, many organizations can't handle that. And that's just what it is. And is your best practice, I know people don't like the word best practice, but is it one of your guidances that you also kind of apply the fact of three here that you should that you should have let's

Starting point is 00:14:32 say at least three three systems or three people in this case that can that can do a particular job or or does this work differently with people it's a good question. You could apply a three factor, but I think it would be maybe far-fetched. I think it's just important to be able to share knowledge. And I think that's also a very good reason why we want to version everything in our software. You version the code, you want to version the infrastructure, the version, the documentation, and especially, you know, all the operations as well. I think that's a big part of the problem

Starting point is 00:15:13 is when people do operation, they don't leave trace of what they're doing. So if you can put in place systems that actually force people to open a ticket, explain what they want to do, and only after that ticket has been opened, they are granted access to the operation. So there's actually a living documentation of what has happened, right?

Starting point is 00:15:31 And everything is version control. Everything is traceable. You know, the logs, the actions, I mean, pretty much also the API, right? That's why enabling CloudTrail, for example, is super important, or AWS Config to verify configuration drifts or things like this. I think that's kind of the first thing before having to employ three people.

Starting point is 00:15:57 Because if you do that, actually, most of the team should be able to recover from that or at least to follow up on that and kind of go back up the stack of what's happened, right? It's about the kind of documenting history, if you will. It sounds like it's more than just documentation, right? When we talk about documenting, you can consider releases as code and infrastructure as code as a way of documenting, you know, not like writing paragraphs about what something is, but as you mentioned in your talks, you

Starting point is 00:16:30 want to do like, say your infrastructure as code or deployments as code. So this way it is in a way documented, but more the processes so that if, if, you know, Paul, I'm going to make a Beatles reference. If Paul is dead, you know, you have it all there and someone just has to push that button to make a Beatles reference. If Paul is dead, you have it all there. And someone just has to push that button to push that script out. And it's more about maintaining that script then as opposed to maintaining the person. Yeah, exactly.

Starting point is 00:16:52 And you say it's about automations as well, right? You don't want to have any humans involved in the deployment, the maintenance, and the maintenance of your system, right? You want humans but supported by tools, right? So human in the loop kind your system, right? You want humans, but supported by tools, right? So human in the loop kind of things, right? So that maybe you don't have 100% automation, but you have humans in the loop. So from looking back where we started the conversation,

Starting point is 00:17:18 I asked you, so what are the things you always make sure you tell the people that you talk to I obviously start with the infrastructure having high availability infrastructure I really like the explanation with the availability sounds physically separated and so on and so forth also the people aspect what else is there if we now come back maybe to you know the application or service layer, what are the things that people need to understand? What are the key requirements that you say you need to understand this concept because otherwise I wouldn't even start coding?

Starting point is 00:17:54 Right. Well, I think there are a few things, right? But I would say the three most important that I see cause a lot of outage is timeouts, retries, and exception handling, right? I think that the problem, and let's start with timeout. Timeout is kind of the time you will wait for a request to succeed or fail, right? Kind of if your dependency doesn't answer is how long are you willing to wait? And the problem with timeouts is if you wait too long, then your system is just hanging, right?

Starting point is 00:18:33 It's just like you're using resources for nothing. So the second problem is when you build software, you often use libraries to do, let's say, an HTTP call to a dependency or using an SDK. And the problem is those libraries are often configured with timeouts that are out of this world. I mean, either it's infinite timeouts

Starting point is 00:19:03 or relying on system defaults or 30 minutes, 100 minutes or 100 seconds. I mean, in very few, especially on the backend side, very few libraries have a kind of meaningful timeout, right? And a meaningful timeout in the world of distributed system is roughly 10 seconds, right? Already in 10 seconds, if you do 10 million requests per second, that's a lot of hanging for nothing, right? Yeah. So the problem is people don't verify those libraries.

Starting point is 00:19:41 They don't do code introspection to figure out what actually are the default timeouts verify those libraries. They don't do code introspection to figure out what actually are the default timeouts of those libraries. And often they figure it out during an outage and in production. And that's kind of happened very, very, very, very regularly. I think most of the outage I've experienced

Starting point is 00:20:00 before joining AWS were timeout related. You download the library to connect to a database, a GDBC driver for MySQL, and that's like a minus one default, which relies on system default. So basically, you

Starting point is 00:20:17 have no idea what it is. And then you realize in production that those defaults are just really not good for your system. So first, figuring out how to to manage your how to manage your dependencies and understanding the timeouts and not using default values right so the most important is that you don't want to have a client deep timeout that is five seconds and then the back-end timeout that is 30 minutes because then you're going to use all the connection pools to your dependency and then the backend timeout that is 30 minutes because then you're going to

Starting point is 00:20:45 use all the connection pools to your dependency and then run out of connection pools. I think that's an error message that a lot of people have seen in their software is we run out of connection pools, right? So I think aligning the timeouts makes sure that the timeout is passed from the client to the backend as an inherited timeout is also a good practice mode and should be done. So is there an option? Does it make sense or is there something that we can do to dynamically adjust timeouts? Would this also make sense? Because obviously, if I download the library, the library creator would never know how his library is used, his or her library is used in the real world, depending on, I don't know, end-to-end transaction behavior? Or is this something that doesn't make sense and every architect should sit down and basically say, well, this is what I believe should be the right timeout for this particular component

Starting point is 00:21:54 because we've done our capacity planning. And so we know what level of load to expect. That's a good question, actually. I know two customers that are doing kind of dynamic timeouts, and sure you introspect all your dependencies and then you are able to alert an operator or a software developer. If a timeout is not set before being deployed in production, that's already a massive step forward.

Starting point is 00:22:40 It will prevent or I think start small, right? I always say start small and then see if you get that issue. But just by first defining your timeouts and not use default values, that's already a massive step forward. So I would say keep it simple. Static definition is great. Yeah, and that's one of the ones, ones this and i think what we're going to talk about next which is the um retries and the retry jitter thing which i hope you get to but um those

Starting point is 00:23:11 are two of the ones that when i read them i was that's when i was like oh duh but again something you don't think about like who thinks to look at the default timeout whereas we've seen we've seen problem patterns for years now even with the connections you're talking about. One of the ones we used to use in our demo environment was a default setup for Hibernate, which would cause all sorts of problems. And the idea is if you're going to use a framework or pull anything else in, you have to look at what it's doing and understand what it's doing before you go ahead and use it. But most people just take, I'm just going to drop this in and go. And one of the things I equate it to is like when you get a new phone, you know, some people, when they get a new phone, they just turn it on and use it and then wonder why everybody knows where

Starting point is 00:23:50 they are. And every time they post a picture, Oh, you were in, you know, South of France. Oh, well, how did you know your geotags were turned on? For me, I go through every single setting in my phone and make sure it's what I want to have turned on, you know, but that's differences in people. And I think developers have to really start looking into when I get a new framework, look at all the settings. Andy, if we take a quick step back to when we had Stefano Dorian with the machines optimizing all the settings, right? So there was a situation where like JVMs have over 700 settings for configuration settings, right? And most people

Starting point is 00:24:25 don't know any of them, but they're doing cool things too, where they're using AI to help fine tune those. And a lot of stuff that I think settings and default options are starting to come back into the spotlight because most people ignore them. Yeah. I think the problem becomes even more important as you build abstractions on the infrastructures, on the software, because someone has made decisions, right? And you need to know those decisions. Yeah. We actually talked about in that one of those others, like if you have a JVM yourself and running it on your own host, you can control everything. If you're using Lambda, I think we actually mentioned it in the podcast, you're relying on AWS to have the settings optimized

Starting point is 00:25:08 and there's only so much you can do and you just have to trust in them. And obviously you guys all do a great job, but you have that, you're even further extracted. So the ones you can control, you really have to look at even more closely. Yeah, it's a trade-off, right? Between operational details and flexibility.

Starting point is 00:25:26 But I think at least the default values that are exposed by the frameworks should be understood and should be verified at all costs. So we want to talk about retries. I think that was another really interesting... That's the second one you mentioned, I believe, right? The retries? Yeah, retries. So retry is quite simple i'll take the example you know if you have kids you've probably been

Starting point is 00:25:51 traveling with kids in the in the back of the car and they will say are we there yet are we there yet are we there yet are we there yet right so that creates a lot of of kind of, let's say... Tension. Stress, tension. And that's the same with a network, right? And when your client is experiencing or when your backend is experiencing issue and cannot answer queries or timeouts, then the client very often will retry, right? That's kind of a classic case of resiliency, right?

Starting point is 00:26:23 You want to retry after an error because there are errors in system, like transient errors and things like this, right? And so classic, you retry. The problem is in the distributed system, if you have many clients, they all and all of those experience kind of the same transient errors

Starting point is 00:26:42 and everyone retries at the same time, you overload the network with retry packets. The problem is very often those retries are every second or few times per second, and there's no max. So these are kind of two default settings that you see is the retries are usually in infinite loop with no max retries. And that's kind of created massive issues. It's kind of what we call a retry storm, right?

Starting point is 00:27:11 So you kind of DDoS yourself, basically, by using retries. So one simple solution is actually, if a retry is failing, is instead of retrying immediately, is you wait a little bit before retrying again it's like you back off right so you realize daddy is not really happy to answer every seconds that is not there yet so you take another two seconds and then six maybe ten ten seconds and then 15 you know and the classic way to do that is exponential backoff.

Starting point is 00:27:47 Now, the problem with exponential backoff, and this is what is outlined in one of the papers we wrote about retries, is in distributed systems, even if you back off with the same algorithm, so exponential backoff in that case, you still have retry clusters because it's natural that all the system will back off at the same time and then therefore retry at the same time. So you need to add in your retry loop a jitter with the backoff algorithm. So you add a random jitter within the backoff algorithm so that not everyone is retrying at the same time.

Starting point is 00:28:25 It can spread the retries across a larger time. And actually, that's also one of the big reasons why distributed systems take a long time to recover from failures is because well-implemented distributed systems are in retry operations and just take their time so that you can kind of slowly recover from an outage and not have massive retry storms every time you put back the system up as well handy yeah what's up think back to think back to your old load testing days what is that jitter versus no jitter piece remind you of well it's the the random think times.

Starting point is 00:29:06 Is that what you're talking about? Yeah. So just for anybody listening or whatever, one of the early mistakes that used to be made in load testing is you would just have a static think time. So when you start your test, you'd have all these same time test hits, which is the same exact thing that you're describing, where it's all the tests every five seconds or every two seconds, another people, another set of people would hit.

Starting point is 00:29:29 So you'd have this false, false spikes going on all the time. And then this load increased, you'd get even worse and worse, which is exactly why we'd add the randomness into it, which is if you think about the idea of testers working with developers and vice versa, that's the kind of thing where it might come up. But since the testing teams had that on their radar for a long time, it'd be interesting to see if there was a collaboration between testing and dev, if that would have come up in some sort of settings

Starting point is 00:29:59 if they were looking at that. Experience, it's good that it reminds you of things, because i think those are the most important right it's when you can put uh an experience on top of a feature is is great and and adrian actually the so the timeouts the retries and the backoffs was one of the things that that i i was also excited about when you showed it in Lint. And then I think I told you, I kind of borrowed slash stole some of your slides. I gave you credit for it in one of the presentations

Starting point is 00:30:31 I did a couple of weeks later at a developer conference in Iasi, in Romania. And it was extremely well-received as well. You had a great animation where you actually showed what happens if the timeouts are incorrectly specified between backend and frontend and if the retries are not done well. And I can just encourage everyone, we will put up the links to your blog posts, to your papers where you actually talk about it and also to your slides and slide share.

Starting point is 00:30:59 So a lot of great material out there. Great. Good to hear. Yeah. material out on that. Great. Good to hear. Yeah, and I think also I didn't talk about it, but the second really bad effect of retries is actually the logging. That's usually the second thing that happens is because everybody retries and the backend kind of retries

Starting point is 00:31:23 to connect to dependencies in loop, they write logs and very often it will fill up the logs in the instance. And then you don't have any more space left on the instance and all of a sudden you take it out. That has happened a lot, actually, having systems that go in retry storms and that are not killed because of the network but because of the instances running out of space to write the logs due to those retries so something that

Starting point is 00:31:52 is important as well in your application to monitor actually the the disk space it sounds it sounds stupid right but it's it's critical because if you can't write your log basically your application is taken out of a of a business right so you have to monitor that and change the log the log level basically dynamically to adjust to this kind of stuff yeah but it's it's not stupid at all we've actually been talking about this you know several times in our podcast that when we advocate for monitoring your system, not only in production, but also in pre-prod, we often tell people, if you're running your tests against your new builds, then look at things like how many log messages are written by the test and how did it change from the last test, from the last build to this build? Because maybe somebody just accidentally is logging 10% more and the 10% will kill you later on

Starting point is 00:32:49 or somebody was accidentally turning on a log level or changing the log level. So these are all things we can detect much earlier. We should. Before joining AWS, actually, we ran into that issue at scale. And we went around it that we actually stopped logging on file on the disk itself. So we actually dynamically sent the logs to Elasticsearch through a stream like either Kafka or Firehose on AWS to be able to compensate for this kind of lack of log.

Starting point is 00:33:29 And I think it was very good, actually. We got rid of Logstash, which was kind of a big problem to run on our instances. Logstash is kind of a software that takes log and sends it to a central logging system. Yeah. So instead of doing the log, we actually wrote directly,

Starting point is 00:33:48 send the JSON object directly to Elasticsearch for analysis. And that was brilliant, actually, because we got rid of first of log stash. So that's less software to install on our instances. And we got rid of large disks while having to transfer the logs. Of course, it brings another set of complexities.

Starting point is 00:34:07 You have to manage Elasticsearch at scale, but there's a lot of offering out there that can help. So at least for us, it was very, very good to move to do that. But it feels like it didn't really address the problem of log floods, right? Because if the system brought you, if you were logging too much and your current system couldn't handle it,

Starting point is 00:34:28 and you just replaced the backend with a different system that can scale better, but did you also address the situation with just too much logs or too many log statements? Yeah, of course, yeah. I mean, that's definitely a big problem, right? If you write too many logs, and especially if you have the wrong log level in production.

Starting point is 00:34:50 But it's still, we got rid of writing logs on the file and having to transfer it after that. So we were, it was a lot easier to bring system up and down and kind of break that kind of dependency to writing on disk. Of course, I mean, you know, if you solve a problem with a regex, you have another problem, right? So it's the same thing with logging, right? If you don't write on the disk and you write to Elasticsearch,

Starting point is 00:35:19 then you need to manage and scale Elasticsearch. But we just found that a lot easier for us. And of course, you can send it directly to S3 as well if you want almost infinite scale. But at Firehose as well. So I think those are great solutions to do this. Actually, I've pushed some code on GitHub, on my GitHub, that demos exactly that,

Starting point is 00:35:40 how to do that with your application. So it's pretty cool. You know, I think that brings up an interesting point too that um you don't always have to start with a full solution right if you go back to the default timeout versus a dynamic timeout or this idea of changing the way you're treating your logs versus cleaning them up obviously you want to go clean them up in the end but if you have an immediate problem that you just need to get some more buffer zone around, there are things to steps you can take. And a lot of times I think organizations are really stressed between, you know, release dates and schedules versus trying to do these maintenance types of tasks. So just kind of putting into people's minds that sometimes there are smaller steps you can take that'll get you to that final path

Starting point is 00:36:25 which might be easier to take to start with but it doesn't mean stop getting to your optimal state but also just from a from an overwhelming point of view you don't have to think oh my gosh we have to clean up all the logs it's going to be this huge project we should just give up now right there's something you could do in the meantime and sometimes look at those right yeah it's a good point and especially i think you have to think about the situation where you are experiencing an outage and then you're out there trying to fix that outage right you want to have as simple things to do as possible you don't want complicated things so yeah that's you know the idea of simple is beautiful, right? And I think that actually pays off a lot

Starting point is 00:37:08 when you are trying to recover from outage. So I think simple solutions work well. It's just doing simple things is hard. Yeah. So the third thing, so you said timeouts, retries, and backoff, which we kind of combined. And then I believe you also said exception handling was another thing you wanted to talk about yeah i mean in applications very often especially in

Starting point is 00:37:29 distributed systems where you have multiple dependencies and one of those dependency as an exception is usually wrapped inside another exemption and very often it's never recursively taken out and analyzed. So basically you don't really have visibility or observability onto what's really happened to your dependencies. And so I think this is something that is very important, handling exceptions and doing it well, and not just a try except that doesn't do anything, right? So just having a resilient system is great, but it doesn't mean that you should have self-healing

Starting point is 00:38:10 without observability, right? So it's good if you recover, but it's always good to make sure that you understand what was the root cause and you can actually go back into it and not just recover without any trace of what happened and i think that's also kind of a big issue right yeah i mean it's obviously a topic that we have been talking about a lot obviously especially with our history on where brian and i work at dynatrace because we've always been you know strong advocates for end-to-end tracing and really figuring out what the

Starting point is 00:38:49 root cause is now we just joined the the open the open telemetry group with also with your colleagues and others right and we are really driving standards forward so we can get all of this information, the observability data, the telemetry, the traces from all sorts of systems that our systems are running on. Because if I write software, then let's assume I run it on AWS, then I can only manage or control what I am kind of writing and tracing in my stack. But then maybe I call into a third-party service. And if you guys then are also applying that standard

Starting point is 00:39:30 or implementing that standard, that gives everyone out there more visibility end-to-end, what's actually happening in their end-to-end distributed system, because we all know that eventually we will touch, our transactions touch more third-party systems or services than our own code, right? Because if I run on AWS, I probably use it in MOTB, I use some messaging, I use all sorts of things,

Starting point is 00:39:54 and then it would be great, obviously, to have more root cause information through traces, through telemetry, and so on and so forth. Yeah, exactly. Yeah, makes sense. Yeah. Hey, actually, so it's interesting. When you said said in the beginning so timeouts retries and exceptions i when i heard exceptions i almost thought you're going down that route of of error handling meaning what if i'm a service and i'm calling another service yes i do my retries and my and all that stuff but how do i react in case there's actually a problem?

Starting point is 00:40:25 I thought you were talking about error handling because I know there's a lot of talk in your presentation. It's part of it, right? It's part of that error handling. I think it's one of the biggest issues in software is naming your

Starting point is 00:40:43 functions in your class and then how do you handle errors, right? And especially handle errors in a way that you're not, that you're still providing a service that is accessible. Maybe it's not accessible in its full glory with all the features and capabilities, but at least not showing the user a blank page that says, sorry, we're out of connections in our database. Yeah, it's degradation, right? So it's basically how do you use errors to degrade your service to be resilient? And kind of, I mean, that's the whole purpose of resiliency

Starting point is 00:41:14 is how do you offer a service while experiencing an issue, an outage, right? And the service doesn't have to be 100% full-blown, like the full service itself. It can be a degraded experience, and you said it well. You can move from a write-read experience to read-only. That's especially true for if you think about Prime Video, right? They are in the business of serving videos,

Starting point is 00:41:44 not writing to a master Video, right? They are in the business of serving videos, not writing to a master database, right? So if you can show to your systems, to your application, that actually a master database is down, make it aware that, hey, it can still run. Even if you're down, I'll just move into a read-only mode. And then you handle that, and then you return that to the clients. That can also degrade the experience or modify the UI so that the user hopefully doesn't notice much of what's happening. Netflix does that as well very easily. You see their UI is actually made so that if a dependency on the,

Starting point is 00:42:27 let's say, a recommendation engine, you have a lot of different recommendations on Netflix. If one of those recommendations doesn't work, they just remove that stack and kind of bump the rest up, and you never see it. And they use heavily the cash and things like you say I think those are kind of the experiences that you can do and you can get that through through the health checks right and you know I'm you

Starting point is 00:42:52 opening a second pound or a box here is like health checks how do you define health checks between deep and the shallow health checks and those are also complicated things that you need to research, you need to understand, so that through a good health check, you can actually degrade the experience of the users for different things. If you have dependencies to a mail server, it should react totally different than if you have a failure of a database, right? So that's kind of the whole point of a health check system that kind of understand the situation of an outage, what's going on, and hopefully degrade the experience so that users can still consume your service. So I know we could probably go on and on,

Starting point is 00:43:45 because I'm just looking at your part two of your blog, where you talk about avoiding cascading failures. And then you actually had, in the wrapping up section, you talk about, actually, continue reading, because health checks are going to be the next thing in the next part three of your blog series. I have a couple of questions to this because i think it's it's out there actually it is out there sorry yeah um what i what what uh

Starting point is 00:44:13 what would just just out of curiosity if i have a service a calling a service b who should be who is responsible for for health check or for rechecking? Is the caller responsible to deal with unhealthy systems? Or is the callee the one that says, well, hey, I reject you. I mean, I assume there's different strategies. And maybe you want to apply different strategies depending on your architecture. But is it typically I as the caller am responsible for making sure I'm not overloading an already unhealthy system? Or is it more the other way around?

Starting point is 00:44:52 Even though I'm struggling, I can still correctly tell my health state and therefore I'm rejecting things. What are the approaches here? This is a very hard question, but a very good one. From the backend side, so from the receiver side, if you get into a war mode, so if you get a lot of retry storms,

Starting point is 00:45:16 if you get in a state where you are being DDoSed, it's your responsibility, basically protect your, your, the entrance of the castle, right? So you're going to do rate limiting, you're going to do load shedding, this kind of stuff, right? But of course, I love to see, to see that responsibility also from the client side, because at the end of the day, it's the customer experience that is very important.

Starting point is 00:45:49 So if you look from a customer or from a client's point of view, is how can you give back some visibility about the system to the client so that it can help preserve the infrastructure or the architectures and knowingly degrade the experience. I think it's much nicer to do it from that way than to do it aggressively from the back end side and trying to prevent aggressively being destroyed by your own clients.

Starting point is 00:46:21 So I think that's a good way to do it. Yeah. Well, it's also the the I think if you do it from if you can propagate the health state from the back and all the way to the front and already kind of tell the the front the most the much if it's the right word the most front-end client already to start backing off that just saves a lot of pain throughout the whole architecture right i mean exactly and especially like so just take the example of a database right if you have a master and a read replica and your master database go down your back end goes into read-only mode you have to propagate this back to the client so that actually it kind of moves the application into a mode where you don't

Starting point is 00:47:06 get the customer to try to write things. So for example, you wouldn't be able to update your profile picture, or you wouldn't be able to write a status update if you would be a social app or things like this, but you can still consume everything that is already being put into database. So these kind of things. Or then you want to queue those requests. Maybe some requests you're going to reject. Some requests can be queued. And that's the whole synchronous versus asynchronous. That really depends on what you're trying to do.

Starting point is 00:47:41 If you have a bank transaction, you don't want to do that asynchronously you want to make sure it's synchronous and and that the system is handling it well right so basically you're gonna go into a mode where your application won't let you do a money transfer if you have a master database right going down but you can maybe read something yeah yeah so that's also I guess comes then for I think in the in your fourth part of your blog you talk about caching and I would I assume when you talk about read read only you could also say well I'm just reading from a cache a cached version and yeah we take to a Netflix example i think maybe you told me that but maybe they don't turn the recommendations off completely but they just take a cached version and that cached version might be at least so quote unquote smart to show me recommendations for my geolocation

Starting point is 00:48:38 right like showing me in the in the instance content is europe European because I'm from Europe. And then Brian, he would see content from the US. And yeah. Yeah, exactly. I mean, of course, handling cash is not easy at all. It brings a lot of problems with cash eviction and things like this. But you can at least use it to serve content that is maybe not that dynamic, like the top movies of last month or something that I don't know, there's tons of ways

Starting point is 00:49:16 to do that. I think definitely cache plays a big and important role in distributed systems and should be used, but carefully as well, because it's like the RegEx as well. It can create other sets of problems. Very cool. I know there will be a whole not a topic probably around how we can test for it. Like chaos engineering is another big topic of yours but i i believe brian we

Starting point is 00:49:45 probably want to we want to invite adrian back for uh for a part two of the blog yeah i think so i think we can fill up a whole another episode with you adrian if you'd be willing to spend more time with us i'd love that yeah chaos engineering is definitely my focus currently. So traveling a lot and meeting customers and helping them implement chaos engineering practices. And that's super interesting as well. Yeah, and it's something we haven't spent much time on either. So I think it would make for a great topic. So Andy, speaking of that though,

Starting point is 00:50:20 anything else or do you want to go ahead and summon the Summarytor? I think we can summon the Summarytor do you want to go ahead and summon the summerator i think we can summon the summerator all right go ahead so adrian uh thank you so much first of all for again taking the time to kind of recap all the great content and stuff you've been working on for quite a while obviously uh it was very happy that we met in lynx when you were speaking at the phone i think what i took away from this session, there's always a lot of things. I really like the, I don't know what the right term is, but the story of three. So when you design a system for resiliency, always think about at least three nodes on the hardware side,

Starting point is 00:50:59 three services that can kind of, or also three, if it's possible, people that can kind of cover a certain role. Because I think what I, what we also learned today, it's about having resiliency goes through the whole organization. And that typically starts with people. Absolutely. Diane, what I also took away from, from this chat is aversion to everything. Coming back to the initial discussion we had

Starting point is 00:51:26 about making things transparent and actually allowing other people to see what other people have been doing so that they can actually jump in. From an architecture perspective, when I asked you about, so what are the things that everybody needs to understand?

Starting point is 00:51:41 I believe to sum up the things that we talked about, like timeouts, retries, exception handlers, errors, and health check, I think that the thing that I take away most is we need to start designing software for different service levels depending on the health state of the overall system. So everything we do, every component we design needs to be able to adjust to a different service level mode, depending on what the overall state of the system is. And we need to, I believe, we need to try to get that database. So shifting left the health state so that every service involved in the end-to-end transaction change can immediately switch to different service level.

Starting point is 00:52:33 And I think that's going to be a big thing that differentiates people and organizations and services if they can dynamically adapt the service level on every level of their service chain to deal with resiliency basically i think that's what that took away or at least several levels right maybe not all of them yeah but at least few levels that make sense for the customer but hey thanks a lot for uh for inviting me uh andrew and brian it's been such a pleasure actually to uh to have this conversation

Starting point is 00:53:04 i didn't see the time pass and it's brilliant it always, it's been such a pleasure actually to have this conversation. I didn't see the time pass and it's brilliant. It always means it's good. Yeah, one last thing I just wanted to add briefly before we run is that, Andy, we've had a few shows on performance anti-patterns. And if you recall, when we started looking at microservices, most of the performance anti-patterns were basically the same, just a microservice flavor of them. What I really love about these concepts of the timeouts, the retries, and the exceptions,

Starting point is 00:53:35 although they apply to older monolithic things as well, they really become much stronger anti-patterns. I almost feel like we have some new anti-patterns that are becoming more common in a cloud-native distributed kind of system, which is, I don't want to say exciting, but from a topical point of view and cool things to look out for is exciting because now there's these new things

Starting point is 00:53:56 where you can almost guarantee nobody's looking at the timeout. So when these situations occur, you can be like, hey, you think about your timeouts. There's some of these new patterns coming up at the timeout. So when these situations occur, you can be like, hey, you think about your timeouts. A couple, you know, there's some of these new patterns coming up that we can take

Starting point is 00:54:09 as quick hits to check to make sure people are aware of and addressing properly. So anyway, that's all I got. The low-hanging fruits, right? Exactly. Yeah. The old ones used to be

Starting point is 00:54:22 N plus one query, right? Or N plus one service calls between microservices or the other ones, which still exist, right? They're obviously still there. But even when we moved from monolith to microservices, they were kind of the same ones coming over. We didn't have so many of these distributed system ones. So it's cool to see some new ones finally entering.

Starting point is 00:54:41 Thanks again, Adrian. This is, considering this is thank you in August are you going to be you have any fall any fall time fall season

Starting point is 00:54:50 appearances you're going to be making yeah I'm going to do you mean holidays or no well you can tell us about your holidays if you want but I mean more like any conference

Starting point is 00:55:01 oh conferences yeah no yeah actually i'm speaking tomorrow in oslo about chaos engineering but uh in august it's my holiday time so when this is hearing uh hearing i'm gonna being uh i'm gonna be on holiday oh perfect my family is coming over to finland and i'm gonna see my niece and spend time with them because i'm living in finland three thousand kilometers away. So we only see a couple of times a year.

Starting point is 00:55:29 So it's going to be a nice to have them over. Great. Awesome. Well, thank you very much. If people want to check out what you're doing, uh, you have some blogs on medium, which we'll put up.

Starting point is 00:55:38 Do you also put things up on Twitter that people can follow you or any other social media? Yeah. It's adorn on Twitter on medium can follow you or any other social media? Yeah, it's Adorn on Twitter, on Medium, and pretty much all over the internet. I think my days of privacy on the internet are over. Great, awesome. Well, we look forward to having you back. If any of our listeners have any questions or anything

Starting point is 00:56:02 or any topics they want us to cover with anything, you can send us a tweet at pure underscore DT or an email at pureperformance at dianatrace.com. Thanks, everyone, for listening. You can follow me at Emperor Wilson, Grabner, Andy. I got that right, Andy, right? Correct? All right. It's been a while since we mentioned those guys. So thanks, everyone, for listening.

Starting point is 00:56:24 Thank you, Adrian Adrian for being on Andy always a pleasure thank you so much

Your Ad Here

PurePerformance - How to build distributed resilient systems with Adrian Hornsby

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.