PurePerformance - Service Meshes: From simple load balancing to securing planet scale architectures with Sebastian Weigand

Episode Date: July 20, 2020

Whether you are still researching on whether you need a Service Mesh or simple use a load balancer or if you are already deploying multi hybrid-cloud architectures and Service Meshes help you secure t...he location aware routed traffic. In both cases: listen to this episode!We invited Sebastian Weigand (@ThatDevopsGuy) back to our podcast who wrote papers such as Building a Planet-Scale Architecture the Easy Way. In our episode Sebastian walks us through why Service Meshes have gained so much in popularity, what the main use cases are, how you should decide on whether or not use Service Meshes and which challenges you might run into as you expand into using more features.https://twitter.com/thatdevopsguyhttps://files.devnetwork.cloud/DeveloperWeekNewYork/presentations/2019/scalability/Sebastian_Weigand.pdf

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and joining me as always is the Commissar himself, Andy Grabner. Andy, how are you doing today? I'm very good. Shall I now sing the English version of the Commissar or the real version? Well, you're Austrian, so sing the real version. Or you can even,
Starting point is 00:00:48 the one thing I was trying to fit in before was if you go one country over, we also had 99 Luftballons as well as 99 Red Ballons. Yeah, that's right, Nena. I'm sure it was very big in your country as well. So people probably have no clue what we're talking about right now. Probably not.
Starting point is 00:01:01 Probably not, yeah. Except for some of the older people. They might know those songs. Yeah. No no it's uh the commissar right i did they also translate how did they write their commissar with a k or with a c or i don't know with the k look at least and i can't look it up but you know yeah it'll be but in case and people are you know familiar with the song don Don't Turn Around, and the English lyrics of Don't Turn Around, that's not the original version.
Starting point is 00:01:30 The original version is, And I will stop singing now because I don't get paid for that. And he also did Amadeus, Rock Me Amadeus, right? He was pretty big. David Hasselhoff, though, I think was bigger than... Who was bigger over in that area?
Starting point is 00:01:48 Was Hasselhoff bigger, even though he wasn't from there? Because I know Hasselhoff got really, really big over in Berlin. Overall, Falco is our hero. And he had some big things when he was in the 80s. So, yeah. All right. I think this is podcast today about falco and the austral uh isn't there some kind of way you can segue from that you're come on andy you're you're really
Starting point is 00:02:12 good at these obscure segues how do i segue from austral pop from the 80s to uh service meshes let me figure this out i know i know just go you better go yeah we just go now we have to serve it so i i brought up the term service meshes is well service measures is a big topic right and we actually have it was suggested by a previous guest sebastian who is uh back on the show today sebastian are you there? Yes, thanks for having me back. It's great to be back. How good are you with lyrics of old Austrian songs?
Starting point is 00:02:58 I mean, I can Google it with the best of them, I suppose. You don't have that on your resume? Not really, no. 15 years experience with Kubernetes and i know all the falco songs um no but that would be an interesting role that that would be a requirement for i can't imagine what it would look like but it would be fascinating to watch as a bystander you should check out the job descriptions on dynatrace hey sebastian last time talked, we talked about a topic that is obviously still hot, but we talked about SLO adoption and usage in SRE. You gave some great examples on how SLIs, SLOs, SLAs, you gave the whole definition. about you know how this can help an organization uh especially i think you know in in a cloud native world i mean not only in cloud native but i think that's obviously where we see it more and
Starting point is 00:03:50 more and then you said you know it's great that we talked about this but there's a topic that is really really hot these days and it's it's a service meshes and everything that we can do with service meshes like multi-cluster architecture and what service meshes can do a lot of things. And I think I know what a service mesh is. I think I know what it can do. We had episodes more than a year ago with Matt Turner, who talked a little bit about Istio.
Starting point is 00:04:19 We also had Alois Meyer, one of our technical product managers, talking a little bit about how we monitor service meshes like Istio. But you proposed a topic, so I assume you know much more than we've learned so far. So I would really like to push it over to you and say, what do we need to know about service meshes? And especially what is new? What's upcoming? What do people need to think about when they think about service meshes?
Starting point is 00:04:45 Yeah, no, that's a good question. I think it's an interesting topic just in general, because it's almost kind of daunting if you're a sysadmin or operator in the modern day and age, right? So people are telling you like, well, hey, we got to move out all of this stuff into microservices. We want to containerize stuff. We need to have a cloud native approach to the way that we do infrastructure design, application development, DevOps in general, you know, the whole shebang. And then there's this like other thing that's sort of existing in parallel, which is this concept of a service mesh. And there's a handful of different service meshes out there. I tend to have more familiarity with Istio. I think it's a little bit further along in terms of its feature set and just its maturity model. But it's interesting
Starting point is 00:05:29 that, you know, right, you know, as soon as people feel like they got to the cloud or they got to Kubernetes, now it's like, oh, cool. Now redo everything inside of a service mesh, because apparently that's like the newest hype du jour that we have to we have to implement. But I'm here to say that it's actually not it's not hype, right? It is actually a really interesting and really awesome set of tools that enable some really interesting and awesome patterns that you will see in in like where the cloud computing market and where sort of enterprises are going today. The big ones are like multi-cloud and hybrid architectures is really powered by the concept of a service mesh. I mean, there's ways of doing it outside of it, but then it's kind of janky and it doesn't quite work the way you would think it would work. And there's a lot of extra glue that you need to be able to provide in order for that to actually work properly. Whereas if you leverage a service mesh, you get so many advantages with not a whole lot of downside, which is pretty cool.
Starting point is 00:06:30 Again, the biggest issue is how do I manage it? How do I spin it up? How do I install it and maintain the thing? And luckily, now the industry has sort of like kind of turned a corner. And not only can you get really great managed like Kubernetes engines available from, you know, that your provider of choice, but also you can get managed Istio or managed Istio like componentry on top of things as well, which is really cool. So I think we're at a really awesome time right now to start taking a look at these things and to really start uh like using them to their fullest extent which is kind of fun um it's been a while um since i think the the last you chatted
Starting point is 00:07:12 about service meshes so uh it'd be i think would it be helpful if i gave sort of like a quick refresher on like what a service mesh is how it works that sort of thing yeah that would be great yeah especially the main use cases right i mean uh i think we have an understanding but if you as you said a lot of things have changed in the last year or so so yeah a quick recap would be awesome so a service mesh is if you want like a great definition here here's here's one that we've got um it's it's a platform that provides a uniform way to to, manage, monitor and secure services. And that's kind of interesting. So you have this concept of connecting services, of managing the services in terms of the way that they work, monitoring the services and also securing the services.
Starting point is 00:07:57 So there's like kind of like four different big things that you kind of get as a result of that. Now, you might say to yourself, well, like I can manage services just fine inside of Kubernetes, or I can manage services just fine in my thing that exists that is not inside of Kubernetes, because technically a service mesh doesn't require Kubernetes to operate. It just tends to go hand in hand with that just by nature of it being a distributed system and Kubernetes offering primitives to run distributed systems. But if I were to ask the Kubernetes administrators out there, a really simple question with respect to services, if I said, you've got an app, let's say, and you've got service A that talks to service B, right? And you've got a bunch of other services that also talk to service B, and you've got a couple that service A talks to that aren't service B. So A might talk to C and that sort of thing,
Starting point is 00:08:48 you know, standard everyday microservice type thing. You know, they all talk to different backend services in different ways. If I were to say, what's the total throughput or the total amount of bandwidth that's being used when service A talks to service B, how would you answer that question? Like, think about that for a second. That's kind of challenging because while you can get metrics out of a system, you don't necessarily have service level metrics. You can get like pod level metrics, or you can get maybe application specific metrics, you know, like something, something Prometheus, you know, something like that, but not really from the service perspective, especially important when it comes to like dynamic services that will come and go and scale up and scale down as needed. Another great use
Starting point is 00:09:34 case would be something like Canary deployments. We talked about that in our, in our last episode together, where we talked about SLIs and SLOs, we had mentioned like, there's some really interesting things that you can do with, with Canaries to be able to get feedback from that and then be able to release things. And then once you're really good with releasing things, then you can migrate the different versions up and people can have an uninterrupted experience. If you want to leverage Canary deployments though, right now, how do you ensure that a specific percentage of traffic goes to the new version that you have versus the old version? Well, right now with Kubernetes, it's a little bit more, I don't know, it's not super polished. You'd have to just make sure that your replica count is the correct proportion larger than something else. It's kind of weird.
Starting point is 00:10:21 It's kind of complicated. A service mesh brings order to this chaos, right? It provides that network level functions to your application. And that works, it has a variety of different benefits that you've got on top of that. But those are the sort of the highlights. And I'm going to talk about a couple of the different benefits that we've got coming up here. But does that make, does that track in terms of everyone's understanding of a service mesh? Or did I say something that was weird, complicated, not in line with what you were thinking? No, I think from my perspective, great. But I have a question right to the example that you just brought up with the canaries. So does a service mesh then typically, and I know you know Istio
Starting point is 00:11:04 pretty well, does Istio then also not only take care of correct traffic routing but also scaling of the pods so that means i configure i'd say is still i want to run 80 of my traffic on my main canary and 20 on my new canary is still then automatically figuring out how many pods I need in order to do this? Or how does this work? Ah, see, that's really good. So Istio takes care of the traffic component, but it lets your underlying Kubernetes infrastructure take care of the scaling component, which is pretty cool.
Starting point is 00:11:39 So let's say you have an arbitrary number of pods on the backend. You could just be one, let's say version one and one version two. And you want, you know, 90% or 80% of the traffic to go to one of those two. The way that Istio works, because all of the traffic that exists inside of the service mesh has to go through Istio's primary controllers. In other words, like the gateway that you establish, it knows that every packet coming in is destined for a specific service that's under its jurisdiction. So it sort of acts as a gatekeeper, and then it figures out how much traffic should go to whichever different backend deployment. So in that circumstance, as more traffic comes in,
Starting point is 00:12:24 Istio will always maintain that exact proportion that you're looking for in that traffic splitting scenario that we talked about, and then it will deliver traffic to those appropriate pods. If those pods end up getting overloaded or need to scale because there's a lot of traffic coming in and you have the appropriate scaling principles in place, you've got good custom metrics and things like that, then Kubernetes will scale them in and of itself. So they're sort of like non-overlapping problems,
Starting point is 00:12:52 but at the same time, the ability to have all of the traffic come in be so expertly controlled, that's something that you only get with something like a service mesh, which is pretty nifty. But wouldn't it make sense? something that you only get with something like a service mesh, which is pretty new. So but wouldn't it make sense? Kubernetes by default would scale based on resource shortage, right? Let's scale up in case, let's say my Canary 2 is seeing so much traffic and now I'm spiking
Starting point is 00:13:18 CPU or exhausting CPU. Now I'm scaling up. But wouldn't it make more sense because Istio has all this great information to scale based on maybe other metrics like, hey, response time is going up, or we see an increase in failure rate and therefore we probably need to scale. Wouldn't it make sense to take some of these metrics that are specific to a service or to a canary, and then also take this into consideration, then maybe trigger something in Kubernetes and say, hey, we need to scale up? Yeah, and what you've just described is the concept of a custom metric that you can scale on inside of Kubernetes,
Starting point is 00:13:53 which is super awesome. So the horizontal pod autoscaler, by default, just scales on CPU load. It's not a particularly good metric. Usually there's a standard linear correlation with standard linear correlation with like the number of requests that are being processed and the amount of CPU that's used to process those requests, but not always. And that's not really the best SLI. Remember we talked about that before to, to go off of because the CPU is sort of tangential to the, uh, to, to the heart of the matter of the
Starting point is 00:14:24 service. Like I don't care how much CPU is being used. I care whether or not those requests are coming out and working appropriately. Just like we talked about before, like, I need to make sure that, like, let's say the response time, like, let's say that's a super critical thing that I need to ensure that my production service meets because I have this SLO or I even have this SLA because, you know, a B2B or if I'm providing like a software as a service or something like that, I have to make sure to hit that. Well, wouldn't it be great if I can scale based on these types of metrics? And because Istio understands these
Starting point is 00:14:57 types of metrics intrinsically, it provides mechanisms that tie into the horizontal pod autoscaler so that you can come up with better custom metrics on which to scale. So that way you can say the number of requests per second coming in or the average latency or the response time or something along those lines, whatever it is that you want to key off of, you've got this nice like duality between metrics that are produced inside of the service mesh and then your ability to do something with those metrics or take actions, which is super powerful, especially when it comes to production services
Starting point is 00:15:31 that span potentially multiple backends, which is another interesting topic. Yeah, cool. I know you said you have a lot of other stuff to talk, to tell about, but I think another use case, and correct me if I'm wrong, but this is also where service measures are great in when we come to fault injection. I think I've heard this right from chaos engineers
Starting point is 00:15:53 using service measures like Istio to say, let's inject 10% fault and see how systems behave. So I assume this is one of the other use cases we see upcoming, especially as chaos engineering is kind of growing in popularity. Absolutely. So in addition to being able to provide the security aspects and the observability aspects, you can also do a lot of traffic control stuff inside of your service mesh. And this can be something as really cool as chaos engineering and fault injection, but it can also be just like standard best practices like rate limiting.
Starting point is 00:16:31 You know, if you've got an incredibly, let's say, noisy or demanding client that's accessing these services inside of your mesh, then Istio can kind of say, okay, well, I know that this backend service can only do X requests per second. So while all the requests are coming in, Istio can kind of say, okay, well, I know that this backend service can only do X requests per second. So while all the requests are coming in, Istio will kind of buffer them for you and then make sure that that service that's, let's say, overwhelmed has a chance to kind of back off. You can also do things like circuit breaking. If a backend isn't working well in, in, in some definition of well, then you can, you can fail quickly.
Starting point is 00:17:07 And that way you're not overwhelming other services that might be sort of downstream of things. But the fault injection is super cool because of course you can do things like, you know, the proverbial or the, the, the analog to unplugging the cord and seeing what happens to some of the
Starting point is 00:17:23 packets. Usually it'll just like retransmit and it's fine. But the really cool thing you can do with Istio is you can do latency injection, which is really interesting. So rather than simply just like, you know, drop every fifth packet, which you should still do to make sure that your services can recover from that. But most of the time that's sort of handled, right? Like, you know, TCP is fairly good at dealing with that. But what if we just increase the latency by like three seconds, every random distribution of packets that come in? Like what happened to your service then? What happens if like you're sending a bunch of metrics or if you're sending a bunch of data or you've got, you know, like a message queue or something like that, where a message just takes really long to be delivered,
Starting point is 00:18:09 especially when you have a lot of stuff running in parallel, like you might get something delivered out of order. Like how does your application respond to that? It's really important to take a look at those things because you'll probably run into that in production and it's better to catch it beforehand rather than be reactionary after something you know goes down and you get paged at two o'clock in the morning because no one likes that and all this is obviously possible i think envoy is used as a proxy correct yep yep pretty nifty stuff uh it's really high performant proxy really works well very super programmable it's an awesome component of the istio and now also correct me if i'm wrong but the way that the magic behind the scenes is that you are
Starting point is 00:18:53 or east or any other service mesh probably similar is injecting itself into the containers to control in going and outgoing traffic right so that you can actually route it automatically without having to change the code of your app or your service to the proxies, right? Yep. It does that through this interesting concept of an admission controller and this really fun way of essentially rewriting whatever it is that you've requested. So this is where it's kind of interesting because Istio can work outside of, let's say, Kubernetes, but it makes the most amount of sense and the most amount of automated sort of workings that you get sort of out of the box if it has this nice, tightly coupled Kubernetes integration. So as a developer, I can just focus on making sure that my microservice works. And then when I say, go run this, Kubernetes intercepts what I'd like to run, rewrites it a little bit, and then adds the appropriate Istio components, and then attaches that sidecar that has that Envoy proxy into it.
Starting point is 00:19:57 So it's running another container inside of that pod and then wires everything else up in the back end so that traffic flows through Istio first, the Envoy proxy second, and then eventually your app third. Hey, can I ask a question? This is kind of a weird one because it dawned on me when you mentioned something about, you know, there's a lot of programming
Starting point is 00:20:19 you could do to Envoy. When listening to this, a lot of it sounds like you drop in Istio, it all runs perfect and great, and you just have to tell Kubernetes to do it. But how much in reality, right? I'm sure there's a lot of out of the box that it just works really good with it. In reality, though, how much maintenance, how much, let's say, old-fashioned style sysadmining do people have to do with Istio and going in and controlling configurations and settings you know completely unrelated but you know i'm thinking like just
Starting point is 00:20:50 the classic jvm right you got to set your memory settings threads all this kind of stuff is there equivalent kind of configurations and things that you tune performance tuning things that you have to really pay attention to on this do and do enough people, if there are, do enough people know about that? And is there any good practices around that? Yeah, that's a really good question. So it's the case of pretty much every sort of open source product where it's, hey, here's the ingredients, good luck.
Starting point is 00:21:20 And then you're kind of looking at it like, okay, this is going to be interesting. Luckily, because we keep moving up the abstraction stack, right. You know, we don't really concern ourselves with like big VMs anymore. Now we concern ourselves with containers and we really don't even concern ourselves with containers. We have, you know, Kubernetes package management and stuff like that. You know, you have like a helm chart to install something that's really complicated. You get to a point where a lot of the basics of what you would otherwise do with systems administration has been codified and is available to you. So for example, how do I scale the ingress that it uses, like that big gateway that Istio takes advantage of?
Starting point is 00:22:06 Well, it turns out that you scale it the same way that you would scale anything else. It has the appropriate resource requests, limits, and requests put in its definition. So if you take a look at some of the YAML that exists after you install Istio, if you just start going and taking a look at what's running in your cluster, it'll come back with a bunch of data and it will say like, this is how we scale this. This is how we add additional pods and so on and so forth. So from that perspective,
Starting point is 00:22:32 there's not a whole lot to do. You still have to maintain the Istio components. So like, you know, if there's new versions, you want to make sure that things get upgraded. If the, you know, the database schema changes, there's probably going to be an upgrade path and so on and so forth. A lot of that, though, is relegated to whoever is managing your Kubernetes environment or your Istio componentry built on top of it. So realistically, I would say, yes, there's a bunch of stuff that needs to be done. But depending on how much you want to take for yourself versus how much you want to leverage managed services, it's really not something to really concern yourself with, especially
Starting point is 00:23:13 if you're like a mom and pop shop or even a really large scale multinational conglomerate. You just need to focus on writing applications and then letting somebody else take care of all of this other stuff. It's kind of the same thing. Like, you know, we used to have a lot of, you know, DBAs that would focus on optimizing and performance tuning for a lot of, let's say, you know, like MySQL databases. Well now, I mean, if you're doing MySQL by yourself, then sure, you need to maintain that. You need to tune things in the kernel and so forth. But if I just go to a cloud provider and I say, give me a MySQL database, they've already done all that tuning for me. And I pay
Starting point is 00:23:49 a slight fee on top of whatever the cost of running the VM or the storage or the networking or whatnot would be, so that I don't have to think about that. So from a business perspective, it focuses you back on your applications that are actually providing business value as opposed to being really removed from the system and being super inefficient where you're dealing with all of this extra stuff like network functions, rate limiting, security, stuff like that. Like you shouldn't have to be focusing on this. That should be something else that's dealt with.
Starting point is 00:24:21 And then the thing that you're running it in should also be dealt with automatically. It's funny because a lot of this kind of sounds like it's moving back to where Cloud Foundry was trying to get to. Cloud Foundry doesn't seem like it picked up as much as it could have because it was a very opinionated platform and everything was set. You literally would just push your code and they were responsible for making sure that system ran well by designing that system. And then on the counter side, there was the complete DIY side of, let me set up my Docker, Kubernetes, all these pieces, but
Starting point is 00:24:53 now we're starting, it sounds like some of this stuff that's been figured out is being pushed in like Istio is doing some of the opinions for you and for the most part it's going to run as intended it's only going to be it sounds like it also it's only going to be in some more the edge cases where you're going to have to dive in
Starting point is 00:25:13 and tackle it as opposed to doing full network maintenance on your own which would be the other extreme interesting okay cool but of course it's important to remember that like your mileage may vary and anything you're introducing like a shim between all of your network requests Okay, cool. But of course, it's important to remember that your mileage may vary. And anytime you're introducing a shim between all of your network requests, you're going to want to do your appropriate load testing and things like that to make sure that you figured out how to scale it properly. And if you're doing advanced features like rate limiting, for example, inside of Istio, well, obviously, a cluster that has rate limiting enabled and a cluster that doesn't are going to behave very differently depending on how you load test them.
Starting point is 00:25:46 So there's still edge cases, or not necessarily edge cases. There's still testing and things like that that you would do. But I sort of take the stance that if you're doing DevOps properly, you're going to be doing that, oh, we need to adjust some Istio policies as opposed to, oh, we need to adjust some other thing or create some library that our coding language will support to basically implement that same functionality anyway.
Starting point is 00:26:16 So it's still sort of taken care of in terms of your standard testing. Just don't copy and paste from Stack Overflow and you're fine. Well, it depends. If it's Guilty High Tower writing a bunch of stuff, you'd probably be okay, but if you can't verify the source,
Starting point is 00:26:32 then be careful. I got two more questions, and the first one, the second one is on observability, but the first one is on if you have a large Kubernetes cluster and it's shared by different teams deploying different types of apps
Starting point is 00:26:47 that they want to get separated. They want to make sure things are separated. Yet there's Istio on it. And my question is, is Istio always a shared component on the complete Kubernetes cluster? Or can you also run Istio in, let's say, multiple instances of Istio
Starting point is 00:27:05 so that maybe version one takes care of these namespaces and version two takes care of the other namespaces? Instance one, instance two, to make sure that you have a clear separation. Or is this something where I'm completely missing something here and it's not a requirement at all? So this is an interesting question because the way that you're asking it presupposes the existence of a singular cluster, which while we can answer that question very easily, you run into
Starting point is 00:27:36 a little bit more complicated scenarios when you want to have more than one cluster and you want to have the same sort of controls that are rolled out across the board. So I want to have more than one cluster and you want to have the same sort of controls that are rolled out across the board. So I want to answer the question from both perspectives, right? So on the one hand, Istio is, you know, if you want to boil it down, it's a series of programmable network functions that allows you to not have to worry about any of the things that fall under its jurisdiction inside of your application code. A great example of that would be like, how do I enable authentication and authorization between a bunch of different microservices? How do I enable encryption?
Starting point is 00:28:15 Well, you could if you had a big monorepo, you could have every service inherit from the same code base that just imports the appropriate encryption library and go from there. Or you could let Istio do it for you, in which case all of the fantastic certificates and MTLS configurations will just get done for you. And then you're, you're, you're good. You're off to the races. Now, what's, what's funny about this is, um, when we have different namespaces, because we want to have fewer clusters in general. It's always better to have fewer clusters than a lot of random clusters out there. And there are different ways of achieving multi-tenancy, even amongst like other,
Starting point is 00:28:52 like software as a service providers and things like that. There's still ways of being able to do multi-tenancy in a singular cluster. But realistically, you'd want Istio to have access to all of the different namespaces because that's how you enable things like micro segmentation. So you might have thought like, well, how do I do like firewall rules to prevent, let's say, you know, team A from talking to team B? Not necessarily because we you know, they don't you know, the teams don't get along and they just hate each other or something like that. But more like, you know, team A is let's say a billing service
Starting point is 00:29:28 and team B is a, you know, a payment microservice, you know, framework or something like that. Like one does processing, the other one does like, you know, calculations for bills and things like that. Well, those should be able to talk to each other, but the front end, which is like another team doesn't really need to talk to like that specific billing backend. It just needs to put a request in once like the shopping cart service, let's say, does something. So in those circumstances, by definition, you want to be able to tell an overlay component that this should talk to this and this shouldn't be allowed to talk to this other thing. In which case, that thing that needs to enforce it, in this case Istio, needs to be aware of all of the things that are running inside of your
Starting point is 00:30:09 cluster. And it needs to be able to put up those barriers to either enable or disable communication between them, which is kind of at the heart of the question. Now, I said that this gets a little bit more complicated when you have multiple clusters. Now, that's interesting because when I talk to a lot of customers and I talk to a lot of people that are rolling out really large scale infrastructure, it's kind of fascinating because everyone that I talk to, because people are sort of like in the cloud now, I think the state of enterprises now is they have a cloud strategy, which is great. Five years ago or so, people were developing a cloud strategy. But they're still very regional. It's still very like, well, we're doing something in US East or US West, or we've got EU1 or something like that. But we're not thinking broader. We're not thinking bigger. Disaster
Starting point is 00:31:02 recovery is still a, I know, I push a button and then I deploy a bunch of infrastructure and then I, you know, I can recover. They're not thinking planet scale fault tolerance just yet. And it's very easy to be able to do that when you have a service mesh, which runs on top of multiple clusters or multiple pieces of infrastructure, because now you have the concept of you know quote-unquote production existing in more than one cluster and you have to bring order to chaos because you want to say that my production services need to do the following things i have the following slos that need to be maintained i have the appropriate slis in place to gather
Starting point is 00:31:41 whether or not or gather information on whether or not the service is operational. And how do I take, you know, take actions to to get that back to that known good state that abstraction stack so that you're not beholden to a single piece of infrastructure, a single region, a single VPC or a single network even. And that's where something like environs come into play, where you can define that this is my production cluster or excuse me, my production environment, which consists of multiple clusters. And then all of these services need to be able to talk to each other. So now it's a case where not only do my services need to communicate with themselves inside of the cluster, but it's possible that one of those services might be in production, but might exist, let's say on prem. So how do I make sure that those microservices up in the cloud can talk to those microservices on prem? And how do I make sure that those microservices up in the cloud can talk to those microservices on-prem? And how do I make sure that services that I don't want talking to each other can't talk to each other regardless of where they're deployed?
Starting point is 00:32:54 In which case, you have to have this centralized, I say centralized, but it's still distributed, but this one place where you define all of these things and that becomes your service mesh now how that's implemented there's a handful of different strategies and there's tons of different slides that talk about how to do control plane components and whether or not you want to duplicate the control plane components or if you want to have them all in one cluster and yada yada yada the the end goal though the only thing that i care about is that i have a singular mesh that spans all of these different back- and backend infrastructure so that I don't focus on infrastructure. I focus on environments, which are, this is really what I care more about. And cool. So that means if I understand you correctly, you can install the service mesh,
Starting point is 00:33:42 the control plane, at least somewhere, whether single or fault-tolerant, all this is possible. But you have execution plane components in every single cluster. One or multiple clusters belong to an environment. But obviously, and I assume this is also true, that when you have multiple environments that are distributed around the globe, that you always have something like a location awareness so that Easter is smart enough to say, Service A needs to talk to Service B and let's make sure they're geographically close, talk to each other, and not that we're sending packages all over the globe and things like that.
Starting point is 00:34:30 Exactly. Yeah. And that's, it's really interesting because when you have that infrastructure in place and you have like global capabilities that are provided either from your cloud provider or from, let's say like, you know, a CDN or like a large scale networking provider or something like that, you can do some really interesting like network flow type optimizations or like sectioning, or you can have cross region communication, or you can disable it depending on what you'd like it to do. So for example, you know, if I have this global network of stuff that I have this massive service mesh that covers the planet, and I'm in the US and I'm connecting to uh you know this back-end service you know i'm just going to you know example.com let's say you can have it set up
Starting point is 00:35:09 such that example.com will resolve to an ip address and that ip address could be an anycast ip address that would basically deposit you on whatever closest cluster happens to be near you based on whoever's providing the network infrastructure's technology. It's like geolocation-based sort of thing, that sort of thing. Or you could have it say, well, this particular service, please route me to the closest available backend, but then I have restrictions. So for example, it might be the case that let's say we don't want, let's say a European customer to, or a European end user to connect to this global network and then be routed outside of, let's say the EU, maybe for, you know, data privacy regulation
Starting point is 00:35:59 or something along those lines. So what's important is it's super programmable. So you can enable whatever it is that you would like to enable and you can disable whatever it is you would like to disable. But the one thing to kind of take away from this, like if you were to boil down all of the stuff I've been ranting about here, is that you have a singular interface that governs your entire production environment that doesn't care about infrastructure. Infrastructure is sort of a separate thing. So you just define services, you define the policies that govern the services, and then you let the mesh take care of the rest and let it communicate with the underlying components. And then you're all set all set you're good pretty cool um i i'm what we should do i
Starting point is 00:36:48 know you said there's a lot of uh slides and presentations out there we should make sure to get some of the links and put it into the podcast proceedings into the description so people can follow up yeah that would be great yeah absolutely. I got one more question. Now I want to go to observability, and I'm sure more questions will spark as we go along. But as far as I understand, basically Istio sees traffic between two services and therefore knows exactly who is talking to whom. But how about distributed tracing? Is it possible for Istio to actually follow transactions
Starting point is 00:37:27 across multiple hops or is this just i don't think it's possible right because you cannot put a trace tag on one entry point and then it gets automatically propagated through the different layers but i wanted to see if if if, if something is in the works, if it's possible, if true distributed tracing is also something that service meshes can, can provide. It can actually, which is pretty cool. Um, what's, what's important to remember with service meshes is it's not just East West traffic, but it's also North South traffic, right? So if I have something that's coming into the service mesh, which again, we talked about has a variety of different backend infrastructures, including things that are not clusters. You can just pop an Envoy proxy on a mainframe and then add it to the mesh.
Starting point is 00:38:15 And that way you've got that governed under the same policies that you have. And as a result, any of the traffic that's coming into the mesh can then have the appropriate, you know, tags and whatnot applied to it so that you can actually plug in whatever distributed tracing framework you would like to plug into it, which is pretty nifty, including, you know, standard open source stuff or, you know, managed offerings from a variety of different service providers. I'd imagine you have your preferred service provider to provide observability insights and things like that. But out of the box, it works with things like Zipkin, Jaeger, or whatnot. So it provides this type of capability. And Envoy has the capability of essentially adding the appropriate tags and things like that to it so that you can then pick up and then
Starting point is 00:39:06 go do something. Keep in mind, though, that like it's network function specific. That said, it's producing a bunch of different metrics that then can go into a variety of different backends. So this is where it gets like kind of complicated because Istio does a lot of different things and then other things can be built on top of it that can also plug into other things, for example. So if you need to do something that modifies the way that the network works, you have to have Envoy capable of being programmed to be able to inject that. So for example, like headers and things like that, like you need to be able to put like span context or trace IDs or something along those lines inside of it. But at the same time,
Starting point is 00:39:48 if all of that data is being fed somewhere, then you need to be able to essentially aggregate that, right? So it's kind of like a map reduce operation. You have to map all of the different things that have this particular span, but then you have to have something that processes it to be able to provide some meaningful insights. And that's where another part of that.
Starting point is 00:40:08 So your distributed systems tracing framework would have the ability to then pick up all of this information and show you or present the traces and spans to you in a way that makes the most amount of sense. So you need to implement it, but then you also need a way to present it of course so istio provides the implementation specifics but then you need something else that can read the data that's that's uh inside of the the mesh to then present stuff which is pretty cool yeah so i mean and again i think you confirm what i thought in order to do true distributed tracing you would need to instrument your individual services with you know some open source library or a commercial offering to truly get end-to-end tracing. So to take the tag that was put on the request by Istio and then, you know, take the same tag and basically push it out at the back end when you make back-end calls to get the tracing.
Starting point is 00:41:00 I thought maybe I don't, you know, maybe Istio has found a solution for that as well, but obviously it's not possible because, as you said, different technologies and you cannot just pass a tag through the different runtimes that we use to build our services with automatically. Yeah, cool. Exactly. Yeah. So you still got to have a backend service to receive all of these things. So one, you configure it and the other one, you then present the data and then you're good. Perfect. And yeah, you mentioned there's obviously a lot of standards and we are also proud on the Dynatrace side that we are part of OpenTelemetry and supporting the W3C trace context and all that stuff. So that's great.
Starting point is 00:41:41 Talking about standards, I believe there is some standard around service meshes or service mesh interface. Can you talk a little bit more about this, or is it just Istio everywhere? Well, so it depends on how does one define a service mesh, right? How do you actually say these are all of the components that as long as you check these check boxes now we have a service mesh like what does that look like and not only that but how do you then provide additional insights or feature capabilities on top of things right so like um let's say you just have a bunch of envoys sitting around theoretically that could be a service mesh like you're all set you're ready to go as matter of fact, a lot of people end up programming envoys without using the rest of
Starting point is 00:42:28 Istio. They just write a bunch of envoys, they deploy them, and they say, right, now I've got a service mesh. I'm pretty good. But then you run into the same sort of backend issues in terms of an automation perspective that would lead you to build something that kind of looks like Istio if you squint hard enough. Like for example, well, I don't really want to have to program all of these envoys manually for every service that I deploy. And on top of that, I really don't want to have to inject them or provide them with the appropriate certificates so that I can get security out of the box. So I should really build a thing that does that automatically. It's like, well, congrats, you've just like reinvented Citadel, which is one of the components inside of Istio.
Starting point is 00:43:11 Or I really need to make sure that all of the envoys that I have out there are programmed appropriately. How do I do that in a sane and rational manner? Well, I need to have something that understands all of the routes that we have and the way to get from one cluster to another cluster. If I have a multi-cluster match, well, congrats, you've just reinvented essentially what pilot does. So you're going to have some subset of functionality that is going to exist that is good enough for whatever usage that you, whatever use case that you have, but at the same time, be cognizant that like,
Starting point is 00:43:44 you're probably not the first person who's going to go down that path. And as a result, people that have spent a really long amount of time getting Istio to where it is today have already kind of thought about these things and have actually changed the capabilities that the mesh can provide and all of the requisite technology that goes into it and all of the stuff that it plugs into. So for example, is it a requirement that a service mesh meet, like be integrated a hundred percent with open telemetry? Well, I would hope because that's like the goal of the open telemetry project, but at the same time, you know, Istio predates it. So would you consider that to be a requirement of the service mesh or back and forth or whatever this tends to take a look like?
Starting point is 00:44:30 The bigger thing would be you want to have something that you can define that operates, that figures out how to manage traffic across a variety of different backends that's super pluggable. So whatever that looks like, I would say is the best, I don't want to say standard, but the best implementation strategy for something like a service mesh. The really fun part about is there's a handful of different foundations that are out there that are all doing some interesting work in this space. So like the Cloud Native Computing Foundation, for example, the Continuous Delivery Foundation, all of the different foundations that are kind of like Linux Foundation. And as a result, like Istio, I believe, is going to be donated over to one of those foundations. I think it's like an offshoot of them, which would be kind of cool. So once it's over in that type of foundation, then you can create
Starting point is 00:45:33 the appropriate standards body that says any one of these backend components can be implemented. But keep in mind, it's also a system of systems. So like, can I do Istio without using Envoy? Well, theoretically you could, you just need a thing that speaks the APIs or that can read the custom resource definitions that you have to implement the same type of functionality. So there's a lot of different moving parts and there's a lot of different standards
Starting point is 00:46:02 behind sort of all of them to kind of like build all of that stuff up. It's confusing. No, I was also referring to the service mesh interface. I think that came to mind. At least I think there is a spec out there that talks about, you know, what does a service mesh look like? And I think there's also the SMI spec IO page. And that's kind of what i was trying
Starting point is 00:46:26 to get to figure out oh yes there's something like this and that you know when people decide yes we obviously need service meshes but do we need to if we decide on easter today and or let's say we decide on something else today and we implement things do we lose everything if we switch over to another service mesh or is there some type of standardization that makes it easy for users to switch over gotcha yeah um so like the the smi spec uh would be one step but keep in mind that it's a system of systems so there's like a lot of other random uh specs that need to be a part of it beyond just like the service mesh. So like, yes, you want to have something that implements the appropriate like service mesh, like type logic, but at the same time you want to have things that implement, uh, all of the components that it needs to be able to run on top of. So like there's more than one standard that
Starting point is 00:47:21 needs to be kind of present in order for this to work. Great use case would be like a gateway interface. So like service meshes, one of the big things that they have is like a common ingress and a common egress point that are provided as part of the standard sort of bootstrapped installation of Istio. Now, in that case, we have this weird thing inside of Kubernetes called like an ingress, which is kind of fun. So as opposed to just sort of a standard that exists like from a governance body, how about like a standard that exists inside of an actual like API? So you get down into the nuts and bolts of like what the CRD should actually be, which I believe SMI actually does. It gets down into the weeds of that. But in that circumstance,
Starting point is 00:48:06 now we've got this weird, like we have this concept of like a load balancer. We have the concept of something like a gateway, which is not quite the same thing. It's plumbed, but it's not really plumbed in the same way. And then ingresses are powered by ingress controllers, which I guess would be kind of like what Istio provides. But at the same time, we have this weird concept of an Ingress that's not maybe L7. So then you get into this weird, like, we really should have a different interface for some of these components. And what's really cool about it is because it's all open and because anyone can sort
Starting point is 00:48:41 of hack on it, right now we're in that like beginning phase where like a bunch of open source hackers are sitting around trying to figure out like what's the best way forward and we're introducing new concepts and then taking away other concepts and then merging other concepts all into one interface so and i think that that's like the best of of open source right like you move super fast you're able to release really awesome software, but then you have that like Redux phase where you're like, okay, let's do a sanity check and make sure that we're on the same page. So for example, like the entire motivation behind open telemetry was just, we have way too many different ways of exposing like metrics and information.
Starting point is 00:49:21 We really ought to standardize this in one common set of APIs. Same thing with open policies, because you can have and enforce different types of policies inside of Istio. We really should have one common policy framework that everything sort of derives from, and we should have one common whatever framework to do something else. And then as a result, you'll start to see these different types of standards bodies come up with their definition. Now that, let's say, the pioneers, and I would consider Istio to be one of the front runners because it's just been in development for a really long time. You kind of take a look at where it is now, and then you derive the standard that anybody else can implement and then everybody ends up implementing that including let's say istio like they'll redo things to be conformant to the standard that they helped create so it brings everything nice
Starting point is 00:50:15 uh full circle and and together in the circle of software development life cycles cue the fun music hey um i want to kind of you know i know that we can probably chat a long long time about this because the more you talk the more questions come to mind but i want to end it with with one question which let's see how long how far this goes but a lot of people are, you know, starting with Kubernetes. And I think at some point they have to figure out, A, do I need a service mesh or can I just do it with, I don't know, some load balancers that can also route traffic or is a load balancer and service mesh anyway. But at what point do I seriously need to look into service meshes? And, and this is the big question,
Starting point is 00:51:08 what are the most common problems that people run into? So that, you know, let's talk about this and let's make sure that people understand, because not everything is, you just do a kubectl apply and everything is good, right? There are challenges with it too. And so, A, at which point do people need to look into service meshes? And what are going to be the challenges? That's a really good question to end on. I can see why you're doing a podcast. You're pretty good at it.
Starting point is 00:51:39 So to tackle that one, I think there's really a couple gut checks or a couple sanity checks that you want to be aware of. One of them is anytime you're doing anything where you have more than one piece of back-end infrastructure. So if you're kind of rolling stuff up yourself and you've got one cluster, know, one cluster, let's say up in the cloud or whatnot. As soon as you think to yourself, ah, I really want to run something on prem and I want to run something, you know, up in the cloud. How do I like get that working appropriately? In that capacity, you really need to be thinking about a service mesh because that's the only thing that abstracts that infrastructure away from your, your concern to the point where you can just focus on writing the application and making sure that things work properly. The other thing would be if you're
Starting point is 00:52:32 developing different components of a service mesh, or if you're working on, let's say, trying to do something with encryption across the board or something on these, these, these, you know, in those types of categories, you'll run into this. I'm not coding application logic, I'm coding something that I need the application logic to be beholden to. So in other words, like you know, the payment service, theoretically speaking, shouldn't really concern itself from a code perspective of how it needs to scale, how it needs to be, you know, shouldn't really concern itself from a code perspective of how it needs to scale, how it needs to be, you know, let's say rate limited, how it needs to do different types of like retry patterns and things like that. Like once you, you get down that path where you're like, wow, I need to, I need to like stop doing this. You're going to want to bring order
Starting point is 00:53:21 to chaos by implementing a service mash and keeping your applications nice and pure and not having a bunch of extra logic or whatnot that's not really fundamental to something. So that would be like the two big components that I would take a look at. And end-to-end encryption, that would be a big one. Or if you have a lot of micro-segmentation requests, like I need to make sure that these namespaces don't talk to each other or different things or something along those lines, then I think you really should take a look at a service mesh when it comes to challenges that you have when adopting it i think a lot of it is fully understanding the capabilities that the service mesh has to offer and understanding how it implements some of these things because it's a little bit different than I think people were traditionally aware of. It's a different set of custom resource definitions,
Starting point is 00:54:10 right? Your standard service definition and your load balancer definition need to be rewritten because now they're going to be different service definitions inside of Istio. And as a result, you're going to need to not necessarily modernize. Like it's not as big of a gear change from going from, you know, the monolith to the microservice. But it's going to be I need to take a look at this YAML and I need to just update a couple of things here and there to make sure that it works with with the mesh. The other thing to keep in mind is you can actually run a service mesh on top of your Kubernetes cluster that also is running services that are not part of the mesh, which is kind of cool. So you don't have to necessarily move everything at once. It'll be sort of a gradual type of procedure. The big complexity though is, like I mentioned before about like leveraging managed services is really reaching out to whoever it is
Starting point is 00:55:06 that is packaging up your Kubernetes distribution or the Istio installation on top of it, and working with them to see what the best practices are. Because even though we want to keep this to be as standardized as possible, there's little differences depending on how you're deploying things. And there's going to be little differences that start cropping up in different parts of your application. So it might be that one vendor's implementation really recommends that you do the following, you know, especially when it comes to bringing the components up and running different parts of the control plane, maybe in different ways. So they might have their approach, which, which they're certifying as being like nice and stable. That might be a little bit different depending on which provider you go to that
Starting point is 00:55:53 might have a different set of functionality for the way that something or other is implemented. So I think that would be kind of an important thing. The other thing to mention, um, I think in terms of like a challenge, which is actually not a challenge, but people end up getting in this very weird mindset where they think it's going to be a challenge is like if you get something running in the mesh, like you don't have to, you know, from day one, turn on all of the advanced capabilities that the mesh has to offer, like micro segmentation, rate limiting, quota enforcement, policy here and there. You don't have to do that. If you just deploy Istio and then deploy a service inside of it, it's kind of like when you deploy a service inside of Kubernetes. If you don't have the appropriate resource requests and horizontal pod autoscaler parameters and whatnot, the service will run. It won't necessarily run great, but it'll run. So then over time, you can start implementing the restrictions or the capabilities of Ist that maybe one of the services keeps falling over for some reason. Maybe take a look at that and go, you know what would solve this? Better rate limiting.
Starting point is 00:57:13 Let's figure out how to do that inside of Istio and let's go ahead and implement that. But up until that point, you didn't need to know anything about how Istio does rate limiting. You just needed to deploy your service and then you're good. So don't bite off more than you can chew trying to turn every feature on all at once, because that gets super complicated. And it's very difficult to then try to debug what's going on and where Istio is the problem
Starting point is 00:57:38 or not the problem. And you kind of take it from there. Those, I think, would be the SAB, uh, you know, Saab's tips for, uh, you know, getting started with service meshes. Very cool. Um, Sebastian, is there anything, Mr. Any area of service meshes that we completely missed out today, which means you would need another episode of peer performance, because if I give you another topic or two, then I think we need another hour, but is there anything, there anything that we definitely need to mention? Well, I mean, service meshes are so broad
Starting point is 00:58:13 and they're so powerful in terms of what they're able to do. And there's different ways of talking about specific issues inside of things. I'm going to say no. I think this is a good... By the way, have you thought about all these other things that service meshes can do? I think anything else, you really need a little bit more of a use case. So it needs to be more of like a sort of like a customer oriented
Starting point is 00:58:35 conversation and then focusing on how to like implement specific things. And there's like a variety of different options that you can have, again, based on your vendor capabilities. So then it gets into like, well, I can do these things using a service mesh that are provided by this other backend component, which is going to be slightly better than somebody else's backend component, which again, that's just a whole nother layer of abstraction. And also it gets more esoteric and dare i say more boring as you go up that so i think you'll lose audience members so if we keep talking i think we're good for today great uh andy i think there's quite a lot there if you're going to summarize but did
Starting point is 00:59:19 you have any words of wisdom from your well no i want to bring yeah no i just want to want to quickly summarize it right as i as i always try to do in the end but i think what sebastian said uh in the close to the end is is what really what i think what also excites me about service meshes it is it's taking away all the burden and all the stuff you need to take care of uh in order to run applications but you shouldn't take care of i think sebastian you said it correctly earlier if it's something that you would need to put into your application logic which is not part of your logic then this is something where you should find a managed service that does it for you and that's what we've been talking,
Starting point is 01:00:05 what the industry has been talking about for so many years now. Focus on what you're good in, focus on what differentiates you, and don't focus on things that have been solved many, many times before better than you can ever solve it, because there's somebody that offers this as a service. And I think that's also true for service meshes. I really like that you gave us the overview of that it's not just traffic routing, but there's encryption, segregation, making sure that maybe application one is not allowed to talk to application two,
Starting point is 01:00:39 the geographical boundaries. There's a lot of cool things in there, default injection, a lot of great things, and probably much more, as you said, there's so many more capabilities so that if you're looking into this, you know, make sure you understand what service meshes can really do so that you don't go down the trail of maybe building something yourself that a service mesh can already do. Sebastian, I really would love if you could send us some links to material so that people, you mentioned a couple of things that presentations that people can look into to get started.
Starting point is 01:01:16 Also challenges. I think this is very much appreciated by all listeners. Absolutely. Thanks very much for having me again. It's been super fun yeah and thanks for coming back you're part of uh an elite group of people who've been on twice there's people who've been on more than twice i just have to let you know so that you know maybe you'll be like hey i got another idea i want to join that the next tier of you know uh i want to say elitism but that's the wrong word but you want to join this special club of Pure Performance repeat guests. We used to have a running tally that we gave up on a long time ago.
Starting point is 01:01:53 We've just been doing this for too long to keep track, really, I think, Andy. But thank you so much for joining us again, Sebastian. As with the last episode with SLIs and SLOs, this was extremely informative, especially on a newer topic like service meshes, which I think can be overwhelming for a lot of people, including myself. So I really learned a lot today. I really appreciate that.
Starting point is 01:02:16 And without further ado then, Andy, this might be one of our longest episodes. I think we've popped into over an hour. So thank you for our listeners for staying with us and listening. And we will be back in a few weeks. And if you have any questions or any
Starting point is 01:02:34 topics you'd like to talk about, you can tweet them to at pure underscore DT on Twitter. But that's what it is, right? It's been a while since I said it? It's PureUnder4DT. It's been a while since I said it. It's been a while since I said it.
Starting point is 01:02:49 Or you can send an old-fashioned email to pureperformance at dynatrace.com. And Sebastian, remind people where they can follow you on the socials. That DevOps guy. Nice and easy. That's right. And you have that across all platforms, right? Pretty much, yep.
Starting point is 01:03:04 Yeah, I remember that from last time. Yeah. Awesome. All right. Well, thanks everyone for listening and Sebastian. Thank you. And we'll see you all soon. Bye-bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.