PurePerformance - An Introduction to Service Meshes and Istio with Matt Turner

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always my co-host Andy Grabner is with me. Hi Andy, how are you doing? Not too bad actually. Now talking into the right direction of the microphone, I believe people can actually hear me because that was my problem in the beginning. Yes, you're a performance genius

Starting point is 00:00:52 but not quite an audio uh maestro yeah but getting there you'll get there andy you'll get there your voice i gotta just say i i love when i'm editing the shows i love hearing you on the microphone instead of the headset i'm sure the listeners do as well so that's that's awesome yeah i got the feedback from other listeners too that the audio quality definitely increased they also said the content the quality of the content didn't get better but at least the audio quality increased well well i think today's content will will be very very good because i think it's it's a very um it's sort of a new topic one we haven't really covered too much yet. And I think it could be confusing when you're first getting into it, depending on your level. Why don't you go ahead and introduce the topic and then introduce our guest, Mr. Grabner. Yeah, sure.

Starting point is 00:01:33 I mean, first of all, I want to say thanks to the folks at Deaf Experience in Romania who actually introduced or allowed me to get introduced to Matt Turner, our guest of today. I met him at Dev Experience in Iasi, a city in the northeast of Romania, where he gave a talk on the life of a packet through Istio. Now, that was very interesting for me because I learned a hell of a lot about network basics and how routing works and how Istio works. And that's when I reached out to Matt after the talk and said, hey, there's so much you know. And then he told me a little bit about his background. But instead of me repeating all the stuff that he said, I just want to hand it over to Matt. And maybe Matt, I know you should

Starting point is 00:02:17 be with us right now online. And if you could introduce yourself, your background or what you've been doing, what you're doing right now. And then we want to learn more about service meshes, E-Steel, what else is out there, what people have to know in case they just get started on that topic. Sure. Yeah, I'm here. Pleasure to be with you. Can you hear me? Am I talking towards the right microphone? Yeah, it's good. I passed the first test, first piece of equipment. Yeah, as I say, it's a pleasure to be here. Thank you very much for having me. Yeah, I met Andy in the Ashburn Mania at Dev Experience, which was a great little conference, actually.

Starting point is 00:02:53 Lots of really, really good advanced talks. I actually sat in every session that I, you know, apart from the one I was presenting, and it was, yeah, really informative. So that was a great place to be. So, yeah, I'm Matt Turner from London in the UK. I'm a software engineer and computer scientist by training. And I guess for the past sort of five years, I've always had an interest in systems.

Starting point is 00:03:16 I've always been building servers and Raspberry Pis and stuff at home. That's, I guess, the hobby part of my uh experience with computers um and i guess about five years ago this devops thing started to happen and the infrastructure and the platforms became more important and i was like oh i know that stuff um i thought this was just you know nerdy hobby but this is this is really relevant now um so i guess i kind of got into that i did a bit of uh you know old school orchestration and management of VMs, the sort of the early lift and shift to early cloud when that was OpenStack. And I've kind of followed the technology since then.

Starting point is 00:03:55 So with Docker, with Kubernetes, and now with Istio and the other service meshes, which are kind of at the forefront of the cloud native landscape. Cool. So, you know, the service mesh, you just brought it up. And I think Brian actually asked about this when we kind of prepared for that talk. Can you give a quick like one on one on service meshes and what problems we try to solve with it and how they work? Because I'm pretty sure, you know, part of the audience obviously is aware of it, but just to level set everyone. Yeah, so I guess to turn to a service mesh

Starting point is 00:04:30 by way of a problem, if you think originally that we would have a big monolith, a big hunk of software that was probably written in C++, or Java, or PHP, and that could be millions of lines of code. And that thing worked. We became quite good at writing those. We have dependency injection systems, we have modularization, modular loading. And we became good at avoiding common anti-patterns, things like inversion of control. But ultimately, all of your code was hosted in one big blob, and it ran in one big

Starting point is 00:05:05 process. So although you might have a domain-driven design aggregate that was almost a separate little piece of software to another part of the system, to a different namespace, where one wanted to call another, that would just be a function call, right? And any computer science 101 course will tell you how know how arguments are actually passed sort of in registers on the cpu so that was really simple and really fast and that never failed but then our systems got bigger and bigger uh we needed more and more scale um and we so we broke that that monolith up right into microservices and obviously there's a bunch of other reasons for that i guess i don't need to explain microservices you know you can then

Starting point is 00:05:44 release every service independently and and whatever else but there's a bunch of other reasons for that. I guess I don't need to explain microservices. You can then release every service independently and whatever else. But there was a tendency to take this monolith and to split it into all these pieces and run each one in a separate container without really fundamentally changing the way that the system worked and the way that communications across these boundaries happened. So what you ended up with was, to give it a glib name, a distributed monolith. So you had all the same code as before, with all the same failure cases that you could have before. But now things that were previously a function call just between one class and another that could never fail now could because they're going across a network and Kubernetes might have these two containers on different sides of the planet. So we needed to cope with that. We needed to add a bunch of resiliency to cope with this distributed system

Starting point is 00:06:30 that we've now built. The early attempts at this were libraries like Hystrix and Finagle from the Netflix open source stack that people have probably heard of. And they gave your applications facilities on a sort of an RPC call, like backoffs, like retries, like circuit breakers, timeouts with defaults, that kind of thing. So one of these functions mitigated the fact that a network call now might be unreliable. But they were in process, you know, you downloaded a library, you added it to your Maven config, and that was a big hunk of code in each and every copy of the microservice.

Starting point is 00:07:07 And you needed to upgrade it, and you needed to do a rolling update of your thousand lines of microservice business logic every time this hundred thousand lines of history changed. So what a service mesh does is it takes that functionality, it takes those concepts, and it moves them completely away from your application. It moves them out of the process altogether. So you can imagine if you need these kind of retries and timeouts, then you might use a sort of advanced HTTP client on the one side, and you'd set up an HTTP reverse proxy on the other side, like an Nginx. You know, you think your API gateway has a lot of these functions for you. So what a service mesh does is it takes an HTTP proxy

Starting point is 00:07:48 and puts one next to every microservice. So in Kubernetes speak, if you're on Kubernetes and each service is in a pod, then we have this HTTP proxy as a sidecar, as another container in that pod. And it intercepts all traffic on the way in and the way out. And, you know, on the way in and the way out um and you know in the way on the way in it'll enforce rate limiting and sort of parallelism constraints and stuff and on the way out it'll do circuit breakers and retries and timeouts and everything else so you get all of

Starting point is 00:08:15 these network functions uh kind of for free as it was ambiently from your infrastructure uh you don't have to certainly you don't have to write the code to do them and you don't even have to sort of vendor in the code to do them oh that's pretty pretty impressive in a couple of minutes i think i i finally understand the whole history and actually what what what problems we are really solving with this and i think it's for me the bigger high moment and i just want to be honest here because i'm you know definitely not the expert on this but for me the bigger high moment from from when you explained in the beginning we had Hystrix and Finagle and now we have the services like Istio is that the first approach was a library that you bake into your code.

Starting point is 00:08:57 And then you made an interesting comment where you said if the library changed, you had to redistribute and recompile and redeploy all of your services. And obviously you're taking that burden away by extracting the functionality of a service mesh into its own entity. And therefore you're completely independent. You don't have to touch your code. The only thing you do is really you're injecting a sidecar into the, into the pods. And I think that's obviously that makes much more sense. One question I wanted to get for clarification too, because I know you're,'re gonna about to run away with this um the one of the way parts that was explained to me and i want to make sure i understand it correctly now that i

Starting point is 00:09:34 have both of you on the on the i was gonna say on the phone boy um it's both of you yeah exactly it's like we're gonna tape that um now that i have both you here, one of the other benefits that I heard about this, and I just want to make sure I understand it correctly, is that it's also like your services will register themselves with the service mesh so that you don't have to also tell your services where all the other services are. They just call into the service mesh,

Starting point is 00:09:58 the service mesh knows the map of that and then routes it to where it's supposed to go. Is that another one of the big benefits of something like a service mesh, or did I understand that incorrectly last time? Yeah, yeah, that's another big advantage. I mean, there's a few, coming from the history and the problem,

Starting point is 00:10:12 there's a few things I missed out on. We can talk about the rest, I guess, in due course, but that service discovery is definitely a big piece. The practicalities of, say, an Istio and Kubernetes system are not quite that the service registers with the service mesh, but it effectively works like that. The service mesh gains knowledge of all the services. Uh, yeah. And as you say, you can, uh, a service they're making a call can use, um, uh, sort of a short name and a non-qualified domain name or, or some, anything that's DNS

Starting point is 00:10:42 compatible, but doesn't have to be a sort of globally valid FQDN that you will go to the top level name servers and recurse to look up. You sort of throw this request into the ether with the correct host header and the service mesh will get it to the right place. It also understands, because it understands logical service names like that,

Starting point is 00:11:01 and it understands that they're comprised possibly of several instances of the workload so in kubernetes speak you know you'd have one service with a capital s made of uh you know a couple of deployments lots and lots of pods the service mesh can understand physically where they are which uh region which availability zone which host they're on and it can start to do things like route to the closest one for performance reasons. Hey, talking about performance, because I think this is a question that I heard people ask before when people presented about Istio and the service meshes in general. So if you're injecting, let's say, an NGINX or a proxy into every pod, isn't that itself a huge overhead?

Starting point is 00:11:46 Isn't that itself very much complicating my architecture, even though I don't have to take care of it? But potentially with every microservice, I get another service that sits in front of it? You do. You do. I would hope that it doesn't complicate the architecture too much. I mean, you're right, all of this code and infrastructure is being added.

Starting point is 00:12:11 And I guess a big thing I missed out in the introduction is that if you want to be pedantic, a service mesh is a mesh of services, right? It's all of these services talking to each other with this proxy that gives extra features in order to be able to configure those proxies you know in order for them to be able to do the service discovery um that we just talked about in order for them to know what characteristics to apply they need a control plane so they need one or more other uh services that sort of accept high level configuration documents and then and then pass that out to the yeah to the little sidecars. So the architecture, if you look at every detail, does get more complicated. But as I say, you've dropped history to all your Java apps,

Starting point is 00:12:53 you've dropped whatever the Python or the advanced configuration you're doing to Python requests to try to do the same thing, but it's never quite a parity because they're different libraries for different languages. So yes, it is there. But the architecture that the user sees, it should be completely transparent, is my point. The injection is transparent. The application needs no configuration to know that the service mesh is there. On the performance question, yes, there is an implication.

Starting point is 00:13:22 So there are a bunch of service meshes on the market. One of the early ones was called Linkerd1. So Linkerd version one, it was basically a middle proxy. So there wasn't actually a proxy per service. There was one per Kubernetes host and they all shared the same one. That was essentially the Finnegal library wrapped in a little bit more code. So it was a kind of a single bottleneck. It was essentially the Finagle library wrapped in a little bit more code. So it was a kind of a single bottleneck. It was on the JVM. It was actually written in Scala. So it was kind of worst case JVM performance. So it had a lot of features, but it wasn't fast. The newer service meshes do address that. Istio doesn't use Nginx as a proxy. Actually,

Starting point is 00:14:03 it uses a newer piece of software called Envoy that came from Lyft, the ride-sharing company. So that's written in C++, very deliberately to be a high-performance piece of software. It does its own thread scheduling. It's got its own RCU subsystem. It's obviously a non-garbage collected language. Matt Klein, the main author, has written a bunch of really interesting blogs on the performance tuning and the trade-offs they've made actually between throughput, because you can always just auto-scale more pods to get more throughput.

Starting point is 00:14:31 They've actually traded throughput down to get better latency and to get tighter bounds on the latency. So performance is definitely at the forefront of people's minds and a lot of thoughts gone into it. Linkerd is now also on version 2, and their proxy is written in Rust, which again is very close to the metal and garbage

Starting point is 00:14:48 collection. Hopefully lots of performance benefits. But it can't be avoided. We did some empirical measurements nine months ago. Take this with a pinch of salt because there's a thousand variables in an experiment like this, but saw about two milliseconds

Starting point is 00:15:04 per service being added. One millisecond in an experiment like this, but saw about two milliseconds per service being added. So one millisecond for an Istio system, one millisecond roughly to traverse Envoy, the proxy itself, and actually one millisecond jumping into the kernel and out again a few times to jump through all the IP tables rules that do the interception because that interception is transparent. Now, you can mitigate that by telling your application about the sidecar, having it send traffic directly there.

Starting point is 00:15:30 Then you don't have as many context switches into sort of IP tables and back. But yeah, there will always be a performance hit. If you're trying to do high-frequency trading, maybe it's not acceptable for you. But for most other applications, I'd hope it's just a blip in human time. Yeah. I mean, that was the idea of you can't get something for nothing, right? And it's always the trade-off. One question about the performance, the overhead thing, though, I don't know if you saw

Starting point is 00:15:53 this. I came across this a little, actually not very long ago, because it came out in April. It was an article by this guy, Michael Clipper, where he benchmarked Istio and Linkerd CPU, and he found, not on response time, but I think Envoy was about 50% higher in CPU utilization than Linkerd. I don't know if you saw that or not, but it's just kind of interesting. But again, that's not necessarily impacting your application.

Starting point is 00:16:18 That's more the service mess usage. But yeah, there's always going to be trade-offs, right? And I think that's the thing. And you have to look at what those trade-offs are for what you're getting. If you're going to go back to managing all those communications manually, now you're paying for all these people to be able to know it, track it, and be able to keep that configuration up and running and maybe even have software failures because it's not maintained versus suffering one or two milliseconds extra on a transaction, which is really, as you said, a blip. Yeah, exactly. And what is the cost of a code path traversing a thread on your CPU,

Starting point is 00:16:50 traversing all the way through hystrics, which you can now remove? Yeah, I think the answer is admit to yourself that you're never going to get a free lunch. Work out what your actual requirements are and just test it. Yeah, I've been following Michael's stuff. I'm a big fan. He did that initial work, and then he did something more recently, actually,

Starting point is 00:17:07 a higher scale, a much higher load. And actually, it tipped. I actually thought he found that Envoy was quicker in the first set of experiments, but then under a lot of load, it was Linkerty or something. I can't remember, but he's found some very interesting stuff. He's very methodical about his sort of experimental conditions. If they exactly match your environment, then that's great.

Starting point is 00:17:29 If not, they're just an indication, you know, spin up your own load test, work it out. And then, yeah, exactly as you say, go and look at what your business value is from this. Go and look at what the opportunity cost is of not doing this. Look at how much it costs the people. All of these are the questions that should factor into a big tech decision like this yeah and you are right he did he did come up with a follow-up on may 8th so definitely hey so quick question here um you mentioned envoy so and you know obviously if you google or bing or whatever search engine you prefer if you look for let's say east to

Starting point is 00:18:01 architecture uh there's a lot of great overview pictures in the nomad. And we'll add these to the links for the podcast proceedings. You have a lot of lectures out there and articles that you did and presentations on East2 for beginners to advance people. But if you look at the architecture, then you see Envoy being obviously injected into the pods as a proxy to intercept all the traffic. Now, you mentioned earlier there's a control plane kind of on top of it. And could you explain a little bit more about what makes up the control plane so people are a little bit more aware of when they hear things like pilot, mixer, and authenticator? Yeah, so I guess what I'll talk about is Istio in the setting of

Starting point is 00:18:46 running in the Kubernetes cluster, because that's almost always what's written about almost always how it's used. Istio can run outside of Kubernetes, but you have to do a lot of stuff yourself. It gets quite complicated. Um, in Kubernetes, uh, as, as I'll talk about, um, some parts of Kubernetes are leveraged to, to sort to help control things as well. But so basically, yeah, you've got the Envoy proxy as a sidecar to every service, which means that every Kubernetes pod has another container running Envoy. Envoy is quite a modern piece of software, so it takes configuration over an API, not off a hot reloading file on disk. So Envoy is a nice little thing that's sitting there waiting to be configured,

Starting point is 00:19:25 but something needs to tell it what to do. And you don't want that to be you because it would be super complicated. So there's this control plane with which you interact and you give it a high level configuration and it tells all the Envoys what to do. So there are sort of three or four components, five, six, depends how you want to look at it.

Starting point is 00:19:44 But the three major ones, the first one is this thing called Pilot. Pilot is basically the configuration system, the sort of configuration compiler, if you like. So if I want Istio to implement a fault injection, 10% 500s, right, for chaos, then I write a a yaml file looks a lot like a kubernetes yaml file i write a an istio yaml file to an istio schema that that tells istio that i want a fault injection and the return code should be 500 and it should be 10 at the time or or whatever i submit that to istio pilot then um effectively compiles that transpiles that into the configuration format that Envoy wants, which is a different document form. And it sends that down to every Envoy to tell it what to do.

Starting point is 00:20:35 Where the integration with something like Kubernetes comes in is what we were talking about earlier, say service discovery. The sidecars don't actually sort of call in and register with the mesh. What actually happens is pilot goes and talks to Kubernetes. It says, well, I'm running in a Kubernetes cluster and I want to know about all the different pods, all the different workloads. Well, all I need to really do is effectively kubectl get pods

Starting point is 00:20:59 against the sort of local Kubernetes API server. So it does that. So it takes in configuration from, it takes in service discovery information from a bunch of places, including Kubernetes, and it takes in all the extra configuration documents that you give it to give it any kind of non-default settings. And it compiles them and it pushes them out to all the envoys. The next component is something called Mixer. So where Pilot is for sort of upfront configuration, the kind of thing that you would write into a config file

Starting point is 00:21:28 if you were configuring it manually, Mixer is like online decision-making. So say I've got a rate limit. Say I've got three copies of the pod for service A, and I want a thousand QPS rate limit across all of them because maybe they all call off to the same database behind the scenes. So I can scale to as many as I want a thousand QPS rate limit across all of them, because maybe they all call off to the same database behind the scenes. So I can scale to as many as I want, but that doesn't help the bottleneck in my system,

Starting point is 00:21:51 which is this one database. So I can have one or three or 5,000 copies of the service APOD, and they only really can take a thousand QPS between them, because each time I touch one, it calls the database. That kind of rate limit can't be pre-programmed into a configuration file. Each envoy could get configuration saying, well, your local rate limit is 1,000. But if you want that kind of global coordination, then you need effectively a global counter, a global histogram bucket.

Starting point is 00:22:18 So that's the kind of thing that Mixer provides. So there's a very tight communication loop between the envoys and Mixer. So for something like a rate limit or for whitelist, blacklist policy, every time there's a request, the envoy sidecar will call to mixer and say, hey, is this okay? And mixer will check its rate limit bucket or its up-to-date policy list or something, and it'll give a reply. The other, I guess, fairly big feature that I missed out in the introduction to Istio is the observatory, the telemetry that you get for free. So because all of these proxies are on the wire and they're handling, you know, actually passing through every

Starting point is 00:22:56 network transaction, they can produce a log at each one and they can produce metrics about all of the, you know, all of the different characteristics and all of the rates they can produce trace spans uh if trace headers are being propagated again all of that stuff that you'd have to import like a zipkin or a jaeger client library to do and then wire up your web frameworks logging and all of that stuff if you've got this proxy on this universal proxy on the y and x or everything istio can totally do that for free and the way that works is through Mixer. So Envoy tells Mixer, sort of the roar, it basically sends it the headers of the transaction

Starting point is 00:23:31 that's gone through. And then Mixer will send that. Mixer is configurable, so you can say, all right, I've got a Prometheus server over there and an older Graphite server over there. Both of those want metrics. And my logging server is Elasticsearch over there and so forth. So that's what Mixer does. It's a central aggregation point for real-time

Starting point is 00:23:50 policy and for observability. The next major component is, I think, called Citadel. So that deals with the security. The Istio sidecars can give you mutual TLS between all your pods. So if you think about a normal microservices setup, you tend to either not do TLS, you just do an HTTP, plain HTTP call between pods, and you sort of make an argument about defense in depth saying, well, I'm in a Kubernetes overlay network in a VPC, it's fine. But MTLS certainly doesn't hurt. Or you would do TLS by giving again giving netty some certificates and maybe you get one-way tls and you kind of bake these things in when you

Starting point is 00:24:30 build the application and they expire after a year and they never get rotated all of those are sort of bad security practices so istio can do can set up the tls tunnels for you mutual tls so a verification of both, short-lived certificates that are regularly cycled. All of this stuff is totally possible manually, right? You just have to write the code. So Istio has done that for you. And Citadel is the component that mints those certificates and issues them or rotates them. There's another couple of things that are sort of down in the weeds of making the system work. They're probably not that important. I guess the only other thing I'd mention is the sidecar injector. So this isn't really an Istio component. This is leaning

Starting point is 00:25:08 on a Kubernetes feature, a mutating webhook admission controller, if anybody's familiar with those, which basically says, again, that the developer experience gets to be better. So as a developer or an operator, I write a Kubernetes YAML file saying, this is my deployment, and all of my pods have one container in them, which is my application code. You don't mention the sidecar, you don't have to know it's going to be there. And when that YAML document is submitted to the Kubernetes API server, this mutating Webhook and Mission Controller modifies that document and says, I'm going to add another container to the containers list in the pod spec, which is the Istio sidecar container. And as I said, that's free and that's transparent.

Starting point is 00:25:50 And that's done on a Kubernetes system by Istio hooking this powerful Kubernetes feature. Pretty cool. Hey, Matt, thank you so much for the overview. I mean, that was, I think, at least for me, I mean, also looking at the architectural diagrams, I i suggest people that are listening to this maybe listening to this again just open up the architectural diagram because it really makes a lot of sense the way you explain it um i also like a lot you know the flexibility that mixer gives you um and uh obviously then implementing or enabling a lot

Starting point is 00:26:22 of the features we i think we all need to think about when in large distributed systems, you know, everything that around traffic control, as you said, enforcing rate limits and all that stuff. Now we didn't just introduce or invite you just to give a quick overview of Istio. I think I also want to learn from you, especially with the work you do right now, because I think you help organizations, you especially with with the work you do right now because i think you help organizations um you know with with with this deal and with with microservice architectures

Starting point is 00:26:51 can you maybe give us give us a little insight in what are people struggling with uh what are the problems people face uh what should people you know be be aware of when they go down that road of a service mesh or in particular with Istio? Why would they also maybe reach out to you again and ask for more feedback? Yeah, it's an interesting question. So briefly what I do is I'm CTO at a cloud native consultancy in London called Native Wave. And as you say, what we do is we help organizations that are looking to become cloud native consultancy in London called Native Wave. And as you say, what we do is we help organizations

Starting point is 00:27:26 that are looking to become cloud native. So organizations that are looking to take whatever software stack they've got and move it to a public cloud. And the reasons they do that are varied, but they're always looking to get a hold of at least one of the advantages of public cloud and cloud-native computing, right, that we all know about. The thing is it's very complicated.

Starting point is 00:27:54 There's a lot of it. I mean, if you've seen the sort of CNCF landscape map recently, you know, it's now you can't read the logos on one page. There's so much there. So people have kind of been tuned in for the last few years. They've heard that there are all these massive advantages and we've just talked about what Istio can do. And I think it's got a bunch of great features that people will benefit from. But organizations, especially organizations that weren't born in the cloud,

Starting point is 00:28:20 they struggle to know which of these things they want and they struggle to know how to get there. And the kind of depth of knowledge that you need to run Istio or to run Kubernetes or to run Vault or, you know, any and all of these systems at production sort of, you know, scale and reliability is really deep. So I think we see organizations that don't want to go and simply don't have the capacity to go and learn all of this stuff for all of these products and then take a decision about what they should use. So it's okay for me to go to a conference and say, you should use Istio. It's great. But that means that in order to know that it's doing the right thing and not breaking your application, you need the monitoring set up properly, right? If you run Prometheus at

Starting point is 00:29:05 scale, it's not that easy. And that, you know, relies on a working Kubernetes cluster, which relies on all this other stuff. Yeah. So to your point about when people should use it, you know, what problems people have adopting it, it is often seen as the last step in the adoption of all of these sort of cloud native technologies, which can be a long way down the road for a lot of people who are maybe just starting with Docker or just starting with one of the cloud providers. And it could be quite a daunting thing to sort of build up to. So we really go in and sort of help organizations cut through all of the vendor pitches maybe and work out what technology they need. And then we help them design what the right stack for them would look like.

Starting point is 00:29:49 And obviously we can help build it. And actually we have a managed service platform as well. So we can just help host it as well. You can outsource your IT function to us, which is what we see a lot of companies really, really wanting to do. So they get all the benefits of the latest cutting edge cloud native technologies because we are experts in that. Not because we're particularly clever, but because these are the kind of conferences and podcasts that we spend our time at.

Starting point is 00:30:13 And then the idea is that the developers in the companies can just finally live that dream of focusing on their business logic. They write these 1,000-line microservices that don't have to care about where they run or what their network is or whether things are on fire. So yeah, that's kind of what we saw in the market. And that's why we've decided to do what we do. So it's great for me.

Starting point is 00:30:34 I get to learn about all this stuff, bring it to the table, build the best platform I can. And hopefully other people get the advantages of this stuff without the pain. Yeah. And we also, I mean, I think we talked about this in Yash at the conference that here within Dynatrace, we just started an open source project called Captain, where we are also using Istio for traffic control.

Starting point is 00:30:56 When we do blue-green deployments or any type of deployment strategy, then we are using Istio and Captain is automatically configuring Istio and creating all the Helm charts and putting it into Git. So there's also a lot of lessons learned when we played around with this latest and greatest technology. And really our hope is to provide a platform that really allows these teams and organizations to really, let's say, benefit from what cloud native promises, which is focus on your code, write your microservice, deploy it, and let the cloud-native frameworks

Starting point is 00:31:32 that are out there handle all the tough work, whether it's traffic routing, whether it is the different type of deployment models, whether it's scaling up, scaling down, and things like that. But, I mean, the way we learned it, I'm sure you, you've learned it as well. It's if you go down to the weeds, it's really, there's a lot to it and it's, it's not as easy as, as it, as it looks like sometimes, but we at least you, and now we also, we try to make it easier by, by figuring out what are the best practices and then combining the right tools and providing a good service on top of it or a

Starting point is 00:32:03 good framework. Yeah, I hope so. And everybody has their specialism, right? I'm sure you've learned a lot about Istio, a lot more than you claim to know. I'm sure you know loads from writing Keptin because obviously it has to lean heavily on Istio. I'm super, super excited about the Keptin project. I think it's great.

Starting point is 00:32:20 I think it's almost the last missing piece on top of that stack that we've been talking about, right, that actually gives developers an interface where they can do what they want, which is here are the three versions of my software that I care about at the moment. I want an A-B test between this, and I want an automatic rollback

Starting point is 00:32:37 if it blows up during the middle of the night, right? Istio provides all of the primitives for that. But again, you'd be sitting there pushing a lot of configuration documents, even at Istio's level of abstraction, if you wanted to do that. So I'm super excited about Kevjin bringing that to the table and automating it. Yeah. Hey, so I know you're tight on time, I believe, because you're actually right now somewhere in Europe and at a conference. But I got a question for you.

Starting point is 00:33:06 So when is it maybe not a good idea to think about these things? When is it, or what are the minimum requirements from an architectural perspective from your app to look into something like a service mesh? When is it not smart to walk down that road? Because I think knowing when it's not smart is just as good as knowing when it is smart. Yeah, it's a tricky one, right?

Starting point is 00:33:34 I guess I would say don't do science projects, don't over-engineer more than you need to. A number of people have come to me because i talk about the staff or the native wave and said oh we we want kubernetes can you help and i've said why and they said oh well because it's i've heard of it it's got all these advantages and all these features and i say right how many services you've got oh three okay and what kind of load are you at oh you're pre-release okay so actually you know a do Docker compose file would be just fine. Right.

Starting point is 00:34:08 And you use two EC2 instances. So you've got a backup. That's, it's certainly not perfect. We could all sit here and pick holes in that all day, but it's going to work and it's going to work at that scale and it's going to be totally good enough. The opportunity cost of sitting down for nine months and building a perfect platform is nine months where you're not writing your application code, where you're not going to market and getting feedback and raising funding and all of that good stuff. So I think don't build, as ever, don't build more than you need. The thing about service messages is they are

Starting point is 00:34:41 really useful. I wouldn't necessarily go multi-region in your cloud provider or even go to Kubernetes until it's ready, until you've got time. I wouldn't even necessarily do microservices until you really have a need, until you actually do have sort of real scale or real development velocity problems. And as I said at the beginning, we are actually quite good at, you know, our IDEs and our tools and our frameworks make us quite good at writing fairly large pieces of code. But if you are going to be calling across a network, I really do think you need

Starting point is 00:35:14 these kinds of features. Now, if you're in one language, yeah, you could use an in-process library. There are, you know, libraries for Go and for Python and other languages like that. If you're starting from scratch, if you're sort of born in the cloud, then I would actually be really tempted to get a managed Kubernetes cluster. It's a folly to run Kubernetes yourself. I don't know why anybody does. But get a managed Kubernetes cluster, install Istio or Linkerd2 into it. It's really quite simple these days.

Starting point is 00:35:44 Turn the Chaos on from day one. So turn Chaos Cube simple these days. Turn the Chaos on from day one. So turn Chaos Cube on and turn the Istio stuff on from day one. And so, you know, do software development properly, do continuous delivery from day one under these simulated conditions of Chaos, and then everything will be lovely. And I really would go to a service mesh quite early if you started from scratch.

Starting point is 00:36:03 I just think they're so valuable. The one thing that does put people off is they build on top of this other stack of stuff that you need, which I admit is a problem. I don't have a perfect solution for it other than to say that now getting a managed Kubernetes cluster on one of the major cloud providers is really a case of a few clicks. So hopefully it's not that hard. As for when not to do it, you know, if you have a big brownfield legacy site, trying to shove one of these things in may cause you, you know, may cause you a bit of pain.

Starting point is 00:36:35 It may not be what you need right now. Istio has a bunch of ways to mitigate that. You can turn it on sort of Kubernetes namespace by Kubernetes namespace. You can extend what's sort of Kubernetes namespace by Kubernetes namespace. You can extend what's called extend the mesh. So you can have an Istio mesh running in a Kubernetes cluster that also talks to services on sort of VMs. So if you're on VMs and you're doing a lift and shift into containers, you can do a little bit of your workload, put it in a couple of containers, put that in a cluster, have Istio in that cluster, giving the advantage to them, and then have it set up just so that it can

Starting point is 00:37:10 still call out to the old legacy VM stuff. So there's ways to migrate incrementally. But yeah, I can't really help but say, I think it's a great thing. And I think you should try to get all of its benefits. And hopefully it is more simple now than people think think this you know it's just now version 1.1 i know it it got a bad uh reputation maybe but but those were the 0.1 days it was released very early with a very clear 0.1 label on and i think it got so much hype so much coverage everybody said oh it's great but it is a bit buggy well yes it said 0.1 on the tin you know and people just

Starting point is 00:37:43 got carried away and tried to use it in production uh i you know hopefully now you should have a much better time yeah cool well and then the good news is you know if people have questions on whether it's the right time on order or seeking for some advice we will definitely make sure to put all of your information on the podcast proceeding so that they can reach out to you. Because obviously you do have a lot of experience in how to make companies cloud-native or cloud-native ready. And so we definitely, if it's okay with you, obviously we'll direct them your way. Yeah, please.

Starting point is 00:38:17 I mean, I'm personally always happy to have my opinions challenged and have debates with people and learn new information. So come find me on Twitter. I'm sure you'll put that information up. up and yeah native native wave can help at a at a company level as well yeah perfect brian is there anything else from your end no i think uh it might be time to go ahead and some of the good old summerator do it now so folks what i've learned today i mean there's obviously a whole lot to Istio. What helped me in the explanation from Matt, which was phenomenal, explaining the different pieces

Starting point is 00:38:51 of the architecture, is just look at the architectural diagram, see what Envoy is doing as the proxy that sits in front of all of your services. And then the control plane on top, which makes sure to propagate the configuration to the envoys, the mixer that is doing all the telemetry and then is doing real-time configuration changes and traffic changes, and then also Citadel, which is one of the components for secure communication and I'm sure a lot of other things too. But, Matt, I think this was extremely useful for me

Starting point is 00:39:24 and also thanks for answering the questions on when it might not be the right time because we want to educate people on new technology, but we also want to make sure they understand when it might not be the right time. And if they're still uncertain, then obviously we will direct them to some of the material that you put out there

Starting point is 00:39:40 and also make sure that people know where to find you. I know you are traveling the world for different conferences. that you put out there and also make sure that people know where to find you. I know you are traveling the world for different conferences. We met in Romania. I know you are in Barcelona at CubeCon, and I think there's other things coming up. By the time that this airs, it's going to be July, and I know you enjoy probably a nice, quiet summer,

Starting point is 00:40:05 but in case we have people that want to reach out to you and disturb your summer. We'll again, still send them your way. Yeah. I think luckily I'll have a conference season all came at once this year. Um, I think luckily I'll have safely got back to London before we, before this airs. So, uh, so nobody can find me, but, but no, of course, you know, reach out, reach, reach reach reach out online instead um but hopefully this you know this kind of i love talking about this stuff i find it super interesting um and i'd like to impact as much as native wave would like to help people you know work out whether this stuff is right for them and implement it i think you know a lot of it just comes down to to education if you understand the systems what they're trying to do and how they work, then you can make an informed decision to yourself as to whether it,

Starting point is 00:40:47 it sounds right for you, you know, now, so which is why I kind of go around talking about this stuff. Yeah. Well, yeah. And in terms of people finding you, when Andy said we're going to put where people can contact you, he did mean, you know, your home address, you know, when you're going to be when you're typically on the tube and all that so yep come find me on the uh come find me on the bank

Starting point is 00:41:11 branch of the northern line 8 30 a.m every day if you can uh if you can elbow your way through the crowds you're on the guy in the kubernetes t-shirt yeah exactly um i want to thank you a ton for this because i it really kind of it really solidified this all for me. And I do want to mention again, just stress again for anybody who wants to completely wrap their head around this, do as Andy suggested. Grab the architecture diagram, take the section where Matt talked about all that, and just look at that while he's talking because it really solidifies it all. And I think it just also goes to speak again. I brought this up on a previous podcast, Andy, about the maturity of all this stuff.

Starting point is 00:41:50 You know, Kubernetes flourishing, taking off and how now there are so many different services around these things. In the earlier days of all this, you know, cloud native experimentations and all, it was all build your own and you had to find the time to do it all yourself as well but but as you say matt you knew you know even your organization there are services now around it there are a lot of services around so

Starting point is 00:42:15 many this is different aspects of it it's really really cool to see how mature this has become in so few years so um it's exciting times i think yeah and i can you know uh i could talk all day about how sdo can do protocol translation on the wire and transparent database sharding and all of these things and i think the technology's always been there and where people are just thinking of novel ways to use it i don't think service meshes were ever designed to to be transparent database sharding systems but actually if you think about it and you know how it works, you can totally make it do that. So I don't know whether anybody on the Istio team saw that coming two years ago,

Starting point is 00:42:51 but it's a thing. So I think we're really just getting started with this kind of technology. It's super exciting. For me, it sounds you're just introducing the next podcast. Oh, I signed up. Yeah, I think you are. We we haven't mentioned in a while but we we used to have a running competition on repeat guests i think the most we got up to was three or four i forget but uh um it was a late to join the challenge yeah well exactly well there's a yeah there's a

Starting point is 00:43:20 there's a bunch more stuff where it came from so Awesome. Let's see what people want to hear about. I'd be really interested to see the feedback on this session. I have one, hopefully a quick one. What are you talking about at KubeCon? At KubeCon, I'm talking about one of the new features in Istio 1.1, which is a way to basically make easy calls between service meshes. So if you've got two Kubernetes clusters in different regions, you absolutely want to run two copies of Istio with a separate control plane each. So a service in cluster A needs to be able to talk to a service in cluster B

Starting point is 00:44:03 for whatever reason. You always could do that with Istio, but it was super complicated and I wrote like a blog and a config generator and stuff for it back in the day. Istio 1.1 makes a lot of that first class. So I just kind of explained the theory behind that. And then I'll do a demo of two cube clusters, each with an Istio mesh, and then service in cluster A will be able to call my backend dot cluster B dot global and it'll end up in the right place, routed via

Starting point is 00:44:32 an ingress and egress gateway that can white list each other's IPs, end-to-end MTLS, all of the good stuff of the service mesh, but across the globe, you know, for actual globally distributed systems. Very cool and uh i think the recordings of kubecon will be on youtube so people can watch you probably by the time this

Starting point is 00:44:53 airs they will be able to see your life yeah yeah they um yeah they they get them up they get them already quick live on tape yeah and maybe if uh i'll say exactly if the if the live demo doesn't work i'll send you a frantic email and ask you to cut this section yeah all right well thank you so much again and enjoy wherever you are right now i mean i know it's evening for you it is i want to say i'm in i'm in vilnius lithuania it's my first time here and it's really nice. So I just wanted to plug Vilnius. Thanks to all the people in bars and restaurants who've been friendly to me. It's a nice place. Speaking of feedback, we would love to have Matt back on.

Starting point is 00:45:34 We can have him back with or without feedback. But if any of our listeners have any feedback or other Istio topics they would like him to explore with us, please let us know. You can tweet us at pure underscore DT. Or send an old-fashioned email at pureperformance.dynatrace.com. Matt, thank you so very much. Andy, great to be on with you as always. Thanks for having me. Thank you, everybody.

Your Ad Here

PurePerformance - An Introduction to Service Meshes and Istio with Matt Turner

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.