PurePerformance - An Introduction to Service Meshes and Istio with Matt Turner
Episode Date: July 22, 2019To service mash or not? That’s a good question! Not every architecture and project needs a service mesh but for running distributed microservices architectures service mashes provide a lot of essent...ial features such as service discovery, traffic routing, security, observability ..We invited Matt Turner (@mt165), CTO at Native Wave, to tell us all we need to know about service mashes. We get a deep dive into Istio, one of the most popular current service mashes, the architecture and how the individual components such as Envoy, Pilot, Mixer and Citadel work together. We also chat about the tradeoff between performance, latency, throughput and service mash capabilities. If you want to learn more make sure to check out Matt’s online content such as blogs and recorded conference presentations on https://mt165.co.uk/.Native Wave https://nativewave.io/Istio vs. Linkerd CPU Overhead Benchmarks by Michael Kipper Initial Observations: https://medium.com/@michael_87395/benchmarking-istio-linkerd-cpu-c36287e32781 Second Analysis: https://medium.com/@michael_87395/benchmarking-istio-linkerd-cpu-at-scale-5f2cfc97c7fa
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always my co-host Andy Grabner is with me.
Hi Andy, how are you doing?
Not too bad actually. Now talking into the right direction of the microphone,
I believe people can actually hear me because that was my problem in the beginning.
Yes, you're a performance genius
but not quite an audio uh maestro yeah but getting there you'll get there andy you'll get there
your voice i gotta just say i i love when i'm editing the shows i love hearing you on the microphone instead of the headset i'm sure the listeners do as well so that's that's awesome
yeah i got the feedback from other listeners too that the audio quality definitely increased they also said the
content the quality of the content didn't get better but at least the audio quality increased
well well i think today's content will will be very very good because i think it's it's a very
um it's sort of a new topic one we haven't really covered too much yet. And I think it could be confusing when you're first getting into it, depending on your level.
Why don't you go ahead and introduce the topic and then introduce our guest, Mr. Grabner.
Yeah, sure.
I mean, first of all, I want to say thanks to the folks at Deaf Experience in Romania who actually introduced or allowed me to get introduced to Matt Turner, our guest of today. I met him at Dev Experience in Iasi, a city in the northeast of Romania,
where he gave a talk on the life of a packet through Istio.
Now, that was very interesting for me because I learned a hell of a lot about network basics
and how routing works and how Istio works.
And that's when I reached out to Matt after the talk and said,
hey, there's so much
you know. And then he told me a little bit about his background. But instead of me repeating all
the stuff that he said, I just want to hand it over to Matt. And maybe Matt, I know you should
be with us right now online. And if you could introduce yourself, your background or what
you've been doing, what you're doing right now. And then we want to learn more about service meshes, E-Steel, what else is out there, what people
have to know in case they just get started on that topic. Sure. Yeah, I'm here. Pleasure to be
with you. Can you hear me? Am I talking towards the right microphone? Yeah, it's good.
I passed the first test, first piece of equipment. Yeah, as I say, it's a pleasure to be here.
Thank you very much for having me.
Yeah, I met Andy in the Ashburn Mania at Dev Experience,
which was a great little conference, actually.
Lots of really, really good advanced talks.
I actually sat in every session that I, you know,
apart from the one I was presenting, and it was, yeah, really informative.
So that was a great place to be.
So, yeah, I'm Matt Turner from London in the UK.
I'm a software engineer and computer scientist by training.
And I guess for the past sort of five years,
I've always had an interest in systems.
I've always been building servers and Raspberry Pis and stuff at home.
That's, I guess, the hobby part of my uh experience with computers
um and i guess about five years ago this devops thing started to happen and the infrastructure
and the platforms became more important and i was like oh i know that stuff um i thought this was
just you know nerdy hobby but this is this is really relevant now um so i guess i kind of got
into that i did a bit of uh you know old school orchestration and management of VMs,
the sort of the early lift and shift to early cloud when that was OpenStack.
And I've kind of followed the technology since then.
So with Docker, with Kubernetes, and now with Istio and the other service meshes,
which are kind of at the forefront of the cloud native landscape.
Cool.
So, you know, the service mesh, you just brought it up.
And I think Brian actually asked about this when we kind of prepared for that talk.
Can you give a quick like one on one on service meshes and what problems we try to solve with it and how they work?
Because I'm pretty sure, you know, part of the audience obviously is aware of it, but just to level set everyone.
Yeah, so I guess to turn to a service mesh
by way of a problem, if you think originally
that we would have a big monolith,
a big hunk of software that was probably written in C++,
or Java, or PHP, and that could be millions of lines of code.
And that thing worked. We became
quite good at writing those. We have dependency injection systems, we have modularization,
modular loading. And we became good at avoiding common anti-patterns, things like inversion of
control. But ultimately, all of your code was hosted in one big blob, and it ran in one big
process. So although you might have a domain-driven design aggregate that was almost a separate little
piece of software to another part of the system, to a different namespace, where one wanted to
call another, that would just be a function call, right? And any computer science 101 course will
tell you how know how arguments
are actually passed sort of in registers on the cpu so that was really simple and really fast and
that never failed but then our systems got bigger and bigger uh we needed more and more scale um
and we so we broke that that monolith up right into microservices and obviously there's a bunch
of other reasons for that i guess i don't need to explain microservices you know you can then
release every service independently and and whatever else but there's a bunch of other reasons for that. I guess I don't need to explain microservices. You can then release every service independently and whatever else.
But there was a tendency to take this monolith and to split it into all these pieces and run each one in a separate container without really fundamentally changing the way that the system worked and the way that communications across these boundaries happened.
So what you ended up with was, to give it a glib name, a distributed monolith. So you had all the same code as before,
with all the same failure cases that you could have before. But now things that were previously
a function call just between one class and another that could never fail now could because they're
going across a network and Kubernetes might have these two containers on different sides of the
planet. So we needed to cope with that.
We needed to add a bunch of resiliency to cope with this distributed system
that we've now built.
The early attempts at this were libraries like Hystrix and Finagle
from the Netflix open source stack that people have probably heard of.
And they gave your applications facilities on a sort of an
RPC call, like backoffs, like retries, like circuit breakers, timeouts with defaults,
that kind of thing. So one of these functions mitigated the fact that a network call now might
be unreliable. But they were in process, you know, you downloaded a library, you added it to your
Maven config, and that was a big hunk of code in each and every copy of the microservice.
And you needed to upgrade it, and you needed to do a rolling update of your thousand lines of microservice business logic every time this hundred thousand lines of history changed.
So what a service mesh does is it takes that functionality, it takes those concepts, and it moves them completely away from your application.
It moves them out of the process altogether.
So you can imagine if you need these kind of retries and timeouts,
then you might use a sort of advanced HTTP client on the one side,
and you'd set up an HTTP reverse proxy on the other side, like an Nginx.
You know, you think your API gateway has a lot of these functions for you.
So what a service mesh does is it takes an HTTP proxy
and puts one next to every microservice.
So in Kubernetes speak, if you're on Kubernetes
and each service is in a pod,
then we have this HTTP proxy as a sidecar,
as another container in that pod.
And it intercepts all traffic on the way in and the way out.
And, you know, on the way in and the way out um and you know in the way on the way in it'll enforce rate limiting and sort of parallelism constraints and stuff and on the
way out it'll do circuit breakers and retries and timeouts and everything else so you get all of
these network functions uh kind of for free as it was ambiently from your infrastructure uh you
don't have to certainly you don't have to write the code to do them and you don't even have to sort of vendor in the code to do them oh that's pretty pretty impressive in a
couple of minutes i think i i finally understand the whole history and actually what what what
problems we are really solving with this and i think it's for me the bigger high moment and i
just want to be honest here because i'm you know definitely not the expert on this but for me the
bigger high moment from from when you explained in the beginning
we had Hystrix and Finagle and now we have the services like Istio
is that the first approach was a library that you bake into your code.
And then you made an interesting comment where you said if the library changed,
you had to redistribute and recompile and redeploy all of your services.
And obviously you're taking that burden away by extracting the functionality of a service mesh
into its own entity. And therefore you're completely independent. You don't have to
touch your code. The only thing you do is really you're injecting a sidecar into the, into the
pods. And I think that's obviously that makes much more sense. One question I wanted to get
for clarification too, because I know you're,'re gonna about to run away with this um the one of
the way parts that was explained to me and i want to make sure i understand it correctly now that i
have both of you on the on the i was gonna say on the phone boy um it's both of you yeah exactly
it's like we're gonna tape that um now that i have both you here, one of the other benefits that I heard about this,
and I just want to make sure I understand it correctly,
is that it's also like your services
will register themselves with the service mesh
so that you don't have to also tell your services
where all the other services are.
They just call into the service mesh,
the service mesh knows the map of that
and then routes it to where it's supposed to go.
Is that another one of the big benefits
of something like a service mesh,
or did I understand that incorrectly last time?
Yeah, yeah, that's another big advantage.
I mean, there's a few,
coming from the history and the problem,
there's a few things I missed out on.
We can talk about the rest, I guess, in due course,
but that service discovery is definitely a big piece.
The practicalities of, say, an Istio and Kubernetes system
are not quite that the service registers
with the service mesh, but it effectively works like that. The service mesh gains knowledge of
all the services. Uh, yeah. And as you say, you can, uh, a service they're making a call can use,
um, uh, sort of a short name and a non-qualified domain name or, or some, anything that's DNS
compatible, but doesn't have to be a sort of globally valid FQDN
that you will go to the top level name servers
and recurse to look up.
You sort of throw this request into the ether
with the correct host header
and the service mesh will get it to the right place.
It also understands,
because it understands logical service names like that,
and it understands that they're comprised
possibly of several instances of the
workload so in kubernetes speak you know you'd have one service with a capital s made of uh you
know a couple of deployments lots and lots of pods the service mesh can understand physically where
they are which uh region which availability zone which host they're on and it can start to do
things like route to the closest one for performance reasons.
Hey, talking about performance, because I think this is a question that I heard people ask before when people presented about Istio and the service meshes in general.
So if you're injecting, let's say, an NGINX or a proxy into every pod, isn't that itself a huge overhead?
Isn't that itself very much complicating my architecture, even though I don't have to
take care of it?
But potentially with every microservice, I get another service that sits in front of
it?
You do.
You do.
I would hope that it doesn't complicate the architecture too much.
I mean, you're right, all of this code and infrastructure is being added.
And I guess a big thing I missed out in the introduction is that if you want to be pedantic, a service mesh is a mesh of services, right?
It's all of these services talking to each other with this proxy that gives extra features in order to be able to configure those proxies you know in order for them to be able to do the service discovery
um that we just talked about in order for them to know what characteristics to apply they need a
control plane so they need one or more other uh services that sort of accept high level
configuration documents and then and then pass that out to the yeah to the little sidecars.
So the architecture, if you look at every detail,
does get more complicated.
But as I say, you've dropped history to all your Java apps,
you've dropped whatever the Python or the advanced configuration
you're doing to Python requests to try to do the same thing,
but it's never quite a parity because they're different libraries
for different languages.
So yes, it is there.
But the architecture that the user sees, it should be completely transparent, is my point. The injection is transparent.
The application needs no configuration to know that the service mesh is there.
On the performance question, yes, there is an implication.
So there are a bunch of service meshes on the market.
One of the early ones was called Linkerd1. So Linkerd version one, it was basically a middle proxy.
So there wasn't actually a proxy per service.
There was one per Kubernetes host and they all shared the same one.
That was essentially the Finnegal library wrapped in a little bit more code. So it was a kind of a single bottleneck. It was essentially the Finagle library wrapped in a little bit more code.
So it was a kind of a single bottleneck. It was on the JVM. It was actually written in Scala.
So it was kind of worst case JVM performance. So it had a lot of features, but it wasn't fast.
The newer service meshes do address that. Istio doesn't use Nginx as a proxy. Actually,
it uses a newer piece of software called
Envoy that came from Lyft, the ride-sharing company. So that's written in C++, very deliberately
to be a high-performance piece of software. It does its own thread scheduling. It's got its own
RCU subsystem. It's obviously a non-garbage collected language. Matt Klein, the main author,
has written a bunch of really interesting blogs on the performance tuning
and the trade-offs they've made actually between throughput,
because you can always just auto-scale more pods
to get more throughput.
They've actually traded throughput down
to get better latency
and to get tighter bounds on the latency.
So performance is definitely at the forefront of people's minds
and a lot of thoughts gone into it.
Linkerd is now also on version 2,
and their proxy is written in Rust,
which again is very close to the metal and garbage
collection. Hopefully lots
of performance benefits. But it can't
be avoided. We did some empirical
measurements nine months
ago. Take this
with a pinch of salt because there's
a thousand variables in an experiment like this, but
saw about two milliseconds
per service being added. One millisecond in an experiment like this, but saw about two milliseconds per service being added.
So one millisecond for an Istio system,
one millisecond roughly to traverse Envoy, the proxy itself,
and actually one millisecond jumping into the kernel
and out again a few times to jump through all the IP tables rules
that do the interception because that interception is transparent.
Now, you can mitigate that by telling your application
about the sidecar, having it send traffic directly there.
Then you don't have as many context switches
into sort of IP tables and back.
But yeah, there will always be a performance hit.
If you're trying to do high-frequency trading,
maybe it's not acceptable for you.
But for most other applications, I'd hope it's just a blip in human time.
Yeah. I mean, that was the idea of you can't get something for nothing, right? And it's always the
trade-off. One question about the performance, the overhead thing, though, I don't know if you saw
this. I came across this a little, actually not very long ago, because it came out in April.
It was an article by this guy, Michael Clipper, where he benchmarked Istio and Linkerd CPU,
and he found, not on response time,
but I think Envoy was about 50% higher
in CPU utilization than Linkerd.
I don't know if you saw that or not,
but it's just kind of interesting.
But again, that's not necessarily impacting your application.
That's more the service mess usage.
But yeah, there's always going to be trade-offs, right?
And I think that's the thing. And you have to look at what those trade-offs are for what you're getting. If you're
going to go back to managing all those communications manually, now you're paying for all these people
to be able to know it, track it, and be able to keep that configuration up and running and maybe
even have software failures because it's not maintained versus suffering one or two milliseconds
extra on a transaction, which is really, as you said, a blip. Yeah, exactly.
And what is the cost of a code path traversing a thread on your CPU,
traversing all the way through hystrics, which you can now remove?
Yeah, I think the answer is admit to yourself
that you're never going to get a free lunch.
Work out what your actual requirements are and just test it.
Yeah, I've been following Michael's stuff.
I'm a big fan.
He did that initial work,
and then he did something more recently, actually,
a higher scale, a much higher load.
And actually, it tipped.
I actually thought he found that Envoy was quicker
in the first set of experiments,
but then under a lot of load, it was Linkerty or something.
I can't remember, but he's found some very interesting stuff.
He's very methodical about his sort of experimental conditions.
If they exactly match your environment, then that's great.
If not, they're just an indication, you know, spin up your own load test, work it out.
And then, yeah, exactly as you say, go and look at what your business value is from this.
Go and look at what the opportunity cost is of not doing this.
Look at how much it costs the people.
All of these are the questions that should factor into
a big tech decision like this yeah and you are right he did he did come up with a follow-up on
may 8th so definitely hey so quick question here um you mentioned envoy so and you know obviously
if you google or bing or whatever search engine you prefer if you look for let's say east to
architecture uh there's a lot of great overview pictures in the nomad.
And we'll add these to the links for the podcast proceedings.
You have a lot of lectures out there and articles that you did and presentations on East2 for beginners to advance people.
But if you look at the architecture, then you see Envoy being obviously injected into the pods as a proxy to intercept all the traffic.
Now, you mentioned earlier there's a control plane kind of on top of it.
And could you explain a little bit more about what makes up the control plane
so people are a little bit more aware of when they hear things like pilot, mixer, and authenticator?
Yeah, so I guess what I'll talk about is Istio in the setting of
running in the Kubernetes cluster, because that's almost always what's written about almost always
how it's used. Istio can run outside of Kubernetes, but you have to do a lot of stuff yourself. It
gets quite complicated. Um, in Kubernetes, uh, as, as I'll talk about, um, some parts of Kubernetes
are leveraged to, to sort to help control things as well.
But so basically, yeah, you've got the Envoy proxy as a sidecar to every service, which means that every Kubernetes pod has another container running Envoy.
Envoy is quite a modern piece of software, so it takes configuration over an API,
not off a hot reloading file on disk.
So Envoy is a nice little thing that's sitting there waiting to be configured,
but something needs to tell it what to do.
And you don't want that to be you
because it would be super complicated.
So there's this control plane with which you interact
and you give it a high level configuration
and it tells all the Envoys what to do.
So there are sort of three or four components,
five, six, depends how you want to look at it.
But the three major
ones, the first one is this thing called Pilot. Pilot is basically the configuration system,
the sort of configuration compiler, if you like. So if I want Istio to implement a
fault injection, 10% 500s, right, for chaos, then I write a a yaml file looks a lot like a kubernetes yaml file i
write a an istio yaml file to an istio schema that that tells istio that i want a fault injection and
the return code should be 500 and it should be 10 at the time or or whatever i submit that to istio
pilot then um effectively compiles that transpiles that into the configuration format that Envoy wants, which is a different document form.
And it sends that down to every Envoy to tell it what to do.
Where the integration with something like Kubernetes comes in is what we were talking about earlier, say service discovery.
The sidecars don't actually sort of call in
and register with the mesh.
What actually happens is pilot goes and talks to Kubernetes.
It says, well, I'm running in a Kubernetes cluster
and I want to know about all the different pods,
all the different workloads.
Well, all I need to really do is effectively kubectl get pods
against the sort of local Kubernetes API server.
So it does that.
So it takes in configuration from,
it takes in service discovery information from a bunch of places, including Kubernetes,
and it takes in all the extra configuration documents that you give it to give it any
kind of non-default settings. And it compiles them and it pushes them out to all the envoys.
The next component is something called Mixer. So where Pilot is for sort of upfront configuration,
the kind of thing that you would write into a config file
if you were configuring it manually,
Mixer is like online decision-making.
So say I've got a rate limit.
Say I've got three copies of the pod for service A,
and I want a thousand QPS rate limit across all of them
because maybe they all call off to the same database behind the scenes. So I can scale to as many as I want a thousand QPS rate limit across all of them, because maybe they all call off to the same database behind the scenes.
So I can scale to as many as I want,
but that doesn't help the bottleneck in my system,
which is this one database.
So I can have one or three or 5,000 copies of the service APOD,
and they only really can take a thousand QPS between them,
because each time I touch one, it calls the database.
That kind of rate limit can't be pre-programmed into a configuration file.
Each envoy could get configuration saying, well, your local rate limit is 1,000.
But if you want that kind of global coordination, then you need effectively a global counter,
a global histogram bucket.
So that's the kind of thing that Mixer provides.
So there's a very tight communication loop between the envoys and Mixer.
So for something like a rate limit or for whitelist, blacklist policy, every time there's a request,
the envoy sidecar will call to mixer and say, hey, is this okay? And mixer will check its rate
limit bucket or its up-to-date policy list or something, and it'll give a reply.
The other, I guess, fairly big feature that I missed out in the
introduction to Istio is the observatory, the telemetry that you get for free. So because
all of these proxies are on the wire and they're handling, you know, actually passing through every
network transaction, they can produce a log at each one and they can produce metrics about all
of the, you know, all of the different characteristics and all of the rates they can produce trace spans uh if trace headers are
being propagated again all of that stuff that you'd have to import like a zipkin or a jaeger
client library to do and then wire up your web frameworks logging and all of that stuff if you've
got this proxy on this universal proxy on the y and x or everything istio can totally do that for
free and the way that works is through Mixer.
So Envoy tells Mixer, sort of the roar,
it basically sends it the headers of the transaction
that's gone through.
And then Mixer will send that.
Mixer is configurable, so you can say,
all right, I've got a Prometheus server over there
and an older Graphite server over there.
Both of those want metrics.
And my logging server is Elasticsearch over there
and so forth. So that's what Mixer does. It's a central aggregation point for real-time
policy and for observability. The next major component is, I think, called Citadel. So that
deals with the security. The Istio sidecars can give you mutual TLS between all your pods. So if you think about a normal microservices setup, you tend to either not do TLS,
you just do an HTTP, plain HTTP call between pods,
and you sort of make an argument about defense in depth saying,
well, I'm in a Kubernetes overlay network in a VPC, it's fine.
But MTLS certainly doesn't hurt.
Or you would do TLS by giving again giving netty
some certificates and maybe you get one-way tls and you kind of bake these things in when you
build the application and they expire after a year and they never get rotated all of those
are sort of bad security practices so istio can do can set up the tls tunnels for you
mutual tls so a verification of both, short-lived certificates that are regularly
cycled. All of this stuff is totally possible manually, right? You just have to write the code.
So Istio has done that for you. And Citadel is the component that mints those certificates
and issues them or rotates them. There's another couple of things that are sort of down in the
weeds of making the system work. They're probably not that important. I guess the only other thing
I'd mention is the sidecar injector. So this isn't really an Istio component. This is leaning
on a Kubernetes feature, a mutating webhook admission controller, if anybody's familiar
with those, which basically says, again, that the developer experience gets to be better.
So as a developer or an operator, I write a Kubernetes YAML file saying, this is my deployment,
and all of my pods have one container in them, which is my application code. You don't mention the sidecar,
you don't have to know it's going to be there. And when that YAML document is submitted to the
Kubernetes API server, this mutating Webhook and Mission Controller modifies that document and
says, I'm going to add another container to the containers list in the pod spec, which is the Istio sidecar container.
And as I said, that's free and that's transparent.
And that's done on a Kubernetes system by Istio hooking this powerful Kubernetes feature.
Pretty cool.
Hey, Matt, thank you so much for the overview.
I mean, that was, I think, at least for me, I mean, also looking at the architectural
diagrams, I i suggest people
that are listening to this maybe listening to this again just open up the architectural diagram
because it really makes a lot of sense the way you explain it um i also like a lot you know the
flexibility that mixer gives you um and uh obviously then implementing or enabling a lot
of the features we i think we all need to think about when in large distributed systems,
you know, everything that around traffic control, as you said,
enforcing rate limits and all that stuff.
Now we didn't just introduce or invite you just to give a quick overview of
Istio. I think I also want to learn from you,
especially with the work you do right now,
because I think you help organizations, you especially with with the work you do right now because i think you help
organizations um you know with with with this deal and with with microservice architectures
can you maybe give us give us a little insight in what are people struggling with uh what are
the problems people face uh what should people you know be be aware of when they go down that road of a service mesh or in particular
with Istio?
Why would they also maybe reach out to you again and ask for more feedback?
Yeah, it's an interesting question.
So briefly what I do is I'm CTO at a cloud native consultancy in London called Native
Wave.
And as you say, what we do is we help organizations that are looking to become cloud native consultancy in London called Native Wave. And as you say, what we do is we help organizations
that are looking to become cloud native.
So organizations that are looking to take whatever software stack
they've got and move it to a public cloud.
And the reasons they do that are varied,
but they're always looking to get a hold of at least one
of the advantages of public cloud and cloud-native computing, right,
that we all know about.
The thing is it's very complicated.
There's a lot of it.
I mean, if you've seen the sort of CNCF landscape map recently,
you know, it's now you can't read the logos on one page.
There's so much there.
So people have kind of been tuned in
for the last few years. They've heard that there are all these massive advantages and we've just
talked about what Istio can do. And I think it's got a bunch of great features that people will
benefit from. But organizations, especially organizations that weren't born in the cloud,
they struggle to know which of these things they want and they struggle to know how to get there. And the kind of depth of knowledge that you need to run Istio or to run
Kubernetes or to run Vault or, you know, any and all of these systems at production sort of, you
know, scale and reliability is really deep. So I think we see organizations that don't want to go
and simply don't have the capacity to go and learn all of this stuff
for all of these products and then take a decision about what they should use.
So it's okay for me to go to a conference and say, you should use Istio. It's great.
But that means that in order to know that it's doing the right thing and not breaking your
application, you need the monitoring set up properly, right? If you run Prometheus at
scale, it's not that easy. And that, you know, relies on a working Kubernetes cluster, which
relies on all this other stuff. Yeah. So to your point about when people should use it, you know,
what problems people have adopting it, it is often seen as the last step in the adoption of all of
these sort of cloud native technologies, which can be a long
way down the road for a lot of people who are maybe just starting with Docker or just starting
with one of the cloud providers. And it could be quite a daunting thing to sort of build up to.
So we really go in and sort of help organizations cut through all of the vendor pitches maybe and
work out what technology they need. And then we help them design what the right stack for them would look like.
And obviously we can help build it. And actually we have a managed service platform
as well. So we can just help host it as well. You can outsource your IT function
to us, which is what we see a lot of companies really, really wanting to do.
So they get all the benefits of the latest cutting edge cloud native technologies
because we are experts in that.
Not because we're particularly clever,
but because these are the kind of conferences and podcasts
that we spend our time at.
And then the idea is that the developers in the companies
can just finally live that dream of focusing on their business logic.
They write these 1,000-line microservices
that don't have to care about where they run
or what their network is or whether things are on fire.
So yeah, that's kind of what we saw in the market.
And that's why we've decided to do what we do.
So it's great for me.
I get to learn about all this stuff,
bring it to the table, build the best platform I can.
And hopefully other people get the advantages
of this stuff without the pain.
Yeah.
And we also, I mean, I think we talked about this in Yash at the conference that here within
Dynatrace, we just started an open source project called Captain, where we are also
using Istio for traffic control.
When we do blue-green deployments or any type of deployment strategy, then we are using
Istio and Captain is automatically configuring Istio and creating all the Helm charts and putting it into Git.
So there's also a lot of lessons learned when we played around
with this latest and greatest technology.
And really our hope is to provide a platform that really allows
these teams and organizations to really, let's say, benefit
from what cloud native promises, which is focus on your code, write your microservice,
deploy it, and let the cloud-native frameworks
that are out there handle all the tough work,
whether it's traffic routing,
whether it is the different type of deployment models,
whether it's scaling up, scaling down, and things like that.
But, I mean, the way we learned it, I'm sure you, you've learned it as well. It's if you go down to the weeds, it's
really, there's a lot to it and it's, it's not as easy as, as it, as it looks like sometimes,
but we at least you, and now we also, we try to make it easier by, by figuring out what are the
best practices and then combining the right tools and providing a good service on top of it or a
good framework.
Yeah, I hope so.
And everybody has their specialism, right?
I'm sure you've learned a lot about Istio,
a lot more than you claim to know. I'm sure you know loads from writing Keptin
because obviously it has to lean heavily on Istio.
I'm super, super excited about the Keptin project.
I think it's great.
I think it's almost the last missing piece on top of that stack
that we've been talking about, right,
that actually gives developers an interface
where they can do what they want,
which is here are the three versions of my software
that I care about at the moment.
I want an A-B test between this,
and I want an automatic rollback
if it blows up during the middle of the night, right?
Istio provides all of the primitives for that.
But again, you'd be sitting there pushing a lot of configuration documents,
even at Istio's level of abstraction, if you wanted to do that.
So I'm super excited about Kevjin bringing that to the table and automating it.
Yeah. Hey, so I know you're tight on time, I believe,
because you're actually right now somewhere in Europe and at a conference.
But I got a question for you.
So when is it maybe not a good idea to think about these things?
When is it, or what are the minimum requirements
from an architectural perspective from your app
to look into something like a service mesh?
When is it not smart to walk down that road?
Because I think knowing when it's not smart
is just as good as knowing when it is smart.
Yeah, it's a tricky one, right?
I guess I would say don't do science projects,
don't over-engineer more than you need to.
A number of people have come to me because i talk about the staff or the
native wave and said oh we we want kubernetes can you help and i've said why and they said oh well
because it's i've heard of it it's got all these advantages and all these features and i say right
how many services you've got oh three okay and what kind of load are you at oh you're pre-release
okay so actually you know a do Docker compose file would be just fine.
Right.
And you use two EC2 instances.
So you've got a backup.
That's, it's certainly not perfect.
We could all sit here and pick holes in that all day, but it's going to work and it's going
to work at that scale and it's going to be totally good enough.
The opportunity cost of sitting down for nine months and building a perfect platform is nine months where you're not writing your application code, where you're not going to
market and getting feedback and raising funding and all of that good stuff. So I think don't
build, as ever, don't build more than you need. The thing about service messages is they are
really useful. I wouldn't necessarily go multi-region in your cloud provider
or even go to Kubernetes until it's ready, until you've got time.
I wouldn't even necessarily do microservices until you really have a need,
until you actually do have sort of real scale
or real development velocity problems.
And as I said at the beginning, we are actually quite good at,
you know, our IDEs and our tools and our frameworks make us quite good at writing fairly large pieces
of code. But if you are going to be calling across a network, I really do think you need
these kinds of features. Now, if you're in one language, yeah, you could use an in-process
library. There are, you know, libraries for Go and for Python and other languages like that.
If you're starting from scratch, if you're sort of born in the cloud,
then I would actually be really tempted to get a managed Kubernetes cluster.
It's a folly to run Kubernetes yourself.
I don't know why anybody does.
But get a managed Kubernetes cluster, install Istio or Linkerd2 into it.
It's really quite simple these days.
Turn the Chaos on from day one. So turn Chaos Cube simple these days. Turn the Chaos on from day one.
So turn Chaos Cube on and turn the Istio stuff on from day one.
And so, you know, do software development properly,
do continuous delivery from day one
under these simulated conditions of Chaos,
and then everything will be lovely.
And I really would go to a service mesh quite early
if you started from scratch.
I just think they're so valuable.
The one thing that does put people off is they build on top of this other stack of stuff that
you need, which I admit is a problem. I don't have a perfect solution for it other than to say that
now getting a managed Kubernetes cluster on one of the major cloud providers is really a case of
a few clicks. So hopefully it's not that hard.
As for when not to do it, you know, if you have a big brownfield legacy site,
trying to shove one of these things in may cause you, you know,
may cause you a bit of pain.
It may not be what you need right now.
Istio has a bunch of ways to mitigate that.
You can turn it on sort of Kubernetes namespace by Kubernetes namespace.
You can extend what's sort of Kubernetes namespace by Kubernetes namespace.
You can extend what's called extend the mesh. So you can have an Istio mesh running in a Kubernetes cluster that also talks to services on sort of VMs. So if you're on VMs and you're doing a lift
and shift into containers, you can do a little bit of your workload, put it in a couple of
containers, put that in a cluster, have Istio
in that cluster, giving the advantage to them, and then have it set up just so that it can
still call out to the old legacy VM stuff.
So there's ways to migrate incrementally.
But yeah, I can't really help but say, I think it's a great thing.
And I think you should try to get all of its benefits.
And hopefully it is more simple now than people think think this you know it's just now version 1.1
i know it it got a bad uh reputation maybe but but those were the 0.1 days it was released
very early with a very clear 0.1 label on and i think it got so much hype so much coverage everybody
said oh it's great but it is a bit buggy well yes it said 0.1 on the tin you know and people just
got carried away and tried to use it in production uh i you know hopefully now you should have a much better
time yeah cool well and then the good news is you know if people have questions on whether it's the
right time on order or seeking for some advice we will definitely make sure to put all of your
information on the podcast proceeding so that they can reach out to you. Because obviously you do have a lot of experience
in how to make companies cloud-native or cloud-native ready.
And so we definitely, if it's okay with you,
obviously we'll direct them your way.
Yeah, please.
I mean, I'm personally always happy to have my opinions challenged
and have debates with people and learn new information.
So come find me on Twitter.
I'm sure you'll put that information up. up and yeah native native wave can help at a
at a company level as well yeah perfect brian is there anything else from your end
no i think uh it might be time to go ahead and some of the good old summerator do it now so folks
what i've learned today i mean there's obviously a whole lot to Istio. What
helped me in the explanation from Matt, which was phenomenal, explaining the different pieces
of the architecture, is just look at the architectural diagram, see what Envoy is doing
as the proxy that sits in front of all of your services. And then the control plane on top,
which makes sure to propagate the configuration to the envoys,
the mixer that is doing all the telemetry
and then is doing real-time configuration changes and traffic changes,
and then also Citadel, which is one of the components for secure communication
and I'm sure a lot of other things too.
But, Matt, I think this was extremely useful for me
and also thanks for answering the questions
on when it might not be the right time
because we want to educate people on new technology,
but we also want to make sure they understand
when it might not be the right time.
And if they're still uncertain,
then obviously we will direct them
to some of the material that you put out there
and also make sure that people know where to find you.
I know you are traveling the world for different conferences. that you put out there and also make sure that people know where to find you.
I know you are traveling the world for different conferences.
We met in Romania.
I know you are in Barcelona at CubeCon,
and I think there's other things coming up.
By the time that this airs, it's going to be July,
and I know you enjoy probably a nice, quiet summer,
but in case we have people that want to reach out to you and disturb your summer. We'll again, still send them your way. Yeah. I think luckily
I'll have a conference season all came at once this year. Um, I think luckily I'll have safely
got back to London before we, before this airs. So, uh, so nobody can find me, but, but no,
of course, you know, reach out, reach, reach reach reach out online instead um but hopefully this
you know this kind of i love talking about this stuff i find it super interesting um and i'd like
to impact as much as native wave would like to help people you know work out whether this stuff
is right for them and implement it i think you know a lot of it just comes down to to education
if you understand the systems what they're trying to do and how they work, then you can make an informed decision to yourself as to whether it,
it sounds right for you, you know, now,
so which is why I kind of go around talking about this stuff.
Yeah.
Well, yeah. And in terms of people finding you,
when Andy said we're going to put where people can contact you,
he did mean, you know, your home address, you know,
when you're going to be when
you're typically on the tube and all that so yep come find me on the uh come find me on the bank
branch of the northern line 8 30 a.m every day if you can uh if you can elbow your way through
the crowds you're on the guy in the kubernetes t-shirt yeah exactly um i want to thank you a
ton for this because i it really kind of it really solidified this all for me.
And I do want to mention again, just stress again for anybody who wants to completely wrap their head around this, do as Andy suggested.
Grab the architecture diagram, take the section where Matt talked about all that, and just look at that while he's talking because it really solidifies it all.
And I think it just also goes to speak again.
I brought this up on a previous podcast, Andy,
about the maturity of all this stuff.
You know, Kubernetes flourishing, taking off
and how now there are so many different services
around these things.
In the earlier days of all this, you know,
cloud native experimentations and all,
it was all build your own
and you had to find the time to do it all yourself as well but but as you say matt you knew you know
even your organization there are services now around it there are a lot of services around so
many this is different aspects of it it's really really cool to see how mature this has become in
so few years so um it's exciting times i think yeah and i can you know uh i could talk
all day about how sdo can do protocol translation on the wire and transparent database sharding and
all of these things and i think the technology's always been there and where people are just
thinking of novel ways to use it i don't think service meshes were ever designed to to be
transparent database sharding systems but actually if you think about it and you know how it works,
you can totally make it do that.
So I don't know whether anybody on the Istio team saw that coming two years ago,
but it's a thing.
So I think we're really just getting started with this kind of technology.
It's super exciting.
For me, it sounds you're just introducing the next podcast.
Oh, I signed up.
Yeah, I think you are. We we haven't mentioned in a while but we we used to
have a running competition on repeat guests i think the most we got up to was three or four i forget
but uh um it was a late to join the challenge yeah well exactly well there's a yeah there's a
there's a bunch more stuff where it came from so Awesome. Let's see what people want to hear about.
I'd be really interested to see the feedback on this session.
I have one, hopefully a quick one.
What are you talking about at KubeCon?
At KubeCon, I'm talking about one of the new features in Istio 1.1, which is a way to basically make easy calls between service meshes.
So if you've got two Kubernetes clusters in different regions,
you absolutely want to run two copies of Istio with a separate control plane each.
So a service in cluster A needs to be able to talk to a service in cluster B
for whatever reason.
You always could do that with Istio, but it was super complicated and I wrote like a blog
and a config generator and stuff for it back in the day. Istio 1.1 makes a lot of that first class.
So I just kind of explained the theory behind that. And then I'll do a demo of two cube clusters,
each with an Istio mesh, and then service in cluster A will be able to call my backend
dot cluster B dot global
and it'll end up in the right
place, routed via
an ingress and egress gateway that can white
list each other's IPs, end-to-end
MTLS, all of the good stuff
of the service mesh, but across
the globe, you know, for actual
globally distributed
systems. Very cool and uh i think
the recordings of kubecon will be on youtube so people can watch you probably by the time this
airs they will be able to see your life yeah yeah they um yeah they they get them up they get them
already quick live on tape yeah and maybe if uh i'll say exactly if the if the live demo doesn't work i'll send you
a frantic email and ask you to cut this section yeah all right well thank you so much again and
enjoy wherever you are right now i mean i know it's evening for you it is i want to say i'm in
i'm in vilnius lithuania it's my first time here and it's really nice. So I just wanted to plug Vilnius.
Thanks to all the people in bars and restaurants who've been friendly to me.
It's a nice place.
Speaking of feedback, we would love to have Matt back on.
We can have him back with or without feedback.
But if any of our listeners have any feedback or other Istio topics they would like him to explore with us, please let us know.
You can tweet us at pure underscore DT.
Or send an old-fashioned email at pureperformance.dynatrace.com.
Matt, thank you so very much.
Andy, great to be on with you as always.
Thanks for having me.
Thank you, everybody.