PurePerformance - Service Meshes: From simple load balancing to securing planet scale architectures with Sebastian Weigand
Episode Date: July 20, 2020Whether you are still researching on whether you need a Service Mesh or simple use a load balancer or if you are already deploying multi hybrid-cloud architectures and Service Meshes help you secure t...he location aware routed traffic. In both cases: listen to this episode!We invited Sebastian Weigand (@ThatDevopsGuy) back to our podcast who wrote papers such as Building a Planet-Scale Architecture the Easy Way. In our episode Sebastian walks us through why Service Meshes have gained so much in popularity, what the main use cases are, how you should decide on whether or not use Service Meshes and which challenges you might run into as you expand into using more features.https://twitter.com/thatdevopsguyhttps://files.devnetwork.cloud/DeveloperWeekNewYork/presentations/2019/scalability/Sebastian_Weigand.pdf
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and joining me as always is the Commissar himself, Andy Grabner.
Andy, how are you doing today?
I'm very good. Shall I now sing the English version of the Commissar or the real version?
Well, you're Austrian, so sing the real version.
Or you can even,
the one thing I was trying to fit in before was if you go one country over,
we also had 99 Luftballons
as well as 99 Red Ballons.
Yeah, that's right, Nena.
I'm sure it was very big in your country as well.
So people probably have no clue
what we're talking about right now.
Probably not.
Probably not, yeah.
Except for some of the older people.
They might know those songs. Yeah. No no it's uh the commissar right i did they also translate how
did they write their commissar with a k or with a c or i don't know with the k look at least
and i can't look it up but you know yeah it'll be but in case and people are you know familiar
with the song don Don't Turn Around,
and the English lyrics of Don't Turn Around,
that's not the original version.
The original version is,
And I will stop singing now
because I don't get paid for that.
And he also did Amadeus,
Rock Me Amadeus, right?
He was pretty big.
David Hasselhoff, though, I think was bigger than...
Who was bigger over in that area?
Was Hasselhoff bigger, even though he wasn't from there?
Because I know Hasselhoff got really, really big over in Berlin.
Overall, Falco is our hero.
And he had some big things when he was in the 80s.
So, yeah.
All right.
I think this is podcast today about falco and the austral uh
isn't there some kind of way you can segue from that you're come on andy you're you're really
good at these obscure segues how do i segue from austral pop from the 80s to uh service meshes
let me figure this out i know i know just go you better go yeah we just go now we have to
serve it so i i brought up the term service meshes is well service measures is a big topic right and
we actually have it was suggested by a previous guest sebastian who is uh back on the show today
sebastian are you there?
Yes, thanks for having me back.
It's great to be back.
How good are you with lyrics of old Austrian songs?
I mean, I can Google it with the best of them, I suppose.
You don't have that on your resume?
Not really, no. 15 years experience with Kubernetes and i know all the falco songs um no but that
would be an interesting role that that would be a requirement for i can't imagine what it
would look like but it would be fascinating to watch as a bystander you should check out
the job descriptions on dynatrace hey sebastian last time talked, we talked about a topic that is obviously still hot, but we talked about SLO adoption and usage in SRE.
You gave some great examples on how SLIs, SLOs, SLAs, you gave the whole definition. about you know how this can help an organization uh especially i think you know in in a cloud
native world i mean not only in cloud native but i think that's obviously where we see it more and
more and then you said you know it's great that we talked about this but there's a topic that is
really really hot these days and it's it's a service meshes and everything that we can do
with service meshes like multi-cluster architecture
and what service meshes can do a lot of things.
And I think I know what a service mesh is.
I think I know what it can do.
We had episodes more than a year ago
with Matt Turner, who talked a little bit about Istio.
We also had Alois Meyer,
one of our technical product managers,
talking a little bit about how we monitor
service meshes like Istio.
But you proposed a topic, so I assume you know much more than we've learned so far.
So I would really like to push it over to you and say, what do we need to know about service meshes?
And especially what is new? What's upcoming?
What do people need to think about when they think about service meshes?
Yeah, no, that's a good question. I think it's an interesting topic just in general, because
it's almost kind of daunting if you're a sysadmin or operator in the modern day and age, right? So
people are telling you like, well, hey, we got to move out all of this stuff into microservices.
We want to containerize stuff.
We need to have a cloud native approach to the way that we do infrastructure design,
application development, DevOps in general, you know, the whole shebang.
And then there's this like other thing that's sort of existing in parallel, which is this concept of a service mesh. And there's a handful of different service meshes
out there. I tend to have more familiarity with Istio. I think it's a little bit further along in terms of its feature set and just its maturity model. But it's interesting
that, you know, right, you know, as soon as people feel like they got to the cloud or they got to
Kubernetes, now it's like, oh, cool. Now redo everything inside of a service mesh, because
apparently that's like the newest hype du jour that we have to we have to implement. But I'm here to say that it's actually not it's not hype, right?
It is actually a really interesting and really awesome set of tools that enable some really interesting and awesome patterns that you will see in in like where the cloud computing market and where sort of enterprises are going today. The big ones are like multi-cloud
and hybrid architectures is really powered by the concept of a service mesh. I mean, there's ways
of doing it outside of it, but then it's kind of janky and it doesn't quite work the way you would
think it would work. And there's a lot of extra glue that you need to be able to provide in order
for that to actually work properly. Whereas if you leverage a service mesh, you get so many advantages with not a whole lot of downside, which is pretty cool.
Again, the biggest issue is how do I manage it? How do I spin it up? How do I install it
and maintain the thing? And luckily, now the industry has sort of like kind of turned a corner.
And not only can you get really great
managed like Kubernetes engines available from, you know, that your provider of choice,
but also you can get managed Istio or managed Istio like componentry on top of things as well,
which is really cool. So I think we're at a really awesome time right now to start taking
a look at these things and to really start uh like using them to their
fullest extent which is kind of fun um it's been a while um since i think the the last you chatted
about service meshes so uh it'd be i think would it be helpful if i gave sort of like a quick
refresher on like what a service mesh is how it works that sort of thing yeah that would be great
yeah especially the main use cases right i mean uh i think we have an understanding but if you as you said a lot
of things have changed in the last year or so so yeah a quick recap would be awesome so a service
mesh is if you want like a great definition here here's here's one that we've got um it's it's a
platform that provides a uniform way to to, manage, monitor and secure services.
And that's kind of interesting.
So you have this concept of connecting services, of managing the services in terms of the way that they work, monitoring the services and also securing the services.
So there's like kind of like four different big things that you kind of get as a result of that. Now, you might say to yourself, well, like I can manage services just fine inside of Kubernetes, or I can manage services just fine
in my thing that exists that is not inside of Kubernetes, because technically a service mesh
doesn't require Kubernetes to operate. It just tends to go hand in hand with that just by nature
of it being a distributed system and Kubernetes offering primitives to run
distributed systems. But if I were to ask the Kubernetes administrators out there,
a really simple question with respect to services, if I said, you've got an app, let's say,
and you've got service A that talks to service B, right? And you've got a bunch of other services
that also talk to service B, and you've got a couple that service A talks to that aren't service B. So A might talk to C and that sort of thing,
you know, standard everyday microservice type thing. You know, they all talk to different
backend services in different ways. If I were to say, what's the total throughput or the total
amount of bandwidth that's being used when service A talks to service B, how would you answer that question?
Like, think about that for a second. That's kind of challenging because while you can get
metrics out of a system, you don't necessarily have service level metrics. You can get like
pod level metrics, or you can get maybe application specific metrics, you know,
like something, something Prometheus, you know, something like that, but not really from the service perspective, especially important when it comes to like
dynamic services that will come and go and scale up and scale down as needed. Another great use
case would be something like Canary deployments. We talked about that in our, in our last episode
together, where we talked about SLIs and SLOs, we had mentioned like, there's some really
interesting things that you can do with, with Canaries to be able to get feedback from that and then be able to
release things. And then once you're really good with releasing things, then you can migrate the
different versions up and people can have an uninterrupted experience. If you want to leverage
Canary deployments though, right now, how do you ensure that a specific percentage of traffic goes to the new version that you have versus the old version? Well, right now with Kubernetes,
it's a little bit more, I don't know, it's not super polished. You'd have to just make sure that
your replica count is the correct proportion larger than something else. It's kind of weird.
It's kind of complicated. A service mesh brings order to this chaos,
right? It provides that network level functions to your application. And that works, it has a
variety of different benefits that you've got on top of that. But those are the sort of the
highlights. And I'm going to talk about a couple of the different benefits that we've got coming
up here. But does that make, does that track in terms of everyone's understanding of a service mesh? Or did I say
something that was weird, complicated, not in line with what you were thinking?
No, I think from my perspective, great. But I have a question right to the example that you just
brought up with the canaries. So does a service mesh then typically, and I know you know Istio
pretty well, does Istio then also not only take care of correct traffic routing but also scaling
of the pods so that means i configure i'd say is still i want to run 80 of my traffic on my
main canary and 20 on my new canary is still then automatically figuring out how many pods I need in order to do this?
Or how does this work?
Ah, see, that's really good.
So Istio takes care of the traffic component,
but it lets your underlying Kubernetes infrastructure take care of the scaling component,
which is pretty cool.
So let's say you have an arbitrary number of pods on the backend.
You could just be one, let's say version one and one version two.
And you want, you know, 90% or 80% of the traffic to go to one of those two.
The way that Istio works, because all of the traffic that exists inside of the service mesh has to go through Istio's primary controllers.
In other words, like the gateway that you establish, it knows that
every packet coming in is destined for a specific service that's under its jurisdiction.
So it sort of acts as a gatekeeper, and then it figures out how much traffic should go to
whichever different backend deployment. So in that circumstance, as more traffic comes in,
Istio will always maintain that exact
proportion that you're looking for in that traffic splitting scenario that we talked
about, and then it will deliver traffic to those appropriate pods.
If those pods end up getting overloaded or need to scale because there's a lot of traffic
coming in and you have the appropriate scaling principles in place, you've got good custom
metrics and things like that,
then Kubernetes will scale them in and of itself.
So they're sort of like non-overlapping problems,
but at the same time, the ability to have all of the traffic come in
be so expertly controlled,
that's something that you only get with something like a service mesh,
which is pretty nifty.
But wouldn't it make sense? something that you only get with something like a service mesh, which is pretty new.
So but wouldn't it make sense?
Kubernetes by default would scale based on resource shortage, right?
Let's scale up in case, let's say my Canary 2 is seeing so much traffic and now I'm spiking
CPU or exhausting CPU.
Now I'm scaling up. But wouldn't it make more sense because Istio has all this great information to scale based on maybe other metrics like, hey, response time is going up,
or we see an increase in failure rate and therefore we probably need to scale. Wouldn't
it make sense to take some of these metrics that are specific to a service or to a canary,
and then also take this into consideration, then maybe trigger something in Kubernetes and say, hey, we need to scale up?
Yeah, and what you've just described
is the concept of a custom metric
that you can scale on inside of Kubernetes,
which is super awesome.
So the horizontal pod autoscaler,
by default, just scales on CPU load.
It's not a particularly good metric.
Usually there's a standard linear correlation with standard linear correlation with like the number of
requests that are being processed and the amount of CPU that's used to process those requests,
but not always. And that's not really the best SLI. Remember we talked about that before to,
to go off of because the CPU is sort of tangential to the, uh, to, to the heart of the matter of the
service. Like I don't care how much CPU is being used.
I care whether or not those requests are coming out and working appropriately.
Just like we talked about before, like, I need to make sure that, like, let's say the
response time, like, let's say that's a super critical thing that I need to ensure that
my production service meets because I have this SLO or I even have this SLA because,
you know, a B2B or if I'm providing like
a software as a service or something like that, I have to make sure to hit that. Well, wouldn't
it be great if I can scale based on these types of metrics? And because Istio understands these
types of metrics intrinsically, it provides mechanisms that tie into the horizontal pod
autoscaler so that you can come up with better
custom metrics on which to scale. So that way you can say the number of requests per second coming
in or the average latency or the response time or something along those lines, whatever it is that
you want to key off of, you've got this nice like duality between metrics that are produced inside
of the service mesh and then your ability to do something with those metrics
or take actions, which is super powerful,
especially when it comes to production services
that span potentially multiple backends,
which is another interesting topic.
Yeah, cool.
I know you said you have a lot of other stuff to talk, to tell about,
but I think another use case, and correct me if I'm wrong,
but this is also where service measures are great in
when we come to fault injection.
I think I've heard this right from chaos engineers
using service measures like Istio to say,
let's inject 10% fault and see how systems behave.
So I assume this is one of the other use cases we see upcoming,
especially as chaos engineering is kind of growing in popularity.
Absolutely. So in addition to being able to provide the security aspects and the observability
aspects, you can also do a lot of traffic control stuff inside of your service mesh.
And this can be something as really cool as chaos engineering
and fault injection, but it can also be just like standard best practices like rate limiting.
You know, if you've got an incredibly, let's say, noisy or demanding client that's accessing
these services inside of your mesh, then Istio can kind of say, okay, well, I know that this
backend service can only do X requests per second. So while all the requests are coming in, Istio can kind of say, okay, well, I know that this backend service can only do X requests per
second. So while all the requests are coming in, Istio will kind of buffer them for you and then
make sure that that service that's, let's say, overwhelmed has a chance to kind of back off.
You can also do things like circuit breaking. If a backend isn't working well in, in, in some definition of well,
then you can,
you can fail quickly.
And that way you're not overwhelming other services that might be sort of
downstream of things.
But the fault injection is super cool because of course you can do things
like,
you know,
the proverbial or the,
the,
the analog to unplugging the cord and seeing what happens to some of the
packets.
Usually it'll just like retransmit and it's fine. But the really cool thing you can do with Istio is you can do
latency injection, which is really interesting. So rather than simply just like, you know, drop
every fifth packet, which you should still do to make sure that your services can recover from that.
But most of the time that's sort of handled, right? Like, you know, TCP is fairly good at dealing with that. But what if we just increase
the latency by like three seconds, every random distribution of packets that come in? Like what
happened to your service then? What happens if like you're sending a bunch of metrics or if
you're sending a bunch of data or you've got, you know, like a message queue or something like that, where a message just takes really long to be delivered,
especially when you have a lot of stuff running in parallel, like you might get something delivered
out of order. Like how does your application respond to that? It's really important to take
a look at those things because you'll probably run into that in production and it's better to
catch it beforehand rather than be reactionary
after something you know goes down and you get paged at two o'clock in the morning because no
one likes that and all this is obviously possible i think envoy is used as a proxy correct yep yep
pretty nifty stuff uh it's really high performant proxy really works well very super programmable it's an awesome component of the istio and
now also correct me if i'm wrong but the way that the magic behind the scenes is that you are
or east or any other service mesh probably similar is injecting itself into the containers to control
in going and outgoing traffic right so that you can actually route it automatically without having to change the code of your app or your service to the proxies, right?
Yep. It does that through this interesting concept of an admission controller and this
really fun way of essentially rewriting whatever it is that you've requested. So this is where
it's kind of interesting because Istio can work outside of, let's say, Kubernetes, but it makes the most amount of sense and the most
amount of automated sort of workings that you get sort of out of the box if it has this nice,
tightly coupled Kubernetes integration. So as a developer, I can just focus on making sure that
my microservice works. And then when I say, go run this, Kubernetes intercepts what I'd like to run, rewrites it a little bit, and then adds the appropriate Istio components, and then attaches that sidecar that has that Envoy proxy into it.
So it's running another container inside of that pod and then wires everything else up in the back end so that traffic flows through Istio first,
the Envoy proxy second,
and then eventually your app third.
Hey, can I ask a question?
This is kind of a weird one
because it dawned on me
when you mentioned something about,
you know, there's a lot of programming
you could do to Envoy.
When listening to this,
a lot of it sounds like you drop in Istio, it all runs perfect and
great, and you just have to tell Kubernetes to do it. But how much in reality, right? I'm sure
there's a lot of out of the box that it just works really good with it. In reality, though,
how much maintenance, how much, let's say, old-fashioned style sysadmining do people have
to do with Istio and going in and controlling
configurations and settings you know completely unrelated but you know i'm thinking like just
the classic jvm right you got to set your memory settings threads all this kind of stuff is there
equivalent kind of configurations and things that you tune performance tuning things that you have
to really pay attention to on this do and do enough people, if there are, do enough people know about that?
And is there any good practices around that?
Yeah, that's a really good question.
So it's the case of pretty much
every sort of open source product
where it's, hey, here's the ingredients, good luck.
And then you're kind of looking at it like,
okay, this is going to be interesting.
Luckily, because we keep moving up the abstraction stack, right. You know, we don't really concern
ourselves with like big VMs anymore. Now we concern ourselves with containers and we really
don't even concern ourselves with containers. We have, you know, Kubernetes package management and
stuff like that. You know, you have like a helm chart to install something that's really complicated. You get to a point where a lot of the basics of what you would otherwise do with
systems administration has been codified and is available to you. So for example,
how do I scale the ingress that it uses, like that big gateway that Istio takes advantage of?
Well, it turns out that you scale it the same way that you would scale anything else. It has
the appropriate resource requests, limits, and requests put in its definition. So if you take
a look at some of the YAML that exists after you install Istio, if you just start going and taking
a look at what's running in your cluster, it'll come back with a bunch of data
and it will say like, this is how we scale this.
This is how we add additional pods
and so on and so forth.
So from that perspective,
there's not a whole lot to do.
You still have to maintain the Istio components.
So like, you know, if there's new versions,
you want to make sure that things get upgraded.
If the, you know, the database schema changes, there's probably going to be an upgrade path and so on and so forth.
A lot of that, though, is relegated to whoever is managing your Kubernetes environment or your Istio componentry built on top of it.
So realistically, I would say, yes, there's a bunch of stuff that needs to be done. But depending on how much you want to take for yourself versus how much you want to leverage
managed services, it's really not something to really concern yourself with, especially
if you're like a mom and pop shop or even a really large scale multinational conglomerate.
You just need to focus on writing applications and then letting somebody else take care of
all of this other
stuff. It's kind of the same thing. Like, you know, we used to have a lot of, you know, DBAs
that would focus on optimizing and performance tuning for a lot of, let's say, you know, like
MySQL databases. Well now, I mean, if you're doing MySQL by yourself, then sure, you need to maintain
that. You need to tune things in the kernel and so forth. But if I just go to a cloud provider and I
say, give me a MySQL database, they've already done all that tuning for me. And I pay
a slight fee on top of whatever the cost of running the VM or the storage or the networking
or whatnot would be, so that I don't have to think about that. So from a business perspective,
it focuses you back on your applications that are actually providing business value
as opposed to being really removed from the system and being super inefficient where you're
dealing with all of this extra stuff like network functions, rate limiting, security,
stuff like that.
Like you shouldn't have to be focusing on this.
That should be something else that's dealt with.
And then the thing that you're running it in should also be dealt with automatically. It's funny because a lot of this kind of sounds like it's moving back to
where Cloud Foundry was trying to get to. Cloud Foundry doesn't seem like it picked up as much
as it could have because it was a very opinionated platform and everything was set. You literally
would just push your code and they were responsible for making sure that system ran well
by designing that system.
And then on the counter side, there was the complete
DIY side of, let me set
up my Docker, Kubernetes, all these pieces, but
now we're starting, it sounds like
some of this
stuff
that's been figured out is being pushed in
like Istio is
doing some of the opinions for you
and for the most part it's going to run as intended it's only going to be it sounds like
it also it's only going to be in some more the edge cases where you're going to have to dive in
and tackle it as opposed to doing full network maintenance on your own which would be the other
extreme interesting okay cool but of course it's important to remember that like your mileage may
vary and anything you're introducing like a shim between all of your network requests Okay, cool. But of course, it's important to remember that your mileage may vary.
And anytime you're introducing a shim between all of your network requests,
you're going to want to do your appropriate load testing and things like that to make sure that you figured out how to scale it properly.
And if you're doing advanced features like rate limiting, for example, inside of Istio,
well, obviously, a cluster that has rate limiting enabled and a cluster that doesn't
are going to behave very differently depending on how you load test them.
So there's still edge cases, or not necessarily edge cases.
There's still testing and things like that that you would do.
But I sort of take the stance that if you're doing DevOps properly, you're going to be doing that, oh, we need to adjust some Istio policies as opposed to,
oh, we need to adjust some other thing
or create some library
that our coding language will support
to basically implement
that same functionality anyway.
So it's still sort of taken care of
in terms of your standard testing.
Just don't copy and paste
from Stack Overflow and you're fine.
Well, it depends.
If it's Guilty High Tower writing a bunch of stuff,
you'd probably be okay, but
if you can't verify the source,
then be careful.
I got two more questions, and the first
one, the second one is on
observability, but the first one
is on if you
have a large Kubernetes cluster
and it's shared by different teams
deploying different types of apps
that they want to get separated.
They want to make sure things are separated.
Yet there's Istio on it.
And my question is,
is Istio always a shared component
on the complete Kubernetes cluster?
Or can you also run Istio
in, let's say, multiple instances of Istio
so that maybe version one takes care of these namespaces
and version two takes care of the other namespaces?
Instance one, instance two, to make sure that you have a clear separation.
Or is this something where I'm completely missing something here
and it's not a requirement at all?
So this is an interesting question
because the way that you're asking it presupposes the
existence of a singular cluster, which while we can answer that question very easily, you run into
a little bit more complicated scenarios when you want to have more than one cluster and you want
to have the same sort of controls that are rolled out across the board. So I want to have more than one cluster and you want to have the same sort of controls that are
rolled out across the board. So I want to answer the question from both perspectives, right? So
on the one hand, Istio is, you know, if you want to boil it down, it's a series of programmable
network functions that allows you to not have to worry about any of the things that fall under
its jurisdiction
inside of your application code. A great example of that would be like, how do I enable authentication
and authorization between a bunch of different microservices? How do I enable encryption?
Well, you could if you had a big monorepo, you could have every service inherit from the same
code base that just imports the appropriate encryption library and go from there. Or you could let Istio do it for you, in which case all of the fantastic certificates
and MTLS configurations will just get done for you. And then you're, you're, you're good. You're
off to the races. Now, what's, what's funny about this is, um, when we have different namespaces,
because we want to have fewer clusters in general. It's always better to have fewer clusters
than a lot of random clusters out there.
And there are different ways of achieving multi-tenancy,
even amongst like other,
like software as a service providers and things like that.
There's still ways of being able to do multi-tenancy
in a singular cluster.
But realistically, you'd want Istio to have access
to all of the different namespaces
because that's how you enable things like micro segmentation.
So you might have thought like, well, how do I do like firewall rules to prevent, let's say, you know, team A from talking to team B?
Not necessarily because we you know, they don't you know, the teams don't get along and they just hate each other or something like that. But more like, you know, team A is let's say a billing service
and team B is a, you know, a payment microservice, you know, framework or something like that.
Like one does processing, the other one does like, you know, calculations for bills and things like
that. Well, those should be able to talk to each other, but the front end, which is like another
team doesn't really need to talk to like that specific billing backend. It just needs to put a request in once like the shopping cart service,
let's say, does something. So in those circumstances, by definition, you want to be
able to tell an overlay component that this should talk to this and this shouldn't be allowed to talk
to this other thing. In which case, that thing that needs to enforce
it, in this case Istio, needs to be aware of all of the things that are running inside of your
cluster. And it needs to be able to put up those barriers to either enable or disable communication
between them, which is kind of at the heart of the question. Now, I said that this gets a little
bit more complicated when you have multiple clusters. Now, that's interesting because when I talk to a lot of customers and I talk to a lot
of people that are rolling out really large scale infrastructure, it's kind of fascinating because
everyone that I talk to, because people are sort of like in the cloud now, I think the state of
enterprises now is they have a cloud strategy, which is great. Five years ago or so, people were developing a cloud strategy. But they're still very regional.
It's still very like, well, we're doing something in US East or US West, or we've got EU1 or
something like that. But we're not thinking broader. We're not thinking bigger. Disaster
recovery is still a, I know, I push a button
and then I deploy a bunch of infrastructure and then I, you know, I can recover. They're not
thinking planet scale fault tolerance just yet. And it's very easy to be able to do that when you
have a service mesh, which runs on top of multiple clusters or multiple pieces of infrastructure,
because now you have the concept
of you know quote-unquote production existing in more than one cluster and you have to bring order
to chaos because you want to say that my production services need to do the following things i have
the following slos that need to be maintained i have the appropriate slis in place to gather
whether or not or gather information on whether or not the service is operational.
And how do I take, you know, take actions to to get that back to that known good state that abstraction stack so that you're not beholden to a single piece of infrastructure, a single region, a single VPC or a single network even.
And that's where something like environs come into play, where you can define that this is my production cluster or excuse me, my production environment, which consists of multiple clusters. And then all of these services need to be able to
talk to each other. So now it's a case where not only do my services need to communicate with
themselves inside of the cluster, but it's possible that one of those services might be
in production, but might exist, let's say on prem. So how do I make sure that those microservices up
in the cloud can talk to those microservices on prem? And how do I make sure that those microservices up in the cloud can talk to those microservices on-prem?
And how do I make sure that services that I don't want talking to each other can't talk to each other regardless of where they're deployed?
In which case, you have to have this centralized, I say centralized, but it's still distributed, but this one place where you define all of these things and that becomes your service mesh
now how that's implemented there's a handful of different strategies and there's tons of
different slides that talk about how to do control plane components and whether or not you want
to duplicate the control plane components or if you want to have them all in one cluster and yada
yada yada the the end goal though the only thing that i care about is that i have a singular mesh
that spans all of these different back- and backend infrastructure so that I don't focus
on infrastructure. I focus on environments, which are, this is really what I care more about.
And cool. So that means if I understand you correctly, you can install the service mesh,
the control plane, at least somewhere,
whether single or fault-tolerant, all this is possible.
But you have execution plane components in every single cluster.
One or multiple clusters belong to an environment.
But obviously, and I assume this is also true,
that when you have multiple environments that are distributed around the globe,
that you always have something like a location awareness so that Easter is smart enough to say,
Service A needs to talk to Service B and let's make sure they're geographically close, talk to each other, and not that we're sending packages all over the globe and things like that.
Exactly. Yeah. And that's, it's really interesting because when you have that infrastructure in place and you have like global capabilities that are provided either from your
cloud provider or from, let's say like, you know, a CDN or like a large scale networking provider
or something like that, you can do some really interesting like network
flow type optimizations or like sectioning, or you can have cross region communication,
or you can disable it depending on what you'd like it to do. So for example, you know, if I
have this global network of stuff that I have this massive service mesh that covers the planet,
and I'm in the US and I'm connecting to uh you know this back-end
service you know i'm just going to you know example.com let's say you can have it set up
such that example.com will resolve to an ip address and that ip address could be an anycast
ip address that would basically deposit you on whatever closest cluster happens to be near you
based on whoever's providing the network infrastructure's technology.
It's like geolocation-based sort of thing, that sort of thing. Or you could have it say,
well, this particular service, please route me to the closest available backend,
but then I have restrictions. So for example, it might be the case that let's say we don't want,
let's say a European customer to, or a European end user to connect to this global network and
then be routed outside of, let's say the EU, maybe for, you know, data privacy regulation
or something along those lines. So what's important is it's super programmable. So you can enable whatever
it is that you would like to enable and you can disable whatever it is you would like to disable.
But the one thing to kind of take away from this, like if you were to boil down all of the
stuff I've been ranting about here, is that you have a singular interface that governs your entire
production environment that doesn't care about infrastructure.
Infrastructure is sort of a separate thing. So you just define services, you define the policies
that govern the services, and then you let the mesh take care of the rest and let it communicate
with the underlying components. And then you're all set all set you're good pretty cool um i i'm what we should do i
know you said there's a lot of uh slides and presentations out there we should make sure to
get some of the links and put it into the podcast proceedings into the description so people can
follow up yeah that would be great yeah absolutely. I got one more question.
Now I want to go to observability, and I'm sure more questions will spark as we go along.
But as far as I understand, basically Istio sees traffic between two services
and therefore knows exactly who is talking to whom.
But how about distributed tracing?
Is it possible for Istio to actually follow transactions
across multiple hops or is this just i don't think it's possible right because you cannot
put a trace tag on one entry point and then it gets automatically propagated through
the different layers but i wanted to see if if if, if something is in the works, if it's possible, if true distributed tracing is also something that service meshes can, can provide.
It can actually, which is pretty cool. Um, what's, what's important to remember with service meshes
is it's not just East West traffic, but it's also North South traffic, right? So if I have something
that's coming into the service mesh, which again, we talked about
has a variety of different backend infrastructures, including things that are not clusters.
You can just pop an Envoy proxy on a mainframe and then add it to the mesh.
And that way you've got that governed under the same policies that you have.
And as a result, any of the traffic that's coming into the mesh can then have the appropriate, you know, tags and
whatnot applied to it so that you can actually plug in whatever distributed tracing framework
you would like to plug into it, which is pretty nifty, including, you know, standard open source
stuff or, you know, managed offerings from a variety of different service providers. I'd
imagine you have your preferred service provider to provide observability insights and things like that. But out of the box, it works with things like
Zipkin, Jaeger, or whatnot. So it provides this type of capability. And Envoy has the capability
of essentially adding the appropriate tags and things like that to it so that you can then pick up and then
go do something. Keep in mind, though, that like it's network function specific. That said, it's
producing a bunch of different metrics that then can go into a variety of different backends.
So this is where it gets like kind of complicated because Istio does a lot of different things and then other things
can be built on top of it that can also plug into other things, for example. So if you need to do
something that modifies the way that the network works, you have to have Envoy capable of being
programmed to be able to inject that. So for example, like headers and things like that,
like you need to be able to put like span context or trace IDs or something along those lines inside of it.
But at the same time,
if all of that data is being fed somewhere,
then you need to be able to essentially aggregate that, right?
So it's kind of like a map reduce operation.
You have to map all of the different things
that have this particular span,
but then you have to have something that processes it
to be able to provide some meaningful insights.
And that's where another part of that.
So your distributed systems tracing framework would have the ability to then pick up all of this information and show you or present the traces and spans to you in a way that makes the most amount of sense.
So you need to implement it, but then you also need a way to present it of course so istio provides the implementation specifics but then you need something else that can read the data that's
that's uh inside of the the mesh to then present stuff which is pretty cool yeah so i mean and
again i think you confirm what i thought in order to do true distributed tracing you would need to
instrument your individual services with you know some open source library or a commercial offering to truly get end-to-end tracing.
So to take the tag that was put on the request by Istio and then, you know, take the same
tag and basically push it out at the back end when you make back-end calls to get the
tracing.
I thought maybe I don't, you know, maybe Istio has found a solution for that as well, but obviously it's not possible because, as you said, different technologies and you cannot just pass a tag through the different runtimes that we use to build our services with automatically.
Yeah, cool.
Exactly.
Yeah.
So you still got to have a backend service to receive all of these things.
So one, you configure it and the other one, you then present the data and then you're good. Perfect. And yeah, you mentioned
there's obviously a lot of standards and we are also proud on the Dynatrace side that we are
part of OpenTelemetry and supporting the W3C trace context and all that stuff. So that's great.
Talking about standards, I believe there is some standard around service meshes or service mesh interface.
Can you talk a little bit more about this, or is it just Istio everywhere?
Well, so it depends on how does one define a service mesh, right?
How do you actually say these are all of the components that as long as you check these
check boxes now we have a service mesh like what does that look like and not only that but how do
you then provide additional insights or feature capabilities on top of things right so like um
let's say you just have a bunch of envoys sitting around theoretically that could be a service mesh
like you're all set you're ready to go as matter of fact, a lot of people end up programming envoys without using the rest of
Istio. They just write a bunch of envoys, they deploy them, and they say, right, now I've got
a service mesh. I'm pretty good. But then you run into the same sort of backend issues in terms of
an automation perspective that would lead you to build something that kind of looks like Istio if you squint hard enough. Like for example, well, I don't really want to have to program
all of these envoys manually for every service that I deploy. And on top of that, I really don't
want to have to inject them or provide them with the appropriate certificates so that I can get
security out of the box. So I should really build a thing that does that automatically.
It's like, well, congrats, you've just like reinvented Citadel, which is one of the components
inside of Istio.
Or I really need to make sure that all of the envoys that I have out there are programmed
appropriately.
How do I do that in a sane and rational manner?
Well, I need to have something that understands all of the routes that we have and the way
to get from one cluster to another cluster. If I have a multi-cluster match,
well, congrats, you've just reinvented essentially what pilot does. So you're going to have some
subset of functionality that is going to exist that is good enough for whatever usage that you,
whatever use case that you have, but at the same time, be cognizant that like,
you're probably not the first person who's going to go down that path.
And as a result, people that have spent a really long amount of time getting Istio to where it is today
have already kind of thought about these things and have actually changed the capabilities that the mesh can provide
and all of the requisite technology that goes into it
and all of the stuff that it plugs into. So for example, is it a requirement that a service mesh
meet, like be integrated a hundred percent with open telemetry? Well, I would hope because that's
like the goal of the open telemetry project, but at the same time, you know, Istio predates it.
So would you consider that to be a requirement of the service mesh or back and forth or whatever this tends to take a look like?
The bigger thing would be you want to have something that you can define that operates, that figures out how to manage traffic across a variety of different backends that's super pluggable. So whatever
that looks like, I would say is the best, I don't want to say standard, but the best
implementation strategy for something like a service mesh. The really fun part about is
there's a handful of different foundations that are out there that are all doing some interesting work in this space.
So like the Cloud Native Computing Foundation, for example, the Continuous Delivery Foundation,
all of the different foundations that are kind of like Linux Foundation.
And as a result, like Istio, I believe, is going to be donated over to one of those foundations. I think it's like an offshoot of them,
which would be kind of cool. So once it's over in that type of foundation, then you can create
the appropriate standards body that says any one of these backend components can be implemented.
But keep in mind, it's also a system of systems. So like, can I do Istio without using Envoy?
Well, theoretically you could,
you just need a thing that speaks the APIs
or that can read the custom resource definitions
that you have to implement the same type of functionality.
So there's a lot of different moving parts
and there's a lot of different standards
behind sort of all of them
to kind of like build all of that stuff up.
It's confusing.
No, I was also referring to the service mesh interface.
I think that came to mind.
At least I think there is a spec out there that talks about, you know, what does a service mesh look like?
And I think there's also the SMI spec IO page.
And that's kind of what i was trying
to get to figure out oh yes there's something like this and that you know when people decide
yes we obviously need service meshes but do we need to if we decide on easter today and or let's
say we decide on something else today and we implement things do we lose everything if we
switch over to another service mesh or is there some type of standardization that makes it easy for users to switch over gotcha yeah um so like the the smi
spec uh would be one step but keep in mind that it's a system of systems so there's like a lot
of other random uh specs that need to be a part of it beyond just like the service mesh. So like, yes, you want to have something that implements the appropriate like service mesh,
like type logic, but at the same time you want to have things that implement, uh, all of the
components that it needs to be able to run on top of. So like there's more than one standard that
needs to be kind of present in order for this to work. Great use case would be like a gateway interface.
So like service meshes, one of the big things that they have is like a common ingress and a
common egress point that are provided as part of the standard sort of bootstrapped installation of
Istio. Now, in that case, we have this weird thing inside of Kubernetes called like
an ingress, which is kind of fun. So as opposed to just sort of a standard that exists like from
a governance body, how about like a standard that exists inside of an actual like API? So you get
down into the nuts and bolts of like what the CRD should actually be, which I believe SMI actually
does. It gets down into the weeds of that. But in that circumstance,
now we've got this weird, like we have this concept of like a load balancer. We have the
concept of something like a gateway, which is not quite the same thing. It's plumbed,
but it's not really plumbed in the same way. And then ingresses are powered by ingress controllers,
which I guess would be kind of like what Istio provides.
But at the same time, we have this weird concept of an Ingress that's not maybe L7.
So then you get into this weird, like, we really should have a different interface for
some of these components.
And what's really cool about it is because it's all open and because anyone can sort
of hack on it, right now we're in that like beginning phase
where like a bunch of open source hackers are sitting around trying to figure out like what's
the best way forward and we're introducing new concepts and then taking away other concepts and
then merging other concepts all into one interface so and i think that that's like the best of of
open source right like you move super fast you're able to release really awesome software,
but then you have that like Redux phase where you're like, okay, let's do a sanity check and
make sure that we're on the same page. So for example, like the entire motivation behind open
telemetry was just, we have way too many different ways of exposing like metrics and information.
We really ought to standardize this in one common set of APIs.
Same thing with open policies, because you can have and enforce different types of policies
inside of Istio. We really should have one common policy framework that everything sort of derives
from, and we should have one common whatever framework to do something else. And then as a
result, you'll start to see these different types of standards bodies come up with their definition.
Now that, let's say, the pioneers, and I would consider Istio to be one of the front runners because it's just been in development for a really long time.
You kind of take a look at where it is now, and then you derive the standard that anybody else can implement and then everybody ends up implementing that including let's say istio like they'll
redo things to be conformant to the standard that they helped create so it brings everything nice
uh full circle and and together in the circle of software development life cycles cue the fun music hey um i want to kind of you know i know that we can probably chat a long long
time about this because the more you talk the more questions come to mind but i want to end it with
with one question which let's see how long how far this goes but a lot of people are, you know, starting with Kubernetes.
And I think at some point they have to figure out,
A, do I need a service mesh or can I just do it with,
I don't know, some load balancers that can also route traffic or is a load balancer and service mesh anyway.
But at what point do I seriously need to look into service meshes?
And, and this is the big question,
what are the most common problems that people run into?
So that, you know, let's talk about this and let's make sure that people understand, because not everything is, you just do a kubectl apply and everything is good, right?
There are challenges with it too.
And so, A, at which point do people need to look into service meshes?
And what are going to be the challenges?
That's a really good question to end on.
I can see why you're doing a podcast.
You're pretty good at it.
So to tackle that one, I think there's really a couple gut checks or a couple sanity checks that you want to be aware of.
One of them is anytime you're doing anything where you have more than one piece of back-end infrastructure.
So if you're kind of rolling stuff up yourself and you've got one cluster, know, one cluster, let's say up in the cloud or whatnot. As soon as you think to yourself, ah, I really want to run something on prem and I want
to run something, you know, up in the cloud. How do I like get that working appropriately?
In that capacity, you really need to be thinking about a service mesh because that's the only
thing that abstracts that infrastructure away from your, your concern to the point where you
can just focus on writing
the application and making sure that things work properly. The other thing would be if you're
developing different components of a service mesh, or if you're working on, let's say,
trying to do something with encryption across the board or something on these, these, these, you know, in those types of categories, you'll run into this. I'm not coding application logic,
I'm coding something that I need the application logic to be beholden to. So in other words, like
you know, the payment service, theoretically speaking, shouldn't really concern itself
from a code perspective of how it needs to scale, how it needs to be, you know, shouldn't really concern itself from a code perspective of how it
needs to scale, how it needs to be, you know, let's say rate limited, how it needs to do different
types of like retry patterns and things like that. Like once you, you get down that path where
you're like, wow, I need to, I need to like stop doing this. You're going to want to bring order
to chaos by implementing a service mash and keeping your applications nice and pure and not having a bunch of extra logic or whatnot that's not really
fundamental to something. So that would be like the two big components that I would take a look at.
And end-to-end encryption, that would be a big one. Or if you have a lot of micro-segmentation
requests, like I need to make sure that these namespaces don't talk to each other or different
things or something along those lines, then I think you really should take a look at a
service mesh when it comes to challenges that you have when adopting it i think a lot of it is
fully understanding the capabilities that the service mesh has to offer and understanding how
it implements some of these things because it's a little bit different than I think people were traditionally aware of. It's a different set of custom resource definitions,
right? Your standard service definition and your load balancer definition need to be rewritten
because now they're going to be different service definitions inside of Istio.
And as a result, you're going to need to not necessarily modernize. Like it's not as big of a gear change from going from, you know, the monolith to the microservice.
But it's going to be I need to take a look at this YAML and I need to just update a couple of things here and there to make sure that it works with with the mesh.
The other thing to keep in mind is you can actually run a service mesh on top of your Kubernetes cluster that also is running
services that are not part of the mesh, which is kind of cool. So you don't have to necessarily
move everything at once. It'll be sort of a gradual type of procedure. The big complexity
though is, like I mentioned before about like leveraging managed services is really reaching out to whoever it is
that is packaging up your Kubernetes distribution or the Istio installation on top of it,
and working with them to see what the best practices are. Because even though we want to
keep this to be as standardized as possible, there's little differences depending on how
you're deploying things. And there's going to be little differences that start cropping up
in different parts of your application. So it might be that one vendor's implementation
really recommends that you do the following, you know, especially when it comes to bringing
the components up and running different parts of the control plane, maybe in different ways.
So they might have their approach, which, which they're certifying as being like nice and stable. That might be a little bit different depending on which provider you go to that
might have a different set of functionality for the way that something or other is implemented.
So I think that would be kind of an important thing. The other thing to mention, um, I think
in terms of like a challenge, which is actually not a challenge, but people end up getting in this very weird mindset where they think it's going to be a challenge is like if you get something running in the mesh, like you don't have to, you know, from day one, turn on all of the advanced capabilities that the mesh has to offer, like micro segmentation, rate limiting, quota enforcement, policy here and there. You don't have to do that. If you just deploy Istio and then deploy a service
inside of it, it's kind of like when you deploy a service inside of Kubernetes. If you don't have
the appropriate resource requests and horizontal pod autoscaler parameters and whatnot, the service
will run. It won't necessarily run great, but it'll run. So then over time, you can start implementing the restrictions or the capabilities of Ist that maybe one of the services keeps falling over for some reason.
Maybe take a look at that and go, you know what would solve this?
Better rate limiting.
Let's figure out how to do that inside of Istio and let's go ahead and implement that.
But up until that point, you didn't need to know anything about how Istio does rate limiting.
You just needed to deploy your service and then you're good.
So don't bite off more than you can chew trying
to turn every feature on all at once,
because that gets super complicated.
And it's very difficult to then try
to debug what's going on and where Istio is the problem
or not the problem.
And you kind of take it from there.
Those, I think, would be the SAB, uh, you know, Saab's tips for, uh, you know,
getting started with service meshes. Very cool. Um, Sebastian, is there anything,
Mr. Any area of service meshes that we completely missed out today, which means you would need
another episode of peer performance, because if I give you another topic or two, then I think we
need another hour, but is there anything, there anything that we definitely need to mention?
Well, I mean, service meshes are so broad
and they're so powerful in terms of what they're able to do.
And there's different ways of talking about
specific issues inside of things.
I'm going to say no.
I think this is a good...
By the way, have you thought
about all these other things that service meshes can do? I think anything else, you really need a
little bit more of a use case. So it needs to be more of like a sort of like a customer oriented
conversation and then focusing on how to like implement specific things. And there's like a
variety of different options that you can have, again, based on
your vendor capabilities. So then it gets into like, well, I can do these things using a service
mesh that are provided by this other backend component, which is going to be slightly better
than somebody else's backend component, which again, that's just a whole nother layer of
abstraction. And also it gets more esoteric and dare i say more boring as
you go up that so i think you'll lose audience members so if we keep talking i think we're good
for today great uh andy i think there's quite a lot there if you're going to summarize but did
you have any words of wisdom from your well no i want to bring yeah no i just want to want to quickly
summarize it right as i as i always try to do in the end but i think what sebastian said uh in the
close to the end is is what really what i think what also excites me about service meshes
it is it's taking away all the burden and all the stuff you need to take
care of uh in order to run applications but you shouldn't take care of i think sebastian you said
it correctly earlier if it's something that you would need to put into your application logic
which is not part of your logic then this is something where you should find a managed service
that does it for you and that's what we've been talking,
what the industry has been talking about for so many years now.
Focus on what you're good in, focus on what differentiates you,
and don't focus on things that have been solved many, many times before
better than you can ever solve it,
because there's somebody that offers this as a service.
And I think that's also true for service meshes.
I really like that you gave us the overview of that it's not just traffic routing,
but there's encryption, segregation, making sure that maybe application one is not allowed to talk to application two,
the geographical boundaries.
There's a lot of cool things in there, default injection, a lot of great things, and
probably much more, as you said, there's so many more capabilities so that if you're looking into
this, you know, make sure you understand what service meshes can really do so that you don't
go down the trail of maybe building something yourself that a service mesh can already do.
Sebastian, I really would love if you could send us some links
to material so that people, you mentioned a couple of things that
presentations that people can look into to get started.
Also challenges. I think this is very much appreciated by all listeners.
Absolutely. Thanks very much
for having me again. It's been super fun yeah and
thanks for coming back you're part of uh an elite group of people who've been on twice there's people
who've been on more than twice i just have to let you know so that you know maybe you'll be like hey
i got another idea i want to join that the next tier of you know uh i want to say elitism but
that's the wrong word but you want to join this special club of Pure Performance repeat guests.
We used to have a running tally that we gave up on a long time ago.
We've just been doing this for too long to keep track, really, I think, Andy.
But thank you so much for joining us again, Sebastian.
As with the last episode with SLIs and SLOs,
this was extremely informative,
especially on a newer topic like service meshes,
which I think can be overwhelming for a lot of people, including myself.
So I really learned a lot today.
I really appreciate that.
And without further ado then, Andy,
this might be one of our longest episodes.
I think we've popped into over an hour.
So thank you for our listeners for staying with us and
listening. And
we will be back in
a few weeks. And
if you have any questions or any
topics you'd like to talk about,
you can tweet them to
at pure underscore DT
on Twitter.
But that's what it is, right?
It's been a while since I said it? It's PureUnder4DT.
It's been a while since I said it.
It's been a while since I said it.
Or you can send an old-fashioned email to pureperformance at dynatrace.com.
And Sebastian, remind people
where they can follow you on the socials.
That DevOps guy.
Nice and easy.
That's right.
And you have that across all platforms, right?
Pretty much, yep.
Yeah, I remember that from last time. Yeah. Awesome. All right. Well,
thanks everyone for listening and Sebastian. Thank you. And we'll see you all soon. Bye-bye.