PurePerformance - The Security and Resiliency Challenges of Cloud Native Authorization with Alex Olivier
Episode Date: November 11, 2024Authentication (validating who you claim to be) and Authorization (enforcing what you are allowed to do) are critical in modern software development. While authentication seems to be a solved problem,... modern software development faces many challenges with secure, fast, and resilient authorization mechanisms. To learn more about those challenges, we invited Alex Olivier, Co-Founder and CPO at Cerbos, an Open Source Scalable Authorization Solution. Alex shared insights on attribute-based vs. role-based access Control, the difference between stateful and stateless authorization implementations, why Broken Access Control is in the OWASP Top 10 Security Vulnerabilities, and how to observe the authorization solution for performance, security, and auditing purposes.Links we discussed during the episode:Alex's LinkedIn: https://www.linkedin.com/in/alexolivier/Cerbos on GitHub: https://github.com/cerbos/cerbosOWASP Broken Access Control: https://owasp.org/www-community/Broken_Access_Control
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson and as always I have my great friend and mockumentary in Andy Grabner. Hello Andy, how are you doing today?
I'm really good, I'm really good. I just wanted to know
how can I authenticate you?
How can I be sure that you are you?
Well, if I, you know what?
Well, first, before I answer that question,
it was really weird.
I called you my great friend,
which is like, I hate you.
So that's really bizarre that I did that.
But if I give you,
if I buy one of those heart pendants
that you can like snap in half,
and maybe if I give you the other half of it,
and then when you meet me, you can be like,
show me your heart pendant, and if we put them together, they match up.
They match up, yeah.
Maybe.
Maybe we could do that.
Yeah, yeah.
And then if I authenticate you, what would you be authorized to do?
Like record a podcast with me?
That I'd be authorized to go into your bank accounts.
I'd be authorized to do impersonations of you on stage.
I got nothing else good there.
It's dying.
It's dying.
You know what?
It would authorize us to get to our guest and save us from this pit we're falling into.
Exactly.
Let's do this.
Authorization, authentication, and so many more things today to learn from Alex Olivier.
Hopefully I pronounced the name correctly.
Alex, welcome to Pure Performance.
Thank you so much for being here today.
Thank you very much. Thanks for having me. Looking forward to demystifying some of those two words that sound very similar.
Yeah, they do sound very similar. And maybe, Alex, before we jump in, because I really want to kick it off with what's the difference so that we don't get it wrong in the future.
Maybe a couple of words to yourself first. I see on your LinkedIn profile you're the co-founder and CPO at CERBOS.
I'm pretty sure you have a long history of things you've done and things you've seen.
What do people need to know about you?
Yeah, absolutely.
So yeah, currently co-founder and CPO at CERBOS.
We're in the authorization space, which we'll be going to talk more about in a bit.
But I'm a software engineer at heart.
I spent my entire teenage years building bad software
for people. And still, there's CMSs running out there in
PHP 4 that I wrote in the early 2000s, which I stole on the internet, which is
mildly terrifying. But my professional career was initially
Microsoft, working on the.NET stack, and then a string of startups
in various different industries and verticals from e-commerce to supply chain, fitness.
And one of the common things I keep having to build and fix and re-architect in these
various systems was access controls and permissions.
Hence, Servos is now spending all my time on.
Yeah.
Cool.
That's great to know, especially the background where you come from
working for these organizations
that everybody knows.
Let's jump into it.
Authorization and authentication.
What did we get wrong,
Brian and I,
when we did our little strange intro?
And by the way,
hopefully you will never
have to be on another podcast
where the two hosts
are trying to be that funny
and kind of mess it up.
I was trying to work out whether this heart necklace would be
a second factor, because your first factor is you
recognize each other. Your second factor is making sure
your necklaces meet. So you had
2FA there, kind of.
Yeah.
You could do voice, you could do gate
detection, all sorts of things.
So authentication and authorization
are annoyingly two words that sound very similar
but are actually two very different things.
And they are interchanged and used by accident
all over the place.
And the most, I think, obvious place
where it's kind of misused is actually in the HTTP spec.
If you've ever set an authorization header
and put a JWT sort of token in there,
that is actually authentication rather than authorization.
In the spec, it's called authorization.
It's
misused even in something
as foundational as the HTTP specification.
The way to think about it,
authentication is
the whole process that you go through when you
log into your email, log into some tool
where you are challenged to provide
some sort of credential
that identifies who you are.
So you'll be asked for a username password,
you'll be asked for the other half of the heart necklace,
you'll be asked for some sort of way of identifying yourself.
That authentication system will then verify that credential
and say, okay, yep, this person's gone through the right processes
and the right ceremonies, they've done their 2FA, they've done their one-time positive, whatever mechanism they
have. And we can confidently say that this person is who they are and kind of issue that identity.
And nowadays, those are typically just JSON web tokens, but they can take all sorts of other
forms. But that's the authentication ceremony. You are authenticating that someone is who they
say they are. Authorization, on the other hand, is once you now know who someone is,
what are they actually allowed to do
inside of your system, inside your application?
So I know that you're Andy and you're Brian,
but should both of you be able to do the same actions
and do the same tasks
and perform the same ceremonies inside of a system?
That is where authorization comes in.
And really unhelpfully,
they are quite often reduced down to authN and authZ, which also doesn't really help things because they look very
similar as well. You just turn an N on the side and you get a Z, so it doesn't really help.
But authentication, ensuring that
someone is who they say they are, and then authorization is, okay, now I know who this person is,
can they do XY, Y, Z action
inside of an application or inside of a service?
And I think another maybe example for a physical kind of analogy,
because I just traveled, came back yesterday,
but when I travel, then I go to the hotel and claim my room.
I obviously need to show my identification because they have my name there.
So I basically authenticate myself.
Then I get a token, which is a key.
And that key then tells the system which doors I'm allowed to enter.
Yep.
Andy, you're so good at that.
It just came out.
It just really is.
I know.
It's just, he does it all the time like this.
I mean, travel is a really good kind of analogy that we use all the time
you turn up at an airport, you show a passport
that is your identity document and then
the border person will decide whether
you're allowed in or not based on the identity
so that's authentication and then authorization
Hey, in preparation of this
call there were a couple of things that
piqued my interest when I read through some
of the other talks you've given and the content that you produced.
One is around security, and the other one is around performance and scalability.
I want to start with security, because what I didn't know is that authorization has always been, or has been at least recently, in the top 10 list of OWASP common challenges, security challenges.
Can you tell us a little bit more and also why that is?
Why is authorization such a big attack factor also for security,
for hackers, I guess, as well?
Yeah, obviously.
So OWASP top 10, for those who don't know,
is a standard report that's gone out every couple of years
that does big analysis and understanding what are the top issues, the top vulnerabilities,
predominantly in web applications is kind of what that particular report is focused off,
but the same patterns repeated in numerous kind of surveys.
And the top issue in the last round of that report, number one, was broken access controls.
So in a system, in an architecture, in a service, in an application,
a user was able to do something they hadn't been able to do,
or maybe they got an action that they shouldn't have been able to do.
And the access control logic, the literal code inside the application
that's determining if this user is an admin, allow the action,
if this user is an editor, and allow the action under these scenarios, etc.,
there was a flaw in that logic somewhere,
and the access control was ultimately broken.
Now, that could be down on the infrastructure side of things.
Not a day goes by where you see some S3 bucket leaked
because the keys were open or the bucket tackles were misconfigured.
Or it could be down inside of an application.
So you're trying to record it through Zoom.
Let's say we're trying to go and set up a meeting,
and I've been able to maybe go and set up a meeting
when I shouldn't have been able to because I don't own the account.
That's an access level permissions.
Or it could be more end user experience
where I'm interacting with some app we're building
and I've been able to do an action I shouldn't have done.
And those kind of broken access controls
can be anywhere across the stack.
It could be an infrastructure level, it could be the API level,
it could be in the application layer,
and it could be on the end client
where these permissions and access controls
need to be set and defined.
And that logic is generally very fragile if it's done in kind of what we sort of classically have seen the way of doing it.
And also it's very, very complicated.
It's one thing doing it on one system and one microservice.
But if you're building a large application, you're going to have sprawling places across that stack
where you need to define your access control and define your permission logic.
And that is kind of the authorization problem, which has really come to the forefront in
recent years.
And really a lot of the reason why focus has gone on to that now is authentication was
the top problem before.
But we're kind of in a world now where that's sort of a solved quote unquote problem.
And, you know, there's great tools, great vendors out there that give you a full identity, an IAM IDP type system.
There's whole open source projects around it, and more importantly, there's actually an open standard.
So OAuth 2, OpenID Connect, these things that anyone that spends any time
building software will probably run into when trying to decide how to
create the system. And so the authentication problem is kind of solved to a certain
degree. There's great authorization to it. And now the focus has really shifted to
okay, now I know who this person is, what can I actually do?
And there's both kind of business needs for it, but also regulatory changes that are coming
through in recent years around how best to approach this. And you go and look at some of the standards
agencies like NIST, the DoD in the US,
they've been publishing white papers
over the last few years around how to build
the zero trust architecture.
And I say zero trust in air quotes
because it's a very overloaded term, I think, in some days.
But now the best practice is now shifting from,
okay, your authentication is now solid
because of good standards, good architectures,
good projects out there. Now let's go and really nail down how to do authorization in a scalable,
auditable way. So maybe to try to extend my analogy earlier with the hotel room,
does it mean like I authenticate myself against the system, right?
For instance, I get my ticket or my key card to enter the building, right?
And normally I'm only allowed to go into my room.
But maybe I sneak, maybe I kind of, what's it called?
If I kind of follow somebody, if I tailgate somebody and all of a sudden,
you know, get into the executive launch,
even though I'm not supposed to be there.
So that's kind of the idea that hackers are trying to,
let's say, log in with very low privileges,
what they can do into a system.
And then once they are in the system,
you try to explore and exploit ways to gain more access
than you're supposed to have.
Yeah, from an exploit point of view,
that sort of escalation attack where you've got one identity
and being able to jump up and escalate your permissions
definitely fall in that bucket.
And a lot of the time, it's not so much that they've found a weakness,
it's more like there's been a misconfiguration.
So broken access control, if you actually read what I've said,
is more about there's just been a misconfiguration
because this stuff is so complicated when you're at scale
or you have a large enough system.
Defining those rules is really a delicate piece of work
and then being able to test those rules
and to test those access policies ultimately
is a very fragile and important part of the system
you need to get right.
And if you look at how
systems have been built up until
fairly recently, a lot of this logic was
deeply coupled into the code base.
So you have a request come in,
we have an API call coming in,
that's authenticated, there's a token associated with it,
we now know that user's identity,
that request will make it down to some service that's going to handle it,
and then inside of that service,
based on what the requirements were of what that application needs to do,
you're now going to have this typically hard-coded logic that says, if user role is admin, then
allow the request. If user role is manager, then only allow the request. If the particular
resource they're trying to access belongs to their team, if that access is just viewer,
only allow the view action.
And that's usually hard coded.
And that is where issues and holes sneak in, because you have to think of every single
permutation of what that access pattern could be and under which conditions, and to extend
the hotel analogy.
It's not just whether you have that key, it's whether that key belongs to that room, and
whether the lifespan of that key is still valid.
So the checkout is at midday, your key stops working at midday.
That's further logic that has to be hard-coded or defined inside of the system.
And if that logic isn't put in and tested properly, that's when you're going to start
running into these kind of access control issues.
Yeah, it almost sounds like because
authorization
no, yes
authorization, see?
See, it's pretty easy.
The menu
of authorization
capabilities for
a lot of software
and different platforms that you log in has become
so much more fine-grained or
so much more granular that this is opening the problem.
To extend it to the hotel analogy, if you go to a
large, named theme park type of place that
has the hotel, oftentimes you can start by saying,
I'm going to put my credit card on my key
card as well so that I can use it to buy stuff.
But then that might also become your park ticket.
It might become a bunch of other things.
And now you went from a simple key to a pass that you opt in and authorized to do a bunch
of stuff.
And then maybe your kid has one.
And what can your kid do?
And is it turned on for the kid or not for the kid?
And because of all these new options coming through, it's not just you're an admin, right?
Well, you're an admin for one portion of the tool now.
You're an admin for two portions of the tool.
Other ones you have viewers on.
Others you have no access to.
And we even see that, you know, the first time I saw that was, and this is not my field, obviously,
but, you know, when you start playing with the,
when you start looking at all the permissions in AWS,
it's like, oh my gosh, right?
And then we've expanded our own permissions
to have a lot more flexibility.
And it's almost like when going to service mesh, right?
To bring it from a coding side,
suddenly now you have this complete map
or this gigantic map of who can talk to who that you have to manage
carefully. Which also ties into the authorization, right?
Because you have to control who can talk.
This happens a lot on these podcasts, right? We start talking about this topic and suddenly
the gears start creaking slowly. You're like, oh my gosh, this is such an
intense, I don't want to say problem, but an intense, not concept, an intense thing to
manage, right? And to stay on top of, you know, from the teams who are looking at a
big picture.
Yeah, and if you kind of look at, on your point around sort of granularity and how fine-grained these checks are,
if you go back five, maybe ten years,
everyone's kind of familiar with the idea of RBAC,
Role-Based Access Control, where you're simply checking,
someone has a role, yes, they can do an action,
it's just like a Boolean yes or no.
But nowadays, what most people are doing,
either implicitly or explicitly, is Attribute-Based Access Control, ABAC,
where it's not only what role you have, it's whether there's attributes about yourselves or the particular instance
of a resource or a particular object you're interacting with, whether you should have
access or not.
And kind of the best example of this is, imagine like a blog, blogging system, you know, your
typical CMS thing that I think everyone when they they start out coding, builds at some point.
You want to have some rules in there that says
only the author of an article can edit an article, let's say.
So now it's not working out whether I have the role of author,
therefore I should be able to do everything.
It's, okay, this person making the request is the author.
Now I also need to check the attribute of author ID on the article
is equal to the ID of the person making the request.
And defining that logic is when you start hard coding these rules
into your application code, which is all fine for a small system.
But we're now in the world of distributed microservices,
we're in the world of systems that are pulling data sets,
we're in the world of very large architectures, hybrid cloud,
you name it, we've seen it all.
And now we're in a world where as soon as you need to change that business logic,
which fundamentally will happen, you can set a rule,
but I guarantee you will change it.
And I say that from someone that's had to go through that pain
and go through that process at numerous companies.
There was one particular company I worked for where we built an access control system
and within the space of a year we had to rebuild it three times because the requirements kept
changing as the business evolved.
And you don't have to go back and rework this matrix of logic that was distributed across
every request handler and every gateway and every middleware, etc.
So what's happened now in recent years is this movement towards what the analysts out
there, the analyst firms, call externalized authorization or decoupled
authorization, where essentially you're taking that logic, that
hard-coded if statement, that where clause on your SQL query, for example, and pulling it out
into a policy, a policy
file, and then you have what's called a policy decision point, a PDP.
And that centralizes the definition of who can do what
under which conditions inside of a system.
And then your policy decision point is just another service
that's running inside of your infrastructure,
inside of your service mesh, running as a sidecar,
run it wherever it makes most sense for your system.
And then in your application code,
at every point where you would have hard-coded that
if-else case switch style logic, you're now just making an API call out to that
sidecar or to that other service in your mesh saying, I have this user
with this identity trying to do this action on this particular resource.
Then that's evaluated against your policy decision point, which has your business logic defined as
policy loaded into it, and that will come up with a decision, either an allow or deny
in most cases, and then it goes back to your application code, and now your application code is a single
if statement. If the decision point says allow, do the action. If not, deny.
What that now means is there's a single source of truth for what your authorization logic
should be. It's those policy files. They can be versioned, they can be tested, they can
be fully audited, keep them in your source control system.
And then when you want to change your logic, there's one place to update it, and then every
microservice, every Lambda function, every component in your stack, as long as it's calling
a decision point to get its authorization checks, you're going to get an existent answer
across your entire architecture without any kind of extra work or any extra effort for
the actual application teams who are building it.
It can almost now be offloaded to a SecOps
or even product teams to manage the authorization logic themselves
because ultimately they're the ones with the requirements.
This is a fascinating kind of segue now
over to the second big thing is performance and scalability
because this PDP, which thanks for that,
I did not know that it's called policy decision point.
So if everyone,
if you think about high volume transactions in an application
and every transaction needs to authenticate or call the PDP
and it's a centralized system, obviously there's a lot of load on it.
There's a lot of, if this system doesn't scale, if this system is not reliable,
then obviously it will either slow down things in my
regular transaction, or if it fails to produce,
then for the end user, even though they might be logged in and all of a sudden
they try to click the edit button, the author tries to edit his or her own post,
but all of a sudden that PDP check goes wrong for this particular request,
then I'm wondering, hey, what's happening now?
So how do we, I mean, is that a real big challenge
that these systems become so central and so critical
that they obviously have to have the highest standards
of availability and resiliency?
Yeah, so just to go a bit with my background again.
So one of the businesses that I was both an engineer at one point and then later
became the product architect for was a system that was ingesting clickstream data.
And we needed to authorize whether data should have come in
and whether someone could query it. We were doing about a quarter of a million requests a second.
So every mistake you can make at large-scale data processing
and large-scale distribution systems you can make, I have made at some point.
And one of those systems that failed numerous times due to how we originally
architected was around how we handled authorization. So the whole reason
we started CERBOS and kind of working in this space is coming from a pain point we had ourselves
of having to model this authorization logic and then also do it at large scale.
So from a performance perspective, if you look at the difference
again between authentication and authorization and how that fits into here.
Authentication is generally a one-off ceremony, an interactive
one-off ceremony that a user will do. So every 15 minutes, let's say
your sessions are, you'll be prompted, username, password,
whatever, and you'll get a token back that's valid for 15 minutes, an hour, 10 seconds,
whatever your requirements are.
Once you've got that token, you can verify it without having to go and hit that service
because of JSON key sets and all that kind of infrastructure that's been set up around it.
So authentication is kind of almost like a cash concern at that point.
There's obvious areas that you need to make sure you can invalidate tokens, etc.
And there's great work going on in the standards committees around that.
But you don't need to keep hitting that service.
You can do it essentially locally on the node using key sets.
Authorization, on the other hand, if you're doing anything beyond simple role base,
where you check the token whether someone has a claim and allow or not,
if you're doing anything that is contextual based on details of the user,
details of the resource and the action they're doing, or even just the request context,
then by definition it can't be cached. So if you're making an authorization decision
going back to that hotel keycard example,
we can't cache that decision because first we need to validate that keycard is active.
We need to validate that that key card is assigned to that room, and we also need to do a time
check.
And if we just cache a decision for an hour, well if the time that key expires within an
hour, it's going to let you in the room when you shouldn't.
So authorization needs to be done on every single call, and it needs to be evaluated
on every single call.
And that is in the blocking path of every single request.
It's in the key pipeline.
So performance is fundamental to how you need to design
and build an architect authorization.
There's different schools of thoughts for how to do this.
There's ways of doing that statefully.
There's ways of doing that stateless.
We at Servos lean on the stateless side of things,
but there's also architectural decisions to be made around what's best for your system.
From a performance perspective,
if you were to go down kind of the stateful route, what you're going to have to do is basically
replicate all the information that's relevant for authorization about your users or your resources
out of their underlying data stores and push it up into some sort of distributed cache
at your authorization layer that can handle the external load that you expect.
So if you're a consumer application with millions of users, you need to make sure this thing
can scale.
And then you need to also make sure that whenever there's a change to the underlying data, you
have to go and synchronize all that state continuously.
And distributed caching, cache misses, all those fun problems that come with distributed
things, authorization is a distributed data problem
at the end of the day, come into play.
The other approach, which is how we design Servos,
is the stateless approach,
where the actual decision point
gets all its inputs at request time.
So it's down to the service that's doing the check
to provide, here's the identity,
because they've got it from the token
or they've got it from a request header, etc.
And here's the resource they're trying to interact with.
So to go back to the previous example
with a blog system, request comes in,
we verify the token, we've got the identity,
we go out to our database,
we go and grab the particular article
that the person's trying to interact with,
and then we pass that to the authorization layer.
The policy decision point, which is the formal terminology for these things,
receives a request from the policy enforcement point,
which is the service.
The policy decision point will then use that data that's been sent to it,
principal resource action, go and evaluate that against the policies,
come up with a decision, either allow, deny, and send back that decision
to the service, which up with a decision, either allow, deny, and send back that decision to the service,
which will then enforce that decision.
And with that architecture,
if you're running at large scale,
high throughput, distributed,
that model actually now scales with you
because the decision point itself
is fully stateless.
The only thing that's loaded into it
is the actual policies,
and everything else is provided at request time.
So from an architecture perspective,
if we all live in the Kubernetes world,
which most of us do nowadays, I think,
that means you can just put it as a sidecar
inside of every single service
that you need to do authorization checks in,
and then it's a local host, it's an on-node,
inside the same pod call from your application
to the decision point.
The decision point doesn't need to make any API calls,
doesn't need to go to any database,
doesn't need to read from disk.
It can make a decision in memory
using the context that's been given from the application layer
and make a decision, which is the scalable way of doing it.
Based on your architecture, it may be better to use a stateful approach,
but just from our own experience,
our feet burnt in a few situations, the stateless approach is the one that will
work from a high-throughput,
distributed architectural approach.
But it's always a decision based on what's best for you.
Thanks, first of all, for the explanation
and also this approach.
But that means as a sidecar,
if I have a large application
that runs hundreds and thousands and even more containers,
that means I need to factor in the additional, let's say, quote unquote, overhead of that sidecar
that I have in every single pot, basically, right? I mean, that's, I guess that's then the
decision that you need to make. I guess at some point it definitely pays off, as you said,
to have everything kind of locally enforced and locally validated.
But I guess it's an interesting trade-off.
It's an interesting trade-off, yeah.
And obviously now we have all these kind of mechanisms that the infrastructure gives you
or the orchestrator gives you, like things like Kubernetes.
Maybe PsyCard doesn't make sense, but really what you want to reduce is off-box calls.
So you could go like a daemon set to make sure there's at least one running on each
node.
Or for smaller applications, there's only a couple of nodes, you know they're going
to be co-related to each other.
You can just run it as a service inside of it and let the infrastructure figure out how
to make that request.
So it really is kind of trade-off, and we have users that do a bit of a hybrid approach.
But from an architectural standpoint, regardless of what decision engine and decision point you use, these are the
kind of decisions that need to be decided based on what's best for your application.
There may be some things that need it as a cycle, others that can maybe survive as just
a daemon set, others that are happy to go and hit some other node somewhere. I would
just never recommend doing anything that requires a call over the internet because you're going
to be adding tens of milliseconds to every single call to your system.
I wanted to ask along the lines of this, Andy, this goes into adding another item that people
have to consider when they design their architecture. We've talked about, they have
to consider how we're going to be able to observe this, how are we going to be able to do security, other types of security, not this.
There are a lot of different factors of instead of just, let's do it, we now have to consider
how are we going to run this?
Are we going to run this as a sidecar?
A lot of times people do these as afterthoughts.
Who in an organization would be the one to be thinking about
how should we do this? I wouldn't think it's the security team because they're looking at a
different point of view. Where does that role fit to make that evaluation and decision
that you see at least?
Yeah, this is one of those ones where authorization
as a core concern, as a box on that diagram,
is still quite early, and there isn't a clear owner
inside of a business, is what we're seeing.
It will range, but ultimately it's whoever wears
the architecture hat inside of a company.
So we work with lots of small startups.
At that point, it's the CTO, it's the first engineer,
it's engineering managers,
maybe things grow up,
through to large enterprises.
So we work with like
Utility Warehouse here in the UK,
one of the largest telecoms providers.
And they have like an architecture team
and they're responsible for defining
the security architecture
outside of the stack.
And it kind of fits with them.
And then we've seen kind of
everything in between.
Sometimes it's the product owner or the product manager owns the actual
logic for what the rules need to be. The infrastructure or DevOps team will
own the actual decision point service. And then a security
team member will be involved to basically oversight to make sure that the
authorization logic meets all the regulatory requirements that that business sits under.
Because it's all very well us talking about technical requirements all the time and sort of functional
what the application needs to do requirements. A lot of the companies that are really taking
authorization seriously are also regulated businesses where you have things like
data locality requirements. You have things like auditability requirements.
You have to be able to prove, one of my
fun experiences in previous lives,
company, we went through ISO and SOC 2 compliance.
Every year we'd get audited.
Every year I'd get dragged down to a dark basement by a lawyer
and I had to demonstrate what our access controls were
inside of our systems.
And the first couple of times I did it,
I was sitting there trying to grip through logs in S3
trying to prove we had access controls.
Not the best way of doing it, just a heads up.
And so the other benefits of taking
this kind of externalized approach,
you now have policy defined centrally that's version
controlled. You can write tests against it.
And then on the other side, on the observability
side, yes, you get all your nice
open telemetry insights, Prometheus metrics, all
that kind of thing for observing the actual
service itself. But you have this other
major benefit of having this centralized
service for authorization, which is you this other major benefit of having this centralized service
for authorization, which is you get a definite log of exactly this user tried to do this
action at this time to do on this particular resource. And it was either allowed or denied
by this particular version of this particular policy, which is gold dust for any security
team. Imagine you're running a business, you have some sort of suspected breach, you know this identity was active inside of the system and you need to be able to pull out exactly
what that user did. Your authorization service is the
source of truth of really exactly what that person did inside of your application.
Which is another whole benefit and another kind of observability and logging that
are quite commonly a bit of an afterthought if you don't really consider this up front in your
architecture.
And there's a real kind of benefit as you're working in either a regulated business or working towards some of those standards and compliance certifications.
A quick architectural question to your deployment model.
That means you have, let's assume we go the sidecar route.
Let's say we have 100 containers, 100 sidecars. That means during startup,
they connect to a central policy, whatever, policy operator, what do you call it?
Yeah, so a policy store of some sort.
Policy store, yeah. And that's basically then, so that policy store makes sense that whenever
policies get changed, that the change also gets obviously distributed to all of
the different instances. I guess that's a push because you want to have it immediately pushed
any changes. Now from a centralized auditing, because auditing is obviously the key use case,
does this mean that every sidecar is then sending also that information back to the central operator?
It has to, right?
Yes.
Yeah, so there's a couple of things here.
Firstly, I'll speak of Serbos, the open source project and how we soldered that, but it's
a very common pattern you'll see across all of us out there.
So the core of Serbos, the policy decision point in the engine, is complete open source
Apache 2 ground grab of GitHub.
We have hundreds, if not
thousands of companies out there using it today
at various scales. And in that model
it's a container that you're running
and each instance of that container
as you correctly pointed out, has to go and fetch
those policies.
With the Serbos approach we allow you to connect
to a Git repo, an S3 bucket,
a local database, or even just pull it
from disk.
But it's down to you to set up what CI pipeline you want to go and fetch that data.
But what you'll immediately run into when you have more than one of these instances is how do you coordinate all these changes and these things that roll out?
So we have a whole commercial product called Servers Hub,
which is like a management control plane for these policy decision points.
To use another P star P term for you, it's a policy administration point in the formal architecture, which is where you define your
policies, but then also manages the rollout and distribution of those.
On the logging side of things, the exact same thing is true. So each decision point is making
decisions, be it Serbos or some other decision point, they're all distributed, they're all
making decisions, and they'll generally all be generating some sort of audit log. Now
if you're in an infrastructure where you have
a team that's set up
a log collection, you can just have each pod
log out and go and send that off into
some
ELK stack or use Loki or something
like that.
But that is very much going to be in the hands and in the
realms of your infrastructure team, your DevOps team,
your engineering team, not that security
person that's trying to pull out those audit logs for them
to do a security audit, let's say.
So as part of the server's offering
and our control plane, we also do that
reverse collection where we pull in
all the logs and give you a centralized view across
all your decision points, regardless of where
in your architecture and where in your stack
those decisions are made. But the
open core, the open source project that's out there,
you can configure that log sync to where you want to go. One of our community members contributed
a Kafka sync, for example, so you can write off all your logs to a Kafka topic somewhere
inside of your stack. So it's very pluggable from that sense. But really the use case you
need to solve for is not just as a developer, is this permission doing what it should do,
it's solving for the security team, the auditor, the person that needs to be able to get that view exactly what happened for compliance reasons,
which is a different kind of viewpoint
on top of the same data
and needs to be thought about
from an architectural perspective as well
when you're designing that stack.
Yeah.
And especially, right,
I mean, in largely distributed systems
that are scalable and stateless,
it means if I get a token and I make one request, I get to one instance of that microservice.
I do another one, I might get to another.
So in order to see really what I did, what I was allowed to do, I need to look at the
whole picture.
So very curious now from an observability perspective, because this is where Brian and
I live in.
So I understand that the logs are coming in, so I can use this for
auditing purposes. I say, show me everything that this person with this token
did. Are there any other use cases?
Do you look into things like, hey, there was unusual activity
from one particular user? What are some of the other things you can extract out of the logs?
Yeah, absolutely.
So if we're looking at just the audit logs,
firstly, show me what the actions user did.
There's this whole world of SIEM systems,
which is Security Incident Monitoring type tools,
which you can funnel those logs off to
and then start doing more of that
behavioral anomaly detection type work.
And there's some of that that's in Servos and some of it,
which we're kind of partnering with as well.
But from a business side of things,
being able to identify, okay, there's two groups.
This kind of behavior is unexpected.
This person is doing some action that hasn't been done before.
This role is maybe doing things that is unusual,
that weren't doing last week is now doing this week in high volume.
And that's kind of group one.
The other side of things, which is more important from a
policy definition process is like, this role exists, but it's not being used, or this role
has a certain action, but it isn't commonly being accessed across your stack. You may want to
consider scoping down the privileges that a particular role has
and really reducing the possible kind of blast radius
and least privilege type approach to what the identities can do,
be it on identity level or role level.
And these are two things that you can get out of the authorization logs
that you wouldn't normally get from just plain old application logs.
Cool. That's interesting.
So how about, I guess, similar with what you just said with certain policies are hardly ever used.
You may want to re-scope. I think that's an interesting one.
Or I guess also when you have all of a sudden a high number of failed policy checks,
somebody tries to do certain things, but it doesn't allow them.
So again, it could be alerting because somebody tries to hack the system.
Or it could be that an end user believes they have a certain capacity
or a certain privilege, but they don't.
And then this could also then cause some policy change as well.
Yeah, and there's that end user getting that
permission to deny message, they raise a help desk ticket,
that thing goes back to someone who needs
to be able to unpick what those rules are, and if that
rule or that logic is hard-coded in the application,
you're now all the way back to the developer's team
that needs to go and unpick some
case switch statement that was maybe
slightly incorrect when it was first implemented.
Whilst if you had externalized that into some policy format, which is human-readable,
you don't have to know Java, you don't have to know Go, you don't have to know TypeScript or whatever.
It's in some sort of human-readable format. We use YAML.
Love it or hate it, it is a bit more human-readable than application code.
There's great tooling around it as well. But now there's a clear definition of what these rules are
to someone that isn't deep
in a particular code base can actually understand.
You can reduce the amount of headaches that engineers may have to deal with or tickets
engineers have to deal with because of understanding of what the permissioning model needs to be.
And the last question on the observability piece.
So I understand that what we can do with the audit logs,
there's many different things we can read out of it.
You also mentioned that the policy system itself,
the policy operator or the policy admin,
exposes probably some metrics, some Prometheus metrics,
some open telemetry.
What are some of the key metrics
or key indicators that performance engineers
should look out for?
Are there anything where you say,
hey, every time you go to one of your users,
you say, hey, do you look at this metric?
Like, I don't know, queue lengths, latency,
what are some of these core KPIs that you look at?
Yeah, it kind of goes back to the original point
when we started talking about performance.
Really, the end user they're caring about here
is how quickly they're getting ability to do an action.
And so you really need to be able to stand
in that blocking path.
Part of it is going to be authorization.
So how long is that authorization check coming on?
So making sure whatever policy decision point
you're using, the policy engine you're using,
is giving you that insight and giving that observability.
So at Cerbos, we're cloud native, born and bred, as it were. So we have baked in support
for OpenTelemetry, we pass it through all the tracing headers. So if you're having the
full OTEL stack in your environment, you'll get that full in and out of CERBOS traces.
And not only do we just tell you we're now authorizing it, within the SERBOS traces we'll also tell
you at which phase of the authorization.
So first working out which policy is applicable, then down inside of the policy how long it
takes to go and evaluate each rule within that policy.
This is down to the microsecond level at this point, because it's very, very performant
at this point.
And then if you are getting particular request paths
where you're seeing slow behavior,
and one of the common ones is around
when you want to list items to a user
that they're allowed to access to,
kind of like an index or a filtering page,
the kind of old way of doing it
is you just whack a where clause in your database
which enforces your policy rules.
When you externalize authorization,
you run into a bit of a challenge
because now that where clause
that filters your results based on permissions
is dynamic based on what your policies are.
So with Cerbos, we can actually generate
what ultimately that where clause needs to be
dynamically based on policies.
And inside of the observability we give you,
we can show you how long that query plan
has been taken to generate.
And a lot of the time,
it will highlight to someone that's looking at
performance, performance engineer, hey, there's a particular condition, there's a particular check going on inside
of this policy which isn't performing as we expected.
And it will highlight, okay, maybe it's going to look at how this particular condition,
this particular rule is being evaluated.
So you can get very, very granular insights inside of the traces, all your top level Prometheus
metrics, servers and servers are in a go, so you get all the standard go routines, observability, and then also things like the policy cache size, how
many policies it's holding in memory, the evictions, and all those kind of best practices
in terms of how a system is behaving across your stack and then fitting nicely into the
tool chain, which everyone's kind of used to at this day and
age, hopefully. But yeah, that full request lifecycle insight is passed through a
server and you'll see it inside of your spans, inside of your traces.
And one pattern that Brian and I have discussed since the beginning of this podcast.
I know what this is.
The M plus one query problem.
You brought up an example where, let's say,
in an app, I do a search,
I get a list of, let's say, 100 entries.
It's like people,
and I'm only allowed to see
the names of certain people
that might be within my group.
That's a great example, I guess,
of using policies.
Am I allowed to see the telephone number, for instance?
So as a developer, I could implement this in a way where I say, give me all the results and then I iterate through the 100 elements.
And then for every element, I basically make a policy call.
So this would be a classical data-driven performance problem because... Oh, yes. Is there anything in modern authorization,
like with Serpals, for instance,
where you can also make batch calls,
where you can say,
hey, this is a list of things I want to validate?
Yeah.
So there's kind of two operations to solving
what's called kind of the listing problem.
One is kind of the brute force one,
where you go and query all your records out,
and then one by one you do that check.
And that could be a one by one or a batch.
You can configure a batch and do as many checks as you want in one go
into a single RPC and get a result back.
But from a performance perspective, that is not good.
Because firstly, that batch you're sending in to check
could be five records or it could be 10 million records,
21 your system does.
And particularly if you then go through and check every single one
and it's actually denied, denied, denied for all of them,
you just wasted all that compute
because that person has a role that means it's always denied
or doesn't have a role that means it's always denied.
Why have you gone through and checked all those records, etc.?
So the solution to that problem is something that's pretty unique to Servos
is what we call generating a query plan.
So when you look at your policies, you're basically defining all the rules
under which a particular action can be done by particular roles
and under particular conditions.
And inside of that request where you want to generate that list page
or that index page or resources, you can say to the decision
point service, I've got this user, I know who they are, I've got their roles, I've got
their attributes, I know they're trying to request something against a particular resource
type, and they're trying to view a resource.
So this user, you've got Andy, he's a manager, he's in this team in this region, et cetera,
and he's trying to view the employees.
What you can do is say to a system like Servos,
what are the conditions that must be true for Andy to be able to access this particular resource type?
And what the authorization system, in case Servos will do,
is rather than just giving you back an allow or deny,
it will go through and generate and look through all the policies.
What are the conditions that must be met for a record to be accessible by this particular
profile, by this particular principle?
And that condition tree is going to be dynamic based on every single request that comes in.
It's going to be driven by all the request context that's available.
So time of day, IP address, region, data center, your profile information, your principal information,
your groups, your teams, your roles
and then it will take those policies, it will evaluate
as many conditions as it can and will basically
return one of three answers. It will say back
to your system, based on those policies
Andy has access to all of them
therefore you don't need to filter it from
an authorization perspective, obviously you want to do pagination
and stuff
and so your where clause now
can just be a select all.
It may decide that
actually based on your roles is you don't ever
have access to this resource type at all.
Therefore you don't even need to go and query the database. We'll just say
based on your current priorities, never
always denied, therefore you don't even need to query
the database. But the most common answer is
a conditional response.
And in the service world, that conditional response
is an abstract syntax tree of conditions which will, this attribute of the resource must be this
value, or this attribute must be true, or this attribute must be one of these four values,
or this attribute must be equal to the ID of the person making the request, for example.
So that comes back as an AST. It's in a standard grammar, as you'd expect with an AST. And
then from there, you can then take that and convert it to a SQL workflows.
You can convert it to a Mongo filter.
You can convert it to a API filter,
API call filter,
or whatever's kind of relevant to your use case.
It's agnostic to any particular storage technology.
So we've published open source adapters
for things like Prisma, SQL Alchemy, Mongoose,
and a few others.
And what this now means is
you're essentially taking your authorization logic,
which is in your policies,
and then pushing it down into the database
and then the database do the filtering.
But the key thing is that filter is fully dynamic.
So whenever the business changes the policies,
on the next call,
that query is going to be slightly different
or that query is going to be different
based on the person making the call.
And that way it's optimized
because you're only ever going to pull out the database
what that user actually will have access to.
Cool.
Yeah, that's perfect.
I mean, and I just hope that every developer
who is going to use Serbos
understands that these different options are available
and not just fall into the easy trap
and just doing it in a way that, you know,
ends up with the M plus one query problem that we
have seen too many times
in our life.
Just requesting too much data
and then iterating over it and then making individual
calls.
It was always traditionally way back
it was the database.
And then as microservices came in,
I was like, oh, you can do that between microservices now.
It's like it finds a way to infest everything it can, you know.
Can I just round back really quickly?
Because I think you said something really important earlier that I want to confirm, at least important to me.
It sounds like you were saying that Servos comes pre-instrumented with OpenTelemetry for the key pieces, right?
I just want to first thank you for that.
This has been the promise of OpenTelemetry.
I don't have my finger on the pulse of all commercial offerings and all that,
but there was this idea that all these commercial offerings
and even open source offerings can come pre-instrumented with the key components of what
it takes because you all know what's important for troubleshooting this stuff.
Right.
And I don't think I've come across or heard of anyone doing that yet.
I don't know if that's the case in reality, but to hear it there,
I just think, you know,
definitely a hats off to you because that's that's what we
were that was that was the goal i mean that was i don't want to say it's the goal but that was one
of the dreams right every vendor was going to bake it in you pop it in you get what you need right so
i just really got to say that's amazing the other quick question i had on the side of that
is obviously this code has to be performant are there certain languages that you see that are used
i remember back when i used to work with a lot of trading people,
a lot of the back-end trading apps were written in C++
because they were dealing with half-millisecond response times
and things like that.
Are there code bases that lend themselves better to this these days?
Because I know a lot have changed performance.
Yeah, if you look across the different authorization solutions out there, some are open source,
some are commercial products, some are libraries, some are packages, some are vendor specific,
etc.
The overriding commonality between them all is Go.
Okay, that makes sense.
There's other kinds of solutions out there in Go.
If you're in the cloud native space, that's obviously very prevalent.
And for us, that was our experience from previous businesses
where we had to build these sort of low latency,
high throughput systems.
Go was always kind of the go-to choice.
I'm sure someone will say rewrite it in Rust at some point.
But right now, we're sticking with Go for that.
And it also has the best-in-class support,
we think, at least for the observability side of things as well.
It just kind of fits nicely in the ecosystem.
We can do the ARM64 builds and all that as well at the same time to keep things, be able
to run it anywhere.
Plus, it's all open source, so you can go fork it and it's not in obscure language.
We encourage community contributions as well, so we wanted to keep it something that's pretty
common in this tech stack.
Great. as well. So we wanted to keep it something that's pretty common in this tech stack.
Great.
Hey, Brian. I think we're running up at the end of time for you.
I think so too, yeah. But honestly, first of all, Alex, thank you so much for demystifying
the difference between authentication and authorization.
Thanks for playing along on that use case we came up in the beginning with the hotel so i think we
learned quite a bit um also thanks for for as brian just said you know contributing back to
open source and following these standards it's a huge topic right um authorization um i guess
i've you know i've been a developer for many many years but when I
developed it was actually never that big of a
concern for me because of the type of software
that I developed because I just assumed
my code runs in
the right authorization
but yeah folks if you listen to this
I guess we have a lot of
links that we add to the description as always
we will also add your
LinkedIn profile link
so that people can follow up with you.
And yeah, I would love to have you back at some point
because this is a topic that will stay relevant
and probably even more relevant in the years to come.
Absolutely, yeah.
Very happy to, and thanks for having me.
It's been great chatting.
Yeah, I think this has been awesome.
I don't really deal with the code side as much,
but I think like a lot of people,
we all use permissions and set up who can do what
no matter what we're doing.
And I don't think I've ever thought of
what's behind that ever in my life.
And now it's like, oh my gosh,
there's a huge, huge topic behind that.
So thank you for sharing the information and just making me a smarter person or more knowledgeable.
At least I don't know if I'm any smarter, but I'm more knowledgeable.
We appreciate your time.
Thanks for everyone listening today.
We hope you had as much fun as we did.
And we will see you at the, or you'll hear us on the next episode, I guess.
Thanks, everybody.
Bye-bye.
Thank you. Bye-bye.