PurePerformance - The Security and Resiliency Challenges of Cloud Native Authorization with Alex Olivier

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have my great friend and mockumentary in Andy Grabner. Hello Andy, how are you doing today? I'm really good, I'm really good. I just wanted to know how can I authenticate you? How can I be sure that you are you? Well, if I, you know what?

Starting point is 00:00:50 Well, first, before I answer that question, it was really weird. I called you my great friend, which is like, I hate you. So that's really bizarre that I did that. But if I give you, if I buy one of those heart pendants that you can like snap in half,

Starting point is 00:01:07 and maybe if I give you the other half of it, and then when you meet me, you can be like, show me your heart pendant, and if we put them together, they match up. They match up, yeah. Maybe. Maybe we could do that. Yeah, yeah. And then if I authenticate you, what would you be authorized to do?

Starting point is 00:01:24 Like record a podcast with me? That I'd be authorized to go into your bank accounts. I'd be authorized to do impersonations of you on stage. I got nothing else good there. It's dying. It's dying. You know what? It would authorize us to get to our guest and save us from this pit we're falling into.

Starting point is 00:01:50 Exactly. Let's do this. Authorization, authentication, and so many more things today to learn from Alex Olivier. Hopefully I pronounced the name correctly. Alex, welcome to Pure Performance. Thank you so much for being here today. Thank you very much. Thanks for having me. Looking forward to demystifying some of those two words that sound very similar. Yeah, they do sound very similar. And maybe, Alex, before we jump in, because I really want to kick it off with what's the difference so that we don't get it wrong in the future.

Starting point is 00:02:19 Maybe a couple of words to yourself first. I see on your LinkedIn profile you're the co-founder and CPO at CERBOS. I'm pretty sure you have a long history of things you've done and things you've seen. What do people need to know about you? Yeah, absolutely. So yeah, currently co-founder and CPO at CERBOS. We're in the authorization space, which we'll be going to talk more about in a bit. But I'm a software engineer at heart. I spent my entire teenage years building bad software

Starting point is 00:02:48 for people. And still, there's CMSs running out there in PHP 4 that I wrote in the early 2000s, which I stole on the internet, which is mildly terrifying. But my professional career was initially Microsoft, working on the.NET stack, and then a string of startups in various different industries and verticals from e-commerce to supply chain, fitness. And one of the common things I keep having to build and fix and re-architect in these various systems was access controls and permissions. Hence, Servos is now spending all my time on.

Starting point is 00:03:21 Yeah. Cool. That's great to know, especially the background where you come from working for these organizations that everybody knows. Let's jump into it. Authorization and authentication. What did we get wrong,

Starting point is 00:03:33 Brian and I, when we did our little strange intro? And by the way, hopefully you will never have to be on another podcast where the two hosts are trying to be that funny and kind of mess it up.

Starting point is 00:03:46 I was trying to work out whether this heart necklace would be a second factor, because your first factor is you recognize each other. Your second factor is making sure your necklaces meet. So you had 2FA there, kind of. Yeah. You could do voice, you could do gate detection, all sorts of things.

Starting point is 00:04:01 So authentication and authorization are annoyingly two words that sound very similar but are actually two very different things. And they are interchanged and used by accident all over the place. And the most, I think, obvious place where it's kind of misused is actually in the HTTP spec. If you've ever set an authorization header

Starting point is 00:04:20 and put a JWT sort of token in there, that is actually authentication rather than authorization. In the spec, it's called authorization. It's misused even in something as foundational as the HTTP specification. The way to think about it, authentication is

Starting point is 00:04:38 the whole process that you go through when you log into your email, log into some tool where you are challenged to provide some sort of credential that identifies who you are. So you'll be asked for a username password, you'll be asked for the other half of the heart necklace, you'll be asked for some sort of way of identifying yourself.

Starting point is 00:04:56 That authentication system will then verify that credential and say, okay, yep, this person's gone through the right processes and the right ceremonies, they've done their 2FA, they've done their one-time positive, whatever mechanism they have. And we can confidently say that this person is who they are and kind of issue that identity. And nowadays, those are typically just JSON web tokens, but they can take all sorts of other forms. But that's the authentication ceremony. You are authenticating that someone is who they say they are. Authorization, on the other hand, is once you now know who someone is, what are they actually allowed to do

Starting point is 00:05:28 inside of your system, inside your application? So I know that you're Andy and you're Brian, but should both of you be able to do the same actions and do the same tasks and perform the same ceremonies inside of a system? That is where authorization comes in. And really unhelpfully, they are quite often reduced down to authN and authZ, which also doesn't really help things because they look very

Starting point is 00:05:51 similar as well. You just turn an N on the side and you get a Z, so it doesn't really help. But authentication, ensuring that someone is who they say they are, and then authorization is, okay, now I know who this person is, can they do XY, Y, Z action inside of an application or inside of a service? And I think another maybe example for a physical kind of analogy, because I just traveled, came back yesterday, but when I travel, then I go to the hotel and claim my room.

Starting point is 00:06:21 I obviously need to show my identification because they have my name there. So I basically authenticate myself. Then I get a token, which is a key. And that key then tells the system which doors I'm allowed to enter. Yep. Andy, you're so good at that. It just came out. It just really is.

Starting point is 00:06:38 I know. It's just, he does it all the time like this. I mean, travel is a really good kind of analogy that we use all the time you turn up at an airport, you show a passport that is your identity document and then the border person will decide whether you're allowed in or not based on the identity so that's authentication and then authorization

Starting point is 00:06:56 Hey, in preparation of this call there were a couple of things that piqued my interest when I read through some of the other talks you've given and the content that you produced. One is around security, and the other one is around performance and scalability. I want to start with security, because what I didn't know is that authorization has always been, or has been at least recently, in the top 10 list of OWASP common challenges, security challenges. Can you tell us a little bit more and also why that is? Why is authorization such a big attack factor also for security,

Starting point is 00:07:37 for hackers, I guess, as well? Yeah, obviously. So OWASP top 10, for those who don't know, is a standard report that's gone out every couple of years that does big analysis and understanding what are the top issues, the top vulnerabilities, predominantly in web applications is kind of what that particular report is focused off, but the same patterns repeated in numerous kind of surveys. And the top issue in the last round of that report, number one, was broken access controls.

Starting point is 00:08:00 So in a system, in an architecture, in a service, in an application, a user was able to do something they hadn't been able to do, or maybe they got an action that they shouldn't have been able to do. And the access control logic, the literal code inside the application that's determining if this user is an admin, allow the action, if this user is an editor, and allow the action under these scenarios, etc., there was a flaw in that logic somewhere, and the access control was ultimately broken.

Starting point is 00:08:23 Now, that could be down on the infrastructure side of things. Not a day goes by where you see some S3 bucket leaked because the keys were open or the bucket tackles were misconfigured. Or it could be down inside of an application. So you're trying to record it through Zoom. Let's say we're trying to go and set up a meeting, and I've been able to maybe go and set up a meeting when I shouldn't have been able to because I don't own the account.

Starting point is 00:08:43 That's an access level permissions. Or it could be more end user experience where I'm interacting with some app we're building and I've been able to do an action I shouldn't have done. And those kind of broken access controls can be anywhere across the stack. It could be an infrastructure level, it could be the API level, it could be in the application layer,

Starting point is 00:08:59 and it could be on the end client where these permissions and access controls need to be set and defined. And that logic is generally very fragile if it's done in kind of what we sort of classically have seen the way of doing it. And also it's very, very complicated. It's one thing doing it on one system and one microservice. But if you're building a large application, you're going to have sprawling places across that stack where you need to define your access control and define your permission logic.

Starting point is 00:09:28 And that is kind of the authorization problem, which has really come to the forefront in recent years. And really a lot of the reason why focus has gone on to that now is authentication was the top problem before. But we're kind of in a world now where that's sort of a solved quote unquote problem. And, you know, there's great tools, great vendors out there that give you a full identity, an IAM IDP type system. There's whole open source projects around it, and more importantly, there's actually an open standard. So OAuth 2, OpenID Connect, these things that anyone that spends any time

Starting point is 00:10:00 building software will probably run into when trying to decide how to create the system. And so the authentication problem is kind of solved to a certain degree. There's great authorization to it. And now the focus has really shifted to okay, now I know who this person is, what can I actually do? And there's both kind of business needs for it, but also regulatory changes that are coming through in recent years around how best to approach this. And you go and look at some of the standards agencies like NIST, the DoD in the US, they've been publishing white papers

Starting point is 00:10:26 over the last few years around how to build the zero trust architecture. And I say zero trust in air quotes because it's a very overloaded term, I think, in some days. But now the best practice is now shifting from, okay, your authentication is now solid because of good standards, good architectures, good projects out there. Now let's go and really nail down how to do authorization in a scalable,

Starting point is 00:10:50 auditable way. So maybe to try to extend my analogy earlier with the hotel room, does it mean like I authenticate myself against the system, right? For instance, I get my ticket or my key card to enter the building, right? And normally I'm only allowed to go into my room. But maybe I sneak, maybe I kind of, what's it called? If I kind of follow somebody, if I tailgate somebody and all of a sudden, you know, get into the executive launch, even though I'm not supposed to be there.

Starting point is 00:11:29 So that's kind of the idea that hackers are trying to, let's say, log in with very low privileges, what they can do into a system. And then once they are in the system, you try to explore and exploit ways to gain more access than you're supposed to have. Yeah, from an exploit point of view, that sort of escalation attack where you've got one identity

Starting point is 00:11:53 and being able to jump up and escalate your permissions definitely fall in that bucket. And a lot of the time, it's not so much that they've found a weakness, it's more like there's been a misconfiguration. So broken access control, if you actually read what I've said, is more about there's just been a misconfiguration because this stuff is so complicated when you're at scale or you have a large enough system.

Starting point is 00:12:12 Defining those rules is really a delicate piece of work and then being able to test those rules and to test those access policies ultimately is a very fragile and important part of the system you need to get right. And if you look at how systems have been built up until fairly recently, a lot of this logic was

Starting point is 00:12:29 deeply coupled into the code base. So you have a request come in, we have an API call coming in, that's authenticated, there's a token associated with it, we now know that user's identity, that request will make it down to some service that's going to handle it, and then inside of that service, based on what the requirements were of what that application needs to do,

Starting point is 00:12:47 you're now going to have this typically hard-coded logic that says, if user role is admin, then allow the request. If user role is manager, then only allow the request. If the particular resource they're trying to access belongs to their team, if that access is just viewer, only allow the view action. And that's usually hard coded. And that is where issues and holes sneak in, because you have to think of every single permutation of what that access pattern could be and under which conditions, and to extend the hotel analogy.

Starting point is 00:13:19 It's not just whether you have that key, it's whether that key belongs to that room, and whether the lifespan of that key is still valid. So the checkout is at midday, your key stops working at midday. That's further logic that has to be hard-coded or defined inside of the system. And if that logic isn't put in and tested properly, that's when you're going to start running into these kind of access control issues. Yeah, it almost sounds like because authorization

Starting point is 00:13:47 no, yes authorization, see? See, it's pretty easy. The menu of authorization capabilities for a lot of software and different platforms that you log in has become

Starting point is 00:14:04 so much more fine-grained or so much more granular that this is opening the problem. To extend it to the hotel analogy, if you go to a large, named theme park type of place that has the hotel, oftentimes you can start by saying, I'm going to put my credit card on my key card as well so that I can use it to buy stuff. But then that might also become your park ticket.

Starting point is 00:14:30 It might become a bunch of other things. And now you went from a simple key to a pass that you opt in and authorized to do a bunch of stuff. And then maybe your kid has one. And what can your kid do? And is it turned on for the kid or not for the kid? And because of all these new options coming through, it's not just you're an admin, right? Well, you're an admin for one portion of the tool now.

Starting point is 00:14:52 You're an admin for two portions of the tool. Other ones you have viewers on. Others you have no access to. And we even see that, you know, the first time I saw that was, and this is not my field, obviously, but, you know, when you start playing with the, when you start looking at all the permissions in AWS, it's like, oh my gosh, right? And then we've expanded our own permissions

Starting point is 00:15:13 to have a lot more flexibility. And it's almost like when going to service mesh, right? To bring it from a coding side, suddenly now you have this complete map or this gigantic map of who can talk to who that you have to manage carefully. Which also ties into the authorization, right? Because you have to control who can talk. This happens a lot on these podcasts, right? We start talking about this topic and suddenly

Starting point is 00:15:39 the gears start creaking slowly. You're like, oh my gosh, this is such an intense, I don't want to say problem, but an intense, not concept, an intense thing to manage, right? And to stay on top of, you know, from the teams who are looking at a big picture. Yeah, and if you kind of look at, on your point around sort of granularity and how fine-grained these checks are, if you go back five, maybe ten years, everyone's kind of familiar with the idea of RBAC, Role-Based Access Control, where you're simply checking,

Starting point is 00:16:14 someone has a role, yes, they can do an action, it's just like a Boolean yes or no. But nowadays, what most people are doing, either implicitly or explicitly, is Attribute-Based Access Control, ABAC, where it's not only what role you have, it's whether there's attributes about yourselves or the particular instance of a resource or a particular object you're interacting with, whether you should have access or not. And kind of the best example of this is, imagine like a blog, blogging system, you know, your

Starting point is 00:16:43 typical CMS thing that I think everyone when they they start out coding, builds at some point. You want to have some rules in there that says only the author of an article can edit an article, let's say. So now it's not working out whether I have the role of author, therefore I should be able to do everything. It's, okay, this person making the request is the author. Now I also need to check the attribute of author ID on the article is equal to the ID of the person making the request.

Starting point is 00:17:06 And defining that logic is when you start hard coding these rules into your application code, which is all fine for a small system. But we're now in the world of distributed microservices, we're in the world of systems that are pulling data sets, we're in the world of very large architectures, hybrid cloud, you name it, we've seen it all. And now we're in a world where as soon as you need to change that business logic, which fundamentally will happen, you can set a rule,

Starting point is 00:17:33 but I guarantee you will change it. And I say that from someone that's had to go through that pain and go through that process at numerous companies. There was one particular company I worked for where we built an access control system and within the space of a year we had to rebuild it three times because the requirements kept changing as the business evolved. And you don't have to go back and rework this matrix of logic that was distributed across every request handler and every gateway and every middleware, etc.

Starting point is 00:17:59 So what's happened now in recent years is this movement towards what the analysts out there, the analyst firms, call externalized authorization or decoupled authorization, where essentially you're taking that logic, that hard-coded if statement, that where clause on your SQL query, for example, and pulling it out into a policy, a policy file, and then you have what's called a policy decision point, a PDP. And that centralizes the definition of who can do what under which conditions inside of a system.

Starting point is 00:18:30 And then your policy decision point is just another service that's running inside of your infrastructure, inside of your service mesh, running as a sidecar, run it wherever it makes most sense for your system. And then in your application code, at every point where you would have hard-coded that if-else case switch style logic, you're now just making an API call out to that sidecar or to that other service in your mesh saying, I have this user

Starting point is 00:18:51 with this identity trying to do this action on this particular resource. Then that's evaluated against your policy decision point, which has your business logic defined as policy loaded into it, and that will come up with a decision, either an allow or deny in most cases, and then it goes back to your application code, and now your application code is a single if statement. If the decision point says allow, do the action. If not, deny. What that now means is there's a single source of truth for what your authorization logic should be. It's those policy files. They can be versioned, they can be tested, they can be fully audited, keep them in your source control system.

Starting point is 00:19:25 And then when you want to change your logic, there's one place to update it, and then every microservice, every Lambda function, every component in your stack, as long as it's calling a decision point to get its authorization checks, you're going to get an existent answer across your entire architecture without any kind of extra work or any extra effort for the actual application teams who are building it. It can almost now be offloaded to a SecOps or even product teams to manage the authorization logic themselves because ultimately they're the ones with the requirements.

Starting point is 00:19:53 This is a fascinating kind of segue now over to the second big thing is performance and scalability because this PDP, which thanks for that, I did not know that it's called policy decision point. So if everyone, if you think about high volume transactions in an application and every transaction needs to authenticate or call the PDP and it's a centralized system, obviously there's a lot of load on it.

Starting point is 00:20:21 There's a lot of, if this system doesn't scale, if this system is not reliable, then obviously it will either slow down things in my regular transaction, or if it fails to produce, then for the end user, even though they might be logged in and all of a sudden they try to click the edit button, the author tries to edit his or her own post, but all of a sudden that PDP check goes wrong for this particular request, then I'm wondering, hey, what's happening now? So how do we, I mean, is that a real big challenge

Starting point is 00:20:55 that these systems become so central and so critical that they obviously have to have the highest standards of availability and resiliency? Yeah, so just to go a bit with my background again. So one of the businesses that I was both an engineer at one point and then later became the product architect for was a system that was ingesting clickstream data. And we needed to authorize whether data should have come in and whether someone could query it. We were doing about a quarter of a million requests a second.

Starting point is 00:21:24 So every mistake you can make at large-scale data processing and large-scale distribution systems you can make, I have made at some point. And one of those systems that failed numerous times due to how we originally architected was around how we handled authorization. So the whole reason we started CERBOS and kind of working in this space is coming from a pain point we had ourselves of having to model this authorization logic and then also do it at large scale. So from a performance perspective, if you look at the difference again between authentication and authorization and how that fits into here.

Starting point is 00:21:56 Authentication is generally a one-off ceremony, an interactive one-off ceremony that a user will do. So every 15 minutes, let's say your sessions are, you'll be prompted, username, password, whatever, and you'll get a token back that's valid for 15 minutes, an hour, 10 seconds, whatever your requirements are. Once you've got that token, you can verify it without having to go and hit that service because of JSON key sets and all that kind of infrastructure that's been set up around it. So authentication is kind of almost like a cash concern at that point.

Starting point is 00:22:23 There's obvious areas that you need to make sure you can invalidate tokens, etc. And there's great work going on in the standards committees around that. But you don't need to keep hitting that service. You can do it essentially locally on the node using key sets. Authorization, on the other hand, if you're doing anything beyond simple role base, where you check the token whether someone has a claim and allow or not, if you're doing anything that is contextual based on details of the user, details of the resource and the action they're doing, or even just the request context,

Starting point is 00:22:52 then by definition it can't be cached. So if you're making an authorization decision going back to that hotel keycard example, we can't cache that decision because first we need to validate that keycard is active. We need to validate that that key card is assigned to that room, and we also need to do a time check. And if we just cache a decision for an hour, well if the time that key expires within an hour, it's going to let you in the room when you shouldn't. So authorization needs to be done on every single call, and it needs to be evaluated

Starting point is 00:23:21 on every single call. And that is in the blocking path of every single request. It's in the key pipeline. So performance is fundamental to how you need to design and build an architect authorization. There's different schools of thoughts for how to do this. There's ways of doing that statefully. There's ways of doing that stateless.

Starting point is 00:23:40 We at Servos lean on the stateless side of things, but there's also architectural decisions to be made around what's best for your system. From a performance perspective, if you were to go down kind of the stateful route, what you're going to have to do is basically replicate all the information that's relevant for authorization about your users or your resources out of their underlying data stores and push it up into some sort of distributed cache at your authorization layer that can handle the external load that you expect. So if you're a consumer application with millions of users, you need to make sure this thing

Starting point is 00:24:13 can scale. And then you need to also make sure that whenever there's a change to the underlying data, you have to go and synchronize all that state continuously. And distributed caching, cache misses, all those fun problems that come with distributed things, authorization is a distributed data problem at the end of the day, come into play. The other approach, which is how we design Servos, is the stateless approach,

Starting point is 00:24:34 where the actual decision point gets all its inputs at request time. So it's down to the service that's doing the check to provide, here's the identity, because they've got it from the token or they've got it from a request header, etc. And here's the resource they're trying to interact with. So to go back to the previous example

Starting point is 00:24:53 with a blog system, request comes in, we verify the token, we've got the identity, we go out to our database, we go and grab the particular article that the person's trying to interact with, and then we pass that to the authorization layer. The policy decision point, which is the formal terminology for these things, receives a request from the policy enforcement point,

Starting point is 00:25:12 which is the service. The policy decision point will then use that data that's been sent to it, principal resource action, go and evaluate that against the policies, come up with a decision, either allow, deny, and send back that decision to the service, which up with a decision, either allow, deny, and send back that decision to the service, which will then enforce that decision. And with that architecture, if you're running at large scale,

Starting point is 00:25:32 high throughput, distributed, that model actually now scales with you because the decision point itself is fully stateless. The only thing that's loaded into it is the actual policies, and everything else is provided at request time. So from an architecture perspective,

Starting point is 00:25:45 if we all live in the Kubernetes world, which most of us do nowadays, I think, that means you can just put it as a sidecar inside of every single service that you need to do authorization checks in, and then it's a local host, it's an on-node, inside the same pod call from your application to the decision point.

Starting point is 00:25:59 The decision point doesn't need to make any API calls, doesn't need to go to any database, doesn't need to read from disk. It can make a decision in memory using the context that's been given from the application layer and make a decision, which is the scalable way of doing it. Based on your architecture, it may be better to use a stateful approach, but just from our own experience,

Starting point is 00:26:18 our feet burnt in a few situations, the stateless approach is the one that will work from a high-throughput, distributed architectural approach. But it's always a decision based on what's best for you. Thanks, first of all, for the explanation and also this approach. But that means as a sidecar, if I have a large application

Starting point is 00:26:42 that runs hundreds and thousands and even more containers, that means I need to factor in the additional, let's say, quote unquote, overhead of that sidecar that I have in every single pot, basically, right? I mean, that's, I guess that's then the decision that you need to make. I guess at some point it definitely pays off, as you said, to have everything kind of locally enforced and locally validated. But I guess it's an interesting trade-off. It's an interesting trade-off, yeah. And obviously now we have all these kind of mechanisms that the infrastructure gives you

Starting point is 00:27:17 or the orchestrator gives you, like things like Kubernetes. Maybe PsyCard doesn't make sense, but really what you want to reduce is off-box calls. So you could go like a daemon set to make sure there's at least one running on each node. Or for smaller applications, there's only a couple of nodes, you know they're going to be co-related to each other. You can just run it as a service inside of it and let the infrastructure figure out how to make that request.

Starting point is 00:27:39 So it really is kind of trade-off, and we have users that do a bit of a hybrid approach. But from an architectural standpoint, regardless of what decision engine and decision point you use, these are the kind of decisions that need to be decided based on what's best for your application. There may be some things that need it as a cycle, others that can maybe survive as just a daemon set, others that are happy to go and hit some other node somewhere. I would just never recommend doing anything that requires a call over the internet because you're going to be adding tens of milliseconds to every single call to your system. I wanted to ask along the lines of this, Andy, this goes into adding another item that people

Starting point is 00:28:18 have to consider when they design their architecture. We've talked about, they have to consider how we're going to be able to observe this, how are we going to be able to do security, other types of security, not this. There are a lot of different factors of instead of just, let's do it, we now have to consider how are we going to run this? Are we going to run this as a sidecar? A lot of times people do these as afterthoughts. Who in an organization would be the one to be thinking about how should we do this? I wouldn't think it's the security team because they're looking at a

Starting point is 00:28:51 different point of view. Where does that role fit to make that evaluation and decision that you see at least? Yeah, this is one of those ones where authorization as a core concern, as a box on that diagram, is still quite early, and there isn't a clear owner inside of a business, is what we're seeing. It will range, but ultimately it's whoever wears the architecture hat inside of a company.

Starting point is 00:29:19 So we work with lots of small startups. At that point, it's the CTO, it's the first engineer, it's engineering managers, maybe things grow up, through to large enterprises. So we work with like Utility Warehouse here in the UK, one of the largest telecoms providers.

Starting point is 00:29:35 And they have like an architecture team and they're responsible for defining the security architecture outside of the stack. And it kind of fits with them. And then we've seen kind of everything in between. Sometimes it's the product owner or the product manager owns the actual

Starting point is 00:29:48 logic for what the rules need to be. The infrastructure or DevOps team will own the actual decision point service. And then a security team member will be involved to basically oversight to make sure that the authorization logic meets all the regulatory requirements that that business sits under. Because it's all very well us talking about technical requirements all the time and sort of functional what the application needs to do requirements. A lot of the companies that are really taking authorization seriously are also regulated businesses where you have things like data locality requirements. You have things like auditability requirements.

Starting point is 00:30:20 You have to be able to prove, one of my fun experiences in previous lives, company, we went through ISO and SOC 2 compliance. Every year we'd get audited. Every year I'd get dragged down to a dark basement by a lawyer and I had to demonstrate what our access controls were inside of our systems. And the first couple of times I did it,

Starting point is 00:30:37 I was sitting there trying to grip through logs in S3 trying to prove we had access controls. Not the best way of doing it, just a heads up. And so the other benefits of taking this kind of externalized approach, you now have policy defined centrally that's version controlled. You can write tests against it. And then on the other side, on the observability

Starting point is 00:30:54 side, yes, you get all your nice open telemetry insights, Prometheus metrics, all that kind of thing for observing the actual service itself. But you have this other major benefit of having this centralized service for authorization, which is you this other major benefit of having this centralized service for authorization, which is you get a definite log of exactly this user tried to do this action at this time to do on this particular resource. And it was either allowed or denied

Starting point is 00:31:15 by this particular version of this particular policy, which is gold dust for any security team. Imagine you're running a business, you have some sort of suspected breach, you know this identity was active inside of the system and you need to be able to pull out exactly what that user did. Your authorization service is the source of truth of really exactly what that person did inside of your application. Which is another whole benefit and another kind of observability and logging that are quite commonly a bit of an afterthought if you don't really consider this up front in your architecture. And there's a real kind of benefit as you're working in either a regulated business or working towards some of those standards and compliance certifications.

Starting point is 00:31:53 A quick architectural question to your deployment model. That means you have, let's assume we go the sidecar route. Let's say we have 100 containers, 100 sidecars. That means during startup, they connect to a central policy, whatever, policy operator, what do you call it? Yeah, so a policy store of some sort. Policy store, yeah. And that's basically then, so that policy store makes sense that whenever policies get changed, that the change also gets obviously distributed to all of the different instances. I guess that's a push because you want to have it immediately pushed

Starting point is 00:32:31 any changes. Now from a centralized auditing, because auditing is obviously the key use case, does this mean that every sidecar is then sending also that information back to the central operator? It has to, right? Yes. Yeah, so there's a couple of things here. Firstly, I'll speak of Serbos, the open source project and how we soldered that, but it's a very common pattern you'll see across all of us out there. So the core of Serbos, the policy decision point in the engine, is complete open source

Starting point is 00:33:02 Apache 2 ground grab of GitHub. We have hundreds, if not thousands of companies out there using it today at various scales. And in that model it's a container that you're running and each instance of that container as you correctly pointed out, has to go and fetch those policies.

Starting point is 00:33:18 With the Serbos approach we allow you to connect to a Git repo, an S3 bucket, a local database, or even just pull it from disk. But it's down to you to set up what CI pipeline you want to go and fetch that data. But what you'll immediately run into when you have more than one of these instances is how do you coordinate all these changes and these things that roll out? So we have a whole commercial product called Servers Hub, which is like a management control plane for these policy decision points.

Starting point is 00:33:41 To use another P star P term for you, it's a policy administration point in the formal architecture, which is where you define your policies, but then also manages the rollout and distribution of those. On the logging side of things, the exact same thing is true. So each decision point is making decisions, be it Serbos or some other decision point, they're all distributed, they're all making decisions, and they'll generally all be generating some sort of audit log. Now if you're in an infrastructure where you have a team that's set up a log collection, you can just have each pod

Starting point is 00:34:09 log out and go and send that off into some ELK stack or use Loki or something like that. But that is very much going to be in the hands and in the realms of your infrastructure team, your DevOps team, your engineering team, not that security person that's trying to pull out those audit logs for them

Starting point is 00:34:26 to do a security audit, let's say. So as part of the server's offering and our control plane, we also do that reverse collection where we pull in all the logs and give you a centralized view across all your decision points, regardless of where in your architecture and where in your stack those decisions are made. But the

Starting point is 00:34:41 open core, the open source project that's out there, you can configure that log sync to where you want to go. One of our community members contributed a Kafka sync, for example, so you can write off all your logs to a Kafka topic somewhere inside of your stack. So it's very pluggable from that sense. But really the use case you need to solve for is not just as a developer, is this permission doing what it should do, it's solving for the security team, the auditor, the person that needs to be able to get that view exactly what happened for compliance reasons, which is a different kind of viewpoint on top of the same data

Starting point is 00:35:10 and needs to be thought about from an architectural perspective as well when you're designing that stack. Yeah. And especially, right, I mean, in largely distributed systems that are scalable and stateless, it means if I get a token and I make one request, I get to one instance of that microservice.

Starting point is 00:35:31 I do another one, I might get to another. So in order to see really what I did, what I was allowed to do, I need to look at the whole picture. So very curious now from an observability perspective, because this is where Brian and I live in. So I understand that the logs are coming in, so I can use this for auditing purposes. I say, show me everything that this person with this token did. Are there any other use cases?

Starting point is 00:35:56 Do you look into things like, hey, there was unusual activity from one particular user? What are some of the other things you can extract out of the logs? Yeah, absolutely. So if we're looking at just the audit logs, firstly, show me what the actions user did. There's this whole world of SIEM systems, which is Security Incident Monitoring type tools, which you can funnel those logs off to

Starting point is 00:36:18 and then start doing more of that behavioral anomaly detection type work. And there's some of that that's in Servos and some of it, which we're kind of partnering with as well. But from a business side of things, being able to identify, okay, there's two groups. This kind of behavior is unexpected. This person is doing some action that hasn't been done before.

Starting point is 00:36:38 This role is maybe doing things that is unusual, that weren't doing last week is now doing this week in high volume. And that's kind of group one. The other side of things, which is more important from a policy definition process is like, this role exists, but it's not being used, or this role has a certain action, but it isn't commonly being accessed across your stack. You may want to consider scoping down the privileges that a particular role has and really reducing the possible kind of blast radius

Starting point is 00:37:08 and least privilege type approach to what the identities can do, be it on identity level or role level. And these are two things that you can get out of the authorization logs that you wouldn't normally get from just plain old application logs. Cool. That's interesting. So how about, I guess, similar with what you just said with certain policies are hardly ever used. You may want to re-scope. I think that's an interesting one. Or I guess also when you have all of a sudden a high number of failed policy checks,

Starting point is 00:37:44 somebody tries to do certain things, but it doesn't allow them. So again, it could be alerting because somebody tries to hack the system. Or it could be that an end user believes they have a certain capacity or a certain privilege, but they don't. And then this could also then cause some policy change as well. Yeah, and there's that end user getting that permission to deny message, they raise a help desk ticket, that thing goes back to someone who needs

Starting point is 00:38:12 to be able to unpick what those rules are, and if that rule or that logic is hard-coded in the application, you're now all the way back to the developer's team that needs to go and unpick some case switch statement that was maybe slightly incorrect when it was first implemented. Whilst if you had externalized that into some policy format, which is human-readable, you don't have to know Java, you don't have to know Go, you don't have to know TypeScript or whatever.

Starting point is 00:38:32 It's in some sort of human-readable format. We use YAML. Love it or hate it, it is a bit more human-readable than application code. There's great tooling around it as well. But now there's a clear definition of what these rules are to someone that isn't deep in a particular code base can actually understand. You can reduce the amount of headaches that engineers may have to deal with or tickets engineers have to deal with because of understanding of what the permissioning model needs to be. And the last question on the observability piece.

Starting point is 00:39:04 So I understand that what we can do with the audit logs, there's many different things we can read out of it. You also mentioned that the policy system itself, the policy operator or the policy admin, exposes probably some metrics, some Prometheus metrics, some open telemetry. What are some of the key metrics or key indicators that performance engineers

Starting point is 00:39:27 should look out for? Are there anything where you say, hey, every time you go to one of your users, you say, hey, do you look at this metric? Like, I don't know, queue lengths, latency, what are some of these core KPIs that you look at? Yeah, it kind of goes back to the original point when we started talking about performance.

Starting point is 00:39:44 Really, the end user they're caring about here is how quickly they're getting ability to do an action. And so you really need to be able to stand in that blocking path. Part of it is going to be authorization. So how long is that authorization check coming on? So making sure whatever policy decision point you're using, the policy engine you're using,

Starting point is 00:40:01 is giving you that insight and giving that observability. So at Cerbos, we're cloud native, born and bred, as it were. So we have baked in support for OpenTelemetry, we pass it through all the tracing headers. So if you're having the full OTEL stack in your environment, you'll get that full in and out of CERBOS traces. And not only do we just tell you we're now authorizing it, within the SERBOS traces we'll also tell you at which phase of the authorization. So first working out which policy is applicable, then down inside of the policy how long it takes to go and evaluate each rule within that policy.

Starting point is 00:40:37 This is down to the microsecond level at this point, because it's very, very performant at this point. And then if you are getting particular request paths where you're seeing slow behavior, and one of the common ones is around when you want to list items to a user that they're allowed to access to, kind of like an index or a filtering page,

Starting point is 00:40:56 the kind of old way of doing it is you just whack a where clause in your database which enforces your policy rules. When you externalize authorization, you run into a bit of a challenge because now that where clause that filters your results based on permissions is dynamic based on what your policies are.

Starting point is 00:41:10 So with Cerbos, we can actually generate what ultimately that where clause needs to be dynamically based on policies. And inside of the observability we give you, we can show you how long that query plan has been taken to generate. And a lot of the time, it will highlight to someone that's looking at

Starting point is 00:41:23 performance, performance engineer, hey, there's a particular condition, there's a particular check going on inside of this policy which isn't performing as we expected. And it will highlight, okay, maybe it's going to look at how this particular condition, this particular rule is being evaluated. So you can get very, very granular insights inside of the traces, all your top level Prometheus metrics, servers and servers are in a go, so you get all the standard go routines, observability, and then also things like the policy cache size, how many policies it's holding in memory, the evictions, and all those kind of best practices in terms of how a system is behaving across your stack and then fitting nicely into the

Starting point is 00:42:00 tool chain, which everyone's kind of used to at this day and age, hopefully. But yeah, that full request lifecycle insight is passed through a server and you'll see it inside of your spans, inside of your traces. And one pattern that Brian and I have discussed since the beginning of this podcast. I know what this is. The M plus one query problem. You brought up an example where, let's say, in an app, I do a search,

Starting point is 00:42:29 I get a list of, let's say, 100 entries. It's like people, and I'm only allowed to see the names of certain people that might be within my group. That's a great example, I guess, of using policies. Am I allowed to see the telephone number, for instance?

Starting point is 00:42:47 So as a developer, I could implement this in a way where I say, give me all the results and then I iterate through the 100 elements. And then for every element, I basically make a policy call. So this would be a classical data-driven performance problem because... Oh, yes. Is there anything in modern authorization, like with Serpals, for instance, where you can also make batch calls, where you can say, hey, this is a list of things I want to validate? Yeah.

Starting point is 00:43:17 So there's kind of two operations to solving what's called kind of the listing problem. One is kind of the brute force one, where you go and query all your records out, and then one by one you do that check. And that could be a one by one or a batch. You can configure a batch and do as many checks as you want in one go into a single RPC and get a result back.

Starting point is 00:43:37 But from a performance perspective, that is not good. Because firstly, that batch you're sending in to check could be five records or it could be 10 million records, 21 your system does. And particularly if you then go through and check every single one and it's actually denied, denied, denied for all of them, you just wasted all that compute because that person has a role that means it's always denied

Starting point is 00:43:55 or doesn't have a role that means it's always denied. Why have you gone through and checked all those records, etc.? So the solution to that problem is something that's pretty unique to Servos is what we call generating a query plan. So when you look at your policies, you're basically defining all the rules under which a particular action can be done by particular roles and under particular conditions. And inside of that request where you want to generate that list page

Starting point is 00:44:21 or that index page or resources, you can say to the decision point service, I've got this user, I know who they are, I've got their roles, I've got their attributes, I know they're trying to request something against a particular resource type, and they're trying to view a resource. So this user, you've got Andy, he's a manager, he's in this team in this region, et cetera, and he's trying to view the employees. What you can do is say to a system like Servos, what are the conditions that must be true for Andy to be able to access this particular resource type?

Starting point is 00:44:56 And what the authorization system, in case Servos will do, is rather than just giving you back an allow or deny, it will go through and generate and look through all the policies. What are the conditions that must be met for a record to be accessible by this particular profile, by this particular principle? And that condition tree is going to be dynamic based on every single request that comes in. It's going to be driven by all the request context that's available. So time of day, IP address, region, data center, your profile information, your principal information,

Starting point is 00:45:24 your groups, your teams, your roles and then it will take those policies, it will evaluate as many conditions as it can and will basically return one of three answers. It will say back to your system, based on those policies Andy has access to all of them therefore you don't need to filter it from an authorization perspective, obviously you want to do pagination

Starting point is 00:45:40 and stuff and so your where clause now can just be a select all. It may decide that actually based on your roles is you don't ever have access to this resource type at all. Therefore you don't even need to go and query the database. We'll just say based on your current priorities, never

Starting point is 00:45:55 always denied, therefore you don't even need to query the database. But the most common answer is a conditional response. And in the service world, that conditional response is an abstract syntax tree of conditions which will, this attribute of the resource must be this value, or this attribute must be true, or this attribute must be one of these four values, or this attribute must be equal to the ID of the person making the request, for example. So that comes back as an AST. It's in a standard grammar, as you'd expect with an AST. And

Starting point is 00:46:22 then from there, you can then take that and convert it to a SQL workflows. You can convert it to a Mongo filter. You can convert it to a API filter, API call filter, or whatever's kind of relevant to your use case. It's agnostic to any particular storage technology. So we've published open source adapters for things like Prisma, SQL Alchemy, Mongoose,

Starting point is 00:46:41 and a few others. And what this now means is you're essentially taking your authorization logic, which is in your policies, and then pushing it down into the database and then the database do the filtering. But the key thing is that filter is fully dynamic. So whenever the business changes the policies,

Starting point is 00:46:56 on the next call, that query is going to be slightly different or that query is going to be different based on the person making the call. And that way it's optimized because you're only ever going to pull out the database what that user actually will have access to. Cool.

Starting point is 00:47:08 Yeah, that's perfect. I mean, and I just hope that every developer who is going to use Serbos understands that these different options are available and not just fall into the easy trap and just doing it in a way that, you know, ends up with the M plus one query problem that we have seen too many times

Starting point is 00:47:27 in our life. Just requesting too much data and then iterating over it and then making individual calls. It was always traditionally way back it was the database. And then as microservices came in, I was like, oh, you can do that between microservices now.

Starting point is 00:47:44 It's like it finds a way to infest everything it can, you know. Can I just round back really quickly? Because I think you said something really important earlier that I want to confirm, at least important to me. It sounds like you were saying that Servos comes pre-instrumented with OpenTelemetry for the key pieces, right? I just want to first thank you for that. This has been the promise of OpenTelemetry. I don't have my finger on the pulse of all commercial offerings and all that, but there was this idea that all these commercial offerings

Starting point is 00:48:22 and even open source offerings can come pre-instrumented with the key components of what it takes because you all know what's important for troubleshooting this stuff. Right. And I don't think I've come across or heard of anyone doing that yet. I don't know if that's the case in reality, but to hear it there, I just think, you know, definitely a hats off to you because that's that's what we were that was that was the goal i mean that was i don't want to say it's the goal but that was one

Starting point is 00:48:50 of the dreams right every vendor was going to bake it in you pop it in you get what you need right so i just really got to say that's amazing the other quick question i had on the side of that is obviously this code has to be performant are there certain languages that you see that are used i remember back when i used to work with a lot of trading people, a lot of the back-end trading apps were written in C++ because they were dealing with half-millisecond response times and things like that. Are there code bases that lend themselves better to this these days?

Starting point is 00:49:17 Because I know a lot have changed performance. Yeah, if you look across the different authorization solutions out there, some are open source, some are commercial products, some are libraries, some are packages, some are vendor specific, etc. The overriding commonality between them all is Go. Okay, that makes sense. There's other kinds of solutions out there in Go. If you're in the cloud native space, that's obviously very prevalent.

Starting point is 00:49:43 And for us, that was our experience from previous businesses where we had to build these sort of low latency, high throughput systems. Go was always kind of the go-to choice. I'm sure someone will say rewrite it in Rust at some point. But right now, we're sticking with Go for that. And it also has the best-in-class support, we think, at least for the observability side of things as well.

Starting point is 00:50:04 It just kind of fits nicely in the ecosystem. We can do the ARM64 builds and all that as well at the same time to keep things, be able to run it anywhere. Plus, it's all open source, so you can go fork it and it's not in obscure language. We encourage community contributions as well, so we wanted to keep it something that's pretty common in this tech stack. Great. as well. So we wanted to keep it something that's pretty common in this tech stack. Great.

Starting point is 00:50:28 Hey, Brian. I think we're running up at the end of time for you. I think so too, yeah. But honestly, first of all, Alex, thank you so much for demystifying the difference between authentication and authorization. Thanks for playing along on that use case we came up in the beginning with the hotel so i think we learned quite a bit um also thanks for for as brian just said you know contributing back to open source and following these standards it's a huge topic right um authorization um i guess i've you know i've been a developer for many many years but when I developed it was actually never that big of a

Starting point is 00:51:08 concern for me because of the type of software that I developed because I just assumed my code runs in the right authorization but yeah folks if you listen to this I guess we have a lot of links that we add to the description as always we will also add your

Starting point is 00:51:24 LinkedIn profile link so that people can follow up with you. And yeah, I would love to have you back at some point because this is a topic that will stay relevant and probably even more relevant in the years to come. Absolutely, yeah. Very happy to, and thanks for having me. It's been great chatting.

Starting point is 00:51:41 Yeah, I think this has been awesome. I don't really deal with the code side as much, but I think like a lot of people, we all use permissions and set up who can do what no matter what we're doing. And I don't think I've ever thought of what's behind that ever in my life. And now it's like, oh my gosh,

Starting point is 00:52:00 there's a huge, huge topic behind that. So thank you for sharing the information and just making me a smarter person or more knowledgeable. At least I don't know if I'm any smarter, but I'm more knowledgeable. We appreciate your time. Thanks for everyone listening today. We hope you had as much fun as we did. And we will see you at the, or you'll hear us on the next episode, I guess. Thanks, everybody.

Starting point is 00:52:24 Bye-bye. Thank you. Bye-bye.

PurePerformance - The Security and Resiliency Challenges of Cloud Native Authorization with Alex Olivier

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.