Software at Scale - Software at Scale 46 - Authorization with Or Weis

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining me today is Or Weiss, the founder and CEO of Permit.io, which is a permissions as a service platform. Thank you for joining me. It's a pleasure to be here, Ustav. I'm really excited for our conversation. It's pretty early in the morning there, right, in Israel. Let me start with just asking about your background.

Starting point is 00:00:40 My background starts in the intelligence core in the IDF. I had a long career in the IDF and then as a VP of R&D and I worked in several startups and I founded another company before this one, another DevTools company called Workout. And throughout our careers, both myself and my co-founder, we've built access control for products that we've been building probably thousands of times. But the most annoying part is that it've building probably thousands of times. But the most annoying part is that it was more than one spare product. So for example, in my previous company, Rookout, I ended up rebuilding access control five times for a product that wasn't even three years old.

Starting point is 00:01:17 It literally drove me insane. At each point, I thought, okay, I've built this, it's perfect, I'm done. And every time it surprised me and you with more challenges coming either from the customers, from security, compliance, from the infrastructure, or also from weird angles. So for example, we were working with Cisco as a biz dev partner. They were selling Rookout directly to market. And at some point, they came in and said, we want our own back office, we want to manage users on our own, we want to assign permissions on our own, we want our salespeople to be able to work with this. And I looked at what we've built in and said,

Starting point is 00:01:54 there's no freaking way that I can make this solution support two back offices, I have to once again, throw it out in the window and start from scratch. And I just thought, this is so silly. I don't want to do this. I want to focus on actually building my product. And I remembered feeling that sensation, that mindset across my career. And I just thought there must be a better way. And that's what brought me to create a permission service so developers can focus on building their products and not rebuilding this over and over. Maybe can you walk us through where does the complexity in permission management come in? As I think about this as a layperson who hasn't thought too much, you have different user

Starting point is 00:02:38 types, maybe they have different attributes or different roles. When does this get complicated? So I think actually the question that you're posing is the crux of the problem at each point that we're building it it's hard to see what we'll actually need down the road i myself fail i myself also have fallen for the same fallacy um at each point that i was building this i I thought, oh, I have the entire picture. I know what I need. I'm going to build this and I'm done. But things are constantly changing.

Starting point is 00:03:09 And so if we take a zoom back, look at broader strokes, we can see that almost every company starts with or every product starts with having admin and not admin. And then you move to admin, not admin and super admin. Then you move to access control lists. So the people on that list can do A, people on that list can do B. Then usually as you start working with customers and you need to have more structure in your permissions,

Starting point is 00:03:36 you move to role-based access control. Then it also often comes with compliance because compliance like SOC 2 specifically talks about these kinds of controls. And then you realize, oh, actually, roles are not enough because I need more granular things. Like I need roles plus ownership. It's not enough that an editor can edit files. It needs to be able to edit only his own or her files. Or you need other attributes. I want to enable this only if a customer is paying or only if they're in a specific geolocation.

Starting point is 00:04:06 So that's our back plus ownership plus some attributes. And as you start to add more attributes, you start to slide toward attribute based access control. So everything that is a arbitrary element or property of either the identity, or the resources in the application itself, or the actions you perform on those resources are generally referred to as attributes. And as you start to add those, you either find yourself at attribute-based access control or policy-based access control. And as those gain complexity, you either try to simplify it with relationship based access control that can

Starting point is 00:04:45 also work with graph based access control. And nowadays, most of these would translate into some kind of policy as code, the challenge is not necessarily understanding each of those models. By the way, for each of those models that I described, we can probably have a discussion for five hours straight on the different structures and layouts and objects that you can have for that. And also how you create DB schemas for it. And now we create an update mechanism for it and how you create an edit mechanism for it and a versioning mechanism for it and an auditing mechanism for it. And each of them, it will be slightly different, but also none of them is like the correct answer. application is a snowflake each application is unique otherwise it wouldn't need to exist

Starting point is 00:05:28 because there's another application like it so you need to be able to adapt these mindsets or concepts to the concrete requirements of your product and the most challenging part there is that your product is evolving, just like your company is evolving and you as a development team are moving forward and gaining more features, more capabilities, more infrastructure. And with that evolution, your permission model will change. It will also, as we said before, be affected by what the customers want and what the product managers want and what security, compliance, infrastructure, all of that will change your product constantly. On average, every company refactors or rebuilds their permissioning system every three to five. And

Starting point is 00:06:19 the change, depending on how they're shifting or what they're shifting to can be between a month to eight months of intensive labor of on average of three people team so the cost is also very high every time you readdress this and the organizational fixture friction is also very high because these often float in so it probably sees something like a product manager talking to a customer and the customer saying, oh, we need another role because we have this guy who's working on that department and we need them to have a slightly different set of permissions. And the product manager would go, oh, yeah, sure. Let's just go back and open up a ticket in Jira.

Starting point is 00:07:03 And some poor schlep of a developer receives that ticket. And the people that opened that ticket don't actually realize that there's a world of pain behind that simple requirement of adding another role or making roles dynamic, or making roles more auditable or whatever it is. So it ends up just out of the blue becoming a huge project. And as it gets delayed and has more friction, more people start to clamor around it. And that puts more pressure on R&D. And then it asks. And there's this gap between the requirements and the people floating them in product security compliance and the understanding that's actually needed to build this that only resides with developers.

Starting point is 00:07:46 So a lot of the tension is there. So it's both about the organization understanding and the developers understanding where this is going and also aligning the organization around the journey because otherwise people are constantly being surprised about simple things that are actually interesting. And like you mentioned so many different kinds of like access control like role-based attribute-based graph-based i've certainly dealt with like policy-based when it comes to like cloud system like aws so i'm just curious like off the top of your head like are there any particular cloud system that you think do permissions well or

Starting point is 00:08:21 because permissioning in like and it i amAM and AWS has always been like the biggest source of confusion for me what's your opinion as someone who thinks about permissions like day in and day out yeah so I think the for example the AWS IAM is an amazing system it's super powerful but I see a lot of developers that are impressed by the power of that solution and consider that breadth of scope ideal. It's actually not, I think not the right way to think about it because what's right for a big cloud infrastructure solution like AWS is not the same as a SaaS application or even a PaaS application. So there are definitely concepts that you can take from it. But you probably shouldn't take it one by one and

Starting point is 00:09:10 definitely not fall in love with it. Sometimes I actually see that too, and expect them to work perfectly for your solution. In the end of the day, all of these models are back a back the IAM system policy based systems. These are concepts and tools. They're not final product for a specific application. So you should look at them more as options and suggestions and pick and choose what's right for you throughout in a journey where you'll constantly be updating this. The reason that IAM is so powerful is it's because it needs to be. It's catering to a highly technical audience that requires super flexibility. It can allow itself to be less interfaceable or addressable by the average Joe, and it needs to cover a lot of different services. That's not your run of the

Starting point is 00:10:01 mill application. So definitely, it's a very good example of a well designed solution, but it's not your run of the mill application. So definitely it's a very good example of a well-designed solution, but it's not the perfect example for every application. Yeah, no, I think it's definitely really flexible, but I do think the complexity is just inscrutable at times, like service accounts, like linked roles and stuff. It's just like the concepts keep piling on, but you definitely see how it allows you

Starting point is 00:10:24 to do a lot of interesting things there's like solutions you can build on top of that like access manager getting all those access logs ensuring you have auditability then combining roles and stuff together there's like a security audit role that combines a bunch of policies so that's interesting then my question for you then is how do you productize this problem, right? Like you, as you mentioned, there's a lot of complexity here and every solution,

Starting point is 00:10:49 every company probably needs a different set of answers to this problem based on how technical their audience is, what kind of flexibility they need. How do you build a product that encapsulates all of that? Great question. I'll start by saying that with the IAM and with AWS, there's a choice there to keep it

Starting point is 00:11:17 more complex and not addressable for the common Joe. And that's on purpose. That's not really what's relevant for more applications. But you did mention some other things there, like the auditing and logging of it and connecting this to higher level concepts like roles. That's something that you'd probably find in almost every application. So that's something that we can definitely look on in a positive light there. So every application should probably have audit logs at some point and should have versioning on its policies and should have the ability to combine roles and to combine attributes.

Starting point is 00:11:50 Those are, it's a matter of time until they chime into the conversation. And the way to think about it, I think, as in general with software is to think about it in a kind of modular stack. So you don't have to have everything at day one, but you want to have the right stack and components built in so you can add more capabilities as you go. So you want to start simple and you want to start with something that answers your needs now, but can grow and can add more interfaces to the other people, the other stakeholders that are involved. The way I like to think about it is in kind of three ways. And those are also the things that we offer people when they work with us. I like to think about it in best practices, infrastructure, which can ideally be open source, and then experiences and interfaces on top.

Starting point is 00:12:46 So with best practices, you have things like decoupling the policy and code. Once you understand that things are going to change and both your application is going to be different, and both your authorization layer and policy are going to change, you understand that if you couple them together, every time you want to change one of them, you'll have to change everything. And that's going to be very painful and add a lot of friction or basically slow you down and often reforce you to redo everything for every little thing. So by decoupling policy and code, which essentially in a modern application means creating a separate microservice for authorization. You can keep your application more simple and your authorization more simple, and both

Starting point is 00:13:30 can evolve separately, but side by side. So that's one of the key best practices. Another one would be keeping things event driven. Permissions and access control is a critical experience. You want it to be quick, you want it to be performant, you want to be consistent. So if you have something that updates in delays, you're going to have a bad time. For example, if you want a policy, only users that have paid for a feature can use it. The information there on who paid doesn't exist in your database today. That's a third party service like Stripe or

Starting point is 00:14:01 Chargebee or PayPal. So you need a way to synchronize with that services is changing. And the best way to do that is to listen into events. So you have events propagating in from different services. And you allow your authorization layer to be updated by those events in a real time manner. There are more best practices, but we can circle back to them in a minute. The second part is building the right infrastructure. If you have a plug, a pluggable infrastructure that is extensible, once you want to add more interfaces on top. So for everyone starts, as we said, with just having basic permissions, basic enforcement

Starting point is 00:14:38 and a really simple model. But on top of the model changing, you want more capabilities on top. So you probably want to add user management with the ability to assign roles. And you want to add API key management because you also provide some automation. You want secrets management, you want audit logs, you want to be able to see who did what in your system. You want to multi tenancy, you'd want impersonation for the ability to see who did what within the system logging in as that

Starting point is 00:15:05 user. You'd want approval flows, asking permissions from another user. And this list is, first of all, things you've seen a billion times, and also never ending. There's always another item to add to that list. So if we design the authorization layer where those kind of interfaces, experiences can plug in on, we can grow gradually with the evolution of the application. So we don't need to have impersonation, for example, at day one. But we want to be able to easily add it without refactoring everything when we get to that point. And if we use the right best practices

Starting point is 00:15:38 and the right infrastructure, we'll be able to. And it's just a matter of either adopting the right tools or learning yourself how to work with those tools and best practices. And lastly, these are the experiences themselves. I think it's the recognition that it's not you're not just delivering a feature here, you're delivering a organizational pattern here. So it's not just the developers being involved with this. It's all the other stakeholders, product managers, security, compliance. They'll need a modicum of self-determines, an ability to manage this on their own and at least chime in on the conversation. So we want to be able to provide them with interfaces early on, not necessarily at day one,

Starting point is 00:16:25 but we want to plug in those interfaces. Once we recognize that, we are ready for most patterns of that evolution. And once we have interfaces for ourselves, we can also offer interfaces for our customers, which is also something that arrives pretty early. The customers themselves want this democratized. They want to be able to control

Starting point is 00:16:44 who they're adding to their organization within your application. A subset of permissions that they can mutate on their own, maybe create a few roles on their own or attributes on their own, et cetera, et cetera. So generally at what stage of company does like someone approach you? Have they generally like built an auth system or two and they realized they should be outsourcing this is it like different for like b2b companies versus b2c companies because i can totally understand that kind of complexity with oh if you want to do like geolocation based permission checks like i don't want to build that on my own yeah so we're seeing

Starting point is 00:17:23 companies of all sizes all of them arrive to us at the point where they they are actively working on this they are actively thinking about this because some requirement has come that has came in and changed the way that they need to build this we are seeing companies starting at square one just saying i have so much else to build i don't want to deal with this at all. I think in general, that's the common thread. Developers often don't care about this. They want this to work well, but it's not a unique part of their product. And just they don't want to build billing or authentication. No one really wants to build this and definitely not build this and make errors

Starting point is 00:17:59 while building it. The other two types are either companies have already built something in place and realized that they need to change it because of those incoming requirements, or companies even going through a more significant change. So we see big companies, for example, as they're going through an IPO process or an M&A process, there are a lot of demands coming in, pushing also critical timeline on the changes that they need to apply, or when they're doing significant infrastructure change. So for example, we had several companies moving from monoliths to microservices. So when you're working with a monolith, you can often rely on the built in access control mechanism. So for like in Django, in Python, or Spring Framework, there are some

Starting point is 00:18:41 basic RBAC admin panels baked in. The moment you move to microservices, that just stops working at all, especially if you're polyglot, if you have multiple languages. So that often brings players to the table. And the painful part is if you arrive at this later than earlier, the amount of refactoring you have to do is where most of the pain is. And I think the most painful parts are people that have already learned, they've glanced there's a different way to work about this. They've decided, we don't want to put the effort of changing this, we'll just tweak what we have.

Starting point is 00:19:18 And then they come back a year later and saying, okay, we realized that didn't solve it. And now we have to completely revamp it. And we actually added more friction on the way. So bottom line, we're seeing companies of all sizes, but they come in with different requirements and different needs. And the idea, like I said before, is to enable them to find a quick solution for what they need now, and gradually evolve with it. Okay, that's interesting interesting to know and doesn't match my intuition like i guess i i assume that as companies get bigger they would run into this but i guess it makes sense that sometimes people just they know that this is going to be a problem from like their

Starting point is 00:19:55 previous job and they're like i'm just going to outsource this from day one so i just don't have to think about this at all i think what the difference there is that people are learning that this is an option, just like with authentication. If you go five, seven years back, most companies would say, why do I need to use an authentication vendor? I can just store passwords. What's the big deal? And now most, I think most developers would react to that and say, okay, that's insane. Storing passwords is really hard. It's the security and cryptographic aspects of it, like hashing and salting and just tracking everything and doing SSO around that. That's

Starting point is 00:20:29 a huge pain point. And there's no unique value in implementing this again. And as people learn that authentication solutions are an option, and that they are readily available, the mindset shifted. I think the same thing is happening now with authorization, a lot of developers are learning that they don't have to build this. And most of them don't want to build this anyway. So if there's an alternative, they'll they often stick to it. Some people are still struggling to saying, Oh, I've been building this for it's actually with the bigger companies. So we've been, we've built this huge complex thing that we're really proud of. So what if it doesn't meet our requirements anymore? So what if it doesn't meet the modern standards anymore? I think I can make

Starting point is 00:21:14 this work. And they're right. But every time they make that statement, they're just postponing another point where they'll have to reconsider and actually adopt the modern patterns. Because there's it's again, it's not about having the right solution. Now, it's about having something that can evolve quickly. Okay, then one question that I have for you is like you mentioned the Google Zanzibar paper in one of your documents, maybe you can walk us through through, even behind the scenes, permission management is not easy to run in a nice and fast and consistent and scalable way.

Starting point is 00:21:52 Why have people written research papers about this? Isn't there just an access control list and you need to check whether a person's in the list or not in the list? Where does the performance challenge come in? First of all, you need to realize that the average microservice sends three authorization queries for every request it gets. So if your authorization layer is inefficient, you're going to have a bad time because you

Starting point is 00:22:20 if it adds, let's say, 50 milliseconds, you're quickly getting to several hundreds of milliseconds before your application has done anything. On top of that, there are other hidden complexities in how you store the data that you need for authorization and how you fetch it. The data that you have for the application, first of all, is not all the data that you need for authorization. We already covered the third-party services and distributed data plane and data sinks you're working with. But even just the data for the application itself, the way you structure the schema of your database for the application is not the ideal way to structure it for the authorization layer because they're actually querying different things and they need to do different joins and different aggregates. And you see that often that pain point starts

Starting point is 00:23:10 when people are moving from RBAC to attribute based. So they're piling in attributes, just adding more queries to the database, essentially. And initially, it's fine. But then at some point, the database chokes, because there are too many queries, they're too slow, and while the authorization layer might be still quick, the underlying data layer can't really support it, and everything screeches to a halt. And so there's complexities in how you store your data, how you propagate it, and how you manage its schemas. And lastly, and that's something that is actually unique specifically to Zanzibar, is how you manage its schemas. And lastly, and that's something that is actually unique

Starting point is 00:23:45 specifically to Zanzibar, is how you apply consistency. So one of the key challenges when you have a large complex system is things can change while the system, for example, you're sending a request to the service, it starts at microservice one. And as that microservice is querying another microservice, during that transaction, the world picture, the data for authorization has changed. That's often referred to as the new enemy problem or a subset of the new enemy problem. So now you have, as you're running queries for your systems, you're handling requests, they're inconsistent. So now you can have a case where at one moment, you're giving someone permissions and the other one they

Starting point is 00:24:30 don't have, or they have a different set of permissions. And you end up either failing the request or providing the wrong result, or worse leaking data or access that you weren't supposed to. And that's something that's really hard to track, especially if you have a high-scale system. So in general, taking a step back, there are two camps today. What's interesting about the authorization landscape is it's still nascent. It's still evolving.

Starting point is 00:24:57 As, I don't know, humanity, society, I don't know what you want to call it, we haven't decided on what are the best practices and standards. We have some of them, but it the best practices and standards. We have some of them, but it's not finalized yet. We're still writing that book. So unlike with authentication and with JSON web tokens and with SAML and OpenID Connect on the IAM side, things are still evolving in the authorization space. And currently there are two camps. There's the code-based camp and

Starting point is 00:25:22 the graph-based camp for implementing access control. In the code based camp, you'd find things like open policy agent, which essentially says, you should write policies loaded into an engine, a load data in the form of JSON documents in that engine, you can have that engine run as a sidecar or as a cluster next to your services, and then they can query it. It's really the equivalent of the policy decision point in the ex ACML methodology for those who are familiar with it. And the graph based camp says something different. There's a lot of data here, a lot of complexity, a lot of users, we need to manage it in a consistent picture and consistent graph,

Starting point is 00:26:00 and be able to query it all the time in an efficient manner. And these camps have pros and cons that I'll try to run through some of them quickly. So with code, first of all, code is Turing complete. So you can describe any policy that you want. With a graph, you can't have a Turing complete really, because then navigation on the graph won't be efficient. The moment you make it cyclical, And the more it's not a DAG, not a direct acyclical graph, it's going especially if the graph is large, you're going to have a really bad time navigating through it. And it will most likely fail. So you can only have more with Zanzibar and most graph based solutions, you can only have more simple policies,

Starting point is 00:26:42 mostly around relationship based access control. But it's really great to describe hierarchies like nested files or folders or organizational structures, but it fails when you start to do multiple attributes, for example, when you try to do more a back, I never thing is the ability to do reverse indices. So you often ask the question in authorization, can who staff access this thing. But a lot of times you want the reverse of that you want to ask who can access that thing. So with code, if you have code, this answering the question can X you it's basically impossible to get the reverse code only runs one way. You can try and maybe brute force it and enumerate all the options, but that's really a bad way to do that. With a graph, you have the advantage of navigating the other way

Starting point is 00:27:32 around. So you can get, basically we get reverse indices out of the box. That's what some people call the spice of Google Zanzibar. The graph, because you're managing a big graph in the cloud, you get consistency. You control all of the pictures. So you can make sure that picture is consistent. But when you work with a distributed layout, it's harder to do. But if you work with a with a graph, and it's, it's a big graph that is remote from the services themselves, you're paying for latency, when you're querying it as opposed to a small, efficient agent at the edge that you can query. So you can see that there are more pros and cons, but there are a lot of them that we've

Starting point is 00:28:10 already touched on. And another thing that I think is interesting to see is that they're complementary. So what the policy, what code is good for is the complementary or opposite image, mirror image of what the graph is good for. So what I'm actually advocating for is using both, is using both the graph-based solution to manage a bigger picture in the cloud, and to use the code base to have efficient answers at the edge. And if you have a component in between that syncs the two, you can actually enjoy both options. And I think that's probably the ideal way to think about it. But it's still evolving. We'll still have to see

Starting point is 00:28:51 where things go. So like the ideal graph based solution would be like a Google Drive or something where you might mark this person has access to this folder therefore they have access to every filed and recursive subdirectory and that gets complicated really quickly because you could have tons of subdirectories and they all need to do it so you need to traverse and that there's a code-based solution is tricky and you're advocating for keeping these both of them because they have these different use cases and then you have to figure out how to keep them consistent which is like tricky yeah and yeah the more i think about it there's it's not just a google drive that needs it like anybody who maintains things like here's

Starting point is 00:29:39 a collection of documents that maybe are not like don't have a lot of subdirectories, but you can add permissions to the collection, you can add permission to the document itself. So a lot of people are like building something in use case like Figma, or even like the company that I work at might have to think about this kind of stuff. And yeah, I guess I just didn't appreciate how complicated all of this could be. And to be sure, just to clarify, we're just scratching the surface here. Just on Google Zanzibar, we can talk easily for 10 hours and not get to all the concepts there. We didn't even touch on the main reason that Google Zanzibar was

Starting point is 00:30:16 created, which is great scale. So if you just have a few users and a few objects that you're interacting with, it doesn't really matter how you manage this. You can just shove it into a database, make most of the available data in cache, and it would just work. But as you start to move from hundreds of thousands to millions and above that, both managing all of that data and the continuous scaling up of that data, that's what's going to get you.

Starting point is 00:30:46 And so Google Zanzibar was built for those scales. It was built to maintain that constant huge picture for things like Google Drive and YouTube, which are running within Google and Google Zanzibar. I should probably mention also that there are open source implementations of Google Zanzibar. So Google hasn't released Zanzib that there are open source implementations of Google Zensibar. So Google hasn't released Zensibar as an open source. They just threw a white paper at us. But some cool folks at companies like AuthZ and Auth0 have taken up the mantle of implementing it.

Starting point is 00:31:16 They actually haven't implemented it fully, but it's getting there. But I think it's important to understand that for most companies, at least at the beginning, you don't need Zanzibar, you're not going to run things at Google scale, you might need to be able to grow into that scale down the road. And that's an important difference. So you want to create a modular solution with the interfaces that will later on enable you to change your data layer into something like Zanzibar, for example, you can definitely start with Zanzibar at they want, but you need to understand that there are trade-offs. So you will, for example, you'll have more latency and perform and general performance

Starting point is 00:31:53 to aggregate, but you'll get a better picture, more consistent, and you'll have an easier time scaling. But I think if anyone takes anything out of this is you should stick to the best practices. Decouple your policy and code, create a separate authorization layer, have an event-driven fashion to update it and have it modular enough so you can layer interfaces on top. And then it doesn't matter. You can start with the stupidest thing. You can have a microservice that always returns true for any authorization query. That would be a good place to start because you can build on top of that,

Starting point is 00:32:24 as opposed to having something baked in into some if in your code that later on, if you want to refactor, you have to do a full code review and change everything in the application itself. So start simple, start modular, grow gradually, you don't have to cover all of this in day one. It's also so hard to code review or like check for correctness with authentication checks or like authorization checks. Like very few people write sufficient integration tests when they add things like permissions logic or like they evolve it from admin, non-admin to something more involved. So refactoring that code is often like another whole project. That's why also the system themselves, the way you manage the code, you rarely see in the modern

Starting point is 00:33:12 solutions, just functional code. You don't see Python or Java as the recommended language to write policies in, because it's hard to make sure that you cover all your bases when you're running because unless if you have a rule but you don't invoke that rule you're you basically you're screwed but with for example with opa or also they are using logical programming languages they're both derivatives of prologue so also is a derivative of prolog, OPA is a rego, the language for OPA is a derivative of Datalog, which is derivative of Prolog. And the idea there is that you have a recursive engine that runs through all of the rules that are defined in a performant way. And that way it ensures that you cover all your bases. Same thing is true of the graph, you have an

Starting point is 00:34:03 engine that does the graph navigation for you. So as long as you structured the graph correctly, it's going to do what you're planning for. So it translates the problem from making sure that you cover all the bases within the logical layer of the policy to structuring the policy correctly and auditing the policy itself, takes it on another level higher and enables you to focus on what you actually want as opposed to how it should work with prologue it really takes me back to college like thinking about data flow languages i haven't thought about that in a while but we've been talking about opa like open policy agent right so there's two separate permission conversation

Starting point is 00:34:45 that we're having. One is for like the end user when you want to build like a system that lets a certain user access certain party for application or a certain document or whatever. There's also the microservice, can this service call this other service type of logic,

Starting point is 00:35:01 which I think OPA helps with because like OPA, you can put that into like your Kubernetes, you can put that in as like a sidecar, as you mentioned. But the more I think about it, you're basically trying to solve the same problem within your product and as like an infrastructure component. Like does that sound right to you? Like, what do you think? Yeah, yeah. So both OPA and also our general purpose decision engines, you can use them to make whichever decisions are relevant to you. They're focused on policy, but they're general purpose decision engines. OPA got its real kick, its real control across the stack. You need physical access control. You need like locks on door. And then you need network level access control,

Starting point is 00:35:49 like firewalls and zero trust networks. Then you have infrastructure level access control with admission control and service to service access control. And then you have application level access control. And then it evolves more and more in complexity within the application layer into more logical. And OPA really got its go in the infrastructure authorization layer. And it's actually quite difficult on its own to take it to the application layer.

Starting point is 00:36:13 The big problem there is how do you keep it in sync with the changing application? Like a new user is paid for the service. How do I make sure that OPA knows about that user? Or we change the policy, we added a new role and we did it from the UI. How do we make OPA know that there's a new role now? And that's actually solved by another open source project. I'm actually wearing the t-shirt for it now. So we created OPAL, Open Policy Administration Layer, that essentially takes that event-driven best practice and applies it to policy. You are able to subscribe to topics for both policy and data. And as events come in, they propagate into each of the instances at the edge, keeping them constantly up to date with both the policy and data that they

Starting point is 00:37:05 need, and only those that they need. And so you have a distributed administration layer for OPA, and you can have your different third party services that are changing with your applications, webhook and notify Opal on what has changed. And you can have your Git repository webhook on policy changes to Opal, and it will pick those elements and trickle them down like rain to the various Opal agents through what we call the Opal client. Opal does two things through that. One, it solves that challenge of bringing Opal to the application there. And two, it really helps you tackle the inconsistency problem because it really focuses on propagating events quickly.

Starting point is 00:37:50 So the agents at the edge, even if they don't have the data, they know that they're missing data, that the picture has changed. And you can already start seeing this working with something like Zanzibar. So if you have a big graph in the cloud managing the bigger aspects, you can take subsets of it through Opal as the graph changes and propagate them in real time into each of the edge nodes. So each edge node has what it needs being supported by the bigger picture managed for everyone in the cloud. So that kind of also touches on the hybrid solution that we're seeing here, and also how we are literally moving towards the hybrid solution and implementing it.

Starting point is 00:38:30 So your company is not just working on like end user like application security, but it's also working on tooling for basically permissions across the stack. Yeah, so we just we just try to solve this. So our notion is, developers don't want to build this, it's really hard to build, there's a lot of complexities, it's really hard to be aware of all those complexities, we want to abstract those away, we want to always enable developers have access to the code to manage this with GitOps to manage this with infrastructure that they control. But unless they want to do something, they shouldn't be forced to. They should have the option, but not the responsibility all the time. I don't think

Starting point is 00:39:12 most people care about the difference between RBAC and ABAC. And I don't think they should. I think a solution should abstract that and enable you to dive into that only when it's relevant. You should be able to start simple, build this, have it work and grow with you as you go. And the way to achieve this is by creating standards. It's by creating solutions that are inherently built to address the problem and are flexible enough to be extensible by the different snowflake solutions that need to use them. And that's really the mindset that we had with Opal. And also why I think it's, though it's a really young project, it's only a year old. I think that's why it's seeing so much success.

Starting point is 00:39:53 It's already in use in companies like Tesla in production, in Zapier, Accenture, and dozens of others. And as a significant community in Slack of people asking questions on a daily basis, I think we were able to do that because we built something that is both powerful enough and flexible enough for developers to adjust it for what they're building. Yeah, I'm noticing this consolidation across the industry around standards and going up the stack. It's very similar to what AWS is doing, but more in like the open source ways. Like now you have like open telemetry. I was talking to the LightStep people a few years ago

Starting point is 00:40:29 and it really seemed like it's matured. And now I'm guessing there's like more and more standards coming out on authorization, like how you should be doing this. People are converging on to OPA and saying this is the way it should be done. It's interesting to see, see yeah as the industry matures you think less about the infrastructure that's running your systems and more about your end

Starting point is 00:40:52 use cases you have to is it's basically the story of humankind right at the beginning we were working like you had uh you just pick a stone and use that to hunt or to cut your meat or whatever and then one day someone came in and said oh you should take stone from that guy he make good stone and then everyone said you should take spear wood shaft from that guy he makes good shaft and then one guy then one day someone came in and offered you a shaft with a stone already tied to it and say, oh, this is much better than getting it and assembling it on my own. And we constantly spread out, create new solutions, then we consolidate and then we build more layers on top. And every time we add a layer on top, we have to specialize.

Starting point is 00:41:37 We have to create people that are or solutions that are specialized in building that. So other people don't have to understand all of those complexities. And the same thing is happening here. The only difference is that we don't have the right answer yet. It's still evolving. So what we're trying to do as a vendor is to give you that promise of

Starting point is 00:41:55 no matter what spear or sling will come into existence, we'll wrap it for you and make it available for you. So you don't have to care about it. As you go, you can focus on building your product. And I also think it's our responsibility to chime in on the conversation

Starting point is 00:42:11 and make sure that together through the open source we're offering and through integrations that are being built, we create the right standards. That's why we took this open source. So we can have a public conversation on how we can all together build the correct thing for again us as society humanity whatever you want to

Starting point is 00:42:31 so then let me wrap up with what are you most excited about what you're building what's the next big thing that you're excited about what's like the next feature or like the next project that's a good question i'd say i'm most excited about the human interfaces which is funny to say for a developer tools product but i think that's really key because when we explore the space when so we started with our own pain but we wanted to see how it looks across the space. So we looked in into the bigger organizations like the Facebook and Google as a glimpse into the future. And what we realized there is that a they've invested a lot of time to build this. So for example, in Facebook, they invested a team of 30 people for half a decade to just build the infrastructure components

Starting point is 00:43:22 for their X. And what they did is two things. One, they, at some point, they had to move from just static rules, just policy you create to a intelligent component, to a machine learning component that can react to the gray points between the policies. And B, that AI ends up translating the interactions back into organizational behaviors and flows. So for example, when an employee tries to access the Facebook database, or the metadata database, I should say, and they're querying more data than they probably should, or they do on average, the AI can detect that as an anomaly. But because they

Starting point is 00:44:06 want business to continue, they don't just shut it down. Because you have thousands of employees doing thousands of things. If you just shut down everything that passes the anomaly, things will just screech to a halt. So what they do instead is they translate that into human interactions. So for example, they ask the team lead for that person, is what they're doing okay? Is there an assignment around this? Should we throttle this? Should we limit this? Maybe you should talk to them. And by going back to conversations and having the people align back with the machine, they're able to both keep it secure and keep it fast enough for the business to run. And I think that's something that's coming up for all of us, both the as we're like, when we're building applications down,

Starting point is 00:44:52 it's mostly we're thinking about human users using our applications, but more and more, it's applications on behalf of applications on behalf of applications on behalf of applications, using our application. And we're it's like with algo trading, if like in the past, it was just like humans yelling at each other, buy, sell, buy. Nowadays, it's all automated in a speed that humans can't really work with. So we need a very quick layer that can react those things, interpret it and provide back interfaces for us as humans to manage it and have it work the way we want. So what we're building today, we already covered a significant part of the basic infrastructure. And we're starting to look at the more automation around it. But mostly and more

Starting point is 00:45:35 importantly, building interfaces, low code interfaces, no code interfaces, human conversation interfaces, that all the stakeholders can come in and build this together in a way that can move quickly. Yeah, like, it seems very similar to IAM right sizing, right? Like this kind of stuff seems like super chaotic, but it makes sense that if you notice a certain role is not using all the permissions that are assigned, AWS can tell you, you should reduce the set of permissions, increase the set AWS can tell you, you should reduce the set of permissions, increase the set of permissions, because if you see like an access denied,

Starting point is 00:46:08 but what if you do that in a more naturalistic way? I can also imagine you can actually use your permission system to understand whether somebody is worth upselling this. Oh, this person keeps going on a feature that they don't have access to maybe show them an ad saying buy the product it's a it's interesting to think about one thing i remember from dropbox is like the highest or the biggest like the the most popular way for them to make money was when a user was over quota and they got like an error message because once they got that error message there was a click do you want to buy more space that That's what made them most of cash. So it would be cool if we had an inbuilt like feature flagging permission based system. I know you all have, I remember looking at OP to toggle, which

Starting point is 00:46:56 does something like that, right? Yeah, so that's one of our other open source projects. So you want to be able to have a one core place where you manage your policy and have all of your application feed from that. So with Opal, we already talked about how that propagates with Opal and OPA. We talked about how that propagates to the backend. But what about the front end? You want the front end experience to also adjust. So for example, if someone's going to get an error, like a four or three error when they query the API, you don't want that to just be thrown in the UI. You want to give them a different experience. If they can't click that button, don't show them that button. And the way to do that today in general is with feature flag solutions.

Starting point is 00:47:39 That's the way front-end applications adjust our experience. So with Optogles, you can sync your feature flag solution to your open policy. So you change your open policy and through Opal, Optogles listens in and then updates your launch directly, split IO, etc. So you can have everything chime in the right way. But more importantly, and kind of like touching on what I said before, everyone gets the right interface. So the backend engineers can work with the policy engine and the GitHub solutions. And the frontend engineers can work with what they're accustomed to, which is a feature

Starting point is 00:48:14 flag solution. So everyone chimes in on the same conversation, but with the right interface for them. Yeah, ideally, to me, you should just have the same thing like feature flagging permissioning etc etc should just be like this one big product that manages all of that for you and helps you like maybe upsell and block unless necessary but anyways thank you so much for joining this was a lot of fun and i hope i hope you had a great time I had a great conversation thank you so much was great talking to you and I look forward to next time yeah thank you I will take you up on it

Your Ad Here

Software at Scale - Software at Scale 46 - Authorization with Or Weis

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.