The Data Stack Show - 125: Authorization Is A Data Problem with Jeff Chao of Abbey Labs

Episode Date: February 8, 2023

Highlights from this week’s conversation include:Jeff’s background at Netflix and Stripe leading him to Abbey Labs (2:22)What Abbey is solving in the space (5:16)Tackling permissions in an organiz...ation (7:30)Opportunities to improve the availability of data (10:14)The challenge of tackling a new problem area at a new company (14:59)What is the most common challenges in the identity and security space (18:43)Importance of identity and the ability to track it in data (22:46)Connecting all the different platforms without frustrating the user (30:32)What are the parts of access data that needing to be tracked (36:10)Dealing with the varieties of data in security and managing permissions (40:26)Final thoughts and takeaways (51:52)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week, we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome to the Data Stack Show. Kostas, I think this may be our first three-time guest on the show. Jeff, we first talked with him when he was at Netflix. We talked with him again when he was at Stripe. And he has now co-founded his own company, Abbey. And what an amazing guy. We love having him on the show.
Starting point is 00:00:46 And we're going to talk with him about Abby today, which is in the identity space, but focused on employee identities within a company with the emphasis on security, which is really fascinating. And what I want to know, this isn't going to surprise you, but he built all sorts of crazy streaming technologies at some of the most famous companies in the entire world across a different number of problem areas. And that's pretty different than what he's building at Abbey. And so when there's a change like that, I'm always interested in the story behind it and going to attack a new problem. And so that's what I'm
Starting point is 00:01:30 going to ask. How about you? Yeah. Actually, I think it's going to be like a great opportunity to understand why the industry believes that security is a data problem. And I think we have the right person to help us. So it's a very common theme, like a theme that we hear a lot lately, that security is a data problem. I think for people who are outside of security, it's hard to understand what does this mean, right? So here we have someone who comes from an incredible background in
Starting point is 00:02:07 building data infrastructure who decided to go and build a company in security. I think we have the right person there to help us understand why security is a data problem and how this is implemented as part of the vision of the company that he has founded. Yep. I'm so excited to chat with Jeff again. Let's dig in. Yeah, let's do it. Jeff, welcome back. You are at this point a multi-time repeat guest, and it's always such a pleasure to have you on the show. So thanks for joining. Hey, thanks for having me again. It's good to be back.
Starting point is 00:02:44 Okay, so for the listeners who didn't get a chance to catch your previous episodes, number one, if you're listening, and that's, you absolutely need to listen to prior episodes with Jeff, all the individual ones and the panel ones. But can you just give us a brief background and then tell us what you're doing today? Because you started something new, which is very exciting. Yeah, sure thing. So the last time I was on here, well, the first time was when I was at Netflix working on streaming data systems. And that was a bit interesting
Starting point is 00:03:15 because the premise was that we wanted to be extremely cost-effective when working with this data. And it was specifically around the observability space where we wanted to help keep Netflix, the service up and running. And so I worked on a system called Mantis. We open sourced that
Starting point is 00:03:33 and it did about some number of trillion of events per day and penibates of data per day. After that, I went to Stripe where I led a data team. Stripe is really big on eventing systems. And so I led a data team around change data capture, worked with some folks on Debezium as a committer to the Debezium Vitesse connector. And then this change data capture system worked with financial data. And it was mid-migration before I left.
Starting point is 00:04:00 And at this point, it's 100% migrated to the new system, which does about $640 billion in annual payment volume. And so I thought, hey, things were going too well. Let's go on to hard mode here. So I decided to leave and start a company in the identity security space. So I am now sitting as a co-founder and CTO of Abbey Labs, and we're tackling challenges in our own authorization. Very cool. Okay, I want to dig into Abby and all the things about it. But one question actually for you, that I know Stripe is big on eventing.
Starting point is 00:04:37 Was it always like that? Has the company always had sort of an event-driven architecture or do you know if that was a process that they went through? Definitely before my time, but looking through the git commit history the very first incarnation of the cdc pipeline was i think in 2014 so wow not everything is evented right the stripe is heavy mongo db users yeah and but the idea is that developers want to use a tooling that works well for them and so how this the standard model is hey i have a stateless web app and i write to a database yeah but rather than doing these distributed transactions or complicated joins let's have these async systems receive the individual operations out of the database
Starting point is 00:05:27 change to capture. And then from there, you can fat it out and people can be as async as they want, people being served. Yeah, very cool. That's just a fun bit of history there. Okay, so Abby, give us the breakdown. And what I'd love to hear is, you know, give us the brief explanation of what the product does. But then also, I'd love for you to go back and, you know, how did you, where did the idea come from? And how did you decide to start a company specifically focused on this problem? Yeah, so a couple questions there. Definitely were early days. So things are subject to change for sure, as you all know. But it turns out that as you grow in an organization, as an organization gets larger, it's probably a pretty good idea to have an understanding of who has access to what. So you can improve security posture, try to enforce least privilege
Starting point is 00:06:25 and all those other buzzwords. But the idea is as the number of employees in an organization grows, you also see many more services and each of those services require different permissions at different levels of granularity. And so you end up with this sort of end-by-end problem that makes it difficult to manage
Starting point is 00:06:46 and understand the state of access within your company. Quite simply put, who has access to what is a difficult problem to solve. And when you can answer that question in this environment that's fragmented and ever-changing, then you can do other things for your security or your compliance programs. And so really, the way my co-founder and I, Arvill, we've been thinking about it is that
Starting point is 00:07:11 authentication's pretty mature right now. You've got a lot of players there. But the authorization space is still early, early days. And there's a certain level of maturity that a company has to go through. There's like this maturity curve, like you want to get some single sign-on, you want to enforce passwords, and then eventually you're going to get
Starting point is 00:07:30 to this permissions level concern within your organization. So we thought this would be the good place to help people tackle that because the challenge really comes at scale. And the problem is that these teams that are responsible or accountable for ensuring that this stuff works, their headcount stays relatively flat. Super interesting. And when we were talking, catching up before the show,
Starting point is 00:07:56 you have an interesting approach to this problem. And you described it as fundamentally a data problem, which I think is really fascinating. Can you break that down for us? How is it, first of all, when people do not classify it as a data problem, how do they classify it? And then why do you classify it as a data problem? Yeah, it just comes from experience of all the different systems I've worked on. And, you know, this is one of my hot takes as a data person jumping to the security realm, right? But the idea is you have these different types of data sets. You have identity data, which is like human and machine identity.
Starting point is 00:08:36 And so those are like attributes on who you are, what you are. And then you have access data, which is what are the things you can access? And then you have activity data, which is what are the things you can access? And then you have activity data, like did you actually do something or access the thing? And so, you know, depending on the size of the organization, this could get pretty, pretty crazy for two reasons. One is the scale of data. In this world, you kind of want to have a view of everything. And so sampling is kind of a tricky situation there.
Starting point is 00:09:07 And then the other thing is the data itself is fragmented. If you think of external SaaS applications, there can be many. Like even as a startup, we already have so many. There's your accounting software, your business software, your engineering, and et cetera, HR. And then you have internal services, and then you have ephemeral things like workload. And so it gets pretty unmanageable or untenable very quickly. And so I thought, okay, all of these things are data sources.
Starting point is 00:09:37 You want to, they're generally raw data. Some are log-looking data. Some are more structured. And I want to derive insights on this. And then from those insights, I want to do some sort of automation. So a lot of this is like, get data in, store it somewhere in the right place, enrich the data, and do something with those enrichments. It sounds pretty familiar over here. Very cool. Yeah. So the corollary is like, well, if I think of these as like vertical a sound foundation that is built upon best practices that we've learned from in the data space and a little bit also from the
Starting point is 00:10:31 observability space, depending on the use case. Yeah, for sure. And I'm interested to know, as you look at the landscape of that data being created. Coming from the data space, do you see opportunities around improving the availability of data there? Because in the world of data, if you talk about CDC or eventing systems that we just discussed,
Starting point is 00:11:03 those are very established concepts, decades old, lots of technology, lots of established patterns. And I think a lot of times when you bring a paradigm of, okay, well, this is actually a data problem at the core into a discipline that, you know, heretofore hasn't really been described as a data problem. A lot of time there's could be deficiencies like on the actual data side of things. Is that an opportunity area or a challenge to see? Yeah, definitely. Also definitely learn from what came before. There's always going to be nuances, right? You can't just say, oh, let's sprinkle on some software and call it a day. But so this is like the example of don't, oh, let's just sprinkle some data
Starting point is 00:11:46 technologies and call it a day. No, not like that. But yeah, so I think the challenge is that at least for the companies that we're thinking of are like, there's, they might be cloud native. They might not be there. There might be bare metal, old school on-prem or on-prem with their own cloud accounts. So depending on your prioritization, you kind of want to consider each of those differently. And so what I mean is I mentioned the word fragmentation earlier. So the ecosystem is fragmented. So you have API calls, you have, if someone's more sophisticated, then yeah, sure, connect to the stream. Otherwise it'd be more snapshot based or there's different protocols.
Starting point is 00:12:29 It's a bit tricky. So integration is a huge pain point. And there are a lot of players out there. I would like not to build yet another system that ingests data. So yeah, I would like to avoid that. But the problem is it's easy to ingest the data in my opinion, like relatively easy. But the problem is, okay, so I can get the data in, I can set it up in five minutes an hour or whatever.
Starting point is 00:12:55 But then I'm going to have the next level questions immediately after that, which is like, okay, well, how do I not do it without blowing my budget? Oh, you're going to give me a full refresh every single ingest? I don't think so. Or how can I do this incrementally? Or the other thing is like, okay, so the data is good. Like, how good is it? Like, what about data quality?
Starting point is 00:13:15 Because I want this to be actually correct. And so ingestion, like getting the data in, it's just the first initial problem. And I think it's still early there. There are a lot of players there, but I'm eagerly waiting because I do not want to do this again. Yeah.
Starting point is 00:13:32 Okay, I know Costas has a bunch of questions, but I do have a question about the name of the company. Abby is, I mean, I love the term, you think about almost like a monastery or something of that nature. Give us the thinking behind the name. It's such a unique name, especially for the type of, we were talking about identity or, you know, security breaches or other things like that. You wouldn't necessarily think about that as, you know.
Starting point is 00:14:01 Yeah, a lot of props to my co-founder on that one. But the idea is like we believe in bringing peace of mind to companies especially in this like crazy world where like authorization or permissions are getting out of whack and we believe in doing that without necessarily having to be so masculine about it and abby just really came from about where it's a place where you can congregate and be at peace. So our thinking is that, you know, we can congregate or get out and have this data and then make it available to people in the way they want it and give them the control. And then so they can eventually have peace of mind to build out their security and compliance
Starting point is 00:14:40 programs. Love it. Costas. Yes. I have many questions. Yeah. Love it. I also want to tease Eric a little bit because it's a pretty common quote from him. I get it as a signal that my time is coming. You know, like when he says, I know that Costas has like a lot of...
Starting point is 00:15:02 Oh, yeah. That's your cue. That's our secret. That's your cue. That's our secret. That's our secret. Kagavin. That's our secret signal. Yeah. So, okay.
Starting point is 00:15:12 Before we get into like more technology related questions, I want to ask you something a little bit more personal. As a person who you have made like a career so far, like in engineering, that it's like, let's say around some specific things, right, like you're talking about data infrastructure, events in many various different forms, and at some point you decide to go and enter like a new space, right? And yeah, sure. Like it is a data problem, but it's not only a data problem that you're solving here so what's your experience with that as an engineer right from going from something that you feel comfortable that you have done like a lot of
Starting point is 00:16:00 things there your confidence and getting like into a new problem area. Yeah, that's a great question. I will say that even as an engineer, I've always, I love the technical side and I always tinker. And, but for me, it's like about solving a larger problem and for some, a problem for someone that, and one that matters to them. And so I've always been interested more about customer empathy. Even in data infrastructure, I always push for being like a full cycle developer where you really own the thing you're doing end to end.
Starting point is 00:16:38 And part of that is understanding that you're building something with not just the technical thing in mind. You tie the problem to the product to the technical. And so even in data infrastructure or infrastructure, it's like, who are your customers? Other engineers or machine learning engineers or et cetera, right? And so I've always been interested in that. And so part of it is customer empathy.
Starting point is 00:17:02 The other is product building. And lastly, I have like a lot of things that i've learned over the years in terms of company building and building great teams and i'd like to put that to the test see how it goes yeah yeah makes total sense and i think what you're saying is also i think like a partial response to my next question which has to do with the experience of going from being employed in a big company to starting your own company, right? Because obviously, like, it's a different experience.
Starting point is 00:17:31 So, again, you personally, like, answered that. But tell me a little bit also about this experience so far, how it feels from, you know, like, part of like this huge organization into like being you, your co-founder, and I don't know how many engineers you have right now, but still it's going to be a much smaller like environment compared to what was before. Yeah, I will say like as a founder, it takes a certain type of person to do that. But overall, I would say whether you're a founder or not, the fulfillment is a lot higher because there's so much accountability.
Starting point is 00:18:11 And so if you thrive on that accountability, that execution, then really this place or any other startup is really the way to go. And a lot of it is like going broad. So if you were trying to go deeper, at least for engineering, I would recommend going to a larger company. You'll get to see all the patterns, good or bad. And then you can try to, well, eventually you'd have to pick it, pick and choose pieces of that and distill them down into what could be useful to a startup. But fulfillment is the real big winner here.
Starting point is 00:18:41 Yeah. To say that it's not without pain it's tons of pain but also very fulfilling 100% yeah but let's not focus on the pain today we'll keep that for another episode actually we'll do that after you IPO when you IPO you'll have an episode to talk about the pain so okay let's talk a little bit more
Starting point is 00:19:04 about technology now. Talking about security and security is like a broader thing, right? Like there's not just one thing in security. You are from what I understand by reading like your landing page, for example, you are talking about identity. Can you give us a little bit of like an overview of what security is what are like the parts that you most commonly see out there and how identity fits into that yeah what is security oh boy that's a lot i can make a joke for sure but i won't but security i guess for me is like It's like about tying the business value to the risk. And so obviously if you have like different companies have different risk tolerances. So that doesn't mean that they're less or more secure, right?
Starting point is 00:19:56 It's just tied to the risk model that they have. And that's tied to the business value that they want to preserve or generate or et cetera. Right. And so around identity identity it's like in this environment in this like cloudy cloud environment you have multi-cloud you have hybrid cloud there's you know on-prem as well so that's what i mean with high by hybrid but like the days of being within this single like waltz network is no longer a thing it hasn't been a thing for a while and so and especially with the past couple years where you have employees which are not necessarily
Starting point is 00:20:32 within the confines of an office and the vpn in a single location like they're free to go anywhere as well like it really becomes about identity right like inadvertent access or like intentional or malicious access it's done by a person or a thing which is backed by a person right so it all boils down to an identity and so there's already a lot to that so we're just thinking about the employee identity for now so identity there's like employee identity or human There's service identity and there's workload or machine. And so we're thinking of that in the confines of a company now. So what it means is like, imagine if there's a breach or something, it's like, okay, what is the impact of that breach?
Starting point is 00:21:19 Okay. Maybe an account got taken over. Okay. What acts levels of access does this account have and to which resources? And how can we begin to figure that out, like traverse that tree recursively, and then maybe do some communication or some mitigation, etc. Right. And how is identity established in,
Starting point is 00:21:43 let's say, the most traditional approach like in the industry right now? Yeah, I say by far, there's a maturity curve for sure. So identity is established through, I would say, through like your Google workspace. You know, everyone has like a Gmail account for their company or something, right? Or maybe they're Microsoft or something like like that if they're a microsoft shop and so after that they they might do some simple things around authentication like okay let's make sure there's a password rule like it must be this long with this number of characters it might have to be refreshed every quarter or something and then you go up the maturity curve.
Starting point is 00:22:25 There might be, okay, let's SSO everything. And then more people join more applications. They might have contractors, people are changing roles. Okay. We need like a single sign on like a, like an identity provider. So maybe try to do something with Google or maybe move to Okta or some other there. And then after that, it's like, okay, well, everyone has admin access to everything.
Starting point is 00:22:50 So now we need to lock that down for different compliancy reasons. That's the stick or the carrot would be, okay, we actually want to improve our security posture or reduce like cost in managing this kind of stuff. We can actually have our employees be more productive and have a better experience. All right. So let's say I have like in my organization, I'm using Okta, right? So I have like a central repository of identity,
Starting point is 00:23:18 let's say like everyone needs to go through that like to identify themselves. And there is something in this system, right, that represents my identity, right? Now, this something, we'll see, like, yeah. The reason I'm saying something is because, like, I want to hear from you what this something is, actually, because in this way we can get, like, into the data side of things.
Starting point is 00:23:42 It has to travel around, travel around the different applications and systems that I'll be interacting with, right? How does this work and how important is it to trace that? And when it is important to trace that? Because if you think from the user perspective, like the employee perspective, right? For me, it's just something that I have to go through because I'm forced to do it. I need to access 10 different tools. I know I'll go to Wokta,
Starting point is 00:24:14 someone will add my applications there, I'll click on them, and suddenly I have access, and I go to Salesforce, and something happens. I can do my job there, but I don't really know what's happening between the systems there, right? And also, I don't know why... I mean, I have an idea of why it is important to do that, but what is tracked and how it is exposed and who cares about that is not something that I'm aware of.
Starting point is 00:24:40 And for a good reason, like that's not my job, right? Can you take us through the journey of the data there? Like this identity, how it is represented, how it moves like from one system to the other, what kind of traces it leaves behind? And from all that information, what do we need to do other things later on? Yeah. Yeah. That's a great question.
Starting point is 00:25:03 So there are two cases. One where a company is a bit more mature and they have everything pretty locked down going through an identity provider already. And then the other case is where they don't. Where they don't, then there's probably zero visibility into who has access to what. In the case where they do have things locked down through an identity provider,
Starting point is 00:25:27 and assuming it's all integrated and everything, then they can do some level of, let's say, like tree traversal, if you will, starting from that root. It's basically a GUID, and then traversing down to what access they have. The only thing there is it's not granular access. It's based off of groups
Starting point is 00:25:50 or whoever defined groups or roles. And so that's just the limitation there. But then the question comes in, what if some, well, the problem is like, it's not as centralized as it used to be. So for example, if someone in marketing decides to add a new marketing tool, they can with their corporate card, right? And then now they have access to this new thing that might not be in the view of the
Starting point is 00:26:21 team that's a security team or IT team that's responsible for that. Same thing for engineering. How many times in a large company have you been using a very big bug tracking product and then you're like, hey, let's go use this Trello thing or something like that. That happens all the time. So yeah,
Starting point is 00:26:39 even then it can still get out of hand. But there's the access to resources, but then there's also the levels of access to. So then there's quite a bit of work that goes into that. Then the thing is like, sure, you can have a team like your security or IT team build this stuff and relatively it's easy. Right. But then the problem is like, OK, what do you do on day two? How about me maintaining this thing? Who is on call for this and all that stuff?
Starting point is 00:27:04 And like, do you really want to do that? Because that's out of your core competency. Like you want to be furthering other parts of your security or compliance programs, not doing this sort of data engineering work, right? And so to go back to the other questions, like, why does this or when does this matter? So there's two parts to it. There's like, if you use the analogy of like the carrot and the stick analogy, right? So a lot of it is compliance driven, quite frankly. There's SOC 2, there's ISO, there's SOCs and many other types.
Starting point is 00:27:34 And these are just rules or controls that you have to abide by for whatever reason deemed necessary by your company, right? And so that's the first thing and so the class of problems that or solutions that come out of that are born to solve those would be like access reviews or compliance report generation or even like a request approval flow and so but then after that like that still can be different levels of manual. So then you want to automate that as much as you can, because as you said, like ICs, right?
Starting point is 00:28:14 People down the line might not have the context to work with this type of thing. Like imagine, you know, I'm a manager and I've been here, right? I've like, it's in the end of the quarter, quarterly planning is coming up. I have to attend a QBR. There's other things going on. Meanwhile, Slackbot yells at me with 60 permissions that I have to review and approve by the end
Starting point is 00:28:34 of the day. What do you think I'm going to do? I'm just going to hit yes, sadly. And so that might get me through the compliance, but it doesn't necessarily get you through the security part. And so at the end of the day, it becomes worrisome because, you know, then there's liabilities there, right? It could be fines or violations or et cetera, because it could be inaccurate or you eventually could end up getting breached or something like that. So it matters before, like kind of before breach, there's pre-breach and post-breach, I would say. So pre-breach is all of like the posture, the compliance, the companies are trying to be least privileged or zero trust.
Starting point is 00:29:11 And that's all cool, but like just making security better. And then post-breach is understanding the impact or the blast radius. So an account got compromised. What are all the things this account has access to and what levels of access and how do I go in and shut things off? The answer is I don't want to do any of that. I want a system that automatically does that for me and then tells me after, or depending on the risk of the company, it can have me approve it or not, but the idea. Okay. And where is like ABI operating in this picture that you have described? Yeah, so right now we're thinking about this in a few ways.
Starting point is 00:29:50 We're thinking about it in terms of like integrating. So the ecosystem of data sources are fragmented. So the integration, we're trying to solve that as well. But then in addition to that, you have this raw data. And so we're trying to build out a, let's say, like a unified view of an identity. So in other domains, this is called entity resolution. So we built out a little thing that you can see a graph of starting from Jeff and looking at all the levels of access that I have to which resources. And then there's like a little search and I can search for different resources and it will highlight parts of the graph. from Jeff and looking at all the levels of access that I have to which resources.
Starting point is 00:30:28 And then there's like a little search and I can search for different resources and it will highlight parts of the graph. So integration, identity resolution, and then the last part is automation. So you have this foundation of data, you can integrate, you can enrich it, which is the identity normalization or resolution. And then after that, you take that data and then you automate it against some workflows. So then that would be around things like access reviews or request approvals. And okay, so we have the identity and this identity, let's say for each system that it has access to, like most probably like each system has its own access to. Most probably Git system has its own
Starting point is 00:31:05 access controls, right? Salesforce has its own. Zendesk, whatever. Everybody's different. It's crazy. Exactly, yeah. And then, of course, you have everything in-house. Who knows what's going on there?
Starting point is 00:31:21 You have systems that can become super complex in terms of how access controls are managed. How do you connect and align all these things without creating just noise in front of the user? Because one thing is to aggregate all the data, and it's a completely different problem on how you can make sense out of all this data. Right. So how do you do that? Like, give us like a little bit inside there because that's an interesting like data problem. Oh yeah, man.
Starting point is 00:31:53 This is the funny thing because you know, one could say like we'll create a standard and then and then everyone follows it, but then you just end up with a N plus one standards. Right. So we'll see how that goes. But there are existing calls out there and standards and people that are trying to do good work on that. But I think for me, this is drawing from the data space, right? So there are three ways to do it. So how, okay, let's use it, speak around a concrete example. I want to understand who Kostas is.
Starting point is 00:32:26 Kostas is a GUID in Okta. Kostas is an email address in Google. And Kostas is, let's say, an IAM policy in AWS. Or
Starting point is 00:32:41 Kostas is a mapping in a YAML file on a service. So how do I understand what that is? There are three ways to do that. One is you can do a direct mapping if it's so easy, like email address to email address exact match. The second way is using a heuristic or rules based matching. So let's say, you know, we have GitHub as well. Let's add that. GitHub usernames, those are usually personal accounts, right? If you had a GitHub account
Starting point is 00:33:09 that was prefixed with my company name hyphen username, you could apply that heuristic or that rule for other identity sources. The third one is where both of those fail. If there is zero attributes that you can look at to map them together, then that comes with inference. So inference is like, how do you infer who someone is? And you do that through their behavior, the things that they have access similar to their peers. And so now we're getting into a lot of like classification or some sort of graph clustering like that.
Starting point is 00:33:43 So those are the three ways that I see today without standard. Stas Piotrowski- Yeah. That's super interesting because like, okay, you know, like one thing is matching on a syntactical level, which it's hard on its own, right? Like you have the email and then you have the YAML file and then an XML document. I don't know why I like that. Like, how fun. But there's also, like, the semantic level, right?
Starting point is 00:34:07 Like, what's the meaning behind these things? How aligned they can be? And, like, you can see that even with, like, and I bring this because, like, when it comes, like, to access control, my experience is mainly, like, with data. You have, like, they're all based access controls, and then you have attributes-based access controls. And, like, at the end, they're supposed, like, to be doing, like, the with data. You have the role-based access controls and then you have
Starting point is 00:34:25 attributes-based access controls. And at the end, they're supposed to be doing the same things, but in a different way. But how do you transform one to the other? It's not that trivial, right? Even if they represent the same things. Exactly because the way that we represent things or what we mean, or we implicitly mean in these things like it's not easy.
Starting point is 00:34:45 So how that's why like I find like this like super interesting. And by the way, like it's not solo insecurity. I think Eric can talk about identity resolution in marketing, right? And like figure out like who is doing what and how to create like this identity graph there. So you mentioned some applications and protocols. Can you tell us like a little bit more about that? Like what's the standards out there, if there are any?
Starting point is 00:35:15 Alex Ferrari- Yeah. So there are a couple of things that I'd like to address. One is open policy agent and specifically the Rego language. So that's for like defining policies. So that we're thinking of using that in a way that we can have some standard around defining policies in a sensible way. And then evaluating them as well. And then on the API side, there's Skim, the Skim Perl goal. So that's mostly like detecting changes upstream and then listening to them and then applying like permissions changes around users and applying them downstream.
Starting point is 00:35:53 There's also a read component. It's just CRUD on REST. And so there's a read component to that as well. There are a number of open source or source available, I would say, projects out there which are attempting to have like a standard around ingesting these types of sources. These types being like any external, any SaaS application, really, and then having some sort of like interface or API around that. And so, yeah, I would say those are the main ones. Okay. And then when it comes to...
Starting point is 00:36:32 Okay, these are the policies, right? And how we can... Let's say the formal part and where we define things, how they should ideally be, right? And you have to start tracking what's going on in these systems. So I guess there you have different types of data that you need to collect, probably logs or I don't know. So what's there?
Starting point is 00:36:52 What's the behavioral part of the identity that you are tracking? How it looks like and how do you collect that? Yeah, yeah. So there's three types of data, identity, access, and activity. And so identity, again, there's human and machine. And so that can come from any, you know, identity provider access data might also be coming from like a resource itself. Like, cause like, like, you know, any OLAP or LLTP database might have like, you know, permissions embedded in there. Right. And so you could get it from there.
Starting point is 00:37:38 And then activity data, that's just a fancy word for logs. So in the security space, there's SIMS, S-I-E-M. And so that collects everything or there's other flavors of SIM like XDR, EDR, like extended detection and response, etc. And so basically, those are, you know, like Elasticsearch-esque looking things. And so the same patterns, right? You're ingesting from API, REST API, the schema, schema is just a schema, right? It might have different schema or envelopes. You're ingesting directly from data stores or data sources like a OLAP database or like a event queue or something like that.
Starting point is 00:38:18 And then you're also ingesting from logs or search indices. Yeah. It sounds like a lot of data. Is it a lot of data? Probably I would say it depends on the size of the company, but I would say hopefully the number of groups, people would use a lot of role base. So hopefully those aren't too large, but we've seen them to be pretty large from our customers yeah like there's twice as many admin roles or groups than employees and
Starting point is 00:38:50 that's not a good idea but so so like in terms of the like the number of items it's not that much but like if you're thinking if you want to listen for changes on those, that could be a bit more, but the identities and access like that changes a bit more frequently than the sheer number of it. But then when you add in the activity data, that is the long tail. Yeah. Yeah. That's what I'm like triggered, like this reaction for me, because we're talking about
Starting point is 00:39:24 like logs, logs can be verbose, right? Yeah. There's a lot of data there, and there's a lot of processing that needs to happen, because they are semi-structured data. It's not necessarily like a JSON. Yeah. Logs are so funny. It's like, how do i say this it's like fairly not valuable because it's raw and it's coming like the logs might have might be holding a lot of sources of data you know that may look
Starting point is 00:39:57 differently and yet it's still so valuable at the same time if you're able to structure them and extract the right insight that you need because it's kind of like you don't know what you don't know you know so like insecurity like it's you kind of want to know as much as you can obviously depending on your risk tolerance but yeah so okay like from all this different data, what kind of modeling do you do on top of that? Because somehow you need to connect all these things. You have different serializations. It's more low-level stuff that's just so different. You have, as we said, semi-structured logs, and then you have identities that are records like records on a database, right? It's the opposite.
Starting point is 00:40:48 So how do you deal with that variety of data that needs to be homogenized somehow? Yeah, pretty standard way. You ingest the raw data and then you TTL it if you need. And then you have async systems that are able to process and reprocess the data to normalize it and do some sensible representation. Then we actually, that spits out, one of the data sets we spit out is the, you know, the resolved identity. And so that's just a single data set and then yeah and then that's stored somewhere and then that's it it's pretty pretty
Starting point is 00:41:35 standard here i think in terms of serialization you know like on the ingest lots of it is through rest there might be different envelopes on it. We're able to handle that. And then on the egress, right now, we haven't done it, but we're looking to use... The whole idea is to not build this walled garden. We want to give control to our customers. And so you can bring your own tooling, bring your own database, bring your own BI tool. And the reason is because like, this data should be accessible by not only security engineers, but IT admins, or maybe data engineering with security focus as well. And so why would we want to build a tool that you aren't using today? Like there's
Starting point is 00:42:27 already amazing tooling out there. And so we want to use this like specific table formats, fake query engines that are available, and you can just plug them in. We'll host the data for you. If you don't like that, then we can do things like bring your own encryption keys, or you can host the data yourself if you dare. And then yeah, so interoperability is pretty huge for us. Misha Belkin- Makes sense. And can you give us an example of like the first, let's say insights that someone can get from these homogenized and processed data sets that you create? It would be
Starting point is 00:43:08 nice if it's something that someone was working in that space before using something like IBI. It would be hard to get this. Yeah, some simple questions once you're connected. It's just like,
Starting point is 00:43:23 how many admins do I have to which systems in my company? And so that's the first question. The more interesting question on top of that is transitive access. How did Costas get access to this RDS instance, this table within an RDS instance? And he got access because he's part of this group, which is part of a group before that.
Starting point is 00:43:50 And Eric had added cost us to that group. And that's how he has access. And then the third thing is really around like, we have we, so aside from analytics, we use the same thing to just run like a continuous query so then you can basically throw an alert so like i know how many admins i have now like alert me on slack if i if that goes beyond yeah any of it we have 10 today hopefully no more than 10 so alert me on that so so like that's the beginnings of building automation. And one last question from me, and then I'll give the microphone back to Eric.
Starting point is 00:44:28 From your experience so far with the customers and the users that we are talking with, what are the first and most let's say, obvious systems that they bring in and they try to get insights from? Because, okay,
Starting point is 00:44:43 from what I understand, when we're talking about like identity, it's everything, right? It can be a SaaS application, it can be like your cloud infrastructure, it can be your database systems, it can be like, I don't know, like pretty much like everything.
Starting point is 00:44:58 So what's like the most common and the first, let's say, use case that you see there in terms of like infrastructure that they're struggling today to have like a good monitoring of identity on it. Yeah, I'll frame this in terms of user persona. So the first one is I'm a head of security or that's responsible for IT and I just joined the company. WTF, what's going on? I need to have some insight into who has access to what.
Starting point is 00:45:23 That's number one. Number two is we have an audit coming up, and I need to understand who has access to what so then I can do any remediations. And number three is, oh no, we've been breached. I want to understand. That's a bit bad because it's more time-bound, but I want to understand what the blast radius is. And so really it's about like,
Starting point is 00:45:47 number one is understanding the state of access, but then ultimately that honestly matters a lot less compared to actually doing the thing that comes after. Makes sense. All right, Eric, all yours. I'm sure you have also more questions. Well, this is so interesting. Costas read my mind here, which makes sense because we've been doing this for a couple
Starting point is 00:46:09 years. And of course, I come from the world of marketing where we talk a lot about identity resolution. And going into this conversation, part of me thought, okay, an organization in some ways is a closed system, right? When I think about marketing, there are all these external touch points some ways is a closed system, right? When I think about marketing, there are all these external touch points that I have zero visibility into, right? And I can only understand them in many ways as via proxies of the way that people come into an interaction with my company and then sort of go through and all that.
Starting point is 00:46:42 But if you think about inside of a company, you know, even though there is a lot, there can be a lot of ambiguity, at least you kind of have, you know, somewhat of a closed system, right? But the more you talk about it, the more I thought, I mean, you could really just change out some terminology and be talking about identity resolution in general. Do you agree with that? And are there things to learn talking about identity resolution in general. Do you agree with that? And are there things to learn from customer identity resolution in the way that you solve that problem inside of an organization? Yeah, I think certainly the techniques will be the same. There will always be the nuances. But yeah, that's where we draw inspiration from. It's the identity resolution.
Starting point is 00:47:25 A lot of people have done a lot of work that came before this, so it's nothing new in that regard. Within the context of an organization, that is true, but I would just be careful in thinking of this organization as like a very static wall thing like organizations are by itself amorphous in many ways you know like how many reorgs have you been in in the past year for a large company how many contractors come in and out services that are built and torn down how many employees join or leave the company so it's really tricky because there's fragmentation and change is really that only constant,
Starting point is 00:48:10 to throw that cliche out there. Yeah, yeah. No, that is actually very interesting to think about because when I think about it from a marketing perspective, there are all sorts of entry points and then certain pathways that you can go through, but there actually aren't that many paths through systems, interestingly enough, which is much, much more complicated inside of an organization,
Starting point is 00:48:31 right? Because you have, you know, an individual identity traversing hundreds of systems, right? Where as with a customer, I mean, they may be in lots of systems, but their journey generally, you know, follows like a fairly defined path. Yeah. And that would even be the better case too, because a lot of times that there's not even an individual identity, like through an identity provider, like you might have a company might have done some number of M&As in the past year and those companies each brought their own identity provider.
Starting point is 00:49:04 And yet you're still under the same ticker symbol. Yeah, yeah. And then, so you've gotten a lot of that. Yeah, it is. It just does sound so funny about thinking about concepts like fingerprinting from your own employees. I mean, even like a big brother way, but from a security standpoint, that's just an interesting concept.
Starting point is 00:49:25 Yeah, I hear you. That's crossed my mind as well. Yeah. Okay. Well, we're at time here, but I do have one more question for you. So when you have built amazing technologies that have come out of some of the most interesting, awesome companies in the world, but now you're building a new company. And one thing that I like about going through that experience is you get a chance to explore a lot of different things that maybe you are more limited in just because of the scope of your role or the project or something at a larger company. In building Abby, have you run across any interesting new or old or different technologies that have been intriguing to you yeah you know i'm an active follower of the streaming space
Starting point is 00:50:14 that's my bread and butter so eagerly awaiting the developer experience to to get better there it's funny i just listened to the talk shop or shop talk with you in Kansas about streaming real time versus streaming debate. Got a lot of opinions there, but let's do a follow up shop talk and have you on. Yeah, yeah, it's great. I'm really glad that the title was explicitly real time versus streaming because those are not necessarily mutually exclusive. Right. So thanks for that. But that aside, some streaming stuff, because we do have some streaming component on our side and I don't want to build yet another thing.
Starting point is 00:50:52 I don't have time for that. The other thing is we're thinking about a lot of graphs. And so thinking about graphs, data stores or graph relational stores, and then also around more like standardization around some like security metadata protocols like like skim i would say or other things that are like that also permission stores are very interesting to me there are a number of players out there as well and so everything on like the control side we're not so much thinking about enforcement, which is a different approach, but a different set of problems and technologies.
Starting point is 00:51:30 And so, yeah, anything around control, that's very interesting to me right now. Very cool. Well, I definitely want to, Brooks, let's make sure to get Jeff back on for a follow on shop talk on streaming because we would love a hot take. And Jeff, thanks again for joining us. And Abby sounds awesome and best of luck with it. Yeah.
Starting point is 00:51:50 Thanks. Thanks again for having me. And it's good to see you all again. Okay. What I loved about that Costas was I think a couple of times Jeff said, I don't want to build another like streaming ingestion service, which I loved in a couple of levels because obviously if he's done that
Starting point is 00:52:09 at companies like Netflix and Stripe, he's sort of seen a lot of angles of that problem and solved it at a scale that many of us will never see. So I just loved it. It was in some ways like a humble way to acknowledge that he has solved a lot of those problems. And how cool is it that he's at a point where he's like, it's not intellectually stimulating for me to continue to focus on that problem area. I was like, man, what a place to be.
Starting point is 00:52:37 That was really cool. And then also, obviously, at the very end of the show, I was just super intrigued about the parallels in identity resolution from my background marketing and how similar that is to the problem they're solving inside of a company now obviously the security concerns are certainly very different but that is really
Starting point is 00:52:58 fascinating and I'm sure that I'll be thinking a lot about that this week yeah yeah 100% for me it's's always fascinating. When it's like... I'll say that. One of the greatest things around software engineering
Starting point is 00:53:15 and computer science and this whole industry is abstraction, right? And it's very interesting to see how the same abstractions apply to different problems and how you can implement, let's say, similar patterns to solve problems in very different areas, like from security to marketing. But at their core are the same. That's always something that I find super fascinating. It's one of the reasons that I love the things that I'm doing and why I work in this space, and why I like computers and all that stuff.
Starting point is 00:53:57 So this is exactly one of these cases. Of course, the implications of the solution and the problem itself are like very different when we are talking about like security or we're talking about marketing or like something else, like, but and that's what makes it interesting, right? Like you can build something and that's what you see with people like Jeff. Like you have someone who is okay, like he builds data infrastructure and now he can take like all these like experience
Starting point is 00:54:25 and knowledge like applied in a different domain. Yeah. That's beautiful. Yeah, I agree. I love it. And I definitely want to get him on a shop talk. I think that would be awesome. Yeah, let's do that.
Starting point is 00:54:37 Absolutely. All right. Well, thank you for listening. Subscribe if you haven't. Tell a friend. Jeff is a subscriber. So if you want to be like Jeff, subscribe to the show. And we'll catch you on the next one.
Starting point is 00:54:49 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.