The Data Stack Show - 125: Authorization Is A Data Problem with Jeff Chao of Abbey Labs
Episode Date: February 8, 2023Highlights from this week’s conversation include:Jeff’s background at Netflix and Stripe leading him to Abbey Labs (2:22)What Abbey is solving in the space (5:16)Tackling permissions in an organiz...ation (7:30)Opportunities to improve the availability of data (10:14)The challenge of tackling a new problem area at a new company (14:59)What is the most common challenges in the identity and security space (18:43)Importance of identity and the ability to track it in data (22:46)Connecting all the different platforms without frustrating the user (30:32)What are the parts of access data that needing to be tracked (36:10)Dealing with the varieties of data in security and managing permissions (40:26)Final thoughts and takeaways (51:52)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome to the Data Stack Show. Kostas,
I think this may be our first three-time guest on the show. Jeff, we first talked with him when he was at Netflix. We talked with him again when he was at Stripe. And he has now co-founded his own
company, Abbey. And what an amazing guy. We love having him on the show.
And we're going to talk with him about Abby today, which is in the identity space, but
focused on employee identities within a company with the emphasis on security, which is really
fascinating.
And what I want to know, this isn't going to surprise you, but he built all sorts of
crazy streaming technologies at some of the most famous companies in the entire world across a
different number of problem areas. And that's pretty different than what he's building at
Abbey. And so when there's a change like that, I'm always
interested in the story behind it and going to attack a new problem. And so that's what I'm
going to ask. How about you? Yeah. Actually, I think it's going to be like a great opportunity
to understand why the industry believes that security is a data problem. And I think we have the right person to help us.
So it's a very common theme,
like a theme that we hear a lot lately,
that security is a data problem.
I think for people who are outside of security,
it's hard to understand what does this mean, right?
So here we have someone who comes from an incredible background in
building data infrastructure who decided to go and build a company in security.
I think we have the right person there to help us understand why security is a data problem and
how this is implemented as part of the vision of the company that he has founded.
Yep. I'm so excited to chat with Jeff again. Let's dig in.
Yeah, let's do it.
Jeff, welcome back. You are at this point a multi-time repeat guest,
and it's always such a pleasure to have you on the show. So thanks for joining.
Hey, thanks for having me again. It's good to be back.
Okay, so for the listeners who didn't get a chance to catch your previous episodes,
number one, if you're listening, and that's, you absolutely need to listen to prior episodes with
Jeff, all the individual ones and the panel ones. But can you just give us a brief background and
then tell us what you're doing today? Because you started something new, which is very exciting.
Yeah, sure thing.
So the last time I was on here,
well, the first time was when I was at Netflix working on streaming data systems.
And that was a bit interesting
because the premise was that
we wanted to be extremely cost-effective
when working with this data.
And it was specifically around the observability space
where we wanted to help keep Netflix,
the service up and running.
And so I worked on a system called Mantis.
We open sourced that
and it did about some number of trillion of events per day
and penibates of data per day.
After that, I went to Stripe where I led a data team.
Stripe is really big on eventing systems.
And so I led a data team around change data capture, worked with some folks on Debezium
as a committer to the Debezium Vitesse connector.
And then this change data capture system worked with financial data.
And it was mid-migration before I left.
And at this point, it's 100% migrated to the new system, which does about $640 billion
in annual payment volume. And so I thought, hey, things were going too well. Let's go on to hard
mode here. So I decided to leave and start a company in the identity security space. So I am
now sitting as a co-founder and CTO of Abbey Labs, and we're tackling challenges in our own
authorization. Very cool.
Okay, I want to dig into Abby and all the things about it.
But one question actually for you,
that I know Stripe is big on eventing.
Was it always like that?
Has the company always had sort of an event-driven architecture or do you know if that was a process that they went through?
Definitely before my time, but looking through the git commit history the very first incarnation of the cdc pipeline
was i think in 2014 so wow not everything is evented right the stripe is heavy mongo db users
yeah and but the idea is that developers want to use a
tooling that works well for them and so how this the standard model is hey i have a stateless web
app and i write to a database yeah but rather than doing these distributed transactions or
complicated joins let's have these async systems receive the individual operations out of the database
change to capture. And then from there, you can fat it out and people can be as async as they
want, people being served. Yeah, very cool. That's just a fun bit of history there. Okay, so
Abby, give us the breakdown. And what I'd love to hear is, you know, give us the brief explanation of what the product does.
But then also, I'd love for you to go back and, you know, how did you, where did the idea come from?
And how did you decide to start a company specifically focused on this problem?
Yeah, so a couple questions there.
Definitely were early days.
So things are subject to change for sure, as you all know. But it turns out that as you grow in an organization, as an organization gets larger, it's probably a pretty good idea to have an understanding of who has access to what. So you can improve security posture, try to enforce least privilege
and all those other buzzwords.
But the idea is as the number of employees
in an organization grows,
you also see many more services
and each of those services require different permissions
at different levels of granularity.
And so you end up with this sort of end-by-end problem
that makes it difficult to manage
and understand the state of access within your company.
Quite simply put, who has access to what
is a difficult problem to solve.
And when you can answer that question
in this environment that's fragmented and ever-changing,
then you can do other things for your security
or your compliance programs.
And so really, the way my co-founder and I, Arvill, we've been thinking about it is that
authentication's pretty mature right now.
You've got a lot of players there.
But the authorization space is still early, early days.
And there's a certain level of maturity that a company has to go through.
There's like this maturity curve,
like you want to get some single sign-on,
you want to enforce passwords,
and then eventually you're going to get
to this permissions level concern within your organization.
So we thought this would be the good place
to help people tackle that
because the challenge really comes at scale.
And the problem is that these teams
that are responsible or accountable for
ensuring that this stuff works, their headcount stays relatively flat.
Super interesting. And when we were talking, catching up before the show,
you have an interesting approach to this problem. And you described it as fundamentally
a data problem, which I think is really fascinating. Can you break that down for
us? How is it, first of all, when people do not classify it as a data problem, how do they classify
it? And then why do you classify it as a data problem? Yeah, it just comes from experience
of all the different systems I've worked on. And, you know, this is one of my hot takes as a data person jumping to the security realm, right?
But the idea is you have these different types of data sets.
You have identity data,
which is like human and machine identity.
And so those are like attributes on who you are,
what you are.
And then you have access data,
which is what are the things you can access? And then you have activity data, which is what are the things you can access?
And then you have activity data, like did you actually do something or access the thing?
And so, you know, depending on the size of the organization, this could get pretty,
pretty crazy for two reasons. One is the scale of data. In this world, you kind of want to have
a view of everything. And so sampling is kind of a tricky situation there.
And then the other thing is the data itself is fragmented.
If you think of external SaaS applications, there can be many.
Like even as a startup, we already have so many.
There's your accounting software, your business software, your engineering, and et cetera, HR.
And then you have internal services,
and then you have ephemeral things like workload.
And so it gets pretty unmanageable or untenable very quickly.
And so I thought, okay, all of these things are data sources.
You want to, they're generally raw data.
Some are log-looking data.
Some are more structured.
And I want to derive insights on this.
And then from those insights, I want to do some sort of automation. So a lot of this is like,
get data in, store it somewhere in the right place, enrich the data, and do something with
those enrichments. It sounds pretty familiar over here. Very cool. Yeah. So the corollary is like, well, if I think of these as like vertical a sound foundation that is built upon best
practices that we've learned from in the data space and a little bit also from the
observability space, depending on the use case.
Yeah, for sure.
And I'm interested to know, as you look at the landscape of that data being created.
Coming from the data space,
do you see opportunities around improving the availability of data there?
Because in the world of data,
if you talk about CDC or eventing systems
that we just discussed,
those are very established concepts, decades old,
lots of technology, lots of established patterns. And I think a lot of times when you bring a
paradigm of, okay, well, this is actually a data problem at the core into a discipline that,
you know, heretofore hasn't really been described as a data problem. A lot of time there's could be
deficiencies like on the actual data side of things. Is that an opportunity area or a challenge
to see? Yeah, definitely. Also definitely learn from what came before. There's always going to
be nuances, right? You can't just say, oh, let's sprinkle on some software and call it a day.
But so this is like the example of don't, oh, let's just sprinkle some data
technologies and call it a day. No, not like that. But yeah, so I think the challenge is that at
least for the companies that we're thinking of are like, there's, they might be cloud native.
They might not be there. There might be bare metal, old school on-prem or on-prem with their
own cloud accounts.
So depending on your prioritization, you kind of want to consider each of those differently.
And so what I mean is I mentioned the word fragmentation earlier. So the ecosystem is
fragmented. So you have API calls, you have, if someone's more sophisticated, then yeah, sure,
connect to the stream. Otherwise it'd be more snapshot based or there's different protocols.
It's a bit tricky.
So integration is a huge pain point.
And there are a lot of players out there.
I would like not to build yet another system that ingests data.
So yeah, I would like to avoid that.
But the problem is it's easy to ingest the data in my opinion, like relatively easy.
But the problem is, okay, so I can get the data in, I can set it up in five minutes an
hour or whatever.
But then I'm going to have the next level questions immediately after that, which is
like, okay, well, how do I not do it without blowing my budget?
Oh, you're going to give me a full refresh every single ingest?
I don't think so.
Or how can I do this incrementally?
Or the other thing is like, okay, so the data is good.
Like, how good is it?
Like, what about data quality?
Because I want this to be actually correct.
And so ingestion, like getting the data in,
it's just the first initial problem.
And I think it's still early there.
There are a lot of players there,
but I'm eagerly waiting
because I do not want to do this again.
Yeah.
Okay, I know Costas has a bunch of questions,
but I do have a question about the name of the company.
Abby is, I mean, I love the term,
you think about almost like a monastery or something of that nature.
Give us the thinking behind the name.
It's such a unique name, especially for the type of, we were talking about identity or,
you know, security breaches or other things like that.
You wouldn't necessarily think about that as, you know.
Yeah, a lot of props to my co-founder on that one.
But the idea is like we believe in
bringing peace of mind to companies especially in this like crazy world where like authorization or
permissions are getting out of whack and we believe in doing that without necessarily having to be so
masculine about it and abby just really came from about where it's a place where you can congregate and
be at peace. So our thinking is that, you know, we can congregate or get out and have this data
and then make it available to people in the way they want it and give them the control.
And then so they can eventually have peace of mind to build out their security and compliance
programs. Love it. Costas. Yes. I have many questions.
Yeah.
Love it.
I also want to tease Eric a little bit
because it's a pretty common quote from him.
I get it as a signal that my time is coming.
You know, like when he says,
I know that Costas has like a lot of...
Oh, yeah.
That's your cue.
That's our secret. That's your cue. That's our secret.
That's our secret.
Kagavin.
That's our secret signal.
Yeah.
So, okay.
Before we get into like more technology related questions, I want to ask you something a little bit more personal. As a person who you have made like a career so far, like in engineering, that
it's like, let's say around some specific things, right, like you're talking about
data infrastructure, events in many various different forms, and at some
point you decide to go and enter like a new space, right?
And yeah, sure.
Like it is a data problem, but it's not only a data problem
that you're solving here so what's your experience with that as an engineer right
from going from something that you feel comfortable that you have done like a lot of
things there your confidence and getting like into a new problem area.
Yeah, that's a great question. I will say that even as an engineer, I've always, I love the technical side and I always tinker.
And, but for me, it's like about solving a larger problem and for some, a problem for someone that, and one that matters to them.
And so I've always been interested more
about customer empathy.
Even in data infrastructure,
I always push for being like a full cycle developer
where you really own the thing you're doing end to end.
And part of that is understanding
that you're building something
with not just the technical thing in mind.
You tie the problem to the product to the technical.
And so even in data infrastructure or infrastructure, it's like, who are your customers?
Other engineers or machine learning engineers or et cetera, right?
And so I've always been interested in that.
And so part of it is customer empathy.
The other is product building.
And lastly, I have like a lot
of things that i've learned over the years in terms of company building and building great teams
and i'd like to put that to the test see how it goes yeah yeah makes total sense and i think what
you're saying is also i think like a partial response to my next question which has to do
with the experience of going from being employed in a big company
to starting your own company, right?
Because obviously, like, it's a different experience.
So, again, you personally, like, answered that.
But tell me a little bit also about this experience so far,
how it feels from, you know, like, part of like this huge organization into like being
you, your co-founder, and I don't know how many engineers you have right now, but still
it's going to be a much smaller like environment compared to what was before.
Yeah, I will say like as a founder, it takes a certain type of person to do that.
But overall, I would say whether you're a founder or not, the fulfillment is a lot higher
because there's so much accountability.
And so if you thrive on that accountability, that execution, then really this place or
any other startup is really the way to go.
And a lot of it is like going broad.
So if you were trying to go deeper, at least for engineering, I would recommend going to a larger company.
You'll get to see all the patterns, good or bad.
And then you can try to, well, eventually you'd have to pick it, pick and choose
pieces of that and distill them down into what could be useful to a startup.
But fulfillment is the real big winner here.
Yeah.
To say that it's not without pain it's tons of pain but also very fulfilling
100% yeah
but let's not focus on the pain today
we'll keep that for another episode
actually we'll do that after you IPO
when you IPO you'll have an episode to talk about the pain
so okay let's talk a little bit more
about technology now.
Talking about security and security is like a broader thing, right? Like there's not just one
thing in security. You are from what I understand by reading like your landing page, for example,
you are talking about identity. Can you give us a little bit of like an overview of what security is what are like the parts that
you most commonly see out there and how identity fits into that yeah what is security oh boy that's
a lot i can make a joke for sure but i won't but security i guess for me is like It's like about tying the business value to the risk.
And so obviously if you have like different companies have different risk tolerances.
So that doesn't mean that they're less or more secure, right?
It's just tied to the risk model that they have.
And that's tied to the business value that they want to preserve or generate or et cetera.
Right.
And so around identity identity it's like
in this environment in this like cloudy cloud environment you have multi-cloud you have hybrid
cloud there's you know on-prem as well so that's what i mean with high by hybrid but like the days
of being within this single like waltz network is no longer a thing it hasn't been a thing for a while and so
and especially with the past couple years where you have employees which are not necessarily
within the confines of an office and the vpn in a single location like they're free to go anywhere
as well like it really becomes about identity right like inadvertent access or like intentional or malicious access it's done by a
person or a thing which is backed by a person right so it all boils down to an identity and so
there's already a lot to that so we're just thinking about the employee identity for now
so identity there's like employee identity or human There's service identity and there's workload or machine.
And so we're thinking of that in the confines of a company now.
So what it means is like, imagine if there's a breach or something, it's like, okay, what
is the impact of that breach?
Okay.
Maybe an account got taken over.
Okay.
What acts levels of access does this account have and to which resources?
And how can we begin to figure that out,
like traverse that tree recursively,
and then maybe do some communication or some mitigation, etc.
Right. And how is identity established in,
let's say, the most traditional approach like in the industry right now?
Yeah, I say by far, there's a maturity curve for sure.
So identity is established through, I would say, through like your Google workspace.
You know, everyone has like a Gmail account for their company or something, right?
Or maybe they're Microsoft or something like like that if they're a microsoft shop and so after that they
they might do some simple things around authentication like okay let's make sure
there's a password rule like it must be this long with this number of characters it might have to be
refreshed every quarter or something and then you go up the maturity curve.
There might be, okay, let's SSO everything.
And then more people join more applications.
They might have contractors, people are changing roles.
Okay.
We need like a single sign on like a, like an identity provider.
So maybe try to do something with Google or maybe move to
Okta or some other there.
And then after that, it's like, okay, well, everyone has admin access to everything.
So now we need to lock that down for different compliancy reasons.
That's the stick or the carrot would be, okay, we actually want to improve our security posture or reduce like cost in managing this kind of stuff.
We can actually have our employees be more productive
and have a better experience.
All right.
So let's say I have like in my organization,
I'm using Okta, right?
So I have like a central repository of identity,
let's say like everyone needs to go through that
like to identify themselves.
And there is something in this system, right,
that represents my identity, right?
Now, this something, we'll see, like, yeah.
The reason I'm saying something is because, like,
I want to hear from you what this something is, actually,
because in this way we can get, like, into the data side of things.
It has to travel around, travel around the different applications and systems
that I'll be interacting with, right?
How does this work and how important is it to trace that?
And when it is important to trace that?
Because if you think from the user perspective, like the employee perspective, right?
For me, it's just something that I have to go through because I'm forced to do it.
I need to access 10 different tools.
I know I'll go to Wokta,
someone will add my applications there,
I'll click on them, and suddenly I have access,
and I go to Salesforce, and something happens.
I can do my job there, but I don't really know
what's happening between the systems there, right?
And also, I don't know why...
I mean, I have an idea of why it is important to do that, but what is tracked
and how it is exposed and who cares about that is not something that I'm aware of.
And for a good reason, like that's not my job, right?
Can you take us through the journey of the data there?
Like this identity, how it is represented, how it moves like from one system to the other,
what kind of traces it leaves behind?
And from all that information, what do we need to do other things later on?
Yeah.
Yeah.
That's a great question.
So there are two cases.
One where a company is a bit more mature and they have everything pretty locked down
going through an identity provider already.
And then the other case is where they don't.
Where they don't, then there's probably zero visibility
into who has access to what.
In the case where they do have things locked down
through an identity provider,
and assuming it's all integrated and everything,
then they can do some level of,
let's say, like tree traversal, if you will,
starting from that root.
It's basically a GUID,
and then traversing down to what access they have.
The only thing there is it's not granular access.
It's based off of groups
or whoever defined groups or roles.
And so that's just the limitation there.
But then the question comes in,
what if some, well, the problem is like,
it's not as centralized as it used to be.
So for example, if someone in marketing decides to add a new marketing tool, they can with
their corporate card, right?
And then now they have access to this new thing that might not be in the view of the
team that's a security team or IT team that's responsible for that. Same thing for
engineering. How many times in a large
company have you been using
a very big bug tracking product
and then you're like, hey, let's go use this
Trello thing or something like that. That happens
all the time.
So yeah,
even then it can still get out of hand.
But there's the access to
resources, but then there's also the levels of access to.
So then there's quite a bit of work that goes into that.
Then the thing is like, sure, you can have a team like your security or IT team build this stuff and relatively it's easy.
Right. But then the problem is like, OK, what do you do on day two?
How about me maintaining this thing?
Who is on call for this and all that stuff?
And like,
do you really want to do that? Because that's out of your core competency. Like you want to be
furthering other parts of your security or compliance programs, not doing this sort of
data engineering work, right? And so to go back to the other questions, like, why does this or
when does this matter? So there's two parts to it. There's like, if you use the analogy of like
the carrot and the stick analogy, right?
So a lot of it is compliance driven, quite frankly.
There's SOC 2, there's ISO, there's SOCs and many other types.
And these are just rules or controls that you have to abide by for whatever reason deemed
necessary by your company, right?
And so that's the first thing
and so the class of problems that or solutions that come out of that are born to solve those
would be like access reviews or compliance report generation or even like a request approval flow
and so but then after that like that still can be different levels of manual.
So then you want to automate that as much as you can,
because as you said, like ICs, right?
People down the line might not have the context
to work with this type of thing.
Like imagine, you know, I'm a manager
and I've been here, right?
I've like, it's in the end of the quarter, quarterly planning is coming up.
I have to attend a QBR.
There's other things going on.
Meanwhile, Slackbot yells at me with 60 permissions that I have to review and approve by the end
of the day.
What do you think I'm going to do?
I'm just going to hit yes, sadly.
And so that might get me through the compliance, but it doesn't necessarily get you through
the security part.
And so at the end of the day, it becomes worrisome because, you know, then there's liabilities there, right?
It could be fines or violations or et cetera, because it could be inaccurate or you eventually could end up getting breached or something like that.
So it matters before, like kind of before breach, there's pre-breach and post-breach, I would say. So pre-breach is all of like the posture, the compliance, the companies are trying to be least privileged or zero trust.
And that's all cool, but like just making security better. And then post-breach is understanding the
impact or the blast radius. So an account got compromised. What are all the things this account
has access to and what levels of access and how do I go in
and shut things off? The answer is I don't want to do any of that. I want a system that automatically
does that for me and then tells me after, or depending on the risk of the company,
it can have me approve it or not, but the idea. Okay. And where is like ABI
operating in this picture that you have described?
Yeah, so right now we're thinking about this in a few ways.
We're thinking about it in terms of like integrating.
So the ecosystem of data sources are fragmented.
So the integration, we're trying to solve that as well.
But then in addition to that, you have this raw data.
And so we're trying to build out a, let's say, like a unified view of an identity.
So in other domains, this is called entity resolution.
So we built out a little thing that you can see a graph of starting from Jeff and looking at all the levels of access that I have to which resources.
And then there's like a little search and I can search for different resources and it will highlight parts of the graph. from Jeff and looking at all the levels of access that I have to which resources.
And then there's like a little search and I can search for different resources and it will highlight parts of the graph.
So integration, identity resolution, and then the last part is automation.
So you have this foundation of data, you can integrate, you can enrich it, which is the
identity normalization or resolution.
And then after that, you take that data and then you automate it against some workflows.
So then that would be around things like access reviews or request approvals.
And okay, so we have the identity and this identity, let's say for each system that it
has access to, like most probably like each system has its own access to. Most probably Git system has its own
access controls, right?
Salesforce has its own.
Zendesk,
whatever.
Everybody's different. It's crazy.
Exactly, yeah. And then, of course,
you have everything in-house.
Who knows what's going on there?
You have systems that can become
super complex in terms of how
access controls are managed. How do you connect and align all these things without creating just
noise in front of the user? Because one thing is to aggregate all the data, and it's a completely
different problem on how you can make sense out of all this data. Right.
So how do you do that?
Like, give us like a little bit inside there because that's an interesting like data problem.
Oh yeah, man.
This is the funny thing because you know, one could say like we'll create a standard and then and then everyone follows it, but then you just end up with a N plus one standards.
Right.
So we'll see how that goes.
But there are existing calls out there and standards and people that are trying to do good work on that.
But I think for me, this is drawing from the data space, right?
So there are three ways to do it.
So how, okay, let's use it, speak around a concrete example.
I want to understand who Kostas is.
Kostas is
a GUID in Okta.
Kostas is
an email address
in Google. And Kostas
is, let's say,
an IAM policy in
AWS. Or
Kostas is a mapping
in a YAML file on a service. So how do I understand what that is?
There are three ways to do that. One is you can do a direct mapping if it's so easy, like email
address to email address exact match. The second way is using a heuristic or rules based matching.
So let's say, you know, we have GitHub as well. Let's add that.
GitHub usernames,
those are usually personal accounts, right?
If you had a GitHub account
that was prefixed with my company name hyphen username,
you could apply that heuristic or that rule
for other identity sources.
The third one is where both of those fail.
If there is zero attributes that you can look at to map them together, then that comes with inference.
So inference is like, how do you infer who someone is?
And you do that through their behavior, the things that they have access similar to their peers.
And so now we're getting into a lot of like classification or some sort of graph clustering like that.
So those are the three ways that I see today without standard.
Stas Piotrowski- Yeah.
That's super interesting because like, okay, you know, like one thing is matching
on a syntactical level, which it's hard on its own, right?
Like you have the email and then you have the YAML file and then an XML document.
I don't know why I like that.
Like, how fun.
But there's also, like, the semantic level, right?
Like, what's the meaning behind these things?
How aligned they can be?
And, like, you can see that even with, like,
and I bring this because, like,
when it comes, like, to access control,
my experience is mainly, like, with data.
You have, like, they're all based access controls,
and then you have attributes-based access controls. And, like, at the end, they're supposed, like, to be doing, like, the with data. You have the role-based access controls and then you have
attributes-based access controls.
And at the end, they're supposed to be doing the same things,
but in a different way.
But how do you transform one to the other?
It's not that trivial, right?
Even if they represent the same things.
Exactly because the way that we represent things or what we mean, or we
implicitly mean in these things like it's not easy.
So how that's why like I find like this like super interesting.
And by the way, like it's not solo insecurity.
I think Eric can talk about identity resolution in marketing, right?
And like figure out like who is doing what and how to create
like this identity graph there.
So you mentioned some applications and protocols.
Can you tell us like a little bit more about that?
Like what's the standards out there, if there are any?
Alex Ferrari- Yeah.
So there are a couple of things that I'd like to address.
One is open policy agent and specifically the Rego language.
So that's for like defining policies.
So that we're thinking of using that in a way that we can have some standard around defining policies in a sensible way.
And then evaluating them as well.
And then on the API side, there's Skim, the Skim Perl goal. So that's mostly like detecting changes upstream and then listening to them and then applying
like permissions changes around users and applying them downstream.
There's also a read component.
It's just CRUD on REST.
And so there's a read component to that as well.
There are a number of open source or source available, I would say, projects out there which are attempting to have like a standard around ingesting these types of sources.
These types being like any external, any SaaS application, really, and then having some sort of like interface or API around that.
And so, yeah, I would say those are the main ones.
Okay.
And then when it comes to...
Okay, these are the policies, right?
And how we can...
Let's say the formal part and where we define things,
how they should ideally be, right?
And you have to start tracking what's going on in these systems.
So I guess there you have different types of data
that you need to collect, probably logs or I don't know.
So what's there?
What's the behavioral part of the identity that you are tracking?
How it looks like and how do you collect that?
Yeah, yeah.
So there's three types of data, identity, access, and activity. And so identity, again, there's human and machine. And so that can come from any, you know, identity provider access data might also be coming from like a resource itself.
Like, cause like, like, you know, any OLAP or LLTP database might have like, you know,
permissions embedded in there.
Right.
And so you could get it from there.
And then activity data, that's just a fancy word for logs.
So in the security space, there's SIMS, S-I-E-M. And so
that collects everything or there's other flavors of SIM like XDR, EDR, like extended detection and
response, etc. And so basically, those are, you know, like Elasticsearch-esque looking things.
And so the same patterns, right? You're ingesting from API, REST API, the schema, schema is just a schema, right?
It might have different schema or envelopes.
You're ingesting directly from data stores or data sources like a OLAP
database or like a event queue or something like that.
And then you're also ingesting from logs or search indices.
Yeah.
It sounds like a lot of data.
Is it a lot of data?
Probably I would say it depends on the size of the company, but I would
say hopefully the number of groups, people would use a lot of role base.
So hopefully those aren't too large, but we've seen them to be
pretty large from our customers yeah like there's twice as many admin roles or groups than employees and
that's not a good idea but so so like in terms of the like the number of items it's not that much
but like if you're thinking if you want to listen for changes on those,
that could be a bit more, but the identities and access like that changes a bit more frequently
than the sheer number of it.
But then when you add in the activity data, that is the long tail.
Yeah.
Yeah.
That's what I'm like triggered, like this reaction for me, because we're talking about
like logs, logs can be verbose, right?
Yeah.
There's a lot of data there, and there's a lot of processing that needs to happen, because they are semi-structured data.
It's not necessarily like a JSON.
Yeah.
Logs are so funny.
It's like, how do i say this it's like fairly not valuable because it's raw and it's
coming like the logs might have might be holding a lot of sources of data you know that may look
differently and yet it's still so valuable at the same time if you're able to structure them and extract the right insight
that you need because it's kind of like you don't know what you don't know you know so like
insecurity like it's you kind of want to know as much as you can obviously depending on your
risk tolerance but yeah so okay like from all this different data, what kind of modeling do you do on top of that?
Because somehow you need to connect all these things. You have different serializations.
It's more low-level stuff that's just so different. You have, as we said, semi-structured logs,
and then you have identities that are records like records on a database, right?
It's the opposite.
So how do you deal with that variety of data that needs to be homogenized somehow?
Yeah, pretty standard way.
You ingest the raw data and then you TTL it if you need.
And then you have async systems that are able to process and reprocess the data to normalize it and do some sensible representation.
Then we actually, that spits out, one of the data sets we spit out is the, you know, the
resolved identity.
And so that's just a single
data set and then yeah and then that's stored somewhere and then that's it it's pretty pretty
standard here i think in terms of serialization you know like on the ingest lots of it is through
rest there might be different envelopes on it. We're able to handle that.
And then on the egress, right now, we haven't done it, but we're looking to use...
The whole idea is to not build this walled garden.
We want to give control to our customers.
And so you can bring your own tooling, bring your own database, bring your own BI tool.
And the reason is because like, this data should be accessible by not only security engineers, but IT admins, or maybe data engineering with security focus as well.
And so why would we want to build a tool that you aren't using today? Like there's
already amazing tooling out there. And so we want to use this like specific table formats,
fake query engines that are available, and you can just plug them in. We'll host the data for you.
If you don't like that, then we can do things like bring your own encryption keys, or you can host the data yourself if you dare.
And then yeah, so interoperability is pretty huge for us.
Misha Belkin- Makes sense.
And can you give us an example of like the first, let's say insights that
someone can get from these homogenized and processed data sets that you create?
It would be
nice if it's something that
someone was working in that space
before using
something like IBI. It would be hard
to get this.
Yeah, some simple questions
once you're connected.
It's just like,
how many admins do I have to which systems in my company?
And so that's the first question.
The more interesting question on top of that
is transitive access.
How did Costas get access to this RDS instance,
this table within an RDS instance?
And he got access because he's part of this
group, which is part of a group before that.
And Eric had added cost us to that group.
And that's how he has access.
And then the third thing is really around like, we have we, so aside from analytics,
we use the same thing to just run like a continuous query so then
you can basically throw an alert so like i know how many admins i have now like alert me on slack
if i if that goes beyond yeah any of it we have 10 today hopefully no more than 10 so alert me on
that so so like that's the beginnings of building automation. And one last question from me, and then
I'll give the microphone back to Eric.
From your experience
so far with the customers and the users that we
are talking with, what are the
first and most
let's say, obvious systems
that they bring
in and they try to get insights
from? Because, okay,
from what I understand,
when we're talking about like identity,
it's everything, right?
It can be a SaaS application,
it can be like your cloud infrastructure,
it can be your database systems,
it can be like, I don't know,
like pretty much like everything.
So what's like the most common and the first, let's say, use case
that you see there in terms of like infrastructure
that they're struggling today
to have like a good monitoring of identity on it.
Yeah, I'll frame this in terms of user persona.
So the first one is I'm a head of security or that's responsible for IT and I just joined the company.
WTF, what's going on?
I need to have some insight into who has access to what.
That's number one. Number two is we have an audit coming up,
and I need to understand who has access to what
so then I can do any remediations.
And number three is, oh no, we've been breached.
I want to understand.
That's a bit bad because it's more time-bound,
but I want to understand what the blast radius is.
And so really it's about like,
number one is understanding the state of access,
but then ultimately that honestly matters a lot less
compared to actually doing the thing that comes after.
Makes sense.
All right, Eric, all yours.
I'm sure you have also more questions.
Well, this is so interesting.
Costas read my mind here, which makes sense because we've been doing this for a couple
years.
And of course, I come from the world of marketing where we talk a lot about identity resolution.
And going into this conversation, part of me thought, okay, an organization in some
ways is a closed system, right?
When I think about marketing, there are all these external touch points some ways is a closed system, right? When I think about
marketing, there are all these external touch points that I have zero visibility into, right?
And I can only understand them in many ways as via proxies of the way that people come into
an interaction with my company and then sort of go through and all that.
But if you think about inside of a company, you know, even though there is a lot, there can be a lot of ambiguity,
at least you kind of have, you know, somewhat of a closed system, right?
But the more you talk about it, the more I thought, I mean, you could really
just change out some terminology and be talking about identity resolution in general.
Do you agree with that? And are there things to learn talking about identity resolution in general. Do you agree with that? And are there
things to learn from customer identity resolution in the way that you solve that problem inside of
an organization? Yeah, I think certainly the techniques will be the same. There will always
be the nuances. But yeah, that's where we draw inspiration from. It's the identity resolution.
A lot of people have done a lot of work that came before this, so it's nothing new in that
regard.
Within the context of an organization, that is true, but I would just be careful in thinking
of this organization as like a very static wall thing like organizations are by itself amorphous
in many ways you know like how many reorgs have you been in in the past year for a large company
how many contractors come in and out services that are built and torn down how many employees
join or leave the company so it's really tricky because there's fragmentation
and change is really that only constant,
to throw that cliche out there.
Yeah, yeah.
No, that is actually very interesting to think about
because when I think about it from a marketing perspective,
there are all sorts of entry points
and then certain pathways that you can go through,
but there actually aren't that many paths through
systems, interestingly enough, which is much, much more complicated inside of an organization,
right? Because you have, you know, an individual identity traversing hundreds of systems, right?
Where as with a customer, I mean, they may be in lots of systems, but their journey generally,
you know, follows like a fairly defined path.
Yeah.
And that would even be the better case too, because a lot of times that there's not even
an individual identity, like through an identity provider, like you might have a company might
have done some number of M&As in the past year and those companies each brought their
own identity provider.
And yet you're still under the same ticker symbol.
Yeah, yeah.
And then, so you've gotten a lot of that.
Yeah, it is.
It just does sound so funny about thinking about concepts like fingerprinting from your
own employees.
I mean, even like a big brother way, but from a security standpoint, that's just an interesting
concept.
Yeah, I hear you. That's crossed my mind as well.
Yeah. Okay. Well, we're at time here, but I do have one more question for you. So
when you have built amazing technologies that have come out of some of the most interesting,
awesome companies in the world, but now you're building a new company.
And one thing that I like about going through that experience is you get a chance to explore
a lot of different things that maybe you are more limited in just because of the scope of your role
or the project or something at a larger company. In building Abby, have you run across any interesting new or old or different technologies
that have been intriguing to you yeah you know i'm an active follower of the streaming space
that's my bread and butter so eagerly awaiting the developer experience to to get better there
it's funny i just listened to the talk shop or shop talk with you in Kansas about streaming real time versus streaming debate.
Got a lot of opinions there, but let's do a follow up shop talk and have you on.
Yeah, yeah, it's great.
I'm really glad that the title was explicitly real time versus streaming because those are not necessarily mutually exclusive.
Right. So thanks for that.
But that aside, some streaming stuff,
because we do have some streaming component on our side and I don't want to build yet another thing.
I don't have time for that.
The other thing is we're thinking about a lot of graphs.
And so thinking about graphs, data stores
or graph relational stores,
and then also around more like standardization around some like
security metadata protocols like like skim i would say or other things that are like that
also permission stores are very interesting to me there are a number of players out there as well
and so everything on like the control side we're not so much thinking about enforcement, which is a different approach, but a different set of problems and technologies.
And so, yeah, anything around control, that's very interesting to me right now.
Very cool.
Well, I definitely want to, Brooks, let's make sure to get Jeff back on for a follow
on shop talk on streaming because we would love a hot take.
And Jeff,
thanks again for joining us.
And Abby sounds awesome and best of luck with it.
Yeah.
Thanks.
Thanks again for having me.
And it's good to see you all again.
Okay.
What I loved about that Costas was I think a couple of times Jeff said,
I don't want to build another like streaming ingestion service,
which I loved in a couple of levels
because obviously if he's done that
at companies like Netflix and Stripe,
he's sort of seen a lot of angles of that problem
and solved it at a scale that many of us will never see.
So I just loved it.
It was in some ways like a humble way
to acknowledge that he has solved a lot of
those problems. And how cool is it that he's at a point where he's like, it's not intellectually
stimulating for me to continue to focus on that problem area. I was like, man, what a place to be.
That was really cool. And then also, obviously, at the very end of the show, I was just super
intrigued about the parallels in identity resolution from my background
marketing and
how similar that is
to the problem they're solving
inside of a company now obviously the security
concerns are certainly
very different but that is really
fascinating and I'm sure that I'll be thinking
a lot about that this week
yeah yeah 100%
for me it's's always fascinating.
When it's like...
I'll say that.
One of the greatest things
around software engineering
and computer science
and this whole industry
is abstraction, right?
And it's very interesting to see
how the same abstractions apply to different problems and how you can implement, let's say, similar patterns to solve problems in very different areas, like from security to marketing.
But at their core are the same.
That's always something that I find super fascinating. It's one of the reasons that I love the things that I'm doing and why I work in this space,
and why I like computers and all that stuff.
So this is exactly one of these cases.
Of course, the implications of the solution and the
problem itself are like very different when we are talking about like security
or we're talking about marketing or like something else, like, but and that's
what makes it interesting, right?
Like you can build something and that's what you see with people like Jeff.
Like you have someone who is okay, like he builds data infrastructure and now
he can take like all these like experience
and knowledge like applied in a different domain.
Yeah.
That's beautiful.
Yeah, I agree.
I love it.
And I definitely want to get him on a shop talk.
I think that would be awesome.
Yeah, let's do that.
Absolutely.
All right.
Well, thank you for listening.
Subscribe if you haven't.
Tell a friend.
Jeff is a subscriber.
So if you want to be like Jeff, subscribe to the show.
And we'll catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com. Thank you.