The Data Stack Show - 11: Why Modern Cyber Security is a Data Problem with Jack Naglieri of Panther Labs
Episode Date: October 21, 2020This week’s episode of The Data Stack Show features a conversation with hosts Kostas Pardalis and Eric Dodds and guest Jack Naglieri, founder and CEO of Panther Labs. Panther, a San Francisco based ...startup, is an open platform that helps security teams detect and respond to breaches in cloud-native environments, providing a modern alternative to traditional SIEMs.Highlights from this week’s episode include:Introduction to Jack and Panther Labs (2:33)The different pillars of data security (10:24)Onboarding process for a company using Panther (18:40)Thinking of security as a data problem (24:55)Using S3 and other infrastructure suggestions that will be helpful in the long run (32:16)Use cases for analyzing past and real-time data (39:20)Panther’s data stack (42:54)Open source technology being helpful for the community (47:57)The future for Panther (54:39)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome back to the Data Sacks show.
Today we are talking with Jack from Panther,
and they are a tool that helps security teams
automate the collection and analysis of security data.
And I'm really interested in this conversation.
Security is one of those subjects inside of companies that is kind of funny.
Everyone should and does care about it.
But, you know, it's one of those things that it can kind of be a pain to deal with.
So I'm interested to see what Panther's building and how they fit into the
stack cost us from a technical perspective what do you want to ask jack yeah i'm very interesting
in this episode today for a couple of different reasons one is of course like the technical side
of things and learn more about the amazing technology that they have built with panther
but also because i'm lucky enough i mean i've met Jack about a year and a half ago
or something for the first time.
I mean, it was like a couple of months
after they started Panther.
And I had the pleasure to see how quickly they grew
and what the kind of team they managed to build there
and the product that they have today.
And it's amazing what kind of progress
they have managed to do,
which of course has to do a lot with the people involved,
but also with the kind of product and the problem that they are solving.
It's going to be extremely interesting to discuss today with Jack,
learn more about security.
Security is one of these things that everyone has heard about it,
but I don't think that many people really know what
security involves in companies today. And it's going to be extremely interesting to hear about
how security is performed today and what kind of problems there are there and how Panther is
actually addressing that and what kind of technologies they are using. So really looking
forward to it. Let's see what Jack has to say about it.
Sounds good.
Let's jump in.
Hi, Jack, and welcome to the Data Stack Show.
It's very nice to have you here today.
How are you?
I'm good.
I'm good.
How are you doing?
Very good.
We also have Eric here with us.
Hi, Eric.
Hello.
Good to be here after missing out on a couple episodes.
Yeah, it's great to have you back.
So, Jack, would you like to start with giving us a bit of a background about yourself and also Panther Labs?
Yeah, of course, of course.
So, I'm Jack Neglieri. I'm the founder and CEO of Panther Labs.
It's a cybersecurity company founded in 2018 based out of San Francisco, California. And what we do is we build software
to help security teams detect and prevent security breaches. My background is in security,
engineering. I was an analyst. I've done forensics. I've done a little bit of everything,
some application security. And probably the last five years has just been spent all on detection.
So this involves taking a huge amount of security data that could live really anywhere in a company's environment,
putting it all into a single place
and then analyzing it programmatically,
looking for like quote unquote security threats
or some type of suspicious behaviors
that could lead to a breach.
That's great.
So can you share a little bit more information
about how's the team right now?
I mean, you said that you started here in San Francisco.
I know that you are growing very rapidly.
So can you tell a little bit more
about the people involved?
How is the team right now?
If you're hiring and all that stuff.
Yeah, yeah.
I guess I'll give you some more info on the origin of the company.
So we started back in 2018.
That same year, I was actually working at Airbnb as a security engineer.
So I had joined Airbnb in 2016.
And my whole task was really to just build the detection out with the other members of
the security team and really focus
it around Amazon and how do we protect Airbnb's production environment from attacks. But more
importantly, how do we get visibility and detect when really anything bad starts to happen?
So I had an open source project called StreamAlert and I was the lead engineer on that for several
years. We built it internally starting in 2016. I open sourced it at a conference called Enigma in 2017, in January. And then from there on, it was just continuing to develop the community, develop the projects. hired more engineers and then I became a manager and then you know had the idea really around
that same time of becoming a manager that you know this could be something that we just build
into a company and there was a lot of problems that you know I wanted to continue to solve
just working on full-time and Panther was really the chance to do that so 2018 I left to start the
company and you know we started with just a really small Agile team of three people. And then later on, we had some great inbound candidates, like some who came from Amazon. So that ended up being great. And actually, a pattern that you'll see really often is a lot of security teams end up, a lot of advanced security teams end up building some
version of what Panther is internally. But the problem with that obviously is you have to
maintain it. It also takes a huge amount of engineering effort to really get a system like
this running and sustainable over time. So they had amazing experience. We're really happy
to have them in the team. And then, you know, since then, obviously we've grown considerably.
We're over, you know, we're over 20 people now. We just raised our series A. It was a $15 million
series A led by Lightspeed Venture Partners. And, you know, we've definitely come a long way since
the first three, just, you know, back in 2018. So it's been a really
awesome journey so far. Hey, Jack, interested to know, you know, you said you were doing
security work at Airbnb in 2016, you know, which they've been around, obviously longer than that.
But I'd be interested to know, just from your experience there, and really your experience
interacting with your customers at Panther, at what point of sort of scale and or complexity
does the security problem become material in the way that Panther solves it?
In terms, you know, I mean, there are lots of startups out there, right?
But in the early stages, of course, they're concerned about security, but not in a way that they
need sort of dedicated infrastructure, because their stack is probably incredibly simple. But
is there an inflection point? Or how do you think about that at Panther or just in general having
experience? That's a really good question. I think when i generally answer this question it comes down to
what's the industry that the company's in so like highly regulated industries like finance and
you know anything else that has that emphasis on the data that you're collecting
they probably need a program like this earlier because they need to have confidence that you
know the sensitive data that they're collecting is secure and it's not being accessed incorrectly or by some potential unwanted party.
So I think for them, they want a program like this much earlier.
For other teams, you generally start to see it as they start to scale.
So all in their growth stage as a company, or maybe to the point where they hire a CISO,
maybe it's organizational. Actually, that's probably the most common. We see that a company
has grown to a size that they hired a CISO. And then they start to bring in infrastructure people
who are like, okay, let's kind of just take our inventory of where we're at. And let's kind of
start to check these really table stakes boxes
off of our list.
Do we know what's happening in our production environment?
Do we know when people are making changes to our AWS account?
Do we know when people are logging in as admins?
Do we know when those admins are getting assigned?
It's really just going from zero to one.
And a product like Panther really helps them from the get-go to be
successful in that because at the end of the day you know you're the company is going to continue
to grow and you need to be able to scale with that and you need to be able to scale in a way that
you know isn't going to break the bank which is the problem today and also just operationally
isn't going to really overload the team, which is another problem today.
Yeah, I was going to say, you know,
security is one of those things where it can be a little bit funny to talk about it because even sort of bringing up the question around
the importance of security is weird because, of course, it's important, right?
You're dealing with customer data,
and you have to have security measures in there. But it seems like it's something that companies would want to start doing as early as possible,
especially if they don't have to build it internally, which seems just like a huge limiter.
Yeah, the thing with security, though, is you really need dedicated people.
It's not one of those things where you can just kind of buy a sass product and kind of call it a day even though a lot of
executives would want that it's just unfortunately not the reality today and i don't know if it ever
will be you know it's it's really a thing that you have to look at constantly if you
continually tune it and you know give it a lot of support so jack quick question because uh that's like
something that i also personally uh struggle a little bit to understand with security because
okay i'm i'm i'm aware like of the complexity of it but i'm not really into it like in terms of
how it's usually like implemented in the companies And based on also what you said about always having the need to have people there, can
you give us a quick, let's say, introduction of what security at the end is, how it is
materialized inside the company, and what are the basic components that consist of the
security infrastructure of a company today?
Yeah, sure.
So, I mean, there's a lot of different pillars of security, right?
So you have application security,
which is looking at the application of your production-level product, right?
So a company like Airbnb, the application security team,
is looking at the changes that they're making to the Ruby on Rails app,
and I'm sure they've probably migrated to something more
advanced now, but they're really
responsible for that interaction between
the customer and our main
production application. So they handle
that layer, right? Then there
will sometimes be another team
that's dedicated to data security, where they're
just looking at the access to the
production data, which could have PII, it could have
financial data, it could be really whatever the company is storing. So making sure that it's
encrypted, making sure that the access is secure and there's enough auditing and accountability
there. And then there's generally like an incident response team. And that's more of the, this is
more of the area that I've worked in historically. So incident response
has a lot of different components as well. But incident response is basically, we want to be
able to detect breaches and respond to them. And responding to a breach is, we got an alert for
something bad happening. We need to go in and understand exactly what happened. We need to
contain the incident. We need to do a report. And then we need to put those controls back into our
environment to prevent that from happening again. So it really also depends on the organization and it really depends on the company
and what you're trying to secure. But I would say application security, data security team,
you'll also really often see a compliance team and their whole thing is just making sure that the
company is accomplishing whatever compliance frameworks they need to get in order to either
do certain types of business or go public, things like that. So there's a whole team dedicated to
that very normally. And then on IR like this, the teams that I'd worked in historically,
I was in an IR team at Yahoo and Airbnb and also at VeriSign. There's a lot of components there.
So you could have a forensics team that their whole job is just to take images off of systems and analyze what happened. This is really common if you have malware that
is like exfilling data and you need to understand what exactly the attackers did. There's a whole
team dedicated there. And then there's also infrastructure security, which is highly
related to IR. And this is the idea of collecting all this telemetry and doing this core detection.
And this is always the area that I've been in.
So they're responsible for deploying tools
to the production environment
to collect this really helpful telemetry,
like who's logging into our production systems,
or now it's more abstracted to like Kubernetes, right?
Like who's logging in and making API calls
to our Kubernetes or making API calls
to our GCP or AWS and making these changes.
And it's really like understanding the state
of that infrastructure, but also the activity within it.
And that's exactly what Panther is designed to collect.
So Panther is a platform that collects all that data,
puts it in a format that is very structured
and gives analysts and engineers an ability
to write Python and do analysis on the data
as it streams in.
And that's really like what we feel is the best
way to do that function today.
And we've seen
that a lot of teams, as they've
moved to the cloud, the amount
of data that has increased so much that they
need a new platform. And that's really what
Panther's designed for.
That's kind of the lay of the land in
security teams. And I may have forgotten a couple other specialized teams but i think like those are
the main core ones that you would see in most companies and obviously so like above them kind
of managing and directing everything jack it's so interesting to hear you talk about sort of the
like the two separate disciplines of collection and detection and this may not be true in panthers
world but just sort of thinking about a similar paradigm with with the companies we work with at
ruddersack people try to build sort of our you know customer behavior event streaming infrastructure
internally as well and one of the challenges that we see is that the collection piece is,
is phenomenally difficult. And so most companies don't actually,
they spend so much time if they're trying to do it internally,
just getting the collection done well,
that they don't actually end up spending any time analyzing the data. Right.
So like the most valuable piece of the puzzle is ignored.
Is there a similar paradigm in Panther's world of security?
Like is the collection piece pretty difficult?
And so detection sort of gets neglected if people are trying to do this internally?
Good question.
So I think that collection and detection kind of go hand in hand today. They don't always,
but I think it also depends on the data source that you're collecting.
So there's certain data sources that are really well understood. Then there are others that are
fairly new. So the well understood ones are the ones that have just been around for longer. So
you have things like event logs off of systems, right? That historically, that's just been around
for so long. And we have so much signal to determine what a breach looks like there. But then as we transition to cloud, I think it kind of just goes with
how well do we understand the data source.
In terms of the complexity,
I think everything in Amazon is fairly easy to collect.
They've done a good job of centralizing into S3,
and that's actually one of the primary ways
that we recommend our customers getting data into Panther
because S3 is so highly scalable, it's very reliable,
and it's probably the most cost-effective way at scale.
And this is actually how they did it at Amazon,
so I trust my engineers
who spent over five years doing this at Amazon
to give us that sort of intuition.
Sure.
I think the challenge
actually comes when
you try to collect data that is
very unique
to the organization, so something that
is an internal application, for
example, they generally have their own
formats. And actually even collecting
syslog data can be really, really
difficult because you can arbitrarily format it.
So it makes it really hard for us to
say like, oh yeah, send your syslog data
because it could be in any
format possible. So that's also
a challenge sometimes.
Interesting, yeah.
I think overall, the huge advantage of just cloud and SaaS in general,
this trend that the whole industry is moving towards,
makes our job much easier because it's very highly structured and it's really predictable.
And then it's very different from on-prem where,
like I was just saying, syslog data can be in any format possible
and your internal application logs can be in any format possible.
But as we move more to the cloud,
there's just so much more predictability.
So it makes our job easier.
And then what we can do,
and this is actually what exists in Panther today,
but we have detections that ship with Panther
for a lot of these common log sources
that we understand really well
and the community understands really well.
And there's a lot of basic checks that everyone
should be doing like send me
an alert if an
admin gets created or something.
There's a lot of this really helpful signal
that could help find
privilege escalation or exfiltration.
There's these common patterns across everything
that we can identify. So we ship with
some base rules that look for all
of those basic things.
And then the teams can go through and write their own
based on their own internal logic.
Because also another thing that we haven't really discussed
is that every company is completely different.
So they have different threat models.
They have different infrastructure.
And as a result of that,
you need a system that can be highly customized
to be able to work for their own business logic.
And that's a huge value prop that you get with Panther
because you can define all these rules in Python,
which is highly accessible.
So Jack, can you give us a very quick description
of a typical onboarding or setup process for Panther?
Let's say a company comes today and they want to evaluate the product
and see how it can
fit in their
security processes
that they have. So usually how does this happen
with Panther? What's the typical
process that someone has to go through
to set it up and start
seeing the value out of it?
Yeah, that's a great question.
Like I was saying a second ago,
the main way that we recommend people to get their logs into Panther is to get it into S3 somehow. And it's not just limited to S3. It could be SQS or SNS or EventBridge or something similar. But I think the core thing is that it gets into Amazon somehow and then we can pick it up. So generally what you'll see is for a new team who's just really starting
from scratch, they want to get their CloudTrail data
first. That's usually one of the
what do I explain?
So it's probably the data source that
covers the most ground
and it's probably the most valuable one
they can do. It basically has a really high
return on their investment.
CloudTrail natively sends to S3, that's awesome.
What you do is you basically say,
you go into the Panther console and
you add a new data source, and
then you basically give your Panther installation
the ability to pull from that bucket.
And we do all this with
configuration as code. So we use CloudFormation
and Terraform. You can choose whichever to
run depending on your own internal workflows.
And then we create some IM roles and
we set some things up to where when new data gets written to that bucket. Panther gets those notifications,
pulls it, and then parses it, normalizes, does all the detections, and then it puts it into a
data lake. And then also it will send alerts if anything happens that we find. So that's more on
the S3 side. And that covers things like VPC full logs, like CloudTrail, GuardDuty can write to S3.
There's a lot of internal Amazon services
that can write to S3.
So that covers your groundwork, right?
And then you have SaaS.
So in our enterprise version of Panther,
we can pull down the SaaS logs periodically,
probably every minute I think we pull,
or maybe even sooner than that, depending on the source.
And we hit all these different APIs and pull all the data into Panther that way. So it could be your G Suite,
your Box, your Okta, your OneLogin, and the list goes on. And we're continually adding more and
more to this based off of the feedback that we're getting from customers and really what the most
highly valuable data sources that are needed. So that's sort of the second piece. And honestly, between both of those,
you cover most use cases. And then if you want to pull in your on-prem data or your data from
your laptops, all you really need to do is use a logging framework like LintD or LogStash or
anything that can write again out to S3. And then you get the ability to pull that into Panther too.
And actually a really exciting feature we just shipped
is the ability to define custom schemas.
So what you can do is you can have a YAML file
within the Panther UI that says,
this is the structure of my custom data,
like my internal application logs
or some internal tool that I wrote
that gives us really helpful telemetry.
You put that in and then you basically say,
this S3 bucket has
these logs and then we classify it uh we put it into the data like in a very structured format
and then you know we can analyze it as normal so that's a feature that just came recently
into panther yeah the whole data aspect of panther i think is also so interesting that
you know i would love to go deeper yeah. We are definitely going to discuss more about this
because the whole feeling that I get all this time
that we are talking about this is that
actually security is turning into a data problem, actually.
I hear many, many terms that are usually used
like in other data-related products,
like schemas, like how you format the data
and all that stuff, and also like processing the data.
So we will get on that.
Before we go there, just a quick question.
If I understand correctly, Panther right now,
it's mainly working on AWS, is this right?
Yeah, so the infrastructure itself runs on AWS, that's correct.
But we also host it as a SaaS,
so it's kind of a separate way from most people.
Do you see the need there
to also support other cloud infrastructures
for whatever compliance reasons or whatever?
Because I understand moving the data around,
it's part of security and all that stuff.
And do you plan to do that?
Is that the plan for Panther,
to give the option to your customers
to store the data in other cloud infrastructures?
The goal really is around being able to run components of Panther in other clouds. I don't
know if we'll ever fully run Panther in other clouds just because of the complexity of
translating serverless into other clouds. That could be a little difficult. I think today
what I want to get to eventually
is the ability to say, okay, I have data
in Azure, I have data in GCP,
I have data in Amazon. We can
deploy Panther and we can have a single
instance of Panther that takes all that
data in. So maybe
it stores some of it in some of the other clouds
and maybe it puts it all into one.
It's really interesting. I think the multi-cloud strategy
is something that hasn't even really been hashed out yet.
I think there's a lot of different approaches.
So we're still figuring that out,
but I definitely have the aspiration
to support the other clouds.
Yeah, that's interesting.
That's one of the reasons that I was asking you about this
because I see all this kind of trend
around the multi-cloud deployments and all that stuff. So it's interesting to see how
this also affects security in general, but also like how
security related tools like have to operate in this environment.
All right, that's great. So let's go back like to the
security problem and how it relates like to data in general.
So yeah, I mean, it looks like more and more anything that has to do with security,
which for me as an outsider, to be honest, security was always something related to compliance,
like many rules, like things that we need to follow and all that stuff.
That's what I traditionally had in my mind around security, to be honest. Although I'm pretty sure that's not accurate, but I'm also sure that I'm not the only
person who thinks like this about security. But as you describe the product and in general,
also how security works, it sounds like at the end, it turns into a data problem.
Can you say a little bit more about that? How do you think that if this is true, first of all,
and why it is true and why we need to approach it as a data problem
in order to succeed with security?
Yeah, it's absolutely a data problem.
It's been a data problem for years.
I've been trying to solve this for over five, six years.
And we always felt that like the, you know,
the ideal solution for us is that all the data goes into some big data
warehouse, right? This is how BI teams have been doing it for years.
You know, they, they collect their production level data.
They put it into something like Hadoop or, you know,
they put it into some really scalable data warehouse and then they search over it that way.
They have workflows, tools like Airflow, things like that.
And really that was the North Star
that I always wanted to see in security.
What you see in reality is someone deploying
into Splunk or Elasticsearch
that has a very different way of handling the data.
And it's not to say that they're bad.
It's just to say that at the certain scale that we've hit,
they just become ineffective.
And what you really need is you need that structured,
scalable data warehouse or data lake
is now more widely acceptable to do our security job.
Because in security, it's very common not to detect something
for maybe three to six months.
Maybe you get a letter from the FBI one day or an email saying, hey, you got breached.
And we're like, oh, okay.
We need to go back and look eight months ago.
But we really have 90 days of hot storage in our Splunk.
So we're kind of out of luck there.
So the scale of the cloud has really restricted a lot of people.
Even just get the most basic monitoring done just because the scale of the cloud has really restricted a lot of people even just get the most basic
monitoring done just because the scale's so high you know we've heard some people who want to
collect you know 50 terabytes of data per day or 100 terabytes of data per day and you know that
that's an astronomical scale that you just cannot do with spunk and elastic search you know unless
you're going to be willing to spend millions of dollars and have
a huge team of individuals.
But the beauty of
Panther is that because we're using
so much serverless and because Amazon
has built so many of these great tools,
we can take advantage of them.
And we get the byproduct of
big data right there.
Just for free, basically.
The only challenge is you have to know how
to set it up and configure it right so to get into a data warehouse you need to structure the data
that really like is the core problem right so i built this a long time ago even in the stream
it was the ability to like classify the data guess its schema and then like force it into a
schema that you defined and then you use that to put it into
a segmented part of your data lake in S3
in that exact format.
And then you have a table that you can use
to do schema on read,
and then you can actually read the data
and you can do huge searches over terabytes of data.
So at the end of the day,
that's what we need as security teams.
We need really structured data.
We need it in a way that can go to petabytes.
And it's only going to continue to get worse, right?
In terms of data volumes,
it's just increasing more and more every year.
You know, so this problem is definitely not going anywhere.
But also, you know, the thing for processing data with Python,
you need this strict schema
because we're looking at certain fields.
And, you know, if we don we're looking at certain fields and if we
don't have it in this format, then we can't do
our Python rules.
There's a lot of different reasons for why we
need this in this format, but I'd say the
core one is
really around the sustainability
and having a way to actually go through and search
your data six to 12 months
back and not pull your hair
out because a lot of teams have this problem
where it would take days to get a response. And then you realize you searched the wrong thing.
And now you have to go back and do it again. So we're just trying to avoid that pain that we felt
so much in the past. SQL adds a little bit of complexity. Like I said, I think those tools, Splunk and Elastic,
they're not bad products by any means.
They definitely helped at a time when they were really needed.
But I think just for the scale that we're at,
they're just not working anymore.
And the thing with SQL is that, yeah,
there's a little bit of a learning curve,
but it's extremely, extremely powerful.
And when you learn how to really use it
and do these joins and do these statistics,
it's going to change the game for security.
That's very interesting.
So from what I understand,
like a big part of the problem at this stage,
at least, is like a data modeling problem, right?
Like you have to take like all these different sources.
You might have like also very ad hoc data
that might be coming in,
and then you have to model them
into one model that Panther defines,
and then you can query the data
with very specific semantics.
So do you have,
I mean, in the data world in general,
and we're talking about here
data analytics and data science
and all that stuff,
the data quality is a big thing.
It's a big issue, right?
Especially when you have to work with many different data sources
and where you have many different people
who are actually affecting the data that will be transmitted.
And this is something that we see also at Rutterstack.
Big issue is like, okay, we collect the data, that's one thing.
But then in order to operationalize this data,
it's a completely different story.
And things about the quality of the data,
how you can monitor the quality of the data,
how you can enforce rules around how the data should be transmitted
and how to react when things go wrong, it's a big thing.
So is this also like data quality is also something
that's important also in your case,
or it's a bit different because the data that you are collecting probably are coming like from,
let's say, more domain-specific data sources in a way.
Do you see that? Is this a problem that you have?
And if it is, like, how do you deal with it?
Yeah, that's a great question.
So it's definitely less of a problem for us because in Panther's nature,
we are forcing all the data into that schema.
And like I was saying before,
a lot of SaaS data or cloud-based data
is so highly predictable
that we don't have to really worry
about the format very often.
So generally, if there is a problem,
it's that we couldn't classify it.
And then it goes into a queue
and then we can reprocess it once we've fixed the issue.
But it's pretty rare.
I mean, honestly, once you onboard a data source
and you use it for about a week,
you can figure out all these little variations in the data
and then we can add that into our schema
and then call it a day.
But it's usually something we don't have to touch very often.
It's just kind of like a one-time cost.
And then we get the benefit of, you know,
just being able to repeat that across deployments.
Jack, I'm interested to know,
you talk a lot about sort of the, you know,
if you try to do this yourself,
there's severe infrastructure costs or there can be,
are there things that earlier stage companies could think about just
in terms of their architecture that would make sort of the jump to like a higher level of security
easier? I mean, it sounds like Panther makes that way easier just in and of itself, but you just
mentioned architecture multiple times. I'm just interested to know from your experience,
are there things where you say, you know,
I wish I would have done this that way or sort of looked more closely at this
type of infrastructure from a security standpoint?
I think there's a lot of elements to that. For one,
I would say at least collect your data and put it in S3 because then you at
least have a forensic record of it. And that's something that
is really easy to configure
that has a ton of value,
but really down the line.
Let's say you're just an infra
team and you're getting set up. You don't have a
security team yet, but once you do have
that security team, if all the data that they need
is already in S3, it makes their life so easy.
And let's say they deploy...
It's like a gift to them in the future.
Absolutely.
And then let's say they deploy to like Panther
that will actually use that data for detection,
for storage, for analytics, things like that.
Then they basically can go from zero to one very quickly,
like within an hour.
You know, it's really quick.
Wow, that's incredible.
Yeah.
And the fact that I honestly,
I'm always amazed
what we've created in Panther because this type of system was impossible for us to create as
security practitioners. We just never had the skillset of the time, you know, and that is
actually one of the things I'm really thankful for having the opportunity to do is to, you know,
be in a startup and run a startup rather that is building this for other the opportunity to do is to, you know, be in a startup and run a startup
rather that is building this for other security teams to get value out of like, that's a really
fulfilling part of like my job, right. And it's just, it's going to make a huge difference in the
next like five to 10 years, like in this next wave of security tooling. So yeah, I would definitely
say if you're if you're an infra team listening, like put your logs in S3, please.
Jack, you talked about your job being rewarding and I,
I want to acknowledge the sensitivity of, you know,
your customer relationships,
but is there any way you could share maybe a story around how Panther sort of
caught something that could have been bad for a company?
Obviously you don't need to use the company's name,
but just interested if there are any stories like that where, you know,
someone running your technologies was saved from a potentially, you know,
big pain funnel, you know, because of some sort of security breach.
So I can't speak to like specific alerting or intrusions because of privacy and respect for that.
I mean, honestly, what we hear often is just
it's very easy to get set up.
And teams love the fact that they can write Python.
And I think in a sense, the Python is what enables them
to detect new types of things that they couldn't otherwise do.
So that's a huge advantage for a system like Panther.
And we've heard that pretty repeatedly.
And also the time to value has been pretty quick.
Once you get the data in, of course,
which can be sometimes challenging in large organizations.
But again, if you have all your data in S3,
it's pretty easy just to get going.
It's really more around like the capabilities and like the platform that, you know, we get that feedback from versus like, oh, it detected this specific type of thing.
Sure.
So, Jack, I have a bit of like a more technical architectural question that I have. So in data infrastructure in general,
there are like two main, let's say, models
that we usually think of, right?
One is like more of the data warehouse
or data lake-centric system
where data is delivered like to the destination.
And then in a more of a batch mode,
let's say we go there and process and come up with insights,
we can analyze the data, et cetera, et cetera.
And then there's also like the streaming model, right?
With things like doing processing
on something like Kinesis or Apache Kafka
or things that also technologies
like Spark from Databricks is doing.
So in security, I mean, of course,
this kind of paradigms,
they are supplemental at the end, right?
I mean, in many cases, you will see them living together in a company.
So in the security context, is one of these paradigms more prevalent?
And are they both needed?
And how Panther associates to these paradigms, if it does?
I mean, funny enough,
Panther is one of the first tools that I've ever seen.
I mean, besides from StreamAlert,
my other project that I worked on that actually used both of these technologies together.
Every other solution in security
has been some homegrown
or index-based searching system.
So the status quo,
and I actually just wrote a blog post about this
that kind of goes into detail,
but really the status quo in security
is log analytics platforms
like Splunk, Elastic, Sumo Logic,
and they don't really utilize any of the tech
that we just talked about, right?
It's their own sort of internal proprietary
search engine, indexing engine, things like that.
So you have like Lucene for Elastic,
which is the query syntax.
And then for Spunk and Sumo,
you have their own search syntax that they created, right?
And under the hood, they're using, again,
their own proprietary solutions
for searching and handling the data.
But when we did StreamAlert,
that was really the first time we even had the
opportunity to use tools like Kinesis and Lambda and things like that and get that really highly
sophisticated data pipeline because it's serverless and because Amazon exposed the service to allow us
to do it without needing a huge ops team. You know, I went and did a talk at Netflix in like, I don't remember what year it was. I think
it was 2017 or 20. Yeah, 2017. And I got a question from them like, well, why don't you just do this
in Spark? And I'm like, because I'm a team of two people. And, you know, we don't have this massive
infrastructure that we can just manage, you know. But, you know, what serverless did is it made it
really accessible.
So now because of the accessibility,
and a lot of these things that I'm talking about are pretty recent.
Lambda came out in 2016, 2015.
And Kinesis also was fairly new as well.
So we've basically been living on the bleeding edge
for a while.
And we continue to do that at Panther.
So all the Lambdas are using Go.
And I think Go Lambda support even came out in 2018 or so. And that's the year that we like founded the company.
You know, we continue to do stuff like that. Like we just use the bleeding edge Amazon based
solution. And that's actually one of the things that makes it a little bit difficult for us to
go multi-cloud, but we get the advantage of, we can run it like a crazy scale and we can do it
at a fairly low cost so that's the
trade-off that we're making but you know the fact that it makes it viable to do security at this
cloud scale it's like that's a good trade-off so oh that that's amazing so with with panther like
what is the most common use cases when they like the team's working with data it's more about like
digging in the past and trying to address something that might have happened
in the past, or it's
more about having real-time alerts
and trying to react
as fast as possible in a threat
or something that has happened.
Or both. I mean, what do you
see that's the most common practice right now?
Yeah, that's a great question.
There's a couple pieces to it.
I think the real-time is really, really great at finding things that are happening right now. And a big thing in security,
like I was saying before, is you sometimes don't know that something's happened, you know, until
months after, or maybe it takes a long time to issue a search on your data because it's so,
it's so much data, right? So real time really helps just get that really quick feedback loop, right?
This advantage with it right now is that it's looking at, you know,
a very limited window of time.
It's generally only looking at a handful of events, right?
And to really detect certain behavioral patterns,
you need to look at, like, batches of events together, right?
So that's where, like, the SQL comes into play.
So it's really interesting to combine both of them.
So maybe you have a real-time
detection that said, hey, Kostas
just logged into this prod box
and he is not
supposed to be doing that. This is a
really sensitive box that only this
one team should have access to.
That's suspicious. So what we can do is
we can go back in our data lake
and say, show me all the other boxes
that Kostas is logged into.
Maybe we don't have a rule for that,
but we have the data because we're collecting it.
And that's the other thing that makes
the data collection piece and the storage piece
so important is that you may not have a detection
for everything, but you have the data probably.
So if you can go and search the data really efficiently,
then you can answer the questions
that you couldn't answer in real time at that moment. So it definitely has like this,
like the very complimentary solutions. And, you know, I think it kind of goes back to like teams
want that like real time response. And they want to be able to within a minute to go in and fix
something if it's needed. Like there's a lot of organizations that I work with
that they want to have everything fully automated.
They want to know literally within a minute.
If someone logs into a system,
they want to get an alert and remediate it as soon as possible.
And systems like Panther allow you to do that
because we can plug the alerts into things like SQS queues
or into webhooks or into sort platforms
and really take advantage of the automation.
It's amazing.
And it's also very interesting exactly because it sounds like security is the
kind of like context where real time and bots,
they really need to coexist.
And when you need them,
you need probably like to have like access to both of them,
which is a very interesting use case.
Usually like with other data problems, what we have seen,
that's like, I mean, you might need both,
but for very different needs,
and probably also different teams are involved there.
So you have this situation where, for example,
streaming processing is more about detecting something that's happened now,
and you won't like to send alerts,
but your BI team, for example,
probably never mess with that.
They're probably working only on batch processing
and on your data warehouse.
So that's pretty unique, I think, in security.
And I find it from a data problem perspective,
let's say, very, very interesting.
So Jack, you have touched a little bit
of the kind of technology that you are using.
You mentioned serverless, you mentioned Lambdas
and how all these technologies,
they have allowed you to deliver the product
and the experience that you have right now.
Would you like to get a little bit more detail
of your technology, the stack that you are using?
And I mean, okay, it's on AWS, as we mentioned,
but I think it would be super interesting
to hear a little bit more about the architecture
of the product and the technology behind it.
I'd love to.
I'm always really impressed, actually,
with how the team has been able to evolve
the architecture as well.
So Panther started just as a set of Lambda functions
connected together with some queues,
some S3 buckets, things like that.
So the first thing I'll say is Panther is fully serverless,
meaning there's not a single virtual machine
that we manage when we deploy Panther.
It's all managed services.
So if we start with the web front end,
so the web front end runs as a Fargate
container. And if you're unfamiliar with Fargate, it's basically just like a quote unquote serverless
version of ECS. So you give Fargate a container or an image rather, and then it runs it as a
container and you don't have to manage the underlying VM. So that's the first piece.
Our front end is a React app and our middleware
is GraphQL. And then our
backends are Golang
Lambda functions and we
back them with things like DynamoDB
and S3 and things
like that. So again, the stack
is completely serverless, which is amazing
because we get the advantage of
scale, of low cost
and just low operational overhead.
So that's been really incredible,
especially with the GraphQL layer.
That is such a cool component to our architecture.
Yeah, yeah.
It's building-edge technology, as you mentioned before.
That's very interesting.
Quick question.
Why did you choose Go as the language for your Lambdas?
I know that you can build Lambdas in many different languages.
Why Go lang?
That's a good question.
So with Go, when you're stream processing,
you get so much benefit from the performance out of Go.
And also the type safety aspects, too.
I think just in general, it's better for the use case that we have,
and that's really the reason.
We'd written it in Python and Streamlite,
and then we'd found at certain scales,
it just was a bottleneck.
So in order for us to run it like a 10x scale,
which was one of the goals with Panther,
I decided that we really should write it
in a language that is more performant.
And that's been really helpful for us, actually.
Makes sense. That's good.
And then you support Python as the language for the rules, right?
Which is something that it's also like a very interesting
differentiate or convert like to other tools,
as you mentioned, like Splunk,
where you need like to learn their query language.
Then you have Elastic, which again,
I mean, they have their own language there.
And it's one of the things that I always disliked
about Elastic, to be honest. But why did you choose Python? And how does this interoperate with the rest
of the technologies that you have there? Great question. So Python's probably the
most widely understood language in security. So when we wrote StreamAlert, that was one of the
huge value adds there, right? It was like, security
engineers love to code in Python, so
we're going to allow them to write the detections in
Python. And, I mean,
the second part of it is really you get a ton
of expressibility with Python.
You don't have to write some crazy
long proprietary
search in Splunk, right? You can
use things like classes, and you can
use things like helper functions and these
really widely understood
programming concepts you can apply into security
now, which is awesome. So that was the reason
that we kept that in Panther.
It's just such a highly powerful
feature. Eventually
I think what will end up happening is
we'll have an option to where
you can define the rules in a YAML
format or a simple format.
And then for the teams who don't write code,
it allows them to still do their job and do it effectively.
But I love the mantra of simple but powerful.
At the end of the day, we're allowing teams
to write Python on classified events,
like on events that are parsed and normalized,
I should say rather. And in its essence, it's pretty simple, right? We take some data, we put
it in a format, and then you can write some Python on it. But I want to keep that going forward as
well to where you could expand it to where you can choose how advanced you want to get. And that's
really like what I aspire to have in the product eventually. Yeah, that's great. And actually, as you were talking about it,
I was thinking that because you said that Python
is probably the most what's in the security space.
And I think that's another indication
that probably security is a data problem at the end
because Python is the de facto language
that every data professional is pretty much using.
So yeah, that's also interesting.
So Jack, you mentioned at the beginning of our conversation,
your involvement in open source.
You mentioned like that back when you were at Airbnb,
you open source like a project.
I know that Panther is also like pretty active in open source
and you have open source as part of your product.
Why? Why do you think open source is important?
And what's the value that Panther gets from the open source community?
And what's the value that Panther gives back to the open source community?
Yeah, that's an awesome question.
So I think all in all, open source has been so helpful
for the security community in general.
A lot of tools, for example, like OS Query
and Google has a tool called Santa
and there's a bunch of other really popular security tools.
There's one called ElastAlert or Elasticsearch.
And they've really just helped kind of push
the whole industry forward quite a lot.
And more importantly, it allows a security engineer
who's starting from nothing to go on GitHub
and pull a lot of their tool chain down for free
and try it out and get a ton of value out of it
from the get-go.
I think from there, it's extremely valuable for new teams.
But also just in general, open source is helpful
because one, it adds a lot of transparency into the projects.
It, I think, promotes better code quality too, because it's all out in the open.
And, you know, you have a lot of eyes on it. And also part of that is like,
you can have people who are very security-minded developers look at the code and say, oh,
actually, you know, this could be a problematic thing. This could be a vulnerability actually
in your code. You should fix this. Or this thing's too permissive. You should fix this or whatever. So you get that sort of that free kind of security consulting with it too, which is
awesome. So I think it strengthens the project overall, you know, with that ad level transparency.
And then you really get the opportunity to have a bunch of people test your software on all these
different types of infrastructures. And that actually makes it more resilient. So I love that aspect of it too.
So I think that's how I think about open source in general.
It's helpful for the community.
It's helpful for us who, you know,
we're working on the problem every day
and it helps us think about the problem
maybe in slightly new ways.
You get that diversity of feedback, I think,
which is amazing, right?
Because when you're a company,
you have all the same people with, you know,
all the same perspective looking at the problem every day.
But you get someone completely fresh and they look at it and they say, hey, have you thought about this?
Have you thought about doing it this way?
And we're like, actually, we didn't consider that, but now we will.
And that, I think, pushes the whole project forward more.
The last thing I think that's also really important is you can kind of standardize on a lot of different methods of doing security, right?
So, you know, OS Query kind of became
the de facto for getting data
off of systems, right?
And ElastAlert became kind of the de facto
of, you know, searching
data in your Elasticsearch for security purposes.
And then StreamAlert became the de facto for
really people who were starting
with nothing and had AWS Infra. And now
Panther is kind of the successor of StreamAlert
with the UI and all these other added features
that teams really need.
So it just felt right.
And at the end of the day, we want the security engineer,
the person who's tinkering with the system,
to just be able to deploy it right away
and get that value immediately.
We also open source all our detections as well.
We used to have some packs that were internal
but then we just decided
we want to have everything out in the
open and we want to get feedback on all this.
We have
both Panther's open core
where you get the core cloud security and log analysis
features and then there's a couple things that we have
in our hosted SaaS that
allow you to search through the data easier. We have an
indicator search which is really cool. We have
our custom log support, our SaaS support,
our back, all the normal
enterprise stuff is available
with a license from us.
That's really how we trade off open source
in our proprietary version.
That's great.
I know that you're deploying a number
of alerts and ways of processing the data and you also allow Oh, that's great. So I know that you're like deploying a number of like alerts
and like ways of like processing the data.
And you also allow, as you said, like with Python,
like the customer to create their own rules there.
How do you come up with these rules?
I mean, is there like, okay, that's pretty much like
actually what's the work of security is.
And how is like this also,
how do you see this can work together with open source?
Do you see that people might be submitting rules publicly
and how you are reviewing these rules?
Because, okay, it's not something that you just take it
and run it and see what happens.
I mean, it's about security.
So how do you manage this?
So the detections are really based off of
really commonly accepted frameworks.
And that kind of goes back to one of the advantages
of open source in general that I was mentioning, right?
So there's frameworks called by MITRE.
MITRE attack is a very common one.
And there's others very similar to that.
And they basically lay out,
these are the common types of attacks that exist
in every environment or every type of environment.
So an example is like privilege escalation.
You know, maybe on, maybe there's some vulnerability on a system that you can get root on.
Like, that's a very common type of attack.
We can write a detection that's very generic that will detect that in a lot of different types of scenarios.
So following frameworks like that is really like the source of our detections, like how we are
informed and how we create.
And then it also really comes down
to just research, right? There's a lot of
new attacks that are happening every day. There's new breaches
all the time. So for example, like the
Capital One breach, we'll read reports
like that and then say, oh, okay, this
is how we would detect it in Panther. And I actually
did a webinar with Snowflake on that.
So I broke down, you know, this is how we would detect it in Panther. And I actually did a webinar with Snowflake on that. So I broke down,
this is how you would find a
retrace similar to that, and then
we write those detections, put it in open source,
and then anyone who delights in Panther gets
those rule sets. So stuff like that
is really helpful. In terms of
sharing them in open source,
I think it kind of goes back to
getting more eyes on the problem and
maybe having them review,
but also getting other teams to contribute their own. As security practitioners, we've all worked
in different companies and we've seen so many different things. And part of the goal of Panther
was really to democratize a lot of that knowledge. And the way you can do that is by committing
rules back to our source repo that are widely applicable to other companies as well.
We had, for example like
we had one of the engineers we know at hashi corp had committed a rule back in that looks for
signal from amazon if you commit something into github so there's some some scanning that happens
where amazon can proactively like detect credentials that were leaked and there's a
cloud trail you can get that would alert a team.
That stuff's really helpful to get
just, again, the different perspective, the thing I was
mentioning before, right?
We can get all that feedback into
Panther just by it being
open-sourced as a byproduct.
Yeah, it's super interesting. It's a very
interesting kind of community built
around that. I'm really
curious to see how it will
evolve in the future so Jack one last question I mean I know we can keep like chatting about
that stuff like for forever but what's next about Panther like is there anything exciting that you
would like to share yeah so it's been a pretty uh pretty interesting year obviously especially
with like the pandemic but you know we've had a really exciting year, obviously, especially with the pandemic.
But we've had a really exciting year.
We've grown.
We've, I think, more than doubled as a company.
And we're continuing to hire engineers and non-engineers
and really just kind of grow the company,
both on the business and technical side.
So, I mean, as always, we're working on big features all the time
and, you know, we're trying to ship stuff that's super impactful that teams can jump in right away
and start using. So we just shipped, you know, some more log supports for things like Cloudflare,
Fastly, stuff like that. And then some bigger features are coming down the line that allow you
to do, you know, more of the behavioral based analysis that I was talking about before.
So being able to take advantage of this data lake that
we've created with
S3 and the
data classification process and
look in windows of time and do
more around automation. So that's really
what's coming down the pipe with Panther.
And
just continuing to support more use cases and
push these features into open source
and also get super helpful feedback from engineers.
That's great.
I'm really looking forward to chatting again
in the future with you
and also see what other amazing stuff
you're going to build at Panther.
Thank you so much, Jack.
Have a good day.
And as I said, I'm looking forward to chatting again.
Yeah, me too.
Well, that was a super interesting conversation.
I think security to me is, like you said,
when we were talking before we started the interview with Jack,
it's one of those things that can be a little bit ambiguous,
but amazing just to hear how he built on the work that he did at Airbnb and an open
source project to turn the technology into something that's a big and growing company.
That's really exciting. What do you think, Kostas? It's amazing. I mean, what they have
managed to achieve so far in Panther and with the product they've built. I think what I found
extremely interesting and a bit surprising, to be honest, is how security has involved all
these years. I mean, from the first time that I started working in technology, I remember chatting
about security, but how security was performed like a couple of years ago with what security
is today is something completely different. And what I find extremely interesting is that,
and that's get us also like to why it was interesting to have this podcast
as a show that's called the data stack show
is that increasingly the security problem is becoming a data problem
it's all about how to collect the right data
how to search into huge amounts of data
and how you can react in real time when it's needed, but at the
same time have access to a huge amount of historical data that you can browse and query effectively.
And it looks like Panther has done an amazing job in like addressing these requirements of the
security problem of today. And I'm pretty sure that we will see
an amazing growth trajectory of this company.
And I'm looking forward to chat again with Jack in the future
and learn more about what they are doing
and what exciting stuff they're building.
Me too.
I think it's interesting thinking back on our conversation
with Slapdash and how they achieve so much with search by approaching the problem differently.
I think that Panther has done the same thing in many ways, thinking about a serverless
architecture that sort of allows you to manage petabytes of data.
It's just neat to see people approaching problems differently.
So be excited to catch up with Jack in another couple months, and we will catch you
next time on the Data Stack Show.