Risky Business - Soap Box: AI has entered the SOC, and it ain't going anywhere
Episode Date: June 16, 2025In this sponsored Soap Box edition of the Risky Business podcast Patrick Gray chats with Dropzone AI founder Ed Wu about the role of LLMs in the SOC. The debate about w...hether AI agents are going to wind up in the SOC is over, they’ve already arrived. But what are they good for? What are they NOT good for? And where else will we see AI popping up in security? This episode is also available on Youtube. Show notes
Transcript
Discussion (0)
Hey everyone and welcome to this Soapbox edition of the Risky Business Podcast.
My name is Patrick Gray.
For those of you who are unfamiliar, these Soapbox editions of the show are wholly sponsored
and that means everyone you hear in one of these podcasts paid to be here.
And today we're speaking with Ed Wu, who is the founder of a company called Dropzone. Dropzone makes a really interesting AI platform
that you can deploy into your SOC
that basically acts as a tier one SOC analyst, right?
And it works really well.
I also should disclose at this point
that I'm an advisor to Dropzone,
which means I have an extra vested interest
in them doing well.
But yeah, I mean I regularly meet with today's guest, Ed Wu, and talk to him about all manner
of stuff and I can promise you he's a really sharp guy who understands this problem space
very, very well and has been in it longer than most.
In fact, before he was a founder of Dropzone, he worked at ExtraHop Networks where he was a part of the team,
or I think led the team that took ExtraHop's platform and took it from being a network oriented product
into being a security oriented product.
And if you want to see like how happy they were with his work when he was at ExtraHop,
one of the founders, well, I'm sorry, one of the investors in Dropzone is actually one of the founders of Extra Hop.
So, you know, that's a solid endorsement there. Ed, thank you for joining me. I thought today what we could really talk about is not just about Dropzone and what it does in the SOC.
Obviously, we'll, you know, touch on that. But I wanted to talk about like the use of AI in
cybersecurity more generally, what it's good for,
what it's not good for. But let's start with the SOC, right? Because I think it's one area
where not only is the use case clear, but people are already using it in the SOC and not just drop
zone. Like AI, when it comes to like processing logs, looking at alerts, things like that, triaging,
I mean, people are using LLMs in a lot of socks already.
Do you think that's a fair statement?
Yeah, yeah, absolutely.
To best answer this, I think actually using Cursor
or AI coding tools, I think that's like a very good analogy.
So a lot of us might remember a couple years ago
where if you are using ChatGPT to help you write code,
you get laughed at because the consensus back then is
if you are using ChatGPT to write code, you know,
vibe code, you are just creating more bugs
that will end up costing you more time.
So actually, if you would have to do,
have done this yourself.
But now fast forward to today, I think it's pretty clear every single, you know, head
of engineering, every single CTO is strongly advocating developers to use, you know, AI
coding tools, whether it's cursor, whether it's, you know, GitHub co-pilot.
And I think a lot of this is ultimately, you know, with any new technologies, there's always
like hesitation and skepticism.
But over time, as you know, the early adopters start to see return, see words get spread
out and then see the rest of the community start to pick up all the success stories.
With AI in Sock specifically, I think two years ago,
probably around this time I remember the RSAC two years ago where Microsoft just launched Security Copilot.
And all it was was a chat bot,
you can ask it to enrich a particular IP address.
You can ask it to summarize a particular log line.
But that's pretty much it.
But yeah, I was saying the last two years,
the technology has matured extensively,
where there are a number of organizations using AI agents within SOC in production.
And as they see more actual real world impact, see words get spread across the community.
And I think nowadays, the percentage of people who are skeptical of the technology has dramatically
decreased compared to even a year ago.
I'm wondering though, like to what degree people feel comfortable using it, right?
In a SOC context. Cause as you point out, you know, stuff like Copilot, stuff like Cursor,
like that is just work a day now, right? Everybody kind of uses it, but they can dial up and dial down
like where they want to use it. Cause it's like, it's one of those sorts of tools, right? Everybody kind of uses it, but they can dial up and dial down like where they want to use it. Cause it's like, it's one of those sorts of tools, right? Where you
use it in a development environment and you can just say, well, I want to use it here,
but this bit I'll do manually. You know, sock work is really sort of workflow based, right?
So I'm guessing, you know, it's a little bit different in that you have to think ahead
of like, well, where do I want to use an LLM to do this? And where do I want it to step back and kick it to a human?
Like, is that part of the whole question
of how this stuff is winding up in the SOC at the moment?
Yeah, yeah, it is.
You're absolutely right.
Like with coding, co-pilots, to some extent,
every time a developer is working on a project,
they are making a decision, a two-way door decision,
whether they want Cursor to give it a try first,
or they should just wing it themselves.
So they are making this decision,
whether I delegate this to Cursor to take a first step,
or I just do this myself manually.
But with SaaS specifically, most of the time, what we have seen is the human
analysts are not looking at each alert and making a dynamic decision. Oh, for this alert, I want to
delegate it to Dropzone. Well, but I mean, that's the problem you're trying to solve, right? Which
is there's too many alerts. So trying to, you know, if you're actually in a position where you have
to decide which alert you want to AI triage,
that's kind of useless, right? Yeah, absolutely. And that's kind of why, at least from what we have seen, the chatbots, the security chatbots of the world has not been tremendously successful.
Because again, the challenge is there are so much to do in security. If you have to micromanage a chat bot
and tell it exactly what to do, like every 30 seconds,
then that's kind of more or less the feeder purpose.
And this is where like for AI agents,
what was the most common way is to treat it
as a new tier one.
So feed all your alerts to an AI agent.
The agent will perform the investigations.
It will dismiss or close the false positives
and then only escalate the suspicious
or the malicious alerts.
So that's kind of the most common workflow
like deployment model we have seen,
which is leveraging AI agents as the new tier one,
or you can say the AI filter or the AI meat shield
that shields the rest of the team
from the vast majority of the noise.
It is interesting that most security products historically
really focus on true positives.
When you look at the detection product,
most of them are showing you how they were able to detect a five-step sophisticated APT attacks.
But in reality, for AI SOC agents,
the biggest value proposition is not detecting true positives
or sophisticated multi-months, multi-hop intrusions.
But instead, the biggest value proposition
is reducing false positives.
Because by the virtue of removing hay from the haystack,
it makes finding the needles much easier.
Now, I bet already some people are listening to this
and saying, well, well, hold on, buddy.
Because what happens if you start dismissing true positives and flagging them
as false positives, right? And that's always going to be the concern when you're looking at
plugging in an AI model and trusting it to sit at the top of your detection stack and you give it
authority to dismiss stuff. Like how can you assuage fears that there is some genuine attack
going on and the model just doesn't know about it,
doesn't think it's a big deal and just gets rid of it.
Like, you know, cause I'm guessing that's like a huge barrier
when you're trying to sell into a new place
is convincing them that it's actually accurate enough
that it's not gonna, you know, give you a bad result there.
Like what, you know, how can you assuage those fears?
Yeah, there's definitely a couple of components.
First and foremost, AI SOC agent vendors are, including us, all prioritized minimizing false
negatives.
Meaning, when we say a security alert is benign, 99.9% of the time, it is actually benign.
And this is where I will be transparent and frank.
At this moment, looking at the technology,
Xero will always be a degree of hallucination.
There is no way to completely remove
all hallucinations from the large language models.
Any vendor who claim they have figured out
a magic way to remove all hallucinations,
they should be acquired by OpenAI for like $20 billion.
Because I'm sure OpenAI and Google
would love to know the magic sauce
to remove all hallucinations.
Yeah, so this is not a little problem
that a security startup is going to fix.
This is a fundamental large language model issue.
Correct.
But what security startups can do
is build processes, systems, and engineers in modules
in a way where the level of hallucination
is controllable and manageable.
And this is where you ask, hey, how can I trust an AI SOC
agent to not make mistakes?
And our perspective is an AI SOC agent will make mistakes,
but it's not about like whether it will make mistakes or not,
but it's more about the probability of making mistakes.
And this is where like I was introduced to a concept
recently that talks about the kind of the trade-off
between leverage and uncertainty.
So some of us who have been like a manager
or business owner or tech lead
are very familiar with this concept,
which is sometimes you are given a project
and then you might have somebody else working for you.
And then you are doing this mental calculus in your head.
How long does it take me to do it?
How long will it take my employee to do it?
And how much can I trust my employee
on doing the right thing or solving this problem
in the same way that I want it to be solved?
And anytime I think whether it's delegating tasks
to another human or delegating tasks to an AI agent,
there's always this trade off,
anytime you want to increase leverage,
you're kind of sacrificing uncertainty
or increasing uncertainty and increasing potential errors.
So from our perspective, our goal is to build a system
and we have already achieved it consistently, that is
at or above the accuracy compared to a typical human tier one security analyst.
Yeah.
I mean, you can benchmark this, right?
Because SOCs are well logged, right?
Decisions are well recorded.
So you can actually benchmark an LLM-based product
against people.
Absolutely.
And some of our, especially, MSSP or MDR service providers,
when they were POCing our technology,
we often get put into a bake-off.
So the service provider will gather 100 security alerts,
they will run 100 through our system,
and then they will build a spreadsheet.
One column is what Dropzone has found,
the other column is what their team has found,
and we will compare and contrast.
I will definitely tell you that oftentimes,
when you run through exercises like these,
the first thing you noticed is
even different members of the team might mark
the same alert in different ways,
because there's always a difference in opinion.
But even beyond that,
the accuracy of an AI,
like our AI stock analyst is definitely on par,
if not sometimes meaningfully better
than the human team members.
Now, you just touched on something interesting there,
which is that you have to manage an LLM
or have expectations around an LLM
similarly to how you would have expectations
of a human staff member.
What I'm seeing, I'm seeing some interesting stuff in AI
around multi-agent sort of deployments,
where you almost have an AI that has a role,
that can play that role of being a supervisor
to the core LLM that's doing most of the work.
I mean, is that something that you've played with as well,
at Dropzone, which is having a supervisory model observing your sort of log processing and investigation model,
you know, and can you even have multiple models doing the investigations and then you can
evaluate like if there is some sort of disagreement between them, you might want to kick that
up to a human. So I guess my question is like, you know, what's the role of sort of
multi-agent in a tech stack like this?
Yeah, so kind of similar to to kind of how, you know, we operate as humans. I
think sometimes we felt like there are multiple voices in our head, right? As a
father, I should prioritize X over Y. As an entrepreneur, I should prioritize Z
over X, right?
Stuff like that.
So yes, absolutely.
And what we have seen with large language models
is giving some different personas
really helps them to specialize.
And by doing that, you're able to build more complex
end-to-end workflows that you couldn't have
with like a single persona.
So we definitely leverage what we call multi-personas within our system to that's specialized, each
being specialized in a specific function.
And things like self-reflection, which is you ask model to do one thing and then you ask it,
or another module to critique itself,
is a very common technique to increase
the accuracy of the output of specific functions.
One very common example is, for example,
you want a large language model to generate an SPL query, so a Splunk query.
The model might generate something,
and that query might or might not work.
And a very common technique to improve
the accuracy of that query is use another module
to nitpick the query generated by the first module
to spot kind of mistakes.
Hey, you misspelled user instead of user underscore name,
it should be user space name, for example,
as a field and stuff like that.
Very similar to, I think most of us,
when we were in schools, when we are taking exams,
especially math exams, I think most of us,
when we complete all the questions,
we will go back and revisit our answers.
Again, critiquing ourselves.
This self-reflection is definitely a very common technique.
Then using a multi-model,
like different prompts, different temperatures,
different models to generate
the same output and then compare and contrast.
It's a little bit like voting.
When you ask three people about a certain topic
and you pick the most agreed upon answer,
that is going to further boost the accuracy of the outputs.
Yeah, right.
So this is absolutely a thing that's happening.
Because a friend of mine, he went to some Microsoft demo, which he was blown away by
where they got it to build like a Scrabble game or something.
But it was the multi-model part that was incredible to him where there's like a model that's a
project manager that deals with the other models and yells at them
when they get stuff wrong.
And he just said it was incredible watching
all these little AIs going off and doing stuff.
I mean, are you currently doing this, are you,
with the multiple model approach?
Yeah, yeah, absolutely.
To give you another example, alert investigation,
it's a little bit like being a detective.
You can kind of technically go on forever.
You can investigate to the nth degree.
So one module we have is kind of like an accountant,
where it's keeping track of the progress made
by the investigator components and trying
to identify when the marginal utility of additional CPU cycles or additional time spend
on this alert.
Yeah, at the point that it's analyzing one gigabyte
crash dumps, it might be time to tell it to chill out, right?
Yeah, or looking at IPs associated with another username,
associated with another IPs that might correlate to alert. Again, a lot of these after certain points here is decreasing marginal utility.
Yeah, no, that makes a lot of sense.
So when we start looking outside the SOC, right, which I know is not what you do, you
know, obviously I'm working now with Dasebel, which is one of the backers of your company,
right?
And, you know, everybody's all looking for ways to invest in AI companies
that are doing interesting things.
I think it's got some applicability pretty much everywhere.
I think the clearest use case, day one, is the SOC stuff.
It's the type of stuff that you're doing.
But obviously, as someone who is running an AI startup,
you've got your finger on the pulse, I'm guessing,
of where people are making progress in other areas of cybersecurity. Where do you see the
exciting stuff happening there? Yeah, from what I've seen is, obviously there are different ways
to prioritize different chunk of tasks. But from our perspective, what we have seen is most people are prioritizing the work
that's the most manual as well as highest quantity. Because if we were to build a module
or product that automates stuff, you might as well start with the most laborious and
the highest quantity tasks
within the security program.
So we have seen definitely pen testing.
That's one where throwing spaghetti at the wall
is not the most fun thing.
Or being a manual fuzzer is not the most fun thing
that somebody could do.
See, the other one we have seen a lot of success so far
is in code reviews.
Again, code review, I don't think any of us
wake up in the morning and gets excited about reviewing code.
But at the end of the day, for any fast growing application
or business,
there are a lot of code commits that will love or benefit
from security reviews.
Man, I got a friend who has just played around
with some generic models and figured out
how to prompt them in such a way that he thinks
it's the end of the SaaS industry.
And he says there's no moat, so it's not something
he's going to turn into a startup,
because it's all done with commodity models.
And he's like, if you know what you're doing,
you could throw code into them, and just all the bugs fall out.
It's coming.
That's definitely coming.
Yeah.
Yeah, like code analysis, like I did my PhD in program analysis.
Definitely spent a lot of times in a previous life,
looking at code, looking at syntax trees,
basic blocks, and stuff like that.
Yeah, large language models are very good at understanding code.
I do think there are still challenges,
especially where the code base is very large.
If you have a 100-line Python script,
I would not be surprised if ChatGPTSS already
does a tremendous job of spotting the issues.
But when you have a more complex code
base with complex interactions with internal libraries
or proprietary libraries or APIs.
Yeah, a million dependencies and dependencies on dependencies.
And yeah, you're just going to run out of space, aren't you?
Yeah, and also this requires the model
to really understand the different context of your code.
And this is where even in SOC, what we have seen is initially
most of the AI SOC startups like us
focus first on building integrations.
But we are getting to a place where
most of the integrations are already built.
What we have seen is the difference between
a mature product and immature product now moves
down to ability to build context.
Because a mature AI SOC analyst will be able to come
into your environment through a combination of integrations
and other means, really understand your organizational policies,
preferences, and practices.
Versus a naive or immature AI SOC analyst or AI SOC product
will come in and be like, okay, I marked this alert as malicious because I saw it as malicious,
even though the company might have a policy saying this kind of logging activity is actually expected.
I mean, it's probably worth pointing out too that one of the issues that you've had running this business is,
I think some people expect AI magic to fix their problems
when they just have a terrible detection stack right so you go in there and the
source data is patchy like really patchy so your agent can't collect the context
it needs to make decisions and whatnot so just to be clear like an AI SOC
analyst is only gonna work well when you've got a detection stack that's
pulling in the right information to begin with.
I mean, people are, you know, some people
expect a little bit too much, right?
Which is that an AI agent is going to be able to infer things
without actually collecting good context.
Yeah, yeah.
I would say we have run into a number of cases,
for example, you know, our technology is asked to investigate AWS alerts
when there are no AWS logs at all, either in AWS itself or within their SIEM. So obviously,
in that case, it's technically impossible to investigate those alerts if there are no logs at all. So yes, like an AI SOC agent is not going to fix the visibility problem. If you don't have logs in
certain parts of your business, then, you know, an AI agent is not going to be able to fix it for you.
With regards to patchy detections, we have seen cases like, for example, within our product,
when we see the same false positive happening over and over and over again,
our technology will propose recommendations, like tweaks on the detection rules,
to help tone down the noise.
So I would say that's actually a little bit easier to solve.
The opposite problem, yeah, yeah, yeah.
Than trying to ask, it's the opposite problem, which is you're asked to cook a dish when
you don't have any of our ingredients.
Yeah.
Now look, another thing I wanted to ask you about, and it's been quite the thing on social
media over the last week, is this paper that was written, I think, by an Apple intern looking
at large reasoning models and about how they're not actually, they don't really appear to
be more accurate than large language models when asked to do reasoning tasks.
And in fact, when tasks get to a certain level of complexity,
both LLMs and LRMs are not all that useful, right?
Which I don't quite understand
why people are so surprised by this.
Because when we see where the wins are with LLMs,
it's the stuff that you're talking about,
like high volume,
kind of menial stuff that nobody wants to do that's sort of semi repetitive and requires
diligence.
You know, I mean, a lot of the reason people miss sock alerts is because sitting in front
of a same console all day is boring and mind numbing.
And this isn't a problem experienced by computers.
Like it just isn't.
But I wanted to ask you what you made of that paper.
Like was there anything in there that was surprising to you?
Anything you agree with or disagree with
as someone who's using these sorts of models?
Yeah, I think there are different ways to,
that was definitely an interesting paper.
Some say, you know, Apple is just jealous
of kind of being a little bit left behind by everybody else.
But yeah, I think from our perspective,
like part of the art of using large language models
is task-deep composition.
And what I mean by that is similar to like asking
a single person to build a business,
that will be very difficult.
But most modern projects,
whether it's a modern business or a Manhattan project, involves a large number of different
type of specialists doing their special thing, but working in unison to really achieve a very complex end-to-end,
or solve a complex problem end-to-end.
So generally, if you expect a single large language model
invocation to be able to perform very complex tasks,
I think that's kind of misaligned.
Expectation, most of the large language model or AI agent
developers like us are decomposing complex tasks into small cognitive steps.
Each of them frankly should be trivially solvable
by a middle schooler.
So for example, when our AI SOC agent is looking at an alert
and trying to make sense of this alert and investigate it,
on average, our system makes close to 100
distinct large language model invocations.
Again, by breaking down alert investigation
into small cognitive steps.
Yeah, I mean, this is, it's interesting.
When you said pen testing earlier,
like as something that's ripe for sort of disruption
with LLMs, I know that there's a lot of pen testers who would wince at that and say, no,
that's not possible. And look, I mean, I think to a degree they're right,
like real elite level sort of pen testing is going to require that pen tester
brain, which is a rare type of brain,
but there's so much of the pen testing workflow where the
tricky part is understanding which steps to do next and why.
But the steps themselves are actually quite simple.
So I think, you know, I think that we might wind up in a situation where a lot of the
cool technology work is actually teaching the LLMs how to do certain things, right?
Like I can see that as being something that, you know, like if you're a pen tester, you
might teach a model,
hey, there's this type of check that I figured out how to do.
You teach the model how to do it.
And then when you actually want to get around to doing the check, it's just as simple as
asking the model to do it.
So and then of course, you know, with these multi-model approaches, you might be able
to have models which will understand better which, which checks you want to apply in which
context and whatnot.
So but I think you're right.
It's about breaking those things down, isn't it?
Into those simple steps and just thinking about those problems in
terms of I have an army of middle schoolers who will do whatever I
want at basically infinite speed.
Like how can I instruct these 14 year olds on how to do stuff?
Is that, you know, that's kind of the way I think about it.
Is it the way you think about it as well?
Yeah, yeah, absolutely.
I think a lot of people use phrases like force multiplication
or up-leveling.
Like one analogy we generally use
is we want to up-level the human security engineers
and human security analysts to be like the generals
and special forces, where they have an army of AI middle schoolers or AI foot soldiers,
that's listening to their commands and doing whatever they instructed.
This is also where one thing we have seen as we work with
different organizations of different sizes and maturity is actually
making sure the AI agent is coachable,
like listening to instructions is quite important.
You and I have talked about that before,
because you actually had to do quite a lot of work there
to get that coachability into the models that you're using.
Yeah, and I also think it's a very key component of this trust
building.
I think I use analogy like everybody
has experience working with smart jerks that
are very stubborn and do not take any inputs or feedback
or suggestions from team members.
But I think all of us probably also
have experience working with somebody who's junior,
but tremendously coachable.
And after a couple of months, that junior person is actually outperforming somebody
who is more senior because they are so coachable
and they're absorbing everything you taught them.
And we are kind of seeing something similar
within the AI SOC agent space,
where there are, every environment is different.
And sometimes a very coachable AI SOC agent can kind of actually
become significantly more valuable to an organization than maybe a smarter out-of-the-box
agent but that's very stubborn. Yeah, yeah. No, I mean it's I think we're actually at the fun part
from my perspective when it comes to AI,
because we've got a better understanding of what
it's useful for.
And of course, that's going to change, right?
But yeah, we're getting a better idea of how to use it,
what it's good at, what it's not so good at yet.
Ed Wu, we're going to wrap it up there.
Always a pleasure to chat to you, my friend,
and pick your brain on this stuff.
We learn a lot.
So thanks a lot for your time, and I'll
be chatting to you again soon.
Thank you for having me.