Risky Business - Soap Box: AI has entered the SOC, and it ain't going anywhere

Episode Date: June 16, 2025

In this sponsored Soap Box edition of the Risky Business podcast Patrick Gray chats with Dropzone AI founder Ed Wu about the role of LLMs in the SOC. The debate about w...hether AI agents are going to wind up in the SOC is over, they’ve already arrived. But what are they good for? What are they NOT good for? And where else will we see AI popping up in security? This episode is also available on Youtube. Show notes

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everyone and welcome to this Soapbox edition of the Risky Business Podcast. My name is Patrick Gray. For those of you who are unfamiliar, these Soapbox editions of the show are wholly sponsored and that means everyone you hear in one of these podcasts paid to be here. And today we're speaking with Ed Wu, who is the founder of a company called Dropzone. Dropzone makes a really interesting AI platform that you can deploy into your SOC that basically acts as a tier one SOC analyst, right? And it works really well.
Starting point is 00:00:36 I also should disclose at this point that I'm an advisor to Dropzone, which means I have an extra vested interest in them doing well. But yeah, I mean I regularly meet with today's guest, Ed Wu, and talk to him about all manner of stuff and I can promise you he's a really sharp guy who understands this problem space very, very well and has been in it longer than most. In fact, before he was a founder of Dropzone, he worked at ExtraHop Networks where he was a part of the team,
Starting point is 00:01:06 or I think led the team that took ExtraHop's platform and took it from being a network oriented product into being a security oriented product. And if you want to see like how happy they were with his work when he was at ExtraHop, one of the founders, well, I'm sorry, one of the investors in Dropzone is actually one of the founders of Extra Hop. So, you know, that's a solid endorsement there. Ed, thank you for joining me. I thought today what we could really talk about is not just about Dropzone and what it does in the SOC. Obviously, we'll, you know, touch on that. But I wanted to talk about like the use of AI in cybersecurity more generally, what it's good for, what it's not good for. But let's start with the SOC, right? Because I think it's one area
Starting point is 00:01:51 where not only is the use case clear, but people are already using it in the SOC and not just drop zone. Like AI, when it comes to like processing logs, looking at alerts, things like that, triaging, I mean, people are using LLMs in a lot of socks already. Do you think that's a fair statement? Yeah, yeah, absolutely. To best answer this, I think actually using Cursor or AI coding tools, I think that's like a very good analogy. So a lot of us might remember a couple years ago
Starting point is 00:02:25 where if you are using ChatGPT to help you write code, you get laughed at because the consensus back then is if you are using ChatGPT to write code, you know, vibe code, you are just creating more bugs that will end up costing you more time. So actually, if you would have to do, have done this yourself. But now fast forward to today, I think it's pretty clear every single, you know, head
Starting point is 00:02:50 of engineering, every single CTO is strongly advocating developers to use, you know, AI coding tools, whether it's cursor, whether it's, you know, GitHub co-pilot. And I think a lot of this is ultimately, you know, with any new technologies, there's always like hesitation and skepticism. But over time, as you know, the early adopters start to see return, see words get spread out and then see the rest of the community start to pick up all the success stories. With AI in Sock specifically, I think two years ago, probably around this time I remember the RSAC two years ago where Microsoft just launched Security Copilot.
Starting point is 00:03:39 And all it was was a chat bot, you can ask it to enrich a particular IP address. You can ask it to summarize a particular log line. But that's pretty much it. But yeah, I was saying the last two years, the technology has matured extensively, where there are a number of organizations using AI agents within SOC in production. And as they see more actual real world impact, see words get spread across the community.
Starting point is 00:04:16 And I think nowadays, the percentage of people who are skeptical of the technology has dramatically decreased compared to even a year ago. I'm wondering though, like to what degree people feel comfortable using it, right? In a SOC context. Cause as you point out, you know, stuff like Copilot, stuff like Cursor, like that is just work a day now, right? Everybody kind of uses it, but they can dial up and dial down like where they want to use it. Cause it's like, it's one of those sorts of tools, right? Everybody kind of uses it, but they can dial up and dial down like where they want to use it. Cause it's like, it's one of those sorts of tools, right? Where you use it in a development environment and you can just say, well, I want to use it here, but this bit I'll do manually. You know, sock work is really sort of workflow based, right?
Starting point is 00:04:57 So I'm guessing, you know, it's a little bit different in that you have to think ahead of like, well, where do I want to use an LLM to do this? And where do I want it to step back and kick it to a human? Like, is that part of the whole question of how this stuff is winding up in the SOC at the moment? Yeah, yeah, it is. You're absolutely right. Like with coding, co-pilots, to some extent, every time a developer is working on a project,
Starting point is 00:05:22 they are making a decision, a two-way door decision, whether they want Cursor to give it a try first, or they should just wing it themselves. So they are making this decision, whether I delegate this to Cursor to take a first step, or I just do this myself manually. But with SaaS specifically, most of the time, what we have seen is the human analysts are not looking at each alert and making a dynamic decision. Oh, for this alert, I want to
Starting point is 00:05:53 delegate it to Dropzone. Well, but I mean, that's the problem you're trying to solve, right? Which is there's too many alerts. So trying to, you know, if you're actually in a position where you have to decide which alert you want to AI triage, that's kind of useless, right? Yeah, absolutely. And that's kind of why, at least from what we have seen, the chatbots, the security chatbots of the world has not been tremendously successful. Because again, the challenge is there are so much to do in security. If you have to micromanage a chat bot and tell it exactly what to do, like every 30 seconds, then that's kind of more or less the feeder purpose. And this is where like for AI agents,
Starting point is 00:06:36 what was the most common way is to treat it as a new tier one. So feed all your alerts to an AI agent. The agent will perform the investigations. It will dismiss or close the false positives and then only escalate the suspicious or the malicious alerts. So that's kind of the most common workflow
Starting point is 00:06:58 like deployment model we have seen, which is leveraging AI agents as the new tier one, or you can say the AI filter or the AI meat shield that shields the rest of the team from the vast majority of the noise. It is interesting that most security products historically really focus on true positives. When you look at the detection product,
Starting point is 00:07:21 most of them are showing you how they were able to detect a five-step sophisticated APT attacks. But in reality, for AI SOC agents, the biggest value proposition is not detecting true positives or sophisticated multi-months, multi-hop intrusions. But instead, the biggest value proposition is reducing false positives. Because by the virtue of removing hay from the haystack, it makes finding the needles much easier.
Starting point is 00:07:55 Now, I bet already some people are listening to this and saying, well, well, hold on, buddy. Because what happens if you start dismissing true positives and flagging them as false positives, right? And that's always going to be the concern when you're looking at plugging in an AI model and trusting it to sit at the top of your detection stack and you give it authority to dismiss stuff. Like how can you assuage fears that there is some genuine attack going on and the model just doesn't know about it, doesn't think it's a big deal and just gets rid of it.
Starting point is 00:08:27 Like, you know, cause I'm guessing that's like a huge barrier when you're trying to sell into a new place is convincing them that it's actually accurate enough that it's not gonna, you know, give you a bad result there. Like what, you know, how can you assuage those fears? Yeah, there's definitely a couple of components. First and foremost, AI SOC agent vendors are, including us, all prioritized minimizing false negatives.
Starting point is 00:08:54 Meaning, when we say a security alert is benign, 99.9% of the time, it is actually benign. And this is where I will be transparent and frank. At this moment, looking at the technology, Xero will always be a degree of hallucination. There is no way to completely remove all hallucinations from the large language models. Any vendor who claim they have figured out a magic way to remove all hallucinations,
Starting point is 00:09:23 they should be acquired by OpenAI for like $20 billion. Because I'm sure OpenAI and Google would love to know the magic sauce to remove all hallucinations. Yeah, so this is not a little problem that a security startup is going to fix. This is a fundamental large language model issue. Correct.
Starting point is 00:09:43 But what security startups can do is build processes, systems, and engineers in modules in a way where the level of hallucination is controllable and manageable. And this is where you ask, hey, how can I trust an AI SOC agent to not make mistakes? And our perspective is an AI SOC agent will make mistakes, but it's not about like whether it will make mistakes or not,
Starting point is 00:10:09 but it's more about the probability of making mistakes. And this is where like I was introduced to a concept recently that talks about the kind of the trade-off between leverage and uncertainty. So some of us who have been like a manager or business owner or tech lead are very familiar with this concept, which is sometimes you are given a project
Starting point is 00:10:32 and then you might have somebody else working for you. And then you are doing this mental calculus in your head. How long does it take me to do it? How long will it take my employee to do it? And how much can I trust my employee on doing the right thing or solving this problem in the same way that I want it to be solved? And anytime I think whether it's delegating tasks
Starting point is 00:10:58 to another human or delegating tasks to an AI agent, there's always this trade off, anytime you want to increase leverage, you're kind of sacrificing uncertainty or increasing uncertainty and increasing potential errors. So from our perspective, our goal is to build a system and we have already achieved it consistently, that is at or above the accuracy compared to a typical human tier one security analyst.
Starting point is 00:11:36 Yeah. I mean, you can benchmark this, right? Because SOCs are well logged, right? Decisions are well recorded. So you can actually benchmark an LLM-based product against people. Absolutely. And some of our, especially, MSSP or MDR service providers,
Starting point is 00:11:55 when they were POCing our technology, we often get put into a bake-off. So the service provider will gather 100 security alerts, they will run 100 through our system, and then they will build a spreadsheet. One column is what Dropzone has found, the other column is what their team has found, and we will compare and contrast.
Starting point is 00:12:19 I will definitely tell you that oftentimes, when you run through exercises like these, the first thing you noticed is even different members of the team might mark the same alert in different ways, because there's always a difference in opinion. But even beyond that, the accuracy of an AI,
Starting point is 00:12:43 like our AI stock analyst is definitely on par, if not sometimes meaningfully better than the human team members. Now, you just touched on something interesting there, which is that you have to manage an LLM or have expectations around an LLM similarly to how you would have expectations of a human staff member.
Starting point is 00:13:02 What I'm seeing, I'm seeing some interesting stuff in AI around multi-agent sort of deployments, where you almost have an AI that has a role, that can play that role of being a supervisor to the core LLM that's doing most of the work. I mean, is that something that you've played with as well, at Dropzone, which is having a supervisory model observing your sort of log processing and investigation model, you know, and can you even have multiple models doing the investigations and then you can
Starting point is 00:13:35 evaluate like if there is some sort of disagreement between them, you might want to kick that up to a human. So I guess my question is like, you know, what's the role of sort of multi-agent in a tech stack like this? Yeah, so kind of similar to to kind of how, you know, we operate as humans. I think sometimes we felt like there are multiple voices in our head, right? As a father, I should prioritize X over Y. As an entrepreneur, I should prioritize Z over X, right? Stuff like that.
Starting point is 00:14:06 So yes, absolutely. And what we have seen with large language models is giving some different personas really helps them to specialize. And by doing that, you're able to build more complex end-to-end workflows that you couldn't have with like a single persona. So we definitely leverage what we call multi-personas within our system to that's specialized, each
Starting point is 00:14:37 being specialized in a specific function. And things like self-reflection, which is you ask model to do one thing and then you ask it, or another module to critique itself, is a very common technique to increase the accuracy of the output of specific functions. One very common example is, for example, you want a large language model to generate an SPL query, so a Splunk query. The model might generate something,
Starting point is 00:15:09 and that query might or might not work. And a very common technique to improve the accuracy of that query is use another module to nitpick the query generated by the first module to spot kind of mistakes. Hey, you misspelled user instead of user underscore name, it should be user space name, for example, as a field and stuff like that.
Starting point is 00:15:33 Very similar to, I think most of us, when we were in schools, when we are taking exams, especially math exams, I think most of us, when we complete all the questions, we will go back and revisit our answers. Again, critiquing ourselves. This self-reflection is definitely a very common technique. Then using a multi-model,
Starting point is 00:15:53 like different prompts, different temperatures, different models to generate the same output and then compare and contrast. It's a little bit like voting. When you ask three people about a certain topic and you pick the most agreed upon answer, that is going to further boost the accuracy of the outputs. Yeah, right.
Starting point is 00:16:24 So this is absolutely a thing that's happening. Because a friend of mine, he went to some Microsoft demo, which he was blown away by where they got it to build like a Scrabble game or something. But it was the multi-model part that was incredible to him where there's like a model that's a project manager that deals with the other models and yells at them when they get stuff wrong. And he just said it was incredible watching all these little AIs going off and doing stuff.
Starting point is 00:16:51 I mean, are you currently doing this, are you, with the multiple model approach? Yeah, yeah, absolutely. To give you another example, alert investigation, it's a little bit like being a detective. You can kind of technically go on forever. You can investigate to the nth degree. So one module we have is kind of like an accountant,
Starting point is 00:17:13 where it's keeping track of the progress made by the investigator components and trying to identify when the marginal utility of additional CPU cycles or additional time spend on this alert. Yeah, at the point that it's analyzing one gigabyte crash dumps, it might be time to tell it to chill out, right? Yeah, or looking at IPs associated with another username, associated with another IPs that might correlate to alert. Again, a lot of these after certain points here is decreasing marginal utility.
Starting point is 00:17:50 Yeah, no, that makes a lot of sense. So when we start looking outside the SOC, right, which I know is not what you do, you know, obviously I'm working now with Dasebel, which is one of the backers of your company, right? And, you know, everybody's all looking for ways to invest in AI companies that are doing interesting things. I think it's got some applicability pretty much everywhere. I think the clearest use case, day one, is the SOC stuff.
Starting point is 00:18:17 It's the type of stuff that you're doing. But obviously, as someone who is running an AI startup, you've got your finger on the pulse, I'm guessing, of where people are making progress in other areas of cybersecurity. Where do you see the exciting stuff happening there? Yeah, from what I've seen is, obviously there are different ways to prioritize different chunk of tasks. But from our perspective, what we have seen is most people are prioritizing the work that's the most manual as well as highest quantity. Because if we were to build a module or product that automates stuff, you might as well start with the most laborious and
Starting point is 00:19:02 the highest quantity tasks within the security program. So we have seen definitely pen testing. That's one where throwing spaghetti at the wall is not the most fun thing. Or being a manual fuzzer is not the most fun thing that somebody could do. See, the other one we have seen a lot of success so far
Starting point is 00:19:26 is in code reviews. Again, code review, I don't think any of us wake up in the morning and gets excited about reviewing code. But at the end of the day, for any fast growing application or business, there are a lot of code commits that will love or benefit from security reviews. Man, I got a friend who has just played around
Starting point is 00:19:55 with some generic models and figured out how to prompt them in such a way that he thinks it's the end of the SaaS industry. And he says there's no moat, so it's not something he's going to turn into a startup, because it's all done with commodity models. And he's like, if you know what you're doing, you could throw code into them, and just all the bugs fall out.
Starting point is 00:20:15 It's coming. That's definitely coming. Yeah. Yeah, like code analysis, like I did my PhD in program analysis. Definitely spent a lot of times in a previous life, looking at code, looking at syntax trees, basic blocks, and stuff like that. Yeah, large language models are very good at understanding code.
Starting point is 00:20:36 I do think there are still challenges, especially where the code base is very large. If you have a 100-line Python script, I would not be surprised if ChatGPTSS already does a tremendous job of spotting the issues. But when you have a more complex code base with complex interactions with internal libraries or proprietary libraries or APIs.
Starting point is 00:20:58 Yeah, a million dependencies and dependencies on dependencies. And yeah, you're just going to run out of space, aren't you? Yeah, and also this requires the model to really understand the different context of your code. And this is where even in SOC, what we have seen is initially most of the AI SOC startups like us focus first on building integrations. But we are getting to a place where
Starting point is 00:21:29 most of the integrations are already built. What we have seen is the difference between a mature product and immature product now moves down to ability to build context. Because a mature AI SOC analyst will be able to come into your environment through a combination of integrations and other means, really understand your organizational policies, preferences, and practices.
Starting point is 00:21:58 Versus a naive or immature AI SOC analyst or AI SOC product will come in and be like, okay, I marked this alert as malicious because I saw it as malicious, even though the company might have a policy saying this kind of logging activity is actually expected. I mean, it's probably worth pointing out too that one of the issues that you've had running this business is, I think some people expect AI magic to fix their problems when they just have a terrible detection stack right so you go in there and the source data is patchy like really patchy so your agent can't collect the context it needs to make decisions and whatnot so just to be clear like an AI SOC
Starting point is 00:22:39 analyst is only gonna work well when you've got a detection stack that's pulling in the right information to begin with. I mean, people are, you know, some people expect a little bit too much, right? Which is that an AI agent is going to be able to infer things without actually collecting good context. Yeah, yeah. I would say we have run into a number of cases,
Starting point is 00:23:02 for example, you know, our technology is asked to investigate AWS alerts when there are no AWS logs at all, either in AWS itself or within their SIEM. So obviously, in that case, it's technically impossible to investigate those alerts if there are no logs at all. So yes, like an AI SOC agent is not going to fix the visibility problem. If you don't have logs in certain parts of your business, then, you know, an AI agent is not going to be able to fix it for you. With regards to patchy detections, we have seen cases like, for example, within our product, when we see the same false positive happening over and over and over again, our technology will propose recommendations, like tweaks on the detection rules, to help tone down the noise.
Starting point is 00:23:56 So I would say that's actually a little bit easier to solve. The opposite problem, yeah, yeah, yeah. Than trying to ask, it's the opposite problem, which is you're asked to cook a dish when you don't have any of our ingredients. Yeah. Now look, another thing I wanted to ask you about, and it's been quite the thing on social media over the last week, is this paper that was written, I think, by an Apple intern looking at large reasoning models and about how they're not actually, they don't really appear to
Starting point is 00:24:22 be more accurate than large language models when asked to do reasoning tasks. And in fact, when tasks get to a certain level of complexity, both LLMs and LRMs are not all that useful, right? Which I don't quite understand why people are so surprised by this. Because when we see where the wins are with LLMs, it's the stuff that you're talking about, like high volume,
Starting point is 00:24:46 kind of menial stuff that nobody wants to do that's sort of semi repetitive and requires diligence. You know, I mean, a lot of the reason people miss sock alerts is because sitting in front of a same console all day is boring and mind numbing. And this isn't a problem experienced by computers. Like it just isn't. But I wanted to ask you what you made of that paper. Like was there anything in there that was surprising to you?
Starting point is 00:25:07 Anything you agree with or disagree with as someone who's using these sorts of models? Yeah, I think there are different ways to, that was definitely an interesting paper. Some say, you know, Apple is just jealous of kind of being a little bit left behind by everybody else. But yeah, I think from our perspective, like part of the art of using large language models
Starting point is 00:25:31 is task-deep composition. And what I mean by that is similar to like asking a single person to build a business, that will be very difficult. But most modern projects, whether it's a modern business or a Manhattan project, involves a large number of different type of specialists doing their special thing, but working in unison to really achieve a very complex end-to-end, or solve a complex problem end-to-end.
Starting point is 00:26:08 So generally, if you expect a single large language model invocation to be able to perform very complex tasks, I think that's kind of misaligned. Expectation, most of the large language model or AI agent developers like us are decomposing complex tasks into small cognitive steps. Each of them frankly should be trivially solvable by a middle schooler. So for example, when our AI SOC agent is looking at an alert
Starting point is 00:26:40 and trying to make sense of this alert and investigate it, on average, our system makes close to 100 distinct large language model invocations. Again, by breaking down alert investigation into small cognitive steps. Yeah, I mean, this is, it's interesting. When you said pen testing earlier, like as something that's ripe for sort of disruption
Starting point is 00:27:03 with LLMs, I know that there's a lot of pen testers who would wince at that and say, no, that's not possible. And look, I mean, I think to a degree they're right, like real elite level sort of pen testing is going to require that pen tester brain, which is a rare type of brain, but there's so much of the pen testing workflow where the tricky part is understanding which steps to do next and why. But the steps themselves are actually quite simple. So I think, you know, I think that we might wind up in a situation where a lot of the
Starting point is 00:27:35 cool technology work is actually teaching the LLMs how to do certain things, right? Like I can see that as being something that, you know, like if you're a pen tester, you might teach a model, hey, there's this type of check that I figured out how to do. You teach the model how to do it. And then when you actually want to get around to doing the check, it's just as simple as asking the model to do it. So and then of course, you know, with these multi-model approaches, you might be able
Starting point is 00:27:59 to have models which will understand better which, which checks you want to apply in which context and whatnot. So but I think you're right. It's about breaking those things down, isn't it? Into those simple steps and just thinking about those problems in terms of I have an army of middle schoolers who will do whatever I want at basically infinite speed. Like how can I instruct these 14 year olds on how to do stuff?
Starting point is 00:28:21 Is that, you know, that's kind of the way I think about it. Is it the way you think about it as well? Yeah, yeah, absolutely. I think a lot of people use phrases like force multiplication or up-leveling. Like one analogy we generally use is we want to up-level the human security engineers and human security analysts to be like the generals
Starting point is 00:28:42 and special forces, where they have an army of AI middle schoolers or AI foot soldiers, that's listening to their commands and doing whatever they instructed. This is also where one thing we have seen as we work with different organizations of different sizes and maturity is actually making sure the AI agent is coachable, like listening to instructions is quite important. You and I have talked about that before, because you actually had to do quite a lot of work there
Starting point is 00:29:18 to get that coachability into the models that you're using. Yeah, and I also think it's a very key component of this trust building. I think I use analogy like everybody has experience working with smart jerks that are very stubborn and do not take any inputs or feedback or suggestions from team members. But I think all of us probably also
Starting point is 00:29:39 have experience working with somebody who's junior, but tremendously coachable. And after a couple of months, that junior person is actually outperforming somebody who is more senior because they are so coachable and they're absorbing everything you taught them. And we are kind of seeing something similar within the AI SOC agent space, where there are, every environment is different.
Starting point is 00:30:01 And sometimes a very coachable AI SOC agent can kind of actually become significantly more valuable to an organization than maybe a smarter out-of-the-box agent but that's very stubborn. Yeah, yeah. No, I mean it's I think we're actually at the fun part from my perspective when it comes to AI, because we've got a better understanding of what it's useful for. And of course, that's going to change, right? But yeah, we're getting a better idea of how to use it,
Starting point is 00:30:31 what it's good at, what it's not so good at yet. Ed Wu, we're going to wrap it up there. Always a pleasure to chat to you, my friend, and pick your brain on this stuff. We learn a lot. So thanks a lot for your time, and I'll be chatting to you again soon. Thank you for having me.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.