The a16z Show - Enabling Agents and Battling Bots on an AI-Centric Web
Episode Date: July 4, 2025Taken from the AI + a16z podcast, Arcjet CEO David Mytton sits down with a16z partner Joel de la Garza to discuss the increasing complexity of managing who can access websites, and other web apps, and... what they can do there. A primary challenge is determining whether automated traffic is coming from bad actors and troublesome bots, or perhaps AI agents trying to buy a product on behalf of a real customer.Joel and David dive into the challenge of analyzing every request without adding latency, and how faster inference at the edge opens up new possibilities for fraud prevention, content filtering, and even ad tech.Topics include:Why traditional threat analysis won’t work for the AI-powered webThe need for full-context security checksHow to perform sub-second, cost-effective inferenceThe wide range of potential actors and actions behind any given visitAs David puts it, lower inference costs are key to letting apps act on the full context window — everything you know about the user, the session, and your application. Follow everyone on social media:David MyttonJoel de la GarzaCheck out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts. Stay Updated: Let us know what you think: https://ratethispodcast.com/a16zFind a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://x.com/eriktorenbergPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
50% of traffic is already bots, it's already automated,
and agents are only really just getting going.
Most people are not using these computer use agents
because they're too slow right now.
They're still like previews,
but it's clear that's where everything is going.
Then we're going to see an explosion in the traffic that's coming from these tools
and just blocking them just because their AI is the wrong answer.
You've really got to understand why you want them,
what they're doing, who they're coming from,
and then you can create these granular rules.
AI agents are changing how people interact with the web, but most sites still treat them like bots.
In this episode, taken from the AI plus A16Z podcast, A16Z partner Joel De LaGarza talks with Arcjet CEO David Mitten about building internet infrastructure for this new era.
Here's Derek to kick things off.
Thanks for listening to the A16C AI podcast.
If you've been listening for a while or if you're all plugged into the world of AI, you've no doubt heard of what AI agents and all the amazing things they theoretically can do.
But there's a catch.
When it comes to engaging with websites,
agents are limited by what any given site allows them to do.
If, for example, a site tries to limit all non-human interactions
in an attempt to prevent unwanted bot activity,
it might also prevent an AI agent from working on a customer's behalf,
say, making a reservation, signing up for a service, or buying a product.
This broadstrokes approach to site security is incompatible
with the idea of what some call agent experience,
an approach to web and product design
that treats agents as first-class users.
In this episode,
A16Z Infra Partner Joel DeLegars
that dives into this topic with David Mitten,
the CEO of Arcjet,
a startup building developer native security
for modern web frameworks,
including attack detection,
sign-up spam prevention, and bot detection.
Their discussion is short,
sweet, and very insightful.
And you'll hear it after these disclosures.
As a reminder,
please note that the content here is for informational
purposes only, should not be taken as legal, business, tax, or investment advice, or be used to
evaluate any investment or security, and is not directed at any investors or potential investors
in any A16Z fund. For more details, please see A16Z.com slash disclosures.
It seems like what once was old is new again, and would love to get your thoughts on this,
this new emergence of bots and how, while we know all the bad things that happen,
with them, there's actually a lot of good and really cool stuff that's happening and how we can
maybe work towards enabling that.
Right.
Well, things have changed, right?
The DDoS problem is still there, but it's just almost handles a commodity these days.
The network provider, your cloud provider, they'll just deal with it.
And so when you're deploying an application, most of the time, you just don't have to think
about it.
The challenge comes when you've got traffic that just doesn't fit those filters.
It looks like it could be legitimate, or maybe it is legitimate.
legitimate and you just have a different view about what kind of traffic you want to see.
And so the challenge is really about how do you distinguish between the good bots and the bad
bots? And then with AI changing things, it's bots that might even be acting on behalf of
humans, right? It's no longer a binary decision. And as the amount of traffic from bots increases,
like in some cases, it's the majority of traffic that sites are receiving is from an automated
source. And so the question for site owners is, well, what kind of traffic do you,
you want to allow?
And when it's automated,
what kind of automated traffic
should come to your site?
And what are you getting in return for that?
And in the old days,
I mean, I guess the old providers,
we'll say the legacy providers in this space,
like it was very much using a hammer, right?
So they would say, hey, if this IP address is coming in,
it's probably a bot.
Or they would say, if this user agent is coming in,
it's probably a bot.
Very imprecise.
And I think the downside of that is that you probably
blocked a lot of legitimate traffic,
along with illegitimate traffic.
And now there's very real consequences
because some of these AI bots
could be actual users
they're acting on behalf of
who are looking to purchase your products.
This is the challenge.
So a volumetric DDoS attack,
you just want to block that at the network.
You never want to see that traffic.
But everything else needs the context of the application.
You need to know where in the application
the traffic is coming to.
You need to know who the user is, the session,
and to understand in which case you want to allow or deny that.
And so this is the real issue for developers, for site owners, for security teams, is to make those really nuanced decisions to understand whether the traffic should be allowed or not.
And the context of the application itself is so important because it depends on the site.
If you're running an e-commerce operation, an online store, the worst thing you can do is block a transaction because then you've lost the revenue.
Usually you want to then flag that order for review, a human customer support person is going to come in and determine based on various.
various signals by whether to allow it.
And if you just block that at the network,
then your application will never see it.
You never even know that that order was failed in some way.
There's been a lot of media releases about companies
that have released solutions in this space,
but largely they were based on sort of those old kind of approaches
using network telemetry.
Is that generally how they're working now?
Or is there some other capabilities that they've released?
Because they give them AI names and you just,
immediately assume that they're doing something fancy.
That's right, yeah.
So you're blocking on the network is basically how the majority of these old school products
work.
They do analysis before the traffic reaches your application, and then you never know
what the result of that was.
And that just doesn't fly anymore.
It's insufficient for being able to build modern applications, particularly with AI
coming in where something like OpenAI has four or five different types of bots, and
some of them you might want to make a more restrictive decision over.
but then others are going to be taking actions on behalf of a user search.
And we're seeing lots of different applications getting more sign-ups,
businesses actually getting higher conversions as a result of this AI traffic.
And so just blocking anything that is called AI is too blunt of an instrument.
You need much more nuance.
And the only way you can do that is with the application context,
understanding what's going on inside your code.
I mean, I'd say we're seeing across the industry that AI is driving incredible amounts of new revenue
to companies. And if you use an old world tool to just block any of that traffic, you're probably
doing your business. That's right. Or you're like putting it into some kind of maze where it's
seeing irrelevant content. And then by doing that, you are kind of downranking your site because
the AI caller's never going to come back. It's kind of like blocking Google from visiting your
site. Yeah, Google doesn't get you in, you're no longer in Google's index. But then you're no longer
in Google's index. And so anyone's searching is not going to find you as a result.
Well, and I believe we had sort of standards in the old days that developed are quasi-standards like robots.com, right, which would tell you, like, until the crawlers, hey, don't crawl these directories, are we doing something similar for this new age, agentic world?
So, robots.comtext is still the starting place, and it's kind of a voluntary standard. It evolved over several decades ago now. It's been around a long time. Bots have been a problem for a long time. And the idea is that you describe the areas of your application.
and tell any robot that's coming to your site
whether you want to allow that robot to access that area of the site or not.
And you could use that to control the rollout of new content.
You could protect certain pages of your site
that you just don't want to be indexed for whatever reason.
And you can also point the crawler to where you do want it to go.
You can use the site map for that as well.
But the robot's text file format has evolved over time
to provide these signals to the likes to crawlers,
like search engines from Google,
and so on.
The challenge with that is it's voluntary,
and there's no enforcement of it.
And so you've got good bots like Googlebot
that will follow the standard,
and you'll be able to have full control over what it does.
But there are new bots that are ignoring it,
or even sometimes using it as a way to find the parts of your site
that you don't want it to access,
and they will just do that anyway.
And so this becomes a control problem for the site owner.
And you really want to be able to understand
not just what the list of rules are,
but how they are enforced.
Totally.
Maybe it'd be great to walk through what these agents are,
maybe get some more understanding of sort of how they operate,
what people are using them for,
perhaps go through a couple of the use cases.
And then it'd be great to understand sort of like how you do control it,
because it seems like a far more complicated problem
than just bad IP addresses.
Right.
So if we think about Open AI as an example,
because they have four or five different crawlers,
there's one, and they all have different names,
and they will identify themselves in different ways.
So one actually is calling to train the Open AI models on your site.
And that's the one that probably everyone is thinking about
when they're thinking about I want to block AI, the training.
And you have different philosophical approaches
to how you want to be included in the training data.
The others are more nuanced and will require more thought.
So there's one that will go out when a user is typing something into the chat
and has asked a question and Open Eye will go out.
search. It's built up its own search index. And so that's equivalent of Googlebot. You probably
want to be in that index because as we're seeing, sites are getting more sign-ups, are getting more
traffic. The discovery process is being part of just another search index is super important.
Gotcha. So like when I ask Open AI, when is John F. Kennedy's birthday? If it doesn't know the
answer, it goes out and searches the web. Yeah, that's right. Or if it's trying to get open hours
for something, it might go to a website for a cafe or whatever and pass it and then return the
results. So that's really just like a classic search engine crawler, except it's kind of happening
behind the scenes. The other one is something that's happening in real time. So you might give
the agent a specific URL and go and ask it to summarize it or to look up a particular question
in the docs for a developer tool or something like that. And then that's a separate agent that
will go out, it will read the website, and then it will return and answer the query. For both of
these two examples, Open AI and others are now starting to cite those.
sources and you'll regularly see, and this is kind of the recommendation, is you get the result
from the AI tool, but you shouldn't trust at 100%. You go and then verify and you look at the docs,
and maybe it's like when you used to go to Wikipedia and you'd read the summary, and then you'd
look at the references, and you'd go to all the references and check to make sure what had
been summarized is actually correct. But all three of those examples, you clearly could see
why you would want them accessing your site. Right. Like blocking all of Open AI's crawlers is probably
a very bad idea. Yeah, it's too blunt. It's too blunt to an instrument. You know,
need to be able to distinguish each one of these and determine which parts of your site you want
them to get into. And this then comes to the fourth one, which is the actual agent. And the agent,
the computer operator type feature that is... Headless web browsers. Yeah, but even a web browser,
a full web browser operating inside a VM. And those are the ones that require more nuance,
because maybe you're booking a ticket or doing some research and you do want the agent to take actions
on your behalf.
Maybe it's going through your email inbox and triaging things.
From the application builder's perspective, that's probably a good thing.
You want more transactions.
You want more usage of your application.
But there are examples where it might be a bad action.
So, for example, if you're building a tool that is going to try and buy all of the concert tickets
and then sell them on later, that becomes a problem for the concert seller because they don't
want to do that.
They want the true fans to be able to get access to those.
And again, you need the nuance.
Maybe you allow the bot to go to the homepage and sit in a queue,
but then when you get to the front of the queue,
you want the human to actually make the purchase
and you want to rate limit that so that maybe the human can only purchase,
let's say, five tickets.
You don't want them to purchase 500 tickets.
And so this gets into the real details of the context,
each one about what you might want to allow
and what you might want to restrict.
That's incredibly complicated.
I mean, if I remember back why we made a lot of the decisions we made
in blocking bots was strictly because of scale.
So, you know, you've got 450,000 IP addresses sending you terabits of traffic through a link that only can do gigabit,
and you've got to just start dropping stuff, right?
And you take, you know, it's the battlefield triage of the wounded, right?
It's like some of you aren't going to make it, and it becomes a little brutal.
That sounds incredibly sophisticated.
How do you do that sort of fine-grained control of traffic flow at Internet scale?
So this is about building up layers of protections.
So you start with the robots.
text, just managing the good bots, then you look at IPs and start understanding, well,
where's the traffic coming from? In an ideal scenario, you have one user per IP address,
but we all know that that doesn't happen. That never happens. And so you can start to build
up databases of reputation around the IP address. And you can access the underlying metadata
about that address, knowing which country is coming from or which network it belongs to.
And then you can start building up these decisions thinking, well, we shouldn't really be getting
traffic from a data center for our signup page.
And so we could block that network.
But it becomes more challenging if we have that agent example.
The agent with a web browser or headless browser is going to be running on a server somewhere.
It's probably in a data center.
And then you have the compounding factor of the abusers will purchase access to proxies
which run on residential IP addresses.
So you can't easily rely on the fact that it's part of a home ISP block anymore.
And so you have to build up these patterns,
understanding the reputation of the IP address.
Then you have the user agent string.
That is basically a free text field
that you can fill in with whatever you like.
There is kind of a standard there,
but the good bots will tell you who they are.
It's been surprising, getting into the details
of this, how many bots actually tell you who they are.
And so you can block a lot of them just on that heuristic
combined with the IP address.
Or allow them.
Or allow them.
Yeah.
I'm the shopping bot from OpenAI.
Right.
Come on it and buy some stuff.
Exactly.
And Google bot, Open AI,
they tell you who they are, and then you can verify that by doing a reverse DNS lookup on the IP address.
So even though you might be able to pretend to be Google bot, you can check to make sure that that's the case or not with very low latency lookups.
So we can verify that, yes, this is Google, I want to allow them.
Yes, this is the open AI bot that is doing the search indexing.
I want to allow that.
The next level from that is building up fingerprints and fingerprinting the characteristics of the request.
And this started with the JA3 hash, which was invented at Salesforce,
and has now been developed into a JA4.
Some of them are open source, these algorithms,
some of them are not.
So essentially you take all of the metrics around a session
and you create a hash of it,
and then you stick it in a database.
Exactly.
And you look for matches to that hash.
You look for matches.
And then the idea is that the hash will change based on the clients
so you can allow or deny certain clients.
But if you have a huge number of those clients all spamming you,
then they all are the same.
They all have the same fingerprint,
and you can just block that fingerprint.
So this is almost like, if you think of,
you know, I always,
think of things in terms of the classic sort of network stack like, you know, layer zero up to layer seven.
Like this is almost like layer two level identity for devices, right?
Right. It's looking at the TLS handshake on the network level. And then you can go up the layers.
There's one core of the J4H, which looks at the HTTP headers. And the earlier versions of this
would be working on the ordering of the headers, for instance. So an easy way to work
around it is just to shift the headers. The hashing is, as,
improved over time so that even changing the ordering of the headers doesn't change the hash.
And the idea is that you can then combine all of these different signals to try and come to a
decision about whether you think this is or who it is basically making the request.
And if it's malicious, you can block it based on that. And if it's someone that you want to allow,
then you can do so. And this is before you even get into kind of the user level, what's
actually happening in the application, right? That's right. Yeah. So this is the logic on top of that
because you have to identify who it is first
before you apply the rules about what you want them to do.
Gotcha.
So it's almost like you're adding an authentication layer
or an identity layer to sort of the transport side.
That's right, yeah.
The application side, I guess you should say.
Yeah, the application, yeah.
But it's throughout the whole stack, the whole OSI model.
And the idea is you have this consistent fingerprint
that you can then apply these rules to.
And identity kind of layers on top of that.
And we've seen some interesting developments
in fingerprinting and,
providing signatures based on who the request is coming from.
So a couple of years ago, Apple announced Privacy Pass,
which is a hash that is attached to every request you make.
If you're in the Apple ecosystem and using Safari on iPhone or on Mac,
then there is a way to authenticate that the request is coming from an individual
who has a subscription to IClack.
And Apple has their own fraud analysis to allow you to subscribe to ICloud.
So it's an easy assumption to make that if you have a subscription,
and this signature is verified, then you're a real person.
There's a new one that Cloudflare recently published around doing the same thing for automated requests
and having a fingerprint that's attached your signature inside every single request,
which you can then use public key cryptography to verify.
These are all emerging as the problem of being able to identify automated clients increases
because you want to be able to know who the good ones are to allow them through
whilst blocking all the attackers.
Yeah, it's just like the old days with Kerberos, right?
Every large vendor is going to have their flavor.
Right.
And if you're a shop and you're trying to sell to everybody,
you've got to kind of work with all of them.
That's right.
And you just need to be able to understand,
is this a human and is our application built for humans?
And then you allow them.
Or is it that we're building an API?
Or do we want to be indexed and we want to allow this traffic?
It's just giving the site owner the control.
Yeah, I mean, I think it's what's really interesting to me is that in my own use
and in my own life,
like I interact with the internet less and less directly,
like almost every day.
And I'm going through some sort of AI type thing.
It could be an agent.
It could be an large language model.
It could be any number of things,
but I generally don't query stuff directly
as much as I used to.
And it seems like we're moving to a world
where almost the layer you describe,
the agent type activity you describe,
will become the primary consumer of everything on the internet.
Well, if 50% of the traffic is already bots,
is already automated, and agents are only really just getting going.
Most people are not using these computer use agents because they're too slow right now.
They're still like previews, but it's clear that's where everything is going.
Then we're going to see an explosion in the traffic that's coming from these tools
and just blocking them just because their AI is the wrong answer.
You've really got to understand why you want them, what they're doing, who they're coming from,
and then you can create these granular rules.
I mean, I hate to use the analogy, but these things are almost like Abbott's.
They're running around on someone's behalf, and you need to figure out who that someone is and
what the objectives are, right, and control them very granularly.
And the old-school methods of doing that assume malicious intent, which isn't always the case,
and increasingly is going to be not the case because you want the agents to be doing things.
And the signals just no longer work when you're expecting traffic to come from a data center
or you're expecting it to come from an automated Chrome instance.
and being able to have the understanding of your application
to dig into the characteristics of the request
is going to be increasingly important in the future
of distinguishing how criminals are using AI.
What we've seen so far is either training
and people have that opinion of whether they want to train or not
or it's bots that maybe have got something wrong.
They're accessing the site too much
because they haven't thought about throttling
or they're ignoring robots.com text
rather than looking at agents.com text,
which is distinguishing between an agent you want to access your site and some kind of crawler.
And the examples that we've seen are just bots coming to websites and just downloading the content continuously.
There's no world where that should be happening.
And this is where the cost is being put on the site owner because they currently have no easy way to manage the control the traffic that's coming to their site.
Directionally, things are improving because I have looked back 18 months.
the bots have no rate limiting.
They're just downloading content all the time.
Today, we know that these bots can be verified.
They are identifying themselves.
They are much better citizens of the Internet.
They are starting to follow the rules.
And so over the next 18 months, I think we'll see more of that,
more of the AI crawlers that we want,
following the rules, doing things in the right way.
And it will start to split into making a lot easier to detect the bots with criminal intent.
And those are the ones that we want to be blocking.
So with the transition of bots from being these entities on the Internet that represent third parties and organizations
to this new world where these AI agents could be representing organizations, they could be representing customers,
they could be representing any number of people, and this is probably the wave of the future.
It seems to me like detecting that it's AI or a person is going to be an incredibly difficult challenge.
And I'm curious, like, how are you thinking about proving humanness on the Internet?
Right. Proofing is a tail as old as time. There's a NIST working group on proofing identity that's been running, I think, for 35 years. And it still hasn't really gotten to something that's implementable. There's 15 companies out there, right? The first wave of rideshare services and gig economy type companies needed to have proofing, right, because you're hiring these people in remote places where you don't have an office. And it's still not a solved problem. I'm curious. Like, it feels like maybe AI can help get us there or maybe there's something that's happening in that space.
Right. Well, the pure solution is a digital signature, right? But we've been talking about that for so long. And the UX around it is basically impossible for normal people to figure out. And it's why something like email encryption, no one encrypts their email. You have encrypted chat because it's built into the app and it can do all the difficult things like the key exchange behind the scenes. So that solution isn't really going to work. But AI has been used in analyzing traffic for at least over a decade. It's just,
it was called machine learning.
And so you start with machine learning,
and the question is,
well, what does the new generation of AI allow us to do?
The challenge with the LLM-type models
is just the speed of which they are doing analysis,
because you often want to take a decision on the network
or in the application within a couple of milliseconds,
otherwise you're going to be blocking the traffic
and the user's going to become annoyed.
And so you can do that with kind of classic machine learning models
and do the inference really quickly.
And where I think the interesting thing,
in the next few years is going to be is how we take this new generation of generative AI
using LLMs or other types of LLM-like technology to do analysis on huge traffic patterns.
I think that can be done in the background initially, but we're already seeing new edge
models designed to be deployed to mobile devices and IOT that use very low amounts of system
memory and can provide inference responses within milliseconds. I think those are going to start to be deployed
to applications over the next few years.
I think you're exactly right.
I think so much of what we're seeing now
is just being restricted by the cost of inference.
And that cost is dropping incredibly fast, right?
We saw this with Cloud where, like,
S3 went to being the most expensive storage
you could buy to being free, essentially free.
Glacier is essentially free, right?
Free is beer, right? Whatever.
And so, like, we're seeing that even at a more accelerated rate
for inference, like the cost is just falling incredibly.
And then when you look at the capabilities,
of these new technologies to drop a suspicious email into chat GPT and ask if it's suspicious,
and it's like 100% accurate.
If you want to find sensitive information, you ask the LLM is a sensitive information,
and it's like 100% accurate.
Like, it's amazing.
Like, as you squint and look at the future, you can start to see these really incredible
use cases, right?
Like, to your point of inference on the edge, like, do you think we all end up eventually
with like an LLM running locally that's basically going to be close?
clippy but for CISOs, like it pops up and says, hey, it looks like you're doing something
stupid. Like, is that kind of where you think we land? That's what we're working on is getting
this analysis into the process so that for every single request that comes through, you can
have a sandbox that will analyze the full request and give you a response. Whereas now you can wait
maybe two to five seconds to delay an email and do the analysis and decide whether to flag it
for review or send it to someone's inbox. Delaying an HTTP request for five seconds, that's
not going to work. And so I think the
trend that we're seeing with the
improvement cost, the
inference cost, but also the latency
in getting the inference decision,
that's going to be the key. So we can
embed this into the application. You've got
the full context window so you can add everything
you know about the user, everything about the session,
everything about your application,
alongside the request, and then come to
decision entirely locally on your
web server, on the edge, wherever it happens to be running.
As I listen to you say that and describe
this process, all I can think is that advertisers
are going to love this.
It just seems like the kind of technology
built for sort of like, hey, he's looking at this product,
show him this one, right?
Yeah, super fast inference on the edge
coming to a decision.
And for advertisers, stopping click spam,
that's a huge problem.
And being able to come to that decision
before it even goes through your ad model
and the auction system.
Who would have ever thought
that non-deterministic, incredibly cheap compute
would solve these use cases, right?
We're in a weird world.
That's it for this episode.
Thanks again for listening.
And remember to keep listening
for some more great episodes.
As the AI space matures,
we need to start thinking more practically
about how the technology
coexist with the systems
and platforms we already use.
That's what we try to do here
and we'll keep examining
these questions in the weeks to come.
