The a16z Show - Enabling Agents and Battling Bots on an AI-Centric Web

Starting point is 00:00:00 50% of traffic is already bots, it's already automated, and agents are only really just getting going. Most people are not using these computer use agents because they're too slow right now. They're still like previews, but it's clear that's where everything is going. Then we're going to see an explosion in the traffic that's coming from these tools and just blocking them just because their AI is the wrong answer.

Starting point is 00:00:24 You've really got to understand why you want them, what they're doing, who they're coming from, and then you can create these granular rules. AI agents are changing how people interact with the web, but most sites still treat them like bots. In this episode, taken from the AI plus A16Z podcast, A16Z partner Joel De LaGarza talks with Arcjet CEO David Mitten about building internet infrastructure for this new era. Here's Derek to kick things off. Thanks for listening to the A16C AI podcast. If you've been listening for a while or if you're all plugged into the world of AI, you've no doubt heard of what AI agents and all the amazing things they theoretically can do.

Starting point is 00:01:03 But there's a catch. When it comes to engaging with websites, agents are limited by what any given site allows them to do. If, for example, a site tries to limit all non-human interactions in an attempt to prevent unwanted bot activity, it might also prevent an AI agent from working on a customer's behalf, say, making a reservation, signing up for a service, or buying a product. This broadstrokes approach to site security is incompatible

Starting point is 00:01:28 with the idea of what some call agent experience, an approach to web and product design that treats agents as first-class users. In this episode, A16Z Infra Partner Joel DeLegars that dives into this topic with David Mitten, the CEO of Arcjet, a startup building developer native security

Starting point is 00:01:45 for modern web frameworks, including attack detection, sign-up spam prevention, and bot detection. Their discussion is short, sweet, and very insightful. And you'll hear it after these disclosures. As a reminder, please note that the content here is for informational

Starting point is 00:02:00 purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures. It seems like what once was old is new again, and would love to get your thoughts on this, this new emergence of bots and how, while we know all the bad things that happen, with them, there's actually a lot of good and really cool stuff that's happening and how we can maybe work towards enabling that. Right.

Starting point is 00:02:36 Well, things have changed, right? The DDoS problem is still there, but it's just almost handles a commodity these days. The network provider, your cloud provider, they'll just deal with it. And so when you're deploying an application, most of the time, you just don't have to think about it. The challenge comes when you've got traffic that just doesn't fit those filters. It looks like it could be legitimate, or maybe it is legitimate. legitimate and you just have a different view about what kind of traffic you want to see.

Starting point is 00:03:04 And so the challenge is really about how do you distinguish between the good bots and the bad bots? And then with AI changing things, it's bots that might even be acting on behalf of humans, right? It's no longer a binary decision. And as the amount of traffic from bots increases, like in some cases, it's the majority of traffic that sites are receiving is from an automated source. And so the question for site owners is, well, what kind of traffic do you, you want to allow? And when it's automated, what kind of automated traffic

Starting point is 00:03:34 should come to your site? And what are you getting in return for that? And in the old days, I mean, I guess the old providers, we'll say the legacy providers in this space, like it was very much using a hammer, right? So they would say, hey, if this IP address is coming in, it's probably a bot.

Starting point is 00:03:51 Or they would say, if this user agent is coming in, it's probably a bot. Very imprecise. And I think the downside of that is that you probably blocked a lot of legitimate traffic, along with illegitimate traffic. And now there's very real consequences because some of these AI bots

Starting point is 00:04:05 could be actual users they're acting on behalf of who are looking to purchase your products. This is the challenge. So a volumetric DDoS attack, you just want to block that at the network. You never want to see that traffic. But everything else needs the context of the application.

Starting point is 00:04:20 You need to know where in the application the traffic is coming to. You need to know who the user is, the session, and to understand in which case you want to allow or deny that. And so this is the real issue for developers, for site owners, for security teams, is to make those really nuanced decisions to understand whether the traffic should be allowed or not. And the context of the application itself is so important because it depends on the site. If you're running an e-commerce operation, an online store, the worst thing you can do is block a transaction because then you've lost the revenue. Usually you want to then flag that order for review, a human customer support person is going to come in and determine based on various.

Starting point is 00:04:59 various signals by whether to allow it. And if you just block that at the network, then your application will never see it. You never even know that that order was failed in some way. There's been a lot of media releases about companies that have released solutions in this space, but largely they were based on sort of those old kind of approaches using network telemetry.

Starting point is 00:05:20 Is that generally how they're working now? Or is there some other capabilities that they've released? Because they give them AI names and you just, immediately assume that they're doing something fancy. That's right, yeah. So you're blocking on the network is basically how the majority of these old school products work. They do analysis before the traffic reaches your application, and then you never know

Starting point is 00:05:40 what the result of that was. And that just doesn't fly anymore. It's insufficient for being able to build modern applications, particularly with AI coming in where something like OpenAI has four or five different types of bots, and some of them you might want to make a more restrictive decision over. but then others are going to be taking actions on behalf of a user search. And we're seeing lots of different applications getting more sign-ups, businesses actually getting higher conversions as a result of this AI traffic.

Starting point is 00:06:11 And so just blocking anything that is called AI is too blunt of an instrument. You need much more nuance. And the only way you can do that is with the application context, understanding what's going on inside your code. I mean, I'd say we're seeing across the industry that AI is driving incredible amounts of new revenue to companies. And if you use an old world tool to just block any of that traffic, you're probably doing your business. That's right. Or you're like putting it into some kind of maze where it's seeing irrelevant content. And then by doing that, you are kind of downranking your site because

Starting point is 00:06:42 the AI caller's never going to come back. It's kind of like blocking Google from visiting your site. Yeah, Google doesn't get you in, you're no longer in Google's index. But then you're no longer in Google's index. And so anyone's searching is not going to find you as a result. Well, and I believe we had sort of standards in the old days that developed are quasi-standards like robots.com, right, which would tell you, like, until the crawlers, hey, don't crawl these directories, are we doing something similar for this new age, agentic world? So, robots.comtext is still the starting place, and it's kind of a voluntary standard. It evolved over several decades ago now. It's been around a long time. Bots have been a problem for a long time. And the idea is that you describe the areas of your application. and tell any robot that's coming to your site whether you want to allow that robot to access that area of the site or not. And you could use that to control the rollout of new content.

Starting point is 00:07:37 You could protect certain pages of your site that you just don't want to be indexed for whatever reason. And you can also point the crawler to where you do want it to go. You can use the site map for that as well. But the robot's text file format has evolved over time to provide these signals to the likes to crawlers, like search engines from Google, and so on.

Starting point is 00:07:57 The challenge with that is it's voluntary, and there's no enforcement of it. And so you've got good bots like Googlebot that will follow the standard, and you'll be able to have full control over what it does. But there are new bots that are ignoring it, or even sometimes using it as a way to find the parts of your site that you don't want it to access,

Starting point is 00:08:17 and they will just do that anyway. And so this becomes a control problem for the site owner. And you really want to be able to understand not just what the list of rules are, but how they are enforced. Totally. Maybe it'd be great to walk through what these agents are, maybe get some more understanding of sort of how they operate,

Starting point is 00:08:36 what people are using them for, perhaps go through a couple of the use cases. And then it'd be great to understand sort of like how you do control it, because it seems like a far more complicated problem than just bad IP addresses. Right. So if we think about Open AI as an example, because they have four or five different crawlers,

Starting point is 00:08:54 there's one, and they all have different names, and they will identify themselves in different ways. So one actually is calling to train the Open AI models on your site. And that's the one that probably everyone is thinking about when they're thinking about I want to block AI, the training. And you have different philosophical approaches to how you want to be included in the training data. The others are more nuanced and will require more thought.

Starting point is 00:09:18 So there's one that will go out when a user is typing something into the chat and has asked a question and Open Eye will go out. search. It's built up its own search index. And so that's equivalent of Googlebot. You probably want to be in that index because as we're seeing, sites are getting more sign-ups, are getting more traffic. The discovery process is being part of just another search index is super important. Gotcha. So like when I ask Open AI, when is John F. Kennedy's birthday? If it doesn't know the answer, it goes out and searches the web. Yeah, that's right. Or if it's trying to get open hours for something, it might go to a website for a cafe or whatever and pass it and then return the

Starting point is 00:09:55 results. So that's really just like a classic search engine crawler, except it's kind of happening behind the scenes. The other one is something that's happening in real time. So you might give the agent a specific URL and go and ask it to summarize it or to look up a particular question in the docs for a developer tool or something like that. And then that's a separate agent that will go out, it will read the website, and then it will return and answer the query. For both of these two examples, Open AI and others are now starting to cite those. sources and you'll regularly see, and this is kind of the recommendation, is you get the result from the AI tool, but you shouldn't trust at 100%. You go and then verify and you look at the docs,

Starting point is 00:10:36 and maybe it's like when you used to go to Wikipedia and you'd read the summary, and then you'd look at the references, and you'd go to all the references and check to make sure what had been summarized is actually correct. But all three of those examples, you clearly could see why you would want them accessing your site. Right. Like blocking all of Open AI's crawlers is probably a very bad idea. Yeah, it's too blunt. It's too blunt to an instrument. You know, need to be able to distinguish each one of these and determine which parts of your site you want them to get into. And this then comes to the fourth one, which is the actual agent. And the agent, the computer operator type feature that is... Headless web browsers. Yeah, but even a web browser,

Starting point is 00:11:12 a full web browser operating inside a VM. And those are the ones that require more nuance, because maybe you're booking a ticket or doing some research and you do want the agent to take actions on your behalf. Maybe it's going through your email inbox and triaging things. From the application builder's perspective, that's probably a good thing. You want more transactions. You want more usage of your application. But there are examples where it might be a bad action.

Starting point is 00:11:40 So, for example, if you're building a tool that is going to try and buy all of the concert tickets and then sell them on later, that becomes a problem for the concert seller because they don't want to do that. They want the true fans to be able to get access to those. And again, you need the nuance. Maybe you allow the bot to go to the homepage and sit in a queue, but then when you get to the front of the queue, you want the human to actually make the purchase

Starting point is 00:12:01 and you want to rate limit that so that maybe the human can only purchase, let's say, five tickets. You don't want them to purchase 500 tickets. And so this gets into the real details of the context, each one about what you might want to allow and what you might want to restrict. That's incredibly complicated. I mean, if I remember back why we made a lot of the decisions we made

Starting point is 00:12:18 in blocking bots was strictly because of scale. So, you know, you've got 450,000 IP addresses sending you terabits of traffic through a link that only can do gigabit, and you've got to just start dropping stuff, right? And you take, you know, it's the battlefield triage of the wounded, right? It's like some of you aren't going to make it, and it becomes a little brutal. That sounds incredibly sophisticated. How do you do that sort of fine-grained control of traffic flow at Internet scale? So this is about building up layers of protections.

Starting point is 00:12:49 So you start with the robots. text, just managing the good bots, then you look at IPs and start understanding, well, where's the traffic coming from? In an ideal scenario, you have one user per IP address, but we all know that that doesn't happen. That never happens. And so you can start to build up databases of reputation around the IP address. And you can access the underlying metadata about that address, knowing which country is coming from or which network it belongs to. And then you can start building up these decisions thinking, well, we shouldn't really be getting traffic from a data center for our signup page.

Starting point is 00:13:22 And so we could block that network. But it becomes more challenging if we have that agent example. The agent with a web browser or headless browser is going to be running on a server somewhere. It's probably in a data center. And then you have the compounding factor of the abusers will purchase access to proxies which run on residential IP addresses. So you can't easily rely on the fact that it's part of a home ISP block anymore. And so you have to build up these patterns,

Starting point is 00:13:49 understanding the reputation of the IP address. Then you have the user agent string. That is basically a free text field that you can fill in with whatever you like. There is kind of a standard there, but the good bots will tell you who they are. It's been surprising, getting into the details of this, how many bots actually tell you who they are.

Starting point is 00:14:06 And so you can block a lot of them just on that heuristic combined with the IP address. Or allow them. Or allow them. Yeah. I'm the shopping bot from OpenAI. Right. Come on it and buy some stuff.

Starting point is 00:14:15 Exactly. And Google bot, Open AI, they tell you who they are, and then you can verify that by doing a reverse DNS lookup on the IP address. So even though you might be able to pretend to be Google bot, you can check to make sure that that's the case or not with very low latency lookups. So we can verify that, yes, this is Google, I want to allow them. Yes, this is the open AI bot that is doing the search indexing. I want to allow that. The next level from that is building up fingerprints and fingerprinting the characteristics of the request.

Starting point is 00:14:42 And this started with the JA3 hash, which was invented at Salesforce, and has now been developed into a JA4. Some of them are open source, these algorithms, some of them are not. So essentially you take all of the metrics around a session and you create a hash of it, and then you stick it in a database. Exactly.

Starting point is 00:14:57 And you look for matches to that hash. You look for matches. And then the idea is that the hash will change based on the clients so you can allow or deny certain clients. But if you have a huge number of those clients all spamming you, then they all are the same. They all have the same fingerprint, and you can just block that fingerprint.

Starting point is 00:15:13 So this is almost like, if you think of, you know, I always, think of things in terms of the classic sort of network stack like, you know, layer zero up to layer seven. Like this is almost like layer two level identity for devices, right? Right. It's looking at the TLS handshake on the network level. And then you can go up the layers. There's one core of the J4H, which looks at the HTTP headers. And the earlier versions of this would be working on the ordering of the headers, for instance. So an easy way to work around it is just to shift the headers. The hashing is, as,

Starting point is 00:15:46 improved over time so that even changing the ordering of the headers doesn't change the hash. And the idea is that you can then combine all of these different signals to try and come to a decision about whether you think this is or who it is basically making the request. And if it's malicious, you can block it based on that. And if it's someone that you want to allow, then you can do so. And this is before you even get into kind of the user level, what's actually happening in the application, right? That's right. Yeah. So this is the logic on top of that because you have to identify who it is first before you apply the rules about what you want them to do.

Starting point is 00:16:17 Gotcha. So it's almost like you're adding an authentication layer or an identity layer to sort of the transport side. That's right, yeah. The application side, I guess you should say. Yeah, the application, yeah. But it's throughout the whole stack, the whole OSI model. And the idea is you have this consistent fingerprint

Starting point is 00:16:33 that you can then apply these rules to. And identity kind of layers on top of that. And we've seen some interesting developments in fingerprinting and, providing signatures based on who the request is coming from. So a couple of years ago, Apple announced Privacy Pass, which is a hash that is attached to every request you make. If you're in the Apple ecosystem and using Safari on iPhone or on Mac,

Starting point is 00:16:58 then there is a way to authenticate that the request is coming from an individual who has a subscription to IClack. And Apple has their own fraud analysis to allow you to subscribe to ICloud. So it's an easy assumption to make that if you have a subscription, and this signature is verified, then you're a real person. There's a new one that Cloudflare recently published around doing the same thing for automated requests and having a fingerprint that's attached your signature inside every single request, which you can then use public key cryptography to verify.

Starting point is 00:17:30 These are all emerging as the problem of being able to identify automated clients increases because you want to be able to know who the good ones are to allow them through whilst blocking all the attackers. Yeah, it's just like the old days with Kerberos, right? Every large vendor is going to have their flavor. Right. And if you're a shop and you're trying to sell to everybody, you've got to kind of work with all of them.

Starting point is 00:17:51 That's right. And you just need to be able to understand, is this a human and is our application built for humans? And then you allow them. Or is it that we're building an API? Or do we want to be indexed and we want to allow this traffic? It's just giving the site owner the control. Yeah, I mean, I think it's what's really interesting to me is that in my own use

Starting point is 00:18:10 and in my own life, like I interact with the internet less and less directly, like almost every day. And I'm going through some sort of AI type thing. It could be an agent. It could be an large language model. It could be any number of things, but I generally don't query stuff directly

Starting point is 00:18:26 as much as I used to. And it seems like we're moving to a world where almost the layer you describe, the agent type activity you describe, will become the primary consumer of everything on the internet. Well, if 50% of the traffic is already bots, is already automated, and agents are only really just getting going. Most people are not using these computer use agents because they're too slow right now.

Starting point is 00:18:49 They're still like previews, but it's clear that's where everything is going. Then we're going to see an explosion in the traffic that's coming from these tools and just blocking them just because their AI is the wrong answer. You've really got to understand why you want them, what they're doing, who they're coming from, and then you can create these granular rules. I mean, I hate to use the analogy, but these things are almost like Abbott's. They're running around on someone's behalf, and you need to figure out who that someone is and what the objectives are, right, and control them very granularly.

Starting point is 00:19:19 And the old-school methods of doing that assume malicious intent, which isn't always the case, and increasingly is going to be not the case because you want the agents to be doing things. And the signals just no longer work when you're expecting traffic to come from a data center or you're expecting it to come from an automated Chrome instance. and being able to have the understanding of your application to dig into the characteristics of the request is going to be increasingly important in the future of distinguishing how criminals are using AI.

Starting point is 00:19:51 What we've seen so far is either training and people have that opinion of whether they want to train or not or it's bots that maybe have got something wrong. They're accessing the site too much because they haven't thought about throttling or they're ignoring robots.com text rather than looking at agents.com text, which is distinguishing between an agent you want to access your site and some kind of crawler.

Starting point is 00:20:12 And the examples that we've seen are just bots coming to websites and just downloading the content continuously. There's no world where that should be happening. And this is where the cost is being put on the site owner because they currently have no easy way to manage the control the traffic that's coming to their site. Directionally, things are improving because I have looked back 18 months. the bots have no rate limiting. They're just downloading content all the time. Today, we know that these bots can be verified. They are identifying themselves.

Starting point is 00:20:45 They are much better citizens of the Internet. They are starting to follow the rules. And so over the next 18 months, I think we'll see more of that, more of the AI crawlers that we want, following the rules, doing things in the right way. And it will start to split into making a lot easier to detect the bots with criminal intent. And those are the ones that we want to be blocking. So with the transition of bots from being these entities on the Internet that represent third parties and organizations

Starting point is 00:21:13 to this new world where these AI agents could be representing organizations, they could be representing customers, they could be representing any number of people, and this is probably the wave of the future. It seems to me like detecting that it's AI or a person is going to be an incredibly difficult challenge. And I'm curious, like, how are you thinking about proving humanness on the Internet? Right. Proofing is a tail as old as time. There's a NIST working group on proofing identity that's been running, I think, for 35 years. And it still hasn't really gotten to something that's implementable. There's 15 companies out there, right? The first wave of rideshare services and gig economy type companies needed to have proofing, right, because you're hiring these people in remote places where you don't have an office. And it's still not a solved problem. I'm curious. Like, it feels like maybe AI can help get us there or maybe there's something that's happening in that space. Right. Well, the pure solution is a digital signature, right? But we've been talking about that for so long. And the UX around it is basically impossible for normal people to figure out. And it's why something like email encryption, no one encrypts their email. You have encrypted chat because it's built into the app and it can do all the difficult things like the key exchange behind the scenes. So that solution isn't really going to work. But AI has been used in analyzing traffic for at least over a decade. It's just, it was called machine learning. And so you start with machine learning,

Starting point is 00:22:37 and the question is, well, what does the new generation of AI allow us to do? The challenge with the LLM-type models is just the speed of which they are doing analysis, because you often want to take a decision on the network or in the application within a couple of milliseconds, otherwise you're going to be blocking the traffic and the user's going to become annoyed.

Starting point is 00:22:58 And so you can do that with kind of classic machine learning models and do the inference really quickly. And where I think the interesting thing, in the next few years is going to be is how we take this new generation of generative AI using LLMs or other types of LLM-like technology to do analysis on huge traffic patterns. I think that can be done in the background initially, but we're already seeing new edge models designed to be deployed to mobile devices and IOT that use very low amounts of system memory and can provide inference responses within milliseconds. I think those are going to start to be deployed

Starting point is 00:23:33 to applications over the next few years. I think you're exactly right. I think so much of what we're seeing now is just being restricted by the cost of inference. And that cost is dropping incredibly fast, right? We saw this with Cloud where, like, S3 went to being the most expensive storage you could buy to being free, essentially free.

Starting point is 00:23:52 Glacier is essentially free, right? Free is beer, right? Whatever. And so, like, we're seeing that even at a more accelerated rate for inference, like the cost is just falling incredibly. And then when you look at the capabilities, of these new technologies to drop a suspicious email into chat GPT and ask if it's suspicious, and it's like 100% accurate. If you want to find sensitive information, you ask the LLM is a sensitive information,

Starting point is 00:24:16 and it's like 100% accurate. Like, it's amazing. Like, as you squint and look at the future, you can start to see these really incredible use cases, right? Like, to your point of inference on the edge, like, do you think we all end up eventually with like an LLM running locally that's basically going to be close? clippy but for CISOs, like it pops up and says, hey, it looks like you're doing something stupid. Like, is that kind of where you think we land? That's what we're working on is getting

Starting point is 00:24:41 this analysis into the process so that for every single request that comes through, you can have a sandbox that will analyze the full request and give you a response. Whereas now you can wait maybe two to five seconds to delay an email and do the analysis and decide whether to flag it for review or send it to someone's inbox. Delaying an HTTP request for five seconds, that's not going to work. And so I think the trend that we're seeing with the improvement cost, the inference cost, but also the latency

Starting point is 00:25:09 in getting the inference decision, that's going to be the key. So we can embed this into the application. You've got the full context window so you can add everything you know about the user, everything about the session, everything about your application, alongside the request, and then come to decision entirely locally on your

Starting point is 00:25:25 web server, on the edge, wherever it happens to be running. As I listen to you say that and describe this process, all I can think is that advertisers are going to love this. It just seems like the kind of technology built for sort of like, hey, he's looking at this product, show him this one, right? Yeah, super fast inference on the edge

Starting point is 00:25:40 coming to a decision. And for advertisers, stopping click spam, that's a huge problem. And being able to come to that decision before it even goes through your ad model and the auction system. Who would have ever thought that non-deterministic, incredibly cheap compute

Starting point is 00:25:55 would solve these use cases, right? We're in a weird world. That's it for this episode. Thanks again for listening. And remember to keep listening for some more great episodes. As the AI space matures, we need to start thinking more practically

Starting point is 00:26:09 about how the technology coexist with the systems and platforms we already use. That's what we try to do here and we'll keep examining these questions in the weeks to come.

The a16z Show - Enabling Agents and Battling Bots on an AI-Centric Web

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.