This Week in Startups - Is Chalk.ai the ‘Next Databricks’? + Tollbit’s Bot Paywall for AI Agents | E2167

Starting point is 00:00:00 That's training jobs, these training data pipelines, Chalk is able to get fresh data to machine learning and AI models in context at inference time, which means when someone clicks a button, when they load a page, we can help a model fetch the data from the source. So it's really fresh and really relevant to what you're doing in that application. This week in startups is brought to you by dot tech. Say it without saying it. to get.tech slash twist or your favorite registrar to get a clean, sharp, dot tech domain today. Lemon.io. Hire pre-vetted remote developers and get 15% off your first four weeks of developer time at lemon.io slash twist. And Squarespace, turn your idea into a beautiful website. Go to Squarespace.com slash twist for a free trial. When you're ready to launch, use offer code twist to save 10% off your first purchase of a website or domain.

Starting point is 00:01:02 Hey, everybody. Welcome back to Twist. This is Alex. And today we have two Twist 500 interviews for you. I did both of these, and I'm incredibly excited about both these companies. All right, first of all, Chalk. Chalk works with ML to bring the right data at the right time into production.

Starting point is 00:01:18 So essentially, if you want to run an ML workload and have the most up-to-date data, well, chalk has built the data pipelines for that. As we all bring more data into our ML and AI context, this really matters and what the company's building could become absolutely enormous. Then we're talking to Talbot, the publisher AI data marketplace company, had them on the show last August. They are back, and since then they have raised a $24 million Series A, led by Lightspeed, and also signed up way more than a thousand publishers to their platform. I want to know what's going on with Rag queries.

Starting point is 00:01:51 How much money can you make for licensing your data today? our AI company is plain fair, and is robots TXD over? Well, all that and more is coming your way. First, chalk, then Tulbit. Let's have some fun. When a venture capitalist leads around and says that the company in question is the next Databricks and is one of the fastest growing data companies that they have ever seen, I have no choice but to sit up and take notice.

Starting point is 00:02:19 Why? Well, Databricks is one of the most valuable startups in the world, period. And there are some really quickly growing data startups out there that we've talked to. So at the startup that we're talking about today, Chalk must be doing something incredibly interesting. And to help us understand why we need new data pipelines for the AI inference era, please welcome to the show. It's Mark Fried Finnegan, co-founder and CEO of Chalk. Mark, hey, welcome to the show.

Starting point is 00:02:43 Hi, Alex. Thank you so much for having me. I'm psyched. Yeah, me too, man. So I want to start by laying some foundations for folks out there who are not familiar with the acronym ETO and don't know about data pipelines. So from a high level, how is AI? adoption today driving a shift from training compute to what you guys call inference compute. Absolutely. Well, thank you so much for having me. Everyone's talking a lot about training

Starting point is 00:03:08 and for good reason. People are training incredible models. They're spending a lot of time and money developing that perfect model. But once you've got it, you want to run the model. And maybe you want to run it forever. And that's the inference part. And so even if you look at how compute is being used, and obviously people are using a lot of computers, a lot of chips. Training compute is growing, but compute allocated to inference is exploding because that's when you run the model, that's when you incorporate new data, and that's when you get the really exciting answers for users who are interacting with AI. And that's the piece of the puzzle that we're focused on. Okay, so back in the day when OpenAI, you know, in the pre-Chad GPT era, they were getting a lot

Starting point is 00:03:49 of GPUs together, crunching a lot of numbers, making new models. That was intensive. But Now, today, we have a lot of great models. Everyone's using them. So the amount of computation we're pushing through just the use of the models versus the training is now, I presume, the majority case. It's shifting. Absolutely. And part of what's happening is that people used to accept like, hey, why don't we train and process

Starting point is 00:04:11 and compute a lot of stuff in advance? But I think everyone's realizing that the answers get better, the experiences get better if you compute more stuff on the fly, if you do multi-step reasoning, if you actually incorporate new data, fresh data into the answer, into the experience, the thing that consumers experience is totally transformed. And to put this into kind of grocable terms for folks out there, we all recall an era before LLMs were connected to the internet and they couldn't do search. And so their data just went up to like last September.

Starting point is 00:04:43 And if it happened after that, they didn't know. So data freshness really does help the end user get the most value out of a trained AI model. That fair? Absolutely. agree. And maybe let's go even further back in time with some really concrete examples. If you think about stuff that's processed in batch, which is really what training is doing, that's what Databricks is doing. Everyone used to process everything in batch. Think about moving money. So I deposit a check. There's this nightly batch process for ACH. They move the money in the next day or two days, you see the money.

Starting point is 00:05:14 Cool. Yeah. But contrast that with debit. Right? With a debit card, you need to make this instant decision about approve or decline. You move the money instantly. And, So you can't do an overnight process. You need to have a machine learning or an AI model that's evaluating what literally just happened. You might have retried that debit card five times in the past second. And so by shifting where you evaluate the risk, by doing it more quickly, by doing it with fresher data, you can produce a much better, faster decision and better experience.

Starting point is 00:05:44 The same thing applies to consumer experiences like ranking and recommendations and search and logistics and operations and really anything you can think of gets better. when it's faster and fresher. Okay, and this brings us to kind of what Chalk does. And I think the best place to start is really, why does a shift from training to inference compute require different data pipelines? So perhaps you could explain to us how the batch system works

Starting point is 00:06:08 and then also maybe how Chalk approaches this to make it effectively real time as far as I can tell. Absolutely. So maybe I'll share another example. One of our customers, whatnot, they're the largest live streaming marketplace in the United States. It's kind of like eBay and QBC mashed up a little bit. Oh, yeah, yeah.

Starting point is 00:06:23 I've seen that they're growing like mad right now. Yeah, yeah, yeah. They are growing like mad. They're a totally incredible company. And before us, when you open the home screen of the app, you would see a set of recommendations. They actually generated those recommendations on a nightly basis. So they would look at everything you browsed for, shop for in the past. And let's say it was a lot of sneakers.

Starting point is 00:06:44 You come into app the next day. You're going to see a lot of sneakers. But what if that morning you're browsing for Pokemon cards? it's not going to update. They weren't able to update. Or I'll take a different user scenario. Imagine it's a brand new user where there is no data. And all of a sudden, the home screen of the app is populated their 4-U feed with almost a random set of products because they don't have data because they haven't done that job in advance.

Starting point is 00:07:11 What we helped them to do was to instantly update the set of recommendations based on what you're browsing for in the moment. And so instead of these batch training jobs, these training data pipelines, Chalk is able to get fresh data to machine learning and AI models in context at inference time, which means when someone clicks a button, when they load a page, we can help a model fetch the data from the source. So it's really fresh and really relevant to what you're doing in that application. So fundamentally, what Chalk does is it responds very quickly to real-time requests and makes data available. at the time of inference or the time of a request, so that way whatever is generated can be as fresh as possible. Now, that sounds relatively easy. I presume in practice it's not. So what's tricky, and therefore why does chalk exist about making data appear inside of an AI context or a machine learning context as you needed? That seems very obvious from my layperson's perspective.

Starting point is 00:08:11 I presume technically it's tricky. Yeah, well, here's the state of the world before chalk. So you have things like data bricks and Snowflake incredible companies that do huge batch training jobs, a billion rows at a time, billions of users, great, wonderful. But then in sort of the context of an application, you want to serve those answers really quickly. And so what people do today is they take the train data and they'll cache it in something called a feature store, which is really just a cache. And then they can serve really low latency answers inside of an application. But the problem is...

Starting point is 00:08:44 In that context, there still won't be fresh data. it's still the data from the last batch job. That's the trick. So there's this tension because they can serve it in a low latency manner. There's effectively no math, no computation happening. So it's fast to serve, but it's old because it was pre-processed. And so companies found, well, hold on, maybe the pre-processed data isn't good enough, but I still care about latency. And that's kind of the tension.

Starting point is 00:09:11 And they wound up building their own custom solutions. So sometimes people ask, what do you compete against? The answer is often companies wind up trying to build something that's a little bit in between a feature store and these batch training jobs. That's what we're building. And it was inspired by real life experiences. So I'm lucky my wonderful co-founder, Elliot, I've got two, Elliot Nandy. Elliot was one of the first 10 engineers at a firm. So you can imagine buy now, pay later.

Starting point is 00:09:35 It's not an overnight job. You click. You have to decide, are we going to loan you $100? And they said, yo, Elliot, we have to do this on the fly. He couldn't buy something. so he built something. He built the same thing again at his startup with Andy and NeoBank. They sold it to Credit Karma and built it again.

Starting point is 00:09:51 It's because there's a gap. If you want, no, seriously, it's a very tried and true way of starting a company. No, no, I'm just laughing that he had to do it twice. Yeah. And so now we're doing it as a company because it's a real gap in the market that we're able to fill. So going back in time to the early days of a firm, I actually interviewed Max Lepchin for Tech Run, a billion, like probably like 2013 or, something. So way back when Affirm was a little baby thing. And at the time, he told me something

Starting point is 00:10:18 along the lines of when we do a credit decision or make a credit decision, we bring in tons of individual data points to help us be informed. And that wouldn't have been possible if one of your co-founders hadn't built the backend engine to allow for the quick ingestion and then computation against those things to make that decision. So essentially, without home brewing this originally about a firm, the firm that we know today probably wouldn't exist. If you want your business to succeed online, it has to stand out, and that's why you need a beautiful world-class website. The good news is you don't have to hire an expensive designer anymore, nor do you need a bunch of developers. No, Squarespace has all the tools you need to claim your domain

Starting point is 00:11:00 and start building your company. Hey, maybe you got a product to sell. Maybe you want to show off some samples of your work. Maybe you've got a new service you're providing. Well, Squarespace is the all-in-one tool you need for your business to grow and flourish. And they offer beautiful templates. Maybe you want to start an online course or you're scheduling appointments for people. Maybe you're generating client invoices. All of that is built into the product. Everything you create with Squarespace is pre-optimized to show up in search engines as well. So you don't need to hire some expensive SEO person. Nope. They're building your meta descriptions. They're setting you up with an auto-generated site map. And the SEO is done before you even get started. Plus, there's a new AI

Starting point is 00:11:40 powered feature. It's called Blueprint, which makes it easier than ever before to customize your website and make it really pop. So check out Squarespace.com slash twist for a free trial and when you're ready to launch. Go to Squarespace.com slash twist to get 10% off your first website or domain purchased. That's Squarespace.com slash twist. Max is incredible and he's built a company with a ton of people, but certainly, Elie had built a really essential piece that gave us the insight for chalk. I didn't think. I didn't think. Max did not. Yeah. Yeah, well, credit where credits do.

Starting point is 00:12:12 But yeah, Elliot was a really amazing part of that early team. I think the other sort of key thing here is really about latency and speed. The language of data scientists is really Python. And so Python is easy to write. It's very flexible, but it's also slow. And that's part of why there's a little bit of tension here, because data scientists come up with a model. You mentioned the firm.

Starting point is 00:12:34 Great. You have thousands of pieces of data you want to incorporate in this decision. And if you write it in Python, you think, well, there's no way we can run that in a fast way at inference time, at an e-commerce checkout, when you load an app. So what do you do? What teams do today is they wind up having a separate data engineering team rewrite the Python in a language that's more efficient. Maybe it's C plus plus. Maybe it's something else. Maybe it's rust. But you can imagine you rewrite the same definition a second time. There's huge opportunity for error. It takes a lot of time to get something into production.

Starting point is 00:13:07 And part of the magic of chalk, I'd say, we've been the, curve of what's possible. We change the way of thinking because you can write Python and we will automatically transpile it into C++ instantly. So we meet data science teams where they are, they can write Python, but it still executes so fast that they can use it in a production environment. You guys use the phrase idiomatic Python and I wasn't going to bring this up because it reflects my own ignorance, but break that down for me as someone who hasn't written code since high school. Yeah, what we mean, so Python, and the modifier is just saying simple, natural Python. I think some kind of tools come out and they have like crazy DSLs and they want you to learn a whole

Starting point is 00:13:49 new language. And the point we're really making there is you can write the language you already know. We have some ways that we ask you to structure the Python to talk to chalk, but you just write Python and we transpile it into C++ and Rust. So it runs crazy fast in a way. You never thought it was possible that Python could be that fast or that you could really run Python in production. And also it means that a company could use chalk without having a data science team that was prepped to do the transcription and compilation of Python into roster C+++, so it makes this more available to more firms, is my read. Absolutely.

Starting point is 00:14:27 We're making things possible that honestly people didn't think were possible unless they built huge custom solutions or had data engineering teams. translating a lot of code. We're making it possible to do fresher, faster things, which opens up entirely new product categories for our customers. So the change of Python

Starting point is 00:14:48 into a different language that's more machine learning appropriate, perhaps. How do you guys do that? Is that done via an AI model? Is that done via something else? I'm just curious about how you go from idiomatic Python into what you actually want.

Starting point is 00:15:03 Yeah. Honestly, Alex, a lot of hard work. I do it myself. I type all the code by hand. I love it. Yes. It actually comes to me first. No, I'm not writing our code.

Starting point is 00:15:15 But really, the core of chalk is a compute engine. And if you really think about what Databricks is, DataBricks says, hey, give us all of your data and we'll make your data accessible in a wide variety of ways. We do something actually very similar. We're tuned to a different set of use cases than Databricks. We are tuned to low latency. applications so chalk can sit in production. You can never kick off a data bricks job in response to a user click. You're never going to run one row at a time with data bricks. Their newest engine is called

Starting point is 00:15:48 photon. It says on the docs page, don't use us. We don't work for anything that takes under two seconds. Chalk is optimized for the exact opposite set of things. We're optimized exactly for doing things really low latency right in the context of a production application. So two seconds is infinite time in regards to an actual application that a user touches. And if you don't believe me, go look at all the Google research listeners and friends about how important it is to load results very quickly, for example. So 100%. But Databricks, we all know Al-a-Godzi.

Starting point is 00:16:25 We know we know Databricks, lovely company, lots of smart people over there. 100%. How hard would it be for them to go from two? seconds to something that would be competitive with you guys. Because on one hand, I don't think this is what they want to do. But on the other hand, I'm curious because you guys are doing so well, it's going to attract more competitive pressure as you scale. So I'm just curious if Databricks could, if they wanted to, stop by the barbecue. Absolutely. Well, you know, the age old question, Alex, where an early stage company could Google, Amazon, Databricks go and build the exact same thing?

Starting point is 00:16:56 Maybe. But we know them well. And they are focused on a different set of problems. And I mentioned Photon, but they just spent years and billions of dollars coming up with the newest version of their engine that's optimized for jobs that take over two seconds. So essentially, they're not batch training jobs. So they're not coming in. But one thing, just to stick on this one for one more question. Yeah. The major hyperscalers are building out an increasingly impressive feature set for supporting AI broadly at the enterprise level. Do you think they're ever going to make a version of what you guys? guys are working on that would be competitive, but perhaps less feature rich? Or am I, again,

Starting point is 00:17:35 kind of asking a question about a company that doesn't want to dig into what you're doing? Yeah, I think that most of our customers also use data bricks and or Snowflake and or Google and or Amazon. They might be on Vertex or SageMaker. And we complement those tools because we're doing something that's optimized for a different set of use cases. Could they focus on our use cases one day? Of course, that could happen. But explicitly, Amazon is really. referring customers to us because SageMaker just doesn't do the things that we're optimized for. I know that because I was trying to figure out your guys is pricing information and I was bopping around the AWS website and it would inform me that it wouldn't tell me unless I

Starting point is 00:18:13 had a more serious AWS account. So speaking of that, I'm curious about what this actually costs because on one hand, SaaS is traditional. On the other hand, people have been moving a lot more towards a pay-for-what-you-use kind of approach. So how do you guys charge for chalk today? Yeah. So we have some SaaS fees. But our core monthly, is based on usage. So something that's really differentiated about Chalk, although other companies do similar things, we deploy our software into the cloud environments of our customers. We work with big customers. We work with venture back companies that have raised hundreds of millions of dollars. We work with big public companies. We work with Fortune 500 companies. They don't want their

Starting point is 00:18:49 data leaving their environment. And neither do we. So Chalk deploys into their cloud environments where they can retain complete control, privacy, security over their data. With that model, we then just look at how many computers are running chalk. And we charge per computer, per hour, a lot like data bricks, so that it's really based on how much chalk you're using. So this is a nerdy financial question. But if you guys are running chalk on, let's say, Alex Incorporated's cloud instance, and you're keeping in check of how many of machines that I'm spinning up and down that are running on chalk, and I'm paying

Starting point is 00:19:24 you, does that mean that your gross margins are effectively like 99% because I'm paying for all the compute on my side? Yeah, thank you. So they're not 99%, but we do have great margins, and we're really proud of that. We also don't want to get... Do they start with an eight? We also don't want to get in the middle of our customers and their negotiated rates with AWS or GCP. They pay directly for their hosting costs based on whatever they've negotiated, and they pay us separately for software in a model that's per computer per hour. We have credits and it sort of works in a way that a lot of people are familiar with. But yeah, you know, we'll maybe host the metadata plane or our data. dashboard or some alerting or other things like that. But really the data sits with them. And it does provide for a really great business for us. This is the best business ever. When you're a busy founder, finding a new developer, my God, that can become a full-time job. And you've got enough on your plate. I mean, you're running a startup. But lemon.io has done the

Starting point is 00:20:23 hard part for you already. They've got a crop of pre-vetted developers that they've ensured are experienced, results-oriented, and prepare to make an impact at your startup. And they can work right now at competitive rates. These are skilled, hand-picked devs with a minimum of three years of on-the-job experience. And just 1% of applicants are accepted into their program. Lemon.io isn't just recruiting you the top talent that's out there. They're helping you integrate them into your team. If anything goes wrong, Lemon will find you a replacement developer, ASAP.

Starting point is 00:20:58 and many of our launch founders and founder university companies have staffed up with lemon.io, and we always get the best feedback. So go to lemon. dotio slash twist and find the perfect developer or even a tech team in less than 48 hours. And Twist listeners get 15% off their first four weeks. Stop burning money. Hire developers smarter. Visit lemon.

Starting point is 00:21:21 com slash twist. You guys just get to hang out on someone else's cloud instance and you get paid to it. It's like going to a party. and not bringing a present and then being sent home with $5. That's fantastic. Don't you worry, Alex. We do a lot of work and we are working out a lot of happy customers. I know this is very hard.

Starting point is 00:21:37 I'm just saying on the monetization side, that's genius and I freaking love it. Okay. So Databricks, the comparison that your investor, Idinson cut from Felis, mentioned at the top of the show, you guys recently raised $50 million at a $500 million dollar valuation. Databricks has a really long history with Apache Spark and open source in general. I'm kind of curious, does the comparison with Databricks extend to a future open source approach, or is it more of a, we both work in data and so forth? Yeah, absolutely.

Starting point is 00:22:07 So we have a lot of different components of chalk that are already open source and lots of plans to extend there as well. So, you know, starting with you're just writing Python. So you can take your Python and run it somewhere else. Granted, the core of chalk is a compute engine that helps you run it, but you want to replace chalk with a different engine. You certainly could. A big component of chalk. Obviously, we rely on a ton of open source projects,

Starting point is 00:22:29 including Facebook. Now, Meta has a project called VLox that they've open source. We're big contributors to VLox, and we're big users of VLox, and we have lots more plans to open source different pieces of chalk as we go. All right. Well, that's totally encouraging, because in my experience, watching companies that have a friendly approach to open source in whatever capacity tend to be worth about five times as much down the road.

Starting point is 00:22:49 I don't know why that's the case, but it seems to hold pretty well. Now, on the customer front, you mentioned, Fortune 500 companies, startups that have raised lots of money. That's a pretty diverse, diverse plane. I'm kind of curious, are we seeing adoption of chalk products by companies that we would not think of when we would consider industries that are going AI first? How wide is adoption of chalk as opposed to just the companies that we know are going to be pretty technologically forward? Yeah. So where we got our start was in very mature venture-backed tech companies.

Starting point is 00:23:24 So I'd say they are kind of progressive companies. They are already worth maybe billions of dollars or more and have really strong technical teams, but something was missing. And we don't come in and try to boil the ocean and rip out everything. We come in to help launch a specific product. Hey, we used to apply for an account and get back to you two days later. Now we do instant account onboarding. Great.

Starting point is 00:23:45 And we've enabled that. So we're helping their businesses launch new products and new features. But we've moved up market from there. And so, as you mentioned, we're lucky to have a lot of great public companies, including apartment lists, Turro, Mission Lane, many more. We can go through the list. But also, we do have some early stage companies as well. Sometimes we have some really amazing series A and B companies that get on chalk early

Starting point is 00:24:12 and grow with us, and that's exciting too. But I take the core for us tends to be pretty mature, complex companies that already have a lot of data and really want to supercharge what they're doing. So this kind of tracks of what I was thinking, because I pulled a quote from your websites. I thought it was illustrative of what I was thinking. You guys wrote that quote, chalk executes models end-to-end with an optimized engine, eliminating stale streams and ETL jobs by resolving features directly from the source. Which I know what that means, but I don't think most people out in the broader economy do. And I'm curious if there's a way to make the power of chalk and the power of modern AI more accessible to companies that don't have

Starting point is 00:24:51 the ability to even understand that sentence. And so I guess, Mark, as you look towards the future and you want to have every company using shock, how do you help people that are less technical but still could benefit from what you guys are working on, get on board and get going? Yeah. So really what we're describing in that sentence you just quoted is there's an opportunity to go directly to the source of the data at the time you need it instead of pre-processing. And There are a lot of different ways to talk about this stuff, but I think the idea of pre-processing and batch versus on-demand processing in real-time is a great way to just summarize one of the things we're most excited about.

Starting point is 00:25:33 With pre-processing, you go to all of your sources of data, you pull them in, you run your math, you run your computation, and you wind up with a data set that you're able to then cache in a feature store and use in an application context. we think we've seen, you mentioned Google, I was lucky to work at Google for a bunch of years, and Google talks about speed as a feature. When things are faster, people search more,

Starting point is 00:26:00 they click on more ads, and it's more magical. Everything gets better. And we think that applies to every possible application, both for businesses and consumers. So what we're helping people do, a lot of people say, okay, so what I need is a feature store. Chuck has a feature store.

Starting point is 00:26:14 We think of that as just a data cache. It's useful. But what if you didn't have to cash the data, data? What if you could go to the source and pull it on demand so it's super fresh and you still can do some math, some computation in real time? And people say, that's not possible. My data science team is writing Python. I only have 10 or 100 milliseconds. I can't possibly go to the data source in context at inference time. And we say, want to try it? And honestly, Alex, basically everyone who does a pilot with us winds up buying the software. So we know, not every person, we know, not every

Starting point is 00:26:50 knows about us, although now that we're talking, I guess everyone's going to know. Not everyone thinks of us when they have problems we'd be really good at. But we're really proud that when people do find out when they try the software, they really think about a different architecture for their applications and their business. Mark, an absolute pleasure. I never thought I'd have this much fun talking about data pipelines, but you made it an absolute treat. So thank you much.

Starting point is 00:27:12 And I always ask these two questions. One, where can people find the company if they want to? and two, what's a role you're looking to hire for where you're struggling to find the right candidate? Thank you so much. Well, we're based in San Francisco. We also just moved to our second office in New York. We're growing.

Starting point is 00:27:28 So we work together in person every day and would love to have you here to join us or just to visit us. And of course, at chalk.a.I. Just if you want to learn a bit more, we're hiring for everything. Obviously, we always want incredible engineers. It is the core of our company.

Starting point is 00:27:43 And we're always expanding our sales and go to market team. And so even if you're just starting out and want to be an SDR, we'd love to have you spread the word, or if you want to help write some code on some really technical problems, we'd love to help you build out our engine. All right. Next up, we are talking with Tolbit, one of my favorite Twist 500 companies. I love all my children the same, but I do love some of them a little bit more than others.

Starting point is 00:28:10 Bounder, Toset Panagrahi, was last on the show in August 5th, 2004. That's episode 1989, if you want to go back and look it up. Now, since then, the market for AI data has changed. While we previously spent a lot of our time considering IP rights for data used to train AI models, in 2025, we are seeing user-led AI search queries rise in volume. In other words, the rise of AI bot traffic is straining publishers and other website owners and may demand a new method of monetization for the entire web. Now, if Talbot wins its market, it could become a truly enormous and very valuable company. But it does face stiff competition from other startups and a raft of wealthy companies that would rather pay nothing for access to information that their AI bots and tools require to operate. Now, to help us untangle the current state of AI data licensing, please welcome, toast it back to the show.

Starting point is 00:29:03 Hey, how you doing? Hey, Alex, how's it going? Thanks for having me. I'm doing okay. I only made like three mistakes in that intro. I'm kind of myself. That's higher than my usual standard. All right.

Starting point is 00:29:13 So it's been about a year. You and I have emailed back and forth about a couple of things here. and there. But I think the most important thing to start is to explain to people the difference between an AI training scrape and what's going on in the world of RAG. I think that'll give us a good foundation. So could you start there, please? Yeah, certainly. So training is the process by which we teach these models how to make sense of the world and then how to output an answer and sound human, right? And oftentimes training is, there's a few large companies that are training these what are called foundational models.

Starting point is 00:29:48 And the data access oftentimes for training might be one time. However, the reason RAG is different is because that represents RAG, stands for retrieval, augmented generation. It's just a fancy way of saying that when you ask search GPTO question, when you ask perplexity a question, when you ask U.com a question, it has to go out, it has to read the web, load those articles into context, so that it can intelligently answer a question. So training is this sort of like one-time access

Starting point is 00:30:18 to sort of train the underlying model, whereas RAG, retrieval output, is fetching that article every single time so that it can be loaded into context so that it can actually make sure that it's answering your question accurately and not hallucinating and making mistakes. We all understand the importance of a crisp, memorable,

Starting point is 00:30:41 easy-to-spell domain name. One of those names you can say, over the phone and people know how to type it in without asking you the spelling. But let's get real. The good ones are either taken or there's some poacher who's holding it and waiting for some huge pay day and they don't reply to you. Even if you want to pay for a premium domain, you don't want to use up all your runway on a domain name. That's just the truth for a startup. You want to put that valuable cash back into your startup's operations. So you should consider this. A dot tech domain. You can get a clean, crisp, super memorable name for your website and company and signal out loud

Starting point is 00:31:17 to your customers and investors. We're a tech company that's instant branding for you. That's why over 500,000 founders have collectively raised over $5 billion in investment, building their companies on dottech. So skip the hassle. Head to www.com. Get. Dot tech slash twist. Or go to your favorite registrar and grab your dot tech domain today. Now, this often comes up RAG in a search context, because if you're searching for something, you want the most recent information. AI models are trained at a certain period of time, and if you want to have more recent information, you have to go out there and get it.

Starting point is 00:31:53 Is that the only time in which we see RAG scrapes come into play? Or are there other instances of use for RAG that we should also keep in mind as we talk? Yeah, I think they're sort of if you look at the development of AI over the last two years, right, there's turned out to be a few foundational model companies. Then on top of those, you have degrees of magnitude more applications that are made, right? So you take a model, you take a few models, you package them together, and you build a new application out of it, right? So think of cursor, right?

Starting point is 00:32:25 Think of perplexity, right? Those are good examples of applications. There's not much training happening there. That's more about the retrieval, right? And a lot of them do focus around either AI coding or, say, a search use case. cases, now this year, what you've seen is the sort of explosion in agents, right? So especially enterprise agents. And when you hear the word agent, that is all rag, because the agent needs to, when you're

Starting point is 00:32:52 automating something, it needs to go out, it needs to go read your email, it needs to automate a text message, it needs to go read your internal files, that's all retrieval. But when we think about RAG in the context of an AI data marketplace, like like what Tulbit is building. We're thinking a lot about search and then when AI agents interact with the open web versus an internal database, for example. Yes. And it doesn't need to be an internal database, right? There's quite a few of these sort of agentic workflows, right, that still need access to

Starting point is 00:33:24 research reports, right? Premium data sources, things that are behind a paywall, right? I'm not sure if you have used, I think, a few weeks ago, open AI introduced that agent mode into chat GPT. And if you try, you, to read Bloomberg articles, Bloomberg will actually block you and make the Chad GPT agent sort of, it'll put a challenge in front of it where the bot actually needs to click and then prove that it's human before it can proceed. And shockingly enough, the AI agent is not a human, so it probably fails that test. But that's a pretty good segue into just how much of this stuff we're seeing. You guys put out a really excellent Q1 state of the bot's report. I'm hoping,

Starting point is 00:34:07 by the way, there's a Q2, Q3, and Q4 edition coming out. But at the time that launched, you guys said that from the fourth quarter of 2024 to the first quarter of 2025, RAGBOT scripts per site were up 49%. That's two and a half times the then rate at which training bot scripts were going up. And among websites with Tollbit Analytics set up before the start of the year, AI bot traffic nearly doubled in the first quarter. So that shows, I think, just the scale which we're seeing these rag queries go up, What's happened since then?

Starting point is 00:34:38 Are the trends still trending in the same direction? Absolutely. I think it's sort of shocking, right? Like when you think about the implication of what the arrival of both AI applications and AI agents mean, right? And the reason it's not surprising to us that these numbers continue to go up, right, through the quarter, is when you and I search, right? We're limited by our attention span. We're limited by our ability to read and type, right? We will click the first link.

Starting point is 00:35:10 We'll click the second link and we'll move on, right? AI doesn't do that. AI does not get tired. It doesn't have a short attention span. It reads phenomenally fast, right, because it's a computer. So when we search on any one of these tools or when we ask an agent to automate something, that agent doesn't get tired. It will go.

Starting point is 00:35:28 You ask it to do research to find, you know, the perfect Lego set for a gift for someone, right? it's going to keep going to those sites and it's not going to get tired, right? You try to use any one of these AI search applications. I'm not sure if you used AI mode on Google recently, but that will read 200 sites. I have, yes. That will read 200 sites to answer your question. But, Tosa, pause there. That's 200 times in which the Google AI agent is going out there and pinning the website

Starting point is 00:35:56 and requesting information from them, bringing it back and then digesting it and showing it to the user per query. So that means that instead of me going to two sites and maybe clicking on an ad, Google is going out 200 times and clicking on nothing. Exactly. And there is certainly, as with everything in computer science, there is some level of caching that does happen. But the research that we do continues to show that that's actually the primary driver of this explosion in bot traffic that's happening. Because all of these users, all of us that are asking these questions are forcing these bots to go out and keep, you know, you can't download the entirety of the internet. you can't cache a local copy of the entirety of the internet. That's what's causing this explosion in bot traffic. It gets that retrieval use of use.

Starting point is 00:36:38 I think actually when we consider the discussion in 2003, 2004 about AI training data as the focus and this year being a focus much more on RAG, it's really us just saying that a lot of time has been building better AI models and now people are using them. And so, of course, we're seeing a shift in the overall patterns. What I'm a little surprised by is how willing,

Starting point is 00:36:59 it seems, according to reports, that some AI companies are willing to go around robots.t.xD. That's kind of a standard file people put it into their website to say, hey, here are the rules for our site. And it was apparently more of a norm and a custom than a hard law or rule. But it was how the internet worked for a good period of time. It seems to have lost some of its bite. So what do you think is driving that to say?

Starting point is 00:37:22 So I think, you know, if you look at the history of the robots tax file, right, in many ways, it helped keep the AI search engines above board and not get in trouble. Small example. When Facebook has its robots text file, they will purposely make sure that, for example, profiles, messages, right, are not crawled, right? So that in case a search crawler accidentally finds that page, it knows not to index it because you and I don't want our private messages showing up on Google for anyone to search. This happened sort of recently, right?

Starting point is 00:37:58 Whether it was intentional or not, when you saw that, that sort of, you know, anytime you share a conversation on OpenAI, that link, if it gets picked up by search caller, can actually get indexed, right? Literally today, I think XAI's grok was dinged for the same thing. So, yeah, that's happening. So now, right, so that was what the purpose of the robots text file was.

Starting point is 00:38:20 It was never, you know, it was definitely an honor policy, right? You trusted that the bots wouldn't see those. pages, but there was no hard enforcement really of that, right? And I think now what the bigger conversation is going to be about is, and we see this happen in, you know, when you saw the perplexity Cloudflare sort of debacle two weeks ago, right? Yep. That was fun. That was a fun one. I think we're going to start coming to an ethical question, right? And you might certainly be able to figure out where we sit on this. But the question is going to be, if I ask my agent to do something and it goes out, right? Is that human? Should it be able to pass off as human?

Starting point is 00:39:02 And this matters quite a lot because some of the companies are saying we're not disregarding robots text. Instead, what we're doing is we're just following through what the user requests. And so all these rags scrapes, rag queries, rag pings should be treated as if they were done by a human. That's a very friendly perspective for the Googles and perplexities of the world. but if you're, you know, Bob.com and people are just scraping the hell out of your content, it doesn't feel much better, Toset. No, and I think where this actually starts breaking down, and we're already starting to see this now, right,

Starting point is 00:39:35 amongst our publishers, is that visit from that agent is not equivalent to human, right? You are not monetizing that visit, for example, right? And actually, the other place where this starts breaking down, even the unit economics is these bots are coming in droves, right? You can sit in any one of these, you know, whether it's Open AI's agent platform, it's perplexity's agent platform, or, you know, you use a platform like, say, Mind Studio, right? These agents are going out in droves. They are hammering your site for information, right?

Starting point is 00:40:11 And that actually costs the website in CDN bills, right? Quite a lot. And so you're not able to monetize, but at the same time, these AI companies are saying, Well, it's equivalent to a user, right? The user might have clicked your link anyways, but the unit economics are not the same. Not the same at all. So they can search more websites faster, generating more total traffic,

Starting point is 00:40:31 but not actual human traffic. So it's all costs, no benefit to the website. Okay, this brings us to your guy's solution to the problem, which is the bot paywall. So my question is, this piece of technology that you guys have built to essentially, as far as I can understand it, erect a wall in front of bots to say, you may or may not pass.

Starting point is 00:40:50 Is it as hard a wall as I'm imagining, or is it more of like a hedge that you could maybe wiggle through, but it's a little bit harder? Yes, it's more of the latter, right? I think one of the things that we, when, you know, through the course of the company, right, made a decision very early on about

Starting point is 00:41:07 is that we are not a cybersecurity company, right? There are a dozen or more cybersecurity company. All of them are really good at what they do, right? And the problem of figuring out whether a bot was, whether a visitor was human or not, is going to become very expensive, right? And the reason sort of even practically we made this decision, right? If you play out what's going to happen, right? And again, that, you know, Cloudflare perplexity situation two weeks ago is a great example of this. If we just keep maligning bots, for example, Tullet just went down and said, you know, we're going to be a cybersecurity company.

Starting point is 00:41:42 We're just going to focus on blocking bots, right? what you're actually doing is you're creating an incentive structure where you're forcing the bots to just become better, right? Yeah, it's an arms race, essentially. Right. It becomes an arms race, and the bots are just going to become better. The scrapers are just going to become better. They're going to start passing off as more human. And what's going to end up happening?

Starting point is 00:42:04 Sorry, you were about to say it. No, no, no, no. I'm just excited. You keep going. Yeah. What's going to happen is cybersecurity itself will become prohibitively expensive in terms of cost and compute, right? because it will be very, very, it won't be impossible, but it will be prohibitively expensive for anyone, any website to figure out whether when a person lands, when a visitor lands, is this a bot

Starting point is 00:42:26 or not? So I'm starting to get vibes of the old piracy music debate. Back in the day, it's hard to buy stuff, people stole it, then we made it easier to pay for, and then people switched to Spotify, which now does billions of dollars a year in revenue. I presume you guys are going to offer not the world's hardest paywall nor the most permissive, but instead a, hey, look, here's some annoyances, but you can also pay a reasonable fee and get through without friction. There you go.

Starting point is 00:42:51 And so that's where we focus on the happy path, right? So let us assume that there are cybersecurity companies out there. There are CDNs out there who can try to solve that problem of determining who's a bot or not. However, our focus is always, let's assume you are a good actor, right? Suppose you want to go down the happy path. How can we make sure you can have that programmatic exchange of value with that underlying? website, right? How do we make it so that we can actually solve problems for you? Because actually, when you look at it through this lens, there are a lot of problems that AI companies

Starting point is 00:43:26 face and agents face on the open internet, which can actually be solved if you get the cooperation of websites, right? So you can actually make it easier for them to interact with the internet by providing them the equivalent of an API hook, I presume. Okay. So there's a carrot and a stick here. Yes. Okay. So I love this idea because I'm incredibly biased. I'm a writer online. I do this for part of my day job. And so I care a lot about the value of content because it's what's, you know, filled my 401k over the years. How much can websites charge? Because on one hand, this could be a de minimis amount of money that adds up to a molehill or it could be quite a large pile. But I have no idea how to value an a like a rag,

Starting point is 00:44:13 scrape on a marginal basis. So can you talk to me a little bit about how pricing is looking so far? Yeah. So I'll give you one real example, right? So whenever we onboard a publisher on a total bit, we usually set up a 45-day check-in after they go live so that we can come back and we can look at their analytics, right? And this is important because we try to get the publisher to sort of understand what does that volume look like? What are AI companies more interested in? How can we inform even content strategy for you, right? one of the things that is, you know, a publisher we sat down with in the UK, they have a lot of football content, soccer content, right, over in the UK.

Starting point is 00:44:52 Hugely popular. Yes, hugely popular. And this site is also very popular. They were crawled by Google about 16 million times in 30 days. And from that, they got millions of visitors, right? Google's are main source of traffic like it is for many publishers. In that same 30-day period, they were accessed, crawled by AI companies, scraped, crawled, Corey, however we want to call that.

Starting point is 00:45:19 They were crawled by the AI companies 13 million times, from which they got only 650 visitors. So within spitting distance of the same number of crawls, but 99.999999% less traffic. Yes. And that's what caused the publishers away. this isn't fair anymore. I am actually paying for your content access, right? I am paying the CDN bills for all of your bots accessing my content. So I'm going to start blocking you and I'm going to start setting a rate so that every time

Starting point is 00:45:50 you have to come in and scrape my content, you have to pay a fee. Because I know that you're coming in millions of times every month to access that. So you could, for example, take that website's total traffic for a month, total revenue, from advertising for a month, divide it, come up with a per visit average, if you will, and then you can apply that to each rag scrape. Is that where pricing is kind of landing or am I being too ambitious? That's exactly where we start. We look at their, you know, if you lost a page view to an AI reader, let's start there,

Starting point is 00:46:27 right? And then obviously we have a dialogue on the other side, right? And that pricing as, you know, a marketplace, our job is to sort of make that market and figure out what pricing the market will certainly bear, right? But I think that's a fair starting point and it allows people to visualize it. Now, the neat thing about this, though, right? Remember when we said earlier, when you use any one of these AI applications, you use cursor, you launch your own agents, they don't get tired.

Starting point is 00:46:52 They read a lot of content, right? So the neat thing that's that is going to unfold is that if you look at the number of pages accessed on the internet today, most of that is driven by human consumption, right? Now, as these sort of AI applications, agents, as more of us use it, more and more of that traffic is going to become autonomous. We call them autonomous visitors. It just means there's not a human on the other side, whether it's a bot, agent, application, whatever it is, it's autonomous.

Starting point is 00:47:22 So it's not just a shift that, you know, the sort of page views go from human to autonomous. The amount of pages, the pie itself gets bigger. It's going to be bigger. Much larger. Exactly. Yeah. Now, this all works on the publisher side for very obvious reasons. They have enormous incentives to sign up with Tolbit or one of the other awesome companies that's working in your space.

Starting point is 00:47:45 I really think this is a cool market. I'm not going to lie. But you have to have the other side of the marketplace show up. And so reading public pronouncements from AI companies, even the federal government recently, it hasn't shown, to me, a dramatic willingness to start to pay for what they have previously gotten often for, for free. So, apart from sending a publishers, how are you doing on the getting deals with the perplexities and open AIs of the world that are driving an enormous number of rag scrapes? Yeah. So we've begun in our integration with a few agent builder platforms, actually, right?

Starting point is 00:48:22 So there are enterprise agent builder platforms where their customers, right, are Fortune 1000 companies. They're public companies. They want to do the right thing. They want to make sure they have proper licenses in place. It makes scraping, they have money to do it, right? And that's one of the, I think, surprising place where we've found that actually that integration actually helps and works. And it solves a problem for the other side. That process of partnerships is actually very difficult, very expensive, and very time-consuming. And remember when you gave that analogy about Spotify and Netflix, right, and piracy, I think one of the things, surprising things that we've found is that it's not that there was an unwillingness to pay.

Starting point is 00:49:05 There was an unwillingness to do partnerships. They thought that the partnerships process would take too long. It's too human-centric. It involves too many lawyers, for example, right? In terms no one can agree on. And when we can do that up front, get the cooperation of those sites, actually makes it offering, you know, today our catalog of sites is actually public on Tolbit as well, right?

Starting point is 00:49:28 And as people go live, we add more to that. so that network becomes even more valuable. That's one of the places I think is driving a good number of the transactions. We have some of the longer, tail, smaller AI companies that are getting started, the application layer companies that are also onboarding. We actually had two in the last week that onboarded and started transacting with publishers. So we're bringing the demand for sure. So I just pulled up that site, which, by the way, I should have clicked on before our chat, my bad.

Starting point is 00:49:59 I read the rest of your website. It's a new page. It's a relatively new page. Well, I'm so just laughing. 1,400 publishers, including names like USA Today, AP, Huff Poe, Forbes, Time, Ted, Mumsnet, Futurism, Fast Company, G2, Stocktwits, Inc. I mean, this is a lot. It does seem that you've had a very good job branding of the publishers. Well, we're trying.

Starting point is 00:50:23 I think as sort of, you know, the last two quarters, three quarters have unfolded, And even people come in for the analytics, I think they then start realizing, wait a second. The future is rat, right? There is real volume on the other side. Yeah. And there's two things you have to do. I mean, when I was just going back to my thinking about Tollbit, I was thinking, okay, there's two things that have to happen. You have to understand how often AI is pinging or scraping your site, generally speaking, and then you need to set and enforce terms for those scrapes and pings. And that's why you guys have the analytic service and then also the monetization marketplace. I didn't touch on the first one as much, but I do want to point out

Starting point is 00:50:57 to folks that you guys do actually both things. Yes, absolutely. So, since we last spoke, $24 million series A led by Lightspeed. It actually came out not that long after we chatted. I know that Vitrocopoulos are an impatient bunch at times, and you guys are building in a relatively nascent space, and that can be difficult. Has the growth rate of the company in monetary terms

Starting point is 00:51:22 match your expectations and projections that you made at the time, of that investment? Certainly, I think so. I think Licep is pretty happy with that investment as well. The publisher onboarding has just, it's been insane. I think our customer,

Starting point is 00:51:40 if you look at our go-to-market team, our customer success sort of headcount has also grown. I know that catalogs says 1400 publishers. That's simply who's live. We have, I think, another 1,000 that are in the process of getting activated, right, where they're switching

Starting point is 00:51:55 from analytics to actually then setting up that bought paywall. So only after they set that up, do they go live and get listed there. How close are you to a critical mass of people that have signed up to use the bot paywall? Because at some point, it'll become just easier to say, okay, fine, okay, fine. But that's easier to reach a few of more people. Have you hit that tipping point yet, or are you not quite there? I think it's hard to tell. You can only tell that in hindsight. I think we certainly did. It seems a knock on what hit a tipping point, say, with publishers and the messaging. And I think it is always, I think there's something to be said about there's a growing network effect

Starting point is 00:52:41 every time we add one of these publishers, right? And then every time the AI companies even, right? So obviously we integrate with a wide variety of partners who do some of that detection, and they have that stick, obviously, to forward traffic to that bot paywall. And it is interesting on the AI company side, even, the sort of perception that, you know, every time you're going to try to access one of these sites, you keep getting redirected to a Tollbit subdomain on these sites, right? Which is actually sort of insane, the sort of mindshare that that Tollbit gets as a result of that, that sort of enforcement, right? And it is even just cool for us to see on the other side, right, when you see subdomains existing on so many of these publisher sites like Tollbit. TED.com, right?

Starting point is 00:53:24 Tobit.npr.org, right? And bots keep getting forwarded there and being told, hey, you have to pay. I think it shows how ubiquitous that paywall can become. Yeah, and they're going to have to start coughing up. So let's look forward to the future and let's play a fun little game called, let's say, when? When do you think that Tolbit will have

Starting point is 00:53:45 facilitated, let's say, a big number, like $100 million in value from rag scrapers writ large to publishers on the Tullbit platform. Yeah, I think it is hard to project the future, right? But if you look at the sort of growth in AI, if you look at the segments in which AI, the rag demand is emerging, right? In 2024, it was a lot of these application layer companies. in 2025, it's been enterprise AI agents, right?

Starting point is 00:54:24 We do think 2026 is going to be the year of consumer AI agents, where I'm still excited to see what is that first truly agented consumer app. Maybe it comes in from the browser angle. It's still unclear. But if you look at the sort of growth of the sort of autonomous traffic on the internet, the economic forces out there of cybersecurity is becoming more expensive. CDN bills are going through the roof, referrals from these AI platforms are not coming through, I think the world will look very different

Starting point is 00:54:56 in three years. The unit economics of the internet will look very different in three years. That much I am 95% confident and told it's going to be a big part of that. Are you guys going to announce milestones that are financial? Like we have facilitated X million, X, X, X, well, I'm not going to say that, but like, you know what I mean? Like it goes up. Like, will that be a thing you guys talk about publicly? Yes, I'm sure we'll be included in the state of the bots reports. I know every so often on LinkedIn, I talk about how many tens of millions of transactions we facilitated that quarter.

Starting point is 00:55:28 So we try to be as transparent as possible. And I think what the powerful thing will always be, you know, it's not just about the stick. It's about solving problems for the other side, right? If you bring something that is truly valuable, then the AI companies actually are receptive. Yeah. I mean, people need data for their rag agents or their products. And no one wants that. There's a lot of people betting on them to do well.

Starting point is 00:55:50 All right, just to wrap us up, I want to touch on a pet topic of mine. One thing that I adore to say it is when companies, especially startups, take some of their internal data and find a way to break it out to share with the world what they're seeing. Because startups are very private. They don't share that much usually. And we've mentioned the Tollbit, State of the Boss report, which I've read a number of times because it's been so useful to my own reporting work. How much of a driver in terms of publishers reaching out to you and

Starting point is 00:56:17 so forth, has that state of the bots report been? Yeah, I think the state of the bots report, I think, serves an important purpose for us. Obviously, we put it out every quarter because it helps us sort of lift the veil. I think we realize how little good data there is out there about this, right? So as we onboard publishers, right, it allows us to sort of lift the veil and say, hey, on an anonymized level, this is sort of what we're seeing, right, across the board, right? This is what you should care about. these are the things that, you know, this is what we see in terms of referrals.

Starting point is 00:56:49 Let's put things in perspective, right? And I think as sort of publishers, you know, over the last, say, quarter, right? In Q2 of this year, there was a big focus on, say, answer engine optimization, right, caring about how you show up in some of these AI platforms. Yeah, GEO. GEO, right? I think it was sort of useful for publishers to navigate their thoughts around that as they, again, practical example, right?

Starting point is 00:57:16 Imagine that publisher in the UK. They had actually wanted to get crawled and indexed, right, and show up in these AI search platforms. And then we showed them how many referrals they got from them. It was 650. Then they started doing the math of, well, what did that 650 cost us in terms of 13 million autonomous visitors every month

Starting point is 00:57:36 and our CDN bills that we have to pay as a result of that, right? That's when they did the math and said, no, that's not fair anymore, right? So the bot report is a way so that, you know, I think the entire industry can become a little wiser, right? So that instead of just looking at your own data, you can look at what your peers, what the industry is seeing, what the entire internet, the entire internet, but a good portion of the publishers on TOLB that are seeing, right? So it helps us just make our publishers a little, like wiser about the decisions that they're making. Okay, I appreciate that. So it's it when you do have the next milestone in terms of either a publisher account, a AI model partnership

Starting point is 00:58:12 or really just a revenue milestone. We'd love to have you back. But at the meantime, where can people find you online? And what is a role you're having a hard time hiring for? I think where people can find us online is definitely a tulipot.com or, you know, follow us on LinkedIn. We're posting snippets, excerpts, data points, thoughts there randomly throughout the quarter all the time. And then a role that I think we're actively interviewing for and and talking to a lot of people for is a product partnerships manager, right? So this is our person, a rock star who can sort of manage our relationships with our CDN partners, right? So the Fastly's, the data domes, the human securities, right, with our cybersecurity partners, right?

Starting point is 00:59:02 And manage those integrations and make sure that both sides are successful. All right, both is an absolute pleasure. best to you in the company, and we'll talk to you soon. Thanks.

This Week in Startups - Is Chalk.ai the ‘Next Databricks’? + Tollbit’s Bot Paywall for AI Agents | E2167

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.