This Week in Startups - Orchestrating Smarter AI Systems with AI21 Labs’ Yoav Shoham | AI Basics with Google Cloud

Episode Date: July 10, 2025

In this episode of AI Basics, Jason sits down with Yoav Shoham — Stanford professor emeritus and co-founder of AI21 Labs, creators of Jurassic-2, Wordtune, and the new orchestration system Maestro.T...hey unpack:Why enterprise AI struggles with reliabilityWhat orchestration really means (and why LLMs alone aren't enough)The pitfalls of “agent-washing”Small vs large models, agent-to-agent protocols, and where real opportunities lieThis one is for founders building with AI — if you're navigating hallucinations, chasing automation, or exploring multi-agent workflows, this episode is a must.*Timestamps:(0:00) Yoav Shoham joins Jason to discuss AI Basics.(1:05) What AI21 Labs is building — from Jurassic-2 to Maestro(5:49) The overuse and confusion of the seductive term “agent”(8:22) Small models vs large models: what's best for enterprise?(10:53) The verticalization of AI: legal, accounting, and beyond(12:17) The challenge of agent-to-agent communication and shared semantics(17:08) What’s overhyped vs underhyped in AI — Yoav’s advice for founders*Uncover more valuable insights from AI leaders in Google Cloud's 'Future of AI: Perspectives for Startups' report. https://goo.gle/futureofai*Explore FurtherGoogle Cloud’s Report: The Future of AIGet insights from 23 leading experts on how startups can leverage AI for real business impact.👉 Read the full reportMore AI Basics EpisodesFounders, operators, and builders — catch up on our full Startup Basics series (Legal, Finance, Growth & AI):👉 thisweekinstartups.com/basicsLinks from episde:Google Cloud — Build smarter with AI tools built for startupsAI21 Labs — Makers of Wordtune, Jurassic-2, and MaestroMaestro — Orchestrate AI agents and tools across workflowsJamba — AI21’s hybrid language model with breakthrough efficiencyLearn About Google’s A2A (Agent-to-Agent) ProtocolThe next evolution in agent interoperability and coordination:Official A2A AnnouncementHands-on A2A Tutorial (Medium)*Follow Yoav:X: https://x.com/yshohamLinkedIn: https://www.linkedin.com/in/yoavshoham/*Follow Jason:X: https://twitter.com/JasonLinkedIn: https://www.linkedin.com/in/jasoncalacanis*Follow TWiST:Twitter: https://twitter.com/TWiStartupsYouTube: https://www.youtube.com/thisweekinInstagram: https://www.instagram.com/thisweekinstartupsTikTok: https://www.tiktok.com/@thisweekinstartupsSubstack: https://twistartups.substack.com

Transcript
Discussion (0)
Starting point is 00:00:00 All right, everybody, welcome back to this week in startups. It's time again for our AI basics series. What is this? Man, founders ask us all the time. The same questions over and over again. So we have done basics for legal. We've done it for marketing and growth tactics, obviously accounting. And here we are in the age of AI. People need to understand the best practices. And hey, it's moving pretty quickly. If you want to get caught up on your AI basics, Go ahead and download this fantastic report by our partner, Google Cloud. It's called The Future of AI Perspectives for Startups featuring insights from 23 top AI experts. And today, one of them is with us. Yoav Shaham is here. He's a Stanford professor, emeritus, as you know, and he is the co-founder of AI21 Labs, the team behind Jurassic 2 and Word Tune, building large language models and tools for collaborative reasoning. Welcome to the show.
Starting point is 00:01:02 You are fun to be here. Thanks for having me. Let's talk a little bit about what we mean by reasoning, you know, here in 2025. Are these machines actually reasoning? And how do you get the best out of them? Right now, so many experiments going on in corporate America, so many people are testing and starting to deploy AI technology from large language models or with the basis of large language models. But there is some concern about the reasoning. Are these making the best decisions,
Starting point is 00:01:33 hallucinations, et cetera? So what is to state today? And maybe tell us a little bit about your company, AI21 Labs. Maybe I'll start with the latter. AI21 land. It's about seven years old. We definitely, one of the main LLM builders, although we took a slightly different tax than most people recently with our Jamba family, which is not a pure transformer architecture for a efficiency reasons, and we can speak about that. Most of our effort now is around orchestration and planning all these complex AI systems, in particular a product we call Mindstra, which we released. But if that's directly relevant to your question, Jason, about, so, you know, we in
Starting point is 00:02:17 AI are guilty of using terms that are so ill-defined that they come back to bite us. And we can draw a long list from AGI to agents to, to, to, to, to, to, you know, reasoning. But if I step back from the actual terms, the issue, as you pointed out in the enterprise, is that there's a ton of experimentation. Like, three years ago, nobody paid attention, maybe three years ago. You couldn't get a CEO or, you know, a chief innovation officer to pay attention. Now, everybody's on top of that. But for all the hundreds of use cases in a given company that you see, the number of deployments is very small. And the main reason is the issue of, well, the many reasons. Issues of compliance and safety and, you know, use cases, new technology.
Starting point is 00:03:03 It's all good, but the main reason is reliability. And again, the term of hallucination, maybe not the best term, but these are probabilistic machines. And they're for sometimes, often, they'll give you brilliant output. But if you're brilliant, 95% of the time, not just wrong, but total garbage 5% of the time, that may be okay in consumer land when not in the enterprise. And so there many studies that show that the ratio of experiments in the enterprise deployments is like 10 to 1, 20 to 1. And so that is super relevant. What it means is there's an enthusiasm for the technology. But when it comes to mission critical applications, if you're doing something in accounting, if you're doing customer support, obviously if you're a developer and you're pushing code to a server, it needs to be a lot more reliable.
Starting point is 00:03:57 And that's why humans are in the loop. But with Maestro, I think the concept, and you'll correct me here if I'm wrong, is we have many different LLMs and having them, as crazy as it sounds, work in concert with each other. Interesting, concert, my show. And having them check each other's work, maybe trying to get them to justify their answer, as it were, can result in better output and more reliable systems, correct? Largely yes. Let me nuances. Please. So someone would like to take the LLM
Starting point is 00:04:36 or as the new version of them sometimes called LRN, large reasoning models, which is really a misnomer, but system like 0103 R1, and try to get them to behave through guardrails and alignment efforts and everything. And all this is good to do and we do that. But that will never iron out the, the variance in the models.
Starting point is 00:05:00 And so what you really need to do is to put logic on the outside, to orchestrate, and it's not just a matter of routing between, you know, this LLM or that LLN, sometimes you run code, sometimes use a tool. You know, you'll access the database. You'll call a weather API, what have you. And something needs to orchestrate all that, and you're absolutely right, that you want to do not only system testing, but unit testing.
Starting point is 00:05:27 So every step of the way you have an explicit plan and every step of the way you want to, as well as best you can, validate how well you're doing, and sometimes you'll do it with the language models,
Starting point is 00:05:39 what's called judge language model. And often you'll just, for example, you want an output to be six to 800 words long. Just do the damn counting. Let's talk about agents specifically. This promise
Starting point is 00:05:53 has captured people's imagination because agents feel like a way for humans to stop doing chores. So much of what we do when we go to work, when we try to build a business, is the actual product and the service we're providing to people, but we all have to do our chores. And cleaning stuff up, normalizing data, just work that humans find monotonous, most humans,
Starting point is 00:06:20 and they don't like to do digging dishes. like the agents feel like the proper solution for those. How close are we to having agents at scale doing these repetitive tasks? How often are people actually building an agent that is in 2025 when we're recording this, getting the chore done reliably enough that humans can forget about that chore? Your mileage varies. Depends on the store and depend on the quote-unquote agent. The problem is that people have been using the term agent now.
Starting point is 00:06:55 It's so seductive for anything that smacks of any kind of automation. And it'll come back to bite us. I call this agent washing. And so you're absolutely right that the biggest bang for the buck is when you try to get the technology to take care of fairly simple, fairly mundane stuff, kind of like robotic process automation on steroids. There's more and more stuff can be automated. Is that an agent or is this simply a program that you wrote that maybe is an LLM?
Starting point is 00:07:26 We don't need to get ano about the definition. But typically, when we speak about agents, what do we have in mind? We have in mind a system that's not a transactional call to an LLM. There's something that's more ongoing. There is the AI system, the agent, can be proactive, not just respond to our prompt. It executes complicated flows, not just like a one-step thing. It uses multiple tools. And as you do this, it gets nicer.
Starting point is 00:08:00 Because if a single call to an LLN carries some uncertainty, when you start to compose them, at some point you get more noise and signal. And that's where I think maybe some people are getting a little ahead of themselves. So simple road repetitive stuff, yes. More complicated stuff, we have work to do. It seems like there's been an investment in smaller, more narrow models. And some debate about that. Some people believe the large models will figure it all out eventually.
Starting point is 00:08:32 Other people think, hey, why not make a smaller model faster, cheaper, better, and more constrained? Maybe you could talk about the state of SMLs. I think the short answer is in Consumerland, if you're looking for a very general purpose chat, let's call a spade a spade, a chat GPT like experience, there's probably no replacement for a very large language model that is being aligned tuned to cover a huge variety of cases because you can't anticipate the variety of input you'll get from consumers. As you go to the enterprise and your needs are much more narrower, there's several reasons to go narrow.
Starting point is 00:09:12 first of all is just cost. It's cost not only in terms of dollars, but also latency. And so, you know, and as you know, we came up with this new family of models called Jamba. It is a hybrid state-based model and transformer
Starting point is 00:09:30 that in terms of the quality of the answer, you get it's competitive with the most sized model. In terms of latency and memory footprint, there's just no comparison, especially as the input, so-called the context length increases, that kills you. The transformer architecture,
Starting point is 00:09:48 this is what, in 2017, the famous paper from Google, thank you, Google, really moved the needle. Suddenly stuff happened in language that hadn't happened before, which did happen in vision, and the reason is that the attention mechanism
Starting point is 00:10:02 transformer allows you to, as a term, suggests, tend to a very disparate part of the input. In vision, it doesn't so much matter, to know that this here is a, as a phone, it doesn't really matter what the picks away over to the side is. But in language, there's nothing local. So that made the difference.
Starting point is 00:10:20 The problem is that it's expensive. It's a quadratic complexity in the input or the contact length, as we call it. Now, when we had input of a thousand, thousand squared is fine, but we're now pushing a million, a million square is not fine. And so you need to deal with that. Part of it is smaller language models. That doesn't quite deal with the contact length side of things. but then rethinking the architecture.
Starting point is 00:10:46 So the state-based model, which is inherently linear and not quadratic, and mixing it just a little bit transformer, it gives you the best of both worlds. What about verticalization of knowledge? Is there a movement to say, hey, this small language model is going to focus really on accounting
Starting point is 00:11:02 to just put it in business categories. This one is just really amazing at legal concepts. and you know that when you throw this legal brief into it, or on the accounting side, you throw this RFP into it, these are huge contacts windows, dumping cases of case law and trying to process them. Is it just about the speed and the cost, or is it also about the accuracy because the model has been constrained to not have to worry about, oh, all the movies and songs ever written in the world, and every blog post about those songs and music in the world, is it actually going to result in a higher fidelity of content?
Starting point is 00:11:41 Sort of answer, yes. I'm sorry, I'm always nuanced. But the longer answer is that there's some general common sense that the baseline model has learned that you want to retain, even if it's just, you know, mastering, correct English, you know, language, you know, grammar. And so as you attain more domain-specific language, it's okay to forget certain same but not the others. So the art here is to just remember the right stuff. Let's talk a little bit about agent-to-agent protocols. People don't know. A2A is a protocol for
Starting point is 00:12:23 interoperability between agents. You know, technologists are always looking at what's around the corner. If you do get your agents working well, let's say the accounting department agent doing purchase orders and paying bills, eventually you might want that agent to interact, between companies, maybe to put an RFP out to get five companies to bid for, I don't know, the new shed you're building or the new software that you want written. And agent-to-agent protocol is going to solve for that. But this is, we're talking within the last 60, 90 days, this stuff is all starting to be publicly released by Google, other players. And we're starting to see some consensus from different technology companies and data sources that this is the next
Starting point is 00:13:14 big thing. Is anybody actually got this in deployment now in your experience? What are the early results like? What are people thinking this will do? No, I think it's too early for anything to have been in production, even in the ideal scenario. So this is not a knock on A2A, just too early. So first of all, I think kudos to Google for shepherding this and a good start. But it's It's just a start, and we have to realize the limitations. So the vision that there's several things that excite people when they hear about multiple agents coordinating. Part of it is something for nothing.
Starting point is 00:13:51 Oh, I don't need to think hard about the problem. I'll just build a bunch myself. We'll build a bunch of agents, and then magic would happen when they come together. That historically has led to disappointment. Often the magicism, the glue is good to... compartmentalize and factor things out. That's always a good. But often the magic is how you put together things
Starting point is 00:14:15 and what the algorithm is. But I think, as you said, the promise is that it's not only my agent speaking to my agents, it's my agent to speaking to other agents in my company, but that I didn't build, but also outside my company. And here, I think not so fast.
Starting point is 00:14:35 There are two fundamental problems. One is, if you look at the protocol, there's a part of what the agent communicates in JSON, which is its capabilities, and other stuff where the contract it does with our agents. The problem is it specifies the syntax, but not the semantic, not the meaning. And that historically has been the pitfall of distributed object systems, that objects advertise the capabilities, but there's no reason for me to, for my agent, to understand what you meant when you put in language, I know how to find, you know, good flights. Well, what is good flights? Is it mean efficient time, you know? So when you share semantics, it can be done, but it's a big undertaking, it's a community kind of activity. Is that number one? Number two is shared incentives. If you go outside the boundaries of even my own unit in company, because we don't always share the same incentives,
Starting point is 00:15:40 even if we're in the same company, let alone. So for example, if I'm looking, I have an agent that's trying to put together an itinerary for me and book of flight. And it's speaking with your agent and you're maybe an agent from one of the airline companies. We do not have the same incentive. And so you need to put in some control for that. And actually, I spent in my, wearing my academic hat, I spent a good fraction of my academic work on the area of multi-agent systems.
Starting point is 00:16:11 In fact, we have a standard textbook in the area. And a lot of it has to do with crafting protocols for multiple agents to get them to play nice together, even though lift to their own devices, they wouldn't. A lot of game theory and stuff like that. if you think about what we went through when we tried to have a semantic web exist. Oh, hey, you're a chef, and you have a really silly example,
Starting point is 00:16:37 but you're putting your recipes online. You'd like you to make your recipes semantic. These are ingredients, these are steps, you know, and here's what the output looks like, and here's the origin, and here's the temperature. We want all this stuff to be semantic. It was like, okay, yeah, I'll do all that for you. And then, you know, a bunch of people
Starting point is 00:16:52 just scrape all your recipes and you get less traffic to your website. And the promise was, oh, I would get more traffic. People would search for these three ingredients and, you know, my recipe might come up. So the devil is in the details there. And, you know, thinking it through is critical. Let's end on this. What's hyped?
Starting point is 00:17:09 What's overhyped? You've got a lot of founders listening here on this week in startups. They're building stuff. What's something they should be doing now that's obvious and going to pay dividends for their startup? What are things that, hey, maybe agent to agent falls into this category? You could become aware of. There's an opportunity here.
Starting point is 00:17:27 but it might be a bit too early to get significant gains from it. Where should they be focusing their time and effort in your mind? Well, presumptuous for me to give a definitive answer because there's so many kind of degrees of freedom here, but I think that if you look for the maybe under-hyped opportunities, maybe I'll mention two.
Starting point is 00:17:47 One is the boring stuff. You know, getting workflows to be reliable is, and again, I'm thinking enterprise. is kind of my lens I put through. That is the biggest blocker in the enterprise right now, getting the workflows to be reliable and customizing them per deployment. That's hard work, but that's where I think the real pain is. The thing is, it's not sexy.
Starting point is 00:18:15 If you give a demo that did something amazing once or maybe many times, that's sexy, but it's not sexy to show that things don't fail. but that's where the real value is. I think that's where I, you know, one area I would focus in. Yeah, the area where I don't know that it's, you know, under-hyped, that I think it's underserved his education. You know, I was there in the early days of online courses, you know, Coursera and Udacity,
Starting point is 00:18:44 starting my corridor at Stanford. And, you know, I have an online course on game theory that's been seen by over a million people, but it doesn't begin to scratch the surface of what the real opportunity is in proactive, per student teaching, the issue is not how to get chat GPT out of the classroom so people don't feed.
Starting point is 00:19:12 The issue is how to get the technology in the classroom and rethink what education is really about how we do it right with technology. So that's an area that I would really like to see kind of blossom. It is super interesting. The first step was getting all those courses online. And somebody who went to Fordham University
Starting point is 00:19:33 didn't quite hit the IVs. I was always jealous, like, what's going on, you know, at MIT, at Stanford? Like, how different is it than my experience? And I was feeling particularly under-resourced in macroeconomics, right? And I was like, I really want to understand this. And I just went to YouTube,
Starting point is 00:19:52 and I found a courses at Stanford. Stanford, MIT. And I watched the courses and I was like, wow, this is like alchemy. They're trying to figure out how macroeconomics works. But the fact that it was available for free on YouTube, the same course that people were paying and had to qualify to be in the 0.1% of people on the planet to get there was available for free. And now you imagine if it was adaptive learning. And you could answer some questions up front and then I said, yeah, you know, you should really start with this third video or actually you're not ready for the first two videos. You should do this pre-calculus and maybe the statistics course first before you get in there, learn some basics about statistics
Starting point is 00:20:34 so you can actually understand the material better. And that would be so amazing to have it be adaptive and not leave any, because as a professor, you wind up leaving some students behind. And at what fork in the road did they get disengaged? There's always the question, yeah? Absolutely. You know, it's typical. When a new technology comes on, you try to use it the way you use the old technology. So television initially was televised radio.
Starting point is 00:21:02 And then over time, you understood what the media world is really good for. I think the time is ripe now with, you know, the flat world and everybody has access to, you know, computers and networking. to get AI to rethink education. Yeah, absolutely. And you think about the role of professors, creating courses, creating quizzes, even. You can go into any LLM today,
Starting point is 00:21:26 ask it to take great expectations, Charles Dickens, and give you a series of Q&A and be your coach, and you just give it that tiny prompt. And you could sit there with your phone and have a personalized tutor, you would have had to spend, you know, whatever, days or weeks to find them
Starting point is 00:21:42 and pay hundreds of dollars, and it's just available to everybody for free today. What an amazing discussion. Thanks so much for joining us. Thank you. It's really fun. Thanks for having me. You can learn more.
Starting point is 00:21:54 In Google Clouds Report, The Future of AI, perspectives for startups go to G-O-O-D-L-E-Slas Future of AI. That's G-O-O-D-L-E-L-E-Slas future of AI. Go check out Gemini. I love that deep research. And everybody, if you want to get more of our startup basic series, from legal to accounting to marketing and now AI,
Starting point is 00:22:15 go to this week in startups.com slash basics. Thanks again for listening. We'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.