The a16z Show - Remaking the UI for AI

Starting point is 00:00:01 The cost required to bring a new form factor to market is just insane. We've had a reasoning breakthrough with large language models and generative models, and they're really hungry for new kinds of input and context that the current generation of interfaces is not providing. The North Star of computing has always been to borrow like a Steve Jobs quote to be the bicycle for the mind. The history of hardware has been the history of computers. Hey, the rate limiter here is silicon. It's sand. And there's tons of sand in the world. And eventually, somebody else will figure out how to make sand that does the same thing. We have adapted to interfaces over the last 60 years. But the way we interact with computers today is actually dramatically unnatural.

Starting point is 00:00:44 Hello, everyone. Welcome back to the A16Z podcast. Now, if you're a regular around here, you probably have a sentence for just how important the topic of artificial intelligences to A16C. In fact, it is so important that we decided to create a new AI-specific podcast, which, to no surprise, is called AI plus A16Z. And no, we did not consult AI on naming. Today, you'll get to hear one of the early episodes from that podcast, hosted by venture editor Derek Harris. Derek sits down with A16Z general partner, Anjane Mehta, to discuss a very important question. Here it goes. If we have a new software platform through large language models, do we also need to rethink our hardware from the ground up?

Starting point is 00:01:25 Do our interfaces today keep pace with the data requirements across input, reasoning, and output? In other words, what does the UI look like for AI? Will it look like a phone or something completely different? And what can we learn from the millennia of human behavior, long before this technology existed, that might actually inform what's to come? Of course, there are many companies trying to solve this right now, but this conversation may actually give you clues as to why no one has quite solved this yet. Again, this episode comes straight from our new AI-816-Z.

Starting point is 00:01:55 So if you like this episode, don't forget to subscribe to get all of A16Z's latest AI content, including episodes around the software supply chain, open source, rag, and a whole lot more. Of course, we'll include a link in our show notes. And with that, Derek, take it away. Hi, this is Derek Harris, and you're listening to the A16Z AI podcast, where we dig into all things artificial intelligence with our in-house team of experts, as well as the founders, engineers, and researchers working at the state of the art. In this episode, I speak with A16Z general partner Angeny Midha about how AI hardware will look in the years to come,

Starting point is 00:02:34 and why there's so much innovation yet to happen at the inference layer. Among other things, he explains how he sees wearable devices evolving to take advantage of improvements in sensors and workload-specific chips, and how the introduction of big company technology, like the Apple Vision Pro, can actually lay the foundation for startups. But because we recorded right after Nvidia's big GTC event, as with our previous episode, with Neveen Rao. We kick off the conversation talking about training workloads versus inference workloads and how NVIDIA came to dominate the former category. As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or

Starting point is 00:03:22 potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures. I saw recently like Olamma release support for AMD GPUs. I think I saw someone compare Nvidia to Sun Microsystems in the early days of the web, which seems like it might be wishful thinking. Are our skeptics kind of underestimating the stranglehold that Nvidia has? There's the two schools of thought, which is that Nvidia's stronghold on this is completely transitory.

Starting point is 00:03:53 Their margins are overinflated because of supply chain crises over the last 24 months, where we had this explosion in demand, but like production is going to catch up. And, you know, basically those folks will tell you, that's cool of thought, will tell you like, hey, the rate limiter here is silicon. It's sand. And there's tons of sand in the world. And eventually, somebody else will figure out how to make sand that does the same thing. There's tons of it on planet Earth.

Starting point is 00:04:14 And I personally find that view is provocative, but reductive because it doesn't on your first principles analysis of the fact that training has these idiosyncratic needs like a really robust software driver layer, right, that can orchestrate thousands of these trips acting in unison. I think, yes, on that side of the debate, I'm certainly one who believes that the developer experience that Nvidia started, by the way, investing in a decade ago is in its sort of later stages of compounding right now. And it's really hard to dislodge that. What we may see is margins compress over time because the budgets are shifting from training to inference. And I think that's actually where a lot of the exciting stuff is happening. And I know we're going to

Starting point is 00:04:54 spend the bulk of time today, hopefully talking about inference, because that's open season right now. Right. Right. Every time you have a new compute, a new software primitive, it often results in new kinds of workloads that the incumbents have a harder time keeping up with. And I would argue the inference workloads, like you mentioned with Olamma, for example, are entirely new kinds of compute workloads we haven't seen before. And so that's a much more even playing field. I don't think there was a way for Nvidia to invest in that kind of workload 10 years ago, because it just didn't exist, whereas training fundamentally has in some shape or form been around for the better part of a

Starting point is 00:05:29 decade because deep learning has been around for that long. And actually, I think this is a good time to introduce what I think is a useful mental model that I have about the future of hardware. And I found there's several ways to reason about hardware. One, you can work backwards from the customer. Who is the customer here and what do they need? What are their pain points? And then there's another way you can reason about is like reasoning from history and see what the progression and evolution of compute has been over time. And I think the history of hardware has been the history. true of computers, right? And in my mind, if you look at the last 60 years or so of computers that we've have, and this is basically modern computing, one way, one popular way to reason about it is often

Starting point is 00:06:04 the hardware versus software split. But I think there's another way to reason about it, which I'll give full credit to one of our founders, you know, Unkid Kumar, who, who spent a lot of time at Discord building the first day I bought there and was reasoning about how to expose language models to large numbers of users before a lot of people got the chance to experiment, with these large language models. He basically believes there's two lineages of computing. There's reasoning or intelligence, and then there's interfaces. And you can kind of go back in time over the last 60 years

Starting point is 00:06:36 and basically break down every major computing revolution we've had to some fundamental progress in either the reasoning part of computing or the interface part of computing. And if you traverse that tree, if you go all the way back to the first neural networks in 1958, that's when we start to see. reasoning start to happen with through neural networks. And then that led, you give way to some of probabilistic graph models in the 80s. That then led to this idea of like GPU accelerated deep learning in the 2000s, which then gave way to transformers in the late 2010s. And then now we're

Starting point is 00:07:12 at this next phase of like massive transformers. And you can say, okay, that's one lineage, which is the lineage of reasoning and intelligence. In parallel, we've had this other lineage in computing, which is interfaces, right? And you started with, the command line and keyboards. And then that kind of gave way to the GUI with a mouse when Steve Jobs got inspired by Xerox Park in the 80s. And then that has led ultimately to mobile interfaces with touch as an input mechanism. And then I think the question is where are we going next? And I think we have really good reasons to believe that the next interface will be an AI companion. That's some combination of text, voice, and vision that can understand the world. That's almost a better predictor

Starting point is 00:07:52 of where hardware is going, because the history of computing is so far shown that whichever one of those lineages is undergoing a moment of resonance with customers ends up dominating for the kinds of workloads that then get to scale for the next 10, 15 years. And I think we're in the middle of both a reasoning and an interface shift. And that's what's exciting right now. Right. It seems if you look at it, like how you're explaining this is, I would say like, we have the the smartphone interface is pretty well established at this point, which, which runs some AI inference. Like our computers have these inference chips in them, but like those are like standard, well-known interfaces at this point.

Starting point is 00:08:33 And what's new seems to be on the reasoning side is the LLMs and foundation models and and this ability to interact with, with a model that way. So it stands to reason. If I'm hearing you correctly, right, like this is where we are in the reasoning side of things. So the hardware interface now needs to take that step forward. That's exactly right. I think you're basically we've had.

Starting point is 00:08:51 a reasoning breakthrough, like you're saying, with large language models and generative models. And they're really hungry for new kinds of input and context that the current generation of interfaces is not providing. And I think that's what we're seeing folks dinker with interfaces or an experiment with new interfaces for the first time in ways that just weren't possible before because the reasoning capability wasn't there. So what's the hurdle, I guess, of existing interfaces in the sense that like, like we walk around a smartphone that seem pretty powerful. We've had had voice recognition for a while. We've had devices like, you know, these Amazon Echo devices in our homes or whatever that we could talk to. And, you know, they run some sort of model off

Starting point is 00:09:30 in the cloud somewhere. Where's that step function or that improvement from what we have today? That seems pretty capable in some ways to what you're explaining, which is like a whole new way of, I mean, maybe data capture is as a primary feature is kind of the jumping off point. The North Star of computing has always been, to borrow like a Steve Jobs quote, to be the bicycle for the mind, right, is to ultimately express or translate human thought into some set of actions in the world that then allow humans to accomplish what they want to do in ways that are just aren't possible without the leverage of a tool. Right. And so computers in their most grand romantic reality or expression are tools for thought and action in a way that allows to accomplish things we never could without those tools. And so if you ask, okay, well, why aren't computers able to help us accomplish that North Star today? There's a whole host of those reasons. But I think you started at the top of that list and you worked your way down. The first one is they're pretty dumb, right? At inferring our intent about the world, humans, we have adapted to interfaces over the last 60 years. But the way we interact with computers today is actually dramatically unnatural, mostly because we are compensating for the lack of the computer's ability to understand our intent and translate that into action.

Starting point is 00:10:47 proactively. There's this paradigm in computing, which is declarative versus imperative, right? The idea being with imperative, you are extremely prescriptive about what you want the computer to do. You say, open up this file and then, you know, here's a set of instructions I want you to do, almost like you were instructing a toddler, right? And declarative is when you say, you just declare essentially a goal that you have in mind the way you would interact with an adult and say, hey, please go, book a flight for me or whatever, what your goal is. And then it goes reasons about how to do that. And humans are phenomenal at doing that. And computers are nowhere close at translating thought into action.

Starting point is 00:11:21 And so I think the big limitation is that computers just aren't smart enough to translate thought into action. And so if you ask what's the big blocker there, if generative models are in fact capable of reasoning, then why haven't we seen the computers cross that reasoning step, the intelligence step? And I think there's two big problems. There's a fundamental context problem. The way generative models work is they're only as good as you can. prompt them, right? They're only as good as the questions you ask them. That's why I took chat GPT, which was, which is essentially just a slightly different context form factor to ask questions of GPT3, which had been around, and at GPT 3.5, which had been around as a raw next token

Starting point is 00:12:04 prediction endpoint for like eight, nine months before chat GPT came out, but nobody really did anything interesting with it. And then when they packaged it up as a chat interface, the rest is history. It turns out just allowing humans to prompt the model and talk to it and give it context in a way that's useful makes a massive difference in the value you can derive from these models. While we've seen how valuable that is for pure text generation, when it comes to taking action in the world and doing something, predicting the next action that you want to take and then taking that action for you, there's nowhere close to that interface being able to see what you're seeing, listen to what you're listening to, hear what others are in the room are hearing, look at where your eyes are tracking and infer all of that context. text about what the human wants to achieve and then proactively do that for you. Because right now, we're kind of basically forcing these models to try to understand us through a little straw, right, where all they can get is like a tiny representation of reality through what we can type and via text chat. But we're losing all the other context about reality. And so there's an

Starting point is 00:13:01 interface problem where there's no interface today that seamlessly just captures the entire world that you're interacting with and then translates that into a prompt for these generative models. Now, the solution space is fascinating to think about because you could argue, well, on the smartphone is probably the most densely packed sensor, world sensor rig that we've ever had. Right. We've got just the one I have here has three rear-facing cameras, one, one forward facing, a full depth sensor, RGB sensing on there. It has an accelerometer motion. It has GPS. It knows where I am.

Starting point is 00:13:36 It's got microphones in there. It can see what I'm seeing. And so you might go, well, what are you talking? about. Billions of people today now have a full sensor rig in their pocket. The issue is that when those sensors are all sitting in your pocket in a way that actually can't seamlessly capture the world and insert itself proactively into your day, it's useless as an interface. You need an interface that can provide the inference endpoint sufficient continuous context about the world that the user is facing so that then the reasoning layer can start making

Starting point is 00:14:06 useful predictions about what you want to do. And I just think we haven't unlocked that form factor yet. But voice comes close. Vision, I think, will be a huge part of it, as we're seeing with multimodal models. And I think that there are on the margins, huge gains to be had with precise input of that mirror thought like eye tracking. I don't know if you've tried the Apple Vision Pro out, but the entire premise of the Vision Pro interaction system is that you replace a mouse with your eyes. And the eyes are extraordinarily good at looking somewhere that your brain wants to go. it's actually the fastest response, you know, part of the human body that follows thought. And so if a computing interface can infer what you're about to do because it has access to your intent via your eyes,

Starting point is 00:14:52 now I think suddenly you start to shave several seconds off latency between a computer understanding what you want to do and actually doing it. And the history of human computer interface is rife with examples of how like shaving off marginal amounts of latency that may seem like trivial make dramatic changes in user adoption. What does that look like? Because we have companies experimenting right now with with like pendants and pins and that sort of thing. But also you look at, you know, sometimes with these wearable headsets, maybe you have to be corded in. Maybe you have to, you know, you have this giant thing on your head because, I mean, you have to pack these sensors and pack a powerful enough chip into the device itself. So where do we actually have to get to from hardware perspective? That's a great jumping off point, right?

Starting point is 00:15:33 because I think you can break down the hardware into three major buckets. One is, what's the input that you need? What's the full context window that it has over your life? So there's just a bunch of hardware required to accurately sense the world. And that's a sensing problem. The second is the actual hardware required to then process all of that input and make sense of it. And then the last step is the output hardware, right? How do you relay the results of that reasoning step that,

Starting point is 00:16:03 middle step back to the human and do it in a sufficiently integrated seamless loop that you can have a conversation with your computer in the same way that when you see a programmer in flow state where they're like literally programming in their IDE but they make up they make a syntax error and the ID intelligently says hey here's your syntax error the programmer then integrates that into their workflow I'm a terrible programmer and so I've had the privilege of working with some remarkable programmers in my life and that you can just tell when they're in flow state, they're almost bionic when they and their computer meld into one. And that's primarily a result of a bunch of really great decisions that we've made along the way about how the computer

Starting point is 00:16:43 talks back to you about the inference or the reasoning it's done. Big picture, you can think of the hardware we need as in three step, three buckets, input, reasoning, and output. The human anatomy analogy here would be eyes, ears, and fingers to see what's going on in the world. That's step one, the sensing. Then you need a brain to make sense. sense of all that input. And then ultimately, like we have our appendages, you need something to manipulate the environment around you and actually take action. Traditionally, most innovation happened in this area in robotics. It was a well-contained kind of discipline of computer science to reason about how to bring those three things together in a way that was research-friendly.

Starting point is 00:17:23 Robotics research has traditionally happened in industrial labs where they don't have to interact with humans very much other than constrained environments like warehouses and so on. Today, what we're seeing is a dramatic shift where the generative model breakthroughs have resulted in tons of research outside of robotics in consumer land. The form factors are everything from pendants and little pins on people's collars to pairs of glasses that just blend seamlessly into everyday life and don't look any different from glasses that you or I would wear if we were just had prescriptions on. We are seeing companies that are thinking about more invasive but embedded devices that literally might be a brain computer interface, like a chip, right?

Starting point is 00:18:03 Yeah. In, in situ. If you asked me, hey, where are we seeing the most, as things go from like science to engineering, where it might actually be in people's hands, everyday people's hands soon, certainly the most innovation is happening in the middle step, the brain step, which is where inference is today. There are companies who are making chips that are custom designed for specific models, where they're literally burning the weights. this was an idea that I first heard from David Holtz, the founder of Mid Journey.

Starting point is 00:18:33 Early on, his intuition was that diffusion models were so effective at image generation that eventually you'd have only a handful of models that were handling the bulk of inference workloads for image generation. And at that point, when you have sufficient scale and volume of a certain type of reasoning, the brain is just having to do one type of task all day long. You can actually make a chip that just does that kind of task. right. And so you burn the model weights into the chip, that dramatically reduces the flexibility of things that brain can reason about. You essentially turn from a big brain into a small brain,

Starting point is 00:19:07 but you can get, you can get like extraordinary orders of magnitude of speed up, 100x, 150x, 200x. And that's generally been the history of computing as well. When you have sufficient maturity of a certain type of workload, usually the chip sector makes a chip just for your calculator and just for your fridge and so on. And so I think we're about to start seeing that with generative modeling. as well. On the input and output side, that's historically been such a difficult place for startups to make a dent because the cost required to bring a new form factor to market is just insane. And so the closest that I think we've seen companies come are teams like Oculus, right, who brought one of the first really mass market virtual reality form factors to market.

Starting point is 00:19:51 And their hack was they piggybacked off of the PC and smartphone supply chain that had grown up in the preceding decade that had resulted in the cost of a bunch of individual components dramatically falling. That resulted in this huge manufacturing ecosystem in China, that they were then able to basically go do shopping, off-the-shelf shopping, duct tape a bunch of parts together. I literally think the first DK1 screen was a Samsung or an LG smartphone screen that if you ask the founders, I think the founders of Oculus have this story where I'm paraphrasing poorly, but the display manufacturers didn't even believe that you could refresh the pixels at a fast enough rate for virtual reality. And they actually hacked into the driver and said, no, look, we can do it.

Starting point is 00:20:29 So I think what we're what we need to be paying attention to on the, the hardware side is, are there really low cost form factors that startups can innovate around because there's an existing supply chain that's that an incumbent like an Apple or a or Google has essentially subsidized at scale over the last decade. And that's what's so exciting about things like the Apple Vision pro is when Apple gets into the game, It results in a second and third order effects of like new supply chains showing up for sensors like depth sensors and LIDARs and pass through mixed reality displays that then give startups the license to go experiment with new form factors. whichever company figures out how to capture vision in a way that's sufficiently high throughput for a user to be able to prompt a model and say, this is what I'm looking at,

Starting point is 00:21:19 is able to hear what the user is listening to because audio is a very large amount of the context that drives decision making in our daily lives. That is make or break. So I think somebody who figures out the combination of audio and video, both sensing and output, in a way that's very easy to summon throughout your daily life so that you're not having to say, hey, Siri, hey Alexa. Like that, the wake word, trigger word thing has not worked, right? That paradigm, that interface is not working at scale.

Starting point is 00:21:46 It's ruthlessly disruptive to our day, like to our, to a natural conversation. So I think the interface will look natural. It will look like it's voice and audio heavy with vision augmenting, I think, the reasoning layer. Yeah, it does seem like, not only is it disruptive, but you feel like it's unnatural in terms of just how engaging with, with the world around you. The same way, like, I think when Google Glass came out. out. I think it was cool. Right. But I mean, that was quite a while ago and I think about it.

Starting point is 00:22:15 But, but yeah, it just seemed unnatural. I think it's, it seemed unnatural to have someone walking around the camera on their face. I mean, maybe we're, and maybe that's going to look a lot more prescient than it did at the time. I'm generally pretty Lindy about interface design, which is that if there's no good form factor that proxies the device you're building that goes back hundreds of years, it is remarkable. remarkably hard to change human behavior so that the new paradigm works. Arguably, you could say smartphones were not a new interface for humans. We had been used to putting things in our pockets for years, whether it was wallets or notebooks. And we had been used to like tapping and

Starting point is 00:22:56 interacting with notebooks, whether that was through it with a pencil and so on. I think the innovation there, of course, was that Steve Jobs figured out that stylises are pretty unnatural, even though they look like pens and pencils that humans have used forever. But, turns out what came even before a stylist was the human finger. And so I think the issue with Google Glass was it completely violated all the laws of social Lindy, right? There was nothing natural about a display floating in front of your eye with a camera. This is why I actually think it's extremely unlikely that if somebody is building, if there's a company out there experimenting with wearable that is sensing the world, that if it records humans, that it will gain

Starting point is 00:23:37 mass stream adoption because that's a fundamentally, that's not a Lindy natural behavior. So that's a conundrum that is a trust and privacy issue where you have to find a way to capture visual context about the world. And if you had a pair of glasses that had outward facing a visual sensor that could tell the device what you're seeing, but not record, then I think that would be a breakthrough. And I think a big mistake that some of the big hardware manufacturers today have made is they've rushed to bring recording glasses to market. And people don't trust those. People haven't. welcome those at scale into their lives yet.

Starting point is 00:24:09 What do you think it takes to actually break through that privacy or that, I mean, security, you know, maybe even less or so, but like that privacy angle here and get people to embrace some of these technologies? Look, I think Apple is a great case study here. It's this constant dialogue between a technical solution and a consumer or end user promise, where Apple has basically said, look, we're going to basically not give ourselves the capability to look at certain kinds of data. on your device. So they have, Apple has a secure enclave on device on the iPhone. And a ton of the

Starting point is 00:24:44 smartphone camera processing actually just happens on device. Actually, almost 100% of Apple's native computer vision processing on your photos happens on device. It actually never leaves the device. Now, they've offered sort of cloud services on top, right, that you can backup your stuff for storage like to iCloud. But when it actually comes to raw inference, like intelligence about your data, That's all happening locally. And so when you brought up Olama early in the example, I think why we're seeing so many developers flock to Olama is because there is a lot of demand from consumers

Starting point is 00:25:19 to interact with language models in private ways. And that means they're going to have to figure out how to get the models to run locally without ever leaving, without ever the user's context and data leaving the user's device. And that's going to result, I think, in a renaissance of new kinds of chips that are capable of handling massive workloads of inference on device. We are yet to see those unlocked. But the good news is open source models are phenomenal at unlocking efficiency.

Starting point is 00:25:45 The open source language model ecosystem is just so ravenous. Like when a new mistral, when Mistral's new model came out, their mixture of experts open source model a few months ago, I think it literally took like less than 24 hours for somebody to quantize it and then add GGML support so that you could run locally within the week, even though originally, like just out of the box, that model was actually really hard to run on anything less than 24090s. Like I had to get it up and running on my, I have like two gaming cards at home and in this like heat machine by my desk. But, you know, by today you can run a

Starting point is 00:26:18 mixed raw model purely because of software improvements on much smaller, on single chips. You'll get a few gains from the open source ecosystem that then allow use cases to be sufficiently figured out that then the hardware guys go, oh, let's make special chips just for these. workloads. And that's what's happening. I think there are people bringing diffusion model chips to market and transform model chips to market that are just good at that workload so that when a startup says, you can trust us. The user knows that they don't have to. It's engineered by design, right? So I think doing no evil by design results in a can't do evil. And I think that's the strongest commitment you can make to your customers. I'm guessing it doesn't hurt when, or in fact,

Starting point is 00:26:57 helps would be the more affirmative statement way to state that. It probably helps when a company comes and doesn't have, let's say, a legacy business to support that's dependent on showing you ads or that's dependent on otherwise using your personal data in some ways. Is that fair? I mean? Yeah. So that, okay, that's a good question, which is, can you trust an AI companion that is being paid by someone other than you?

Starting point is 00:27:24 And I do believe advertising is going to be remark. Like, basically, in its current shape and form, almost dead on arrival. for most intelligent interfaces that people trust with with daily actions. You know, it's one one thing to trust a model with next token prediction where you're asking for the next word you should use in an essay. When we make the leap to next action prediction, where you're giving an agency to act on your behalf and represent you in the world, then a fundamental misalignment between you and the agent who's doing things for you is that if that agent's not getting paid by you, then you may not

Starting point is 00:27:59 be able to trust it. Now, I think there may be flavors of advertising that work that don't look like the Google ad format today where like you stuff four paid links ahead of the seven or eight good ones. And I think we're actually seeing Google struggle with that a little bit. But you're right. I think the most aligned business model would just be the same way you hire a human to help you out with a task as an employee. You're hiring an agent. You're hiring a computer and you're paying for it. and you're really the employer in that case.

Starting point is 00:28:31 And that's the most aligned business model, I think, when it comes to this next wave of generative computers, is to not think about them as tools. While the ultimate impact that they can make on your life is that of a tool, the business model relationship should be much more akin to an employer and employee than I think just a third-party tool that somebody else is subsidizing for you to use.

Starting point is 00:28:53 The classic, like, if you're not paying, you're the product, I think, exists here. And that's, and the failure mode is more, catastrophic when you're trusting it to do things on your behalf. I find it interesting right now when you look at, say, subscriptions to chat, cheap, T subscriptions to any of these gender model services, perplexity, whatever. I mean, you name it like, like people, people are willing to pay for search now. That's better.

Starting point is 00:29:16 People are willing to pay a monthly subscription for for these things. And maybe that, maybe that does indicate, yes, that this is a thing people would pay for going forward. Like, we've crossed that bridge where we realized, to your point, like, free is that. not free. And if something truly adds value, you can come in and say, this is what we do and this is that we make our money and take it or leave it. But know the service you're getting for the price for what you're paying. Yeah. Look, I think the long arc of tech has shown that the marginal cost of compute over time just generally converges to zero. Right. And we're in this

Starting point is 00:29:50 really weird phase right now. Because generative models are so new, we haven't seen the dramatic reduction in cost happen that Moore's law should be driving. for compute. And so as a result, the best services, generative model services in the world are premium services that, because it's expensive to run inference, right? Like, but what is crazy, like to your point is that people are willing to pay for that. That's how much economic value is being unlocked by generative models. At this point now, I've been personally involved either as a advisor or investor or as an operator with at least five companies, generative model companies, that have exploded past 30 to 50 million in revenue in subscription revenue in their first 12 months of monetizing.

Starting point is 00:30:33 And this has all happened in the last two years, right? It's insane. And I think that's because when the model actually accomplishes a task for you that you didn't think you could do on your own, whether that's generating an image with mid-journey or it's getting an answer from perplexity that would have taken you hours to do by yourself or generating a podcast of your voice using 11 labs from your text, These were things you'd actually have to go hire people to do earlier. And it turns out when it's bundled up as compute and you can call on it 24-7 hours a day, charging 20 bucks a month for it is not even a hard ask of customers because the comparable is,

Starting point is 00:31:11 like legitimately, I think, hiring a human to do that task for you. That cost varies anywhere from minimum wage per hour to some humans you literally can't find to accomplish a task you want on the timeline you may need. And so I don't think it's, I don't think, I want to be clear, I don't think these models are replacing humans. I think they're filling gaps in economic demand that weren't being filled, these niches that weren't being filled before. And they're creating new categories.

Starting point is 00:31:34 And $20 a month subscription for that today is not a hard ask. But over time, I see those prices will decline because the marginal cost of compute will converge to closer and closer to zero. To tie this back. So we started off talking about GPUs and kind of the training side. If the UI of AI becomes like multimodal data capture on device, what does a training process and system look like? And then the model development process,

Starting point is 00:31:59 look as we're just pumping in now, I don't even know to quantify the volumes of data that we're talking with generating here. Yeah. So look, I think there was a moment in time about 24, 36 months ago where everyone was like, oh,

Starting point is 00:32:15 of course bigger is better. The bigger the model, the better it's going to be and size is everything. And we're going to have GPT 10 be this insanely large, 100 trillion parameter model that will be this big God brain. And products should just be increasingly wrappers on top of that model. And that reality has not come to pass. What is instead happening is that the most useful products are combinations of different models. Matase Zaharia, who has a great paper on this recently from his lab at Berkeley that calls it compound systems. They did a pretty

Starting point is 00:32:46 systematic study of all the most used products today that use generative models. And it turns out They're not single monolithic models. Their combinations, these compound systems of different models acting in unison. So I'm a big believer in the idea that the future products are going to be swarms of small models working together to solve a task, cheaper or faster, more efficiently than just one big megabrain that can do it. And then when those teams of models encounter tasks that they can't solve themselves, then they will call out to a larger model that might be in the cloud and then ask that model to do. what might be a multi-day reasoning problem. Sometimes when you need to invent the theory of relativity, you do need to go ask Einstein for help.

Starting point is 00:33:30 But most things throughout your day today, I don't need Einstein to help me with. And instead, what I do want is a really great, efficient team of people who are specialists at something, acting in close unison to each other, the way companies work together. And so I see a future where everybody's got a personal team of AI is working for us, just the same way companies are often in service of a customer.

Starting point is 00:33:50 And I think what's going to happen is these inference workloads are going to become combinations of quickly attacking a task with that team and then offloading the tasks that they can't solve themselves to bigger and bigger cloud-hosted inference workloads. At the same time, what that means for training is that actually bigger and bigger training runs may not be that important to our everyday lives. what might actually be really important is training and fine-tuning models, base models on individual data. One of the biggest unlocks that Bytance, which made TikTok, was in this intense personalization of the algorithm, so that within three swipes of somebody opening up TikTok, they knew what Derek really wanted to see next.

Starting point is 00:34:35 And the concept behind the scenes is basically a personal embedding for everybody, right? Like every individual consumer has a personal embedding that understands their preferences so deeply that you're able to serve them with what they want next, whether that's searching for a restaurant that you want to go to or just like taking action for you and calling you a taxi if that's what you need. And I think that in that future,

Starting point is 00:34:59 a lot of the training, like what's currently happening in the pre-training phase of model development will start to make its way into post-training, what's currently called fine-tuning or customization, right? Where once you have a good enough base model for most tasks, then you can start to fine tune it on what the individual wants on their individual user set. And so you don't need a massive model to keep reasoning about every individual user. You actually need just a good enough base model that then learn specifically about you.

Starting point is 00:35:26 And that happens in the post-training step. It seems like we're reliably bad at predicting the future when it comes to certain things. I think most people jump to like sci-fi didn't seem to predict the internet or the smartphone, which might have been two of the biggest advances we actually had. So I'm going to ask you like to comment on that and also open yourself up to that same sort of mistake. But like, where do you think we might be missing some areas for improvement when we're thinking about how AI develops? Like, do we have blind spots based on what we're currently doing that might limit how we develop these things going forward? Humans are so good at reasoning by analogy, right?

Starting point is 00:36:01 And rightly so because millions of years of evolution have showed us that pattern matching is a really good skill in life. If your ancestors have seen a lion before and have associated that with danger, next time you see. see a thing that kind of looks like a line, you should probably reason about it like the way your brain, your ancestors have learned over millions of years. And I think that's, well, that serves us well in most daily life. It's actually served us really poorly in computing because I think we keep looking to biological metaphors to guide computer design. And for a long time, AI was in this like weird research path where most of the AI research community just believe that the path to unlocking sort of general intelligence would be you had to first figure out how brains worked, human brains worked,

Starting point is 00:36:46 and then you can replicate that in silicon, right? And so there's just like the decades of research at many DARPA and DOD and university-funded labs that was this sort of neuroscience first approach to inventing computers. That has proven to be mostly a distraction. Like, it turns out that just predicting the next token or the next word a model should say is a remarkably useful way to attack intelligence and design computers. Instead of getting computers to learn like human beings, I will say what is now happening is that because transformers are so remarkably effective at what they do, most major industrial labs have doubled down on that architecture.

Starting point is 00:37:28 It's not clear whether that will result in multi-step reasoning of a kind that is essentially unconstrained, right? It's not clear that the current architectures will get us all the way to the the end goal, which is all, everyone has a definition of what the end goal, but let's say the end goal is a kind of computer that's able to do almost everything we want as humans and take all the drudgery out of our lives and allow us to be the ultimate bicycle for the mind. Forget bicycle. Let's say we want, we want computers to be the interstellar travel of them for the mind. It's not clear that the current architectures we have of these models will get us there. But because they work so well,

Starting point is 00:38:09 the bulk of research dollars are going to fund optimizations in the current architecture. That's a blind spot because what we may actually need is a fundamental new architecture to unlock the interstellar travel for the mind. While there's some promising startups trying to do that, it is a pretty capital intensive game. It's not for the faint of the heart. And so what we do risk, I think as an industry is hitting a point where scaling laws don't hold. Our current architectures actually do plateau. And then we are back. that the kind of slowdown that we had with AI for the past three winters over the last three decades where many of the loss curves or the ability for models to predict reality basically

Starting point is 00:38:52 hit walls. They started off being super promising and then they hit a wall. So far, signs are that's not happening. But if there's one blind spot in the future of computing, it's that our current architectures are insufficient and that we haven't invested enough in alternative backups to overtake those. Now, I'm optimistic that this time around, there's sufficient excitement in the ecosystem, both from everybody from the hardware providers at the compute level like Nvidia, the cloud providers and startups and ultimately investors like us who are really excited for New York. And so, you know, if there are people out there who are dinkering and experimenting with those that unlock new interfaces, unlock the next phase in computing, yeah, that's what we're here to fund. I just wish

Starting point is 00:39:32 more people are working on those. Thanks for listening, everyone. I thought that was a super insightful discussion, and I hope you did too. We're just getting warmed up and we have more episodes to come shortly. But in the meantime, feel free to rate the show and let us know what you think so far.

The a16z Show - Remaking the UI for AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.