a16z Podcast - Remaking the UI for AI

Starting point is 00:00:00 The cost required to bring a new form factor to market is just insane. We've had a reasoning breakthrough with large language models and generative models, and they're really hungry for new kinds of input and context that the current generation of interfaces is not providing. The North Star of computing has always been, to borrow like a Steve Jobs quote, to be the bicycle for the mind. The history of hardware has been the history of computers. Hey, the rate limiter here is silicon. It's sand. and there's tons of sand in the world

Starting point is 00:00:31 and eventually somebody else will figure out how to make sand that does the same thing. We have adapted to interfaces over the last 60 years but the way we interact with computers today is actually dramatically unnatural. Hello everyone, welcome back to the A16Z podcast. Now, if you're a regular around here,

Starting point is 00:00:50 you probably have a sentence for just how important the topic of artificial intelligences to A16C. In fact, it is so important that we decided to create a new AI-specific podcast which, to no surprise, is called AI plus A16Z. And no, we did not consult AI on naming. Today, you'll get to hear one of the early episodes from that podcast,

Starting point is 00:01:10 hosted by venture editor Derek Harris. Derek sits down with A16Z general partner, Anjane Mehta, to discuss a very important question. Here it goes. If we have a new software platform through large language models, do we also need to rethink our hardware from the ground up? Do our interfaces today keep pace with the data requirements across input, reasoning, and output?

Starting point is 00:01:30 In other words, what does the UI look like for AI? Will it look like a phone or something completely different? And what can we learn from the millennia of human behavior long before this technology existed that might actually inform what's to come? Of course, there are many companies trying to solve this right now, but this conversation may actually give you clues as to why no one has quite solved this yet.

Starting point is 00:01:51 Again, this episode comes straight from our new AI plus A16-Z feat. So if you like this episode, don't forget to subscribe. to get all of A16Z's latest AI content, including episodes around the software supply chain, open source, rag, and a whole lot more. Of course, we'll include a link in our show notes, and with that, Derek, take it away. Hi, this is Derek Harris, and you're listening to the A16Z AI podcast, where we dig into all things artificial intelligence with our in-house team of experts, as well as the founders, engineers, and researchers working at the state of the art.

Starting point is 00:02:26 In this episode, I speak with A16Z general partner Angeny Midha about how AI hardware will look in the years to come, and why there's so much innovation yet to happen at the inference layer. Among other things, he explains how he sees wearable devices evolving to take advantage of improvements in sensors and workload-specific chips, and how the introduction of big company technology, like the Apple Vision Pro, can actually lay the foundation for startups. But because we recorded right after Nvidia's big GTC event, as with our previous episode, with Naveen Rao, we kick off the conversation talking about training workloads versus inference workloads and how NVIDIA came to dominate the former category. As a reminder, please note that the content here is for informational purposes only,

Starting point is 00:03:13 should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures. I saw recently like Olamma release support for AMD GPUs. I think I saw someone compare Nvidia to Sun Microsystems in the early days of the web, which seems like it might be wishful thinking. Are our skeptics kind of underestimating the stranglehold that Nvidia has? There's the two schools of thought, which is that Nvidia's stronghold on this is completely

Starting point is 00:03:52 transitory. Their margins are overinflated because of supply chain crises over the last 24 months where we had this explosion in demand, but production is going to catch up. And, you know, basically those folks will tell you, that's cool of thought will tell you like, hey, the rate limiter here is silicon. It's sand. And there's tons of sand in the world. And eventually somebody else will figure out how to make sand that does the same thing. There's tons of it on planet Earth. And I personally find that view is provocative but reductive because it doesn't on your first principles analysis of the fact that training has these idiosyncratic needs like a really robust software driver layer, right, that can orchestrate thousands of these trips acting

Starting point is 00:04:30 in unison. I think, yes, on that side of the debate, I'm certainly one who believes that the developer experience that Nvidia started, by the way, investing in a decade ago is in its sort of later stages of compounding right now. And it's really hard to dislodge that. What we may see is margins compress over time because the budgets are shifting from training to inference. And I think that's actually where a lot of the exciting stuff is happening. And I know we're going to spend a bulk of time today, hopefully talking about inference because that's open season right now. Right. Every time you have a new compute, a new software primitive, it often results in new kinds of workloads that the incumbents have a harder time keeping up with. And I would argue the

Starting point is 00:05:10 inference workloads, like you mentioned with Olamma, for example, are entirely new kinds of compute workloads we haven't seen before. And so that's a much more even playing field. I, I, I don't think there was a way for Nvidia to invest in that kind of workload 10 years ago because it just didn't exist. Whereas training fundamentally has in some shape or form been around for the better part of a decade because deep learning has been around for that long. And actually, I think this is a good time to introduce what I think is a useful mental model that I have about the future of hardware.

Starting point is 00:05:38 And I found there's several ways to reason about hardware. One, you can work backwards from the customer. Who is the customer here and what do they need? What are their pain points? And then there's another way you can reason about is like reasoning from history and see what the progression and evolution of compute has been over time. And I think the history of hardware has been the history of computers, right? And in my mind, if you look at the last 60 years or so of computers that we've had,

Starting point is 00:06:00 and it is basically modern computing, one way, one popular way to reason about it is often the hardware versus software split. But I think there's another way to reason about it, which I'll give full credit to one of our founders, you know, Ankit Kumar, who spent a lot of time at Discord building the first I bought there and was reasoning about how to expose language models to large numbers of users before a lot of people got the chance to experiment with these large language models. He basically believes there's two lineages of computing. There's reasoning or intelligence, and then there's interfaces.

Starting point is 00:06:32 And you can kind of go back in time over the last 60 years and basically break down every major computing revolution we've had to some fundamental progress in either the reasoning part of computing or the interface. part of computing. And if you traverse that tree, if you go all the way back to the first neural networks in 1958, that's when we start to see reasoning start to happen with through neural networks. And then that led, you give way to sort of probabilistic graph models in the 80s. That then led to this idea of like GPU accelerated deep learning in the 2000s, which then gave way to transformers in the late 2010s. And then now we're at this next phase of like massive transformers.

Starting point is 00:07:15 say, okay, that's one lineage, which is the lineage of reasoning and intelligence. In parallel, we've had this other lineage in computing, which is interfaces, right? And you started with the command line and keyboards. And then that kind of gave way to the GUI with a mouse when Steve Jobs got inspired by Xerox Park in the 80s. And then that has led ultimately to mobile interfaces with touch as an input mechanism. And then I think the question is where are we going next? And I think we have really good reasons to believe that the next interface will be an AI companion that's some combination of text, voice, and vision that can understand the world. That's almost a better predictor of where hardware is going because the history of computing is so far shown that

Starting point is 00:07:57 whichever one of those lineages is undergoing a moment of resonance with customers ends up dominating for the kinds of workloads that then get to scale for the next 10, 15 years. And I think we're in the middle of both a reasoning and an interface shift. And that's what's exciting right now. Right. It seems if you look at it, like how you're explaining this is, I would say like, like, we have in the, the smartphone interface is pretty well established at this point, which, which runs some AI inference. Like our computers have these inference chips in them, but like those are like standard, well-known interfaces at this point. And what's new seems to be on the reasoning side is the LLMs and foundation models and this ability to interact with with a model that way.

Starting point is 00:08:40 So it stands to reasoning, if I'm hearing you correctly, right? Like, this is where we are in the reasoning side of things. So the hardware interface now needs to take that step forward. That's exactly right. I think you're basically, we've had a reasoning breakthrough, like you're saying, with large language models and generative models. And they're really hungry for new kinds of input and context that the current generation of interfaces is not providing.

Starting point is 00:09:05 And I think that's what we're seeing folks dinker with interfaces or an experiment with new interfaces for the first time in ways that just weren't possible before because the reasoning capability wasn't there. So what's the hurdle, I guess, of existing interfaces in the sense that like, like we all carry around a smartphone that seem pretty powerful. We've had voice recognition for a while. We've had devices like, you know, these Amazon Echo devices in our homes or whatever that we could talk to.

Starting point is 00:09:27 And, you know, they run some sort of model off in the cloud somewhere. Right. Where's that that step function or that improvement from from what we have today? That seems pretty capable in some ways to what you're explaining, which is like a whole new way of, I mean, and maybe data capture as a primary feature is kind of the jumping off point. The North Star of computing has always been, to borrow like a Steve Jobs quote, to be the bicycle for the mind, right, is to ultimately express or translate human thought into some set of actions in the world that then allow humans to accomplish what they want to do in ways that are just aren't possible without the leverage of a tool, right? And so computers in their most grand romantic reality or expression are tools for thought and action in a way that allow us to accomplish things we never could without those tools.

Starting point is 00:10:18 And so if you ask, okay, well, why aren't computers able to help us accomplish that North Star today? There's a whole host of those reasons. But I think you started at the top of that list and you worked your way down. The first one is they're pretty dumb, right? At inferring our intent about the world, humans, we have adapted to interfaces over the last 60 years. But the way we interact with computers today is actually dramatically unnatural, mostly because we are compensating for the lack of the computer's ability to understand our

Starting point is 00:10:45 intent and translate that into action proactively. There's this paradigm in computing, which is declarative versus imperative, right? The idea being with imperative, you are extremely prescriptive about what you want the computer to do. You say, open up this file and then, you know, here's a set of instructions I want you to do, almost like you were instructing a toddler, right? And declarative is when you say, you just declare essentially a goal that you have in mind the way you would interact with an adult and say, hey, please go book a flight for me or whatever, that what your goal is. And then it goes reasons about how to do that. And humans are phenomenal at doing that and computers are nowhere close at translating thought into action. And so I think the

Starting point is 00:11:22 big limitation is that computers just aren't smart enough to translate thought into action. And so if you ask what's the big blocker there, if generative models are in fact capable of reasoning, then why haven't we seen the computers cross that reasoning step, the intelligence step? And I think there's two big problems there. One, there's a fundamental context problem. The way generative models work is they're only as good as you can prompt them, right? They're only as good as the questions you ask them. That's why I took chat GPT, which is essentially just a slightly different context form factor

Starting point is 00:11:57 to ask questions of GPT3, which had been around, and at GPT 3.5, which had been around as a raw next token prediction endpoint for like eight, nine months before chat GPT came out, but nobody really did anything interesting with it. And then when they packaged it up as a chat interface, the rest is history. It turns out just allowing humans to prompt the model and talk to it and give it context in a way that's useful makes a massive difference in the value you can derive from these models. While we've seen how valuable that is for pure text generation, when it comes to taking action in the world and doing something, predicting the next action that you want to take and then taking that action for you, there's nowhere close to that interface being able to

Starting point is 00:12:34 see what you're seeing, listen to what you're listening to, hear what others are in the room are hearing, look at where your eyes are tracking, and infer all of that context about what the human wants to achieve and then proactively do that for you. Because right now, we're kind of basically forcing these models to try to understand us through a little straw, right? Where all they can get is like a tiny representation of reality through what we can type and via text chat, but we're losing all the other context about reality. And so there's an interface problem where there's no interface today that seamlessly just captures the entire world that you're interacting with and then translates that into a prompt for these generative models. Now, the solution space is

Starting point is 00:13:13 fascinating to think about because you could argue, well, on the smartphone is probably the most densely packed sensor, world sensor rig that we've ever had. Right. We've got just the one I have here has three rear-facing cameras, one, one forward-facing, a full depth sensor, RGB sensing on there. It has an accelerometer motion. It has GPS. It knows where I am. It's got microphones in there. It can see what I'm seeing. And so you might go, well, what are you talking about? Billions of people today now have a full sensor rig in their pocket. The issue is that when those sensors are all sitting in your pocket in a way that actually can't seamlessly capture the world and insert itself proactively into your day,

Starting point is 00:13:53 it's useless as an interface. You need an interface that can provide the inference endpoint, sufficient, continuous context about the world that the user is facing, so that then the reasoning layer can start making useful predictions about what you want to do. And I just think we haven't unlocked that form factor yet. But voice comes close.

Starting point is 00:14:13 Vision, I think, will be a huge part of it, as we're seeing with multimodal models. And I think that there are on the margins, huge gains to be had with precise input of that mirror thought like eye tracking. I don't know if you've tried the Apple Vision Pro out, but the entire premise of the Vision Pro interaction system is that you replace a mouse with your eyes.

Starting point is 00:14:34 And the eyes are extraordinarily good at looking somewhere that your brain wants to go. It's actually the fastest part of the human body that follows thought. And so if a computing interface can infer what you're about to do because it has access to your intent via your eyes, now I think suddenly you start to shave several seconds off latency between a computer

Starting point is 00:14:56 understanding what you want to do and actually doing it. And the history of human computer interfaces is rife with examples of how like shaving off marginal amounts of latency that may seem like trivial make dramatic changes in user adoption. What does that, what does that look like? We have companies experimenting right now with with like pendants and pins and that sort of thing. But also you look at, you know, sometimes with these wearable headsets, maybe you have to be corded in. Maybe you have to, you know, you have this giant thing on your head because I mean, you have to pack these sensors and pack a powerful enough chip into the device itself. So where do we actually have to get to from hardware perspective? That's a great jumping off point,

Starting point is 00:15:33 right? Because I think you can break down the hardware into three major buckets. One is, what's the input that you need? What's the, what's the full context window that it has over your life? So there's just a bunch of hardware required to accurately sense the world. And that's a sensing problem. The second is the actual hardware required to then process all of that input and make sense of it. And then the last step is the output hardware, right? How do you relay the results of that reasoning step, that middle step back to the human and do it in a sufficiently integrated seamless loop that you can have a conversation with your computer in the same way that when you see a programmer in flow state where they're like literally

Starting point is 00:16:15 programming in their IDE, but they make a, they make a syntax error and The ID intelligently says, hey, here's your syntax error. The programmer then integrates that into their workflow. I'm a terrible programmer, and so I've had the privilege of working with some remarkable programmers in my life. And you can just tell when they're in flow state, they're almost bionic, when they and their computer meld into one. And that's primarily a result of a bunch of really great decisions that we've made along the way about how the computer talks back to you about the inference or the reasoning it's done. Big picture, you can think of the hardware we need as three step, three buckets, input, reasoning, and output.

Starting point is 00:16:54 The human anatomy analogy here would be eyes, ears, and fingers to see what's going on in the world. That's step one, the sensing. Then you need a brain to make sense of all that input. And then ultimately, like we have our appendages, you need something to manipulate the environment around you and actually take action. Traditionally, most innovation happened in this area in robotics. It was a well-contained kind of discipline of computer science to reason about how to bring those three things together in a way that was research-friendly. Robotics research has traditionally happened in industrial labs where they don't have to interact with humans very much other than constrained environments like warehouses and so on.

Starting point is 00:17:32 Today, what we're seeing is a dramatic shift where the generative model breakthroughs have resulted in tons of research outside of robotics in consumer land. The form factors are everything from pendants and little pins on people's collars to pairs of glasses that just blend seamlessly into everyday life and don't look any different from glasses that you or I would wear if we were just had prescriptions on. We are seeing companies that are thinking about more invasive but embedded devices that literally might be a brain computer interface like a chip, right? Yeah. In, in situ. If you asked me, hey, where are we seeing the most, as things go from like, science to engineering where it might actually be in people's hands, everyday people's hands soon. Certainly the most innovation is happening in the middle step, the brain step, which is

Starting point is 00:18:19 where inference is today. There are companies who are making chips that are custom designed for specific models where they're literally burning the weights. This was an idea that I first heard from David Holtz, the founder of Mid Journey. Early on, his intuition was that diffusion models were so effective at image generation that eventually you'd have only a handful of models that were handling the bulk of inference workloads for image generation. And at that point, when you have sufficient scale

Starting point is 00:18:48 and volume of a certain type of reasoning, the brain is just having to do one type of task all day long, you can actually make a chip that just does that kind of task, right? And so you burn the model weights into the chip, that dramatically reduces the flexibility of things that brain can reason about. You essentially turn from a big brain into a small brain, but you can get, like, extraordinary orders of magnitude of speed up, 100x, 150x, 200x.

Starting point is 00:19:14 And that's generally been the history of computing as well. When you have sufficient maturity of a certain type of workload, usually the chip sector makes a chip just for your calculator and just for your fridge and so on. And so I think we're about to start seeing that with generative modeling as well. On the input and output side, that's historically been such a difficult place for startups to make a dent because the cost required to bring a new form factor to market is just insane. And so the closest that I think we've seen companies come are teams like Oculus, right, the company who brought one of the first really mass market virtual reality form factors to market.

Starting point is 00:19:51 And their hack was they piggybacked off of the PC and smartphone supply chain that had grown up in the preceding decade that had resulted in the cost of a bunch of individual components dramatically falling. That resulted in this huge manufacturing ecosystem in China, that they were then able to basically go do shopping, off-the-shelf shopping, duct tape a bunch of parts together. I literally think the first DK1 screen was a Samsung or an LG smartphone screen that if you ask the founders, I think the founders of Oculus have this story where I'm paraphrasing poorly, but the display manufacturers didn't even believe that you could refresh the pixels at a fast enough rate for virtual reality. And they actually hacked into the driver and said, no, look, we can do it. So I think what we're what we need to be paying attention to on the hardware side is are there really low cost form factors that startups can innovate around because there's an existing supply chain that's that an incumbent like an Apple or Google has essentially subsidized at scale over the last decade. And that's what's so exciting about things like the Apple Vision Pro is when Apple gets into the game, it results in a second and third order effects of like new supply chains showing up. for sensors like depth sensors and LIDARs and pass through mixed reality displays that then give startups the license to go experiment with new form factors.

Starting point is 00:21:10 Whichever company figures out how to capture vision in a way that's sufficiently high throughput for a user to be able to prompt a model and say, this is what I'm looking at, is able to hear what the user is listening to because audio is a very large amount of the context that drives decision making in our daily lives, that is make or break. So I think somebody who figures out the combination of audio and video both sensing and output in a way that's very easy to summon throughout your daily life so that you're not having to say, hey, Siri, hey Alexa. Like that, the wake word, trigger word thing has not worked, right? That paradigm, that interface is not working at scale. It's ruthlessly disruptive to our day, like to our, to a natural conversation. So I think the interface will look natural. It will look like it's voice and audio heavy with vision augmenting, I think. the reasoning layer. Yeah, it does seem like, not only is it disruptive, but you feel like it's unnatural in terms of just how engaging with with the world around you.

Starting point is 00:22:08 The same way, like, I think when Google Glass came out, like, I think it was cool. Right. But, I mean, that was quite a while ago now if you think about it. But, but yeah, it just, it seemed unnatural. I think it's, it seemed unnatural to have someone walking around the camera on their face. I mean, maybe we're, and maybe it's going to look a lot more prescient than it did at the time. I'm generally pretty lindy about interface design, which is that if there's no good form factor that proxies the device you're building that goes back hundreds of years, it is remarkably hard

Starting point is 00:22:40 to change human behavior so that the new paradigm works. Arguably, you could say smartphones were not a new interface for humans. We had been used to putting things in our pockets for years, whether it was wallets or notebooks. And we had been used to like tapping and interacting with notebooks, whether that was with a pencil and so on. I think the innovation there, of course, was that Steve Jobs figured out that styluses are pretty unnatural, even though they look like pens and pencils that humans have used forever. But it turns out what came even before a stylus was the human finger. And so I think the issue with Google Glass was it completely violated all the laws of social Lindy, right? There was nothing natural about a display floating

Starting point is 00:23:22 in front of your eye with a camera. This is why I actually think it's extremely unlike. likely that if somebody is building, if there's a company out there experimenting with wearable that is sensing the world, that if it records humans, that it will gain mass stream adoption because that's a fundamentally, that's not a Lindy natural behavior. So that's a conundrum that is a trust and privacy issue where you have to find a way to capture visual context about the world. And if you had a pair of glasses that had outward facing a visual sensor that could tell the device what you're seeing, but not record, then I think that would be a breakthrough. And I think a big mistake that some of the big hardware manufacturers today have made is they've

Starting point is 00:24:03 rushed to bring recording glasses to market. And people don't trust those. People haven't welcomed those at scale into their lives yet. What do you think it takes to actually break through that privacy or that, I mean, security, you know, maybe even less or so, but like that privacy angle here and get people to embrace some of these technologies? Look, I think Apple is a great case study here. It's this constant dialogue between a technical solution and a consumer or end user promise where Apple has basically said, look, we're going to basically not give ourselves the capability to look at certain kinds of data on your device. So, Apple has a secure enclave on device on the iPhone. And a ton of the smartphone camera processing actually just happens on device.

Starting point is 00:24:48 Actually, almost 100% of Apple's native computer vision processing on your photos happens on device. It actually never leaves the device. Now, they've offered sort of cloud services on top, right, that you can back up your stuff for storage to iCloud. But when it actually comes to raw inference, like intelligence about your data, that's all happening locally. And so when you brought up Olamma early in the example, I think why we're seeing so many developers flock to Olamma is because there is a lot of demand for, from consumers to interact with language models in private ways. And that means they're going to have to figure out how to get the models to run locally without ever leaving, without ever the user's context and data leaving the user's device.

Starting point is 00:25:31 And that's going to result, I think, in a renaissance of new kinds of chips that are capable of handling massive workloads of inference on device. We are yet to see those unlocked. But the good news is open source models are phenomenal at unlocking efficiency. The open source language model ecosystem is just so ravenous. Like when a new Mistral, when Mistral's new model came out, their mixture of experts, open source model a few months ago, I think it literally took like less than 24 hours for somebody to quantize it

Starting point is 00:25:58 and then add GGML support so that you could run locally within the week, even though originally like just out of the box, that model was actually really hard to run on anything less than 24090s. Like I had to get it up and running on my, I have like two gaming cards at home and in this like heat machine by my desk. But, you know, by today you can run a mixture. model purely because of software improvements on much smaller, on single chips, you'll get a few gains from the open source ecosystem that then allow use cases to be sufficiently figured out

Starting point is 00:26:29 that then the hardware guys go, oh, let's make special chips just for these workloads. And that's what's happening. I think there are people bringing diffusion model chips to market and transform model chips to market that are just good at that workload so that when a startup says, you can trust us, the user knows that they don't have to. It's engineered by design, right? So I think doing no evil by design results in a can't do evil. And I think that's the strongest commitment you can make to your customers.

Starting point is 00:26:55 I'm guessing it doesn't hurt when, or in fact helps would be the more affirmative statement way to state that. It probably helps when a company comes and doesn't have, let's say, a legacy business to support that's dependent on showing you ads or that's dependent on otherwise using your personal data in some ways. Is that fair? I mean. Yeah.

Starting point is 00:27:15 So that, okay, that's a good question, which is can you, trust an AI companion that is being paid by someone other than you. And I do believe advertising is going to be remarked, like basically all in its current shape and form, almost dead on arrival for most intelligent interfaces that people trust with with daily actions. You know, it's one one thing to trust a model with next token prediction where you're asking for the next word you should use in an essay. When we make the leap to next action prediction, where you're giving an agency to act on your behalf and represent you in the world, then the fundamental misalignment between you and the agent who's doing things for you

Starting point is 00:27:55 is that if that agent's not getting paid by you, then you may not be able to trust it. Now, I think there may be flavors of advertising that work, that don't look like the Google ad format today where like you stuff four paid links ahead of the seven or eight good ones. And I think we're actually seeing Google struggle with that a little bit. but you're right. I think the most aligned business model would just be the same way you hire a human to help you out with a task as an employee.

Starting point is 00:28:25 You're hiring an agent, you're hiring a computer and you're paying for it. And you're really the employer in that case. And that's the most aligned business model, I think, when it comes to this next wave of generative computers, is to not think about them as tools that while the ultimate impact that they can make on your life is that of a tool. The business model relationship should be much more akin to an employer and employee than, I think, just a third-party tool that somebody else is subsidizing for you to use. The classic, like, if you're not paying, you're the product, I think, exists here. And that's, and the failure mode is more catastrophic when you're trusting it to do things on your behalf.

Starting point is 00:29:02 I find it interesting right now. When you look at, say, subscriptions to chat, GPT subscriptions to any of these generative model services, perplexity, whatever. I mean, you name it, like, like, people are willing to pay for, for, for, for, you know, search now. That's better. People are willing to pay a monthly subscription for for these things. And maybe that maybe that does indicate yes, that this is a thing people would pay for going forward. Like we've crossed that bridge where we realized to your point, like free is not free. And if something, if something truly adds value, you can come in and say, this is what we do and this is that we make our money and take it or leave it. But know the service you're getting for the price for what you're

Starting point is 00:29:40 paying. Yeah. Look, I think the long arc of tech has shown that the market, cost of compute over time just generally converges to zero, right? And we're in this really weird phase right now. Because generative models are so new, we haven't seen the dramatic reduction in cost happen that Moore's law should be driving for compute. And so as a result, the best services, generative model services in the world are premium services that, because it's expensive to run inference, right? Like, but what is crazy, like to your point is that people are willing to pay for that. That's how much economic value is being unlocked by generative models. At this point now, I've been personally involved either as an advisor or investor

Starting point is 00:30:21 or as an operator with at least five companies, generative model companies, that have exploded past 30 to 50 million in revenue in subscription revenue in their first 12 months of monetizing. And this has all happened in the last two years, right? It's insane. And I think that's because when the model actually accomplishes a task for you that you didn't think you could do on your own, whether that's generating an image with mid-journey, or it's getting an answer from perplexity that would have taken you hours to do by yourself, or generating a podcast of your voice using 11 labs from your text. These were things you'd actually have to go hire people to do earlier. And it turns out when it's bundled up as compute and offer, and you can call on it

Starting point is 00:31:03 24-7 hours a day, charging 20 bucks a month for it is not even a hard ask of customers, because the comparable is legitimately, I think, hiring a human to do that task for you. That cost varies anywhere from minimum wage per hour to some humans you literally can't find to accomplish a task you want on the timeline you may not. And so I don't think it's, I don't think, I want to be clear, I don't think these models are replacing humans. I think they're filling gaps in economic demand that weren't being filled, these niches that weren't being filled before and they're creating new categories. And a $20 a month subscription for that today is not a hard ask, but over time I see those prices will decline because the marginal cost of compute

Starting point is 00:31:43 will converge to closer and closer to zero. To tie this back, so we started off talking about GPUs and kind of the training side, if the UI of AI becomes like multimodal data capture on device, what does a training process and system look like and then the model development process look at as we're just pumping in now, I don't even know to quantify the volumes of data that we're talking about generating here. Yeah. So look, I think there was a moment in time about 24, 36 months ago where everyone was like, oh, of course, bigger is better. The bigger the model, the better it's going to be and size is everything. And we're going to have GPT 10 be this insanely large 100 trillion parameter model that

Starting point is 00:32:26 will be this big God brain. And products should just be increasingly wrappers on top of that model. And that reality has not come to pass. What is instead happening is that the most useful products are combinations of different models. Mattes Zaharia has a great paper on this recently from his lab at Berkeley that calls it compound systems. They did a pretty systematic study of all the most used products today that use generative models. And it turns out they're not single monolithic models. They're combinations, these compound systems of different models acting in unison. So I'm a big believer in the idea that the future products are going to be swarms of small models working together to solve a task, cheaper, faster, more efficiently than just one

Starting point is 00:33:07 big megbrain that can do it. And then when those teams of models encounter tasks that they can't solve themselves, then they will call out to a larger model that might be in the cloud and then ask that model to do what might be a multi-day reasoning problem. Sometimes when you need to invent the theory of relativity, you do need to go ask Einstein for help. But most things throughout your day today, I don't need Einstein to help me with. And instead, what I do want is a really great, efficient team of people who are specialists at something, acting in close unison to each other, the way companies work together. And so I see a future where everybody's got a personal team of AI is working for us, just the same way companies are often in service

Starting point is 00:33:49 of a customer. And I think what's going to happen is these inference workloads are going to become combinations of quickly attacking a task with that team and then offloading the tasks that they can't solve themselves to bigger and bigger cloud-hosted inference workloads. At the same time, what that means for training is that actually bigger and bigger training runs may not be that important to our everyday lives. What might actually be really important

Starting point is 00:34:17 is training and fine-tuning models, base models on individual data. One of the biggest unlocks that ByteDance, which made TikTok, was in this intense personalization of the algorithm so that within three swipes of somebody opening up TikTok, they knew what Derek really wanted to see next. And the concept behind the scenes is basically a personal embedding for everybody, right? Like every individual consumer has a personal embedding that understands their preferences so deeply that you're

Starting point is 00:34:45 able to serve them with what they want next, whether that's searching for a restaurant that you want to go to or just like doing, taking action for you and calling you a taxi if that's what you need. And I think that in that future, a lot of the training, like what's currently happening in the pre-training phase of model development will start to make its way into post-training, what's currently called fine-tuning or customization, right? Where you got you, once you have a good enough base model for most tasks, then you can start to fine-tune it on what the individual wants, on their individual user set. And so you don't need a massive model to keep reasoning about every individual user. You actually need just a good enough base model that then learn specifically

Starting point is 00:35:26 about you. And that happens in the post-training step. It seems like we're reliably bad at the future when it comes to certain things. I think I think most people jump to like sci-fi didn't seem to predict the internet or the smartphone, which might have been two of the biggest advances we actually had. So I'm going to ask you like to comment on that and also open yourself up to that same sort of mistake. But like where do you think we might be missing some areas for improvement when we're thinking about how AI develops? Like do we have blind spots based on what we're currently doing that might limit how we develop these things going forward? Humans are so good at reasoning by analogy, right? And rightly so because millions of years of evolution have showed us that pattern

Starting point is 00:36:05 matching is a really good skill in life. If your ancestors have seen a lion before and have associated that with danger, next time you see a thing that kind of looks like a lion, you should probably reason about it like the way your brain, your ancestors have learned over millions of years. And I think that's, well, that serves us well in most daily life. It's actually served us really poorly in computing. Because I think we keep looking to buy. biological metaphors to guide computer design. And for a long time, AI was in this weird research path where most of the AI research community just believe

Starting point is 00:36:40 that the path to unlocking sort of general intelligence would be you had to first figure out how brains worked, human brains worked, and then you could replicate that in silicon, right? And so there was just like the decades of research at many DARPA and DOD and university-funded labs that was this sort of neuroscience first approach to inventing computers. That has proven to be mostly a distraction.

Starting point is 00:37:02 Like, it turns out that just predicting the next token or the next word, a model should say, is a remarkably useful way to attack intelligence and design computers. Instead of getting computers to learn like human beings, I will say what is now happening is that because transformers are so remarkably effective at what they do, most major industrial labs have doubled down on that architecture. it's not clear whether that will result in multi-step reasoning of a kind that is essentially unconstrained, right? It's not clear that the current architectures will get us all the way to the end goal, which is everyone has a definition of what the end goal. But let's say the end goal is a kind of computer that's able to do almost everything we want as humans and take all the drudgery out of our lives and allow us to be the ultimate bicycle for the mind.

Starting point is 00:37:56 Forget bicycle. Let's say we want computers to be the interstellar travel for the mind. It's not clear that the current architectures we have of these models will get us there. But because they work so well, the bulk of research dollars are going to fund optimizations in the current architecture. That's a blind spot because what we may actually need is a fundamentally new architecture to unlock the interstellar travel for the mind. While there's some promising startups trying to do that, it is a pretty capital intensive game. It's not for the faint of the heart. And so what we do risk, I think, as an industry, is hitting a point where scaling laws don't hold. Our current architectures actually do plateau.

Starting point is 00:38:38 And then we are back at the kind of slowdown that we had with AI for the past three winters over the last three decades where many of the loss curves or the ability for models to predict reality basically hit walls. They started off being super promising and then they hit a wall. So far, signs are that's not happening, but if there's one blind spot in the future of computing, it's that our current architectures are insufficient and that we haven't invested enough in alternative backups to overtake those. Now, I'm optimistic that this time around, there's sufficient excitement in the ecosystem, both from everybody from the hardware providers at the compute level like Nvidia, the cloud providers and startups and ultimately investors like us who are really excited for Newark.

Starting point is 00:39:21 And so, you know, if there are people out there who are dinkering and experimenting with those that unlock new interfaces, unlock the next phase in computing. Yeah, that's what we're here to fund. I just wish more people were working on those. Thanks for listening, everyone. I thought that was a super insightful discussion, and I hope you did too. We're just getting warmed up and we have more episodes to come shortly. But in the meantime, feel free to rate the show and let us know what you think so far.

Starting point is 00:39:55 Thank you.

a16z Podcast - Remaking the UI for AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.