The a16z Show - Remaking the UI for AI
Episode Date: May 16, 2024Make sure to check out our new AI + a16z feed: https://link.chtbl.com/aiplusa16z a16z General Partner Anjney Midha joins the podcast to discuss what's happening with hardware for artificial intellige...nce. Nvidia might have cornered the market on training workloads for now, but he believes there's a big opportunity at the inference layer — especially for wearable or similar devices that can become a natural part of our everyday interactions. Here's one small passage that speaks to his larger thesis on where we're heading:"I think why we're seeing so many developers flock to Ollama is because there is a lot of demand from consumers to interact with language models in private ways. And that means that they're going to have to figure out how to get the models to run locally without ever leaving without ever the user's context, and data leaving the user's device. And that's going to result, I think, in a renaissance of new kinds of chips that are capable of handling massive workloads of inference on device."We are yet to see those unlocked, but the good news is that open source models are phenomenal at unlocking efficiency. The open source language model ecosystem is just so ravenous."More from Anjney:The Quest for AGI: Q*, Self-Play, and Synthetic DataMaking the Most of Open Source AISafety in Numbers: Keeping AI OpenInvesting in Luma AIFollow everyone on X:Anjney MidhaDerrick HarrisCheck out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts. Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
The cost required to bring a new form factor to market is just insane.
We've had a reasoning breakthrough with large language models and generative models,
and they're really hungry for new kinds of input and context that the current generation of interfaces is not providing.
The North Star of computing has always been to borrow like a Steve Jobs quote to be the bicycle for the mind.
The history of hardware has been the history of computers.
Hey, the rate limiter here is silicon. It's sand. And there's tons of sand in the world.
And eventually, somebody else will figure out how to make sand that does the same thing.
We have adapted to interfaces over the last 60 years. But the way we interact with computers today is actually dramatically unnatural.
Hello, everyone. Welcome back to the A16Z podcast. Now, if you're a regular around here,
you probably have a sentence for just how important the topic of artificial intelligences to A16C.
In fact, it is so important that we decided to create a new AI-specific podcast, which, to no surprise, is called AI plus A16Z.
And no, we did not consult AI on naming.
Today, you'll get to hear one of the early episodes from that podcast, hosted by venture editor Derek Harris.
Derek sits down with A16Z general partner, Anjane Mehta, to discuss a very important question.
Here it goes.
If we have a new software platform through large language models, do we also need to rethink our hardware from the ground up?
Do our interfaces today keep pace with the data requirements across input, reasoning, and output?
In other words, what does the UI look like for AI?
Will it look like a phone or something completely different?
And what can we learn from the millennia of human behavior, long before this technology existed,
that might actually inform what's to come?
Of course, there are many companies trying to solve this right now,
but this conversation may actually give you clues as to why no one has quite solved this yet.
Again, this episode comes straight from our new AI-816-Z.
So if you like this episode, don't forget to subscribe to get all of A16Z's latest AI content,
including episodes around the software supply chain, open source, rag, and a whole lot more.
Of course, we'll include a link in our show notes.
And with that, Derek, take it away.
Hi, this is Derek Harris, and you're listening to the A16Z AI podcast, where we dig into all
things artificial intelligence with our in-house team of experts, as well as the founders,
engineers, and researchers working at the state of the art.
In this episode, I speak with A16Z general partner Angeny Midha about how AI hardware will look in the years to come,
and why there's so much innovation yet to happen at the inference layer.
Among other things, he explains how he sees wearable devices evolving to take advantage of improvements in sensors and workload-specific chips,
and how the introduction of big company technology, like the Apple Vision Pro, can actually lay the foundation for startups.
But because we recorded right after Nvidia's big GTC event, as with our previous episode,
with Neveen Rao. We kick off the conversation talking about training workloads versus inference workloads
and how NVIDIA came to dominate the former category. As a reminder, please note that the content
here is for informational purposes only, should not be taken as legal, business, tax, or investment
advice, or be used to evaluate any investment or security, and is not directed at any investors or
potential investors in any A16Z fund. For more details, please see A16Z.com
slash disclosures.
I saw recently like Olamma release support for AMD GPUs.
I think I saw someone compare Nvidia to Sun Microsystems in the early days of the web,
which seems like it might be wishful thinking.
Are our skeptics kind of underestimating the stranglehold that Nvidia has?
There's the two schools of thought, which is that Nvidia's stronghold on this is completely
transitory.
Their margins are overinflated because of supply chain crises over the last 24 months,
where we had this explosion in demand, but like production is going to catch up.
And, you know, basically those folks will tell you, that's cool of thought,
will tell you like, hey, the rate limiter here is silicon.
It's sand.
And there's tons of sand in the world.
And eventually, somebody else will figure out how to make sand that does the same thing.
There's tons of it on planet Earth.
And I personally find that view is provocative, but reductive because it doesn't on your first
principles analysis of the fact that training has these idiosyncratic needs like a really
robust software driver layer, right, that can orchestrate thousands of these trips acting in unison.
I think, yes, on that side of the debate, I'm certainly one who believes that the developer
experience that Nvidia started, by the way, investing in a decade ago is in its sort of later
stages of compounding right now. And it's really hard to dislodge that. What we may see is
margins compress over time because the budgets are shifting from training to inference.
And I think that's actually where a lot of the exciting stuff is happening. And I know we're going to
spend the bulk of time today, hopefully talking about inference, because that's open season right now.
Right.
Right. Every time you have a new compute, a new software primitive, it often results in new kinds of
workloads that the incumbents have a harder time keeping up with. And I would argue the
inference workloads, like you mentioned with Olamma, for example, are entirely new kinds
of compute workloads we haven't seen before. And so that's a much more even playing field. I don't
think there was a way for Nvidia to invest in that kind of workload 10 years ago, because it just didn't
exist, whereas training fundamentally has in some shape or form been around for the better part of a
decade because deep learning has been around for that long. And actually, I think this is a good
time to introduce what I think is a useful mental model that I have about the future of hardware.
And I found there's several ways to reason about hardware. One, you can work backwards from
the customer. Who is the customer here and what do they need? What are their pain points? And then there's
another way you can reason about is like reasoning from history and see what the progression and evolution
of compute has been over time. And I think the history of hardware has been the history.
true of computers, right? And in my mind, if you look at the last 60 years or so of computers that we've
have, and this is basically modern computing, one way, one popular way to reason about it is often
the hardware versus software split. But I think there's another way to reason about it, which I'll
give full credit to one of our founders, you know, Unkid Kumar, who, who spent a lot of time at
Discord building the first day I bought there and was reasoning about how to expose language models
to large numbers of users before a lot of people got the chance to experiment,
with these large language models.
He basically believes there's two lineages of computing.
There's reasoning or intelligence, and then there's interfaces.
And you can kind of go back in time over the last 60 years
and basically break down every major computing revolution we've had
to some fundamental progress in either the reasoning part of computing
or the interface part of computing.
And if you traverse that tree, if you go all the way back to the first neural networks
in 1958, that's when we start to see.
reasoning start to happen with through neural networks. And then that led, you give way to
some of probabilistic graph models in the 80s. That then led to this idea of like GPU accelerated
deep learning in the 2000s, which then gave way to transformers in the late 2010s. And then now we're
at this next phase of like massive transformers. And you can say, okay, that's one lineage,
which is the lineage of reasoning and intelligence. In parallel, we've had this other lineage in
computing, which is interfaces, right? And you started with,
the command line and keyboards. And then that kind of gave way to the GUI with a mouse when Steve Jobs
got inspired by Xerox Park in the 80s. And then that has led ultimately to mobile interfaces with
touch as an input mechanism. And then I think the question is where are we going next? And I think
we have really good reasons to believe that the next interface will be an AI companion. That's some
combination of text, voice, and vision that can understand the world. That's almost a better predictor
of where hardware is going, because the history of computing is so far shown that whichever
one of those lineages is undergoing a moment of resonance with customers ends up dominating
for the kinds of workloads that then get to scale for the next 10, 15 years. And I think we're
in the middle of both a reasoning and an interface shift. And that's what's exciting right now.
Right. It seems if you look at it, like how you're explaining this is, I would say like, we have the
the smartphone interface is pretty well established at this point, which, which runs some AI inference.
Like our computers have these inference chips in them, but like those are like standard,
well-known interfaces at this point.
And what's new seems to be on the reasoning side is the LLMs and foundation models and
and this ability to interact with, with a model that way.
So it stands to reason.
If I'm hearing you correctly, right, like this is where we are in the reasoning side of
things.
So the hardware interface now needs to take that step forward.
That's exactly right.
I think you're basically we've had.
a reasoning breakthrough, like you're saying, with large language models and generative models.
And they're really hungry for new kinds of input and context that the current generation of
interfaces is not providing. And I think that's what we're seeing folks dinker with interfaces
or an experiment with new interfaces for the first time in ways that just weren't possible
before because the reasoning capability wasn't there. So what's the hurdle, I guess, of existing interfaces
in the sense that like, like we walk around a smartphone that seem pretty powerful. We've had
had voice recognition for a while. We've had devices like, you know, these Amazon Echo devices
in our homes or whatever that we could talk to. And, you know, they run some sort of model off
in the cloud somewhere. Where's that step function or that improvement from what we have today?
That seems pretty capable in some ways to what you're explaining, which is like a whole new way of,
I mean, maybe data capture is as a primary feature is kind of the jumping off point.
The North Star of computing has always been, to borrow like a Steve Jobs quote, to be the bicycle for the mind, right, is to ultimately express or translate human thought into some set of actions in the world that then allow humans to accomplish what they want to do in ways that are just aren't possible without the leverage of a tool.
Right.
And so computers in their most grand romantic reality or expression are tools for thought and action in a way that allows to accomplish things we never could without those tools.
And so if you ask, okay, well, why aren't computers able to help us accomplish that North Star today?
There's a whole host of those reasons. But I think you started at the top of that list and you worked your way down. The first one is they're pretty dumb, right? At inferring our intent about the world, humans, we have adapted to interfaces over the last 60 years. But the way we interact with computers today is actually dramatically unnatural, mostly because we are compensating for the lack of the computer's ability to understand our intent and translate that into action.
proactively. There's this paradigm in computing, which is declarative versus imperative, right? The idea
being with imperative, you are extremely prescriptive about what you want the computer to do. You say,
open up this file and then, you know, here's a set of instructions I want you to do, almost like
you were instructing a toddler, right? And declarative is when you say, you just declare essentially a
goal that you have in mind the way you would interact with an adult and say, hey, please go,
book a flight for me or whatever, what your goal is. And then it goes reasons about how to do that.
And humans are phenomenal at doing that.
And computers are nowhere close at translating thought into action.
And so I think the big limitation is that computers just aren't smart enough to translate thought into action.
And so if you ask what's the big blocker there, if generative models are in fact capable of reasoning, then why haven't we seen the computers cross that reasoning step, the intelligence step?
And I think there's two big problems.
There's a fundamental context problem.
The way generative models work is they're only as good as you can.
prompt them, right? They're only as good as the questions you ask them. That's why I took chat
GPT, which was, which is essentially just a slightly different context form factor to ask questions
of GPT3, which had been around, and at GPT 3.5, which had been around as a raw next token
prediction endpoint for like eight, nine months before chat GPT came out, but nobody really did anything
interesting with it. And then when they packaged it up as a chat interface, the rest is history.
It turns out just allowing humans to prompt the model and talk to it and give it context in a way that's useful makes a massive difference in the value you can derive from these models.
While we've seen how valuable that is for pure text generation, when it comes to taking action in the world and doing something, predicting the next action that you want to take and then taking that action for you, there's nowhere close to that interface being able to see what you're seeing, listen to what you're listening to, hear what others are in the room are hearing, look at where your eyes are tracking and infer all of that context.
text about what the human wants to achieve and then proactively do that for you. Because right now,
we're kind of basically forcing these models to try to understand us through a little straw,
right, where all they can get is like a tiny representation of reality through what we can
type and via text chat. But we're losing all the other context about reality. And so there's an
interface problem where there's no interface today that seamlessly just captures the entire world
that you're interacting with and then translates that into a prompt for these generative models.
Now, the solution space is fascinating to think about because you could argue, well, on the smartphone is probably the most densely packed sensor, world sensor rig that we've ever had.
Right.
We've got just the one I have here has three rear-facing cameras, one, one forward facing, a full depth sensor, RGB sensing on there.
It has an accelerometer motion.
It has GPS.
It knows where I am.
It's got microphones in there.
It can see what I'm seeing.
And so you might go, well, what are you talking?
about. Billions of people today now have a full sensor rig in their pocket. The issue is that
when those sensors are all sitting in your pocket in a way that actually can't seamlessly
capture the world and insert itself proactively into your day, it's useless as an interface.
You need an interface that can provide the inference endpoint sufficient continuous context
about the world that the user is facing so that then the reasoning layer can start making
useful predictions about what you want to do. And I just think we haven't unlocked that
form factor yet. But voice comes close. Vision, I think, will be a huge part of it, as we're seeing
with multimodal models. And I think that there are on the margins, huge gains to be had with
precise input of that mirror thought like eye tracking. I don't know if you've tried the Apple
Vision Pro out, but the entire premise of the Vision Pro interaction system is that you replace a mouse
with your eyes. And the eyes are extraordinarily good at looking somewhere that your brain wants to go.
it's actually the fastest response, you know, part of the human body that follows thought.
And so if a computing interface can infer what you're about to do because it has access to your intent via your eyes,
now I think suddenly you start to shave several seconds off latency between a computer understanding what you want to do and actually doing it.
And the history of human computer interface is rife with examples of how like shaving off marginal amounts of latency that may seem like trivial make dramatic changes in user adoption.
What does that look like?
Because we have companies experimenting right now with with like pendants and pins and that sort of thing.
But also you look at, you know, sometimes with these wearable headsets, maybe you have to be corded in.
Maybe you have to, you know, you have this giant thing on your head because, I mean, you have to pack these sensors and pack a powerful enough chip into the device itself.
So where do we actually have to get to from hardware perspective?
That's a great jumping off point, right?
because I think you can break down the hardware into three major buckets.
One is, what's the input that you need?
What's the full context window that it has over your life?
So there's just a bunch of hardware required to accurately sense the world.
And that's a sensing problem.
The second is the actual hardware required to then process all of that input and make sense of it.
And then the last step is the output hardware, right?
How do you relay the results of that reasoning step that,
middle step back to the human and do it in a sufficiently integrated seamless loop that you can
have a conversation with your computer in the same way that when you see a programmer in flow
state where they're like literally programming in their IDE but they make up they make a syntax error
and the ID intelligently says hey here's your syntax error the programmer then integrates that into
their workflow I'm a terrible programmer and so I've had the privilege of working with some
remarkable programmers in my life and that you can just tell when they're in
flow state, they're almost bionic when they and their computer meld into one. And that's primarily
a result of a bunch of really great decisions that we've made along the way about how the computer
talks back to you about the inference or the reasoning it's done. Big picture, you can think of
the hardware we need as in three step, three buckets, input, reasoning, and output. The human anatomy
analogy here would be eyes, ears, and fingers to see what's going on in the world. That's
step one, the sensing. Then you need a brain to make sense.
sense of all that input. And then ultimately, like we have our appendages, you need something to
manipulate the environment around you and actually take action. Traditionally, most innovation
happened in this area in robotics. It was a well-contained kind of discipline of computer science
to reason about how to bring those three things together in a way that was research-friendly.
Robotics research has traditionally happened in industrial labs where they don't have to interact
with humans very much other than constrained environments like warehouses and so on. Today, what we're
seeing is a dramatic shift where the generative model breakthroughs have resulted in tons of
research outside of robotics in consumer land. The form factors are everything from pendants and
little pins on people's collars to pairs of glasses that just blend seamlessly into everyday life
and don't look any different from glasses that you or I would wear if we were just had prescriptions
on. We are seeing companies that are thinking about more invasive but embedded devices that
literally might be a brain computer interface, like a chip, right?
Yeah. In, in situ.
If you asked me, hey, where are we seeing the most, as things go from like science to
engineering, where it might actually be in people's hands, everyday people's hands soon,
certainly the most innovation is happening in the middle step, the brain step, which is
where inference is today.
There are companies who are making chips that are custom designed for specific models,
where they're literally burning the weights.
this was an idea that I first heard from David Holtz, the founder of Mid Journey.
Early on, his intuition was that diffusion models were so effective at image generation
that eventually you'd have only a handful of models that were handling the bulk of inference
workloads for image generation.
And at that point, when you have sufficient scale and volume of a certain type of reasoning,
the brain is just having to do one type of task all day long.
You can actually make a chip that just does that kind of task.
right. And so you burn the model weights into the chip, that dramatically reduces the flexibility
of things that brain can reason about. You essentially turn from a big brain into a small brain,
but you can get, you can get like extraordinary orders of magnitude of speed up, 100x, 150x, 200x.
And that's generally been the history of computing as well. When you have sufficient maturity
of a certain type of workload, usually the chip sector makes a chip just for your calculator and just
for your fridge and so on. And so I think we're about to start seeing that with generative modeling.
as well. On the input and output side, that's historically been such a difficult place for startups
to make a dent because the cost required to bring a new form factor to market is just insane.
And so the closest that I think we've seen companies come are teams like Oculus, right,
who brought one of the first really mass market virtual reality form factors to market.
And their hack was they piggybacked off of the PC and smartphone supply chain that had
grown up in the preceding decade that had resulted in the cost of a bunch of individual components
dramatically falling. That resulted in this huge manufacturing ecosystem in China, that they were then
able to basically go do shopping, off-the-shelf shopping, duct tape a bunch of parts together.
I literally think the first DK1 screen was a Samsung or an LG smartphone screen that if you
ask the founders, I think the founders of Oculus have this story where I'm paraphrasing poorly,
but the display manufacturers didn't even believe that you could refresh the pixels at a fast enough
rate for virtual reality. And they actually hacked into the driver and said, no, look, we can do it.
So I think what we're what we need to be paying attention to on the, the hardware side is,
are there really low cost form factors that startups can innovate around because there's an
existing supply chain that's that an incumbent like an Apple or a or Google has essentially
subsidized at scale over the last decade. And that's what's so exciting about things like the
Apple Vision pro is when Apple gets into the game,
It results in a second and third order effects of like new supply chains showing up for sensors like depth sensors and LIDARs and pass through mixed reality displays that then give startups the license to go experiment with new form factors.
whichever company figures out how to capture vision in a way that's sufficiently high
throughput for a user to be able to prompt a model and say, this is what I'm looking at,
is able to hear what the user is listening to because audio is a very large amount of the
context that drives decision making in our daily lives.
That is make or break.
So I think somebody who figures out the combination of audio and video, both sensing and
output, in a way that's very easy to summon throughout your daily life so that you're not
having to say, hey, Siri, hey Alexa.
Like that, the wake word, trigger word thing has not worked, right?
That paradigm, that interface is not working at scale.
It's ruthlessly disruptive to our day, like to our, to a natural conversation.
So I think the interface will look natural.
It will look like it's voice and audio heavy with vision augmenting, I think, the reasoning layer.
Yeah, it does seem like, not only is it disruptive, but you feel like it's unnatural in terms of just how engaging with, with the world around you.
The same way, like, I think when Google Glass came out.
out. I think it was cool.
Right.
But I mean, that was quite a while ago and I think about it.
But, but yeah, it just seemed unnatural.
I think it's, it seemed unnatural to have someone walking around the camera on their face.
I mean, maybe we're, and maybe that's going to look a lot more prescient than it did at the time.
I'm generally pretty Lindy about interface design, which is that if there's no good form factor
that proxies the device you're building that goes back hundreds of years, it is remarkable.
remarkably hard to change human behavior so that the new paradigm works. Arguably, you could say
smartphones were not a new interface for humans. We had been used to putting things in our pockets
for years, whether it was wallets or notebooks. And we had been used to like tapping and
interacting with notebooks, whether that was through it with a pencil and so on. I think the innovation
there, of course, was that Steve Jobs figured out that stylises are pretty unnatural, even though
they look like pens and pencils that humans have used forever. But,
turns out what came even before a stylist was the human finger. And so I think the issue with
Google Glass was it completely violated all the laws of social Lindy, right? There was nothing
natural about a display floating in front of your eye with a camera. This is why I actually think
it's extremely unlikely that if somebody is building, if there's a company out there experimenting
with wearable that is sensing the world, that if it records humans, that it will gain
mass stream adoption because that's a fundamentally, that's not a Lindy natural behavior.
So that's a conundrum that is a trust and privacy issue where you have to find a way to
capture visual context about the world. And if you had a pair of glasses that had outward
facing a visual sensor that could tell the device what you're seeing, but not record, then I
think that would be a breakthrough. And I think a big mistake that some of the big hardware
manufacturers today have made is they've rushed to bring recording glasses to market.
And people don't trust those. People haven't.
welcome those at scale into their lives yet.
What do you think it takes to actually break through that privacy or that, I mean, security,
you know, maybe even less or so, but like that privacy angle here and get people to embrace
some of these technologies?
Look, I think Apple is a great case study here.
It's this constant dialogue between a technical solution and a consumer or end user promise,
where Apple has basically said, look, we're going to basically not give ourselves the capability
to look at certain kinds of data.
on your device. So they have, Apple has a secure enclave on device on the iPhone. And a ton of the
smartphone camera processing actually just happens on device. Actually, almost 100% of Apple's native
computer vision processing on your photos happens on device. It actually never leaves the device.
Now, they've offered sort of cloud services on top, right, that you can backup your stuff for
storage like to iCloud. But when it actually comes to raw inference, like intelligence about your data,
That's all happening locally.
And so when you brought up Olama early in the example,
I think why we're seeing so many developers flock to Olama
is because there is a lot of demand from consumers
to interact with language models in private ways.
And that means they're going to have to figure out
how to get the models to run locally without ever leaving,
without ever the user's context and data leaving the user's device.
And that's going to result, I think, in a renaissance of new kinds of chips
that are capable of handling massive workloads of inference on device.
We are yet to see those unlocked.
But the good news is open source models are phenomenal at unlocking efficiency.
The open source language model ecosystem is just so ravenous.
Like when a new mistral, when Mistral's new model came out,
their mixture of experts open source model a few months ago,
I think it literally took like less than 24 hours for somebody to quantize it
and then add GGML support so that you could run locally within the week,
even though originally, like just out of the box, that model was actually really hard to run on
anything less than 24090s. Like I had to get it up and running on my, I have like two gaming
cards at home and in this like heat machine by my desk. But, you know, by today you can run a
mixed raw model purely because of software improvements on much smaller, on single chips.
You'll get a few gains from the open source ecosystem that then allow use cases to be sufficiently
figured out that then the hardware guys go, oh, let's make special chips just for these.
workloads. And that's what's happening. I think there are people bringing diffusion model chips
to market and transform model chips to market that are just good at that workload so that when a
startup says, you can trust us. The user knows that they don't have to. It's engineered by design,
right? So I think doing no evil by design results in a can't do evil. And I think that's the
strongest commitment you can make to your customers. I'm guessing it doesn't hurt when, or in fact,
helps would be the more affirmative statement way to state that. It probably helps when a company
comes and doesn't have, let's say, a legacy business to support that's dependent on showing you ads
or that's dependent on otherwise using your personal data in some ways.
Is that fair?
I mean?
Yeah.
So that, okay, that's a good question, which is, can you trust an AI companion that is being paid
by someone other than you?
And I do believe advertising is going to be remark.
Like, basically, in its current shape and form, almost dead on arrival.
for most intelligent interfaces that people trust with with daily actions.
You know, it's one one thing to trust a model with next token prediction where you're asking for
the next word you should use in an essay.
When we make the leap to next action prediction, where you're giving an agency to act on your
behalf and represent you in the world, then a fundamental misalignment between you and the agent
who's doing things for you is that if that agent's not getting paid by you, then you may not
be able to trust it.
Now, I think there may be flavors of advertising that work that don't look like the Google ad format today where like you stuff four paid links ahead of the seven or eight good ones.
And I think we're actually seeing Google struggle with that a little bit.
But you're right.
I think the most aligned business model would just be the same way you hire a human to help you out with a task as an employee.
You're hiring an agent.
You're hiring a computer and you're paying for it.
and you're really the employer in that case.
And that's the most aligned business model, I think,
when it comes to this next wave of generative computers,
is to not think about them as tools.
While the ultimate impact that they can make on your life
is that of a tool,
the business model relationship should be much more akin to an employer and employee
than I think just a third-party tool
that somebody else is subsidizing for you to use.
The classic, like, if you're not paying,
you're the product, I think, exists here.
And that's, and the failure mode is more,
catastrophic when you're trusting it to do things on your behalf.
I find it interesting right now when you look at, say, subscriptions to chat,
cheap, T subscriptions to any of these gender model services, perplexity, whatever.
I mean, you name it like, like people, people are willing to pay for search now.
That's better.
People are willing to pay a monthly subscription for for these things.
And maybe that, maybe that does indicate, yes, that this is a thing people would pay for
going forward.
Like, we've crossed that bridge where we realized, to your point, like, free is that.
not free. And if something truly adds value, you can come in and say, this is what we do and
this is that we make our money and take it or leave it. But know the service you're getting
for the price for what you're paying. Yeah. Look, I think the long arc of tech has shown that
the marginal cost of compute over time just generally converges to zero. Right. And we're in this
really weird phase right now. Because generative models are so new, we haven't seen the dramatic
reduction in cost happen that Moore's law should be driving.
for compute. And so as a result, the best services, generative model services in the world are
premium services that, because it's expensive to run inference, right? Like, but what is crazy,
like to your point is that people are willing to pay for that. That's how much economic value is
being unlocked by generative models. At this point now, I've been personally involved either as
a advisor or investor or as an operator with at least five companies, generative model companies,
that have exploded past 30 to 50 million in revenue in subscription revenue in their first 12 months of monetizing.
And this has all happened in the last two years, right?
It's insane.
And I think that's because when the model actually accomplishes a task for you that you didn't think you could do on your own,
whether that's generating an image with mid-journey or it's getting an answer from perplexity
that would have taken you hours to do by yourself or generating a podcast of your voice using 11 labs from your text,
These were things you'd actually have to go hire people to do earlier.
And it turns out when it's bundled up as compute and you can call on it 24-7 hours a day,
charging 20 bucks a month for it is not even a hard ask of customers because the comparable is,
like legitimately, I think, hiring a human to do that task for you.
That cost varies anywhere from minimum wage per hour to some humans you literally can't find to
accomplish a task you want on the timeline you may need.
And so I don't think it's, I don't think, I want to be clear,
I don't think these models are replacing humans.
I think they're filling gaps in economic demand that weren't being filled,
these niches that weren't being filled before.
And they're creating new categories.
And $20 a month subscription for that today is not a hard ask.
But over time, I see those prices will decline because the marginal cost of compute
will converge to closer and closer to zero.
To tie this back.
So we started off talking about GPUs and kind of the training side.
If the UI of AI becomes like multimodal data capture on device,
what does a training process and system look like?
And then the model development process,
look as we're just pumping in now,
I don't even know to quantify the volumes of data that we're talking
with generating here.
Yeah.
So look,
I think there was a moment in time about 24,
36 months ago where everyone was like,
oh,
of course bigger is better.
The bigger the model,
the better it's going to be and size is everything.
And we're going to have GPT 10 be this insanely large,
100 trillion parameter model that will be this big God brain. And products should just be increasingly
wrappers on top of that model. And that reality has not come to pass. What is instead happening is
that the most useful products are combinations of different models. Matase Zaharia, who has a great
paper on this recently from his lab at Berkeley that calls it compound systems. They did a pretty
systematic study of all the most used products today that use generative models. And it turns out
They're not single monolithic models.
Their combinations, these compound systems of different models acting in unison.
So I'm a big believer in the idea that the future products are going to be swarms of small models working together to solve a task, cheaper or faster, more efficiently than just one big megabrain that can do it.
And then when those teams of models encounter tasks that they can't solve themselves, then they will call out to a larger model that might be in the cloud and then ask that model to do.
what might be a multi-day reasoning problem.
Sometimes when you need to invent the theory of relativity,
you do need to go ask Einstein for help.
But most things throughout your day today,
I don't need Einstein to help me with.
And instead, what I do want is a really great,
efficient team of people who are specialists at something,
acting in close unison to each other,
the way companies work together.
And so I see a future where everybody's got a personal team of AI is working for us,
just the same way companies are often in service of a customer.
And I think what's going to happen is these inference workloads are going to become combinations of quickly attacking a task with that team and then offloading the tasks that they can't solve themselves to bigger and bigger cloud-hosted inference workloads.
At the same time, what that means for training is that actually bigger and bigger training runs may not be that important to our everyday lives.
what might actually be really important is training and fine-tuning models,
base models on individual data.
One of the biggest unlocks that Bytance, which made TikTok,
was in this intense personalization of the algorithm,
so that within three swipes of somebody opening up TikTok,
they knew what Derek really wanted to see next.
And the concept behind the scenes is basically a personal embedding for everybody, right?
Like every individual consumer has a personal embedding
that understands their preferences so deeply
that you're able to serve them with what they want next,
whether that's searching for a restaurant that you want to go to
or just like taking action for you
and calling you a taxi if that's what you need.
And I think that in that future,
a lot of the training,
like what's currently happening in the pre-training phase of model development
will start to make its way into post-training,
what's currently called fine-tuning or customization, right?
Where once you have a good enough base model for most tasks,
then you can start to fine tune it on what the individual wants on their individual user set.
And so you don't need a massive model to keep reasoning about every individual user.
You actually need just a good enough base model that then learn specifically about you.
And that happens in the post-training step.
It seems like we're reliably bad at predicting the future when it comes to certain things.
I think most people jump to like sci-fi didn't seem to predict the internet or the smartphone,
which might have been two of the biggest advances we actually had.
So I'm going to ask you like to comment on that and also open yourself up to that same sort of mistake.
But like, where do you think we might be missing some areas for improvement when we're thinking about how AI develops?
Like, do we have blind spots based on what we're currently doing that might limit how we develop these things going forward?
Humans are so good at reasoning by analogy, right?
And rightly so because millions of years of evolution have showed us that pattern matching is a really good skill in life.
If your ancestors have seen a lion before and have associated that with danger, next time you see.
see a thing that kind of looks like a line, you should probably reason about it like the way your
brain, your ancestors have learned over millions of years. And I think that's, well, that serves us well
in most daily life. It's actually served us really poorly in computing because I think we keep looking
to biological metaphors to guide computer design. And for a long time, AI was in this like weird
research path where most of the AI research community just believe that the path to unlocking sort of
general intelligence would be you had to first figure out how brains worked, human brains worked,
and then you can replicate that in silicon, right? And so there's just like the decades of
research at many DARPA and DOD and university-funded labs that was this sort of neuroscience
first approach to inventing computers. That has proven to be mostly a distraction. Like,
it turns out that just predicting the next token or the next word a model should say is a remarkably
useful way to attack intelligence and design computers.
Instead of getting computers to learn like human beings, I will say what is now happening
is that because transformers are so remarkably effective at what they do, most major industrial
labs have doubled down on that architecture.
It's not clear whether that will result in multi-step reasoning of a kind that is essentially
unconstrained, right?
It's not clear that the current architectures will get us all the way to the
the end goal, which is all, everyone has a definition of what the end goal, but let's say the end goal
is a kind of computer that's able to do almost everything we want as humans and take all the drudgery
out of our lives and allow us to be the ultimate bicycle for the mind. Forget bicycle. Let's say
we want, we want computers to be the interstellar travel of them for the mind. It's not clear that
the current architectures we have of these models will get us there. But because they work so well,
the bulk of research dollars are going to fund optimizations in the current architecture.
That's a blind spot because what we may actually need is a fundamental new architecture to unlock
the interstellar travel for the mind. While there's some promising startups trying to do that,
it is a pretty capital intensive game. It's not for the faint of the heart. And so what we do risk,
I think as an industry is hitting a point where scaling laws don't hold. Our current architectures
actually do plateau. And then we are back.
that the kind of slowdown that we had with AI for the past three winters over the last three
decades where many of the loss curves or the ability for models to predict reality basically
hit walls. They started off being super promising and then they hit a wall. So far, signs are that's
not happening. But if there's one blind spot in the future of computing, it's that our current
architectures are insufficient and that we haven't invested enough in alternative backups to overtake
those. Now, I'm optimistic that this time around, there's sufficient excitement in the ecosystem,
both from everybody from the hardware providers at the compute level like Nvidia, the cloud
providers and startups and ultimately investors like us who are really excited for New York. And so,
you know, if there are people out there who are dinkering and experimenting with those that unlock new
interfaces, unlock the next phase in computing, yeah, that's what we're here to fund. I just wish
more people are working on those. Thanks for listening, everyone.
I thought that was a super insightful discussion, and I hope you did too.
We're just getting warmed up and we have more episodes to come shortly.
But in the meantime, feel free to rate the show and let us know what you think so far.
