a16z Podcast - Remaking the UI for AI
Episode Date: May 16, 2024Make sure to check out our new AI + a16z feed: https://link.chtbl.com/aiplusa16z a16z General Partner Anjney Midha joins the podcast to discuss what's happening with hardware for artificial intellige...nce. Nvidia might have cornered the market on training workloads for now, but he believes there's a big opportunity at the inference layer — especially for wearable or similar devices that can become a natural part of our everyday interactions. Here's one small passage that speaks to his larger thesis on where we're heading:"I think why we're seeing so many developers flock to Ollama is because there is a lot of demand from consumers to interact with language models in private ways. And that means that they're going to have to figure out how to get the models to run locally without ever leaving without ever the user's context, and data leaving the user's device. And that's going to result, I think, in a renaissance of new kinds of chips that are capable of handling massive workloads of inference on device."We are yet to see those unlocked, but the good news is that open source models are phenomenal at unlocking efficiency. The open source language model ecosystem is just so ravenous."More from Anjney:The Quest for AGI: Q*, Self-Play, and Synthetic DataMaking the Most of Open Source AISafety in Numbers: Keeping AI OpenInvesting in Luma AIFollow everyone on X:Anjney MidhaDerrick HarrisCheck out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts. Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
The cost required to bring a new form factor to market is just insane.
We've had a reasoning breakthrough with large language models and generative models,
and they're really hungry for new kinds of input and context that the current generation of interfaces is not providing.
The North Star of computing has always been, to borrow like a Steve Jobs quote,
to be the bicycle for the mind.
The history of hardware has been the history of computers.
Hey, the rate limiter here is silicon. It's sand.
and there's tons of sand in the world
and eventually somebody else will figure out
how to make sand that does the same thing.
We have adapted to interfaces
over the last 60 years
but the way we interact with computers today
is actually dramatically unnatural.
Hello everyone, welcome back to the A16Z podcast.
Now, if you're a regular around here,
you probably have a sentence for just how important
the topic of artificial intelligences
to A16C.
In fact, it is so important
that we decided to create a new AI-specific podcast
which, to no surprise, is called AI plus A16Z.
And no, we did not consult AI on naming.
Today, you'll get to hear one of the early episodes from that podcast,
hosted by venture editor Derek Harris.
Derek sits down with A16Z general partner,
Anjane Mehta, to discuss a very important question.
Here it goes.
If we have a new software platform through large language models,
do we also need to rethink our hardware from the ground up?
Do our interfaces today keep pace with the data requirements
across input, reasoning, and output?
In other words, what does the UI look like for AI?
Will it look like a phone or something completely different?
And what can we learn from the millennia of human behavior
long before this technology existed
that might actually inform what's to come?
Of course, there are many companies trying to solve this right now,
but this conversation may actually give you clues
as to why no one has quite solved this yet.
Again, this episode comes straight from our new AI plus A16-Z feat.
So if you like this episode, don't forget to subscribe.
to get all of A16Z's latest AI content, including episodes around the software supply chain,
open source, rag, and a whole lot more.
Of course, we'll include a link in our show notes, and with that, Derek, take it away.
Hi, this is Derek Harris, and you're listening to the A16Z AI podcast,
where we dig into all things artificial intelligence with our in-house team of experts,
as well as the founders, engineers, and researchers working at the state of the art.
In this episode, I speak with A16Z general partner Angeny Midha about how AI hardware will look in the years to come,
and why there's so much innovation yet to happen at the inference layer.
Among other things, he explains how he sees wearable devices evolving to take advantage of improvements in sensors and workload-specific chips,
and how the introduction of big company technology, like the Apple Vision Pro, can actually lay the foundation for startups.
But because we recorded right after Nvidia's big GTC event, as with our previous episode,
with Naveen Rao, we kick off the conversation talking about training workloads versus inference workloads
and how NVIDIA came to dominate the former category.
As a reminder, please note that the content here is for informational purposes only,
should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security,
and is not directed at any investors or potential investors in any A16Z fund.
For more details, please see A16Z.com slash disclosures.
I saw recently like Olamma release support for AMD GPUs.
I think I saw someone compare Nvidia to Sun Microsystems in the early days of the web,
which seems like it might be wishful thinking.
Are our skeptics kind of underestimating the stranglehold that Nvidia has?
There's the two schools of thought, which is that Nvidia's stronghold on this is completely
transitory. Their margins are overinflated because of supply chain crises over the last 24 months
where we had this explosion in demand, but production is going to catch up. And, you know,
basically those folks will tell you, that's cool of thought will tell you like, hey,
the rate limiter here is silicon. It's sand. And there's tons of sand in the world. And eventually
somebody else will figure out how to make sand that does the same thing. There's tons of it
on planet Earth. And I personally find that view is provocative but reductive because it doesn't
on your first principles analysis of the fact that training has these idiosyncratic needs like
a really robust software driver layer, right, that can orchestrate thousands of these trips acting
in unison. I think, yes, on that side of the debate, I'm certainly one who believes that
the developer experience that Nvidia started, by the way, investing in a decade ago is in its
sort of later stages of compounding right now. And it's really hard to dislodge that. What we may
see is margins compress over time because the budgets are shifting from training to
inference. And I think that's actually where a lot of the exciting stuff is happening. And I know we're
going to spend a bulk of time today, hopefully talking about inference because that's open season
right now. Right. Every time you have a new compute, a new software primitive, it often results in
new kinds of workloads that the incumbents have a harder time keeping up with. And I would argue the
inference workloads, like you mentioned with Olamma, for example, are entirely new kinds of compute
workloads we haven't seen before. And so that's a much more even playing field. I, I,
I don't think there was a way for Nvidia to invest in that kind of workload 10 years ago
because it just didn't exist.
Whereas training fundamentally has in some shape or form been around for the better
part of a decade because deep learning has been around for that long.
And actually, I think this is a good time to introduce what I think is a useful mental
model that I have about the future of hardware.
And I found there's several ways to reason about hardware.
One, you can work backwards from the customer.
Who is the customer here and what do they need?
What are their pain points?
And then there's another way you can reason about is like reasoning from history
and see what the progression and evolution of compute has been over time.
And I think the history of hardware has been the history of computers, right?
And in my mind, if you look at the last 60 years or so of computers that we've had,
and it is basically modern computing, one way, one popular way to reason about it is often
the hardware versus software split.
But I think there's another way to reason about it, which I'll give full credit to one
of our founders, you know, Ankit Kumar, who spent a lot of time at Discord building the first
I bought there and was reasoning about how to expose language models to large numbers of users
before a lot of people got the chance to experiment with these large language models.
He basically believes there's two lineages of computing.
There's reasoning or intelligence, and then there's interfaces.
And you can kind of go back in time over the last 60 years and basically break down every
major computing revolution we've had to some fundamental progress in either the reasoning part
of computing or the interface.
part of computing. And if you traverse that tree, if you go all the way back to the first neural
networks in 1958, that's when we start to see reasoning start to happen with through neural
networks. And then that led, you give way to sort of probabilistic graph models in the 80s.
That then led to this idea of like GPU accelerated deep learning in the 2000s, which then gave way
to transformers in the late 2010s. And then now we're at this next phase of like massive transformers.
say, okay, that's one lineage, which is the lineage of reasoning and intelligence. In parallel,
we've had this other lineage in computing, which is interfaces, right? And you started with
the command line and keyboards. And then that kind of gave way to the GUI with a mouse when Steve Jobs
got inspired by Xerox Park in the 80s. And then that has led ultimately to mobile interfaces with
touch as an input mechanism. And then I think the question is where are we going next? And I think
we have really good reasons to believe that the next interface will be an AI companion that's
some combination of text, voice, and vision that can understand the world. That's almost a better
predictor of where hardware is going because the history of computing is so far shown that
whichever one of those lineages is undergoing a moment of resonance with customers ends up
dominating for the kinds of workloads that then get to scale for the next 10, 15 years. And I think
we're in the middle of both a reasoning and an interface shift.
And that's what's exciting right now.
Right.
It seems if you look at it, like how you're explaining this is, I would say like, like, we have in the, the smartphone interface is pretty well established at this point, which, which runs some AI inference.
Like our computers have these inference chips in them, but like those are like standard, well-known interfaces at this point.
And what's new seems to be on the reasoning side is the LLMs and foundation models and this ability to interact with with a model that way.
So it stands to reasoning, if I'm hearing you correctly, right?
Like, this is where we are in the reasoning side of things.
So the hardware interface now needs to take that step forward.
That's exactly right.
I think you're basically, we've had a reasoning breakthrough, like you're saying, with large
language models and generative models.
And they're really hungry for new kinds of input and context that the current generation
of interfaces is not providing.
And I think that's what we're seeing folks dinker with interfaces or an experiment with
new interfaces for the first time in ways that just weren't possible before because the reasoning
capability wasn't there.
So what's the hurdle, I guess, of existing interfaces in the sense that like, like we
all carry around a smartphone that seem pretty powerful.
We've had voice recognition for a while.
We've had devices like, you know, these Amazon Echo devices in our homes or whatever that we
could talk to.
And, you know, they run some sort of model off in the cloud somewhere.
Right.
Where's that that step function or that improvement from from what we have today?
That seems pretty capable in some ways to what you're explaining, which is like a whole new
way of, I mean, and maybe data capture as a primary feature is kind of the jumping off point.
The North Star of computing has always been, to borrow like a Steve Jobs quote, to be the bicycle for the mind, right, is to ultimately express or translate human thought into some set of actions in the world that then allow humans to accomplish what they want to do in ways that are just aren't possible without the leverage of a tool, right?
And so computers in their most grand romantic reality or expression are tools for thought
and action in a way that allow us to accomplish things we never could without those tools.
And so if you ask, okay, well, why aren't computers able to help us accomplish that North Star
today?
There's a whole host of those reasons.
But I think you started at the top of that list and you worked your way down.
The first one is they're pretty dumb, right?
At inferring our intent about the world, humans, we have adapted to interfaces over the last
60 years. But the way we interact with computers today is actually dramatically unnatural,
mostly because we are compensating for the lack of the computer's ability to understand our
intent and translate that into action proactively. There's this paradigm in computing,
which is declarative versus imperative, right? The idea being with imperative,
you are extremely prescriptive about what you want the computer to do. You say, open up this file
and then, you know, here's a set of instructions I want you to do, almost like you were instructing
a toddler, right? And declarative is when you say, you just declare essentially a goal that you have in
mind the way you would interact with an adult and say, hey, please go book a flight for me or whatever,
that what your goal is. And then it goes reasons about how to do that. And humans are phenomenal at
doing that and computers are nowhere close at translating thought into action. And so I think the
big limitation is that computers just aren't smart enough to translate thought into action. And so
if you ask what's the big blocker there, if generative models are in fact capable of reasoning,
then why haven't we seen the computers cross that reasoning step, the intelligence step?
And I think there's two big problems there.
One, there's a fundamental context problem.
The way generative models work is they're only as good as you can prompt them, right?
They're only as good as the questions you ask them.
That's why I took chat GPT, which is essentially just a slightly different context form factor
to ask questions of GPT3, which had been around, and at GPT 3.5, which had been around as a raw
next token prediction endpoint for like eight, nine months before chat GPT came out, but nobody really
did anything interesting with it. And then when they packaged it up as a chat interface,
the rest is history. It turns out just allowing humans to prompt the model and talk to it and
give it context in a way that's useful makes a massive difference in the value you can derive
from these models. While we've seen how valuable that is for pure text generation, when it comes to
taking action in the world and doing something, predicting the next action that you want to take and
then taking that action for you, there's nowhere close to that interface being able to
see what you're seeing, listen to what you're listening to, hear what others are in the room
are hearing, look at where your eyes are tracking, and infer all of that context about what
the human wants to achieve and then proactively do that for you. Because right now, we're kind
of basically forcing these models to try to understand us through a little straw, right? Where
all they can get is like a tiny representation of reality through what we can type and via text
chat, but we're losing all the other context about reality. And so there's an interface problem
where there's no interface today that seamlessly just captures the entire world that you're interacting
with and then translates that into a prompt for these generative models. Now, the solution space is
fascinating to think about because you could argue, well, on the smartphone is probably the most
densely packed sensor, world sensor rig that we've ever had. Right. We've got just the one I have here
has three rear-facing cameras, one, one forward-facing, a full depth sensor,
RGB sensing on there. It has an accelerometer motion. It has GPS. It knows where I am.
It's got microphones in there. It can see what I'm seeing. And so you might go, well,
what are you talking about? Billions of people today now have a full sensor rig in their
pocket. The issue is that when those sensors are all sitting in your pocket in a way that
actually can't seamlessly capture the world and insert itself proactively into your day,
it's useless as an interface.
You need an interface that can provide the inference endpoint,
sufficient, continuous context about the world
that the user is facing,
so that then the reasoning layer can start making useful predictions
about what you want to do.
And I just think we haven't unlocked that form factor yet.
But voice comes close.
Vision, I think, will be a huge part of it,
as we're seeing with multimodal models.
And I think that there are on the margins,
huge gains to be had with precise input
of that mirror thought like eye tracking.
I don't know if you've tried the Apple Vision Pro out,
but the entire premise of the Vision Pro interaction system
is that you replace a mouse with your eyes.
And the eyes are extraordinarily good
at looking somewhere that your brain wants to go.
It's actually the fastest part of the human body
that follows thought.
And so if a computing interface can infer
what you're about to do because it has access
to your intent via your eyes,
now I think suddenly you start to shave several seconds off latency between a computer
understanding what you want to do and actually doing it. And the history of human computer
interfaces is rife with examples of how like shaving off marginal amounts of latency that
may seem like trivial make dramatic changes in user adoption. What does that, what does that
look like? We have companies experimenting right now with with like pendants and pins and that
sort of thing. But also you look at, you know, sometimes with these wearable headsets, maybe you
have to be corded in. Maybe you have to, you know, you have this giant thing on your head because
I mean, you have to pack these sensors and pack a powerful enough chip into the device itself. So
where do we actually have to get to from hardware perspective? That's a great jumping off point,
right? Because I think you can break down the hardware into three major buckets. One is,
what's the input that you need? What's the, what's the full context window that it has over your
life? So there's just a bunch of hardware required to accurately sense the world. And that's a
sensing problem. The second is the actual hardware required to then process all of that input
and make sense of it. And then the last step is the output hardware, right? How do you relay
the results of that reasoning step, that middle step back to the human and do it in a
sufficiently integrated seamless loop that you can have a conversation with your computer
in the same way that when you see a programmer in flow state where they're like literally
programming in their IDE, but they make a, they make a syntax error and
The ID intelligently says, hey, here's your syntax error.
The programmer then integrates that into their workflow.
I'm a terrible programmer, and so I've had the privilege of working with some remarkable programmers in my life.
And you can just tell when they're in flow state, they're almost bionic, when they and their computer meld into one.
And that's primarily a result of a bunch of really great decisions that we've made along the way about how the computer talks back to you about the inference or the reasoning it's done.
Big picture, you can think of the hardware we need as three step, three buckets, input,
reasoning, and output.
The human anatomy analogy here would be eyes, ears, and fingers to see what's going on in the world.
That's step one, the sensing.
Then you need a brain to make sense of all that input.
And then ultimately, like we have our appendages, you need something to manipulate the environment
around you and actually take action.
Traditionally, most innovation happened in this area in robotics.
It was a well-contained kind of discipline of computer science to reason about how to bring those three things together in a way that was research-friendly.
Robotics research has traditionally happened in industrial labs where they don't have to interact with humans very much other than constrained environments like warehouses and so on.
Today, what we're seeing is a dramatic shift where the generative model breakthroughs have resulted in tons of research outside of robotics in consumer land.
The form factors are everything from pendants and little pins on people's collars to pairs of glasses that just blend seamlessly into everyday life and don't look any different from glasses that you or I would wear if we were just had prescriptions on.
We are seeing companies that are thinking about more invasive but embedded devices that literally might be a brain computer interface like a chip, right?
Yeah.
In, in situ.
If you asked me, hey, where are we seeing the most, as things go from like,
science to engineering where it might actually be in people's hands, everyday people's hands
soon. Certainly the most innovation is happening in the middle step, the brain step, which is
where inference is today. There are companies who are making chips that are custom designed for
specific models where they're literally burning the weights. This was an idea that I first heard
from David Holtz, the founder of Mid Journey. Early on, his intuition was that diffusion models
were so effective at image generation
that eventually you'd have only a handful of models
that were handling the bulk of inference workloads
for image generation.
And at that point, when you have sufficient scale
and volume of a certain type of reasoning,
the brain is just having to do one type of task all day long,
you can actually make a chip that just does that kind of task, right?
And so you burn the model weights into the chip,
that dramatically reduces the flexibility of things
that brain can reason about.
You essentially turn from a big brain into a small brain,
but you can get, like, extraordinary orders of magnitude of speed up, 100x, 150x, 200x.
And that's generally been the history of computing as well.
When you have sufficient maturity of a certain type of workload,
usually the chip sector makes a chip just for your calculator and just for your fridge and so on.
And so I think we're about to start seeing that with generative modeling as well.
On the input and output side, that's historically been such a difficult place for startups to make a dent
because the cost required to bring a new form factor to market is just insane.
And so the closest that I think we've seen companies come are teams like Oculus, right,
the company who brought one of the first really mass market virtual reality form factors to market.
And their hack was they piggybacked off of the PC and smartphone supply chain that had grown up in the preceding decade that had resulted in the cost of a bunch of individual components dramatically falling.
That resulted in this huge manufacturing ecosystem in China, that they were then able to basically go do shopping, off-the-shelf shopping, duct tape a bunch of parts together.
I literally think the first DK1 screen was a Samsung or an LG smartphone screen that if you ask the founders, I think the founders of Oculus have this story where I'm paraphrasing poorly, but the display manufacturers didn't even believe that you could refresh the pixels at a fast enough rate for virtual reality.
And they actually hacked into the driver and said, no, look, we can do it.
So I think what we're what we need to be paying attention to on the hardware side is are there really low cost form factors that startups can innovate around because there's an existing supply chain that's that an incumbent like an Apple or Google has essentially subsidized at scale over the last decade.
And that's what's so exciting about things like the Apple Vision Pro is when Apple gets into the game, it results in a second and third order effects of like new supply chains showing up.
for sensors like depth sensors and LIDARs and pass through mixed reality displays that then
give startups the license to go experiment with new form factors.
Whichever company figures out how to capture vision in a way that's sufficiently high
throughput for a user to be able to prompt a model and say, this is what I'm looking at,
is able to hear what the user is listening to because audio is a very large amount of
the context that drives decision making in our daily lives,
that is make or break. So I think somebody who figures out the combination of audio and video both sensing and output in a way that's very easy to summon throughout your daily life so that you're not having to say, hey, Siri, hey Alexa. Like that, the wake word, trigger word thing has not worked, right? That paradigm, that interface is not working at scale. It's ruthlessly disruptive to our day, like to our, to a natural conversation. So I think the interface will look natural. It will look like it's voice and audio heavy with vision augmenting, I think.
the reasoning layer.
Yeah, it does seem like, not only is it disruptive, but you feel like it's unnatural in terms
of just how engaging with with the world around you.
The same way, like, I think when Google Glass came out, like, I think it was cool.
Right.
But, I mean, that was quite a while ago now if you think about it.
But, but yeah, it just, it seemed unnatural.
I think it's, it seemed unnatural to have someone walking around the camera on their face.
I mean, maybe we're, and maybe it's going to look a lot more prescient than it did at the time.
I'm generally pretty lindy about interface design, which is that if there's no good form factor
that proxies the device you're building that goes back hundreds of years, it is remarkably hard
to change human behavior so that the new paradigm works. Arguably, you could say smartphones were
not a new interface for humans. We had been used to putting things in our pockets for years,
whether it was wallets or notebooks. And we had been used to like tapping and interacting with
notebooks, whether that was with a pencil and so on. I think the innovation there, of course,
was that Steve Jobs figured out that styluses are pretty unnatural, even though they look
like pens and pencils that humans have used forever. But it turns out what came even before
a stylus was the human finger. And so I think the issue with Google Glass was it completely
violated all the laws of social Lindy, right? There was nothing natural about a display floating
in front of your eye with a camera. This is why I actually think it's extremely unlike.
likely that if somebody is building, if there's a company out there experimenting with wearable
that is sensing the world, that if it records humans, that it will gain mass stream adoption
because that's a fundamentally, that's not a Lindy natural behavior. So that's a conundrum
that is a trust and privacy issue where you have to find a way to capture visual context about
the world. And if you had a pair of glasses that had outward facing a visual sensor that could
tell the device what you're seeing, but not record, then I think that would be a breakthrough.
And I think a big mistake that some of the big hardware manufacturers today have made is they've
rushed to bring recording glasses to market. And people don't trust those. People haven't welcomed
those at scale into their lives yet. What do you think it takes to actually break through
that privacy or that, I mean, security, you know, maybe even less or so, but like that privacy
angle here and get people to embrace some of these technologies? Look, I think Apple is a great case study
here. It's this constant dialogue between a technical solution and a consumer or end user promise
where Apple has basically said, look, we're going to basically not give ourselves the capability
to look at certain kinds of data on your device. So, Apple has a secure enclave on device
on the iPhone. And a ton of the smartphone camera processing actually just happens on device.
Actually, almost 100% of Apple's native computer vision processing on your photos happens on device.
It actually never leaves the device.
Now, they've offered sort of cloud services on top, right, that you can back up your stuff for storage to iCloud.
But when it actually comes to raw inference, like intelligence about your data, that's all happening locally.
And so when you brought up Olamma early in the example, I think why we're seeing so many developers flock to Olamma is because there is a lot of demand for,
from consumers to interact with language models in private ways.
And that means they're going to have to figure out how to get the models to run locally
without ever leaving, without ever the user's context and data leaving the user's device.
And that's going to result, I think, in a renaissance of new kinds of chips that are capable
of handling massive workloads of inference on device.
We are yet to see those unlocked.
But the good news is open source models are phenomenal at unlocking efficiency.
The open source language model ecosystem is just so ravenous.
Like when a new Mistral, when Mistral's new model came out,
their mixture of experts, open source model a few months ago,
I think it literally took like less than 24 hours for somebody to quantize it
and then add GGML support so that you could run locally within the week,
even though originally like just out of the box,
that model was actually really hard to run on anything less than 24090s.
Like I had to get it up and running on my,
I have like two gaming cards at home and in this like heat machine by my desk.
But, you know, by today you can run a mixture.
model purely because of software improvements on much smaller, on single chips, you'll get a few
gains from the open source ecosystem that then allow use cases to be sufficiently figured out
that then the hardware guys go, oh, let's make special chips just for these workloads.
And that's what's happening.
I think there are people bringing diffusion model chips to market and transform model chips
to market that are just good at that workload so that when a startup says, you can trust us,
the user knows that they don't have to.
It's engineered by design, right?
So I think doing no evil by design results in a can't do evil.
And I think that's the strongest commitment you can make to your customers.
I'm guessing it doesn't hurt when, or in fact helps would be the more affirmative statement
way to state that.
It probably helps when a company comes and doesn't have, let's say, a legacy business
to support that's dependent on showing you ads or that's dependent on otherwise using your
personal data in some ways.
Is that fair?
I mean.
Yeah.
So that, okay, that's a good question, which is can you,
trust an AI companion that is being paid by someone other than you. And I do believe advertising
is going to be remarked, like basically all in its current shape and form, almost dead on arrival
for most intelligent interfaces that people trust with with daily actions. You know, it's one one thing
to trust a model with next token prediction where you're asking for the next word you should use
in an essay. When we make the leap to next action prediction,
where you're giving an agency to act on your behalf and represent you in the world,
then the fundamental misalignment between you and the agent who's doing things for you
is that if that agent's not getting paid by you, then you may not be able to trust it.
Now, I think there may be flavors of advertising that work,
that don't look like the Google ad format today where like you stuff four paid links
ahead of the seven or eight good ones.
And I think we're actually seeing Google struggle with that a little bit.
but you're right.
I think the most aligned business model would just be the same way you hire a human to
help you out with a task as an employee.
You're hiring an agent, you're hiring a computer and you're paying for it.
And you're really the employer in that case.
And that's the most aligned business model, I think, when it comes to this next wave of generative
computers, is to not think about them as tools that while the ultimate impact that they
can make on your life is that of a tool.
The business model relationship should be much more akin to an employer and employee than, I think, just a third-party tool that somebody else is subsidizing for you to use.
The classic, like, if you're not paying, you're the product, I think, exists here.
And that's, and the failure mode is more catastrophic when you're trusting it to do things on your behalf.
I find it interesting right now.
When you look at, say, subscriptions to chat, GPT subscriptions to any of these generative model services, perplexity, whatever.
I mean, you name it, like, like, people are willing to pay for, for, for, for, you know,
search now. That's better. People are willing to pay a monthly subscription for for these things. And
maybe that maybe that does indicate yes, that this is a thing people would pay for going forward.
Like we've crossed that bridge where we realized to your point, like free is not free. And if something,
if something truly adds value, you can come in and say, this is what we do and this is that we make
our money and take it or leave it. But know the service you're getting for the price for what you're
paying. Yeah. Look, I think the long arc of tech has shown that the market,
cost of compute over time just generally converges to zero, right? And we're in this really
weird phase right now. Because generative models are so new, we haven't seen the dramatic
reduction in cost happen that Moore's law should be driving for compute. And so as a result,
the best services, generative model services in the world are premium services that,
because it's expensive to run inference, right? Like, but what is crazy, like to your point is
that people are willing to pay for that. That's how much economic value is being unlocked by
generative models. At this point now, I've been personally involved either as an advisor or investor
or as an operator with at least five companies, generative model companies, that have exploded
past 30 to 50 million in revenue in subscription revenue in their first 12 months of monetizing.
And this has all happened in the last two years, right? It's insane. And I think that's because
when the model actually accomplishes a task for you that you didn't think you could do on your
own, whether that's generating an image with mid-journey, or it's getting an answer from
perplexity that would have taken you hours to do by yourself, or generating a podcast of your
voice using 11 labs from your text. These were things you'd actually have to go hire people to do
earlier. And it turns out when it's bundled up as compute and offer, and you can call on it
24-7 hours a day, charging 20 bucks a month for it is not even a hard ask of customers, because
the comparable is legitimately, I think, hiring a human to do that task for you. That cost
varies anywhere from minimum wage per hour to some humans you literally can't find to accomplish
a task you want on the timeline you may not. And so I don't think it's, I don't think,
I want to be clear, I don't think these models are replacing humans. I think they're filling
gaps in economic demand that weren't being filled, these niches that weren't being filled before
and they're creating new categories. And a $20 a month subscription for that today is not a hard
ask, but over time I see those prices will decline because the marginal cost of compute
will converge to closer and closer to zero.
To tie this back, so we started off talking about GPUs and kind of the training side,
if the UI of AI becomes like multimodal data capture on device, what does a training process
and system look like and then the model development process look at as we're just pumping
in now, I don't even know to quantify the volumes of data that we're talking about generating
here. Yeah. So look, I think there was a moment in time about 24, 36 months ago where everyone was
like, oh, of course, bigger is better. The bigger the model, the better it's going to be and size is
everything. And we're going to have GPT 10 be this insanely large 100 trillion parameter model that
will be this big God brain. And products should just be increasingly wrappers on top of that
model. And that reality has not come to pass. What is instead happening is that the most useful
products are combinations of different models. Mattes Zaharia has a great paper on this recently
from his lab at Berkeley that calls it compound systems. They did a pretty systematic study of all
the most used products today that use generative models. And it turns out they're not single
monolithic models. They're combinations, these compound systems of different models acting in
unison. So I'm a big believer in the idea that the future products are going to be swarms of
small models working together to solve a task, cheaper, faster, more efficiently than just one
big megbrain that can do it. And then when those teams of models encounter tasks that they
can't solve themselves, then they will call out to a larger model that might be in the cloud
and then ask that model to do what might be a multi-day reasoning problem. Sometimes when you need
to invent the theory of relativity, you do need to go ask Einstein for help. But most things
throughout your day today, I don't need Einstein to help me with. And instead, what I do want
is a really great, efficient team of people who are specialists at something, acting in
close unison to each other, the way companies work together. And so I see a future where everybody's
got a personal team of AI is working for us, just the same way companies are often in service
of a customer. And I think what's going to happen is these inference workloads are going
to become combinations of quickly attacking a task with that team and then offloading the tasks
that they can't solve themselves
to bigger and bigger cloud-hosted inference workloads.
At the same time, what that means for training
is that actually bigger and bigger training runs
may not be that important to our everyday lives.
What might actually be really important
is training and fine-tuning models,
base models on individual data.
One of the biggest unlocks that ByteDance,
which made TikTok, was in this intense personalization
of the algorithm so that within three swipes
of somebody opening up TikTok, they knew what Derek really wanted to see next. And the concept
behind the scenes is basically a personal embedding for everybody, right? Like every individual
consumer has a personal embedding that understands their preferences so deeply that you're
able to serve them with what they want next, whether that's searching for a restaurant that you
want to go to or just like doing, taking action for you and calling you a taxi if that's what you
need. And I think that in that future, a lot of the training, like what's
currently happening in the pre-training phase of model development will start to make its way into
post-training, what's currently called fine-tuning or customization, right? Where you got you, once you have
a good enough base model for most tasks, then you can start to fine-tune it on what the individual
wants, on their individual user set. And so you don't need a massive model to keep reasoning about
every individual user. You actually need just a good enough base model that then learn specifically
about you. And that happens in the post-training step. It seems like we're reliably bad at
the future when it comes to certain things. I think I think most people jump to like sci-fi
didn't seem to predict the internet or the smartphone, which might have been two of the biggest
advances we actually had. So I'm going to ask you like to comment on that and also open yourself
up to that same sort of mistake. But like where do you think we might be missing some areas for
improvement when we're thinking about how AI develops? Like do we have blind spots based on what we're
currently doing that might limit how we develop these things going forward? Humans are so good at reasoning by
analogy, right? And rightly so because millions of years of evolution have showed us that pattern
matching is a really good skill in life. If your ancestors have seen a lion before and have
associated that with danger, next time you see a thing that kind of looks like a lion, you should
probably reason about it like the way your brain, your ancestors have learned over millions
of years. And I think that's, well, that serves us well in most daily life. It's actually served us
really poorly in computing. Because I think we keep looking to buy.
biological metaphors to guide computer design.
And for a long time, AI was in this weird research path
where most of the AI research community just believe
that the path to unlocking sort of general intelligence
would be you had to first figure out how brains worked,
human brains worked, and then you could replicate that in silicon, right?
And so there was just like the decades of research
at many DARPA and DOD and university-funded labs
that was this sort of neuroscience first approach
to inventing computers.
That has proven to be mostly a distraction.
Like, it turns out that just predicting the next token or the next word, a model should say,
is a remarkably useful way to attack intelligence and design computers.
Instead of getting computers to learn like human beings, I will say what is now happening
is that because transformers are so remarkably effective at what they do,
most major industrial labs have doubled down on that architecture.
it's not clear whether that will result in multi-step reasoning of a kind that is essentially unconstrained, right?
It's not clear that the current architectures will get us all the way to the end goal, which is everyone has a definition of what the end goal.
But let's say the end goal is a kind of computer that's able to do almost everything we want as humans and take all the drudgery out of our lives and allow us to be the ultimate bicycle for the mind.
Forget bicycle. Let's say we want computers to be the interstellar travel for the mind.
It's not clear that the current architectures we have of these models will get us there.
But because they work so well, the bulk of research dollars are going to fund optimizations in the current architecture.
That's a blind spot because what we may actually need is a fundamentally new architecture to unlock the interstellar travel for the mind.
While there's some promising startups trying to do that, it is a pretty capital intensive game.
It's not for the faint of the heart.
And so what we do risk, I think, as an industry, is hitting a point where scaling laws don't hold.
Our current architectures actually do plateau.
And then we are back at the kind of slowdown that we had with AI for the past three winters over the last three decades where many of the loss curves or the ability for models to predict reality basically hit walls.
They started off being super promising and then they hit a wall.
So far, signs are that's not happening, but if there's one blind spot in the future of
computing, it's that our current architectures are insufficient and that we haven't invested
enough in alternative backups to overtake those.
Now, I'm optimistic that this time around, there's sufficient excitement in the ecosystem,
both from everybody from the hardware providers at the compute level like Nvidia,
the cloud providers and startups and ultimately investors like us who are really excited for Newark.
And so, you know, if there are people out there who are dinkering and experimenting with those that
unlock new interfaces, unlock the next phase in computing.
Yeah, that's what we're here to fund.
I just wish more people were working on those.
Thanks for listening, everyone.
I thought that was a super insightful discussion, and I hope you did too.
We're just getting warmed up and we have more episodes to come shortly.
But in the meantime, feel free to rate the show and let us know what you think so far.
Thank you.