The a16z Show - From Promise to Reality: Inside a16z's Data and AI Forum
Episode Date: May 2, 2023Nvidia’s CEO Jensen Huang declared in a recent keynote, “we are in the iPhone moment of AI.”This special episode will give you an inside look into a16z’s Data and AI Forum, hosted the day GPT-...4 came out, featuring many of the most influential builders in the space – from the companies building foundational models like OpenAI to those building the underlying infrastructure like AWS. Resources:Check out CoactiveAI: https://coactive.ai/Check out CharacterAI: http://character.ai/Check out Hex: https://hex.tech/Follow Cody on Twitter: https://twitter.com/codyaustunFollow Myle on Twitter: https://twitter.com/myleottFollow Barry on Twitter: https://twitter.com/barrald Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. For more details please see a16z.com/disclosures. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.
Transcript
Discussion (0)
Hi everyone. Welcome back to the A16Z podcast. This is your host, Steph Smith, and today we continue the conversation around AI.
Now, this conversation is ongoing, but today we have exclusive footage from our data and AI forum held last month.
So what we've done is we went through all of the presentations from that forum, and we've aggregated what we think are the best segments delivered right to your ears.
As a reminder, the content here is for informational purposes only. None of the following is invest.
business, legal, or tax advice,
and please note that A6 and Z and its affiliates
may maintain investments in the companies discussed
in this podcast. Please see A6NZ.com
slash disclosures for more important information,
including a link to a list of our investments.
So the promises of AI have echoed for decades,
starting as early as the 50s
when artificial intelligence blossomed into its own discipline.
And since then, the field has held so much promise.
Although AI skeptics would claim
and the technology has fallen short of expectations.
That is, of course, until recently.
A string of technology unlocks from the deep learning renaissance around 2012,
the attention is all you need paper in 2017,
and compute moving down the cost curve,
have resulted in numerous applications now in the hands of the masses,
including, of course, Chit, which reached 100 million users this January.
As Navidia's CEO Jensen Huang declared in a recent keynote,
we are in the iPhone moment of AI.
In a matter of months, this technology has gone from what seemed like a distant promise
to the everyday internet user leveraging chat TPT to write emails,
mid-journey to generate images for their next presentation,
or runway ML to edit their videos.
We've become so accustomed to the rapid pace of innovation,
we may even forget that Dolly 2 was teased just one.
one year ago, or the sheer sense of wonder that some of these early experiences brought.
As Arthur C. Clark said, any sufficiently advanced technology is indistinguishable from magic.
Well, the magic continues. Just a few weeks ago, the precise date that GPT4 was released.
A16C held a data and AI forum, featuring many of the most influential builders in the space,
from the companies building the foundational models like OpenAI, to those building the
the underlying infrastructure like AWS. And in today's episode, you'll get a sneak peek into a few
of the most important conversations from the forum. So to kick things off, one of the most
interesting unlocks that AI services is the ability to interface with unstructured data,
something that the average internet user happens to produce a lot of, from the pictures we take to
the emails we send. Here, Cody Coleman, CEO and co-founder of Coactive, a platform that helps
makes sense of your unstructured image and video data talks about the potential to leverage
the vast amount of data already being collected by companies who are already paying a small
fortune to retain it. You might even call this a data dividend. Right now, 80% of internet traffic
is unstructured video data. And you can see this when you think about all the number of Zoom
meetings that are on your calendar, the rise of e-commerce, we're making purchasing decisions
based off of images and videos.
And if you have kids, they're probably addicted to TikTok or Instagram.
And it's not just internet traffic.
It's projected that 80% of worldwide data
will be unstructured data.
So that's things such as video and audio.
85% of all worldwide data will be unstructured data by 2025.
So it's a massive, it's the dominant form,
the vast majority of data that's out there in the world,
and it actually influences not just the content that we consume,
but also the products that we buy.
So nearly 80% of people say that user-generated content, UGC,
impacts their decisions to purchase.
And this is just the beginning.
You can already see that the barrier to creating content,
whether it be text or visual, dropped dramatically.
And that means that there's going to be a waterfall or a fire hose
of the amount of content that can be generated.
So it's not a question anymore of, you know,
if or when content is going to be king,
content is king right now
and it's affecting and capturing every part of our life.
Now, the question that we should be asking is,
what type of king is your content?
So we all want the legendary content
where you have a good king.
You know, you have the right content
and it will lift your sales and engagement
by delivering the right content to the right person at the right time.
You can also use all this content that you might be creating in-house
or collecting from your users
to actually answer trends about customer behavior
and illuminate trends more broadly.
That's the ideal picture.
Unfortunately, most businesses can't actually realize that version
of their content and actually derive that much value out of it.
Instead, we see a lot of people
with maybe a little easier king.
You know, when you think about unstructured data and content,
a lot of it sits underutilized on cloud storage right now,
so in S3 or Google Cloud Storage,
for serving or it's archived in backups.
And basically, this just causes you a small fortune
just because of the sheer volume when we think about image,
audio, and video data.
And that's basically just taxing your organizations,
taxing your business,
just to store this data,
to keep it there kind of archival, which is really expensive.
But things can get even worse when we think about what bad content looks like.
Bad content exposes your businesses and organizations to a wide variety of risks.
So one, there's the risk of violating user privacy.
When we think about this unstructured data, when we think about text, when we think about video,
when we think about audio, there's that additional context that it captures is so personal
and it exposes potentially the risk to violate privacy there.
Also, you might not have control over all the content that is shared on your platform
or kind of put out there.
And that risk corrupting the safety of the online communities that we are a part of.
And that, in turn, can end up meaning that we lose trust in those platforms that provide
those online communities or the brands or things like that that are talked about.
If we actually don't have the right content and the right message being delivered
with the content in our businesses.
Luckily, we found the key.
We found the way that we can actually make this data useful,
and that's with AI.
The piece that actually makes content king is actually AI.
By being able to actually process and understand this data at scale,
which before this moment really wasn't possible.
It was a very hard manual process
to actually go through all this content and understand it.
But thanks to the work of folks like OpenAI,
we can now actually understand and, like, in general,
and appreciate this content.
And I've seen that value
and how that can generate so much value
firsthand from my own experiences.
So I've done my PhD in Stanford
at the intersection of machine learning systems.
I've worked as a data scientist in industry
ranging from finance to education
to big tech companies like Pinterest and meta.
And I've seen just how much,
how they can leverage AI to actually make their content
better to improve ads,
to improve recommendation, to improve search.
and so many other kind of vital and critical use cases to businesses.
But on the other side, you know, from being hands-on and from doing my PhD in this,
I know that it is really, really difficult to get it right.
There's like no such thing as a free lunch.
And I think that there's kind of four main challenges that prevents organizations and businesses
from really being able to unlock this data right now.
And the first is one of just scale.
When we think about unstructured text and visual data,
It's orders of magnitude larger than today's big data.
So to put that into perspective, if we were to think about tabular data,
so we had 10 million rows of tabular data, that's around 40 megabytes.
And to put that into perspective,
we can think of that as being like all of the water
and all of the area of Lake Tahoe in California,
which is around 496 square kilometers.
If we were to think about 10 million documents,
text documents, we go from 40 megabytes to 40 gigabytes.
And now we have something that's more on the scale of the Caspian Sea.
So 371,000 square kilometers of space, when we put that in perspective and scaling it up.
It's three orders of magnitude, more data in terms of volume than when we think about tabular data.
And then when we think about visual data, if we had 10 million images, that would be 20 terabytes
of data. That's another three words of magnitude bigger, and that's like the Pacific Ocean,
when we think about it, in terms of just the sheer scale of data that that is in terms of volume.
The Pacific Ocean is 168 million square kilometers. Now, right now, when we think about kind of big
data and our data lakes, we have kind of these tools and vehicles that kind of can process
that efficiently. But that's kind of kind of. That's kind of a lot of.
kind of like having a rowboat or a canoe.
You know, it'll get you across the lake,
but I wouldn't trust that if you were trying
to cross the Pacific Ocean.
So in order to actually be able to unlock
kind of the value from this richer,
kind of more context that we get in content,
we actually need to create kind of tools
and infrastructure in order to process that.
Now it's gonna be probably a similar shape
in terms of like, just like how a sea going boat
looks somewhat similar to a rowboat,
but the scale and the processing of it
will just have to be kind of completely different.
And we'll need to prepare ourselves just for the fact of the sheer volume and scale that we're thinking about
when we move from a tabular view of the world to more of a content view of the world.
Given how much data is being created every single minute,
you can imagine all the new infrastructure opportunities and challenges there will be in order to make use of it.
But you also may wonder, if we're collecting and processing so much data, will we ever run out,
both in terms of our ability to store it, but also to continuously upgrade new models,
with new data.
Here's A16C's general partner, Sarah Wing, asking that question to Mile Ott.
A longtime AI researcher previously leading the LLM efforts at Facebook
and now part of the Character.A.I founding team,
an AI platform seeking to give consumers access to personalized AI systems.
And fun fact, one of the other founders of Character
was one of the authors of The Attention Is All You Need paper from 2017,
a truly foundational piece of research
underpinning many of the AI advancements since.
There's sort of this question around
are we running out of data?
And I think what's really interesting for this room
is that there are a bunch of execs here
with access to a ton of proprietary data, right?
So this question may not pertain to that as much,
although I think it'd be interesting to loop that in.
But there's sort of this question of,
as these models get bigger,
they ingest more data,
are we actually running out of publicly available
static web data. And what do we do about that? How do you guys think about that at character?
And how is that informed the approach that you've taken? Yeah, it's a good question. I think, so obviously
most of the kind of AI systems that are being trained today are trained on these public
data sets, right? So, you know, mostly kind of data crawled from the web. I think there's actually
still like a decent amount of public data available. I think, you know, even if we're kind of
reaching the limit, say, of text, I think there's other modalities that, you know, folks are starting
and explore audio, video, images.
I think there's a lot of really rich data sources out there
still on the web.
I think there's then, I don't know the exact magnitudes,
but I imagine roughly similar scale
of private data sets out there, right?
And I think that's gonna be really important
in certain applications.
You know, I imagine if you have a code generation system,
it's great that it's trained on all of public GitHub,
but it might be even more useful if it's trained
on my own code base, right, than my private code base.
So I think figuring out like how to blend
these public and private datasets
is going to be really interesting.
And I think it's going to open up a whole bunch of new applications, too.
From character's perspective, and I guess more generally,
one of the things that we're starting to see that is pretty exciting,
is this move from, you know, you could call it like static data sets,
but data that kind of exists already out there,
independent of AI systems.
We're moving now, I think, towards data sets that
are being built with AI in the loop, right?
And so you have, you know, people often refer to as these data flywheels,
but you basically can imagine, say, for characters,
we have all these rich interactions where character is having a conversation with someone,
and we get feedback on that conversation from the user, either explicitly or implicitly.
And that's really like the perfect data to use to make that AI system better, right?
And so we have these loops that I think are going to be really kind of exciting
and provide both richer and perhaps much larger data sources for the sort of next generation of systems.
Yeah, very exciting.
Well, I think we've been talking a little bit about the future, but I actually want to bring us back, since you've been working in large language models for quite some time now, getting a little bit of a history lesson from you would actually be very interesting.
And I think even though Michael had listed a long list of accomplishments and things that you'd worked on, it was still, frankly, in my view, very humble.
And I think one of the most significant contributions of yours is the development of the Roberta model.
And rather than hearing me define it, if you could take us back to, I believe it's 2019,
what the state of AI look like back then, LLMs,
and maybe just bring us forward to today
as a lot has changed, to your point, in the course of four years.
Yeah.
So Roberta, you know, I think as I kind of mentioned earlier,
when I was in the research group at Facebook,
a lot of my focus was on trying to build kind of larger-scale engineering systems.
But a lot of that actually started with translation systems.
So obviously, machine translation, automatically translating between different languages.
It's like a hugely important problem at Facebook.
It runs in production.
And one of the highest leverage ways we found to make those systems better
was to train them on more data and with more compute.
And I think in some ways that sounds like an obvious idea now.
But I think actually back then it was somewhat controversial.
And I think there's almost this kind of perception that like in order to make big advances in AI,
we were going to need really big algorithmic breakthroughs.
I think it was kind of underappreciated how far you could get by just increasing the amount of data,
improving the data quality, and scaling up the amount of compute.
I think in late 2018, Google came out with something called BERT,
which is also a transformer model, but was used a slightly kind of clever training objective
and got kind of state-of-the-art performance in all of these natural language understanding tasks, right?
So making classifications about particular text input or something.
And Roberto was really kind of taking Burt and scaling it up, right?
And I think we trained it on something like 10x more data and with a lot more compute.
And, you know, what we found is that there was this big algorithmic jump from kind of the stuff
before Burt going to Burt and then an almost equally sized step by just scaling it up, right?
And so I think that has been really, in many ways, the story of the last few years, too,
is that by scaling up these systems, there's actually really really.
substantial gains, like qualitatively different behavior and performance that we can get,
accuracy that we can get out of these models. So I think that's in like a really kind of
fruitful direction to explore. And I think there's probably still more to explore there going
forward. Yeah, absolutely. I mean, it's fascinating because I think that relationship today,
we almost take for granted. If we extrapolate that relationship, more data equals more powerful
models. And as these models do become more powerful, the reaction of many is to question whether
this technology will take our jobs.
Or an even further extrapolation,
whether it'll make us as humans completely obsolete.
And while people often cite games like chess, go, or StarCraft,
as examples of where bots have definitively beat humans,
there's actually another story that can be told.
For one, people are still playing chess
decades after the infamous 1997 match between Deep Blue and Kasparov.
In fact, you could argue that chess is more popular than ever.
Here is Barry McArdle, founder of Hex, a data science platform that's integrating AI,
illuminating how the story is much more dynamic than human versus bot,
and how there's a much more helpful lens of what we can achieve together.
In 1997, Deep Blue, the IBM chess bot, beat Gary Kasparov in this very famous televised game.
Here's a photo of our grandmaster holding his head in his hands.
And this was a really seminal moment in the AI research,
and it inspired a whole generation of computer nerds, myself included.
And it also spawned a ton of headlines about AI taking over and the end of humans.
I found while I was researching this one article,
I had a big picture of a Terminator, killer robot on it.
Well, it's been a while, and it didn't quite work out that way.
20 years later, in 2017, a robot kicked our ass at go.
Here's our human champion, also holding his head in his hands.
Apparently, this is the universal surrender pose
when you have been defeated by a computer in a game.
And Go is a famously sophisticated and nuanced game.
So this was a really big deal.
And once again, it spawned all these articles
about the end of humans and all this stuff.
A couple years after that, the same research lab
developed a model that could beat humans in Starcraft.
Here we go.
Another photo of a human surrendering by holding his head in his hands.
I don't know about y'all.
I played hundreds of hours of Starcraft
in high school and college. So this was a really big deal for me. And it was also a really big deal
because Starcraft is famously complex. You have imperfect information, multiple races, units,
you have to balance scouting and resource collection and all-out combat. It's great. And then
an AI could play it at a human level was really, really impressive. And once again, spawned all of
these articles about humans. And with the twists that this time we had trained a bot to engage us
in space combat, which I think seems especially alarming. So three of the three of the
Oh, we have a bad record, right?
Computers are beating us.
They're superior than humans.
We are on our way to becoming obsolete.
But as it turns out, it's a little more complicated than that.
And there is something that actually does a better job than a human alone or a computer
alone, and that is a human with a computer.
And when you look at these games again through that lens, you actually get some different
and more nuanced results.
So let's go back to chess.
A few years after that game, Gary Kasparov, or a very much,
actually many years after the game,
Gary Kasparov organized a tournament
where humans could play with computer support.
And there were some really highly ranked grandmasters playing,
and at the time the cutting-edge chess AIs that had been developed.
But there were two amateurs who swept the whole field,
and they did this not by having better human chess skills,
and not by having a better chess bot.
They had developed a model that they had programmed to be able to work with.
They were working in tandem with it.
It was effectively inferior humans and an inferior model,
but they had found a way to work together
to beat superior humans and superior models.
The same exact thing just happened in Go a few weeks ago.
I don't know if folks caught the headlines for this,
but there was an amateur, it's always the amateurs, right,
who developed a model that he was able to work with
that understood and studied the weaknesses in the leading Go bot
and was able to defeat it.
And it wasn't, again, that this amateur had like a better Go model
that he had developed.
He had figured out a way to partner together
together with this AI to enhance their performance.
And in StarCraft, it's still not the case
that AIs are routinely able to beat our human professional players.
In fact, a big reason for that is because human pros
now are developing and relying on techniques
that were first pioneered by bots.
We're using these models to understand strategies
that humans can then uniquely go and execute against.
And so three cases in a row of AI actually elevating,
not eliminating human performance.
Humans are better because of AI.
We're able to work with AI to improve.
I think it's also worth mentioning all of these games
are as popular as ever.
Humans clearly still enjoy this,
even though there's an AI that might be better
than them as individuals.
So this is an example of something that sounds a little sci-fi,
but it's called Human Computer Symbiosis.
This was first proposed by this guy, J.C.R. Licklider in 1960.
And he has this awesome paper.
And for something written like 63 years ago, it really holds up.
And he has this quote that I have drawn inspiration from quite a while,
which is, computers can do the routinizable work
to prepare the way for insights and decisions
in technical and scientific thinking.
This was 63 years ago, and I think this is exactly what we're seeing happen now.
It's not about computers replacing humans.
It's about them working together cooperatively to solve a problem.
And I think this is the next step for AI.
I think it's the next step for humans.
You can have the computer doing the routine tedious work so humans can do the creative, interesting stuff.
We're a room of humans.
Our most fulfilled amazing days as humans are the days that we are spending doing creative and interesting work and not doing the tedious drudgery stuff.
And I think AI is here to help us achieve that state of fulfillment.
Now, I'm going to bring this into the domain that I think a lot about.
I've been working in data, data science, data analytics my whole career.
I am now the founder and CEO of a company that builds a data science and analytics tool.
And our product is used by thousands of data practitioners every day.
And we see them do some really creative, interesting stuff.
I think data practitioners are creatives.
I know it's not the first thing that comes in mind when I say creative.
Do you think of artists or whatever?
But you think what data scientists do in their day,
they're asking questions, they're forming hypotheses,
they're testing new things.
They're building narratives.
They're taking risks.
They're telling stories.
This is good data science.
It's good data analytics.
And it's what we expect from our data teams.
And it's an art and a science and a great use of human time.
But data work can also be really tedious.
Spend a lot of time writing boilerplate and fixing dependencies
and tracking down missing parentheses in a query.
It can be more plumbing than science sometimes.
And this is where I think people wind up spending a lot of their time
and really is a blocker to them doing their best work.
And so this really feels like a perfect opportunity
to bring human computer symbiosis
into this creative profession.
Now, when most people, when they think of this,
they assume it means kind of just replacing data teams
with a magic insights text box.
Like the next step is we'll all buy solutions
that then our stakeholders or executives
will come in, they'll write a question,
it'll give them a magic response back,
properly formatted charts,
and well-reasoned explanations and full business context.
But that doesn't really work.
And it doesn't work, one, because these models aren't perfect.
They can hallucinate,
They're missing a lot of context.
They don't understand the full situation of things.
But also that humans want to be able to hear a story and understand and ask and answer
questions of a human around these things.
And so we actually tried this.
At Hex, we had built a UI that was really sort of a little more black box.
You type of question we'd bring you an answer back.
And you got pretty good results, but it was missing the human element.
And we learned the same lesson, the same thing JCR-Licklider posited, the same thing we
we learned through all these games,
that for now at least, the best approach
is one where humans and computers can work together
to elevate performance.
And so the features we launched in our product last month
were built around these principles,
and I think there's a lot of takeaways here.
We built these features, they're called Hexmagic,
and they're built directly into the UI
that thousands of data scientists and data analysts
already use every day.
They bring the powerful large language models,
the latest models from OpenAI integrated directly in our product.
And you can ask it to do all sorts of things,
from writing queries to building visualism.
Or my personal favorite is called magic fix when you have an error in your code.
It will automatically detect and fix it.
And as someone who has more and more errors in my code every day, that is a very useful thing.
But the key thing here and the thing we really realize is that the thing that we are in the business of doing is to enhance and benefit humans.
It's to work with humans, not replace them.
We've found that we can elevate and accelerate human intuition.
And that's what our users tell us.
We had a user tell us they can spend more of their time.
doing the creative, interesting part of their job,
and less time doing the tedious plumbing.
And that is so exciting to me
because I think that is a little beginning.
It's a foothill of the ultimate value that AI can provide in our lives.
It's human, computer, symbiosis, and action.
All right.
That is all for these exclusive segments from our data and AI forum.
Hopefully, that gets your wheel spinning
and eliminates how much opportunity there still is to build here.
We've got lots more AI coverage to come,
as this field moves very quickly,
but for now we'd encourage you to go check out the companies that participated here.
So that's coactive.aI, character.aI, and hex.com.
We'll include all of that in the show notes,
but I also wanted to call out if you like these kinds of episodes,
this one being a compilation episode, please let us know.
You can always email us at potpitches at A16C.com,
and if you haven't noticed already, we're doing a lot of testing here in format, ideas,
as guests. So if you like something, if you hate something, if there's certain topics you'd
like to see more or less of, different guests you'd like to see on the podcast, please do let us
know. We love hearing your feedback and thank you so much for listening. Thanks for listening
to the A16Z podcast. If you like this episode, don't forget to subscribe, leave a review, or tell a friend.
We also recently launched on YouTube at YouTube.com slash A16Z underscore video, where you'll find exclusive
video content. I'll see you next time.
