a16z Podcast - From Promise to Reality: Inside a16z's Data and AI Forum
Episode Date: May 2, 2023Nvidia’s CEO Jensen Huang declared in a recent keynote, “we are in the iPhone moment of AI.”This special episode will give you an inside look into a16z’s Data and AI Forum, hosted the day GPT-...4 came out, featuring many of the most influential builders in the space – from the companies building foundational models like OpenAI to those building the underlying infrastructure like AWS. Resources:Check out CoactiveAI: https://coactive.ai/Check out CharacterAI: http://character.ai/Check out Hex: https://hex.tech/Follow Cody on Twitter: https://twitter.com/codyaustunFollow Myle on Twitter: https://twitter.com/myleottFollow Barry on Twitter: https://twitter.com/barrald Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. For more details please see a16z.com/disclosures.
Transcript
Discussion (0)
Hi everyone. Welcome back to the A16Z podcast. This is your host, Steph Smith, and today we continue the conversation around AI.
Now, this conversation is ongoing, but today we have exclusive footage from our data and AI forum held last month.
So what we've done is we went through all of the presentations from that forum, and we've aggregated what we think are the best segments delivered right to your ears.
As a reminder, the content here is for informational purposes only. None of the fall.
following is investment, business, legal, or tax advice, and please note that A6 and Z and its affiliates
may maintain investments in the companies discussed in this podcast. Please see A6NZ.com
slash disclosures for more important information, including a link to a list of our investments.
So the promises of AI have echoed for decades, starting as early as the 50s,
when artificial intelligence blossomed into its own discipline.
And since then, the field has held so much promise,
although AI skeptics would claim the technology has fallen short of expectations.
That is, of course, until recently.
A string of technology unlocks from the deep learning renaissance around 2012,
the attention is all you need paper in 2017,
and compute moving down the cost curve,
have resulted in numerous applications now in the hands of the math.
including, of course, chat TPT, which reached 100 million users this January.
As Navidia's CEO, Jensen Huang, declared in a recent keynote,
we are in the iPhone moment of AI.
In a matter of months, this technology has gone from what seemed like a distant promise
to the everyday internet user leveraging chat TPT to write emails,
mid-journey to generate images for their next presentation,
or runway ML to edit their videos.
We've become so accustomed to the rapid pace of innovation,
we may even forget that Dolly 2 was teased just one year ago,
or the sheer sense of wonder that some of these early experiences brought.
As Arthur C. Clark said, any sufficiently advanced technology is indistinguishable from magic.
Well, the magic continues.
Just a few weeks ago, the precise date that GPT4 was released.
A16C held a data and AI forum, featuring many of the most
influential builders in the space, from the companies building the foundational models like
Open AI to those building the underlying infrastructure like AWS. And in today's episode,
you'll get a sneak peek into a few of the most important conversations from the forum. So to kick
things off, one of the most interesting unlocks that AI services is the ability to interface with
unstructured data, something that the average internet user happens to produce a lot of from the
pictures we take to the emails we send. Here, Cody Coleman, CEO and co-founder of Coactive,
a platform that helps make sense of your unstructured image and video data, talks about the
potential to leverage the vast amount of data already being collected by companies who are
already paying a small fortune to retain it. You might even call this a data dividend.
Right now, 80% of internet traffic is unstructured video data. And you can see this when you think
about all the number of Zoom meetings that are on your calendar, the rise of e-commerce,
we're making purchasing decisions based off of images and videos. And if you have kids,
they're probably addicted to TikTok or Instagram. And it's not just internet traffic. It's projected
that 80% of worldwide data will be unstructured data. So that's things such as video and
audio. 85% of all worldwide data will be unstructured data by 2025. So it's a
massive, it's the dominant form of the vast majority of data that's out there in the world,
and it actually influences not just the content that we consume, but also the products that we buy.
So nearly 80% of people say that user-generated content, UGC, impacts their decisions to purchase.
And this is just the beginning.
You can already see that the barrier to creating content, whether it be text or visual, dropped
dramatically. And that means that there's going to be a waterfall or a fire hose of the amount
of content that can be generated. So it's not a question anymore of, you know, if or when content
is going to be king. Content is king right now and is affecting and capturing every part of our life.
Now, the question that we should be asking is, what type of king is your content? So we all want
the legendary content where you have a good king. You know, you have the right content and it'll
lift your sales and engagement by delivering the right content to the right person at the right
time. You can also use all this content that you might be creating in-house or collecting
from your users to actually answer trends about customer behavior and illuminate trends more
broadly. That's the ideal picture. Unfortunately, most businesses can't actually realize that
version of their content and actually derive that much value out of it. Instead, we see a lot of
people with maybe a little easier king. You know, when you think about unstructured data and
content, a lot of it sits underutilized on cloud storage right now. So in S3 or Google Cloud Storage
for serving or it's archived in backups. And basically, this just causes you a small fortune
just because of the sheer volume
when we think about image, audio, and video data.
And that's basically just taxing your organizations,
taxing your business,
just to store this data, just to keep it there,
kind of archival, which is really expensive.
But things can get even worse
when we think about what bad content looks like.
Bad content exposes your businesses and organizations
to a wide variety of risks.
So one, there's a risk of violating user privacy.
When we think about this unstructured data,
When we think about text, when we think about video, when we think about audio,
there's that additional context that it captures is so personal,
and it exposes potentially the risk to violate privacy there.
Also, you might not have control over all the content that is shared on your platform
or kind of put out there.
And that risk corrupting the safety of the online communities that we are a part of.
And that, in turn, can end up meaning that we lose trust in those platforms that provide
those online communities, all the brands or things like that that are talked about,
if we actually don't have the right content and the right message being delivered with the
content in our businesses.
Luckily, we found the key.
We found the way that we can actually make this data useful, and that's with AI.
The piece that actually makes content king is actually AI.
By being able to actually process and understand this data at scale, which before this
moment really wasn't possible.
It was a very hard manual process to actually go.
through all this content and understand it. But thanks to the work of folks like Open AI, we can now
actually understand and ingest and appreciate this content. And I've seen that value and how that
can generate so much value firsthand from my own experiences. So I've done my PhD in Stanford at the
intersection of machine learning systems. I've worked as a data scientist in industry ranging from
finance to education to big tech companies like Pinterest and meta. And I've seen just how much
how they can leverage AI to actually make their content better,
to improve ads, to improve recommendation, to improve search,
and so many other kind of vital and critical use cases to businesses.
But on the other side, from being hands-on and from doing my PhD in this,
I know that it is really, really difficult to get it right.
There's like no such thing as a free lunch.
And I think that there's kind of four main challenges
that prevents organizations and businesses from really being able to unlock this,
data right now. And the first is one of just scale. When we think about unstructured text and
visual data, it's orders of magnitude larger than today's big data. So to put that into perspective,
if we were to think about tabular data, so we had 10 million rows of tabular data, that's around
40 megabytes. And to put that into perspective, we can think of that as being like all of the
water and all the area of Lake Tahoe in California, which is around 496 square kilometers.
If we were to think about 10 million documents, text documents, we go from 40 megabytes to 40
gigabytes. And now we have something that's more on the scale of the Caspian Sea, so 371,000
square kilometers of space when we put that in perspective and scaling it up. It's three orders,
of magnitude, more data in terms of volume than when we think about tabular data.
And then when we think about visual data, if we had 10 million images, that would be 20
terabytes of data. That's another three words of magnitude bigger. And that's like the Pacific
Ocean when we think about it in terms of just the sheer scale of data that that is in terms of
volume. The Pacific Ocean is 168 million square kilometers. Now, right now, when we think about
kind of big data and our data lakes, we have kind of these tools and vehicles that kind of can
process that efficiently. But that's kind of like having a rowboat or canoe. You know, it'll get
you across the lake, but I wouldn't trust that if you were trying to cross the Pacific Ocean.
So in order to actually be able to unlock kind of the value from this richer, kind of more
context that we get in content, we actually need to create kind of tools and infrastructure
in order to process that.
Now, it's going to be probably a similar shape in terms of like, just like how a sea-going
boat looks somewhat similar to a rowboat, but the scale and the processing of it will just
have to be kind of completely different.
And we'll need to prepare ourselves just for the fact of the sheer volume and scale that
we're thinking about when we move from a tabular view of the world.
to more of a content view of the world.
Given how much data is being created every single minute,
you can imagine all the new infrastructure opportunities and challenges
there will be in order to make use of it.
But you also may wonder, if we're collecting and processing so much data,
will we ever run out, both in terms of our ability to store it,
but also to continuously upgrade new models with new data.
Here's A16C's general partner, Sarah Wing,
asking that question to Myel Ott, a longtime AI researcher previously leading the LLM efforts at Facebook
and now part of the Character.A.I founding team, an AI platform seeking to give consumers access
to personalize AI systems. And fun fact, one of the other founders of Character
was one of the authors of The Attention Is All You Need Paper from 2017, a truly foundational
piece of research, underpinning many of the AI advancements since.
There's sort of this question around, are we running out of data?
And I think what's really interesting for this room is that there are a bunch of execs here
with access to a ton of proprietary data, right?
So this question may not pertain to that as much, although I think it'd be interesting to loop that in.
But there's sort of this question of, you know, as these models get bigger, they ingest more data,
are we actually running out of publicly available static web data?
And what do we do about that?
How do you guys think about that at Character and how is that informed the approach that you've taken?
Yeah, it's a good question. I think, so obviously, most of the kind of AI systems that are being
trained today are trained on these public data sets, right? So, you know, mostly kind of data crawled from
the web. I think there's actually still like a decent amount of public data available. I think,
you know, even if we're kind of reaching the limits, say, of text, I think there's other
modalities that, you know, folks are starting to explore audio, video, images. I think there's a lot
of really rich data sources out there still on the web. I think there's then, I don't know,
of the exact magnitudes, but I imagine, you know, roughly similar scale of private
data sets out there, right? And I think that's going to be really important in certain
applications. You know, I imagine if you have a code generation system, it's great that it's
trained on all of public GitHub, but, you know, it might be even more useful if it's trained
on my own code base, right, than my private code base. So I think figuring out, like, how
to blend these public and private data sets is going to be really interesting. And I think
it's going to open up a whole bunch of new applications, too. From character's perspective,
And I guess more generally, one of the things that we're starting to see that is pretty exciting is this move from, you know, you could call it like static data sets, but data that kind of exists already out there independent of AI systems.
We're moving now, I think, towards data sets that are being built with AI in the loop, right?
And so you have, you know, people often refer to as these data flywheels, but you basically can imagine, say, for character, we have all these rich interactions where character is having a conversation.
someone, and we get feedback on that conversation from the user, either explicitly or
implicitly, and that's really like the perfect data to use to make that AI system better, right?
And so we have these loops that I think are going to be really kind of exciting and provide
both richer and perhaps much larger data sources for the sort of next generation of systems.
Yeah, very exciting.
Well, I think we've been talking a little bit about the future, but I actually want to bring us
back since you've been working in large language models for quite some time now, getting a little
bit of a history lesson from you would actually be very interesting. And I think even though Michael
had listed a long list of accomplishments and things that you'd worked on, it was still, you know,
frankly, in my view, very humble. And I think one of the most significant contributions of yours is
the development of the Roberta model. And rather than hearing me define it, if you could take us back to,
I believe it's 2019, what the state of AI look like back then, LLMs, and maybe just
just bring us forward to today as a lot has changed to your point in the course of four years.
Yeah. So Roberta, you know, I think as I kind of mentioned earlier, when I was in the research
group at Facebook, a lot of my focus was on trying to build kind of larger scale engineering
systems. But a lot of that actually started with translation systems. So obviously machine translation,
automatically translating between different languages. It's like a hugely important problem
at Facebook. It runs in production. And one of the kind of highest leverage ways we found to make
those systems better was to train them on more data and with more compute. And, you know, I think
in some ways that sounds like an obvious idea now, but I think actually back then it was somewhat
controversial. And I think there's almost this kind of perception that, like, in order to make
big advances in AI, we were going to need really big algorithmic breakthroughs. And I think it was
kind of underappreciated how far you could get by just increasing the amount of data,
improving the data quality, and scaling up the amount of compute.
I think in late 2018, Google came out with something called BERT, which is also a transformer model,
but was used a slightly kind of clever training objective and got kind of state-of-the-art performance
in all of these natural language understanding tasks, right?
So making classifications about particular text input or something.
And Roberta was really kind of taking Burt and scaling it up, right?
I think we trained it on something like 10x more data and with a lot more compute.
And, you know, what we found is that there was this big algorithmic jump from kind of the stuff before Burt going to Burt, and then an almost equally sized step by just scaling it up, right?
And so I think that has been really, in many ways, the story of the last few years, too, is that by scaling up these systems, there's actually really substantial gains, like qualitatively different behavior and performance that we can get, accuracy that we can get out of these models.
So I think that's in like a really kind of fruitful direction to explore,
and I think there's probably still more to explore there going forward.
Yeah, absolutely.
I mean, it's fascinating because I think that relationship today we almost take for granted.
If we extrapolate that relationship, more data equals more powerful models.
And as these models do become more powerful,
the reaction of many is to question whether this technology will take our jobs.
Or an even further extrapolation, whether it'll make us as humans,
completely obsolete. And while people often cite games like chess, Go, or StarCraft,
as examples of where bots have definitively beat humans, there's actually another story that
can be told. For one, people are still playing chess decades after the infamous 1997 match
between Deep Blue and Kasparov. In fact, you could argue that chess is more popular than ever.
Here is Barry McArdo, founder of Hex, a data science platform that's integrating AI,
illuminating how the story is much more dynamic than human versus bot,
and how there's a much more helpful lens of what we can achieve together.
In 1997, Deep Blue, the IBM chess bot, beat Gary Kasparov in this very famous televised game.
Here's a photo of our grandmaster holding his head in his hands.
And this was a really seminal moment in the eye research,
and it inspired a whole generation of computer nerds, myself included.
And it also spawned a ton of headlines about AI taking over and the end of humans.
I found while I was researching this one article had a big picture of a Terminator, killer robot on it.
Well, it's been a while and it didn't quite work out that way.
20 years later, in 2017, a robot kicked our ass at Go.
Here's our human champion.
Also holding his head in his hands.
Apparently this is the universal surrender pose when you have been defeated by a computer in a game.
And Go is a famously sophisticated and nuanced game.
So this was a really big deal.
And once again, it spawned all these articles
about the end of humans and all this stuff.
A couple years after that,
the same research lab developed a model
that could beat humans in StarCraft.
Here we go.
Another photo of a human surrendering
by holding his head in his hands.
I don't know about y'all,
I played hundreds of hours of StarCraft in high school and college.
So this was a really big deal for me.
And it was also a really big deal
because StarCraft is famously complex.
You have imperfect information,
multiple races, units,
you have to balance scouting
and resource collection and all-out combat.
It's great.
And then AI could play it at a human level
was really, really impressive.
And once again,
spawned all of these articles about humans.
And with the twists that this time
we had trained a bot to engage us in space combat,
which I think seems especially alarming.
So three in a row,
we have a bad record, right?
computers are beating us. They're superior than humans. We are on our way to becoming obsolete.
But as it turns out, it's a little more complicated than that. And there is something that actually
does a better job than a human alone or a computer alone, and that is a human with a computer.
And when you look at these games again through that lens, you actually get some different and more nuanced
results. So let's go back to chess. A few years after that game, Gary Kasparov, or actually many years
after the game, Gary Kasparov organized a tournament where humans could play with computer support.
And there were some really highly ranked grandmasters playing, and at the time, the cutting-edge
chess AIs that had been developed. But there were two amateurs who swept the whole field,
and they did this not by having better human chess skills, and not by having a better chess bot.
They had developed a model that they had programmed to be able to work with. They were working in tandem with it.
It was effectively inferior humans and an inferior model,
but they had found a way to work together
to beat superior humans and superior models.
The same exact thing just happened in Go a few weeks ago.
I don't know if folks caught the headlines for this,
but there was an amateur, it's always the amateurs, right,
who developed a model that he was able to work with
that understood and studied the weaknesses in the leading Go bot
and was able to defeat it.
And it wasn't, again, that this amateur had like a better Go model
that he had developed.
He had figured out a way to partner together,
with this AI to enhance their performance.
And in StarCraft, it's still not the case
that AIs are routinely able to beat
our human professional players.
In fact, a big reason for that is
because human pros now are developing and relying
on techniques that were first pioneered by bots.
We're using these models to understand strategies
that humans can then uniquely go and execute against.
And so three cases in a row of AI actually elevating,
not eliminating human performance.
Humans are better because of AI.
We're able to work with AI to improve.
I think it's also worth mentioning
all of these games are as popular as ever.
Humans clearly still enjoyed us,
even though there's an AI that might be better
than them as individuals.
So this is an example of something that sounds a little sci-fi,
but it's called Human Computer Symbiosis.
This was first proposed by this guy,
J.C.R. Licklider, in 1960,
and he has this awesome paper,
And for something written, like 63 years ago, it really holds up.
And he has this quote that I have drawn inspiration from quite a while,
which is, computers can do the routinizable work
to prepare the way for insights and decisions in technical and scientific thinking.
This was 63 years ago, and I think this is exactly what we're seeing happen now.
It's not about computers replacing humans.
It's about them working together cooperatively to solve a problem.
And I think this is the next step for AI.
I think it's the next step for humans.
You can have the computer doing the routine tedious work so humans can do the creative, interesting stuff.
We're a room of humans.
Our most fulfilled amazing days as humans are the days that we are spending doing creative and interesting work
and not doing the tedious drudgery stuff.
And I think AI is here to help us achieve that state of fulfillment.
Now, I'm going to bring this into the domain that I think a lot about.
I've been working in data, data science, data analytics my whole career.
I am now the founder and CEO of a company that builds a data science and analytics tool.
and our product is used by thousands of data practitioners every day,
and we see them do some really creative, interesting stuff.
I think data practitioners are creatives.
I know it's not the first thing that comes in mind
when I say creative, do you think of artists or whatever,
but you think what data scientists do in their day,
they're asking questions, they're forming hypotheses,
they're testing new things,
they're building narratives, they're taking risks, they're telling stories.
This is good data science, it's good data analytics,
and it's what we expect from our data teams.
And it's an art and a science and a great use of human time.
But data work can also be really tedious.
Spend a lot of time writing boilerplate and fixing dependencies
and tracking down missing parentheses in a query.
It can be more plumbing than science sometimes.
And this is where I think people wind up spending a lot of their time
and really is a blocker to them doing their best work.
And so this really feels like a perfect opportunity
to bring human computer symbiosis
into this creative profession.
Now, when most people, when they think of this,
they assume it means kind of just replacing data teams
with a magic insights text box.
Like, the next step is we'll all buy solutions
that then our stakeholders or executives
will come in, they'll write a question,
it'll give them a magic response back,
properly formatted charts,
and well-reasoned explanations
and full business context.
But that doesn't really work.
And it doesn't work, one,
because these models aren't perfect.
They can hallucinate,
they're missing a lot of context,
They don't understand the full situation of things,
but also that humans want to be able to hear a story
and understand and ask and answer questions
of a human around these things.
And so we actually tried this.
At hex, we had built a UI
that was really sort of a little more black box.
You type of question would bring you an answer back.
And you got pretty good results,
but it was missing the human element.
And we learned the same lesson,
the same thing JCR-Licklider posited,
the same thing we learned through all these games,
that for now at least, the best approach
is one where humans and computers
can work together to elevate performance.
And so the features we launched in our product last month
were built around these principles,
and I think there's a lot of takeaways here.
We built these features, they're called HexMagic,
and they're built directly into the UI
that thousands of data scientists and data analysts
already use every day.
They bring the powerful large language models,
the latest models from OpenEI
integrated directly in our product.
And you can ask it to do all sorts of things,
from writing queries to building visualizations
My personal favorite is called magic fix when you have an error in your code.
It will automatically detect and fix it.
And as someone who has more and more errors in my code every day,
that is a very useful thing.
But the key thing here and the thing we really realize
is that the thing that we are in the business of doing
is to enhance and benefit humans.
It's to work with humans, not replace them.
We've found that we can elevate and accelerate human intuition.
And that's what our users tell us.
We had a user tell us they can spend more of their time
doing the creative, interesting part of their job
and less time doing the tedious plumbing.
And that is so exciting to me
because I think that is a little beginning.
It's a foothill of the ultimate value
that AI can provide in our lives.
It's human, computer, symbiosis, and action.
All right.
That is all for these exclusive segments
from our data in AI forum.
Hopefully, that gets your wheel spinning
and eliminates how much opportunity
there still is to build here.
We've got lots more AI coverage to come
as this field moves very,
quickly, but for now we'd encourage you to go check out the companies that participated here.
So that's coactive.aI, character.aI, and hex.com.
We'll include all of that in the show notes, but I also wanted to call out if you like these
kinds of episodes. This one being a compilation episode, please let us know. You can always email
us at potpitches at A16c.com. And if you haven't noticed already, we're doing a lot of testing
here in format, ideas, guests. So if you like something, if you hate something, if there's
certain topics you'd like to see more or less of, different guests you'd like to see on the
podcast, please do let us know. We love hearing your feedback and thank you so much for listening.
Thanks for listening to the A16Z podcast. If you like this episode, don't forget to subscribe,
leave a review, or tell a friend. We also recently launched on YouTube at YouTube.com slash A16Z
underscore video where you'll find exclusive video content. We'll see you next time.