The a16z Show - From Promise to Reality: Inside a16z's Data and AI Forum

Starting point is 00:00:00 Hi everyone. Welcome back to the A16Z podcast. This is your host, Steph Smith, and today we continue the conversation around AI. Now, this conversation is ongoing, but today we have exclusive footage from our data and AI forum held last month. So what we've done is we went through all of the presentations from that forum, and we've aggregated what we think are the best segments delivered right to your ears. As a reminder, the content here is for informational purposes only. None of the following is invest. business, legal, or tax advice, and please note that A6 and Z and its affiliates may maintain investments in the companies discussed in this podcast. Please see A6NZ.com

Starting point is 00:00:39 slash disclosures for more important information, including a link to a list of our investments. So the promises of AI have echoed for decades, starting as early as the 50s when artificial intelligence blossomed into its own discipline. And since then, the field has held so much promise. Although AI skeptics would claim and the technology has fallen short of expectations.

Starting point is 00:01:12 That is, of course, until recently. A string of technology unlocks from the deep learning renaissance around 2012, the attention is all you need paper in 2017, and compute moving down the cost curve, have resulted in numerous applications now in the hands of the masses, including, of course, Chit, which reached 100 million users this January. As Navidia's CEO Jensen Huang declared in a recent keynote, we are in the iPhone moment of AI.

Starting point is 00:01:44 In a matter of months, this technology has gone from what seemed like a distant promise to the everyday internet user leveraging chat TPT to write emails, mid-journey to generate images for their next presentation, or runway ML to edit their videos. We've become so accustomed to the rapid pace of innovation, we may even forget that Dolly 2 was teased just one. one year ago, or the sheer sense of wonder that some of these early experiences brought. As Arthur C. Clark said, any sufficiently advanced technology is indistinguishable from magic.

Starting point is 00:02:18 Well, the magic continues. Just a few weeks ago, the precise date that GPT4 was released. A16C held a data and AI forum, featuring many of the most influential builders in the space, from the companies building the foundational models like OpenAI, to those building the the underlying infrastructure like AWS. And in today's episode, you'll get a sneak peek into a few of the most important conversations from the forum. So to kick things off, one of the most interesting unlocks that AI services is the ability to interface with unstructured data, something that the average internet user happens to produce a lot of, from the pictures we take to the emails we send. Here, Cody Coleman, CEO and co-founder of Coactive, a platform that helps

Starting point is 00:03:04 makes sense of your unstructured image and video data talks about the potential to leverage the vast amount of data already being collected by companies who are already paying a small fortune to retain it. You might even call this a data dividend. Right now, 80% of internet traffic is unstructured video data. And you can see this when you think about all the number of Zoom meetings that are on your calendar, the rise of e-commerce, we're making purchasing decisions based off of images and videos. And if you have kids, they're probably addicted to TikTok or Instagram. And it's not just internet traffic.

Starting point is 00:03:38 It's projected that 80% of worldwide data will be unstructured data. So that's things such as video and audio. 85% of all worldwide data will be unstructured data by 2025. So it's a massive, it's the dominant form, the vast majority of data that's out there in the world, and it actually influences not just the content that we consume, but also the products that we buy.

Starting point is 00:04:05 So nearly 80% of people say that user-generated content, UGC, impacts their decisions to purchase. And this is just the beginning. You can already see that the barrier to creating content, whether it be text or visual, dropped dramatically. And that means that there's going to be a waterfall or a fire hose of the amount of content that can be generated. So it's not a question anymore of, you know,

Starting point is 00:04:32 if or when content is going to be king, content is king right now and it's affecting and capturing every part of our life. Now, the question that we should be asking is, what type of king is your content? So we all want the legendary content where you have a good king. You know, you have the right content

Starting point is 00:04:52 and it will lift your sales and engagement by delivering the right content to the right person at the right time. You can also use all this content that you might be creating in-house or collecting from your users to actually answer trends about customer behavior and illuminate trends more broadly. That's the ideal picture. Unfortunately, most businesses can't actually realize that version

Starting point is 00:05:19 of their content and actually derive that much value out of it. Instead, we see a lot of people with maybe a little easier king. You know, when you think about unstructured data and content, a lot of it sits underutilized on cloud storage right now, so in S3 or Google Cloud Storage, for serving or it's archived in backups. And basically, this just causes you a small fortune

Starting point is 00:05:44 just because of the sheer volume when we think about image, audio, and video data. And that's basically just taxing your organizations, taxing your business, just to store this data, to keep it there kind of archival, which is really expensive. But things can get even worse when we think about what bad content looks like. Bad content exposes your businesses and organizations to a wide variety of risks.

Starting point is 00:06:09 So one, there's the risk of violating user privacy. When we think about this unstructured data, when we think about text, when we think about video, when we think about audio, there's that additional context that it captures is so personal and it exposes potentially the risk to violate privacy there. Also, you might not have control over all the content that is shared on your platform or kind of put out there. And that risk corrupting the safety of the online communities that we are a part of. And that, in turn, can end up meaning that we lose trust in those platforms that provide

Starting point is 00:06:44 those online communities or the brands or things like that that are talked about. If we actually don't have the right content and the right message being delivered with the content in our businesses. Luckily, we found the key. We found the way that we can actually make this data useful, and that's with AI. The piece that actually makes content king is actually AI. By being able to actually process and understand this data at scale,

Starting point is 00:07:09 which before this moment really wasn't possible. It was a very hard manual process to actually go through all this content and understand it. But thanks to the work of folks like OpenAI, we can now actually understand and, like, in general, and appreciate this content. And I've seen that value and how that can generate so much value

Starting point is 00:07:26 firsthand from my own experiences. So I've done my PhD in Stanford at the intersection of machine learning systems. I've worked as a data scientist in industry ranging from finance to education to big tech companies like Pinterest and meta. And I've seen just how much, how they can leverage AI to actually make their content

Starting point is 00:07:47 better to improve ads, to improve recommendation, to improve search. and so many other kind of vital and critical use cases to businesses. But on the other side, you know, from being hands-on and from doing my PhD in this, I know that it is really, really difficult to get it right. There's like no such thing as a free lunch. And I think that there's kind of four main challenges that prevents organizations and businesses from really being able to unlock this data right now.

Starting point is 00:08:15 And the first is one of just scale. When we think about unstructured text and visual data, It's orders of magnitude larger than today's big data. So to put that into perspective, if we were to think about tabular data, so we had 10 million rows of tabular data, that's around 40 megabytes. And to put that into perspective, we can think of that as being like all of the water and all of the area of Lake Tahoe in California,

Starting point is 00:08:43 which is around 496 square kilometers. If we were to think about 10 million documents, text documents, we go from 40 megabytes to 40 gigabytes. And now we have something that's more on the scale of the Caspian Sea. So 371,000 square kilometers of space, when we put that in perspective and scaling it up. It's three orders of magnitude, more data in terms of volume than when we think about tabular data. And then when we think about visual data, if we had 10 million images, that would be 20 terabytes of data. That's another three words of magnitude bigger, and that's like the Pacific Ocean,

Starting point is 00:09:27 when we think about it, in terms of just the sheer scale of data that that is in terms of volume. The Pacific Ocean is 168 million square kilometers. Now, right now, when we think about kind of big data and our data lakes, we have kind of these tools and vehicles that kind of can process that efficiently. But that's kind of kind of. That's kind of a lot of. kind of like having a rowboat or a canoe. You know, it'll get you across the lake, but I wouldn't trust that if you were trying to cross the Pacific Ocean.

Starting point is 00:09:58 So in order to actually be able to unlock kind of the value from this richer, kind of more context that we get in content, we actually need to create kind of tools and infrastructure in order to process that. Now it's gonna be probably a similar shape in terms of like, just like how a sea going boat looks somewhat similar to a rowboat,

Starting point is 00:10:15 but the scale and the processing of it will just have to be kind of completely different. And we'll need to prepare ourselves just for the fact of the sheer volume and scale that we're thinking about when we move from a tabular view of the world to more of a content view of the world. Given how much data is being created every single minute, you can imagine all the new infrastructure opportunities and challenges there will be in order to make use of it. But you also may wonder, if we're collecting and processing so much data, will we ever run out, both in terms of our ability to store it, but also to continuously upgrade new models,

Starting point is 00:10:51 with new data. Here's A16C's general partner, Sarah Wing, asking that question to Mile Ott. A longtime AI researcher previously leading the LLM efforts at Facebook and now part of the Character.A.I founding team, an AI platform seeking to give consumers access to personalized AI systems. And fun fact, one of the other founders of Character was one of the authors of The Attention Is All You Need paper from 2017, a truly foundational piece of research

Starting point is 00:11:22 underpinning many of the AI advancements since. There's sort of this question around are we running out of data? And I think what's really interesting for this room is that there are a bunch of execs here with access to a ton of proprietary data, right? So this question may not pertain to that as much, although I think it'd be interesting to loop that in.

Starting point is 00:11:42 But there's sort of this question of, as these models get bigger, they ingest more data, are we actually running out of publicly available static web data. And what do we do about that? How do you guys think about that at character? And how is that informed the approach that you've taken? Yeah, it's a good question. I think, so obviously most of the kind of AI systems that are being trained today are trained on these public data sets, right? So, you know, mostly kind of data crawled from the web. I think there's actually

Starting point is 00:12:07 still like a decent amount of public data available. I think, you know, even if we're kind of reaching the limit, say, of text, I think there's other modalities that, you know, folks are starting and explore audio, video, images. I think there's a lot of really rich data sources out there still on the web. I think there's then, I don't know the exact magnitudes, but I imagine roughly similar scale of private data sets out there, right?

Starting point is 00:12:30 And I think that's gonna be really important in certain applications. You know, I imagine if you have a code generation system, it's great that it's trained on all of public GitHub, but it might be even more useful if it's trained on my own code base, right, than my private code base. So I think figuring out like how to blend these public and private datasets

Starting point is 00:12:46 is going to be really interesting. And I think it's going to open up a whole bunch of new applications, too. From character's perspective, and I guess more generally, one of the things that we're starting to see that is pretty exciting, is this move from, you know, you could call it like static data sets, but data that kind of exists already out there, independent of AI systems. We're moving now, I think, towards data sets that

Starting point is 00:13:07 are being built with AI in the loop, right? And so you have, you know, people often refer to as these data flywheels, but you basically can imagine, say, for characters, we have all these rich interactions where character is having a conversation with someone, and we get feedback on that conversation from the user, either explicitly or implicitly. And that's really like the perfect data to use to make that AI system better, right? And so we have these loops that I think are going to be really kind of exciting and provide both richer and perhaps much larger data sources for the sort of next generation of systems.

Starting point is 00:13:44 Yeah, very exciting. Well, I think we've been talking a little bit about the future, but I actually want to bring us back, since you've been working in large language models for quite some time now, getting a little bit of a history lesson from you would actually be very interesting. And I think even though Michael had listed a long list of accomplishments and things that you'd worked on, it was still, frankly, in my view, very humble. And I think one of the most significant contributions of yours is the development of the Roberta model. And rather than hearing me define it, if you could take us back to, I believe it's 2019, what the state of AI look like back then, LLMs, and maybe just bring us forward to today as a lot has changed, to your point, in the course of four years.

Starting point is 00:14:26 Yeah. So Roberta, you know, I think as I kind of mentioned earlier, when I was in the research group at Facebook, a lot of my focus was on trying to build kind of larger-scale engineering systems. But a lot of that actually started with translation systems. So obviously, machine translation, automatically translating between different languages. It's like a hugely important problem at Facebook. It runs in production.

Starting point is 00:14:47 And one of the highest leverage ways we found to make those systems better was to train them on more data and with more compute. And I think in some ways that sounds like an obvious idea now. But I think actually back then it was somewhat controversial. And I think there's almost this kind of perception that like in order to make big advances in AI, we were going to need really big algorithmic breakthroughs. I think it was kind of underappreciated how far you could get by just increasing the amount of data, improving the data quality, and scaling up the amount of compute.

Starting point is 00:15:20 I think in late 2018, Google came out with something called BERT, which is also a transformer model, but was used a slightly kind of clever training objective and got kind of state-of-the-art performance in all of these natural language understanding tasks, right? So making classifications about particular text input or something. And Roberto was really kind of taking Burt and scaling it up, right? And I think we trained it on something like 10x more data and with a lot more compute. And, you know, what we found is that there was this big algorithmic jump from kind of the stuff before Burt going to Burt and then an almost equally sized step by just scaling it up, right?

Starting point is 00:15:59 And so I think that has been really, in many ways, the story of the last few years, too, is that by scaling up these systems, there's actually really really. substantial gains, like qualitatively different behavior and performance that we can get, accuracy that we can get out of these models. So I think that's in like a really kind of fruitful direction to explore. And I think there's probably still more to explore there going forward. Yeah, absolutely. I mean, it's fascinating because I think that relationship today, we almost take for granted. If we extrapolate that relationship, more data equals more powerful models. And as these models do become more powerful, the reaction of many is to question whether

Starting point is 00:16:38 this technology will take our jobs. Or an even further extrapolation, whether it'll make us as humans completely obsolete. And while people often cite games like chess, go, or StarCraft, as examples of where bots have definitively beat humans, there's actually another story that can be told. For one, people are still playing chess decades after the infamous 1997 match between Deep Blue and Kasparov.

Starting point is 00:17:05 In fact, you could argue that chess is more popular than ever. Here is Barry McArdle, founder of Hex, a data science platform that's integrating AI, illuminating how the story is much more dynamic than human versus bot, and how there's a much more helpful lens of what we can achieve together. In 1997, Deep Blue, the IBM chess bot, beat Gary Kasparov in this very famous televised game. Here's a photo of our grandmaster holding his head in his hands. And this was a really seminal moment in the AI research, and it inspired a whole generation of computer nerds, myself included.

Starting point is 00:17:43 And it also spawned a ton of headlines about AI taking over and the end of humans. I found while I was researching this one article, I had a big picture of a Terminator, killer robot on it. Well, it's been a while, and it didn't quite work out that way. 20 years later, in 2017, a robot kicked our ass at go. Here's our human champion, also holding his head in his hands. Apparently, this is the universal surrender pose when you have been defeated by a computer in a game.

Starting point is 00:18:10 And Go is a famously sophisticated and nuanced game. So this was a really big deal. And once again, it spawned all these articles about the end of humans and all this stuff. A couple years after that, the same research lab developed a model that could beat humans in Starcraft. Here we go. Another photo of a human surrendering by holding his head in his hands.

Starting point is 00:18:32 I don't know about y'all. I played hundreds of hours of Starcraft in high school and college. So this was a really big deal for me. And it was also a really big deal because Starcraft is famously complex. You have imperfect information, multiple races, units, you have to balance scouting and resource collection and all-out combat. It's great. And then an AI could play it at a human level was really, really impressive. And once again, spawned all of these articles about humans. And with the twists that this time we had trained a bot to engage us in space combat, which I think seems especially alarming. So three of the three of the

Starting point is 00:19:04 Oh, we have a bad record, right? Computers are beating us. They're superior than humans. We are on our way to becoming obsolete. But as it turns out, it's a little more complicated than that. And there is something that actually does a better job than a human alone or a computer alone, and that is a human with a computer. And when you look at these games again through that lens, you actually get some different

Starting point is 00:19:27 and more nuanced results. So let's go back to chess. A few years after that game, Gary Kasparov, or a very much, actually many years after the game, Gary Kasparov organized a tournament where humans could play with computer support. And there were some really highly ranked grandmasters playing, and at the time the cutting-edge chess AIs that had been developed.

Starting point is 00:19:48 But there were two amateurs who swept the whole field, and they did this not by having better human chess skills, and not by having a better chess bot. They had developed a model that they had programmed to be able to work with. They were working in tandem with it. It was effectively inferior humans and an inferior model, but they had found a way to work together to beat superior humans and superior models.

Starting point is 00:20:12 The same exact thing just happened in Go a few weeks ago. I don't know if folks caught the headlines for this, but there was an amateur, it's always the amateurs, right, who developed a model that he was able to work with that understood and studied the weaknesses in the leading Go bot and was able to defeat it. And it wasn't, again, that this amateur had like a better Go model that he had developed.

Starting point is 00:20:31 He had figured out a way to partner together together with this AI to enhance their performance. And in StarCraft, it's still not the case that AIs are routinely able to beat our human professional players. In fact, a big reason for that is because human pros now are developing and relying on techniques that were first pioneered by bots. We're using these models to understand strategies

Starting point is 00:20:54 that humans can then uniquely go and execute against. And so three cases in a row of AI actually elevating, not eliminating human performance. Humans are better because of AI. We're able to work with AI to improve. I think it's also worth mentioning all of these games are as popular as ever. Humans clearly still enjoy this,

Starting point is 00:21:14 even though there's an AI that might be better than them as individuals. So this is an example of something that sounds a little sci-fi, but it's called Human Computer Symbiosis. This was first proposed by this guy, J.C.R. Licklider in 1960. And he has this awesome paper. And for something written like 63 years ago, it really holds up. And he has this quote that I have drawn inspiration from quite a while,

Starting point is 00:21:38 which is, computers can do the routinizable work to prepare the way for insights and decisions in technical and scientific thinking. This was 63 years ago, and I think this is exactly what we're seeing happen now. It's not about computers replacing humans. It's about them working together cooperatively to solve a problem. And I think this is the next step for AI. I think it's the next step for humans.

Starting point is 00:22:00 You can have the computer doing the routine tedious work so humans can do the creative, interesting stuff. We're a room of humans. Our most fulfilled amazing days as humans are the days that we are spending doing creative and interesting work and not doing the tedious drudgery stuff. And I think AI is here to help us achieve that state of fulfillment. Now, I'm going to bring this into the domain that I think a lot about. I've been working in data, data science, data analytics my whole career. I am now the founder and CEO of a company that builds a data science and analytics tool. And our product is used by thousands of data practitioners every day.

Starting point is 00:22:34 And we see them do some really creative, interesting stuff. I think data practitioners are creatives. I know it's not the first thing that comes in mind when I say creative. Do you think of artists or whatever? But you think what data scientists do in their day, they're asking questions, they're forming hypotheses, they're testing new things. They're building narratives.

Starting point is 00:22:51 They're taking risks. They're telling stories. This is good data science. It's good data analytics. And it's what we expect from our data teams. And it's an art and a science and a great use of human time. But data work can also be really tedious. Spend a lot of time writing boilerplate and fixing dependencies

Starting point is 00:23:10 and tracking down missing parentheses in a query. It can be more plumbing than science sometimes. And this is where I think people wind up spending a lot of their time and really is a blocker to them doing their best work. And so this really feels like a perfect opportunity to bring human computer symbiosis into this creative profession. Now, when most people, when they think of this,

Starting point is 00:23:32 they assume it means kind of just replacing data teams with a magic insights text box. Like the next step is we'll all buy solutions that then our stakeholders or executives will come in, they'll write a question, it'll give them a magic response back, properly formatted charts, and well-reasoned explanations and full business context.

Starting point is 00:23:50 But that doesn't really work. And it doesn't work, one, because these models aren't perfect. They can hallucinate, They're missing a lot of context. They don't understand the full situation of things. But also that humans want to be able to hear a story and understand and ask and answer questions of a human around these things. And so we actually tried this.

Starting point is 00:24:09 At Hex, we had built a UI that was really sort of a little more black box. You type of question we'd bring you an answer back. And you got pretty good results, but it was missing the human element. And we learned the same lesson, the same thing JCR-Licklider posited, the same thing we we learned through all these games, that for now at least, the best approach is one where humans and computers can work together to elevate performance.

Starting point is 00:24:32 And so the features we launched in our product last month were built around these principles, and I think there's a lot of takeaways here. We built these features, they're called Hexmagic, and they're built directly into the UI that thousands of data scientists and data analysts already use every day. They bring the powerful large language models,

Starting point is 00:24:48 the latest models from OpenAI integrated directly in our product. And you can ask it to do all sorts of things, from writing queries to building visualism. Or my personal favorite is called magic fix when you have an error in your code. It will automatically detect and fix it. And as someone who has more and more errors in my code every day, that is a very useful thing. But the key thing here and the thing we really realize is that the thing that we are in the business of doing is to enhance and benefit humans. It's to work with humans, not replace them.

Starting point is 00:25:16 We've found that we can elevate and accelerate human intuition. And that's what our users tell us. We had a user tell us they can spend more of their time. doing the creative, interesting part of their job, and less time doing the tedious plumbing. And that is so exciting to me because I think that is a little beginning. It's a foothill of the ultimate value that AI can provide in our lives.

Starting point is 00:25:36 It's human, computer, symbiosis, and action. All right. That is all for these exclusive segments from our data and AI forum. Hopefully, that gets your wheel spinning and eliminates how much opportunity there still is to build here. We've got lots more AI coverage to come, as this field moves very quickly, but for now we'd encourage you to go check out the companies that participated here.

Starting point is 00:25:59 So that's coactive.aI, character.aI, and hex.com. We'll include all of that in the show notes, but I also wanted to call out if you like these kinds of episodes, this one being a compilation episode, please let us know. You can always email us at potpitches at A16C.com, and if you haven't noticed already, we're doing a lot of testing here in format, ideas, as guests. So if you like something, if you hate something, if there's certain topics you'd like to see more or less of, different guests you'd like to see on the podcast, please do let us

Starting point is 00:26:32 know. We love hearing your feedback and thank you so much for listening. Thanks for listening to the A16Z podcast. If you like this episode, don't forget to subscribe, leave a review, or tell a friend. We also recently launched on YouTube at YouTube.com slash A16Z underscore video, where you'll find exclusive video content. I'll see you next time.

The a16z Show - From Promise to Reality: Inside a16z's Data and AI Forum

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.