a16z Podcast - From Promise to Reality: Inside a16z's Data and AI Forum

Starting point is 00:00:00 Hi everyone. Welcome back to the A16Z podcast. This is your host, Steph Smith, and today we continue the conversation around AI. Now, this conversation is ongoing, but today we have exclusive footage from our data and AI forum held last month. So what we've done is we went through all of the presentations from that forum, and we've aggregated what we think are the best segments delivered right to your ears. As a reminder, the content here is for informational purposes only. None of the fall. following is investment, business, legal, or tax advice, and please note that A6 and Z and its affiliates may maintain investments in the companies discussed in this podcast. Please see A6NZ.com slash disclosures for more important information, including a link to a list of our investments. So the promises of AI have echoed for decades, starting as early as the 50s,

Starting point is 00:01:00 when artificial intelligence blossomed into its own discipline. And since then, the field has held so much promise, although AI skeptics would claim the technology has fallen short of expectations. That is, of course, until recently. A string of technology unlocks from the deep learning renaissance around 2012, the attention is all you need paper in 2017, and compute moving down the cost curve, have resulted in numerous applications now in the hands of the math.

Starting point is 00:01:30 including, of course, chat TPT, which reached 100 million users this January. As Navidia's CEO, Jensen Huang, declared in a recent keynote, we are in the iPhone moment of AI. In a matter of months, this technology has gone from what seemed like a distant promise to the everyday internet user leveraging chat TPT to write emails, mid-journey to generate images for their next presentation, or runway ML to edit their videos. We've become so accustomed to the rapid pace of innovation,

Starting point is 00:02:02 we may even forget that Dolly 2 was teased just one year ago, or the sheer sense of wonder that some of these early experiences brought. As Arthur C. Clark said, any sufficiently advanced technology is indistinguishable from magic. Well, the magic continues. Just a few weeks ago, the precise date that GPT4 was released. A16C held a data and AI forum, featuring many of the most influential builders in the space, from the companies building the foundational models like Open AI to those building the underlying infrastructure like AWS. And in today's episode,

Starting point is 00:02:40 you'll get a sneak peek into a few of the most important conversations from the forum. So to kick things off, one of the most interesting unlocks that AI services is the ability to interface with unstructured data, something that the average internet user happens to produce a lot of from the pictures we take to the emails we send. Here, Cody Coleman, CEO and co-founder of Coactive, a platform that helps make sense of your unstructured image and video data, talks about the potential to leverage the vast amount of data already being collected by companies who are already paying a small fortune to retain it. You might even call this a data dividend. Right now, 80% of internet traffic is unstructured video data. And you can see this when you think

Starting point is 00:03:24 about all the number of Zoom meetings that are on your calendar, the rise of e-commerce, we're making purchasing decisions based off of images and videos. And if you have kids, they're probably addicted to TikTok or Instagram. And it's not just internet traffic. It's projected that 80% of worldwide data will be unstructured data. So that's things such as video and audio. 85% of all worldwide data will be unstructured data by 2025. So it's a massive, it's the dominant form of the vast majority of data that's out there in the world, and it actually influences not just the content that we consume, but also the products that we buy. So nearly 80% of people say that user-generated content, UGC, impacts their decisions to purchase.

Starting point is 00:04:14 And this is just the beginning. You can already see that the barrier to creating content, whether it be text or visual, dropped dramatically. And that means that there's going to be a waterfall or a fire hose of the amount of content that can be generated. So it's not a question anymore of, you know, if or when content is going to be king. Content is king right now and is affecting and capturing every part of our life. Now, the question that we should be asking is, what type of king is your content? So we all want the legendary content where you have a good king. You know, you have the right content and it'll lift your sales and engagement by delivering the right content to the right person at the right

Starting point is 00:05:00 time. You can also use all this content that you might be creating in-house or collecting from your users to actually answer trends about customer behavior and illuminate trends more broadly. That's the ideal picture. Unfortunately, most businesses can't actually realize that version of their content and actually derive that much value out of it. Instead, we see a lot of people with maybe a little easier king. You know, when you think about unstructured data and content, a lot of it sits underutilized on cloud storage right now. So in S3 or Google Cloud Storage for serving or it's archived in backups. And basically, this just causes you a small fortune just because of the sheer volume

Starting point is 00:05:46 when we think about image, audio, and video data. And that's basically just taxing your organizations, taxing your business, just to store this data, just to keep it there, kind of archival, which is really expensive. But things can get even worse when we think about what bad content looks like. Bad content exposes your businesses and organizations

Starting point is 00:06:07 to a wide variety of risks. So one, there's a risk of violating user privacy. When we think about this unstructured data, When we think about text, when we think about video, when we think about audio, there's that additional context that it captures is so personal, and it exposes potentially the risk to violate privacy there. Also, you might not have control over all the content that is shared on your platform or kind of put out there.

Starting point is 00:06:32 And that risk corrupting the safety of the online communities that we are a part of. And that, in turn, can end up meaning that we lose trust in those platforms that provide those online communities, all the brands or things like that that are talked about, if we actually don't have the right content and the right message being delivered with the content in our businesses. Luckily, we found the key. We found the way that we can actually make this data useful, and that's with AI. The piece that actually makes content king is actually AI.

Starting point is 00:07:04 By being able to actually process and understand this data at scale, which before this moment really wasn't possible. It was a very hard manual process to actually go. through all this content and understand it. But thanks to the work of folks like Open AI, we can now actually understand and ingest and appreciate this content. And I've seen that value and how that can generate so much value firsthand from my own experiences. So I've done my PhD in Stanford at the intersection of machine learning systems. I've worked as a data scientist in industry ranging from finance to education to big tech companies like Pinterest and meta. And I've seen just how much

Starting point is 00:07:44 how they can leverage AI to actually make their content better, to improve ads, to improve recommendation, to improve search, and so many other kind of vital and critical use cases to businesses. But on the other side, from being hands-on and from doing my PhD in this, I know that it is really, really difficult to get it right. There's like no such thing as a free lunch. And I think that there's kind of four main challenges that prevents organizations and businesses from really being able to unlock this,

Starting point is 00:08:14 data right now. And the first is one of just scale. When we think about unstructured text and visual data, it's orders of magnitude larger than today's big data. So to put that into perspective, if we were to think about tabular data, so we had 10 million rows of tabular data, that's around 40 megabytes. And to put that into perspective, we can think of that as being like all of the water and all the area of Lake Tahoe in California, which is around 496 square kilometers. If we were to think about 10 million documents, text documents, we go from 40 megabytes to 40 gigabytes. And now we have something that's more on the scale of the Caspian Sea, so 371,000 square kilometers of space when we put that in perspective and scaling it up. It's three orders,

Starting point is 00:09:09 of magnitude, more data in terms of volume than when we think about tabular data. And then when we think about visual data, if we had 10 million images, that would be 20 terabytes of data. That's another three words of magnitude bigger. And that's like the Pacific Ocean when we think about it in terms of just the sheer scale of data that that is in terms of volume. The Pacific Ocean is 168 million square kilometers. Now, right now, when we think about kind of big data and our data lakes, we have kind of these tools and vehicles that kind of can process that efficiently. But that's kind of like having a rowboat or canoe. You know, it'll get you across the lake, but I wouldn't trust that if you were trying to cross the Pacific Ocean.

Starting point is 00:09:58 So in order to actually be able to unlock kind of the value from this richer, kind of more context that we get in content, we actually need to create kind of tools and infrastructure in order to process that. Now, it's going to be probably a similar shape in terms of like, just like how a sea-going boat looks somewhat similar to a rowboat, but the scale and the processing of it will just have to be kind of completely different. And we'll need to prepare ourselves just for the fact of the sheer volume and scale that we're thinking about when we move from a tabular view of the world.

Starting point is 00:10:28 to more of a content view of the world. Given how much data is being created every single minute, you can imagine all the new infrastructure opportunities and challenges there will be in order to make use of it. But you also may wonder, if we're collecting and processing so much data, will we ever run out, both in terms of our ability to store it, but also to continuously upgrade new models with new data. Here's A16C's general partner, Sarah Wing,

Starting point is 00:10:56 asking that question to Myel Ott, a longtime AI researcher previously leading the LLM efforts at Facebook and now part of the Character.A.I founding team, an AI platform seeking to give consumers access to personalize AI systems. And fun fact, one of the other founders of Character was one of the authors of The Attention Is All You Need Paper from 2017, a truly foundational piece of research, underpinning many of the AI advancements since. There's sort of this question around, are we running out of data? And I think what's really interesting for this room is that there are a bunch of execs here with access to a ton of proprietary data, right?

Starting point is 00:11:37 So this question may not pertain to that as much, although I think it'd be interesting to loop that in. But there's sort of this question of, you know, as these models get bigger, they ingest more data, are we actually running out of publicly available static web data? And what do we do about that? How do you guys think about that at Character and how is that informed the approach that you've taken? Yeah, it's a good question. I think, so obviously, most of the kind of AI systems that are being trained today are trained on these public data sets, right? So, you know, mostly kind of data crawled from the web. I think there's actually still like a decent amount of public data available. I think,

Starting point is 00:12:11 you know, even if we're kind of reaching the limits, say, of text, I think there's other modalities that, you know, folks are starting to explore audio, video, images. I think there's a lot of really rich data sources out there still on the web. I think there's then, I don't know, of the exact magnitudes, but I imagine, you know, roughly similar scale of private data sets out there, right? And I think that's going to be really important in certain applications. You know, I imagine if you have a code generation system, it's great that it's trained on all of public GitHub, but, you know, it might be even more useful if it's trained on my own code base, right, than my private code base. So I think figuring out, like, how

Starting point is 00:12:44 to blend these public and private data sets is going to be really interesting. And I think it's going to open up a whole bunch of new applications, too. From character's perspective, And I guess more generally, one of the things that we're starting to see that is pretty exciting is this move from, you know, you could call it like static data sets, but data that kind of exists already out there independent of AI systems. We're moving now, I think, towards data sets that are being built with AI in the loop, right? And so you have, you know, people often refer to as these data flywheels, but you basically can imagine, say, for character, we have all these rich interactions where character is having a conversation. someone, and we get feedback on that conversation from the user, either explicitly or implicitly, and that's really like the perfect data to use to make that AI system better, right? And so we have these loops that I think are going to be really kind of exciting and provide

Starting point is 00:13:38 both richer and perhaps much larger data sources for the sort of next generation of systems. Yeah, very exciting. Well, I think we've been talking a little bit about the future, but I actually want to bring us back since you've been working in large language models for quite some time now, getting a little bit of a history lesson from you would actually be very interesting. And I think even though Michael had listed a long list of accomplishments and things that you'd worked on, it was still, you know, frankly, in my view, very humble. And I think one of the most significant contributions of yours is the development of the Roberta model. And rather than hearing me define it, if you could take us back to,

Starting point is 00:14:14 I believe it's 2019, what the state of AI look like back then, LLMs, and maybe just just bring us forward to today as a lot has changed to your point in the course of four years. Yeah. So Roberta, you know, I think as I kind of mentioned earlier, when I was in the research group at Facebook, a lot of my focus was on trying to build kind of larger scale engineering systems. But a lot of that actually started with translation systems. So obviously machine translation, automatically translating between different languages. It's like a hugely important problem at Facebook. It runs in production. And one of the kind of highest leverage ways we found to make those systems better was to train them on more data and with more compute. And, you know, I think

Starting point is 00:14:56 in some ways that sounds like an obvious idea now, but I think actually back then it was somewhat controversial. And I think there's almost this kind of perception that, like, in order to make big advances in AI, we were going to need really big algorithmic breakthroughs. And I think it was kind of underappreciated how far you could get by just increasing the amount of data, improving the data quality, and scaling up the amount of compute. I think in late 2018, Google came out with something called BERT, which is also a transformer model, but was used a slightly kind of clever training objective and got kind of state-of-the-art performance in all of these natural language understanding tasks, right?

Starting point is 00:15:34 So making classifications about particular text input or something. And Roberta was really kind of taking Burt and scaling it up, right? I think we trained it on something like 10x more data and with a lot more compute. And, you know, what we found is that there was this big algorithmic jump from kind of the stuff before Burt going to Burt, and then an almost equally sized step by just scaling it up, right? And so I think that has been really, in many ways, the story of the last few years, too, is that by scaling up these systems, there's actually really substantial gains, like qualitatively different behavior and performance that we can get, accuracy that we can get out of these models. So I think that's in like a really kind of fruitful direction to explore, and I think there's probably still more to explore there going forward. Yeah, absolutely.

Starting point is 00:16:22 I mean, it's fascinating because I think that relationship today we almost take for granted. If we extrapolate that relationship, more data equals more powerful models. And as these models do become more powerful, the reaction of many is to question whether this technology will take our jobs. Or an even further extrapolation, whether it'll make us as humans, completely obsolete. And while people often cite games like chess, Go, or StarCraft, as examples of where bots have definitively beat humans, there's actually another story that can be told. For one, people are still playing chess decades after the infamous 1997 match

Starting point is 00:17:02 between Deep Blue and Kasparov. In fact, you could argue that chess is more popular than ever. Here is Barry McArdo, founder of Hex, a data science platform that's integrating AI, illuminating how the story is much more dynamic than human versus bot, and how there's a much more helpful lens of what we can achieve together. In 1997, Deep Blue, the IBM chess bot, beat Gary Kasparov in this very famous televised game. Here's a photo of our grandmaster holding his head in his hands. And this was a really seminal moment in the eye research, and it inspired a whole generation of computer nerds, myself included.

Starting point is 00:17:42 And it also spawned a ton of headlines about AI taking over and the end of humans. I found while I was researching this one article had a big picture of a Terminator, killer robot on it. Well, it's been a while and it didn't quite work out that way. 20 years later, in 2017, a robot kicked our ass at Go. Here's our human champion. Also holding his head in his hands. Apparently this is the universal surrender pose when you have been defeated by a computer in a game. And Go is a famously sophisticated and nuanced game.

Starting point is 00:18:13 So this was a really big deal. And once again, it spawned all these articles about the end of humans and all this stuff. A couple years after that, the same research lab developed a model that could beat humans in StarCraft. Here we go. Another photo of a human surrendering

Starting point is 00:18:29 by holding his head in his hands. I don't know about y'all, I played hundreds of hours of StarCraft in high school and college. So this was a really big deal for me. And it was also a really big deal because StarCraft is famously complex. You have imperfect information, multiple races, units,

Starting point is 00:18:44 you have to balance scouting and resource collection and all-out combat. It's great. And then AI could play it at a human level was really, really impressive. And once again, spawned all of these articles about humans. And with the twists that this time

Starting point is 00:18:57 we had trained a bot to engage us in space combat, which I think seems especially alarming. So three in a row, we have a bad record, right? computers are beating us. They're superior than humans. We are on our way to becoming obsolete. But as it turns out, it's a little more complicated than that. And there is something that actually does a better job than a human alone or a computer alone, and that is a human with a computer. And when you look at these games again through that lens, you actually get some different and more nuanced

Starting point is 00:19:28 results. So let's go back to chess. A few years after that game, Gary Kasparov, or actually many years after the game, Gary Kasparov organized a tournament where humans could play with computer support. And there were some really highly ranked grandmasters playing, and at the time, the cutting-edge chess AIs that had been developed. But there were two amateurs who swept the whole field, and they did this not by having better human chess skills, and not by having a better chess bot. They had developed a model that they had programmed to be able to work with. They were working in tandem with it. It was effectively inferior humans and an inferior model, but they had found a way to work together

Starting point is 00:20:09 to beat superior humans and superior models. The same exact thing just happened in Go a few weeks ago. I don't know if folks caught the headlines for this, but there was an amateur, it's always the amateurs, right, who developed a model that he was able to work with that understood and studied the weaknesses in the leading Go bot and was able to defeat it. And it wasn't, again, that this amateur had like a better Go model

Starting point is 00:20:30 that he had developed. He had figured out a way to partner together, with this AI to enhance their performance. And in StarCraft, it's still not the case that AIs are routinely able to beat our human professional players. In fact, a big reason for that is because human pros now are developing and relying

Starting point is 00:20:48 on techniques that were first pioneered by bots. We're using these models to understand strategies that humans can then uniquely go and execute against. And so three cases in a row of AI actually elevating, not eliminating human performance. Humans are better because of AI. We're able to work with AI to improve. I think it's also worth mentioning

Starting point is 00:21:09 all of these games are as popular as ever. Humans clearly still enjoyed us, even though there's an AI that might be better than them as individuals. So this is an example of something that sounds a little sci-fi, but it's called Human Computer Symbiosis. This was first proposed by this guy, J.C.R. Licklider, in 1960,

Starting point is 00:21:28 and he has this awesome paper, And for something written, like 63 years ago, it really holds up. And he has this quote that I have drawn inspiration from quite a while, which is, computers can do the routinizable work to prepare the way for insights and decisions in technical and scientific thinking. This was 63 years ago, and I think this is exactly what we're seeing happen now. It's not about computers replacing humans. It's about them working together cooperatively to solve a problem.

Starting point is 00:21:56 And I think this is the next step for AI. I think it's the next step for humans. You can have the computer doing the routine tedious work so humans can do the creative, interesting stuff. We're a room of humans. Our most fulfilled amazing days as humans are the days that we are spending doing creative and interesting work and not doing the tedious drudgery stuff. And I think AI is here to help us achieve that state of fulfillment. Now, I'm going to bring this into the domain that I think a lot about.

Starting point is 00:22:22 I've been working in data, data science, data analytics my whole career. I am now the founder and CEO of a company that builds a data science and analytics tool. and our product is used by thousands of data practitioners every day, and we see them do some really creative, interesting stuff. I think data practitioners are creatives. I know it's not the first thing that comes in mind when I say creative, do you think of artists or whatever, but you think what data scientists do in their day,

Starting point is 00:22:46 they're asking questions, they're forming hypotheses, they're testing new things, they're building narratives, they're taking risks, they're telling stories. This is good data science, it's good data analytics, and it's what we expect from our data teams. And it's an art and a science and a great use of human time. But data work can also be really tedious. Spend a lot of time writing boilerplate and fixing dependencies

Starting point is 00:23:10 and tracking down missing parentheses in a query. It can be more plumbing than science sometimes. And this is where I think people wind up spending a lot of their time and really is a blocker to them doing their best work. And so this really feels like a perfect opportunity to bring human computer symbiosis into this creative profession. Now, when most people, when they think of this,

Starting point is 00:23:32 they assume it means kind of just replacing data teams with a magic insights text box. Like, the next step is we'll all buy solutions that then our stakeholders or executives will come in, they'll write a question, it'll give them a magic response back, properly formatted charts, and well-reasoned explanations

Starting point is 00:23:47 and full business context. But that doesn't really work. And it doesn't work, one, because these models aren't perfect. They can hallucinate, they're missing a lot of context, They don't understand the full situation of things, but also that humans want to be able to hear a story

Starting point is 00:24:03 and understand and ask and answer questions of a human around these things. And so we actually tried this. At hex, we had built a UI that was really sort of a little more black box. You type of question would bring you an answer back. And you got pretty good results, but it was missing the human element.

Starting point is 00:24:19 And we learned the same lesson, the same thing JCR-Licklider posited, the same thing we learned through all these games, that for now at least, the best approach is one where humans and computers can work together to elevate performance. And so the features we launched in our product last month were built around these principles,

Starting point is 00:24:36 and I think there's a lot of takeaways here. We built these features, they're called HexMagic, and they're built directly into the UI that thousands of data scientists and data analysts already use every day. They bring the powerful large language models, the latest models from OpenEI integrated directly in our product.

Starting point is 00:24:51 And you can ask it to do all sorts of things, from writing queries to building visualizations My personal favorite is called magic fix when you have an error in your code. It will automatically detect and fix it. And as someone who has more and more errors in my code every day, that is a very useful thing. But the key thing here and the thing we really realize is that the thing that we are in the business of doing

Starting point is 00:25:11 is to enhance and benefit humans. It's to work with humans, not replace them. We've found that we can elevate and accelerate human intuition. And that's what our users tell us. We had a user tell us they can spend more of their time doing the creative, interesting part of their job and less time doing the tedious plumbing. And that is so exciting to me

Starting point is 00:25:30 because I think that is a little beginning. It's a foothill of the ultimate value that AI can provide in our lives. It's human, computer, symbiosis, and action. All right. That is all for these exclusive segments from our data in AI forum. Hopefully, that gets your wheel spinning

Starting point is 00:25:46 and eliminates how much opportunity there still is to build here. We've got lots more AI coverage to come as this field moves very, quickly, but for now we'd encourage you to go check out the companies that participated here. So that's coactive.aI, character.aI, and hex.com. We'll include all of that in the show notes, but I also wanted to call out if you like these kinds of episodes. This one being a compilation episode, please let us know. You can always email

Starting point is 00:26:15 us at potpitches at A16c.com. And if you haven't noticed already, we're doing a lot of testing here in format, ideas, guests. So if you like something, if you hate something, if there's certain topics you'd like to see more or less of, different guests you'd like to see on the podcast, please do let us know. We love hearing your feedback and thank you so much for listening. Thanks for listening to the A16Z podcast. If you like this episode, don't forget to subscribe, leave a review, or tell a friend. We also recently launched on YouTube at YouTube.com slash A16Z underscore video where you'll find exclusive video content. We'll see you next time.

Your Ad Here

a16z Podcast - From Promise to Reality: Inside a16z's Data and AI Forum

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.