No Priors: Artificial Intelligence | Technology | Startups - The Future is Small Models, with Matei Zaharia, CTO of Databricks

Starting point is 00:00:00 So we really wanted to see whether it's possible to democratize this and to let people build their own models, you know, with their own data, without sending it to some centralized provider that's trying to sort of learn from everyone's data and, you know, kind of control their destiny in this space. This is the No Pryors podcast. I'm Sarah Gora. We invest in, advise, and help start technology companies. In this podcast, we're talking with the leading founders and researchers in AI about the biggest questions. If you have $30 a few hours in one server, then you're ready to create a chat GPT-like model that can do what's known as instruction following. The latest launched Dolly from Databricks, which is available in open source, foreshadows a potential move in the industry towards smaller and more accessible, but extremely capable AIs. Matei Zahari, co-founder and chief technologist at Databricks,

Starting point is 00:01:04 is here to tell us all about Dolly. We'll talk about how big data sets actually need to be, why manual annotations becoming less necessary to train some models and how he went from a Berkeley PhD student with a little project you may have heard of called Spark, to the founder of a company that's now a critical data infrastructure that's increasingly moving into AI. Welcome to the podcast, Matti.

Starting point is 00:01:22 Thanks a lot. Excited to be here. Can you start by telling us a little bit about the origins of Databricks and how it led you to where you are today? Sure, yeah. So Database started from a group of seven researchers at UC Berkeley back in 2013, and we were really excited about democratizing basically the use of large data sets and of machine learning. So we had seen, you know, the web companies at the time were very successful with these things, but most other companies, you know, most other organizations, things like scientific labs and so on weren't. And we were really excited to look at making it easier to do computation on large amounts of data and also to,

Starting point is 00:01:59 to do machine learning at scale with the latest algorithms. So we had started, you know, during our research, we worked with some of the web companies. We also started open source projects, like most notably Apache Spark, which, you know, was essentially, you know, the first version of it was my PhD thesis. And we had seen a lot of interest in these. And we thought, you know, it would be great to start a company to really reach enterprises and make this type of thing much better and, you know, actually allow other companies to use

Starting point is 00:02:27 this stuff. Can you just give us a sense of what data? Databricks looks like today from like a, you know, scale and product suite perspective. Sure, yeah. So Databricks offers a pretty comprehensive data and ML platform in the cloud. It runs on top of the three major cloud providers, Amazon, Microsoft, and Google. And it includes support for, you know, data engineering, data warehousing, machine learning. And most interestingly, all this is integrated into one product.

Starting point is 00:02:53 So, for example, you can have one definition of your business metric that you use in your BI dashboards. And the same exact definition is used as a feature in machine learning. And you don't have this drift or copying data, and you can just kind of go back and forth between these worlds. The company has about 6,000 employees now. And last year, we said that we cost a billion dollars in ARR and we're continuing to go. It's a consumption-based cloud model where, you know, customers that are successful can go over time and bring in new use cases and so on. Did you think the opportunity was as big as it has been when he started the company? Yeah, well, we definitely didn't, you know, anticipate necessarily to go to this size, right?

Starting point is 00:03:34 A lot of things can go on, but we were excited about the confluence of a few trends. So, first of all, you know, it's so easy to collect large amounts of data and people are doing it automatically in, you know, many industries. And second, cloud computing makes it possible to scale up very quickly, do experiments, scale down, and so on, which enables more companies to work with this kind of thing. And then the third one was machine learning. So we thought, you know, these are powerful trends. And the exciting thing for, you know, us as a company is we didn't invent cloud computing. We didn't necessarily invent big data or anything. But we were able to start at a point in time when many companies were thinking to move into this space

Starting point is 00:04:15 and just provide a great platform for that. And there's this migration already happening. And, you know, if you provide the best platform as people are migrating to the cloud, they'll consider it. You still keep roots in research. You have a research group at Stanford. Can you talk about that? Yeah. So I'm a computer science professor there, so I split my time between that and

Starting point is 00:04:35 Databricks. And we work on a bunch of things. We usually like looking farther ahead into the future. And we've worked a lot on scalable systems for machine learning, how to do efficient training on lots of GPUs and stuff like that, or how to do efficient serving. And then another thing I'm really excited about that we started about three years ago, is looking at knowledge-intensive applications where you combine a language model

Starting point is 00:05:00 with something like a search engine or an API you call or something like that and you try to produce a correct result maybe for a complicated task. Like do a literature survey and then like tell me what you found about this thing with a bunch of references or counter-arguments or whatever. And I have a great group of PhD students

Starting point is 00:05:18 that are working on that and, you know, exploring different ways to do it. How did Databricks decide to start working on Dolly. Let's spark that and, you know, how did you first get going on that? Yeah. So we've had customers working with large language models of various forms, you know, even before Chad GPD came out, but they were doing the more standard things like translation or sentiment analysis or things like that. A lot of them were tuning models for their specific domains. I think we had like almost a thousand customers that were using these in

Starting point is 00:05:48 some form. But then when Chad GPD came out in November, it got people interested in, you know, using these for a lot more than just analyzing a bit of data and instead creating entire new interfaces or new types of computer applications, new experiences in them. And so there was an intense interest in this, even at a time when, you know, the industry in general is being conscious about spending and like which things are really required and so on. This was an exciting one. And the really exciting thing about Chad GPD, as you both know, is the instruction following or basically the ability of it to kind of carry on a conversation and listen to the things you're telling it to do and do those,

Starting point is 00:06:27 as opposed to just completing text or just telling you a small amount of information, like this is a positive or negative sentiment. So we really wanted to see whether it's possible to democratize this and to let people build their own models with their own data without sending it to some centralized provider that's trying to sort of learn from everyone's data and kind of control their destiny in this space. We were exploring different ways of doing it,

Starting point is 00:06:52 And in particular, like Dolly is partly based on this great result from some other faculty members at Stanford called Alpaca, where they tested a way to, you know, basically they used the model to generate a bunch of realistic conversations, and then they use this to train another model that can now kind of carry on conversation on its own. And so we tried essentially cloning that approach, but starting with an open source model, and it actually worked pretty well. And so that's what became Dolly. But yeah, we've been looking at the space for a while and seen, you know, incredible demand for these kinds of applications. Yeah, I think the industry has really been very focused on scaling data, parameter size, and flops. And I think you all really have showcased the power of instruction following, even, you know, something that's relatively smaller scale. Could you explain that and how that all works? It's very interesting.

Starting point is 00:07:42 And I think there's actually a lot of research still to be done here because these models have been mostly locked up in these very large, companies for a while, and everyone thought it's too hard to reproduce them. So the interesting thing is, you know, language models had existed for a while. You basically trained them to complete words. You know, here's a missing word in the text, can you fill it in? And then at the beginning, when people tried to apply them to real applications, not just, you know, I erased a word on my homework, like fill it back in, but like actual applications, they had always done various ways of, you know, training something else on top of, you know, say, the feature representation in these. And so there was a lot of domain-specific work, but you could build like a sentiment, classifier, or stuff like that.

Starting point is 00:08:25 Is it positive or negative? Probably like three years ago now. OpenAI published a GPD3 paper, which is called Language Models are a few-shot learners. And they said, number one, like we trained a language model to 175 billion parameters. And we trained it on, I think it's like 45 terabytes of text. So lots of data, lots of parameters. and it's like pretty good at language modeling. And then number two, they said,

Starting point is 00:08:51 you can actually kind of prompt this with a few examples of a task, and it picks up on the task and does it. So lots of people were working on that. How do you prompt it? What's the best example to show? But everyone assumed that for that capability, you need a giant model to begin with.

Starting point is 00:09:06 So even the researchers in academia were calling into GPD3 and trying to build stuff based on it and study this phenomenon. on. And then last year, 2022, OpenAI published this other paper, which was sort of instruction tuning these models, where they said, hey, we used some human feedback and then some reinforcement learning, and we got this GPD3 model to actually just listen to one instruction. It doesn't need a complicated prompt with lots of examples, and it kind of works. And then

Starting point is 00:09:37 they released a version of this as chat GPD. So I think in a lot of people's minds, the scientific view of it was, first, you need a giant model, and then you need this reinforcement learning thing, and only then do you get this conversational capability and broad world knowledge. So it's actually very surprising. In Alpaca, we just had a larger data set of, you know, human-like conversations, and we had this very kind of modest-sized open-source model that's only six billion parameters, only trained on less than one terabyte of text, so like 50 times less data than GPD3, and it still has this behavior. I think it's been pretty surprising to a lot of researchers, the size of model that still

Starting point is 00:10:19 gets you this kind of instruction following ability. So I think this is kind of an open research problem, like what exactly about these data sets is it that makes them good at this? What are the limitations? You know, are they tasked that these are clearly worse at or better at? It's actually kind of hard to evaluate with long answers because it's hard to like automatically score them and say, you know, like, this. This is a good Seinfeld skits that you generated, and this is like a bad, you know, Barack Obama speech.

Starting point is 00:10:47 But I think we'll figure this out. Yeah. Were there anything that emerged from the model that you also found surprising? Like you mentioned one aspect of it just in terms of the approach you took. And, you know, with dramatically more limited data and approach, you ended up with really performant behavior. Were there other things that were unexpected properties of what you did with Dolly? Yeah. I think to me, the most interesting thing is it's surprisingly good at just free-form.

Starting point is 00:11:11 a fluent text generation. So you can tell it to, like, create a story or create a tweet or create a scientific paper abstract. And it does a pretty good job at that. And before that, whenever I talk to my, you know, NLP, like, researcher friends, they thought that that creativity was the thing that required a lot of parameters from something like GPD3. Like, they actually told me, oh, the knowledge-intensive stuff, like remembering facts,

Starting point is 00:11:36 tell me the capital of, like, France and whatever. that's not surprising that a small model with a few parameters can do it, but the creativity, that's, like, really hard. So this one is actually pretty good at the creativity and generation. It's less good at remembering lots of facts, which kind of makes sense, given the parameters. So if you ask it about common topics, you know, it'll be good. If you ask it, like, the author of a book, you know, it might give the wrong one.

Starting point is 00:12:00 I think we had an example because we've actually been building a slightly bigger version of this too, and we had this question with, like, who is the author of Snow Crash? which is Neil Stevenson, and the initial Dolly model said Neil Gaiman. So, you know, it's still a Neal, it's still an author, but it's still a sci-fi writer. Yeah, so it's less good at remembering facts, but pretty good at coherent sort of generation.

Starting point is 00:12:23 Yeah, the name Dolly basically references the first cloned mammal, Dolly the Sheep. Can you explain the reference within the AI space? Yeah, so it's based on, you know, cloning this other model from Stanford called Alpaca, but doing it with an open dataset. And that itself was based on something that meta released, I think, maybe three weeks ago or less, called Lama, which is they took a modest size model, seven billion parameters, and they trained it on a ton of data. I think they said 1.4 trillion tokens or something like that, which is, I don't know how many bytes of data it was, but it was multiple terabytes of data, basically. And they said, hey, by just training this for longer, we got a small model that's actually producing pretty high quality content for its size. So there were all these kind of, you know, woolly sort of animals out there. And we thought it's just too perfect

Starting point is 00:13:12 to like clone it. And there are all these other things like, you know, it's like the Dali Lama. I don't know. There are all these like things. Yeah. That was a great name. Yeah. That's a good name. Yeah. Are there other things that you can share that you all have coming in the background at Databricks or your Stanford lab in terms of this more general area of language models? Yeah. I mean, at Databricks definitely, you know, we're using everything we learn from Dali and we're learning from our customers to just offer a great suite of tools for training and operating LLM applications. We already have a popular MLOPS platform, and we also have this open source project called MLFlow that integrates with a lot of tools out there that our offering is built around. So you can expect some nice integrations into that.

Starting point is 00:13:56 You know, separately, we're also working on Databricks product features that use language models internally and learning a lot from developing those and, you know, feeding that into our products. I think in the next few months, you can expect it. And we also have this big user conference data AI summit coming up in June that will probably have a lot of stuff about this. And I would say, you know, as a researcher and also kind of with my Databricks hat on, the thing I'm most excited about is really connecting these models with reliable data sources and making them really produce reliable results. Because, you know, if you use chat GPD or GPD4, the two big problems with it are number one, Like, the knowledge is not up to date.

Starting point is 00:14:38 You know, it only knows stuff it was strained on. And number two, a lot of the things it says are inaccurate, and it's confident but, like, wrong in various ways. And I think you can tackle both of these by combining some kind of language model with, you know, a system that, that, you know, pulls out, like, vetted data, either from documents, like a search engine or from, you know, APIs and tables and stuff like that inside your company. You know, like, for example, when I talk to the chat,

Starting point is 00:15:06 But in my bank, it should know my latest bank account balance and transactions and stuff. You know, if I'm like, can you cancel the payment I made because I unsubscribe? You should just know what that means. So cracking how exactly to do that isn't easy. It may actually be easier with small models than with big ones to reduce hallucination from them. But I think it's still an open question. But I think if we can figure us out, then these become a much more reliable component in an application. Maybe we'll go from there to just like projecting a little bit.

Starting point is 00:15:36 about, like, architecture and research. You know, so much of the industry is focused on model scaling, right, and improving reasoning that way. Like, how much do you think that matters in terms of, I guess, like real world usage in production with your customers in the near term? Yeah, great question. So to me, at least the relationship between scale of the model versus, you know, quality of the data and supervision you put in versus, like, design of an application

Starting point is 00:16:04 around that. And those things and like overall quality, I think the relationship is not 100% clear yet. Like to get a really reliable model that say, I don't know, can, you know, like make a pharmacy prescription or something like that, maybe you need a trillion parameters,

Starting point is 00:16:20 you know, maybe you actually need a really carefully designed data set and like supervision process, which is kind of traditional sort of ML engineering type work. Or maybe you actually need a clever application where like you're chaining together a couple of models and things and you're saying, well, does this make sense? Can I find a reference?

Starting point is 00:16:38 Can I show this example to a human if it's really hard? So I think it's a little bit open. The thing I can say for sure, especially and Dolly and like other, you know, results like this really highlighted is it does seem that the core tech is getting commoditized very quickly. So if you just want to run, you know, something like today's chat GPD, it will be a lot cheaper because all these hardware manufacturers are building devices that are specialized and much cheaper.

Starting point is 00:17:05 And another thing that's making it less expensive is we're figuring out ways to get a smaller model with less data, fewer parameters and stuff to get similar performance. So that, I think, is happening faster than at least I would have thought, you know, a few months ago. So at least to get something with today's capabilities, I think it will be very affordable

Starting point is 00:17:24 and you might just be able to run it locally on, you know, your phone or something. The question of how large can, you know, if you make a much larger model, is it going to be a lot smarter? I think it's still a bit unknown. I mean, there are people who argue it's going to be very good at reasoning, but at the same time,

Starting point is 00:17:41 this kind of token by token generation we're doing now is not an amazing format for reasoning because you have to linearly say one thing at a time. So it's not really good for making plans or comparing versions. I think to get a really smart application, you'll need to combine today's language modeling with some other sort of framework around it

Starting point is 00:18:00 that uses it multiple times or explores a plan space or whatever, and then you might get something good. And it's also possible that the very largest models are simply memorizing more stuff. So, like, they're impressive in terms of trivia. Like, I can ask it about some random topic and it will know, but they're not really, like,

Starting point is 00:18:17 smarter at solving even a basic, you know, word problem. So, yeah, I'm not sure. Unfortunately, especially with training from the web, it's often very hard to tell apart, like, reasoning from memorization, essentially. did I see that thing before. So I think actually being able to do experiments where you train these on carefully selected data and will lead to better understanding of what they can do.

Starting point is 00:18:42 Yeah, yeah, that makes sense. Maybe if we think a little bit just because you have great visibility from your role at Databricks, like what are they dwelling to companies need, like your enterprise customers or just generally enterprises need to make use of these models? Because you said, you know, we believe the core technology, the models themselves are getting commoditized. Yeah. So definitely the first piece is you need a data platform that could actually build, you know, reliable data, right? So we think that's like the bread and potatoes of like getting anything. You need some, you know, a basis to like sort of build on. So we think that will become really important. And, you know, maybe data platforms will have to evolve a little bit to be better at supporting unstructured data like text and images and so on and to do quality assessment and stuff like that for it. That's one piece. I think another piece you need. is you need the MLOPS piece of being able to experiment with things, deploy them, A, B, test them, and so on, and see what does better and improve it incrementally.

Starting point is 00:19:41 And I also think these models will need a good connection to operational systems inside the company to do really powerful things with, like, the latest data. So, you know, you saw probably the support for tools in chat GPD. You know, before that, there were lots of groups working on at least models integrated into search engines, sometimes in. to calling other tools as well, like calculators. I think it's still a little bit open-ended. There's one extreme where people say the model will figure out what tools to use on its own.

Starting point is 00:20:10 I think for, like, enterprise use cases, that's a little bit, like, more than you really need. You know, you can kind of give it some tools and feed it stuff, and it doesn't have to discover and, like, read the manual to figure out which one to use. But, yeah, I think that's another piece you'll need for, like, really powerful applications. And then I do think infrastructure, like just basic training and serving infrastructure, is important, too, when you start to care about performance, like about latency and speed. And you can see some of the new search engines using these models are not that fast, right? Like a little bit slow, you know, it would be nice to have it faster.

Starting point is 00:20:44 And for automated analytics, it's even more important that it's efficient. So there could be, I think there'll be a lot of activity there. Yeah. Where do you see enterprises getting the most value from investing in, I guess, more traditional ML and then some of the language model stuff to date. Yeah, great question. So traditional ML, we're seeing actually virtually all major enterprises, you know, and all industries are using it.

Starting point is 00:21:09 It's changed a lot in the past decade, actually. So it's very good for forecasting things in general and for automating certain types of decisions. So basically, like, for example, optimizing your supply chain, right? You don't have time to look at, like, exactly everything that's going on and, you know, think about it and have a meeting. But if you do order the right amount of parts to meet your demand this week, or if you minimize the amount of time, you know, an agricultural product like sits in a warehouse and like, you know, degrades in quality or stuff like that, it matters a lot. And it can have a huge impact on the profitability of a company. So we're seeing a lot of that people applying it to automate, you know, supply chain and to automate basically their operations in various ways. And then there are more classic use cases like fraud detection and stuff like that. We're also, you know, it's all. always a norms race and you're trying to do the best you can because every percent of like

Starting point is 00:22:01 accuracy you do better and can translate into, you know, huge impact. With language models specifically and especially with kind of conversational ones, I think the, you know, the really exciting thing is interfaces to people. And I think customer support is a very obvious one. Maybe things like recommendations or asking questions on a product page, you know, in retail, things like search augmented with stuff is one. And we've also found that just internal apps in a company that have a lot of internal data can benefit from this kind of thing. So like one of the things, you know, we've built, for example, is inside Databricks. We have all these resources for, you know, engineers to understand how different parts of the product work, how to operate it, like all the

Starting point is 00:22:45 APIs. And, you know, people used to just ask each other questions in these Slack channels for each team. And we could use that data, like the questions and answers plus the data. you know, in the actual documentation to, you know, essentially automatically answer many, many such questions and just save people a lot of time. So I do think that any app that has kind of business data or like stuff written by humans in it, like your issue tracker for your software development or like your Salesforce or something like that could benefit from these kind of interfaces. Yeah. Yeah, it seems like any type of forum or anything else instantly becomes like data that you can use

Starting point is 00:23:22 to fine tune or train a model that's specific to your customer support use case or you could use an embedding or something to do interesting things with it. So it seems like some really cool stuff to do. Are there any specific areas that Databricks is not focused on that you think would be especially interesting for somebody to build from a tooling perspective for enterprises trying to use some of these technologies? Yeah, I think there are a lot of these. I think it's very early on. Probably one of the most obvious ones is just the domain or vertical-specific models and tools. I actually think even a lot of the enterprises that have a lot of the data in various domains might turn more into data or model vendors of some form in the future as they

Starting point is 00:24:01 use this to build something that no one else can. So I wouldn't be surprised at all if you see like the next wave of companies for, say, security analytics or like, you know, biotech or analyzing financial data or stuff like that, really build around LLM technology in there. And I also think in general, in the app development space, like, how do you develop apps that incorporate these tools? It's very open. It's not clear what the best way to do it is. And, you know, you might end up with, like, really good programming tools that focus on this problem. I would say, you know, for people thinking about startups and so on, like you want your startup to have, you know, a long-term defensible mode, ideally something that grows over time also. So anything around the unique data set,

Starting point is 00:24:44 for example, or unique, like, feedback interaction you have is always good. it, right? Like, honestly, even something like adding ML features in your product that just kind of learn from your users and, you know, do better recommendation and so on could eventually become a motor like, you know, others just can't easily catch up. But I think that, you know, anything that's on custom data sets is sort of safest. Yeah. When you were working on Spark for your PhD, did you think you'd become a founder? Was your intention to start a company? Or did you just think it was interesting research to do or both? No, it really wasn't. Yeah, I mean, as aggressive, you know, I've always been interested in just like doing, you know, things that help people that have an impact, help people do cool things. And, you know, I had seen these open source technologies out there for distributed data processing. I thought, okay, well, I'll try to start one and see how it goes. You know, I wasn't sure that people would really pick it up and use it. But I wasn't looking to be a founder necessarily. I was just looking to do something useful in this like emerging space. And honestly, I

Starting point is 00:25:47 I was at least considering to be a computer science professor, and I thought, if I'm going to be a professor and all the most exciting computing is happening in data centers today, and I don't know how that works, how am I going to teach computer science to people? So I better learn about that stuff. But it turned out to be something more broadly interesting. What was the most unexpected thing about being a founder? There are a lot of challenges along the way. I think just being able to learn about all the aspects of a business and how much complexity there is in each one. You know, starting out as a more technical person, at first, I didn't really know what they expected, but there's a ton of depth in each one. And if you understand them, if you, like,

Starting point is 00:26:27 really try to understand them, get to know the culture of people there, like, really get to know what they're thinking about. You can make much better decisions across multiple aspects of your company. Is there anything that you would advise people coming from a similar background to years. I have a PhD as well, although it's in biology. And I feel like there's certain things that I learned in academia that was really valuable. And then there's a bunch of stuff I really needed to unlearn as I went into industry. There's specific pieces of advice you'd give to technical founders or PhD founders in terms of things that they should unlearn. Let's see. Like a lot of research, at least in computer science, the kind of stuff that I've worked on, a lot of research is basically

Starting point is 00:27:04 is mostly prototyping. It's like, can we showcase an idea? But it's not really software engineering of like, we'll build a thing that can be maintained and, like, runs flawlessly in the future and, like, supports, you know, problems. So I think you should kind of unlearn just the focus on short-term stuff and think about how is this going to go over time. Eventually, right, there is a phase of the company where you're just prototyping to get a good fit, but you should design things so they can evolve into, you know, into something that's very reliable long-term. The other thing is, you know, I think unlearn trying to invent everything from scratch, you should really be careful about like, hey, where am I doing something unique? Or if I'm doing something different

Starting point is 00:27:43 from others, like, why is it? Right? Don't do it just for kicks. So, because in research is very tempting to say, you know, I did this new thing. I'm going to, you know, I'm going to try all the fanciest, like new ideas in each component of it. Was there something that you guys like experimented with being like, you know, first principles unique about that you then said, you know, there are systems for this? A good one early on was deployment infrastructure for like, how do we deploy and update our software across, you know, all the clouds and so on. And we soon realize it's better to go with really standard things like Kubernetes and tools like that than to try to do something custom because they're evolving very quickly. So yeah, that's kind of a good

Starting point is 00:28:23 example where like you, at the beginning you say, ah, how hard can it be? You know, let's just build something. But then you realize, wait, every month there's like new stuff coming out and maybe this isn't where we want to focus on. So maybe just thinking about being like CTO now of a very large company, like how is your lens as a researcher, computer science researcher, informed your thinking as a CTO? I think, first of all, as a researcher, like, you think a lot about the long-term trends. Like, what, you know, what could things look like five or ten years from now? What's the, what's kind of the fundamental things here?

Starting point is 00:28:55 So, for example, this thing about LLMs being commoditized and, or honestly, the thing about them kind of maxing out at more parameters, I think many people hadn't really thought about that. But if you think back, like, you know, there is a lot of room to improve efficiency usually in hardware and software for an application. And this particular application is kind of simple because it is all basically like, you know, two or three different types of matrix operations. So like it's sort of the hardware designer's dream to do this stuff. And also there are usually diminishing returns from scale in terms of quality of models in general. And you can also kind of see it in other areas like in computer vision, for example. we don't have trillion parameter models.

Starting point is 00:29:38 You get actually pretty small models that you can train for a specific task that are good. Self-driving cars is another example. You know, they rapidly improved in quality up to a point and then they kind of plateaued and they're still not really, you know, ready for prime time. Eventually, you hit some limits. There are plenty of people who are researchers in the field

Starting point is 00:29:59 who don't really see an asymptote, right, with scaling. And so where do you believe that limit comes from? like parameters, compute, data, something else. I just think a lot of things like scale sublinearly in general. Now, it's hard to tell for things like reasoning and so on. But certainly in classical machine learning, like, for example, if you're trying to learn a function that like separates positive and negative examples, and as you add more data, like your accuracy doesn't really improve linearly.

Starting point is 00:30:30 Like, you know, with a few examples, you get a pretty good estimate of that boundary. And then with more of them, it gets a little. little bit better, but it doesn't get like that much better. So it's just, I think it's common. That would be my main reason. Now, the one thing, so with language model specifically, I think the part that does go linearly with more parameters or should is ability to just memorize more stuff. So if you wanted to tell you, like, who was on the fifth episode of like friends and like what was the second line they said and stuff like that, like, yeah, more parameters will get you a neural network that just by putting that input, I can tell you that stuff. But that wasn't that

Starting point is 00:31:07 interesting to me because I think the right solution for that is look things up in a database, like do heat evil. I do a search index. Actually, I think from a computation perspective, it's very inefficient to have like a trillion parameters and have to actually load them all and add and multiply by them each time you make an inference because they're just encoding knowledge, most of which you don't need for that inference. So that one I wasn't as excited about, But I think there are people who are just excited about neural networks, like, how do, you know, it's the same kind of people who wonder, like, how do brains work, like, how do animals learn, who are just excited about, wait, I only had some neurons and I put in this stuff and it remembered it.

Starting point is 00:31:45 But as an engineer, I'm not that excited because I'm like, yeah, I could have built a database that did that. But in terms of like, hey, I just trained a network with gradient descent and it did it. That is kind of cool. Yeah, I feel like people are almost the opposite where we're actually quite bad at memorization we're very good at inferring things. And so it's interesting to ask what is the basis for that computationally. Yeah. But the other thing that we're learning, though, from this is it does seem that the type of data you put in and the kind of fine-tuning,

Starting point is 00:32:13 essentially it's like weighing the data has a lot of impact. So this instruction tuning stuff is like, really, we have only a few examples of instruction following, but since we do fine-tune the model, it's as if we put a very high weight on it and had lots of examples of that in our training set. And I mean, I think it's still an open question. Like, for example, if you made a lot of examples of logical puzzles, right? Like you just generate some problems and solutions. Would you get a model that's better at logical reasoning?

Starting point is 00:32:41 There are other things you can do. I also think a big problem with current models, I think I hinted at this before, is we're just calling them to generate one token at a time. So, for example, you've probably seen this chain of thought reasoning thing. Like, if you ask a model a math problem and it just tries to answer, like how many sheep were there. It might say like seven or something. And then it tries to make up the explanation.

Starting point is 00:33:04 And it's like wrong. But if you tell it, do the explanation for things step by step and then answer, it's more likely to be right. But you can imagine other versions of that. Like if it had a scratch pad, if it had a way to backtrack to say, you know, this is kind of a dead end, it might become better. So I think stuff like that that's kind of around the model. It's still an AI system, but it's not just one giant DNN can further improve its ability.

Starting point is 00:33:29 Yeah. And you've seen that work in like really complex and impressive ways. Like we had Noam Brown from the sort of Cicero group on and, you know, they have planning as part of it, right, versus it's just one one very large model. They expect to do all the reasoning. You were actually saying like, you know, you basically make like sometimes controversial like long-term predictions about what's going to happen. Like, you know, there's an asymptote in sort of value of scale. Yeah, as a researcher. And how does that impact like your decisions as CTO? So especially as a company goes, right, it's like actually it becomes slower to change direction super dramatically. So you really want to think about like what will be too long term or, you know, our CEO or Lee has this

Starting point is 00:34:14 decision rule of like with any decision I ask about like, hey, which one am I like sort of more likely to regret like five years from now? Not five months from now, but like, you know, if I don't do this or whatever, like, what's going to happen. So you try to think about where things are going to go. Of course, you do want to collect data and, like, sort of update your thoughts about it, test hypothesis. And that, I think, is something you can get from research, too. Like, in research, we always think when we have an idea, it's sort of a race to, like, figure out, is it a good idea? And can I publish it? Because the research community values novelty a lot, being the force to do something, you know, for better or worse. It's not amazing. But if you just

Starting point is 00:34:52 reproduce a thing that someone else did, unfortunately, you don't get as much credit. So we do think about how can we quickly validate something. But at the same time, and even in research, I had the same thing. You tried to pick topics that will matter. Like, for example, when I was doing my PhD, I didn't do a ton with machine learning. And, you know, I knew people who did it. I helped them out.

Starting point is 00:35:14 I built infrastructure. But I didn't do ML research myself. And then later, I kind of decided, like, yeah, I am going to do some things, especially around this, like, you know, connecting, machine learning to external data sources like search engines. And I know it's going to take a while to really learn about it and get an intuition and stuff, but I think this is going to matter long term because I think the local, like, you know, parsing semantics of what the sentence means is kind of solved already.

Starting point is 00:35:41 And the interesting thing will be like, you know, doing this in a bigger system. Yeah, I have four degrees and no PhD. I've never contributed anything to the corpus of the world's knowledge. a lot, got to ask, does it affect how you do investing? No, not really. The PhD, moot point. Nice. I don't know. I have a math degree as well, and I feel like that actually was a thing that forced me to think slightly differently, or at least it forced a way of very logic.

Starting point is 00:36:10 I felt like there's a groove in your brain for logic that gets carved. So that probably helped, but who knows, I don't know. You've been working in data machine learning for a long time. Like, where do you think we are in this generation of AI? Yeah, I think we're still at the early stages of AI on unstructured data, so things like text and images and so on, really having an impact in applications. So I think chat GPD related features that every application is going to add will change the way we work with computing. And they'll also change data analytics to some extent because you'll be able to use this data. And honestly, I also think that in terms of just basic data infrastructure and ML infrastructure,

Starting point is 00:36:52 We're still pretty early also. It's still many different tools you have to hook together, a lot of complex integration, and you need a lot of sort of specialized people to do it. And I think over time, like I increasingly think that basically, especially because of the capabilities of these AI models, every software engineer will need to become an ML engineer

Starting point is 00:37:13 and a data engineer also as they build their application. And we'll figure out ways of doing them recipes or abstractions or whatever that are actually, easy enough for everyone to do. And one analogy I like is, you know, when I was learning programming, which was sort of like, you know, mid-late 90s, I got these books on, you know, web applications. And it was very complicated. There was a book on MySQL. There was a book on Apache web server, like, CGI bin, all these things you have to hook together. And now, you know, most developers can make a web application in like one function. And even non-programmers can make something like

Starting point is 00:37:49 Google forms or Salesforce or whatever, that's sort of, you know, basically it is a custom application. So I think we're far away from that in data and ML, but it could sort of look like that. It's harder because it depends on this sort of static data that you've got sitting around, but I do think there are going to be a lot more of these applications. Matei, this is a great conversation. Thanks for joining us on No Priors. Thanks so much.

Starting point is 00:38:11 Thanks a lot, Sarah. Thank you. Thank you for listening to this week's episode of No Priors. Follow No Priors for a new guest each week and let us know online what you think and who an AI you want to hear from. You can keep in touch with me and conviction by following at Serenormus.

Starting point is 00:38:28 You can follow me on Twitter at Alad Gill. Thanks for listening. No Pryors is produced in partnership with Pob People. Special thanks to our team, Synthel Galdaya and Pranav Reddy and the production team at Pod People.

Starting point is 00:38:41 Alex Vigmanis, Matt Saab, Amy Machado, Ashton Carter, Danielle Roth, Carter Wogan and Billy Libby. Also, our parents, our children, the Academy, and Shy Hulud, creator of Spice.

Your Ad Here

No Priors: Artificial Intelligence | Technology | Startups - The Future is Small Models, with Matei Zaharia, CTO of Databricks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.