No Priors: Artificial Intelligence | Technology | Startups - What is the role of academia in modern AI research? With Stanford Professor Dr. Percy Liang

Starting point is 00:00:00 For ages, human level has been the target for AI, and that has really been kind of a North Star that has fueled many dreams and efforts and so on over the decades. But I think we're getting to a point where along many axes, it's superhuman or should be superhuman. And I think we should maybe define more of an objective measure of like what we actually want. This is more of a general statement about how we should think about technology, not just chasing after mimicking a human because we have a lot of humans. This is the No Pryors podcast. I'm Saragua. I'm Alad Gail. We invest in, advise, and help start technology companies.

Starting point is 00:00:52 In this podcast, we're talking with the leading founders and researchers in AI about the biggest questions. We're very pleased today to have Dr. Percy Liang, Professor of Computer Science here at Stanford, and director of the Center for Research on Foundation Models, a recently founded center here. Dr. Liang is the author of over 800 heavily cited research papers around helping machines understand natural language, helping humans reason about those.

Starting point is 00:01:26 models and has contributed a number of novel technical approaches and creative uses of data to the machine learning field. And as a special trait, we're recording here in his office at Stanford today. Thanks, Percy. Great. Welcome. So I think just to start, can you tell us a little bit about how you got into the machine learning research field in your personal background? Yeah. So I've been in the field of machine learning and natural language processing for over 20 years. I started getting into it and undergrad. I was undergrad at MIT. I liked theory. I had a fascination with languages. I was fascinated by how humans could just be exposed to just strings of text, speech, and somehow acquire very sophisticated understanding of the world and also syntax and

Starting point is 00:02:10 learned that in a fairly unsupervised way. And I wanted to, my dream was to get computers to do the same. So then I went to grad school at Berkeley. And then after that started at Stanford, and ever since, I've been pursued of developing systems that can really truly understand natural language. And of course, in the last four years, this once-upon-time kind of dream has really taken off in a sense, maybe in a not a way that I would necessarily expect, but with a coming out of large language models such as GPT3, it's truly kind of astonishing how much of the structure of language and the world that these models can capture. In some ways, it kind of harkens back when I actually first started in NLP.

Starting point is 00:03:00 I was training language models, but of a very different type. It was based on Hidden Markov models. And there, the goal was to discover hidden structure and text. And I was very excited by the fact that it could learn about, tease apart what words were like city names versus days of a week and so on. But now it's kind of on a completely different level. You've worked on multiple generations of NLP at this point, pushing the forefront of semantic parsing. Was there a moment at which you decided that you were going to focus on foundation models and large language models?

Starting point is 00:03:34 Yeah. There was a very decisive moment, and that moment was when GPT3 came out. That was in the middle of the pandemic, and it was just, it wasn't so much the capabilities of the model that shocked me, but it was a way that the model was trained, is basically taking a massive amount of text and asking a model to predict the next word over and over again, billions of times. What rose from it was not only a model that could generate fluent text, but also a model that could do in-context learning, which means that you can prompt the language model with instructions, for example, summarize this document, give it some examples, and have the model on-the-fly in context figure out what the task was.

Starting point is 00:04:15 And this was a paradigm shift, in my opinion, because, it changed the way that we conceptualize machine learning and NLP systems from these bespoke systems where it's trained to do question answering, to train to do this, to just a general substrate where you can ask the model to do various things. And the idea of a task, which is so central to AI, I think, begins to dissolve. And I find that extremely exciting. And that's the reason later in 2021, we founded the Center for Research on Favis. foundation models. We coined the term foundation models because we thought there was something

Starting point is 00:04:53 that was happening in the world that somehow large language models didn't really capture the significance. And it was not just about language, it was about images and multimodality. It was a more general phenomenon. And we coined the term foundation models. And then the center started and it's been sort of, you know, kind of a roller coaster ride ever since. We're going to be talking a thing about both your experiences in research and academia, and then we'll also separately be talking about together, which is a company you're involved with now. Could you tell us a little bit more about what the center does

Starting point is 00:05:24 and what you're focused on? Yes, so the Center for Research on Foundation models sort of two years ago is under the Human Centered AI Institute at Stanford. And the main mission of the center is, I would say, to increase transparency and accessibility to foundation models. So foundation models are becoming more and more ubiquitous, but at the same time, one thing we have noticed is the lack of transparency and accessibility of these models. So if you think about the last decade of deep learning, it has profited a lot from having a culture of openness with tools like PyTorch or TensorFlow, data sets that are open, people publishing openly about their research. And this has led to a lot of community and progress, not just in academia, but also in industry with different startups and hobbyists and whoever, just getting involved.

Starting point is 00:06:23 And what we're seeing now is sort of a retreat of that open culture, where models are now being only accessible via APIs. We don't really know all the secret sauce that's going behind them, and they're sort of limited access. What's your diagnosis of why that's happening? I think that this is very natural because these models take a lot of capital to train. They're an enormous amount of, you can generate a lot of value and it's a competitive advantage. So, you know, incentives are to keep these under control. There's also another factor, which is, you know, safety reasons. I think these models are extremely powerful.

Starting point is 00:07:07 And maybe the models right now, I think, are, well, if they were out and open, it would be maybe okay. But in the future, these models could be extremely good and having them completely, anything goes. We might have to think about that a little bit more carefully. How do you think all this evolves in terms of, if you look at the history of ML or NLP or AI, we've had these waves of innovation in academia, and then we've had waves of innovation and implementation and industry. And in some cases, we've had both happening simultaneously, but it feels a little bit like it's ping ponged over time in different ways. Now that people are starting to be more closed in terms of some of these models on the industry side and publishing less and being less

Starting point is 00:07:46 open, how do you view the role of academia and industry diverging, if at all? Like, do you think it'll be different types of research that each type of institution tackles? Do you think they'll be overlap? I'm sort of curious how you view all that evolving. I mean, I think industry and academia have very distinctive and important functions. And always when I tell my students, well, we should be working on things that are lean on academia's competitive advantage. And historically, I think this has meant different things. So before ML was that big, I think a lot of academic research was really about developing the tools to make these models work at all. I remember working on systems and being ML models back in grad school.

Starting point is 00:08:28 And basically, it wasn't working. I mean, computer wasn't working, vision wasn't working, question answering wasn't working. And I think the goal of academia there was to make things work. And a lot of the advances that were born out of academia then influence other ideas and influenced other ideas before it started clicking. And now we're seeing a lot of the fruits of both academia industries research fueling this kind of industry drive that you see today. And now today, I think the dynamic is quite different

Starting point is 00:09:00 because it's no longer academia's job isn't just to get things to work because you can do that in other ways. There's a lot of resources going into tech companies where if you have data and compute, you can just sort of scale and blast through a lot of barriers. And I think a lot of the role of academia is understanding because these models for all their impressive feats,

Starting point is 00:09:26 we just don't understand what they work, how they work, what the principles are, how does this training data, how does moral architecture affect the different behaviors, what is the best way to weight data, how do you, what's the training objective? Many of these questions, I think, could benefit from a more rigorous, you know, analysis. The other piece, which is a different type of understanding is understanding social impact. And this is going back to the question about what is CRFM's role is. CRFM is a center with over 30 different faculty across 10 different departments at Stanford. So it's quite interdisciplinary.

Starting point is 00:10:04 So we're looking at foundation models not just from a technical perspective of how do you get these models to work, but also thinking about their economic impact. The challenges when it comes to copyright and legality, we're working on a paper that explore some of those questions. We're looking at different questions of social biases and thinking, carefully how the impact of these models have on issues of, you know, homogenization, where you have a central model that's making perhaps decisions for a single user across all the different aspects. So some of these are the types of questions. There are also people at the center looking at risks of disinformation, monitoring to what extent these tools are

Starting point is 00:10:52 so persuasive, which they are getting increasingly. So, and what are the actual risks when it comes to, let's say, foreign state actors leveraging this technology? And there's also people at the center who are in medicine, and we're exploring ways of leveraging foundation models and deployment in actual clinical practice. How near term do you think some of those deployments are? Because if you go back into the 70s, there was like the Mycine Project here at Stanford, which was an expert system that outperformed Stanford medical school staff at predicting what infection disease somebody had, for example. And that was 50 years ago, or almost 50 years ago, and it never really got implemented in the real world. And so one of my concerns sometimes in terms of the impact of some of these things is, are there industries that are resistant to adoption or resistance to change? And is it exciting to hear that, you know, at Stanford, they're actually starting to look at how do you actually integrate these things into real clinical care?

Starting point is 00:11:44 Do you view those things is very far out on the health care side? Do you them is sort of near? I know that isn't the main topic we're going to cover, but I'm still a curious given how close you are to all this. Yeah, I think it's a good question. I think there are a bunch of different issues that need to be resolved. For example, foundation models are training on a lot of data. How do you deal with privacy? How do you deal with robustness?

Starting point is 00:12:05 Because once you're talking about in the healthcare spaces, especially, there are cases where we know that these models can still hallucinate facts and sound very confident in doing so. I know some doctors like that, too. Yeah, there you go. But you've also taken a point of view that we should, you know, expect superhuman, if we see superhuman performance from these models, like holding them to the standard of a human doctor is actually insufficient as well. Yeah, I think that's a great point, is that for ages, human level has been the target for AI. And that has really been kind of a North

Starting point is 00:12:46 Star that has fueled many dreams and efforts and so on over the decades. But I think we're getting into a point where along many axes, it's superhuman or should be superhuman. And I think we should maybe define more of an objective measure of what we actually want. We want something that's very reliable. It's grounded. I often want more statistical evidence when I speak to doctors and sometimes fail to get that and have something that would be sort of a lot more principled and and rational. This is more of a general statement about how we should think about technology, not just chasing after mimicking a human, because we don't have a lot of humans. It's an interesting point. It's really fascinating to watch all this evolve right now.

Starting point is 00:13:33 You've done extensive research on natural language processing and computational semantics. Can you explain what those terms mean and how they're relevant to the development of AI? So computational semantics is the process where you take language, text, and compute, quote-unquote, meaning from it. And that is something I'm not going to maybe try to attempt to define. There's a huge literature of linguistics and philosophy about what meaning is. I would say that a lot of my research in the past, maybe five to ten years ago, was adopting this view that language is a programming language.

Starting point is 00:14:14 It computes. You can give orders. You can instruct. You can do things with. language, and therefore, it was natural the model natural language as a formal language. So a lot of semantic parsing is about mapping natural language into a formal space so that machines could execute this. And so one concrete application of this that worked on for a while is mapping natural

Starting point is 00:14:38 language questions into essentially SQL queries, which obviously has many different applications as well. And what was nice about this framework is that to really do this, you had to understand how the words contribute to different parts of the SQL query, and then you could get something that was a program that you could execute and you deliver the results, as opposed to many question answering systems, which you ask a question, maybe retrieve some document, you're retrieving the answer, or either that or making something up, rather than computing it rigorously. So that was a paradigm I was working in maybe five to ten years ago.

Starting point is 00:15:19 But the main problem is that the world isn't a database. A small part of the world is a database, but most of the world is unstructured. And then I started thinking about question answering in general, and we developed the squad question answering benchmark to fuel progress in open domain question answering. And that, in turn, and many other data sets that were developed, both at Stanford and elsewhere, I think led to the development of these powerful language models that then, like Bert and Roberta and Elmo, back in about 2018, to then many years ago. Many years ago.

Starting point is 00:15:58 Ancient history now to more like 2020 generation of these large foundation models. So I think there's certainly a place for that type of thinking. There are cases where you want to just math, natural language into, say, people call tool use. Like you ask some question that versus calculation. You should just use a calculator rather than try to sort of quote unquote do it in the Transformer's head. But there's also a lot of aspects of reasoning, which are not quite formal. We do this all the time. And a lot of that happens kind of natively in the language model.

Starting point is 00:16:42 And I think it's still an interesting question how to kind of marry the two. I feel like the two are still in sort of jammed together in a way. And maybe it's natural because there's certain things you can do in your heads, certain things you can invoke a tool to use. But this has been also one of the classic. debates in AI. There's neural versus symbolic. And for a while, symbolic AI was dominant. Now neural AI has come really taken off and become dominant. But some of those central problems of how do you do planning, how do you do reasoning, which was the focus in study of symbolic AI, are now again

Starting point is 00:17:26 really relevant because now we've moved past just simple classification and just entity extraction, but now more to more ambitious tasks. What do you think of some of the more interesting research programs right now in that area? I think that it's interesting to remark on what's happening because there are, to a first order approximation, larger models trained on the relevant data seem to do well on various benchmarks. I think that maybe there isn't enough emphasis on data efficiency

Starting point is 00:18:01 and how quickly you can get and how robustly you can get to these points because we know it has been well-documented that benchmarks can be gamable, so even though you do well on a benchmark doesn't mean you've necessarily solved the problem. So I think one has to be a little bit cautious about that. So obviously, scale and more data is just one clear direction.

Starting point is 00:18:22 But in terms of orthogonal directions, what are the methods? Several things have to happen. One is we have to have ability to have, handle a greater context length. If you think about a long reasoning chain, you know, transformers are fixed and there's ways to extend it, but fundamentally, it's sort of a fixed model.

Starting point is 00:18:42 Let's say advanced problem solving, for example, if you want to solve a math problem, you'll improve something. The language model generates sort of this chain of thought and generates token by token, and then it generates something. But we know that humans, when they solve a problem, It's much more you try different things, you backtrack.

Starting point is 00:19:03 It's much more flexible, iterative, and it can last a lot longer and then just you're going for a few iterations. And what is the architecture that can handle that level of complexity? I think is still an outstanding question. Is there any aspects of foundation or large language models that are emergent that you didn't anticipate or that really surprised you? I think going back to GPD3, I think, I think in context learning is something that surprised many people, including me.

Starting point is 00:19:33 So here you're prompting a language model with an instruction and input-output pairs. You know, here's a sentence, it's classified positive, here's a sentence to classify negative. And the model is somehow able to latch on to these examples and sort of figure out what you're trying to do and solve the task. And this is really intriguing because it's emergent. It wasn't hand-coded by the designers to, oh, I want to do in-context learning this way. Now, of course, you could have done that, but I think the real sort of magic is you didn't have to do that, and yet it still does something. It's not completely reliable, but it sort of can get better with better models and, you know, better data.

Starting point is 00:20:18 Then there's chain of thought. Do you want to explain what that is? So the idea is if I have a question that's presented to a language model, the language model could just answer and it'll maybe get a right or wrong. But if you ask a language model to generate an explanation of how it would solve the problems, kind of thinking out loud, then it's much more likely to get the answer right. And this is very natural that it would be the case for humans as well. But the fact that, again, the chain of thought, the capability is something that, you know, emerges. The other thing I think is really wild is this. And I think it's maybe a general principle, which is the ability to mix and match.

Starting point is 00:21:02 So you can ask the model to explain the quicksort algorithm in the style of Shakespeare. And it will actually construct something that is semantically pretty on point, but also stylistically, you know, much, much, better than what many people could come up with, which means that it has learned different concepts of what Shakespeare and what QuickSort are and is able to fuse them. So if you think about creativity, I think this is sort of an example of creative use. People say that sometimes all language models just memorize because they're so big and trained on clearly a lot of text. But these examples, I think, really indicate that there's no way that these language models are just memorizing because this text just doesn't exist and you have to have some creative juice and

Starting point is 00:21:52 invent something new. And I think to kind of go on, riff on that a little bit, I think the creative aspects of these language models with the potential for scientific discovery or doing research or pushing the boundaries beyond what humans can do, I think is really, really fascinating. Because up until now, again, remember the AI dream tops out. humans, but now we can actually go beyond in many, many ways. And I think that unlocks a lot of possibilities. Yeah, there are a lot of really interesting examples. I mean, you could actually argue that connecting new concepts in any novel way is creativity, but I love the one that is just discovering like new tactics and go that humans haven't discovered after thousands of years at play.

Starting point is 00:22:39 Maybe we'll ask if you'll risk making a prediction that is impossible. Emergent behaviors of models is at the next level of scale, anything you might predict. Emerging capabilities, if we wouldn't have thought, chain of thought, or in context learning would work. So I can give you an example of something I think is emerging, and I can give you an example of a hope, but I don't know what I would call a prediction. So what we're seeing today is the ability to instruct a model

Starting point is 00:23:10 using natural language to do certain things. You see a lot of this online with chat GPT and being, chat where you can just, and some of Anthropics work as well, you can instruct a model to be succinct, generate three paragraphs in the style of, and so on, you can lay out these guidelines and have the model actually follow. So this instruction following ability is getting extremely good. Now, I will say that how much is emergent and how much is not, it's hard to tell because a lot of these models, it's not just the language model that's trained to predict the next word. There's a lot of secret sauce that goes under the hood. And if you define

Starting point is 00:23:53 emergence of, you know, it was not intended by the designers. I don't know how much of that is emergent, but at least it's a capability that I think is very striking. The hope is that language models currently mix stuff up. They hallucinate. And this is clearly a pig problem. And Almost in some ways, a very difficult problem to crack. The hope is that as models get better, that some of this will actually go away. I don't know if that will happen. I guess the way I think about these models is that they're doing some sort of, if you think about predicting the next word, it seems very simple,

Starting point is 00:24:37 but you have to really internalize a lot of what is going on in this context, What are the previous words? What's the syntax? What's who's saying them? And all of that information and context has to get compressed. And then that allows you to predict the next word. If you're able to do this extremely well, then you sort of have a model of what's happening in the world, at least the world that you've captured in text.

Starting point is 00:25:07 And so while the notion of truth might be ambiguous in, many cases, I think the model can get an idea of what certain parts of the internet are maybe reliable and what parts of the internet are not and what kind of, you know, the idea of having, you know, entities and dates and locations and what activities there are, I think that will maybe become more salient in the model. Like, if you think a model, language model, that's just predicting the next word. And it's only trained to do that. And you, you say Elad travel to blank. Of course it's going to mix something up without further context.

Starting point is 00:25:50 But if it has a better understanding of what's happening and, of course, with more context, then maybe it can use that context to actually know that, well, okay, well, I don't know. Maybe I should ask where he went. So scale is basically increasing the statistical accuracy of the prediction on the next word because you have more context and more data by which tend for what's coming. and therefore it will reduce hallucinations because you're increasing accuracy. Yeah, so I think there's pre-training,

Starting point is 00:26:19 which is predicting the next word and developing a world model, so to speak. And with those capabilities, then you still have to say don't hallucinate, but it will be much easier to control that model if it has a notion of what hallucination even is. I was talking to somebody who was close to the development of the transformer model,

Starting point is 00:26:41 And his claim was that one of the reasons it's done so well is to your point around scale, right? Eventually, you hit enough scale that you see that it's clearly has these really interesting emergent properties, so you keep scaling it up, and you keep sort of growing it. And so therefore, it's like a self-reinforcing loop to keep using these types of models. And his claim was that it's expensive to do that sort of scale. And so therefore, there may be other architectures or approaches that we've just never scaled up sufficiently in order to actually see if they have the same emergent properties or certain characteristics that may be superior.

Starting point is 00:27:09 How do you think about that from the perspective of just going down the path of the transformer side versus other architectures that may be really interesting and may be neglected because we just haven't thrown enough compute at them because it's expensive. Yeah, I really hope that in 10 years we won't be able using the transformer because I think the transformer is, I mean, it's a very good architecture. People have tried to improve it, but it's sort of like kind of good enough for people to press ahead. But scientifically, there's no reason to believe that this is the one. and there have been some efforts. So one of my colleagues, Chris Ray and his students, have developed other architectures, which are actually at smaller scales competitive with Transformers

Starting point is 00:27:51 and actually don't require the central operation of attention. And I would love to see much more research exploring other alternatives to Transformers. This is something, again, that academia, I think, is very well-suited to do because it involves kind of challenging the status quo. You're not really trying to just get it to work and get it out there. But you're trying to reflect on what are the principles? What can we learn from Transformers? What is it trying to do?

Starting point is 00:28:19 And how can we incorporate them in a much more principled way? At some level, it's still going to be about compute. Right. So people have shown that LSTMs, scaling laws for LSTMs show that if you were able to scale up LSTMs, maybe they would work pretty well as well but the amount of compute is many times more

Starting point is 00:28:41 and given a fixed compute budget we're always in a compute constrained environment. It's an efficient enough architecture to keep trying. Yeah you would not use an LSTM. The X Transformer strictly dominates an LSTM from the perspective of given a fixed compute budget. So this question of what if I could scale the LSTM

Starting point is 00:29:00 it becomes a little bit sort of irrelevant. So for the things where you see transformer-like performance, what sort of compute budget would you need in order to be able to test them out? Is it the scale of a million dollars, $10 million, $100 million of compute? I know it changes based on compute pricing. And I'm just trying to get a rough sense of, you know, how expensive is it to try today? And then if we extrapolate down a compute curve three years from now, maybe it's tractable again or something. Yeah, it really depends on the gaps that you're seeing. Right now in academia, you can train one billion parameter models. I mean,

Starting point is 00:29:31 It's not cheap by academia standards, but you can do it. And here at CRFM, we're training like six or seven billion parameter models. And I think it's enough to be able to try out some ideas. But ultimately, because of emergent properties and importance of scale, you can only make a hypothesis. You can find something like, oh, this seems promising at smaller scales. you still have to go out and test whether it really pans out or the gap just closes. And maybe this is a good segue to talk about the compute together. So we found it together on the premise that compute is a central bottleneck in foundation models.

Starting point is 00:30:20 On the other hand, there's a lot of compute that's decentralized, that's maybe underutilized or idle. And if we could harness that compute and bring it bare for both research and also commercial purposes, then we could actually do a lot more. There are some pretty hefty technical challenges around doing that because foundation models are typically trained in very high-end data center environments where they interconnect between devices is extremely good. Whereas if you just grab your average desktop, or home interconnect, it's 100 times or more slower.

Starting point is 00:31:02 But Chris Ray and Sajan and others, really, they deserve most of the credit for this. We've developed some techniques that allow you to leverage this weekly connected compute and actually get pretty interesting training going. So hopefully with that type of infrastructure, we can begin to unlock a bit more of compute, both for academic research, but also for, you know, other startups and so on. That's really cool. So it sounds a little bit like earlier predecessors of this may be things like folding at home where people did protein folding collectively on their computers or study at home

Starting point is 00:31:40 where there was search through different astronomical data. And now you can actually do this for training an AI system on your desktop or, you know, excess compute that exists at data centers or in other places. So folding out home is, I think, a great inspiration for a lot of this work. At some point during the middle of the pandemic, they actually had the world's largest supercomputer in terms of flop count because it was used to do molecular dynamics simulations for COVID. The main challenge with foundation models is that there's a lot of big models and big data that needs to be shuffled around. So the task decomposition is much, much harder. So that's why many of the technical things that we've doing about scheduling and compression enable us to,

Starting point is 00:32:26 overcome these hurdles. And then there's the question of incentives. So I think there's two aspects of what together is building. One is a sort of what I will call a research computer, which is for academic research purposes where people can contribute compute. And in the process of contributing compute, they are able to use the decentralized cloud for doing training when they're not using it and when they are using it, they can use much more of it. So the hope is that it provides

Starting point is 00:33:00 a much more efficient use of the compute because you're spreading across a larger set of people. And then on the commercial side, the hope is that the open models that are developed in the open source ecosystem can, the together platform can allow people to fine-tune and adapt these models to various different use cases. One thing I think is noteworthy is that we think of foundation models today as maybe there's

Starting point is 00:33:35 a few foundation models that are very good and exist. But I think in the future there's going to be many different ones for different kind of use cases as this space takes off. Many of them will be derived from maybe existing foundation. models, but many of them will also be perhaps trained on from scratch as well. I think this is actually a pretty uncommon viewpoint right now. Can you talk a little bit about like where you or, you know, research efforts you're associated with choose to train models, like maybe by a PubMed or whatever else you think is relevant here? So there's foundation models

Starting point is 00:34:14 is a pretty broad category of, and many of the sort of the core center is, you know, large language models that are trained on, lots of internet data. We've trained a model here at CRFM in collaboration with Mosaic on a bot called BiomedLM. It's not a huge model, but it's trained on PubMed articles, and it exhibits pretty good performance on various benchmarks. For a while, we were able to be state-of-the-art on the U.S. medical licensing exam. You know, Google did come up with a model that was, I think, 200 times larger, and they beat that model. So, you know, scaled does matter. But I think there are many cases where you, for efficiency reasons, maybe you do want a smaller model, since cost, I think, is a, you know, a big concern.

Starting point is 00:35:08 I want to talk about some of the, I think, like, most important or hopefully most important work that the center's done so far. Can you explain what a helm is and what the goal has been? Yeah. So Helm stands for holistic evaluation. of language models, which is this project that happened over the last year. And the goal is to evaluate language models. So the trouble is that language models is a very generic thing. It's like saying evaluate the internet.

Starting point is 00:35:40 What does that mean? The language model takes text in and text out. And one of the features of a language model is that it can be used for a myriad different applications. And so what we did in that paper is to be as systematically and as rigorous as we could in laying out the different scenarios in which language models could be used and also measure aspects of these uses, which include not just accuracy, which a lot of benchmarks focus on, but also issues of how robust it is, how well it's calibrated, meaning that whether

Starting point is 00:36:19 does the model know what it doesn't know, whether the models are fair according to some definition of fairness, whether they're biased, whether they spear out toxic content, how efficient they are. And then we go and we basically grab every language model that's prominent that we could access, which includes open source models like OPT and Bloom, but also getting access to APIs from Cohare AI21, Open AI, and also Anthropic and Microsoft.

Starting point is 00:36:53 So overall, there were 30 different models, 42 scenarios and seven metrics, and we ran the same evaluations on all of that. We've put all the results on the Helm website so that you could see the top level of statistics and accuracies, but also you can drill down into, on this particular benchmark, what are the instances, What are the predictions that these models are making all the way down to what prompts are you using for the language models?

Starting point is 00:37:24 So the idea here is that we're trying to provide transparency to this space, right? We know that these models are powerful. They have some deficiencies, and we're trying to lay that all out in a kind of a scientific manner. So I'm pretty excited about this project. The challenge thing about this project is since we put, out the paper maybe three months ago, a bunch of different models have come out, including chat GPT, Lama, you know, coherent AI-21 have updated their models. GPD4 might come out at some point. So what had this project has evolved into is this basically this dynamically updating

Starting point is 00:38:08 where every two weeks we refresh it with new models that are coming out as well as new scenarios, because one thing we also realize, which was made clear by chat GPT, is that the type of things that we ask of a language model is changing. We don't ask it just to do question answering or just to do sentiment. Increasing capabilities. Now they can do a lot more.

Starting point is 00:38:31 They can, you know, write an email or give you, you know, life advice on XYZ if you put in a scenario and or write a, you know, an essay about XYZ. And I think what we need to do with the benchmark is also add the scenarios that accordingly improve or capture

Starting point is 00:38:52 these capabilities as well as kind of new risks. So we are definitely interested in benchmarking how persuasive are these language models, which governs, you know, what are the risk, and also how secure they are. One thing I'm actually also worried about is, given all that the jailbreaking, that is extremely common with these models where you can basically buy bypass safety controls, if these models start interacting with the world and accepting external inputs, now you can not only just sort of jailbreak your own model, but you can drill break other people's model and get them to do various things. And then, so that could lead to sort of a cascade of errors.

Starting point is 00:39:34 So some of these are the concerns that we hope to also capture with the model. I should also mention we're also trying to look at multimodal models, which I think is going to be pretty pertinent. So lots to do. A bunch of the things that you've described as sort of the role you see for the center or even like academia in the age of like foundation models broadly. Like they have more of an intersection with policy than traditionally like machine learning research. Like how do you think about that? Yeah. Actually we've, I'm glad you asked that because we've been thinking a lot about the social implications of these models and sort of the not the models themselves, which we focus a lot on talking about, but the environment in which

Starting point is 00:40:18 these models are built. So there are a few players in the space with different opinions about how models should be built. Some are more closed, some are more open. And there's also, again, this sort of lack of transparency where we have a model that's produced and it's aligned, apparently to human values but then once you start kind of questioning you can ask a question okay well which values which humans are we talking about

Starting point is 00:40:50 who determines these values what legitimacy does that have and what's the sort of accountability then you start noticing that well a lot of this is just kind of completely a black box so one thing that we've been working at the center on is

Starting point is 00:41:07 developing norms starting with transparency I think transparency is necessary but not sufficient. You need some level of transparency to even have a conversation about any of the policy issues. So making sure that the public can understand how these models are built,

Starting point is 00:41:28 at least some notion of what the data is, what are the instructions that are given to align the models. We're trying to advocate for greater transparency. there. And I think this will be really important as these models really get deployed at scale and start impacting our lives. You know, what kind of an analogy I like to think about is, you know, nutrition labels or any sort of specification sheets on electronic devices. There's some sort of obligation I think that producers of some products should have to make sure that their product is

Starting point is 00:42:09 used properly and has some bounds on it. I guess I'll ask two questions. One is if people wanted to participate in Together, is there a client that can download and install or use? Or how can people help support the Together efforts? Yeah. So we are developing a client that will be made available. We both from the perspective joining the Together Clouds

Starting point is 00:42:32 so that you can contribute your compute, but also where we have an API that we're developing so that people can use together infrastructure to do inference and fine-tuning models. We are also training some open models, so we have this something called Open Chat Kit that we're releasing soon, and this is built on top of Illythera AI's NeoX model,

Starting point is 00:42:56 but improved to include various different types of capabilities. You should think about it as really a work in progress. What we've trying to do is open it up so that people can play with it, give feedback, and have the community improve this together, rather than us trying to produce some finished product and putting it out there. This goes back to the point about involving the spirit of open source and involving the community to build these foundation models together as opposed to someone unilaterally building them. While we're talking timelines and predictions that you don't quite feel comfortable making, how do you think as a rigorous scientist about AGI? I must say that my opinions about EGI have changed over time. I think that for a while, it was perceived by most of the community as laughable. I will say that in the last 10 years, I have been aware of, you know, there's a kind of a certain community who think about AGI.

Starting point is 00:44:03 and also existential risk and things like that. So I've been in touch with people who think about these. I think I see the world maybe differently. I think of perhaps certainly these are powerful technologies and could have extreme social consequences. But there's a lot of more near-term issues. I focus a lot on kind of robustness of ML systems in the last, you know, five years.

Starting point is 00:44:28 But, you know, one thing I've learned about foundation models, Because of their emerging qualities, I've learned to be very kind of no open-minded, I would say. I was asking early about what no priors, where that comes from. And I think it's a fitting way to think about, you know, the world. Because I think everyone, including scientists, often get sort of drawn into a particular worldview and paradigm. And I think that, you know, the world is changing, both on the technical side, but also how we conceive of AI. and, you know, maybe even humans at some level. And I think we have to be open-minded to, you know,

Starting point is 00:45:11 how that's going to evolve over the next few years. Awesome. Thanks for doing this conversation with us, Percy. It's great. Yeah, thanks for joining us. Yeah, thank you very much. Thank you for listening to this week's episode of No Priors. Follow No Priors for new guests each week

Starting point is 00:45:27 and let us know online what you think and who an AI you want to hear from. You can keep in touch with me and conviction by following at Serenormus. You can follow me on Twitter at Alad Gill. Thanks for listening. No Pryors is produced in partnership with Pod People.

Starting point is 00:45:43 Special thanks to our team, Synthel Galdea and Pranav Reddy and the production team at Pod People. Alex McManus, Matt Saab, Amy Machado, Ashton Carter, Danielle Roth,

Starting point is 00:45:53 Carter Wogan, and Billy Libby. Also, our parents, our children, the Academy, and tyranny.m.L., just your average-friendly AGI world government. Thank you.

No Priors: Artificial Intelligence | Technology | Startups - What is the role of academia in modern AI research? With Stanford Professor Dr. Percy Liang

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.