No Priors: Artificial Intelligence | Technology | Startups - What is the role of academia in modern AI research? With Stanford Professor Dr. Percy Liang
Episode Date: March 9, 2023When AI research is evolving at warp speed and takes significant capital and compute power, what is the role of academia? Dr. Percy Liang – Stanford computer science professor and director of the St...anford Center for Research on Foundation Models talks about training costs, distributed infrastructure, model evaluation, alignment, and societal impact. Sarah Guo and Elad Gil join Percy at his office to discuss the evolution of research in NLP, why AI developers should aim for superhuman levels of performance, the goals of the Center for Research on Foundation Models, and Together, a decentralized cloud for artificial intelligence. No Priors is now on YouTube! Subscribe to the channel on YouTube and like this episode. Show Links: See Percy’s Research on Google Scholar See Percy’s bio on Stanford’s website Percy on Stanford’s Blog: What to Expect in 2023 in AI Together, a decentralized cloud for artificial intelligence Foundation AI models GPT-3 and DALL-E need release standards - Protocol The Time Is Now to Develop Community Norms for the Release of Foundation Models - Stanford Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @PercyLiang Show Notes: [1:44] - How Percy got into machine learning research and started the Center for Research and Foundation Models at Stanford [7:23] - The role of academia and academia’s competitive advantages [13:30] - Research on natural language processing and computational semantics [27:20] - Smaller scale architectures that are competitive with transformers [35:08] - Helm, holistic evaluation of language models, a project with the the goal is to evaluate language models [42:13] - Together, a decentralized cloud for artificial intelligence
Transcript
Discussion (0)
For ages, human level has been the target for AI, and that has really been kind of a North Star that has fueled many dreams and efforts and so on over the decades.
But I think we're getting to a point where along many axes, it's superhuman or should be superhuman.
And I think we should maybe define more of an objective measure of like what we actually want.
This is more of a general statement about how we should think about technology,
not just chasing after mimicking a human because we have a lot of humans.
This is the No Pryors podcast. I'm Saragua.
I'm Alad Gail.
We invest in, advise, and help start technology companies.
In this podcast, we're talking with the leading founders and researchers in AI about the biggest questions.
We're very pleased today to have Dr. Percy Liang,
Professor of Computer Science here at Stanford,
and director of the Center for Research on Foundation Models,
a recently founded center here.
Dr. Liang is the author of over 800 heavily cited research papers
around helping machines understand natural language,
helping humans reason about those.
models and has contributed a number of novel technical approaches and creative uses of data to
the machine learning field. And as a special trait, we're recording here in his office at Stanford
today. Thanks, Percy. Great. Welcome. So I think just to start, can you tell us a little bit about
how you got into the machine learning research field in your personal background? Yeah. So I've
been in the field of machine learning and natural language processing for over 20 years. I started getting
into it and undergrad. I was undergrad at MIT. I liked theory. I had a fascination with
languages. I was fascinated by how humans could just be exposed to just strings of text,
speech, and somehow acquire very sophisticated understanding of the world and also syntax and
learned that in a fairly unsupervised way. And I wanted to, my dream was to get computers to do
the same. So then I went to grad school at Berkeley. And then after
that started at Stanford, and ever since, I've been pursued of developing systems that can
really truly understand natural language. And of course, in the last four years, this once-upon-time
kind of dream has really taken off in a sense, maybe in a not a way that I would necessarily expect,
but with a coming out of large language models such as GPT3, it's truly kind of astonishing how much
of the structure of language and the world that these models can capture.
In some ways, it kind of harkens back when I actually first started in NLP.
I was training language models, but of a very different type.
It was based on Hidden Markov models.
And there, the goal was to discover hidden structure and text.
And I was very excited by the fact that it could learn about,
tease apart what words were like city names versus days of a week and so on.
But now it's kind of on a completely different level.
You've worked on multiple generations of NLP at this point, pushing the forefront of semantic parsing.
Was there a moment at which you decided that you were going to focus on foundation models and large language models?
Yeah. There was a very decisive moment, and that moment was when GPT3 came out.
That was in the middle of the pandemic, and it was just, it wasn't so much the capabilities of the model that shocked me, but it was a way that the model was trained,
is basically taking a massive amount of text and asking a model to predict the next word
over and over again, billions of times.
What rose from it was not only a model that could generate fluent text, but also a model
that could do in-context learning, which means that you can prompt the language model with
instructions, for example, summarize this document, give it some examples, and have the model
on-the-fly in context figure out what the task was.
And this was a paradigm shift, in my opinion, because,
it changed the way that we conceptualize machine learning and NLP systems from these bespoke
systems where it's trained to do question answering, to train to do this, to just a general
substrate where you can ask the model to do various things.
And the idea of a task, which is so central to AI, I think, begins to dissolve.
And I find that extremely exciting.
And that's the reason later in 2021, we founded the Center for Research on Favis.
foundation models. We coined the term foundation models because we thought there was something
that was happening in the world that somehow large language models didn't really capture the
significance. And it was not just about language, it was about images and multimodality. It was a more
general phenomenon. And we coined the term foundation models. And then the center started and it's
been sort of, you know, kind of a roller coaster ride ever since. We're going to be talking a thing
about both your experiences in research and academia,
and then we'll also separately be talking about together,
which is a company you're involved with now.
Could you tell us a little bit more about what the center does
and what you're focused on?
Yes, so the Center for Research on Foundation models
sort of two years ago is under the Human Centered AI Institute at Stanford.
And the main mission of the center is, I would say,
to increase transparency and accessibility to foundation models.
So foundation models are becoming more and more ubiquitous, but at the same time, one thing we have noticed is the lack of transparency and accessibility of these models.
So if you think about the last decade of deep learning, it has profited a lot from having a culture of openness with tools like PyTorch or TensorFlow, data sets that are open, people publishing openly about their research.
And this has led to a lot of community and progress, not just in academia, but also in industry with different startups and hobbyists and whoever, just getting involved.
And what we're seeing now is sort of a retreat of that open culture, where models are now being only accessible via APIs.
We don't really know all the secret sauce that's going behind them, and they're sort of limited access.
What's your diagnosis of why that's happening?
I think that this is very natural because these models take a lot of capital to train.
They're an enormous amount of, you can generate a lot of value and it's a competitive advantage.
So, you know, incentives are to keep these under control.
There's also another factor, which is, you know, safety reasons.
I think these models are extremely powerful.
And maybe the models right now, I think, are, well, if they were out and open, it would be maybe okay.
But in the future, these models could be extremely good and having them completely, anything goes.
We might have to think about that a little bit more carefully.
How do you think all this evolves in terms of, if you look at the history of ML or NLP or AI,
we've had these waves of innovation in academia, and then we've had waves of innovation and implementation and industry.
And in some cases, we've had both happening simultaneously, but it feels a little
bit like it's ping ponged over time in different ways. Now that people are starting to be more
closed in terms of some of these models on the industry side and publishing less and being less
open, how do you view the role of academia and industry diverging, if at all? Like, do you think
it'll be different types of research that each type of institution tackles? Do you think they'll
be overlap? I'm sort of curious how you view all that evolving. I mean, I think industry and
academia have very distinctive and important functions. And always when I tell my students, well,
we should be working on things that are lean on academia's competitive advantage.
And historically, I think this has meant different things.
So before ML was that big, I think a lot of academic research was really about developing the tools to make these models work at all.
I remember working on systems and being ML models back in grad school.
And basically, it wasn't working.
I mean, computer wasn't working, vision wasn't working, question answering wasn't working.
And I think the goal of academia there was to make things work.
And a lot of the advances that were born out of academia
then influence other ideas and influenced other ideas before it started clicking.
And now we're seeing a lot of the fruits of both academia industries research
fueling this kind of industry drive that you see today.
And now today, I think the dynamic is quite different
because it's no longer academia's job
isn't just to get things to work
because you can do that in other ways.
There's a lot of resources going into tech companies
where if you have data and compute,
you can just sort of scale and blast through a lot of barriers.
And I think a lot of the role of academia is understanding
because these models for all their impressive feats,
we just don't understand what they work, how they work,
what the principles are, how does this training data, how does moral architecture affect the
different behaviors, what is the best way to weight data, how do you, what's the training
objective? Many of these questions, I think, could benefit from a more rigorous, you know,
analysis. The other piece, which is a different type of understanding is understanding social
impact. And this is going back to the question about what is CRFM's role is.
CRFM is a center with over 30 different faculty across 10 different departments at Stanford.
So it's quite interdisciplinary.
So we're looking at foundation models not just from a technical perspective of how do you get these models to work,
but also thinking about their economic impact.
The challenges when it comes to copyright and legality, we're working on a paper that explore some of those questions.
We're looking at different questions of social biases and thinking,
carefully how the impact of these models have on issues of, you know, homogenization,
where you have a central model that's making perhaps decisions for a single user across all
the different aspects. So some of these are the types of questions. There are also people
at the center looking at risks of disinformation, monitoring to what extent these tools are
so persuasive, which they are getting increasingly.
So, and what are the actual risks when it comes to, let's say, foreign state actors leveraging this technology?
And there's also people at the center who are in medicine, and we're exploring ways of leveraging foundation models and deployment in actual clinical practice.
How near term do you think some of those deployments are?
Because if you go back into the 70s, there was like the Mycine Project here at Stanford, which was an expert system that outperformed Stanford medical school staff at predicting what infection disease somebody had, for example.
And that was 50 years ago, or almost 50 years ago, and it never really got implemented in the real world.
And so one of my concerns sometimes in terms of the impact of some of these things is, are there industries that are resistant to adoption or resistance to change?
And is it exciting to hear that, you know, at Stanford, they're actually starting to look at how do you actually integrate these things into real clinical care?
Do you view those things is very far out on the health care side?
Do you them is sort of near?
I know that isn't the main topic we're going to cover, but I'm still a curious given how close you are to all this.
Yeah, I think it's a good question.
I think there are a bunch of different issues that need to be resolved.
For example, foundation models are training on a lot of data.
How do you deal with privacy?
How do you deal with robustness?
Because once you're talking about in the healthcare spaces, especially, there are cases where
we know that these models can still hallucinate facts and sound very confident in doing so.
I know some doctors like that, too.
Yeah, there you go.
But you've also taken a point of view that we should, you know,
expect superhuman, if we see superhuman performance from these models, like holding them to the
standard of a human doctor is actually insufficient as well. Yeah, I think that's a great point,
is that for ages, human level has been the target for AI. And that has really been kind of a North
Star that has fueled many dreams and efforts and so on over the decades. But I think we're getting
into a point where along many axes, it's superhuman or should be superhuman. And I think we should
maybe define more of an objective measure of what we actually want. We want something that's
very reliable. It's grounded. I often want more statistical evidence when I speak to doctors
and sometimes fail to get that and have something that would be sort of a lot more principled and
and rational. This is more of a general statement about how we should think about technology,
not just chasing after mimicking a human, because we don't have a lot of humans.
It's an interesting point. It's really fascinating to watch all this evolve right now.
You've done extensive research on natural language processing and computational semantics.
Can you explain what those terms mean and how they're relevant to the development of AI?
So computational semantics is the process where you take language, text,
and compute, quote-unquote, meaning from it.
And that is something I'm not going to maybe try to attempt to define.
There's a huge literature of linguistics and philosophy about what meaning is.
I would say that a lot of my research in the past, maybe five to ten years ago,
was adopting this view that language is a programming language.
It computes.
You can give orders.
You can instruct.
You can do things with.
language, and therefore, it was natural the model natural language as a formal language.
So a lot of semantic parsing is about mapping natural language into a formal space so that
machines could execute this.
And so one concrete application of this that worked on for a while is mapping natural
language questions into essentially SQL queries, which obviously has many different
applications as well.
And what was nice about this framework is that to really do this, you had to understand
how the words contribute to different parts of the SQL query, and then you could get something
that was a program that you could execute and you deliver the results, as opposed to many
question answering systems, which you ask a question, maybe retrieve some document, you're
retrieving the answer, or either that or making something up, rather than computing it rigorously.
So that was a paradigm I was working in maybe five to ten years ago.
But the main problem is that the world isn't a database.
A small part of the world is a database, but most of the world is unstructured.
And then I started thinking about question answering in general,
and we developed the squad question answering benchmark to fuel progress in open domain question answering.
And that, in turn, and many other data sets that were developed, both at Stanford and elsewhere,
I think led to the development of these powerful language models that then, like Bert and Roberta and Elmo, back in about 2018,
to then many years ago.
Many years ago.
Ancient history now to more like 2020 generation of these large foundation models.
So I think there's certainly a place for that type of thinking.
There are cases where you want to just math, natural language into, say, people call tool use.
Like you ask some question that versus calculation.
You should just use a calculator rather than try to sort of quote unquote do it in the Transformer's head.
But there's also a lot of aspects of reasoning, which are not quite formal.
We do this all the time.
And a lot of that happens kind of natively in the language model.
And I think it's still an interesting question how to kind of marry the two.
I feel like the two are still in sort of jammed together in a way.
And maybe it's natural because there's certain things you can do in your heads,
certain things you can invoke a tool to use.
But this has been also one of the classic.
debates in AI. There's neural versus symbolic. And for a while, symbolic AI was dominant. Now neural
AI has come really taken off and become dominant. But some of those central problems of how do you do
planning, how do you do reasoning, which was the focus in study of symbolic AI, are now again
really relevant because now we've moved past just simple classification and just entity extraction,
but now more to more ambitious tasks.
What do you think of some of the more interesting research programs right now in that area?
I think that it's interesting to remark on what's happening
because there are, to a first order approximation,
larger models trained on the relevant data
seem to do well on various benchmarks.
I think that maybe there isn't enough emphasis on data efficiency
and how quickly you can get
and how robustly you can get to these points
because we know it has been well-documented
that benchmarks can be gamable,
so even though you do well on a benchmark
doesn't mean you've necessarily solved the problem.
So I think one has to be a little bit cautious about that.
So obviously, scale and more data is just one clear direction.
But in terms of orthogonal directions,
what are the methods?
Several things have to happen.
One is we have to have ability to have,
handle a greater context length.
If you think about a long reasoning chain,
you know, transformers are fixed and there's ways to extend it,
but fundamentally, it's sort of a fixed model.
Let's say advanced problem solving,
for example, if you want to solve a math problem,
you'll improve something.
The language model generates sort of this chain of thought
and generates token by token,
and then it generates something.
But we know that humans, when they solve a problem,
It's much more you try different things, you backtrack.
It's much more flexible, iterative, and it can last a lot longer
and then just you're going for a few iterations.
And what is the architecture that can handle that level of complexity?
I think is still an outstanding question.
Is there any aspects of foundation or large language models that are emergent
that you didn't anticipate or that really surprised you?
I think going back to GPD3, I think,
I think in context learning is something that surprised many people, including me.
So here you're prompting a language model with an instruction and input-output pairs.
You know, here's a sentence, it's classified positive, here's a sentence to classify negative.
And the model is somehow able to latch on to these examples and sort of figure out what you're
trying to do and solve the task.
And this is really intriguing because it's emergent.
It wasn't hand-coded by the designers to, oh, I want to do in-context learning this way.
Now, of course, you could have done that, but I think the real sort of magic is you didn't have to do that, and yet it still does something.
It's not completely reliable, but it sort of can get better with better models and, you know, better data.
Then there's chain of thought.
Do you want to explain what that is?
So the idea is if I have a question that's presented to a language model, the language model could just answer and it'll maybe get a right or wrong.
But if you ask a language model to generate an explanation of how it would solve the problems, kind of thinking out loud, then it's much more likely to get the answer right.
And this is very natural that it would be the case for humans as well.
But the fact that, again, the chain of thought, the capability is something that, you know, emerges.
The other thing I think is really wild is this.
And I think it's maybe a general principle, which is the ability to mix and match.
So you can ask the model to explain the quicksort algorithm in the style of Shakespeare.
And it will actually construct something that is semantically pretty on point, but also stylistically, you know, much, much,
better than what many people could come up with, which means that it has learned different
concepts of what Shakespeare and what QuickSort are and is able to fuse them. So if you think about
creativity, I think this is sort of an example of creative use. People say that sometimes all
language models just memorize because they're so big and trained on clearly a lot of text. But
these examples, I think, really indicate that there's no way that these language models are just
memorizing because this text just doesn't exist and you have to have some creative juice and
invent something new. And I think to kind of go on, riff on that a little bit, I think the creative
aspects of these language models with the potential for scientific discovery or doing research
or pushing the boundaries beyond what humans can do, I think is really, really fascinating.
Because up until now, again, remember the AI dream tops out.
humans, but now we can actually go beyond in many, many ways. And I think that unlocks a lot of
possibilities. Yeah, there are a lot of really interesting examples. I mean, you could actually argue
that connecting new concepts in any novel way is creativity, but I love the one that is just discovering
like new tactics and go that humans haven't discovered after thousands of years at play.
Maybe we'll ask if you'll risk making a prediction that is impossible. Emergent behaviors of models
is at the next level of scale, anything you might predict.
Emerging capabilities, if we wouldn't have thought,
chain of thought, or in context learning would work.
So I can give you an example of something I think is emerging,
and I can give you an example of a hope,
but I don't know what I would call a prediction.
So what we're seeing today is the ability to instruct a model
using natural language to do certain things.
You see a lot of this online with chat GPT and being,
chat where you can just, and some of Anthropics work as well, you can instruct a model
to be succinct, generate three paragraphs in the style of, and so on, you can lay out these
guidelines and have the model actually follow. So this instruction following ability is getting
extremely good. Now, I will say that how much is emergent and how much is not, it's hard
to tell because a lot of these models, it's not just the language model that's trained
to predict the next word. There's a lot of secret sauce that goes under the hood. And if you define
emergence of, you know, it was not intended by the designers. I don't know how much of that is
emergent, but at least it's a capability that I think is very striking. The hope is that
language models currently mix stuff up. They hallucinate. And this is clearly a pig problem. And
Almost in some ways, a very difficult problem to crack.
The hope is that as models get better, that some of this will actually go away.
I don't know if that will happen.
I guess the way I think about these models is that they're doing some sort of,
if you think about predicting the next word, it seems very simple,
but you have to really internalize a lot of what is going on in this context,
What are the previous words?
What's the syntax?
What's who's saying them?
And all of that information and context has to get compressed.
And then that allows you to predict the next word.
If you're able to do this extremely well, then you sort of have a model of what's happening in the world,
at least the world that you've captured in text.
And so while the notion of truth might be ambiguous in,
many cases, I think the model can get an idea of what certain parts of the internet are
maybe reliable and what parts of the internet are not and what kind of, you know, the idea of
having, you know, entities and dates and locations and what activities there are, I think that
will maybe become more salient in the model. Like, if you think a model, language model, that's
just predicting the next word. And it's only trained to do that. And you,
you say Elad travel to blank.
Of course it's going to mix something up without further context.
But if it has a better understanding of what's happening and, of course, with more context,
then maybe it can use that context to actually know that, well, okay, well, I don't know.
Maybe I should ask where he went.
So scale is basically increasing the statistical accuracy of the prediction on the next word
because you have more context and more data by which tend for what's coming.
and therefore it will reduce hallucinations
because you're increasing accuracy.
Yeah, so I think there's pre-training,
which is predicting the next word
and developing a world model, so to speak.
And with those capabilities,
then you still have to say don't hallucinate,
but it will be much easier to control that model
if it has a notion of what hallucination even is.
I was talking to somebody who was close
to the development of the transformer model,
And his claim was that one of the reasons it's done so well is to your point around scale, right?
Eventually, you hit enough scale that you see that it's clearly has these really interesting
emergent properties, so you keep scaling it up, and you keep sort of growing it.
And so therefore, it's like a self-reinforcing loop to keep using these types of models.
And his claim was that it's expensive to do that sort of scale.
And so therefore, there may be other architectures or approaches that we've just never scaled up
sufficiently in order to actually see if they have the same emergent properties
or certain characteristics that may be superior.
How do you think about that from the perspective of just going down the path of the transformer side versus other architectures that may be really interesting and may be neglected because we just haven't thrown enough compute at them because it's expensive.
Yeah, I really hope that in 10 years we won't be able using the transformer because I think the transformer is, I mean, it's a very good architecture.
People have tried to improve it, but it's sort of like kind of good enough for people to press ahead.
But scientifically, there's no reason to believe that this is the one.
and there have been some efforts.
So one of my colleagues, Chris Ray and his students,
have developed other architectures,
which are actually at smaller scales competitive with Transformers
and actually don't require the central operation of attention.
And I would love to see much more research exploring other alternatives to Transformers.
This is something, again, that academia, I think, is very well-suited to do
because it involves kind of challenging the status quo.
You're not really trying to just get it to work and get it out there.
But you're trying to reflect on what are the principles?
What can we learn from Transformers?
What is it trying to do?
And how can we incorporate them in a much more principled way?
At some level, it's still going to be about compute.
Right.
So people have shown that LSTMs, scaling laws for LSTMs show that if you were able to scale up LSTMs,
maybe they would work
pretty well as well
but the amount of compute
is many times more
and given a fixed compute budget
we're always in a compute constrained
environment. It's an efficient enough
architecture to keep trying. Yeah you would
not use an LSTM. The X Transformer strictly
dominates an LSTM from the perspective
of given a fixed compute budget. So this question of
what if I could scale the LSTM
it becomes a little bit sort of irrelevant.
So for the things
where you see transformer-like performance, what sort of compute budget would you need in order
to be able to test them out? Is it the scale of a million dollars, $10 million, $100 million of
compute? I know it changes based on compute pricing. And I'm just trying to get a rough sense
of, you know, how expensive is it to try today? And then if we extrapolate down a compute curve
three years from now, maybe it's tractable again or something. Yeah, it really depends on the
gaps that you're seeing. Right now in academia, you can train one billion parameter models. I mean,
It's not cheap by academia standards, but you can do it.
And here at CRFM, we're training like six or seven billion parameter models.
And I think it's enough to be able to try out some ideas.
But ultimately, because of emergent properties and importance of scale, you can only make a hypothesis.
You can find something like, oh, this seems promising at smaller scales.
you still have to go out and test whether it really pans out or the gap just closes.
And maybe this is a good segue to talk about the compute together.
So we found it together on the premise that compute is a central bottleneck in foundation models.
On the other hand, there's a lot of compute that's decentralized, that's maybe underutilized or idle.
And if we could harness that compute and bring it bare for both research and also commercial purposes,
then we could actually do a lot more.
There are some pretty hefty technical challenges around doing that
because foundation models are typically trained in very high-end data center environments
where they interconnect between devices is extremely good.
Whereas if you just grab your average desktop,
or home interconnect, it's 100 times or more slower.
But Chris Ray and Sajan and others, really, they deserve most of the credit for this.
We've developed some techniques that allow you to leverage this weekly connected compute
and actually get pretty interesting training going.
So hopefully with that type of infrastructure, we can begin to unlock a bit more of compute,
both for academic research, but also for, you know, other startups and so on.
That's really cool.
So it sounds a little bit like earlier predecessors of this may be things like folding at home
where people did protein folding collectively on their computers or study at home
where there was search through different astronomical data.
And now you can actually do this for training an AI system on your desktop or, you know,
excess compute that exists at data centers or in other places.
So folding out home is, I think, a great inspiration for a lot of this work.
At some point during the middle of the pandemic, they actually had the world's largest supercomputer in terms of flop count because it was used to do molecular dynamics simulations for COVID.
The main challenge with foundation models is that there's a lot of big models and big data that needs to be shuffled around.
So the task decomposition is much, much harder.
So that's why many of the technical things that we've doing about scheduling and compression enable us to,
overcome these hurdles. And then there's the question of incentives. So I think there's two
aspects of what together is building. One is a sort of what I will call a research computer,
which is for academic research purposes where people can contribute compute. And in the process
of contributing compute, they are able to use the decentralized cloud for doing training
when they're not using it
and when they are using it,
they can use much more of it.
So the hope is that it provides
a much more efficient use of the compute
because you're spreading across a larger set of people.
And then on the commercial side,
the hope is that the open models
that are developed in the open source ecosystem
can, the together platform can allow people
to fine-tune and adapt these models to various different use cases.
One thing I think is noteworthy is that we think of foundation models today as maybe there's
a few foundation models that are very good and exist.
But I think in the future there's going to be many different ones for different kind of
use cases as this space takes off.
Many of them will be derived from maybe existing foundation.
models, but many of them will also be perhaps trained on from scratch as well.
I think this is actually a pretty uncommon viewpoint right now. Can you talk a little bit about
like where you or, you know, research efforts you're associated with choose to train models,
like maybe by a PubMed or whatever else you think is relevant here? So there's foundation models
is a pretty broad category of, and many of the sort of the core center is, you know, large
language models that are trained on, lots of internet data. We've trained a model here at CRFM
in collaboration with Mosaic on a bot called BiomedLM. It's not a huge model, but it's trained on
PubMed articles, and it exhibits pretty good performance on various benchmarks. For a while, we were
able to be state-of-the-art on the U.S. medical licensing exam. You know, Google did
come up with a model that was, I think, 200 times larger, and they beat that model. So, you know,
scaled does matter. But I think there are many cases where you, for efficiency reasons,
maybe you do want a smaller model, since cost, I think, is a, you know, a big concern.
I want to talk about some of the, I think, like, most important or hopefully most important work
that the center's done so far. Can you explain what a helm is and what the goal has been?
Yeah. So Helm stands for holistic evaluation.
of language models, which is this project that happened
over the last year.
And the goal is to evaluate language models.
So the trouble is that language models is a very generic thing.
It's like saying evaluate the internet.
What does that mean?
The language model takes text in and text out.
And one of the features of a language model
is that it can be used for a myriad different
applications. And so what we did in that paper is to be as systematically and as rigorous as we
could in laying out the different scenarios in which language models could be used and also
measure aspects of these uses, which include not just accuracy, which a lot of benchmarks focus
on, but also issues of how robust it is, how well it's calibrated, meaning that whether
does the model know what it doesn't know,
whether the models are fair according to some definition of fairness,
whether they're biased, whether they spear out toxic content,
how efficient they are.
And then we go and we basically grab every language model that's prominent that we could access,
which includes open source models like OPT and Bloom,
but also getting access to APIs from Cohare AI21,
Open AI, and also Anthropic and Microsoft.
So overall, there were 30 different models,
42 scenarios and seven metrics,
and we ran the same evaluations on all of that.
We've put all the results on the Helm website
so that you could see the top level of statistics and accuracies,
but also you can drill down into,
on this particular benchmark, what are the instances,
What are the predictions that these models are making all the way down to what prompts are you using for the language models?
So the idea here is that we're trying to provide transparency to this space, right?
We know that these models are powerful.
They have some deficiencies, and we're trying to lay that all out in a kind of a scientific manner.
So I'm pretty excited about this project.
The challenge thing about this project is since we put,
out the paper maybe three months ago, a bunch of different models have come out, including
chat GPT, Lama, you know, coherent AI-21 have updated their models. GPD4 might come out at some
point. So what had this project has evolved into is this basically this dynamically updating
where every two weeks we refresh it with new models that are coming out as well as new
scenarios, because one thing
we also realize, which was made clear by
chat GPT, is that
the type of things that we ask of a
language model is changing. We don't ask it
just to do question answering or just to do
sentiment. Increasing capabilities. Now they can do a lot more.
They can, you know, write an email or give
you, you know, life advice on
XYZ if you put in a scenario
and
or write a, you know, an essay
about XYZ.
And I think
what we need to do with the benchmark is also add the scenarios that accordingly improve or capture
these capabilities as well as kind of new risks. So we are definitely interested in benchmarking
how persuasive are these language models, which governs, you know, what are the risk,
and also how secure they are. One thing I'm actually also worried about is, given all that
the jailbreaking, that is extremely common with these models where you can basically buy
bypass safety controls, if these models start interacting with the world and accepting external
inputs, now you can not only just sort of jailbreak your own model, but you can drill break
other people's model and get them to do various things.
And then, so that could lead to sort of a cascade of errors.
So some of these are the concerns that we hope to also capture with the model.
I should also mention we're also trying to look at multimodal models, which I think is going
to be pretty pertinent. So lots to do. A bunch of the things that you've described as sort of the role
you see for the center or even like academia in the age of like foundation models broadly.
Like they have more of an intersection with policy than traditionally like machine learning research.
Like how do you think about that? Yeah. Actually we've, I'm glad you asked that because we've been
thinking a lot about the social implications of these models and sort of the
not the models themselves, which we focus a lot on talking about, but the environment in which
these models are built. So there are a few players in the space with different opinions about
how models should be built. Some are more closed, some are more open. And there's also, again,
this sort of lack of transparency where we have a model that's produced and it's aligned,
apparently to human values
but then once you start
kind of questioning you can ask a question
okay well which values
which humans are we talking about
who determines these values
what legitimacy does that have
and what's the sort of accountability
then you start
noticing that well a lot of this
is just kind of completely a black box
so one thing that we've been
working at the center on is
developing norms
starting with transparency I think transparency
is necessary but not sufficient.
You need some level of transparency
to even have a conversation
about any of the policy issues.
So making sure that the public
can understand how these models are built,
at least some notion of what the data is,
what are the instructions that are given
to align the models.
We're trying to advocate for greater transparency.
there. And I think this will be really important as these models really get deployed at scale and start
impacting our lives. You know, what kind of an analogy I like to think about is, you know,
nutrition labels or any sort of specification sheets on electronic devices. There's some sort of
obligation I think that producers of some products should have to make sure that their product is
used properly and has some bounds on it.
I guess I'll ask two questions.
One is if people wanted to participate in Together,
is there a client that can download and install or use?
Or how can people help support the Together efforts?
Yeah.
So we are developing a client that will be made available.
We both from the perspective joining the Together Clouds
so that you can contribute your compute,
but also where we have an API that we're developing
so that people can use together infrastructure
to do inference and fine-tuning models.
We are also training some open models,
so we have this something called Open Chat Kit
that we're releasing soon,
and this is built on top of Illythera AI's NeoX model,
but improved to include various different types of capabilities.
You should think about it as really a work in progress.
What we've trying to do is open it up so that people can play with it, give feedback, and have the community improve this together, rather than us trying to produce some finished product and putting it out there.
This goes back to the point about involving the spirit of open source and involving the community to build these foundation models together as opposed to someone unilaterally building them.
While we're talking timelines and predictions that you don't quite feel comfortable making, how do you think as a rigorous scientist about AGI?
I must say that my opinions about EGI have changed over time.
I think that for a while, it was perceived by most of the community as laughable.
I will say that in the last 10 years, I have been aware of, you know, there's a kind of a certain community who think about AGI.
and also existential risk and things like that.
So I've been in touch with people who think about these.
I think I see the world maybe differently.
I think of perhaps certainly these are powerful technologies
and could have extreme social consequences.
But there's a lot of more near-term issues.
I focus a lot on kind of robustness of ML systems
in the last, you know, five years.
But, you know, one thing I've learned about foundation models,
Because of their emerging qualities, I've learned to be very kind of no open-minded, I would say.
I was asking early about what no priors, where that comes from.
And I think it's a fitting way to think about, you know, the world.
Because I think everyone, including scientists, often get sort of drawn into a particular worldview and paradigm.
And I think that, you know, the world is changing, both on the technical side, but also how we conceive of AI.
and, you know, maybe even humans at some level.
And I think we have to be open-minded to, you know,
how that's going to evolve over the next few years.
Awesome.
Thanks for doing this conversation with us, Percy.
It's great.
Yeah, thanks for joining us.
Yeah, thank you very much.
Thank you for listening to this week's episode of No Priors.
Follow No Priors for new guests each week
and let us know online what you think and who an AI you want to hear from.
You can keep in touch with me and conviction by following
at Serenormus.
You can follow me on Twitter
at Alad Gill.
Thanks for listening.
No Pryors is produced
in partnership with Pod People.
Special thanks to our team,
Synthel Galdea and Pranav Reddy
and the production team
at Pod People.
Alex McManus, Matt Saab,
Amy Machado,
Ashton Carter,
Danielle Roth,
Carter Wogan, and Billy Libby.
Also, our parents,
our children,
the Academy,
and tyranny.m.L.,
just your average-friendly
AGI world government.
Thank you.