No Priors: Artificial Intelligence | Technology | Startups - The Future is Small Models, with Matei Zaharia, CTO of Databricks
Episode Date: April 6, 2023If you have 30 dollars, a few hours, and one server, then you are ready to create a ChatGPT-like model that can do what’s known as instruction-following. Databricks’ latest launch, Dolly, foreshad...ows a potential move in the industry toward smaller and more accessible but extremely capable AIs. Plus, Dolly is open source, requires less computing power, and fewer data parameters than its counterparts. Matei Zaharia, Cofounder & Chief Technologist at Databricks, joins Sarah and Elad to talk about how big data sets actually need to be, why manual annotation is becoming less necessary to train some models, and how he went from a Berkeley PhD student with a little project called Spark to the founder of a company that is now critical data infrastructure that’s increasingly moving into AI. No Priors is now on YouTube! Subscribe to the channel on YouTube and like this episode. Show Links: Hello Dolly: Democratizing the magic of ChatGPT with open models Dolly Source Code on Github Matei Zaharia - Chief Technologist & Cofounder - Databricks | LinkedIn Matei Zaharia - Google Scholar Databricks debuts ChatGPT-like Dolly, a clone any enterprise can own | VentureBeat Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Databricks | @Matei_Zaharia Show Notes: [01:29] - Origin of Databricks [4:30] - Work at Stanford Lab [5:29] - Dolly and Role of Open Source [12:30] - Industry focus on high parameter count, understanding reasoning at small model scale [18:42] - Enterprise applications for Dolly & chat bots [25:06] - Making bets as an academic turned CTO [36:23] - The early stages of AI and future predictions
Transcript
Discussion (0)
So we really wanted to see whether it's possible to democratize this and to let people build their own models, you know, with their own data, without sending it to some centralized provider that's trying to sort of learn from everyone's data and, you know, kind of control their destiny in this space.
This is the No Pryors podcast. I'm Sarah Gora.
We invest in, advise, and help start technology companies.
In this podcast, we're talking with the leading founders and researchers in AI about the biggest questions.
If you have $30 a few hours in one server, then you're ready to create a chat GPT-like model that can do what's known as instruction following.
The latest launched Dolly from Databricks, which is available in open source, foreshadows a potential move in the industry towards smaller and more accessible,
but extremely capable AIs.
Matei Zahari, co-founder and chief technologist at Databricks,
is here to tell us all about Dolly.
We'll talk about how big data sets actually need to be,
why manual annotations becoming less necessary to train some models
and how he went from a Berkeley PhD student
with a little project you may have heard of called Spark,
to the founder of a company that's now a critical data infrastructure
that's increasingly moving into AI.
Welcome to the podcast, Matti.
Thanks a lot.
Excited to be here.
Can you start by telling us a little bit about the origins of Databricks
and how it led you to where you are today?
Sure, yeah.
So Database started from a group of seven researchers at UC Berkeley back in 2013, and we were really excited about democratizing basically the use of large data sets and of machine learning.
So we had seen, you know, the web companies at the time were very successful with these things, but most other companies, you know, most other organizations, things like scientific labs and so on weren't.
And we were really excited to look at making it easier to do computation on large amounts of data and also to,
to do machine learning at scale with the latest algorithms.
So we had started, you know, during our research, we worked with some of the web
companies.
We also started open source projects, like most notably Apache Spark, which, you know,
was essentially, you know, the first version of it was my PhD thesis.
And we had seen a lot of interest in these.
And we thought, you know, it would be great to start a company to really reach enterprises
and make this type of thing much better and, you know, actually allow other companies to use
this stuff.
Can you just give us a sense of what data?
Databricks looks like today from like a, you know, scale and product suite perspective.
Sure, yeah.
So Databricks offers a pretty comprehensive data and ML platform in the cloud.
It runs on top of the three major cloud providers, Amazon, Microsoft, and Google.
And it includes support for, you know, data engineering, data warehousing, machine learning.
And most interestingly, all this is integrated into one product.
So, for example, you can have one definition of your business metric that you use in your BI dashboards.
And the same exact definition is used as a feature in machine learning.
And you don't have this drift or copying data, and you can just kind of go back and forth between these worlds.
The company has about 6,000 employees now.
And last year, we said that we cost a billion dollars in ARR and we're continuing to go.
It's a consumption-based cloud model where, you know, customers that are successful can go over time and bring in new use cases and so on.
Did you think the opportunity was as big as it has been when he started the company?
Yeah, well, we definitely didn't, you know, anticipate necessarily to go to this size, right?
A lot of things can go on, but we were excited about the confluence of a few trends.
So, first of all, you know, it's so easy to collect large amounts of data and people are doing it automatically in, you know, many industries.
And second, cloud computing makes it possible to scale up very quickly, do experiments, scale down, and so on, which enables more companies to work with this kind of thing.
And then the third one was machine learning.
So we thought, you know, these are powerful trends.
And the exciting thing for, you know, us as a company is we didn't invent cloud computing.
We didn't necessarily invent big data or anything.
But we were able to start at a point in time when many companies were thinking to move into this space
and just provide a great platform for that.
And there's this migration already happening.
And, you know, if you provide the best platform as people are migrating to the cloud, they'll consider it.
You still keep roots in research.
You have a research group at Stanford.
Can you talk about that?
Yeah.
So I'm a computer science professor there, so I split my time between that and
Databricks.
And we work on a bunch of things.
We usually like looking farther ahead into the future.
And we've worked a lot on scalable systems for machine learning, how to do efficient
training on lots of GPUs and stuff like that, or how to do efficient serving.
And then another thing I'm really excited about that we started about three years ago,
is looking at knowledge-intensive applications
where you combine a language model
with something like a search engine
or an API you call or something like that
and you try to produce a correct result
maybe for a complicated task.
Like do a literature survey and then like tell me
what you found about this thing
with a bunch of references or counter-arguments or whatever.
And I have a great group of PhD students
that are working on that
and, you know, exploring different ways to do it.
How did Databricks decide to
start working on Dolly. Let's spark that and, you know, how did you first get going on that?
Yeah. So we've had customers working with large language models of various forms, you know,
even before Chad GPD came out, but they were doing the more standard things like
translation or sentiment analysis or things like that. A lot of them were tuning models for their
specific domains. I think we had like almost a thousand customers that were using these in
some form. But then when Chad GPD came out in November, it got people interested in, you know,
using these for a lot more than just analyzing a bit of data and instead creating
entire new interfaces or new types of computer applications, new experiences in them.
And so there was an intense interest in this, even at a time when, you know, the industry
in general is being conscious about spending and like which things are really required and
so on. This was an exciting one. And the really exciting thing about Chad GPD, as you both
know, is the instruction following or basically the ability of it to kind of carry on a conversation
and listen to the things you're telling it to do and do those,
as opposed to just completing text or just telling you a small amount of information,
like this is a positive or negative sentiment.
So we really wanted to see whether it's possible to democratize this
and to let people build their own models with their own data
without sending it to some centralized provider
that's trying to sort of learn from everyone's data
and kind of control their destiny in this space.
We were exploring different ways of doing it,
And in particular, like Dolly is partly based on this great result from some other faculty members at Stanford called Alpaca, where they tested a way to, you know, basically they used the model to generate a bunch of realistic conversations, and then they use this to train another model that can now kind of carry on conversation on its own.
And so we tried essentially cloning that approach, but starting with an open source model, and it actually worked pretty well.
And so that's what became Dolly.
But yeah, we've been looking at the space for a while and seen, you know, incredible demand for these kinds of applications.
Yeah, I think the industry has really been very focused on scaling data, parameter size, and flops.
And I think you all really have showcased the power of instruction following, even, you know, something that's relatively smaller scale.
Could you explain that and how that all works?
It's very interesting.
And I think there's actually a lot of research still to be done here because these models have been mostly locked up in these very large,
companies for a while, and everyone thought it's too hard to reproduce them.
So the interesting thing is, you know, language models had existed for a while.
You basically trained them to complete words.
You know, here's a missing word in the text, can you fill it in?
And then at the beginning, when people tried to apply them to real applications, not just, you know, I erased a word on my homework, like fill it back in, but like actual applications,
they had always done various ways of, you know, training something else on top of, you know, say, the feature representation in these.
And so there was a lot of domain-specific work, but you could build like a sentiment, classifier, or stuff like that.
Is it positive or negative?
Probably like three years ago now.
OpenAI published a GPD3 paper, which is called Language Models are a few-shot learners.
And they said, number one, like we trained a language model to 175 billion parameters.
And we trained it on, I think it's like 45 terabytes of text.
So lots of data, lots of parameters.
and it's like pretty good at language modeling.
And then number two, they said,
you can actually kind of prompt this
with a few examples of a task,
and it picks up on the task and does it.
So lots of people were working on that.
How do you prompt it?
What's the best example to show?
But everyone assumed that for that capability,
you need a giant model to begin with.
So even the researchers in academia
were calling into GPD3
and trying to build stuff based on it
and study this phenomenon.
on. And then last year, 2022, OpenAI published this other paper, which was sort of
instruction tuning these models, where they said, hey, we used some human feedback and then
some reinforcement learning, and we got this GPD3 model to actually just listen to one instruction.
It doesn't need a complicated prompt with lots of examples, and it kind of works. And then
they released a version of this as chat GPD. So I think in a lot of people's minds, the scientific
view of it was, first, you need a giant model, and then you need this reinforcement learning
thing, and only then do you get this conversational capability and broad world knowledge. So it's
actually very surprising. In Alpaca, we just had a larger data set of, you know, human-like
conversations, and we had this very kind of modest-sized open-source model that's only
six billion parameters, only trained on less than one terabyte of text, so like 50 times less
data than GPD3, and it still has this behavior.
I think it's been pretty surprising to a lot of researchers, the size of model that still
gets you this kind of instruction following ability.
So I think this is kind of an open research problem, like what exactly about these
data sets is it that makes them good at this?
What are the limitations?
You know, are they tasked that these are clearly worse at or better at?
It's actually kind of hard to evaluate with long answers because it's hard to like automatically
score them and say, you know, like, this.
This is a good Seinfeld skits that you generated, and this is like a bad, you know, Barack Obama speech.
But I think we'll figure this out.
Yeah.
Were there anything that emerged from the model that you also found surprising?
Like you mentioned one aspect of it just in terms of the approach you took.
And, you know, with dramatically more limited data and approach, you ended up with really performant behavior.
Were there other things that were unexpected properties of what you did with Dolly?
Yeah.
I think to me, the most interesting thing is it's surprisingly good at just free-form.
a fluent text generation.
So you can tell it to, like, create a story or create a tweet or create a scientific paper
abstract.
And it does a pretty good job at that.
And before that, whenever I talk to my, you know, NLP, like, researcher friends, they
thought that that creativity was the thing that required a lot of parameters from something
like GPD3.
Like, they actually told me, oh, the knowledge-intensive stuff, like remembering facts,
tell me the capital of, like, France and whatever.
that's not surprising that a small model with a few parameters can do it,
but the creativity, that's, like, really hard.
So this one is actually pretty good at the creativity and generation.
It's less good at remembering lots of facts,
which kind of makes sense, given the parameters.
So if you ask it about common topics, you know, it'll be good.
If you ask it, like, the author of a book, you know, it might give the wrong one.
I think we had an example because we've actually been building a slightly bigger version of this too,
and we had this question with, like, who is the author of Snow Crash?
which is Neil Stevenson,
and the initial Dolly model said Neil Gaiman.
So, you know, it's still a Neal,
it's still an author, but it's still a sci-fi writer.
Yeah, so it's less good at remembering facts,
but pretty good at coherent sort of generation.
Yeah, the name Dolly basically references
the first cloned mammal, Dolly the Sheep.
Can you explain the reference within the AI space?
Yeah, so it's based on, you know,
cloning this other model from Stanford called Alpaca,
but doing it with an open dataset.
And that itself was based on something that meta released, I think, maybe three weeks ago or less, called Lama, which is they took a modest size model, seven billion parameters, and they trained it on a ton of data. I think they said 1.4 trillion tokens or something like that, which is, I don't know how many bytes of data it was, but it was multiple terabytes of data, basically. And they said, hey, by just training this for longer, we got a small model that's actually producing pretty high quality content for its size. So there were all
these kind of, you know, woolly sort of animals out there. And we thought it's just too perfect
to like clone it. And there are all these other things like, you know, it's like the Dali Lama.
I don't know. There are all these like things. Yeah. That was a great name. Yeah. That's a good
name. Yeah. Are there other things that you can share that you all have coming in the background
at Databricks or your Stanford lab in terms of this more general area of language models?
Yeah. I mean, at Databricks definitely, you know, we're using everything we learn from Dali
and we're learning from our customers to just offer a great suite of tools for training and operating LLM applications.
We already have a popular MLOPS platform, and we also have this open source project called MLFlow that integrates with a lot of tools out there that our offering is built around.
So you can expect some nice integrations into that.
You know, separately, we're also working on Databricks product features that use language models internally and learning a lot from developing those and, you know, feeding that into our products.
I think in the next few months, you can expect it.
And we also have this big user conference data AI summit coming up in June that will probably have a lot of stuff about this.
And I would say, you know, as a researcher and also kind of with my Databricks hat on,
the thing I'm most excited about is really connecting these models with reliable data sources
and making them really produce reliable results.
Because, you know, if you use chat GPD or GPD4, the two big problems with it are number one,
Like, the knowledge is not up to date.
You know, it only knows stuff it was strained on.
And number two, a lot of the things it says are inaccurate,
and it's confident but, like, wrong in various ways.
And I think you can tackle both of these by combining some kind of language model
with, you know, a system that, that, you know, pulls out, like, vetted data,
either from documents, like a search engine or from, you know, APIs and tables and stuff like that
inside your company.
You know, like, for example, when I talk to the chat,
But in my bank, it should know my latest bank account balance and transactions and stuff.
You know, if I'm like, can you cancel the payment I made because I unsubscribe?
You should just know what that means.
So cracking how exactly to do that isn't easy.
It may actually be easier with small models than with big ones to reduce hallucination from them.
But I think it's still an open question.
But I think if we can figure us out, then these become a much more reliable component in an application.
Maybe we'll go from there to just like projecting a little bit.
about, like, architecture and research.
You know, so much of the industry is focused on model scaling, right,
and improving reasoning that way.
Like, how much do you think that matters in terms of, I guess,
like real world usage in production with your customers in the near term?
Yeah, great question.
So to me, at least the relationship between scale of the model versus, you know,
quality of the data and supervision you put in versus, like, design of an application
around that.
And those things and like overall quality,
I think the relationship is not 100% clear yet.
Like to get a really reliable model that say,
I don't know,
can, you know,
like make a pharmacy prescription or something like that,
maybe you need a trillion parameters,
you know,
maybe you actually need a really carefully designed data set
and like supervision process,
which is kind of traditional sort of ML engineering type work.
Or maybe you actually need a clever application
where like you're chaining together a couple of models and things
and you're saying, well, does this make sense?
Can I find a reference?
Can I show this example to a human if it's really hard?
So I think it's a little bit open.
The thing I can say for sure, especially and Dolly and like other, you know,
results like this really highlighted is it does seem that the core tech is getting commoditized
very quickly.
So if you just want to run, you know, something like today's chat GPD,
it will be a lot cheaper because all these hardware manufacturers are building devices
that are specialized and much cheaper.
And another thing that's making it less expensive
is we're figuring out ways to get a smaller model
with less data, fewer parameters and stuff
to get similar performance.
So that, I think, is happening faster
than at least I would have thought, you know, a few months ago.
So at least to get something with today's capabilities,
I think it will be very affordable
and you might just be able to run it locally on, you know, your phone or something.
The question of how large can, you know,
if you make a much larger model,
is it going to be a lot smarter?
I think it's still a bit unknown.
I mean, there are people who argue
it's going to be very good at reasoning,
but at the same time,
this kind of token by token generation
we're doing now is not an amazing format for reasoning
because you have to linearly say one thing at a time.
So it's not really good for making plans
or comparing versions.
I think to get a really smart application,
you'll need to combine today's language modeling
with some other sort of framework around it
that uses it multiple times
or explores a plan space or whatever,
and then you might get something good.
And it's also possible that the very largest models
are simply memorizing more stuff.
So, like, they're impressive in terms of trivia.
Like, I can ask it about some random topic
and it will know, but they're not really, like,
smarter at solving even a basic, you know, word problem.
So, yeah, I'm not sure.
Unfortunately, especially with training from the web,
it's often very hard to tell apart, like, reasoning from memorization,
essentially.
did I see that thing before.
So I think actually being able to do experiments where you train these on carefully selected data
and will lead to better understanding of what they can do.
Yeah, yeah, that makes sense.
Maybe if we think a little bit just because you have great visibility from your role at Databricks,
like what are they dwelling to companies need, like your enterprise customers or just generally
enterprises need to make use of these models?
Because you said, you know, we believe the core technology, the models themselves are getting commoditized.
Yeah. So definitely the first piece is you need a data platform that could actually build, you know, reliable data, right? So we think that's like the bread and potatoes of like getting anything. You need some, you know, a basis to like sort of build on. So we think that will become really important. And, you know, maybe data platforms will have to evolve a little bit to be better at supporting unstructured data like text and images and so on and to do quality assessment and stuff like that for it. That's one piece. I think another piece you need.
is you need the MLOPS piece of being able to experiment with things, deploy them,
A, B, test them, and so on, and see what does better and improve it incrementally.
And I also think these models will need a good connection to operational systems inside the
company to do really powerful things with, like, the latest data.
So, you know, you saw probably the support for tools in chat GPD.
You know, before that, there were lots of groups working on at least models integrated into
search engines, sometimes in.
to calling other tools as well, like calculators.
I think it's still a little bit open-ended.
There's one extreme where people say the model will figure out what tools to use on its own.
I think for, like, enterprise use cases, that's a little bit, like, more than you really need.
You know, you can kind of give it some tools and feed it stuff, and it doesn't have to
discover and, like, read the manual to figure out which one to use.
But, yeah, I think that's another piece you'll need for, like, really powerful applications.
And then I do think infrastructure, like just basic training and serving infrastructure,
is important, too, when you start to care about performance, like about latency and speed.
And you can see some of the new search engines using these models are not that fast, right?
Like a little bit slow, you know, it would be nice to have it faster.
And for automated analytics, it's even more important that it's efficient.
So there could be, I think there'll be a lot of activity there.
Yeah.
Where do you see enterprises getting the most value from investing in, I guess, more traditional ML and
then some of the language model stuff to date.
Yeah, great question.
So traditional ML, we're seeing actually virtually all major enterprises, you know,
and all industries are using it.
It's changed a lot in the past decade, actually.
So it's very good for forecasting things in general and for automating certain types of
decisions.
So basically, like, for example, optimizing your supply chain, right?
You don't have time to look at, like, exactly everything that's going on and, you know,
think about it and have a meeting.
But if you do order the right amount of parts to meet your demand this week, or if you minimize the amount of time, you know, an agricultural product like sits in a warehouse and like, you know, degrades in quality or stuff like that, it matters a lot. And it can have a huge impact on the profitability of a company. So we're seeing a lot of that people applying it to automate, you know, supply chain and to automate basically their operations in various ways. And then there are more classic use cases like fraud detection and stuff like that. We're also, you know, it's all.
always a norms race and you're trying to do the best you can because every percent of like
accuracy you do better and can translate into, you know, huge impact. With language models specifically
and especially with kind of conversational ones, I think the, you know, the really exciting thing
is interfaces to people. And I think customer support is a very obvious one. Maybe things like
recommendations or asking questions on a product page, you know, in retail, things like search
augmented with stuff is one. And we've also found that just internal apps in a company that have
a lot of internal data can benefit from this kind of thing. So like one of the things, you know,
we've built, for example, is inside Databricks. We have all these resources for, you know,
engineers to understand how different parts of the product work, how to operate it, like all the
APIs. And, you know, people used to just ask each other questions in these Slack channels for
each team. And we could use that data, like the questions and answers plus the data.
you know, in the actual documentation to, you know,
essentially automatically answer many, many such questions and just save people a lot of time.
So I do think that any app that has kind of business data or like stuff written by humans in it,
like your issue tracker for your software development or like your Salesforce or something like that
could benefit from these kind of interfaces. Yeah.
Yeah, it seems like any type of forum or anything else instantly becomes like data that you can use
to fine tune or train a model that's specific to your customer support use case or you could
use an embedding or something to do interesting things with it. So it seems like some really cool
stuff to do. Are there any specific areas that Databricks is not focused on that you think
would be especially interesting for somebody to build from a tooling perspective for
enterprises trying to use some of these technologies? Yeah, I think there are a lot of these. I think
it's very early on. Probably one of the most obvious ones is just the domain or vertical-specific
models and tools. I actually think even a lot of the enterprises that have a lot of the data
in various domains might turn more into data or model vendors of some form in the future as they
use this to build something that no one else can. So I wouldn't be surprised at all if you see like
the next wave of companies for, say, security analytics or like, you know, biotech or analyzing
financial data or stuff like that, really build around LLM technology in there. And I also think
in general, in the app development space, like, how do you develop apps that incorporate these
tools? It's very open. It's not clear what the best way to do it is. And, you know, you might end up
with, like, really good programming tools that focus on this problem. I would say, you know,
for people thinking about startups and so on, like you want your startup to have, you know, a long-term
defensible mode, ideally something that grows over time also. So anything around the unique data set,
for example, or unique, like, feedback interaction you have is always good.
it, right? Like, honestly, even something like adding ML features in your product that just
kind of learn from your users and, you know, do better recommendation and so on could eventually
become a motor like, you know, others just can't easily catch up. But I think that, you know,
anything that's on custom data sets is sort of safest. Yeah. When you were working on Spark for
your PhD, did you think you'd become a founder? Was your intention to start a company? Or did you
just think it was interesting research to do or both?
No, it really wasn't. Yeah, I mean, as aggressive, you know, I've always been interested in just like doing, you know, things that help people that have an impact, help people do cool things. And, you know, I had seen these open source technologies out there for distributed data processing. I thought, okay, well, I'll try to start one and see how it goes. You know, I wasn't sure that people would really pick it up and use it. But I wasn't looking to be a founder necessarily. I was just looking to do something useful in this like emerging space. And honestly, I
I was at least considering to be a computer science professor, and I thought, if I'm going to be a
professor and all the most exciting computing is happening in data centers today, and I don't
know how that works, how am I going to teach computer science to people? So I better learn about that
stuff. But it turned out to be something more broadly interesting. What was the most unexpected
thing about being a founder? There are a lot of challenges along the way. I think just being able to
learn about all the aspects of a business and how much complexity there is in each one.
You know, starting out as a more technical person, at first, I didn't really know what they
expected, but there's a ton of depth in each one. And if you understand them, if you, like,
really try to understand them, get to know the culture of people there, like, really get to know
what they're thinking about. You can make much better decisions across multiple aspects of your
company. Is there anything that you would advise people coming from a similar background to
years. I have a PhD as well, although it's in biology. And I feel like there's certain things that
I learned in academia that was really valuable. And then there's a bunch of stuff I really needed
to unlearn as I went into industry. There's specific pieces of advice you'd give to technical
founders or PhD founders in terms of things that they should unlearn. Let's see. Like a lot of research,
at least in computer science, the kind of stuff that I've worked on, a lot of research is basically
is mostly prototyping. It's like, can we showcase an idea? But it's not really software engineering of
like, we'll build a thing that can be maintained and, like, runs flawlessly in the future and,
like, supports, you know, problems. So I think you should kind of unlearn just the focus on
short-term stuff and think about how is this going to go over time. Eventually, right, there is a
phase of the company where you're just prototyping to get a good fit, but you should design things
so they can evolve into, you know, into something that's very reliable long-term. The other thing
is, you know, I think unlearn trying to invent everything from scratch, you should really be
careful about like, hey, where am I doing something unique? Or if I'm doing something different
from others, like, why is it? Right? Don't do it just for kicks. So, because in research is
very tempting to say, you know, I did this new thing. I'm going to, you know, I'm going to try all
the fanciest, like new ideas in each component of it. Was there something that you guys like
experimented with being like, you know, first principles unique about that you then said, you know,
there are systems for this? A good one early on was deployment infrastructure for like, how do we
deploy and update our software across, you know, all the clouds and so on. And we soon realize
it's better to go with really standard things like Kubernetes and tools like that than to try
to do something custom because they're evolving very quickly. So yeah, that's kind of a good
example where like you, at the beginning you say, ah, how hard can it be? You know, let's just
build something. But then you realize, wait, every month there's like new stuff coming out and
maybe this isn't where we want to focus on. So maybe just thinking about being like CTO now of
a very large company, like how is your lens as a researcher,
computer science researcher, informed your thinking as a CTO?
I think, first of all, as a researcher, like, you think a lot about the long-term trends.
Like, what, you know, what could things look like five or ten years from now?
What's the, what's kind of the fundamental things here?
So, for example, this thing about LLMs being commoditized and, or honestly, the thing about them
kind of maxing out at more parameters, I think many people hadn't really thought about that.
But if you think back, like, you know, there is a lot of room to improve efficiency usually in hardware and software for an application.
And this particular application is kind of simple because it is all basically like, you know, two or three different types of matrix operations.
So like it's sort of the hardware designer's dream to do this stuff.
And also there are usually diminishing returns from scale in terms of quality of models in general.
And you can also kind of see it in other areas like in computer vision, for example.
we don't have trillion parameter models.
You get actually pretty small models
that you can train for a specific task that are good.
Self-driving cars is another example.
You know, they rapidly improved in quality up to a point
and then they kind of plateaued
and they're still not really, you know, ready for prime time.
Eventually, you hit some limits.
There are plenty of people who are researchers in the field
who don't really see an asymptote, right, with scaling.
And so where do you believe that limit comes from?
like parameters, compute, data, something else.
I just think a lot of things like scale sublinearly in general.
Now, it's hard to tell for things like reasoning and so on.
But certainly in classical machine learning, like, for example,
if you're trying to learn a function that like separates positive and negative examples,
and as you add more data, like your accuracy doesn't really improve linearly.
Like, you know, with a few examples, you get a pretty good estimate of that boundary.
And then with more of them, it gets a little.
little bit better, but it doesn't get like that much better. So it's just, I think it's common.
That would be my main reason. Now, the one thing, so with language model specifically, I think the part
that does go linearly with more parameters or should is ability to just memorize more stuff.
So if you wanted to tell you, like, who was on the fifth episode of like friends and like what was
the second line they said and stuff like that, like, yeah, more parameters will get you a neural
network that just by putting that input, I can tell you that stuff. But that wasn't that
interesting to me because I think the right solution for that is look things up in a database,
like do heat evil. I do a search index. Actually, I think from a computation perspective,
it's very inefficient to have like a trillion parameters and have to actually load them all
and add and multiply by them each time you make an inference because they're just encoding
knowledge, most of which you don't need for that inference. So that one I wasn't as excited about,
But I think there are people who are just excited about neural networks, like, how do, you know,
it's the same kind of people who wonder, like, how do brains work, like, how do animals learn,
who are just excited about, wait, I only had some neurons and I put in this stuff and it remembered it.
But as an engineer, I'm not that excited because I'm like, yeah, I could have built a database that did that.
But in terms of like, hey, I just trained a network with gradient descent and it did it.
That is kind of cool.
Yeah, I feel like people are almost the opposite where we're actually quite bad at memorization
we're very good at inferring things.
And so it's interesting to ask what is the basis for that computationally.
Yeah.
But the other thing that we're learning, though, from this is it does seem that the type of data you put in and the kind of fine-tuning,
essentially it's like weighing the data has a lot of impact.
So this instruction tuning stuff is like, really, we have only a few examples of instruction following,
but since we do fine-tune the model, it's as if we put a very high weight on it
and had lots of examples of that in our training set.
And I mean, I think it's still an open question.
Like, for example, if you made a lot of examples of logical puzzles, right?
Like you just generate some problems and solutions.
Would you get a model that's better at logical reasoning?
There are other things you can do.
I also think a big problem with current models, I think I hinted at this before,
is we're just calling them to generate one token at a time.
So, for example, you've probably seen this chain of thought reasoning thing.
Like, if you ask a model a math problem and it just tries to answer,
like how many sheep were there.
It might say like seven or something.
And then it tries to make up the explanation.
And it's like wrong.
But if you tell it, do the explanation for things step by step and then answer,
it's more likely to be right.
But you can imagine other versions of that.
Like if it had a scratch pad, if it had a way to backtrack to say, you know,
this is kind of a dead end, it might become better.
So I think stuff like that that's kind of around the model.
It's still an AI system, but it's not just one giant DNN can further improve its ability.
Yeah. And you've seen that work in like really complex and impressive ways. Like we had Noam Brown
from the sort of Cicero group on and, you know, they have planning as part of it, right,
versus it's just one one very large model. They expect to do all the reasoning. You were actually
saying like, you know, you basically make like sometimes controversial like long-term
predictions about what's going to happen. Like, you know, there's an asymptote in sort of value of
scale. Yeah, as a researcher. And how does that impact like your decisions as CTO? So especially as a
company goes, right, it's like actually it becomes slower to change direction super dramatically. So
you really want to think about like what will be too long term or, you know, our CEO or Lee has this
decision rule of like with any decision I ask about like, hey, which one am I like sort of more likely
to regret like five years from now? Not five months from now, but like, you know, if I don't do
this or whatever, like, what's going to happen. So you try to think about where things are going to go.
Of course, you do want to collect data and, like, sort of update your thoughts about it, test
hypothesis. And that, I think, is something you can get from research, too. Like, in research,
we always think when we have an idea, it's sort of a race to, like, figure out, is it a good
idea? And can I publish it? Because the research community values novelty a lot, being the
force to do something, you know, for better or worse. It's not amazing. But if you just
reproduce a thing that someone else did, unfortunately,
you don't get as much credit.
So we do think about how can we quickly validate something.
But at the same time, and even in research, I had the same thing.
You tried to pick topics that will matter.
Like, for example, when I was doing my PhD, I didn't do a ton with machine learning.
And, you know, I knew people who did it.
I helped them out.
I built infrastructure.
But I didn't do ML research myself.
And then later, I kind of decided, like, yeah, I am going to do some things,
especially around this, like, you know, connecting,
machine learning to external data sources like search engines.
And I know it's going to take a while to really learn about it and get an intuition and
stuff, but I think this is going to matter long term because I think the local, like,
you know, parsing semantics of what the sentence means is kind of solved already.
And the interesting thing will be like, you know, doing this in a bigger system.
Yeah, I have four degrees and no PhD.
I've never contributed anything to the corpus of the world's knowledge.
a lot, got to ask, does it affect how you do investing?
No, not really.
The PhD, moot point. Nice.
I don't know. I have a math degree as well, and I feel like that actually was a thing that
forced me to think slightly differently, or at least it forced a way of very logic.
I felt like there's a groove in your brain for logic that gets carved.
So that probably helped, but who knows, I don't know.
You've been working in data machine learning for a long time.
Like, where do you think we are in this generation of AI?
Yeah, I think we're still at the early stages of AI on unstructured data, so things like text and images and so on, really having an impact in applications.
So I think chat GPD related features that every application is going to add will change the way we work with computing.
And they'll also change data analytics to some extent because you'll be able to use this data.
And honestly, I also think that in terms of just basic data infrastructure and ML infrastructure,
We're still pretty early also.
It's still many different tools you have to hook together,
a lot of complex integration,
and you need a lot of sort of specialized people to do it.
And I think over time,
like I increasingly think that basically,
especially because of the capabilities of these AI models,
every software engineer will need to become an ML engineer
and a data engineer also as they build their application.
And we'll figure out ways of doing them recipes or abstractions or whatever
that are actually,
easy enough for everyone to do. And one analogy I like is, you know, when I was learning programming,
which was sort of like, you know, mid-late 90s, I got these books on, you know, web applications.
And it was very complicated. There was a book on MySQL. There was a book on Apache web server,
like, CGI bin, all these things you have to hook together. And now, you know, most developers can
make a web application in like one function. And even non-programmers can make something like
Google forms or Salesforce or whatever, that's sort of, you know, basically it is a custom
application.
So I think we're far away from that in data and ML, but it could sort of look like that.
It's harder because it depends on this sort of static data that you've got sitting around,
but I do think there are going to be a lot more of these applications.
Matei, this is a great conversation.
Thanks for joining us on No Priors.
Thanks so much.
Thanks a lot, Sarah.
Thank you.
Thank you for listening to this week's episode of No Priors.
Follow No Priors for a new guest each week
and let us know online what you think
and who an AI you want to hear from.
You can keep in touch with me and conviction
by following at Serenormus.
You can follow me on Twitter
at Alad Gill.
Thanks for listening.
No Pryors is produced in partnership
with Pob People.
Special thanks to our team,
Synthel Galdaya and Pranav Reddy
and the production team at Pod People.
Alex Vigmanis, Matt Saab,
Amy Machado,
Ashton Carter, Danielle Roth,
Carter Wogan and Billy Libby.
Also, our parents, our children, the Academy,
and Shy Hulud, creator of Spice.