The Changelog: Software Development, Open Source - ANTHOLOGY — Open source AI (Interview)
Episode Date: May 24, 2023This week on The Changelog we're taking you to the hallway track of The Linux Foundation's Open Source Summit North America 2023 in Vancouver, Canada. Today’s anthology episode features: Beyang Liu ...(Co-founder and CTO at Sourcegraph), Denny Lee (Developer Advocate at Databricks), and Stella Biderman (Executive Director and Head of Research at EleutherAI). Special thanks to our friends at GitHub for sponsoring us to attend this conference as part of Maintainer Month.
Transcript
Discussion (0)
Welcome back friends this week on the change that we're taking to the hallway track of
the Linux Foundation's Open Source Summit North America 2023 in Vancouver, Canada.
This episode is part of our maintainer month celebration along with GitHub and many others.
Check it out at maintainermonth.github.com.
Today's anthology episode features Byung-Loo,
co-founder and CTO at Sourcegraph,
Denny Lee, developer advocate at Databricks,
and Stella Biederman, executive director
and head of research at Eleuther AI.
The common denominator of these conversations
is open source AI.
Byung-Loo and his team at Sourcegraph
are focused on enabling more developers
to understand code
and their approach to a completely open source
model agnostic coding assistant called Cody
has significant interest from us.
Denny Lee and the team at Databricks
recently released Dolly 2.0.
This is the first open source instruction following LLM
that has been fine-tuned on a human-generated instruction dataset
and is licensed for research and commercial use.
And Stella Biederman gave the keynote address on generative AI
and works at the base layer doing open source research,
model training, and AI ethics.
She trained the eLuther AI Pythia model family base layer doing open source research, model training, and AI ethics.
She trained the eLuther AI Pythia model family that Databricks used to create Dolly 2.0.
A massive thank you to our friends at dev cycle cto and co-founder
jonathan norris so jonathan my main question i guess if i'm handing off my feature flags to you
all is my uptime dependent on your uptime? Like if you're down, am I down?
We've designed into all the SDKs and all the APIs. APIs fail, right? That's a cardinal rule
of the internet. So all the SDKs have been designed with kind of defaults and caching
mechanisms and all that stuff in place so that, yeah, if our CDN is down or APIs are down, it'll sort of fall back to
those defaults or those cache values in those SDKs. So that handles for those blips pretty
easily. And then we rely on Cloudflare as our sort of main high load edge provider. So all of our
edge APIs are through Cloudflare and they're also operating as our CDN for assets. So obviously
relying on a large provider like that, that runs such a large percentage of the internet,
means that, yeah, you're not relying on our ability
to keep AWS instances running properly.
You're relying on sort of Cloudflare
and ability to sort of make sure the internet still works
as they control such a large percentage of it.
So yeah, we've architected it in a way
that it doesn't sort of rely on our APIs
to be up all the time and our databases
to be up all the time to have that good reliability.
Well, that's good news.
Okay, so how do you accomplish that?
One of the core sort of architectural decisions we made with our platform
when we designed it was trying to move the decisioning logic
of your feature flags as close to the end user and end device as possible.
So we did that with those local bucketing server SDKs
that are using sort of a shared WebAssembly core.
And then we have edge-based APIs
that are also powered by WebAssembly
to serve sort of those client SDK usages.
So things like web and mobile apps.
So that's one of our core principles
is to try to get that decisioning logic
as close to the end device as possible.
And this is probably one of the only use cases where performance really matters because you
want your feature flags to load really, really quickly so you can render your website or
you can render your mobile app really quickly.
And so, yeah, we definitely understand that your feature flagging tool needs to be fast
and needs to be really, really performant.
So if you want a fast feature flagging tool that's performant and is not going to impact
your uptime, check out our friends at DevCycle. That's devcycle.com slash changelogpod. And for
those curious, they have a free forever tier that you can try out and prove to yourself and your
team that this is going to work for you. So check it out, devcycle.com slash changelogpod and tell them
we sent you.
So, Cody. Yeah, Cody.
Yeah, Cody.
Is CODY.
Is this a big deal?
We think it is.
Seems like it.
Wasn't it Sourcegraph 4.0 last year was relaunched as the intelligence platform?
Yep.
Is that right?
Because before, not just, but just code search, which was cool, but hard to really map out the ecosystem.
And you want all the space in there,
but there was a limit to code search.
And you had to expand and the insights and the intelligence.
And now, obviously, code is just like one more layer on top of insights.
Yeah, totally.
So, as you know, Sourcegraph historically has been focused
on the problem of code understanding.
So, heavily inspired by tools like code search inside Google
or TBGS inside Facebook.
Right.
These kind of systems that indexed your company-wide code base as well as your open source dependencies
and made that easy to search and navigate.
And that's what's been powering the business for the past 10 years.
This is actually the 10th year of building Sourcegraph.
Congratulations.
Thank you.
I was just wondering about that.
Yeah. When we first met you, it had to be about building Sourcegraph. Congratulations. I was just wondering about that. Wow. Because when we first met you,
it had to be about a decade ago.
I think Sourcegraph just either didn't exist or just had
existed. Sourcegraph existed when we met.
This was like GopherCon. I think it was like
2014. The first or second GopherCon.
GopherCon. Yeah. And you had
this vision of Sourcegraph.
And I'm wondering 10 years later,
have you achieved that vision? Has the vision of, you know, source graph. And I'm wondering 10 years later, like, have you achieved that vision?
Has the vision changed, et cetera?
You know, our mission was always to enable everyone to code.
And we actually took a look at our seed deck recently.
Was it quaint?
It was very quaint.
We were very bad at PowerPoint.
You're probably a lot better at it now.
Not really.
Better at the pitch, maybe.
Maybe.
You were fine at your pitch.
Largely, I could deliver that pitch today off that deck.
It's basically the same.
It's just the pitch of Sourcegraph,
which is there's never been more code in the world.
Most of your job as an engineer
or software creator
is understanding all the code
that already exists in your organization.
Yeah.
Because that is all upstream
of figuring out what code you want to write.
And then once we actually figure out
what you need to build,
like, that's almost the easy part.
It's also the fun part, right?
Because you're building new things
and shipping stuff.
But we help you get to that point of,
you know, creation and enjoyment
by helping you pick
up all that context.
Right.
Traditionally, that's been like search, right?
Just like Google's been web search, but then these large language models have now come
on the scene.
Yeah.
In some ways, they're disruptive to kind of like search engines, but in other ways, they're
highly complimentary.
So anyone who's used ChatDBT-
I'm still Googling.
I just, less.
It's just less
right
it's more like
the last thing you do
when you can't get
the answer elsewhere
right
I guess I'll go
Google it
yeah
although technically
Google's a weird thing
because I will search
a product
and they think
I want to buy it
not research it
right
it's like
I want to learn
about the thing
and those who are
teaching about the thing
and how it integrates
other things
not where can I buy it and for how much.
Yeah.
So there's like zero context there.
Like they're incentivized, it seems, to point you to places that you can purchase it, not
learn how to use it.
Yeah, yeah.
I mean, I think there's an interesting discussion.
Which is the opposite of ChatGPT.
Yeah.
So there's kind of like pluses and minuses to both, right?
Like with Google, you get results to actual web pages
and you can kind of judge them based on the domain.
It's kind of like more primary source material, which is useful.
It's also live.
You know, you get results from 2023 rather than 2021.
Sure.
Whereas ChatGPT...
That'll change.
That's a temporary thing, right?
I mean, the delay will be temporary.
Eventually, it'll catch up. Well, I mean, GPT-4 is still... It came out recently. thing, right? I mean, the delay will be temporary. Eventually, it'll catch up.
Well, I mean, GPT-4 is still, it came out recently.
It's still 2021.
Right, but isn't the plug-ins and all that stuff where it's like,
okay, the model is old, but it has access to new data.
So the plug-ins is actually where it gets interesting
because that's where things get really powerful, in my opinion.
Because if you ask ChatGPT with the plug-ins enabled,
it can go and browse the web on your behalf.
So it's not just the base model trying
to answer your question from memory anymore.
It's actually going and essentially Googling
for things.
It has access to what you would do.
Behind the scenes.
Exactly.
So it's the best of both worlds.
And essentially, we're doing that with Kodi,
but in your editor for developers. So basically combining large language models like GPD4
or Anthropix CLOD model,
and then combining that power with the most advanced code search engine in the world.
So it's the best of all worlds.
It gives you highly context-aware and specific answers about your code,
and it can also generate code that's kind of tuned to the specific patterns in your code base,
not just the kind of like median stack overflow or open source code.
How did you get there?
How did you think, wow, I mean, obviously, LLMs are a big deal, right?
This new wave of intelligence that we have access to. How far back is this in the making?
Has this been years, or has it been like,
wow, Chet GPT is crazy.
November.
Chet, GPT-3 is in November.
It's like, okay, we've got to move.
How far back does this go?
Yeah, good question.
Yeah, so for me personally, it's kind of a bit of a homecoming.
So my first interest in computer science, actually,
was machine learning and artificial intelligence. That's what I did a lot of my undergrad doing. I was actually
part of the Stanford AI lab doing vision research in those days under Professor Daphne Kohler.
She's my advisor. And so I did a lot of work there. It was super interesting and I felt
really passionate about it. There's just a lot of elegant math that goes into things
and it feels like you're kind of like poking at some of the hidden truths of the universe a little bit.
But the technology at that point was just,
it was nowhere near commercializable.
And so I decided to pursue my other passion,
which is developer productivity and dev tools,
and kind of like stayed on top of the research
as it was coming along.
And I think one of the inflection points for us
was the release of GPT-3
because that was kind of a step function increase
in the quality of the language models.
And we started to see some potential applications
to developer tools and code.
And we really started in earnest
maybe a little over a year ago,
maybe 12 to 18 months ago,
experimenting with the kind of like internal representations
of language models as a way to enhance code search.
So we actually put out an experiment called CodeSearch.ai
that uses embeddings to enhance the quality of code search results that you get.
And that was pretty successful as an experiment.
I think we released that probably middle of last
year, so about a year ago.
And that kind of started us down the road, and then
of course when ChatGPT came out,
that was also another big inflection
point, and that's
when we started to think very seriously
about kind of like a chat-based
interaction that could
happen in your editor, have all the advantages of ChatGPT,
but know
about the specific context of your code.
And so for Kodi specifically, I think first commit was December 1 or something like that.
And by February, we basically had a version that we're having users and customers try.
And then March was when we rolled out to our first enterprise customer.
So it's just been like this whirlwind of development activity.
And I don't know, I cannot remember a time where I've been more excited
and just eager to build stuff because we're living through interesting times right now.
It is.
This is the eureka moment that we've all been waiting for, basically, right?
I mean, this is the invention of the Internet all over again,
potentially the iPhone level
invention.
I think it's a dramatic paradigm shift in how we think as engineers and software developers.
Like, how do we learn?
How do we leverage?
How do we augment?
You know, it's just insane what is available to somebody who doesn't have an understanding
to quickly get understanding and then be, you know, performing in a certain task or whatever, because of the LLMs that are available
and how it works. It's so crazy. The chat interface is pretty simple though, right?
Like the simple, the simplicity of a chat interface. Did you expect this Eureka moment
to be simply chat? Like as you've been, I mean, like what I mean? Like, it's a web app.
Yeah.
It's not something else.
It's a web interface.
It's a chat interface.
I think, so, you know,
I'm a programmer by background,
so I've been, like, pushing,
I've been trying to spread
the gospel of textual-based input
for, you know,
as long as I can remember.
Obviously, it's mostly
fallen on deaf ears
because, you know,
the non-programming world
is like, you know, command line. That's we in, like the 1980s? But I actually think
philosophically, like textual input, the reason I like it is because if you think about just like
the IO, like bit rate of human-computer interaction, it's like we live in a time where we have 4K screens
running at 60 or 120 hertz.
The sheer amount of data that computers can feed into us
through our eyeballs is huge.
Whereas in kind of like the point-and-click mouse world,
it's like how many bits per second
can you really feed into the computer as a human?
And now textual input doesn't get us all the way there to 4K times 60 hertz, but it does
basically 10x's or more like the input bit rate of what we can do to instruct machines.
I think it's a great win for human agency.
We want to be programming computers, not the other way around.
And I think a lot of the technology that has emerged
over the past 10, 15 years
has been computers programming
us as humans a little bit in terms of
all the stuff that we consume.
And so, yeah, I'm
super excited for textual-based inputs.
I think chat is kind of like
a subset of that.
The way we think about Kodi evolving is
really it's going to evolve in the direction of just like this
rich REPL. So it's not necessarily
going to be like, oh, it's a human-like
thing that you talk with conversationally. It's more like
if you want to do a search, you type something that looks like a search query, it knows
that you want to do a search, shows you search results. If you ask a
high-level question, it knows that you're asking a high-level question,
it gives you an answer that integrates to the context of your code base. If you want to ask-level question, it knows you're asking a high-level question. It gives you an answer that integrates
the context of your code base. If you want
to ask a question about your production logs,
or maybe something about
something someone said in chat, or
an issue or a code review,
it should pull context from those sources
and integrate that and
both synthesize an answer to
your specific question, but also refer
you back to the primary sources
so that you can go and dig deeper
and understand more fully how it got to its answer.
So we think chat is just the starting point.
It's really just this rich REPL
that's going to integrate all sorts of contexts,
like whatever piece of information
is relevant to you creating software.
This is kind of like the thing that focuses that
and pulls it all in.
It really seems like that, at least as an interface,
you're seeing that as
the future of what Sourcegraph is, isn't it?
Or is there more to Sourcegraph than that
in the future? So the way we think about it
is like we spent the past 10 years
building the world's most advanced code understanding
tool. So we have the best code search, we have
the best code graph, so the
global reference graph across
all the different languages in
the world.
We have a large scale code modification, refactoring system, and a system to track high level insights.
So there's all these back end capabilities that are really, really powerful.
And what language models have done is given us a really, really nice beginner friendly
interface to all that power.
And I think you're going to see this
across all kinds of software.
It's like historically,
building power user tools has been difficult
because the on-ramp to getting full,
taking full advantage of those tools
has been a little steep.
Requires education, yeah.
Yeah, and so if you're worried about the on-ramp,
maybe you end up constraining your product a little bit
just to make it simpler,
dumb it down for the beginning user, but you lose out on the power.
I think that trade-off is no longer going to be as severe now with language models.
And so at Sourcegraph, we're basically rethinking the user interaction of the entire experience.
The underlying capabilities and underlying tech is not changing.
That's still, if anything,
that's gotten more valuable now because you can feed it into
the language model and instantly get value out of it.
But the entire user interaction layer
I think needs to be rethought.
And Cody, as your AI editor
assistant, is kind of like the
first iteration of that
thought process. How did you
iterate to the interface you're at now, and is
it a constant evolution?
Yeah, I mean it's pretty much like, hmm, I think that would be a good idea, let me
go hack it together and see how it plays.
And you play around with it and then you kind of experience it yourself and you build conviction
in your own mind and then you maybe share it with one or two other teammates and see
if they have the same wow moment and if they do, that's usually a pretty good sign that
you're on to something and there might be more details to hammer out
to make it more accessible to everyone,
but if you can convince yourself
and at least two or three other smart people out there
that there's something worth investigating,
I think that's typically a pretty good sign
that you're onto something.
How do you get access to coding?
Not so much get access, but how do you use it
in the SourceCraft world?
How does it appear? How do you conjure it? Yeah, so it's just an editor extension.
You can download it from the VS Code Marketplace. It's available now and it's free to use.
And we have other editors on the way. IntelliJ is very high priority for us. Also NeoVim
and of course my editor of choice, Emacs. Of course.
And we're developing it completely in the open as well.
So Kodi itself is completely open source and Apache licensed.
And to get access to it, to start using it, you just install the extension into your editor and start using it.
It opens up in a sidebar. You can chat with it.
We also do inline completions.
So as you're typing, we can complete code.
Again, taking advantage of the kind of, like,
baked-in knowledge of the language model
plus the context of your specific code base.
So generating, like, very high-quality completions.
And, yeah, it's generally just as simple
as installing the extension,
and then you're off to the races.
Probably a Sourcegraph account first, right?
Yeah, so you do have to auth through Sourcegraph
because that's how we...
I mean, we wouldn't be able to provide it for free
if you didn't auth through Sourcegraph
because on the back end,
we're calling out to different language model providers,
and we're also running a couple of our own.
Okay, so accessible then,
not having to install Sourcegraph,
have it scan my repository,
like the traditional way you provide intelligence,
which is to leverage literally Sourcegraph on my repo.
I can just simply auth through Sourcegraph,
have an extension in my VS Coder in the future Emacs.
Exactly.
Then potentially.
They're kind of loosely coupled.
We don't believe in strong coupling
just for the sake of selling you more software.
And I think with Kodi, the design philosophy was like,
look, if you connected to Sourcegraph,
it does get a lot better.
It's like if you gave a really smart person access to Google,
they're going to be a lot smarter
about answering your questions.
But if you don't give them Google,
they're still a smart person,
and so Kodi will still fetch context
from kind of like your local code
using non-source graph mechanisms
if you're just running it standalone.
Yeah.
How does it get this intelligence as an extension?
Like, how does that...
Explain how that works.
Like, I've got it on my local repo.
It's an extension.
How does it get the intelligence from my code base?
Yeah, so it's basically...
I mean, think of the way that you would
understand or build a mental
model of what's going on in a code base as a
human. You might
search for some pieces of functionality,
you might read through the readme,
click on a couple search results. It does all that.
It's reading my readme right away? Yeah,
basically. So when you ask a question,
Cody will ping Sourcegraph for,
hey, what are the most relevant pieces
of documentation or source code
in your code base?
And then essentially, quote unquote,
read them as a language model
and use that as context for answering a question.
So if you ask a general purpose question,
it'll typically read the readme.
If you ask a more targeted question,
like, oh, how do you do this,
this one specific thing,
like read a PDF or whatever, it'll go find the places in source code where you're, you know, it processes PDFs and read that in and then interpret that through the lens of answering the question.
In real time, yeah.
Yeah, yeah.
Is there a latency to the question, to the gathering, and, like, what's the speed? If I said that example, how does my application,
you know, compile a PDF from a Markdown file, for example?
Yeah, so it typically gets back to you
within like one or two seconds.
And most of the latency is actually
just the language model latency.
So it depends on what language model
you're choosing to use underneath the hood.
All the Sourcegraph stuff is super fast
because that's just, I mean, there's no like,
yeah, Sourcegraph is fast.
We've spent the past 10 years
making it very fast and
there's no like billions
of linear algebra operations
happening with Sourcegraph. Sourcegraph is just
classical
CPU-based
code and text. What about privacy?
Yeah, so privacy
is extremely important to us, both
in terms of individual developers and our enterprise customers.
The last thing they want to do is have their private code be used as training data into some general purpose model that's going to leak their sensitive IP to the rest of the world.
So we basically negotiated zero retention policies with all our proprietary language model providers, which means that your data is never
going to get used as training data for a model.
And not only that, the language model providers
will forget your data as soon as the request is complete.
So there is no persistence in terms of remembering the code
that you sent over to complete a request.
That just gets forgotten as soon as the language model
generates a request for Kodi.
And then for the rest of it, I mean, Sourcegraph has always taken user privacy
and code privacy very seriously.
It's why we've been able to serve the sorts of enterprise customers that we do.
For sure.
I know why that's important, but spell it out.
Why is that important, this zero attention policy?
What's the real breakdown of that privacy?
Why is it important to the main users?
So from a company's point of view, it's important
because you don't want to leak portions of your code base
or have them persist in the logs of some third-party data provider.
As an individual developer, I think it's just important
to give you control over your own data.
And I think that's going to be an especially important thing
in this new world that we're living in,
where before private data was valuable,
it carries value, it tells you things about a certain person
or the way they work,
and it can be used for purposes both good and bad.
Search history.
It's like search history, right?
Yeah, exactly.
You can tell a lot about a person by their search history,
their watch history, their like history.
Totally.
But now it's used for a whole other reason, right?
Yeah, and I think it's important to grant our users
and customers control and ownership over that data
because it is your data.
And I think with language models,
like language models just, they're like 10x the value
and the sensitivity of that data.
Because now, instead of, you know, just, like, feeding it into, like, a Gen 1 AI model or exposing it to some other human,
you can feed it into one of these large language models that can, you know, kind of, like, memorize everything about you as a person or a programmer.
And, you know, in some ways, maybe that's good. Like, if you're open to that, if you're
willing to share your data, we could potentially train
language models that, you know, emulate
some of the best and brightest programmers in
existence. But ultimately, we think that
should be your personal...
Opt-in.
How explicit is that in the sign-up or
the acceptance of the Kodi license
or the, you know, this
GA to now, you know, widespread usage? How do you... How explicit are you with a new sign-up that says, I want to the Kodi license or the, you know, this GA to now, you know, widespread use.
How do you, how explicit are you with a new signup that says, I want to use Kodi?
Do you say privacy and all these things you just said, basically?
How clear is that?
So when you first install it, there is kind of like a terms of use that pops up and you cannot use Kodi unless you read through and accept it.
How many words is in that TOS?
It fits on like basically one page without scrolling.
Okay, so 1,000 words maybe.
500.
250.
Maybe not 250.
I think it's probably 250 to 500.
I had to go back and check specifically.
Digestible in a minute.
Yeah, we're not trying to be one of those companies that tries to hide stuff.
What I mean by that is, let's try to say are you hiding it, but more how clear are you being? In a minute. Yeah. We're not trying to be one of those companies that tries to hide stuff.
Well, what I mean by that is, let's try to say, are you hiding it?
But more, how clear are you being?
Because it seems like you care to be clear.
Yeah. So is that like a paramount thing for you all to be so clear that you say, hey, privacy matters.
Yes.
We don't collect.
There's zero retention.
It's spelled out really clear.
It's a bullet list saying, basically saying exactly what you said.
Privacy matters.
We don't collect data.
I wrote it for you.
We're not using.
Yeah.
Basically.
Well, Tammy, our wonderful legal counsel.
I didn't write it.
I'm just kidding around.
We all know ChatGPT wrote it, okay?
Let's be serious here.
Actually, that's a great use case for ChatGPT.
If you're asked to accept one of these lengthy end users.
Paste it in there and summarize it for me.
Paste it in there and summarize it. Tell. Paste it in there and summarize it.
Tell me if there's anything fishy.
Yes.
That would be cool for sure.
That's the best.
I cannot wait, honestly, for that to come out.
What are the loopholes in this contract?
I have nefarious action on the other side.
What are my loopholes to get out?
Right.
You know what I mean?
Yep.
For bad or good.
I guess you could use that in the bad side or the good side.
GPT for X, where X is literally everything, is going to be there.
There's going to be one specifically trained for lawyering.
Yeah, yeah.
I think language models will be a huge democratizing force in many domains.
It's democratizing understanding of legal concepts,
democratizing access to software creation.
I think it's going to be
a huge expansion of the
percentage of people
that's going to be able to access those
knowledge domains.
So let's say I'm a happy
GitHub co-pilot user.
Would I install
Kodi alongside this and be happier?
Would I be less happy?
Is this a zero-sum game?
Do I need to go all in on Cody?
What are your thoughts on that?
I think it's the exact opposite of a zero-sum game.
I think there's so much left to build that the market is huge
and vastly growing.
We do have features that Copilot doesn't have.
So currently, they don't have kind of like a chat-based
textual input to ask
high-level questions about the code.
I think that's coming
in Copilot X to some extent.
Yeah, I think they announced that, but it's not out yet.
It's not out yet. If you look at the video,
the kind of context fetching they're doing, it's basically like
you're currently open file, explain that.
And Kodi is already doing much, much more than that.
It's reading, even if you ask it a question
about the current file, it'll actually go
and read other files in your code base
that it thinks are related and use that
to inform your answer.
So we think the power of Sourcegraph
gives us a bit of a competitive edge there
with the kind of high-level questions
and onboarding and kind of like rubber ducking use case.
And then for completions, you know,
I think Copilot is great.
But for completions, we're essentially doing the same thing.
So like the completions that code generates,
it takes into account that same context
when it's completing code.
So that means it's better able to kind of mimic
or emulate the patterns and best practices
in your specific code base.
And again, because we're kind of open source and model agnostic, we are just integrating
all the best language models as they come online.
So I think Anthropic, I don't know when this episode's going out, but Anthropic today just-
Pretty quick.
Okay, pretty quick?
The 24th.
Yeah, so Anthropic just announced today that they have a new version of Cloud that has
an incredible
100,000 token context
window. It's just like
I think like orders of magnitude
more than
what was previously available.
And that should be, by the time this episode
goes online, it should be
available in Kodi. Whereas
Copilot, I think they're
maybe someone from GitHub can correct me if I'm wrong,
but I think they're still using the Codex model,
which was released in like 2021 or something.
And so it's a much smaller model
that only has around like 2000 tokens of context window
and much more basic context fetching.
It's already incredibly useful,
but I think we're kind of taking it
to the next level a little bit.
So open source and model agnostic.
Open source, model agnostic.
We're not locking you in to a vertical proprietary platform.
Proxy friendly.
Proxy friendly.
Also enterprise friendly.
Sourcegraph, we've made ourselves easy to use in both cloud and on-premises environments.
So we're just trying to do the best thing for our customers
and for developers at large.
So because you're model agnostic,
does that mean that you're not doing any of the training
of the base layer models?
So do you also sidestep legal concerns?
Because I know with Codex and Copilot,
there's at least one high-profile lawsuit that's pending.
There's legal things happening.
There's going to be things litigated.
I'm wondering if you're in the target for that now with Kodi,
or if you're just not because there's other people's models.
No, we're very mindful of that.
And we actually integrate models in a couple different ways.
So we do it for kind of like the chat-based autocomplete.
There's a separate model we use for code completions,
and there's another model that we use for embeddings-based code search and information retrieval.
And it's kind of like a mix and match.
Sometimes we'll use a proprietary off-the-shelf model.
Other times we'll use a model that we fine-tuned.
But for the models that we do rely on external service providers for,
we're very mindful of the kind the evolving legal and IP landscape.
And so one of the things that we're currently building is basically copyright code or copied code detection.
And if you think about it,
Sourcegraph as a code search engine
is kind of in a great position to build this feature.
It's like if you emit a line of code
or you write a line of code
that is verbatim copied from somewhere else in open source
or even in your own proprietary code base,
you might be worried about just code duplication.
We can flag that for you
because we've been building code search for the past 10 years.
Cool stuff, man.
So moving fast, what comes next?
When are you going to drop Kodi 2?
It's probably like a week from now, right?
Yeah, that's a great question.
I mean, we are just kind of like firing on all cylinders here.
We have a lot of interesting directions to explore.
One direction or one dimension that we're expanding in
is just integrating more pieces of context.
So one of the reasons why we wanted to open source Kodi
is because we just want to be able to integrate context
from wherever it is and not be limited by a single code host
or a single platform.
There's so much institutional knowledge
that's in many different systems.
It might be in Slack.
It might be in GitHub issues. It might be in your code review tool. It might be in Slack. It might be in GitHub issues.
It might be in your code review tool.
It might be in your production logs.
And so we want to build integrations into Kodi
that just pull in all this context.
And I think the best way to do that
is just to make this kind of like platform,
this orchestrator of sorts,
like open source and accessible to everyone.
The other dimension that is very exciting to us
is going deeper into the model layer.
So we've already started to do this
for the embeddings-based code retrieval.
But I think we're exploring some models
that are related to code generation
and potentially even the chat-based completions
at some point.
And that's going to be interesting
because it's going to allow us to incorporate
pieces of source graph into the actual training training process and there's been some research there that shows
that uh incorporating like search engines into training uh language models actually you know
yields very nice uh properties in terms of like lower latency but uh higher quality um and it's
also important to a lot of our customers because a lot of them are you know large corporations they
deploy on premises,
and even the zero retention policy where the code is forgotten as soon as it's sent back
over is not good enough for some of our customers.
So they want to completely be able to self-host this and we plan to serve them as well.
How high up the stack, the conceptual stack, do you think Cody can get or maybe any AI tooling
with CodeGen
with regards to how I instruct it as a
developer? Yeah. You know, because right now
we're very much like, okay, it's autocomplete.
There's a function here, right? I can
tell it, write me a thing that connects
to an API and parses the JSON
or whatever. And I can spit that out.
But like, how high up the stack can I get?
Can I say, you know,
write me a Facebook for dogs?
And be done?
For instance. Or like user stories? Can I write some user
stories and go from there? What do you think?
That's a great question. I mean, we've all seen
the Twitter demos by now where
someone is like, you know, GPD4.
Like, build me an app. And, you know,
it creates a working app.
I think if you've actually gone through and tried that in practice yourself you soon realize like hey you can get to like a working
app pretty quickly just through like instructing it using english or natural language but then
you get a little bit further down that path and you're like oh i wanted to do this i wanted to
do that can you add this bell muscle there's kind of this like commentorial complexity that emerges
as you add like different features and you're kind of diverging from like the common
path. And then, and then it falls apart. Like I actually tried this myself. Like I tried to write
a complete app. Uh, it was actually a prototype for, for the next version of Cody. Um, I tried
to do it by not writing a single line of code just by writing English. And I got like 80% of the way
there in like 30 minutes. And I was like, this is amazing. Like this is the future. Like I'm
never going to code again. And then the remaining 20% literally took like four hours and I was
banging my head against the wall because I asked it to do one thing and then it did,
did it. But then it kind of like screwed up this other thing and it became kind of like
this like whack-a-mole problem. So we're not all the way there yet, but I think, I think the way we think about it is like Cody right now
is at the point where if you ask it, uh, this is another thing I tried the other day. Like I wanted
to add a new feature to Cody. Uh, Cody has these things called recipes, which are kind of like
templated interactions with, uh, Cody. So like write a unit test or generate a doc string or,
you know, smell my code, you know, give me some feedback. Yeah. I wanted to add a new recipe and I basically asked Cody, Hey, I want to add a new
recipe to Cody. Uh, what parts of the coach I modify. And it basically showed me all the parts
of the code that were relevant. And then it generated the code for the new recipe using
the existing recipes as like a reference point. Uh, and I basically got it done like five minutes
and it was amazing. So like, I was still obviously in the hot seat there.
I was still calling the shots,
but it turned something that probably would have been
at least 30 minutes, maybe an hour,
if I got frustrated or distracted
into something that was like five minutes.
And that was actually the interview question we were using
for interviewing on the AI team.
So after that, we had to go back and revamp that.
It's like, this is too easy.
Too easy now.
Everything just got easier.
Yeah.
Do you think this is like
a step change
in what we can do
and then we're going to plateau
right here for a while
and like refine
and, you know,
do more stuff
but kind of like
stay at this level
of quote unquote intelligence
or do you think it's like
just the sky is the limit
from here on out?
Like,
which, I mean, obviously, just conjecture at this point. Challenging to predict. I mean, it's like just the sky is the limit from here on out? Like, which, I mean, obviously it's just conjecture at this point.
Challenging to predict.
I mean, it's, it's very challenging to predict. Uh, you know, I might be eating my words,
um, in, in another six months, but like, uh, you know, on the spectrum of, you know,
oh, it's just like glorified auto-complete and it doesn't really know anything to all,
all the way to like, you know, AGI Doomer, you know, let's, let's nuke the GPU data centers.
Oh my gosh.
Where do you fall?
Yeah.
Don't give them ideas.
Cancel, cancel, cancel.
Honestly, I think a lot of the discourse on that end of the spectrum
has just gotten kind of crazy.
Yeah.
Like, the way I view it is this is a really powerful tool.
It's an amazing new technology.
And, you know, it can be used for
evil, certainly, as any technology can. But I'm a techno-optimist and I think this will largely be
positively impactful for the world. And I don't really see it replacing programmers. It might
change the way we think about programming or software creation. There's certainly going to
be a lot more people that are going to be empowered to create software now.
And I think there'll be kind of a spectrum of people
from those who write software
just by describing it in natural language
all the way to the people who are kind of like
building the core kernels
of kind of like the operating systems of the future
that form like the solid foundation
that pack in the really important data structures
and algorithms and core architecture
around which everyone else can throw their ideas and stuff.
So there'll be like a huge spectrum.
I think we'll almost think of it
in terms of like the way we think of like reading and writing now
where like you have many different forms of reading and writing.
People just like reading, writing stuff on Twitter, that's one form of writing.
And then there's other people who write long books that span many years of intense research.
And I think the future of code looks something like that.
It's the ultimate flattener.
You see that book, The World is Flat? It's like that. It's the ultimate flattener. You see that book, The World is Flat? Yeah. Yeah. It's like
that. Like for a while there, it was outsourcing and now it's sort of like just accessibility to
everybody. Now, you know, people who don't know much about code can learn about code and level
up pretty quickly. And so the access, the catered access to have a patient, whether person or not,
like I have conversations with ChatGPT and I swear, I'm like,
I tell my wife, I'm like, I'm literally talking to a machine and I get it,
but we 30, 40 rounds back and forth through whatever it might be.
And it's very much like a conversation I have with Jared.
If you would give me the time and patience and if you wouldn't get frustrated,
you know what I mean? And so it's a better, very patient. Yeah. Well, not necessarily, but you would give me the time and patience, and if you wouldn't get frustrated, you know what I mean? And so I have this very patient, well, not necessarily,
but the world now has access to a patient sidecar
that's quite intelligent, that will get even more intelligent,
whether you call it artificial intelligence or not.
It has intelligence behind it, some knowledge,
and it's accessible right now.
I agree. Humans are still necessary, thank the Lord. But wow, it's super flat now. And a lot more people have access to
what could be and what might be because of this. And that's a fantastic thing. I think of, you
know, there's that Steve Jobs quote where he said, computers are amazing because they're like a
bicycle for the human mind. They allow a much more i think he's drawing comparisons to like you know how different animals
get around and like a human walking is like very inefficient but a human on a bicycle is like more
efficient than like the the fastest cheetah or whatever right i think like what what language
models um are are capable of doing is instead of like a bicycle, now we each have like a race car or a
rocket ship. Now we're still in the driver's seat, right? Like we're still steering it and telling
it where to go, but it's just, it's way more leverage for any given individual. So great thing
if, you know, you love being creative, you love dreaming up, you know, new ideas and ways to
solve problems. One more question on the business side of things.
How has growth been because of Cody?
That's a great question.
Cody is, you almost would not believe it if I described it to you,
but Cody is literally like the most magical thing to happen to the source graph go-to-market
or sales
motion since basically when we started the company ever basically. I've been paying attention for a
while. So I asked that question. You've had trouble getting growth because you got to install a server
or go cloud and you got to examine the code base. Then you got to learn how to search the code,
which is all like friction points. So one of the, like transparently, one of the challenges that we had as a business is,
you know, we had a couple of subsets
of the programmer population
that were very eager to adopt Sourcegraph.
It's basically if you use a tool like Sourcegraph before,
you want to use it again.
So if you're an ex-Googler, ex-Facebooker,
ex-Dropboxer, or, you know, ex-Microsofter
in a couple of teams,
you kind of got it immediately. and then everyone else is like, oh, is it like
grep or is it like control F?
And we would lose a lot of people along the way.
I think with Kodi, it's at the point where not only
does any programmer get it right away, they're like, oh, holy shit
you just asked to explain this
very complex code in English and gave me like really good explanation. Um, even like non-technical
stakeholders. So like as we sell to larger and larger companies, a lot of times, you know,
in the room is, is someone with like a, uh, I don't know, a CEO or like board of directors or,
uh, you know, non-technical someone who who's pretty distant from the code, traditionally speaking.
And they get it too because, you know,
we were in a pitch meeting the other week
where it was like a large kind of Fortune 500 energy company,
and there was not a program in the room.
It was just kind of like, you know, high-level business owners
who were all very skeptical until we got to Cody. We opened up, you know, high level business owners, um, who are all very skeptical until we got to Cody,
we opened up, you know, one of their open source libraries and asked Cody to explain what was
going and one person leaned in and they were like, you know, I'm, I haven't coded in like
30 years and even I would get value out of this. So yeah, it's, it's just absolutely incredible.
Your total adjustable market got a lot bigger. Yeah. Yeah. Yeah. Cause like what is
an engineer now? Um, I think it's like in, in a couple of years, uh, almost every human in the
world will be empowered to create software and in some, some fashion. You said before that Cody
leverages all that source graph is today, the intelligence. Yep. Will that always be true? I
guess is maybe the more basic way to answer that or ask that question.
Because at some point, if this is the largest arc in your hockey stick growth and all the
up from here is not so much Kodi related, but Kodi driven really, does what Sourcegraph
do at large now eventually become less and less important. And the primary interface really is this natural language coding interface that explains my code.
That's a great question.
It's like, you know, does AI just like swallow all of programming at some point?
Like at some point, do we cease to write kind of like old traditional like systems oriented software in the Von Neumann tradition?
You hand wrote thatote that code?
What?
You wrote a for loop
instead of just asking it nicely
to repeat something?
Forget code search.
I don't even read code.
Why are you reading code?
Let alone searching.
Right.
Yeah.
This is still very early days,
so it's very difficult to predict,
but the way I think about it, I think about it in terms of, like, maybe we have, there are different types of computers that can exist in the world.
Like a traditional, you know, like PC, that's one type of computer.
You could maybe say, like, the human brain is another type of computer.
And then these language models, I think they're a new type of computer.
And they do some things a lot better
than the PC type of computer did.
And then some things much worse.
Like they're far less precise.
I think I saw a tweet the other day
where someone repeatedly asked GPD-4
whether four was greater than one.
And then at some point,
GPD-4 got unsure of itself and said, oh, no, actually, I was mistaken. you know, four was greater than one. And then at some point, GPT-4 got
unsure of itself and said, oh, no, actually, I was mistaken. You know, one is greater than four.
I apologize.
Yeah, exactly. Exactly.
Yeah, I apologize.
So I think these two types of computers are actually very complementary. And so like the
most powerful systems are going to be the ones that combine both and feed the inputs of one
and the outputs of the other
and synthesize them in a way that's truly powerful.
And we're already seeing early examples of this.
Like Kodi is one.
We use kind of like the Chomsky style
like code understanding tech
with the more Norvig style language models.
Bing search is another
where they're using chat GPT for the
AI part of it, but they're still relying on
traditional Bing web search. And so I think we'll
see a lot of hybrid systems emerge that
combine the best of both worlds.
Exciting times. Thanks for
talking to us. Yeah, thanks for having me on.
Good seeing you again. Good talking. Pleasure chatting
with you. Oh, that was fun.
You guys are good at this. I'm excited for you. Yeah, that was fun. You guys are good at this. I'm excited for you.
So in the sponsor of Minisoad, here in the breaks, I'm here with Tom Hu, dev advocate at Sentry on the CodeCov team.
So Tom, tell me about Sentry's acquisition of CodeCov.
And in particular, how is this improving the Sentry platform?
When I think about the acquisition, when I think about how does Sentry use CodeCov, or conversely, how does CodeCov use Sentry?
I think of CodeCov and I think of the time of deploy. When you're a
software developer, you have your lexical, you write your code, you test your code, you deploy,
and then your code goes into production and then you sort of fix the bugs. And I sort of think of
that split in time as like when you actually do that deploy. Now, where CodeCup is really useful
is before deploy time. It's when you are developing your code. It's when you're saying,
hey, like I want to make sure this is going to work. I want to make sure that I have as few bugs as possible.
I want to make sure that I've thought of all the errors and all the edge cases and whatnot.
And Sentry is the flip side of that.
It says, hey, what happens when you hit production, right?
When you have a bug and you need to understand what's happening in that bug, you need to understand the context around it.
You need to understand where it's happening, what the stack trace looks like, what other local variables exist at that time so that you can debug that and hopefully you don't see that error case
again. When I think of like, oh, what can Sentry do with CodeCover? What can CodeCover do with
Sentry? It's sort of taking that entire spectrum of the developer lifecycle of, hey, what can we
do to make sure that you ship the least buggy code that you can. And when you do come to a bug that is unexpected,
you can fix it as quickly as possible, right?
Because, you know, as developers,
we want to write good code.
We want to make sure that people can use
the code that we've written.
We want to make sure that they're happy with the product,
they're happy with the software,
and it works the way that we expect it to.
If we can build a product, you know,
the Century Plus Code Cup thing
to make sure that you are de-risking your code changes and de-risking your software, then, you know, we've hopefully done the developer community a service.
So, Tom, you say bring your tests and you'll handle the rest.
Break it down for me.
How does a team get started with CodeCov?
You know, what you bring to the table is like testing and you bring your coverage reports.
And what CodeCov does is we say, hey, give us your coverage reports, give us access to your code base so that we can, you know, overlay code coverage on top of it and give us access to your CICD.
And so with those things, what we do and what CodeCov is really powerful at is that it's not just, hey, like, this is your code coverage number. It's, hey, here's a code coverage number. And your viewer also knows and other parts of your organization know as well.
So it's not just you dealing with code coverage and saying, I don't really know what to do with
this. Because we take your code coverage, we analyze it, and we throw it back to you into
your developer workflow. And by developer workflow, I mean your pull request, your merge request.
And we give it to you as a comment so that you can see, oh, great, this was my code coverage
change.
But not only do you see this sort of information, but your viewer also sees it and they can
tell, oh, great, you've tested your code or you haven't tested your code.
And we also give you a status check, which says, hey, like you've met whatever your team's
decision on what your code coverage should be, or you haven't met that goal, whatever
it happens to be.
And so CodeCov is particularly powerful in making sure that code coverage is not just a
thing that you're doing on your own island as a developer, but that your entire team can get
involved with and can make decisions. Very cool. Thank you, Tom. So hey, listeners, head to Sentry
and check them out, Sentry.io and use our code changelog. So the cool thing is, is our listeners get the team plan for free for three months, not one month, not two months, three months.
Yes.
The team plan for free for three months.
Use the code changelog again.
Sentry.io.
That's S-E-N-T-R-Y.io andO. And use the code changelog. Also check out our friends over
at CodeCove. That's CodeCove dot I-O.
Like code coverage, but just shortened to CodeCove. CodeCove dot I-O.
Enjoy. so now we're now we're fine-tuned here we're ready to go okay i see what you did there
swine tune i think see what you did there.
Swine-tuned, I think, is what you were trying to say.
Well, no, I think it was a Dolly reference, fine-tuned.
So, yeah.
It was a pun.
It was a pun.
Work with us, Jared.
I mean, Adam and I are already on the same page.
What the heck, man?
Adam's puns are on point always.
He never misses with a pun.
All right.
Thank you.
All right. So, we have Denny Lee from Databricks or Databricks.
Databricks.
Databricks.
Is that the official stance?
It's not a Canadian or American thing.
It's just Databricks.
It's just Databricks.
Here to talk about Dolly 2.
But first, I hear you're a just-in-time conference presenter.
Tell us what this means.
Well, I think the context was that you were asking me, hey, what's your presentation?
That's what you asked me first.
I did.
And I was actually responding, I don't remember the name, nor do I remember.
I do remember the concepts.
At least I do have that part.
But I don't remember the name.
Nor.
Nor are the slides done yet.
And this is.
Normal.
And it starts in 30 minutes.
No, no, no, no, no, no, no.
Tomorrow.
No, no, tomorrow.
Tomorrow.
Okay. I'm just simply saying that it is common for me to go ahead and not do a thing until 30 minutes before the actual presentation to create the slides.
So you're a procrastinator.
Yes.
I'm a very good one.
No, that's not procrastination.
No, efficiency.
That's optimization.
Efficiency.
Pure efficiency.
Why sweat over the details until you have to?
Exactly.
Exactly.
Because what if you start 30 minutes before,
but you realize the details required 45 minutes. So I had this one time where actually a bud in mind, Thomas Kaiser, he and I went ahead and did a presentation where he, so he's from Denmark.
I'm from Seattle. We're both in, I don't know where, some other city to do the presentation.
Somewhere in the world. Somewhere in the world. So we actually got together, but we realized we
actually hadn't done squat on the slides
until 30 minutes before the actual session.
And guess what?
30 minutes before, put together the slides, bam, we're good to go.
Has it ever bit you?
I'm sure.
Tomorrow.
I'm sure at some point it will bite me.
I guess the context is I've gotten away with it so far. So I'm going to go
with it. And enough times that you have full confidence.
Yes. Fair enough.
Yes. Or at least I know how to fake it.
So what would you like to know about Dolly?
About Dolly 1, how we came about with Dolly 1.0
or Dolly 2.0? Let's start with why.
And then how. Alright, so let's go backwards
a little bit. That's when. No, you're talking when.
All the way back three weeks ago.
Roughly. In the days of yore. Yeah, when? All the way back three weeks ago. Okay?
Okay.
Roughly.
No, sorry.
In the days of yore.
Yeah, in the days of yore four weeks ago.
All right?
Yes.
So, one of the things that... And I want to give credit where credit's due.
Mike Conover is the guy who actually figured it out.
Okay.
Now, we were using a much older particular model, and we're going like, eh, would this
work?
Right? And what it boiled down to is that there's a supposition
that could you take an older model,
fine-tune it with good data,
and still actually end up getting good results,
with the key point being that, hey,
we're only going to pay $30 to actually train the data
as opposed to, oh, the tens of millions of dollars
that you'd have to do.
And could you do it?
That was the supposition for Dolly 1.0.
And sure enough, we were right.
Basically, it was about $30 worth of training time
on what is not considered public data.
So that's why it's Dolly 1.0.
So we could give you the weights, we could give you the model,
but we couldn't give you the data
because the data itself was actually not public.
But you owned it.
No, no, no.
In fact, I believe it was the same data that ChatGPT was using.
So we could give you the weight.
Again, that's open source, but we can't do the data because the data is actually ChatGPT.
Gotcha.
All right.
And so then we're going, wait, we actually used only a tiny amount of data and it still
came out with some pretty decent results.
Okay, let's go ahead and say, why don't we generate our own data?
So again, take credit where credit is due.
Our founders went ahead and said, hey, why don't
we just get, we have about 5,000 employees
at Databricks now. This is my favorite part. Yeah.
Let's just go ahead and generate our own
data. So for two weeks, that's literally
all we did. We had basically a bunch
of employees dumping in
data in a Q&A style format.
We had seven different categories.
It's all listed out there, so I don't remember all those details anymore.
I worked on the t-shirts, so at least I was helpful on that part.
Love the t-shirt.
That's a good one.
No one's seen this right now, but it is a podcast.
That's right.
Draw a word picture, Adam.
Dude, a sheep.
Come on, man.
It's a sheep.
Dolly.
Dolly.
Dolly.
Oh, my goodness.
I knew you thought he was on point.
Oh. Okay. So Dolly the sheep, a clone, right? It's a sheep. Dolly, sheep. Oh my gosh. Oh my goodness. I knew you thought he was on point. Okay. So Dolly, the sheep,
a clone, right? It's a clone, right?
So that's the whole context. Yes.
So we go ahead and actually get that up and running.
And then we're like, hey, now we've got 15,000
plus of
Q&A style new
information, all brand new,
and we're publicly giving it away,
right? So the actual data set, if you go to Hugging Face information all brand new and we're publicly giving it away right so not so
the actual data set if you go to hugging face or database labs slash dolly or
whatever the the github site is yeah basically all that data is there okay
all 15,000 lines oh sorry lines 15,000 QA's okay and then we trained that data
set again using the same old model from two years ago.
Okay?
Okay.
And we ran that.
And then basically what was really cool about this is that it cost us about $100 worth of training.
But it's pretty good.
And if you ask some pointed questions on this stuff, it actually responds really, really well.
For example, I've got some examples where I'm actually asking coffee questions.
And the coffee questions answers are,
okay, I'll give chatgbt4.0 a lot of credit.
It is much more verbose than what Dolly 2.0 can provide.
But in terms of correctness, it is correct. They both are the same level of correctness
between Dolly 2.0 and chatgbt4.
I actually have it on my own GitHub somewhere,
like a review where I actually explain all that.
Mainly because I was actually running it on an M1 mac too because i was goofing off and
which is fine well that's amazing right there yeah let me first just say as a daily user of
chat gpt sometimes verbose is not desirable and i'm like dude i actually will tell it to be brief
or in one sentence because i'm so sick of the word salad that spits out. I'm like, I just want the answer.
The answers are useful.
But sometimes you're waiting for it to tell me the whole history of the thing.
No.
Well, don't you want to know the retrospective while you're at it?
I'm being very sarcastic about it, yes.
People can't tell it's a podcast, but we're all eye-rolling each other on that one.
We are.
That was major eye-rolls.
So using it.
Let's say I've never used anything but ChatGPT's web UI, but I'm a developer.
Sure.
And I want my own, I want Dolly to answer my questions.
Yes.
What does that process look like for folks?
Okay, so you've got two choices, or no, no, I should rephrase it slightly.
You've got many choices, in fact.
But the most common choices are we have a Databricks notebook that's in the Dolly GitHub that you can just download for free and run it.
Now then you're going to tell me, but Denny, I don't want to use Databricks.
That's fair.
I would prefer you to, but I understand if you don't.
That's fine.
Go to Hugging Face.
The instructions are all right there on how to use it.
In fact, like I was saying, I was actually playing with it so that way I could optimize for an M1 Mac, and so that the answers could
come back faster. My only problem
was that when I started testing it, there was a
obvious bug in PyTorch.
Because basically
when we told it to go ahead and use the M1,
it was giving us back garbage answers.
It wasn't even actual answers.
It was literally
nonsensical characters.
And when we
used CPU mode, it worked perfectly fine.
But then just as I was about to
create a new issue on PyTorch,
they fixed it. No, that's a good thing.
I know, but I also had the fix.
Oh, you had the fix.
Okay, that's it. I get you.
You were about to have a control.
You wasted my time.
But it's fun. I get you. You're about to have a control. You're wasting my time. You're wasting my time. Damn it.
But it's fun.
But basically the idea is that obviously,
I shouldn't say obviously,
you probably don't want to train within M1,
but you can definitely do inference within M1.
The Q&A, so you got your data.
So how do you collect that data and how do you format it so that Dolly can understand it?
No joke.
I'm assuming you're saying,
so don't use Databricks data.
You could do the same thing like you did with the Q&A.
Yes, absolutely.
Literally, when we ask people to fill out, it was a Google form.
Okay.
That's literally it.
And what were the questions?
Oh, no, no.
They could produce the questions and then the answers.
They would ask a question, and then it would spit out.
Provide a detailed answer for it.
I see.
So that way, Dolly can train it.
So how do you make an espresso?
How do you make, since you use coffee, so. It wouldn't even be how do you make an espresso? How do you make, to choose coffee?
It wouldn't even be how do you make an espresso? For example, let's be very specific, okay?
It would say, what are the particular features of great espresso? Okay. And then we would talk about, okay, you're required to have a fine grind. You're required to, using a conical burr grinder.
There's a religious war between flat burr grinders and conical burr grinders to, using a conical burr grinder. There's a religious war between
flatbore grinders and conical burr grinders. I put in
conical burr grinders, so yeah, I'm sure
the flatbore grinders are pissed off that that's not the
answer that they're going to get from Dolly. That's bias. You're putting
bias into this. Yes, absolutely. There's absolutely
100% bias. Let's not pretend there isn't, okay?
Okay. So, it also requires you
to actually have coffee beans roasted
in a particular way. It also requires
you to have the espresso water boiled in a particular way. It also requires you to have the espresso water
boiled at a particular temperature.
Okay.
So you put all of those details down.
That's the idea.
So in other words, it's not just like,
okay, hi, what's great espresso?
You buy it from Espresso Vavace in Seattle.
I mean, while that's true,
and I'm basically,
I don't own any stock in them, by the way,
but they are easily the best coffee.
Who's the brand again?
Espresso Vavace in Seattle.
Espresso Vavace.
Yeah.
David Shomer is a magician when it comes to espresso.
Okay.
But the context is like, well, as much as I want to just provide an answer like that, the reality is no.
Obviously, we can't train that bad.
We actually need to have verbosity to provide context, provide proof, if you want to put it that way.
Because there's going to be other people putting other answers, too.
So, for example, in this case, I'm just going to call a buddy of mine, Rob Reed.
He's a fellow cyclist.
He's also a fellow coffee addict.
I know he also put some coffee answers inside there as well.
Okay.
So, between everybody that put coffee answers in there, that's actually literally you're getting data from myself,
from Rob, and a few other folks from, well, Databricks.
Right.
And how many instructions are in there that you guys put in?
The 5,000 employees?
5,000 employees put 15,000.
15,000.
So it's remarkable.
If you think about it, that's remarkably small.
We were always under the impression when we started this process
that we would require hundreds of thousands or like millions how does it know you gave it coffee
instructions yeah yeah no we know something totally different like I said
Dolly 1.0 shocked us like it really shocked us because we thought we would
need to put in a lot more data we thought we would need to do a lot more
training and then they were like wow this is not bad I mean it's not perfect
but it's not bad actually right And so from a business perspective,
what ends up happening is if you have your own business,
now your data, you don't need a million things.
You've got 15,000 pieces of information.
Now, the great thing,
and I'm not telling you to use Dolly, by the way.
I mean, obviously, go use it if you want to,
but I'm saying use any open source model.
I don't care which one.
That way, you get to go ahead and keep it and have your data as your IP.
So you as a business end up using the data actually in a good way.
Right.
Where you actually make it advantageous for you, yet also keeping the privacy for the users that make up that data at the exact same time.
So the move is you have these, I don't know if this is technically what a foundational model is,
or you have these models that are large enough language models. Right. Right. And then
each company or each org or each use case says, okay, now we're going to fine tune it. I don't
know if that's the right language or not. And apply it to us. Right. And there's all sorts of
models out there. There are already, like a a lot of people were asking me originally, like,
hey, okay, well, then, you need to use DALI.
I'm like, no, no, no, no.
DALI was just us proving that it can be done.
That's all it was.
So there are a lot of really good companies,
whether it's Hugging Face or anybody else,
that produces solid, open-source, large language models.
Use those, too.
Because the whole point is that you can use it yourself, run it with smaller amounts of
data, have really good answers, and you're paying a hundred bucks.
At least in our case, we did.
A hundred bucks to train it.
Right.
So we're like, okay, that's actually worth your business.
You're protecting the privacy of your users.
You're going ahead and actually having relatively solid answers.
And you're not basically giving your data away to another service.
Because that's the key thing about when you use a service.
Right.
That you're basically giving away your data so they can go train against the two.
Right.
Right?
Now, I know Microsoft and OpenAI, for example, you're calling those two out in a positive way, not a negative.
Usually, I'm a former Microsoft employee, so I'm allowed to be negative if I want to, but this is actually me being positive.
They actually have introduced concepts saying you can pay more to train
and that they'll never actually use your data.
But I don't remember the cost, but it is definitely paying more.
Yeah.
Yeah.
Well, it's not as valuable to them, so it makes sense as a transaction.
So that becomes more of a transaction that way.
Exactly.
Right.
So have you seen the Googler's leaked memo
about we have no moat?
Yeah, everybody talks about that memo.
And what's interesting about that whole concept
is that, I know it sounds sideways,
but I was about to actually give you another context.
And this is actually, again, Mike gone over,
I want to give credit attribution
to the guy who actually said it.
What's really interesting about this whole thing,
when they talk about moats
and talk about everything else,
is that more fundamentally, we could have done this two years ago. We could have
taken this concept of basically saying, small amount of data, foundational model, fine-tune it,
and actually have good results. So all of us were focusing on, I need a bigger model. I need to bump more data.
I need to scrape the entire freaking Internet and chuck it all into the gigantic model.
Spend tens of millions of dollars, warp every single GPU until Azure basically melts in order to go ahead and train this thing.
Until the heat death of the universe.
Right, exactly.
And then meanwhile, it's like, or we literally could have taken a foundational model that was okay to good,
$100, and bam, we get something good.
So when they talk about there's no moat and all this other stuff between open source and not,
literally my attitude toward this whole thing is like, no, just step backwards for a second.
The reality is we could have done this.
We all got attracted to the idea, the shiny thing of, ooh, bigger, more, bigger,
more, larger, more. That's all we got
attracted to. And so
in the end, I'm going, I don't
care.
These companies,
the ones that, quote unquote, are trying to build a moat
around themselves, what they're doing, they're
trying to make sure that they have a service
in which you will
give them your data, and then by in which you will give them your data.
And then by definition, you will give away your competitive advantage.
Right.
Simple as that.
For the folks that don't want to do that, which I think is the vast majority, then my attitude is quite simple.
Then don't do that and build your own model.
Now, how about if I'm the general consumer?
I just want to pump out a good blog template for me to work with.
Yeah, absolutely.
Why not?
Seriously, I'm not trying to say these services aren't worthwhile.
Quite the opposite.
ChatDB is fun.
Very valuable.
Oh, yeah.
It's extremely valuable.
In fact, I've already had it pumping out code for me just for shits and giggles.
So my Rust is-
It's going to pump out some slides for you here soon, for tomorrow.
Oh, that's a good idea.
I should test out that.
Yeah, yeah.
Take that 30 minutes, turn it into 12.
Oh, yeah.
That'd be perfect.
Yeah, yeah.
But see, you get my drift.
Yeah, totally.
Yeah, so my Rust code is Rusty.
And so basically, I was using ChatGP to basically pump out a bunch of Rust code for me.
I'm like, hey, this is a great bowler play.
Now I've got something to work with, and boom, now I can start writing again.
Right.
So what is Databricks' play in this chess game?
Like, what's your guys' angle?
Our angle's quite simple.
You've got a ton of data.
You need to ETL it, process it in the first place.
Then you need to have a platform to run machine learning
or data science or AI
or whatever frickin' wording you want to use.
Whether it's LLMs today,
deep learning yesterday
or tomorrow,
image optical
resolutions,
object recognition, I don't care.
The point is that you have a ton
of data. You need to be able to
process it. You need to be able to access
every single open source system
or service.
Databricks play is quite simple.
We just make it easy for you to do any of it.
That's it. That's our only play.
Let's make it easy.
Are you for, I guess, then
people owning their own data?
It seems that that's your...
So here's the thing. I'm absolutely
for both from a Databricks perspective,
but also from an open source perspective, right?
So I'm an open source contributor.
I contributed to Apache Spark and MLflow,
and I'm also a maintainer for Delta Lake, okay?
And so, yeah, by definition,
I'm always going to lean toward open source,
which means you should own your data.
Data should be a competitive advantage.
Everything else should be open source, basically, for all intents and purposes. I'm even
for things like differential privacy and privacy preserving histograms to basically protect your
data. And I can go on a diatribe on that, so let's not do that. But the context is, I'm not
saying, though, these services like OpenAI oring or whatever else aren't worthwhile they are they're cheap they're helpful in fact training other systems isn't necessarily a bad
thing either it's some for me it's not about don't do it it's about knowing what you're doing right
that's it yeah no transparency exactly that's it that's that's my icon if you want to use open ai
uh within a database platform, we make it easy.
For crying out loud, we added SQL syntax directly,
so you can literally write Spark SQL,
which basically at this point is basically anti-SQL compliant.
You literally write SQL to go ahead and access your OpenAI
to run an LL model directly against your data.
So literally, party hardy, have fun.
So it's not, our attitude isn't so much like,
don't use one versus the other.
Our attitude is very much, no, no, just know what you're doing.
Understand when you're using something like a service.
Understand when it makes sense for you to build your own model.
And we also make it easy for you to build, maintain, train,
infer against that model.
That's it.
So I mentioned we have our transcripts as open source, right?
Yeah.
Everything we're saying here, when it hits the podcast, it's going to be transcribed into words.
How are ways we can use Dolly 2.0, this open model that you're talking about, this direction?
How can we leverage these transcripts for our personal betterment as a podcast company?
For example, as a podcast company, one of the first things, in fact, I'm actually already
doing this technically for Delta Lake, is that we also have podcasts ourselves.
So what are we doing, though?
I'm spending time and effort to generate blogs based off of the podcasts.
Why?
Because it's better for Google SEO search.
It's not like I'm trying to just repeat the same thing.
I'm just trying to summarize because we talked about, we talked about barbecue in the beginning, right? We talked
about coffee. We probably don't need all of those details inside the transcript of the podcast of
our blog. You want people to go ahead and actually understand what they're talking about when it
comes to Dolly. Cool. We generate a blog based off of this conversation. It can summarize it,
get to the key points.
Boom, there you go.
It simplifies the whole process,
so that way you're not spending exorbitant hours
trying to figure out how to basically synthesize
the key points out of our conversation right now.
So it's still time for you to review and look
to make sure the model isn't giving you garbage.
There's still time for a producer or for any other person who is knowledgeable in this field to validate the statements.
Maybe I'm full of BS of all I know, right?
And then so you get an extra.
Sometimes.
Oh, yeah, yeah.
I don't know.
Denny's full of it.
Forget it.
It most likely would be the Conical versus Flatbird grinder.
But, again, that's a whole other story.
The whole summary will just be Adam and I talking in color.
I'm Conical.
I'm on your team. Conical is me.
I'm Conical. Team Conical.
There you go. Perfect. See?
But the context is that we can go ahead
and actually use these systems to simplify it.
Would it be cheaper and easier if we just went ahead
and did like ChatJB to do it?
Yeah. Go for it.
Would it be worthwhile to do it
in your own dolly model? Absolutely.
Because you have your own style, right?
So if you have your own style, if DALI or any other open source model,
again, I want to be very clear here,
is going ahead and be trained against your transcripts,
it will then be able to start writing blogs based off of your style, right?
That's the cool thing about it.
Is it cool to actually chain like that,
or is it better to go with a foundational model
and then just our stuff,
or to be cool or to be like,
well, start with Dolly because it has instructions,
and then add our style,
and then maybe add something else?
Literally, my answer is all of the above,
because we don't know.
Just whatever you want.
We don't know.
We don't know, because that's the whole point.
Different foundational models will provide different,
will be better at different things. As simple as. Different foundational models will be better at different things.
As simple as that.
Some models will be better at, for example, conversations.
Some models will be better for writing purposes.
Nat.dev.
I'm forgetting the guy's name.
Nat Friedman.
Thank you.
Oh, my God.
I don't believe I spaced out on that.
He's a nobody.
He's a small guy.
Okay, so Nat Freeman, former CEO of GitHub.
So slightly important guy.
Nat.dev is an awesome playground, for example,
where you can test out a lot of different models already.
And you're literally just chucking, like,
hey, let me try with ChatGPT-3.
Let me try with Vicuna.
Whatever else.
And literally you will see with the same question,
especially if we do the compare playground section,
different answers from the different models.
So, yeah, like literally you got to play a little bit to figure out which model makes sense for you.
Yeah.
Interesting.
Love it.
Well, thanks for talking with us, Denny.
Glad to, always.
Aside from your opinions on coffee and whatnot, you're pretty good.
Pretty solid dude, yeah.
You know, those are fighting words.
I just want to say that, okay?
Those are fighting words.
Oh, that's good.
All right.
Gentlemen, thank you very much.
Yes, thank you.
All right.
All right. Hey, friends.
This episode is brought to you by CIQ, the founding sponsor and partner of Rocky Linux, Enterprise Linux, the open source community way.
And I'm here with Gregor Kertzer, the founder and CEO of CIQ and the creator of Rocky Linux.
So Greg, I know that a lot of people are still sort of catching up to some degree with what went down with CentOS,
the Red Hat acquisition, and just the massive shift that required everyone using CentOS to do.
Can you give me a glimpse into what happened there? We've seen a number of cases in the open source community where projects were pivoted due to
business agenda or commercial needs. We saw that happen with CentOS. CentOS was one of the primary,
one of the biggest enterprise operating systems ever. People were using it all over the place.
Enterprise organizations and professional IT teams were all leveraging CentOS.
For CentOS to be stripped away from the community and removed as a suitable option to meet their
needs created a massive pain point and a gap within the industry. As one of the founders of
CentOS, I really took this to heart and I wanted to ensure that this does not happen again. And that is what we created
with Rocky Linux and the RESF. Okay. You mentioned the RESF. What is that? And what is its relationship
to Rocky Linux? The RESF is the Rocky Enterprise Software Foundation. And it is a organization
that we created to hold ourselves responsible to what it is that we've promised that we're going to do with the community.
It is community run. It is community led.
We have a board of directors, which is comprised of a number of people that have a huge amount of experience, both with Linux as well as open source and community. And from this organization, we solidify the governance of how we are to manage Rocky Linux
and any other projects that come and join in this vision.
Sounds good, Greg.
I love it.
So Enterprise Linux, the open source way, the community way has a home at Rocky Linux
and the RESF.
Check it out and learn more at RockyLinux.org
slash changelog. Again,
RockyLinux.org slash
changelog. All right, Stella Biederman.
Yeah.
And you're with, I'm going to also butcher the name of the org.
Eleuther AI.
Eleuther.
Eleuther AI.
Yes.
Okay, what is this?
What is Eleuther AI?
Y'all were just talking with Databricks about Dali.
This is right.
Yes, correct.
So that was
built on top of an open source language
model. Okay, yes. I trained
that. Okay, so you're
underneath DALI. Yes.
Okay. So you personally trained
it. Yes. Okay.
What's the model? It's called Pythia.
Pythia. It's a
suite of language models, actually, that we put out a couple months ago.
Okay.
But in general, Luther Eye has trained several of the largest open source language models in the world in the past three years.
Okay.
Very nice.
So what do you want to tell the world then?
What do I want to tell the world?
Honestly, didn't think that far in advance.
Okay.
All right. Well, what should the world know?
What should the world know? About what you do
in terms of training models that Databricks
uses, that's open source, etc.
Honestly, especially like the open source
world should really know that
the AI world really needs help
from the open source community at large.
That's actually, broadly speaking,
why I'm here at the Linux Open Source Summit.
Okay.
You know, we're struggling with a lot of issues
about maintainability,
issues about licensing,
issues about regulation,
issues about building sustainable ecosystems
that the open source community writ large
has been working on for years, if not decades.
Yeah.
And a lot of people in the AI world
are a little too proud to ask for help from non-AI people,
which is definitely a real systemic problem.
But there's, I think, a lot of...
If people are excited about foundation models,
large language models, whatever you want to call them,
and want to get involved and don't know, or want to help and don't know that much about AI,
there's a ton of open source work that needs to be done that we need help with
to build a robust and enduring ecosystem.
Where is the money coming from?
Where's the money coming from? Great question.
So at Eleuther AI, we recently formed a nonprofit.
And we have donations from a number of companies, most prominently Google, Stability AI, and Hugging Face.
Okay.
And CoreWeave are among our biggest sponsors. We have also been applying for grants from mostly the U.S. government
to pay for our forthcoming research and work.
In terms of computing resources,
it's actually like training these really large language models
is not that expensive, which is like...
Is that a secret?
I don't know if it's a secret or what,
but I think that the CS world
kind of got used to the idea
that anything can be done on a personal laptop,
and that that's kind of what constitutes
a reasonable amount of money to spend on a paper.
And that's great.
There's a huge accessibility boon for doing that.
But training these large language models,
it is pricey.
It's not something that anyone can do on their own.
But it's not ruinously expensive.
There are thousands of companies
around the world that can afford to do this.
There are dozens of universities
that can afford to do this.
And by and large, they just haven't been. Okay. So there's a model that you trained.
Yeah. How much did that cost? So we trained, so it's part of a suite of models that had like
28 in it total. But altogether, that was like less than $800,000. The largest model,
one training run would probably be like $200,000.
Not bad.
That's more than a laptop.
Which is more than a laptop.
But it's less than.
It's not like a mind-boggling amount of money.
It's less than a Super Bowl commercial.
It's true.
Yeah.
So right now, the largest open source,
well, okay, the second largest open source English language model in the world
is called G2B NeoX.
We train that, I train that, my organization.
And
that cost us about
$350,000.
Or what if we weren't given the compute for free?
But like $350,000
for the second largest open source
language model in the world. And at the time we released it,
it was the largest.
Later, someone else trained a bigger model with sponsorship from the Russian government.
But it's for...
So G2P3 came out in 2020.
And for about two years,
almost nobody was training in open sourcing language models.
Google was doing it with similar models,
but not like the same kinds of models that G2P3 is.
And we were doing it.
It was really not that expensive.
We got into it on compute that we got for free
through a Google research computing program
called the TensorFlow Research Cloud.
And with that, we trained a 6 billion parameter language model,
the one that underpins the first version of DALI
that he was talking about.
That's been extremely widely used,
deployed in a whole bunch of different industry
and research contexts and been hugely successful.
And it was literally just like Google gave us for free.
It ran preemptively on their research,
basically the idea of TRC is that they have
a research cluster that they don't always use all of.
And so other researchers, independent researchers,
academics, non-profits, can apply to be able to run
preemptible jobs on their research cluster
and just use the compute that they're not using at the time.
And using that, we trained this model in like two and a half months.
And it was a really big deal when it came out.
It was the largest model of its type in the world by a sizable margin.
It was about three times the size of the, four?
Four times the size of the largest open source model of its type in the world. Yeah. And the Pythia models we trained on like 120,
800 GPUs for a couple of weeks, which is certainly a lot of computing resources,
but it's not like mind boggling amounts of compute. There are lots and lots and lots of
companies that have that, that could, you know, it's less about it actually being too expensive
and more about kind of having the political will to actually
go do it. Yeah. Are you focused on training open source models? Is that your focus? So our focus
is on open source AI research in general. Our kind of area of expertise is large scale AI.
And most of what we do is language models. But we've also worked on training and releasing other kinds of large-scale AI models.
We were part of the OpenFold project.
DeepMind
created an
algorithm for modeling protein
interactions called AlphaFold.
That was a really big deal.
We helped some
academics scale up their research and
replicate that and release it
open source. We've done some stuff in the text- space both on our own and some of our staff have uh
have kind of gone on and worked at stability on on some of their language uh sorry image models
and we are a big proponent of open source research in general so our kind of the reason we decided to
start training these large language models was back in the summer of 2020,
we thought this G2P3 thing is going to be a major player in the future of AI.
And it's going to be really essential if you want to be doing something meaningful in AI,
you probably want to know how these things work.
You want to be able to experiment with them. You want to have access to them. And back then, you couldn't even pay
OpenAI to let you use the model. They announced that they had it, and that was it. And so we said,
well, what the hell? Let's try to train a model like that. We'll learn something along the way.
And so we started building an open source infrastructure for training large language
models. We created a data set called the pile, which is now kind of the
de facto standard for training large language
models. We created a
evaluation
suite for
consistently evaluating
language models because everyone runs their evaluations
a little differently and there's huge reproducibility
issues. So we
built a framework that we could release them in source
and run on our own
models, run on other people's models, and actually have kind of meaningful apples to
apples comparisons. And we started training large language models. We trained a 2.7 billion
parameter model, which is like a little bit bigger than G2B2 was at the time. And then
we started training larger models. 6 billion parameters was the largest open source G2B3
style language model in the world. 20 billion parameters was the largest open source GTP3 style language model
in the world. 20 billion parameters was the largest language model of any sort to be released
open source in the world. Since then, there's been a lot more investment and willingness
to train and release models. There's several companies that are now doing it. So Mosaic
is a company that released a nine, I want to say,
something, a large language model.
That seems really excellent, like last week.
There is Meta, which has been training
and releasing sort of models.
They'll tell you that they're open source releasing models,
but that's just not actually correct.
They're under non-commercial licenses
and they're not open source,
despite their retort to the contrary.
But there's a whole bunch of companies.
Stability AI is training large language models.
So now there's a lot more people in this space
and doing it and releasing it.
And honestly, from my point of view,
we got into training large language models
mostly because we wanted to study them.
We wanted to enable people to do essential research on interpretability, ethics, alignment,
understanding how these models work, why these models work, and what they're doing,
so that we can design better models and so that we can know what appropriate and inappropriate
deployment contexts for them are.
And so now that there's a lot more people working in kind of this open source training space,
we're moving more towards doing that kind of scientific research source training space, we're moving more towards, you
know, doing that kind of scientific research that we've always wanted to do. So in the past six
months, we've been doing a lot of work in interpreting language models and kind of understanding
why they behave the way they do. My personal kind of area of focus is tracing the behavior of
language models back to their actual training data.
So the models that the DALI 2 is trained on, the Pythia suite, what kind of makes that special is that most language model suites are very ad hoc constructed. I'm calling them suites because
you have several models that are similar of different sizes. So like the OPT suite by Meta, for example, ranges from 125 million parameters to 175
billion parameters.
But they're not actually very consistent between them.
Some of them even have different architectures.
They have different data order.
There's a lot of stuff that kind of limits your ability to understand, to do controlled
experiments on these models.
And so we sat down and we said, if we wanted to design from the ground up a suite of large language models that was designed to
enable scientific research, what would it look like? What kinds of properties would it have?
What kinds of experiments do we think people are going to want to do that we're going to need to
enable? And we built this list of requirements and then created a model suite that satisfies that.
So it was trained on entirely publicly available data.
All of the training, it was trained on the same data. Every model in the suite was trained on
the same data in the same order. And we have a whole lot of intermediate checkpoints that are
safe. So if you want to know, you know, after 10 billion tokens, how each model in the suite
is performing, you can go and grab those checkpoints after 10 billion tokens. And then
you can say, okay, what's the next data point
it saw during training after 10 billion tokens?
What was the 10 billion first token?
And you can actually use some stuff we've uploaded
to the internet to actually load that data
in the same order it's seen by the models.
You can study kind of how being exposed
to particular training data influences model behavior.
So we've been using this right now primarily
to study memorization,
understanding, because language models have a propensity for reproducing long exact sequences from their training core paths. And we're interested in understanding what causes memorization,
why certain strings get memorized and others don't. Right now, I'm wrapping up our kind of
first paper on that. We have some more research in the works, trying to understand, you know,
looking at the actual models
throughout the course of training
and looking at kind of the training data points that they see
and trying to reverse engineer
what that actual interaction between the model
and the data is.
And yeah, this is something I'm personally really high on.
Most interpretability research right now
is kind of focused on final trained models
as like pre-existing artifacts.
So you have this trained model and you want to understand what behaviors it has. But, you know,
my perspective as someone who trains these models is much more focused on kind of where they come
from and what, especially like my overarching goal is to kind of, you know, if I as a person
who trains a large language model have a particular desire for a property the model has, a property the model doesn't have, what decisions can I make to actually influence that and to make the model have the properties I want it to have?
So if there's data I don't want it to memorize, is there a way that I can know ahead of time what's going to be memorized?
That's the paper that we actually just released on Archive about forecasting what is going to be memorized before you actually train the model.
Is that to make it less black box,
more like you deploy it and you don't know what it can do so that you can
sort of understand,
okay,
here's the data,
here's how it's trained to sort of have a,
a more clarity of what the box actually contains versus this black box.
Is that why that's important?
That is what the field of interpretability is about in general.
Okay.
And I would say kind of building on that,
that what,
what my research is about in general. And I would say kind of building on that, what my research is about in particular
is not just opening up that black box
and looking inside and understanding
what the model is actually doing,
but understanding where it came from
and how we can build boxes
that are more transparent from the ground up.
Predictable maybe even?
Yeah.
Yeah?
Because I mean, that's one of the fears is,
you know, especially with like Bing. When they put that out there, I think, that's one of the fears is, you know, especially with, like, Bing.
Yeah.
When they put that out there, I think what it threatened the person, like, there was some sort of, like, threat on humanity, essentially.
And it's like you deploy this thing out into the world and you don't understand what it can actually do.
Is that to be more predictable, more controlled to some degree?
Sorry?
And even designable, like, say, well, forget these things, remember these things.
Yeah.
Designability is a really big component, I think,
that's going to become huge in the future.
Right.
And really, it hasn't been studied primarily
because people haven't had the tools.
Very few model suites have intermediate checkpoints at all.
A lot of publicly released models
weren't trained on publicly released data sets.
Or if they were trained on publicly released data sets, they didn't tell you what order it was trained on.
And it turns out that matters a lot. What saw early in training, what saw late in training.
And so there's really a huge reproducibility issue in terms of, if you want to dig in and
really understand how data by data, data point by data point, the model is learning to behave.
You need to be able to basically fully reproduce the training.
Not actually, because you're not going to spend a couple hundred thousand dollars,
but at least in principle, you need to be able to inspect individual data points,
know when it's going to get loaded, understand kind of how it works.
And this is something that we've put a huge amount of resources into,
both on the training side as well as kind of on the engineering side.
It was not easy, but you can actually reproduce our model training exactly.
So if you take the code base that we used to train these Pythia models
and you pick a checkpoint and you load that checkpoint
and you resume training from that checkpoint,
you will end up with the same fully trained model that we did.
Exactly.
That's important.
That is really important. It's important because if you want to understand how to design models, you need to
understand how they're changing over the course of training. And that is really persnickety and
really sensitive to a lot of implementation specific details that tend to not get released.
How far in the future do you think, since you're at the training level, you're like the ground level of if this is the eureka moment for humanity.
Yeah.
Right?
How far in the future do you think and do you have fear, trepidation, hope?
Like where will this take us as humanity?
I really don't know.
My kind of attitude is that the recent, like there was a really big paradigm shift in 2020
with the release of G2P3 and the aggressive focus on scaling.
And people really changed their attitudes towards how to design language models
and how they can be used and what they can be used for.
In a sense, we got really lucky because it wasn't that dangerous.
There were a lot of fears about what G2P3 could do.
And by and large, it turned out to be pretty safe.
There wasn't all that much harm done, and a lot of the fears turned out to be not come to fruition.
And looking forward, I think the really important thing to think about is we obviously can't predict the next paradigm shift.
But building tools that allow us to hopefully more readily adapt and respond to future paradigm shifts
in large scale AI.
So that one day there probably will be something that gets developed that is dangerous and
we want to be able to be, I guess, ready for that.
Yeah.
Yeah.
Cool.
Well, what are some touch points, people who are interested in what you're up to, want
to help out, want to give money, want to read more, where can people connect with you?
So the best place to connect with us is our Discord server.
We are a research institute,
but we actually operate basically entirely in the public view.
We're distributed all over the world,
and we do our research in a public Discord.
And anyone can join, anyone can drop in, read about what we're getting up to, hang out with us, chat with us about AI.
So our Discord server is discord.gg slash Eleuther AI.
There's also a link on our website, which is Eleuther.ai.
Shockingly.
We'll link it up in the show notes for sure.
Yeah.
And, yeah, we're always happy to take on more volunteers.
We have a small professional staff and a large number of volunteers that help out as well.
How small is small?
Like 10 full-time employees.
Okay.
And if they go to the Discord server, what can they do there?
What can they expect from the Discord server?
Like, you're there, others are there.
Yeah, so you can chat about AI.
We have a bunch of discussion channels where people talk about kind of cutting edge trends in
artificial intelligence. Honestly,
I don't really follow
AI publication news anymore because I just
follow my Discord server and everything
that's important shows up for me. There you go.
Which is a really nice place to be.
You can talk with us. You can talk with other researchers.
We have a large amount of researchers
at the cutting edge of AI. I can't count the number
of times that someone's posted a paper and been like, hey, this is really cool. Like, does anyone know anything
about this? And someone just like tags the guy who wrote the paper. That happens all the time.
We have people from open AI, anthropic, meta, like all the major labs who come, DeepMind,
come in and chat about language models, give advice, give, you know, perspectives on research
and talk about kind of how things are going. You can also get involved with ongoing research projects. So we have a dozen-ish ongoing
research projects ranging from learning to train, figuring out how to train better language models
to training language models in other languages. So if you look at like the list of the hundred
largest language models in the world, basically all of them are English or Chinese.
Yeah.
And so if you want to spread the benefits of this technology and the ability to kind of use and understand this technology
to the world at large,
like not everyone speaks English and Chinese,
and even the people who do often also speak other languages
that they care about.
So we're training,
we've trained and released several Korean language models.
We're currently training with the plan of releasing some Indic language models, as well
as some Romance language models.
So yeah, on the developing new model side, we do research like that.
On the interpretability side, we do a lot of different stuff, understanding training
dynamics, understanding how to evaluate language models, understanding how to kind of extract
the best information from them.
We recently started up some work on kind of red teaming them and trying to understand,
you know, there's a lot of stuff out there right now about prompt hacking, about how
people are trying to put filters on language models and they're kind of not really very
successful and trying to understand like what the dynamics of that is like,
whether you can, uh, build meaningful safeguards around these things or whether it's always going
to be subverted. We do a lot of work like that as well. Very cool. Well, thanks for coming on
the show, Stella. It was awesome having this deep dive with you. I love that. Thank you.
Great to meet you guys. Yeah. So if you'd have told me a few years ago that I'd be going to an open source summit
and talking about AI in open source at this level, from Cody, a coding assistant, to Databricks
and training models on small data sets, to Stella's work and the Luther AI's work on opening eye research
and all these things that'd be real,
that'd be touchable, that'd be usable today
to transform my work, to transform your work,
to transform the world around me.
I would not have believed it, but it's true.
We're here and this show was awesome.
So hope you enjoyed it.
Once again, a big thank you to our friends at GitHub
for sponsoring us to go to this conference
as part of maintainer month.
There is a small bonus for our plus plus subscribers.
So stick around for that.
If you're not a plus plus subscriber, it's too easy.
changelog.com slash plus plus.
We drop the ads.
We obviously give you bonus content. We slash plus plus. We drop the ads.
We obviously give you bonus content.
We bring you a little closer to the metal and the best part, you directly support us.
10 bucks a month, 100 bucks a year.
changelog.com slash plus plus.
That's it.
The show's done.
Thanks for tuning in. We will see you on Friday.