Latent Space: The AI Engineer Podcast - RWKV: Reinventing RNNs for the Transformer Era — with Eugene Cheah of UIlicious
Episode Date: August 30, 2023The AI Engineer Summit Expo has been announced, presented by AutoGPT (and future guest Toran Bruce-Richards!) Stay tuned for more updates on the Summit livestream and Latent Space University.This post... was on HN for 10 hours.What comes after the Transformer? This is one of the Top 10 Open Challenges in LLM Research that has been the talk of the AI community this month. Jon Frankle (friend of the show!) has an ongoing bet with Sasha Rush on whether Attention is All You Need, and the most significant challenger to emerge this year has been RWKV - Receptance Weighted Key Value models, which revive the RNN for GPT-class LLMs, inspired by a 2021 paper on Attention Free Transformers from Apple (surprise!).What this means practically is that RWKV models tend to scale in all directions (both in training and inference) much better than Transformers-based open source models:While remaining competitive on standard reasoning benchmarks:swyx was recently in Singapore for meetings with AI government and industry folks, and grabbed 2 hours with RWKV committee member Eugene Cheah for a deep dive, the full recording of which is now up on Latent Space TV:Today we release both the 2hr video and an edited 1hr audio version, to cater to the different audiences and provide “ablation opportunities” on RWKV interest level.The Eleuther Mafia?The RWKV project is notable not merely because of the credible challenge to the Transformers dominance. It is also a distributed, international, mostly uncredentialed community reminiscent of early 2020s Eleuther AI:* Primarily Discord, pseudonymous, GPU-poor volunteer community somehow coordinating enough to train >10B, OPT/BLOOM-competitive models* Being driven by the needs of its community, it is extremely polyglot (e.g. English, Chinese, Japanese, Arabic) not because it needs to beat some benchmarks, but because its users want it to be for their own needs.* “Open Source” in both the good and the bad way - properly Apache 2.0 licensed (not “open but restricted”), yet trained on data taken from commercially compromised sources like the Pile (where Shawn Presser’s Books3 dataset has been recently taken down) and Alpaca (taking from Steven Tey’s ShareGPT which is technically against OpenAI TOS)The threadboi class has loved tracking the diffusion of Transformers paper authors out into the industry:But perhaps the underdog version of this is tracking the emerging Eleuther AI mafia:It will be fascinating to see how both Eleuther and Eleuther alums fare as they build out the future of both LLMs and open source AI.Audio Version Timestampsassisted by smol-podcaster. Different timestamps vs the 2hr YouTube* [00:05:35] Eugene's path into AI at UIlicious* [00:07:33] Tokenizer penalty and data efficiency of Transformers* [00:08:02] Using Salesforce CodeGen* [00:10:17] The limitations of Transformers for handling large context sizes* [00:13:17] RWKV compute costs compared to Transformers* [00:16:06] How Eugene found RWKV early* [00:18:52] RWKV's focus on supporting many languages, not just English* [00:21:24] Using the RWKV model for fine-tuning for specific languages* [00:24:45] What is RWKV?* [00:33:46] Overview of the different RWKV models like World, Raven, Novel* [00:41:34] Background of Blink, the creator of RWKV* [00:49:55] The linear vs quadratic scaling of RWKV vs Transformers* [00:53:29] RWKV matching Transformer performance on reasoning tasks* [00:54:31] The community's lack of marketing for RWKV* [00:57:00] The English-language bias in AI models* [01:00:33] Plans to improve RWKV's memory and context handling* [01:03:10] Advice for AI engineers wanting to get more technical knowledgeShow NotesCompanies/Organizations:* RWKV - HF blog, paper, docs, GitHub, Huggingface* Raven 14B (finetuned on Alpaca+ShareGPT+...) Demo* World 7B (supports 100+ world languages) Demo* How RWKV works in 100 LOC, RWKV overview* EleutherAI - Decentralized open source AI research group* Stability AI - Creators of Stable Diffusion * Conjecture - Spun off from EleutherAIPeople:* Eugene Chia - CTO of UIlicious, member of RWKV committee (GitHub, Twitter)* Blink/Bo Peng - Creator of RWKV architecture* Quentin Anthony - our Latent Space pod on Eleuther, coauthor on RWKV * Sharif Shameem - our Latent Space pod on being early to Stable Diffusion* Tri Dao - our Latent Space pod on FlashAttention making Attention subquadratic* Linus Lee - our Latent Space pod in NYC* Jonathan Frankle - our Latent Space pod about Transformers longevity* Chris Re - Genius at Stanford working on state-space models* Andrej Karpathy - Zero to Hero series* Justine Tunney ("Justine.lol") - mmap trickModels/Papers:* Top 10 Open Challenges in LLM Research* Retentive Network: A Successor to Transformer for Large Language Models * GPT-NeoX - Open source replica of GPT-3 by EleutherAI * Salesforce CodeGen and CodeGen 2* Attention Free Transformers paper* The Pile* RedPajama dataset* Monarch Mixer - Revisiting BERT, Without Attention or MLPsMisc NotesRWKV is not without known weaknesses - Transformers do well in reasoning because they are expressive in the forward pass, yet the RWKV docs already note that it is sensitive to prompt formatting and poor at lookback tasks. We also asked pointed questions about RWKV’s challenges in the full podcast. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
Hey listeners. Today we have a very special episode for you. There's been a recent paper on the top 10 open challenges in LLM research that has consolidated a lot of intense debate.
Today we're going to talk about RWKV models, receptance weighted key value models, with Eugene Chiaa, who is both part of the core RWKV team, CTO of a low-code AI test automation platform, an active member of the Leyenspace Discord.
The RWKV architecture has the potential to solve three of the top 10 open LLM challenges, increasing context length, making LLM's faster and cheaper,
and designing a new model architecture.
What is particularly appealing about it
is that it does so by reviving the recurrent neural network,
which even I have argued has been obsoleted by the transformer.
It rejects the idea that attention is all you need
and replaces multi-head attention and feed-forward networks
with new concepts called AtyMix and Channel Mix, respectively.
It has been trained up to 14 billion parameters,
they're getting help from Illuther and Stability AI to scale up even more,
and shows competitive results on reasoning benchmarks,
the same benchmarks we covered in our Benchmarks 101 episode,
with similar size models, yet with linear costs and speed curves instead of quadratic ones.
In a way, RWKV are promising the room temperature superconductor of LLM architectures.
In other words, the parallelizability and performance of transformers without the quadratic cost.
Obviously, the topic of what happens after the transformer is in the finance terminology,
what we call a low delta outcome.
It's probably not going to happen, but if it does, it will be very, very big.
We even discussed this a little bit in our episode with Jonathan Frankel of Mosaic ML.
But since the RWKV paper was published, the idea has been somewhat independently validated
with Microsoft Research putting out the retinet or the retentive network, which has similar
remains of what is trying to do.
And it also, of course, competes with other alternatives to the transformer, like the state-space
models coming out of Chris Rays Group and Stanford, S4H3, and a monarch mixer that was recently
announced.
However, RWKV is so far the most validated of all these ideas, because,
it is already trained up to 14 billion parameters with multiple models that you can download
and generate text with today. As podcasters, we want to be the first place that you hear about
new things in AI that you'll be using your work or personal life as AI engineers and enjoyers.
So this presents us with the problem. We have to be early on consequential topics and things,
but also high signal. Some of our favorite compliment so far, which, by the way, I've
added to our About page if you want to check that. Your pods are a legitimate highlight of life
for me. They're amazing from McKay Rigley. And from the AI
safety memes Twitter accounts, which is always fun. They just simply said we're the highest signal
pod for them. So we're very proud of this and want to keep it up while taking risk, because even
though we cross a quarter million downloads just five months into the podcast, we're still very
young and trying to figure out what kind of podcast we want to be and what kind of audience we want to
have. Today is going to be one of the riskier pods for a few reasons. One, it is the first pod we're
doing on a non-traditional architecture with no large Western institutional backing. Two, it is the
first pod we recorded outside of the US without a regular studio, so the audio is not as good.
And three, finally, it is the first pod we're doing with my Singaporean accent.
I'll address each of these in turn.
One, there's a significant institutional bias in Western media coverage when it comes to AI,
in that if you don't come from Stanford or Oxford, or you don't work at Google or Facebook,
your ideas have trouble getting attention.
However, by far, the most admired organization that our guests have repeatedly mentioned
is Luther AI, which spun up as a disavoured.
centralized Discord community that independently trained the first GPT class LLMs without a prestigious
background. A Luther has since spun off organizations like stability, AI, and conjecture.
But I suspect that the RRWKV community is working like the early days of Luther, and we have a
rare opportunity to capture an oral history of it live rather than two years after the fact.
Two, part of the latent space magic is that we try to get to know our guests and be in person
with them to establish rapport, like we did in New York in our Notion AI episode with Linus Lee.
Despite the lower audio quality, I think we got a much better interview with Eugene because we were able to interact with each other in person.
Three, part of the joy of audio is that you get to hear the full diversity of humanity, and neither are less you or I are American,
and I like to showcase the broader world throughout AI lens, whatever we can.
As you'll see, the RWKV story is also about the non-English rest of the world, self-organizing to build the LLMs that the English-centric West has neglected and doing so relatively successfully.
We've aggressively edited this interview down to one hour for the audio podcast,
but for those who are interested in RDKV, head over to the new Latenspace TV YouTube channel,
which has the full to hour interview, including a screen share walk through the RWKV models, paper, and Discord,
as well as digressions on Eugene's background, discussions on diffusion models and the token crisis,
and the relationship between open source AI and the AI Wifu Hasbando community.
I hope you enjoy our conversation with Eugene Chair of RWKV, and if you liked it, reach out to him at PeeFUG,
Eco Creator on Discord or on Twitter and let him know.
Okay, so I'm here with Eugene.
We are in Singapore.
This is the first time I'm podcasting in Singapore.
This is the first time I'm podcasting with my Singapore in accent.
Eugene has been a very valued part of our Latenspace Discord for a while.
And also diving deep onto Al-WKV.
I think you're actually the first person that brought it to my attention as a potential
transformers alternative.
You're also CTO of U.ILicious, which is UI Testing Company Net's.
in Singapore here, which is local platform.
I got the first demo maybe four years ago.
Yes.
And I was like, OK, fine, you know, you're doing testing.
There wasn't an obvious AI angle.
I mean, now that you explained it, it was great.
But like, what was your personal, like, OK, I'm going to be a dedicated AI guy
for your eyelashes?
OK, so one of the things, one thing that I found very interesting with the huge
transformer boom right now, is that traditionally, right,
when you tell companies that you need, when you want to build your own AI,
you need a really large data set.
And over time, actually, the amount of data sets that you need is actually scaled down because you can just not find foundation models.
Yeah, finding your foundation models.
And when we started Neurin.
We always knew at that time because a lot of our other companies that were launched at the same time were dealing with neural networks that at some point, the data that we've been collecting data on, let's say, how to do testing, website.
It's just a very specific focus.
basically every single test that has run on our platform
unless our customer has opt out
or delete their account
basically privacy related stuff
we actually still retain the test data
and that's something that we always felt
that was useful in the long run
to be able to actually build a huge training model
the irony of that it was that
even though we were building all those data sets
as the trash hole came in
and the transformable boom happened
we realized we don't actually need that
break of a data set anymore
to actually get a functional AI
One of the key insights, especially for people who is like trying to build on top transformal model.
Pre-transformer, large-engagement model, is we would always be thinking of like in terms of like 100 gigabytes of data,
one-dherabyte of data, like millions of record for all the different examples.
Post-transformer is literally, you probably need only like a thousand or 10,000, enough like data that you can literally get an intern a few weeks.
Right.
Just get it done.
And you have a working model.
It may not be that great.
but frankly every piece of data you add after that is a diminishing returns.
And because it's a language model.
It doesn't actually have any inherent understanding that it is automating the browser.
So it's presented as like a prompt answer pair, like question answer there.
At least for our internal model that our users are using is it's presented as here's the prompt, describe your test or what you want to modify the code.
And then subsequently generate the code for you.
And hindsight, it's now basically copilot.
Yeah, yeah.
Yeah, I think now co-pilot is adding that chat, widget.
Are they fully launched yet?
Yes, I actually downloaded it yesterday.
I haven't actually used it yet, but it is a separate VS code extension.
So there are now three co-pilot extensions ship at GitHub because they have shipped their org chart.
I'm quite friendly with that team, but it's very funny.
But just to come back to you, so did you implement this with GPT3?
So we based it off the Salesforce CodeJM model.
Okay, right.
So that was the foundation model that we built on top.
We are looking into replacing it in parts, but that becomes a longer conversation.
CodeGen being the first really credible, open source, code-specific language model that was released by literally anyone, I think about three years ago.
Yeah.
And then they recently released CodeGen 2.
Correct.
Any opinions on Code Gen 2 and just while we're on this topic?
In terms of like CodeGen, one big appeal for the CodeGen and even Code Gen 2 model is that Salesforce took a very clear and clean approach to the.
licensing. Meaning they were very, very clear that everything that they trained on was open source.
Yeah, MIT. They didn't touch the problematic like this. So, and you can imagine.
And you think that co-pilot did?
Knowing Microsoft statement on how liberal they were about GitHub data and they were saying they used the term that is under fair use.
I see. Yeah. I have no reason to believe that they didn't. But this same problem happens to actually a lot of existing.
code jam models and and that it was actually the main appeal for me for
running for actually building on top of the sales post code jam model mostly also
because like like for us we deploy on-premise into enterprises yeah in Europe
and they ask questions so what does this deploy on-premise mean like you
you pack UI laces into a container yeah you give it down yeah and then you
date it's like a license fee or something correct okay cool that's very
interesting yeah okay I don't know if I have any other questions based on that
anything else before we go into alternate like the reasons for alternative models
okay so so I anything after I heard that no I don't really have much right
for alternative models so yeah so let me as I said the premise right like
transformers have won for now they've slid the neural networks yes and you know
it seems like you have had a history since with machine learning and before
transformers and now they're there are kind of like the peak
their power. And I see that, you know, there's a desire for alternative models for a number
of reasons. But I'm very curious as to what drives your personal interest in alternative models.
So, so first thing to be clear, majority of our AI is still based on Tranforma, at least within
my company. Yes. But what drove me into alternatives beyond Transformer? In essence, once we actually
managed to get our bot to generate UI testing code, the most obvious next thing that our customers
started asking, hey, let's say the test failed. Can your AI now analyze my website and then tell
me what's wrong and tell me what to change? Basically, they're getting crazy and case.
Yeah, yeah, yeah. Humans are very good at moving go-posts. And we had something working for toy websites.
But the one thing that we do and internally is that we look at the, I think, what's the list, top 100,
top 1,000 websites. And we basically just run or we actually do run our test platform instead to see,
that our code works against any front-end platform.
Well, what do you mean run your test platform, right?
Because you don't have tests for them?
Yeah, we have some very rudimentary basically.
Go to website, see something, click something, add to cart.
Yeah, that's it.
That's it.
The idea is more like, because there's so many frameworks out there.
And our...
You want to make sure you cover all of them.
Yeah.
And so we did the same thing for our AI.
And the first thing that it died on was literally Amazon.
Why?
Oh, 5 megabytes.
Yeah, I think you've heard me mentioned it.
So when you are trying to analyze your website, we've been talking about increasing token count size, right?
But for e-commerce websites in particular, even if you stripped off a CSS, even if you strip off of JavaScript,
having the entire HTML in megabyte size is not unheard of.
And that's where it's like, how am I supposed to solve this in terms of from an AI point of view?
How many tokens would that be?
Like, oh my gosh, you could easily be looking at over a million token.
I see.
Which is still too much even for today.
Yeah.
Did you look into making your own tokenizer?
That's something that we explored.
I think what we found more realistic was to actually pass the HTML into a more token-friendly
format.
So this way we can still build on top of existing models.
But yeah, we are exploring that as well.
But back to the lot of the .
So like the key things for me and was at that point and subsequently I think I showed you
like the experiments with like English compiler and things like that, right?
AI agents generating code.
You also have your own small there.
What's that?
The context size is a real problem.
And transformer inherently by nature, at least vanilla transformer.
I know that's transformer XL and some other attempts is that it quadratically scales with the context size.
So we scale to like, let's say 100,000, that's already requiring us.
a shit ton of compute and VRAM and I don't even want to imagine what happens to 1 million or 10
and and that's where I needed I was like okay this is a fundamental problem that needs to be
changed if not we will not go past this and I think there's also now like a lot of people who are
very interested in like models that can handle large contact size because they also want it to be
able to use in use cases where they never need to fine because fine tuning is a pain apparently yes that's
Okay, well, there's issues with just throwing everything in context, right?
It's shown that retrieval is only best when the item that's relevant is in front or in the back of the context window.
So basically I'm just like maybe we've just tapped out.
Context is working memory, and maybe it's like, maybe Transformers are very similar to humans in that.
Our working memory is only of a given size.
If you try to artificially extend it, you just make it very lossy.
Yeah.
So that's where I end up.
landing on the RWKV model because in that sense right so so you one thing that I always
found very weird for transformers but I mean it's by design is as you infer each token
you are you will re-computing everything up right that's the quadratic part and and
and well well you're mentioning about the working memory problem in theory with enough
attention heads on it and and people seem to be trying to cram more and more
attention heads into the process
It could scale that way.
Ignoring compute costs.
Okay.
And ignoring compute costs is like just like a very liberal.
That's just true as much H101.
It doesn't make sense.
But, but RRKV, what is still was fundamentally a neural network at its core.
It ends up scaling linearly as it goes through the tokens.
It also, it will still suffer from the memory issue.
So, so like within the RRFKVB, we do, we do like measure two separate things.
So like one, we call it the perfect memory.
So in the model, we have only a certain amount of capacity
where it can remember things perfectly, just like humans.
And then beyond that, that is where it will start to discard things from its perfect memory.
Right.
And I felt that this was actually a lot more in line with our goals and commercially.
And also what I felt was that was more useful in the long run because it's cheaper compute
and it could be potentially paralyzable for a real long run.
Right.
So we're going to go into our RWQVy paper in a bit.
But one thing I wanted to ask, you kind of glossed over how you found it in the first place.
Because you're not a researcher.
You're not like, I don't imagine you're like reading papers every day or something.
Until recently.
Until recently.
How do you find it?
How do you find it?
How do you know like this is the one to bet on versus there's a bunch of other alternatives, right?
I think what was quick.
I think it was rather quick after I concluded that.
Transformer as it is, we're not scaled to 10 million tokens.
Okay.
And so, by the way, you mentioned Transformers Excel.
We also did an episode on Flash Attention, which helps to make part of it sublinear at least.
Yeah, but that is like way, way after I already died into other UKB.
So history-wise, at that point in time, Transformer X, we are talking about like the,
the, when 4K was the limit that everyone knew.
Right.
And this was last year.
I mean, just to set context.
Okay.
Okay.
And then, yeah, so you just kind of were searching around, you found our other BKV, presumably, like, did you go straight into the Discord?
Was it, like, primarily a GitHub repo?
Like, what was it?
Because as far as I can tell, there was no paper until maybe about two months ago.
Oh, and I talked about it before the paper, right?
Yes.
So you found it before they did any publicity, which is weird.
It's not normal.
So what happened?
What did you do?
So what I did, okay, so it was basically, I believe, okay, so it's a mixture of things because it's like, I was searching,
beat guitar, I was searching like forums, other discords and also like blogs actually.
I was just like getting all the, because everyone was just creating lists of lists, right?
Yeah, yeah.
And I believe you also have a list of list somewhere.
Yeah, but mine is very, so I would consider myself very tread in the sense that I would just follow the large model labs.
whereas the kind of list that you have to follow in order to get to something like
R2KV before they've done any publicity is the non-trad like you know the kind of people that
is not working on those Hermes wizard you know that like no credentials I don't even know
who the hell they are but they're just working on it oh this is all for game memory and I
might be hallucinating this because there was too many lists but I believe the list that
actually what brought me to RDAV was that beyond opening eyes model and
And beyond Chapchapiti and Claudia, the two big models, right, outside of the English-speaking nations, right, a lot of the open-source models really fall flat.
And that is why when you actually go through like lists or art for like doing things in other languages,
RWKB actually stood out and then point.
And just on the basic premise, and we're not even talking about architectural and management, it's just the basic premise that they imported the data set in other languages.
in the training data.
Yeah.
And...
Was that a...
Because, I mean, I imagine 99% of your customers are English.
Yeah.
Was that really a driver for you?
It wasn't a driver, but...
Are you just trying to explain it?
Yeah, that's how I landed onto, like, all these blocks and technical.
And can you say, when you say fall flat, the main one that I know about is there's a tokenizer
penalty for non-English.
Yeah, that's that.
Right?
So, like, Chinese is up to...
Chinese or Japanese or Thai or something.
It's like 16 times the number of tokens for a typical English sentence.
Yeah, but even before the...
that, right? Because, I mean, I think you understand, like, a lot of community users, they want to not use the commercial APIs.
Okay. So they try to find open source models. Yes. And we'll talk about the not safe for work at people.
I really want, because you've actually talked to them. I have never talked to these people. But like, when I
discovered them, they are huge community. They're extremely passionate. And they're actually good.
Yeah, they're really good. They're good at this. So let's talk about them, right? Yeah, we can talk about it later.
So they don't want to use the commercial models, and they want to use the open source model,
and there is a tokenizer penalty, which is true.
But I think on the more fundamental basis, right, if we look through the datasets,
and this is also partially important because the way we set up about evils,
all evals are written in English, and at least for the majority of them.
And if we are racing towards building AI models, at least right now,
as you see all the companies
as they build their open source model
and they just want to narrowly
focus on the e-vowls
adding in a foreign data set
is actually a loss
because once you are below a certain paramount
so we're talking about 7 and 14
the more you add
that's more in line with your e-vowls
the more you'll degrade
and they just exclude it
so the model just
the priority is English
yeah I get it the model just fundamentally
so what's the trade-off
like I mean okay so English
and Chinese or, you know, there's all these other languages.
What do you pick?
So, so RWKB started.
Also, in context, the main person leading the Adiagabri project, Bling, is from China.
So he naturally has an interest to make sure it supports Chinese.
Yes.
Yeah.
So English.
And there are a fair amount of bilingual models, essentially, that are English and Chinese
from the major universities in China.
So, so we started from basically English, Chinese, Japanese, Korean.
frankly this is large part mostly because there were fans in those communities that came on board
and then and then subsequently we tried to onboard other languages as well yeah but these people are
like again not researchers no money like training on their home GPU lab or whatever right
partially true but also how I see it works out for a lot of the other languages was that we have
the foundation model and this is the foundation model where we just kind of say devouse be damn
Let's just make sure to include all the other languages.
Okay.
And when we included the other languages, right, the model works for most parts for the other language.
Subsequently, these individuals who wanted to use these models for their respective use cases,
we will then fine-tune respectively.
Because it's easier to fine-tune in another language for your use case than to...
I mean, this is a classic fine-tuning, than to train the language from scratch.
And I think more recently, and this model is not 100% trained yet,
more recently, RWKB has released what we call the world model,
where we go the next step of even including all the translation data sets that we can find,
even for minority languages that people end in our discord.
Because the goal for them, the long-term goal for us, at this internal network,
that we wanted an AI model for everyone.
and everyone does not be USA
it means the world
So there are a lot of languages in there
Is it Asia biased or
You know
Give me a sense
It's probably no offence
It's probably still going to be
US biased in terms of knowledge
Because what we are doing is
Still power red pyjamas for the knowledge
But in terms of language
We add all the other languages
Wiki and translation set
So it's hard
I mean we haven't fully evaluated
the bias here, but I'm quite sure that when disproportionally knowledge is still within the
English universe, there's the bias there, but frankly, we are still at the stage where
can support the other languages.
Yeah.
And I think I mentioned this, this is this is one of the interesting parallels that sometimes
I have, right, is that I can be in the, I can see in the illiter forums and all that.
And then we're talking about alignment and like we're talking about it in very big.
Which is, yeah, very keen on safety and all that, which is great.
but like it's not your role as the RWKV community.
Yeah and when you talk to like members of the community that came on board,
they're like, oh I want to get this to work for Korean, Japanese, Thai, Arabic languages and so on.
So they just want something that worked.
Yes.
They don't want it to be, they're not after the big model that does everything, they just want something that they can play with in their language.
And that was very important to them.
Yeah.
And these are literally just hackers.
Literally just hackers doing it for personal enjoyment.
Correct.
Not yet for work.
Or maybe some of them for work.
You don't know?
We don't know.
I mean the core character AI category, there's quite a number of them using it for that.
So professionally.
Professionally.
Okay.
As in they run character companies.
Yeah.
Let's call it.
Should we pause here and then I'll switch to the screen?
Sure, sure.
Okay.
All right.
So we have it pulled up.
We are going to screenshot for the bulk of this.
If you're listening on audio, it might be a good time to switch to the YouTube channel.
So we're just going to start with an intro.
What is RWKV?
So RWKV is a modern recursive neural network with transformer-like level of LM performance,
which can be trained in a transformer mode.
And this part has already been benchmarked against GPD NeoX in the paper,
and it has similar training performance compared to transformers models of the same dataset and parent-count.
So specifically the GPT NeoX.
next model. So the key thing is that even though it's matching in performance, while trading
blow to GDPMU is, it's doing all this without attention there is. And in the process,
it's actually having a much substantially lower compute based on its design and also because
it's a neural network, which we're diving into later why, why that's substantially lower,
in both training and inference. And this is back to like I mentioned previously. Transformer,
it traditionally transformed until we found out of a transformer Excel.
and things like that, tends to scale quadratically based on the contact size.
And this applies not just in inference, but in training.
And due to how to like, due to how this is still a neural networking is hard,
even though it can train like a transformer,
it's able to do so much more efficiently and faster,
especially like when you hit context size of 8K, 16K is and a berth.
And once you do like quadratic and linear,
the differences start to like go crazy once you scale the numbers up.
And that was the main benefits of the IDWKB model, per se.
There were a few prominent researchers when they actually reviewed through the archery paper when it came out.
They did highlight an important question of like, is this like evidence to literally,
maybe all that really matters is that we need a large data set and a scalable model.
That makes sense obviously to some approximation,
but you are still using attention?
No, we don't use attention inside.
Okay, yeah, maybe let's rewind a little bit.
Oh, specifically attention as you understood it.
Yeah.
Okay.
Tell us more.
So, so we, we, we, we use weighted receptors and, and if there's any diagrams I should pull out, let me know.
Oh, okay.
Let's, okay, so we are using AFD.
So this attention free transformer, and this is, this paper was written by Apple.
What the hell is an attention free transformer?
Okay, this is unusual.
Yeah.
we use the weighted retention weights and we compute over it.
And in essence, this is like the classic like stacking more layers.
Once you do on top of it, like you don't really need attention.
Once you have enough weights and layers stack on it.
Okay.
I don't know whether we want to go into the deep dye or afts.
Sure.
But that's interesting.
I've never heard of this paper.
Yeah.
So this is, this was written my Apple.
And subsequently, we interviewed.
at this blink, the creator
RWKB, took this,
took this and applied it to a language model
and scaled it up.
Right. And
that is how
we landed on RWKB
that doesn't use attention. So
sometimes within the community, we use the word
light attention, because what happens
is that these layers and
these weights will still play the role
of attention. I was going to say, you end up
approximating attention. Exactly.
So it ends up like looking at the
or parts of the memory and then applying it to the output.
And the key benefits is that because remember the attention model is a multi-head part,
it will need to scan all the tokens back and forth.
This removes that requirement and hence it reduced the overall compute count.
I might be jumping quite and forth a bit, but that's the one of the key essence of the WKB segments.
And we call it light attention.
I, and this is the part where I would disagree with the RWKB community in some parts.
I think that was a bad name.
name.
Because it's cute.
Why is it a bad name?
Because when the RWKV paper came out, right, and then we talked about, like, we use
this and we call it light attention, but by design, it's really nothing like your existing
attention head models.
And it ended up like sidetracking the hacker noon debate on like one corner, it's like,
no, this is technically attention, approximately attention, then another group is like, no,
this is not attention.
I see.
But I'm like, propose a better name because I have no idea what I call it.
Okay.
What else should people know?
Maybe we can explain what RWK&V stand for?
Receptive with the key values.
Okay.
Yeah.
And each of these are like actual things that you model in the code, right?
Correct.
So we can go into that.
Which attention historically is like query key value.
Correct.
Okay.
So, so do you want to jump straight into the layer architecture?
Should we, should we cover something else first?
Anything like high level?
High level, okay, there's a 7B, there's a 14B,
there's one of the assets or the artifacts.
Okay, so before we go into the nitty gritty-grities
of how the layering and everything works.
On the high level, right, currently RWKB architecturally,
as a model, it can be,
what we have already proven is that it can be scaled
and trained like a train former.
How I do so, we'll cover later,
and this can be scaled to as many parameters as we want.
Currently, what we have is dominant.
Our main models is the 7B model and the 14B model,
which you can find on Hagen Face or respectively our demos.
We also have, there will be the RDWKV Raven models.
These are also instructionally tuned.
Okay, so there's World, there's Raven, there's music.
Oh my God, this novel.
What is all this?
Okay, so before, the current main models is RDWKVVV for,
for the power and raven.
So power is basically just a power plus model.
What is power plus?
I know about power, but where is power plus?
Random data sets that are the communication with the power.
How many tokens were?
I would just say slightly 1.1 or 1.2 times the power.
Okay.
Yeah.
This is not instruction tune and stuff.
Yeah, the plus one is typically all the other languages.
Subsequently, Raven are the instruction.
This is the current main complete models.
We subsequently have...
And the instruction data sets are from...
Typically, GPT4, but then we scrub it for and remove all the...
As a large...
So, yeah, this would be the uncensored.
There's some other project that's kind of doing something similar,
and they call it uncensored, but really they just scrubbed it as a large...
Correct.
So that makes it technically breaking TOS or open the eye, right?
Yeah.
Okay, but yeah.
But that's a, I mean...
That's a later problem.
Frankly, let's be honest, if we...
Even if we don't remove it, someone is going to remove it.
I mean, so there's ways around this,
which is you get it, you get clean datasets that are not GPT4.
So the one that I typically mention is Yonet Kulture's Open Assistant.
I believe that was included subject to NACA as well.
Yeah, obviously all these release orders are all over the place.
Yeah.
So, okay, Raven World.
So Raven is the instruction team.
And then subsequently, the world model is a new model that we are training.
It's not 100% complete yet.
With the focus on a new tokenizer and all the languages.
So what means?
All the languages.
All the languages that we can grab from the internet.
All the wikis in all the respective languages.
Like what do you mean when you say all languages?
100 languages.
Okay, fine.
So 100 languages.
It wasn't really a very precise sign.
We just basically, whatever the wiki tool that allows us.
tool that allows us to download the ex-Wiki languages.
If it works, it's in the set.
If it doesn't work, skip.
And all the major prominent Oscar translation sets.
So as you can see, Powell, Red Pidjama.
What is Oscar?
Oscar is just a common term that we use in,
and you can just search Oscar in Haging Face Dataset.
And it just means translations.
Okay.
So you can find like English ex-pants.
I see.
Yeah, all the respective pairs.
Okay.
So, and then all challenging.
Did I can find?
Okay.
So 70% English, 15% multi-lang, 15% code.
Is there a strong grounding for why 15% code?
No.
It was just, it was already there.
Yeah.
So the focus of the world model was not to improve everything else.
It was literally that 15% multi-lang.
We wanted to increase.
It was English and code and then you just added multi-lang.
Yeah, we had fair bit of multi-lang, but we wanted to bump it up.
Right.
So this is primarily English?
Whatever.
Okay.
Yeah.
What I would like is basically like a visual of like, here's all the building blocks and here's how they combine to create all these things.
So we have the RDAPV architecture code.
So that's the main model building block and basically we feed it the data.
Power Plus, red pyjama, and subsequently some of the code data.
For the world model, we subsequently add on top of that all the translation Oscar sets and so on.
And so you're training these things.
You've mentioned that you're intentionally taking a hit on e-vals.
on traditional e-vals, like MLU or whatever?
I wouldn't say intentionally.
Also to clarify, like, I am not training it.
I'm just part of the community.
The community and Blinks, the one training.
But I would say it's more of like the lack of care for the e-vowls.
So the reason why we add things to the dataset
was never about improving e-vowls.
It's about directly in response to user feedback.
It's like, oh, not good enough at this.
So they're, okay, just toadish it.
Yes, literally.
Along those, so take, for example, right, like, within, even for Raven and the world model,
as we go through the training stages, right, we specifically ask people in other nationalities within our discord committee to test it for their language.
And our rule that we said is that, our informal rule is that the only person can decide if whether this improved world model is better in Japanese or Thai or whatever it is, is a native speaker.
Where does it take place?
So it's mostly in within linguistic sense, but sometimes we do a shout-out in general as well.
Okay, linguistics.
Yep.
So why don't, so do you have like an appointed ambassador, like you have a hundred languages?
Yeah.
You just have like a czar of Japanese, a czar of Thai.
It's not so pointed, it's more of like, hey, this is the Japanese model, please try.
It's not...
There's no the Japanese model.
There's one model.
There's a world model.
So if you go to world model, I don't know whether it's inside here.
No, four, sorry.
Five is, you should never put five from top because five is fully experimental.
So under file semblance.
I see, I see, yes, yes.
So you see there's Japanese specific tune, Chinese, Arabic.
Then for all the other smaller languages, we actually asked them from the base world model.
Yeah.
A bit itself, so feedback on that.
So we actually released previously like 10% train, 15%, 20%, like as it goes through the stages and then it's like, hey, is this working?
Is it regressing?
So it's like evals, but real...
Done by real humans and not systematically.
Is there a reason that you release?
So you mentioned 7b 14b, by C also 0.1b, 0.4B, 3B, 1.5B.
Is that useful for people or is it just for research or...
0.1 and 0.4 is frankly more for research.
But some people do try to make use of them, nothing stopping them.
Well, I mean, it's extra, like these are just different architectures, different dimensions.
Yeah.
So it's actually extra costs to you to provide these things.
Oh, but specifically for the world model, what, because we are trying a new tokenizer, we are, and, and the reason why we're trying a new tokenizer is that as, as, as, as, as, I think I'm cut, is that one thing that we found, more like I found surprisingly frustrating,
in existing tokenizer was that it was very English-centric.
And the existing tokenizer you took from GPD Neo?
Yeah.
And just to, I need to backtrack a little bit,
just for people who are not following along.
GPTJ was the original Luther reproduction of GPD3.
And the GPD Neo was the bigger GPDJ?
Yeah, you can pretty much say.
20B, something like that.
Yeah, I do believe there between me more, though.
And there's a cheat for, I mean, for those outside of the open source,
space in particular for the transformer.
I think one thing significant for GBT Neo-X was that it was one of the major models that had
everything fully documented and why they make this change in the architecture and so on and so forth.
And that became like a basically reference note for all other subsequent open source models
because they were the early ones that were like doing a good transformer model
and at least for the large and grid model.
So, GPT2 was actually open source.
People didn't find that useful?
No, people do reference it as well, but it's like the code is there.
And why did you do this?
Oh, I see.
It's not documented.
I see, yes.
So in that sense, was OPP from Facebook useful?
Because I've heard very good things about the logbook of OPT,
where they had the daily logbook and they just published that.
Yeah, those were useful as well.
Yeah, okay.
I think one thing that NeoX had going for, especially the illegal committee, is that
it's not just logbook, it's just like, you could just go to, they just got, hey, why do you?
Right.
And the person who trained it will tell you.
Yep, someone there, okay, might?
Hopefully.
One off the self.
So that's why we had the 0.1 and 0.4 models, because we were just in uncharted waters here.
We, so, like, a lot of existing tokenizer took space as a major delimiter to detect this bit.
and the tokenizer we are using is actually a lot more simplified.
So the tokenizers, they scan or the text, they do a statistic model of like what pairs well with what and so on and so forth.
We did a similar approach, but we instead of using like this token pairs well this and should be paired with that,
we just made it a trio of this.
So basically, we find the try the data structure.
Yeah, so we just find the longest.
string in that matching string that we have trained inside our token list and then we just
use that token.
It's a drastically simplified tokenizer and it doesn't use spaces as an assumption which
I know.
Which is good.
Yeah.
And that helps a lot of the Japanese, Chinese and character models.
They don't have spaces.
And I would even and I would even argue to fair say like if you look at like the really large
models, like be it open AI or Kaulda, right?
tokenizers are not really a thing.
I mean, in the sense that the model can work even if you tell it character by character.
It's maybe inefficient.
There's someone tried.
I mean, there was that geoprache where, you know, the system problem you put the character,
that enter, enter, enter, you remember that geobrik?
No, I didn't see that one.
Yeah, so you can literally, like instead of like left to right, you can literally up to down.
Okay.
And you're just eating tokens for every character.
actually you're eating too because there's also the new line.
And the model understood it because there's enough
dumb data on the internet
that it has learned how to deal with this kind of formatting.
Got it, okay.
And if these models are already understanding things
at the character level, everything else is just improved compute.
Okay.
Because we jump the multiple tokens.
Do you have any idea of your dictionary size
when you use this tree data structure?
Yeah.
Because the typical tokenizer is like 80,000 tokens, dictionary size.
I presume you'll be bigger.
Yeah, I can't remember offhand.
Our previous tokenizer is around 50,000 is the NeoX tokenizer.
Then subsequently, I believe this is around the same size.
It's not bad.
Yeah, pretty good.
We didn't want to change too much on that size, but we just wanted to just change the format.
Yeah, cool.
I actually kind of want to establish the credentials of this thing.
So who is Blink?
is Randall on the internet?
Again, never heard of this guy until he published.
This is real name.
And you had, like, I have this paper to work with, but it was only published in May.
Yeah.
You found this before the paper.
And so I think it's very unusual for a researcher to effectively launch to the wider public without a paper and just get some kind of pretty decent community.
going and then publish the paper.
I think a few years back,
once with GPT2,
Transformer started to pick up Steam.
And I guess the whole world is starting to think,
let's just abandon neural networks.
So we haven't even gone into the code part,
but like, so the main reason why neural networks were bad
compared to Transformer was that
when you train a, like I say, you just input a token,
and train a token for data sample,
you have to wait for the compute to finish for that token,
take the state, and then you train the next token.
and we'll get into how RRWKB solves that
but basically the whole word at that point
just concluded, yeah, neural network cannot scale as well
transformers let us abandon it
and everyone just went in that direction
and Blink or Bhopeng, his actual name,
decided basically as an individual
literally at the illuter AI forum
decided that hey, I think we can
modify recurrent neural networks, no neural networks
based on the Apple paper
the light attention that I showed previously
to make to scale this up
without, to make neural networks
scalable and paralyzable
in the same way transformers work.
Because the reason why we branch away
and focus on the world is because neural networks
were slow to train.
It was never, I mean, it wasn't so much about
whether it was good or not.
It was just, no one wants to wait
100 years for their ability to train finish,
even if they can throw a GPU farm at it.
And that's,
where he started looking into it like how
to make the neural networks
trainable in parallel.
And specifically RNNs?
Yes. And subsequently
the AI and I believe there was
also a few others, because he
was doing it very publicly there,
came on board to sponsor
and the GPU computers required
because even though I mentioned that
on large context size, it is
substantially cheaper. I think
especially if you run
an open source discord forum for
an AI model, every
day there'll be someone who thinks
that they can train a 20B model on a single
GPU coming in.
The scale is
still large, even though it's
like 1 5th or 110, compared to transformer,
it still needs a large GPU.
So that's where AI and
the rest, stability, I believe also
is involved, stepped up
and donated the A100s
needed to train the
basic models that RDAQB had.
and
so before that
before those models
were trained
we were like
only having in theory
the toy models
or the smaller models
that this can match
transformer
we have no idea
whether it can match
transformer at that
scale
yeah
and subsequently
with the larger models
the 14B models
and all that
and we can compare it
directly with
new X model
and that's where
this paper came out
so
so that's the
history behind it
is like
he wasn't
really doing it in silence, he was doing it from Iluta, then he branched out.
Because this became a big project on its own, and that's where other people started coming in.
So the part where we say that RWKV is a neural network that can be scaled, can be wrote out as a transformer, right?
The key thing that you want to see, right, is this diagram here.
This should be in the paper, should not sorry, yeah, accordingly.
So what you get?
So when you do, when inference, when you are running in inference mode,
ideally you should run it as a neural network.
So this is a layer.
So classic neural networks is that you have a state,
the state could be start from blank,
you process a token, you output a state,
and then you rinse and repeat,
and then as it keeps doing the output,
it makes a prediction.
In that, one thing that,
so subsequently for RDAGIV, what happens here, right,
is that the,
we can roll out this neural network side by side,
and then it runs similar to transform it.
The key thing here is that the states are split across the layer.
So this is what we call, in this diagram here specifically,
this is what we call the timings and channel mix.
These are operations within the layer.
Depending how you want you view it,
you could view this as individual layers,
or as how we view it.
We view like this collection of layers as one layer block.
And each layer block pass the states to its sibling.
subsequently down the road as you process the next token which is a similar
RNN type correct feature however the key thing is if you do not need to wait
for the upper layers to complete right before you can go to the next token so
what happens in practice and you're able to jump to the diagram like this
this graphic here this is not 100% of how it run behind the scene I like it
yeah whoever put time into this kudos I made it
So this is how you can visualize it.
So the first layer is the layer norm.
The layer norm doesn't, this is standard layer normalization that it doesn't need to, it
just doesn't need to wait for the other layers.
But if you notice, right, subsequently to the right and to the top, these tokens, these blocks
need to wait for the blocks on the left.
And this is like once you go past the first few tokens, right, these cascades very rapidly, especially
Like this is only like one, two, three, four layers.
Most models have like 20, 40 plus layers.
And the cascading patterns are happening.
And in practice, once you start cascading there, you just saturate the GPU.
And that's how it starts being paralyzable to train.
You no longer need to train in slices like traditional iron ends.
That was one of the key things.
What else is the key thing?
So other things is that.
So I think you're familiar with LSTM, right?
this is how
traditional neural networks keeps
things within memories
in RRWKB, we have two channels
we call it the channel mix and the time mix respectively
Is there a formal definition of channel mix and time mix?
Yeah, you can see the data
from the respective time mix and channel mix
move to the next
to the next segment
how time mix is designed per se
was that
it's how it retains
similar to LSDMs right
where it processes a state
and the input it may decide to
discard certain states and keep new things in the state. Time mix does the same thing but with a
different formula. So it replaces the LSDM in that sense and it can decide to keep things
indefinitely. So this represents the long-term memories if you want to be with that way.
But classically the problem with that is that it struggles with long distance.
Correct.
This has the same issue.
So that's subsequent. It struggles with long distance because it also needs to keep track of both
near-term memory and long-term memory.
So you split it up.
Yeah, effectively speed up.
So channel mix is the perfect memory?
Yeah, this is the closer to the perfect memory, then it's the short term.
So, so, Alice, time mix, it has trainable weights on what decides to keep in this card.
Okay.
Channel Mix, it has a very strong bias in it towards like, just the next token.
So, so subsequently, it was just like, as, like, memories,
are stored in the lower layers, it just slowly shifts artworks through the channel mix.
And this is the short-term memory, which at some point, as it just shifts all the way up,
it will just disappear into the void. At that point, subsequently, then,
time mix should be retaining the longer-term memory.
So we took a break for a bit, but now we're trying to cover, like, what is the big aha moment
for you? And you said it was something to do with cost.
Correct. So we have this chart on screen. There's literally a chart of quadratic
scaling versus linear scaling in terms of GPU time spent in text generation.
And you said it was at training time and at inference time.
Just basically in everything that matters.
So I mean, so look back to how R&N works from a high level.
We do an 01 operation on a token, create a state, 01 operation, create a state.
So these just scales linear.
You want to throw a thousand tokens at it.
It's just on inference, it just scaled in it.
Subsequently, for Transformers, you take in a token,
you process your first token.
It may be 01 here.
Subsequently, when you generate your third token,
you need to compute your second and first,
and then it writes with a so you do your 1000 token,
you need to compute back your 99 previous tokens.
And as this keeps growing, growing,
this is your quadratic scaling,
and this is why we had this graph of the,
graph of the amount of cumulative GPU time that you need to spend to generate all these tokens
respectively. And this is fundamentally just transformer versus neural networks. Yeah, on inference.
The reason why, and subsequently, like, neural networks did have the disadvantage of, let's
say, not being able to paralyze beyond training, but as I covered, RWKB kind of solved that
by effectively splitting the layers, allowing you to train different parts in parallel.
Like some people will go into the academic debate of like technically the second and third token is not paralyzable until the first is done, but once you get into like, I can saturate a GPU land.
It's just way better.
It's just academic debate.
We are done.
And so training in essence has always, I mean, this is bid for Transformer.
Only one network is I need to do an inference pass.
I look at the logits.
I then backprop to see what went wrong and I update the weights.
Yeah.
So the inference is the forward pass.
You still need to, it's part of the training course.
As you backprop as well, as you backprop as well,
like having needed to only look at the current cell tokens
and the state instead of like everything,
also reduce the amount of things that you need back from.
So it's just that, it's like there's so many factors involved
in just reducing the overall inference and training time.
And that was something that appealed to me
because in the long run, I mean,
all of us wants our model just run blazingly fast, right?
Yeah.
And also on minimal hardware.
Oh yes, which as far as understand, you still have 14 billion parameters.
That's not going away.
You still need the RAM to store 14 billion parameters work and stuff.
That's not going away.
Yeah.
Okay.
So RAM is unchanged.
Yeah, on the RAM side, but the working memory is reduced.
So typically you need more than 14 for transformer.
I mean, let's not touch quantization.
But in this case, we don't need to keep.
like if you really, really want to like save RAM,
it is possible for you to do token by token inference
so that you don't need to keep your states in history.
You only need to keep your current token state and your next.
Yeah.
Yeah.
And yeah, and that's actually like one segment of our community.
It's just purely porting other activity to C++-based model.
Oh, and the next.
Yeah, and running in on pies and stuff.
Raspberry pies.
Yeah.
It's interesting.
There's a chart about performance and it shows that RWKB is competitive or actually better in some of the reasoning challenges, which that's something I definitely would look for, right?
Like, it's fine if like your speed is faster and all that, but if the reasoning quality sucks, then it's not a very useful language model.
Exactly.
So.
So this is like literally us saying there's no trade-offs.
Yeah, you don't rule out in that process.
Okay.
Big question then.
why isn't our
Al-WQVie a bigger deal right now?
So, one, we
are not a commercial organization.
This is literally the pure open-source play.
But you could have done
the stable diffusion thing.
Which, you know,
stable diffusion launched. It was
by a bunch of nobodies before that.
It's from, like, literally
split out from Luther.
And, but they definitely had some hype.
They definitely, like, you know, I interviewed
Shariq. Shamin.
The reason I ask you have so many things about
how did you find out of it all you give you?
Because I think the generalizable skill is how to be early in AI.
Because being early in AI is very valuable.
Because then you were there to see how things developed
instead of like picking it up later like me.
Anyway, so, yeah, why is it not a big deal?
You don't need to be frank.
Yeah.
We just suck at marketing.
Okay.
That's fair.
I mean, this is part of it.
Yeah, this is part of it.
Like, so like maybe.
But I don't think that is entirely the cause.
Yeah, I'm sure.
definitely I think the other major segment right now as well is that is that we were
really late on the paper okay like one of the weirdest thing right now is I
we're just thing right now I feel that is that I think RGB is starting to have its
moment right now okay is that ever since that initial paper came out there was
Resnet there's a I think there's two more there's a few more additional papers
coming out one from Microsoft one one from other organizations that are
literally exploring the whole idea
once again of scalable neural networks.
And they are citing RWKB as far as well.
And I think for most,
almost like I think it's existingly,
why switch to this model
when, even though we have proven that yes,
it's scalable to 7 and 14
and that it can match transformers
at similar params and training size,
but all this is very academic
Because the community, right, the community at large, especially for the English-speaking community, right, they don't really care about this.
They care about what's the best model that I can run on my computer, at least within the open-source space.
And by that, even though we match in performance for things in the same dataset, the keyword is same dataset.
Like, this benchmark is not even red pagey-a-mast yet.
It's the power.
And when you have models that are being trained on much larger data set,
especially for an English use case, it makes more sense to use that.
I see, so there will be another paper coming that is RWKV trained on red pajama.
And that will presumably be a larger dataset, yeah, and so on so forth.
So I think that's the, we are still in the stages of reaching that point where we train on the larger data set.
The only reason why we have a bigger outsized impact compared to like the other models,
models is frankly because half of our discord came in not for English, it's for other
languages.
Yeah, that's great.
And there is a definite very US and English-centric bias towards these models, and it's, to
me, kind of poetic.
Like there's nothing in the architecture of our DDKV that particularly bias it to be really
good at other languages.
It's just that as a community, you decided to prioritize it in your tokenization and your
datasets.
That's it.
Yeah, that's it.
I would even argue that I'm surprised, more surprised that,
especially on the European side of things,
that we don't have more models that actually focus on even the European languages.
Because that is like a softer jump to character, Japanese and Chinese characters are all romantic.
But I think back to the benchmark, what excites me more still about this is that it just means that we just need to scale.
We just need to scale this model and we derive data to like 40B.
40B, 60B.
I mean, paramed is one thing.
It's data sets and GPU time.
Yeah.
So you and I are talk offline about ideas for getting data, getting compute and all this.
Okay.
So this is like a project that's ongoing.
Okay, anything else for the future of all the QB?
The biggest one would be.
Okay, so this is back to how, remember I said,
evolves doesn't hide off, doesn't highlight everything, being realistic on another weakness on
RWKB side, is that now with the rise of like, let's say, 100K or 32K context size windows,
transformable model, R2KB currently is trained to handle, let's say, 8, or even some people
have already trained it to 16K sizes. It has, and well, it will, as a neural network, it will
happily keep going on for infinite context, man. It will just keep generating. Does it do well?
that's the answer is no because if you didn't train it to handle that situation and that's actually a child rule
so for example if like the prediction the the power test loss right it does improve from a time let's say if we go down the
context length but this is if we train it and what it's not seen here is that if we were to do let's say run it further
it'll just go back up because it was not trained to handle that well it technically can run it it suffers from the longer context length
and that
that's the part where
other TV, especially in like Q&A tasks
and huge documents
like you get closer to summarize giant documents
none of this is fundamental, it's just you need more money
yeah that's it
and no there is actually a fundamental part
so what one of the things that I was doing
and I am actively helping within the committee right now
is that we found that
the existing way to scale the memory
was not that efficient
and we were just being
realistic itself. If we want to hit 100K, we need to change this. So one thing that I'm
actually looking forward to right now is actually those experiments. We have already started scaling
things to be able to handle things at transformal scale, be it the 4K, 8K in terms of how it
handles memory really well. And we want and we found it, we want to like extend that to be like
16, 32 and 64. And we that is within our roadmap. And that's the exciting thing. Because once we have that,
it able to handle long-term memory within those sizes.
It removed what many people in the community felt, right,
was the last architectural limit.
Because once it's able to handle memories, like context length,
the same as transformer,
we normally do all the,
like, you know how existingly people do like long conversation and transformer?
They just discard the rest and the sliding window.
This is like the better version of sliding window.
You have, the model can handle the sliding window perfectly.
where it can keep remnants behind it.
Sure.
And that's something that I'm really excited and invested towards,
because this is back to the full circle of how I came into R&KE.
I want my model to handle 100K tokens.
All megabytes or HGML.
Whatever I throw at it and be able to process it.
But it will be lossy.
The later half will be lossy,
but the key thing is extending the non-lossy part,
and we are aiming to extend the normal NOSIFI.
So, you know, you have displayed today an impressive amount of knowledge just across the, you know, all this stuff, and you don't have like a research background.
Your advice to AI engineers getting as deep as you who want to get as deep as you.
So I think your article articulated very well that there are, there's going to be divisions within how we approach this.
So AI engineers, model trainers and data set curators and ML scientists.
So our loosely defined as a tree, I ignore the full stack because every company needs it.
So within this tree space, there is actually a lot of ways anyone can come in without knowing anything.
So let's just start with AI engineers.
Don't be like even though this whole topic, we even dived into how the layers work,
we also showed how the math works.
Frankly, for an AI engineer, you don't need it.
Your main thing that you needed to do was to, frankly, just play around with chat GPD
or all the alternatives, be aware of the alternatives, just be very mercenary, swap out to
cloud if it's better for you, or swap out to an open source if it's better for you, and just
play around the prompts.
Learn, learn, bear prompting techniques like one shot, two short, a few shorts, and then from
that odds you can start building your agents, stacking, stacking your prompts, and
in sequences and stuff like that and you are able to build applications that do anything in
terms of the AI space and all this without knowing all this nerdy stuff with all the hard
engineering because that's all you really need to actually build a product for the user remember
you are supposed to focus on making it for the user they don't care if it's rwkV or transformer
underneath the hood they just care that it helps them and and i will say like notion probably it's like
probably one good example of how they use it
because we know underneath the food is open
air but you really use
food for the way. Yeah.
No, so I obviously agree
with all that. Let's just say that
people are bare already and they're just curious
they want to do what you did.
So that's where you start going
down the layers. So the
next layer you go
down in is subsequently
training the model from
scratch, fine tuning and incorporating the dataset.
And this is
This is where you still do not need to know the math,
but you need to have a rough sensing on how the model works
and how the certain models,
and in this even within the open source transform space,
certain models are better trained in certain sequences with certain learning rates,
and you just need to get a few of it.
So this is just like, collect the data set, try it, see the loss.
You literally did this.
Yeah, at least for RRKB and the code gen models.
Yeah, it's not a cheap.
work too because you need GPUs.
Okay. And that took you how long?
I think I code Jen alone was like six months and then this other UKB I've been doing this
for like another six months. And that is the, it's just pure experimentation.
Like there's no right or wrong because like especially if it's in a different domain.
Like recently I was like helping someone on the algebraic discord regarding the music
music generation domain
and my assumptions
for learning rate
and all the patterns
were just completely thrown out of the window
because the music model
just fundamentally
is different
in those sense
so that is
the exciting thing is
because it doesn't really
have any specific rules
and guidelines
until you get
until you trial and ever
to a certain space
it also means that
you coming in
is as fresh as anyone else
coming in last year
it's really that kind of
uncharted space
for everyone and especially as you start exploring to new domains your existing
your existing knowledge may actually matter because sometimes I mean I think a few
papers already covered this that's like how you train your model in certain
sequences also matter like you want to train a certain set of knowledge and then
and then you extend that knowledge subsequently but if you're talking about
material science or genetics how am I supposed to know what is foundation or
what is extended knowledge I have no idea maybe you do and
I'm just picking an example.
And the same thing for music and so on.
So those are things where even though you're outside the space
is where you can come in just at the dataset level.
Now you want to peel off to the next layer, let's say.
Let's just say you want to look into modifying the model,
the foundations of it.
I think one of the beauties about this current boom is that
even though I did my toes early,
like before the transformal wave and into an early neural network phase,
frankly, almost everything that matters was basically in the past four years.
Like, there were a lot of things that, in academics, there were before that,
and they were mostly dealing with models that were under a billion parameters.
They pretty much no longer matter, and can you be more specific, like,
okay, I know I'm shooting myself on the foot because how they're curious in neural network,
but if you're just trying to get Transformers to work,
you don't need to know at our STM.
Yes.
You don't need, yeah,
there's a lot of like pre-knowledge in neural networks
that is irrelevant in the transformer era.
And maybe some of it will have a resurgence,
but to get up and running, it's not a requirement.
And I think this is where you could either go
the very academic way or reading papers and stuff,
but frankly, what I found was way more,
useful was Akkabati yeah his series of videos there's a hero yeah that is really really good
I think even if I even though I read some of the read some of the papers and guides
before that it really helps that it starts from zero because you can see how it happens
part by part and and even though we will not use how the exact same code they use
because like he re-implement the back prop and all that and we're just going to use torch for that
Yeah, that's where you get the aha moments on how this building blocks work and how it
fall into place. And like, I had fundamental misunderstanding on how backpro worked until I actually
watch this video. Oh, really? Yeah. And I think this is the scariest and craziest thing about
AI models sometimes is that you can actually have fundamental misunderstandings, but as long as you
make the building blocks and it connect and, okay, loss is great, it works.
Yeah, well, so, you know, even the gods of the industry, you know, I don't know if you read the Swigulu paper.
So there's these like, there's all alternative activation functions.
Like there's RELU and then people are always looking for different slopes.
And very famously, the Swigloo paper had this line in there that was like, yeah, we don't know why this works, but it works.
Can't explain it.
Yeah, it literally happens here and there, you know, all these got two.
One of the funny things that I'm doing right now in RWKVE 5 experiments is that
Okay, we are going to do this change
We're going to run this train
Make your prediction
Will this model beat this model in this loss curve
As a game, as a betting
It's a very informal
It's literally a body
Kind of like kind of bet
But
The fact that
The fact that we can do this kind of bets
even though we like understand the code
it's like, it just goes to show
how often like, oh wait, this didn't go
to why it predicted, no one.
And
that's why even if let's say you don't
have a PhD or
so on and so forth, like, even if
math is not your specialisation, you're coming in as a
developer. I'm going to come in, I'm going to say
frankly, like I didn't come from the research right now.
The extremely math-heavy stuff is what I struggle with.
What I do sometimes is I copy and paste the math
into GPT4 and asked it to explain to me,
which is good in plainer language.
It's very good at that.
Yeah.
Yeah.
And so, but the thing is,
there is lots of value at beyond that.
One thing that I realized,
and this is not specific to RWKVE,
this also happens across a lot of open-source models,
is that a lot of MLs like this,
when they really build this stuff,
the focus was more of like always get it to work.
It was never about getting it to work efficiently,
or getting the code documented or organized.
And stable diffusion literally went through this whole journey.
They had the code and the model that worked.
And the community just started and engineers that came in with zero machine learning background,
started picking it apart.
It's like, no, this could replace this with this that does the exact same thing.
It's more efficient.
Like one of the major breakthroughs, for example, for GML,
and this happened sometimes.
back for a bit, Lama, more so that, was that,
while someone external from the AI committee
went in and implemented memory mapping.
Yes, I saw that, yeah.
I forget her name, but yeah.
Justine.
Dot Law is her URL.
Yeah.
And she didn't come in as an AI expert.
She came in as a software engineer.
Yeah.
And these are all just very, very straightforward.
You know, in her world, this is normal.
Whereas for the researchers, they will be like,
I don't want that.
Wait, what is memory map?
Yeah, exactly.
Yeah, and there are a lot of things like, like, one of the jokes that I have right now
at that every month there is a research, ML scientist, that is rediscovering the number 32.
Why?
Because, be it like, or someone in the committee writing the inference code, because GPUs,
especially, especially YouTube, GPUs tends to work really well when they align to the
batch size of multiples of 32.
Oh.
And if you've been in the gaming industry, especially when you're,
write shader code, right? This is like well known, like just given knowledge. And people are just
constantly rediscovering, oh, maybe if I just adjust my data set or my data size to fit this
batcher size, suddenly I get 10% improvement. And yeah, and it's like, these are things that
once again, because they were so focused on just making it work, that they won't know outside
space. And that's why I would say, right, if anything, right, now is the best I mean
that you don't know AI to have deeper from different background coming in because your
contribution could be from data set level, how to train the knowledge, to shader code, to
heck, how to memory map, how to cache data. There's so many gaps.
Cool, great. So yeah, thanks so much for being very willing to get on and talk with no prep.
We did some prep, but it's very unusual podcast episode, but I really enjoyed it.
We literally just met yesterday in Singapore.
But I know you've gone on the Discord for a while, and I can tell you, like, you're very serious about all this.
I think it's very unusual for someone.
Like, you have a job, but this is like a second job, essentially.
Yes.
But you are really enthusiastic and passionate about it, and I think that's very rare, and I don't want to encourage more people to do it.
And so thanks for sharing.
Yeah, bye. Thanks for having me here.
