Latent Space: The AI Engineer Podcast - RWKV: Reinventing RNNs for the Transformer Era — with Eugene Cheah of UIlicious

Starting point is 00:00:00 Hey listeners. Today we have a very special episode for you. There's been a recent paper on the top 10 open challenges in LLM research that has consolidated a lot of intense debate. Today we're going to talk about RWKV models, receptance weighted key value models, with Eugene Chiaa, who is both part of the core RWKV team, CTO of a low-code AI test automation platform, an active member of the Leyenspace Discord. The RWKV architecture has the potential to solve three of the top 10 open LLM challenges, increasing context length, making LLM's faster and cheaper, and designing a new model architecture. What is particularly appealing about it is that it does so by reviving the recurrent neural network, which even I have argued has been obsoleted by the transformer. It rejects the idea that attention is all you need

Starting point is 00:00:42 and replaces multi-head attention and feed-forward networks with new concepts called AtyMix and Channel Mix, respectively. It has been trained up to 14 billion parameters, they're getting help from Illuther and Stability AI to scale up even more, and shows competitive results on reasoning benchmarks, the same benchmarks we covered in our Benchmarks 101 episode, with similar size models, yet with linear costs and speed curves instead of quadratic ones. In a way, RWKV are promising the room temperature superconductor of LLM architectures.

Starting point is 00:01:09 In other words, the parallelizability and performance of transformers without the quadratic cost. Obviously, the topic of what happens after the transformer is in the finance terminology, what we call a low delta outcome. It's probably not going to happen, but if it does, it will be very, very big. We even discussed this a little bit in our episode with Jonathan Frankel of Mosaic ML. But since the RWKV paper was published, the idea has been somewhat independently validated with Microsoft Research putting out the retinet or the retentive network, which has similar remains of what is trying to do.

Starting point is 00:01:41 And it also, of course, competes with other alternatives to the transformer, like the state-space models coming out of Chris Rays Group and Stanford, S4H3, and a monarch mixer that was recently announced. However, RWKV is so far the most validated of all these ideas, because, it is already trained up to 14 billion parameters with multiple models that you can download and generate text with today. As podcasters, we want to be the first place that you hear about new things in AI that you'll be using your work or personal life as AI engineers and enjoyers. So this presents us with the problem. We have to be early on consequential topics and things,

Starting point is 00:02:13 but also high signal. Some of our favorite compliment so far, which, by the way, I've added to our About page if you want to check that. Your pods are a legitimate highlight of life for me. They're amazing from McKay Rigley. And from the AI safety memes Twitter accounts, which is always fun. They just simply said we're the highest signal pod for them. So we're very proud of this and want to keep it up while taking risk, because even though we cross a quarter million downloads just five months into the podcast, we're still very young and trying to figure out what kind of podcast we want to be and what kind of audience we want to have. Today is going to be one of the riskier pods for a few reasons. One, it is the first pod we're

Starting point is 00:02:48 doing on a non-traditional architecture with no large Western institutional backing. Two, it is the first pod we recorded outside of the US without a regular studio, so the audio is not as good. And three, finally, it is the first pod we're doing with my Singaporean accent. I'll address each of these in turn. One, there's a significant institutional bias in Western media coverage when it comes to AI, in that if you don't come from Stanford or Oxford, or you don't work at Google or Facebook, your ideas have trouble getting attention. However, by far, the most admired organization that our guests have repeatedly mentioned

Starting point is 00:03:22 is Luther AI, which spun up as a disavoured. centralized Discord community that independently trained the first GPT class LLMs without a prestigious background. A Luther has since spun off organizations like stability, AI, and conjecture. But I suspect that the RRWKV community is working like the early days of Luther, and we have a rare opportunity to capture an oral history of it live rather than two years after the fact. Two, part of the latent space magic is that we try to get to know our guests and be in person with them to establish rapport, like we did in New York in our Notion AI episode with Linus Lee. Despite the lower audio quality, I think we got a much better interview with Eugene because we were able to interact with each other in person.

Starting point is 00:04:00 Three, part of the joy of audio is that you get to hear the full diversity of humanity, and neither are less you or I are American, and I like to showcase the broader world throughout AI lens, whatever we can. As you'll see, the RWKV story is also about the non-English rest of the world, self-organizing to build the LLMs that the English-centric West has neglected and doing so relatively successfully. We've aggressively edited this interview down to one hour for the audio podcast, but for those who are interested in RDKV, head over to the new Latenspace TV YouTube channel, which has the full to hour interview, including a screen share walk through the RWKV models, paper, and Discord, as well as digressions on Eugene's background, discussions on diffusion models and the token crisis, and the relationship between open source AI and the AI Wifu Hasbando community.

Starting point is 00:04:46 I hope you enjoy our conversation with Eugene Chair of RWKV, and if you liked it, reach out to him at PeeFUG, Eco Creator on Discord or on Twitter and let him know. Okay, so I'm here with Eugene. We are in Singapore. This is the first time I'm podcasting in Singapore. This is the first time I'm podcasting with my Singapore in accent. Eugene has been a very valued part of our Latenspace Discord for a while. And also diving deep onto Al-WKV.

Starting point is 00:05:10 I think you're actually the first person that brought it to my attention as a potential transformers alternative. You're also CTO of U.ILicious, which is UI Testing Company Net's. in Singapore here, which is local platform. I got the first demo maybe four years ago. Yes. And I was like, OK, fine, you know, you're doing testing. There wasn't an obvious AI angle.

Starting point is 00:05:30 I mean, now that you explained it, it was great. But like, what was your personal, like, OK, I'm going to be a dedicated AI guy for your eyelashes? OK, so one of the things, one thing that I found very interesting with the huge transformer boom right now, is that traditionally, right, when you tell companies that you need, when you want to build your own AI, you need a really large data set. And over time, actually, the amount of data sets that you need is actually scaled down because you can just not find foundation models.

Starting point is 00:05:58 Yeah, finding your foundation models. And when we started Neurin. We always knew at that time because a lot of our other companies that were launched at the same time were dealing with neural networks that at some point, the data that we've been collecting data on, let's say, how to do testing, website. It's just a very specific focus. basically every single test that has run on our platform unless our customer has opt out or delete their account basically privacy related stuff

Starting point is 00:06:23 we actually still retain the test data and that's something that we always felt that was useful in the long run to be able to actually build a huge training model the irony of that it was that even though we were building all those data sets as the trash hole came in and the transformable boom happened

Starting point is 00:06:38 we realized we don't actually need that break of a data set anymore to actually get a functional AI One of the key insights, especially for people who is like trying to build on top transformal model. Pre-transformer, large-engagement model, is we would always be thinking of like in terms of like 100 gigabytes of data, one-dherabyte of data, like millions of record for all the different examples. Post-transformer is literally, you probably need only like a thousand or 10,000, enough like data that you can literally get an intern a few weeks. Right.

Starting point is 00:07:09 Just get it done. And you have a working model. It may not be that great. but frankly every piece of data you add after that is a diminishing returns. And because it's a language model. It doesn't actually have any inherent understanding that it is automating the browser. So it's presented as like a prompt answer pair, like question answer there. At least for our internal model that our users are using is it's presented as here's the prompt, describe your test or what you want to modify the code.

Starting point is 00:07:35 And then subsequently generate the code for you. And hindsight, it's now basically copilot. Yeah, yeah. Yeah, I think now co-pilot is adding that chat, widget. Are they fully launched yet? Yes, I actually downloaded it yesterday. I haven't actually used it yet, but it is a separate VS code extension. So there are now three co-pilot extensions ship at GitHub because they have shipped their org chart.

Starting point is 00:07:56 I'm quite friendly with that team, but it's very funny. But just to come back to you, so did you implement this with GPT3? So we based it off the Salesforce CodeJM model. Okay, right. So that was the foundation model that we built on top. We are looking into replacing it in parts, but that becomes a longer conversation. CodeGen being the first really credible, open source, code-specific language model that was released by literally anyone, I think about three years ago. Yeah.

Starting point is 00:08:25 And then they recently released CodeGen 2. Correct. Any opinions on Code Gen 2 and just while we're on this topic? In terms of like CodeGen, one big appeal for the CodeGen and even Code Gen 2 model is that Salesforce took a very clear and clean approach to the. licensing. Meaning they were very, very clear that everything that they trained on was open source. Yeah, MIT. They didn't touch the problematic like this. So, and you can imagine. And you think that co-pilot did? Knowing Microsoft statement on how liberal they were about GitHub data and they were saying they used the term that is under fair use.

Starting point is 00:09:01 I see. Yeah. I have no reason to believe that they didn't. But this same problem happens to actually a lot of existing. code jam models and and that it was actually the main appeal for me for running for actually building on top of the sales post code jam model mostly also because like like for us we deploy on-premise into enterprises yeah in Europe and they ask questions so what does this deploy on-premise mean like you you pack UI laces into a container yeah you give it down yeah and then you date it's like a license fee or something correct okay cool that's very interesting yeah okay I don't know if I have any other questions based on that

Starting point is 00:09:40 anything else before we go into alternate like the reasons for alternative models okay so so I anything after I heard that no I don't really have much right for alternative models so yeah so let me as I said the premise right like transformers have won for now they've slid the neural networks yes and you know it seems like you have had a history since with machine learning and before transformers and now they're there are kind of like the peak their power. And I see that, you know, there's a desire for alternative models for a number of reasons. But I'm very curious as to what drives your personal interest in alternative models.

Starting point is 00:10:18 So, so first thing to be clear, majority of our AI is still based on Tranforma, at least within my company. Yes. But what drove me into alternatives beyond Transformer? In essence, once we actually managed to get our bot to generate UI testing code, the most obvious next thing that our customers started asking, hey, let's say the test failed. Can your AI now analyze my website and then tell me what's wrong and tell me what to change? Basically, they're getting crazy and case. Yeah, yeah, yeah. Humans are very good at moving go-posts. And we had something working for toy websites. But the one thing that we do and internally is that we look at the, I think, what's the list, top 100, top 1,000 websites. And we basically just run or we actually do run our test platform instead to see,

Starting point is 00:11:05 that our code works against any front-end platform. Well, what do you mean run your test platform, right? Because you don't have tests for them? Yeah, we have some very rudimentary basically. Go to website, see something, click something, add to cart. Yeah, that's it. That's it. The idea is more like, because there's so many frameworks out there.

Starting point is 00:11:22 And our... You want to make sure you cover all of them. Yeah. And so we did the same thing for our AI. And the first thing that it died on was literally Amazon. Why? Oh, 5 megabytes. Yeah, I think you've heard me mentioned it.

Starting point is 00:11:34 So when you are trying to analyze your website, we've been talking about increasing token count size, right? But for e-commerce websites in particular, even if you stripped off a CSS, even if you strip off of JavaScript, having the entire HTML in megabyte size is not unheard of. And that's where it's like, how am I supposed to solve this in terms of from an AI point of view? How many tokens would that be? Like, oh my gosh, you could easily be looking at over a million token. I see. Which is still too much even for today.

Starting point is 00:12:04 Yeah. Did you look into making your own tokenizer? That's something that we explored. I think what we found more realistic was to actually pass the HTML into a more token-friendly format. So this way we can still build on top of existing models. But yeah, we are exploring that as well. But back to the lot of the .

Starting point is 00:12:26 So like the key things for me and was at that point and subsequently I think I showed you like the experiments with like English compiler and things like that, right? AI agents generating code. You also have your own small there. What's that? The context size is a real problem. And transformer inherently by nature, at least vanilla transformer. I know that's transformer XL and some other attempts is that it quadratically scales with the context size.

Starting point is 00:12:58 So we scale to like, let's say 100,000, that's already requiring us. a shit ton of compute and VRAM and I don't even want to imagine what happens to 1 million or 10 and and that's where I needed I was like okay this is a fundamental problem that needs to be changed if not we will not go past this and I think there's also now like a lot of people who are very interested in like models that can handle large contact size because they also want it to be able to use in use cases where they never need to fine because fine tuning is a pain apparently yes that's Okay, well, there's issues with just throwing everything in context, right? It's shown that retrieval is only best when the item that's relevant is in front or in the back of the context window.

Starting point is 00:13:46 So basically I'm just like maybe we've just tapped out. Context is working memory, and maybe it's like, maybe Transformers are very similar to humans in that. Our working memory is only of a given size. If you try to artificially extend it, you just make it very lossy. Yeah. So that's where I end up. landing on the RWKV model because in that sense right so so you one thing that I always found very weird for transformers but I mean it's by design is as you infer each token

Starting point is 00:14:15 you are you will re-computing everything up right that's the quadratic part and and and well well you're mentioning about the working memory problem in theory with enough attention heads on it and and people seem to be trying to cram more and more attention heads into the process It could scale that way. Ignoring compute costs. Okay. And ignoring compute costs is like just like a very liberal.

Starting point is 00:14:41 That's just true as much H101. It doesn't make sense. But, but RRKV, what is still was fundamentally a neural network at its core. It ends up scaling linearly as it goes through the tokens. It also, it will still suffer from the memory issue. So, so like within the RRFKVB, we do, we do like measure two separate things. So like one, we call it the perfect memory. So in the model, we have only a certain amount of capacity

Starting point is 00:15:08 where it can remember things perfectly, just like humans. And then beyond that, that is where it will start to discard things from its perfect memory. Right. And I felt that this was actually a lot more in line with our goals and commercially. And also what I felt was that was more useful in the long run because it's cheaper compute and it could be potentially paralyzable for a real long run. Right. So we're going to go into our RWQVy paper in a bit.

Starting point is 00:15:37 But one thing I wanted to ask, you kind of glossed over how you found it in the first place. Because you're not a researcher. You're not like, I don't imagine you're like reading papers every day or something. Until recently. Until recently. How do you find it? How do you find it? How do you know like this is the one to bet on versus there's a bunch of other alternatives, right?

Starting point is 00:15:56 I think what was quick. I think it was rather quick after I concluded that. Transformer as it is, we're not scaled to 10 million tokens. Okay. And so, by the way, you mentioned Transformers Excel. We also did an episode on Flash Attention, which helps to make part of it sublinear at least. Yeah, but that is like way, way after I already died into other UKB. So history-wise, at that point in time, Transformer X, we are talking about like the,

Starting point is 00:16:26 the, when 4K was the limit that everyone knew. Right. And this was last year. I mean, just to set context. Okay. Okay. And then, yeah, so you just kind of were searching around, you found our other BKV, presumably, like, did you go straight into the Discord? Was it, like, primarily a GitHub repo?

Starting point is 00:16:45 Like, what was it? Because as far as I can tell, there was no paper until maybe about two months ago. Oh, and I talked about it before the paper, right? Yes. So you found it before they did any publicity, which is weird. It's not normal. So what happened? What did you do?

Starting point is 00:17:03 So what I did, okay, so it was basically, I believe, okay, so it's a mixture of things because it's like, I was searching, beat guitar, I was searching like forums, other discords and also like blogs actually. I was just like getting all the, because everyone was just creating lists of lists, right? Yeah, yeah. And I believe you also have a list of list somewhere. Yeah, but mine is very, so I would consider myself very tread in the sense that I would just follow the large model labs. whereas the kind of list that you have to follow in order to get to something like R2KV before they've done any publicity is the non-trad like you know the kind of people that

Starting point is 00:17:41 is not working on those Hermes wizard you know that like no credentials I don't even know who the hell they are but they're just working on it oh this is all for game memory and I might be hallucinating this because there was too many lists but I believe the list that actually what brought me to RDAV was that beyond opening eyes model and And beyond Chapchapiti and Claudia, the two big models, right, outside of the English-speaking nations, right, a lot of the open-source models really fall flat. And that is why when you actually go through like lists or art for like doing things in other languages, RWKB actually stood out and then point. And just on the basic premise, and we're not even talking about architectural and management, it's just the basic premise that they imported the data set in other languages.

Starting point is 00:18:31 in the training data. Yeah. And... Was that a... Because, I mean, I imagine 99% of your customers are English. Yeah. Was that really a driver for you? It wasn't a driver, but...

Starting point is 00:18:40 Are you just trying to explain it? Yeah, that's how I landed onto, like, all these blocks and technical. And can you say, when you say fall flat, the main one that I know about is there's a tokenizer penalty for non-English. Yeah, that's that. Right? So, like, Chinese is up to... Chinese or Japanese or Thai or something.

Starting point is 00:18:55 It's like 16 times the number of tokens for a typical English sentence. Yeah, but even before the... that, right? Because, I mean, I think you understand, like, a lot of community users, they want to not use the commercial APIs. Okay. So they try to find open source models. Yes. And we'll talk about the not safe for work at people. I really want, because you've actually talked to them. I have never talked to these people. But like, when I discovered them, they are huge community. They're extremely passionate. And they're actually good. Yeah, they're really good. They're good at this. So let's talk about them, right? Yeah, we can talk about it later. So they don't want to use the commercial models, and they want to use the open source model,

Starting point is 00:19:35 and there is a tokenizer penalty, which is true. But I think on the more fundamental basis, right, if we look through the datasets, and this is also partially important because the way we set up about evils, all evals are written in English, and at least for the majority of them. And if we are racing towards building AI models, at least right now, as you see all the companies as they build their open source model and they just want to narrowly

Starting point is 00:20:00 focus on the e-vowls adding in a foreign data set is actually a loss because once you are below a certain paramount so we're talking about 7 and 14 the more you add that's more in line with your e-vowls the more you'll degrade

Starting point is 00:20:15 and they just exclude it so the model just the priority is English yeah I get it the model just fundamentally so what's the trade-off like I mean okay so English and Chinese or, you know, there's all these other languages. What do you pick?

Starting point is 00:20:31 So, so RWKB started. Also, in context, the main person leading the Adiagabri project, Bling, is from China. So he naturally has an interest to make sure it supports Chinese. Yes. Yeah. So English. And there are a fair amount of bilingual models, essentially, that are English and Chinese from the major universities in China.

Starting point is 00:20:50 So, so we started from basically English, Chinese, Japanese, Korean. frankly this is large part mostly because there were fans in those communities that came on board and then and then subsequently we tried to onboard other languages as well yeah but these people are like again not researchers no money like training on their home GPU lab or whatever right partially true but also how I see it works out for a lot of the other languages was that we have the foundation model and this is the foundation model where we just kind of say devouse be damn Let's just make sure to include all the other languages. Okay.

Starting point is 00:21:27 And when we included the other languages, right, the model works for most parts for the other language. Subsequently, these individuals who wanted to use these models for their respective use cases, we will then fine-tune respectively. Because it's easier to fine-tune in another language for your use case than to... I mean, this is a classic fine-tuning, than to train the language from scratch. And I think more recently, and this model is not 100% trained yet, more recently, RWKB has released what we call the world model, where we go the next step of even including all the translation data sets that we can find,

Starting point is 00:22:12 even for minority languages that people end in our discord. Because the goal for them, the long-term goal for us, at this internal network, that we wanted an AI model for everyone. and everyone does not be USA it means the world So there are a lot of languages in there Is it Asia biased or You know

Starting point is 00:22:32 Give me a sense It's probably no offence It's probably still going to be US biased in terms of knowledge Because what we are doing is Still power red pyjamas for the knowledge But in terms of language We add all the other languages

Starting point is 00:22:47 Wiki and translation set So it's hard I mean we haven't fully evaluated the bias here, but I'm quite sure that when disproportionally knowledge is still within the English universe, there's the bias there, but frankly, we are still at the stage where can support the other languages. Yeah. And I think I mentioned this, this is this is one of the interesting parallels that sometimes

Starting point is 00:23:10 I have, right, is that I can be in the, I can see in the illiter forums and all that. And then we're talking about alignment and like we're talking about it in very big. Which is, yeah, very keen on safety and all that, which is great. but like it's not your role as the RWKV community. Yeah and when you talk to like members of the community that came on board, they're like, oh I want to get this to work for Korean, Japanese, Thai, Arabic languages and so on. So they just want something that worked. Yes.

Starting point is 00:23:40 They don't want it to be, they're not after the big model that does everything, they just want something that they can play with in their language. And that was very important to them. Yeah. And these are literally just hackers. Literally just hackers doing it for personal enjoyment. Correct. Not yet for work. Or maybe some of them for work.

Starting point is 00:23:58 You don't know? We don't know. I mean the core character AI category, there's quite a number of them using it for that. So professionally. Professionally. Okay. As in they run character companies. Yeah.

Starting point is 00:24:13 Let's call it. Should we pause here and then I'll switch to the screen? Sure, sure. Okay. All right. So we have it pulled up. We are going to screenshot for the bulk of this. If you're listening on audio, it might be a good time to switch to the YouTube channel.

Starting point is 00:24:24 So we're just going to start with an intro. What is RWKV? So RWKV is a modern recursive neural network with transformer-like level of LM performance, which can be trained in a transformer mode. And this part has already been benchmarked against GPD NeoX in the paper, and it has similar training performance compared to transformers models of the same dataset and parent-count. So specifically the GPT NeoX. next model. So the key thing is that even though it's matching in performance, while trading

Starting point is 00:24:56 blow to GDPMU is, it's doing all this without attention there is. And in the process, it's actually having a much substantially lower compute based on its design and also because it's a neural network, which we're diving into later why, why that's substantially lower, in both training and inference. And this is back to like I mentioned previously. Transformer, it traditionally transformed until we found out of a transformer Excel. and things like that, tends to scale quadratically based on the contact size. And this applies not just in inference, but in training. And due to how to like, due to how this is still a neural networking is hard,

Starting point is 00:25:33 even though it can train like a transformer, it's able to do so much more efficiently and faster, especially like when you hit context size of 8K, 16K is and a berth. And once you do like quadratic and linear, the differences start to like go crazy once you scale the numbers up. And that was the main benefits of the IDWKB model, per se. There were a few prominent researchers when they actually reviewed through the archery paper when it came out. They did highlight an important question of like, is this like evidence to literally,

Starting point is 00:26:06 maybe all that really matters is that we need a large data set and a scalable model. That makes sense obviously to some approximation, but you are still using attention? No, we don't use attention inside. Okay, yeah, maybe let's rewind a little bit. Oh, specifically attention as you understood it. Yeah. Okay.

Starting point is 00:26:30 Tell us more. So, so we, we, we, we use weighted receptors and, and if there's any diagrams I should pull out, let me know. Oh, okay. Let's, okay, so we are using AFD. So this attention free transformer, and this is, this paper was written by Apple. What the hell is an attention free transformer? Okay, this is unusual. Yeah.

Starting point is 00:26:52 we use the weighted retention weights and we compute over it. And in essence, this is like the classic like stacking more layers. Once you do on top of it, like you don't really need attention. Once you have enough weights and layers stack on it. Okay. I don't know whether we want to go into the deep dye or afts. Sure. But that's interesting.

Starting point is 00:27:19 I've never heard of this paper. Yeah. So this is, this was written my Apple. And subsequently, we interviewed. at this blink, the creator RWKB, took this, took this and applied it to a language model and scaled it up.

Starting point is 00:27:32 Right. And that is how we landed on RWKB that doesn't use attention. So sometimes within the community, we use the word light attention, because what happens is that these layers and these weights will still play the role

Starting point is 00:27:48 of attention. I was going to say, you end up approximating attention. Exactly. So it ends up like looking at the or parts of the memory and then applying it to the output. And the key benefits is that because remember the attention model is a multi-head part, it will need to scan all the tokens back and forth. This removes that requirement and hence it reduced the overall compute count. I might be jumping quite and forth a bit, but that's the one of the key essence of the WKB segments.

Starting point is 00:28:14 And we call it light attention. I, and this is the part where I would disagree with the RWKB community in some parts. I think that was a bad name. name. Because it's cute. Why is it a bad name? Because when the RWKV paper came out, right, and then we talked about, like, we use this and we call it light attention, but by design, it's really nothing like your existing

Starting point is 00:28:40 attention head models. And it ended up like sidetracking the hacker noon debate on like one corner, it's like, no, this is technically attention, approximately attention, then another group is like, no, this is not attention. I see. But I'm like, propose a better name because I have no idea what I call it. Okay. What else should people know?

Starting point is 00:29:01 Maybe we can explain what RWK&V stand for? Receptive with the key values. Okay. Yeah. And each of these are like actual things that you model in the code, right? Correct. So we can go into that. Which attention historically is like query key value.

Starting point is 00:29:16 Correct. Okay. So, so do you want to jump straight into the layer architecture? Should we, should we cover something else first? Anything like high level? High level, okay, there's a 7B, there's a 14B, there's one of the assets or the artifacts. Okay, so before we go into the nitty gritty-grities

Starting point is 00:29:34 of how the layering and everything works. On the high level, right, currently RWKB architecturally, as a model, it can be, what we have already proven is that it can be scaled and trained like a train former. How I do so, we'll cover later, and this can be scaled to as many parameters as we want. Currently, what we have is dominant.

Starting point is 00:29:54 Our main models is the 7B model and the 14B model, which you can find on Hagen Face or respectively our demos. We also have, there will be the RDWKV Raven models. These are also instructionally tuned. Okay, so there's World, there's Raven, there's music. Oh my God, this novel. What is all this? Okay, so before, the current main models is RDWKVVV for,

Starting point is 00:30:21 for the power and raven. So power is basically just a power plus model. What is power plus? I know about power, but where is power plus? Random data sets that are the communication with the power. How many tokens were? I would just say slightly 1.1 or 1.2 times the power. Okay.

Starting point is 00:30:40 Yeah. This is not instruction tune and stuff. Yeah, the plus one is typically all the other languages. Subsequently, Raven are the instruction. This is the current main complete models. We subsequently have... And the instruction data sets are from... Typically, GPT4, but then we scrub it for and remove all the...

Starting point is 00:31:04 As a large... So, yeah, this would be the uncensored. There's some other project that's kind of doing something similar, and they call it uncensored, but really they just scrubbed it as a large... Correct. So that makes it technically breaking TOS or open the eye, right? Yeah. Okay, but yeah.

Starting point is 00:31:21 But that's a, I mean... That's a later problem. Frankly, let's be honest, if we... Even if we don't remove it, someone is going to remove it. I mean, so there's ways around this, which is you get it, you get clean datasets that are not GPT4. So the one that I typically mention is Yonet Kulture's Open Assistant. I believe that was included subject to NACA as well.

Starting point is 00:31:43 Yeah, obviously all these release orders are all over the place. Yeah. So, okay, Raven World. So Raven is the instruction team. And then subsequently, the world model is a new model that we are training. It's not 100% complete yet. With the focus on a new tokenizer and all the languages. So what means?

Starting point is 00:32:03 All the languages. All the languages that we can grab from the internet. All the wikis in all the respective languages. Like what do you mean when you say all languages? 100 languages. Okay, fine. So 100 languages. It wasn't really a very precise sign.

Starting point is 00:32:16 We just basically, whatever the wiki tool that allows us. tool that allows us to download the ex-Wiki languages. If it works, it's in the set. If it doesn't work, skip. And all the major prominent Oscar translation sets. So as you can see, Powell, Red Pidjama. What is Oscar? Oscar is just a common term that we use in,

Starting point is 00:32:37 and you can just search Oscar in Haging Face Dataset. And it just means translations. Okay. So you can find like English ex-pants. I see. Yeah, all the respective pairs. Okay. So, and then all challenging.

Starting point is 00:32:49 Did I can find? Okay. So 70% English, 15% multi-lang, 15% code. Is there a strong grounding for why 15% code? No. It was just, it was already there. Yeah. So the focus of the world model was not to improve everything else.

Starting point is 00:33:05 It was literally that 15% multi-lang. We wanted to increase. It was English and code and then you just added multi-lang. Yeah, we had fair bit of multi-lang, but we wanted to bump it up. Right. So this is primarily English? Whatever. Okay.

Starting point is 00:33:19 Yeah. What I would like is basically like a visual of like, here's all the building blocks and here's how they combine to create all these things. So we have the RDAPV architecture code. So that's the main model building block and basically we feed it the data. Power Plus, red pyjama, and subsequently some of the code data. For the world model, we subsequently add on top of that all the translation Oscar sets and so on. And so you're training these things. You've mentioned that you're intentionally taking a hit on e-vals.

Starting point is 00:33:49 on traditional e-vals, like MLU or whatever? I wouldn't say intentionally. Also to clarify, like, I am not training it. I'm just part of the community. The community and Blinks, the one training. But I would say it's more of like the lack of care for the e-vowls. So the reason why we add things to the dataset was never about improving e-vowls.

Starting point is 00:34:10 It's about directly in response to user feedback. It's like, oh, not good enough at this. So they're, okay, just toadish it. Yes, literally. Along those, so take, for example, right, like, within, even for Raven and the world model, as we go through the training stages, right, we specifically ask people in other nationalities within our discord committee to test it for their language. And our rule that we said is that, our informal rule is that the only person can decide if whether this improved world model is better in Japanese or Thai or whatever it is, is a native speaker. Where does it take place?

Starting point is 00:34:51 So it's mostly in within linguistic sense, but sometimes we do a shout-out in general as well. Okay, linguistics. Yep. So why don't, so do you have like an appointed ambassador, like you have a hundred languages? Yeah. You just have like a czar of Japanese, a czar of Thai. It's not so pointed, it's more of like, hey, this is the Japanese model, please try. It's not...

Starting point is 00:35:16 There's no the Japanese model. There's one model. There's a world model. So if you go to world model, I don't know whether it's inside here. No, four, sorry. Five is, you should never put five from top because five is fully experimental. So under file semblance. I see, I see, yes, yes.

Starting point is 00:35:33 So you see there's Japanese specific tune, Chinese, Arabic. Then for all the other smaller languages, we actually asked them from the base world model. Yeah. A bit itself, so feedback on that. So we actually released previously like 10% train, 15%, 20%, like as it goes through the stages and then it's like, hey, is this working? Is it regressing? So it's like evals, but real... Done by real humans and not systematically.

Starting point is 00:36:02 Is there a reason that you release? So you mentioned 7b 14b, by C also 0.1b, 0.4B, 3B, 1.5B. Is that useful for people or is it just for research or... 0.1 and 0.4 is frankly more for research. But some people do try to make use of them, nothing stopping them. Well, I mean, it's extra, like these are just different architectures, different dimensions. Yeah. So it's actually extra costs to you to provide these things.

Starting point is 00:36:29 Oh, but specifically for the world model, what, because we are trying a new tokenizer, we are, and, and the reason why we're trying a new tokenizer is that as, as, as, as, as, I think I'm cut, is that one thing that we found, more like I found surprisingly frustrating, in existing tokenizer was that it was very English-centric. And the existing tokenizer you took from GPD Neo? Yeah. And just to, I need to backtrack a little bit, just for people who are not following along. GPTJ was the original Luther reproduction of GPD3. And the GPD Neo was the bigger GPDJ?

Starting point is 00:37:05 Yeah, you can pretty much say. 20B, something like that. Yeah, I do believe there between me more, though. And there's a cheat for, I mean, for those outside of the open source, space in particular for the transformer. I think one thing significant for GBT Neo-X was that it was one of the major models that had everything fully documented and why they make this change in the architecture and so on and so forth. And that became like a basically reference note for all other subsequent open source models

Starting point is 00:37:35 because they were the early ones that were like doing a good transformer model and at least for the large and grid model. So, GPT2 was actually open source. People didn't find that useful? No, people do reference it as well, but it's like the code is there. And why did you do this? Oh, I see. It's not documented.

Starting point is 00:37:59 I see, yes. So in that sense, was OPP from Facebook useful? Because I've heard very good things about the logbook of OPT, where they had the daily logbook and they just published that. Yeah, those were useful as well. Yeah, okay. I think one thing that NeoX had going for, especially the illegal committee, is that it's not just logbook, it's just like, you could just go to, they just got, hey, why do you?

Starting point is 00:38:23 Right. And the person who trained it will tell you. Yep, someone there, okay, might? Hopefully. One off the self. So that's why we had the 0.1 and 0.4 models, because we were just in uncharted waters here. We, so, like, a lot of existing tokenizer took space as a major delimiter to detect this bit. and the tokenizer we are using is actually a lot more simplified.

Starting point is 00:38:46 So the tokenizers, they scan or the text, they do a statistic model of like what pairs well with what and so on and so forth. We did a similar approach, but we instead of using like this token pairs well this and should be paired with that, we just made it a trio of this. So basically, we find the try the data structure. Yeah, so we just find the longest. string in that matching string that we have trained inside our token list and then we just use that token. It's a drastically simplified tokenizer and it doesn't use spaces as an assumption which

Starting point is 00:39:24 I know. Which is good. Yeah. And that helps a lot of the Japanese, Chinese and character models. They don't have spaces. And I would even and I would even argue to fair say like if you look at like the really large models, like be it open AI or Kaulda, right? tokenizers are not really a thing.

Starting point is 00:39:45 I mean, in the sense that the model can work even if you tell it character by character. It's maybe inefficient. There's someone tried. I mean, there was that geoprache where, you know, the system problem you put the character, that enter, enter, enter, you remember that geobrik? No, I didn't see that one. Yeah, so you can literally, like instead of like left to right, you can literally up to down. Okay.

Starting point is 00:40:08 And you're just eating tokens for every character. actually you're eating too because there's also the new line. And the model understood it because there's enough dumb data on the internet that it has learned how to deal with this kind of formatting. Got it, okay. And if these models are already understanding things at the character level, everything else is just improved compute.

Starting point is 00:40:31 Okay. Because we jump the multiple tokens. Do you have any idea of your dictionary size when you use this tree data structure? Yeah. Because the typical tokenizer is like 80,000 tokens, dictionary size. I presume you'll be bigger. Yeah, I can't remember offhand.

Starting point is 00:40:47 Our previous tokenizer is around 50,000 is the NeoX tokenizer. Then subsequently, I believe this is around the same size. It's not bad. Yeah, pretty good. We didn't want to change too much on that size, but we just wanted to just change the format. Yeah, cool. I actually kind of want to establish the credentials of this thing. So who is Blink?

Starting point is 00:41:06 is Randall on the internet? Again, never heard of this guy until he published. This is real name. And you had, like, I have this paper to work with, but it was only published in May. Yeah. You found this before the paper. And so I think it's very unusual for a researcher to effectively launch to the wider public without a paper and just get some kind of pretty decent community. going and then publish the paper.

Starting point is 00:41:38 I think a few years back, once with GPT2, Transformer started to pick up Steam. And I guess the whole world is starting to think, let's just abandon neural networks. So we haven't even gone into the code part, but like, so the main reason why neural networks were bad compared to Transformer was that

Starting point is 00:41:55 when you train a, like I say, you just input a token, and train a token for data sample, you have to wait for the compute to finish for that token, take the state, and then you train the next token. and we'll get into how RRWKB solves that but basically the whole word at that point just concluded, yeah, neural network cannot scale as well transformers let us abandon it

Starting point is 00:42:15 and everyone just went in that direction and Blink or Bhopeng, his actual name, decided basically as an individual literally at the illuter AI forum decided that hey, I think we can modify recurrent neural networks, no neural networks based on the Apple paper the light attention that I showed previously

Starting point is 00:42:35 to make to scale this up without, to make neural networks scalable and paralyzable in the same way transformers work. Because the reason why we branch away and focus on the world is because neural networks were slow to train. It was never, I mean, it wasn't so much about

Starting point is 00:42:54 whether it was good or not. It was just, no one wants to wait 100 years for their ability to train finish, even if they can throw a GPU farm at it. And that's, where he started looking into it like how to make the neural networks trainable in parallel.

Starting point is 00:43:11 And specifically RNNs? Yes. And subsequently the AI and I believe there was also a few others, because he was doing it very publicly there, came on board to sponsor and the GPU computers required because even though I mentioned that

Starting point is 00:43:27 on large context size, it is substantially cheaper. I think especially if you run an open source discord forum for an AI model, every day there'll be someone who thinks that they can train a 20B model on a single GPU coming in.

Starting point is 00:43:44 The scale is still large, even though it's like 1 5th or 110, compared to transformer, it still needs a large GPU. So that's where AI and the rest, stability, I believe also is involved, stepped up and donated the A100s

Starting point is 00:43:59 needed to train the basic models that RDAQB had. and so before that before those models were trained we were like only having in theory

Starting point is 00:44:10 the toy models or the smaller models that this can match transformer we have no idea whether it can match transformer at that scale

Starting point is 00:44:18 yeah and subsequently with the larger models the 14B models and all that and we can compare it directly with new X model

Starting point is 00:44:26 and that's where this paper came out so so that's the history behind it is like he wasn't really doing it in silence, he was doing it from Iluta, then he branched out.

Starting point is 00:44:40 Because this became a big project on its own, and that's where other people started coming in. So the part where we say that RWKV is a neural network that can be scaled, can be wrote out as a transformer, right? The key thing that you want to see, right, is this diagram here. This should be in the paper, should not sorry, yeah, accordingly. So what you get? So when you do, when inference, when you are running in inference mode, ideally you should run it as a neural network. So this is a layer.

Starting point is 00:45:12 So classic neural networks is that you have a state, the state could be start from blank, you process a token, you output a state, and then you rinse and repeat, and then as it keeps doing the output, it makes a prediction. In that, one thing that, so subsequently for RDAGIV, what happens here, right,

Starting point is 00:45:32 is that the, we can roll out this neural network side by side, and then it runs similar to transform it. The key thing here is that the states are split across the layer. So this is what we call, in this diagram here specifically, this is what we call the timings and channel mix. These are operations within the layer. Depending how you want you view it,

Starting point is 00:45:51 you could view this as individual layers, or as how we view it. We view like this collection of layers as one layer block. And each layer block pass the states to its sibling. subsequently down the road as you process the next token which is a similar RNN type correct feature however the key thing is if you do not need to wait for the upper layers to complete right before you can go to the next token so what happens in practice and you're able to jump to the diagram like this

Starting point is 00:46:23 this graphic here this is not 100% of how it run behind the scene I like it yeah whoever put time into this kudos I made it So this is how you can visualize it. So the first layer is the layer norm. The layer norm doesn't, this is standard layer normalization that it doesn't need to, it just doesn't need to wait for the other layers. But if you notice, right, subsequently to the right and to the top, these tokens, these blocks need to wait for the blocks on the left.

Starting point is 00:46:54 And this is like once you go past the first few tokens, right, these cascades very rapidly, especially Like this is only like one, two, three, four layers. Most models have like 20, 40 plus layers. And the cascading patterns are happening. And in practice, once you start cascading there, you just saturate the GPU. And that's how it starts being paralyzable to train. You no longer need to train in slices like traditional iron ends. That was one of the key things.

Starting point is 00:47:22 What else is the key thing? So other things is that. So I think you're familiar with LSTM, right? this is how traditional neural networks keeps things within memories in RRWKB, we have two channels we call it the channel mix and the time mix respectively

Starting point is 00:47:38 Is there a formal definition of channel mix and time mix? Yeah, you can see the data from the respective time mix and channel mix move to the next to the next segment how time mix is designed per se was that it's how it retains

Starting point is 00:47:53 similar to LSDMs right where it processes a state and the input it may decide to discard certain states and keep new things in the state. Time mix does the same thing but with a different formula. So it replaces the LSDM in that sense and it can decide to keep things indefinitely. So this represents the long-term memories if you want to be with that way. But classically the problem with that is that it struggles with long distance. Correct.

Starting point is 00:48:20 This has the same issue. So that's subsequent. It struggles with long distance because it also needs to keep track of both near-term memory and long-term memory. So you split it up. Yeah, effectively speed up. So channel mix is the perfect memory? Yeah, this is the closer to the perfect memory, then it's the short term. So, so, Alice, time mix, it has trainable weights on what decides to keep in this card.

Starting point is 00:48:45 Okay. Channel Mix, it has a very strong bias in it towards like, just the next token. So, so subsequently, it was just like, as, like, memories, are stored in the lower layers, it just slowly shifts artworks through the channel mix. And this is the short-term memory, which at some point, as it just shifts all the way up, it will just disappear into the void. At that point, subsequently, then, time mix should be retaining the longer-term memory. So we took a break for a bit, but now we're trying to cover, like, what is the big aha moment

Starting point is 00:49:19 for you? And you said it was something to do with cost. Correct. So we have this chart on screen. There's literally a chart of quadratic scaling versus linear scaling in terms of GPU time spent in text generation. And you said it was at training time and at inference time. Just basically in everything that matters. So I mean, so look back to how R&N works from a high level. We do an 01 operation on a token, create a state, 01 operation, create a state. So these just scales linear.

Starting point is 00:49:50 You want to throw a thousand tokens at it. It's just on inference, it just scaled in it. Subsequently, for Transformers, you take in a token, you process your first token. It may be 01 here. Subsequently, when you generate your third token, you need to compute your second and first, and then it writes with a so you do your 1000 token,

Starting point is 00:50:12 you need to compute back your 99 previous tokens. And as this keeps growing, growing, this is your quadratic scaling, and this is why we had this graph of the, graph of the amount of cumulative GPU time that you need to spend to generate all these tokens respectively. And this is fundamentally just transformer versus neural networks. Yeah, on inference. The reason why, and subsequently, like, neural networks did have the disadvantage of, let's say, not being able to paralyze beyond training, but as I covered, RWKB kind of solved that

Starting point is 00:50:47 by effectively splitting the layers, allowing you to train different parts in parallel. Like some people will go into the academic debate of like technically the second and third token is not paralyzable until the first is done, but once you get into like, I can saturate a GPU land. It's just way better. It's just academic debate. We are done. And so training in essence has always, I mean, this is bid for Transformer. Only one network is I need to do an inference pass. I look at the logits.

Starting point is 00:51:16 I then backprop to see what went wrong and I update the weights. Yeah. So the inference is the forward pass. You still need to, it's part of the training course. As you backprop as well, as you backprop as well, like having needed to only look at the current cell tokens and the state instead of like everything, also reduce the amount of things that you need back from.

Starting point is 00:51:35 So it's just that, it's like there's so many factors involved in just reducing the overall inference and training time. And that was something that appealed to me because in the long run, I mean, all of us wants our model just run blazingly fast, right? Yeah. And also on minimal hardware. Oh yes, which as far as understand, you still have 14 billion parameters.

Starting point is 00:51:55 That's not going away. You still need the RAM to store 14 billion parameters work and stuff. That's not going away. Yeah. Okay. So RAM is unchanged. Yeah, on the RAM side, but the working memory is reduced. So typically you need more than 14 for transformer.

Starting point is 00:52:14 I mean, let's not touch quantization. But in this case, we don't need to keep. like if you really, really want to like save RAM, it is possible for you to do token by token inference so that you don't need to keep your states in history. You only need to keep your current token state and your next. Yeah. Yeah.

Starting point is 00:52:34 And yeah, and that's actually like one segment of our community. It's just purely porting other activity to C++-based model. Oh, and the next. Yeah, and running in on pies and stuff. Raspberry pies. Yeah. It's interesting. There's a chart about performance and it shows that RWKB is competitive or actually better in some of the reasoning challenges, which that's something I definitely would look for, right?

Starting point is 00:53:00 Like, it's fine if like your speed is faster and all that, but if the reasoning quality sucks, then it's not a very useful language model. Exactly. So. So this is like literally us saying there's no trade-offs. Yeah, you don't rule out in that process. Okay. Big question then. why isn't our

Starting point is 00:53:19 Al-WQVie a bigger deal right now? So, one, we are not a commercial organization. This is literally the pure open-source play. But you could have done the stable diffusion thing. Which, you know, stable diffusion launched. It was

Starting point is 00:53:35 by a bunch of nobodies before that. It's from, like, literally split out from Luther. And, but they definitely had some hype. They definitely, like, you know, I interviewed Shariq. Shamin. The reason I ask you have so many things about how did you find out of it all you give you?

Starting point is 00:53:49 Because I think the generalizable skill is how to be early in AI. Because being early in AI is very valuable. Because then you were there to see how things developed instead of like picking it up later like me. Anyway, so, yeah, why is it not a big deal? You don't need to be frank. Yeah. We just suck at marketing.

Starting point is 00:54:08 Okay. That's fair. I mean, this is part of it. Yeah, this is part of it. Like, so like maybe. But I don't think that is entirely the cause. Yeah, I'm sure. definitely I think the other major segment right now as well is that is that we were

Starting point is 00:54:23 really late on the paper okay like one of the weirdest thing right now is I we're just thing right now I feel that is that I think RGB is starting to have its moment right now okay is that ever since that initial paper came out there was Resnet there's a I think there's two more there's a few more additional papers coming out one from Microsoft one one from other organizations that are literally exploring the whole idea once again of scalable neural networks. And they are citing RWKB as far as well.

Starting point is 00:54:53 And I think for most, almost like I think it's existingly, why switch to this model when, even though we have proven that yes, it's scalable to 7 and 14 and that it can match transformers at similar params and training size, but all this is very academic

Starting point is 00:55:16 Because the community, right, the community at large, especially for the English-speaking community, right, they don't really care about this. They care about what's the best model that I can run on my computer, at least within the open-source space. And by that, even though we match in performance for things in the same dataset, the keyword is same dataset. Like, this benchmark is not even red pagey-a-mast yet. It's the power. And when you have models that are being trained on much larger data set, especially for an English use case, it makes more sense to use that. I see, so there will be another paper coming that is RWKV trained on red pajama.

Starting point is 00:56:02 And that will presumably be a larger dataset, yeah, and so on so forth. So I think that's the, we are still in the stages of reaching that point where we train on the larger data set. The only reason why we have a bigger outsized impact compared to like the other models, models is frankly because half of our discord came in not for English, it's for other languages. Yeah, that's great. And there is a definite very US and English-centric bias towards these models, and it's, to me, kind of poetic.

Starting point is 00:56:33 Like there's nothing in the architecture of our DDKV that particularly bias it to be really good at other languages. It's just that as a community, you decided to prioritize it in your tokenization and your datasets. That's it. Yeah, that's it. I would even argue that I'm surprised, more surprised that, especially on the European side of things,

Starting point is 00:56:54 that we don't have more models that actually focus on even the European languages. Because that is like a softer jump to character, Japanese and Chinese characters are all romantic. But I think back to the benchmark, what excites me more still about this is that it just means that we just need to scale. We just need to scale this model and we derive data to like 40B. 40B, 60B. I mean, paramed is one thing. It's data sets and GPU time. Yeah.

Starting point is 00:57:27 So you and I are talk offline about ideas for getting data, getting compute and all this. Okay. So this is like a project that's ongoing. Okay, anything else for the future of all the QB? The biggest one would be. Okay, so this is back to how, remember I said, evolves doesn't hide off, doesn't highlight everything, being realistic on another weakness on RWKB side, is that now with the rise of like, let's say, 100K or 32K context size windows,

Starting point is 00:57:56 transformable model, R2KB currently is trained to handle, let's say, 8, or even some people have already trained it to 16K sizes. It has, and well, it will, as a neural network, it will happily keep going on for infinite context, man. It will just keep generating. Does it do well? that's the answer is no because if you didn't train it to handle that situation and that's actually a child rule so for example if like the prediction the the power test loss right it does improve from a time let's say if we go down the context length but this is if we train it and what it's not seen here is that if we were to do let's say run it further it'll just go back up because it was not trained to handle that well it technically can run it it suffers from the longer context length and that

Starting point is 00:58:44 that's the part where other TV, especially in like Q&A tasks and huge documents like you get closer to summarize giant documents none of this is fundamental, it's just you need more money yeah that's it and no there is actually a fundamental part so what one of the things that I was doing

Starting point is 00:59:02 and I am actively helping within the committee right now is that we found that the existing way to scale the memory was not that efficient and we were just being realistic itself. If we want to hit 100K, we need to change this. So one thing that I'm actually looking forward to right now is actually those experiments. We have already started scaling things to be able to handle things at transformal scale, be it the 4K, 8K in terms of how it

Starting point is 00:59:30 handles memory really well. And we want and we found it, we want to like extend that to be like 16, 32 and 64. And we that is within our roadmap. And that's the exciting thing. Because once we have that, it able to handle long-term memory within those sizes. It removed what many people in the community felt, right, was the last architectural limit. Because once it's able to handle memories, like context length, the same as transformer, we normally do all the,

Starting point is 01:00:00 like, you know how existingly people do like long conversation and transformer? They just discard the rest and the sliding window. This is like the better version of sliding window. You have, the model can handle the sliding window perfectly. where it can keep remnants behind it. Sure. And that's something that I'm really excited and invested towards, because this is back to the full circle of how I came into R&KE.

Starting point is 01:00:23 I want my model to handle 100K tokens. All megabytes or HGML. Whatever I throw at it and be able to process it. But it will be lossy. The later half will be lossy, but the key thing is extending the non-lossy part, and we are aiming to extend the normal NOSIFI. So, you know, you have displayed today an impressive amount of knowledge just across the, you know, all this stuff, and you don't have like a research background.

Starting point is 01:00:53 Your advice to AI engineers getting as deep as you who want to get as deep as you. So I think your article articulated very well that there are, there's going to be divisions within how we approach this. So AI engineers, model trainers and data set curators and ML scientists. So our loosely defined as a tree, I ignore the full stack because every company needs it. So within this tree space, there is actually a lot of ways anyone can come in without knowing anything. So let's just start with AI engineers. Don't be like even though this whole topic, we even dived into how the layers work, we also showed how the math works.

Starting point is 01:01:33 Frankly, for an AI engineer, you don't need it. Your main thing that you needed to do was to, frankly, just play around with chat GPD or all the alternatives, be aware of the alternatives, just be very mercenary, swap out to cloud if it's better for you, or swap out to an open source if it's better for you, and just play around the prompts. Learn, learn, bear prompting techniques like one shot, two short, a few shorts, and then from that odds you can start building your agents, stacking, stacking your prompts, and in sequences and stuff like that and you are able to build applications that do anything in

Starting point is 01:02:11 terms of the AI space and all this without knowing all this nerdy stuff with all the hard engineering because that's all you really need to actually build a product for the user remember you are supposed to focus on making it for the user they don't care if it's rwkV or transformer underneath the hood they just care that it helps them and and i will say like notion probably it's like probably one good example of how they use it because we know underneath the food is open air but you really use food for the way. Yeah.

Starting point is 01:02:43 No, so I obviously agree with all that. Let's just say that people are bare already and they're just curious they want to do what you did. So that's where you start going down the layers. So the next layer you go down in is subsequently

Starting point is 01:02:59 training the model from scratch, fine tuning and incorporating the dataset. And this is This is where you still do not need to know the math, but you need to have a rough sensing on how the model works and how the certain models, and in this even within the open source transform space, certain models are better trained in certain sequences with certain learning rates,

Starting point is 01:03:23 and you just need to get a few of it. So this is just like, collect the data set, try it, see the loss. You literally did this. Yeah, at least for RRKB and the code gen models. Yeah, it's not a cheap. work too because you need GPUs. Okay. And that took you how long? I think I code Jen alone was like six months and then this other UKB I've been doing this

Starting point is 01:03:45 for like another six months. And that is the, it's just pure experimentation. Like there's no right or wrong because like especially if it's in a different domain. Like recently I was like helping someone on the algebraic discord regarding the music music generation domain and my assumptions for learning rate and all the patterns were just completely thrown out of the window

Starting point is 01:04:07 because the music model just fundamentally is different in those sense so that is the exciting thing is because it doesn't really have any specific rules

Starting point is 01:04:17 and guidelines until you get until you trial and ever to a certain space it also means that you coming in is as fresh as anyone else coming in last year

Starting point is 01:04:27 it's really that kind of uncharted space for everyone and especially as you start exploring to new domains your existing your existing knowledge may actually matter because sometimes I mean I think a few papers already covered this that's like how you train your model in certain sequences also matter like you want to train a certain set of knowledge and then and then you extend that knowledge subsequently but if you're talking about material science or genetics how am I supposed to know what is foundation or

Starting point is 01:04:56 what is extended knowledge I have no idea maybe you do and I'm just picking an example. And the same thing for music and so on. So those are things where even though you're outside the space is where you can come in just at the dataset level. Now you want to peel off to the next layer, let's say. Let's just say you want to look into modifying the model, the foundations of it.

Starting point is 01:05:22 I think one of the beauties about this current boom is that even though I did my toes early, like before the transformal wave and into an early neural network phase, frankly, almost everything that matters was basically in the past four years. Like, there were a lot of things that, in academics, there were before that, and they were mostly dealing with models that were under a billion parameters. They pretty much no longer matter, and can you be more specific, like, okay, I know I'm shooting myself on the foot because how they're curious in neural network,

Starting point is 01:05:59 but if you're just trying to get Transformers to work, you don't need to know at our STM. Yes. You don't need, yeah, there's a lot of like pre-knowledge in neural networks that is irrelevant in the transformer era. And maybe some of it will have a resurgence, but to get up and running, it's not a requirement.

Starting point is 01:06:21 And I think this is where you could either go the very academic way or reading papers and stuff, but frankly, what I found was way more, useful was Akkabati yeah his series of videos there's a hero yeah that is really really good I think even if I even though I read some of the read some of the papers and guides before that it really helps that it starts from zero because you can see how it happens part by part and and even though we will not use how the exact same code they use because like he re-implement the back prop and all that and we're just going to use torch for that

Starting point is 01:06:59 Yeah, that's where you get the aha moments on how this building blocks work and how it fall into place. And like, I had fundamental misunderstanding on how backpro worked until I actually watch this video. Oh, really? Yeah. And I think this is the scariest and craziest thing about AI models sometimes is that you can actually have fundamental misunderstandings, but as long as you make the building blocks and it connect and, okay, loss is great, it works. Yeah, well, so, you know, even the gods of the industry, you know, I don't know if you read the Swigulu paper. So there's these like, there's all alternative activation functions. Like there's RELU and then people are always looking for different slopes.

Starting point is 01:07:41 And very famously, the Swigloo paper had this line in there that was like, yeah, we don't know why this works, but it works. Can't explain it. Yeah, it literally happens here and there, you know, all these got two. One of the funny things that I'm doing right now in RWKVE 5 experiments is that Okay, we are going to do this change We're going to run this train Make your prediction Will this model beat this model in this loss curve

Starting point is 01:08:07 As a game, as a betting It's a very informal It's literally a body Kind of like kind of bet But The fact that The fact that we can do this kind of bets even though we like understand the code

Starting point is 01:08:23 it's like, it just goes to show how often like, oh wait, this didn't go to why it predicted, no one. And that's why even if let's say you don't have a PhD or so on and so forth, like, even if math is not your specialisation, you're coming in as a

Starting point is 01:08:39 developer. I'm going to come in, I'm going to say frankly, like I didn't come from the research right now. The extremely math-heavy stuff is what I struggle with. What I do sometimes is I copy and paste the math into GPT4 and asked it to explain to me, which is good in plainer language. It's very good at that. Yeah.

Starting point is 01:08:57 Yeah. And so, but the thing is, there is lots of value at beyond that. One thing that I realized, and this is not specific to RWKVE, this also happens across a lot of open-source models, is that a lot of MLs like this, when they really build this stuff,

Starting point is 01:09:14 the focus was more of like always get it to work. It was never about getting it to work efficiently, or getting the code documented or organized. And stable diffusion literally went through this whole journey. They had the code and the model that worked. And the community just started and engineers that came in with zero machine learning background, started picking it apart. It's like, no, this could replace this with this that does the exact same thing.

Starting point is 01:09:42 It's more efficient. Like one of the major breakthroughs, for example, for GML, and this happened sometimes. back for a bit, Lama, more so that, was that, while someone external from the AI committee went in and implemented memory mapping. Yes, I saw that, yeah. I forget her name, but yeah.

Starting point is 01:10:02 Justine. Dot Law is her URL. Yeah. And she didn't come in as an AI expert. She came in as a software engineer. Yeah. And these are all just very, very straightforward. You know, in her world, this is normal.

Starting point is 01:10:17 Whereas for the researchers, they will be like, I don't want that. Wait, what is memory map? Yeah, exactly. Yeah, and there are a lot of things like, like, one of the jokes that I have right now at that every month there is a research, ML scientist, that is rediscovering the number 32. Why? Because, be it like, or someone in the committee writing the inference code, because GPUs,

Starting point is 01:10:39 especially, especially YouTube, GPUs tends to work really well when they align to the batch size of multiples of 32. Oh. And if you've been in the gaming industry, especially when you're, write shader code, right? This is like well known, like just given knowledge. And people are just constantly rediscovering, oh, maybe if I just adjust my data set or my data size to fit this batcher size, suddenly I get 10% improvement. And yeah, and it's like, these are things that once again, because they were so focused on just making it work, that they won't know outside

Starting point is 01:11:18 space. And that's why I would say, right, if anything, right, now is the best I mean that you don't know AI to have deeper from different background coming in because your contribution could be from data set level, how to train the knowledge, to shader code, to heck, how to memory map, how to cache data. There's so many gaps. Cool, great. So yeah, thanks so much for being very willing to get on and talk with no prep. We did some prep, but it's very unusual podcast episode, but I really enjoyed it. We literally just met yesterday in Singapore. But I know you've gone on the Discord for a while, and I can tell you, like, you're very serious about all this.

Starting point is 01:11:54 I think it's very unusual for someone. Like, you have a job, but this is like a second job, essentially. Yes. But you are really enthusiastic and passionate about it, and I think that's very rare, and I don't want to encourage more people to do it. And so thanks for sharing. Yeah, bye. Thanks for having me here.

Latent Space: The AI Engineer Podcast - RWKV: Reinventing RNNs for the Transformer Era — with Eugene Cheah of UIlicious

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.