No Priors: Artificial Intelligence | Technology | Startups - The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI

Starting point is 00:00:00 Welcome to No Pryors. Today we're talking to Tengu Ma, Assistant Professor of Computer Science at Stanford, and the co-founder and CEO of Voyage. Voyage trains state-of-the-art components for next-generation retrieval systems, including embedding's models and re-rankers. We're really excited to talk about his research and the rag debate today. Welcome, Tengu. Yeah, thanks so much. Thanks for having me here. We're looking forward to the debate. Yeah. Why don't we start? with just a little bit of an overview of like your research agenda to date, because I think uniquely it covers a broad range of fields within and around deep learning from like theory

Starting point is 00:00:40 to RL to embeddings and optimizers. So can you talk a little bit about sort of how you pick the directions you have? Yeah. So I think most of the papers I wrote have some theoretical thinking in it. I guess maybe that's the commonality. And besides that, I think I worked on quite a few topics as you mentioned, ranging from the theoretical understanding, mathematical proofs of deep learning systems to all the way to practical large language models, reinforcement learning, deep reinforcement learning. And these days recently, I think what we are working on more centralized to efficiency of training the large language models and improving the reasoning tasks for large language

Starting point is 00:01:21 models. So my vision is that in the future, the efficiency is very important because we are writing out of data and compute, so we have to either use the data much better and use the compute much better. And also reasoning tasks seems to be a pretty important direction. And also, in some sense, kind of like a risky direction in the sense that we don't know exactly how fast we can solve those challenging reasoning questions yet. Can you mention a few of the key papers or work that you or students in your lab have done just so our listeners can look them up? In the very early days, I think I worked on some of this matrix completion, optimization for matrix completion that's like 10 years ago.

Starting point is 00:02:11 And then I move on to embedding models, like sentence embeddings, vector embeddings. one of the papers we wrote is a very actually simple paper where we average the word embeddings to get sentence embeddings. And then we did some of these transformations using PCA to make the performance much better. That was even before Transformer came out. And then I think I move on to Transformers, Large Language Models, and Contrastive Learning, which is the new way of training the embedding models, especially the direction started with some of the papers on using contrastive learning for images and we work on improving those and understanding why contrasting brain can work. And recently we work on optimizers for large language models.

Starting point is 00:02:58 For example, one of the paper we wrote last year was Sophia, where we found that we have a new optimizer which can improve the training efficiency by 2X for pre-training. This is great. Adam is very old at this point. Yeah, it's 10 years old now. I think that's the interesting part about it. So optimizers, you know, I think people have tried in the last 10 years for so many times. There were so many papers published, which has improvements over Adam in various cases. But so far, Adam is still the default algorithms for training large English models. And that's why we thought that is the time to really, we spent a lot of time on this.

Starting point is 00:03:40 Like I think I started probably around 2018, 2019. I asked a few students work on this and finally we had one paper out after a few years, after a few failed projects and failed ideas. And recently, I think one of the Facebook friends actually used this in their large-scale multi-model training. And they found that on that scale, I don't know exactly how many parameters they are, but I assume it's kind of more than 100 billion parameters. They found that on that scale, there is a 1.6x improvement. in the efficiency of the training.

Starting point is 00:04:17 So that's like $10 million versus $16 million. That's super exciting. Yeah, I think, you know, Sophia has an opportunity to be really, really impactful. You started a company last year taking leave from Stanford. Given your work has been like theoretical, but with practical applications, like what drove you to do that? I think I came to Stanford partly because there's a very, very, very, strong industry connection here at Stanford compared to some of the other universities. And also probably entrepreneurship is just part of my career plan anyways.

Starting point is 00:04:57 And in terms of the timing, I felt that this is the right timing in the sense that the technologies are more and more mature so that it seems that the commercialization is the right timing right now. So for example, I think one story I have. have is that, you know, I look up some of my slides deck for my lectures at Stanford, CS-29, seven years ago when I started to teach at Stanford. At that point, machine learning, we have a lecture with Chris Ray and the machine learning on applied machine learning. So how do you apply machine learning industry? And there are seven steps there. So the first step is you define

Starting point is 00:05:37 your problem. The second step is you collect your data and you choose the loss function, you train it, and you iterate, so and so forth. So it's pretty complicated at that point. And now the foundation model arise to power. And in a new foundation model era, the only thing you have to do is that you have to, you know, someone will train the foundation model for you. And then you tune a prompt and you add a retrieve argument generation on top of it.

Starting point is 00:06:04 And that's pretty much, that's it. So applying machine learning AI to an industry environment is much, much easier than seven years ago. And that's why I felt that this is probably a right time to commercialize many of the technologies because the technologies are more mature. Yeah. This is actually, I mean, a core premise even for the investing fund that I started a conviction in that, you know, somebody's doing the bulk of the work for you in a more general way. And so the application of AI in industry is just much, much cheaper, right? Because you only do, you know, the last few steps and or a different set, but last few steps in essence. So maybe you can talk about, like, the, you know, just given wide range of research,

Starting point is 00:06:48 the problem you focus on with Voyage that you saw with customers. Yeah. So with Voyage, I think we are mostly building these two components, we rank and embeddings for improving the quality of the retrieval or the search system. So the reason why we focus on this is because we talk to so many customers and we found that right now for implementing RAC, the bottleneck seems to be that it's not very hard to implement it

Starting point is 00:07:15 where you can just connect the components and have your rack system ready very quickly but the bottleneck seems to be the quality of the response and the quality of the response is heavily affected or is kind of almost bottlenecked by the quality of the retrieval part

Starting point is 00:07:31 if the large-en-witch model see very relevant documents then they can synthesize very good answers even like Lama 70B can do that very well Can you just give like a sort of general intuition for like what a rag system is and some of the applications of it? Yeah, so I guess just a little bit background. So a retrieval generation, the idea is that there's a retrieval step, there's a generation step. So the main point here is that if you just use a large language model as a black box, you know, as is,

Starting point is 00:07:58 then the Latin English model wouldn't know anything about the prepared information inside the company. And I doesn't know enough context about the use cases. And the retrieval augmented generation stack is about you first retrieve some knowledge from, for example, inside a company and then use this knowledge and give the knowledge to the Latin-Gush model so that the Latin-Gush model can generate or synthesize a good answer without any hallucination. This has been found to be very, very useful

Starting point is 00:08:29 to reduce the hallucination rate. And so there's two steps. The first step is to retrieve some relevant information given the query, and then this relevant information are given to the Latin-Gush model. The retrieval step is important because once the large-english model sees the research model sees the relevant information, it can reduce the hallucination rate dramatically because it

Starting point is 00:08:50 used the relevant information as an anchor to refine the answers in some sense. And what we are doing here is that we want to improve the quality of the retrieval or the relevancy or accuracy of the retrieved documents and information. And the way to do this is that there are two steps. The first step is that you vectorize all of your documents or all of your knowledge base. So you turn the documents to vectors, you turn the videos into vectors. You turn your code into vectors. Code into vectors, everything into vectors.

Starting point is 00:09:21 And so the vectors are the representations of each piece of the knowledge or documents. And all they are the indices. So and then you put these vectors into a vector database. And then you search the right of information using the vectors as indices. Where are you seeing a rag application today? Like what are what are customers building or, you know, what are the most common systems. Yeah. So we have a lot of users and they are all over the places. I think, you know, we have even a customer who is a chemistry company who is building this RAC system to

Starting point is 00:09:55 understand their chemistry documents or products in descriptions. And I think just it's almost everywhere, like finance, legal, coat retrieval, code generation, so and so forth. I think it can be applied to almost any cases. And also, for even for individual users where you have a lot of like individual personal information and you want to have a Rack system on a phone so that you can access your past information in a much more easy way and you want to retrieve it. For example, we all have seen that, you know, when you search your documents on your laptop, it's actually pretty hard.

Starting point is 00:10:35 You have to use the exact file name. It will be much easier if this search can be semantic-based. Right. Rag is a relatively new architecture. I think your average enterprise technology leader had not heard the term before the last year or so. And it became popularized in a researcher for the last few years. But there is already a debate, I think, you know, in terms of opinions from people at different large labs and in academia about whether or not you need a rag architecture to work. on proprietary data. And just to sort of describe some of the alternative views, I think there's kind of two alternative points of view given. One is a sort of agent chaining architecture where you are inputting your data and knowledge, you know, chemistry, code, law, finance, whatever documents into a series of LLMs that just operate with instruction on it, for example, to summarize or categorize it,

Starting point is 00:11:41 or you simply feed everything into LMs with infinite context or actively managed context versus explicitly vectorizing anything. And so I would love to get your reaction to that as an alternative to RAG. Actually, there was also a debate last year

Starting point is 00:11:59 about RAG versus fine tuning. And I think that debate was kind of like getting to a consensus. Now, it sounds like REC is much easier than fine-tuning, and fan-tuning in many cases doesn't work because you need a lot of data to see the results and there are still hallucinations even after fine-tuning.

Starting point is 00:12:20 And now, as you said, the debate becomes RAC versus age-in-changing or long context. So maybe let's talk about long context first. So I think there are probably two answers to this from different angles, because long context right now is not practical yet, right? So we have to kind of anticipate what long context transformer can do and then do the debate at a future time in some sense, or anticipate the debate at the future time.

Starting point is 00:12:49 In the near term, I think the long context transformer where you just put in all the proprietary data, 1 billion tokens into the context of the transformer. So we'll be very, very expensive, right? If you use the price right now, it's going to be just impossible. to do it, it's probably like 5, 10 magnitudes of difference depending on how many documents you have in the context. Of course, you know, you can bring the cost down by,

Starting point is 00:13:16 for example, one approach is you catch the activations of all of the internal operations of the documents you put in the context. So that will bring the cost down by a lot, but I think still if you do the calculation, theoretically, it's still much more expensive than REC. So I think that's the more practical answer. So in terms of cost, it's going to be much more expensive than REG because you have to save all of these activations or intermediate computations in the GPU memory, most likely, or maybe in CPU

Starting point is 00:13:50 memory, of all the, all the 1 billion tokens contacts. You know, you may argue that, okay, over time, everything will become, you know, cheaper and cheaper. But REG will be cheaper as well, right? because many of the technologies and the reg are neural network-based, and the GPUs will become cheaper, the new-artworks will become smaller. So my prediction is that rag will be much cheaper than long contexts going forward. And another way to think about this is that maybe just from the first principle, right? So my analogy of long-contacts is that, so in some sense, the context is the short-term memory in some sense, right?

Starting point is 00:14:28 And the rag is more like long-term memory in some sense. So the question is, you know, for example, When you answer any question, why you have to go through the entire library every time, right, like put out of the entire library in your short-term memory for answer a single question, right? It sounds like the right approach should be that for every single question, you retrieve some subset of the information and use those to answer the question. That seems to be the most efficient way to do that. It should be some kind of hierarchies in some sense in terms of how we solve the problem

Starting point is 00:15:01 so that we can get the best efficiency. Even when we do the computer architecture, like the hardware stuff, right, so you have a different level of caching, right? So you have disk, you have CPU cache, and so forth. So in that sense, I feel like the more hierarchical two-level kind of like system like REC is more cost-efficient. Yeah.

Starting point is 00:15:22 I mean, the analogy certainly makes sense. I think there is another thread of discussion of like what does long-term memory for LLMs look like where, you know, it is something managed by the LLM itself, but I do not think that is a well-answered question, and like RRAG may just be a part of that answer. So the embedding model, the Rewanker, are in some sense, the large-anglish model

Starting point is 00:15:44 that are managing the long-term memory. Of course, there might be a variance and other ways to manage the long-term memory, but I think it will be somewhat similar. It's going to be like more, you know, the technology always evolves, right, gradually, right? So maybe two years later, Voyage or maybe other companies will have a new version of the

Starting point is 00:16:04 long-term memory, which is based on, you know, embedding models, but, you know, kind of like extending the embedding model in some way, that's entirely possible. Yeah. I do think it's useful to sort of contextualize for people who are not working with sort of data sources for LMs at scale every day, like what sort of token limitations are, right? You know, we go from a few thousand tokens to something like Gemini 1.5 Pro, the context window of a million tokens, right? And so if that's short of that in word count, that's maybe five books or like 25, 30,000 lines of code and obviously like limited amount of video and audio. And so I think the ability to make reasoning decisions on more than that amount of data is obviously going to be needed. And it's the question. or to me are really like, you know, does efficiency matter both from a cost perspective and a speed, like a latency perspective, right? And how much can you push the context window? And like,

Starting point is 00:17:10 you know, does hallucination management matter? And so I think there are lots of arguments for like Rag being very persistent here. Yeah, yeah, exactly. And just to a little bit on that. So one million tokens, five books, right? So by many companies has 100 million tokens. That's a 100x difference. Right. So 100x, you know, for cost is a big difference. That could be just, you know, $100K versus like $10 million. Right. $10 million is unacceptable, but $100K sounds okay. Yeah, I think that's probably what's going to happen. Like so, so from, at least for many of the companies, right? So right now if they have 100 million tokens, I don't think they can use long context transformers at all because it's way too expensive. Right. And I'm like the simplest

Starting point is 00:17:51 thing for me is actually for a system to look at the entire code base or some represent of the entire code base versus the portion of it that could fit into context today. Yeah. What about the other piece, like the idea of agent chaining and using LLMs to manage the data in that form? So agent chaining, this is a, you know, growing area and many people are doing research on it. I think it's a little bit less well defined in some sense. On the first level bit, I would say, is that I think it's kind of orthogonal to embedding

Starting point is 00:18:23 models and rerankers to some degree because even when you have a lot, you have a lot of, agent chaining, right? You still probably use embedding models as part of the chain, right? You probably do iterative retrieval as part of the chain. And of course, you use large-anglish models as part of the chain as well. In some sense, it's orthogonal direction. So I probably would rephrase agent chaining as more like an iterative, multi-steps retrieval augmented, large-anglish model augmented system.

Starting point is 00:18:53 So, and some part of this retrieval probably is done by a large language model. Sometimes part of the system is done by a small, large language model, and some part of the system is then by embedding model, so on and so forth. So in that sense, I feel like it's somewhat kind of orthogonal. Yeah, and I feel like some of the motivation for agent chaining to begin with is the same efficiency motivation as RAG. Yeah, exactly, right? But if you use a very, very large language model to manage the system,

Starting point is 00:19:22 in the knowledge system, I think you, again, lose the efficiency, right? So it has to be a somewhat kind of like smaller model to manage the knowledge. And then at that point, embedding model might be the right thing to do in that agent-changing framework. Maybe another angle to look at this is that whether we should do iterative retrieval versus just retrieve at once. I think iterative retrieval is definitely useful, especially because now there are still a lot of headroom in the embedding models performance.

Starting point is 00:19:55 So that's why sometimes you have to retrieve multiple times because the models are not clever enough. However, in the long run, my suspicion is that iterative retrieval will be useful, but it will be a bit less useful as if the embedding models becomes more and more clever. So once the embedding models are more clever, then maybe one round or two rounds is going to be enough. If we go ahead and just assume that RAG is at least a dominant architecture for enterprise use cases where you care about proprietary data that is large with reliability, how do you go about improving like a RAG system, right? You can improve the LM itself, but what are the other components that you guys are working on or what are the maybe challenges from the user, the builder's perspective to improve retrieval quality? Yeah, so I guess there are a few ways, right?

Starting point is 00:20:47 One way is that you improve the prompting of the large language models. So, for example, you could tell the large language models to abstin if there's no right of information in the retrieved documents. But because the large language models are so good these days, I think you don't need a lot of prompting anymore. It just responds to the instructions so well. And then the next thing is to improve the retrieval part, which is the bottleneck, in my opinion,

Starting point is 00:21:13 because most of our users found out that if they improve the research, retrieval quality, directly that affects the response quality. And improving the retrieval part, I think there are two ways. One way is you improve the embedding model. One way is that you improve some of the other things on top of that. For example, how you trunk the data, whether you do iterative retrieval, whether you put in some of the meta information in the data, so and so forth. So basically, I would say there are two ways of improving.

Starting point is 00:21:40 One way is you improve the networks, either the embedding models or the rewrapers, or you improve the ways to use the networks. with software engineering, right? Better trunking iterations or other kind of like heuristics or kind of like tricks on top of that. So what we are specialized in is that we want to improve the networks because that requires a lot of heavy lifting. That's a very data-driven approach.

Starting point is 00:22:04 We train our new networks on trillions of tokens at least and we fine-tune them for special use cases. And this is something that probably a company should do instead of like every, the users, the end users should optimize themselves. And my long-term vision here is that some of the software engineering layers on top of the networks will be less and less needed when the networks are more and more clever. So for example, right now we already see that trunkings becomes less needed because the context window becomes longer and longer and the long context embedding model, no, relatively long

Starting point is 00:22:40 context embedding model. Long context here means like 10K, for example, maybe 16,000, so that you can put up 50 pages PDF into it. Because this long context embedding model becomes much better, there's less of a need to trunk the documents into pieces of like five, 12 tokens. And I think this will happen in other dimensions as well, right? So maybe in the future you don't have to turn your images

Starting point is 00:23:05 into description of images and then give it to the text embedding model. That's what people are doing right now. Like everything is turned into text and then use a text embedding model. But when the embedding models are more clever and multi-model, then you don't have to do that anymore. Can you talk a little bit about just like the intuition for how fine-tuning or domain-specific embeddings improves performance? Yeah, fine-tuning and domain-specific embedding models are what we are very good at at Voyage. So just to have some context here, so what we do is that we start with a general purpose-based embedding model, which is also what we trained from scratch.

Starting point is 00:23:44 And from there, we first fine-tune or continue pre-train, whatever you call it, on some domain-specific data. So, for example, we fine-tune on two trillions of code snippets, tokens, and then we get the code-embinding model. And we do the fine-tuning on one trillion legal tokens, and that's how we got the legal embedding model. And this domain-specific embedding models, I didn't use any of the proprietary data,

Starting point is 00:24:12 so that everyone can use them, but they really excel in one particular domain, and the performance in other domains are not changed much. And the reason why we do this is because the number of parameters in the embedding model is limited. So because you only have a latency budget, something like maybe one second, sometimes like 200 milliseconds, you know, some people even want 50 milliseconds,

Starting point is 00:24:35 and then basically it's impossible to use more than 10 billion parameters for embedding. for embedding models. And we have limit parameters. Any customization is very important because customization means that you use the limit number of parameters on the right tasks, the right domain, so that you excel in that domain. There's no way that you can use these 10 billion parameters to excel in everything. So that's why you have to specialize in one domain.

Starting point is 00:25:02 And we have seen like 5 to 20% of improvements by this domain specific fine tuning depending on the particular domains. For code, we have seen 15 to 20% of improvement, partly because we have a lot of data there. And the headroom there is also bigger because code retrieval requires a lot of deep understanding of the algorithmic part of the code. And for legal domain, the baseline is a little better, so the head room is slightly smaller. So that's why we see 5 to 15% improvement depending on the data sets. For some of the very complex legal data sets, we have seen bigger improvements. Just to make sure that our listeners can picture exactly like where the latency cost is coming from here in a search system, like your data, you know, has been vectorized by an embeddings model, but then every query also needs to be translated into an embedding and then compared to the embeddings of your knowledge in order to feed that LM for the generation that you want, right?

Starting point is 00:26:04 And so there's a sort of there's inference time latency here as well. I just think that's not obvious if somebody hasn't built a RAC system. Yeah, exactly, exactly. So basically, at the inference time, you have to first turn the query into vectors and then do the search with vector database. And actually, relate to this, the dimension of the vectors, you produce also affects the latency for the vector-based search. If the dimension of the embedding is like 100, it's only 100, then it's going to be much, much faster than when the dimension of the embeddings is 1,000. So, and actually this is something we are very good at as well. So we produce embeddings that is like a 3x, you know, 4x smaller dimension than some of the competitors.

Starting point is 00:26:46 Yep, that makes, I mean, intuitively, you are creating embeddings models that use a limited number of parameters and dimensions, just given the sort of latency budget that you, that any application has to create the best possible representation of, proprietary data or domain specific data. Yeah, exactly. And going back to the domain specificity and fine-tuning, so the second level of customization is that we can customize to a particular company, right? So we fine-tune on the proprietary data of a particular company, and we can see 10 to 20% improvement on top of the domain-specific in fan-tuning as well.

Starting point is 00:27:30 So, of course, there's a total budget in terms of how much additive improvements you have there. Right. So if you start with like 50% accuracy, then you only have 50% hand room. But if you start with 90%, you only have 10% headroom. So the improvement, the absolute improvement varies a little bit across the domains. Maybe just advice to people who are building rag systems. At what point do they begin to invest in, you know, some of these retrieval components? Yeah. I think they can do it even from day one as long as they have a prototype, you know, available. So basically, my default suggestion for our users is that when they have the rack, you know, first of all, of course, you don't connect the components and at least see some response, and then probably do some kind of like basic profiling in terms of the latency and the quality, right?

Starting point is 00:28:20 So you can check the retrieval quality, meaning that how often you retrieve relevant documents. There are some default ways to evaluate the retrieval quality. and then you also do the end-to-end evaluation for the responses, and then you can see which part is the bottleneck. And in many cases, people found that the retrieval quality is not good, so that the final response is not good. And then you can swipe some of the components. You can say, I'm going to try voyage embedding.

Starting point is 00:28:49 I can try the voyage rewankers, which we haven't discussed too much about, and you can try various different embeddings, and possibly various different large language models as well. Maybe just zooming out, like, you know, you started by saying in order to have the debate about RAG versus alternative architectures for working on proprietary data, you need to predict forward, right? Any predictions for how these systems change as LLMs improved dramatically, right? If we look at the next generations of OpenAI and GPT and Claude and the Mistraw models and Lama and such? Yeah, so my prediction is that the system will be simpler and simpler. Maybe this is my biased view.

Starting point is 00:29:34 So at least this is something that we are working towards. So the idea of what would be that it's a very, very simple system. So you just have three components like large English model, vacuum database and embedding models, and maybe four components, another Rewanker, which refine the retrieved results. and you connect all of this and each of the new artworks

Starting point is 00:29:59 does everything else you don't have to worry anything about trunking, multi-modality, changing the data format because new artworks can do most of them. So seven years ago, if you talked to any of the

Starting point is 00:30:11 so-called language models seven years ago, you have to turn the format into a very, very clean format. And now you talk to GPT4, you can have typos, you can have all kind of like a weird format,

Starting point is 00:30:23 you can even dump JCP. files to it, right? So the same thing would happen for embedding models as well. So my vision is that in the future, AI will just be that a very simple software linear layer on top of a few very strong neural network components. Yes, I think the bias toward it is actually all going to be AI versus complex, you know, discretized software systems is clear, but I believe directionally right. Maybe zooming out to just get a little bit of your perspective as a founder, like, you know, what's one or two top learnings you have about starting the company as an academic before, even, you know, despite your work with Google and other companies before? Yeah, I think it's very, very different. Founding a company is very different from doing research at big tech and also even from actually it's a little bit closer to being academia because to run a university lab, I'm the CEO, C.

Starting point is 00:31:21 CEO, CFO, and HR for the university lab, right? So you touch on a little bit of everything, but at a study different scale, right? So I think one of the biggest things I learned actually is from one of our angel investors is that I should read some of the books. Even though I think for probably experienced entrepreneur, many of the books are very basic. But for me, they are very, very useful when I read some of the, even the basic books, including Elat's book, by the way. but his book is a little bit advanced in a sense

Starting point is 00:31:54 that his book is talking about how to scale from 10 people to a thousand people and I only read a few chapters of that because we are about 10 people right now and also talking to a lot of Android investors talking to Sarah and my other lead investors so I think all of this helped me a lot in reducing the un-fossed mistakes

Starting point is 00:32:17 in this process to me, I think it's really about how to reduce the number of errors you make so that you can maximize the efficiency. At least this is what happens to me. And also how to correct the mistakes as fast as possible, right? If you can correct mistakes every one week after you made them versus like one month after you made them, then that's a 4x efficiency improvement. Very theoretically consistent in your, you know, vein of research. Last question, you know, you have been personally productive, productive research lab, but you've started a company. What do you think the role of academia in AI is in this age of, like, scaling?

Starting point is 00:33:02 Because most of your former students, like they essentially all work at Open AI or Anthropic with, you know, a few professors and Citadel folks in the mix. And the ones working with you. Yes, yes. In academia, this is a little bit controversial topic, I think, different people have different views. My view is that I think academia probably should work on some different questions from what industry is good at, right? So if we are just only working on how to scale up the system, then obviously the incentive is not right. We don't have enough capital there. And even open eye, I guess South Altman argues that you need a lot of capital to start to do this in some sense.

Starting point is 00:33:46 So, you know, like at the very beginning, I think the point is that, you know, you first have, it cannot be non-profit because if it's nonprofit, then you don't have enough capital and you cannot scale up enough. I think I kind of agree with that. And that's why in academia is very hard to scale up and have enough resources to do the large-scale research. However, I think in academia, there are many, many other things that we can do on a smaller scale. And we probably should focus on more long-term innovations. So what I told my students at the lab is that we should think about what will be the breakthrough in three to five years as opposed to how do you help open eye to improve their large language models in the next in GPT5. So that's why we work on optimizers, which is like 10 years old. The item is a 10 years old optimizer and we say, okay, that sounds like a long-term project. Maybe in five years we can improve the optimization efficiency by 5 to 10x. that's going to be a game changer for the whole landscape, right? So if we improve the efficiency by 10x,

Starting point is 00:34:51 I guess that's like $100 million, so it's $10 million for training GPD5, then I think that would change the landscape a lot in the industry. So efficiency is one of the things I spend a lot of time on. Another thing is that there's reasoning tasks. I think the reason why I identify that as one of my lab's direction is because it's challenging and it requires a lot of very innovative research. It's very unclear whether you can really, the scaling law is really enough

Starting point is 00:35:21 to get you to prove Riemann hypothesis or any of the math conjectures. So, you know, and also you have to be superhuman performance in some sense, right? So if you turn on just the common crowd data on the web, can you be a good mathematician? It's kind of very hard to believe that. So we need more innovations there. So that's pretty much what we are doing at the university. lab, we try to work on the three to five years agenda and on a smaller scale.

Starting point is 00:35:50 I think that's an inspiring note to end on and like a very open-minded one about what is still to be figured out. Thanks so much for doing this, Tango. Thanks so much. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.

No Priors: Artificial Intelligence | Technology | Startups - The evolution and promise of RAG architecture with Tengyu Ma from Voyage AI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.